[Kittyhawk] KH is not fault-tolerant/fault oblivious
rminnich at gmail.com
Mon Aug 22 13:04:58 EDT 2011
if one bad node in 512 can take kh out, or if it is that sensitive to
hardware problems, then I think it needs some changes. The Plan 9
structure follows the CNS configured one IO node per 64 compute nodes.
That IO node is restricted to talking to that small number compute
nodes. In my setup I talk to groups of IO nodes from the login node. A
gproc arrangement would follow this pattern since a single IO node for
(e.g.) 40K compute nodes is not scalable anyway. One IO node could
also talk to other IO nodes over 10GE.
In other words, in my view, the single IO node for all nodes is
neither scalable nor fault tolerant. Experience is showing that it is
too sensitive to hardware faults. It's also driving me nuts as I try
to scale up Kyoto Tycoon. I never know when I'm going to hit the
landmine and watch it all lock up. So I'm back to CNK.
Asking users to debug marginal hardware is also a no-go. Were I to
drop standard users into this environment, have them lock it up at 512
nodes (a trivially small number), then tell them "Guess you'll have to
find the error yourself", well, that's the end of KH for users I fear.
We've got to change this. I suggest we go with the IO structure we use
on Plan 9. How hard is it to set that up?
More information about the Kittyhawk