[Kittyhawk] KH is not fault-tolerant/fault oblivious

Eric Van Hensbergen ericvh at gmail.com
Mon Aug 22 13:27:55 EDT 2011


On Mon, Aug 22, 2011 at 12:04 PM, ron minnich <rminnich at gmail.com> wrote:
>
> We've got to change this. I suggest we go with the IO structure we use
> on Plan 9. How hard is it to set that up?
>

I'll let Jonathan give a more informed response, but KH does use all
the IO nodes as gateways to the ethernet/external-world.  The "single
point of failure/scale-issue" is the khctl node which is largely
command/control, not IO.  IIRC, we have talked about migrating the
khctl functionality to a more cellular/clustered-object model on the
IO nodes -- but it'll be a bit of work before we can get there because
there are a lot of assumptions in the existing configuration.  In such
a configuration I imagine we can scale gproc out similarly, although
I'm not sure that's the best course of action -- because gproc is
going to do more I/O and the Tree sucks as an I/O network.  It seems
like it would be better for gprocs to run almost exclusively on the
torus, which means no I/O nodes on BG/P -- although once we get go
going we can certainly do this experiment and see what kind of results
we get.

However, I believe this won't solve the stability problem we've been
talking about w.r.t. the marginal h/w on the M1 midplane.  IIRC that
has more to do with the interconnects and their interaction with the
kittyhawk driver (I imagine it would be the Tree causing the issue).
Of course, I'm not sure if that's the same problem you are having,
please go ahead and forward me the block numbers and the scenario in
which you see the instability (boot? after launching the khfoxdev
node?  after launching the compute nodes?) you are experiencing the
instability on so we can correlate and know if there is one problem or
several.

        -eric


More information about the Kittyhawk mailing list