[Kittyhawk] KH is not fault-tolerant/fault oblivious

Eric Van Hensbergen ericvh at gmail.com
Wed Sep 7 22:38:49 EDT 2011


Okay - I spent to today messing around with different simple
configurations.  I have reproduction instructions from both Jan and
Ron and will try to test them tomorrow.

Simple configurations at 500 and 1000 nodes seem stable with only
simple data transfers (ie. pdsh -f 1000 -w host[1-1000] date works
again and again)

When I tried to do some khdo writefile to all 1000 nodes (in part
following ron's script, I tried just writing a the 1000 line hosts
file to all the nodes) things went south.

I tried:
cat root.1000.workers | khdo writefile root.1000.workers
133+1 records in
133+1 records out
32016 bytes (32 kB) copied, 0.001105 s, 29.0 MB/s

I got:

Message from syslogd at localhost at Sep  8 02:09:36 ...
 kernel:------------[ cut here ]------------
on the khfoxdev node - which I think indicates a panic.  After that
everything went unresponsive including the khctl node.  The fact that
we get the panic and then things wedge is a bit interesting -- based
on where I saw the panic messages, I'm thinking it definitely happened
on the khfoxdev node.

Unfortunately, no debug logs anywhere really gave any insight into
what caused the crash.  I guess the next step might be to instrument
the kernel with a circular buffer for the console like we have on Plan
9 so I can see what the panic actually is without relying on the kh
console.  I'll send out another update tomorrow once I know a bit
more.

On the arp front: its clearly not arp, since the pdsh works fine --
however, arps look like they timeout which means we will probably be
better off with static arp resolution in the long run just to avoid
lots of unnecessary traffic.  Also, something Jonathan said doesn't
appear to be the case -- he thought all the arps would be pre-cached
from a khbootapp pings -- but only one node gets pinged so arps don't
happen until you try to access a node.

       -eric


More information about the Kittyhawk mailing list