[Kittyhawk] KH is not fault-tolerant/fault oblivious
jappavoo at bu.edu
Thu Sep 8 17:53:01 EDT 2011
Cool! Thanks Eric! Maybe we can try looking at this tomorrow.
BTW with respect to the ARP we did already have that discussion about only
the single node being pinged and hence not populating all the arp caches.
What I was saying was that that a "Gratuitous ARP" should occur when the
nodes bring up there interfaces under Linux and that this should induce population of the
arp caches (I could not remember the term Gratuitous ARP at the time.
I saw this occur when I was porting lwip to uboot on KH my assumption was that
Linux would likely do the same. The following is info about Gratuitous ARP:
RFC 3220 4.6 IP Mobility Support for IPv4 January 2002
- A Gratuitous ARP  is an ARP packet sent by a node in order
to spontaneously cause other nodes to update an entry in their
ARP cache. A gratuitous ARP MAY use either an ARP Request or
an ARP Reply packet. In either case, the ARP Sender Protocol
Address and ARP Target Protocol Address are both set to the IP
address of the cache entry to be updated, and the ARP Sender
Hardware Address is set to the link-layer address to which this
cache entry should be updated. When using an ARP Reply packet,
the Target Hardware Address is also set to the link-layer
address to which this cache entry should be updated (this field
is not used in an ARP Request packet).
In either case, for a gratuitous ARP, the ARP packet MUST be
transmitted as a local broadcast packet on the local link. As
specified in , any node receiving any ARP packet (Request
or Reply) MUST update its local ARP cache with the Sender
Protocol and Hardware Addresses in the ARP packet, if the
receiving node has an entry for that IP address already in its
ARP cache. This requirement in the ARP protocol applies even
for ARP Request packets, and for ARP Reply packets that do not
match any ARP Request transmitted by the receiving node .
On Sep 7, 2011, at 10:38 PM, Eric Van Hensbergen wrote:
> Okay - I spent to today messing around with different simple
> configurations. I have reproduction instructions from both Jan and
> Ron and will try to test them tomorrow.
> Simple configurations at 500 and 1000 nodes seem stable with only
> simple data transfers (ie. pdsh -f 1000 -w host[1-1000] date works
> again and again)
> When I tried to do some khdo writefile to all 1000 nodes (in part
> following ron's script, I tried just writing a the 1000 line hosts
> file to all the nodes) things went south.
> I tried:
> cat root.1000.workers | khdo writefile root.1000.workers
> 133+1 records in
> 133+1 records out
> 32016 bytes (32 kB) copied, 0.001105 s, 29.0 MB/s
> I got:
> Message from syslogd at localhost at Sep 8 02:09:36 ...
> kernel:------------[ cut here ]------------
> on the khfoxdev node - which I think indicates a panic. After that
> everything went unresponsive including the khctl node. The fact that
> we get the panic and then things wedge is a bit interesting -- based
> on where I saw the panic messages, I'm thinking it definitely happened
> on the khfoxdev node.
> Unfortunately, no debug logs anywhere really gave any insight into
> what caused the crash. I guess the next step might be to instrument
> the kernel with a circular buffer for the console like we have on Plan
> 9 so I can see what the panic actually is without relying on the kh
> console. I'll send out another update tomorrow once I know a bit
> On the arp front: its clearly not arp, since the pdsh works fine --
> however, arps look like they timeout which means we will probably be
> better off with static arp resolution in the long run just to avoid
> lots of unnecessary traffic. Also, something Jonathan said doesn't
> appear to be the case -- he thought all the arps would be pre-cached
> from a khbootapp pings -- but only one node gets pinged so arps don't
> happen until you try to access a node.
> Kittyhawk mailing list
> Kittyhawk at cs.bu.edu
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Kittyhawk