[Kittyhawk] KH is not fault-tolerant/fault oblivious

Jonathan Appavoo jappavoo at bu.edu
Thu Sep 8 17:53:01 EDT 2011


Cool! Thanks Eric!  Maybe we can try looking at this tomorrow.

BTW with respect to the ARP we did already have that discussion about only
         the single node being pinged and hence not populating all the arp caches.
  
         What I was saying was that that a "Gratuitous ARP" should occur when the
         nodes bring up there interfaces under Linux and that this should induce population of the
         arp caches (I could not remember the term Gratuitous ARP at the time.
         I saw this occur when I was porting lwip to uboot on KH my assumption was that
        Linux would likely do the same.  The following is info about Gratuitous ARP:

eg:

RFC 3220 4.6          IP Mobility Support for IPv4          January 2002

      -  A Gratuitous ARP [45] is an ARP packet sent by a node in order
         to spontaneously cause other nodes to update an entry in their
         ARP cache.  A gratuitous ARP MAY use either an ARP Request or
         an ARP Reply packet.  In either case, the ARP Sender Protocol
         Address and ARP Target Protocol Address are both set to the IP
         address of the cache entry to be updated, and the ARP Sender
         Hardware Address is set to the link-layer address to which this
         cache entry should be updated.  When using an ARP Reply packet,
         the Target Hardware Address is also set to the link-layer
         address to which this cache entry should be updated (this field
         is not used in an ARP Request packet).

         In either case, for a gratuitous ARP, the ARP packet MUST be
         transmitted as a local broadcast packet on the local link.  As
         specified in [36], any node receiving any ARP packet (Request
         or Reply) MUST update its local ARP cache with the Sender
         Protocol and Hardware Addresses in the ARP packet, if the
         receiving node has an entry for that IP address already in its
         ARP cache.  This requirement in the ARP protocol applies even
         for ARP Request packets, and for ARP Reply packets that do not
         match any ARP Request transmitted by the receiving node [36].
On Sep 7, 2011, at 10:38 PM, Eric Van Hensbergen wrote:

> Okay - I spent to today messing around with different simple
> configurations.  I have reproduction instructions from both Jan and
> Ron and will try to test them tomorrow.
> 
> Simple configurations at 500 and 1000 nodes seem stable with only
> simple data transfers (ie. pdsh -f 1000 -w host[1-1000] date works
> again and again)
> 
> When I tried to do some khdo writefile to all 1000 nodes (in part
> following ron's script, I tried just writing a the 1000 line hosts
> file to all the nodes) things went south.
> 
> I tried:
> cat root.1000.workers | khdo writefile root.1000.workers
> 133+1 records in
> 133+1 records out
> 32016 bytes (32 kB) copied, 0.001105 s, 29.0 MB/s
> 
> I got:
> 
> Message from syslogd at localhost at Sep  8 02:09:36 ...
> kernel:------------[ cut here ]------------
> on the khfoxdev node - which I think indicates a panic.  After that
> everything went unresponsive including the khctl node.  The fact that
> we get the panic and then things wedge is a bit interesting -- based
> on where I saw the panic messages, I'm thinking it definitely happened
> on the khfoxdev node.
> 
> Unfortunately, no debug logs anywhere really gave any insight into
> what caused the crash.  I guess the next step might be to instrument
> the kernel with a circular buffer for the console like we have on Plan
> 9 so I can see what the panic actually is without relying on the kh
> console.  I'll send out another update tomorrow once I know a bit
> more.
> 
> On the arp front: its clearly not arp, since the pdsh works fine --
> however, arps look like they timeout which means we will probably be
> better off with static arp resolution in the long run just to avoid
> lots of unnecessary traffic.  Also, something Jonathan said doesn't
> appear to be the case -- he thought all the arps would be pre-cached
> from a khbootapp pings -- but only one node gets pinged so arps don't
> happen until you try to access a node.
> 
>       -eric
> _______________________________________________
> Kittyhawk mailing list
> Kittyhawk at cs.bu.edu
> http://cs-mailman.bu.edu/mailman/listinfo/kittyhawk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://cs-mailman.bu.edu/pipermail/kittyhawk/attachments/20110908/cbdbfac9/attachment.html 


More information about the Kittyhawk mailing list