[Kittyhawk] KH is not fault-tolerant/fault oblivious

Dan Schatzberg dschatz at bu.edu
Mon Sep 12 15:03:51 EDT 2011


Did that backup the tree or did it recover?

On Mon, Sep 12, 2011 at 2:44 PM, Eric Van Hensbergen <ericvh at gmail.com>wrote:

> got a new behavior during some poking around today.
>
>   {0}.0: bgtree: inject fifo timed out!
>
> ..and then everything hung.  Of course this could just be the IO node
> responding to a jammed up compute node tree level.
> FWIW, I had this on a 64-node bump doing the same sort of operations I
> was doing the other day although there was less determinism (it seemed
> to happen after I poked it several times instead of just the first
> time).  There were no other error messages on the console.
>
>         -eric
>
>
> On Mon, Sep 12, 2011 at 9:16 AM, Jonathan Appavoo <jappavoo at bu.edu> wrote:
> > We looked at it briefly on Friday and are going to start debugging it
> now.
> >
> > Jonathan.
> >
> > On Sep 12, 2011, at 9:29 AM, Eric Van Hensbergen wrote:
> >
> >> Actually -- looking at the code what Jan says is consistent - assuming
> >> its returning from the alloc_skb it should be printing a message about
> >> not being able to alloc (within the tree driver) and dropping the
> >> packet which should dequeue it from fifo.  But I'm never seeing the
> >> message because TRACE() is defined NULL.  I'll fix that for my future
> >> runs.
> >>
> >> It could be just other parts of the system are more unhappy than us
> >> without GFP_ATOMIC allocation, but we are only seeing stack dumps from
> >> the tree receive interrupt....
> >>
> >> On Mon, Sep 12, 2011 at 6:39 AM, Jan Stoess <jan.stoess at kit.edu> wrote:
> >>> On 9/9/2011 11:15 PM, ron minnich wrote:
> >>>
> >>> ow.
> >>>
> >>> Wouldn't it make sense to have a less fatal reaction to out of memory
> >>> conditions, like "drop packet" ...
> >>>
> >>> It seems to me that this is just a stack dump, right? Doesn't mean that
> this
> >>> is killing Linux already. Might be that it's just unresponsive as hell
> >>> dumping out all those skb allocation failure messages, but eventually
> bails
> >>> out somewhere else.
> >>>
> >>>
> >>> --
> >>> Dr. Jan Stoess, KIT System Architecture Group
> >>> Phone: +49 (721) 6084 4056
> >> _______________________________________________
> >> Kittyhawk mailing list
> >> Kittyhawk at cs.bu.edu
> >> http://cs-mailman.bu.edu/mailman/listinfo/kittyhawk
> >
> >
>
> _______________________________________________
> Kittyhawk mailing list
> Kittyhawk at cs.bu.edu
> http://cs-mailman.bu.edu/mailman/listinfo/kittyhawk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://cs-mailman.bu.edu/pipermail/kittyhawk/attachments/20110912/e723909b/attachment.html 


More information about the Kittyhawk mailing list