[Kittyhawk] KH is not fault-tolerant/fault oblivious
Eric Van Hensbergen
ericvh at gmail.com
Mon Sep 12 09:04:47 EDT 2011
so I didn't get a chance to look at it over the weekend, but I think what's happening is we are exhausting GFP_ATOMIC reserved allocations due to the burst of messages from the console responses - if you recall I'm triggering it via khdo writefile which is essentially dumping a 1000 line hosts file to the shell prompt on a 1000 nodes which probably yields 10 million packets in a short interval -- this likely easily overwhelms the 8MB or so reserved for GFP_ATOMIC on the khctl node -- so we end up with an OOM in an interrupt handler which may be a fatal panic, but even if it isn't, it probably results in packets not getting removed from the tree fifo which clogs it up. I'll go in today setup some debug to verify my understanding of the problem -- this may or may not be related to the problems we are seeing when using IP (as opposed to console commands). It seems to me there is an inherent instability in the way we are dealing with tree traffic - its not clear to me how we detect we are going to run out of skbuf allocation (since the panic is happening inside skballoc) -- but essentially we just need to start dropping packets when we hit this condition. If I can't find a good way to detect we are about to run out of space, I may put in some sort of token allocator for tree skbufs so that we can do the accounting on our side of the kernel interface. Other suggestions welcome.
Sent from my iPad
On Sep 12, 2011, at 6:39 AM, Jan Stoess <jan.stoess at kit.edu> wrote:
> On 9/9/2011 11:15 PM, ron minnich wrote:
>> Wouldn't it make sense to have a less fatal reaction to out of memory
>> conditions, like "drop packet" ...
> It seems to me that this is just a stack dump, right? Doesn't mean that this is killing Linux already. Might be that it's just unresponsive as hell dumping out all those skb allocation failure messages, but eventually bails out somewhere else.
> Dr. Jan Stoess, KIT System Architecture Group
> Phone: +49 (721) 6084 4056
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Kittyhawk