[Kittyhawk] KH is not fault-tolerant/fault oblivious

Jan Stoess jan.stoess at kit.edu
Wed Sep 7 05:45:13 EDT 2011

I'm experiencing severe stability problems with 512 nodes as well. I 
can khqsub 512 nodes successfully, but as soon as I try to launch some 
linux appliance via khfoxapp (from the console or from khfoxdev, 
doesn't matte), it will boot but shortly after having booted linux, the 
nodes lock up and I can't even reach khctlserver anymore. Seems to be 
that the tree-based booting via u-boot is still working (at least 
partly, I can't tell whether all nodes come up), but the communication 
afterwards doesn't work, neither via kh console, nor via network. I 
can't tell if it's torus or tree lockup, but I suspect it's the tree 

@Eric: In desperate need of a solution to this I'm willing to provide 
any info or demo on this.
Dr. Jan Stoess, KIT System Architecture Group
Phone: +49 (721) 6084 4056

More information about the Kittyhawk mailing list