[Kittyhawk] KH is not fault-tolerant/fault oblivious
Eric Van Hensbergen
ericvh at gmail.com
Wed Sep 7 07:56:38 EDT 2011
Okay, I am on it this morning. Can you give me an exact set of steps which reproduce the problem? We had a couple of plans to go after likely candidates, but I want to make sure I fi x your problem.
Sent from my iPad
On Sep 7, 2011, at 4:45 AM, Jan Stoess <jan.stoess at kit.edu> wrote:
> I'm experiencing severe stability problems with 512 nodes as well. I can khqsub 512 nodes successfully, but as soon as I try to launch some linux appliance via khfoxapp (from the console or from khfoxdev, doesn't matte), it will boot but shortly after having booted linux, the nodes lock up and I can't even reach khctlserver anymore. Seems to be that the tree-based booting via u-boot is still working (at least partly, I can't tell whether all nodes come up), but the communication afterwards doesn't work, neither via kh console, nor via network. I can't tell if it's torus or tree lockup, but I suspect it's the tree cloaking.
> @Eric: In desperate need of a solution to this I'm willing to provide any info or demo on this.
> Dr. Jan Stoess, KIT System Architecture Group
> Phone: +49 (721) 6084 4056
More information about the Kittyhawk