Saturday, March 12, 2011

VC No Comm Event

Score a temporary victory for the machines in Man vs. Machine. A couple of weeks ago, an external event triggered a rebellion of our HP Blade Servers.

Without getting too technical, a bug in the firmware meant that the Virtual Connect network switches were checking out an incorrect IP address when validating management communication. The Virtual Connect Manager apparently decided to stop talking to us when someone in China configured DNS for their new network using that address.

Once that happened, the Virtual Connect Manager started reporting NO_COMM and other disinformation. Just going into the manager and looking to see what was going on made things worse, and eventually rebooting the switch (to clean up the mess) abruptly took down every blade server (even though they are configured for complete redundancy).

(The HP Customer Advisory gives a long list of what can trigger disaster.)

Reminds me of the Uncle Remus Tar Baby story. Starts out with a lack of communication, but quickly becomes hopelessly sticky if you try to find out why and then keep slugging it out.

No comments: