With a lot of troubleshooting and a bit of red tape, our team reached a solution to the trouble we faced during the migration process earlier. On the contrary of what was expected, the problem was within the ESXi host rather than the network.

After attempting to troubleshoot TCP/IP stack on ESXi, we discovered that ESXi troubleshooting tools were quite limited. Naturally, reaching out to our vendor for support was our next step.  Our team handles all of our clients sensitively, but ensuring that the software of a healthcare organization remains operative is absolutely crucial! So when we didn’t receive the urgent support we would expect from a vendor in this type of situation, we had to take matters into our own hands.

Though ESXi TCP/IP troubleshooting tools was limited we found a resolution.  Navigating through the supplier’s KB and bug reports ultimately led us to our solution.

We are in the midst of transitioning a hospital from using physical servers to a VMWare vSphere platform. Meditech is this particular client’s primary software. Our team set up proof-of-concept (POC) and ran countless demonstrations to ensure that all systems, including Meditech ran smoothly together.

We were able to complete about 90% of the migration process before reaching a serious problem! The VM cluster began to malfunction. One by one, physical nodes lost connection to public networks. There were no definite patterns that could unveil a concrete reason for the glitches.

Considering that the old switches operate on ISO, the 100Mbp may have been the cause of the problem.  We added new ones in front of the core switch; but still no success. Bouncing the network on a physical server caused it to re-establish a connection. Everything pointed to the network simply being unplugged, but that was not the case.

After performing NIC teaming and attempting to transfer cables into different switches, the host would lose all network visibility.  Many possibilities were exhausted but our team is coming closer to a solution.

Complex problems like this one cannot be resolved with simple research. If these issues did not arise we wouldn’t be in business.  That’s why we are continuing to address this problem head on, boosting our resources and making sure we are doing everything we can to provide 24-hour support for our client.

I will continue to share updates on our progress.

Blackberry has always been a convenient tool. Companies like MobileIron are now trying to solve problems that BlackBerry resolved a long time ago.

I found another convenient feature of BB Enterprise server.  If you have an intranet website that is available only inside, you can now access it through your blackberry using private IP address. The Blackberry Enterprise server works as a proxy for the BB device web browser. So if your blackberry has visibility to intranet site you can easily brows your intranet systems though your blackberry device!

To ensure that your equipment runs a smaller chance of being affected by a natural disaster, there are even more factors to consider when analyzing the construction of datacenters.

To start, the datacenter should be on high grounds. Communication and power cables should be routed into your building from 2 different ends.

In addition, an analysis should be made regarding your internet connectivity or remote access. If you have your own AS number and are able to peer and setup BGP between multiple providers, you can guarantee an uptime for yourself, but you’ll need to find trusted partners. If not, you should find an internet service provider who has multiple upstreams that isn’t just connected to a “protected ring” or some sort of redundant connection running on the same ISP.

Though our own core network was stable during Hurricane Sandy, we still ran into service quality issues with multiple partners; Internap being the biggest problem we encountered.  The latency would grow from 3 to 800ms and some packets timed out completely. I’m sure this was a result of an equipment overload caused when Internap was re-routing traffic. Those cases are difficult to troubleshoot as they come and go. Our clients get complaints from their clients about outages or slowness but because they are inconsistent, they are hard to be caught.

As these types of issues are extremely important, I am constantly stressing to my sales team and our clients the significance of having at least 4 redundant upstreams connected to different peering partners available.

Professionals have said “a datacenter is not just an air-conditioned warehouse.”

After the recent Hurricane, I would add – “They are not just the random basements down the street from your office either!” Making sure your datacenter is specifically built to maintain operation in the event of disaster is essential.

Hurricane Sandy proved that there IS a difference between an actual purpose-built datacenter and the random buildings that have been so-called “converted into one.

We cannot predict weather conditions or future tragic events, but there are many specific factors that should always be considered when IT management is selecting a datacenter. For instance:

  1. Generator locations.
  2. Location of batteries, PDUs, switching etc.
  3.  Quality and separation of chases for power and communication cables.
  4. The size of oil tanks. Two days should be considered a minimum requirement.
  5. Availability of office space.
  6. On-site IT personnel above Level 1 should be available to support IT challenges when your own staff is paralyzed by electrical outages or gas shortages.

For East Coast companies, this hurricane was a true test of all DR measures. Going forward, I will continue to publish my findings on failures caused by the hurricane and what effective techniques are being used against them.

A client requested to know if a server was down when they noticed a reboot occur in their system. The first resource we use for this is monitoring logs. Monitoring didn’t detect any downtime. To verify whether or not the server was actually rebooted, our technician logged into the server to check the event log. We discovered that the VM server did in fact automatically reboot.

Our client then questioned how often we ping for ICMP monitoring and why we hadn’t detected the reboot initially. For up and down events we have multiple sensors with two probing cycles. Our normal probing cycle for this particular client is 60 seconds for ICMP and 10 minutes for port probing and heartbeat report. Because it is a virtual server, the startup was extremely fast. The boot up was so fast that it happened within a 45 second time frame; therefore our standard 60 second probing cycle could not have detected this.

Our clients count on us to let them know when we will have any unexpected reboots.  We understand the server could go down in the middle of their transactions so any reboots at all are essential.  To resolve this issue we implemented another monitoring sensor. Now we are probing if up-time is smaller than the last heartbeat. Having these two scripts will allow us to detect downtime between the 60 seconds of the first monitor, enable us to check log changes and alert us every 10 minutes.

This case proves that out-of-box solutions may not always work and life forces IT teams to constantly modify and improve their operations and techniques to address day-to-day business needs.

A CIO/CTO should not be complacent listening to marketing/sales pitch about the “cloud” availability, global clusters etc…

They have to ask the questions:  What will  happen if one of the datacenters goes down for a day, an hour?

What if the whole city is out…

Every company has its own tolerance for global failures and these needs must be accounted for during the solution-building process at the beginning/ designing stages. Companies should be weiry to rely solely on the selected “cloud(s)” availability.

We have seen numerous cases, two of which are listed below. Check out these links.

Because of these issues and many others, I am always questioning clouds hard on what they bring to the table in savings against what the enhanced risks are that could cost my clients money in the future.

So… Until someone can show me different, I encourage being prepared and understanding all of these risks/challenges before implementing any solution.
We saw Amazon going down for hours recently and should know that technology does break!

Amazon Cloud goes Down:

http://www.forbes.com/sites/anthonykosner/2012/06/30/amazon-cloud-goes-down-friday-night-taking-netflix-instagram-and-pinterest-with-it/

Datacenter Fire in Calgary:

http://www.datacenterknowledge.com/archives/2012/07/16/data-center-fire-disrupts-calgary

Follow

Get every new post delivered to your Inbox.