Storage High Availability: The Achilles Heel of Server Clusters

Let me start by saying that storage high availability is not a feature that every business is going to need in their computing infrastructure. Fault tolerant systems in general are arguably unimportant to businesses that can tolerate one or two days of downtime. But without fault tolerance, you better have a well-tested and documented recovery procedure. In fact, a great recovery procedure can probably cut the downtime from storage system failure to 8 hours or less. But in the real world, few SMBs have a great, well-tested, and documented recovery procedure. Also, in spite of the fact that there are a lot of questionable disaster recovery statistics out there, the fact remains that businesses are remarkably dependent on their computing infrastructure for survival. So it makes sense for businesses to take a close look at the real impact a computing service outage will have. In other words, I am going to be pretty skeptical of businesses that think they can go two days without e-mail, for example. So I am going to assume that some level of fault tolerance matters to most businesses.

In this article, I want to focus on a common scenario: businesses that have a server cluster that lacks highly available storage. Now I am not talking about using RAID that protects against disk failure, or SAN boxes with redundant controllers and redundant power supplies. I am talking about removing the entire SAN as a single point of failure. Because in the real world, it will be the one non-fault tolerant component (backplane?) that will fail.

The gold standard for larger businesses is probably to have all storage mirrored to a secondary NetApp or EqualLogic box. But that can get pricey. Microsoft is pushing its own Scale Out File Server in Server 2012, and that looks like it will offer a more robust and possibly cheaper alternative to the traditional SAN. But right now, there are few enclosures that meet the requirements for true high availability (SCSI Enclosure Services v.3).

One cheap solution that we have rolled out in the past for virtualization clusters is to synchronize all of the virtual machines to a secondary server with plenty of storage, using a tool like vReplicator or Veeam. If the SAN fails, you can fire up a recent version of the production VMs on this machine with reduced performance.

A more expensive solution that we have worked with was provided by nScaled and involves using a Falconstor agent installed on each VM or physical server. Disk snapshots are replicated to a custom-built ESXi server with plenty of storage and then replicated from there up to the cloud. That is a very robust solution that allows for either local or cloud recovery. But it’s pricey.

So another solution we are working on in the lab right now is to build out Linux servers running Infiniband and DRBD as a cheap, highly available SAN alternative. It’s true that snapshotting will be tricky and this definitely isn’t an off-the-shelf solution with a friendly GUI for administration. But what it lacks in polish, it more than makes up for in raw functionality. We can provide fully fault tolerant storage for a fraction of the cost of even a single enterprise SAN. Building out and expanding storage is cheaper as well, because we can use after-market drives instead of paying the outrageous prices NetApp or Dell charge for hard drives. SSDs even become a viable option for a broader range of applications at this price point.

I hope that I have shown that there are a broad range of highly available storage solutions to match practically any budget. I strongly advise all businesses to consider adding this feature to their computing systems architecture. Please feel free to contact us at info@aviator-it.com if you want to discuss your own challenges..

Recent Posts

Categories