Thursday, July 31, 2008

What good is a Business Continuity Plan if the hotsite is also under 20-feet of water?

The other day, someone asked me an interesting question: “What good is a Business Continuity Plan if the client’s hotsite is [also] under 20-feet of water?”

First, at a very high level, a good Business Continuity Plan is more business and process centric rather than data and IT centric. A BCP does not replace a functioning Disaster Recovery Plan, but rather builds upon it or includes it (assuming that the DRP is viable and tested) as one of its components.
The lines between a DRP and BCP are slightly blurred as each installation tends to implement things just a little bit differently.

A good DR plan is one that when exercised (tested) or executed for real, recovers the data and IT infrastructure (including circuits, etc.) and provides a useable environment at a pre-determined level of service. In many cases the DRP may not attempt to replicate the entire production environment, but will provide lesser, yet agreed-upon level of service until a return-to-normal is possible or the production environment can be rebuilt.

In the situation where “the client’s hotsite is [also] under 20-feet of water” I have a couple of different observations, but unfortunately, no real solution to offer.

We can infer from the fact that the client did establish a hotsite that there was some sort of a Disaster Recovery Plan. Perhaps not as complete or effective as it could have been, but still a plan none the less. But was the plan successful? The most obvious answer is “NO” since the recovery site was also under water.

But, is that really true? What if the DR planning process had correctly identified the risk of the hotsite suffering a concurrent failure AND management either actively accepted the risk or simply decided not to fund a solution?

In this case, the DR plan did not fail. I’d be hard pressed to call it a “success,” but one could honestly say that it worked as expected given the situation. Now I know that this is very much like saying “The operation was a success but the patient died.” However, this does underscore the idea that simply having a DR plan is insufficient to protect the enterprise. The DR plan must be frequently tested and also be designed to support a wide range of possible situations.

If we find that the DRP didn’t fail, the Business Continuity Plan, BCP (or lack of) failed miserably. No Business Continuity is possible for this client and the possibility of eventual recovery is dependant upon luck and super human efforts.

If however, the risk associated with the hotsite flooding had not been identified to management, then the Disaster Recovery Planning failed as well as the Business Continuity Plan.

Disaster Recovery and Business Continuity are both similar to an insurance policy: It is difficult to quantify a financial return on investment and by the time any of the (obvious) benefits become tangible it’s way too late to implement!

That’s one of the reasons why governing bodies have been forced to mandate recovery. One of the best set of guidelines that I’ve seen are the draft regulations for financial institutions that were published after 9/11: “Interagency Concept Release: Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U. S. Financial System”. [Links to this and similar documents can be found by visiting NASRP.]

The requirements were something like “each financial institution should have [at least] two data centers [at least] 200-300 miles apart. Both locations should be staffed and either location must be capable of supporting the entire workload.” Unfortunately, these regulations were never signed into law. I suspect this may be due in part to the realization by some elected officials that the mileage limit would move part of the tax revenue out of their local area!

Still, the guidelines were sound – at least as a starting point. Would that have prevented the submersed hotsite example? Maybe / maybe not. It depends on distance and many other factors. Even following these draft regulations, the possibility of multiple concurrent failures exist. There simply isn’t a guarantee.

This is precisely why some companies that are very serious about business continuity have gone beyond having a single hotsite and instead have moved into a three-site configuration. As you might imagine, there are several variations of a three-site BCDR configuration. One of my favorites is where there is a “local” (say, less than 100 miles distant) active mirror site that can instantly assume the workload (Recovery Time Objective = zero) with zero data loss (Recovery Point Objective = zero). This can be achieved by synchronous mirroring of the data, but does tend to limit the distance between the two sites.

The third site is much farther away – say 1000 to 2000 miles away. Data is propagated using an asynchronous data mirroring process. Because asynchronous mirroring is used to propagate data into this third site, there can be some data lost. This will equate to a Recovery Point Objective > zero. The amount of data that can be lost is anywhere from a “few seconds” to an hour or so based on several factors including distance, available bandwidth and the particular mirroring methodology implemented. Generally this tertiary site will have a Recovery Time Objective > zero as well, as some additional processing or recovery steps may be needed before full services can be restored.

Is this a “belt & suspenders” approach? Yes it is. Is it expensive? Yes it is. Is it necessary for every client environment? Certainly not. But it is appropriate for some environments. Is this solution sufficient to guarantee recovery in all situations? No, there is still no 100% guarantee even with this solution.

With these issues in mind, I try to approach each Business Continuity Planning (or Disaster Recovery Planning) effort with the following two thoughts:

• The recovery planning process is essentially additive: Once you have a process that works up to it’s limits, then it’s time to evaluate solutions that address issues not covered by the current plan. In this fashion, there is always another recovery solution that can be added to the current BCDR environment, but each additional solution brings with it additional costs and provides marginally less benefit.
• At some point, the cost of the possible additive solutions will exceed what the company is able (or willing) to fund. Both the costs and alternatives must be clearly understood for management to make a decision that is appropriate for that company and situation.

In Summary, no BCDR solution can provide a 100% guarantee. It is very important that the limits and risks of the existing plans are correctly identified before disaster strikes.

No comments: