Thursday, July 31, 2008

What good is a Business Continuity Plan if the hotsite is also under 20-feet of water?

The other day, someone asked me an interesting question: “What good is a Business Continuity Plan if the client’s hotsite is [also] under 20-feet of water?”

First, at a very high level, a good Business Continuity Plan is more business and process centric rather than data and IT centric. A BCP does not replace a functioning Disaster Recovery Plan, but rather builds upon it or includes it (assuming that the DRP is viable and tested) as one of its components.
The lines between a DRP and BCP are slightly blurred as each installation tends to implement things just a little bit differently.

A good DR plan is one that when exercised (tested) or executed for real, recovers the data and IT infrastructure (including circuits, etc.) and provides a useable environment at a pre-determined level of service. In many cases the DRP may not attempt to replicate the entire production environment, but will provide lesser, yet agreed-upon level of service until a return-to-normal is possible or the production environment can be rebuilt.

In the situation where “the client’s hotsite is [also] under 20-feet of water” I have a couple of different observations, but unfortunately, no real solution to offer.

We can infer from the fact that the client did establish a hotsite that there was some sort of a Disaster Recovery Plan. Perhaps not as complete or effective as it could have been, but still a plan none the less. But was the plan successful? The most obvious answer is “NO” since the recovery site was also under water.

But, is that really true? What if the DR planning process had correctly identified the risk of the hotsite suffering a concurrent failure AND management either actively accepted the risk or simply decided not to fund a solution?

In this case, the DR plan did not fail. I’d be hard pressed to call it a “success,” but one could honestly say that it worked as expected given the situation. Now I know that this is very much like saying “The operation was a success but the patient died.” However, this does underscore the idea that simply having a DR plan is insufficient to protect the enterprise. The DR plan must be frequently tested and also be designed to support a wide range of possible situations.

If we find that the DRP didn’t fail, the Business Continuity Plan, BCP (or lack of) failed miserably. No Business Continuity is possible for this client and the possibility of eventual recovery is dependant upon luck and super human efforts.

If however, the risk associated with the hotsite flooding had not been identified to management, then the Disaster Recovery Planning failed as well as the Business Continuity Plan.

Disaster Recovery and Business Continuity are both similar to an insurance policy: It is difficult to quantify a financial return on investment and by the time any of the (obvious) benefits become tangible it’s way too late to implement!

That’s one of the reasons why governing bodies have been forced to mandate recovery. One of the best set of guidelines that I’ve seen are the draft regulations for financial institutions that were published after 9/11: “Interagency Concept Release: Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U. S. Financial System”. [Links to this and similar documents can be found by visiting NASRP.]

The requirements were something like “each financial institution should have [at least] two data centers [at least] 200-300 miles apart. Both locations should be staffed and either location must be capable of supporting the entire workload.” Unfortunately, these regulations were never signed into law. I suspect this may be due in part to the realization by some elected officials that the mileage limit would move part of the tax revenue out of their local area!

Still, the guidelines were sound – at least as a starting point. Would that have prevented the submersed hotsite example? Maybe / maybe not. It depends on distance and many other factors. Even following these draft regulations, the possibility of multiple concurrent failures exist. There simply isn’t a guarantee.

This is precisely why some companies that are very serious about business continuity have gone beyond having a single hotsite and instead have moved into a three-site configuration. As you might imagine, there are several variations of a three-site BCDR configuration. One of my favorites is where there is a “local” (say, less than 100 miles distant) active mirror site that can instantly assume the workload (Recovery Time Objective = zero) with zero data loss (Recovery Point Objective = zero). This can be achieved by synchronous mirroring of the data, but does tend to limit the distance between the two sites.

The third site is much farther away – say 1000 to 2000 miles away. Data is propagated using an asynchronous data mirroring process. Because asynchronous mirroring is used to propagate data into this third site, there can be some data lost. This will equate to a Recovery Point Objective > zero. The amount of data that can be lost is anywhere from a “few seconds” to an hour or so based on several factors including distance, available bandwidth and the particular mirroring methodology implemented. Generally this tertiary site will have a Recovery Time Objective > zero as well, as some additional processing or recovery steps may be needed before full services can be restored.

Is this a “belt & suspenders” approach? Yes it is. Is it expensive? Yes it is. Is it necessary for every client environment? Certainly not. But it is appropriate for some environments. Is this solution sufficient to guarantee recovery in all situations? No, there is still no 100% guarantee even with this solution.

With these issues in mind, I try to approach each Business Continuity Planning (or Disaster Recovery Planning) effort with the following two thoughts:

• The recovery planning process is essentially additive: Once you have a process that works up to it’s limits, then it’s time to evaluate solutions that address issues not covered by the current plan. In this fashion, there is always another recovery solution that can be added to the current BCDR environment, but each additional solution brings with it additional costs and provides marginally less benefit.
• At some point, the cost of the possible additive solutions will exceed what the company is able (or willing) to fund. Both the costs and alternatives must be clearly understood for management to make a decision that is appropriate for that company and situation.

In Summary, no BCDR solution can provide a 100% guarantee. It is very important that the limits and risks of the existing plans are correctly identified before disaster strikes.

Sunday, July 20, 2008

zIIP Assisted z/OS Global Mirror: Significant cost reduction

Going back through some of the more recent processor announcements reminded me there are some significant cost savings available to the hundreds of XRC (Global Mirror for zSeries) customers out there.

Global Mirror for zSeries is of course, an asynchronous disk mirroring technique that mirrors mainframe data across any distance. This process is controlled by the SDM (or System Data Mover) which is application code running in one or more z/OS LPARs – usually at the recovery site.

Depending upon the size of the configuration – the number of volumes to be mirrored and the number of XRC “readers” defined, this could place a significant load on the processor CPs, and could require multiple CPs just to support the SDMs.

Beginning with z/OS 1.8, IBM began enabling much of the SDM code to be eligible to run on the IBM System z9 and z10 Integrated Information Processors (zIIPs). The zIIP assisted z/OS Global Mirror functions can provide better price performance and improved utilization of resources at the mirrored site.

This improved TCO (Total Cost of Ownership) is accomplished in two ways: First, IBM charges less for the specialty processors – like zIIPs – than then standard general purpose CPs. Secondly, the processing power used on zIIP processors enabled on the physical z9 or z10 CEC is NOT included in the general MSU ("Millions of Service Units”) figure that vendors use to determine your software charges.

Thus, if you are not currently offloading the SDM processing onto the zIIP processors on your z9 or z10, then you could experience some significant cost savings by moving towards this configuration.

Friday, July 18, 2008

IBM and Sun both announce 1TB tape drives

It has been an interesting week, but certainly the most significant storage news is that IBM and Sun both introduced One-terabyte Tape Drives.

IBM and Sun Microsystems have once again brought enterprise tape storage drives into the spotlight: Sun's announcement was made on Monday July 14 and IBM announced its new product just a day later. Now, whichever vendor you embrace, you have new options for enterprise tape storage at a lower TCO (total cost of ownership) and increased storage capacity.

On Monday, with an exclamation of “Bigger is Better, Biggest is Best,” Sun announced that it had succeeded in developing the very first one-terabyte tape drive, the Sun StorageTek T10000B. This new drive provides a maximum of 1TB of storage capacity on a single cartridge for open or mainframe system environments. Unfortunately for Sun, the bragging rights over “biggest” was short-lived as the very next day IBM announced a new tape drive that offers the same capacity as the Sun drive, but is also faster. Named the TS1130, IBM's new device will store up to one TB of data per cartridge and offers a native data rate of 160 MB/s – compared to 120 MB/s for the T10000B.

Both drives re-use existing media, thus providing backward read/write compatibility and asset protection for the current customers, and claim to support up to 1 TB of native capacity (uncompressed) per tape cartridge.

The T10000B (like previous drives) has the ‘control unit’ function built into the drive and supports FICON and Fibre Channel.

The TS1130 has dual FC ports and can connect directly open systems servers, or FICON and ESCON support is available utilizing the TS1120 or 3592 Tape Controllers.

Here is a side by side comparison of some of the “speeds and feeds”:



DescriptionSun T10000BIBM TS1130
Performance
Data transfer rate (uncompressed)120 MB/sec160 MB/sec
Max Data transfer rate360 MB/sec (4 Gb Interface), (compressed, maximum)400MB/Sec
Capacity
Capacity, native (uncompressed)1 TB (240 GB for Sport Cartridge)1TB (using JB/JX media), 640GB (using JA/JW media) or 128GB (using JJ/JR media)
Data Connectivity
Interface4 Gb Fibre Channel, FICONDual-ported 4-Gbps native switched fabric Fibre Channel. The drives can be directly attached to open systems servers with Fibre Channel, or to ESCON or FICON servers with the TS1120 Tape Controller Model C06 or the IBM Enterprise Tape Controller 3592 Model J70.
Mechanical
Height3.5 in. (8.89 cm)3.8 in. (95 mm)
Depth16.75 in. (42.55 cm)18.4 in. (467 mm)
Width5.75 in. (14.61 cm)5.9 in. (150 mm)
Environmental
Operating Temperature+50° F to +104° F (+10° C to +40° C)16°to 32°C (60°to 90°F)
Operating Relative humidity20% to 80%20% to 80% non-condensing (limited by media)
Tape format
FormatLinear serpentineLinear serpentine
Power
Consumption/dissipation (operating maximum continuous - not peak)63 W (drive only) and 90 W (including power supply)46 Watts (drive and integrated blower)
Cooling
Consumption/dissipation (operating maximum continuous - not peak)420 BTU/hr307 BTU/hr
Encryption
EncryptionThe crypto-ready StorageTek T10000B tape drive works in conjunction with the Sun Crypto Key Management Channel rate, uncompressed sustained (Fibre Channel) 120 MB/sec System (KMS). The KMS delivers a simple, secure, centralized solution for managing the keys used to Interface specifications (Fibre Channel) N and NL port, FC-AL-2, FCP-2, FC-tape, 4 Gb FC encrypt and decrypt data written by the T10000B tape drive. Developed on open security standards, the Read/write compatibility interface T10000 format KMS consists of the Key Management Appliance, a security-hardened Sun Fire x2100 M2 rack mounted server and the KMS Manager graphical user interface (GUI) that is executed on a workstation. The KMS runs without regard to application, operating platform, or primary storage device. It complies with Federal Information Processing Standard (FIPS) 140-2 certification. Requirements and specifications may change, so check with your Sun representative.
Built-in encryption of a tape's contents for z/OS, z/VM, IBM i, AIX, HP, Sun, Linux and Windows