Enterprise Storage Solutions: 2008

Sunday, December 28, 2008

As 2008 draws to a close...

As 2008 draws to a close, I am reminded of some of the more interesting storage announcements of the last half of 2008. For those of you who might have missed some of these, here is a second chance.

High Performance Ficon (zHPF) and Incremental Resynch

IBM October 21, 2008

These announcements may have been overlooked because of the z10 processor announcements made on the same day.

Links:
IBM System Storage DS6000 series Models 511 and EX1 offer a new price/performance option with 450 GB 15,000 rpm disk drives

IBM System Storage DS8000 series (Machine type 2107) delivers new functional capabilities (zHPF and RMZ resync) for System z environments

IBM System Storage DS8000 series (Machine types 2421, 2422, 2423, and 2424) delivers new functional capabilities (zHPF and RMZ resync)

High Performance FICON for System z improves performance

IBM now provides High Performance FICON for System z (zHPF). Previously, FICON with DS8000 series functions provided a high-speed connection supporting multiplexed operation. High Performance FICON takes advantage of the hardware available today, with enhancements that are designed to reduce the overhead associated with supported commands, that can improve FICON I/O throughput on a single DS8000 port by 100%.

Enhancements have been made to the z/Architecture® and the FICON interface architecture to deliver improvements for online transaction processing (OLTP) workloads.

When exploited by the FICON channel, the z/OS operating system, and the control unit, zHPF is designed to help reduce overhead and improve performance. The changes to the architectures offer end-to-end system enhancements to improve reliability, availability, and serviceability (RAS). Existing adapters will be able to handle an intermix of transactions using FCP, FICON, and High Performance FICON protocols.

Realistic production workloads with a mix of data transfer sizes can see up to 30 to 70% of FICON I/Os utilizing zHPF, resulting in up to a 10 to 30% savings in channel utilization.

z/OS Metro/Global Mirror Incremental Resync allows efficient replication

The IBM DS8000 series now supports z/OS Metro/Global Mirror Incremental Resync which can eliminate the need for a full copy after a HyperSwaptm situation in 3-site z/OS Metro/Global Mirror configurations.

Previously, the DS8000 series supported z/OS Metro/Global Mirror which is a 3-site mirroring solution that utilizes IBM System Storage Metro Mirror and z/OS Global Mirror (XRC). The z/OS Metro/Global Mirror Incremental Resync capability is intended to enhance this solution by enabling resynchronization of data between sites using only the changed data from the Metro Mirror target to the z/OS Global Mirror target after a GDPS® HyperSwap. This can significantly reduce the amount of data to be copied after a Hyperswap situation and improve the resilience of an overall 3-site disaster recovery solution by reducing resync times.

And if you would like to review the z10 processor announcements, the links can be found here:

IBM System z10 Enterprise Class - The future runs on System z10, the future begins today

IBM System z10 Business Class - The smart choice for your business. z can do IT better

IBM System Storage DS8000 series delivers new flexibility and data protection options

August 12, 2008

Links:

IBM System Storage DS8000 series (Machine types 2421, 2422, 2423, and 2424) delivers new flexibility and data protection options

IBM System Storage DS8000 series (Machine type 2107) delivers new flexibility and data protection options

New functional capabilities for the DS8000tm series include:
* RAID-6
* 450 GB 15,000 rpm Fibre Channel Disk Drive
* Variable LPAR
* Extended Address Volumes
* Ipv6

IBM XIV Storage System: designed to provide grid-based, enterprise-class storage capabilities

August 12, 2008

IBM XIV Storage System: designed to provide grid-based, enterprise-class storage capabilities

The IBM XIV Storage System is designed to be a scalable enterprise storage system based upon a grid array of hardware components. XIV is designed to:
* Support customers requiring Fibre Channel (FC) or Internet Small Computer System Interface (iSCSI) host connectivity
* Provide a high level of consistent performance with no data hot-spots in the grid-based storage device
* High redundancy through the use of unique grid-based rebuild technology
* Provide support for 180 SATA hard drive spindles providing up to 79 TB of useable capacity
* Provide Capacity on Demand option starting at 21.2 TB raw capacity
* Support 24 x 4Gb Fibre Channel ports for host connectivity
* Support 6 x 1Gb iSCSI ports for host connectivity
* Support 120 GB of total system cache

Monday, August 11, 2008

Disasters come in all sizes

Burst pipe damages three businesses

Three businesses in the Millyard Technology Park were damaged by water Saturday when a pipe burst.

A restaurant and two technology companies were affected, according to Nashua Fire Rescue Lt. Byron Breda.

The pipe was in the third-floor restaurant, and water was spilling out for about an hour before the soaked fixtures triggered the fire alarm, Breda said. The water leaked from the restaurant to the second and first floors, he said.

Breda said a computer company sustained the worst damage because its mainframe got soaked, Breda said.

After arriving on scene, fire crews shut off the water, tried to recover as much property as possible and pumped out the water, Breda said. The dollar value of the damage is unclear, Breda said.

--Published: Sunday, August 10, 2008 in the Nashua Telegraph

Thursday, July 31, 2008

What good is a Business Continuity Plan if the hotsite is also under 20-feet of water?

The other day, someone asked me an interesting question: “What good is a Business Continuity Plan if the client’s hotsite is [also] under 20-feet of water?”

First, at a very high level, a good Business Continuity Plan is more business and process centric rather than data and IT centric. A BCP does not replace a functioning Disaster Recovery Plan, but rather builds upon it or includes it (assuming that the DRP is viable and tested) as one of its components.
The lines between a DRP and BCP are slightly blurred as each installation tends to implement things just a little bit differently.

A good DR plan is one that when exercised (tested) or executed for real, recovers the data and IT infrastructure (including circuits, etc.) and provides a useable environment at a pre-determined level of service. In many cases the DRP may not attempt to replicate the entire production environment, but will provide lesser, yet agreed-upon level of service until a return-to-normal is possible or the production environment can be rebuilt.

In the situation where “the client’s hotsite is [also] under 20-feet of water” I have a couple of different observations, but unfortunately, no real solution to offer.

We can infer from the fact that the client did establish a hotsite that there was some sort of a Disaster Recovery Plan. Perhaps not as complete or effective as it could have been, but still a plan none the less. But was the plan successful? The most obvious answer is “NO” since the recovery site was also under water.

But, is that really true? What if the DR planning process had correctly identified the risk of the hotsite suffering a concurrent failure AND management either actively accepted the risk or simply decided not to fund a solution?

In this case, the DR plan did not fail. I’d be hard pressed to call it a “success,” but one could honestly say that it worked as expected given the situation. Now I know that this is very much like saying “The operation was a success but the patient died.” However, this does underscore the idea that simply having a DR plan is insufficient to protect the enterprise. The DR plan must be frequently tested and also be designed to support a wide range of possible situations.

If we find that the DRP didn’t fail, the Business Continuity Plan, BCP (or lack of) failed miserably. No Business Continuity is possible for this client and the possibility of eventual recovery is dependant upon luck and super human efforts.

If however, the risk associated with the hotsite flooding had not been identified to management, then the Disaster Recovery Planning failed as well as the Business Continuity Plan.

Disaster Recovery and Business Continuity are both similar to an insurance policy: It is difficult to quantify a financial return on investment and by the time any of the (obvious) benefits become tangible it’s way too late to implement!

That’s one of the reasons why governing bodies have been forced to mandate recovery. One of the best set of guidelines that I’ve seen are the draft regulations for financial institutions that were published after 9/11: “Interagency Concept Release: Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U. S. Financial System”. [Links to this and similar documents can be found by visiting NASRP.]

The requirements were something like “each financial institution should have [at least] two data centers [at least] 200-300 miles apart. Both locations should be staffed and either location must be capable of supporting the entire workload.” Unfortunately, these regulations were never signed into law. I suspect this may be due in part to the realization by some elected officials that the mileage limit would move part of the tax revenue out of their local area!

Still, the guidelines were sound – at least as a starting point. Would that have prevented the submersed hotsite example? Maybe / maybe not. It depends on distance and many other factors. Even following these draft regulations, the possibility of multiple concurrent failures exist. There simply isn’t a guarantee.

This is precisely why some companies that are very serious about business continuity have gone beyond having a single hotsite and instead have moved into a three-site configuration. As you might imagine, there are several variations of a three-site BCDR configuration. One of my favorites is where there is a “local” (say, less than 100 miles distant) active mirror site that can instantly assume the workload (Recovery Time Objective = zero) with zero data loss (Recovery Point Objective = zero). This can be achieved by synchronous mirroring of the data, but does tend to limit the distance between the two sites.

The third site is much farther away – say 1000 to 2000 miles away. Data is propagated using an asynchronous data mirroring process. Because asynchronous mirroring is used to propagate data into this third site, there can be some data lost. This will equate to a Recovery Point Objective > zero. The amount of data that can be lost is anywhere from a “few seconds” to an hour or so based on several factors including distance, available bandwidth and the particular mirroring methodology implemented. Generally this tertiary site will have a Recovery Time Objective > zero as well, as some additional processing or recovery steps may be needed before full services can be restored.

Is this a “belt & suspenders” approach? Yes it is. Is it expensive? Yes it is. Is it necessary for every client environment? Certainly not. But it is appropriate for some environments. Is this solution sufficient to guarantee recovery in all situations? No, there is still no 100% guarantee even with this solution.

With these issues in mind, I try to approach each Business Continuity Planning (or Disaster Recovery Planning) effort with the following two thoughts:

• The recovery planning process is essentially additive: Once you have a process that works up to it’s limits, then it’s time to evaluate solutions that address issues not covered by the current plan. In this fashion, there is always another recovery solution that can be added to the current BCDR environment, but each additional solution brings with it additional costs and provides marginally less benefit.
• At some point, the cost of the possible additive solutions will exceed what the company is able (or willing) to fund. Both the costs and alternatives must be clearly understood for management to make a decision that is appropriate for that company and situation.

In Summary, no BCDR solution can provide a 100% guarantee. It is very important that the limits and risks of the existing plans are correctly identified before disaster strikes.

Sunday, July 20, 2008

zIIP Assisted z/OS Global Mirror: Significant cost reduction

Going back through some of the more recent processor announcements reminded me there are some significant cost savings available to the hundreds of XRC (Global Mirror for zSeries) customers out there.

Global Mirror for zSeries is of course, an asynchronous disk mirroring technique that mirrors mainframe data across any distance. This process is controlled by the SDM (or System Data Mover) which is application code running in one or more z/OS LPARs – usually at the recovery site.

Depending upon the size of the configuration – the number of volumes to be mirrored and the number of XRC “readers” defined, this could place a significant load on the processor CPs, and could require multiple CPs just to support the SDMs.

Beginning with z/OS 1.8, IBM began enabling much of the SDM code to be eligible to run on the IBM System z9 and z10 Integrated Information Processors (zIIPs). The zIIP assisted z/OS Global Mirror functions can provide better price performance and improved utilization of resources at the mirrored site.

This improved TCO (Total Cost of Ownership) is accomplished in two ways: First, IBM charges less for the specialty processors – like zIIPs – than then standard general purpose CPs. Secondly, the processing power used on zIIP processors enabled on the physical z9 or z10 CEC is NOT included in the general MSU ("Millions of Service Units”) figure that vendors use to determine your software charges.

Thus, if you are not currently offloading the SDM processing onto the zIIP processors on your z9 or z10, then you could experience some significant cost savings by moving towards this configuration.

Friday, July 18, 2008

IBM and Sun both announce 1TB tape drives

It has been an interesting week, but certainly the most significant storage news is that IBM and Sun both introduced One-terabyte Tape Drives.

IBM and Sun Microsystems have once again brought enterprise tape storage drives into the spotlight: Sun's announcement was made on Monday July 14 and IBM announced its new product just a day later. Now, whichever vendor you embrace, you have new options for enterprise tape storage at a lower TCO (total cost of ownership) and increased storage capacity.

On Monday, with an exclamation of “Bigger is Better, Biggest is Best,” Sun announced that it had succeeded in developing the very first one-terabyte tape drive, the Sun StorageTek T10000B. This new drive provides a maximum of 1TB of storage capacity on a single cartridge for open or mainframe system environments. Unfortunately for Sun, the bragging rights over “biggest” was short-lived as the very next day IBM announced a new tape drive that offers the same capacity as the Sun drive, but is also faster. Named the TS1130, IBM's new device will store up to one TB of data per cartridge and offers a native data rate of 160 MB/s – compared to 120 MB/s for the T10000B.

Both drives re-use existing media, thus providing backward read/write compatibility and asset protection for the current customers, and claim to support up to 1 TB of native capacity (uncompressed) per tape cartridge.

The T10000B (like previous drives) has the ‘control unit’ function built into the drive and supports FICON and Fibre Channel.

The TS1130 has dual FC ports and can connect directly open systems servers, or FICON and ESCON support is available utilizing the TS1120 or 3592 Tape Controllers.

Here is a side by side comparison of some of the “speeds and feeds”:

Description	Sun T10000B	IBM TS1130
Performance
Data transfer rate (uncompressed)	120 MB/sec	160 MB/sec
Max Data transfer rate	360 MB/sec (4 Gb Interface), (compressed, maximum)	400MB/Sec
Capacity
Capacity, native (uncompressed)	1 TB (240 GB for Sport Cartridge)	1TB (using JB/JX media), 640GB (using JA/JW media) or 128GB (using JJ/JR media)
Data Connectivity
Interface	4 Gb Fibre Channel, FICON	Dual-ported 4-Gbps native switched fabric Fibre Channel. The drives can be directly attached to open systems servers with Fibre Channel, or to ESCON or FICON servers with the TS1120 Tape Controller Model C06 or the IBM Enterprise Tape Controller 3592 Model J70.
Mechanical
Height	3.5 in. (8.89 cm)	3.8 in. (95 mm)
Depth	16.75 in. (42.55 cm)	18.4 in. (467 mm)
Width	5.75 in. (14.61 cm)	5.9 in. (150 mm)
Environmental
Operating Temperature	+50° F to +104° F (+10° C to +40° C)	16°to 32°C (60°to 90°F)
Operating Relative humidity	20% to 80%	20% to 80% non-condensing (limited by media)
Tape format
Format	Linear serpentine	Linear serpentine
Power
Consumption/dissipation (operating maximum continuous - not peak)	63 W (drive only) and 90 W (including power supply)	46 Watts (drive and integrated blower)
Cooling
Consumption/dissipation (operating maximum continuous - not peak)	420 BTU/hr	307 BTU/hr
Encryption
Encryption	The crypto-ready StorageTek T10000B tape drive works in conjunction with the Sun Crypto Key Management Channel rate, uncompressed sustained (Fibre Channel) 120 MB/sec System (KMS). The KMS delivers a simple, secure, centralized solution for managing the keys used to Interface specifications (Fibre Channel) N and NL port, FC-AL-2, FCP-2, FC-tape, 4 Gb FC encrypt and decrypt data written by the T10000B tape drive. Developed on open security standards, the Read/write compatibility interface T10000 format KMS consists of the Key Management Appliance, a security-hardened Sun Fire x2100 M2 rack mounted server and the KMS Manager graphical user interface (GUI) that is executed on a workstation. The KMS runs without regard to application, operating platform, or primary storage device. It complies with Federal Information Processing Standard (FIPS) 140-2 certification. Requirements and specifications may change, so check with your Sun representative.	Built-in encryption of a tape's contents for z/OS, z/VM, IBM i, AIX, HP, Sun, Linux and Windows

Wednesday, June 25, 2008

Tape is not dead!

For several years I have been chagrined by the various proclamations from industry experts that “Tape is dead”. These reports (many of which are published by vendors that do not offer a tape product) extol the virtues of using disk as a replacement for tape, but ignore the operational benefits and cost savings that tape provides in the enterprise IT environment.

These benefits become more apparent as the quantity of data and the size of the enterprise grows. At some point, it becomes evident that keeping all copies of all data on spinning disk storage is simply unsupportable – both operationally and financially.

Even so, the many myths regarding the demise of tape have been perpetuated over the years. Therefore I was very pleased to see an article debunking some of these myths appear in the June/July 2008 issue of Z/Journal.

This article, “Mainframe Tape Technology: Myths and Realities”
(By Stephen Kochishan and John Hill) discuss several of these popular myths and then describe both the reality and best practice that aptly describe the issue. This article should be an interesting read for the enterprise storage professional.

Saturday, June 7, 2008

ATT Study confirms DRJ and Forrester results

The following study announcements were found clogging my inbox on Friday...

Survey finds most businesses prepared for disasters Bizjournals.com - Charlotte, NC, USA The survey found that 77 percent of Seattle and Portland executives indicate their companies have a business continuity plan.

AT&T 2008 Business Continuity Study Continuity Central (press release) - Huddersfield, UK AT&T has published the results of its latest annual survey of business continuity practices in US organizations. The 2008 survey is the seventh such survey...

AT&T Study: One in Five US Businesses Does Not Have a Business ...Converge Network Digest - USA For the seventh consecutive year, AT&T's Business Continuity Study surveyed IT executives from companies throughout the United States that have at least $25...

Reading just these excerpts, it took me a moment or two before I realized that they were all describing the same AT&T study.

These particular reports are a little light as far as presenting the specific results from the AT&T survey, but still very timely considering the topic of my last post. The consensus of these interpretations of this AT&T study seems to be that 80% of companies have a Business Continuity plan. 59% of the respondents have updated their plan within the last 12 months, but fewer (46 percent) have had the plans fully tested during the same time period.

Using these numbers in place of the percentages from the DRJ/Forrester study referenced in the previous post, we end up with a set of calculations that look something like this:

80% of the companies have a Business Continuity Plan.
59% of the companies update their plan at least once a year. This means that (.80 * .59 = 47%) 47% of the companies have a plan that is updated at least once a year.
46% of the companies actually test their recovery plans at least once a year. This indicates that (.46 * .47 = 22%) 22% of the companies have a plan, update and test it at least once a year.

This result isn't too bad, I guess, but it doesn't incorporate the result from the Gartner study that only 28% of the planned tests actually are successful and meet all of their objectives. If we apply this calculation to the results of the AT&T study, we find that:

(.28 * .22 = 6%) Only six percent of the surveyed companies can be expected to have successful Business Continuity exercises that meet all of their business requirements.

This is discouraging news indeed.

Sunday, June 1, 2008

Are we prepared?

I attended a Disaster Recovery seminar last week aimed at building a better disaster recovery plan. Some of the statistics presented sparked my interest but rather than taking the presentation at face value, I thought that some additional discussion and analysis about some of these findings would be useful.

It is fortunate that I did take a closer look as some of the statistics that were presented as fact did not bear up to close scrutiny. In fact, upon verifying the presenters’ source information it became apparent that one of the statistics I had chosen as the starting point for my analysis had been misinterpreted and used out of context of the original source document.

However, I was still interested in where this information might have led with the proper analysis, so I discarded the seminar materials and went looking for similar – but more accurate and verifiable data that that could stand up to analysis.

These are the results.

The current state of DR/BC

79%The percentage of enterprises that report having a formal and documented recovery plan in place. [Source: DRJ/Forrester article] This is a very strong showing and represents the significant progress that the industry as a whole has achieved. While no one can argue that the percentage should be anything less than 100%, 19% of the respondents indicated that they expect to have a plan in place within the next 6-12 months, leaving only 2% of the respondents with no plan whatsoever

81% Of those with a DR plan, 81% responded that their plans are updated at least once a year. 26% of the respondents indicated that their plans are updated in an ongoing fashion as part of the change and configuration management processes. Kudos to these folks! 14% update their plans quarterly, 18% update their plans twice a year, and 23% update their plans once a year. [Source: DRJ/Forrester article]

82%
82% of those that responded perform a full exercise of their disaster recovery plans at least once a year. 50% test once a year, 22% test twice a year while 10% test more than twice a year. [Source: DRJ/Forrester article]

At this point, the numbers look really encouraging. As a DR/BC professional myself I can look at these numbers and say “Wow! 80% is really good. We’re doing a great job!” On a personal level, this causes warm and fuzzy feelings as my GQ (Goodness Quotient) is set to 80.

However…

Simple numbers, such as these, can be deceiving. In fact, much better business decisions can be made if additional understanding and analysis of the overall results can be achieved.

First of all, even though each of the numbers presented so far are very close to 80%, it is important to realize that each additional statistic represents but a portion of the prior sample: There is a subsequent reduction in the effective end-product success at each iteration.

In other words, the numbers should be understood in this context:

79 out of 100 enterprises have a disaster recovery plan. (GQ=79)

81% of the enterprises with a plan update them at least once a year: 100*.79 = 79 * .81 = 64 (GQ = 64)

82% actually test their recovery plans at least once a year: 100*.79 = 79 * .81 = 64 * .82= 52% (GQ = 52)

Hmmm. So this means that only about 52% of all enterprises actually have a DR plan, update it and test it at least once a year. While that doesn’t make me as warm and fuzzy as the 80% number did, it’s still pretty good, right?

Well, maybe not. While this still appears to be a relatively positive indication, it doesn’t yet include any indication of how many of these DR plans are successful and actually meet or exceed the client requirements.

In order to proceed with this next analytical step, it is necessary to reference the results of an additional study, this one – a recent Gartner study that found: “Twenty-eight percent of organizations reported that their last disaster recovery exercise went well and met all their service targets. However, 61 percent of survey participants reported that they had problems with the exercise.” So,

28%“Twenty-eight percent of organizations reported that their last disaster recovery exercise went well and met all their service targets. So, a 28% “success” rate. [Source: Gartner article]

However, we must remember that this percentage only applies to those that actually have a plan and update and test their plan. So in reality, this is a 28% success rate of only 52% of the total. (52% * .28 = 15%)

The complete breakdown of this analysis is shown graphically here:

Spreadsheet showing the number of succesfull DR tests as a percentage of the whole

Spreadsheet showing the number of succesfull DR tests as a percentage of the whole

This indicates that only 15% of the total companies will actually recovery from a disaster as they have planned. In other words, 85% of all companies will either fail following a disaster or will experience difficulties that will cause them to exceed either their RTO or RPO or both.

It would make a great - although completely irresponsible - headline if we were to state that "85% of all organizations will fail following a disaster". Tempting to some perhaps, but no.

To do so would totally ignore the 61% (as reported by Gartner) who reported that they had "some problems with their last exercise". Since we do not have the detailed information regarding this statistic, we cannot
ascertain the severity or number of the problems they encountered. It is, however safe to assume that many of these problems have been corrected and that a portion of these enterprises will enjoy greater success the next time they exercise their DR validation program.

The areas of Disaster Recovery and Business Continuity are ones that capitalize on the benefits of a Continual Service Improvement methodology. Maintaining the plan and keeping it current and validating the plan via frequent test executions are two of the cornerstones necessary for a compliant and Resilient enterprise.

Source Information
Website: Disaster Recovery Journal, "The State Of DR Preparedness", http://www.drj.com/index.php?option=com_content&task=view&id=794&Itemid=159&ed=10, with references from the Forrester/Disaster Recovery Journal October 2007 Global disaster Preparedness Online Survey, verified 06/01/2008

Website: Gartner, "Gartner Says Most Organizations Are Not Prepared For a Business Outage Lasting Longer Than Seven Days", http://www.gartner.com/it/page.jsp?id=579708, verified 06/01/2008

Tuesday, May 27, 2008

Simple Storage Cost Model (Part 2)

One of the most obvious omissions of the previous cost model is anything that would account for the costs associated with providing a DR/BC (disaster recovery / business continuity) environment for the data. After all, if the data is important enough to your business under normal conditions then chances are it will still be required following a disaster.

Costing Storage for DR/BC

The DR/BC components to be included in the costing model will vary wildly from enterprise to enterprise depending on the specific installation choices and recovery methodologies supported. As the recovery environment becomes more complex, more costs must be accounted for in the cost model.

In designing a fully mirrored operating environment with complete data replication, at first one might assume that a simple doubling of the costs associated with the primary site might be close enough. In rare cases this might be true, but this assumption should not be accepted by IT management unless it can be verified by a much more in-depth cost analysis.
Some of the differences that must be considered are:

Acquisition costs (CapEx):

Storage costs: The secondary storage may be a different device type with different features, performance and cost characteristics.
Network Equipment: Additional equipment such as channel extenders, routers, switches, etc. may be required.

Operational costs (OpEx):

Support personnel costs: May be less than double due to economies of scale and less day-to-day management required of the recovery environment.
Maintenance (warranty) costs for the equipment: Will probably be different based on the secondary device configuration.
Environmental (power and cooling) costs: May be reduced at the secondary site by having redundant environmental systems in standby instead of active mode.
Network costs: Additional monthly costs to provide the bandwidth necessary to support data replication.

Given these considerations, it is safe to assume that not only will the ratio of CapEx to OpEx expense be different but also that a simple doubling of the primary site costs would not be appropriate.

Updating the model

Again, for the purposes of illustration only, let us make-up some numbers to insert into our expanded cost model. We will utilize the following assumptions as the basis for our additional entries:
Acquisition costs (CapEx):

The secondary storage can be acquired at 1.2 times the primary storage

The necessary network equipment will be acquired at a one-time cost of $250,000

Operational costs (OpEx):

The additional support personnel costs are estimated at an additional 40%
The maintenance costs assigned to the secondary storage is estimated at 80% of the primary unit
Secondary environmental costs are estimated at 60% of the primary site
The additional expense for network bandwidth is estimated to be $25k/month
The hardware maintenance expense associated with the additional network equipment is estimated to be $27,500 per year

Using these assumptions, the revised cost model is shown here:

Storage Cost Model
Acquisition Costs
Purchase 1 800TB Disk Unit		$1,100,000
Less any discounts or credits		$0
Purchase secondary storage (1.2* primary cost)		$1,320,000
Less any discounts or credits		$0
Purchase network equipment		$250,000
Less any discounts or credits		$0
Total Acquisition Costs			$2,670,000
Operational Costs	Per Year	Term 5 years
Primary Site
Support Personnel	$91,000	$455,000
Maintenance (EXCLUDING first year)	$77,285	$309,140
Environmental	$51,300	$256,500
Secondary Site
Support Personnel	$36,400	$182,000
Maintenance (EXCLUDING first year)	$61,828	$247,312
Environmental	$30,780	$153,900
Network
Circuit (Bandwidth) Expense	$300,000	$1,500,000
Equipment Maintenance (EXCLUDING first year)	$27,500	$110,000
Total Operational Costs			$3,213,852
Total Cost			$5,883,852
Cost per TB			$7,354.82
Cost/TB/Month			$122.58

Obviously, in this more complete cost model, the price per TB has gone up due the to additional hardware and network costs that are incurred for this particular DR/BC option. (Please remember that the specific costs shown are for illustrative purposes only and should not be used for any purpose other than to demonstrate the components of the cost model).

Once you have determined the appropriate numbers for your environment, it will be useful to compute the cost ratio between the non-mirrored storage and the fully mirrored and recoverable storage. This will allow you to provide two simple planning numbers to your data owners and application designers. The first is the Cost/TB/Month of the un-mirrored (and presumably unrecoverable) storage while the second number represents the monthly cost for the fully mirrored and automatically recovered storage. Once this process (and the associated costs) have been socialized throughout your enterprise, the business owners can determine early on what level of support they are willing to fund for their data.

The spreadsheet that was used in creating this sample cost model is available by visiting the Recovery Specialties website.

Wednesday, May 14, 2008

Simple Storage Cost Model (Part 1)

In the previous missive I described a means of assigning value to business data. In this post I will continue the discussion and show how to determine the storage costs associated with providing business data.

The cost of providing business data

The means by which IT organizations determine and assign costs are as varied as the organizations themselves. Some organizations may not have implemented any chargeback methodology or may use only hard dollar accounting costs. However, in order to gauge the true cost of data, the operational costs should be included as well.

Accounting costs

Acquisition costs (CapEx): The capital or acquisition costs are the easiest to calculate as you begin with a simple bottom-line number. In order to determine the true costs, however, it is important to understand that the capital acquisition costs typically represent less than half of the actual cost of acquiring and operating equipment.

Operational costs (OpEx): There are many operational costs that can be associated with a cost model. For the purposes of an IT cost model, at least three operational costs should be considered:

Support personnel costs
Maintenance (warranty) costs for the equipment
Environmental (power and cooling) costs

Some IT environments will also include depreciation and floorspace costs to the calculation, while others may need to include specific taxes, insurance fees, or other costs specific to the enterprise.

Creating a cost model

Once you determine the components that you will include in the cost model, it is simple enough to enter these values into a spreadsheet to calculate the costs. The spreadsheet should capture the agreed upon major aspects, which include the capital and operational costs, including personnel, maintenance, and power and cooling. A typical operational time frame may be picked, such as three years to five years (depending upon the accounting practices currently in place within the enterprise).

Provided below is a very simple cost model demonstrating the costs associated with a single 800 TB disk storage unit over a five-year term. (Note: The specific costs shown are completely arbitrary and are supplied only for illustrative purposes).

Storage Cost Model
Acquisition Costs
Purchase 1 800TB Disk Unit		$1,100,000
Less any discounts or credits		$0
Total Acquisition Costs			$1,100,000
Operational Costs	Per Year	Term 5 years
Support Personnel	$91,000	$455,000
Maintenance (EXCLUDING first year)	$77,285	$309,140
Environmental	$51,300	$256,500
Total Operational Costs			$1,020,640
Total Cost			$2,120,640
Cost per TB			$2,650.80
Cost/TB/Month			$44.18

Although this model is still quite simple, I favor it as the starting point for determining actual storage costs. In particular, I am an advocate of being able to provide a “Cost per TB per Month” figure that can be referenced as IT clients request additional storage for new projects and systems.

The sample spreadsheet that was used in this example is available from Recovery Specialties.

Thursday, April 24, 2008

Data Classification - Revisited

In an earlier post (see Simple data classification for Business Continuity) I described a simple means of beginning to classify the value of data for Business Continuity. I’d like to expand on that topic and perhaps approach the subject from a slightly different perspective.

The value of business data

When speaking of the value of business data, the one universal constant is “it depends”. All data is “important” to the business owner – otherwise it wouldn’t have been created in the first place, right?

But looking at the question from the terms of the “business”, the true value to the organization of any particular piece of data lies in how that data is accessed, not in any innate value placed on it by the data’s creator. In fact, the importance of data varies significantly among industries, even by application and perhaps time of day within any particular firm.

Just as the true value of data will vary in nearly every case, the process of assigning a value to the data will be different from enterprise to enterprise. Take for example, the case of a large web based retailer. In this environment, the cost of an hours’ outage might be estimated as:

Estimated Cost of Outage $'s per hour		First	Second	Third	Fourth	Fifth
(hard dollars)	Loss of Sales	X	X	X	X	X
(soft dollars)	Customer Satisfaction	Z	Z+10%	Z+15%	Z+20%	Z+25%

In this example, the retailer has determined that the cost of lost sales remains constant while the soft dollar loss relating to customer satisfaction (and future customer visits) gradually increases with the duration of the outage.
Although this is a simplistic case, it does illustrate a starting point that can be used and built upon in support of different industries and or clients. Take, for example, an enterprise in the banking or services industry. A chart such as follows might be used to quantify the cost of an outage:

Estimated Cost of Outage $'s per hour		First	Second	Third	Fourth	Fifth
(hard dollars)	Loss of Fees	X	X	X	X	X
(hard dollars)	Loss of Float	Y	Y	Y	Y	Y
(soft dollars)	Customer Satisfaction	Z	Z+10%	Z+15%	Z+20%	Z+25%

In either case, once you have the anticipated costs assigned to components of both the “hard” and “soft” dollar categories, the value of the data to the business is represented by the sum of the individual columns
Performing this type of exercise is an important step in gaining management concurrence and understanding of the true business value of the various data components. It is also the basis of generating sustainable Service Level Agreements (SLA) as well as Recovery Time and Recovery Point Objectives (RTO and RPO).

Monday, March 24, 2008

Arthur C. Clarke Dead at 90

Arthur C. Clarke, the science fiction writer, technology visionary and Business Resiliency Futurist died last week in Sri Lanka. Sir Arthur C. Clark was 90 and the last surviving member of the “Big Three” science fiction writers (along with Robert A. Heinlein and Isaac Asimov).

During his lifetime, he authored over 100 books and thousands of technical papers, was nominated for a Nobel Prize and predicted the existence of artificial satellites in geosynchronous orbit – Also known as the “Clarke Orbit” – and that man would land on the moon by 1970.

It was in 1945 when “Wireless World”, a UK periodical, published Clarke’s technical paper "Extra-terrestrial Relays" in which he first set out the principles of satellite communication with satellites in geostationary orbits – an idea that was finally implemented 25 years later. He was paid £15 for the article.

Having grown up with Clarkes works, I cannot say that I have a favorite. I believe that his most popular works were those books of the 2001 series. beginning with 2001: A Space Odyssey in 1968.

Interestingly enough, though, I am partial to the Rendezvous with Rama series, the first book of which was published in 1972. The following is copied from the Wikipedia:

“Rendezvous with Rama is a novel by Arthur C. Clarke first published in 1972. Set in the 22nd century, the story involves a thirty-mile-long cylindrical alien starship that passes through Earth's solar system. The story is told from the point of view of a group of human explorers, who intercept the ship in an attempt to unlock its mysteries.
This novel won both the Hugo and Nebula awards upon its release, and is widely regarded as one of the cornerstones in Clarke's bibliography. It is considered a science fiction classic, and is particularly seen as a key hard science fiction text.”

In this book, the strongest underlying philosophy is the basic resilience of the alien starship. Not only are all systems replicated and completely redundant, but these systems and processes are all implemented in groups of threes. In fact, it is eventually discovered that critical systems are designed utilizing three complete sets of threes.

Just as Arthur C Clarke illustrated 36 years ago, three can be a very significant number in the information technology field today. Not only should you have (at a minimum) three copies of your data: The local copy, the onsite backup and an offsite backup, but certain advanced replication techniques also utilize a minimum of three copies. These would be the local data, the local synchronous copy and the remote (asynchronous) copy. And just like the Ramans, there should be additional copies of mission critical data.

Many lessons can be taken from the Rama books and other of Sir Clarke’s writings and applied to help strengthen Business Resilience and Business Continuity principles today.

But for now, a friend is gone.

On his 90th birthday last December, he listed three wishes for the world: To embrace cleaner energy resources, for a lasting peace in his homeland of Sri Lanka, and for evidence of extraterrestrial beings.

"Sometimes I am asked how I would like to be remembered," Clarke said. "I have had a diverse career as a writer, underwater explorer and space promoter. Of all these I would like to be remembered as a writer."

In an interview with The Associated Press, Clarke said he did not regret having never followed his novels into space, adding that he had arranged to have DNA from strands of his hair sent into orbit.

"One day, some super civilization may encounter this relic from the vanished species and I may exist in another time," he said. "Move over, Stephen King."

Until then…
Mission completed. Close the pod bay doors, Hal......

Wednesday, March 19, 2008

Disaster Recovery is…. Boring!

Let’s face it. Disaster Recovery – at least the portion that IT is involved with – is boring. There’s no dramatic TV footage (we hope!), no flashing lights, no daring helicopter rescues and no one shouting “Clear!” as the patient is shocked back to life… In other words, there is nothing about an IT recovery that is very interesting at all to a majority of the general population.

Unfortunately, this does not help prepare the client or end-user to connect a disaster event with a subsequent IT service outage. “Sure there was a class 5 hurricane, but why doesn’t the damn ATM work?” may be the prevalent attitude. Now, that is not to say that those who experienced the disaster first hand and can actually see damaged infrastructure will share in this perception; but for individuals who reside in a different state and did not experience the event first hand – there is no intuitive reason to link the effect – the IT service outage, with the disaster itself.

So, ‘why doesn’t the damn ATM work?’ The only truly acceptable answer is “it should.”

In general, IT services are perceived as a utility function. And just like any other utility – they are supposed to work. Just as when you ‘throw the switch’ there is an expectation that the light will come on, the ATM, email server, or airline reservation system is just supposed to work. Period.

It is the industry’s realization of this that has been leading the paradigm shift from Disaster Recovery to Business Continuity.

There is simply no such thing as an instantaneous Disaster Recovery event. Business Continuity, on the other hand, is implemented to continue the delivery of critical IT services when the normal IT infrastructure has suffered a catastrophic failure.

Sunday, March 9, 2008

Business Continuity announcements from February 26, 2008

On the same day that IBM announced the new z10 processor, there were a couple of other product announcements of interest to the Business Continuity practitioner: GDPS v3.5 was announced along with enhancements to the DS8000. These announcements may have been overlooked by some because of the excitement generated by the processor announcement.

GDPS V3.5

GDPS V3.5 is planned for general availability on March 31, 2008. New functions include:

Distributed Cluster Management (DCM) - Designed to provide coordinated disaster recovery across System z™ and non-System z servers by integrating with distributed cluster managers. Added integration with Veritas Cluster Server (VCS) via GDPS/PPRC and GDPS/XRC.
GDPS/PPRC Multiplatform Resiliency for System z expanded to include Red Hat Enterprise Linux™ 4.
Enhancements to the GDPS GUI.
Added support for FlashCopy® Space Efficient.
Improved performance and system management with support for z/OS® Global Mirror Multi-Reader.
Increased availability with GDPS/MzGM support for z/OS Metro/Global Mirror Incremental Resync

DS8000 (2107)

The new DS8000 functions are currently available. They are delivered via Licensed Machine Code (LMC) update.

Extended Distance FICON for System z environments - help avoid performance degradation at extended distances and reduce the need for channel extenders in DS8000 z/OS Global Mirror configurations.
Support for Extended Address Volume (EAV) – Increases the maximum number of cylinders per volume from 65,520 to 262,668 (223 GB of addressable storage).
Support z/OS Metro/Global Mirror Incremental Resync.

Trademark Legal Info

GDPS, System z, HyperSwap, Geographically Dispersed Parallel Sysplex, DS8000, System Storage, FICON, System z9, HACMP and Tivoli Enterprise are trademarks of International Business Machines Corporation in the United States or other countries or both.

FlashCopy, z/OS, Tivoli, AIX, NetView, Parallel Sysplex, and zSeries are registered trademarks of International Business Machines Corporation in the United States or other countries or both.

Linux is a trademark of Linus Torvalds in the United States, other countries or both.

Other company, product, and service names may be trademarks or service marks of others.

Friday, March 7, 2008

EMC announces mainframe Virtual Tape Library (VTL) product

Last Month EMC announced their entry into the mainframe Virtual Tape Library market:

“EMC Corporation (NYSE:EMC), the world leader in information infrastructure solutions, today extended its industry-leading virtual tape library (VTL) capabilities to customers in mainframe environments with the introduction of the EMC® Disk Library for Mainframe (EMC DLm). Delivering the industry's first 'tapeless' virtual tape system for use in IBM zSeries environments, the EMC DLm enables high-performance disk-based backup and recovery, batch processing and storage and eliminates the challenges associated with traditional tape-based operations to lower customers' data center operating costs.”

“The EMC DLm connects directly to IBM zSeries mainframes using FICON or ESCON channels, and appears to the mainframe operating system as standard IBM tape drives. All tape commands are supported by DLm transparently, enabling customers to utilize their existing work processes and applications without making any modifications. Additionally, the EMC DLm enables asynchronous replication of data over IP networks, extending the benefits of array-based replication to mainframe data protection operations.” [Source: EMC press release]

This is an interesting strategic move by EMC. Not only does it offer EMC entry into a portion of the mainframe storage market where they couldn’t play before, but in the longer term it may also tend to further solidify vendor allegiance in the mainframe storage market as recovery methodologies tend to be somewhat vendor centric.

A “tapeless” implementation of virtual tape is an interesting proposition, but it is not without its own unique constraints. Seeing how Murphy was, and always will be an optimist, it will be interesting to see how “tapeless tape” plays out in the real world.

First of all, a VTL implementation consisting of a disk buffer and no “back-end” physical tape tends to ignore the most attractive cost point of storing data on tape. Namely, that of data being stored – unused – for deep archive or other purposes. Data of this type that eventually resides on physical tape can be stored on the shelf for mere pennies/GB/month. However, if there is no “back-end” physical tape that can be ejected from the VTL, then the unused data must be up and spinning – perhaps forever – at a higher cost per GB/month.

A second consideration is one of capacity. Tape usage in a mainframe environment tends to be somewhat inconsistent. There are the normal cycles of “weekly backups” and “month-end jobs” and the like, but there are also the unplanned events that can use hundreds or thousands of “tapes” without warning.

Is there a mainframe shop that hasn’t run out of tapes in recent history? Even for those mainframe environments that are running VTLs today and never plan to eject the physical tape media... They can eject it should the need arise to add additional capacity on an emergency basis.

These two points are certainly not the most issues to be addressed when evaluating tape solutions. They are merely a couple of additional items to be considered along with cost, performance, availability and other components of the total solution.

Monday, March 3, 2008

Long Live the Mainframe!

In the wake of IBM’s recent announcement for their new generation of mainframe, the Z10 I thought it would be interesting to review some of the other mainframe headlines and related comments of early 2008.

January 23, 2008 (Techworld/IDG) By Chris Kanaracus Up to three-quarters of an enterprise data is managed or stored on a mainframe. Research by IBM user group SHARE has revealed that the mainframe, which conventional wisdom had said was old technology, is playing a big part in modern enterprise systems...
January 24, 2008 (ITbusiness.ca) COBOL coders needed again as mainframe projects increase. Mainframe installation projects are growing, but the talent needed to run them is in short supply.
February 4, 2008 (Computerworld) Palm Beach Community College bought an IBM zSeries mainframe for about a half-million dollars in 2005. Last month, the school agreed to sell it — for $40,000 on eBay.
February 26, 2008 (WSJ) Young Mainframe Programmers are the Cat’s Meow … Where do businesses find people who remember how to program the things? That’s a question IBM is grappling with, as well. Most computer-science students these days view mainframe programming as the tech equivalent of learning Latin. They’d rather learn Java, AJAX, Ruby on Rails and other hot new Web programming languages. So, since 2004, IBM has been trying to get colleges and universities to include mainframe classes in their curriculums. IBM estimates that 50,000 students have sat through a mainframe class since then…
March 3, 2008 (CBRonline) Hitachi to support IBM zSeries mainframe Services oriented storage applications provider Hitachi Data Systems has announced that it will support the IBM z10 zSeries mainframe, which IBM launched earlier this week. Hitachi said it will certify enterprise system connection, fiber connection, and Fibre Channel connectivity for the zSeries. It will also continue to support the z/OS, z/VSE, and z/VM operating systems.

Interesting stuff, eh? Contrary to the long held popular opinion, the mainframe is not dead. Mainframe usage continues and it continues to be the platform of choice for many critical application systems

Wednesday, February 27, 2008

Simple data classification for Business Continuity

The majority of today's businesses cannot survive a catastrophic loss of corporate data. In many cases, the data is the corporations’ most important asset – and the amount of data is growing at exponential rates. This dramatic growth in the enterprise storage environment is forcing businesses to continually examine and enhance the availability, security and reliability of their enterprise storage environment.

Quick Terms
Business Continuity	The ability of an enterprise to continue to function during and after a catastrophic event
HIPAA	Health Insurance Portability and Accountability Act
Resilience	The ability to provide a minimum acceptable level of service during or following following failures
Sarbanes-Oxley	U.S. legislation to protect and preserve financial information

Just as the volume and importance of data for day-to-day business use continues to grow, the requirement to archive data for future use has also grown dramatically. Compliance requirements, like Sarbanes-Oxley, HIPAA, and others have helped to accelerate this growth. Also, new data sources are constantly being developed including the digitization of formerly non-digital (paper) assets, plus the requirement to quickly and accurately retrieve legacy data for business purposes, has contributed to enterprise storage requirements as well.

Businesses recognize that remote data replication, multi-site failover and other techniques along with faster backup and recovery times, are essential to their ability to survive in today’s 24x7 global economy. Several mechanisms are implemented by most businesses to ensure business continuity. These techniques include newer data replication (“mirroring”) processes as well as adaptations to or modifications of some of the more classic Disaster Recovery techniques that have been successfully utilized in the past.

In many cases, the legacy Disaster Recovery methods have served the industry well for many years. These methods continue to be useful, but they are proving to be inadequate as the sole means of providing Business Continuity in today’s world. Business requirements for continuity plans and fault recovery demand greater levels of operational resilience, data protection and business continuance requiring off-site data replication, automatic storage system failover, in addition to shorter backup windows and quicker recovery times.

Luckily, new tools and facilities are constantly being developed to satisfy these requirements.

Before deciding upon which of the newer business continuity facilities might be appropriate in your environment, it is important to try to understand how your data is classified according to business use. While a single solution might seem desirable from a support aspect, it is sometimes not the most cost effective means of satisfying the true business requirements.

Using the broadest terms possible, the recovery requirements of data can be classified as:

Immediate – This data is required to support critical business functions.
As Soon As Possible –This data is required to support normal business functions.
Eventually - This data will probably be used sometime or may need to be available to satisfy legal or archival requirements.

While this list is a vast over-simplification of the complexities involved in data classification, it does illustrate the idea that different data may have different backup and recovery requirements. Once this idea has been accepted, it is possible to target specific data for the appropriate (and most cost effective) backup and recovery methodologies that have been (or can be) implemented in your environment.

Monday, February 11, 2008

Z/Systems Journal Article

I read an interesting article from the Z/Systems Journal titled “Eight Tips for the New Mainframe Storage Manager” by B. Curtis Hall

The article is far too brief to serve as a comprehensive primer for the new storage manager, but the eight tips do include several pearls of wisdom (some obvious and some not so obvious) that should help the newbie storage manager.

One interesting point made by Mr. Hall is that “storage won’t just manage itself.” While this is obvious to many, the complexities involved in proper storage management are sometimes poorly acknowledged by IT management. Recognizing the total cost of storage as a percentage of the total IT budget can sometimes be used to identify it’s level of importance – and the level of focus – that should be given to managing this important resource.

By and large it is the customer data and not the technology that is of primary importance. This is, of course, the client view of the data.

From a service delivery standpoint, however, the technology supporting the storage environment is the most critical aspect of providing access to the client data. Properly managing these technologies and supporting the appropriate levels of Business Resilience, Disaster Recovery, Business Continuity, backups, etc., is one of the most critical components of today’s IT management team.

Mike Smith
Recovery Specialties, LLC

Thursday, February 7, 2008

Mildly Interesting...

I suppose I shouldn't be too surprised, but a good PSA can be hard to find!

I went looking for some updated web PSAs (Public Service Announcements) to put on the http://www.nasrp.com/ website. I was actually kind of surprised that so few charitable organizations spend the time to make this information available to web sites!

The American Red Cross website has the best selection that I was able to find. The presentation of their PSAs makes it very easy to select their banner ads to be displayed. The American Heart Association also has a fairly large selection of PSAs.

What about other organizations? Not so many. Certainly some organizations that I would like to support do not seem to have any web banner PSAs available.

To any large charitable organization that doesn't offer web PSAs today, I recommend that you take a look at what's available from the American Red Cross to see what you are missing!