Jun 042013
 

This is a follow up to my blog post The cost of a bit of a DDUD which examined the total cost of ownership (TCO) of a network attached storage device operated by a research group. The TCO in that case included a malfunction and repair but no data loss.

In this post we go on to put some numbers on an actual data loss event.

Here is the context: Research Group A does analysis of anonymised longitudinal data supplied to them by collaborators elsewhere in the UK. The data are relatively large (2TB) and don’t pass through the network very well, even at intra-JANET speeds, so it is their practice to acquire the data in large chunks (100-250GB) using physical media and keep it locally, attached to the compute machine. They stored the source data and new data which was derived by their work on the same device, which was a desktop quality four disc array.

They did not keep a local backup. The reasons for this parlous circumstance where many: the original data could be re-acquired, albeit with some effort; they planned to deliver derived data to their colleagues offsite; they did not believe central services could provide them with enough networked storage; they were aware of RDTK and waiting for us to provide a better solution; they trusted their device not to fail.

The storage device went wrong. To get the most capacity out of their disc array they used a RAID0 configuration where data is split between discs with no redundancy, so when one disc failed, it effectively failed the whole device. When the unit was returned to the manufacturer under warranty, the data turned out to be irrecoverable.

To calculate the cost of this event we will consider the costs of purchasing, regular maintenance prior to failure, power (@£0.11/kWh), and the effort expended in reacquiring the source data. Then we will add the cost of recomputing the lost work. We won’t use a Power Usage Effectiveness factor since the device was kept on a desk in normal office conditions. Staff costs in this case are higher than we previously used, at £264/day, reflecting a common situation, in which a fairly senior researcher conducts the work and also maintains the equipment.

Capital:

8TB HD, 2 year warranty  = £600

Labour prior to failure:

purchasing, setup, acquire and load data, ~= 3 days;
regular maintenance/interventions over two years ~= 5 days;

Sum of effort = 8 days @£264 ~= £2112

Power:

Nominal 43 watt, with use ~ 0. 05 kW x 24hr x 350days x 2yrs ~= 840 kWh
840 x £0.11 kWh = £92

Labour to replace device and reload original data:

local effort to recover data = 5 days;
contact vendor, arrange replacement part, recommission and reload source data = 3 days;
repeat data transport costs £150;

Sum of effort = 8 days @£264 ~= £2112

Labour to repeat lost research:

Data preparation/ pre-processing source data = 5 days
Research time = 40 days

Sum of effort = 45 days @£264 ~= £11880

Analysis

Let’s see what these numbers mean. If the device had not failed and Research Group A had gone on to fill 8TB in the two year warranty period, then the TCO would have been: 600 (purchase) + 92 (power) + 2112 (8days) = £2804/8/2 = £175 TB/yr

In the event of the failure, the effort in trying to recover data and eventually having to repeat their research adds another 53 days, which for 2TB brings the TCO over the same period to 2804 + 150 + 2112 + 11880 (8 + 45 days) = 16946/2/2 = £4237 TB/yr

ouch, TCO with failure = 24 x TCO without a failure.

There is some good news. Most of the derived data was copied to a collaborator shortly before the outage, saving the 40 days research time.

Nevertheless the TCO by the time they were back to their original position ready continue work was actually: 2804 + 150 + 2112 + 1320 (8 + 5 days) = 6386/2/2 = £1696 TB/yr. (this is consistent with costs calculated in the previous post).

Learning

Beyond the plain figures, which are a rare commodity in RDM, there is a lot of learning in this and the previous blog.

  • RAID0 should carry a very large health warning. I’d go so far as to say the only place it should ever be used is as a component in RAID10. If you have to use a bit of DUDD, never use RAID0 when you can mirror (RAID1) or stripe with parity (RAID5). For the sake of getting only 3 out of 4 discs worth of capacity in a 4 disc array, the fault tolerance is so much better and the risk is so much lower.
  • We can see why the Distributed Datacentre Under the Desk is so pervasive in research practice. Less than £200TB/yr compared with >£800TB/yr for tricksy cloud storage? The low cost of not doing the job properly looks very attractive unless you have been bitten already.
  • The cost of a problem when one occurs is, however, a big deal. Almost insignificant in hardware terms, it is all about the human investment required to fix or redo the research. In the case above this was about two weeks of a senior researcher’s time. It could have been 11 weeks, more than quarter of a person year, more than enough to miss a publication deadline for, for example, the Research Excellence Framework assessment.
  • We see a professional, committed research group trying to balance money, time and risk. They were moving toward a robust position but living with the expedient as they travelled there, and they lost out to the fates. This is the position most researchers are in, and it clearly underlines the need for better training and learning resources with regard to working data management.

I would like to thank Research Group A for their honestly and cooperation, This data loss event added £3582 (which could have been £14,142) and a whole lot of stress to the conduct of their research. It was good of them to share this for the benefit of the RDM cause. I am happy to report that they are now in a much more robust position. They are using RackSpace Cloud Files as their primary store, and moving data back and forth from their working machine as required. The RDM team will continue to work with these researchers after the end of JISCMRD, primarily to see how the use of an off-the-shelf cloud service works in an HE environment, but maybe also take them to the next logical step, would be to move the compute to the data, and do it all in the cloud.

To conclude, I aware that the universal optimist prevails in research culture and no amount of doom mongering is going to change that. But I can’t quite see how to spin this unequivocal evidence in terms of a benefit. The best I can do is to return to the cost-reliability figures (Boardman, et al.) in Steve Hitchcock’s blog, Cost-benefit analysis: experience of Southampton research data producers, and say:

the benefit of using UH networked storage is that the risk of data loss is tiny compared to not using it, and the benefit of using cloud storage is that the risk of data loss reduces to practically nil.

Not compelled to spend the cash? Ask Research Group A.