Jun 052014
 

The Service Oriented Toolkit for Research Data Management project was co-funded by the JISC Managing Research Data Programme 2011-2013 and The University of Hertfordshire. The project focused on the realisation of practical benefits for operationalising an institutional approach to good practice in RDM. The objectives of the project were to audit current best practice, develop technology demonstrators with the assistance of leading UH research groups, and then reflect these developments back into the wider internal and external research community via a toolkit of services and guidance. The overall aim was to contribute to the efficacy and quality of research data plans, and establish and cement good data management practice in line with local and national policy.

The final report is available via http://hdl.handle.net/2299/13636

Blog Survey based on Digital Asset Framework http://bit.ly/18QUZR9
Survey Survey results http://bit.ly/1ao74vy
Report Survey analysis http://bit.ly/128uGMK
Blog UH Research Data Policy in a nutshell http://bit.ly/14cXC9w
Artefact Interview protocol, used by project analyst and RDM champions http://bit.ly/12Jr9KZ
Case studies 12 Case Studies http://bit.ly/19MjnD3
Review Review of cloud storage services: features, costs, issues for HE http://bit.ly/12Jn2yz
Blog Files in the cloud http://bit.ly/R583If
Test data Files transfer rate tests http://bit.ly/1266WsJ
Blog Analysis of barriers to use of local networked storage http://bit.ly/12Gleqg
Blog Hybrid-Cloud model: when the cloud works and the attraction of Dropbox et al. http://bit.ly/Xvmidr
Blog Hybrid-Cloud example: Zendto on Rackspace, integrated with local systems http://bit.ly/11In83q
Service UH file exchange https://www.exchangefile.herts.ac.uk/
Blog Cost of ad-hoc storage http://bit.ly/19ilycQ
Blog Cost of data loss event http://bit.ly/13RSckb
Blog Reflection on use of Rackspace CloudFiles
Blog Data Encryption http://bit.ly/XxDoEM
Training Data Encryption workshop http://bit.ly/11rwLXA
Training Data Encryption guide http://bit.ly/QHyN2y
Blog Document Management for Clinical Trials http://bit.ly/15cfT5K
Artefact eTMF – electronic Trial Master File, 1954 legacy documents scanned no public access
Artfifact Research Project File Plan http://bit.ly/11InVkW
Workflow Post award storage allocation
Workflow Request ‘Research Storage’ Form http://bit.ly/17V7J8t
Workflow Research Grant and Storage Process http://bit.ly/14kvCB0
Workflow Request ‘Research Storage’ Workflow http://bit.ly/12d2aJP
Service R: (R drive), workgroup space with external access access by workgroups
Service DMS, workgroup space with external access access by workgroups
Dataset 4 Oral history datasets, ~300 interviews, 125GB http://bit.ly/uh-hhub
Dataset 1 Leisure studies dataset, SPSS survey, interviews, transcripts, 8GB in preparation
Blog Comparison of data licenses http://bit.ly/12DmXfR
Report Comparison of data licenses http://bit.ly/13NC7gA
Service UHRA repository improvements phase 1 http://uhra.herts.ac.uk/
Blog DOIs for datasets, includes mind map http://bit.ly/QonFoN
Workflow Deposit/access criteria for data with a levels of openness http://bit.ly/12cUqrq
Service RDM micro site (aka Research Data Toolkit), 100+ pages and pdfs of RDM guidance http://bit.ly/uh-rdm
Report Register of Programme engagement at external events, estimated audience 480, ~300 individuals Appendix A
Blog Programme engagement: 38 Blog posts http://research-data-toolkit.herts.ac.uk/
Presentation Association of Research Managers and Administrators Conference 2013 http://bit.ly/ZXv8RK
Presentation UH RDM Stakeholder briefing June 2012 http://bit.ly/11KkJGo
Presentation UH HeaIth and Human Sciences research forum July 2012 http://bit.ly/15cDUKb
Presentation JISCMRD progress workshop Nottingham 2012: storage http://bit.ly/10qpry3
Presentation JISCMRD progress workshop Nottingham 2012: repository http://bit.ly/126zjab
Presentation JISCMRD progress workshop Nottingham 2012: training http://bit.ly/15cH1lj
Presentation JANET/JISCMRD Storage Requirements workshop Paddington 2013 http://bit.ly/12QFu9S
Presentation JISCMRD benefits evidence workshop Bristol 2013 http://bit.ly/ZXE09Y
Presentation JISCMRD progress workshop Aston 2013: training http://bit.ly/11t3Lg0
Presentation JISCMRD progress workshop Aston 2013: agent of change http://bit.ly/13NVIgH
Presentation JISCMRD progress workshop Aston 2013: storage http://bit.ly/19Juixf
Report Register of programme engagement at UH events: interviews (~60), meetings, seminars , workshops. Total attendance 400, est 200-300 individuals Appendix B
DMP 10 data management plans, facilitated by RDM champions and Research Grants Advisor limited public access
Report 6 project manager’s reports to Steering Group no public access
Report Benefits report http://bit.ly/19V1rWS
Report Final Report http://hdl.handle.net/2299/13636

Conclusions

There are many conclusions that could be drawn from the project. These are the headlines:

  • JISCMRD has been a success at UH.
  • The RDTK project has made an impact in awareness raising and service development, and made good inroads into professional development and training. There are good materials, a legacy of knowledge and a retained group of people to sustain and develop the learning.
  • We believe the service orientated approach shows that better technology can facilitate better RDM and the project has been an effective Agent for Change.
  • We also understand that advocacy and training are as important as technology to bring about cultural change.
  • Funding body policy and the implications of the ever increasing volume of data are understood. The business case is clear: the University cannot afford not to invest in RDM.
  • JISCMRD phase2 has been an effective vehicle for knowledge transfer and collaboration. It provided an environment in which a new and complex discipline, and the many, interacting, conflicting, seemingly endless issues therein, could be explored with common cause and mutual support.

Recommendations

JISCMRD activity should continue, and try to reach the part of the research community that is least able to adopt RDM best practice without assistance, and won’t do so as a matter of course. A profitable strand for JISCMRD3 would be Collaborative Services. Appropriate services would include joint RDM support services, or shared specific services, such as regional repositories (including DOI provision) or shared workgroup storage facilities. Institutions with advanced RDM capability could play a mentoring role. Another key strand would be Benefit of Data Re-use; to gather examples of innovative data use and academic merit and reward for individual data publishers.

The DCC should continue in its institutional support role. It should consolidate its DMPonline tool toward a cloud service, with features to allow organisational branding, and template merging. It should place new emphasis on the selection and publishing of data, with a signposting tool for Tier 1 and Tier 2 repositories for subject specific data, including selection criteria, metadata requirements, and citation rates.

Opportunities for organisations to learn from each other and establish collaborations, which have been effective at JISCMRD2 workshops, should continue to be facilitated in some way. In addition, more attempts should be made to reach researchers directly in order to demonstrate the potential personal benefit of good RDM.

The JISC should continue to pursue national agreements via the JANET brokerage. These negotiations should be widened beyond Infrastructure as a Service to include RDM Applications as a Service (RAaaS), for example, Backup as a Service, Workgroup Storage, and Repository as a Service. The goal should be to achieve terms of use which satisfy institutional purchasing, IP and governance requirements; whilst allowing for acquisition by smaller intra-institutional units, from faculty, down to workgroup level. (JISC GRAIL- Generic Rdm Applications Independently Licenced) might be suitable brand for this activity. In addition, JANET should press cloud vendors for an alternative to ‘pay-by-access’ for data which is a barrier to uptake in fixed cost project work.

Oct 292013
 

It has been a while….  but there has been plenty of activity following the conclusion of our two JISCMRD projects in June. Here goes for a quick roundup:

We have continued to spread the message by working at as many levels as we can get access to. We have a foothold in Generic Training for Researchers, the CPD programme from the Staff Development Unit, and Research Institute induction programmes. Because RDM is not a very appealing prospect and many people prefer targeted support, we are have added specific training for tools like DMPonline, Document Management, and Encryption to the broad spectrum RDM tonic. At the senior management level we have made presentations to Research Committee, the Chair of Board Designate and the Deputy Vice-Chancellor.

The trial of  https://fileexchange.herts.ac.uk/ has been a success. This will soon be an ‘officially’ supported service once we migrate it from its current position running on RacskSpace cloud servers to our own datacentre (you can use it as of now anyway).  FileExchange allows multi gigabyte files to be ‘dropped off’ and ‘picked up’ and automatically disposes of them after 7 days.  In many cases this answers the requirement to share data with a collaborator, where the nature of the share is a transfer rather than live co-working.

We are also continuing to explore other ways of weaning researchers off the use of desktop storage, unregulated storage offers such as Dropbox and fragile media such a USB sticks by making improved central storage available.  Working with Prolinx (www.prolinx.co.uk), who are a UH technology partner and JANET brokerage infrastructure provider, we hope to provide a storage solution that supports greater autonomous administration for research groups, backed by tiered levels of service, including backup and audit.   Improved working data storage is one part of a new Research Storage offer, which also includes a seat at our enterprise Document Management System, which proved popular with Health researchers during the JISCMRD project and has been rolled out extensively since. Document Management is not an appropriate tool for storing large amounts of already structured data, but it is a great system for recording the conduct of a project, for when a project uses common desktop formats to store data, or in particular, when a very high standard of data management and accountability is required.

Moving from working data to the end of the research data lifecycle, we are developing our institutional repository http://uhra.herts.ac.uk to support very long term storage of datasets. dSpace consultants @mire (www.atmire.com) are working to attach Arkivum (www.arkivum.com) A-Stor cloud based digital archiving service  to UHRA. A-Stor is an ultra robust, 3 copy, tape system. We aim to support different data journeys including Open Data, Embargoed, and access by criteria for sensitive data. A-Stor offers the lowest storage cost on appropriate terms, at around £200-£300/Terabyte/Year, which about half the best price for data stored on disc based storage. This is an important factor when there may be a requirement to retain very large volumes of data, toward Petabytes, within 3 to 5 years, for 10 to 30 years.

Research Data Management is recognised as an important element of both pre- and post-award research support and the impetus generated by the JISCMRD work is being taken forward in that context. We have started on new arrangements and workflows to bring together all the elements of research provision across the University into a more cohesive Research Support Service.  The idea will be to use Information Managers to broker with Principal Investigators, consult with service specialists, and agree a kind of service level agreement for necessary support for each research project, including non-funded activity.  With no new money identified as yet this is of course a challenge, but we are still fairly well placed to deliver on these new systems and services within the constraints of existing resources, and intend to do so.

The RDM microsite at http://rdm.herts.ac.uk/ is the new focus for all our advice and training materials. Check it out – it is full of great stuff! In addition, Office of the Chief Information Officer (OCIO) staff are still available to address research group forums or particular RDM problems if you need them. Contact Bill Worthington, w.j.worthington@herts.ac.uk

 

Jul 092013
 

At the 2013 National Astronomy Meeting in St Andrews, I presented a review of the UHRA the University of Hertfordshire Research Archive including the plans to preserve data in very long term cloud storage managed by Arkivum.  I also described the recent discovery of photographs depicting the development of the Bayfordbury observatory, and how these and the observational data will be released for reuse via the UHRA.  The audience was a mere 10 people, half interested in the history of astronomy, and the other, invited speakers from museums; the Science Museum in London and National Museums Scotland, and from Jodrell Bank, where they are petitioning for Heritage Site status.

The aim of the session was to discuss the strategy that the Royal Astronomical Society (RAS) should take with respect to preserving astronomy heritage for the future, but focused on historical artefacts from the 1900s and what should be kept and how for the next 100 years.  It was enlightening that the issues that surround objects such as telescopes, computers, instruments, and software, are similar to those that we are used to with digital data.

Storage for the vast number of objects is an issue with warehouses filling up and only 7-10% of objects being exhibited.  Storage is expensive, objects have to be catalogued, looked after, and access to objects is difficult due to the over cramming of storage spaces.  This could describe both digital data and historical objects; the issue of what should be saved, what can be saved, and where it should be kept is challenging.

The museums operate reactively, saving what they can when items are donated from private collections, families, and universities, or can be purchased from auctions.  However, without documentation, the item may not be identifiable, repairable, or recognised as  historically important.  A grey box could be anything or nothing without evidence; who owned it, what was it used to do, or as basic as what is it and when was it made?  This is metadata that puts the item in context without which the object could be destroyed; this is also the case for digital data.

While these issues of storage and metadata are important in astronomy heritage, retention is the major concern.  The audience recoiled at the prospect of retaining data for only 10 years as this is nothing compared to the 100-year timescale that they consider.  A hundred years. The idea that digital data would still be useful in 100 years is incredible.  It is understandable that photos, interviews, and videos, of people, places, and events would have value to future historians, but would astronomical observations also be useful?

We know that stars are born and die, that objects move, and this temporal information is crucial to science, but how could these data be preserved in a useful format?

Just 50 years ago, images from optical telescopes were recorded on glass plates.  These images show fine details of nebulae and galaxies, but cannot be used for modern scientific work as they contain no data on intensities, wavelengths, or spectra.  Newer electronic data from 30 years ago when the Very Large Array (VLA) in New Mexico, was commissioned, is still accessible and can be processed using radio image processing software called AIPS.  Recently, the array has been upgraded, the software is now being maintained by users, and in another 10 years these data may not be processible.  It begs the question, if there are new images, should the old data be kept? Also, should we continue to keep only the unprocessed ‘raw’ data?

To continue to make these data useful, we also need to keep the software, process instructions, and ensure that current operating systems can run the software – is this reasonable and cost effective for 100 years?  Perhaps if there are many data that would need the software etc., then it would be worthwhile, but maybe it would be more beneficial to keep the data in a processed form.  The question then is who processes it?  How should these data be done?  Some calibration methods are quite subjective.  With historical objects, the amount of information that should be kept is equally difficult; is it sufficient to keep the documentation and photos and discard the item itself?

The group described some objects as ‘Rich’, where there is obvious importance behind an object – this maybe something that is worth consideration.  The racket that Andy Murray won Wimbledon with this weekend would be a good example of a Rich object, but with digital data, the PI who requested the observation will think their data is far more important than everyone else’s.  In this respect we have a far more complicated choice to make for the future of digital data in Astronomy.

It was also interesting that many museums and university’s with lots of historical instruments keep catalogues of these objects.  These catalogues are not currently open access, and while you can ask if something is in the catalogue, you can’t search for an object or compare it with other sites.  It is a comfort that institutional repositories are making their data catalogues open access as comparisons are a vital part of research.  There was discussion about making a national list that would pull together the catalogues – the fact that so many museums and institutes have catalogues has shown there’s a demand for a national list and this is likely to happen in data repositories, once subjects show that there is a demand for these national data catalogues.

In conclusion, the results of the strategic plan produced by the RAS should provide some guidance for preservation selection criteria and retention periods.  I for one have learnt that those brassy, old-looking bits of equipment in our lab are worth keeping and I’m going to get a sticker put on them to contact the science museum if they’re at risk of being discarded.

Jul 092013
 

One of the results of the DAF at UH showed that researchers were open to training materials as long as they’re not long-winded or too generic.  However, the results of my interviews in Science and Technology, and interviews in the other research institutes by our champions, show that the best practice for looking after working data, including the storage and sharing of sensitive data, is universal.  This means that although training based on this best practice is largely generic, advertising it as such will not attract researchers.

I have tested an ‘Introduction to RDM’ course as an hour-long session aimed at new research students in the centre for astrophysical research (CAR).  As only first years are required to attend, there were only six students at this session last November.  As an introduction, the session included why RDM is important and a summary of the DMP topics, with a basic DMP handed out during the session for the students to complete.

The feedback was positive and all of these students appear to have benefited from a better understanding of back-up policies and the storage solutions available to them.

This was encouraging and we continued to plan a RDM session in our ‘Generic Training for Researchers’ (GTR) program.  Here in lies our biggest issue with training sessions.  As the RDM introduction session is both broad and generic, its relevance is not immediately clear to researchers.  They are also very busy and cannot justify spending an hour in a session that may not give them enough information to make it useful.

Making the session longer would allow us to give more details, but it is still generic training.  We have also had little interest from researcher students as it is not compulsory beyond the first year.  We’re now considering a different name for the session, perhaps “planning and managing your data”, or something that can be identified as relating to the DMPs that researchers will recognise.

So our strategy to train researchers is to run staff development courses on the tools, attach topics to existing training sessions, and run a poster campaign to advertise the website so researchers can get the answers and examples themselves quickly and easily.  This resolves the issue of a ‘time consuming training session’, but will get our best practice advice across in other sessions.

For research students, we plan to include RDM twice annually in the GTR program and in the department training programs. Even if only first year students are reached, we hope that it will spread by word-of-mouth to their peers and within 3 years, all of the researcher students will have had the training.  The change in student’s RDM behaviour will hopefully be noticed by their support team, who will then also benefit from their students’ training; a secondary method of getting best practice advice to our researchers.

Finally, we will be rolling out training to the service and technical staff so that they can all support the tools and the researchers when it comes to RDM.

So that we can re-use the materials for all audiences and so that future trainers can also target their RDM sessions, I have split our training into 18 topics and produced a table to help trainers choose which slides to combine for their session.   The slides for finishing projects are not ready yet as the guidance for preservation is still inconclusive, but the table below shows the scope of what the training will include.  The training will also include packages of examples for the research groups which will make the training relevant when delivered in the departmental programs.  These topic  presentations will be recorded using Camtasia this autumn so that they can be watched by researchers online if they want a refresher; this may be preferred training to reading a  how-to guide.

This table should help you select which slide packages to use for training different audiences

This table should help you select which slide packages to use for training different audiences

Jul 092013
 

We planned to produce discipline specific examples DMPs for our researchers.  However, as we prepared our best practice advice we learnt how many of the DMP answers for the working data stages are similar throughout the university in respective of the subject, and that the main differences are in the funding-body requirements for archiving and preservation.

We therefore endeavoured to develop a DMP template for the University of Hertfordshire that would stand alone, cover the full life cycle of research, and not require a great deal of extra information on top of what is already answered by the researcher in other funding-body templates.  We are conscious that researchers are not inclined to repeat themselves and that by limiting the answers that are unique to the UH template, they are more likely to complete an additional template.

We therefore began by comparing the DMP questions within the existing RCUK templates on DMPonline to the checklist in our data policy – this checklist has subsequently been removed from the UH data policy in a favour of a requirement to complete the DMPonline UH template.  We found that 95% of the UH checklist was covered by one RCUK template or another.

We therefore decided to include the 50 questions that were also in the RCUK templates, adding only contextual information at the beginning and four questions unique to UH which focus on file naming conventions and resources for computing.

We have sent the draft UH template to our champions to gather multidisciplinary advice before we upload our template this summer by which point our website will also be finalised and published.  Our main concerns with this template is that it should be sent post-award to our document management system (DMS), where it will be store in perpetuity.

We have already received a number of requests for help with DMPs and our champions have been contacted with collaborations based on their basic knowledge of DMPs, all of which suggests that training sessions based on DMPs will be popular and that we should hurry to get our template in place on DMPonline before too many of our researchers complete other templates.

Jul 092013
 

We are not the first institute to produce a website of advice for our researchers, and we wont be the last.  We already have in place the UH public website and two intranet resources; studynet, where students and staff communicate about courses and where information about research is available, and staffnet, which gives information on policies and research services such as the intellectual property and contracts office (IPACS) and the research grants office.  It struck us that while much of the information related to good RDM is available on these sites, only one site is openly available and the information relating to remote access for example is only available on a internal site.

We therefore decided that our advice and guidance would be best placed on the UH public website.  This does limit the look of the RDM site as we have no control over colour schemes or formats, and we have limited choices for the layout of case studies and the advice.  Hopefully, future iterations of the UH website will include more flexibility for its micro sites and we will be able to include dynamic content.

We chose to include as much information and advice as possible so that if people are not available for one-to-one assistance beyond this project, sufficient advice would be on the site.  We currently have 50 pages covering 18 RDM topics as well as additional pages on governance, training, and examples.  There are 6 main sections, covering the RDM life cycle; planning, starting, working, and finishing, as well as training and legal issues.  These section were chosen to cover an equal number of topics and as sensible splits in the life cycle.  The training materials are also divided into the four RDM life cycle sections.

The site is written in a relaxed tone with language which is not overcomplicated so that it is useful to researchers, research students, and support staff. We are now concentrating on open images to illustrate the site and supporting guides for the tools, whilst getting feedback on the content of the site from our stakeholders and all of the contacts that we have made during our project.  This includes collecting more case studies and getting authorisation to publish those that we have already written up.  We are now hoping to publish the site by the end of July at the same time as publishing our UH DMP Template.

Update May 2014 : the re-branded RDM pages are now available at http://bit.ly/uh-rdm

Jun 212013
 

The Service Oriented Toolkit for Research Data Management (RDTK) project and the Research Data Management Training in Physics and Astronomy  (RDMTPA) projects were co-funded by the JISC Managing Research Data Programme 2011-2013 and The University of Hertfordshire.

Our draft final reports are available below. Both reports have been through one iteration and found to be largely fit for purpose by our Steering Group. There will be gremlins, but they can be made available for comment now.  The final versions will appear here in due course.

RDTK final report v05 (updated June 2014)

RDMTPA final report v04 (updated June 2014)

We owe thanks to all the participants of JISCMRD phase 2 who shared their experience and knowledge in a truly collaborative effort. This shared  experience and the Digital Asset Framework survey results published by several projects show close commonality, so we believe the learning delivered in these reports, which we think is considerable, we be applicable and of use across the sector.

My thanks go to everyone involved in UH RDM project team, who worked with commitment and humour in the face of occasional chaos. To save data 😉 I am not going to name them here. You will find them scattered throughout the project’s blog posts. It was a privilege and an education to lead them.

We welcome feedback from JISCMRD colleagues or questions from newcomers to the field, and trust both can benefit from something they find in our project outputs.

 

Jun 042013
 

This is a follow up to my blog post The cost of a bit of a DDUD which examined the total cost of ownership (TCO) of a network attached storage device operated by a research group. The TCO in that case included a malfunction and repair but no data loss.

In this post we go on to put some numbers on an actual data loss event.

Here is the context: Research Group A does analysis of anonymised longitudinal data supplied to them by collaborators elsewhere in the UK. The data are relatively large (2TB) and don’t pass through the network very well, even at intra-JANET speeds, so it is their practice to acquire the data in large chunks (100-250GB) using physical media and keep it locally, attached to the compute machine. They stored the source data and new data which was derived by their work on the same device, which was a desktop quality four disc array.

They did not keep a local backup. The reasons for this parlous circumstance where many: the original data could be re-acquired, albeit with some effort; they planned to deliver derived data to their colleagues offsite; they did not believe central services could provide them with enough networked storage; they were aware of RDTK and waiting for us to provide a better solution; they trusted their device not to fail.

The storage device went wrong. To get the most capacity out of their disc array they used a RAID0 configuration where data is split between discs with no redundancy, so when one disc failed, it effectively failed the whole device. When the unit was returned to the manufacturer under warranty, the data turned out to be irrecoverable.

To calculate the cost of this event we will consider the costs of purchasing, regular maintenance prior to failure, power (@£0.11/kWh), and the effort expended in reacquiring the source data. Then we will add the cost of recomputing the lost work. We won’t use a Power Usage Effectiveness factor since the device was kept on a desk in normal office conditions. Staff costs in this case are higher than we previously used, at £264/day, reflecting a common situation, in which a fairly senior researcher conducts the work and also maintains the equipment.

Capital:

8TB HD, 2 year warranty  = £600

Labour prior to failure:

purchasing, setup, acquire and load data, ~= 3 days;
regular maintenance/interventions over two years ~= 5 days;

Sum of effort = 8 days @£264 ~= £2112

Power:

Nominal 43 watt, with use ~ 0. 05 kW x 24hr x 350days x 2yrs ~= 840 kWh
840 x £0.11 kWh = £92

Labour to replace device and reload original data:

local effort to recover data = 5 days;
contact vendor, arrange replacement part, recommission and reload source data = 3 days;
repeat data transport costs £150;

Sum of effort = 8 days @£264 ~= £2112

Labour to repeat lost research:

Data preparation/ pre-processing source data = 5 days
Research time = 40 days

Sum of effort = 45 days @£264 ~= £11880

Analysis

Let’s see what these numbers mean. If the device had not failed and Research Group A had gone on to fill 8TB in the two year warranty period, then the TCO would have been: 600 (purchase) + 92 (power) + 2112 (8days) = £2804/8/2 = £175 TB/yr

In the event of the failure, the effort in trying to recover data and eventually having to repeat their research adds another 53 days, which for 2TB brings the TCO over the same period to 2804 + 150 + 2112 + 11880 (8 + 45 days) = 16946/2/2 = £4237 TB/yr

ouch, TCO with failure = 24 x TCO without a failure.

There is some good news. Most of the derived data was copied to a collaborator shortly before the outage, saving the 40 days research time.

Nevertheless the TCO by the time they were back to their original position ready continue work was actually: 2804 + 150 + 2112 + 1320 (8 + 5 days) = 6386/2/2 = £1696 TB/yr. (this is consistent with costs calculated in the previous post).

Learning

Beyond the plain figures, which are a rare commodity in RDM, there is a lot of learning in this and the previous blog.

  • RAID0 should carry a very large health warning. I’d go so far as to say the only place it should ever be used is as a component in RAID10. If you have to use a bit of DUDD, never use RAID0 when you can mirror (RAID1) or stripe with parity (RAID5). For the sake of getting only 3 out of 4 discs worth of capacity in a 4 disc array, the fault tolerance is so much better and the risk is so much lower.
  • We can see why the Distributed Datacentre Under the Desk is so pervasive in research practice. Less than £200TB/yr compared with >£800TB/yr for tricksy cloud storage? The low cost of not doing the job properly looks very attractive unless you have been bitten already.
  • The cost of a problem when one occurs is, however, a big deal. Almost insignificant in hardware terms, it is all about the human investment required to fix or redo the research. In the case above this was about two weeks of a senior researcher’s time. It could have been 11 weeks, more than quarter of a person year, more than enough to miss a publication deadline for, for example, the Research Excellence Framework assessment.
  • We see a professional, committed research group trying to balance money, time and risk. They were moving toward a robust position but living with the expedient as they travelled there, and they lost out to the fates. This is the position most researchers are in, and it clearly underlines the need for better training and learning resources with regard to working data management.

I would like to thank Research Group A for their honestly and cooperation, This data loss event added £3582 (which could have been £14,142) and a whole lot of stress to the conduct of their research. It was good of them to share this for the benefit of the RDM cause. I am happy to report that they are now in a much more robust position. They are using RackSpace Cloud Files as their primary store, and moving data back and forth from their working machine as required. The RDM team will continue to work with these researchers after the end of JISCMRD, primarily to see how the use of an off-the-shelf cloud service works in an HE environment, but maybe also take them to the next logical step, would be to move the compute to the data, and do it all in the cloud.

To conclude, I aware that the universal optimist prevails in research culture and no amount of doom mongering is going to change that. But I can’t quite see how to spin this unequivocal evidence in terms of a benefit. The best I can do is to return to the cost-reliability figures (Boardman, et al.) in Steve Hitchcock’s blog, Cost-benefit analysis: experience of Southampton research data producers, and say:

the benefit of using UH networked storage is that the risk of data loss is tiny compared to not using it, and the benefit of using cloud storage is that the risk of data loss reduces to practically nil.

Not compelled to spend the cash? Ask Research Group A.

May 102013
 

In between procrastinating over final reports and filling in odd gaps in our new RDM advice, we are preparing some material on the cost of a data loss event. This will arrive in due course, but I thought it would be useful to precede it with a consideration of the cost of simply owning a typical storage device operated by a typical small research group. I have experience of this in a previous life when I used to buy and maintain kit for several research groups in the Science and Technology Research Institute (STRI) at the University of Hertfordshire.

Dr Phil Richards, CIO at Loughborough coined the acronym DDUD – distributed datacenter under the desk [reference needed]. In STRI, our bit of the DDUD was probably quite advanced for its time: half a dozen network attached storage devices (NAS); an Uninterruptible Power Supply (UPS), mainly to protect against spikes in the office power supply; a partitioned section of an office, otherwise used as a machine graveyard; and a domestic air conditioning unit (AC).

Each NAS had four discs, configured in RAID5 array. So the mean-time-between-failure (mtbf) for a NAS was mtbf-device/4. Consequently, my experience was that every NAS suffered a disc outage at least once in its three year warranty period. We tended to retire them to non-essential use after the warranty due to the relatively high cost of replacement parts and increased rate of failure. When a disc failed, the RAID5 array protected our bytes and allowed us to continue working in a degraded state (though at considerable risk) while the replacement part was acquired, but the required downtime to replace the disc and rebuild the array (the discs were not hot-swappable) was quite an inconvenience. I mention all this not to prove my heritage as a geek, but to illustrate the point that maintaining local storage involves a lot of faffing about that needs to be accounted for.

So, to the calculation. I am going to use the direct modern descendant of our NASs for capital costs; use a pay scale of a middle career researcher for the labour cost (I was actually one, then two, whole pay scales higher at the time); and pick a Power Usage Effectiveness (PUE) ratio = 2. This means that the cost of running the room, UPS and AC is the same as the IT equipment, which is probably an underestimate, but it will do. I don’t know what our power costs are, so I will use a small business tariff that I happen to know about. I haven’t included a share of the capital cost of the UPS or AC.

Capital:

4 x 1TB NAS, 3 year warranty, £2260 – SnapServer DX1 Enterprise

Labour:

purchasing – 2 days (market evaluation, selection, vendor communication, requisition, payment);
delivery and setup – 2 days (goods inwards, commissioning, familiarisation, build RAID, testing, rollout to users);
regular maintenance/interventions – 1 hr@month ~= 36 hrs ~= 5 days;
1 disc failure intervention – 2 days (diagnose, contact vendor, arrange replacement part, swap disc and rebuild RAID, check data integrity)

1 day UH6 = 38,500 per annum / 220 working days per year = £175@day

Sum of effort = 11 days ~= £1925

Power:

Nominal 80 watt, with use ~ 0. 1 kW x 24hr x 350days x 3yrs ~= 2500 kWh, x 2 PUE ~= 5000 kWh

5000 x £0.15 kWh = £750

Total cost of ownership:

2260 + 1925 + 750 = £4935 for a RAID5 capacity of 3TB

= £1645 per terabyte year! Ouch.

You could argue that a RAID5 desktop attached device could be acquired for ~20%-25% of a NAS, bringing the cost down to nearer £1000/TB/yr, but I would suggest the attendent risk of failure is not worth considering.

Even subject to a 100% margin of error this means the cost of owning a bit of a DUDD is at least as much, but probably twice, that of premium rate cloud storage such as RackSpace Cloud Files or Amazon Simple Storage. And between 2 and 4 times the cost of storage in our own data centres.

Sticking with wild estimates, suppose we could consolidate 1PB of research data (~ 50% of our holdings) off  the DUDD and into a efficiently managed hybrid cloud infrastructure @£800/TB/yr?  1024 x800 = £819,200.

1PB ~ £820k per annum, but you would save twice this much in distributed, unseen costs across the university.  Net saving: close to £1 million per annum.  

Someone check my figures please. We could all have new iMac’s, free coffee, a well resourced Research Data Management Service even.

If I have the sums right, there is a undeniably large amount of wasted money to add to all the reasons why we should be rationalising and centralising research data storage. The problem is, the waste is distributed and diluted – and the solution looks too big to countenance. We need to find a way to sell a research data storage service as a benefit, not a cure.

Apr 082013
 

Four delegates reporting on two projects, using three posters, three presentations and two demonstrations. And we still had time to come away with more useful collaborations to pursue!

The roundup workshop was a great way to see how far we have all come in 18 months, reflect that this is still just the beginning for Research Data Management as a professional discipline, and that JISCMRD has given all of us involved a head start, not to mention new opportunity.

Here are our presentations, posters and related posts:

Research Data Management Training in Physics and Astronomy Presentation  (PDF, 1.7 Mbyte )

Research Data Management Training in Physics and Astronomy Poster  (PDF, 440 Kbyte, commended in the poster competition )

Research Data ToolKit (@herts) Document Management Poster  (PDF, 2.5 Mbyte )

Research Data ToolKit (@herts) Adventures in storage: towards the ideal Hybrid Cloud  (PDF, 1 Mbyte )

Research Data ToolKit (@herts) Agent of Change: interventions and momentum  (PDF, 2.9 Mbyte )

Research Data ToolKit (@herts) Poster (PDF, 1.1 Mbyte )

for JISCMRD toolkit:

Research Project File Plan

Comparison of ‘open’ licenses

ZendTo file exchange: a hybrid cloud implementation

 

Apr 042013
 

We have built an installation of ZendTo, which is an opensource system for transferring large files over the web written by Julian Field from University of Southampton, and used at Southampton, Essex and Imperial College.

uhfeSee http://fileexchange.herts.ac.uk/

The requirement for this system came from researchers who were trying to use email to move files larger than about 10MB, resulting in burst mailboxes or returned mail.

The deployment allowed us to test elements of the hybrid cloud approach. We used a Cloud Server from RackSpace and we were able to integrate it with our LDAP service over secureSSL. This was important because it served to demonstrate that we could use local authentication with external services and acted as proof of concept in the regard for a number of other University developments. In addition we were able to configure it to act with authority within our email domain, which has proved problematic with external servers in the past. The basic system, which uses the ubiquitous LAMP (linux-apache-mysql-php) stack, was straightforward to install, and only the integrations above proved in any way burdensome (mainly in identifying and co-opting the appropriate network or system administrator). We will publish a ‘what to look out for’ guide in due course.

ZendTo looks like a good in-house alternative to services such yousendit.com and mailbigfile.com. In the context of RDM requirements it mitigates the need for full shared file system in some cases, because the ‘shared area’ requested by a research group is often used only for transfer of data between collaborators. It has the advantage of automatically disposing of what is transitory data after a short period, and in our system the storage container is elastic, so that if we were to generate significant demand we could meet it.

Although we have talked about authentication, this is only needed for added value features; the system works perfectly well without the sender or recipient being logged in using an exchange of tokens via email (so long as one of them has an @herts.ac.uk email address). So the foremost advantage from an administration point of view is that external users do not need to be managed.

We are using a fully managed cloud server from RackSpace (~£1500/yr). This carries £65@month premium but is not really necessary for such a simple installation. RackSpace compares equitably on cost with Amazon Web Services but, based on prior experience, comes with a superior support offer. Because we choose not to use a Content Delivery Network, our data resides in a datacentre in London.

If you have an @herts.ac.uk address you are welcome to try the system, without logging in, to send files (or rather, ‘drop them off’). If you are external to the university you can also drop off files for pickup by university staff provided you know their email address. If you just want to try the system, please use it to drop off  files for rdmteam@herts.ac.uk.  We welcome feedback whilst we plan a sustainable service around this system.

http://fileexchange.herts.ac.uk/

Thanks again to Julian at http://zend.to for a great piece of kit.

Apr 042013
 

In Work Package 6 – Review data protection, IPR and licensing issues we have undertaken a thorough review of the licences commonly used for ‘open’ access to research information. The review takes in Creative Commons, Open Data Commons, Open Government License, UK Data Archive licence and others. Considerations of when to use each license and ‘what to watch out for’ are included, together with a Glossary of Terms taking in copyright and copyleft.

The Comparison of Open Licenses can be downloaded from http://research-data-toolkit.herts.ac.uk/document/comparison-of-open-licenses/  (PDF, 200kB).

The main intention was to find a license for University of Hertfordshire datasets that was recognisable to the research community but consistent with our data management policies. The work drew from, and fed back to, the institution wide deliberations of a working group on open access.

We found that Creative Commons Zero, the most open (or unconstrained) license, which waives all copyrights and related rights, is not compatible with the University’s intellectual property policy. We settled on Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales (CC BY NC SA) for our open data. Further discussion may change this to Creative Commons Attribution-NonCommercial 2.0 UK: England & Wales (CC BY NC)  because we recognise the cascade of licenses caused by share alike can be problematic.

We also anticipate having some datasets that are available on demand rather than being completely open and directly downloadable via our repository and propose to use a UH derivative of the UK Data Archive licence in these cases.

Apr 042013
 

In WorkPackage3 – Document Management Pilot we have scanned and loaded legacy documents into an electronic Trial Master File (eTMF) for clinical work carried out by the Centre for Lifespan and Chronic Illness Research (CLiCIR). This has been a very successful piece in terms of the engagement it has promoted throughout our health related research community, and has also led to the development of a reusable template folder structure (or File Plan).  Despite a natural tendency for researchers to want free form arrangements there are a lot of cases when a consistent structure is desirable. Examples of this might be where there is a requirement to keep a ‘copy of record’ of the conduct of an entire project; or, more generally where version control and checkin/checkout in a multi user file space is required.

The file plan is appropriate for researchers, finance managers and administrators to use for all project documents, including research data, where this is kept in desktop application files and other free standing forms, as is often the case in small research endeavours outside the big Science and Technology areas.

The Research Project File Plan can be downloaded from http://research-data-toolkit.herts.ac.uk/document/research-project-file-plan/ (PDF, 400kB)

First developed by identifying the record types and activities from an existing paper Trial Master File, the file plan was then aligned to the JISC Business Classification Scheme and extended to be appropriate for any research project. Consultation with various research groups across the University allowed consensus of terminology to be formed, and confidence gained that the structure would in fact be widely applicable.

The File Plan has several advantages, not least that it allows researchers to more easily find information not only on their own project, but when working across projects. Structures for managing data and supporting documents typically grow organically and make finding information difficult over time, or for parties who are not the primary contributors of content. Having a predefined methodology for classification also means that preservation and retention policies can be aligned (or even applied) to sets of documents and files and those files can be migrated to other systems more easily for publishing.

Researchers are now being offered the use of the University’s corporate Document Management system (LiveLink from OpenText), in which we can deploy the File Plan as a default template for each research project. To date, four active research projects have started to use this facility, in addition to the original eTMF, with a further eight engagements under way.

 

Dec 062012
 

It has been a while since I have committed one of these management reports to the blog, but this one seems to gel well and wrap around our recent articles, not to mention providing a linkfest opportunity, so it seems worth it this time around. Enjoy it if you can.

WP2 Cloud Services Pilot

The focus for this activity has been in looking at options for a collaborative workspace, in which researchers from within and without University of Hertfordshire can share data. Several threads have been pursued:

iFolder and SparkleShare (Dropbox alternatives). This work complements that done on OwnCloud by the Orbital project.  iFolder originated from Novell and runs on SuSe linux. We looked at this because its provenance and requirements seem to match our local infrastructure, however the feature set is inferior to more modern alternatives and it looks like it may be moribund. SparkleShare is opensource software which puts a team sharing layer on top the GitHub version control system. It is a good candidate for a cloud hosting service and looks promising, but it would require considerable technical investment to operationalise at UH. Further investigation is required.

DataStage. DataStage offers WEBDAV and a web browser GUI, independent user allocation (albeit via the command line), and a bridge to data repositories. We have conducted tests running on desktop machines and within our virtual server environment. The release candidates we have tested are not yet stable enough to support an operational service. Development of the v1.0 release of DataStage, which has been talked about for several months, seems to have stalled.

SharePoint has returned to our thinking, since it is widely available as a cloud service, offers WEBDAV and a web browser GUI, and version control. When combined with active directory services, it seems to offer a cloud service to complement our existing networked service in a hybrid service model. Further investigation is required.

The utility of the main offerings in the cloud files market is being assessed. This has been less a technical appraisal and more a review of the Costs (over and above the initial free offer), Terms and Conditions and options for keeping data within specified jurisdictions. Further investigation continues.

RDTK is attempting to increase the usage of centrally managed network storage at UH. We continue to regularly encounter researchers who don’t know how to use their storage allocation, or that they may request shared storage at a research group or project level (R: drive). We are producing new advice about, and encouraging the use of, the secure access client for our virtual private network, which Is not well understood, but gives much more effective access to the networked file store than the usually advertised method. We intend to offer a ‘request R: drive’ feature on the RDM web pages, and facilitate the adoption of this facility, which again, is not known to many research staff.

A lengthy technical blog Files in the cloud ( http://bit.ly/R583If ) and a presentation A view over cloud storage (http://bit.ly/SB2cK8 ) bring the issues encountered in this work together. This presentation gave vent to the widespread interest amongst many projects at the recent JISCMRD Workshop in the issues around the use of ‘dropbox like’ cloud services. I represented these voices in a session with the JANET brokerage, underlining the importance of nationally negotiated deals with Amazon, Microsoft, Google and particularly Dropbox, for cloud storage, in addition to the infrastructure agreements already in place. A reflection on the JISCMRD Programme Progress Workshop (http://bit.ly/SdTqCz) refers to this encounter.

WP3 Document Management Pilot

A part time member of staff has been recruited to scan legacy documents into an electronic Trial Master File (eTMF) for work carried out by the Centre for Lifespan and Chronic Illness Research (CLiCIR). The work is progressing extremely well under the direction of the Principal Investigator. A journal of activity, issues and time and motion is being kept. After 8 x 0.5FTE weeks the first phase of scanning, covering the trial documentation, is all but complete. Only the original anonymous patient surveys remain. There are ethical issues and a debate about whether this remaining material is useful, publishable data to consider at this point.

This is proving to be valuable collaboration between the researcher, the University Records Manager, and the EDRMS system consultant. The draft Research Project File Plan has been updated in the light of practical experience and the work has attracted two further potential ‘clients’ from within the School of Life and Medical Sciences.

WP6 Review data protection, IPR and licensing issues

CT and SJ have begun reviewing literature on licensing data, including that from the DDC.

WP8 Data repository

An instance of DSpace 1.8 has been installed on a desktop machine with the aim of testing data deposit via SWORDII protocols. Data deposit has been achieved using several mechanisms, including Linux shell commands, a simple shareware application, and the deposit component of Datastage. This latter piece was facilitated after generating interest at a programme workshop, whereupon a collaboration with other projects helped modify our instances of Dspace and Datastage, so as to allow the SWORD protocol to work.

SWORD deposit has also been demonstrated into the development system for the University Research Archive (UHRA – this the old one). This is not working with Datastage as yet, as it needs further modification. Further progress awaits the roll out of the new system, which in turn has been delayed due to its dependencies on our newly re-engineered Atira Pure Research Information System (UH access only).

A presentation DataStage to DSpace, progress on a workflow for data deposit ( http://bit.ly/Xvmidr ) refers to this work.

Inter alia, there was considerable interest generated by a blog and submission to Jiscmail with regard to organising a practical workshop on how to acquire Digital Object Identifiers for Datasets. Since that time Data.Bris have minted their first DOI using the BL/Datacite api and service, and the Bodleian Library are expected to do the same shortly. The current position is that a workshop might be arranged as part of the JISC sponsored BL/Datacite series. Simon Hodson is facilitating. The article DOIs for Datasets (http://bit.ly/QonFoN ) produced the largest spike in traffic seen by this blog so far (175 page views, 55 bit.ly clicks).

Expressions of interest to publish datasets are beginning to be cultivated, including, for example, datasets of oral histories and historical socio-economic data.

WP9 Research Data Toolkit

Content for the Research Data Toolkit is progressing well with all RDM team members contributing and refining the product.  We have decided to restrict the ‘toolkit’ brand for use with project and adopt a more generic RDM brand for the published material and activity, so as to create a foothold for a sustainable service. ToolKit resources will appear at herts.ac.uk/rdm which may re-direct to herts.ac.uk/research-data-management or herts.ac.uk/research/research-data-management. We are still in negotiation with the University web team over a content delivery platform.

The content is being developed in a platform agnostic way as a set of self-contained pages of advice, which could be delivered under different overarching models, and/or re-purposed for print. There is still some debate as to whether to arrange the pages in groups by activity or by research lifecycle stage. The draft table of contents and sample content are under wraps for now.

WP11 Programme Engagement

RDM team members have made 5 presentations and participated in 14 days of programme events including:

  • DCC London Data Management Roadshow, London
  • BL/Datacite: Working with DataCite: a technical introduction, London
  • BL/Datacite: Describe, disseminate, discover: metadata for effective data citation, London
  • JISC Managing Research Data Programme / DCC Institutional Engagements Workshop, Nottingham (3 presentations)
  • RDM Training Strand Launch Meeting, London (1 presentation)
  • JISC Managing Research Data Evidence Gathering Workshop, Bristol  (1 presentation)

The project blog (including 4 new articles) has received over 800 visits and nearly 1500 page views in the previous quarter.

Other Activity: recruitment

The project has been recruiting for 3 x RDM Champions to work in the University’s research institutes for six months at 0.4 FTE.  We were looking for an established member of each institute, with sufficient experience and to be able to quickly embed sustainable and good practice RDM among their peers. The vehicle for this will be the objective of assisting PIs to prepare or improve a significant number of Data Management Plans within each institute.  Recruitment has been only partially successful. One person started on December 1 in the Health and Human Sciences Research Institute (HHSRI), another is due, subject to final agreement, to start on 28 January in the Social Sciences, Arts and Humanities Research Institute (SSAHRI). The remaining post, in the Science and Technology Research Institute (STRI), has not been filled.

Other Activity: data encryption workshop

We have produced a guide (http://bit.ly/QHyN2y), blog (http://bit.ly/XxDoEM) and workshop (http://bit.ly/11rwLXA) with regard to encryption of sensitive data for sharing, transport, and security on removable media. Over 20 university staff participated in the first workshop, and we have a waiting list for the next date, which will be announced this month. The material was well received and the feedback, though not yet properly evaluated, looks positive. We equipped 6 researchers with large capacity data sticks secured with TrueCrypt, and will evaluate their experience in February. We are encouraging the other attendees to try out the standalone encrypted container available via the toolkit blog (http://bit.ly/TKb8hG ).

Oct 192012
 

Attached to the University of Hertfordshire’s Data Policy is a handy DOs and DON’Ts guide to handling Personal and Confidential Information (PCI). Research data often falls under the definition of PCI, because it is ethically sensitive or has commercial value to the University or a sponsor.  It probably won’t be a surprise to anyone engaged in JISCMRD that we find that practice that is given as ‘unacceptable’ by the guide, is actually common in the research community. Saving PCI on a non-University computer; use of portable media devices to store or backup PCI; regular transfer or unencrypted transfer of PCI via portable media – all these happen…often. Continue reading »

Oct 172012
 

The hybrid cloud approach is being adopted in many organisations, particularly where high quality local infrastructure is in place and is likely to remain useful for a period of years but needs extending or migrating. In these cases the cloud offers expansion of local facilities and a gradual route to full migration. The cloud element often takes the form of offsite failover systems or elastic storage to accommodate short to medium changes in demand. RDTK has been investigating ways in which this extra storage capacity might be utilised at University of Hertfordshire.

As I have blogged previously, our existing networked storage is underused by researchers. The key factors in this are the perceived lack of capacity and difficulty of sharing data with external collaborators. In most cases both these issues can be resolved with extra provision, beyond that which is usually allocated to research staff, but the way we do this is currently ad-hoc. RDTK is looking at consolidating the process of this extra allocation in order to ease the barriers to uptake. In addition, we have looked at alternatives to the way we provide access to storage, including those we know our researchers are using, or would like to use.

In partnership with with HRC3 we set up simple file storage, backup, database and Microsoft SharePoint facilities. These services are ‘in the cloud’ in the sense that the are off site (actually in Iceland) but they represent simple functional services that have been available from Internet Service Providers since before the Cloud coalesced.  The rest of this article focuses on the use of simple file space, though we found the same conclusions can be applied in general to SharePoint or database provision.

Cloud attached files systems

Figures 1, 2, and 3. below show three connections to our test cloud storage hosted at the Thor datacentre in Iceland (Thor is operated by Advania in partnership with HRC3). We compared this with our in house facilities, which are also available off-campus, via a virtual private network. To ease the discussion I will refer to the two parts of our newly formed hybrid cloud as CommerceCloud and UHStaffCloud.

Figure 1 – cloud CIFS volume mapped to a drive on Windows7 desktop
Figure 2 – cloud webdav volume mounted on a Mac OSX desktop

In Figures 1 and 2 CommerceCloud storage is attached to Windows7 and Mac OSX respectively.  This was achieved in the usual way of mounting remote volumes on these platforms, and was easy to do. CommerceCloud was distinguishable from UHStaffCloud only by performance, but this performance gap was considerable when the client was inside the local area network (LAN).

We tested CommerceCloud with both CIFS/samba and Webdav/http protocols and found the latter performed much the better, particularly for OSX, but both were significantly slower than the local UHStaffCloud. This is not unexpected and is due to the delay, known as latency, introduced in moving data over long distances on a wide area network. CommerceCloud was 10 to 20 times slower than UHStaffCloud when working from my desk on our LAN. Again, this is unsurprising, given that we can move data around at 100 Mbit/s on the LAN.

The situation is different when working at home or on some other remote public network. In this scenario, both CommerceCloud and UHStaffCloud are removed from our third party location – both suffer latency, though to different extents, Latency continues to influence performance but bandwidth is the dominant factor.

Using a domestic cable connection with an effective download speed of about 20 Mbit/s, UHStaffCloud was faster than CommerceCloud by a factor of only 3 or 4. When uploading files, the two services were comparable, because the effective speed of the domestic connection was constrained to about 2-4 Mbits/s, well within the capacity of both target networks.

This suggests that when a typical domestic or public broadband connection of  4/0.5 Mbits/s (download /upload) is used, the performance of CommerceCloud and UHStaffCloud becomes equitable.

In the special case where collaborators are working at different points on the JANET network, our own storage remains superior due to its advantageous connection to JANET, but we expect CommerceCloud, though inferior, to be acceptable.

Figure 3 – cloud storage via secure FTP client application

Figure 3 shows an alternative way of working. In this case, we used FTP via Filezilla, which was the fastest FTP client we tested on both Windows and OSX. From within the UH network we consistently saw >25Mbit/s to and from CommerceCloud (with a peak > 55Mbit/s).  This was often nearly as fast as UHStaffCloud, and never slower than by a factor of 2.  An equivalent comparison for use off-campus is hard to make because both UHStaffCloud and CommerceCloud were significantly limited by the available bandwidth, but we expect them to be comparable.

For people who can accept the use of a client application rather than desktop integration, or better still, are confident with the command line, FTP remains the best way to share files.

One factor to note is that we used files of between 2MB and 120 MB for these tests. These are perhaps larger than the files most people will be sharing over a network.  The reason we didn’t use small files was that the delay to the start and end of transfers introduced by handshaking in the filesystem or application (not latency) was significant, and would have made comparison difficult.

To some extent this work evidences what was already known: the latency of almost any remote connection makes it compare poorly with a local area network. However, the ideal situation of two people sharing data on the same LAN or even on a high speed WAN such as JANET is not the norm.

So to conclude..

Files in the cloud offer opportunity to expand existing provision at University of Hertfordshire. When working in a collaboration at home, abroad, with colleagues at other institutions, the overall bandwidth available moderates the effect of connection latency, and in many cases the storage system can respond as fast as it can be accessed, regardless of its location.

The elephants in the room..

Dropbox, Microsoft Skydrive, GoogleDrive. We know these consumer level products are popular with a lot of researchers because they are just so easy. The work above underlines one of the reasons why these applications are so effective: latency would still be an issue here too, but their desktop versions avoid it by using asynchronous background transfer. When you save, close or drop a file onto the desktop folder associated with these applications, they synchronise, moving data to and from the cloud whilst you get on with something else. They are slow slow slow, but you don’t often notice and they can also use this ‘background time’ to do other good things like encryption and chunking, which allows only the part of files that have changed to be transferred.

So why even consider the old, ‘pre-Cloud’ technologies that we have investigated above?  Because the terms of use of  Dropbox et al. remain problematic or unacceptable for some RDM scenarios, (less scenarios than most policies would allow, but more than most researchers would consider).  Brian Kelly and Joss Winn’s comments on Orbital’s very useful article about ownCloud, begin the case against Dropbox nicely, I don’t intend to follow the trail of those arguments here.

One advantage of the methods we looked at above is that they sit relatively well with storage and authentication systems currently found in Higher Education. When combined with smoother processes for setting up users, they offer a path of low resistence to improved services whilst staying within the reach of our governance. This is why they remain important.

Until offerings such as ownCloud evolve into a scalable and robust ‘Academic Dropbox’, the old protocols used with cloud storage will still be useful.

Oct 152012
 

Call for interest in a DOIs for Datasets workshop.

Overshadowed by the subsequent trading of blows over the colour of Open Access, RCUK’s policy toward open data became more explicit in their announcement on July 16.

“and a statement on how the underlying research materials such as data, samples or models can be accessed”

Not if. How. At University of Hertfordshire we had already decided, in the context of our EPSRC roadmap, to extend our institutional repository to support datasets. A major aspect of making this work is the provision of Digital Object Identifiers for our data.

As a newcomer to the JISCMRD programme a year ago in 2011, I hope I would have been forgiven for thinking that the DOI piece of the MRD jigsaw was firmly in place; a given. I had grounds for this casual assumption. Witness: DOI was well established and seemingly uncontentious in the JISCMRD lexicon; UHRA, our own institutional repository was littered with DOIs; they have been around since before the millenium; and well – it is easy – Digital Object Identifier, a widely used citation mechanism, a persistent, unique ID for a digital thing. This complacency was compounded by later experience: exposure to Data Drayd‘s use of DOIs for datasets (tick) and then Louise Corti’s excellent presentations about the data citation and versioning. Job done.

Right up until the point at which you begin to need one, DOIs look straightforward. However, as we approach the moment at which we will be asking our researchers to begin publishing datasets in our repository, the hard questioning begins. At the first of the British Library DataCite workshop series, (reported by Kaptur and data.bris), I began to see less clearly. Or at least, feel like hyperopia had set in. The goal was still in sight but the details in the foreground were not clear.

The questions began to pile up. How do we get DOIs for our datasets? Is there an api to a Datacite/BL service? Could/should University of Hertfordshire mint DOI’s? Would local minting consortia be more appropriate? What about the B-word – where is the benefit over an equivalent handle system already built into our repository and shared by umpteen thousand other DSpace installations? Panic in the detail.

Before this blog turns into bleat it is time to calm down and visualise the problem, this always helps:

Well, it helped me anyway.

In all seriousness, I think that most JISCMRD projects will have to answer the questions and flesh out most of the lines on this mind map eventually, particularly in the detail over on the right hand side. In all probability, these issues are tractable, and it is just a matter of enough effort. But it seems sensible to share the problem if many of us are to be occupied by it. We have had some early discussions with the British Library and with their encouragement I would like to propose a DOIs for Datasets workshop, over and above the continuing BL/DataCite series, specially focused on how to acquire or mint DOIs for our datasets. The University of Hertfordshire would be pleased to arrange such an event if there is interest from enough programme members. The agenda would be dictated by demand, but we foresee some sessions already: role of consortia vs national minting services; service level agreement /obligations of a minting body; overview of existing services, apis, scripts and other magic. The workshop would be in held in London or Hatfield in early winter 2012/2013.

To register an interest in DOIs for Datasets please use the comment form below. If you feel moved to discuss the proposed workshop or any of the issues arising on twitter please use the #dois4datasets tag.

Aug 202012
 

An audit of research data holdings within University of Hertfordshire was conducted in the period May to July 2012.

The online survey  (described in more detail here) was circulated to around 600 research staff first via their regular monthly newsletter, with follow up reminders sent by our information managers to schools and research centres, as well as via our continuing programme of RDTK awareness meetings and interviews. There were 67 responses which represents %12 of those invited to take part. Most research active disciplines were represented in the respondents, albeit with a strong showing from the STEM subjects.

The survey has brought insight into the extent of our research data. It allows us to estimate that we hold approximately 2PB across the whole research landscape. This is a factor of ten larger than our current central provision. However, around 80-90% of this belongs to a very few research groups, who are relatively well organised and funded for RDM, and it tends to be working data for those that crunch numbers – so it may not necessarily be data that requires retention. The remaining 10-20% of research data, which belongs to the balance of 80% of researchers, looks like a manageable quantity. This suggests that cultural change rather than capacity may be the predominant issue when it comes achieving a migration to a more robust infrastructure for working data for the majority of researchers. Likewise, we should expect to be able to manage the data that could be preserved, if we can build the culture and processes to make that possible.

In addition to requirements that have already been resolved (such as easy to use encryption and more flexible provision of storage for mixed staff/student/external research groups) the survey revealed some previously unvoiced requirements, such as centralised version control for source code, CAD and design files.

Perhaps influenced by the STEM respondents the survey also showed that venerable FTP is alive and still working well in amongst the new (and rebranded) offerings of the cloud. This indicates there continues to be profit in exploring a FTP based cloud storage pilot.

The key messages from the survey support the anecdotal evidence acquired to date – in the main there was no big, new, news. However, the subtext obtained is valuable and it underlines that considerable help and resources are needed over the whole project lifecycle, from planning to preservation, if we are to satisfy the demands of a rapidly developing (some would say hardening) funder’s policy regime.

Download the survey results (PDF  400KB).
Download further discussion and analysis (PDF 1.7MB)

May 232012
 

Atira PURE is a current research information system (CRIS) that has been adopted by around 20 UK HEIs. The UK PURE user group works closely with Atira to defined requirements and maintain a unified data model across all UK implementations. The user group met last week at The University of Aberdeen and was represented by several institutions who have JISCMRD projects. The pre-occupation of the meeting was in the present, with mock Research Excellence Framework assessments, but there was also discussion of the product roadmap and some interest for those with foot in the research data management camp.

CERIF2: PURE’s data model will be continually adjusted to match CERIF developments.

OpenAIRE Compliance: support for the OpenAIRE format will be added to PURE’s OAI-PMH harvesting interface.

PURE as a repository: a new player in the market? PURE currently supports ‘connectors’ to DSpace, ePrints, and Equella, so that research outputs originating in the CRIS feed through to an existing repository system.  Whilst making a clear commitment to maintaining these interfaces, Atira restated their belief that PUREPortal offers an alternative that could replace a traditional repository system in full. The best example of this is at Aalborg University. At University of Hertfordshire we maintain a DSpace repository, but our PURE CRIS is now the primary source for almost all our repository content. This a similar position to University of Edinburgh and several others. We have reasons for keeping DSpace at the moment, not the least because it is opensource and offers the opportunity to be hacked to try out new initiatives, such as publishing data. There are several new PURE repositories about to go live, mainly among Universities who do not have an existing public presence. It will be interesting to see if it continues to gain traction among those of us who already have systems online. I think they may struggle to penetrate further until the REF is concluded and everyone has time to breathe, reflect and address new projects. (RDTK is already experiencing inertia due to the REF, which is an overriding priority for researchers and administrators alike.)

PURE and Datasets: there was quite a lot of discussion about data, with two tangled – but with hindsight – distinct threads; the first about data as primary research output in the REF; and the other about the new imperative to publish data in support of traditional publications. The first thread came up when the meeting was considering how PURE currently expresses non-textual outputs, including physical art outputs, events, source code, and data.  This naturally drifted to a discussion about metadata, wherein I began to fear we would be mired for the rest of the meeting; but CERIF gurus rescued the day with timely intervention about the likely outcomes of Cerif for Datasets (C4D). By this route we arrived neatly back at the first point above. Atira have previously told me that they are waiting on the inclusion of a metadata model for data in CERIF, and will implement this when it arrives.  I pointed out that in order to fulfil their aspiration as a repository vendor they will also have to address more than just the metadata issues, for example, in the way that @mire have done with their media streaming plugin for DSpace. (As aside  –  @mire tell me the DSpace developer community is also taking a keen interest in C4D).

Data working group: the conclusion of these discussions was that a working group should be convened to report on data publishing issues at future UK PURE user group meetings. If anyone in JISCMRD who is not in the PURE user group would like to feed into this, then #rdtk_herts can facilitate.

May 232012
 

The UH Research Data Assets Survey has been launched and is planned to run until mid-June. The survey can be found at http://sdu-surveys.herts.ac.uk/rdas.

All research activity generates data in some form, even it you don’t recognise it as such. Valuable data is often found in unstructured, every day office formats, embedded in your working papers. It is as important to understand the requirements of this ‘free form’ data as it is to understand well defined collections, so every contribution from every research area will be valuable.

This survey aims to help the Research Data Toolkit team understand the research data landscape at the University and to plan the most effective support for research data management going forward. It should take no more than 20 minutes to complete.

The data collected in this survey will be held securely. The results will be anonymised and published at http://www.herts.ac.uk/research-data-toolkit  We ask for your name and email address so that we can contact you with regard to good practice and interesting issues to feed into the Research Data Toolkit. No personal data will be reported.

The survey has has got off to a good start with 20 respondents in 24 hours after opening.  Please find the time to complete it so that we can gather a true picture of research data requirements across the whole University, and in particular from within those departments and groups that would not traditionally see themselves as data generators.