Bill Worthington

Jun 052014
 

The Service Oriented Toolkit for Research Data Management project was co-funded by the JISC Managing Research Data Programme 2011-2013 and The University of Hertfordshire. The project focused on the realisation of practical benefits for operationalising an institutional approach to good practice in RDM. The objectives of the project were to audit current best practice, develop technology demonstrators with the assistance of leading UH research groups, and then reflect these developments back into the wider internal and external research community via a toolkit of services and guidance. The overall aim was to contribute to the efficacy and quality of research data plans, and establish and cement good data management practice in line with local and national policy.

The final report is available via http://hdl.handle.net/2299/13636

Blog Survey based on Digital Asset Framework http://bit.ly/18QUZR9
Survey Survey results http://bit.ly/1ao74vy
Report Survey analysis http://bit.ly/128uGMK
Blog UH Research Data Policy in a nutshell http://bit.ly/14cXC9w
Artefact Interview protocol, used by project analyst and RDM champions http://bit.ly/12Jr9KZ
Case studies 12 Case Studies http://bit.ly/19MjnD3
Review Review of cloud storage services: features, costs, issues for HE http://bit.ly/12Jn2yz
Blog Files in the cloud http://bit.ly/R583If
Test data Files transfer rate tests http://bit.ly/1266WsJ
Blog Analysis of barriers to use of local networked storage http://bit.ly/12Gleqg
Blog Hybrid-Cloud model: when the cloud works and the attraction of Dropbox et al. http://bit.ly/Xvmidr
Blog Hybrid-Cloud example: Zendto on Rackspace, integrated with local systems http://bit.ly/11In83q
Service UH file exchange https://www.exchangefile.herts.ac.uk/
Blog Cost of ad-hoc storage http://bit.ly/19ilycQ
Blog Cost of data loss event http://bit.ly/13RSckb
Blog Reflection on use of Rackspace CloudFiles
Blog Data Encryption http://bit.ly/XxDoEM
Training Data Encryption workshop http://bit.ly/11rwLXA
Training Data Encryption guide http://bit.ly/QHyN2y
Blog Document Management for Clinical Trials http://bit.ly/15cfT5K
Artefact eTMF – electronic Trial Master File, 1954 legacy documents scanned no public access
Artfifact Research Project File Plan http://bit.ly/11InVkW
Workflow Post award storage allocation
Workflow Request ‘Research Storage’ Form http://bit.ly/17V7J8t
Workflow Research Grant and Storage Process http://bit.ly/14kvCB0
Workflow Request ‘Research Storage’ Workflow http://bit.ly/12d2aJP
Service R: (R drive), workgroup space with external access access by workgroups
Service DMS, workgroup space with external access access by workgroups
Dataset 4 Oral history datasets, ~300 interviews, 125GB http://bit.ly/uh-hhub
Dataset 1 Leisure studies dataset, SPSS survey, interviews, transcripts, 8GB in preparation
Blog Comparison of data licenses http://bit.ly/12DmXfR
Report Comparison of data licenses http://bit.ly/13NC7gA
Service UHRA repository improvements phase 1 http://uhra.herts.ac.uk/
Blog DOIs for datasets, includes mind map http://bit.ly/QonFoN
Workflow Deposit/access criteria for data with a levels of openness http://bit.ly/12cUqrq
Service RDM micro site (aka Research Data Toolkit), 100+ pages and pdfs of RDM guidance http://bit.ly/uh-rdm
Report Register of Programme engagement at external events, estimated audience 480, ~300 individuals Appendix A
Blog Programme engagement: 38 Blog posts http://research-data-toolkit.herts.ac.uk/
Presentation Association of Research Managers and Administrators Conference 2013 http://bit.ly/ZXv8RK
Presentation UH RDM Stakeholder briefing June 2012 http://bit.ly/11KkJGo
Presentation UH HeaIth and Human Sciences research forum July 2012 http://bit.ly/15cDUKb
Presentation JISCMRD progress workshop Nottingham 2012: storage http://bit.ly/10qpry3
Presentation JISCMRD progress workshop Nottingham 2012: repository http://bit.ly/126zjab
Presentation JISCMRD progress workshop Nottingham 2012: training http://bit.ly/15cH1lj
Presentation JANET/JISCMRD Storage Requirements workshop Paddington 2013 http://bit.ly/12QFu9S
Presentation JISCMRD benefits evidence workshop Bristol 2013 http://bit.ly/ZXE09Y
Presentation JISCMRD progress workshop Aston 2013: training http://bit.ly/11t3Lg0
Presentation JISCMRD progress workshop Aston 2013: agent of change http://bit.ly/13NVIgH
Presentation JISCMRD progress workshop Aston 2013: storage http://bit.ly/19Juixf
Report Register of programme engagement at UH events: interviews (~60), meetings, seminars , workshops. Total attendance 400, est 200-300 individuals Appendix B
DMP 10 data management plans, facilitated by RDM champions and Research Grants Advisor limited public access
Report 6 project manager’s reports to Steering Group no public access
Report Benefits report http://bit.ly/19V1rWS
Report Final Report http://hdl.handle.net/2299/13636

Conclusions

There are many conclusions that could be drawn from the project. These are the headlines:

  • JISCMRD has been a success at UH.
  • The RDTK project has made an impact in awareness raising and service development, and made good inroads into professional development and training. There are good materials, a legacy of knowledge and a retained group of people to sustain and develop the learning.
  • We believe the service orientated approach shows that better technology can facilitate better RDM and the project has been an effective Agent for Change.
  • We also understand that advocacy and training are as important as technology to bring about cultural change.
  • Funding body policy and the implications of the ever increasing volume of data are understood. The business case is clear: the University cannot afford not to invest in RDM.
  • JISCMRD phase2 has been an effective vehicle for knowledge transfer and collaboration. It provided an environment in which a new and complex discipline, and the many, interacting, conflicting, seemingly endless issues therein, could be explored with common cause and mutual support.

Recommendations

JISCMRD activity should continue, and try to reach the part of the research community that is least able to adopt RDM best practice without assistance, and won’t do so as a matter of course. A profitable strand for JISCMRD3 would be Collaborative Services. Appropriate services would include joint RDM support services, or shared specific services, such as regional repositories (including DOI provision) or shared workgroup storage facilities. Institutions with advanced RDM capability could play a mentoring role. Another key strand would be Benefit of Data Re-use; to gather examples of innovative data use and academic merit and reward for individual data publishers.

The DCC should continue in its institutional support role. It should consolidate its DMPonline tool toward a cloud service, with features to allow organisational branding, and template merging. It should place new emphasis on the selection and publishing of data, with a signposting tool for Tier 1 and Tier 2 repositories for subject specific data, including selection criteria, metadata requirements, and citation rates.

Opportunities for organisations to learn from each other and establish collaborations, which have been effective at JISCMRD2 workshops, should continue to be facilitated in some way. In addition, more attempts should be made to reach researchers directly in order to demonstrate the potential personal benefit of good RDM.

The JISC should continue to pursue national agreements via the JANET brokerage. These negotiations should be widened beyond Infrastructure as a Service to include RDM Applications as a Service (RAaaS), for example, Backup as a Service, Workgroup Storage, and Repository as a Service. The goal should be to achieve terms of use which satisfy institutional purchasing, IP and governance requirements; whilst allowing for acquisition by smaller intra-institutional units, from faculty, down to workgroup level. (JISC GRAIL- Generic Rdm Applications Independently Licenced) might be suitable brand for this activity. In addition, JANET should press cloud vendors for an alternative to ‘pay-by-access’ for data which is a barrier to uptake in fixed cost project work.

May 202014
 

Research Data Management Training for the whole project lifecycle in Physics & Astronomy was co-funded by the JISC Managing Research Data Programme 2011-2013 and the University of Hertfordshire. The project was carried out in parallel with other JISCMRD work at University of Hertfordshire and collaborated with researchers in Centre for Astrophysics Research (CAR) and the Centre for Atmospheric & Instrumentation Research (CAIR) to develop a short course in RDM for Post-Graduate and early career researchers in the physical sciences. It adopted a whole project lifecycle approach, covering issues from data management planning, through good data safekeeping, to curation options and arrangements for data reuse. The resultant short course is available via 4 modules at www.jorum.ac.uk.

The final report is available via http://hdl.handle.net/2299/13638

Output / Outcome Type Brief Description and URLs (where applicable)
UH DMPonline Template Progression from a RDM checklist within our UH Data Policy, to a DMPonline template that fulfils the UH data policy and stands alone as a record of the treatment and location of data.
Project Website Including guidance on best-practice RDM for topics related to the lifecycle of research projects and the following training materials http://bit.ly/uh-rdm
Training Slides Presentation slides covering 18 topics within four RDM sessions, available via JORUM

1 – Planning a project http://find.jorum.ac.uk/resources/18502

2 – Getting started http://find.jorum.ac.uk/resources/18503

3 – Safeguarding your data http://find.jorum.ac.uk/resources/18504

4 – Finishing touches http://find.jorum.ac.uk/resources/18505

Trainer Notes Aims and key points for each slide of the training.
Discipline Packages Examples to make the generic advice relevant in physical sciences; Physics and Astronomy. Also in Health sciences, History, and Business.   (Additional packages to follow in the coming months.)
How to choose training Advice on which training is suitable and how these materials can be used in training sessions for researchers, research students, support and technical staff within and without UH.
Case Studies Descriptions of 12 projects, highlighting RDM practices, and key issues and solutions that have affected researchers throughout the university, posted on our RDM website for the benefit of other researchers in the university. http://bit.ly/uh-rdmcs
Current and best-practice assessment Formal and informal interviews with researchers in Astronomy, Physics, Maths, Robotics, and Atmospherics to discuss the bespoke solutions they have adopted and the applicability of our RDM tools to the physical sciences.
Development Blogs Blog summaries on

  • the progression from RDM training sessions for astronomers to generic training sessions for researchers in all disciplines,
  • the development of the website
  • the development of the UH DMP Template
  • http://research-data-toolkit.herts.ac.uk/
Evaluation of Training Feedback evaluated after each training session used to improve training sessions in particular the content and duration.
Improved data management in astronomy research students Follow up interviews with research students demonstrated improved awareness of data management, preservation requirements and security of data.
Workshop presentations This work has been presented at JISC workshops and RDM training related meetings;24/10/12 – JISC Building Institutional RDM Meeting in Nottingham “RDM Training for Physics and Astronomy”26/10/12 – RDM Training Stand workshop “RDMTPA at UH”25/03/13 – JISC RDM Meeting in Birmingham “RDM Training at UH”
Presentations for researchers “Introduction to RDM” presented to researchers, staff and students.

  • Staff development: 16/10/12, and 30/04/13
  • GTR:   13/05/13
  • For Astronomy PGRS: 23/10/12
  • For STRI new PGRS: 01/03/13

Research Group seminars are planned for the autumn term 2013.

“Preserving Digital Data at UH” presented at the National Astronomy Meeting in St Andrews, 1-5/07/13.

Oct 292013
 

It has been a while….  but there has been plenty of activity following the conclusion of our two JISCMRD projects in June. Here goes for a quick roundup:

We have continued to spread the message by working at as many levels as we can get access to. We have a foothold in Generic Training for Researchers, the CPD programme from the Staff Development Unit, and Research Institute induction programmes. Because RDM is not a very appealing prospect and many people prefer targeted support, we are have added specific training for tools like DMPonline, Document Management, and Encryption to the broad spectrum RDM tonic. At the senior management level we have made presentations to Research Committee, the Chair of Board Designate and the Deputy Vice-Chancellor.

The trial of  https://fileexchange.herts.ac.uk/ has been a success. This will soon be an ‘officially’ supported service once we migrate it from its current position running on RacskSpace cloud servers to our own datacentre (you can use it as of now anyway).  FileExchange allows multi gigabyte files to be ‘dropped off’ and ‘picked up’ and automatically disposes of them after 7 days.  In many cases this answers the requirement to share data with a collaborator, where the nature of the share is a transfer rather than live co-working.

We are also continuing to explore other ways of weaning researchers off the use of desktop storage, unregulated storage offers such as Dropbox and fragile media such a USB sticks by making improved central storage available.  Working with Prolinx (www.prolinx.co.uk), who are a UH technology partner and JANET brokerage infrastructure provider, we hope to provide a storage solution that supports greater autonomous administration for research groups, backed by tiered levels of service, including backup and audit.   Improved working data storage is one part of a new Research Storage offer, which also includes a seat at our enterprise Document Management System, which proved popular with Health researchers during the JISCMRD project and has been rolled out extensively since. Document Management is not an appropriate tool for storing large amounts of already structured data, but it is a great system for recording the conduct of a project, for when a project uses common desktop formats to store data, or in particular, when a very high standard of data management and accountability is required.

Moving from working data to the end of the research data lifecycle, we are developing our institutional repository http://uhra.herts.ac.uk to support very long term storage of datasets. dSpace consultants @mire (www.atmire.com) are working to attach Arkivum (www.arkivum.com) A-Stor cloud based digital archiving service  to UHRA. A-Stor is an ultra robust, 3 copy, tape system. We aim to support different data journeys including Open Data, Embargoed, and access by criteria for sensitive data. A-Stor offers the lowest storage cost on appropriate terms, at around £200-£300/Terabyte/Year, which about half the best price for data stored on disc based storage. This is an important factor when there may be a requirement to retain very large volumes of data, toward Petabytes, within 3 to 5 years, for 10 to 30 years.

Research Data Management is recognised as an important element of both pre- and post-award research support and the impetus generated by the JISCMRD work is being taken forward in that context. We have started on new arrangements and workflows to bring together all the elements of research provision across the University into a more cohesive Research Support Service.  The idea will be to use Information Managers to broker with Principal Investigators, consult with service specialists, and agree a kind of service level agreement for necessary support for each research project, including non-funded activity.  With no new money identified as yet this is of course a challenge, but we are still fairly well placed to deliver on these new systems and services within the constraints of existing resources, and intend to do so.

The RDM microsite at http://rdm.herts.ac.uk/ is the new focus for all our advice and training materials. Check it out – it is full of great stuff! In addition, Office of the Chief Information Officer (OCIO) staff are still available to address research group forums or particular RDM problems if you need them. Contact Bill Worthington, w.j.worthington@herts.ac.uk

 

Jun 212013
 

The Service Oriented Toolkit for Research Data Management (RDTK) project and the Research Data Management Training in Physics and Astronomy  (RDMTPA) projects were co-funded by the JISC Managing Research Data Programme 2011-2013 and The University of Hertfordshire.

Our draft final reports are available below. Both reports have been through one iteration and found to be largely fit for purpose by our Steering Group. There will be gremlins, but they can be made available for comment now.  The final versions will appear here in due course.

RDTK final report v05 (updated June 2014)

RDMTPA final report v04 (updated June 2014)

We owe thanks to all the participants of JISCMRD phase 2 who shared their experience and knowledge in a truly collaborative effort. This shared  experience and the Digital Asset Framework survey results published by several projects show close commonality, so we believe the learning delivered in these reports, which we think is considerable, we be applicable and of use across the sector.

My thanks go to everyone involved in UH RDM project team, who worked with commitment and humour in the face of occasional chaos. To save data 😉 I am not going to name them here. You will find them scattered throughout the project’s blog posts. It was a privilege and an education to lead them.

We welcome feedback from JISCMRD colleagues or questions from newcomers to the field, and trust both can benefit from something they find in our project outputs.

 

Jun 042013
 

This is a follow up to my blog post The cost of a bit of a DDUD which examined the total cost of ownership (TCO) of a network attached storage device operated by a research group. The TCO in that case included a malfunction and repair but no data loss.

In this post we go on to put some numbers on an actual data loss event.

Here is the context: Research Group A does analysis of anonymised longitudinal data supplied to them by collaborators elsewhere in the UK. The data are relatively large (2TB) and don’t pass through the network very well, even at intra-JANET speeds, so it is their practice to acquire the data in large chunks (100-250GB) using physical media and keep it locally, attached to the compute machine. They stored the source data and new data which was derived by their work on the same device, which was a desktop quality four disc array.

They did not keep a local backup. The reasons for this parlous circumstance where many: the original data could be re-acquired, albeit with some effort; they planned to deliver derived data to their colleagues offsite; they did not believe central services could provide them with enough networked storage; they were aware of RDTK and waiting for us to provide a better solution; they trusted their device not to fail.

The storage device went wrong. To get the most capacity out of their disc array they used a RAID0 configuration where data is split between discs with no redundancy, so when one disc failed, it effectively failed the whole device. When the unit was returned to the manufacturer under warranty, the data turned out to be irrecoverable.

To calculate the cost of this event we will consider the costs of purchasing, regular maintenance prior to failure, power (@£0.11/kWh), and the effort expended in reacquiring the source data. Then we will add the cost of recomputing the lost work. We won’t use a Power Usage Effectiveness factor since the device was kept on a desk in normal office conditions. Staff costs in this case are higher than we previously used, at £264/day, reflecting a common situation, in which a fairly senior researcher conducts the work and also maintains the equipment.

Capital:

8TB HD, 2 year warranty  = £600

Labour prior to failure:

purchasing, setup, acquire and load data, ~= 3 days;
regular maintenance/interventions over two years ~= 5 days;

Sum of effort = 8 days @£264 ~= £2112

Power:

Nominal 43 watt, with use ~ 0. 05 kW x 24hr x 350days x 2yrs ~= 840 kWh
840 x £0.11 kWh = £92

Labour to replace device and reload original data:

local effort to recover data = 5 days;
contact vendor, arrange replacement part, recommission and reload source data = 3 days;
repeat data transport costs £150;

Sum of effort = 8 days @£264 ~= £2112

Labour to repeat lost research:

Data preparation/ pre-processing source data = 5 days
Research time = 40 days

Sum of effort = 45 days @£264 ~= £11880

Analysis

Let’s see what these numbers mean. If the device had not failed and Research Group A had gone on to fill 8TB in the two year warranty period, then the TCO would have been: 600 (purchase) + 92 (power) + 2112 (8days) = £2804/8/2 = £175 TB/yr

In the event of the failure, the effort in trying to recover data and eventually having to repeat their research adds another 53 days, which for 2TB brings the TCO over the same period to 2804 + 150 + 2112 + 11880 (8 + 45 days) = 16946/2/2 = £4237 TB/yr

ouch, TCO with failure = 24 x TCO without a failure.

There is some good news. Most of the derived data was copied to a collaborator shortly before the outage, saving the 40 days research time.

Nevertheless the TCO by the time they were back to their original position ready continue work was actually: 2804 + 150 + 2112 + 1320 (8 + 5 days) = 6386/2/2 = £1696 TB/yr. (this is consistent with costs calculated in the previous post).

Learning

Beyond the plain figures, which are a rare commodity in RDM, there is a lot of learning in this and the previous blog.

  • RAID0 should carry a very large health warning. I’d go so far as to say the only place it should ever be used is as a component in RAID10. If you have to use a bit of DUDD, never use RAID0 when you can mirror (RAID1) or stripe with parity (RAID5). For the sake of getting only 3 out of 4 discs worth of capacity in a 4 disc array, the fault tolerance is so much better and the risk is so much lower.
  • We can see why the Distributed Datacentre Under the Desk is so pervasive in research practice. Less than £200TB/yr compared with >£800TB/yr for tricksy cloud storage? The low cost of not doing the job properly looks very attractive unless you have been bitten already.
  • The cost of a problem when one occurs is, however, a big deal. Almost insignificant in hardware terms, it is all about the human investment required to fix or redo the research. In the case above this was about two weeks of a senior researcher’s time. It could have been 11 weeks, more than quarter of a person year, more than enough to miss a publication deadline for, for example, the Research Excellence Framework assessment.
  • We see a professional, committed research group trying to balance money, time and risk. They were moving toward a robust position but living with the expedient as they travelled there, and they lost out to the fates. This is the position most researchers are in, and it clearly underlines the need for better training and learning resources with regard to working data management.

I would like to thank Research Group A for their honestly and cooperation, This data loss event added £3582 (which could have been £14,142) and a whole lot of stress to the conduct of their research. It was good of them to share this for the benefit of the RDM cause. I am happy to report that they are now in a much more robust position. They are using RackSpace Cloud Files as their primary store, and moving data back and forth from their working machine as required. The RDM team will continue to work with these researchers after the end of JISCMRD, primarily to see how the use of an off-the-shelf cloud service works in an HE environment, but maybe also take them to the next logical step, would be to move the compute to the data, and do it all in the cloud.

To conclude, I aware that the universal optimist prevails in research culture and no amount of doom mongering is going to change that. But I can’t quite see how to spin this unequivocal evidence in terms of a benefit. The best I can do is to return to the cost-reliability figures (Boardman, et al.) in Steve Hitchcock’s blog, Cost-benefit analysis: experience of Southampton research data producers, and say:

the benefit of using UH networked storage is that the risk of data loss is tiny compared to not using it, and the benefit of using cloud storage is that the risk of data loss reduces to practically nil.

Not compelled to spend the cash? Ask Research Group A.

May 102013
 

In between procrastinating over final reports and filling in odd gaps in our new RDM advice, we are preparing some material on the cost of a data loss event. This will arrive in due course, but I thought it would be useful to precede it with a consideration of the cost of simply owning a typical storage device operated by a typical small research group. I have experience of this in a previous life when I used to buy and maintain kit for several research groups in the Science and Technology Research Institute (STRI) at the University of Hertfordshire.

Dr Phil Richards, CIO at Loughborough coined the acronym DDUD – distributed datacenter under the desk [reference needed]. In STRI, our bit of the DDUD was probably quite advanced for its time: half a dozen network attached storage devices (NAS); an Uninterruptible Power Supply (UPS), mainly to protect against spikes in the office power supply; a partitioned section of an office, otherwise used as a machine graveyard; and a domestic air conditioning unit (AC).

Each NAS had four discs, configured in RAID5 array. So the mean-time-between-failure (mtbf) for a NAS was mtbf-device/4. Consequently, my experience was that every NAS suffered a disc outage at least once in its three year warranty period. We tended to retire them to non-essential use after the warranty due to the relatively high cost of replacement parts and increased rate of failure. When a disc failed, the RAID5 array protected our bytes and allowed us to continue working in a degraded state (though at considerable risk) while the replacement part was acquired, but the required downtime to replace the disc and rebuild the array (the discs were not hot-swappable) was quite an inconvenience. I mention all this not to prove my heritage as a geek, but to illustrate the point that maintaining local storage involves a lot of faffing about that needs to be accounted for.

So, to the calculation. I am going to use the direct modern descendant of our NASs for capital costs; use a pay scale of a middle career researcher for the labour cost (I was actually one, then two, whole pay scales higher at the time); and pick a Power Usage Effectiveness (PUE) ratio = 2. This means that the cost of running the room, UPS and AC is the same as the IT equipment, which is probably an underestimate, but it will do. I don’t know what our power costs are, so I will use a small business tariff that I happen to know about. I haven’t included a share of the capital cost of the UPS or AC.

Capital:

4 x 1TB NAS, 3 year warranty, £2260 – SnapServer DX1 Enterprise

Labour:

purchasing – 2 days (market evaluation, selection, vendor communication, requisition, payment);
delivery and setup – 2 days (goods inwards, commissioning, familiarisation, build RAID, testing, rollout to users);
regular maintenance/interventions – 1 hr@month ~= 36 hrs ~= 5 days;
1 disc failure intervention – 2 days (diagnose, contact vendor, arrange replacement part, swap disc and rebuild RAID, check data integrity)

1 day UH6 = 38,500 per annum / 220 working days per year = £175@day

Sum of effort = 11 days ~= £1925

Power:

Nominal 80 watt, with use ~ 0. 1 kW x 24hr x 350days x 3yrs ~= 2500 kWh, x 2 PUE ~= 5000 kWh

5000 x £0.15 kWh = £750

Total cost of ownership:

2260 + 1925 + 750 = £4935 for a RAID5 capacity of 3TB

= £1645 per terabyte year! Ouch.

You could argue that a RAID5 desktop attached device could be acquired for ~20%-25% of a NAS, bringing the cost down to nearer £1000/TB/yr, but I would suggest the attendent risk of failure is not worth considering.

Even subject to a 100% margin of error this means the cost of owning a bit of a DUDD is at least as much, but probably twice, that of premium rate cloud storage such as RackSpace Cloud Files or Amazon Simple Storage. And between 2 and 4 times the cost of storage in our own data centres.

Sticking with wild estimates, suppose we could consolidate 1PB of research data (~ 50% of our holdings) off  the DUDD and into a efficiently managed hybrid cloud infrastructure @£800/TB/yr?  1024 x800 = £819,200.

1PB ~ £820k per annum, but you would save twice this much in distributed, unseen costs across the university.  Net saving: close to £1 million per annum.  

Someone check my figures please. We could all have new iMac’s, free coffee, a well resourced Research Data Management Service even.

If I have the sums right, there is a undeniably large amount of wasted money to add to all the reasons why we should be rationalising and centralising research data storage. The problem is, the waste is distributed and diluted – and the solution looks too big to countenance. We need to find a way to sell a research data storage service as a benefit, not a cure.

Apr 082013
 

Four delegates reporting on two projects, using three posters, three presentations and two demonstrations. And we still had time to come away with more useful collaborations to pursue!

The roundup workshop was a great way to see how far we have all come in 18 months, reflect that this is still just the beginning for Research Data Management as a professional discipline, and that JISCMRD has given all of us involved a head start, not to mention new opportunity.

Here are our presentations, posters and related posts:

Research Data Management Training in Physics and Astronomy Presentation  (PDF, 1.7 Mbyte )

Research Data Management Training in Physics and Astronomy Poster  (PDF, 440 Kbyte, commended in the poster competition )

Research Data ToolKit (@herts) Document Management Poster  (PDF, 2.5 Mbyte )

Research Data ToolKit (@herts) Adventures in storage: towards the ideal Hybrid Cloud  (PDF, 1 Mbyte )

Research Data ToolKit (@herts) Agent of Change: interventions and momentum  (PDF, 2.9 Mbyte )

Research Data ToolKit (@herts) Poster (PDF, 1.1 Mbyte )

for JISCMRD toolkit:

Research Project File Plan

Comparison of ‘open’ licenses

ZendTo file exchange: a hybrid cloud implementation

 

Apr 042013
 

We have built an installation of ZendTo, which is an opensource system for transferring large files over the web written by Julian Field from University of Southampton, and used at Southampton, Essex and Imperial College.

uhfeSee http://fileexchange.herts.ac.uk/

The requirement for this system came from researchers who were trying to use email to move files larger than about 10MB, resulting in burst mailboxes or returned mail.

The deployment allowed us to test elements of the hybrid cloud approach. We used a Cloud Server from RackSpace and we were able to integrate it with our LDAP service over secureSSL. This was important because it served to demonstrate that we could use local authentication with external services and acted as proof of concept in the regard for a number of other University developments. In addition we were able to configure it to act with authority within our email domain, which has proved problematic with external servers in the past. The basic system, which uses the ubiquitous LAMP (linux-apache-mysql-php) stack, was straightforward to install, and only the integrations above proved in any way burdensome (mainly in identifying and co-opting the appropriate network or system administrator). We will publish a ‘what to look out for’ guide in due course.

ZendTo looks like a good in-house alternative to services such yousendit.com and mailbigfile.com. In the context of RDM requirements it mitigates the need for full shared file system in some cases, because the ‘shared area’ requested by a research group is often used only for transfer of data between collaborators. It has the advantage of automatically disposing of what is transitory data after a short period, and in our system the storage container is elastic, so that if we were to generate significant demand we could meet it.

Although we have talked about authentication, this is only needed for added value features; the system works perfectly well without the sender or recipient being logged in using an exchange of tokens via email (so long as one of them has an @herts.ac.uk email address). So the foremost advantage from an administration point of view is that external users do not need to be managed.

We are using a fully managed cloud server from RackSpace (~£1500/yr). This carries £65@month premium but is not really necessary for such a simple installation. RackSpace compares equitably on cost with Amazon Web Services but, based on prior experience, comes with a superior support offer. Because we choose not to use a Content Delivery Network, our data resides in a datacentre in London.

If you have an @herts.ac.uk address you are welcome to try the system, without logging in, to send files (or rather, ‘drop them off’). If you are external to the university you can also drop off files for pickup by university staff provided you know their email address. If you just want to try the system, please use it to drop off  files for rdmteam@herts.ac.uk.  We welcome feedback whilst we plan a sustainable service around this system.

http://fileexchange.herts.ac.uk/

Thanks again to Julian at http://zend.to for a great piece of kit.

Apr 042013
 

In Work Package 6 – Review data protection, IPR and licensing issues we have undertaken a thorough review of the licences commonly used for ‘open’ access to research information. The review takes in Creative Commons, Open Data Commons, Open Government License, UK Data Archive licence and others. Considerations of when to use each license and ‘what to watch out for’ are included, together with a Glossary of Terms taking in copyright and copyleft.

The Comparison of Open Licenses can be downloaded from http://research-data-toolkit.herts.ac.uk/document/comparison-of-open-licenses/  (PDF, 200kB).

The main intention was to find a license for University of Hertfordshire datasets that was recognisable to the research community but consistent with our data management policies. The work drew from, and fed back to, the institution wide deliberations of a working group on open access.

We found that Creative Commons Zero, the most open (or unconstrained) license, which waives all copyrights and related rights, is not compatible with the University’s intellectual property policy. We settled on Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales (CC BY NC SA) for our open data. Further discussion may change this to Creative Commons Attribution-NonCommercial 2.0 UK: England & Wales (CC BY NC)  because we recognise the cascade of licenses caused by share alike can be problematic.

We also anticipate having some datasets that are available on demand rather than being completely open and directly downloadable via our repository and propose to use a UH derivative of the UK Data Archive licence in these cases.

Apr 042013
 

In WorkPackage3 – Document Management Pilot we have scanned and loaded legacy documents into an electronic Trial Master File (eTMF) for clinical work carried out by the Centre for Lifespan and Chronic Illness Research (CLiCIR). This has been a very successful piece in terms of the engagement it has promoted throughout our health related research community, and has also led to the development of a reusable template folder structure (or File Plan).  Despite a natural tendency for researchers to want free form arrangements there are a lot of cases when a consistent structure is desirable. Examples of this might be where there is a requirement to keep a ‘copy of record’ of the conduct of an entire project; or, more generally where version control and checkin/checkout in a multi user file space is required.

The file plan is appropriate for researchers, finance managers and administrators to use for all project documents, including research data, where this is kept in desktop application files and other free standing forms, as is often the case in small research endeavours outside the big Science and Technology areas.

The Research Project File Plan can be downloaded from http://research-data-toolkit.herts.ac.uk/document/research-project-file-plan/ (PDF, 400kB)

First developed by identifying the record types and activities from an existing paper Trial Master File, the file plan was then aligned to the JISC Business Classification Scheme and extended to be appropriate for any research project. Consultation with various research groups across the University allowed consensus of terminology to be formed, and confidence gained that the structure would in fact be widely applicable.

The File Plan has several advantages, not least that it allows researchers to more easily find information not only on their own project, but when working across projects. Structures for managing data and supporting documents typically grow organically and make finding information difficult over time, or for parties who are not the primary contributors of content. Having a predefined methodology for classification also means that preservation and retention policies can be aligned (or even applied) to sets of documents and files and those files can be migrated to other systems more easily for publishing.

Researchers are now being offered the use of the University’s corporate Document Management system (LiveLink from OpenText), in which we can deploy the File Plan as a default template for each research project. To date, four active research projects have started to use this facility, in addition to the original eTMF, with a further eight engagements under way.

 

Dec 062012
 

It has been a while since I have committed one of these management reports to the blog, but this one seems to gel well and wrap around our recent articles, not to mention providing a linkfest opportunity, so it seems worth it this time around. Enjoy it if you can.

WP2 Cloud Services Pilot

The focus for this activity has been in looking at options for a collaborative workspace, in which researchers from within and without University of Hertfordshire can share data. Several threads have been pursued:

iFolder and SparkleShare (Dropbox alternatives). This work complements that done on OwnCloud by the Orbital project.  iFolder originated from Novell and runs on SuSe linux. We looked at this because its provenance and requirements seem to match our local infrastructure, however the feature set is inferior to more modern alternatives and it looks like it may be moribund. SparkleShare is opensource software which puts a team sharing layer on top the GitHub version control system. It is a good candidate for a cloud hosting service and looks promising, but it would require considerable technical investment to operationalise at UH. Further investigation is required.

DataStage. DataStage offers WEBDAV and a web browser GUI, independent user allocation (albeit via the command line), and a bridge to data repositories. We have conducted tests running on desktop machines and within our virtual server environment. The release candidates we have tested are not yet stable enough to support an operational service. Development of the v1.0 release of DataStage, which has been talked about for several months, seems to have stalled.

SharePoint has returned to our thinking, since it is widely available as a cloud service, offers WEBDAV and a web browser GUI, and version control. When combined with active directory services, it seems to offer a cloud service to complement our existing networked service in a hybrid service model. Further investigation is required.

The utility of the main offerings in the cloud files market is being assessed. This has been less a technical appraisal and more a review of the Costs (over and above the initial free offer), Terms and Conditions and options for keeping data within specified jurisdictions. Further investigation continues.

RDTK is attempting to increase the usage of centrally managed network storage at UH. We continue to regularly encounter researchers who don’t know how to use their storage allocation, or that they may request shared storage at a research group or project level (R: drive). We are producing new advice about, and encouraging the use of, the secure access client for our virtual private network, which Is not well understood, but gives much more effective access to the networked file store than the usually advertised method. We intend to offer a ‘request R: drive’ feature on the RDM web pages, and facilitate the adoption of this facility, which again, is not known to many research staff.

A lengthy technical blog Files in the cloud ( http://bit.ly/R583If ) and a presentation A view over cloud storage (http://bit.ly/SB2cK8 ) bring the issues encountered in this work together. This presentation gave vent to the widespread interest amongst many projects at the recent JISCMRD Workshop in the issues around the use of ‘dropbox like’ cloud services. I represented these voices in a session with the JANET brokerage, underlining the importance of nationally negotiated deals with Amazon, Microsoft, Google and particularly Dropbox, for cloud storage, in addition to the infrastructure agreements already in place. A reflection on the JISCMRD Programme Progress Workshop (http://bit.ly/SdTqCz) refers to this encounter.

WP3 Document Management Pilot

A part time member of staff has been recruited to scan legacy documents into an electronic Trial Master File (eTMF) for work carried out by the Centre for Lifespan and Chronic Illness Research (CLiCIR). The work is progressing extremely well under the direction of the Principal Investigator. A journal of activity, issues and time and motion is being kept. After 8 x 0.5FTE weeks the first phase of scanning, covering the trial documentation, is all but complete. Only the original anonymous patient surveys remain. There are ethical issues and a debate about whether this remaining material is useful, publishable data to consider at this point.

This is proving to be valuable collaboration between the researcher, the University Records Manager, and the EDRMS system consultant. The draft Research Project File Plan has been updated in the light of practical experience and the work has attracted two further potential ‘clients’ from within the School of Life and Medical Sciences.

WP6 Review data protection, IPR and licensing issues

CT and SJ have begun reviewing literature on licensing data, including that from the DDC.

WP8 Data repository

An instance of DSpace 1.8 has been installed on a desktop machine with the aim of testing data deposit via SWORDII protocols. Data deposit has been achieved using several mechanisms, including Linux shell commands, a simple shareware application, and the deposit component of Datastage. This latter piece was facilitated after generating interest at a programme workshop, whereupon a collaboration with other projects helped modify our instances of Dspace and Datastage, so as to allow the SWORD protocol to work.

SWORD deposit has also been demonstrated into the development system for the University Research Archive (UHRA – this the old one). This is not working with Datastage as yet, as it needs further modification. Further progress awaits the roll out of the new system, which in turn has been delayed due to its dependencies on our newly re-engineered Atira Pure Research Information System (UH access only).

A presentation DataStage to DSpace, progress on a workflow for data deposit ( http://bit.ly/Xvmidr ) refers to this work.

Inter alia, there was considerable interest generated by a blog and submission to Jiscmail with regard to organising a practical workshop on how to acquire Digital Object Identifiers for Datasets. Since that time Data.Bris have minted their first DOI using the BL/Datacite api and service, and the Bodleian Library are expected to do the same shortly. The current position is that a workshop might be arranged as part of the JISC sponsored BL/Datacite series. Simon Hodson is facilitating. The article DOIs for Datasets (http://bit.ly/QonFoN ) produced the largest spike in traffic seen by this blog so far (175 page views, 55 bit.ly clicks).

Expressions of interest to publish datasets are beginning to be cultivated, including, for example, datasets of oral histories and historical socio-economic data.

WP9 Research Data Toolkit

Content for the Research Data Toolkit is progressing well with all RDM team members contributing and refining the product.  We have decided to restrict the ‘toolkit’ brand for use with project and adopt a more generic RDM brand for the published material and activity, so as to create a foothold for a sustainable service. ToolKit resources will appear at herts.ac.uk/rdm which may re-direct to herts.ac.uk/research-data-management or herts.ac.uk/research/research-data-management. We are still in negotiation with the University web team over a content delivery platform.

The content is being developed in a platform agnostic way as a set of self-contained pages of advice, which could be delivered under different overarching models, and/or re-purposed for print. There is still some debate as to whether to arrange the pages in groups by activity or by research lifecycle stage. The draft table of contents and sample content are under wraps for now.

WP11 Programme Engagement

RDM team members have made 5 presentations and participated in 14 days of programme events including:

  • DCC London Data Management Roadshow, London
  • BL/Datacite: Working with DataCite: a technical introduction, London
  • BL/Datacite: Describe, disseminate, discover: metadata for effective data citation, London
  • JISC Managing Research Data Programme / DCC Institutional Engagements Workshop, Nottingham (3 presentations)
  • RDM Training Strand Launch Meeting, London (1 presentation)
  • JISC Managing Research Data Evidence Gathering Workshop, Bristol  (1 presentation)

The project blog (including 4 new articles) has received over 800 visits and nearly 1500 page views in the previous quarter.

Other Activity: recruitment

The project has been recruiting for 3 x RDM Champions to work in the University’s research institutes for six months at 0.4 FTE.  We were looking for an established member of each institute, with sufficient experience and to be able to quickly embed sustainable and good practice RDM among their peers. The vehicle for this will be the objective of assisting PIs to prepare or improve a significant number of Data Management Plans within each institute.  Recruitment has been only partially successful. One person started on December 1 in the Health and Human Sciences Research Institute (HHSRI), another is due, subject to final agreement, to start on 28 January in the Social Sciences, Arts and Humanities Research Institute (SSAHRI). The remaining post, in the Science and Technology Research Institute (STRI), has not been filled.

Other Activity: data encryption workshop

We have produced a guide (http://bit.ly/QHyN2y), blog (http://bit.ly/XxDoEM) and workshop (http://bit.ly/11rwLXA) with regard to encryption of sensitive data for sharing, transport, and security on removable media. Over 20 university staff participated in the first workshop, and we have a waiting list for the next date, which will be announced this month. The material was well received and the feedback, though not yet properly evaluated, looks positive. We equipped 6 researchers with large capacity data sticks secured with TrueCrypt, and will evaluate their experience in February. We are encouraging the other attendees to try out the standalone encrypted container available via the toolkit blog (http://bit.ly/TKb8hG ).

Oct 222012
 

One year in! Time flies when you are having fun, or trying to pin the tail on a donkey which at times is how it feels to be a JISCMRD project manager. This isn’t a complaint, it is a stimulating and worthwhile endeavour, and I think programme is working well at UH. The Research Data ToolKit, even before it is properly manifest, is acting as an agent of change, and gaining momentum as the RDM team expands from 1 person, to 3, now 6, soon to be 9.

Most of JISCMRD 2011-2013 convened at NCSL in Nottingham Wed 24-Thu 25 October.  I was taken by the increased confidence and authority of my fellow travellers, compared to the prevailing feeling a year ago. In some senses, the horizon is no closer, indeed it may have receded further in the light of the knowledge we have all acquired; the difference is, perhaps, that the benefit of experience gives us conviction. The RDM problem won’t be fixed by JISCMRD, but those of us involved will be well placed to carry the effort forward beyond the life of the programme.

The progress workshop was packed with interesting sessions, touching all parts of the life cycle of research data. The only disappointment  I had was that I couldn’t divide myself in three to attend parallel sessions.

In my first presentation A view over Cloud StorageI sought to explore the circumstances under which cloud storage can and can’t be utilised.  Part of the intent was to stimulate discussion, and in this it was successful, as I seemed to touch a nerve by naming the elephants in the room: Dropbox, Skydrive, Googledrive (D, S & G). The issues around using these  applications seemed to resonate throughout both days of the workshop. Before I become identified as an advocate for Dropbox I would like, in the manner of a minister redressing a half baked policy, to ‘clarify’. It is not a specific incarnation of any of these cloud storage App’s that I am advocating; it is their feature set.  Unless you work with more than a few gigabytes of data, the ease of use of these the public cloud services make them irresistable to researchers. The implications of the terms and conditions of use, which fall foul of pretty much any institutional policy that you could find, have little impact: usability wins over regulation. During the workshop MariekeGuy tweeted a list of alternatives applications, and we discussed some of these, but no one could wholeheartedly endorse any of the candidates for a robust, reliable service. D, S & G simply work better than our own networked storage offerings in many, many RDM scenarios. Like it or not, this is the case.

In the final workshop session, John Milner gave an account of the major cloud and data centre framework agreement already concluded  and the negotiations that the JANET Brokerage is planning to undertake  with Amazon, Microsoft, Google and Dropbox. An agreement with Microsoft on Office 365 has been reached and it is hoped that favourable terms with Amazon for, for example, EC2 and Glacier and Microsoft Azure can be achieved in co-operation with Internet2 in the USA. Talks with Dropbox and Google have recently been initiated. John indicated that a ‘negotiation’ typically takes at least three to six months to see through. It was encouraging that John indicated, that despite their strong market positions, these companies are willing to discuss HE needs and it is likely that education and research will attract favourable prices and terms and conditions of service, the latter of which (I suggest) is the higher hurdle to adoption. So perhaps JANET may yet resolve an answer to the search for an easy to use cloud storage application, that can be brought within the constraints of our governance, use our authentication and work with our infrastructure, they are certainly working on it and keen to hear requirements from the sector!

I am seeing an App’ like  D, S or G; sitting over hybrid storage; in our own datacentres or within the European Economic Area public cloud; accessed using our own passwords; and, governed by our own T and Cs.  Maybe for Christmas?   Unlikely, but worth the thought.

RDTK’s presentations are available below:

RDTK A view over Cloud Storage, in Parallel Session 1B: Managing Active Data: storage, access, academic ‘dropbox’ services, JISCMRD progress workshop, Nottingham, 2012 (PDF, 0.6 MB)

RDTK DataStage to DSpace, progress on a workflow for data deposit, in Parallel Session 2B: Data Repositories and Storage: options for repository service solutions, JISCMRD progress workshop, Nottingham, 2012 (PDF, 1.5 MB)

RDMTPA Research Data Management Training for Physics and Astronomy, in Parallel Session 3A: Training and Guidance, JISCMRD progress workshop, Nottingham, 2012  (PDF, 1.8 MB)

RDTK Poster, Service Oriented Toolkit for Research Data Management, in Poster Session, JISCMRD progress workshop, Nottingham, 2012, Poster (PDF, 1.9 MB)

Other recent blogs:

 

Oct 192012
 

Attached to the University of Hertfordshire’s Data Policy is a handy DOs and DON’Ts guide to handling Personal and Confidential Information (PCI). Research data often falls under the definition of PCI, because it is ethically sensitive or has commercial value to the University or a sponsor.  It probably won’t be a surprise to anyone engaged in JISCMRD that we find that practice that is given as ‘unacceptable’ by the guide, is actually common in the research community. Saving PCI on a non-University computer; use of portable media devices to store or backup PCI; regular transfer or unencrypted transfer of PCI via portable media – all these happen…often. Continue reading »

Oct 172012
 

The hybrid cloud approach is being adopted in many organisations, particularly where high quality local infrastructure is in place and is likely to remain useful for a period of years but needs extending or migrating. In these cases the cloud offers expansion of local facilities and a gradual route to full migration. The cloud element often takes the form of offsite failover systems or elastic storage to accommodate short to medium changes in demand. RDTK has been investigating ways in which this extra storage capacity might be utilised at University of Hertfordshire.

As I have blogged previously, our existing networked storage is underused by researchers. The key factors in this are the perceived lack of capacity and difficulty of sharing data with external collaborators. In most cases both these issues can be resolved with extra provision, beyond that which is usually allocated to research staff, but the way we do this is currently ad-hoc. RDTK is looking at consolidating the process of this extra allocation in order to ease the barriers to uptake. In addition, we have looked at alternatives to the way we provide access to storage, including those we know our researchers are using, or would like to use.

In partnership with with HRC3 we set up simple file storage, backup, database and Microsoft SharePoint facilities. These services are ‘in the cloud’ in the sense that the are off site (actually in Iceland) but they represent simple functional services that have been available from Internet Service Providers since before the Cloud coalesced.  The rest of this article focuses on the use of simple file space, though we found the same conclusions can be applied in general to SharePoint or database provision.

Cloud attached files systems

Figures 1, 2, and 3. below show three connections to our test cloud storage hosted at the Thor datacentre in Iceland (Thor is operated by Advania in partnership with HRC3). We compared this with our in house facilities, which are also available off-campus, via a virtual private network. To ease the discussion I will refer to the two parts of our newly formed hybrid cloud as CommerceCloud and UHStaffCloud.

Figure 1 – cloud CIFS volume mapped to a drive on Windows7 desktop
Figure 2 – cloud webdav volume mounted on a Mac OSX desktop

In Figures 1 and 2 CommerceCloud storage is attached to Windows7 and Mac OSX respectively.  This was achieved in the usual way of mounting remote volumes on these platforms, and was easy to do. CommerceCloud was distinguishable from UHStaffCloud only by performance, but this performance gap was considerable when the client was inside the local area network (LAN).

We tested CommerceCloud with both CIFS/samba and Webdav/http protocols and found the latter performed much the better, particularly for OSX, but both were significantly slower than the local UHStaffCloud. This is not unexpected and is due to the delay, known as latency, introduced in moving data over long distances on a wide area network. CommerceCloud was 10 to 20 times slower than UHStaffCloud when working from my desk on our LAN. Again, this is unsurprising, given that we can move data around at 100 Mbit/s on the LAN.

The situation is different when working at home or on some other remote public network. In this scenario, both CommerceCloud and UHStaffCloud are removed from our third party location – both suffer latency, though to different extents, Latency continues to influence performance but bandwidth is the dominant factor.

Using a domestic cable connection with an effective download speed of about 20 Mbit/s, UHStaffCloud was faster than CommerceCloud by a factor of only 3 or 4. When uploading files, the two services were comparable, because the effective speed of the domestic connection was constrained to about 2-4 Mbits/s, well within the capacity of both target networks.

This suggests that when a typical domestic or public broadband connection of  4/0.5 Mbits/s (download /upload) is used, the performance of CommerceCloud and UHStaffCloud becomes equitable.

In the special case where collaborators are working at different points on the JANET network, our own storage remains superior due to its advantageous connection to JANET, but we expect CommerceCloud, though inferior, to be acceptable.

Figure 3 – cloud storage via secure FTP client application

Figure 3 shows an alternative way of working. In this case, we used FTP via Filezilla, which was the fastest FTP client we tested on both Windows and OSX. From within the UH network we consistently saw >25Mbit/s to and from CommerceCloud (with a peak > 55Mbit/s).  This was often nearly as fast as UHStaffCloud, and never slower than by a factor of 2.  An equivalent comparison for use off-campus is hard to make because both UHStaffCloud and CommerceCloud were significantly limited by the available bandwidth, but we expect them to be comparable.

For people who can accept the use of a client application rather than desktop integration, or better still, are confident with the command line, FTP remains the best way to share files.

One factor to note is that we used files of between 2MB and 120 MB for these tests. These are perhaps larger than the files most people will be sharing over a network.  The reason we didn’t use small files was that the delay to the start and end of transfers introduced by handshaking in the filesystem or application (not latency) was significant, and would have made comparison difficult.

To some extent this work evidences what was already known: the latency of almost any remote connection makes it compare poorly with a local area network. However, the ideal situation of two people sharing data on the same LAN or even on a high speed WAN such as JANET is not the norm.

So to conclude..

Files in the cloud offer opportunity to expand existing provision at University of Hertfordshire. When working in a collaboration at home, abroad, with colleagues at other institutions, the overall bandwidth available moderates the effect of connection latency, and in many cases the storage system can respond as fast as it can be accessed, regardless of its location.

The elephants in the room..

Dropbox, Microsoft Skydrive, GoogleDrive. We know these consumer level products are popular with a lot of researchers because they are just so easy. The work above underlines one of the reasons why these applications are so effective: latency would still be an issue here too, but their desktop versions avoid it by using asynchronous background transfer. When you save, close or drop a file onto the desktop folder associated with these applications, they synchronise, moving data to and from the cloud whilst you get on with something else. They are slow slow slow, but you don’t often notice and they can also use this ‘background time’ to do other good things like encryption and chunking, which allows only the part of files that have changed to be transferred.

So why even consider the old, ‘pre-Cloud’ technologies that we have investigated above?  Because the terms of use of  Dropbox et al. remain problematic or unacceptable for some RDM scenarios, (less scenarios than most policies would allow, but more than most researchers would consider).  Brian Kelly and Joss Winn’s comments on Orbital’s very useful article about ownCloud, begin the case against Dropbox nicely, I don’t intend to follow the trail of those arguments here.

One advantage of the methods we looked at above is that they sit relatively well with storage and authentication systems currently found in Higher Education. When combined with smoother processes for setting up users, they offer a path of low resistence to improved services whilst staying within the reach of our governance. This is why they remain important.

Until offerings such as ownCloud evolve into a scalable and robust ‘Academic Dropbox’, the old protocols used with cloud storage will still be useful.

Oct 152012
 

Call for interest in a DOIs for Datasets workshop.

Overshadowed by the subsequent trading of blows over the colour of Open Access, RCUK’s policy toward open data became more explicit in their announcement on July 16.

“and a statement on how the underlying research materials such as data, samples or models can be accessed”

Not if. How. At University of Hertfordshire we had already decided, in the context of our EPSRC roadmap, to extend our institutional repository to support datasets. A major aspect of making this work is the provision of Digital Object Identifiers for our data.

As a newcomer to the JISCMRD programme a year ago in 2011, I hope I would have been forgiven for thinking that the DOI piece of the MRD jigsaw was firmly in place; a given. I had grounds for this casual assumption. Witness: DOI was well established and seemingly uncontentious in the JISCMRD lexicon; UHRA, our own institutional repository was littered with DOIs; they have been around since before the millenium; and well – it is easy – Digital Object Identifier, a widely used citation mechanism, a persistent, unique ID for a digital thing. This complacency was compounded by later experience: exposure to Data Drayd‘s use of DOIs for datasets (tick) and then Louise Corti’s excellent presentations about the data citation and versioning. Job done.

Right up until the point at which you begin to need one, DOIs look straightforward. However, as we approach the moment at which we will be asking our researchers to begin publishing datasets in our repository, the hard questioning begins. At the first of the British Library DataCite workshop series, (reported by Kaptur and data.bris), I began to see less clearly. Or at least, feel like hyperopia had set in. The goal was still in sight but the details in the foreground were not clear.

The questions began to pile up. How do we get DOIs for our datasets? Is there an api to a Datacite/BL service? Could/should University of Hertfordshire mint DOI’s? Would local minting consortia be more appropriate? What about the B-word – where is the benefit over an equivalent handle system already built into our repository and shared by umpteen thousand other DSpace installations? Panic in the detail.

Before this blog turns into bleat it is time to calm down and visualise the problem, this always helps:

Well, it helped me anyway.

In all seriousness, I think that most JISCMRD projects will have to answer the questions and flesh out most of the lines on this mind map eventually, particularly in the detail over on the right hand side. In all probability, these issues are tractable, and it is just a matter of enough effort. But it seems sensible to share the problem if many of us are to be occupied by it. We have had some early discussions with the British Library and with their encouragement I would like to propose a DOIs for Datasets workshop, over and above the continuing BL/DataCite series, specially focused on how to acquire or mint DOIs for our datasets. The University of Hertfordshire would be pleased to arrange such an event if there is interest from enough programme members. The agenda would be dictated by demand, but we foresee some sessions already: role of consortia vs national minting services; service level agreement /obligations of a minting body; overview of existing services, apis, scripts and other magic. The workshop would be in held in London or Hatfield in early winter 2012/2013.

To register an interest in DOIs for Datasets please use the comment form below. If you feel moved to discuss the proposed workshop or any of the issues arising on twitter please use the #dois4datasets tag.

Aug 142012
 

The University of Hertfordshire’s Research Data Management activity is being extended after a successful bid to Strand E (Research Data Management Training) of  JISC Managing Research Data Programme 2011-13.

In a one year project, Research Data Management Training for the whole project lifecycle in Physics & Astronomy research (RDMTPA), the University will develop Research Data Management (RDM) training materials directed at Post-Graduate and early career researchers in the physical sciences. The project will collaborate with the University’s Centre for Astrophysics Research and the Centre for Atmospheric Instrumentation Research to produce a short course in RDM. It will leverage the outputs of existing JISCMRD work, within and without the University. The short course will adopt a whole project lifecycle approach, from data management planning, through good data safekeeping, to curation options and arrangements for data re-use. The course will be designed to integrate with, and extend, the Generic Training for Researchers programme at University of Hertfordshire. Although the primary market will be early career researchers, we expect the materials to be useful to information professionals such as discipline liaison librarians and research liaison officers.

We are pleased to welcome Dr Jo Goodger to the University’s RDM team to work on this exciting new work. Jo is an active Astrophysics researcher in the field of Radio-Loud Active Galaxies and also has extensive experience in science outreach, including the development of the Luggage Lab.

The short course will be made available via a variety of channels including the University’s StudyNet VLE, the JORUM repository of teaching and learning resources, and the here on the Research Data ToolKit website. RDMTPA will also share project management and governance with the Research Data ToolKit.

See http://research-data-toolkit.herts.ac.uk/category/training/

May 232012
 

Atira PURE is a current research information system (CRIS) that has been adopted by around 20 UK HEIs. The UK PURE user group works closely with Atira to defined requirements and maintain a unified data model across all UK implementations. The user group met last week at The University of Aberdeen and was represented by several institutions who have JISCMRD projects. The pre-occupation of the meeting was in the present, with mock Research Excellence Framework assessments, but there was also discussion of the product roadmap and some interest for those with foot in the research data management camp.

CERIF2: PURE’s data model will be continually adjusted to match CERIF developments.

OpenAIRE Compliance: support for the OpenAIRE format will be added to PURE’s OAI-PMH harvesting interface.

PURE as a repository: a new player in the market? PURE currently supports ‘connectors’ to DSpace, ePrints, and Equella, so that research outputs originating in the CRIS feed through to an existing repository system.  Whilst making a clear commitment to maintaining these interfaces, Atira restated their belief that PUREPortal offers an alternative that could replace a traditional repository system in full. The best example of this is at Aalborg University. At University of Hertfordshire we maintain a DSpace repository, but our PURE CRIS is now the primary source for almost all our repository content. This a similar position to University of Edinburgh and several others. We have reasons for keeping DSpace at the moment, not the least because it is opensource and offers the opportunity to be hacked to try out new initiatives, such as publishing data. There are several new PURE repositories about to go live, mainly among Universities who do not have an existing public presence. It will be interesting to see if it continues to gain traction among those of us who already have systems online. I think they may struggle to penetrate further until the REF is concluded and everyone has time to breathe, reflect and address new projects. (RDTK is already experiencing inertia due to the REF, which is an overriding priority for researchers and administrators alike.)

PURE and Datasets: there was quite a lot of discussion about data, with two tangled – but with hindsight – distinct threads; the first about data as primary research output in the REF; and the other about the new imperative to publish data in support of traditional publications. The first thread came up when the meeting was considering how PURE currently expresses non-textual outputs, including physical art outputs, events, source code, and data.  This naturally drifted to a discussion about metadata, wherein I began to fear we would be mired for the rest of the meeting; but CERIF gurus rescued the day with timely intervention about the likely outcomes of Cerif for Datasets (C4D). By this route we arrived neatly back at the first point above. Atira have previously told me that they are waiting on the inclusion of a metadata model for data in CERIF, and will implement this when it arrives.  I pointed out that in order to fulfil their aspiration as a repository vendor they will also have to address more than just the metadata issues, for example, in the way that @mire have done with their media streaming plugin for DSpace. (As aside  –  @mire tell me the DSpace developer community is also taking a keen interest in C4D).

Data working group: the conclusion of these discussions was that a working group should be convened to report on data publishing issues at future UK PURE user group meetings. If anyone in JISCMRD who is not in the PURE user group would like to feed into this, then #rdtk_herts can facilitate.

May 232012
 

The UH Research Data Assets Survey has been launched and is planned to run until mid-June. The survey can be found at http://sdu-surveys.herts.ac.uk/rdas.

All research activity generates data in some form, even it you don’t recognise it as such. Valuable data is often found in unstructured, every day office formats, embedded in your working papers. It is as important to understand the requirements of this ‘free form’ data as it is to understand well defined collections, so every contribution from every research area will be valuable.

This survey aims to help the Research Data Toolkit team understand the research data landscape at the University and to plan the most effective support for research data management going forward. It should take no more than 20 minutes to complete.

The data collected in this survey will be held securely. The results will be anonymised and published at http://www.herts.ac.uk/research-data-toolkit  We ask for your name and email address so that we can contact you with regard to good practice and interesting issues to feed into the Research Data Toolkit. No personal data will be reported.

The survey has has got off to a good start with 20 respondents in 24 hours after opening.  Please find the time to complete it so that we can gather a true picture of research data requirements across the whole University, and in particular from within those departments and groups that would not traditionally see themselves as data generators.

May 042012
 

The second project Steering Group meeting took place on 30th March with Professor John Senior, Pro Vice -Chancellor (Research) in the chair.  The meeting considered and approved  Terms of Reference for the Stakeholder Group, and a 6 month progress report. Some re-scheduling of work packages and modifications to the risk register were also approved.

The meeting papers are attached below:

 

Apr 042012
 

The first project Steering Group meeting took place on 9th December with Professor John Senior, Pro Vice -Chancellor (Research) in the chair.  The meeting approved the RTDK project plan and considered terms of reference and a risk register.

The meeting papers are attached below: