SoftServ | Globus Download in Hyrax

Blog

Blog

Globus Download in Hyrax

Bess Sadler

March 1, 2021

Globus is a tool for transferring very large datasets. It has many advantages over older systems for transferring files, and researchers are increasingly expecting that data repositories should offer Globus integration. While there is not yet an official Globus integration offering from the Samvera community, several institutions have integrated Globus into their repository systems. Notch8 was recently asked to write such an integration for the Hyrax-based Rutgers Virtual Data Collaboratory (which has not yet launched). This blog post will describe the research and design process for this, as well as provide links to some sample code and pointers for future development.

Previous Work

As we undertook this work, we were aided greatly by conducting informational interviews with Nabeela Jaffer at the University of Michigan, and David Chandek-Stark at Duke University. UM and Duke have implemented similar strategies for Globus integration, with a few differences. We are grateful to our colleagues at these institutions for sharing their time and expertise, and this is a wonderful example of how working in an open way helps to advance the state of data repositories in general much faster than teams working in isolation.

High Level Architecture

Globus Integration for Rutgers- VDC Data Download

To enable download via Globus, we are following the same general pattern that both UM and Duke are using:

Create a shared volume that is writeable by the Hyrax application process
Create a Globus end-point that reads from that same volume
Automate the export of data sets from Hyrax to that shared volume, organized by unique id
Generate a predictable link that includes the institution’s Globus ID and the item’s unique id, which will allow a user access to the files via the Globus web client

A single work from the Duke Research Data Repository, available for download via the Globus client

A single work from the Duke Research Data Repository, available for download via the Globus client

The top level directory of the Duke Research Data Repository, visible via the Globus client, showing all of the datasets The top level directory of the Duke Research Data Repository, visible via the Globus client, showing all of the datasets

Implementation choices

While the UM, Duke, and Rutgers solutions all share the same high-level pattern, there are some key differences. Please note that this document is not a complete analysis of each solution; it is only a report of the analysis done at Notch8 in order to fulfill a specific contract for Rutgers University.

Michigan: On-demand export

The University of Michigan's Deep Blue Data repository copies files on demand for Globus download, offering the user a button that will copy a dataset to Globus download space in a background job, and then email the user when the item is ready for download. Heavily used datasets remain in the Globus download space, and for those items the user is presented an immediate opportunity to download via Globus, with no waiting. The advantages of this approach include more efficient use of space and thus reduced cost for repository operation. The disadvantages of this approach include increased complexity (e.g., the need for an on-demand job to copy the files and notify the user when their files are ready) and the need for active storage management (the Globus download space must periodically be cleaned out).

Duke: Nightly batch exports

The Duke University solution instead chooses to make all of its public data available for download at any given time. Dataset export is tracked via a rails ApplicationRecord object called Globus::Export, which records whether a work has been exported, whether that export succeeded, and when the last export occurred. A nightly scheduled process scans the repository for newly added works by checking each work against its table of Globus::Export records, kicking off an export for any work that has not yet been exported.

Rutgers: Using the Hyrax Actor Stack

The Rutgers approach takes a more real-time approach than either of the above solutions. We adopted the Globus::Export Application Record from the Duke Solution, but our version of Globus::Export has two additional fields: expected_file_sets and completed_file_sets. One of the challenges around data import in Hyrax is the fact that file attachment happens via background jobs, and there is no obvious way to know when a work has been totally assembled. However, by the end of the initial run of the Actor Stack, we know the list of FileSet objects that are attached to a work. We record that list of FileSet identifiers on a Globus::Export. Then, we insert into the background job that is attaching files, a method that kicks off a Globus Export of a particular FileSet and, assuming all goes to plan, records that FileSet id in the corresponding Globus::Export#completed_file_sets. When generating the user-facing view of a work, we check the Globus::Export object for that work, and if the #expected_file_sets match the #completed_file_sets, we display the generated download link.

Future work

Future work for this integration might include:

-- leveraging the browse-everything gem’s file system integration to also allow for Globus upload to particular directory, where data would then be available for cataloging and deposit into Hyrax
-- improved error checking, including more robust checksum validation when the files are copied
-- extraction of this functionality into a gem that could be installed and configured into a Hyrax application without the need for much local customization

This has been a rewarding project, and we are so grateful to the team at Rutgers for the opportunity to better understand the needs of research scientists working with large data sets!