Core concepts behind Rucio explained
Stories from Victor Azizi
Rucio [1] is a data management solution for big data in data intensive experiments [2] [3] [4] [5].
Big data in the context of Rucio can be defined as distributed Terabyte (or larger) Scale data which has to be managed across multiple sites. It can also work for smaller datasets, but usually one is better of using traditional data management solutions for those as they can easily be stored in one place.
Concepts
Data identifiers
Rucio manages a collection of files, datasets, and containers. Each of those has a unique Data Identifier.
Datasets contain files, collections contain datasets and other collections.
Files do not need to be in datasets, datasets do not need to be in collections.
File-DIDs have a checksum attached to them, to ensure reproducibility.
Storage Elements and file replicas
Rucio manages a collection of Storage Elements
Storage elements are an (editable) description of how to access data at a particular location
A replica is the existence of a File-DID at a specific storage element.
File-DIDs can have none, one or many replicas
Metadata and replication rules
All DIDs and RSEs can have extra metadata (in the form of {key: value} e.g., {'tier': 0} ).
A filter can be applied to metadata to find a subset of DIDs/RSEs.
Rules specify an amount of replicas that must be created for subset of DIDs on a subset of RSEs (with the help of filters).
Rucio will try to satisfy the rule and create (or delete) replicas until all rules are satisfied.
Rucio will not allow contradictory rules.
Replication management
Rucio manages replicas with the help of a file transfer service, usually this means that data is copied directly between RSEs if possible.
Rucio has background daemons that can automatically try to recover lost data, test RSEs for data corruption and many more daemons useful for ensuring data integrity.
This covers the most important aspects of Rucio. Some other notable features are:
Integration to DIRAC [3]
A fully fledged account and identity system which can uses personal X509 certificates or decentralized oauth solutions.
Setting up a Rucio instance
Rucio is fully open source and development happens on github [7]. Installation can be done in many ways, directly on a server, with docker, or in a Kubernetes [9] cluster. Installing inside Kubernetes is the recommended way and helm charts [8] are provided to make this process as easy as possible. Within the documentation [10] this process is also quite well documented. Therefore we chose to follow the recommended practices and set up Rucio with helm-charts on Kubernetes.
As a reference there are a few existing open Rucio deployments which might be useful to copy or to contact the author to see if collaborations are possible!
Pic.es has a setup where they can spin up individual Rucio instances for projects if needed [11]
Cern created the Virtual Research Environment, which is a full fledged solution for data analysis with a notebook server and much more [12]
KM3NeT has a standalone deployment with the full configuration accessible, which is synchronized to their cluster with Flux [13]
Managing data
At the core Rucio offers a REST API interface [14]. It is possible to program a client yourself to interact with this API. Luckily Rucio offers many wrappers for this API: