Data Claims to ORCID: an EMBL-EBI Perspective

THOR Disciplinary Workshop Series, part II

The European Bioinformatics Institute (EMBL-EBI) is a centre for research and services in bioinformatics. It performs basic research in computational biology and offers an extensive user training programme, supporting researchers in academia and industry.

EMBL-EBI is part of EMBL, Europe’s flagship laboratory for the life sciences, and houses a diverse range of data repositories. All of these databases link to each other and to the scientific literature, providing a deeply integrated ecosystem that reflects the natural connectivity of the life sciences. Each repository uses custom identifier systems, known in bioinformatics as accession numbers, most of which predate the use of DOIs for data in some cases by decades and are instantly recognisable to the researchers that use them.

Existing Use of ORCID at EMBL-EBI

EMBL-EBI has been proactive in promoting ORCID iDs within the organisation. In 2013, we introduced the requirement that all staff register with, and that they allow Europe PMC to update their ORCID records with their publications. The ORCID account quickly became the central place where staff tracked their publication record. In return, EMBL-EBI made sure that this information then flowed to EMBL-EBI staff pages automatically.

During the THOR project, the successful ORCID integration in Europe PMC was followed by the development of the EBI ORCID Hub, which enables EBI databases to add ORCID authentication and data claiming to ORCID to their user interfaces. The Hub has since provided a solid technical foundation for persuading those who manage the many diverse databases housed by EMBL-EBI to adopt ORCID iDs within their systems.

The technical challenge of associating a public database record with an ORCID iD has instead centred more around user interaction. This, too, was simplified by the EBI ORCID Hub, which provides standardised claiming and reporting APIs and javascript libraries, and a centralised infrastructure for storing associations between ORCID iDs and data records. As a result, various prototype services for data claiming to ORCID are now being developed across EMBL-EBI, such as the ‘Claim this study to ORCID’ button within the MetaboLights database:

Ongoing Engagement with EMBL-EBI Databases

We are now consulting with the data resources hosted at EMBL-EBI to find out more about their requirements for adopting ‘data claiming to ORCID’. Currently, this focuses more on technical assistance and problem resolution in the implementation of claiming functionality. But we still have a long way to go to convince EMBL-EBI resources and their users about the long-term value of linking data to ORCID iDs.

We need a critical mass of data claims to ORCID across multiple resources that are linked to other information, such as affiliations and funders. This semantic network of relationships could then be made available to both EMBL-EBI resources and external users for searching, along with visualisation interfaces that showcase the potential of such a network for funders, institutes and individual researchers to answer important questions about the research they produce or enable.

We are now developing a prototype retrospective batch claiming interface, which allows a given researcher to claim all their datasets identified in a search, such as in the image below, via one button click.

We plan to use EMBL-EBI’s BioStudies database as a test bed for trying out interesting search scenarios based on data collated during the THOR project. BioStudies holds descriptions of biological studies, links to data from these studies in other databases (be it within EMBL-EBI or externally), as well as data that do not fit in the structured archives at EMBL-EBI. The repository enables manuscript authors to submit supplementary information and link to it from the publication. It is this link to the publication that allows BioStudies to retrieve the funder information from Europe PMC, and in turn link it with the datasets associated with that publication.

Drawing on that connectivity, we plan to implement a faceted search by funder in BioStudies that will enable the user to find all data sets associated with a Europe PMC publication that has been tagged with that funder. Once a sufficiently large number of data claims to ORCID have been accumulated, we plan to allow the BioStudies user to search its data by ORCID iD, returning a results page faceted by funder.

Finally, with enough data-claims data available, we also plan to generate a number of Tableau visualisations offering captivating insights into the correlations between researchers, datasets (and potentially metadata), funders and possibly timelines. Our hope is that the prototypes and visualisations will act as catalysts for generation of new analytics scenarios from our users that will stimulate future PID-related development work at EMBL-EBI.


Outstanding Challenges – Managing Complexity…

To date, we have not attempted to solve a number of semantic challenges in the interpretation of data claims to ORCID. For example, does a data claim imply more than just a participative role in the generation of the data? Should specific roles be distinguished by parameters such as the original researcher, the data submitter or the subsequent data curator? Some EMBL-EBI resources accept experimental data submissions described by metadata that is curated and harmonised using ontology mapping tools, such as ArrayExpress, whereas submissions to other resources, such as Rfam are themselves results of curation, making role distinctions even more complicated.

Furthermore, databases such as the European Nucleotide Archive (ENA), PRIDE PRoteomics IDEntifications (PRIDE) and UniProt may contain secondary data protein identifications, translated reads, genome references that are derived from existing, finer-grained original data, such as sequencing reads or peptides, that may have been submitted previously and independently. A claim to ORCID with regard to the derived data may be made, but can the claim to the original data be legitimately extended to the derived data? The submitters of the original data may also have a legitimate reason to question the ethics of extending the derived data claim to the original data. In such cases, the difficulty of transferring claims from papers to data lies in recognising where those sometimes subtle and often database-specific timelines and inter-dependencies of data exist.

We are excited about the work at EMBL-EBI that will yield tangible results towards the end of THOR, but recognise our prospecting for insight riches made available by data claims to ORCID is only just beginning.