THOR at Digital Infrastructures for Research

The last week of September 2016, several THOR partners headed to the city of churches, Krakow, to participate in the Digital Infrastructures for Research conference (DI4R). DI4R was an event organised by Europe’s leading e-infrastructures, EGI, EUDAT, GÉANT, OpenAIRE and the Research Data Alliance (RDA) Europe, in which researchers, developers and service providers brainstormed and discussed adoption of digital infrastructure services and promote user-driven innovation. Adam Farquhar (British Library), Josh Brown (ORCiD), Robin Dasler (CERN) and myself, Kristian Garza (DataCite), closed the first day of activities with a talk that emphasised that PIDs are a set of tools and systems to be integrated and promoted in infrastructures and services for researchers.

Our session was divided into short presentations that showcased how ORCiD iDs and DataCite DOIs are integrated into research systems and connected with other platforms. After that, we presented the case of CERN for PID integration which showcased how PIDs enabled linking, attribution, claiming and citation of contributors and datasets.

The session was followed by a discussion on ORCiD nationwide use cases and the need for improving metadata capturing compliance of DOIs. Finally, the DI4R audience shot the THOR panel with a provocative series of questions. For example:

    – “How should we deal with credit attribution of collections of datasets? When in some areas data collections are created by a contributor but each item in the collection has a different producer.”  

    – “Do we need PIDs for machines and instrumentation?”

    – “What about PIDs for projects?”

Certainly, some those questions need further thought and exploration by the THOR members and the community at large. Join us at Pidapalooza if you want to be part of this discussion.

Overall the THOR session at DI4R highlighted the project’s work (specifically DataCite’s Event-Data and ORCiD’s auto-update) and ended up with a good discussion about future lines of work to be developed.

 

ORCID Integration Series: EMBL-EBI

In this third blog post we introduce you to the  EBI ORCID Hub we developed as part of the THOR project at EMBL-EBI, to integrate ORCID iDs into life science databases.

The European Bioinformatics Institute (EMBL-EBI)  is a centre for research and services in bioinformatics, and is part of the European Molecular Biology Laboratory (EMBL). There are hundreds of life sciences resources serving the biomedical research community, and at the European Bioinformatics Institute a number of essential resources do not incorporate ORCID iDs in their workflows yet. To support the adoption of ORCID iDs in data repositories we envisioned a Hub which manages the programmatic communication with the ORCID registry, keeps track of relevant ORCID records and makes integration with ORCID as easy as possible. Furthermore the creation of an ORCID Hub avoids duplication of integration efforts for many repositories.

ORCID-Hub-figure.png
EBI ORCID Hub Overview

As a first milestone, the Hub allows EBI databases to easily add ORCID authentication, e.g. on submission forms. Because ORCID records may already contain some of the information that is necessary, submission forms can be automatically filled in using this information. In the last couple of months we worked heavily on improving the EBI ORCID Hub, and supported our first adopters MetaboLights and EMPIAR as they integrated ORCID iDs in their workflows.

MetaboLights is a database for metabolomic data and derived information. It holds data from metabolic experiments, as well as metabolite structures, their roles, and other related metadata. The EBI ORCID Hub allows MetaboLights’ submitters to authenticate their login using their ORCID iD.

metabolights_scr2.png
MetaboLights registration form integrated with ORCID authentication

Our second adopter is EMPIAR, a repository of electron microscopy images in structural biology. Like MetaboLights, they are using the ORCID Hub to identify their submitters by ORCID iD and to easily autofill their submission form.

EMPIAR1.png
EMPIAR registration form integrated with ORCID

We are now focussing on milestone 2: expanding the Hub functionality to push information to the ORCID registry. In practice, this means that data repositories will be able to let their users claim records to their ORCID profile. Following this, we would like to begin keeping track of ORCID records that were claimed through our Hub, and managing this information for the databases linked to it. The idea is to let databases know when their records are being claimed, or when claimed records are changed. For those among you who want to build something similiar, and all the curious developers, we have deposited the code on GitHub (https://github.com/thor-project/ebi).

ORCID Integration Series: PANGAEA

This is the first in a series of posts describing how THOR partners have recently integrated ORCID in their disciplinary data repositories. This post describes ORCID integration in PANGAEA, the Data Publisher for Earth & Environmental Science.

PANGAEA is rolling out a new version of its website. Developers and designers are currently ironing out a few remaining open issues. The release is expected for autumn 2016. Among major improvements in search, design, and usability, a key new feature is the integration of ORCID.

The new feature enables existing PANGAEA users to connect their PANGAEA profile with their ORCID iD, as demonstrated in the video below. 

With this connection, PANGAEA obtains the validated ORCID iD of its users from ORCID. By connecting their ORCID iD, users can also choose to sign in to PANGAEA using ORCID, as an alternative to signing in using PANGAEA user credentials. This can be handy when a user is already signed in to ORCID, or it is quicker to recall ORCID credentials.

Obtaining the validated ORCID iDs of its users is significant for PANGAEA as, contrary to a researcher’s name, the iD is unambiguous: two researchers with the same name can be distinguished by their respective iDs. The iD is also persistent through possible changes in a person’s name: the same researcher may change marital status, or their name may appear in different permutations, at times appear with full name, initials for first name, and with or without middle name (initial). Furthermore, the iD is actionable and can be used to discover information about the researcher.

For researchers, the greatest advantage of connecting their ORCID iD to their PANGAEA profile is that PANGAEA can then record the relationships between dataset publication DOIs and contributor ORCID iDs. This information is then shared with the global network of PID infrastructures, and researchers benefit from automated updates to their ORCID Record for data published at PANGAEA, gaining unambiguous attribution for published datasets and benefiting from greater credit for sharing data early.

Let’s take a look at how the ORCID integration in PANGAEA is making a difference to Dr Alice Lefebvre, GLOMAR Associate Scientist at the MARUM Center for Marine Environmental Sciences of the University of Bremen.

Alice has recently joined ORCID and decided to claim the 14 data publications deposited at PANGAEA that she has authored. As a consequence, Alice gains a more complete ORCID Record, one that does not just include her journal article publications but also her authorship in data publications a record that better reflects her true contribution to the scientific record. Alice was also surprised to learn about DataCite and the overview DataCite provides about her contributions.

The upcoming release of the PANGAEA website automates the sharing of information with the global network of PID infrastructures. Authors of datasets published at PANGAEA who have connected their ORCID iD, like Alice, will benefit from a workflow that ensures information appears automatically and accurately on their ORCID Record.

This shows how far the integration between disciplinary repositories and the global network of PID infrastructures has come over the past years, and how the persistent identification of contributors and research artefacts together with infrastructures that aggregate, process, and share information about persistently identified resources are driving and shaping 21st-century attribution, credit, communication, and measurement of scholarly activity.

Want to Know More?
Readers interested in performing an ORCID integration in their own disciplinary repository can find more information in our recent report, ‘Demonstration of Services to Integrate ORCIDs into Data Records and Database Systems.

ORCID Integration in Disciplinary Data Repositories

Researchers need to be linked to their data. Within THOR, we’ve been busy developing approaches to support the inclusion of ORCID iDs in disciplinary data repositories and data publication workflows.

The results are published in our latest report, ‘Demonstration of Services to Integrate ORCIDs into Data Records and Database Systems’ (10.5281/zenodo.58971), where you can read about the successful integration of ORCID in the databases and services of three THOR partners, each serving a distinct discipline: PANGAEA for Earth and Environmental Sciences, EMBL-EBI for Life Sciences, and CERN for High-Energy Physics.

These integrations were applied to live and operational production systems. This means that researchers in these disciplines are already benefiting from automated persistent identifier linking and linkage-information sharing within the global network of persistent identifier infrastructures.

The report describes the common experiences and challenges as well as the specific concerns each institution faced. These case studies can therefore serve as models for other institutions looking at integrating ORCID in their own systems and workflows.

As a companion to the report, over the next month PANGAEA, CERN, and EMBL-EBI will contribute to a series of posts on the THOR blog that summarise their recent advancements with ORCID integration. We will demonstrate the benefits of ORCID integration, and offer a practical guide to performing your own integrations. 

If you have any questions, please email info@project-thor.eu for more information.

Contributor Information in DataCite Metadata

The Force11 Joint Declaration of Data Citation Principles highlight the importance of giving scholarly credit to all contributors:

Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data.

The EC-funded THOR project that DataCite is involved in addresses these issues, and I have summarized the findings of one of our first reports in aprevious blog post. One of problems identified in the report was the use of a single entry field for personal names, as done by DataCite and many other scholarly services. We need separate fields for family and given names, the most important reason is to allow proper formatting of a data citation (different citation styles have different rules about author name formatting). As a small first step I have implemented proper personal name parsing, using the Namae tool, in DataCite Labs Search and the upcoming DataCite Labs claim store. One of the next places we can implement this is in the DOI content negotiation service, where we currently provide personal names as literal strings when using an output format that supports family and given names (http://data.datacite.org/application/citeproc+json/10.6084/M9.FIGSHARE.791569):

{
  "type": "dataset",
  "DOI": "10.6084/M9.FIGSHARE.791569",
  "URL": "http://dx.doi.org/10.6084/M9.FIGSHARE.791569",
  "title": "rOpenSci - a collaborative effort to develop R-based tools for facilitating Open Science",
  "publisher": "Figshare",
  "issued": {
    "raw": "2013"
  },
  "author": [{
    "literal": "Scott Chamberlain"
  }, {
    "literal": "Edmund Hart"
  }, {
    "literal": "Karthik Ram"
  }, {
    "literal": "Carl Boettiger"
  }]
}

To correctly identify contributors we have to use unique identifiers rather than personal names. The Data Citation Principles highlight the importance of unique identifiers for data, and I had suggested in an early draft of the principles to also mention the importance of unique identifiers for contributors.

ORCID identifiers are by far the most widely used identifiers in DataCite metadata – they can be found in the metadata of about 208,000 DOI names (other identifiers such as ISNI are also supported). In addition there are self-claims of DataCite DOI names in the ORCID registry (e.g. generated via the DataCite Search & Link Service that is part of Labs Search), the exact number of which we currently don’t know. DataCite is working with ORCID on a frictionless exchange of these DataCite/ORCID links in both directions.

But how are these DataCite/ORCID links shared with other services? A good starting point is the DataCite Search API. We can include creator andnameIdentifier in the results, but unfortunately these two fields are not linked together. Until we update the Solr schema for the Search API we therefore have to use the xml field that includes all metadata, and parse out the creator names and associated identifiers. We have recently implemented this in Labs Search, turning names with associated ORCID identifiers into clickable links that return a list of all DataCite DOI names associated with that person (http://search.labs.datacite.org/?q=10.6084%2FM9.FIGSHARE.791569):

Labs Search also provides a Cite button that formats the metadata according to common citation styles such as APA, or in common exchange formats such as BibTeX. These formats unfortunately don’t support ORCID identifiers (nothing has changed since I wrote about thisin 2011), so that the DataCite/ORCID links would be lost using these formats.

Citeproc JSON is a modern alternative to BibTeX, RIS and similar exchange formats, and is used as the machine-readable representation to format references in the reference managers Zotero, Mendeley, Papers (and others) using Citation Style Language. Although Citeproc JSON doesn’t support ORCID identifiers, it is much easier to extend than for example BibTeX, where adding ORCID identifiers without breaking the format is difficult to impossible. Last week I implemented this modified Citeproc JSON in a new DataCite service I am working on (e.g. using the example from above:http://cls.labs.datacite.org/api/works/10.6084/M9.FIGSHARE.791569):

"author": [{
      "family": "Chamberlain",
      "given": "Scott",
      "ORCID": "http://orcid.org/0000-0003-1444-9135"
    }, {
      "family": "Hart",
      "given": "Edmund"
    }, {
      "family": "Ram",
      "given": "Karthik",
      "ORCID": "http://orcid.org/0000-0002-0233-1757"
    }, {
      "family": "Boettiger",
      "given": "Carl",
      "ORCID": "http://orcid.org/0000-0002-1642-628X"
    }]

DataCite is not the first DOI registration agency to implement this, CrossRef is doing the same for some time in their REST API, e.g. forhttp://api.crossref.org/works/10.1111/1365-2745.12293:

"author": [{
  "affiliation": [{
    "name": "Department of Biological Sciences; Simon Fraser University; Burnaby BC Canada"
  }],
  "family": "Chamberlain",
  "given": "Scott",
  "ORCID": "http://orcid.org/0000-0003-1444-9135"
}, {
  "affiliation": [{
    "name": "CONICET; Instituto Argentino de Investigaciones de las Zonas Aridas; Mendoza Argentina"
  }, {
    "name": "Instituto de Ciencias Básicas; Universidad Nacional de Cuyo; Mendoza Argentina"
  }],
  "family": "Vázquez",
  "given": "Diego P."
}, {
  "affiliation": [{
    "name": "School of Biology; University of Leeds; Leeds UK"
  }, {
    "name": "Naturalis Biodiversity Center; PoBox 9517 Leiden 2300RA The Netherlands"
  }],
  "family": "Carvalheiro",
  "given": "Luisa"
}, {
  "affiliation": [{
    "name": "Department of Biological Sciences; Simon Fraser University; Burnaby BC Canada"
  }],
  "family": "Elle",
  "given": "Elizabeth"
}, {
  "affiliation": [{
    "name": "Biology Department; University of Calgary; Calgary AB Canada"
  }],
  "family": "Vamosi",
  "given": "Jana C."
}]

You see one difference: CrossRef also provides the affiliation, as a list of text fields. DataCite metadata also contain an affiliation field. This is a text string, ideally DataCite should also support unique identifiers for the affiliation, as we already do for HostingInstitution which can have a nameIdentifier and nameIdentifierScheme.

Funding information is similar to affiliation in that it is something not related to the dataset itself, but to one or more contributors. We could therefore encode funding information similar to affiliation, as a fundingfield for each author. The big advantage would be that DataCite and ORCID would have consisting funding information, rather than DataCite listing funding for works, and ORCID listing funding for people, and no direct connection between the two.

Lastly, we can use Citeproc JSON to describe the contributor role of the author. DataCite distinguishes between creatorthe main researchers involved in producing the data, or the authors of the publication, in priority order – and contributor for other contributions, with a controlled vocabulary for contributorType. The THOR report mentioned above goes into detail in the different contributor role vocabularies used by DataCite and ORCID (there is little overlap), and also describes Project CRediT, a community initiative to harmonize contributor roles across stakeholders, standardizing on 14 common roles. CRediT is closely link tocontributorship badges, a project started by the Mozilla Science Lab, with an example journal article using the CRediT roles and badges here:

Taking all the above together, the JSON to describe all this information could look similar to the following (some of the data are made up):

"author": [{
      "affiliation": [{
        "name": "Department of Biological Sciences; Simon Fraser University; Burnaby BC Canada",
        "ISNI": "0000-0004-1936-7494"
      }],
      "funding": [{
        "funder-name": "Alfred P. Sloan Foundation",
        "funder-identifier": "http://doi.org/10.13039/100000879",
        "award-number": "555-1212",
        "award-uri": "http://www.sloan.org/awards/555-1212"
      }],
      "family": "Chamberlain",
      "given": "Scott",
      "ORCID": "http://orcid.org/0000-0003-1444-9135",
      "CRediT": ["conceptualization", "writing_initial", "writing_review"]
    }, {
      "family": "Hart",
      "given": "Edmund"
    }, {
      "family": "Ram",
      "given": "Karthik",
      "ORCID": "http://orcid.org/0000-0002-0233-1757"
    }, {
      "family": "Boettiger",
      "given": "Carl",
      "ORCID": "http://orcid.org/0000-0002-1642-628X"
    }]

The above obviously contains a lot more information than the original Citeproc JSON. And event though affiliation, funding and CRediT are optional fields, this goes beyond the scope of Citeproc JSON, which is used to format references, rather than as a generic bibliographic exchange format. We should therefore call this JSON differently, and I propose Crosscite JSON, a common JSON format to describe scholarly works used by the DOI registration agencies CrossRef and DataCite. One particular challenge will be to strike the right balance between important information that we want to share easily, vs. keeping the JSON simple and not move away too much from Citeproc JSON, which after all is already implemented in a lot of tools and workflows. While the above JSON example looks a bit scary at first, it provides the level of detail asked for by institutions and funders, and – in contrast to the Data Citation Principles – uses a single mechanism of attribution applicable to all scholarly works, including data.

The next step for open science: a state-of-the-art identifier network

The THOR project has officially launched!

THOR  (Technical and Human infrastructure for Open Research) will build on the services provided by ORCID and DataCite to ensure that every researcher, at any phase of their career, or at any institution, will have seamless and free access to Persistent Identifiers (PIDs) for their research artefacts and their work will be uniquely attributed to them. THOR represents a European-led solution to a global problem. The services THOR creates will be open to all.

Over its 30-month project term, the THOR consortium will deliver sustainable, accessible PID-based services and enhanced community expertise to provide every researcher in Europe and around the world with a state-of-the-art, federated PID infrastructure. It will work with established platforms and disciplinary communities to ensure that researchers benefit from the added value that PIDs can bring to existing infrastructure. Innovative new services will be added to this toolkit.

Project co-ordinator, Dr Adam Farquhar, said “The THOR project brings together leading providers in persistent identifier services, world-leading research institutes, and major players in data and publishing. Our ambition is to establish seamless integration among articles, data, and researchers. We believe that this will stimulate a new service ecosystem that will transform the research landscape and support the European Commission’s goal of making every researcher digital.”

“THOR aims to build, test, and implement tools to enable seamless interaction of digital persistent identifier systems for articles, data, and researchers.  Our goal in this project is to bring all of the stakeholders together to launch tools and services that support and streamline data citation, across disciplines and research sectors.  We are excited to be part of the project team. ” said ORCID  EU Executive Director, Dr Laurel Haak.