Assessing the PID Landscape: Where is THOR in Context?

Part of knowing how well THOR is doing is knowing how our work fits into the overall context of persistent identifiers (PIDs) at large. This is why we began the project with an eye toward sustainability and also why we developed the metrics dashboard in the early days of the project. (That report is on Zenodo, if you’d like to read it again.)

Now that THOR has celebrated its first birthday, it’s time to pause and see what the PID landscape looks like now compared to when we first started. Assessing these changes now will help THOR tweak our roadmap for the future, making sure we stay on track for the remainder of the project. All of these assessment and evaluation efforts will eventually turn into a formal report at the end of the project, but we know how hard it is to wait. To tide you over, we’ve released a white paper based on our internal midtrack assessment.

Your feedback, questions, and comments are always welcome at info@project-thor.eu.

Persistent IDs and Theses: ETD2016, Lille

The International Symposium on Electronic Theses and Dissertations (ETD) is an annual get together exploring all-things thesis and PhD research. I was there to present a poster on how THOR is developing improved support for identifiers in the British Library’s thesis service, EThOS.

We have previously engaged with UK universities to see how they are already applying identifiers to their theses and data, and what could be done to move that along; we then added support in the EThOS metadata for author identifiers and thesis DOIs. Now as part of THOR, we are planning to push that further by facilitating the development work necessary to enable users to claim theses in EThOS in their ORCID record. This will also enable EThOS to look at completing the round trip, pulling ORCIDs from those claims into EThOS.

THORPosterAtETD2016_trim-2

Currently, anyone with a thesis in EThOS can only add it to their ORCID record manually. This process is prone to errors. Enabling a claim button on EThOS records will make it quicker and easier for researchers to add their thesis to ORCID. If we can then can retrieve claims information from ORCID, we can add links for users to find more works by that author.

There is still one link in this process that is slow to appear in the UK: persistent identifiers for the thesis itself. Many have Handle identifiers from their host repository, but we want to encourage further use of persistent identifiers for theses to make them more easily discoverable and accessible, especially where they are being cited. Having a link for the thesis and its data will also help to maintain the link between the two. We hope this will encourage students to think of their data as a separate, valuable output from their years of hard work, and implant the seed of good data management and sharing right at the start of their careers. So as well as the technical work to develop EThOS, we are working with universities to encourage them to apply persistent identifiers to their theses – and the data from the thesis.

ETD was a great venue to talk to the other repository managers who were interested in applying this work within their own repositories, and a welcome opportunity to answer their questions about the advantages to their institutions – and their students – of our planned approach.

A couple of recurring questions arose:

1. I have Handles in my repository for items already. Will they do?
Technically, yes. Having handles on your theses and related data will certainly enable you to take advantage of consistent linking and citation of the theses. But we do see additional advantages in the use of DOIs. These are: 1) recognition by researchers; 2) the additional governance of DOIs, providing a safety net in terms of long-term persistence.

2. When should our students get ORCIDs? How can we encourage them?
Your students should make sure they have an ORCID as soon as they are ready to publish their first output, whether that be a paper, a dataset, a poster or a conference proceeding.  The first thing institutions can do to encourage them is practice what you preach: demonstrate how you, as repository staff, can bring together your own publications and outputs, and the advantages it has for you!

The poster, which outlines our aims, challenges and potential solutions, can be found online at: https://zenodo.org/record/61176#.V8V8bvkrJpg.

PIDapalooza, the festival of persistent identifiers is coming soon!

This blog post by Laura Rueda has been cross-posted from the DataCite Blog.

Passionate as we are about persistent identifiers, we are delighted to invite you to PIDapalooza, the festival of PIDs this November in Reykjavik. Together with colleagues from Crossref and the California Digital Library, THOR partners DataCite and ORCID have envisioned this community gathering for everyone who’s working with identifiers: digital tech experts, publishers, researchers, tool builders, organisations, infrastructure providers… and you!

crowd

The program will include a mixture of PID demos, workshops, brainstorming, updates on the state of the art, and more – and we invite your contributions. Working together we can catalyze the development of innovative tools, services and community actions.

Come share your ideas with a crowd of like-minded innovators! Send your session proposal, in a very lightweight format, by September 18. The festival lineup will be announced the first week of October.

Registration is already open. Sponsorship offers are welcome, please contact  if you want to support the initiative.

pidapalooza

Where: Radisson Blu Saga Hotel Reykjavik, Hagatorg, 107 Reykjavik, Iceland

When: 9th and 10th November 2016

See you in Reykjavik!

 

ORCID Integration Series: EMBL-EBI

In this third blog post we introduce you to the  EBI ORCID Hub we developed as part of the THOR project at EMBL-EBI, to integrate ORCID iDs into life science databases.

The European Bioinformatics Institute (EMBL-EBI)  is a centre for research and services in bioinformatics, and is part of the European Molecular Biology Laboratory (EMBL). There are hundreds of life sciences resources serving the biomedical research community, and at the European Bioinformatics Institute a number of essential resources do not incorporate ORCID iDs in their workflows yet. To support the adoption of ORCID iDs in data repositories we envisioned a Hub which manages the programmatic communication with the ORCID registry, keeps track of relevant ORCID records and makes integration with ORCID as easy as possible. Furthermore the creation of an ORCID Hub avoids duplication of integration efforts for many repositories.

ORCID-Hub-figure.png
EBI ORCID Hub Overview

As a first milestone, the Hub allows EBI databases to easily add ORCID authentication, e.g. on submission forms. Because ORCID records may already contain some of the information that is necessary, submission forms can be automatically filled in using this information. In the last couple of months we worked heavily on improving the EBI ORCID Hub, and supported our first adopters MetaboLights and EMPIAR as they integrated ORCID iDs in their workflows.

MetaboLights is a database for metabolomic data and derived information. It holds data from metabolic experiments, as well as metabolite structures, their roles, and other related metadata. The EBI ORCID Hub allows MetaboLights’ submitters to authenticate their login using their ORCID iD.

metabolights_scr2.png
MetaboLights registration form integrated with ORCID authentication

Our second adopter is EMPIAR, a repository of electron microscopy images in structural biology. Like MetaboLights, they are using the ORCID Hub to identify their submitters by ORCID iD and to easily autofill their submission form.

EMPIAR1.png
EMPIAR registration form integrated with ORCID

We are now focussing on milestone 2: expanding the Hub functionality to push information to the ORCID registry. In practice, this means that data repositories will be able to let their users claim records to their ORCID profile. Following this, we would like to begin keeping track of ORCID records that were claimed through our Hub, and managing this information for the databases linked to it. The idea is to let databases know when their records are being claimed, or when claimed records are changed. For those among you who want to build something similiar, and all the curious developers, we have deposited the code on GitHub (https://github.com/thor-project/ebi).

ORCID Integration Series: CERN

CERN is a hub for all things High-Energy Physics (or HEP for short). Nearly all researchers in the HEP field make CERN their home for all or part of their research careers. Most of these researchers maintain separate university affiliations as well, making the CERN research community a distributed decentralized global network. When we’re designing information services, we have to consider this global family and devise ways for them to keep track of all their research, all in one place, automatically. Fortunately for us, we can take advantage of third party services developed by our partners in THOR in order to add needed functionality in a way that’s consistent, reliable, and shares our Open Science values.

Inspire, the primary database for HEP literature, provides a number of ways for researchers at CERN and abroad to stay on top of what’s happening in their field. Inspire is a literature aggregator, meaning that it harvests metadata from a suite of HEP-relevant journals that users can then search for pertinent literature. This metadata then feeds other services, such as HEPData, the repository for supplementary publication data in HEP, and allows us to automatically generate author profiles. Handling much of this information automatically is a great benefit for our users, and it makes Inspire a rich source for information specific to research in HEP. But this usefulness naturally doesn’t extend to other systems or disciplines. Tapping into the ORCID iD system will let our users be identified in a variety of scholarly systems and will help them link their HEP work to any other area of their research life.

In the Inspire author profiles, we already had a homegrown system for pushing and pulling works information to and from ORCID. For those authors who have associated an ORCID iD with their profile (a process that formerly required manual entry and manual verification), we are able to append works information from Inspire to their ORCID record, and we are able to pull works information from their ORCID record to display on the External works tab in their Inspire profile. We have now extended this functionality with the ability to authenticate through ORCID for other Inspire functions. This authentication is in place for Inspire’s literature and author suggestion functions and for correction of authors. Further modification of Inspire data via ORCID authentication will be rolled out with the new release of Inspire slated for later this year.

This additional functionality is an extension of Inspire’s upgrade to an all-new version of its underlying Invenio platform. The completely overhauled Invenio 3 includes a module for ORCID authentication, making Inspire’s integration painless. And since Invenio is underneath all of CERN’s scientific information systems (Inspire, HEPData, and Zenodo), this means we’re one step closer to an interoperable platform for researcher outputs.

We’ve also implemented ORCID authentication in HEPData. HEPData gathers its bibliographic metadata from Inspire, and Inspire pulls information on data related to publications from HEPData and displays it in the relevant author’s profile. There is already a direct connection to Inspire, so logging in with ORCID isn’t necessary to make this author-publication-data triangle possible. However, users now have the option of logging in with ORCID to access HEPData’s review and submission functions, providing a third party authentication choice that’s compatible with other scholarly systems.

At CERN, we were able to implement ORCID authentication straight out of the box, making it a simple and practical choice to offer our users for unifying and managing their scholarly identification needs.

ORCID Integration Series: PANGAEA

This is the first in a series of posts describing how THOR partners have recently integrated ORCID in their disciplinary data repositories. This post describes ORCID integration in PANGAEA, the Data Publisher for Earth & Environmental Science.

PANGAEA is rolling out a new version of its website. Developers and designers are currently ironing out a few remaining open issues. The release is expected for autumn 2016. Among major improvements in search, design, and usability, a key new feature is the integration of ORCID.

The new feature enables existing PANGAEA users to connect their PANGAEA profile with their ORCID iD, as demonstrated in the video below. 

With this connection, PANGAEA obtains the validated ORCID iD of its users from ORCID. By connecting their ORCID iD, users can also choose to sign in to PANGAEA using ORCID, as an alternative to signing in using PANGAEA user credentials. This can be handy when a user is already signed in to ORCID, or it is quicker to recall ORCID credentials.

Obtaining the validated ORCID iDs of its users is significant for PANGAEA as, contrary to a researcher’s name, the iD is unambiguous: two researchers with the same name can be distinguished by their respective iDs. The iD is also persistent through possible changes in a person’s name: the same researcher may change marital status, or their name may appear in different permutations, at times appear with full name, initials for first name, and with or without middle name (initial). Furthermore, the iD is actionable and can be used to discover information about the researcher.

For researchers, the greatest advantage of connecting their ORCID iD to their PANGAEA profile is that PANGAEA can then record the relationships between dataset publication DOIs and contributor ORCID iDs. This information is then shared with the global network of PID infrastructures, and researchers benefit from automated updates to their ORCID Record for data published at PANGAEA, gaining unambiguous attribution for published datasets and benefiting from greater credit for sharing data early.

Let’s take a look at how the ORCID integration in PANGAEA is making a difference to Dr Alice Lefebvre, GLOMAR Associate Scientist at the MARUM Center for Marine Environmental Sciences of the University of Bremen.

Alice has recently joined ORCID and decided to claim the 14 data publications deposited at PANGAEA that she has authored. As a consequence, Alice gains a more complete ORCID Record, one that does not just include her journal article publications but also her authorship in data publications a record that better reflects her true contribution to the scientific record. Alice was also surprised to learn about DataCite and the overview DataCite provides about her contributions.

The upcoming release of the PANGAEA website automates the sharing of information with the global network of PID infrastructures. Authors of datasets published at PANGAEA who have connected their ORCID iD, like Alice, will benefit from a workflow that ensures information appears automatically and accurately on their ORCID Record.

This shows how far the integration between disciplinary repositories and the global network of PID infrastructures has come over the past years, and how the persistent identification of contributors and research artefacts together with infrastructures that aggregate, process, and share information about persistently identified resources are driving and shaping 21st-century attribution, credit, communication, and measurement of scholarly activity.

Want to Know More?
Readers interested in performing an ORCID integration in their own disciplinary repository can find more information in our recent report, ‘Demonstration of Services to Integrate ORCIDs into Data Records and Database Systems.

ORCID Integration in Disciplinary Data Repositories

Researchers need to be linked to their data. Within THOR, we’ve been busy developing approaches to support the inclusion of ORCID iDs in disciplinary data repositories and data publication workflows.

The results are published in our latest report, ‘Demonstration of Services to Integrate ORCIDs into Data Records and Database Systems’ (10.5281/zenodo.58971), where you can read about the successful integration of ORCID in the databases and services of three THOR partners, each serving a distinct discipline: PANGAEA for Earth and Environmental Sciences, EMBL-EBI for Life Sciences, and CERN for High-Energy Physics.

These integrations were applied to live and operational production systems. This means that researchers in these disciplines are already benefiting from automated persistent identifier linking and linkage-information sharing within the global network of persistent identifier infrastructures.

The report describes the common experiences and challenges as well as the specific concerns each institution faced. These case studies can therefore serve as models for other institutions looking at integrating ORCID in their own systems and workflows.

As a companion to the report, over the next month PANGAEA, CERN, and EMBL-EBI will contribute to a series of posts on the THOR blog that summarise their recent advancements with ORCID integration. We will demonstrate the benefits of ORCID integration, and offer a practical guide to performing your own integrations. 

If you have any questions, please email info@project-thor.eu for more information.

Year 1 in Review

It has been a year since THOR launched in June 2015 and a natural time to take stock of what the project has achieved and some of the ways that our understanding has matured.

In the THOR vision, persistent identifiers are the default. They are the new normal. And they are interlinked and embedded in the services that researchers use every day. They help researchers to get clear unambiguous credit for the full range of their work – articles, data, software, and more. They enable data centres, universities and funders to track the impact of the research that they enable. They enable publishers to fully incorporate data into scholarly communications. They support a new research infrastructure.

Taken together, this means better evidence-based research and credit where it is due.

The THOR partners are working to make this vision a reality.  We’ve made healthy progress. The THOR Dashboard helps to track activity in the persistent identifier space. If you visit it, you can see the month-to-month progress from years of data.

Year-1-in-Review

The dashboard currently tracks the activity of THOR partners DataCite and ORCID. DataCite is the leading provider of persistent identifiers for data. It assigns DOIs at over 700 and growing data centres around the world. ORCID is the leading provider of persistent identifiers for researchers.

The graph shows continued strong growth with over six million DataCite DOIs assigned to data and other research artefacts, and over two million ORCID IDs for individual researchers.

Over the past year, our research efforts have focused on better understanding how persistent identifiers can be more interoperable and better interlinked. We’ve published a report on how to overcome barriers between PID platforms for contributors, artefacts and organisations. We’ve also produced a report on persistent identifier linking in scholarly e-Infrastructure that extends the thinking about PIDs to cover institutions and funding information. This hard thinking is now resulting in new services to build up links between PIDs and exchange information about them.

We know that we can only achieve our vision by changing the way that the systems work – the ones used by researchers every day. In THOR’s first year, we’ve integrated ORCIDs into essential production services in life sciences, high energy physics, and earth and environmental sciences. This means that they can automatically link deposited datasets with a unique and persistent identifier for their contributors.

Through the website, social media, events and webinars we’ve shared and learned from you about how persistent identifier services can make a difference in research. We’ve talked to and heard from many thousands of people – researchers, data managers, administrators, funders, journal editors and publishers, and more. This has enriched our understanding and, we hope, will result in better services, more robust infrastructure, and more rapid adoption. Many of the events are recorded and are available on the THOR YouTube channel.

If you are passionate about the possibilities that persistent identifiers present for research, you may want to become an Ambassador. Ambassadors work together and with THOR partners to encourage wider understanding and adoption of persistent identifiers. To learn more about getting started with adopting and using persistent identifiers, you can also visit the Knowledge Hub.

THOR stands for technical and human infrastructure for open research. As you can see, we’ve been working hard throughout the first year to make a difference from both perspectives: new understanding, services, integrations; more listening, talking, and sharing what we learn.

We plan to be blogging over the coming weeks to share more about new persistent identifier services and integrations in production services.

Dynamic Data Citation Webinar

This blog post by Martin Fenner has been cross-posted from the DataCite blog.

On July 12, 2016, DataCite invited Andreas Rauber to present the recommendations for dynamic data citation of the RDA Data Citation Working Group in a webinar.

dynamic-data.png

Andreas is one of the co-chairs of the RDA working group, and he gave a throughout overview of the recommendations, and the thinking that went into them. The final recommendations are available since last fall, and the current focus of the working group is to help with implementations.

The recommendations have to be implemented in the data center, but DataCite is happy to help coordinate the work, and to provide feedback to Andreas and the rest of the working group where needed. Of particular importance from a DataCite perspective is recommendation 8:

Query PID: Assign a new PID to the query if either the query is new or if the result set returned from an earlier identical query is different due to changes in the data. Otherwise, return the existing PID.

Assigning a persistent identifier (not only) when a dataset is originally generated, but also when a dataset is about to be cited, is central not only to the working group recommendations for dynamic data citation, but also crucial for other data citation use cases. Data exist at different levels, from raw data possibly generated by a machine, to highly processed data used in a publication. The figure below – presented by Robin Dasler from CERN at the THOR Workshop  on July 7 in Amsterdam – demostrates this for high-energy physics (HEP):

hep.png

DataCite DOIs are intended as citation identifiers. They are persistent identifiers and provide standardized metadata, including links to associated publications, contributors and funders. They thus focus on the data in the top section of the pyramid. While we can also use DataCite DOIs for the other levels of the pyramid, sometimes other identifiers are more appropriate for raw, non-persistent data generated my machines. Dynamic data citation can be seen as a variant of the process that this pyramid describes.

If you could not attend last week or you want to review the session, the recording of the webinar is available:

The THOR project will work with interested data centers on dynamic data citation in the coming 12 months, hopefully leading to important feedback and a few more implementations of the RDA working group recommendations. Please contact us if you work for a data center and are interested in participating.

Highlights Workshop: Identifiers – Infrastructure, Impact and Innovation

On Thursday July 7 2016, project THOR organised the workshop: Identifiers – Infrastructure, Impact and Innovation to showcase the research and work done by all THOR partners during the project’s first year. The event in Amsterdam attracted a mixed audience of representatives from publishing companies, universities and research institutions.

After an introduction to the THOR project by Adam Farquhar (British Library), the day was divided into three sessions. The first one focused on persistent identifier linking, the next session on data publishing and the last one on THOR services. Slides of all presentations can be found on the THOR Knowledge Hub.

IMG_0654 (2)

Photo: Introduction to Project THOR and persistent identifiers

Persistent Identifier Linking

During the first session on persistent identifier linking, Martin Fenner (DataCite), Laura Rueda (DataCite) and Tom Demeranville (ORCID) explained more about challenges in linking data sets to other data sets, dynamic data and how to identify multiple versions of the same data set. The complexities involved in cross-linking databases and how to establish a fully interoperable system were discussed as well. Good quality metadata is crucial. Lack of standards and low adoption complicate matters even more. Despite these challenges, the THOR team has achieved a lot during the project’s first year. For example, THOR partners have contributed to the ORCiD auto update functionality and DataCite event data.

The ORCiD auto update functionality enables researchers to easily search and link their works via DataCite search to their ORCiD records and with DataCite’s event data it is possible to collect events, e.g. data citations in journal articles, around DataCite DOIs. These are great achievements and evidently, more research will be done by the THOR project to address the other challenges.

IMG_0664 (2)

Photo: Tom Demeranville, Martin Fenner and Laura Rueda presenting on persistent identifier linking

Data Publishing

The second session of the day focused on data publishing: Catriona MacCallum (PLOS), Michaela Torkar (F1000), Hylke Koers (Elsevier), presented on data policies in their respective publishing companies. A lot of data that is generated is not being published, because most authors only focus on article publication. A cultural change is needed as by the time a paper is submitted to a journal it is generally too late.

Martin Fenner (DataCite) agrees in his presentation that it is challenging to make the underlying data of a publication publically available and even if the data is made available it is not very accessible, for example because it is hidden in a file format like PDF. Other challenges for data-article linking are again the lack of good quality metadata and the fact that there is a wide range of data submission systems. Integrating persistent identifiers into the data publishing workflow might overcome these problems. However, globally unique identifiers should be used instead of local identifiers. Challenges for a centralized infrastructure are authentication and ownership for the data infrastructure management.

photo

Photo: Josh Brown (ORCID) and Paul Groth introducing the publishing panel

After the presentations Paul Groth (Elsevier Labs) led a panel discussion on the challenges and opportunities of data publishing. Key questions that were discussed included: Should a publisher be responsible for data publishing? Or, are data repositories responsible for data publishing? These questions are not easily answered but all panellists agreed publishers should work together with researchers and other stakeholders to establish community standards for good quality data. The persistency of data accessibility is a stamp of approval, therefore good quality metadata and the use of persistent identifiers are crucial.

Next to these technical infrastructure requirements, it is evident a human infrastructure needs to be in place as well. Another question arose: Would authors commit to having their data accessible forever? According to the panel, incentives and a cultural change are needed for researchers to publish their data. In order to make this change and to achieve a shared infrastructure to push data publishing, more research and workshop discussions between the different stakeholders should take place. The THOR team will continue these discussions the coming months of the project.

IMG_0679

Photo: Panel discussion on Data Publishing

THOR Services

In the final session of the day Florian Graef (EMBL-EBI), Markus Stocker (PANGAEA), Robin Dasler (CERN) and Laura Rueda (DataCite) presented on THOR Services. They gave demonstrations of ORCiD integration in data submission systems within their respective repositories in biological and medical sciences, earth and environmental sciences and high-energy physics.

The demonstrations of ORCiD integration within data set claiming services and workflows show clear advantages; see the example at EBI: there’s a wide variety of databases and maintenance of one single service is a lot easier. See the ORCiD integration within PANGAEA demonstration as well. Next steps for continued implementation of persistent identifiers within the research cycle across the different disciplines have been identified: claiming services for previously published data and alignment of identifiers.

IMG_0688

Photo: Markus Stocker explaining more about ORCiD integration at PANGAEA

Of course, a lot more was discussed during the workshop so check out the presentations and please get in touch in case you were unable to join us and you have any questions! The coming year we will keep you up to date with further achievements of the THOR project through our blog posts and website. Thanks to everybody for their outstanding contributions to valuable discussions in Amsterdam and we welcome you at one of our next events!