Infrastructure series: Data Citations

Printer-friendly version

FORCE11 Blogs: Infrastructure Series

This third entry into the year-long FORCE11 Blogs series on scholarly infrastructure is an interview with Rachael Lammey and Helena Cousijn.

Data Citations

Last month’s post in this year-long series focused on the role of metadata in infrastructure. This month continues that thread with an interview with Rachael Lammey and Helena Cousijn on the work of Crossref and DataCite, two DOI registration agencies that complement each other‘s infrastructure work, now more than 20 and 10 years running, respectively. Supporting persistent identifiers (PIDs) and our members is central to our* shared goals and to linking among the full breadth of research outputs. The efforts of the two organizations to enable data citations in particular helps to illustrate the importance of collaboration in addressing evolving community needs. 

 

Interview with Rachael Lammey and Helena Cousijn

Interview by Jennifer Kemp (*Full disclosure: Crossref is my employer and I am admittedly biased in favor of this kind of infrastructure work!)

 

What does ‘infrastructure’ mean to you/your organization, in the context of research communications?

The technical infrastructures are very important to our organizations, but there is also a more social component to infrastructure. When we think about what makes our infrastructures an important part of the research community, it’s about persistence, governance, sustainability. There is no point making links between data and the related research outputs if these links break easily over time. To help sustain and update the metadata, the organizations behind the infrastructure also need to be well-governed by their communities and work towards becoming/remaining sustainable. 

 

How do you describe data citations to people unfamiliar with it?

Data citation is a way to link datasets to other research outputs. Researchers routinely provide a bibliographic reference to other scholarly resources, such as articles. However, data are often shared, but they are not often cited the same way as journal articles or other publications. Persistent identifiers, provided by our organizations, and the accompanying metadata, are key components of data citations. Our organizations have collaborated for years on ensuring that datasets are cited, these citations can be counted and displayed, and researchers can start getting credit for sharing their data. 

 

What’s the difference between DataCite and Crossref? 

Crossref members generally follow a publishing workflow, whereas DataCite members generally deposit content in repositories. In practice this means that the constituencies and content types we work with are different, but it’s not as simple as ‘Crossref is for articles’ and ‘DataCite is for datasets’. Some of our members are the same, and join both organizations to register different content types e.g. OSF and Morressier. We have many similarities too, in terms of the initiatives we support (ROR, data citation, Project FREYA) and our goals in persistently supporting the communities we work with and their needs. 

 

What is the one thing you wish ‘Silicon Valley’ would do or do differently to better support scholarly metadata?

It is important that they a) use it and b) make it clear if they are using it and where they got it from. Crossref and DataCite metadata comes from the organizations who are stewarding the content, and we make it openly available to encourage its use. When well-known systems harvest the metadata and use it in their systems this can extend its reach and ensure the underlying research has maximum impact. In addition, it can expose errors and gaps in metadata which is useful as it helps them be identified and fixed.

 

What is the one thing you wish non-technical people understood better about the challenges of scholarly metadata?

That it isn’t a homogenous blob. There are so many different flavours which don’t quite all map together. This contributes to some of the errors we see in terms of incorrect or missing information when information is passed between systems. For example, in data citation, links to the data within the paper are often published online, but not provided in the Crossref metadata because of issues passing the information along, so we need to highlight that to publishers to make sure these links can be made as part of the standard production process.  

 

What other areas of infrastructure do you work most closely with/are most dependent on (& how)?

For data citation to work well, Crossref and DataCite need to work together so that data to article and article to data links can be easily passed between our organisations in a standard way. Information about data citations can come from publishers when datasets are cited in articles, but also from data repositories when they indicate in their metadata that an article was built on the dataset. We’ve collaborated on the Event Data service as a means to do this, but from the Crossref side we need to commit more resources to uphold our end of this, but we’re working on it!

 

Explain in some detail the issue you think is the most vexing/interesting/consequential/etc.

The need for a cultural change to support data citation. We believe a lot of the infrastructure is there. Following the FORCE11 Joint Declaration of Data Citation Principles, and the corresponding roadmaps for publishers and repositories, many organizations have implemented the workflows to support data citation, including our organizations. What remains is the cultural change including 1) it becoming standard practice for researchers to cite the data they’ve used and 2) research institutions and funding agencies including data citations in their evaluation systems. 

 

In a perfect world, how would data citation be funded and governed?

A lot of the infrastructure is there, so the remaining piece that needs to be funded is bibliometric work to interpret the numbers that come out and ensure data citations becomes a meaningful concept. When that work is done, data citations and other data metrics can become part of the reward system for researchers. 

 

What are your favorite blogs, conferences, Twitter accounts, etc. to keep up on scholarly metadata?

PIDapalooza is definitely the conference to keep an eye on if you want to know more about PIDs & their potential. It opened my eyes to the wider world of PIDs and their associated metadata. Anything Dominika Tkaczyk writes on the Crossref blog is worth your time, as is everything by Alice Meadows.

 

Favorite little-known fact or unsung hero?

Daniella Lowenberg has been doing great work at Make Data Count. The project is addressing the significant social as well as technical barriers to widespread incorporation of data-level metrics in the research data management ecosystem - and it has a very community-led approach, so worth looking at the work they’re doing and considering implementing their recommendations. Based on the outcomes of the last Make Data Count project, DataCite is now displaying data citations at the dataset level.

 

What question do you wish we asked but didn’t and why?

What can publishers do to support data citation? There’s still some confusion around this which is limiting the number of article/data links, but we try to fill in the gaps here. Data citation needs to be a standard component of publication so that links from other research outputs to the data that supports them are comprehensive and helps the transparency and reproducibility of research. Publishers are working on this e.g. providing and trying to standardize policies to encourage authors to link to and share data, integrating prompts to add links to data in submission system workflows and working with vendors to ensure the information gets passed to Crossref. It’ll take time to iterate across publishers, but some are really making inroads into making this happen as a matter of course.

 

More Information: Rachael Lammey, Helena Cousijn and their organizations (Crossref & DataCite)

Helena CousijnRachael LammeyRachael Lammey is the Head of Community Outreach at Crossref and Helena Cousijn is the Director of Community Engagement at DataCite.  Crossref (crossref.org) and DataCite (datacite.org) make research outputs easy to find, cite, link, assess, and reuse. Both are not-for-profit membership organizations that exist to make scholarly communications better. Crossref works mainly with stakeholders in the research community that follow a publishing workflow, whereas DataCite works with the organizations that house repositories where content is deposited. Crossref and DataCite collaborate in many areas, one of which is data citation. 

 

 

 

 

 

 

 

 

 

 

 


About Jennifer Kemp

Jennifer Kemp is Partnerships at Crossref, where she works primarily with organizations that use Crossref metadata. She also co-chairs their Books Advisory Group and the Metadata 2020 Best Principles and Practices project. Prior to Crossref, she was most recently Senior Manager of Policy and External Relations, North America for Springer Nature... More

View Profile