Data citation FAQs for publishers

Printer-friendly version

The following FAQs and recommendations were based on the recommendations in Starr et al (2015), other publications by and the experience of members of the DCIP working group.  Some reflect issues raised in the Publisher Early Adopter meeting held in London, July  22, 2016.  The purpose of these FAQs is to assist those who are interested in or trying to implement systems of data citation for the current pilot project.  We therefore tried to make concrete recommendations that can be implemented within a 6 month time frame with a minimal amount of effort.  We also tailor these recommendations to biomedicine, as this pilot is supported by a grant from the US National Institutes of Health as part of the bioCADDIE project. We recognize that these recommendations are not the only means for implementing the principles, nor will they hold for all communities.  Indeed, the Joint Declaration of Data Citation Principles (JDDCP) explicitly affords communities the flexibility to implement solutions that will work for them.  However, we do welcome experiences and advice from all domains that have expertise in data citation.

Q1.  We endorsed the JDDCP, now what I am supposed to do?

Q2:  Why as a publisher should I care about citing data?

Q3:  How should a publisher handle data that is restricted either by its provenance or by law?

Q4:  How can I cite data?

Q5:  How should authors cite a range of datasets?  Or a data set that has multiple parts?  

Q6: How should authors handle versioning?

Q7: What do I need to know about dataset identifiers for data citation?

Q8: What makes a repository an acceptable repository?

Q9: What is meant by persistence with regard to data?

Q10: What are the advantages of making data available in a public repository versus in supplemental material?

Q11: What is JATS?  How can I get more information about it?

Q12:  What data should be described in the data availability statement?

Q1.  We endorsed the JDDCP, now what I am supposed to do?

As a publisher or journal, endorsing the principles means beginning to look for a way to start implementing them. Publishers were an integral constituency in the development of JDDCP, as well as and the subsequent implementation teams which have been developing materials to assist early adopters, including publishers. An example of work done specifically aimed at publishers is the new JATS standard (see question 11 below for more information).  Nevertheless, the JDDCP deliberately granted a lot of leeway in terms of implementation in recognition that publishers serve different communities and may be at different starting places for achieving a fully JDDCP-compliant solution.  This pilot, too, provides a range of strategies for achieving compliance with the principles, many of which can be implemented with relatively little effort.  

Q2: Why as a publisher should I care about citing data?

Citing data achieves the aims of supporting scientific reproducibility and enriching the scholarly communications ecosystem with a broader array of research products.  The JDDCP, endorsed by many in the publishing community, states that data is a primary product of scholarship and that “Data citations should facilitate identification of, access to, and verification of the specific data that support a claim.”  “Citing data formally in reference lists a helps facilitate the tracking of data reuse and may help assign credit for individuals’ contributions to research” (Springer-Nature data policy FAQ), as well as the contributions of the data repository in making these data available.

Q3:  How should a publisher handle data that is restricted either by its provenance or by law?

The recommended best practice for all data citation is for the repository to maintain a landing page containing citation and descriptive metadata, including access restrictions if any.  For data that is restricted, however, the landing page is mandatory, as the identifier for that data set must resolve to something that provides basic descriptive data and information on how access rights to the data may be obtained, if possible.

See also, FAQ #2 for Repositories: Why shouldn’t the identifier resolve directly to the data directly to the data themselves?  Why do I need a landing page?

Q4:  How can I cite data?  

The recommended best practice is for datasets to be deposited in public repositories that provide globally unique, web-resolvable identifiers, and then for these datasets  to be formally cited in reference lists. The citations should include the minimum information recommended by the JDDCP standard and follow journal style. The Data Citation Primer provides several examples.

Q5:  How should authors cite a range of datasets?  Or a data set that has multiple parts?  

Both for the purposes of providing provenance, i.e., what is it that I analyzed to produce a set of findings;  what exactly am I referring to, and for attribution, i.e., who used my data, it is important that we have means of tracking the elements of an aggregate data set that might even span repositories.  We will call this aggregate data set a collection.  Some repositories, e.g., NITRC, are exploring how to create and mint these collections, while others do not.  How to create and characterize these collections and how far do I need to go in citing the sources of my collection are two key questions here. If I analyze all structures available in the PDB at a given time, do I need to provide a file that lists all the data available or is a data and time stamp sufficient?  

But for cases where a reasonable number of data sets or elements are combined into a collection, we recommend that the collection be given its own identifier and have its own landing page that should reference the parts of that data set in a machine-readable fashion.  Each of the parts should be given its own ID and the parts related to the whole and vice versa. Several existing schemas allow for tracing the parts of a data set, e.g., DataCite or DATS “has part”.  If these parts are datasets from other repositories, they should be individually cited within the collection.  

For datasets comprised of large numbers of files, some repositories package them as a compressed file and assign an identifier only to the data set package. This is a common practice and is certainly practical, but the ideal would be to assign identifiers to the individual parts, regardless of size, to allow for granular citation and versioning.

Q6: How should authors handle versioning?

Principle 5 of the JDDCP states: “Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data”.  To fulfil this requirement, we are recommending as best practice that the identifier resolve to a landing page that provides these details and access to the data.    

Principle 7 also states that “Data citations should facilitate identification of, access to, and verification of the specific data that support a claim.” So specifying the version or subset of data in a citation is also required. Current repository practice generally is to have a main landing page for the data set that provides a listing of and access to different versions of the data.  Ideally, documentation exists that provides information on each version.  So, in this case, the version number would be provided in the citation, but each version would not have its own landing page.  A larger question is whether each version should have its own resolvable identifier and whether that identifier should contain information on the original dataset (see question Q7).  As far as the citation is concerned, however, if a version of a data set has its own identifier that is used in a data citation, it should resolve to a landing page that provides the required metadata for understanding its provenance.   

Q7: What do I need to know about dataset identifiers for data citation?

This is an identifier that is:  a) resolvable, meaning you can click on it as a web link and get to a resource landing page;  b) globally unique (on the Web), meaning that it can appear on the World Wide Web and not be in conflict with any other alphanumeric string. Globally unique Web-resolvable identifiers in data citations are important because they allow data citations to be readily processed by machine algorithms and compiled into useful search indexes (such as DataMed by BioCADDIE), without requiring additional knowledge. Bottom line: readers clicking on the link get to the exact end point they intended to reach.

From a somewhat more technical point of view: by a globally unique web-resolvable identifier we mean:

  • an http: URI, for which an HTTP: GET request with appropriate MIME type, will result in a machine-readable descriptor of the desired resource.

Examples of resolvable identifiers:

The JDDCP leaves it up to a community to adopt specific identifier schemes.

See also:  FAQ # 8 for repositories:  My dataset is registered in several different repositories.  Each repository gave it a persistent identifier.   Which one should I use for citation? Or should I use all of them?

Q8: What makes a repository an acceptable repository?

Starr et al (2015), suggests that archives and repositories have five responsibilities: “(a) Identifiers, (b) resolution behavior, (c) landing page metadata elements, (d) dataset description and (e) data access methods, that should all conform to the technical recommendations in this article.” This is closely aligned with (although not identical to) Springer-Nature’s Data policy on repositories:

Data repositories for data supporting peer-reviewed publications generally should:

i. Ensure long-term persistence and preservation of datasets

ii. Be recognized by a research community or research institution

iii. Provide deposited datasets with stable and persistent identifiers, such as Digital Object Identifiers (DOIs)

iv. Allow access to data without unnecessary restrictions

v. Provide a clear license or terms of use for deposited datasets

Q9: What is meant by persistence with regard to data?

“Data preservation, or more specifically, digital data preservation, refers to the series of managed activities necessary to ensure continued access to digital materials for as long as necessary.” (IFDO Data Preservation; emphasis added) . Continued access is what many, if not most, people think of as persistence. However, according to the JDDCP principles, persistence of data citations is achieved via the landing page, which is expected to persist even beyond the lifespan of the data.

Q10: What are the advantages of making data available in a public repository versus in supplemental material?

The whole point of data citation according to the JDDCP principles to eliminate the idea that data are supplemental objects.  Data should always be deposited into a repository and given a persistent identifier, not just for proper handling and further analysis but to accord them a primary place in scholarly communication.  Supplemental materials generally are objects that are not essential to the main points of the paper.  Data are always central to the main points of the paper.

Q11: What is JATS?  How can I get more information about it?

As noted in Starr et al (2015), “a relevant National Information Standards Organization (NISO) specification, NISO Z39.96-2012, ...is increasingly used by publishers” as the Document Data Model for scholarly articles. It is called the NISO Journal Article Tag Suite, or JATS.  A recent revision to JATS supports direct data citation. NISO-JATS version 1.1d2 (National Center for Biotechnology Information, 2014) is available as a tag library with full documentation and help. For additional information, see the excellent presentation by Debbie Lapeyre, Citing Data in Journal Articles using JATS (2015).

Q12:  What data should be described in the data availability statement?

Data availability statements should provide a statement about where data supporting the results reported in the article can be found. For example, Nature Research journals strongly encourage provision of a “minimal data set” underlying the figures provided in the paper that is necessary to support the central findings of the study, and to interpret, analyse or reproduce the methods and findings. The data would include datasets generated or analysed during the study, and source data that are necessary to interpret, replicate and build upon the methods and findings reported in the article.  Ideally, this minimal dataset would be provided through deposition in public repositories but may be associated with the paper in supplementary information files and figure source data files. For more information, see: http://www.nature.com/authors/policies/data/data-availability-statements-FAQs.pdf