Data citation FAQs for repositories

Printer-friendly version

The following FAQs and recommendations were based on the recommendations in Starr et al (2015), other publications by and the experience of members of the DCIP working group.  Some reflect issues raised in the Repository Early Adopter meeting held in San Diego, June 22, 2016.  

The purpose of these FAQs is to assist those who are interested in or trying to implement systems of data citation for the current pilot project.  We therefore tried to make concrete recommendations that can be implemented within a 6 month time frame with a minimal amount of effort.  We also tailor these recommendations to biomedicine, as this pilot is supported by a grant from the US National Institutes of Health as part of the bioCADDIE project. We recognize that these recommendations are not the only means for implementing the principles, nor will they hold for all communities.  Indeed, the Joint Declaration of Data Citation Principles (JDDCP) explicitly affords communities the flexibility to implement solutions that will work for them.  However, we do welcome experiences and advice from all domains that have expertise in data citation.

Q1:  Why should repositories care about data citation?  We already ask authors to cite an article that describes our resource.  Isn’t that good enough?

Q2:  Why shouldn’t the identifier resolve directly to the data themselves?  Why do I need a landing page?

Q3:  What is a machine-accessible landing page?

Q4:  What is a globally unique web-resolvable identifier and why is it important?

Q5:  I have my own accession numbers.  Do I have to switch to DOI’s?

Q6: How should I handle landing pages for datasets with multiple versions?

Q7:  Should I put a version number (i.e. semantic information) in the identifier?

Q8:  My dataset is registered in several different repositories. Each repository gave it a persistent identifier. Which one should I use for citation? Or should I use all of them?

Q9:   How do I handle a data set  made of multiple parts?  Do all the parts need separate identifiers?

Q1:  Why should repositories care about data citation?  We already ask authors to cite an article that describes our resource.  Isn’t that good enough?

Citing a dataset achieves the twin aim of crediting the repository and the data contributor

Through a formal system of data citation, the repository can ensure that they get credit as the "publisher" of that data set. Some registries, e.g., SciCrunch and re3data are assigning persistent identifiers to the repositories to make it easier to track impact statistics for the repositories.  

A formal standard for data citation also makes your repository more valuable to users in that it ensures that users of that data properly cite the data authors or contributors so that reuse of data can be tracked.

Q2:  Why shouldn’t the identifier resolve directly to the data themselves?  Why do I need a landing page?

  1. The JDDCP recommends that citations be human *and* machine readable. It's very hard to ensure that all machines (or people) are ready to consume, interpret or access the data. A landing page provides any additional information that is required for these points. A landing page also can serve as the intermediary for complex data packages, e.g., .zip, .tar, gz, to provide a unique point of access.
  2. Data (and other cited resources) may someday go away, but the citation will live on in the scholarly literature. The citation *must* resolve. The landing page allows for a smooth, moderated experience.
  3. The data/resource may have access restrictions. The landing page allows for an access point where metadata (if not the data) can be shared, or access validation can occur.
  4. While we recognize the need in some cases and for some purposes to have identifiers that can resolve directly to data, this does not apply to identifiers used in data citation, which should always and without fail resolve to a landing page.

Q3:  What is a machine-accessible landing page?

Landing pages should ensure that both the metadata and the data are “Machine accessible”, i.e., that the landing page provides access by well-documented Web services to data and metadata stored in a robust repository, independent of browser access by humans.  Specific recommendations for how to achieve these goals may be found in Starr et al (2015)) The DCIP Expert Group on Repository Metadata has issued a preliminary set of guidelines for creating machine-accessible landing pages.

Q4:  What is a globally unique web-resolvable identifier and why is it important?

This is an identifier that is:  a) resolvable, meaning you can click on it as a web link and get to a resource landing page;  b) globally unique (on the Web), meaning that it can appear on the World Wide Web and not be in conflict with any other alphanumeric string. Globally unique Web-resolvable identifiers in data citations are important because they allow data citations to be readily processed by machine algorithms and compiled into useful search indexes (such as DataMed by BioCADDIE), without requiring additional knowledge. Bottom line: readers clicking on the link get to the exact end point they intended to reach.

From a somewhat more technical point of view: by a globally unique web-resolvable identifier we mean

an http: URI, for which an HTTP: GET request with appropriate MIME type, will result in a machine-readable descriptor of the desired resource.

Examples of resolvable identifiers:

DOI:  http://dx.doi.org/10.5061/dryad.2h0q3

ARKS:  http://n2t.net/ark:/21547/AUw2

The JDDCP leaves it up to a community to adopt specific identifier schemes.   

Q5:  I have my own accession numbers.  Do I have to switch to DOI’s?

The JDDCP are agnostic with respect to identifier schemes beyond the requirements listed for what comprises a good scheme.  Several were recommended in Starr et al (2015)) Some publishers at the Boston Workshop made a strong recommendation for using DOI’s.  

bioCADDIE, the biomedical data discovery index, is making specific recommendations for data set identifiers in biomedicine. Researchers, repository managers, publishers and other interested parties in biomedicine are encouraged to follow the progress of this initiative.

We know that repositories have their own histories and communities have their own practices already in place.  The use of accession numbers, for example, is prevalent in many biomedical repositories and is typically coupled with a repository-specific prefix, many of which are registered in the EBI’s Identifiers.org registry.  We are more concerned at this point that there be a unique identifier that resolves. Accession numbers can be made to fit these requirements, but it will require some work on the part of the Pilot Team, which is currently working on the problem; the repositories should not have to worry about it.  We therefore recommend preferentially that data be cited using a DOI; however if this isn’t possible, use the "compact identifier format".  This is constructed by citing the local accession number, prepending the official repository namespace prefix as registered with Identifiers.org, followed by a colon (no space after the colon), e.g.: "ENA:BN000065.1"; e.g. "UNIPROT:P62158".

Compact identifiers may be resolved by services at the European Bioinformatics Institute (http://identifiers.org/) and/or the California Digital Library (http://n2t.net/) scheduled to become active by December 1, 20016. Just prepend the service URI to the compact identifier, e.g. "http://identifiers.org/ENA:BN000065.1".   

Q6: How should I handle landing pages for datasets with multiple versions?

Principle 5 of the JDDCP states:  “Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data”.  To fulfil this requirement, we are recommending as best practice that the identifier resolve to a landing page that provides these details and access to the data (see question Q2).    

Principle 7 also states that “Data citations should facilitate identification of, access to, and verification of the specific data that support a claim.”  So specifying the version or subset of data in a citation is also required.    Current repository practice generally is to have a main landing page for the data set that provides a listing of and access to different versions of the data.  Ideally, documentation exists that provides information on each version.  So, in this case, the version number would be provided in the citation, but each version would not have its own landing page.  A larger question is whether each version should have its own resolvable identifier and whether that identifier should contain information on the original dataset (see question Q7).  As far as the citation is concerned, however, if a version of a data set has its own identifier that is used in a data citation, it should resolve to a landing page that provides the required metadata for understanding its provenance.   

Q7:  Should I put a version number (i.e. semantic information) in the identifier?

In general, it is a best practice for the longevity of an identifier to avoid semantics that are subject to change. Semantics are ok if they are intrinsic to the object being registered.  For example, if a data granule covers a particular date, it is standard practice to incorporate that date into the granule's filename and/or identifier. That works because the coverage date is never going to change.  If a granule has data for 2013-06-05, it will always have data for 2013-06-05.  By contrast, it is also common practice to incorporate an institution name into an object identifier, and this is a problematic practice precisely because the curating institution is typically *not* an integral attribute of an object.  

Q8:  My dataset is registered in several different repositories. Each repository gave it a persistent identifier. Which one should I use for citation? Or should I use all of them?

The recommended approach is to use only one persistent identifier (PID) when citing any given resource. The reason is that using one PID makes the job of impact tracking so much easier. If multiple identifiers are used in citations of the same resource, it becomes necessary to total all the impact metrics for each identifier in order to understand the resource’s impact. Therefore, if there are copies of the dataset in multiple locations, make a decision about which copy is the “copy of record,” so to speak, and cite that one.

Q9:   How do I handle a data set  made of multiple parts?  Do all the parts need separate identifiers?

Both for the purposes of providing provenance, i.e., what is it that I analyzed to produce a set of findings;  what exactly am I referring to, and for attribution, i.e., who used my data, it is important that we have means of tracking the elements of an aggregate data set that might even span repositories.  We will call this aggregate data set a collection.  Some repositories, e.g., NITRC, are exploring how to create and mint these collections, while others do not.  How to create and characterize these collections and how far do I need to go in citing the sources of my collection are two key questions here. If I analyze all structures available in the PDB at a given time, do I need to provide a file that lists all the data available or is a date and time stamp sufficient?  

But for cases where a reasonable number of data sets or elements are combined into a collection, we recommend that the collection be given its own identifier and have its own landing page that should reference the parts of that data set in a machine-readable fashion.  Each of the parts should be given its own ID, if it does not have one already, and the parts related to the whole and vice versa (see Q7 about embedding semantic information into identifiers). Several existing schemas allow for tracing the parts of a data set, e.g., DataCite or DATs “has part”.  If these parts are datasets from other repositories, they should be individually cited within the collection.  

For datasets comprised of large numbers of files, some repositories package them as a zip file and assign an identifier only to the data set package.  This is a common practice and is certainly practical, but the ideal would be to assign identifiers to the individual parts, regardless of size, to allow for granular citation and versioning.