Following on from our March and April posts, this month we see how metadata and technology issues intersect in publishing infrastructure within the context of Open Journal Systems (OJS), with James MacGregor and Mike Nason from the Public Knowledge Project (PKP). Open-source software that must support multiple, foundational functions such as peer review and content hosting involves many diverse stakeholders and evolving needs, making OJS an excellent example to highlight interdependencies and context in scholarly infrastructure.
Interview with James MacGregor and Mike Nason
Interview by Jennifer Kemp
What does ‘infrastructure’ mean to you/your organization, in the context of research communications?
In our little corner of the scholarly publishing world, infrastructure for PKP means the underlying software and related protocols and standards that are developed explicitly to support scholarly research and communication. This includes the software PKP develops, such as Open Journal Systems, Open Monograph Press, and the new Open Preprint Systems, alongside other scholarly publication, communication, preservation and support tools like Islandora, DSpace/Fedora, LOCKSS, and so on. The standards and protocols that these tools use to interoperate, such as SWORD, COUNTER, JATS, OAI, are equally a part of this infrastructure. All of these bits cohere into a whole that, ideally, is entirely invisible to the research community while also supporting their work fully.
Our software has been in use for over 20 years and has active user communities around the world. It *is* global scholarly publishing infrastructure, and this demands a level of responsibility and accountability from PKP to that community. We have a responsibility to provide stable, usable software. It has to be accessible, in word and deed. It must support governmental privacy requirements. It must be interoperable with existing standards and protocols, to better interact with other bits of the infrastructure. It must support existing scholarly communications practices, while at the same time making space for new models like open peer review. We must also ensure that our project is sustainable and that we retain an international, equitable vision of our role in this space.
This view is reinforced by our experience with Coalition Publica, where we are partners with Érudit in delivering national scholarly publishing infrastructure for social science and humanities journals in Canada. This has been and continues to be a major infrastructure project, which is federally funded and with international stakeholders, and I draw from this experience in answering some of the further questions below.
How do you describe what you do to people unfamiliar with it?
I like to describe our software as "Wordpress, but for journals (or presses, or preprints) instead of blogs". Our tools allow folks to operate journals, presses, etc. much in the same way they would a Wordpress site, and with a lot of the same sort of capabilities - but instead of publishing blogs, users of our software are publishing journals, books, etc. We provide editing, review, typesetting and user management tools to folks who publish content online. Our software aims to be simple to use, so the barrier to entry is low. And, of course, it's all open source - free as in beer *and* speech.
Circling back to the first question, I also increasingly explicitly describe what we do as providing infrastructure to scholarly communication, and to frame our contribution as an invisible conduit of information (much like a highway, phone line, or the Internet) than a discrete software application - at least, to end-users who don't have to worry about the day-to-day maintenance of said infrastructure. For example, PKP develops OJS, which is used to generate scholarly content that is in turn harvested by the Érudit platform and disseminated to partners in Canada and elsewhere. There is a lot of invisible piping that goes into making that seemingly simple process work. Given that the information we manage is almost entirely metadata and full-text content, maybe a paraphrasing of the Dune line, "The (metadata) must flow!" would also resonate. But, probably, only to fellow nerds.
What is the one thing you wish ‘Silicon Valley’ would do or do differently to better support scholarly metadata?
I wonder how well scholarly infrastructure work and the management of scholarly metadata conforms to a 'Silicon Valley' mindset. I am concerned with how tempting it can be to try and do something new and shiny to attract grant money as some sort of simulacrum of the silicon valley VC funding model. I think there is a lot of time and money spent on reinventing the wheel unnecessarily, and less time and money spent on existing infrastructure support and long-term sustainability. Of course, as someone who provides longstanding (and sometimes very boring) infrastructure, that's what I have to say! Over the past few years I think we have seen a shift in our collective approach here, with SCOSS being just one example of a new approach towards recognizing and funding sustainability as a practice, and that is an approach that I think will pay real dividends in terms of stability, scalability and sustainability in the long-term.
What is the one thing you wish non-technical people understood better about the challenges of scholarly metadata?
Maybe how important - and difficult - it is to keep metadata consistent across platforms and venues. Once content is published, it has a life of its own well beyond its original publication venue: it lives on in indexing services and databases, it is republished on other platforms, it can be ingested into research datasets, you name it. Changing content at the source should be really easy to propagate to these other venues, but that isn't necessarily the case, and often a fair bit of conscious caretaking is required. Metadata isn't something that just happens once - it's very process-heavy, and the process deserves respect.
How, if at all, does metadata differ when it comes to text vs data or journals, books, etc.?
I'm of the mind that schemas can differ, but the differences can be minor, and it can become a real question about when and where specialist or generalist tools are best suited for a given job. Our tools (OJS, OPS, OMP) are becoming increasingly abstracted into one main shared code library, with individual workflow components isolated in their own standalone codebases. A *lot* of the metadata management we rely on for journals, books and preprints is processed using the shared library, with any minor adjustments made at the individual application level. For example, we have generic plugins to support Crossref, Google Scholar, OAI, etc. - these are (mostly) available across all platforms but massaged just a bit for their own specific context.
What other areas of infrastructure do you work most closely with/are most dependent on (& how)?
We do a lot of work in the PID space and have relationships with Crossref, ORCID and DataCite. We also work with Google Scholar to make sure their indexing service is supported by our software. We spend a lot of time working with XML tools, most specifically collaborating with eLife, SciELO and Érudit on JATS XML editing. And we spend a fair bit of time, though less than we should at the moment, working in metrics with COUNTER for usage tracking, and Paperbuzz and Crossref Event Data for altmetrics. Where possible we try and work with other partners in this space. These are all collaborative tools, after all. And most important, we are working with Érudit on Coalition Publica to aggregate, package and disseminate distributed scholarly material (most journal content, but also books, and hopefully someday preprints). This project heavily involves all of the infrastructure components I mentioned above: metadata (and some full text) is encoded in JATS and transferred via OAI; usage is tracked as per COUNTER rules; all published content has a DOI; and so on.
Explain in some detail the issue you think is the most vexing/interesting/consequential/etc.
I was originally going to say that "operating technology at scale" was the most interesting issue at hand these days, given that hundreds of journals, dozens of OJS installs, and a whole bunch of technologies that have to work together properly. But to be honest, the technical component of developing national infrastructure is (or can be) relatively straightforward: implement a DOI solution, manage stats through COUNTER, use JATS for metadata encoding, use OAI for data transfer, job done. But it turns out that any technical decision in fact requires a community discussion, and often requires a political solution. "Easy" technical issues are never necessarily easy or quick to implement! It’s all well and good, for example, to mandate that issue-level metadata in OJS must conform to a specific standard in order to be compatible with the Érudit platform - but try explaining that to a journal that only recently moved online and who have their own design requirements that are met only by “mangling” the issue metadata in a non-standard way. The human and institutional management element of this endeavour is vexing, interesting and consequential - often all three at the same time.
In a perfect world, how would scholarly infrastructure be funded and governed?
Again from a Coalition Publica and national infrastructure-development perspective, I think that a lot of the development of this infrastructure should be (and already is) primarily federally funded. The resulting work should be financially supported by our federal initiatives, and available to the Canadian and international public via immediate open access. This moves us beyond subscription and even APC OA models and I think clarifies the general value proposition: scholarly communication is funded by and available to the public for public good.
The issue is in doing this at scale in a way that satisfies the requirement of all stakeholders: developers, infrastructure providers, publishers, hosting institutions like libraries, the journal management team, authors and finally, the readers. We are discovering with Coalition Publica that although our primary goals are universally aligned across all stakeholders, a lot of the specifics need continued reworking and consideration. Again, this speaks to the human component that I referenced in the previous question, which is fundamentally an issue of governance.
What are your favorite blogs, conferences, Twitter accounts, etc. to keep up on scholarly infrastructure?
Our own PKP Scholarly Publishing Conference is a favourite, but I'm biased. :-) We always have a great mix of presentations from service providers, infrastructure developers, institutions who use our software, and end-users (editors, researchers, etc.). It's an unusually good conference for understanding how all different stakeholders uniquely traverse the scholarly publishing landscape, and it does a great job of reinforcing how important all stakeholder voices are. From a library publishing perspective, the Library Publishing Forum has been a must-attend event. Although the scope is perhaps a bit more limited to the experience and practices of libraries in this space, they also encompass a wide range of user stories and experiences, and it has been fascinating to watch the role of “libraries as publisher” develop.
Favorite little-known fact or unsung hero?
If I could shout out a group of people as a collection of unsung heroes, it would be the documentation writers. Documentation writing was one of my first jobs at PKP, and it's such a tough thing to keep on top of. You have to track and document software changes, your knowledge has to cross technical boundaries, you have to be really good at explaining complicated concepts simply. It's a never-ending job where you are often playing catch-up to keep content current with the environment. At PKP we are really lucky to have a great documentation team that writes technical docs, end-user application guides, and more general scholarly publishing resources (like Getting Found, Staying Found, Increasing Impact, which talks a lot about caretaking metadata in a digital world and is worth reading regardless of whether you use our software). It took us 20 years, but I think our documentation effort can stand as a great example of what can be done by a dedicated and smart team.
What question do you wish we asked but didn’t and why?
I would love to have been asked, and would love to hear the other's thoughts concerning what we use as the infrastructure that supports the infrastructure we develop, in particular around the use of OSS technologies (or not). Right now PKP uses GitHub to foster code development, and Slack, Google Docs and Notion to foster communication and planning. None of these services are open source. We are continually interrogating our dependency on non-OSS, SaaS tools and whether they are a justifiable strategy for us. I have no clear answers here, except that for PKP expediency and ease of use often wins the day over any moral argument for OSS, and that's a compromised position that we would like to move away from.
More Information: James MacGregor, Mike Nason, and PKP/OJS
James MacGregor has been working with PKP since late 2007, and in that time has dabbled in documentation writing, development, support, and outreach. James coordinated the development of PKP's Publishing Services into a core component of PKP's revenue model. His recent focus is on infrastructure building and larger service and project development, in particular as concerns Coalition Publica, a pan-Canadian scholarly infrastructure project. He also finds himself perpetually volunteered as PKP's usage stats expert.
Mike Nason is the Scholarly Communications and Publishing Librarian at the University of New Brunswick in beautiful Fredericton, New Brunswick, Canada. He's also the Crossref and Metadata Liaison/Support Associate with PKP Publishing Services. When he is not screaming/crying about metadata, he is on a bicycle or making loud noises.
Public Knowledge Project (PKP) is a multi-university initiative developing (free) open source software and conducting research to improve the quality and reach of scholarly publishing. Open Journal Systems (OJS) is an open source software application for managing and publishing scholarly journals. Originally developed and released by PKP in 2001 to improve access to research, it is the most widely used open source journal publishing platform in existence.