The tyranny of formatting
Scene: It is 4 am in the morning and the grant is due the next day. You have 12 pages to tell a story that will determine whether or not you can pay your salary and support your lab next year. You delete a sentence on page 5. The document swells to 13 pages. Oh no! The figures have all jumped around! The next paragraph is now in Courier 14 bold! You hit your head against the screen and spend the next hour trying to get everything back in place and the sentence deleted. The next day, you’re tired, irritable and have a broken computer screen.
Scene: My colleagues and I are writing a paper together. We spend weeks in Google Docs working on a collaborative document. We are all almost happy with the text. We then download it and import it into Word so we can work on the formatting and the figures. We then put it in Dropbox or email it around where all the versions get out of sync. My final edits don’t get copied into the final version. We get criticized for being sloppy.
Scene: You just got your rejection letter from Nature. They liked your paper, but just not enough for this journal. They recommend Glia. You studiously look up the instructions for authors in Glia. You used a reference manager, but the references got screwed up anyway when you put it in Nature format. All the Endnote numbers are there, but it messed up your Word document, so you did some hand editing. You have to change “Methods” to “Materials and Methods” and other cosmetic changes. You spend a day reformatting the paper for Glia; you don’t prepare for your lecture, you don’t mentor a student, you don’t do your research; you don’t do anything but move around and rename sections, numbers and list the first 6 authors instead of the first 5. Science is stalled.
Scene: My child has a rare neurological disorder. A group of text miners offer to help mine the literature for a new druggable target. We ask a Foundation for access to all the literature on this disease. We are given 5000 abstracts and 600 pdfs from open access journals. We spend months working on agreements with the publishers, each of whom owns their own chunk of the field, for their content and go to open source repositories and authors. We get the other 4400 pdfs after a year. We spend 6 months cracking the pdf, trying to align the sections, and massaging the XML. My child can no longer walk.
Who is the common villain in all of this? You might be tempted to blame it on the publisher, and you wouldn’t be 100% wrong. But actually, you should blame it on formatting. At Beyond the PDF, Kaveh Bazargan put forth his proposal in the Vision session: “Why we should publish in XML and nothing else”. He didn’t win. But if you think about it, it’s actually a rather compelling vision. It’s not that he’s against formatting, it’s that formatting should be the “clothes that make the man” and not the man himself. The man should be born naked, as he is, and then dressed up to suit the occasion. And if all papers, as naked babies, are structured more or less the same, we can build all sorts of ways to add value to them.
So now let’s think about formatting. It is formatting that causes me to hit my head against the screen at 4 am. My number of words didn’t change in a significant way; the only thing that changed is how they were laid out on the page. Page numbers are a relic of the printing press. Look at your e-book on Kindle or the iPAD. There are no page numbers because there are no pages. If limits on grants and publications were consistently number of words and not pages, the problem would go away.
Why did I have to leave Google Docs? Because everyone knows that when it comes down to formatting and figures, Google Docs doesn’t cut it just yet. So we have to exit our workspace and go to a Word processor. Our paper didn’t change in a fundamental way; we were happy with the content. Anyone try to reconcile 3 Word documents on a Mac? There are probably great ways to do it, but I haven’t found them. Additional formatting is truly “an exercise in irrelevance”, to quote Philip Lord’s excellent blog.
Why , when when we have a reference ID, the DOI, that allows a publisher to retrieve all metadata for a piece of scholarly work, is the author being tasked with conforming to a reference style? Even if it requires a publisher to pay a company somewhere to format it, isn’t that what they do already? Why is the scholar doing this? Why do the publishers then have to pay someone else (CrossRef) to put back all the information that was stripped out of the reference in the first place?
And finally, as I learned from the School of Data, computers hate formatting. The first step of converting data into something that a machine can read is to strip out formatting. So if machines like to access XML or HTML5 or whatever, why we are requiring all these transformations from Text to XML to PDF to XML to PDF to XML as the paper wends from Word processor to submission to editing to final. Who knows what comes out the other side and what happens to it along the way?
Kaveh points out that there needs to be a version of record. Isn't it somewhat amazing now that we don't have one? What is the version of record? The original document? I never go back and make the changes that the copy editor requests on my original text file, but that might be what I upload to my institutional repository or what I send to a colleague, if I don’t have the rights to the pdf. The pdf at the publisher? The version at Pub Med Central? Can we guarantee that they are all the same? It seems rather ridiculous that in 100 years, we might have people publishing doctoral dissertations on 21st century writings trying to determine the author's original intent, much as current scholars try to reconstruct Euripides' or Shakespeare's intent from surviving versions.
So what is the alternative? Let’s envision a world where we have a standard XML or at least an interoperable XML for representing scholarly papers. Wouldn’t it make more sense, as Kaveh suggested, to have the XML version be the version of record and then have publishers, authors, whomever, dress it up whatever way they want? How many different ways are there of structuring a scientific paper? Why can’t the authors write their papers in generic format and then have the publishers present them as they wish? Why can’t the XML peer reviewed and typo-corrected version be deposited in a global ArXiv repository where it is accessible by text miners along with everything else? And why can’t this version appear as a PDF, HTML, DOC, or anything else it needs to be?
Isn’t it time that we as scholars throw our copy of Microsoft Word out the window and say “I’m mad as hell and I’m not going to take it anymore”. Has anyone tried to calculate the unproductive hours spent by scholars on formatting? Has anyone else been reduced to tears by jumping figures at 4 am? Has anyone then asked “Why?”
Kaveh may not have won because to the crowd at Beyond the PDF it seemed too obvious or unsophisticated. But isn’t it the obvious that we should be going after in FORCE11? We have on-line authoring tools available; we have the formats; we have institutional repositories and the arXiv model. If we have all the tools and technology and formats necessary to make content accessible to the web, text miners and publishers alike, then the obvious questions are “Why aren’t we?” and, “What would it take to make it happen?”
Scene (2014): I sit down to write my paper, because despite all advances, I still like to craft an argument in narrative, as my forbears have done for millenia. I log into (Annotum, Authorea, Word Press, Google, Microsoft, LaTeX-insert your tool here). These authoring tools are specially designed for scholarly publishing styles. My co-authors and I log in with our ORCID ID’s. We compose our paper with the appropriate sections. I insert my reference DOI’s from (Mendeley, EndNote, Zotero). I insert my figures reference ID from (my desktop, Cell Image Library, Fig Share). I reference my data sets DOI according to Amsterdam Manifesto standards from (Data Verse, NIF, CDL); my workflows from (WorkFlows 4 ever, MyExperiment); my code from (GitHub, Source Forge). All my coauthors hit “Acceptable” and I hit “Send”.
Send to whom? Why my preferred broker for peer review (Nature, Journal of Neuroscience, Peer J, Rubriq). My first choice doesn’t want me, so I go to my second. They will take it but want it reduced to 5000 words and 5 figures. OK, deal. I refine my argument and shorten. By doing that, some references disappear from the text. Bye. I send it off. The peer reviewers request refinements and catch some typos. I fix them. The Glia copy editor finds some additional mistakes. I correct them. When the mistakes are fixed, my coauthors say “Fine” and I say “Final”. We have gotten our permission slip and our work may now be admitted into the Body Scholarly for posterity. The (XML, HTML5, whatever) version is stamped “Version of Record” and it enters into the global ArXiv for scholarly communications.
Glia takes my XML and makes it look very pretty. They advertise it on their website and sell the pretty version back to institutions and individuals so they can read it. Some scholars have accused the publishers of existing only to sell us formatting. That’s OK. I pay for formatting all the time. I pay extra for the floral pattern on the Kleenex box because it goes with my color scheme. But my mother, who has no institutional subscription, can go to Pub Med Central and get my vanilla-formatted but still functional document. Sometimes, I buy generic.
Meanwhile, my text miner friends have a subscription to the Global ArXiv via their institution or a personal one, if they are not affiliated with an institution. They agree that they will not try to recreate an individual article for resale and then mine away. Perhaps a new target is discovered for my child’s disease or maybe there isn’t enough information available. But at least we have uniform access to the entire corpus in a form suitable for mining.
Now, with all my extra time saved by no formatting, I am largely done writing my grant proposal at 11 pm instead of 4 am. It needs to be 10,000 words with each figure counting for 500 words. At 11 pm, I am 10 words over. I delete a sentence and hit “Done”. At 11:15, I’m brushing my teeth. Grant reviewers get access to a nifty program that allows them to view/print/read in a variety of formats depending on their medium. One reviewer prefers to print things out on paper. Each figure is a full page and the font is 18 point so the grant is 40 pages long. But a younger reviewer with better vision who still likes paper chooses 12 point font and a half page for figures and her version is 12 pages. Another reviewer has an iPAD and likes to swipe and zoom but he still has my same 8, 000 words and 4 figures. Same content, same length-no extra burden on the reviewers-just different formatting.
There are many other current ways to enhance scholarly communication and many that haven’t yet been invented. But I think decoupling the front end human readable format from the back end machine readable format t is so fundamental to our current difficulties that we should tackle it collectively. And, I believe it is doable and just might be the carrot that induces our scholars to abandon their current models and move towards the future. From that, all else flows.
Latest blog postsThe tyranny of formatting
How do you evaluate a database?
What would you do with 1K to make research communication better?
What I liked about Beyond the PDF2
OK. I’ve got my ORCID ID and I’m a lifetime member of PeerJ; Are we there yet?
Scholarly Communication 101: Improving data literacy