For several months now, I've been seriously toying with the idea of creating a set of materials to help scholars increase their "data literacy". This issue is an important one for FORCE11, as the ability to make data readily available is one of the core drivers of advances in scholarly communication. Surely, you might ask, scholars understand data. Don't scientists generate and work with data all the time? Yes, they surely do. Biologists in particular are data generating machines. But traditionally, scholars haven't published the actual data; rather, they publish papers about their data. The raw or derived data on which their results and analysis are based are rarely provided.
We know from all the activities reported via FORCE11 that this situation is changing rapidly. We've seen a plethora of platforms and iniatives designed to promote publication and sharing of research data. But making data available and making them available in a form where they can be reused are not the same thing. For example, in a project where we at the Neuroscience Information Framework were charged with extracting gene expression information from tables and supplementary materials, we found examples where a supplementary table was provided as a jpeg. In other words, the supplemental table giving us the data, was a picture of numbers. Pictures of numbers may be useful to a human, but not a computer. For them to be useful to a computer, they have to be encoded as numbers. The same thing for text. Have you ever opened a pdf and found that you couldn't copy the text? The only way to copy it in that case is to take a screenshot. However, if you run the text through an optical charcter reader (OCR), it gets encoded as text and your computer allows you to perform operations that are appropriate to text: cut, copy, paste, spell check, etc.
In the supplemental material project, we found many more examples where it took considerable human effort to turn the data into something that could be used. Some of the problems were not as obvious as the jpeg. For examples, authors like to insert subheadings into the middle of tables. We could extract the content using OCR, but subheadings screw up spreadsheets badly.
So that is why I thought that some tutorials on working with data, directed at non-computer scientists, were sorely needed. Well, sometimes procrastinating saves you a lot of effort. I don't have to write my tutorials, because the School of Data did that and a lot more. School of Data is a project of the Open Knowledge Foundation designed to increase data literacy! The current site provides information and on line courses geared towards non-data experts on working with data. The current courses include data fundamentals, starting with What is Data? and other modules like basic data analytics. It includes a glossary, where terms such as API are explained. Most of the current content is geared towards and illustrated with spreadsheets, but future modules will be added for other platforms and for topics like visualizing networks. If you've ever struggled with Excel or would like to know why data and formatting don't go together, this is a great website.