Hybrid Theory - Carlo Artieri's Website

If one were to list those traits that separate 'scientific knowledge' from 'intuition' or 'revealed knowledge', one might choose traits such as a) demonstrability, b) testability, c) reproducibility, d) consistency with previous scientific knowledge, and e) subject to revision with new observations, among others1. As some scientific fields become more complex, involving highly specialized, technical analyses of ever-larger amounts of data, the question of 'reproducability' is becoming a serious concern.

One concern is practical: If I generate terabytes upon terabytes of data, I have to be able to provide said data to the community, should they want to reproduce my work. Not every lab has the expertise (nor the resources) to provide their own web-hosting of the data, therefore such data have been stored in large, public repositories such as those found at the National Center for Biotechnology Information (NCBI). However, as has been pointed out by several op-eds in the recent scientific literature, the price of data generation (mostly in the form on nucleic acid sequencing) is dropping at a rate faster than the cost of the hard-drives required to store it. Therefore, it's quickly becoming cheaper to simply store samples (and resequence them as needed) rather than store the raw data themselves2.

A second concern is more personal: Some of the analyses being done nowadays are very, very big and very, very complicated. Anyone who's ever been involved in a genome consortium paper (and I've been involved in a few now) knows that a < 10 page Nature paper may involve the work of 40+ people, each contributing significant details to an analysis. Collating all of these details together into a single story is challenging, and unfortunately, crucial details about how things were done sometimes get lost in the shuffle.

It's not only consortium papers though - anecdotally, I've heard several scientists complain upon reading new papers, that they couldn't reproduce the details of the analysis. Either the descriptions are too vague, or the available data are not in a format that makes it easy to understand where they came from, or how they were generated.

In the past month, I've come across a few papers whose details and methods I'd love to adapt to answer some of my own questions. However (and you can probably see where this is going) I can't figure out how to square the supplementary data that are provided with the details of the analysis: Why is it that the paper consistently refers to 8 classes of data, whereas the public depository entry contains 14???? Unfortunately, it's an unnecessary impediment to me getting my work done.

If scientific data are going to be held up to the somewhat lofty goals of the discipline, we need to make sure that they're well documented and reproducible3.

1The difference between 'demonstrability' and 'testability' is the subject of much debate in some scientific circles. For instance, if a theory is testable in principle (e.g., given the appropriate conditions/equipment) but not in practice (e.g., the equipment required is far beyond the scope of current knowledge), can it truly be called scientific? There are many reasonable hypotheses about things like life on other planets that just aren't testable in the foreseeable future. That's not to say that this automatically means that such fields aren't 'useful' or interesting; however, the more appropriate question may be whether funding should be prioritized towards testable hypotheses.

2There are all kinds of issues of practicality and reproducibility when we discuss such things - I'm only pointing out the details on purely economics terms.

3I have a nitpicky specific discussion point about online data depositories that I think I'm going to save for a follow-up post.