Fears about 'Discovery Science'...
Tuesday, August 9, 2011 at 7:25PM Given all of the recent difficulties I've had with regards to my career1, I'm beginning to worry that my view of science has been somewhat naïve. See, I spent a good 10 years of training having the idea that hypotheses are important drilled into my head. Furthermore, every time I've applied for funding, I've been told (both by my supervisors and the granting agencies themselves) that having clearly defined hypotheses in my project proposal is a necessary requirement for success. And yet, having spoken to some fellow postdocs about their own work, I seem to be noticing a trend towards generating large, expensive datasets (particularly of the 'next-generation sequencing' variety) and then searching for 'interesting stories' within said sets. I've heard this referred to as 'discovery science' in the past.
Now here's a conceit: There first needs to be an observation upon which to formulate a hypothesis. In other words, you first need to see some pattern in the data before you can speculate on its potential cause(s). Generating a large, expensive dataset may produce the observation wanting an explanation. So let me be more precise in saying that my discussion here isn't so much with the lack of a hypothesis itself (which may come later), but rather with a lack of focus, or even a scientific question in the first place.
There are an infinite number of potential datasets that one could generate - I could, for example produce a dataset estimating gene expression levels in an adult dog's liver as well as a baby chimpanzee's cheek and call that a dataset. My observation may then be that 'sheesh, there sure are a lot of expression differences between these two species/tissues/developmental time points!' I could then formulate a number of hypotheses as to why this would be the case; but now I encounter a difficulty: It's very likely that this 'dataset' is totally inadequate for testing any meaningful, realistic hypothesis that I may generate. In fact, one may say that comparing these two, 'unpaired' tissues between two distantly related species is rather 'random' and even 'weird'.
So let me step away from such an egregious example of poor experimental design by using it to make a point: the question you seek to answer ultimately determines how an experiment should be designed. As I've said numerous times before in previous posts, experimental design is one of the most difficult things to learn as a junior scientist (and learning this should ultimately be the result of a Ph.D.) By extension it's perhaps arguable that the ultimate 'thing' to master as one develops their abilities as a scientist is how to come up with interesting, tractable questions.
My naïvete may ultimately stem from my conviction, up to this point, that everything within a given project stems from the question being asked, and the hypothesis being tested2. It determines the design of the experiment, the interpretation of the results, and the writing of the manuscript detailing those results. And yet, I've found myself - as well as folks I've spoken to - stuck in situations where I'm trying to find a 'story' for a number of observations at the point where I'm writing a manuscript. While I suppose that there's nothing wrong with generating new hypotheses even this late in the game, you may have to go do additional experiments in order to properly test said hypotheses (again, something I'm not particularly used to as I generally don't begin writing a paper until I have a fairly complete story).
I'd be lying if I didn't say that I worry about such completely discovery-based projects, both on scientific and on practical grounds. Scientifically, hypotheses are best tested by data generated for this express purpose - often times these large, exploratory projects 'test' hypotheses using pre-generated data (typically the data from which the observation precipitating the hypothesis was drawn) and a set of assumptions. This isn't always bad, but it could be avoided in some cases by more detailed experimental design in the first place. On the practical side, these projects are often very expensive. I've worked in labs both 'rich' and 'poor' and my personal experience is that one spends a lot more time carefully thinking about experimental design in the latter - you need to be able to get as much 'bang for your buck' as possible.
Regardless, this isn't an easily resolved issue: there are good arguments for generating 'big' datasets - they're often very useful to the community for generating and testing hypotheses down the road, for instance. On the flip side, when I hear folks saying things like 'Nature will have to publish this because it's such a huge dataset!', it somewhat undermines my 'naïve' view of what science is supposed to be.
1I left my current postdoc because of dissatisfaction with the work that I was doing. I'd like to discuss this carefully, in a blog post someday.
2I would like to note here that nowhere am I opposed to the question and/or hypotheses changing (even radically) based on new information during the process - only that the end result will be determined by the final question that's been settled upon.
Carlo |
4 Comments | 




Reader Comments (4)
It is amazing how much genetics has been driven by technological progress. This has led to the return of descriptive science. I think many researchers are a bit bashful about that, because that sounds rather old-fashioned. Still, exploring and getting some intuition for the material is something that has to happen, before we can do hypothesis-driven research again. If only we could get funding if we just said: 'we are going to start this genomics project because it seems like a lot of fun'.
I think the term "discovery science" is a bit of a misnomer - or perhaps redundant - most science is involved in "discovery". I think the difference now is that the technologies available allow scientists to drastically increase the number of observations made simultaneously, and hence, generate several distinctive questions (and leading to a number of distinct hypothesis)from the same data set. The issue becomes when the researchers do not properly report their stories. This will need addressing as the technology progresses - though some science may go back to cataloguing discoveries, similar to what happened before the 2oth century.
Darrell
@Corneel, I agree, and that's why I say that there are good arguments for generating such data. Perhaps then, I could also bring up the concept of the 'new technology bandwagon', where people do things because they're 'hot' without actually thinking about the details of what they're doing? Honestly, I'm not against big data, but rather I've literally been in situations where a huge dataset that I wasn't particularly scientifically interested in was dropped into my lap with the words 'look for something interesting in here'.
@Darrell I don't disagree, however, I suppose that my contention would be that there needs to be some focus on what questions we're interested in asking before we generate the observations. As Darwin said in his autobiography: ‘How odd it is that anyone should not see that all observation must be for or against some view if it is to be of any service!’
Oh, the Discovery Science conversation again. Hmmm. Not that I'm bored with such things, not at all, more that I have so many vague, nebulous opinions on these issues in my mind that I don't know where to begin. I'll just leave this little comment: the phenomenon of technology-driven studies is not restricted to molecular genetics and its offshoots. Any technological advance breaks trade-offs in data collection and analysis that existed before - you had to choose how much of your resources (time, money) to devote to collecting a dataset, and how much to analysing the data. A new technology breaks that trade-off by allowing one to devote the same resources to both activities as were previously devoted to only one, or even more.
In my own work, a big part of my current research is being driven by such a technological breakthrough - we have a machine that measures three gases (CO2, CH4, N2O) from soils simultaneously, where the previous generation of technology (which still retains some important advantages in certain circumstances) could only measure those gases one at a time; thus data collection is three times faster, for the same input of time and money (our machine costs about the same, purchase, maintenance, and operation, as the previous tech). Analysis thus becomes three times more involved, or cubed in complexity (interactions between analyses of formerly separately-considered phenomena mean analysis complexity scales exponentially compared to data-collection complexity).
I hope that example illustrates that such things as sexy-hotness-tech drives many branches of science. I don't know how much closer such an aknowledgement brings us to a decision or conclusion about the relative merits of a priori and post hoc hypothesis generation, though.