Hybrid Theory - Carlo Artieri's Website

If all goes according to plan, I will be co-teaching a class this fall structured around topics in the analysis of genomes (genomics, transcriptomics, proteomics, and much more!). This will be part of a new 'section' or department of the small educational wing of our institution, focusing on genomics and bioinformatics - a field in which many people are apparently quite keen on gaining more experience. Interestingly, I was having a conversation with the organizer of this new section, and he told me that one of the core principles that he wants to convey with these new classes is an appreciation for the utility as well as practical challenges associated with computational biology (I'm paraphrasing of course).

If it were up to me, familiarity with Linux, PERL/Python, and R would be part of required classes for any honors major in the life sciences (and I'm sure that it would be useful in other sciences as well). We're quickly reaching a point where the ability to manipulate large datasets is unavoidable. Some labs cope with lack of such knowledge by hiring dedicated statisticians or computational research associates; but such an option isn't always available. Personally, I don't even like this route because it forces you to rely on someone else's knowledge (and potential to make mistakes) in order to interpret your own data. If this is unavoidable, then at least you should be very clear in terms of expectations of your ability to understand what was done1.

The sad fact of the matter is that the folks that I've met who are completely unfamiliar with computational work are sometimes either a) intimidated by it, and thus aren't particularly interested in learning about it, or b) overconfident in what computers can actually do. More troublesome is typically the latter, because it can lead to all kinds of mess. For instance, I once heard about a group of researchers that had generated a very large and very comprehensive dataset for a genome-wide association study of an interesting disease. They had assumed that they could analyze their data using their personal desktop computers, and thus had not left any room in their budget to purchase the sort of high-end clusters (or access to such computing power) that actually required to process such immense data.

As one of my former supervisory committee members was fond of saying: There are more possible alignments of 2 300 nucleotide DNA sequences than there are elementary particles in the known universe. Even modestly sized datasets (I realize that 'modest' is subjective) can take weeks to analyze on high-end desktop PCs. People who work with this kind of stuff day in and day out use much more powerful clusters of processors, designed for this type of work.

It's also important to keep in mind that a computer can only do what you tell it to do: Hence the famous garbage in/garbage out principle. If you cannot conceptualize the solution to a complex problem no amount of computational power is going to be able to provide said solution. At first glance, this appears laughably obvious - but unfortunately, it comes up more often than many of us would like to admit. I don't know how many times I've seen people generate huge datasets of say, expression data, and then expect a computer to find 'interesting patterns' in said data.

Like any other science, computational biology is still about testing hypotheses. Computers are allowing us to test hypotheses that were previously out of reach, but they do not obviate the need to follow the scientific method.

1Collaboration is both a necessary and desired part of science - pooling together multiple minds of different backgrounds in tackling a project is almost always useful. However, I think it's important to keep a very clear picture in mind that collaboration should not be an assembly line: each participant should be reasonably familiar with all parts of the process. Nullius in verba!

Image cred here.