Just returned from a recent China trip, during which I attended an epigenetics retreated organized by Yang Shi, and another young bioinformatics PI workshop organized by Yi Zhao. Both were excellent meetings, and one common theme came up: should computational biologists do experiments? I should first say that even as a computational biologist, I am relatively weak in computer science, statistics, and machine learning. If I have Hasty/Tibshirani’s machine learning or Speed / Jun Liu’s statistics, doing experiments might not be necessary. With my limited quantitative abilities, my answer to the question is: we should do cutting edge high throughput experiments, develop the computational methods to help advance good techniques, answer useful biological questions in our own domain interests, and continue to collaborate with experimental biologists.
I believe genomics and bioinformatics are like molecular biology in the 1980’s. They are very useful, but only tools to answer specific biology questions. From the recent epigentic retreat, it is quite clear that many experimental labs are becoming very genomics and bioinformatics savvy. They have overcome the genomics learning curve, which will force computational biologists to be more independent. We have to have our own biological domain and have our own biological questions. Sometimes it is impossible to completely mine public data to answer a specific biological questions, so doing some experiments to generate data is necessary. Of course, for very biologically focused questions, if simple experiments are enough to answer a great biological question, by all means do them. And if we can pay a company or core facility to do it for us, even better. Make sure the hypothesis is generated from mining a lot of data, and check literature to make sure this hasn’t been published already. If it is an important biological question that any experimental biologists could easily come up with and the experiments are easy, rest assured these have been tried by the experimental biologists already. If the hypothesis is novel and arise from mining and modeling public data, but experimental validation is complicated, I would suggest collaborating with an experimental group to validate. The problem with doing experimental validation ourselves is that depending on the hypothesis, validation could mean a cell biology experiment one day, a biochemistry experiment tomorrow, an imaging experiment a third day, and an animal model experiment next week. We simply don’t have the capacity to learn them all. This is no difference from experimental biologists collaborating with each other, but collaboration with experimental biologists doesn’t mean we don’t have our own biological questions. As long as the biological question is really good, and the evidence of our hypothesis is strong enough, there will be interested experimental groups who are willing to help us, especially if we have helped these groups with informatics before.
There is one type of experiments which computational biologists should give priorities to do, which not just generates data for us to analyze or validate, but have interesting data that allow us to develop computational algorithms. I would quote from a colleague and collaborator of mine Mitch Lazar, “computational biologists should stay at the cutting edge of technologies”. As computational biologists, we are most likely not strong enough in genomics to invent new techniques (if you can, that’s great), and new technique invention is risky and time / money consuming. Instead, early adoption of new techniques might be a better option. We can adopt these techniques to answer interesting biological questions, develop computational methods to help other biologists adopt these new techniques (e.g. MACS for ChIP-seq), identify potential biases that the technique developer had in analyzing their data (stay tuned with our Nat Method paper DNase-seq analysis which is in revision), and find novel uses of these techniques that the original developer didn’t intend (e.g. use histone mark ChIP-seq to predict nucleosome positioning and TF binding).
In addition to cutting edge, the experiments we do should be:
1. Generate high throughput data so we can apply our bioinformatics expertise to it. E.g. even though CRISPR/CAS9 is cutting edge, it might not be high throughput enough by itself, so we need to combine it with another good cutting edge high throughput experiment to make it work for us.
2. Is really as good as what the technique developer claimed it to be in the paper figures. With high throughput data, anybody could pick some examples to show how great it works. But the key is to download the data and see how noisy the data is, does the conclusion holds, does it really answer questions that previous experimental methods couldn’t? Sometimes one dataset is not enough which might lead to methods that over fit, so we will need >=3 good datasets to develop a working algorithm.
3. Answer the biological questions we are interested in. E.g. if I am interested in transcription regulation, then ribo-seq would not be a good fit for me, because it investigates translational efficiency not transcription regulation.
4. Not too hard and too expensive to do. First of all, this would allow us to master this experiment in a reasonable time and budget. Also, for an algorithm to work well, there must be enough data later in the public to help improve the algorithm, and the wisdom of the crowd is important to evaluate different algorithms. E.g. for PPI, there are 3 groups generating the high throughput data and 1,000 bioinformatics groups developing algorithms, at the end the data generation group’s algorithm always wins no matter how much better the other algorithms are.
Computational biology PIs like me are often no good at attempting these experiments ourselves. We simply don’t have the time or the neat hands to make the experiment work, although as a young PI I had the opportunity of making such foolish attempts. Instead, hire a capable experimental postdoc to do it, same as experimental PIs recruiting a capable bioinformatics postdoc. Even better, co-supervise the experimental postdoc with an experimental collaborator. This will help recruit a better experimental postdoc, and together the two labs could come up with a better experimental design and biological question. Besides, the experimental postdoc will also be more motivated in trying these new techniques, rather than being like an experimental technician to the computational biology lab. The experimental group could help trouble shoot experimental difficulties and the computational group could share our informatics expertise in analysis and algorithm development. Once we master the technique, we could use it to answer interesting biological questions in other systems and help advance the adoption of such techniques by others in the community, both experimentally (teaching others how to do the experiments) and computationally (developing the algorithms for people to use in analyzing their new data). I really admire Myles Brown and Jason Carroll, as well as many pioneers in the genomics community, in their generosity in helping so many labs learn new genomic techniques. The field is moving so quickly, keeping techniques secretive will only result in their own techniques being overtaken by newer techniques. Also, I believe in Benjamin Franklin’s idea of “doing well (themselves) by doing good (to others)”.
Besides the above reasons, doing experiments is also a logistical necessity in China. Funding agency caps salary at 15% of total grant amount. With a RMB1M grant, if a computational biologist only spent 150K on salary and 150K on equipment and supplies, next round the funding agency might only give 300K, with only 45K possible to be used on salary, and so on. The computational biology labs will be forced to close down in a few years if they don’t do experiments.
Update in 2014: CRISPR/Cas9 is becoming high throughput, which means we are getting excited about it. Wink wink!