I am preparing for the STAT115/STAT215/BIO512/BIST298 course for spring 2013. For the first lecture on the introduction of bioinformatics, I tried to look up a history of bioinformatics. A quick search came up with two interesting resources 1 & 2, both are about early bioinformatics (pre 2005), but a very interesting read.
Bioinformatics research progresses in cycles. Whenever a new high throughput technology appearing, there is often a wave of new algorithm development in the field. Looking back the last 15 years, expression microarrays, proteomics, tiling arrays, SNP arrays, 2nd generation sequencing represents these new technology waves, although of course 2nd-Gen sequencing has many more applications than the other technologies. Informatics groups that can do well include those with earlier access to the new data types (e.g. Broad Institute), who truly understands the biology of the technology applications (e.g. alternative splicing), and those with unique statistical and algorithm expertise (e.g. Burrows-Wheeler transformation). Algorithms developed during these waves could be wildly successful, although it requires the developer to continue maintenance and update of their algorithms to maintain the lead and longer-term impact.
Then, there is the period in between technology bursts, such as motif finding in 2004, microarrays in 2006, and sequencing now. I wouldn’t say there is no technology development now, but it is certain not in the same level as 2nd-Gen seq in 2007. This is a time many groups publishes algorithms for the same tasks with “small improvements”, although most people stick to a few winners (e.g. RMA and LIMMA). Some develop tools, pipelines (not as much as algorithm development) and databases that help the community. Many informatics groups will start collaborating with experimental biologists using existing algorithms or tools to understand better biological mechanisms. Most groups are very busy but are also trying to find a breakthrough.
Actually there is a third type of bioinformatics research, which is data integration. Investigators either develop new tools or use existing tools, but start from integrating publicly available data instead of working on collaborators unpublished data, to make novel and interesting biological discoveries. Success in this area not only requires excellent informatics sense and biological knowledge, but also excellent infrastructure to store and integrate the data and experimentally validate the predictions (by themselves or collaborators), and sometimes a little story telling skills also help. There are groups that integrate unpublished data (e.g. ENCODE) but more respectable are the groups who can integrate public data and still make good discoveries. This type of bioinformatics scientists can transcend time and trend, similar to Warren Buffet for investors. However, it takes dedicated efforts to gradually accumulate the expertise and infrastructure.
As a bioinformatics scientist, we could alternate between first and second type depending on the time, but should always think about developing in the third area.
In medicine, a biomarker is a term often used to refer to measurable characteristics that reflects the severity or presence of some disease state. Pharmaceutical companies devote tremendous money and effort in identifying biomarkers; a drug is dead without biomarkers. However, in academia, biomarker research seems to hit a wall. There has been many 50-gene, 70-gene, 100-gene, etc biomarker panels that were derived from some cohorts which totally fail to work in other cohorts. As a results, computational biomarker studies are now often published in BMC type of journals. What can be done?
Cancer is a heterogeneous disease, it is not surprising that biomarkers derived from genomics studies in medium sized cohorts fail in other cohorts. Most of the time, scientists don’t know the mechanisms underlying these biomarkers and why they work or fail. Theoretically, better biological knowledge (from pathways, signal transduction, protein-protein interaction, transcriptional and epigenetic regulation, microRNA regulation, etc) could help understand the mechanism of the biomarkers. As a results, the biomarkers identified with mechanistic understanding might not be the strongest (e.g. most differentially expressed) but the most robust (reproducible). Besides, biomarker with a few features is a concept much older than genomics techniques such as expression microarrays and RNA-seq. As the cost of RNA-seq continues to fall, why not use whole transcriptome, as it should be the most informative and robust biomarker? In addition, some biomarker tests have been patented and off-limits to many companies. Transcriptome biomarker would overcome this barrier, since every gene could be considered.
For many years, we have been using binding sites within 100KB of gene TSS weighted by distance as a way to assign binding to genes. Most of the time, we thought binding further away might not be as important in having a transcriptional regulation effect. Techniques such as Hi-C and ChIA-PET could detect long range chromatin interactions, thus allowing better assignment of binding to genes. However, so far the resolution and sequencing depth of Hi-C and ChIA-PET are still limited to really help with binding sites assignment.
I a recent Mol Cell paper and another Cell paper in press from Matt Freedman Lab, they have found binding sites over 100KB to have strong transcription effect, so indeed long range chromatin interaction could have a major transcriptional regulation effect. The Mol Cell paper is really interesting to show, for the first time to my knowledge, that the production of eRNA at distal p53 binding sites has strong transcriptional regulation effect on far away genes important for stress response. This is a little surprising given that data from Lee Kraus’ Lab show that knockdown of eRNA does not influence binding and histone marks of the gene, which suggested that eRNA doesn’t have much effect on transcription. Maybe their experiment is done too quickly after siRNA, and histone mark is more stable. The Mol Cell paper also showed that some binding sites persist despite the siRNA knockdown of the TF itself, which is also what we see in our own studies.
This has several implications:
1. We should really learn to do GRO-seq. If a distal binding has eRNA from GRO-seq, then it is a more likely functional binding site.
2. Recently we have been using histone marks to define Hi-C domains and trying to use it to help predict target genes. Actually binding sites even across multiple domains might still be physically close and interact with each other. So the binding assignment should not only look at distance, by whether the binding is in the SAME or SIMILAR (not necessarily nearby) domains.
3. Should we bite the bullet to try ChIA-PET or HiC and hope sequencing cost will come down sooner?