Oct 292014

Rick Young has been a pioneer in transcriptional and epigenetic gene regulation. I spent almost my entire scientific career reading Rick’s papers. Besides the scientific findings and insights I have learned from Rick’s papers, there are also some interesting aspects that I have learned:

1. Beautiful paper figures. To get our papers published in high profile journals, having significant findings and convincing data are certainly important, but having clear and beautiful figures is also very important to clarify the points. The figures in Rick’s papers, e.g. his recent Dowen et al, Cell 2014 paper, are work of art themselves. Even if we can’t always publish high profile papers, it wouldn’t hurt to learn generating beautiful figures that look like those appearing in high profile papers (没吃过猪肉,还没见过猪跑吗).

2. Updated and comprehensive paper website. A great example is the website accompanying Rick’s seminal Lee et al, Science 2002 paper, with comprehensive data download / protocol / reagent / analysis descriptions. His current website download page not only has a long list of links to his published data, but a similarly long list of unpublished data. You kinda marvel at how organized he and his lab must be.

Oct 112014

Over the years, I have decided to categorize bioinformatics (or computational biology) research into the following five levels. Here I don’t make any distinction between bioinformatics and computational biology, and so I use the words interchangeably.

Level 0 (if you remember Kongfu Panda) is “modeling for modeling’s sake”. I remember years ago someone asked me, “there are plenty of opportunities to do modeling with the large amount of available GEO data, and what project shall we work on”. I asked him, “what problems would you like to answer”, and he answered “modeling problems”. This is totally OK if the scientists only consider themselves mathematicians, statisticians, computer scientists, or physicists since there are indeed many good theoretical modeling problems in their respective fields, but not OK if they are serious about bioinformatics or computational biology research. Many Level 0 bioinformaticians never read or publish in biological journals or attend biological conferences, so in a way they haven’t got in the door of biomedical research yet. Level 0 research is often only read and cited by the authors themselves and by other people who also only conduct level 0 research, so is quite a waste of resources.

Level 1 is analyzing unpublished data from their own lab or collaborators and trying to make novel biological findings. This is a much more useful endeavor compared to Level 0 bioinformatics, and is a great way to train bioinformaticians. We can practice our existing bioinformatics skills to make real biological findings, learn new bioinformatics skills and a great deal of biology knowledge, and more importantly stimulate insights and ideas on level 2 and level 3 projects. The way to evaluate a Level 1 study is to see how complicated the data is (e.g. the total data volume and data types), whether the bioinformatician needs to create new algorithms or only use other people’s tools to analyze data (e.g. in the method section), how essential the bioinformatics analysis is to the overall project (e.g. how many figures were generated by the bioinformatician and whether main hypothesis is from an informatics analysis), whether the experimental and computational have real fruitful interactions (e.g. from a published paper, more cycles of experimental / computational result description suggest that experiments and computational analyses inform each other for the next step of experiments / analyses, in contrast to studies where all the data was generated first followed by bioinformatics analysis to summarize and integrate the data which sometimes don’t have real findings thus only have descriptive results and no experimental validation), whether there are real and significant biological findings in the study (from reading the abstract).

Level 2 is developing 1) method to solve a general quantitative problem in big data studies that are especially relevant to biomedical research (e.g. Qvalue for FDR), 2) computational algorithms for analyzing data from a new high throughput technique (e.g. RMA or Bowtie), or 3) databases or resources for integrating many other public data (e.g. Oncomine). I considered this a higher level of bioinformatics research since for a Level 1 project the bioinformatician only help their own collaborator, while a good Level 2 project can help thousands of other biologists. Usually these algorithms or resources should address an important and timely biological problem or technical challenge. They don’t have to be published in high profile place, and only time could tell their real significance based on usage and citations. The method may or may not be extremely novel (previously developed statistical or computational method applied to a new biological problem is sufficiently novel), but really has to work and be user friendly. The developers often need to take a lot of additional efforts after the initial publication to maintain and update the algorithm / resources even without future publications. The developers don’t necessarily get sufficient credit from the publication directly, but will do well (when their papers or grants get reviewed) by doing good to the community. Also, to do well in Level 2 research, bioinformaticians should stay focused on their biological domain, so they have good understanding of new computational methods or experimental techniques that are the most relevant or useful in their biological domain.

Level 3 is integrating public high throughput data in a smart way to make good biological findings, so the study often starts from public data and ends in experimental validations. This requires the bioinformatician to have solid biological knowledge, and can come up with their own interesting biological questions. The bioinformatician can lead a biological project where experimental collaborators trust the correctness and significance of the predictions to be willing to conduct experimental validation. Some Level 3 findings that are well designed can even be validated in silico, although unfortunately sometimes experimental biologists might not accept even a solid in silico validation. With more and more public data on resources like GEO, there will be increasing opportunities for level 3 research. These studies should be evaluated by whether the biological question is interesting, whether the integration is smart and sound, and often by the level of the journal where the study is published (as compared to pure experimental studies).

Level X is where bioinformaticians provide the key integration and modeling to the massive amount of data generated from big consortia. A good biomarker for Level X bioinformatics (good specificity not so good sensitivity) is when numbers appear in the paper’s title. Only bioinformaticians with good Level 1 and Level 2 track record and good leadership in team science are recruited to the consortia and eligible for Level X research. These studies often get published in very high profile journals with excellent citations, take tremendous efforts from the informatics lead authors and coordination from all the senior authors. Although the informatics integration is necessary to get the consortium paper published, sometimes the data trumps the informatics, i.e. the journal judges the paper by its data and potential citations and not by the informatics. Also first authorship often better represents the leadership of first author’s PI than the technical capabilities and creativities of the first authors, so first authors may not really get sufficient recognition in their future career for Level X publications. Therefore, first authors in these studies, especially after they become independent, need to establish their own scientific reputation independent of the Level X projects. It might be beneficial for a PI to be involved in some Level X consortium, as these consortia often have members that are pioneers of their respective community. However, only funded by Level X projects and publishings only Level X studies might be a sign that the PI is more on the politics than on the science.

A bioinformatician in training should probably first learn the basic bioinformatics skills and start on Level 1 project, and move towards Level 2 and Level 3 projects as his / her biological understanding and computational techniques improve. As the bioinformatician matures and gains experiences over time, s/he should preferably have a balance of level 1, 2, and 3 projects, with the option of doing some Level X studies. In fact, if resources allow, it is probably healthier for an established bioinformatics PI to conduct research in all levels 1 to X than in just one level. There are also many bioinformaticians, including myself, who are starting to conduct experimental research and generate experimental data themselves. The experimental component of their research should be compared with other experimental biologists, and the informatics component of their research could still be evaluated according to the above 5 categories. Next time, when you read genomics and bioinformatics papers, ask “what is the level of their bioinformatics work?” Try to evaluate the bioinformatics work objectively, instead of by the impact factor of the journal the study is published.