Oct 292014

Rick Young has been a pioneer in transcriptional and epigenetic gene regulation. I spent almost my entire scientific career reading Rick’s papers. Besides the scientific findings and insights I have learned from Rick’s papers, there are also some interesting aspects that I have learned:

1. Beautiful paper figures. To get our papers published in high profile journals, having significant findings and convincing data are certainly important, but having clear and beautiful figures is also very important to clarify the points. The figures in Rick’s papers, e.g. his recent Dowen et al, Cell 2014 paper, are work of art themselves. Even if we can’t always publish high profile papers, it wouldn’t hurt to learn generating beautiful figures that look like those appearing in high profile papers (没吃过猪肉,还没见过猪跑吗).

2. Updated and comprehensive paper website. A great example is the website accompanying Rick’s seminal Lee et al, Science 2002 paper, with comprehensive data download / protocol / reagent / analysis descriptions. His current website download page not only has a long list of links to his published data, but a similarly long list of unpublished data. You kinda marvel at how organized he and his lab must be.

Oct 112014

Over the years, I have decided to categorize bioinformatics (or computational biology) research into the following five levels. Here I don’t make any distinction between bioinformatics and computational biology, and so I use the words interchangeably.

Level 0 (if you remember Kongfu Panda) is “modeling for modeling’s sake”. I remember years ago someone asked me, “there are plenty of opportunities to do modeling with the large amount of available GEO data, and what project shall we work on”. I asked him, “what problems would you like to answer”, and he answered “modeling problems”. This is totally OK if the scientists only consider themselves mathematicians, statisticians, computer scientists, or physicists since there are indeed many good theoretical modeling problems in their respective fields, but not OK if they are serious about bioinformatics or computational biology research. Many Level 0 bioinformaticians never read or publish in biological journals or attend biological conferences, so in a way they haven’t got in the door of biomedical research yet. Level 0 research is often only read and cited by the authors themselves and by other people who also only conduct level 0 research, so is quite a waste of resources.

Level 1 is analyzing unpublished data from their own lab or collaborators and trying to make novel biological findings. This is a much more useful endeavor compared to Level 0 bioinformatics, and is a great way to train bioinformaticians. We can practice our existing bioinformatics skills to make real biological findings, learn new bioinformatics skills and a great deal of biology knowledge, and more importantly stimulate insights and ideas on level 2 and level 3 projects. The way to evaluate a Level 1 study is to see how complicated the data is (e.g. the total data volume and data types), whether the bioinformatician needs to create new algorithms or only use other people’s tools to analyze data (e.g. in the method section), how essential the bioinformatics analysis is to the overall project (e.g. how many figures were generated by the bioinformatician and whether main hypothesis is from an informatics analysis), whether the experimental and computational have real fruitful interactions (e.g. from a published paper, more cycles of experimental / computational result description suggest that experiments and computational analyses inform each other for the next step of experiments / analyses, in contrast to studies where all the data was generated first followed by bioinformatics analysis to summarize and integrate the data which sometimes don’t have real findings thus only have descriptive results and no experimental validation), whether there are real and significant biological findings in the study (from reading the abstract).

Level 2 is developing 1) method to solve a general quantitative problem in big data studies that are especially relevant to biomedical research (e.g. Qvalue for FDR), 2) computational algorithms for analyzing data from a new high throughput technique (e.g. RMA or Bowtie), or 3) databases or resources for integrating many other public data (e.g. Oncomine). I considered this a higher level of bioinformatics research since for a Level 1 project the bioinformatician only help their own collaborator, while a good Level 2 project can help thousands of other biologists. Usually these algorithms or resources should address an important and timely biological problem or technical challenge. They don’t have to be published in high profile place, and only time could tell their real significance based on usage and citations. The method may or may not be extremely novel (previously developed statistical or computational method applied to a new biological problem is sufficiently novel), but really has to work and be user friendly. The developers often need to take a lot of additional efforts after the initial publication to maintain and update the algorithm / resources even without future publications. The developers don’t necessarily get sufficient credit from the publication directly, but will do well (when their papers or grants get reviewed) by doing good to the community. Also, to do well in Level 2 research, bioinformaticians should stay focused on their biological domain, so they have good understanding of new computational methods or experimental techniques that are the most relevant or useful in their biological domain.

Level 3 is integrating public high throughput data in a smart way to make good biological findings, so the study often starts from public data and ends in experimental validations. This requires the bioinformatician to have solid biological knowledge, and can come up with their own interesting biological questions. The bioinformatician can lead a biological project where experimental collaborators trust the correctness and significance of the predictions to be willing to conduct experimental validation. Some Level 3 findings that are well designed can even be validated in silico, although unfortunately sometimes experimental biologists might not accept even a solid in silico validation. With more and more public data on resources like GEO, there will be increasing opportunities for level 3 research. These studies should be evaluated by whether the biological question is interesting, whether the integration is smart and sound, and often by the level of the journal where the study is published (as compared to pure experimental studies).

Level X is where bioinformaticians provide the key integration and modeling to the massive amount of data generated from big consortia. A good biomarker for Level X bioinformatics (good specificity not so good sensitivity) is when numbers appear in the paper’s title. Only bioinformaticians with good Level 1 and Level 2 track record and good leadership in team science are recruited to the consortia and eligible for Level X research. These studies often get published in very high profile journals with excellent citations, take tremendous efforts from the informatics lead authors and coordination from all the senior authors. Although the informatics integration is necessary to get the consortium paper published, sometimes the data trumps the informatics, i.e. the journal judges the paper by its data and potential citations and not by the informatics. Also first authorship often better represents the leadership of first author’s PI than the technical capabilities and creativities of the first authors, so first authors may not really get sufficient recognition in their future career for Level X publications. Therefore, first authors in these studies, especially after they become independent, need to establish their own scientific reputation independent of the Level X projects. It might be beneficial for a PI to be involved in some Level X consortium, as these consortia often have members that are pioneers of their respective community. However, only funded by Level X projects and publishings only Level X studies might be a sign that the PI is more on the politics than on the science.

A bioinformatician in training should probably first learn the basic bioinformatics skills and start on Level 1 project, and move towards Level 2 and Level 3 projects as his / her biological understanding and computational techniques improve. As the bioinformatician matures and gains experiences over time, s/he should preferably have a balance of level 1, 2, and 3 projects, with the option of doing some Level X studies. In fact, if resources allow, it is probably healthier for an established bioinformatics PI to conduct research in all levels 1 to X than in just one level. There are also many bioinformaticians, including myself, who are starting to conduct experimental research and generate experimental data themselves. The experimental component of their research should be compared with other experimental biologists, and the informatics component of their research could still be evaluated according to the above 5 categories. Next time, when you read genomics and bioinformatics papers, ask “what is the level of their bioinformatics work?” Try to evaluate the bioinformatics work objectively, instead of by the impact factor of the journal the study is published.

Aug 142014

Recently a postdoc in the lab asked me whether it is worth joining the editorial board of a new open source journal and whether this will be considered favorably during his later faculty job search.

There has been a wave of new open source journals. Frankly, with the large number of journals making paper open 6 months after publication, the requirement of NIH to put all NIH-funded publications into PubMed Central, as well as the large number of existing good open source journals, it doesn’t make a whole lot of sense from a scientific point of view to start new open source journals. Unless we work on a new field (like bioinformatics 15 years ago or nanotechnology 10 years ago), don’t we have enough places to publish good science already?

Many good journals don’t ask their faculty editors to handle peer reviews, instead they have well-trained scientists as full time editors to handle the logistics. These journals consult their faculty editors on topics to cover for a special issue, writing special reviews or giving expert-opinions in interviews, or helping a paper decision where reviewers could not reach consensus. I understand that depends on the field, this may not be financially possible for some good journals, but I believe this is a much better use of faculty expertise and time.

Having been on several faculty recruitment and promotion committees, and written promotion evaluation letters for many colleagues, I would say that being on journal editorial board is useful, but only if the journal is reputable. Instead of being on the editorial board of low profile journals, postdocs and junior faculty can probably benefit more from having experience reviewing papers for high profile journals. Since “high profile” might mean differently for different people / fields, I would say only serve on editorial board of a journal if you often read papers from that journal.

Aug 022014

Recently I encountered a number of students from China using my signatures. When visiting students asked me to write letters for visiting invitations, apartment rental, or bank application, I often asked the students to draft the letter so I can put it on my letterhead with my signature. When the students send me the drafted letter, to my surprise a number of times the word document had my electronic signature. I asked the students where they got my signature, they answered that they cropped the signature from a previous pdf I sent them before and pasted on the word document.

Students probably don’t understand that it is a very serious problem to reuse other people’s signature without prior permission from the signature owner every time a signature is used. This kinda violates the honor code, and is considered similarly as cheating in exams, fabricating data in papers, or stealing other people’s credit cards. People who crop other people’s signature to use in one letter will be always under the suspicion for fabricating reference letters later. I would like to seriously warn students against ever doing this.

Jul 252014

Just finished the book “Walt Disney: The Triumph of American Imagination“. Although I later found that it was not the Disney books with the best reviews, I still thoroughly enjoyed it. The introduction at the beginning was a bit scattered and boring, but as soon as the story begins, it is fun to read.

As I was “reading” (listening to audiobook on my commute”, I felt like in many ways Disney is similar to Steve Jobs. Like Steve Jobs, Walt Disney was always totally passionate and absorbed in his projects, and extremely detailed oriented. Unlike Steve Jobs who was always (at least portrayed in the book) completely confident like a maniac, Walt Disney had his doubts and struggles when what he believed in didn’t get the desired outcome, and he would revise his approaches or let his delegates try new things to see how things work. He continued to challenge himself for better creativity, and was willing makes a lot of practical compromises. The book also covered other more humane side of him than Jobs. E.g. how Disney was like to his families, his old colleagues and teachers, how he worked with the Disneyland workers late into the night before it opened. As a scientist, I got totally inspired!

By chance I found this blog: Ten Things I’ve Learned from Walt Disney. Can’t agree more!

Jul 012014

Recently heard some talks about single cell gene expression using either Fluidigm’s microfluidic chip or CyToF. Fluidigm is a microfluidics approach that can do multiplex RNA expression of a few hundred genes in single cells, and the user just need to custom design qPCR primers for the target genes. CyToF is a proteomics approach to look at the protein expression of ~50 genes in single cells, if antibodies for the proteins of interest are available. Both give amazingly robust results, which at the current level, seems to be more cost effective than single-cell RNA-seq. For most of the biological systems, robustly testing 50-300 genes in single cells in a population will be enough to gain the insights people get from single-cell RNA-seq at a much lower cost. Potentially interesting bioinformatics problems will be better selection of the genes for testing. Heard from colleagues that Fluidigm bought the company that made CyTof, and they will be pushing out a new machine that combines the two capabilities. I look forward to many exiting new discoveries and opportunities in this area.

May 012014

I found Skype interviews (with videos) to be very effective in screening applicants. An hour of Skype investment can save whole day processes. Through these interviews, I found some common issues with candidates that I would like to discuss in this blog. Hopefully it will help future candidates to better prepare a Skype interview.

    The day before or a few hours before the Skype meeting, the candidate should send an email with his/her CV and presentation slides. The presentation slides should summarize the candidate’s previous research work, and present one good study in more details. The total time of the presentation should be about 20 min.
    The meeting starts with the candidate checking with the faculty that s/he has received all the application material and reference letters. If there are any letters missing, s/he should follow up with his references after the meeting to get those letters sent.
    The first 20-25 min will be spent on the candidate going through the presentation on his/her previous work. This will show case the candidate’s technical and communication skills.
    The next 20 min is often on discussing the current research in the faculty’s lab. This includes discussion on the faculty’s published work, for which the candidate can ask for clarifications or questions. The candidate should do his due diligence, and shows that s/he has a good mastery of the faculty’s published work. The discussion also includes the faculty explaining the recent unpublished results in the lab, which will show case some of the most exciting projects and opportunities in the lab.
    The last 15 min will be about an area or a project that the postdoc plans to do in the faculty’s lab during his/her postdoc period. It should be an area of great interest to both the candidate and the faculty, can use the candidate / faculty’s existing expertise, but allow both (especially the postdoc) to expand his expertise, learn new things, and make an impact. It is very likely that depending on new technology development, resources or other opportunities, the postdoc might end up doing something different from the one he proposed initially. However, having this discussion during the interview demonstrate the candidate’s ability for independent and critical thinking, and the ability to identify good problems to solve.

In summary, candidates should not treat a Skype interview as an hour of free chat. Instead should read, think, and prepare well, to allow the candidate and faculty to learn most about each other in a short time. This not only allows the faculty to better evaluate a candidate, but also should be a beneficial experience for the candidate in learning and planning his future work.

Feb 252014

Saw some very provocative blog articles on Lior Pachter’s blog attacking Barabasi and Kellis’s recent network biology papers, and the Kellis group response. Lior is probably a little harsh in his tone, but the irregularities he pointed out in the Kellis study are probably true. Otherwise, it would have been easy for Manolis to show Lior the code / data to reproduce the figures and claim the $100, story closed. The group’s reputation is worth the effort, regardless of whether the wager is $100 or $60K. In fact, any bright graduate student working on network biology should try out the step-by-step instructions by Manolis, study the code, and post their results on Lior’s blog. This not only is a good learning experience for the students, but also a big favor to the many scientists who are curious about the results.

Confucius says (don’t know whether I can translate this well): if someone point out our problem, if we indeed have it, we correct it; if we don’t, we should remind ourselves not to fall for such problems. We experienced some difficulties in trying to reproduce other high profile paper results, mostly because the supp material is not detailed enough. It is a painful process, I have to say, so I think calling for more detailed supp material and code is reasonable.

I myself might fall victims too if others take closer examination of our studies, and that’s what I will ask my group to be careful for in our future publications. For the papers we already published, I can only pray we did as much as we should have. Jun Liu once told me that papers we published are like sprinkled water, and in Chinese, this phrase is used to describe married daughters. Actually my close colleagues published a Nat Genetics paper evaluating all the Nat Genetics paper analysis results in the previous 2 years. Our own Carroll et al paper was evaluated and unfortunately not among the 2 they called reproducible. Although I wasn’t happy that their differential expressed gene list is only within 1% difference from our reported list and we were called “irreproducible”, I could only ask lab members to be more specific about our parameter settings the next time.

Lior’s next blog is quite personal and damning, which I am not sure I approve. But I like the last two paragraphs on “Methods matter”, and especially the objective comments by Erik van Nimwegen and Marc RobinsonRechavi. The blog calls for scientists to provide enough methodological details and code for their papers, reviewers to take more serious look at the method in manuscript evaluation, and new comers not to look at a paper merely from its journal IF. For bioinformatics papers to appear in high profile journals, studies often overstate their results. The blog about publishing bioinformatics in high profile journals, although funny and cynical, has some truth to it. Good computational biology methods will stand the test of time, and the good conceptual ideas might benefit many other computational biology studies, even if they don’t look totally novel or revolutionary or get published in high profile journals.

It might not be fair to focus just on individual investigators. Computational biology as a discipline should aim to establish our credibility and respect from colleagues in maths / statistics, computer sciences, and biology. I hope computational biologists can recognize the problem, have a community of peers to discuss and work together. I hope knowing there are scientists like Lior out there will make all of us more rigorous scientists. In fact, Lior’s blog puts himself on the test and he has to be a good model himself for his own future studies. Time will tell…

P.S. Had a talk with Rafa and Cliff 2 days after I posted the original blog, and they made some excellent points. It is OK for Lior to attack Manolis’ paper on its technical ground. Scientists should be able to openly criticize other people’s science, whether the authors are friends or foes. But attacking the authors for fraud is something very serious and totally different in nature from calling a paper nonsense. We genomics and informatics people, when reading the blog, can understand this. However, if people not in the field who don’t understand the nuances get the message that an MIT computational biology professor committed fraud, it really hurts people. This type of damage is something you can’t let the genie back in the bottle, and this kind of accusation is unhealthy to the field. From Lior’s blog, what I agreed is that Manolis’ method might not be as novel, the parameter setting was not rigorous, and it might not work as well as the paper claimed. This has been a systemic problem of a promising new field, an issue Rafa and many other colleagues have raised before and we should all be more careful of, but I would never call the authors fraud. Lior’s points about “methods matter” and “not overstating the results” could have reached more audiences if he has attacked the science of the paper rather than the authors’ integrity. If my blog above has lead people to believe otherwise, then I could understand the authors’ chagrin, and I offer my public apologies to Manolis and his co-authors.

Feb 112014

Just heard about Illumina’s NextSeq Machine. It is a desktop machine that delivers the speed (one day runs) and the reads (400M 2 * 150bp reads), and will really democratize sequencing. This might be the best machine for several labs in a department, a floor, or a small center. It might be extremely valuable for clinical applications, and probably will replace HiSeq, MiSeq, or Ion Proton as the work force for research investigator sequencing. Found a very interesting blog about NextSeq. I do believe that the NextSeq will be very appealing to many labs if the two-color thing works out.

If indeed departments or several labs share a NextSeq, the informatics might become a bottle neck. That’s where Bina Technologies might come to the rescue.

Update May 2014: DFCI bought a NextSeq and it is really delivering both the speed and the reads, so we are getting another one. It is amazing how fast and reliable Illumina is pushing out each new generation of their sequencing machines.

Jan 162014

Here is to another unfinished blog article I started last summer…

When I was a first semester graduate student at Stanford, because of some difficulties in the AI class (I took CS221 without taking the prerequisite CS121), I felt that Stanford made a mistake admitting me there. When people praise our work after my talk, I also often feel afraid that they will find out some caveats in our algorithms or findings that our work could not fully address. I have a wonderful team of students, postdocs, and research scientists at DFCI, and I often worry that my team will think I am not smart, hard working, or caring enough.

During the career training in Texas in 2012, I learned as women we are particularly vulnerable in feeling that we are not as great as people think and we are afraid sooner or later people will find out what a fraud we are. This is called Impostor Syndrome, and Sheryl Sandberg mentioned it in her book Lean In as well. It is an interesting revelation to me, although it didn’t stop me from feeling so just the same.

In a recent China trip, I watched the movie Hyde Park on Hudson. We normally see FDR (Franklin D. Roosevelt, not false discovery rate 🙂 ) pictures as the charming and confident president. But seeing FDR being carried from place to place by his valet, I wonder how humiliating he must have felt. In the movie, his night conversation with George VI was quite interesting and endearing. We all have our vulnerabilities, but that’s OK.

When reading shorter biographies of George Washington before, I couldn’t help marvel at his character, beneficence and good judgement. During the China trip, I read a more detailed Washington biography by Ron Chernow. By the way, Ron Chernow is quite a master at biographies, and I read his biography on Alexander Hamilton 3 times. Anyway, the Washington book not only mentioned some blunders of his youth, but also his personality flaws and corkiness. I guess none of us are saints… George Washington might not be the most brilliant of generals, but his character and integrity made him one of the most respected founding fathers of America and probably one of the best human beings I have read about.

There are two things I learned from the Washington book that are directly applicable to the impostor syndrome. The first is that George Washington was always modest and respectful to his colleagues, even the competitor generals during the war who reviled him behind his back. The second is that George Washington was extremely loyal and supportive to his team members. It is like saying, “Sure, I may not be the best, but I never acted like one. I just do the best I can.” Who can criticize that?? This really disarms the impostor syndrome, but it is easier said than done. Interestingly, looking at my colleagues, I found Bing Ren to best fit these characters. No wonder he earned the respect of so many colleagues!