In 2011, we spent some efforts looking at integrating ChIP-seq with GWAS data. That led me to the realization that for cancer studies, it is much more fruitful to study somatic mutations than germline mutations, and studying normal populations are less likely to be cost-effective.
Ever since we sequenced the LNCaP / abl and MCF7 / LTED genomes, I have been thinking of establishing our whole genome sequence analysis capacity in Tongji University, China. Our assistant professor Jianxing Feng got his CS PhD from Tsinghua University specializing in algorithms, so we thought that he would like the computational challenge. We held a focused journal club reviewing the high impact computational and biological papers for genome sequencing. To our surprise and disappointment, most of the existing algorithms are just brute force intuitive software with little algorithmic or statistic component.
Going to IBW, I realized that we are late in the whole genome or exome sequencing game. Many computational groups domestic and overseas are already analyzing massive amount of genome/exome sequencing data. The trend is clear, the first group can publish a good paper with only one whole genome; the second group will need to sequence 2 genomes; then future groups need to sequence 5 (pairs of) genomes, 10, 50, 100, etc to publish a good paper. The bar will rise just like for GWAS studies: the community would expect the sequencing studies to understand the function and consequences of these mutations. That’s where we have some expertise and should be prepared to make an impact.
Recent exome sequencing and whole genome sequencing comparing cancer normal or primary metastatic cancer genomes have yielded many exciting findings. The easy cases to investigate functional mutations are genes with copy number gain or loss, and most of these genes are clear oncogenes or tumor suppressors likely already identified before with CGH or SNP arrays. The functional consequences of these genes are easy to investigate with knockdown / knockout or over expression assays. Our current approach of combining RNA-seq with DNase-seq to profile the wild type vs knockdown / overexpression conditions is a good screening approach to generate initial hypothesis.
One area that is likely to create new research opportunities is long noncoding RNA (lncRNA). Theoretically CGH and SNP studies should have information on their copy number changes, except that previously people didn’t realize that they were genes. In addition to using RNA-seq and DNase-seq to investigate their function, one informative experiment might be to use oligo probes to specifically pull down the lncRNA and mass spec to study the proteins that interact with it. John Rinn seems to have some expertise in this area, and we should also explore this technique.
If enough tumors have been sequenced, and still people only observe point mutations but not copy number variations, it would indicate the mutation is not having weaker or stronger regulation of existing network of genes. The reason is that tumors could increase or decrease copy numbers to achieve similar goals of exerting stronger and weaker regulation. Instead, the mutation must be creating new links in the regulatory network. This type of gain of function mutations could be investigated by knocking in genes carrying the specific mutation, and examining its downstream consequences. This is not a trivial experiment, and we might need to think of more efficient ways to study these mutations.