The Cancer Genome Atlas (TCGA) has compiled genomic, epigenomic, and proteomic data from more than 10,000 samples derived from 33 types of cancer, aiming to improve our understanding of the molecular basis of cancer development. Availability of these genome-wide information provides an unprecedented opportunity for uncovering new key regulators of signaling pathways or new roles of pre-existing members in pathways. To take advantage of the advancement, it will be necessary to learn systematic approaches that can help to uncover novel genes reflecting genetic alterations, prognosis, or response to treatments. This minireview describes the updated status of TCGA project and explains how to use TCGA data.
Human Genome Project (HGP) was successfully completed in 2003 and began a new era of genome-based medicine (1–3). Success of HGP motivated development of new technologies for genome-wide copy number alteration analysis, gene expression profiling, and better sequencing methods that can comprehensively characterize entire genomes at low cost. The comprehensive genome map has improved our understanding of complex genetic networks involved in development of human disease and allowed us to uncover functions of associated genetic elements in molecular level. There are two major branches in genomics. Structural genomics generally collect and catalogues all genetic and epigenetic elements by massive sequencing. On the contrary, functional genomics aims to uncover the functional roles of genetic or epigenetic elements in context of different biological systems. Microarrays is another most important technology for genomics that is designed to capture the expression patterns of coding/non-coding genes, alterations in copy number and methylation status in entire genome simultaneously. Because functional activity of genes is well reflected in gene expression patterns, microarray technology has been extensively used to generate expression profile data from diseased tissues for identifying disease-associated genes and from cell lines for characterizing newly discovered genes.
In recent years, by using publicly available genomic and epigenomic data, many genome-wide analyses uncovered novel genes associated with human diseases and genes with unexpected roles in different cellular context (4–6). Thus, in-depth analysis of multiple genomic data will undoubtedly reveal novel insights into the regulation of many signaling pathways or novel key regulators of the pathways.
In this review, I will provide a short description of major progress on cancer genomics, particularly in The Cancer Genome Atlas (TCGA) project. Furthermore, I will provide description on the data generated by different platforms and analytical tools that have been developed through the progression of TCGA projects.
The Cancer Genome Atlas (TCGA) Project is a multi-institutional innovative research program supported by the National Institutes of Health. TCGA was launched to facilitate the comprehensive understanding of the cancer genetics using state-of-art genomic technologies and analysis tools to catalogue all of the potential cancer drivers, identify robust prognostic and predictive biomarkers and novel druggable therapeutic targets, and uncover molecular subtypes of tumors that are different in prognosis and response to treatments. With use of several different technical platforms, TCGA currently collects and maintains many different genome-wide data including expression of coding and non-coding RNA, somatic mutations, copy-number alteration, and epigenomic data like promoter methylation. In addition to genomic and epigenomic data, it collects proteomic data by using state-of-art technology reverse phase protein arrays (RPPA). The project plans to collect multi-platform data from hundreds of tissues per each cancer type and share the data with any investigators who are interested in genome-based medicine or those who are interested in studying function of genes without any restriction in use of the data. In 2005, a pilot study (phase I) started aiming to test the feasibility of ideas and develop the research infrastructure by characterizing few selected cancer types that are understudied: lung squamous cell cancer, glioblastoma, and ovarian cancers (7–10). Phase 2 study was started in 2009 and expanded to additional cancer types (33 cancer types).
New oncogenes and tumor suppressor genes were identified through analysis of TCGA data. Some of findings are unexpected and showed significant association with clinical outcomes. For instance, genomic level analysis showed that non-hypermutated adenocarcinomas from the colon and rectum are almost indistinguishable at molecular level (11). Alterations in the FGFR kinase genes are very common in lung squamous cell cancer, while
TCGA has established a pipeline for collecting and processing tissues from numerous source sites (tissue banks at hospitals), generation of high quality genomic and proteomic data, and distribution and analysis of the data. Most importantly, major bodies for data generation and analysis are consisted of the Genome Characterization Centers (GCCs), Genome Sequencing Centers (GSCs) and Genome Data Analysis Centers (GDACs). The GCCs aim to identify all genomic alterations in the tumors in each cancer type. Each GCC uses most advanced platform technologies to generate mRNA and miRNA expression data, DNA methylation data, and copy number alteration data. The genetic changes identified by the GCCs are further characterized by the GSCs that perform large-scale genomic sequencing using the latest sequencing technologies to identify small genomic changes that could play a role in cancer. All of the data generated by the GCCs and GSCs on the multiple genomic platform technologies from thousands of tissue samples are transferred to GDAC through Data Coordinating Center (DCC). The GDACs are responsible for analysis of the data and development of new bioinformatics tools that can facilitate use of TCGA data by the entire research community.
Six different platform data are currently generated from GCC and GSC and available to general public. These include somatic mutation data, mRNA and miRNA expression data, DNA methylation data, copy number alteration data, and proteomic data.
Majority of mutation data were generated by whole exome sequencing using second-generation DNA sequencing instruments (mostly Illumina and ABI SOLiD). Whole exome sequencing analysis is carried out by sequencing the DNA coding for protein products, but not DNA sequences that do not directly code for proteins. However, about 10% of samples in TCGA project underwent whole genome sequencing, which sequences every base-pair of DNA and that can reveal any alteration in regulatory regions of genome.
mRNA expression profile data were first generated by using microarray technologies from Affymetrix or Agilent, but RNA sequencing (RNA-seq) technology from Illumina was used in later stage of TCGA project. RNA-seq technology has several advantages over microarray platform as it can quantify rare and common transcripts, alternative splicing, previously unrecognized transcripts, gene fusions, as well as non-coding RNAs. It can also quantify distribution of somatic mutations and edited RNAs (12).
microRNA (miRNA) is a small non-coding RNA (~22 nucleotides in size) that regulates other genes through post-transcriptional manner (13). miRNA expression profile data were generated by directly sequencing small molecule RNAs using RNA-seq technology from Illumina. These data were separately processed and maintained from data from mRNA-seq data as their biological and molecular characteristics are different from coding RNAs.
DNA methylation is an epigenetic mark which is frequently associated with transcriptional activity of genes. TCGA DNA methylation data were initially generated by using Illumina 27K DNA methylation array (HumanMethylation27 containing 27,578 probes in 14,495 genes). Later, it was replaced by 450K methylation arrays (HumanMethylation450 containing 485,512 probes covering 99% RefSeq genes).
Copy number alteration is probably most frequent genetic events during the course of tumor development. Copy number data were generated by using Affymetrix SNP 6.0 arrays containing 1.8 million genetic markers, including more than 906,600 single nucleotide polymorphisms (SNPs) and more than 946,000 probes for the detection of copy number variation.
RPPA is an antibody-based quantitative methods assessing hundreds of protein markers in thousands samples in a cost-effective, sensitive and high-throughput manner (14). This technology has been extensively validated for both cell line and patient samples, and its applications range from building reproducible prognostic models to assessing underlying biology associated with prognosis. Current RPPA data from TCGA project include expression and modification of ~200 proteins.
In addition to genomic and proteomic data, TCGA data also include slide images for histopathology and details on patients information such as tumor stages, races, potential etiology, treatments and survival.
All of genomic, proteomic, and clinical data from TCGA project were available from TCGA data portal site. However, as of July 15th, 2016, the TCGA Data Portal is no longer operational and all TCGA data now resides at the Genomic Data Commons (GDC, https://gdc-portal.nci.nih.gov/). While a vast majority of TCGA data in the GDC are publically available without restriction, meaning that no authentication or authorization is necessary to access it, some of the data are controlled access, meaning that special authorization process is necessary to access the data. Access to controlled data is typically granted by program-specific Data Access Committees (https://gdc.nci.nih.gov/access-data/obtaining-access-controlleddata). Public availability of the data is ruled by the NIH Genomic Data Sharing Policy (https://gds.nih.gov/). Open access data typically includes the data that cannot identify individuals such as high level genomic and proteomic data as well as most clinical and all biospecimen data elements. Controlled data includes individually identifiable data such as low level genomic sequencing data, germline variants, SNP6 genotype data, and certain clinical data elements.
Processed high level data are also available from UCSC Cancer Genomics Browser (https://genome-cancer.ucsc.edu/). It offers more user-friendly processed data and limited visualization tools are also available. Histology information is also available from The Cancer Digital Slide Archive, CDSA (http://cancer.digitalslidearchive.net/), which provides the interactive tools for viewing and annotating diagnostic and tissue slide images from TCGA project (15). In addition to genomic, proteomic, and clinical data, TCGA also offers radiological imaging data from TCGA patients through The Cancer Imaging Archive, TCIA (http://www.cancerimagingarchive.net) in order to stimulate imaging phenotype-genotype study (16).
Comprehensive genomic data from large number of patients would undoubtedly improve our knowledge in understanding of cancer-related genes and their clinical relevance. However, analysis of such “big data” would require substantial skills in computational tools, statistics, and programming languages. Thus, it would be necessary to develop easy-to-use and intuitive genomic tools that can help researchers or clinicians in analysis and interpretation of all the data types in a meaningful way. TCGA provides intuitive web-based tools.
Analysis tools from TCGA project developed to make that basic scientists without training in informatics, statistics, and clinical knowledge can analyze the data and interpret the results. The potential involvement of genes of interest in cancer development can be easily assessed. For example, genetic alterations of peroxiredoxin family in all cancer types can be assessed through cBioPortal (Fig. 1A) and alterations of individual genes in certain cancer type (i.e., ovarian cancer) are visualized in oncoprint format (Fig. 1B). Furthermore, the clinical relevance of alteration is estimated and displayed in Kaplan-Meier plots (Fig. 1C). Clinical association of genes of interest can be further validated by using tools in PROGgeneV2. Correlation between different genomic data is also readily visualized through cBioProtal and Firehose (Fig. 2).
TCGA is an unprecedented powerful public resource of cancer genomic data providing researchers with a great opportunity to increase present knowledge on cancer. Multi-layer analyses performed on different platforms reflecting distinct biological characteristics provide a better understanding of cancer biology, leading to improvement in patient stratification, identification of novel prognostic or predictive markers, and finding novel potentially druggable therapeutic targets. The translation of genomic knowledge into biological insights will move these new findings to the next level and guide to a new era in data-driven molecular biology.
This study was supported in part by National Institutes of Health grants CA150229, 2016 cycle of Institutional Research Grants from The University of Texas MD Anderson Cancer Center, 2016 cycle of Sister Institute Network Grant from The University of Texas MD Anderson Cancer Center, and a grant from The University of Texas MD Anderson Cancer Center Duncan Family Institute for Cancer Prevention and Risk Assessment.