Traditionally, biologists have devoted their careers to studying individual biological entities of their own interest, partly due to lack of available data regarding that entity. Large, high-throughput data, too complex for conventional processing methods (i.e., “big data”), has accumulated in cancer biology, which is freely available in public data repositories. Such challenges urge biologists to inspect their biological entities of interest using novel approaches, firstly including repository data retrieval. Essentially, these revolutionary changes demand new interpretations of huge datasets at a systems-level, by so called “systems biology”. One of the representative applications of systems biology is to generate a biological network from high-throughput big data, providing a global map of molecular events associated with specific phenotype changes. In this review, we introduce the repositories of cancer big data and cutting-edge systems biology tools for network generation, and improved identification of therapeutic targets.
Traditionally, researchers have focused their efforts on single biological phenomena (e.g., a single gene mutation) or a specific signaling pathway (1). Now, the age of “omics” big data has brought about cutting-edge processing methods for interpreting biological mega data, which have now universally adopted. Based on such mega data (so-called “big data”), researchers aim to understand systems-level-based phenotype changes (1, 2) by assessing entire pathways/networks, and not just a single entity. Systems biology is defined as a framework (3) to enable systems-level understanding for generating new biological hypotheses, by computational modeling of massive high-throughput data.
Currently, systems biology has broadened its applications from basic science (including small RNAs) (4–6) toward translational medicine, including biomarker and therapeutic target identification (1–3, 7, 8). Systems biology often begins from high-throughput experimental data. Due to mammoth data deposition, as well as data generation by various next-generation sequencing (NGS) techniques (9), big data science has emerged, in particular, from the field of cancer genomics (10). The most widely used repositories include The Cancer Genome Atlas (TCGA) Research Network (11) and the International Cancer Genome Consortium (ICGC) (12). The development of applications for big data science (10) has been facilitated by systems biology frameworks to allow interpretation of systems-level tumorigenesis and molecular mechanisms.
Systems biology covers several diverse areas (13): hypothesis generation and network construction (or inference), and network simulation (e.g., ordinary differential equations, boolean dynamics). In this review, we restrict our discussion to network generation, while also describing analysis tools and relating databases in the field of cancer.
Systems biology has a straightforward workflow of components (13), as shown in Fig. 1A. To understand systems-level biology, observations for all entries are necessary, and high-throughput data is merely a starting point. Computational modeling takes the high-throughput data and, in certain circumstances, prior knowledge (including pathways and gene sets) is selected, resulting in network inference and hypothesis generation (13). Depending on whether computational modeling is used with or without prior knowledge, one may employ both data-driven network modeling and hybrid network modeling, respectively (14). In both of them, computational modeling is a key component, due to its ability to deal with the complexity of interconnectivity among systems entries (13, 14).
Currently, there are numerous types of high-throughput data (i.e., “omics”), including genomics, epigenomics, transcriptomics, metabolomics, and proteomics (15). As shown in Fig. 1B, the omics data types are aligned with the flow of genetic information in biology. Cancer genomics data in various types of cancers, including whole genome sequencing (WGS), whole exome sequencing (WES), and SNP array, has already been deposited in several public repositories including The Cancer Genome Atlas (TCGA) (11), and International Cancer Genome Consortium (ICGC) (12, 16) (Fig. 1B). Epigenomics in public databases, including the Encyclopedia of DNA Elements (ENCODE) (17) and the Database of Genotypes and Phenotypes (dbGaP) (18), possess next-generation sequencing datasets for genome-wide DNA methylation, histone modifications, transcription factor binding, and non-coding RNAs (e.g., miRNAs, piRNAs). Transcriptomic datasets are deposited in the Gene Expression Omnibus (GEO) (19), and ArrayExpress (20), for more than 10 years. Proteomics and metabolomics have now begun accumulation in the PeptieAtlas (21) and the PRoteomics IDEntifications (PRIDE) (22) databases. Each repository in Fig. 1A is not restricted to one specific data type, and users should be prudent to inspect all the data types of their interest through multiple repositories, and not a single one. The brief information of the repositories is described in Table 1.
The two representative categories in prior knowledge are gene sets and pathway databases (including protein-protein interactions). A gene set consists of the relevant biological description and its gene entries. The MIT MSigDB Collections (23) ( software.broadinstitute.org/gsea/msigdb/collections.jsp), one of most comprehensive repositories of gene sets, contains 13,311 entries. Recently, gene sets have begun including miRNA genes (and their expression), as well as protein-coding genes (24). By definition, however, gene sets do not contain hierarchy or mutual interaction for their gene entries (25). To accommodate such non-hierarchy, gene sets have been mainly applied to various enrichment analyses that utilize Kolmogorov–Smirnov test statistic, ANOVA, or hypergeometric test (further review in (26, 27)). A recent approach (28) identifies the conditional dependency in a gene set, to reconstruct hierarchical relationships. Thus, numerous gene sets have now been recognized as prior knowledge for use in network generation.
Unlike gene sets, pathways or protein-protein interactions have hierarchy or mutual relationships among the entries. Of numerous, diverse pathway databases, we describe the Kyoto Encyclopedia of Genes and Genomes (KEGG) (29), Reactome (30), STRING (31), and human-integrated pathway (hiPathDB) (32) databases. In particular, the KEGG (29) pathway database, one of the popular manually-curated pathway resources, consists of seven types of network contexts: cellular processes, metabolism, genetic information processing, environmental information processing, human diseases, organismal systems, and drug development (29). The KEGG pathway information is machine-readable via KGML (KEGG Markup Language). Reactome (30) is another popular peer reviewed pathway database, and contains > 6,700 reactions (e.g., phosphorylation, acetylation, etc.) extracted from 15,000 publications. For machine readability, the SBML (Systems Biology Markup Language) version of Reactome data is also available (33).
The database and web resource STRING (Search Tool for the Retrieval of Interacting Genes/Proteins, string-db.org) contains a very extensive collection of protein-protein interactions, based on publications and predictions. The interaction entries of STRING (31) amount to 932,553,897, from 2,031 organisms (as of 2016-04-19). While KEGG and Reactome both have directed network structures, STRING also has undirected network structures. The hiPathDB (32) introduces a unique concept of “superpathways,” that consolidates multiple resources of pathway databases (NCI-Nature PID (34), Reactome (30), BioCarta (35) and KEGG (29)), resulting in the most extensive hierarchical network structures.
Depending on prior knowledge usage, computational modeling, a key component in systems biology frameworks, can be divided into two modeling methods: hybrid method and data-driven method. The former incorporates prior knowledge in model development, while the latter infers networks or hypotheses directly from measurements, without prior knowledge. The tools described below are summarized in Table 2.
Data-driven methods have been used to correlate mutual information as gene-gene connection for network construction (36–38), resulting in undirected networks. ARACNE ( minet.meyerp.com) (39), another widely used free web-based tool, uses mutual information for constructing gene regulatory networks from transcriptome datasets. In principle, starting from all connected entries, ARACNE applies a
The R package,
Despite the great success of correlation- and mutual information-based approaches, these approaches often generate extensive links between network entries. Consequently, methods have now been introduced to reduce non-significant links. For example, sparse inverse covariance selection (SICS) (45, 46) infers a gene regulatory network from various data types by reducing non-significant links. The main function of SCIS is to identify a subset of network entries that consists of statistically significant or optimal pairwise correlations, based on the entire correlation (equivalent to covariance) matrix between all the entries. The benefit of subset identification is that it can provide statistically direct relations with smaller number of entries. SICS methods aim at maximizing or optimizing log-likelihood of pairwise correlations, assuming pairwise correlations as Gaussian graphical models (46, 47) or multivariate Gaussian models (45). Cancer Landscapes ( cancerlandscapes.org) utilizes SICS, not only to provide multiple cancer network modules, but also to integrate multilevel omics data types into statistical network modules (45).
Unlike ARACNE and WGCNA, there are several approaches to generate directed networks (Fig. 1C). Bayesian networks, another data-driven approach, utilizes a basic conditional independence (48–51). Bayesian networks is, by definition, that joint density probability of biological entries (e.g., genes) is the product of conditional probabilities of the entries in the omics data (38, 52). The definition naturally confers the ability to prune edges of the conditionally independent entries. Also, conditional dependency defines statistically casual relationships among gene entries, resulting in directed networks. The purpose of Bayesian networks is to identify the set of conditional probabilities that best describe measurements (e.g., gene expression) of biological entries in omics databases.
Banjo ( users.cs.duke.edu/software) is another gene regulatory network generation tool that utilizes Bayesian network frameworks, resulting in directed networks (48). Banjo is applicable not only for single-state transcriptome data, but also for time-series data. Banjo (
One obstacle to all these prediction methods is that there are no “gold standards” for data-driven network generation tools. Consequently, the performance of the data-driven methods depends on data types, model parameter settings, network size, and network topology (55).
In hybrid methods, models are generated to analyze high-throughput data via prior knowledge (e.g., gene sets, pathways) (56), resulting in network inference. Traditionally to date, hybrid methods use pathways as prior knowledge. Recently, gene sets have been recognized as another prior knowledge source for inferring networks that consist of entries and their mutual interactions.
Another tool, EDDY (
Prior pathway information with omics data has been incorporated into statistical frameworks for the past ten years (7, 8, 57), successfully generating network structures. In this approach, the challenge to build the statistical framework is developing and defining a statistic reflecting pathway topology. Pathway topology indicates interaction types (e.g., activation, inhibition, modification) as well as order (e.g., upstream, downstream) of biological entries. Another tool, SPIA (signaling pathway impact analysis) (58) ( bioconductor.org/packages/release/bioc/html/SPIA.html), utilizes the KEGG pathway database as prior knowledge. Instead of utilizing the individual signaling molecules (in KEGG pathways), SPIA aligns the consecutive KEGG signaling “flows” with omics data. Additionally, SPIA now considers two types of a flow between two adjacent signaling molecules: activation and inhibition. SPIA quantitatively measures influence (i.e., perturbation statistic in a given pathway) on signal cascading flows by using omics data between two experimental groups. For any given pathway, SPIA obtains P values for the perturbation statistic by using permutation tests. SPIA also reconstructs statistically significant pathways in a network. Recently, SPIA was applied to aggressive prostate cancer, discovering that the disease shares a pathway network with small cell lung cancer (59).
We also developed pathway topology-driven hybrid methods (7, 8), specifically for network generation, including PATHOME (7). These two methods also input the KEGG database (29) as prior knowledge for network generation. The earlier algorithm (8) (henceforth, pre-PATHOME) identified subsets of all KEGG pathways by utilizing permutation-oriented statistical tests, based on a whole transcriptome. Since graphical structures of the KEGG pathways are too complex, we decomposed to all the possible paths (~130 million, equivalently, subpathways) by traversing the graph structures.
In pre-PATHOME, each path consists of biological entries and their mutual interactions between adjacent two entries, either activation or inhibition. Given a subpathway, we devised a statistic to consider interactions (equivalently, edges) of two adjacent entries, as well as orders of biological entities (8). We assumed the first order Markov property (denoted as Fedge in (8)) where the fold-changes of the entities were regarded as observations. Subsequently, we performed permutation-based statistical tests for the product of Fedge and two additional statistics in each path. The statistically significant paths were collected and visualized. The pre-PATHOME was applied to an early onset colorectal cancer (CRC) dataset (60), revealing the pathways of epithelial-to-mesenchymal transition and immunosuppression even in normal adjacent cells of the CRC patients (8). The pre-PATHOME (8) was also deployed to identify trastuzumab-resistance pathways relating to networks in HER2(+) breast cancer (61), revealing five biomarker candidates associated with trastuzumab non-responsiveness (
Our group recently developed another hybrid method, PATHOME, (7). The pre-PATHOME (8) assumed that all interactions in a subpathway are dependent on their upstream entities (the so called, first order Markov property). PATHOME assumes that all edges in a subpathway are independent, adopting a two-stage strategy in our statistical framework (7). In the first stage, out of 130 million KEGG subpathways, PATHOME selects those with their edges aligned with correlations. In the second stage, we test the selected subpathways under the null hypothesis, that no differential correlation patterns between two groups are observed. Despite the independence assumption among edges, PATHOME showed better agreement with a cancer signaling reference set (62), when compared to other gene set analysis tools (e.g., DAVID (63), and GSEA (25)).
PATHOME has also been applied for delineating druggable target candidates, as well as molecular mechanisms, in both gastric and breast cancers (7, 64, 65). Recently, PATHOME was applied to gastric cancer (GC) transcriptome datasets, suggesting a HNF4α/WNT5A axis to be a new druggable signaling, as well as having a clinical relevance in diffuse type GC (64, 65). Since trastuzumab treatment of HER2-positive GC tumors show limited benefit, compared with ERBB2-positive breast cancer (66), PATHOME was applied to high
Systems biology is a general modeling framework that utilizes high-throughput data and prior knowledge, to result in network inference and hypotheses suggestions. Most network generation tools are based on whole transcriptome data. Using statistical models, the integration of other data types into network topology is still challenging. For example, for effective targeted therapy, the effects of mutations need to be incorporated into pathway topology under the systems biology frameworks (68). Also, for facilitation of translating cancer big data toward therapeutic benefit, pharmacokinetics/pharmacodynamics assessments (69–71) need to be considered in network generation in future.
This work was supported by the Gachon University Gil Medical Center (Grant number: 2016-06), and performed by a subproject of KISTI (Korea Institute of Science and Technology Information)’s project No. P16018 (Development of HPC-based Big Data for healthy Aging Society) funded by (Ministry of Science, ICT, and Future Planning). The Author thanks Curt Balch for editing the manuscript.
Cancer-related, high-throughput data repositories. The databases in Fig. 1B are described with additional information including the number of available data sets, data types, and websites. The number of entries is deemed valid as of 05/02/2016
|Names||Description||Address||Cancer relating data|
|TCGA||The Cancer Genome Atlas (TCGA): now one of programs organized by newly established NCI’s Center for Cancer Genomics (11)||cancergenome.nih.gov||34 cancer studies (types), 11,091 samples|
|dbGaP||The database of Genotypes and Phenotypes (dbGaP): archive of genome and phenotype in human||www.ncbi.nlm.nih.gov/gap||991 datasets|
|SRA||Sequence Read Archive (SRA): raw sequencing files and alignment files from next generation sequencing||www.ncbi.nlm.nih.gov/sra||1,950 cancer studies|
|cBioPortal||Multi-functional platform: supporting intuitive visualization, literate clinical pie chart, and simple data access (75). TCGA data visualization included.||cbioportal.org||126 cancer genomics studies, 26,080 samples|
|ICGC||The International Cancer Genome Consortium (ICGC): global-scale cancer projects (16)||dcc.icgc.org/||66 cancer projects, 17,867 donors|
|ArrayExpress||An archive of functional genomics data (76)||www.ebi.ac.uk/arrayexpress||14,974 datasets|
|EGA||The European Genome-phenome Archive (EGA)||www.ebi.ac.uk/ega/home||1,997 datasets|
|UCSC CGB||UCSC Cancer Genomics Browser (UCSC CGB): supplying interactive heat-map based visualization, and ready-to-use tab-delimited genomics and clinical data download (77). TCGA data visualization included.||genome-cancer.ucsc.edu||720 datasets|
|GEO||The Gene Expression Omnibus (GEO) (19): a public repository for microarray and next-generation sequencing data sets, and one of the representative repositories.||www.ncbi.nlm.nih.gov/geo||19,554 datasets|
|ENCODE||The Encyclopedia of DNA Elements (ENCODE) Consortium: decoding functional elements in DNA (17).||www.encodeproject.org||Cancer cell lines available|
|CCLE||The Cancer Cell Line Encyclopedia (CCLE) project: genomics and visualization in about 1,000 cell lines. Drug sensitivity available for the cell lines (78).||www.broadinstitute.org/ccle/home||Genomic characterization of 1,000 cell lines|
|PeptideAtlas||An archive of proteome information (21)||www.peptideatlas.org||99 datasets|
|PRIDE||PRoteomics IDEntifications (PRIDE) database: protein and peptide identifications, post-translational modifications (22). Mass spectrometry based proteomics data available.||www.ebi.ac.uk/pride/archive||290 datasets|
Summary of tools in network construction. The short description and homepages of some tools in the manuscript are summarized
|Class||Name||Homepage and description|
|Data-driven model||ARACNE (39)|
|Cancer Landscapes (45)|
|Ultranet (46, 47)|
|Hybrid model||EDDY (28)|