Studies have been conducted for a long time so as to discover biomarkers at the molecular level for the diagnosis of diseases (1-3). With the development of sequencing technology, it became possible to understand the entire genes beyond the individual gene associated with a disease. High-throughput sequencing produces omics-data set and allows the identification of modifications of genome, transcriptome and epigenome. The investigation of entire genomic DNA sequences provides individual variants called single-nucleotide polymorphisms (SNPs) and was applied so as to predict diagnosis and prognosis of a disease through analysis of genetic diversity and population genomics (4-8). The comparison of transcriptomes in various conditions can uncover disease-specific or stage-specific genes, which can be used as biomarkers. In addition, the identification of epigenomic factors and not DNA mutations, which are factors that regulates gene expression, has potential as a new biomarker. While it was possible to find biomarkers by each sequencing approach, it has recently been possible to discover high-confident biomarkers through integrative analysis of omics data. However, in order to integrate and analyze the omics data produced for each special purpose, it is necessary to understand the characteristics of the data and examine it carefully.
Liver is the largest internal organ in the body, and it has essential roles in our body such as digesting foods, detoxifying chemical, and storing energy. Chronic liver disease and cirrhosis, damaged liver states, are a cause of global mortality and morbidity (9). Liver disease could be caused by a variety of factors, such as hepatitis virus A, B, and C infection, persistent alcoholic hepatitis, and also fat accumulation in liver. Regardless of these factors, repeated injuries provoke inflammatory damage, parenchymal cell death, and matrix decomposition leading to advance fibrosis (10). Liver disease is a multi-step disease including fatty liver, steatosis, cirrhosis, and hepatocellular carcinoma (HCC). The scar matrixes typically accumulate very slowly for approximately 5-50 years, before cirrhosis, and early stages of chronic liver disease could be reversed to a healthy state (10). But once cirrhosis occurs, it becomes to have irreversible properties. It often develops complications and even progresses to cancer (10).
Thus, in the review, we summarized the overall major researches relevant to next generation sequencing (NGS) techniques from the beginning and introduce more recent studies with integrative analysis of epigenome sequencing classified by each character of omics-data, especially in liver disease (11-15).
Since genome contains genetic materials of an organism, investigating the nucleotide sequence of the genome is a great way to examine the control systems that regulate cell functions. The first DNA sequencing was produced by Sanger sequencing and developed by Frederick Sanger in 1977, which was called the chain termination method (16). From this approach, the human genome project was completed and interpreting sequences of genes has been a great help in understanding human life and diseases (17, 18). However, the function of the non-coding region was not yet precisely known, which makes up the most of human genome. The development of NGS technology that overcomes the shortcomings of Sanger sequencing has provided us with a lot of information on the features of the non-coding region.
Since the new invention of NGS technique, lots of particular sequencing methods for detecting modifications of genome, transcriptome and epigenome have been introduced. In this section, of these advanced sequencing methods, the most popular ones were summarized by categories of genome, chromatin and transcriptome-based studies (Fig. 1).
Genetic variation refers to variety in gene frequencies and mutations (Fig. 1A). The first studies using NGS techniques focused on finding significant mutations as disease triggers. Typical studies using WGS and whole exome sequencing (WES) can be analyzed for detection of genetic variations and used for target sequencing of specific regions (18).
First, there are several methods for targeted sequencing. In detail, oligonucleotide-selective sequencing (OS-Seq) was developed for capturing target genome regions with high specificity analysis of cancer genomes effectively and reproducibly (19, 20). Duplex-Seq has showed increased mutation frequency levels of the small selected regions of the nuclear genome in DNA (19, 20). Repeat sequences occupy a large portion of the eukaryotic genome. Because of their distinguishable character, they have been studied in genome evolution like genomic diversity, and their role in genome have been investigated using target sequencing. For example, molecular inversion probes short tandem repeats (MIPSTR) method specifically targets short tandem repeats (STRs), which makes it possible to detect low-frequency somatic STR variants (21). The transposon insertion sequencing (TN-Seq) is a transposon sequencing that provides information about transposon insertion sites (22). In a mutant population, the sequencing can determine gene disruptions to find some of suppressors or other mutations. Retrotransposon capture sequencing (RC-Seq) is the same mechanism to the previous one, which is applied to analyze HCC samples and identify activating oncogenic pathways (23).
SNPs and/or Single-Nucleotide Variants (SNVs) can also be detected by particular sequencing methods, which are related with restriction enzyme digesting, restriction site-associated DNA sequencing like restriction site associated DNA sequencing (RAD-Seq), specific locus amplified fragment sequencing (SLAF-Seq), and restriction site DNA capture (Rapture) (24-26).
Gene expression can also be regulated by methylation patterns on CpG regions and/or promoter regions (Fig. 1B). DNA methylation is one of the reasons of epigenetic modification, which regulates gene expression through the change of methylation and demethylation status, especially in CpG and promoter regions of the target genes. Therefore, many sequencing techniques were developed to detect the methylation pattern in genome. Whole genome bisulfite sequencing (WGBS) is the most popular tool for confirmation of methylated cytosines in whole-genomic DNA and bisulfite amplicon sequencing (BSAS). RRBS are also used to identify the methylation of DNA (27, 28). Another method for observation of methylation in genome is the methylase assisted bisulfite sequencing (MAB-Seq), which allows quantitative mapping of both 5fC and 5caC that indicate demethylation events (29).
There are several methods for detecting genetic modification like DNA replication, and DNA strand breaks (Fig. 1C). In fundamental cellular life, DNA replication is used as important evidence for various genome regulation. Therefore, there are many techniques that have been introduced for screening the initiation sites of the DNA replication. Repli-Seq maps sequences of newly replicated DNA to the phases of cell division that validate as active DNA replication origin (30). In a similar way, Bubble-Seq, nascent strand sequencing (NS-Seq) and nascent strand capture and release (NSCR) can be utilized to verify the origin of DNA replication (31-33).
DNA strand breaks is also perceived by using sequencing techniques. The single strand break (SSB)-Sequencing shows single-strand breaks in DNA by direct detection of the pathological and physiological fracture of the DNA. On the other hand, double strand break (DSB)-Sequencing/Break-Seq/Breaks Labeling, enrichment on streptavidin and next-generation sequencing (BLESS) make it available to find double-strand breaks (DSB) on a genome wide scale (34, 35). Genome-wide unbiased identifications of DSBs evaluated sequencing (GUIDE-Seq), which is another way to detect DSB, relies on the integration of double-stranded oligodeoxynucleotides into DSBs (36).
Physical access to DNA is an important property of chromatin that plays a crucial role in cellular characteristic (Fig. 1D). Chromatin structures can be analyzed by MNase-Seq and Methidiumpropyl-EDTA sequencing (MPE-Seq), whose techniques are based on the observation of nucleosomes and covalent attachment of tags to capture histones and identify turnover (CATCH-IT), which measures the nucleosome turnover and disruption that use metabolic labeling followed by capture of newly synthesized histones (37-39). In addition to these methods, DNase-Seq, formaldehyde assisted isolation of regulatory elements sequencing (FAIRE-Seq), and transposase hypersensitive sites sequencing (THS-Seq) could be selected to reveal genomic accessibility and open chromatin structure through representing nucleosome positioning and occupancy (40-42). Assay for transposase accessible chromatin sequencing (ATAC-Seq) relies on the hyperactive Tn5 transposase at capable regions in a genome. Proteins could bind on open chromatin regions. For this reason, active chromatin regions are used to elucidate the possibility of protein binding regions. In detail, DNA-Protein interactions could be drawn by ChIP-Seq, chromatin immunoprecipitation – exonuclease digestion (ChIP-Exo), Chem-Seq, and systematic evolution of ligands by exponential enrichment sequencing (SELEX-Seq) (43-45).
In the nucleus, the 3D structure of the genome should be related to gene expression and the importance that has been steadily increasing. Therefore, many scientists are paying attention to study and reveal chromatin looping and physical interactions (Fig. 1E). Thus, various techniques have been proceed to illustrate the structure of chromatin, such as ChIA-PET, Hi-C, Capture-C, Tethered Conformation Capture (TCC) and 4C-Seq (46-49). Chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) that incorporates ChIP based technique and used for a new model of CTCF function identifying chromosome structure organization, gene transcription regulation, and linking enhancers to promoters (50).
At the level of transcription, the proceeding of measurement of gene expression status could be defined as gene expression profiling (Fig. 1F). In this step, under specific conditions, gene expression levels are usually compared to each other. For this, RNA-Seq is a commonly used technique that can examine whole transcriptome for gene expression patterns (51). RNA-mediated oligonucleotide annealing, selection and ligation sequencing (RASL-Seq) and capture-Seq are similar techniques to quantify gene expression levels (52-54). Non-coding RNAs (ncRNAs) have been unveiled as other regulators for gene expression. MicroRNAs, one of the ncRNAs, also play an essential role in the control of gene expression levels and they are detected by miRNA-Seq (55). For profiling the transcriptional diversity in a single cell, massively-parallel RNA sequencing (MARS-Seq), cell expression by linear amplification sequencing (CEL-Seq), DROP-Seq were used in the study (56, 57). Several techniques are focused on specific regions. Cap analysis gene expression sequencing (CAGE-Seq) and simultaneous mapping of RNA ends sequencing (SMORE-Seq) could be used to uncover the presence of transcription start site and measure gene expression levels (58, 59). Additionally, there are other methods for detecting specific regions of RNA. Transcript leader sequencing (TL-Seq) could be suggested for sequencing of 5’ UTR and TAIL-Seq reveals 3’ends of RNAs (60). In addition, TAIL-Seq allows to estimate Poly A tail length (61).
Some proteins have a role in RNA regulation through binding to RNA (Fig. 1G). In unwound DNA strand, RNA polymerase and some other proteins interact with it, then RNA transcripts are produced. In order to analyze that circumstance, precision nuclear run-on sequencing (PRO-Seq) presents the site of active RNA polymerase, bromouridine sequencing (Bru-Seq) and global run-on sequencing (GRO-Seq) show nascent RNA transcripts to analyze synthesis and stability of RNAs (62, 63). In addition to that, GRO-Seq has also been performed to identify enhancer RNAs (64). In order to predict the protein binding sites of RNAs, RNA immunoprecipitation sequencing (RIP-Seq), and targets of RNA binding proteins identified by editing (TRIBE) used to determine RNA-protein association and identify the target RNA sequences of RNA binding proteins (RBP) (65-67). After transcription, ribosomes interact with RNAs for protein synthesis. Ribo-Seq is a ribosome profiling technique that figures out the location of ribosome in mRNA translation (68). Translating ribosome affinity purification sequencing (TRAP-Seq) is another method used to clarify translating mRNAs and profile cell type specific translatomes (69).
RNA methylation is another method of regulating gene expression epigenetically at the transcriptional level (Fig. 1H). Since these variations of RNA have been discovered in cancer, many studies have been carried out to affirm the methylation patterns of RNAs (70). Methylated RNA immunoprecipitation sequencing (MeRIP-Seq) was developed to show m6A methylated RNA, also miCLIP indicates m6A locations (71, 72). Degradation of RNAs can be also detected using sequencing techniques. Parallel analysis of RNA end sequencing (PARE-Seq) was published to identify microRNA cleavage sites as degrading RNA and genome wide mapping of uncapped and cleaved transcripts (GMUCT) was brought out to discover uncapped and cleaved transcripts (73, 74). Today, the secondary structure of RNA is also focalized to understand RNA modification in between processes of transcription and translation. Therefore, several techniques have been published. Selective 2’hydroxyl acylation analyzed by primer extension (SHAPE-Seq) is the RNA structure analysis technique (75). Additionally, structure-seq and parallel analysis of RNA secondary structure sequencing (PARS-Seq) are probing RNA secondary structures in genome wide scale (76, 77). These techniques can simultaneously measure single nucleotide resolution secondary and tertiary information for a lot of RNA molecules of arbitrary sequence.
As liver disease could be caused by a variety of factors, such as viruses and alcohol, the treatment methods differ depending on the cause. Liver disease is developed through several stages for a long time. Unfortunately, liver disease patients are often asymptomatic and can remain unaware of their condition until late stages of the disease. Chronic liver disease is characterized by progressive hepatic fibrosis and it leads to the formation of cirrhosis, HCC, and liver failure, often requiring liver transplantation. However, it is only possible to reversibly return to a favorable state in the early stages of the disease (10). That is the reason why detection of biomarkers is needed for early diagnosis. Many studies have been conducted to find the difference between cirrhosis and cancer, which is the late stages of the disease (10). Therefore, it is necessary to discover biomarkers capable of detecting liver diseases at an early stage by comparatively analyzing specific markers for each stage of liver disease.
NGS techniques have been substantially utilized to identify functional mechanisms and novel biomarkers in diverse diseases (3,6,17,78-82). In previous studies using NGS techniques, significant characters of different tissue/cell status have been identified with a single type or multiple types of NGS data in disease, development or specific condition. Biomarkers identified with NGS technique for liver disease were as summarized in Table 1. At the start, genome sequencing is one of the most popular approaches in the identification of genomic mutations and figure out the mechanisms of diseases. Analysis of mutations using WGS or WES enables the prediction of diseases degeneration or discover influential driver genes. Marker genes have been identified by undermining essential meanings of somatic mutation patterns varied in accordance with different disease states. In progress of HCC, normal hepatocellular cells into carcinoma cells, analysis of genetic alterations using WES data was carried out to verify irregular interruption of cellular pathways related to the cancer occurrence and identify driver genes (83). Genomic variations were also observed in disease stages before tumorigenesis. In detail of the liver disease before tumorigenesis, hepatitis virus, alcohol abuse, and non-alcoholic steatohepatitis (NASH) are commonly known as the causal factors and they ultimately lead to cirrhosis, a stage of liver fibrosis (6, 84, 85). A research with WGS analysis focused on cirrhosis derived from chronic liver disease states – alcohol-related liver disease (ARLS) and non-alcoholic fatty liver disease (NAFLD) (6). They observed heterogeneity through somatic mutations and the results suggested that chronic liver disease has increased rates of mutation, complex structural variation, and low mutations targeting known HCC genes (6). In another research, the researchers were more focused on the chronic liver disease states before cirrhosis (84). Recurrent mutations on chronic liver diseases tissue were found through WES analysis (84). It provided evidence that the somatic mutations are highly related to liver fibrosis stage and specific mutations –
In addition, a large-scale study is underway to confirm the correlation between abnormality including mutation in the genome and liver disease (3, 6). For example, copy-number variations (CNVs) in 38 types of cancers were found as a part of the pan-cancer analysis of whole genomes (PCAWG) consortium analyzing 2,658 cancers and the result suggested that the CNVs could be used as diagnostic markers in the early stage of cancers (86). In addition, somatic mutations found in the hepatocellular carcinoma (HCC) were related to highly expressed hepato-specific genes, providing evidence of liver tumorigenesis (7). A research revealed that genomic markers of liver cancer could also be identified with WGS by genomic subtyping (3). In the process, they figured out the correlation between single nucleotide variations (SNVs) load and two types of heterozygosity mutations – gain-of-heterozygosity (GOH) and loss-of-heterozygosity (LOH) – by categorizing the SNV loads of 110 liver cancers obtained by paired blood-tumor WGS (3). Additionally, it was showed that the recurrent somatic survival-related CNVs (srCNVs) are linked to the LOHs, as they are more relevant to HCC short survival (3). The analysis of WGS data along with prognostic survival analysis indicated that malignant cancers tend to have a large number of SNV, LOHs, and CNV mutations (3). Based on the result, SNV load, LOH%, Signature a%, or srCNV were suggested as remarkable factors as genomic markers (3).
Furthermore, a transcriptome research of the liver showed differentially expressed genes between NAFLD and HCC by analyzing RNA-Seq (87). By using RNA-Seq data, co-expression analysis was performed between NAFLD and HCC by focusing FANS known as putative key regulatory gene in progressive and development to several disease stages was screened and the result confirmed that the expression levels of PCSK9, PNPLA3, and PSCK9 were associated with disease severity (87). Similarly, significant non-coding RNAs like micro RNAs for disease progression were found using sequencing techniques, such as small RNA-Seq and miRNA-Seq (88-90). These sequencing techniques are also conducted with RNA-Seq application (91-93).
Although NGS data have been accumulated according to the advances in technical methods of sequencing, it is still not enough to uncover all of the biological phenomena. In this respect, integration of numerous NGS data types could be utilized to find further biological meanings. However, it is not a simple problem to design integrative analysis. The reason is due to the diversity of research purpose as it can even make or break the overall research. Hence, recently, how to integrate multiple sequencing data is the most remarkable point in researches discovering molecular mechanisms. Previous studies were adopted integration analysis to understand pathology of carcinoma or diseases in the liver.
In spite of previously discovered relations between mutations and diseases through genomic sequencing data analysis, there are still many limitations on understanding gene regulations relevant to diseases progression. Besides the significance of genome analysis in contribution to verify mutations, it is also essential to estimate expression levels of genes for researches on mechanisms of gene regulation. Thus, most of sequencing integration studies have been based on RNA-Seq to obtain transcriptomic information and to realize uncovered parts of different cancers or complex diseases in specific organs (82,94-100). Through integrative analysis, previously studied theories have been confirmed with RNA-Seq adopted to other sequencing techniques, such as WGS, small RNA-Seq or miRNA-Seq, ATAC-Seq or DNase-Seq, ChIP-Seq, and/or WGBS or RRBS (Fig. 2A). The correlation between altered expression levels of genes and genetic variations with integrative analysis of RNA-Seq and Genome sequencing including WGS and WES could be confirmed (2, 13, 82, 101). As mentioned above, RNA-Seq and small RNA-Seq or miRNA-Seq that have been used to reveal the regulation of non-coding RNAs could also be associated with gene expression levels (91-93). Enhancer formations and activities are also considered with gene expression levels and this consequential meaning has been studied with RNA-Seq, ATAC-Seq or DNase-Seq, and ChIP-Seq (14). Combined of RNA-Seq, ChIP-Seq, and Hi-C or TCC data enables to explain that chromatin structural modifications and enhancer activities are components for the alterations of gene expression levels (11, 12, 98). RNA-Seq and BS related techniques, like WGBS and RRBS, can be used to find the inverse correlation between gene expression levels and methylation patterns in CpG and/or promoter regions (15, 82).
For instance, related studies published strong correlations of genome and transcriptome (Fig. 2B) (2,13,101-103). These showed that somatic variations caused over-expression of oncogenes in HCC and emphasized the necessity of the integrative analysis (101). As another trial to integrate genome and transcriptome sequencing in HCC, differentially expressed genes (DEGs) were found in large CNV segments and functional analysis was performed to examine the results (2). TTK, a protein kinase related to p53 signaling, was identified with the integrative analysis for a prognostic marker in HCC (2). Further, for therapeutic purpose, integrative analysis was performed to discover alternative drugs to sorafenib, a pre-found drug of HCC, which turned out to have limited usage due to high toxicity on HCC based on a clinical trial (13). The multi-omics analysis included genetic, transcriptomic, and additionally proteomic data of 34 liver cancer cell lines (LCCLs), including HepG2 and Huh6 of hepatoblastoma (13). Genome analysis with WES was conducted to validate similarities of genetic alterations between LCCLs and HCC and expression patterns through miRNA and mRNA analysis were integrated by elastic net regression (13). The integrated results were used to predict the sensitivity of the drugs followed by identification of molecular markers (13). In sum, integrative analysis of genetic, transcriptomic, and proteomic profiles was performed to find novel candidates of therapeutic markers in HCC (13). The result was combined with single agents, validated combinations, or drug screening, which were previously approved or being in clinical development. Thus, it provided possibilities of application of identified markers to clinical trials (13).
Data integration approaches, which include epigenome but independent on DNA sequences, have been increasing in studies of various influential factors on gene regulatory mechanisms in diseases (97). Integrative analysis based on transcriptomic data was conducted with varied combinations of epigenome data for ascertaining disease progression. Environmental gene regulations could be explained with epigenetic modifications obtained by sequencing data. As a representative, the epigenetic factors include interactions of transcription factors (TFs) with specific genomic regions, DNA methylation patterns, histone acetylation or methylation, and formations of chromatin looping. We summarized previously published studies working with integrative analysis of epigenetic data based on the transcriptomic data, largely categorized as three classes – methylome, chromatin modification, and chromatin structure. In those studies, individual sequencing methods were selected depending on the particular purpose of each research.
Recently, integrative analysis of transcriptome and methylome was carried out to identify biomarkers in osteoporosis (82). Multi-omics data, including transcriptome, methylome, and metabolome, was integrated by sparse multiple discriminative canonical correlation analysis (SMDCCA), a multivariate integrating method that is used for searching optimal linear combination of features, and the integrated data was combined with genome (82). SMDCCA is a valid method of finding potential biomarkers. The integrated result was evaluated with pre-integrated data of 1,5994 DEGs with 1,219 DMRs and 204 DMPs to investigate potential causal effects of the pre-found biomarkers (82).
The development of NGS sequencing technique has been dramatically increased during the past two decades. Therefore, lots of different attempts have also been increased to contrive the improved techniques to specific purposes. As a result, various kinds of methods using NGS have been come into the world to discover the whole process of genomic and epigenetic regulation on biological phenomena in more details. Based on the discoveries, identification of novel biomarkers gets another approach. In contrast to previous biomarkers uncovered by radiology, sequencing data provide more convinced evidences in a microscopic aspect. Although trends on novel biomarkers of diverse diseases increase with abundance of biological data from sequencing, the procedure of data analysis has not yet been constructed systemically. Even with same type of data, integration depends considerably on analysis tools (100). Therefore, integration analysis is still figuring it out with careful consideration of how to integrate various sequencing data that have different properties and how to deal with huge size of sequencing data. In most of the previous studies, traditional statistical methods have conducted research on data analysis. However, following the generation speed and trend of bulk size sequencing data, nowadays, many studies have aggressively implemented to apply computational algorithms, such as AI (97). AI is an upcoming big trend in data analysis field especially with classification. With the results of analysis data using NGS as an input, iterative modeling process of AI makes it possible to classify samples into several disease stages and it can also suggest significant genes as biomarkers. The advanced integrative analysis of NGS and the modeling more elaborate AI algorithms will let us discover novel biomarkers unseen before.
This research was supported by the Collaborative Genome Program for Fostering New Post-Genome Industry of the National Research Foundation (NRF), and funded by the Ministry of Science and ICT (MIST) (NRF-2017M3C9A6044519).
The authors have no conflicting interests.