
Approximately two-meter-long linear length of a human genome is intricately compacted to fit within a 10 μm diameter of the nuclear space. During the process, the genome is folded, looped, and entangled to form a sophisticated 3-dimensional structure, which we call 3D genome folding. Although the precise geometrical configuration of the 3D genome folding remains elusive, advancement molecular technologies has unveiled that: 1) the 3D genome is organized in a hierarchical order, and 2) the 3D genome plays a key role in gene transcription activities, eventually controlling crucial steps of disease development.
The base of the hierarchical structures of the 3D genome are nucleosome fibers, that are subsequently folded to form loops. During the formation of the loops, the fiber dynamically interplays with the structural proteins such as CTCF and cohesin. The loops facilitate the interactions between two non-adjacent loci, which could be several hundred kilobases to megabases apart (1). The next class of the structure is the Topologically Associating Domain (TAD). TAD is an empirically defined region where interactions between loci inside the domain are significantly higher than those outside the domain (2). The class further up are compartments A and B, which present in a form of a globular shape, representing transcriptionally active and inactive portions of a chromosome (3).
Rapid technical advancement has led to the characterization of 3D genome folding. In the pre-NGS era, scientists were able to identify the interactions between specific loci of interest using fluorescence in situ hybridization (FISH) under a microscope or using molecular data from 3C with PCR (4). After NGS was introduced, its massively parallel capability of genome sequencing has enabled 3C, 4C, 5C, and Hi-C to identify high-resolution 3D genome folding (3-8). The evolution of the technologies is still ongoing. For example, adopting micrococcal nuclease (i.e., micro-C) and the insanely deep NGS coverage creates super-resolution 3D genome folding maps, even up to sub-kilobase (9). Development of super-resolution microscopic imaging combined with high throughput FISH (e.g., Optical Reconstruction of Chromatin Architecture; or ORCA) allows us to look at 3D genome folding in individual cells (10, 11).
In this review, we provide a comprehensive view of 3D genome folding and its implications in diseases. We discuss how each hierarchical class of 3D genome folding is formed and how molecular technologies have been developed to characterize 3D genome folding in the NGS era. We also discuss the diseases known to be involved in 3D genome or po- tentially involved in 3D genome.
Phenotypes are determined by the combined effects of the genome and the environment. For example, although identical twins share inherited genomic information, they have different appearances and susceptibilities to diseases. It is often interpreted that the environment affects the specific gene expression and, eventually, the phenotypes. These environmental cues that lead to individual changes in addition to genome sequence are termed epigenetics (12).
In the late 2000s, NGS brought a ground-breaking understanding of epigenetics. To understand the functions of the genome and their characteristics beyond genome sequence, scientists have combined traditional molecular methods with NGS and identified the genome-wide distribution of epigenetic marks. Chromatin immunoprecipitation sequencing (ChIP-seq) targeting specific histone modification marks has identified the locations of all the epigenetic marks and characterized their functions in the genome, such as H3K27ac as an active enhancer (13) and H3K4me3 as an active promoter (14). The bisulfide method has revealed the distribution of the methylation groups on DNA (15). Since MNase and DNase preferentially digest more accessible regions of the genome, MNase-seq and DNase-seq have been used to identify open/closed chromatin conformation (16, 17). Adopting transposase (Tn5) has further revolutionized the technology, so called ATAC-seq, to map chromatin conformation at very low input cell numbers, or even at a single cell level, within just a few hours of hands-on time (18). Epigenetic studies have become more comprehensive and systematic as many consortia have identified the epigenetic marks in most cell types. For example, the ENCODE, Roadmap Epigenome, International Human Epigenome Consortium, and FANTOM projects have marked millions of regulatory element candidates and their relations with genes (19-22). 1000 genome project (23, 24) and The Cancer Genome Atlas (TCGA) (25) have enrolled large populations of healthy individuals or cancer patients, respectively, to map disease-implicated epigenetic marks. Despite the huge advances in epigenetics studies, there is still a gap of knowledges that traditional epigenetics could not explain.
The gap comes from a limited understanding of biology using the linear scale of the epigenome. Some transcription elements (e.g., enhancers) and their functional target genes are distally located from each other in the linear genome, by several hundred kilobases to even megabases. Therefore, a model of dynamic tethering of the chromatin has been introduced (26): non-histone structural proteins tether specific loci together where transcription factors and other elements sit together and form a huge transcription machinery for active transcription. Molecular approaches such as co-ImmunoPrecipitation (co-IP) and PCR have shown the sophisticated interactions of the transcription factors with distal enhancers. The optical techniques such as FISH and chromosome painting have visually validated the existence of chromosome territories, (27) and suggested that transcriptionally active regions are located in the central part of the nucleus, whereas inactive parts reside in the peripheral region (28).
Further development of the molecular approaches combined with NGS has introduced a new view to the 3D genome. Dekker and his team first introduced Chromosome Conformation Capture (3C), which digests DNA and ligate two fragments at proximity in 3D nuclear space (4). The first 3C was initially combined with quantitative PCR, which showed PCR bands representing the two ligated restriction fragments in proximity in 3D space. Although this technology was already capable of revealing the 3-dimensional interactions between specific loci, it became even more ground-breaking when combined with NGS. The 3C method was modified and combined with massively parallel sequencing (i.e., NGS), and showed a comprehensive geometrical conformation of the entire genome on 2-dimensional heatmaps, which is now called Hi-C (3). Although the resolution of the first Hi-C was just 1 Mb per pixel, it was powerful enough to identify the sophisticated organizations of the chromatins in globular shapes called “compartments”, a sub-chromosomal class of the hierarchical structures (3). As the Hi-C was further improved and modified, and as high coverage NGS became more accessible, the resolutions and precisions of 3D genome folding maps have dramatically increased and enabled the investigation of the sub-compartment structures such as TADs and loops (2, 3, 8).
As the perspective of epigenomic research has hugely expanded to 3D, its implications on pathological conditions have been being actively investigated. For example, the complex chromatin networks around an oncogene MYC and super-enhancers are the key players in the cancer development (29). Subsequent studies also revealed that the super-enhancers and their surrounding chromatin circuitries determine the cell fates during the development (30, 31). The additional evidence showing the disease implications of 3D genome folding have been extensively reported, such as neurodegenerative (32), musculoskeletal (33), hematologic (34) and oncologic (35) diseases. To analyze and organize the data more systematically, 4D nucleome (4DN) consortium has been launched to elucidate the complex structures and organizations of 3D genome and the impact of their disruption in disease biology (36).
Basic structural and functional unit of 3D genome is nucleosome. A double helix form of DNA wraps around the octamer complex of histone, spanning ∼146 bp to form a mononucleosome (Fig. 1A). The mononucleosomes come together with intervening “linker” DNA and make a ∼10 nm thick fiber that looks like “beads-on-a-string” (37-40). The most commonly found histone octamer in a typical mononucleosome is composed of H2A, H2B, H3, and H4 (39, 41), although there are histone variants with distinct functions such as kinetochore formation (CENP-A) or transcriptional controls (H2A.Z) (42, 43). Linker histone H1 does not compose nucleosome beads. Instead, it binds to the linker DNA region between nucleosomes and facilitates chromatin compaction, resulting in a 30 nm fiber architecture (44).
Histone modifications and DNA methylations are the two most studied epigenetic marks. Histone modification is a post-translational modification at specific amino acids of histone tails. Histone acetyl- or methyl-transferases (e.g., HAT, HDAC, PKMT) interplay with lysine of histone tails to promote loosening or compaction of the nucleosome fibers, which is referred to as open or close conformation (i.e., euchromatin and heterochromatin), respectively (45-48). On the other hand, DNA methyltransferase (DNMT) confers methyl group to cytosine bases of DNA, and ten-eleven translocation (TET) reverses it (49). DNA methylation preferentially occurs in the CpG islands, which indicate the active transcription modulator sites or promoters (50, 51). Therefore, DNA methylation obstructs RNA polymerase II or transcription factor binding onto the promoters and reduces the transcription activities (52).
Nucleosome fiber is subsequently folded to form a loop, where two distal loci are interconnected. For the formation of the loops, non-histone structural proteins interplay with each other. One of the main players is CTCF, a highly conserved zinc finger protein. CTCF has a specific motif sequence that determines its directionality of DNA binding (Fig. 1B) (53). The orientation of CTCF-DNA binding is critical in loop formation, in which two CTCFs facing each other dimerize and form a loop at high affinities (Fig. 1B) (53). Another main player is cohesin, a SMC protein complex that forms a ring-like structure (54-57). Cohesin binds to the DNA with aid of NIPBL (58), traps the DNA inside the hollow of its ring-like structure, and slides along the DNA until it meets the insulation site by CTCF-dimers (Fig. 1B) (59, 60). While cohesin slides through, DNA gets extruded and makes dynamic looping structures (Fig. 1B).
On 2D heatmaps of Hi-C, the loops are typically expressed as punctate corner dots, indicating high contact frequencies between two focal loci (Fig. 2). In addition, because of the nature of cohesin sliding along the DNA, loop extrusion is illustrated as a stripe pattern like a long vertical or horizontal tail extended from the punctate dots (Fig. 2) (8). The auxin-induced cohesin knockout showed that the loops disappeared although CTCF was intact. However, the loops rapidly recovered following cohesin restoration (61), suggesting that loop extrusion by cohesin is necessary for loop formation. However, the precise mechanism behind the interplay between the insulation and the loop extrusion remains elusive.
More structural proteins are involved in the loop formation, such as Yin Yang 1 (YY1) and condensin. YY1 is a ubiquitous zinc-finger transcription factor that dimerizes between active enhancers and promotors, analogous to the CTCF-mediated loops (62-64). When mouse embryonic stem cells were differentiated into neural progenitor cells (NPC), YY1 facilitated the interactions between NPC-specific promoter-enhancers (65). However, when YY1 was knocked down, NPC-specific loops were abolished (65), suggesting a key role of YY1-mediated loops in cell development by controlling the cell-type specific gene expression. Condensin is another SMC protein complex whose primary role is to control chromosome condensation during mitosis (66). Although condensin I was originally thought to be localized in the nucleus during mitosis, new evidence has shown that ∼10% of condensin I was still associated with DNA during interphase and organized chromatin interactions and gene expression (67, 68). The structure of condensin complex is controversial. One hypothesis is that condensin may be in a ring-like structure that allows sliding along the DNA for loop extrusion (69, 70), reminiscent of cohesin, another SMC family. Another hypothesis is that condensin may form a coiled-coil between two core subunits that makes a rode shape without room to slide through the DNA (71-74). Although further investigation is needed, condensin’s function in demarcating the borders of 3D genome domains (i.e., TAD) suggests its role in 3D genome organization (75).
The loops come together and make TADs, where the interaction frequencies between the loci within the demarcated domain are relatively higher than those outside the domain (Fig. 2). “TAD” was originally an invented term by computational analysis when the Hi-C heatmap resolution was limited to > 40 kb per pixel. Thus, original TAD sizes ranged from tens of kilobases to a few megabases (2). The geometrical morphology of TADs remains elusive. A compelling hypothesis is that TADs are globular in shape with interconnecting boundaries (76). As technologies have improved and higher resolutions maps of 3D genome become available, more complex domain structures such as nested TADs or subTADs have been identified (8, 77, 78). Although subTADs resemble TADs, they are smaller in size, and they are located within their nesting TADs (Fig. 2). SubTADs either share their boundaries with their nesting TADs or have their own boundaries (Fig. 2). Since micro-C allows analysis of 3D genome at sub-kilobase resolution, micro-TADs have recently been suggested (78, 79), reflecting current technical advances and its power to detect genome structures. However, the functional and kinetic relationships between TADs and subTADs are poorly understood.
The most distinct characteristic of TAD is its demarcating boundaries (Fig. 2). TAD boundaries are commonly enriched with structural proteins like CTCF and cohesin. Therefore, the loop extrusion model is often used to explain how TADs are formed: the ring-like structure of SMC protein complexes slide along the DNA until they reach the insulation sites by CTCF, given that stripe patterns are often observed at the edges of the domains where cohesin is bound (Fig. 1B).
TAD functions as a hotspot of chromatin interconnections and transcription regulation. The most commonly detected boundaries are usually highly conserved across species or lineages, and they are open chromatin conformations where transcription factors along with CTCF are bound to (2, 80). Therefore, the long-range interactions between active enhancers and promoters are often mediated by the TAD boundaries or are located within TAD boundaries. These interactions function to control gene transcription and eventually the critical biological phenomenon. For example, when TAD boundaries were deleted or inverted using CRISPR-Cas9 in Wnt6, Ihh, Epha4, Pax3, the gene-enhancer interactions were reorganized and misexpressed the genes that affect limb development in mouse (31). Functional analysis of individual enhancers around the TADs near Shh reveals that enhancers for similar functions of embryo development are collectively located within the same TADs (81, 82), suggesting that TADs confine the regulatory activities of specific gene functions. Disruption of CTCF insulation that originally isolated a gene within a TAD from an active enhancer activates gene expressions (83). However, disrupting CTCF insulation is not sufficient to perturb gene functions. Rocha and his team switched the orientations of the CTCF motifs to re-shape the TAD and its boundary locations around Sox2 in embryonic stem cells: inverting, splitting, and merging domains. Despite dramatic topological changes around Sox2, its transcription was not fully abolished (84), suggesting that there are more complex regulatory mechanisms that could overcome the topological barriers. The detailed functions and roles of transcription remain open questions.
Compartments are observed to have a plaid pattern that nests several TADs on the Hi-C maps (Fig. 2) (3). Because of the nature of alternating signals between plaid, it is thought that alternating compartments cluster together in each territory. Therefore, two clusters of compartments are labeled as compartments A and B, respectively (Fig. 2) (3). Coincidently, compartments A and B match alongside active and inactive gene markers (85, 86). Thus, one compartment cluster is thought to be transcriptionally active regions and vice versa. Interestingly, Phillips-Cremins and her team showed that when CGG short tandem repeats expand at FMR1 gene in fragile X syndrome, FMR1 and its neighboring genes flipped over from compartment A to B and the transcriptions of those genes were vastly reduced (87), suggesting the functional and pathogenic roles of compartments. Details of this will be reinstated later.
The mechanisms behind how compartments are formed are still elusive. One possible hypothesis is that compartments with similar histone marks co-segregate and form a cluster (3). Another hypothesis is phase separation (88, 89), a biophysics term in which small liquid droplets condense and form one giant droplet upon certain physiological or biochemical environment changes (90, 91). This biophysics term stood out when Young and his team showed an in vitro test. When they isolated the coactivator proteins MED1 and BRD4 having intrinsically disordered regions (IDR) and changed salt concentrations, phase separation was induced by forming big droplets of MED1 or BRD4 clusters (92, 93). Strikingly, transcription factors like SOX2 and OCT4 were recruited to the giant liquid droplets, suggesting that the transcription elements may form a huge cluster through phase separation (92, 93). The model suggests that a huge cluster of transcription factors and their associating super-enhancers and gene promoters come together and make a huge network through IDR aggregation (94). Such a giant cluster of the transcription machinery drastically boosts up its transcription activities and determines cell fates (88). There seems to be some biophysical force equilibrium between looping and compartmentalization, given that cohesin knockdown strengthened plaid signals whilst corner dots got fainter, and that plaid signals got weaker as cohesin expression was recovered (61). There are still many questions that remain unanswered, and further investigation is needed.
To understand structural interactions at or beyond the compartment, Dekker and his team have developed liquid chromatin Hi-C (95). They fragmented chromatins at different sizes before Hi-C and analyzed chromatin stability. Their data showed that chromosomal compartmentalization was retained when the fragment size was > 10-25 kb, suggesting the least genomic block size for phase separation and compartment formation (95). However, when fragment size was < 6 kb, it lost its subnuclear structure. Further characterization suggests that lamina-associated regions were relatively stable, whereas nuclear speckles and polycomb-associated regions were less stable, and heterochromatin protein-associated regions were somewhere in between (95). This suggests the differential dynamics of chromatin at each section of the nucleus. Further characterizations of lamina and speckles for the 3D genome and its structural rigidity will be interesting.
The field of study on such astonishing structures of 3D genome folding has been rapidly growing. One of the main reasons is the fast development of bleeding-edge chromatin-capture technologies (i.e., 3C and its derivatives) that have been combined with NGS since the late 2000s. However, the actual history of 3D genome studies is much longer.
In the 1970s, Paulson et al. took one of the most famous electron microscopy (EM) images in the history of cell biology (96). He biochemically depleted histones from mitotic chromosomes in vitro and took EM images, showing that all nucleosome fibers were disorganized and disheveled without the original X-shape morphology. However, X-shaped mesh-like structures still remained in the middle of the nucleosome fibers, reminiscent of the shape of mitotic chromosomes (96). Earnshaw et al. have isolated that mesh-like structure and found that it is a network of many proteins. He named some of its main components Scc1 and Scc2, which are now called cohesin and condensin, respectively (67, 97), a part of the main structural proteins of 3D genome.
In addition to EM and mass spec, cytological approaches like FISH were also used for key discoveries in the 3D genomes, such as chromosome territories. Chromosome territories are where DNAs from the same chromosomes form compartmentalization located in distinct spatial segments of the nucleus (98-100). The discovery of chromosome territory supports the idea that cis-interaction is the most predominant and is correlated with transcription activities (101).
Early 2000s, Dekker developed an innovational molecular method called Chromosome Conformation Capture (3C) (Fig. 3) (4). He basically digested the genome inside the nucleus using a restriction enzyme and subsequently re-ligated the genome. This simple but delicate process made the ligation between DNA fragments at spatial proximity in a 3D nucleus. In most cases, the ligation occurred between fragments positioned linearly adjacent. However, those fragments having long-range interactions were positioned at proximity in the 3D nucleus and got ligated at a relatively high frequency as well. Using the ligated product as a template, he performed quantitative PCR using the primers targeting the restriction enzyme cut-sites. Resulting PCR bands indicate which targeted genomic loci within a few kilobases range are connected (102).
3C brought new biological insights: cataloging chromatin loops and identifying critical transcriptional regulation mechanisms (103). For example, a 40-60 kb long chromatin interaction between the active genes and enhancers around the β-globin locus is not present in the brain where β-globin is not expressed, but in erythroid (104). Dekker and his team also used 3C to identify long-range interactions between the CFTR gene and active enhancers in a 460 kb surrounding region (102).
A few years after the invention of 3C, de Laat and his team developed 4C as a 3C derivative. The name “4C” stands for “3C on Chip” because the 4C library was initially analyzed on a microarray (5), which was replaced with NGS later (4C-seq) (105). The experimental approach starts with preparing a 3C library. The 3C library is then trimmed using the second enzyme and re-ligated, resulting in a circular form of DNA (Fig. 3). Primers for inverse PCR are designed to sit on the locus of interest, or called viewpoint, to amplify the DNA outward the viewpoint (i.e., the DNAs that interact with the viewpoint) (Fig. 3). Resulting 4C or 4C-seq data provide a quantitative figure of how much each locus interacts with the viewpoint. Comprehensive 4C data helped identify how the genes of interest shapes the 3D genome architectures with its interacting regions. For example, an active β-globin gene preferentially interacts with other active genes in fetal liver cells but not in the brain (5, 105). 4C has also been used to identify the differential roles of cohesin and CTCF in shaping the chromatin architecture in humans (106).
In the meantime, Dekker and his team invented another 3C-derivative method called 5C (Chromosome Conformation Capture Carbon Copy). While 4C adopts inverse PCR that enriches all loci specifically interacting with the viewpoint, 5C uses multiplex PCR that enriches the entire region of interest ranging from a few hundred kilobases to megabases (6). Traditional 5C primers were designed to specifically bind to the sequences next to the restriction enzyme sites with specific orientations: forward primer binds to the sense strand of the 3’ end of the restriction fragment, and reverse primer binds to the antisense strand of the 3’ end (6). These 50 bp-long 5C primers are annealed to a 3C library, which allows the 5C forward and reverse primer pair adjacently sitting on the ligation junctions of 3C templates and face each other. Then they are nick-ligated, resulting in 100 bp length 5C pair pairs (Fig. 3). Those 100 bp length 5C primer pairs are enriched using their universal tails, typically T7 and T3 sequences, before next-generation sequencing (Fig. 3) (6). Resulting sequencing data are used for counting 5C primer pairs, which represent relative interaction frequencies between restriction fragments.
One significant advantage brought by 5C is that it has expanded our view on 3D genome topology to multidimensional. It quantitates interactions between all restriction fragments and expresses them on 2-dimensional heatmaps, which are currently the most famous formats of 3D genome folding data (Fig. 2). 5C is also useful in the sense that it provides a high resolution 3D genome architecture map with limited coverage of sequencing data. Just a few million sequencing reads can provide 1-4 kb resolutions of data (7, 65, 77, 107, 108), which is capable of depicting the punctate looping signal between Sox2 and pluripotent enhancer that is only ∼120 kb apart (7, 107).
Recently, 5C became more powerful by adding double alternating primers and in situ ligation (7). Dekker and his team have created a new double alternating primer design, which adds additional 5C primers (called left forward and left reverse primers; LFOR and LREV, respectively) that bind to 5’ ends of each restriction fragment in addition to the 3’ ends of the conventional 5C primers (83), resulting in higher sensitivity to detect looping signals. On the other hand, in situ ligation was originally suggested by de Laat for 4C-seq (105, 109) and later it was adopted in Hi-C (8), which ligates the restriction fragments within the intact nucleus during 3C library preparation. Combining these two with 5C significantly improved sensitivity to detect loops and removed background noises (7).
High-resolution maps of 5C have provided many key findings in 3D epigenomics. 5C on the mouse embryonic stem cells (mESC) and its derived neural progenitor cells (NPC) has revealed that CTCF, cohesin, and mediator organize distinct looping interactions at sub-megabase scale in different cell lineages (77). Another study using 5C has also identified the dynamic looping between super-enhancers and Sox2 during mESC differentiation to NPC and its reprograming to iPSC (107). 5C has been useful to characterize mESC- and NPC-specific loops organized by CTCF and YY1, respectively (65). The high resolution 5C has been useful to identify the focal loop induced by an optogenetics tool combined with dCas9 technology, called Light Activated Dynamic Looping (LADL) (108).
Shortly after 5C was invented, Dekker published the first genome-wide study using Hi-C (High-throughput chromosome conformation capture). The workflow of Hi-C is similar to that of 3C. However, sticky ends of digested restriction fragments in Hi-C are filled with biotinylated nucleotides before proximity ligation (3). Resulting ligation products are sheared by sonication, and only the fragments retaining the ligation junctions are enriched using streptavidin-biotin pulldown (Fig. 3) (3).
The Hi-C library is sequenced by NGS and analyzed using similar approaches to 5C analysis. The difference is that 5C counts the number of 5C primer pairs, whereas Hi-C counts actual genomic sequences. Thus, one of the keys to successful Hi-C is the high complexity of ligation junctions. Another key difference is that 5C mostly focuses on cis-interactions at high resolution, whereas genome-wide analysis by Hi-C enables analyses of both cis- and trans-interactions. The advantage of the genome-wide analysis is that it can depict compartment structure of the 3D genome hierarchy, which is one of the first characterized genome architectures using Hi-C (3). However, the low resolution of the genome-wide analysis was always a technical challenge until early 2010.
To overcome the challenge, Lieberman-aiden and his team have made several ground-breaking technical improvements (8). First, they performed in situ ligation as mentioned in the previous chapter (7, 105), dramatically reducing its required cell numbers, unwanted ligations, and background noises. Second, adopting a 4-cutter restriction enzyme (e.g., MboI) instead of the previous 6-cutter enzyme (e.g., HindIII) made it possible to fragment DNAs to smaller average sizes, thus increasing the resolution of proximity ligation. Third, they drastically improved the computational pipeline, such as adopting a donut expected model that rigorously test looping signals. Combining these technical innovations with super deep sequencing coverage applied to the human trio cell line GM12878 (23), they achieved 1 kb resolution for heatmap (8). Such technology, often called in situ Hi-C has been widely accepted as the most representative NGS-based chromatin conformation capture method to date, with unending efforts to maximize the data resolutions up to sub-kilobase by adopting micrococcal nucleases or DNaseI for smaller size of genome fragmentation (9, 110), or probe-based hybridization to enrich the targeted regions (111-114).
Technological advancement allowed characterization of the distinct patterns of the 3D genome folding in pathological conditions. Genome instability is one of the most well-known causes of diseases, such as SNPs and structural variations, which are known to mis-regulate gene transcription. Recent studies have suggested that the genome instability also causes significant disruptions in 3D genome folding around the pathogenically implicated genes. In this chapter, we will discuss some examples of pathological conditions that are thought to be involved in 3D genome folding.
Short tandem repeat (STR) expansion is a good example of genome instability causing neurological diseases. STRs are repetitive sequences with a length of a few bases tandemly present in the genome. The normal lengths of STRs are widely distributed across the genome. They are stable across generations or in somatic tissues from the same individuals (115). However, STRs rapidly expand to abnormally long lengths due to poorly understood mechanisms and cause diverse neurological diseases such as fragile X syndrome, Huntington’s disease, amyotrophic lateral sclerosis, and Friedreich’s ataxia (116-119). A potential model to explain how the abnormal STR expansion occurs is the formation of the secondary and tertiary structures during DNA replication or DNA repair processes (120), yet a further investigation into a detailed mechanism is necessary.
Phillips-Cremins and her team have discovered an interesting characteristic of the locations of the pathological STR expansion. The neurological disease-associated STR located in FMR1, HTT, DMPK, FXN, C9orf72, and ATXN1 genes are also in the highly conserved TAD boundaries (121). They found that STR expansion sites co-localize with the conserved TAD boundaries across H1 human embryonic stem cells, mesendoderm, mesenchymal stem cells, neural progenitor cells, and trophoblast-like cells, suggesting that the expansion of the genomic fragments might be implicated in the 3D genome topology that could mis-regulate gene expression. They subsequently compared 5C heatmaps around the FMR1 gene in germ line cells (B-lymphocytes) collected from fragile X syndrome patients and compared the with those from their healthy siblings, suggesting significant disruptions of CTCF binding and its 3D architecture (121).
Phillips-Cremins and her team have further investigated the topological function of STR expansion across the genome. They obtained inducible pluripotent stem cell (iPSC) lines from healthy donors, fragile X syndrome patient donors, and premutation donors whose STR lengths are intermediate between healthy controls and patients. Then, they differentiated those iPSC lines to NPC to perform Hi-C. Surprisingly, the TAD and compartment in FMR1 and its surrounding region are completely abolished when STR length is longer than 200 copies, the pathological threshold (Fig. 4A) (87). FMR1 and other genes in the region, like SLITRK2, flip over from compartment A to B, and the heterochromatin mark H3K9me3 covers the entire region as the STR expands. Surprisingly, their data also shows that the STR expansion, along with the heterochromatin mark, also occurs in many regions across the genome that make interchromosomal clusters (87), suggesting that STR expansion is involved in more sophisticated mechanisms for higher order genome organization.
More evidence supports the implications of disease-associated STR expansion in 3D genome folding. Hi-C and 4C data of Huntington’s disease mice show that STR expansion disrupts interactions between Htt and its conserved enhancers (122). Cerebellar ataxia, neuropathy, and vestibular areflexia syndrome (CANVAS) characterized by bi-allelic expansions of the AAGGG repeat at RFC1 gene (123) seem to cause G-quadruplex formation (124).
The repetitive sequences derived from retrotransposons, like LINE-1 or Alu, SINE, are implicated in cancer development through genome rearrangement (125, 126). The endogenous long-terminal repeats (LTR) aberrantly activate the myeloid-specific proto-oncogene CSF1R (colony-stimulating factor 1 receptor) with the reduced expression of its co-repressor, CBFA2T3, in Hodgkin’s lymphoma cells (127). The reduced methylation in the transposable elements also affects cell development. The genome-wide analysis of methylation within 928 transposable elements suggests that hypomethylations of transposable elements are strongly correlated to the tissue-specific functions of the genes during development. In addition, the hypomethylated transposable elements seem to gain the tissue-specific enhancer marks H3K4me1, suggesting its potential role in reshaping the 3D genome folding to determine the cell fate (128). Hypomethylated LINE-1 inserted in the intronic region of the oncogene MET promotes its transcription in both bladder cancers and premalignant bladder cells with tumors (129). Consistently, a study on 77 colorectal cancer revealed that LINE-1 located within the intronic regions of proto-oncogenes, such as MET, RAB3IP, and CHRM3, promote the transcription of these genes to lead metastasis (130). Additional large-scale cancer studies show more prominent evidence supporting the role of transposable elements in cancer. A whole-genome study on 202 colorectal tumors identified the insertions of retrotransposons to the 15 known cancer-associated genes, APC of which looks to be initiating cancer (131). A recent report on 899 single-cell clones of colon epithelium from 28 individuals showed that the rate of L1 retrotransposition rapidly increases and is widely spread throughout the genome during colon cancer development (132).
Interestingly, transposable elements have been reported to affect CTCF binding and may contribute to diverging 3D genome organizations across individuals. Genome-wide CTCF binding profiling in livers from five mammals suggests that the retrotransposable elements redistributed CTCF-binding motifs through the evolution process, thus affecting species-specific genome looping (Fig. 4B) (133). Recent evidence in humans and mice suggests that CTCF redistribution by retrotransposable elements reshaped 3D genome looping (Fig. 4B) (134). Given the pathological implications of transposable elements and their connections to CTCF-binding, further investigation is needed to understand the roles of retrotransposal elements in 3D genome folding and gene misregulation in diseases.
Disordered 3D genome organization between oncogenes and their regulatory elements could mis-regulate oncogene expression, thus, cancers (135, 136). Here, we discuss some examples of disorganized 3D chromatin topology in cancers.
One example is acquiring a cancer-specific super-enhancer and its chromatin circuitry (Fig. 4C). After super-enhancers were first identified and characterized (30), Hnisz et al. systematically analyzed genome-wide enhancer profiles between cancer cell lines and normal cell lines from matched lineages, and identified cancer-specific super-enhancers located nearby MYC or other known oncogenes. They termed them “acquired” super-enhancers that promote oncogene expression during cancer development (Fig. 4C) (137). Afterward, chromatin circuitry between super-enhancers and oncogenes during cancer development has been continuously reported. A mutation in the intergenic region between TAL1 and STIL, where MYB and other transcriptional complexes causing leukemogenesis bind to, is related to super-enhancer acquisition during leukemogenesis (138). Disruption of CTCF that originally insulated proto-oncogenes from the active enhancers activates gene expressions for T-cell acute lymphoblastic leukemia (T-ALL) (83). The ChIA-PET and CRISPR-Cas9 genome editing study has identified the conserved CTCF binding sites that work as a master hub for super-enhancers-MYC gene interactions in colon cancer, breast cancer, chronic myeloid leukemia, and acute T cell leukemia cells (Fig. 4E) (139). Translocation of highly active enhancers to MYC-containing TAD, termed enhancer hijacking, has been reported in neuroblastoma patients (140). Overexpression of CCAT1-L is regulated by a distal active enhancer located 515 kb upstream of the MYC gene in gastric adenocarcinomas (141, 142). Because of the critical roles of super-enhancers regulating the oncogenes in cancers, JQ1, an inhibitor of the enhancer-binding protein BRD4, has been being extensively studied as a potential therapeutic for cancers (143-146).
Structural proteins are key components in the formation of 3D genome folding. Thus, disordered structural proteins could disrupt 3D genome conformation. Interestingly, many structural protein-associated diseases are developmental disorders, although their molecular mechanisms and implications in genome organizations are poorly understood.
Cornelia de Lange syndrome (CdLS) is the most well-known cohesin-associated disorder. It is pathologically characterized by neurodevelopmental delay and mental retardation (147). One of the leading hypotheses about the cause of CdLS is that mutation of the NIPBL gene, whose encoding protein is necessary for cohesin binding to the genome (58), could misregulate cohesin binding to the genome, thus delaying cell cycles and proper development (Fig. 4F) (148-150). A recent report has suggested a new model that NIPBL mutation in CdLS might impair cohesin-mediated loop extrusion (151).
Premature chromosome condensation is known to be the cause of microcephaly. The primary cause of microcephaly was thought to be MCPH1 mutation that causes premature chromosome condensation at G2 phase (152, 153). A recent study has reported that condensin II is a major interacting partner of MCPH1 for homologous recombinant repair (154). Mcph1 deletion in mouse embryonic stem cells recapitulated premature chromosome condensation in G1 and G2 phases (155), suggesting that MCPH1 might inhibit uncontrolled binding of condensin II onto the genome. The mutations of condensin I and II in mice caused chromosome bridges and lagging chromosomes during chromosome segregation and ultimately reduction in size of the brain (156). The mass-spec data has shown that a fraction of condensin I remains on the genome during interphase (67), which might cooperate with condensin II for misregulated chromosome condensation in vertebrate DT40 cells (68).
3D genome is thought to be crucial in immune cell functions and preservation of their unique characteristics. Bediaga et al. have comprehensively tested ATAC-seq, Hi-C, and RNA-seq on activated T cells, suggesting drastic remodeling chromatin organization between CD4+ and CD8+ T cells representing helper T cells and cytotoxic T cells. They found that the activation of CD4+ and CD8+ T cells is correlated to partitioning TADs into many smaller TADs and changes of expression of corresponding target genes (Fig. 4G) (157). More recently, it has been shown that TCF-1, which is required for T cell development, interplay with CTCF to reshape TADs and facilitate long-range interactions to control genes for late T cell development (158).
The chromatin organization during B cell differentiation has been characterized. Pax5, a transcription factor for B cell-specific gene activation (159), organizes B cell-specific chromatin domain distribution (160). However, whether Pax5 functions as structural proteins or cooperates with structural proteins is unclear.
Dysregulation of 3D genome structure can cause fatal dysfunction of immunity in humans, like autoimmune diseases and malignancies (161, 162). For example, H3K27ac HiChIP on primary T cells has revealed that hundreds of autoimmune-specific intergenic variants are interconnected to and functionally regulated by gene targets (163). Asthma-associated variants rs4065275 and rs12936231 are known to contain a CTCF-binding motif and switch CTCF binding from ZPBP2 to ORMDL3 in CD4+ T cells, which seems to reduce IL-2 production (164). A study using capture Hi-C has identified rheumatoid arthritis, type 1 diabetes, psoriatic arthritis, and juvenile idiopathic arthritis disease-associated SNPs that distally interact with FOXO1, AZI2, PTPRC, DEXI, and ZFP36L1 in B and T cell lines (GM12878 and Jurkat, respectively) (165).
Advances in studying 3D genome folding have enriched our understanding of genomics and epigenomics. Classical molecular and cytological approaches have shown that higher order genome structures are dynamically organized in 3D nucleus. The rapid developments of 3C and its derivative technologies and deep coverage NGS have allowed us to look deeper inside chromosomes and realize that the genome is highly organized in a hierarchical order: from compartments to TADs and subTADs, then to loops and nucleosome fibers. Some working models have been proposed for the formation of each type of genome structure, such as loop extrusion for loop formation and phase separation for compartment formation. An increasing number of evidence has suggested that topological changes of 3D genomes are implicated in embryo development, immune cell differentiation and function, and pathogenesis of neurological and mental diseases and cancers. Despite such expansion of knowledge, many critical questions remain elusive regarding the dynamics between the 3D genome and gene transcription.
The first critical question is how looping causally affects gene transcription. We have discussed the evidence suggesting that long range interactions between promoters and active enhancers are correlated to associated gene transcription. However, mechanisms to explain how they causally interplay between loop and gene transcription are not fully understood. The second question is whether a nucleosome communicates with another at proximity through 3D chromatin conformation. If the epigenetic writer/eraser enzymes bind to a locus, it is conceivable that such enzymes could jump to and change epigenetic marks on another interacting locus through 3D genome folding. The third question is: what are the mechanical relationships between 3D genome folding, gene transcription, and phenotypes? For example, when mESC is differentiated to NPC and pluripotency is induced again (i.e., iPSC), the mESC-specific loop between Sox2-super-enhancer seems to be restored in iPSC but Sox2 expression (107). Inverting the CTCF orientation around Sox2 changes its surrounding TAD morphology but Sox2 expression (84), suggesting that additional and unknown mechanisms are involved in transcription activities in addition to 3D genome conformations. The last question is whether reshaping a 3D genome could reprogram disease cells to be healthy if a 3D genome could determine diseases.
3D genome engineering could provide clues to these questions. In the last a few years, a number of innovative methods to engineer loops have been proposed. For example, structural proteins like YY1 and CTCF are conjugated to dCas9 that can dimerize and form a loop between two target loci (62, 166). The optogenetics technology using plant-derived CRY2 and CIBN combined with dCas9, called LADL, can induce a loop between the gene promoter and distally located functional super-enhancer in response to blue light stimulation (108). Single molecule RNA FISH on LADL-expressing cells has shown that target gene transcription in the induced loop by LADL significantly increases in response to blue light exposure (108). Engineering loops using these tools will help test relationships between loops, epigenetic marks, and gene transcription mechanisms and their functional roles in diseases. In addition, using CRY2’s self-aggregating function in response to blue light exposure (93, 108, 167), LADL could be modified to induce phase separation and engineer compartments. Given that compartment formation, where a bunch of super-enhancers and genes are condensed together, is a primary mechanism of cell fate determination (88), we might be able to induce compartment to test the hypothesis and reprogram disease cells to normal, or vice versa.
Continuous development of new technologies is necessary to continue building our knowledge of 3D genome and its clinical applications. We have seen that disruption/reorganization of the 3D genome is involved in pathogenesis, such as compartment disruption around FMR1 in fragile X syndrome (Fig. 4A), chromatin re-circuitry around oncogenes and proto-oncogenes (Fig. 4C-E), genome instability in cancers, and so. This leaves us many questions: what happens if we correct the disease-associated 3D genome conformation to that in healthy people? What if we re-construct the compartment surrounding FMR1 in fragile X syndrome? What if we disrupt hotspots of 3D chromatin circuitry around oncogenes in cancers? Would this reprogram disease cells to be healthy cells? To answer these questions, it is indispensable to focus on developing new technologies to precisely identify key 3D genome architectures in individuals and engineering them. The key to success is collaborating between science and engineering. Combining 3D genome folding with multiomics such as transcriptomics, proteomics, and single cell biology will provide a more holistic view of cellular function and regulation, making precise characterization and personalization possible. The development of high-throughput computational tools must come together for precise data interpretation from a small input. Developing 3D genome engineering tools and an efficient delivery system to bring them into the nuclei is also essential for expanding 3D genome folding to clinical studies. Combined efforts across the fields will continue unraveling mysteries of the 3D genome, expanding the horizon of novel therapeutics, and offering hope for a range of incurable diseases.
This study was supported by the KAIST UP Program, and a Medical Scientist Training Program, and the National Research Foundation of Korea (NRF) grant from the Ministry of Science & ICT of Korea (RS-2024-00334460).
The authors have no conflicting interests.
![]() |
![]() |