The human oral cavity contains a highly personalized microbiome essential to maintaining health, but capable of causing oral and systemic diseases. Thus, an in-depth definition of “healthy oral microbiome” is critical to understanding variations in disease states from preclinical conditions, and disease onset through progressive states of disease. With rapid advances in DNA sequencing and analytical technologies, population-based studies have documented the range and diversity of both taxonomic compositions and functional potentials observed in the oral microbiome in healthy individuals. Besides factors specific to the host, such as age and race/ethnicity, environmental factors also appear to contribute to the variability of the healthy oral microbiome. Here, we review bioinformatic techniques for metagenomic datasets, including their strengths and limitations. In addition, we summarize the interpersonal and intrapersonal diversity of the oral microbiome, taking into consideration the recent large-scale and longitudinal studies, including the Human Microbiome Project.
The human microbiota (the collection of microbes that live on and inside us) consists of a wide range of microorganisms whose aggregate membership exceeds the human somatic and germ cells by at least an order of magnitude (1, 2). The collection of genes in the microbiota is called the human microbiome (2); however, “microbiota” and “microbiome” are often used interchangeably (3). One of the most clinically relevant microbial habitats, the human oral cavity is colonized by a personalized set of microorganisms, including bacteria, archaea, fungi, and viruses (4). Under healthy conditions, the oral microbiota lives in harmony with the host, similar to other body sites. The host provides an environment wherein the microbiome flourish, in turn keeping their host healthy (5). Conversely, the oral microbiome is also considered a key cause of oral diseases, including dental caries and periodontal diseases, as well as many systemic diseases such as diabetes and cardiovascular diseases (5, 6). Because of its crucial role in oral and systemic health, the oral microbiome has become an essential part of microbiomics.
An in-depth definition of a healthy microbiome is an indispensable step toward detecting significant variations in disease states and pre-clinical conditions, as well as understanding the disease onset and progression (7). The advent of next generation sequencing (NGS) or high-throughput sequencing has revolutionized the field of microbiome analysis, providing the tools necessary to address the issue (8). This prompted the launch of the NIH’s Human Microbiome Project (HMP), constructed as a large, genome-scale community research project (NIH HMP Working Group, 2009). Over 200 healthy adults were enrolled, and samples were collected from 15 to 18 body sites, including oral, stool, skin, nasal, and vaginal areas, over a period of 1 to 3 visits (9). Besides two major scientific reports (9, 10), several companion papers analyzed the HMP oral datasets (7, 11–13), revealing great variability of the oral microbiome among and within healthy individuals. Furthermore, other recent large-scale and longitudinal studies have augmented our view of the oral microbiome, beyond that of the HMP.
In this paper, we review bioinformatic techniques for metagenomic datasets, including microbial community profiling, and highlight the strengths and weaknesses of the experimental approaches. We also summarize important findings that lead to the current understanding of the the range of healthy microbial diveristy. Although viruses, fungi, archaea and protozoa form a part of the normal microbiome (4), majority of the research is concentrated on the domain Bacteria. Therefore, we will focus exclusively on oral bacteria in this review.
Two distinct metagenomics approaches are commonly used: marker gene metagenomics and full shotgun metagenomics. Marker gene metagenomics is a fast and cost-effective way to obtain a taxonomic distribution profile. In this approach, specific regions of evolutionarily conserved marker genes are first amplified by PCR, and subsequently sequenced (14). In the case of bacterial (and/or archaeal) community analysis, the target region usually contains the 16S ribosomal RNA (rRNA) gene (15). Hence, the approach is referred to as 16S rRNA profiling. Full shotgun metagenomics, also referred as metagenomic whole genome sequencing (WGS), does not target a specific locus or marker gene, but instead breaks the isolated metagenomic DNA into smaller fragments, and subsequently sequences the individual pieces (14). The sequenced small fragments (i.e., sequencing raw reads) can be used for taxonomy profiling (who is there?) as well as for functional profiling (what are they doing?) (14). In this section, we briefly describe the scheme of the techniques involved and the bioinformatic pipelines, to analyze microbiome sequence data obtained from the above two methods.
The 16S rRNA gene was introduced as a marker for bacterial phylogeny by Woese
Although there are standard operations and protocols to generate the NGS sequencing libraries, stochastic errors in the biological processes for the library creation, and/or incomplete chemical reactions in sequencing, could affect the overall quality of the sequencing library and sequencing datasets. Therefore, raw sequencing reads generated should be carefully checked for the successful downstream analysis in the preprocessing step. A number of computational tools have been used for the preprocessing: FastQC ( http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) provides a quick quality check by running a modular set of analyses such as “per base sequence quality”, “per sequence quality score”, “sequence length distribution”, “adapter content”, etc.; FASTX- toolkit ( http://hannonlab.cshl.edu/fastx_toolkit/) allows detecting and trimming the low quality region of the individual read (especially 3′-end of the reads); DUST is used to remove low-complexity regions in the sequencing read (24). Intrinsically, NGS techniques can harbor various errors in the sequencing reads, such as imprecise signals from longer homopolymer runs and chimera sequences. In the denoising step, these errors were identified and corrected for the accurate taxonomic assignments of the sequencing reads. Many popular software, such as QIIME (25) and mothur (26), have implemented the denoising algorithms. In particular, UCHIME is designed to detect chimeric sequences by comparing reference sequences to a database, or by performing
NGS allows investigators to detect and identify novel bacteria that have previously gone undetected. Subsequently, the assignment of 16S rRNA read sequences originated from uncultured bacterial genome into a specific bacterial taxonomy is even more difficult. In two frequently used methods, the reads are assigned into bins, according to either homology between the reads and known reference sequences (i.e., phylotyping) or homology between the reads (i.e., operational taxonomic units [OTUs]) (28). The former method relies upon aligning reads with the reference 16S rRNA database using sequence alignment algorithms, such as BLAST (29). Besides NCBI Genbank, a number of rRNA databases have been constructed and used for the taxonomic assignment (Table 1). Each database has its own criteria for the curation of data from the original resources. For example, the Human Oral Microbiome Database (HOMD) (30) and CORE (31) have been constructed using 16S rRNA sequences exclusively from human oral bacteria. The second approach is to group 16S rRNA sequencing reads into bins called OTUs with distance-based agglomerative clustering methods, such as CD-HIT (32) and UCLUST (33). Defining species by 97% identity in 16S rRNA gene sequence is a commonly used criterion, but these distinctions are still controversial (11, 34).
NGS platforms generate massively greater number of reads compared to the classical Sanger sequencing, while the reads are relatively much shorter. Unfortunately, current databases and methods are not able to assign all species names or provide enough phylogenetic information for the billions of sequence reads (11). For example, the most commonly used tool for assigning taxonomy, the Ribosomal Database Project (RDP) Classifier (35), does not assign taxonomic names below the genus level (11, 36). Moreover, as revealed in our previous study, the RDP shows insufficient resolution for classifying the GN02 and
To understand the structure and dynamics of microbial community, the measurement of diversity is essential. Two diversity measurements are frequently used to assess and compare microbial communities: alpha (or within-sample) diversity and beta (or between-sample) diversity. Alpha diversity is usually characterized using the total number of organisms within a sample (richness, which may be measured as the number of OTUs), the relative abundances of the organisms (evenness), or indices that combine these two dimensions. In contrast, beta diversity, is often characterized using the number of species (or OTUs) shared between two communities. In particular, UniFrac, a robust method for comparing the differences between microbial communities between samples, measures the proportion of shared branch lengths on a phylogenetic tree between samples (3, 38). Principal Coordinates Analysis (PCoA) can summarize and visualize the UniFrac distances between samples in a scatterplot where points (representing samples) that are more distant from one another are more dissimilar.
Although the 16S rRNA profiling is a powerful, effective and straightforward technique to study microbial communities, it only provides the taxonomic composition. The metagenomic WGS data can provide not only taxonomy, but also the biological functional profiles for the microbial communities. The principles of taxonomy profiling processes that employ WGS data are similar to those described above. This section will therefore focus on the functional profiling of the microbial community. The analysis pipeline can be divided into four stages: (1) preprocessing, (2) reconstruction of raw sequencing reads (assembly), (3) gene prediction, and (4) functional annotations.
Preprocessing assesses the overall quality of WGS data, and most steps are similar to 16S rRNA profiling. Additionally, raw metagenomic NGS reads associated with a host (e.g. human) are checked for host DNA contamination, and the contaminated sequencing reads are removed. Fast short read mapping tools, such as BWA (39) and Bowtie 2 (40), are used to detect the contaminated sequencing reads by aligning raw sequencing reads against the host genome (e.g. human genome).
The metagenomic WGS technique generates raw sequencing reads from the whole microbial genomes in the microbial community. Thus, to identify the specific genomes and/or complete protein coding genes in the genomes accurately, it is helpful to reconstruct the microbial genomes from raw sequencing reads. However, obtaining complete genomes has been challenging not only because of the highly repetitive DNA sequences abundant in a broad range of species (from bacteria to mammals), but also because of short reads and high data volumes produced by NGS technology. Therefore, an assembly of shorter reads into genomic contigs and their orientation into scaffolds is often performed. Most of the metagenomic WGS read assembly tools are designed and implemented based on the de Bruijn graph theory algorithm. Initially, all sequencing reads are fragmented into k-mers, following which they are used as the edges in the de Bruijn graph. The nodes of (k-1)-mer prefix and suffix are linked by the edges of k-mers for the graph. Finally, the assembler identifies the Eulerian paths that go across all edges just once in the graph (41). Velvet (42), ABySS (41) and SOAPdenovo (44) use the de Bruijn graph to assemble whole metagenomes from raw sequencing reads. In the HMP, the raw sequencing reads from 749 metagenomic samples were successfully used to assemble contigs using an optimized SOAPdenovo protocol (8). Recently, more sophisticated algorithms have been developed and applied to the next-generation assemblers, such as Meta-IDBA (45), MetaVelvet-SL (46) and IDBA-UD (47).
Following reconstruction, next stage is to identify genes in the reads or assembled contigs and/or scaffolds. The prediction of genes in metagenomic contents is still a fairly difficult problem, although several gene prediction algorithms have been successfully employed for prokaryotic genomes. To predict genes in metagenomic studies, especially for
After gene prediction, the identified genes are functionally annotated by comparing the known genes in the functional annotation databases such as PFAM (54), IMG/M (55), COG (56) and MetaRef (57). Further analysis of the relationship between the microbiome and the host phenotype is performed using metabolic pathway information database, i.e. KEGG (58), eggNOG (59) and MinPath (60). In the part of the HMP, Abubucker
The HMP assessed oral microbiome composition of seven intra oral sites (buccal mucosa, hard palate, keratinized gingiva, saliva, sub- and supra gingival plaque, and tongue dorsum) and two oropharyngeal sites (throat and palatine tonsils) from 182–206 healthy subjects (18 to 40 years old). A total of 185–322 genera belonging to 13–19 bacterial phyla were discovered (13). The dominant phyla were
The oral cavity is a humid environment which is kept at a fairly consistent temperature (34°C to 36°C) and a relatively neutral pH in most areas, and thus provide great condition for the growth of various microorganisms (63). The oral cavity is composed of diverse habitats with different anatomical structures and physicochemical factors. The oral mucosa covers the cheek, tongue, gingiva, palate, and floor of the mouth and allows rapid elimination of adhering bacteria due to a continuous shedding of its surface epithelial cells (63). On the other hand, papillary surface of the tongue provides shelter for adhering bacteria and protects these bacteria from mechanical cleaning. The hard surface of teeth offers many sites for bacterial colonization, in both supra- and subgingival areas. The gingival crevice (area between the junctional epithelium of the gingiva and teeth), provides a distinctive microbial colonization site, consisting of both hard and soft tissues (63). The epithelium may be keratinized (palate) or nonkeratinized (gingival crevice). Hence, the oral cavity is not considered a uniform environment.
The HMP revealed a substantial divergence in the species richness and evenness among different oral habitats, and also identified microorganisms with specific niche preferences. Hard palate showed the lowest total richness, however the gingival plaque showed the highest total richness (11) (Table 2). Oral sites, especially saliva, have the highest evenness, while buccal mucosa and keratinized gingiva have lower alpha diversity than the other oral sites (10, 13). Each oral habitat in almost every subject was characterized by one or a few signature taxa making up the plurality of the community with highly variable relative abundance among both the individuals and the oral habitats. Most oral habitats are dominated by
Although the HMP produced a huge volume of data, the resulting 16S rRNA datasets are composed of samples from medical students in the USA, and host information is nearly prohibitive to access, which lead to removal of the potential to observe any systematic patterns and regional or ethnic differences (67). A population-scale study of 120 healthy individuals from 12 worldwide locations showed a significant variation in the saliva microbiome according to the locations (68). Notably, the saliva microbiome of Batwa Pygmies, a former hunter-gatherer group from Africa, was much more diverse than the saliva microbiome of two agricultural African groups, probably owing to their different lifestyle and diet (69). In another study of 3 human groups from different geographic and climatic areas (76 native Alaskans, 10 Germans and 66 Africans) the distinctiveness of the saliva microbiome was seen, the reasons of which (e.g. different lifestyles and/or host genetics and physiology) remain to be clarified (70). In the study, alpha diversity was highest for the German group and lowest for the African group, while the opposite was true for beta diversity. It is intriguing to speculate that the higher population density of Germany may provide more opportunities for bacteria to be spread among individuals (70).
Ethnicity is likely to exert a selection pressure on the oral microbiome. Mason
Vertical transmission from mother to child starts at birth (79). Depending on the delivery mode (vaginal or Caesarian), infants acquire bacterial communities similar to their mother’s vaginal microbiota or skin microbiota (80). A study of healthy three-month-old infants delivered vaginally (25 infants) and born by C-section (38 infants) found differences in the infant’s oral microbiota owing to the mode of delivery, with vaginally delivered infants having a higher taxonomic diversity (81). The method of feeding also affects the infant’s microbiome: oral lactobacilli with antimicrobial properties were found in breast-fed infants but not found in formula-fed infants (82, 83). Horizontal transmission of oral microbiota among siblings and other individuals sharing the same environment also contributes to oral microbiome diversity. In a study, 264 saliva samples were collected from 107 individuals (including 45 twin pairs), at up to three time-points during a 10-year period, spanning adolescence. The twins resembled each other more closely than the whole population at all time-points, but became less similar to each other when they aged and no longer cohabited (84).
Studies looking at the temporal variation of the oral microbiome have revealed conflicting results: in a longitudinal study of 5 adults at three time-points (from 5 to 29 days), the salivary microbial community appeared to be stable at different time points (85). The HMP consortium (10) and Zhou
Along with a variety of physiological changes which accompanies aging, microbial habitats also greatly change in the oral cavity. The eruption of primary teeth and replacement of the primary dentition with permanent dentition may lead to shifts in the microbial community composition at different phases of life (87). Edentulous infants have been found to have lower diversity than their mothers or primary care givers in the oral microbial composition (88). In deciduous dentition, a higher proportion of Proteobacteria (
We have only begun to understand the tremendous diversity of the oral microbiome and a number of challenges remain, such as the vast uncultivated species and the lack of reference genomes (90). Until recently, about half of all known bacterial phyla were identified only from their 16S rRNA gene sequences (91). In fact, the bacteria that can be grown in the laboratory are only a portion of the total diversity that exists in the oral cavity (92). One method to address this challenge is single-cell genomics, which is a powerful tool for accessing genetic information from uncultivated microorganisms (93). Future work combining metagenomics and single cell genomics, as well as advances in each separate method, should help to overcome these issues, providing new insights into the uncultivated lineages (94).
Rapidly developing sequencing methods and analytical techniques are enhancing our ability to understand the human microbiome, leading to the concept of a ‘personal microbiome’. The focus now shifts from characterizing oral microbiota to functional studies encompassing genomics, transcriptomics, and metabolomics of both host and microbes. Future investigations will inevitably be personal omics profiling in order to probe the temporal patterns associated with both molecular changes and related physiological health and disease. This knowledge is vital for the development of efficacious prevention and treatment protocols for oral diseases and, ultimately, contribute to the development of personalized medicine and personalized dental medicine.
This work was supported by the Ministry of Science, ICT & Future Planning (NRF-2015R1C1A2A01054588) and the National Honor Scientist Program (NRF-2012R1A3A1050385). This work was also supported in part by 2016 Sabbatical Research Program of Kyung Hee University.
A list of 16S ribosomal RNA database
|Name||16S rRNA coverage||Database URL (reference)|
|CORE||Human Oral Bacteria||http://microbiome.osu.edu/ (32)|
|RDP||Archaea and Bacteria||https://rdp.cme.msu.edu/ (33)|
|Human Oral Microbiolome Database||Human Oral Bacteria||http://www.homd.org/index.php (65)|
|EzTaxon-e||Archaea and Bacteria||http://www.ezbiocloud.net/eztaxon (95)|
|SILVA||Archaea and Bacteria||https://www.arb-silva.de/ (96)|
|Greengenes||Archaea and Bacteria||http://greengenes.secondgenome.com/ (97)|
Counts of patients included, OTUs and estimated richness (number of species) found for both the V1–V3 and the V3–V5 regions (11)
|Patients||OTUs||Estimated richness||Patients||OTUs||Estimated richness|
aUpper and lower confidence limits are not included in this table.
bExample of extraoral sites. The stool samples have the highest estimate of total richness, followed by the oral samples, particularly the plaque and tonsils. The skin sites, such as posterior fornix, have the lowest estimated richness.