In December 2019, the occurrence of a new infectious outbreak of severe respiratory illness was observed (1), which was named novel coronavirus disease 2019 (COVID-19) on January 7, 2020 by the World Health Organization (WHO), leading to the COVID-19 pandemic (2). As of March 2022, 451,611,588 cases of COVID-19 infection resulting in 6,022,199 deaths have been reported in 196 countries and territories worldwide (3, 4). For over 2 years, the COVID-19 pandemic has impacted political, social, economic, and healthcare issues worldwide. The establishment of healthcare strategies for COVID-19 patients requires a critical understanding of the pathophysiology of COVID infection.
The difficulties in predicting disease progression have made it challenging to set up healthcare services associated with quarantine, vaccination, and treatment of COVID-19 patients. Although several studies have reported the dynamics between the COVID-19 virus and infected host immune system (4-9), the molecular and cellular mechanisms resulting in arbitrary symptoms during COVID-19 infection remain poorly understood. Nations such as the countries of the European Union and Taiwan have emphasized on building research infrastructure and collecting data and biospecimens to devise multifaceted responses to the COVID-19 pandemic (10-12). However, there is a limited understanding of the complicated infection biology of COVID-19 owing to a lack of comprehensive and systematic multimodal datasets with matched clinical data and laboratory testing (13). Therefore, longitudinal multimodal data are needed to elucidate the viral infection dynamics and immune response to COVID-19.
In this study, we constructed a scalable multimodal dataset from 300 COVID-19 patients and 120 healthy controls. Comprehensively collected biospecimens and clinical data were deposited in the National Biobank of Korea (NBK). Laboratory testing, HLA typing, and whole-genome sequencing (WGS) of samples from the study participants were performed on admission. Additionally, longitudinal single-cell immune profiling comprising single-cell RNA sequencing combined with T cell receptor (TCR) and B cell receptor (BCR) sequencing (scRNA(+scTCR/BCR)-seq) and bulk BCR and TCR sequencing (bulk BCR/TCR-seq), as well as an profiling 192 cytokines were used to investigate dynamic infection progress in COVID-19 patients. Our dataset can potentially elucidate the understanding of COVID-19 susceptibility and severity and facilitate the development of target molecules and treatment strategies against COVID-19.
From October 2020 to June 2021, 120 healthy controls and 300 COVID-19 patients were enrolled at Chungnam National University Hospital, Seoul Medical Center, and Samsung Medical Center. The study protocol was reviewed and approved by these institutes and the Institutional Review Board of Korea National Institute of Health (CNUH 2020-12-002-008, SEOUL 2021-02-016, SMC-2021-03-160, and 2020-09-03-C-A). All the participants provided written informed consent.
Participants were enrolled when diagnosed with COVID-19 and biospecimens were collected at several time points based on each patient’s disease progression. The clinical course of the patients was recorded daily until discharge. Moreover, we recruited 120 healthy controls who had no history of COVID-19 infection or were vaccinated for COVID-19.
The host response against COVID-19 at the molecular and cellular levels can be examined based on a multidimensional analysis of optimal biospecimens. Blood samples from participants collected using various sampling techniques (Supplementary Fig. 1) were used for complete blood count (CBC), laboratory testing, cytokine profiling, and multi-omics analyses such as WGS, HLA typing, scRNA(+scTCR/BCR)-seq, and bulk BCR/TCR-seq. Nasopharyngeal swabs (NPS) were collected for sequencing of the COVID-19 viral genome. Urine samples were collected and stored for use in future experiments, such as metabolic studies or trace determination of COVID-19. Detailed procedures for collecting of samples were described in Supplementary Methods.
The NBK website is the medium of distribution of the collected biospecimens and also provides biospecimen information (https://nih.go.kr/biobank). Researchers wishing to use these biospecimens can apply to the NBK for access approval based on research proposals.
The severity of COVID-19 in patients was classified according to the WHO guidelines (14). Patients were classified as “severe” on presenting COVID-19 symptoms with pneumonia and one of the following: respiratory rate of more than 30 times/min, oxygen saturation of less than 93% at room temperature, less than 300 mmHg of PaO2/FiO2, or a rapid progression of infiltration (> 50%) observed by chest computed tomography (CT) imaging.
Out of 300 COVID-19 patients, 243 (81.0%) were classified as moderate and 57 (19.0%) as severe (Fig. 1 and Table 1). Two patients with severe COVID-19, whose biospecimens were collected at three time points, died during hospitalization. Other patients were discharged after recovery. The average hospitalization period was 13 and 16 days for moderate and severe patients, respectively. Multi-omics data were obtained at 7 time points from 12 patients with severe COVID-19.
Clinical information collected daily using an electronic clinical data management system (https://icreat.nih.go.kr) and clinical characteristics of study participants are shown in Supplementary Table 1 and Table 1, respectively. The median age was 42.0 for healthy controls, and 52.0 and 64.0 for moderate and severe COVID-19 patients, respectively. Hypertension and diabetes were the most common comorbidities, and pneumonia was observed in 44.4% and 93.0% of the moderate and severe patients, respectively. As the clinical information used in this study includes a broad range of categories recorded on the daily bases (Supplementary Table 1), we envisage that this result will provide insights into COVID-19 infection, such as studies on correlations between COVID-19 severity and demographics, comorbidity, or treatment and on changes in the clinical status during hospitalization for COVID-19 severity.
CBC and blood chemistry were performed on admission for the participants (Table 2). Among the 37 tested parameters, hsCRP levels [12.98 mg/L in moderate and 51.05 mg/L in severe] were significantly elevated in COVID-19 patients than in healthy controls [0.79 mg/L], which indicated an association between hsCRP level and the severity of COVID-19. This is consistent with previous studies reporting hsCRP as an indicator of COVID-19 progression (15, 16). Ferritin was also significantly elevated in COVID-19 patients [214.50 ng/ml in moderate and 490.09 ng/ml in severe] compared to that in healthy controls [119.56 ng/ml]. This finding agrees with that of previous reports (17, 18), suggesting that ferritin is relevant for distinguishing disease severity in COVID-19 patients and leads to immune dysregulation and cytokine storms, especially in severe COVID-19 (19). Conversely, iron levels were lower in COVID-19 patients [71.73 μg/dl in moderate and 48.00 μg/dl in severe] than in healthy controls [108.01 μg/dl], consistent with previously reported results (20, 21). Interestingly, the WBC count was lower in moderate COVID-19 patients and elevated in severe COVID-19 patients compared to that in healthy controls. These findings would allow to understand the association between blood biochemical characteristics and COVID-19 severity, as well as that of the pathophysiology of COVID-19 progression.
The longitudinal profiles for 191 cytokines in COVID-19 patients obtained by measuring cytokine levels at multiple time points from hospitalization to discharge can provide an in-depth understanding of dynamic cytokine patterns in the progression of COVID-19 over time (Supplementary Table 2). To comprehensively identify cytokine inflammation in COVID-19 patients, the expression levels of cytokines for the first time point were evaluated (Supplementary Fig. 2). Of the top 25 significant cytokines, expression levels of complement component C9, LRG1, CD14, CEACAM1-1, and IL-23 were elevated for severity, while the expression of serpin A4, properdin, fetuin A, and fibroblast activation protein alpha (FAP) were decreased.
This reveals an association between cytokines and COVID-19 severity and facilitates prediction of COVID-19 disease development (5, 22, 23). Additionally, integrated analysis with a matched transcriptome at the single-cell level can provide a better understanding of dynamic inflammatory responses under multi-layered control. Detailed procedures to perform blood biochemistry and cytokine profiling are described in the Supplementary Methods.
Here, we focused on genetic polymorphisms in
Additionally, we found relatively high frequencies of
It has been reported HLA polymorphisms play a pivotal role in immune response for COVID-19 infection, especially resulting from pathogen-derived peptide presentation (31). Our HLA typing data along with matched clinic and other omics data would contribute to delineate detailed mechanisms by which HLA genotypes influence on susceptibility, progression, and severity of COVID-19 infection.
In this study, 279 sequences of COVIDSeq data were collected, which excluded QC-failed data. The variants of COVID-19 sequences were analyzed to determine the variation in COVID-19 patients in terms of Pango lineage (32) and GISAID Clade (33) (https://www.gisaid.org) (Supplementary Fig. 3).
Fourteen COVID-19 lineages were identified in this study (Supplementary Fig. 3A). B.1.497 was the most common lineage found in our study, followed by B.1.619. The COVID-19 lineages mostly showed a change from B.1.497 to B.1.619 in biospecimens collected from January to June 2021 (Supplementary Fig. 3B). From the virus samples collected between January and March, 89.3% were of the B.1.497 lineage, which was mainly prevalent in South Korea (32). The proportions of B.1.619 lineage in the virus samples collected in April, May, and June were 8.3%, 49.0%, and 50.0%, respectively. Additionally, the B.1.620 lineage was present in virus samples collected in May and June at proportions of 15.6% and 25.0%, respectively.
Multi-omics datasets were generated for longitudinal time points to investigate genomic and immunogenic dynamics in COVID-19 patients (Fig. 1). The omics datasets collected at three time points for moderate patients and seven time points for severe patients included scRNA-seq(+scTCR/BCR-seq), bulk TCR/BCR-seq, and cytokine profiling.
For scRNAseq data, 4,525 cells on average were captured and 24,530 genes on average were detected, and a total of 11,921,825 cells were obtained for 483 samples. A total of 184,549 B cells with complete heavy and light chain sequences and 386,553 T cells with complete TCR alpha and beta chain sequences were obtained (Supplementary Tables 5-7). Quality control data for bulk TCR/BCR-seq were also recorded in Supplementary Tables 8 and 9. In addition, WGS were also performed for the samples at the first time points to study genome-wide genetic variants of the participants. Detailed QC metrices of all WGS data are given in Supplementary Table 10.
The entire set of multi-omics data are available through the Clinical and Omics Data Archive (CODA) to researchers with approval. And an in-depth analysis of these datasets is also available on an analytical platform on CODA (https://coda. nih.go.kr). Detailed procedures to perform multi-omics datasets were described in Supplementary Methods.
This study presents a comprehensive collection of longitudinal multi-modal data of Korean COVID-19 patients and healthy controls based on state-of-the-art high-resolution technologies. The multi-omics data generated in this study includes matched clinical and laboratory testing data, thus acting as a powerful data source to delineate dysregulated immune biology and the underlying mechanisms against novel viruses.
There are several international resources for COVID-19 research. Global Initiative for Sharing All Influenza Data (GISAID) share over 11 million genomes to track the evolution of COVID-19 viruses and to study the worldwide genomic epidemiology of SARS-CoV-2 in real-time (33). The UK biobank also provides health record data on a regular basis to facilitate rapid research of COVID-19 and to obtain epidemiological insights into molecular characteristics of COVID-19 (34, 35), combined with the extensive data previously collected on genetic factors on UK biobank participants.
Compared to these international resources of COVID-19, our multi-omics study has tremendous strengths to study COVID-19 pathogenesis. The time-resolved set of multi-omics data and biospecimens collected from each patient allow to in-depth study of the dynamic progression of COVID-19. Moreover, the entire set of multi-omics, including single-cell omics, bulk TCR/BCRseq and cytokine profiling, was performed simultaneously for each time point, enabling the investigation of the relationship between dynamic cytokine levels and peripheral immune response and revealing patient-specific immune responses. Considering that clinical data for each patient were collected daily, our dataset can also provide an in-depth understanding of disease progression after COVID-19 infection from the clinic point of view. Additionally, our results include a dataset for the early response to COVID-19 infection in moderate and severe patients that can be used to develop risk prediction models via multi-omics integration methodologies and deep learning. The dataset of multi-omics associated with specific COVID-19 lineages (Supplementary Fig. 3) can facilitate the study of coordinated immune responses strongly correlated with COVID-19 pathogenesis (36).
We expect that the combined single-cell omics, WGS, HLA typing, cytokine profiles, laboratory testing, and matched clinical data in our study will contribute to future integrative meta-analyses of COVID-19 to help investigate potential functions or mechanisms driving COVID-19 infection (37).
This COVID-19 project has several limitations. First, as the participants were recruited from October 2020 to June 2021, new variants of COVID-19, such as delta or omicron, were not included in this project. We are collecting the same set of resources for additional COVID-19 patients and vaccinated participants for a subsequent project, expecting to obtain biospecimens for novel COVID-19 variants and study antibody responses against COVID-19 vaccines (38). This can enable exploration of the immunologic landscape based on COVID-19 lineages or of complications after COVID-19 infection. Second, the biospecimens were collected after COVID-19 diagnosis, thereby having a possibility of variations in the exact time point of COVID-19 infection. Therefore, the very early immunological responses could not be included in this dataset. Lastly, as the datasets in this study were produced based on analysis of blood samples, local immune responses in infected tissues such as lungs or organs could not be accounted for (39, 40).
Our dataset generated based on biospecimens from COVID-19 patients can facilitate a better understanding of the dynamic peripheral immune responses during COVID-19 infection and be used to develop predictive models for estimating the severity or newly emerging viruses. This study can potentially uncover the genetic and biological basis of COVID-19 by combining their relationship with the clinical phenotypes. We anticipate that our data will provide a valuable resource for future COVID-19 studies and integrative meta-analyses of multi-omics datasets of COVID-19 patients worldwide.
All data and biospecimens from this project are available from the National Biobank of Korea (NBK) on prior approval (https://nih.go.kr/biobank). Fundamental datasets, including clinical, laboratory testing and cytokine profiling data, were deposited at NBK, while the multi-omics dataset was deposited at the Clinical and Omics Data Archive (CODA; https://coda.nih.go.kr).
The accession numbers for the clinical data, laboratory characteristics, cytokine profiling, WGS, HLA typing, bulk TCRseq, bulfk BCRseq, and scRNA(+scTCR/BCR)-seq reported in this study are CODA-000034, CODA-000035, CODA-000036, CODA-000037, CODA-000038, CODA-000039, CODA-000040, and CODA-000041, respectively.
Any additional information required to analyze the dataset collected in this project is available from the lead contact upon request.
Further detailed information is provided in the Supplementary Information.
We acknowledge all the healthcare workers involved in this study from the Chungnam National University Hospital, Seoul Medical Center, and Samsung Medical Center for their efforts in collecting samples and creating medical records. We also thank all the managers and staff at the hospitals and biobank for sample handling and preprocessing, as well as for the production of high-quality data.
This work was supported by the Korea National Institute of Health Infrastructural Research Program 4800-4861-312-210-13 and operation of data center for national biomedical data resources (2021-NI-017-00).
The authors have no conflicting interests.
Clinical characteristics of COVID-19 patients
Clinical characteristics | Healthy control (n = 120) | COVID-19 moderate (n = 243) | COVID-19 severe (n = 57) | P-value |
---|---|---|---|---|
Age | 42 (23-80) | 52 (21-97) | 64 (27-93) | - |
Male (%) | 59 (49.2%) | 128 (52.7%) | 33 (57.9%) | - |
Current Smoker | 15 (12.5%) | 27 (11.1%) | 0 (0.0%) | 0.02 |
Comorbidity | ||||
Hypertension | 20 (16.7%) | 63 (25.9%) | 29 (50.9%) | <0.001 |
Diabetes | 7 (5.8%) | 34 (14.0%) | 26 (45.6%) | <0.001 |
Coronary heart disease | 3 (2.5%) | 8 (3.3%) | 6 (10.5%) | 0.05 |
Stroke | 1 (0.8%) | 3 (1.2%) | 2 (3.5%) | 0.53 |
Malignant neoplasm | 1 (0.8%) | 4 (1.6%) | 2 (3.5%) | 0.71 |
Chronic hepatitis/liver cirrhosis | 1 (0.8%) | 4 (1.6%) | 1 (1.8%) | <0.001 |
Symptom | ||||
Cough | - | 99 (40.7%) | 28 (49.1%) | 0.32 |
Dyspnea | - | 9 (3.7%) | 16 (28.1%) | <0.001 |
Fever | - | 84 (34.6%) | 27 (47.4%) | 0.10 |
Sore throat | - | 82 (33.7%) | 14 (24.6) | 0.24 |
Sputum production | - | 46 (18.9%) | 19 (33.3%) | 0.03 |
Rhinorrhea | - | 18 (7.4%) | 2 (3.5%) | 0.44 |
Myalgia | - | 93 (38.3%) | 22 (38.6%) | 1.00 |
Malaise | - | 56 (23.0%) | 15 (26.3%) | 0.73 |
Headache | - | 56 (23.0%) | 14 (24.6%) | 0.94 |
Nausea | - | 3 (1.2%) | 3 (5.3%) | 0.15 |
Diarrhea | - | 4 (1.6%) | 6 (10.5%) | 0.00 |
Pneumonia during hospitalization | - | 108 (44.4%) | 53 (93.0%) | <0.001 |
The number and percentage of events in each group are shown for each clinical characteristic. P-values were calculated by chi-squared test for the categorical groups. Statistical significance was set at P < 0.05. Values in age are presented as median (minimum-maximum).
Laboratory characteristics of healthy controls and COVID-19 patients
Laboratory characteristics | Healthy control (n = 120) | COVID-19 moderate (n = 243) | COVID-19 severe (n = 57) |
---|---|---|---|
WBC (Thous/ul) | 5.96 ± 1.55 | 4.68 ± 1.63* | 7.02 ± 4.94§,† |
RBC (Mil/ul) | 4.60 ± 0.43 | 4.50 ± 0.59 | 4.37 ± 0.59§ |
Platelet (Thous/ul) | 225.06 ± 48.98 | 197.28 ± 63.66* | 178.95 ± 68.62§ |
Hemoglobin (g/dl) | 14.06 ± 1.53 | 13.906 ± 1.79 | 13.48 ± 1.69§ |
Hematocrit (%) | 41.82 ± 4.09 | 41.15 ± 5.19 | 39.96 ± 4.90§ |
hs-CRP (mg/L) | 0.79 ± 1.11 | 12.98 ± 24.57* | 51.05 ± 56.30§,† |
Iron (ug/dl) | 108.01 ± 42.47 | 71.73 ± 41.86* | 48.00 ± 39.05§,† |
Ferritin (ng/ml) | 119.56 ± 110.05 | 214.50 ± 219.43* | 490.09 ± 390.64§,† |
UIBC (ug/dl) | 217.55 ± 72.33 | 222.04 ± 60.26 | 196.40 ± 58.05† |
Vitamin B12 (pg/ml) | 631.26 ± 275.59 | 765.97 ± 468.82* | 1045.74 ± 742.19§,† |
Folate (ng/ml) | 10.84 ± 6.12 | 12.73 ± 8.43 | 13.13 ± 10.00 |
Total protein (g/dl) | 6.87 ± 0.35 | 6.68 ± 0.49* | 6.23 ± 0.71§,† |
Albumin (g/dl) | 4.56 ± 0.23 | 4.34 ± 0.36* | 3.91 ± 0.53§,† |
Homocysteine (umol/L) | 14.08 ± 3.80 | 14.32 ± 4.65 | 13.34 ± 4.95 |
ALT (U/L) | 24.48 ± 16.38 | 28.88 ± 26.29 | 38.09 ± 31.52§,† |
r-GTP (U/L) | 24.11 ± 42.07 | 33.08 ± 67.44 | 45.18 ± 48.50§ |
AST (U/L) | 24.48 ± 12.51 | 29.51 ± 16.10* | 39.12 ± 20.16§,† |
Total bilirubin (mg/dl) | 0.83 ± 0.34 | 0.75 ± 0.35* | 0.67 ± 0.36§ |
Direct bilirubin (mg/dl) | 0.24 ± 0.11 | 0.24 ± 0.13 | 0.25 ± 0.17§ |
ALP (U/L) | 62.26 ± 16.35 | 67.12 ± 21.75* | 74.74 ± 27.10§,† |
Calcium (mg/dl) | 9.51 ± 0.33 | 9.04 ± 0.42* | 8.59 ± 0.53§,† |
Phosphorus (mg/dl) | 94.58 ± 0.42 | 3.38 ± 0.68* | 3.21 ± 0.58§ |
BUN (mg/dl) | 14.54 ± 3.94 | 14.70 ± 11.00 | 19.79 ± 12.05§,† |
Cystatin C (mg/L) | 0.70 ± 0.18 | 0.96 ± 0.79* | 1.24 ± 1.03§,† |
Creatinine (mg/dl) | 0.77 ± 0.18 | 0.88 ± 1.08 | 0.97 ± 0.95§ |
Uric acid (mg/dl) | 5.24 ± 1.54 | 4.94 ± 1.57 | 4.50 ± 1.61§ |
Total CPK (U/L) | 142.60 ± 212.26 | 96.51 ± 143.74* | 122.89 ± 145.91§ |
Glucose (mg/dl) | 94.58 ± 0.42 | 98.21 ± 35.96 | 142.09 ± 74.29§,† |
HbA1c (%) | 5.46 ± 0.74 | 5.89 ± 0.91* | 7.29 ± 2.25§,† |
Total cholesterol (mg/dl) | 179.28 ± 33.34 | 167.35 ± 34.58* | 144.28 ± 38.52§,† |
Triglyceride (mg/dl) | 130.72 ± 99.32 | 116.28 ± 58.64 | 138.46 ± 56.92† |
HDL cholesterol (mg/dl) | 56.71 ± 14.05 | 44.47 ± 12.91* | 34.89 ± 8.31† |
LDL cholesterol (mg/dl) | 104.80 ± 30.57 | 105.18 ± 34.84 | 87.77 ± 37.73§,† |
Apolipoprotein A-I (mg/dl) | 154.93 ± 22.60 | 127.36 ± 26.50* | 105.32 ± 20.23§,† |
Apolipoprotein A-II (mg/dl) | 31.55 ± 5.11 | 28.98 ± 5.86* | 23.28 ± 6.06§,† |
Apolipoprotein B (mg/dl) | 88.46 ± 24.04 | 90.89 ± 23.52 | 85.91 ± 30.20 |
Lipoprotein(a) (mg/dl) | 16.56 ± 12.03 | 18.41 ± 14.09 | 20.30 ± 19.64 |
Continuous variables are presented as means ± standard deviations for each laboratory feature. P-values were calculated by two sample t-test after imputation using K-nearest neighbor (KNN) method. The number of neighbors (K) was 10 as a default setting. Statistical significance was set at P < 0.05. *, §, and †indicate the significance between healthy controls versus COVID-19 moderate, healthy controls versus COVID-19 severe, and COVID-19 moderate versus severe patients, respectively. WBC, white blood cell; RBC, red blood cell; hsCRP, high-sensitivity C-reactive protein; UIBC, unsaturated iron binding capacity; ALT, alanine transaminase; AST, aspartate aminotransferase; BUN, blood urea nitrogen; CPK, creatine phosphokinase; HDL, high-density lipoprotein; LDL, low-density lipoprotein.