Temple researchers have identified the first genome to transmit the coronavirus.
In the field of molecular epidemiology, the scientific community around the world has been spying to solve the mystery of the early history of SARS-CoV-2.
Since the first SARS-CoV-2 infection was discovered in December 2019, tens of thousands of its genomes have been sequenced worldwide, revealing that the coronavirus is mutating, albeit slowly, at a rate of 25 mutations per genome per year.
But despite great efforts, no one has yet identified the first case of human-to-human transmission, or “patient zero” in the United States. Covid-19 pandemic. Finding such a case is essential to a better understanding of how the virus jumped from its animal host first to infect humans as well as the history of how the SARS-CoV-2 genome mutated over time and spread globally.
SARS-CoV-2 carries RNA “The genome has already infected more than 35 million people worldwide,” said Sudhir Kumar, director of the Institute of Genomics and Evolutionary Medicine at Temple University. “We need to find this common ancestor, which we call the progenitor genome.”
This ancestral genome is the mother of all SARS-CoV-2 viruses that infect humans today.
In the absence of Patient Zero, Kumar and his research team at Temple University may now have found the next best thing to assist molecular epidemiological investigation work around the world. “We set out to reconstruct the ancestral genome using a large dataset of coronavirus genomes obtained from infected individuals,” said Sayaka Miura, one of the study’s senior authors.
They found that the “mother” of all SARS-CoV-2 genomes and their early strains were later mutated and spread to control the global epidemic. “We have now reconstructed the ancestral genome and mapped where and when the first mutations occurred,” said Kumar, the corresponding author of the prepress study.
In doing so, their work provided new insights into the early mutation history of SARS-CoV-2. For example, their study indicates that a mutation in the SARS-CoV-2 spike protein (D416G), often implicated in increased infection and spread, occurred after several other mutations, weeks after the onset of COVID-19. “It is almost always found alongside many other protein mutations, so it remains difficult to prove its role in increasing infection,” said Sergey Bond, one of the study’s senior co-authors.
Along with their findings on the early history of SARS-CoV-2, Kumar’s group has developed mutational fingerprints to quickly identify the strains and subspecies that infect an individual or colonize a global region.
To determine the ancestor genome, they used a mutation arrangement analysis technique, which is based on a phylogenetic analysis of mutant strains and the repeated appearance of mutant pairs together in the SARS-CoV-2 genomes.
First, Kumar’s team examined data on nearly 30,000 entire genomes of SARS-CoV-2, the virus that causes COVID-19. Altogether, they analyzed 29,681 SARS-CoV-2 genomes, each containing at least 28,000 bases of sequence data. These genomes were sampled between December 24, 2019 and July 7, 2020, representing 97 countries and territories around the world.
Kumar says that many previous attempts at analyzing these large data sets have not been successful due to “the focus on building an evolutionary tree for SARS-CoV-2”. This Coronavirus is developing very slowly, the number of genomes to be analyzed is very large, and the quality of the genomic data is extremely variable. I immediately saw similarities between characteristics of this genetic data from the Coronavirus and the genetic data from the clonal spread of another gruesome disease, cancer. “
Kumar’s group has developed and investigated several techniques to analyze genetic data from tumors in cancer patients. They adapted and innovated those technologies and created a series of mutations that automatically trace back to their ancestors. “Basically, the genome before the first mutation was the ancestor,” Kumar said. “The mutation-tracking approach is beautiful and predicts one of the” major strains “of SARS-CoV-2. It is a great example of how big data combined with mining for biologically informed data reveals patterns of interest.”
Kumar’s team has uncovered an expected (parent) ancestral genome sequence of all SARS-CoV-2 (proCoV2) genomes. In the proCoV2 genome, they have identified 170 non-synonymous mutations (one that causes amino Acid A change in protein) and 958 are synonymous surrogates compared to the closely related corona virus genome, RaTG13, found in the bat Rhinolophus affinis. While the intermediate animal from bats to humans remains unknown, this reached a sequence similarity of 96.12% between the proCoV2 and RaTG13 sequences.
Next, they identified 49 single nucleotide subtypes (SNVs) that occurred with a variable frequency greater than 1% of their dataset. They were examined further to look at their mutational patterns and global prevalence.
“A mutation tree predicts a phylogenetic tree,” Kumar said. “You can also do a phylogenetic tree first, and predict the order of mutations. However, this method is greatly influenced by the quality of the sequences. When the mutation rate is low, it is difficult to distinguish the error due to low quality from the true mutation. Our approach is more robust against sequence errors because the analysis is Pairs of loci across the genome are more informative.
A previous timeline appears
When comparing the inferred proCoV2 sequence with the genomes in their group did not reveal any perfect match at the nucleotide level, Kumar’s research team learned that the original timing of the onset of the pandemic was off.
“This ancestral genome had a different sequence from what some people call the reference sequence, which was first observed in China and deposited in the GISAID SARS-CoV-2 database,” said Kumar.
The closest match was the genome sampled 12 days after the closest sampled virus became available on December 24, 2019. Multiple matches were found on all continents sampled and discovered in late April 2020 in Europe. Overall, Kumar’s group analyzed 120 genomes all containing only synonymous variations of proCoV2. That is, all of their proteins were identical to the corresponding proCoV2 proteins in the amino acid sequence. The majority (80 genomes) of these protein-level matches were from coronaviruses taken in China and other Asian countries.
These spatiotemporal patterns indicate that proCoV2 indeed possesses a full repertoire of protein sequences needed to infect, spread, and persist in a global population.
They found that proCoV2 and its original offspring originated in China, based on the early mutations of proCoV2 and their sites. Moreover, they also explained that a group of strains with up to six mutational differences from proCoV2 were present at the time of the first detection of COVID-19 cases in China. With estimates of SARS-CoV-2 mutating 25 times a year, this means that the virus must have infected people several weeks before the December 2019 cases.
Because there was strong evidence of many mutations prior to those in the reference genome, Kumar’s group had to come up with new nomenclature of the mutation signatures for the classification of SARS-CoV-2 and interpret them by inserting a series of Greek letter codes to represent each one.
For example, they found that the emergence of μ and α SARS-CoV-2 genome variants came before the first reports of COVID-19. This strongly implies that there is some chain diversity in the ancestral SARS-CoV-2 population. All 17 genomes sampled from China in December 2019, including the reference genome identified for SARS-CoV-2, carry all three variants and a three α. Interestingly, the six genomes containing μ variants but not α variants were sampled in China and the United States in January 2020. Therefore, the first genomes sampled (including the designated reference) were not the ancestral lineages.
It is also expected that the ancestral genome will have offspring that would have spread around the world during the early stages of COVID-19. She was ready to hit from the start.
“The predecessors had all the capacity they needed to spread,” said Sergey Bond. “There is little evidence of choice over breeds between bats and humans, although there is a strong selection for MERS in bats.”
Moreover, they found tantalizing evidence of another mutation accompanying the D416G protein mutation.
“A lot of people are interested in the mutations in spike protein because of its functional properties,” said Kumar. “But what we observe is that in addition to the spiky protein, there have been many additional changes within the genome that are always present alongside changes in the spike protein (D416G). We call this a beta group of mutations, and the spike mutation is one of them. Whatever we think it is. That a mutation does, it is better not to forget that other mutations may be involved as well. Instead, these mutations may just be a long journey together, and we can’t yet know that. “
“Also interesting is that the genome that contains the prickly protein mutation has undergone many other mutations. What we call Epsilon mutations (there are 3 of these mutations) occurred against the background of the spike mutation, which changes the arginine residue in a very important protein, the nucleocapsid (N) protein. “Epsilon mutations are widespread in Europe, and are always present with the spiky protein mutation. Therefore, the Epsilon mutations began a dominant branch in both Europe and Asia.”
Altogether, they identified seven major evolutionary lineages that arose after the epidemic began, some of which appeared in Europe and North America after the ancestral strains arose in China.
“It is the Asian dynasties that have established the entire epidemic,” Kumar said. “But over time, the subspecies containing the epsilon mutation, which may have occurred outside of China (first observed in the Middle East and Europe), is infecting Asia much more.”
Their mutation-based analyzes also demonstrated that corona viruses in North America carry very different genome fingerprints than those prevalent in Europe and Asia.
“This is a dynamic process,” Kumar said. “Obviously, there are very different images of diffusion drawn by the emergence of new mutations, namely the three epsilons, gamma and delta, which we found to occur after the spike protein altered. We need to know if any functional properties of these mutations have accelerated the epidemic. “.
Moving forward, they will continue to improve their scores as new data becomes available.
“There are over 100,000 genomes of SARS-CoV-2 that have now been sequenced,” Bond said. The strength of this approach is that the more data you have, you can easily learn the exact frequency of individual mutations and pairs of mutations. These variants that are produced, single nucleotide variants, or SNVs, and their frequency and history can be well told with more data. So, she concludes. Our analyzes are a reliable root for the evolution of SARS-CoV-2. “
Their results are automatically updated online as new genomes are reported (which now exceed 50,000 samples and can be found in http://igem.temple.edu/COVID-19).
“These findings and our intuitive mutational fingerprints of SARS-CoV-2 strains have overcome the daunting challenges of developing a retrospective on how, when and why COVID-19 emerged and spread, which is a prerequisite for creating treatments to overcome this pandemic through science, technology, public policy and medicine efforts,” Kumar said.
Reference: “An Evolutionary Picture of the Ancestor SARS-CoV-2 and Its Predominant Offshore in the COVID-19 Pandemic” By Sudhir Kumar, Chicking Tao, Stephen Weaver, Maxwell Sanderford, Marcus A. Carabalu-Ortiz, Sudeep Sharma, Serge LK Bond and Sayaka Miura, 29 September 2020, Purexif.
Doi: 10.1101 / 2020.09.24.311845