MSK Library Guides: SARS-CoV-2: Lineages Explained

PANGO

What is PANGO

Phylogenetic Assignment of Named Global Outbreak (PANGO) lineages are similar to formal scientific species names and are assigned according to a suite of detailed evolutionary and epidemiological criteria.

New PANGO lineages are evaluated by a committee based on criteria such as novel evolutionary features, transmissibility, pathogenicity, notable increase in frequency, new movement across regions, etc.

PANGO Naming

PANGO lineage labels are assigned chronologically with the following:

Alphabetical prefix (A, B, C....AA, AB, AC, etc.)
Numerical suffix (A.1, A.2, A.3...B.1, B.2...AA.1, etc.)
Each full-stop (period/dot) within the numerical suffix represents a "descendant of"

Examples:

BA.2.75 is a descendant of BA.2.
BA.2.75.2 is a descendant of BA.2.75

Recombinant Naming

A separate system is used to indicate recombinant (hybrid) variants, in this case beginning with X followed by another letter(s) (proceeding in sequencing XA, XB,... then XAA, XAB...) and containing no numerical suffix except for unambiguous descendants (e.g., XB, XB.1).

PANGO Limitations

Maximum Length and Rolling Over

PANGO labels have a maximum depth of three layers of descendants, and then the label rolls over to a new prefix. Descendants of lineages with three layers (tertiary) suffixes are assigned to the next available prefix in alphabetical order (following the criteria above). This new prefix acts as an "alias" for the name of the parent lineage.

Examples:

The first descendant of the lineage B.1.1.1 was not named B.1.1.1.1 but is instead named C.1
- The prefix C.1 is an alias for B.1.1.1
B.Q.1.1 is an alias for B.1.1.529.5.3.1.1.1.1.1.1

Getting Lost in the Soup

This naming structure can lead to confusion, especially when discussing different variants in non-technical settings or when trying to communicate information to the public.

What could be helpful would be for the WHO to designate BA.2 and BA.5 - the two primary Omicron parent lineages - as Pi and Rho. That way, at least it would be easy to quickly identify the ancestry of a lineage, even when the nomenclature is confusing.

Where are the Recombinants?

Where this has become especially concerning is regarding the XBB recombinant descendants. In February 2023 two new PANGO lineages appeared, EG and EK. What is not clear from these names is that they are in fact descendants of XBB, but since PANGO has a 3 layer limit, the XBB family now has descendants beyond these 3 layers.

EG.1 -- alias of XBB.1.9.2.1
EK.1 -- alias of XBB.1.5.13.1

However, while XBB descendants are getting new alias, as are other variant descendants, that are NOT recombinant lineages:

EF.1 -- alias of BQ.1.1.13.1
EH.1 -- alias of BQ.1.28.1
EJ.1 -- alias of BN.1.3.8.1

See the problem? Can you easily distinguish between EG.1 and EF.1? Do you know which is recombinant and which is not?

It has been proposed for PANGO to include a prefix or suffix to designate recombinant lineages.

How SARS-CoV-2 Big Data Are Challenging Viral Taxonomy Rules

Focosi D, Maggi F. How SARS-CoV-2 Big Data Are Challenging Viral Taxonomy Rules. Viruses. 2023; 15(3):715.

NextStrain

NextStrain Pathogen Surveillance

Nextstrain.org, provides real-time snapshots of evolving pathogen populations. It uses interactive visualizations to enable exploration of curated datasets and analyses which are continually updated when new genomes are available. This offers a powerful pathogen surveillance tool to virologists, epidemiologists, public health officials, and community scientists.

NextStrain SARS-CoV-2 Clades

Nextstrain introduced informal clade designations for SARS-CoV-2 on 4 March 2020, largely to aid internal discussions and to create URL links allowing ‘automatic zoom’ to an area of the tree that was of interest. These clades names were ad-hoc letter-number combinations (e.g. A2a) and were never intended to be a permanent naming system (and never visible by default).

Initial NextStrain Clade Naming Strategy

In June 2020 we put forth an initial Nextstrain clade naming strategy. This basic strategy of flat “year-letter” names was borne out of work with seasonal influenza, where the nested names of 3c2.A1b (etc…) can become unwieldy. In the “year-letter” scheme, years are there to make it easy to know what’s being discussed in ~5 years when, for example, clade 20A is referenced. Our June strategy called for naming of a clade when it reached >20% global frequency for more than 2 months.

January 2021 Naming Strategy Update

However, as the pandemic progressed, lack of international travel made it so that no clades beyond the initial clades 20A, 20B and 20C made it past 20% global frequency. Instead, we’ve seen “regional” clades that hit appreciable frequency in different continent-level regions of the world.

Consequently, we propose an updated strategy, where major (year-letter) clades are named when any of the following criteria are hit:

A clade reaches >20% global frequency for 2 or more months
A clade reaches >30% regional frequency for 2 or more months
A VOC (‘variant of concern’) is recognized (applies currently to 501Y.V1 and 501Y.V2)

April 2022 Naming Strategy Update

The rapid dominance and increased diversity of the Omicron variant and its constituent sub-lineages has triggered another update of the Nextstrain clade naming guidelines and labels. Once again, we propose a backwards-compatible update, this time to allow more flexible and faster designation of clades as new variants appear and spread.

However, with the dominance and diversification of the Omicron family of variants, we are seeing trends in rising variants that are notable, of public and scientific interest, and genetically distinct, yet do not yet meet previous criteria to be labeled. In order to allow easier reference to such variants at an earlier time, we propose an update to the current guidelines. This will allow designation of a clade if it shows consistent >0.05 per day growth in frequency where it is circulating, in addition to reaching >5% regional frequency.

Our revised set of guidelines will therefore now be to designate a clade when any of the following criteria are met:

A VOC or VOI is recognized by the WHO and given a Greek letter label
A clade reaches >20% global frequency for 2 or more months
A clade reaches >30% regional frequency for 2 or more months
A clade shows consistent >0.05 per day growth in frequency where it’s circulating and has reached >5% regional frequency

GISAID

Global Initiative on Sharing Avian Influenza Data

The GISAID Initiative promotes the rapid sharing of data from all influenza viruses and the coronavirus causing COVID-19. Established in 2008, it was created in response to the H5N1 influenza pandemic to provide open access to genomic data of influenza viruses. This includes genetic sequence and related clinical and epidemiological data associated with human viruses, and geographical as well as species-specific data associated with avian and other animal viruses, to help researchers understand how viruses evolve and spread during epidemics and pandemics.

GISAID does so by overcoming disincentive hurdles and restrictions, which discourage or prevented sharing of virological data prior to formal publication.

The Initiative ensures that open access to data in GISAID is provided free-of-charge to all individuals that agreed to identify themselves and agreed to uphold the GISAID sharing mechanism governed through its Database Access Agreement.

All bonafide users with GISAID access credentials agreed to the basic premise of upholding a scientific etiquette, by acknowledging the Originating laboratories providing the specimens, and the Submitting laboratories generating sequence and other metadata, ensuring fair exploitation of results derived from the data, and that all users agree that no restrictions shall be attached to data submitted to GISAID, to promote collaboration among researchers on the basis of open sharing of data and respect for all rights and interests.

GISAID EpiCoV

After COVID-19 was identified as a newly emerging viral respiratory disease and the first hCoV-19 genomes were made available on 10th January 2020 to the scientific community on GISAID’s newly established EpiCoV™ platform, prestigious institutions around the globe came together by contributing experts to GISAID'S team of curators to ensure vast amounts of data could be reviewed and curated in real-time and annotated, prior to release. Their remarkable contribution remains key to the unprecedented speed enabling real-time progress in the understanding of the new COVID-19 disease and in the research and development of candidate medical countermeasures.

GISAID maintains the world's largest repository of SARS-CoV-2 sequences, with over 14,500,000 sequences submitted.