Monday, July 10, 2023

GPT4 Explains A Bit About NGS CGP and MSI

User - MSI? Please help me understand.

I am a clinical molecular biologist and I am reviewing a test that does comprehensive genomic profiling (sequencing 200 tumor genes by NGS).   

The test "assesses MSI" but I can't really tell the method.  What would be a likely way of assessing "MSI" microsatellite instability in this setting of tumor NGS sequencing?


Microsatellite instability (MSI) refers to the hypermutability (or propensity for mutation) of short repetitive DNA sequences (microsatellites) due to defects in the DNA mismatch repair (MMR) system. This is frequently observed in certain types of cancer and can be an important indicator for the suitability of certain types of treatment, particularly with checkpoint inhibitor immunotherapies.

In the context of Next-Generation Sequencing (NGS) of a tumor's genomic profile, there are multiple methods to assess MSI status, but one of the most likely ways is by using bioinformatics analysis of the sequencing data.

Here's a typical method:

Sequencing the tumor's genomic DNA: The genomic DNA from a tumor tissue sample (and ideally a matched normal sample) would be sequenced using a NGS platform like Illumina's MiSeq or NovaSeq. The comprehensive genomic profiling that you mention, focusing on 200 tumor genes, would include a number of microsatellite loci across the genome.

Alignment and Variant Calling: The sequencing reads would be aligned to a reference genome, and variants (mutations) are identified.

Bioinformatics Analysis for MSI: Microsatellite loci in the genome are known and can be identified in the sequence data. An algorithm can then analyze the number of mutations found in these microsatellite regions. A high number of mutations in the microsatellites is indicative of MSI, and this can be quantified.

There are also specific bioinformatics tools and algorithms available, like MSIsensor, MANTIS, or MSIseq, which are designed to detect MSI in NGS data. These tools essentially work by comparing the number of mutations found in microsatellites in the tumor sample to a baseline (either a matched normal sample or a precomputed reference).

It's also possible to assess MSI status by looking for mutations in the MMR genes themselves (MLH1, MSH2, MSH6, PMS2), as such mutations often lead to MSI. This is less direct and may miss some MSI-positive cases, but could be used as a complementary approach.

Please note that while this provides an overall idea, the specific method can vary based on different laboratory protocols or workflows.

User - CNV

Thank you.   I understand they may also assess specific tumor oncogenes like MSH2 MSH6.     They also report copy number variation (CNV).  I thought this was hard to do with NGS libraries, and more likely done with quantitative microarrays.   However, they assess CNV with NGS.   How might they do that accurately?


Copy Number Variations (CNV/CNAs) refer to alterations of the DNA of a genome that result in the cells having an abnormal number of copies of one or more sections of the DNA. CNAs have been associated with various diseases, including cancer.

You're correct in that traditionally, CNAs have been assessed with comparative genomic hybridization (CGH) microarrays. However, with advances in next-generation sequencing (NGS) and bioinformatics, it is now possible to accurately assess CNAs using NGS data.

There are multiple methods to do this, but in general, the approach involves comparing the read depth (the number of sequenced reads that align to a particular genomic region) in the tumor sample to a baseline. This baseline could be a matched normal sample, a pooled normal control, or a computed expectation based on factors like GC content. Regions with a higher or lower read depth than expected are inferred to have copy number gains or losses, respectively.

Here's a more detailed process:

Sequencing and Alignment: As with the MSI analysis, the genomic DNA from the tumor sample is sequenced and the reads are aligned to a reference genome.

Normalization and Segmentation: The genome is divided into bins (for example, each bin might be 1 kilobase long), and the number of reads in each bin is counted. This raw read count is then normalized to correct for various factors that can affect sequencing depth, like GC content, mappability, and sequencing biases. After normalization, a segmentation algorithm is applied to group adjacent bins with similar read depth into segments.

Calling Copy Number Variations: For each segment, the normalized read depth is compared to the expected baseline read depth. Segments with a significantly higher or lower read depth are called as copy number gains or losses, respectively.

Several bioinformatics tools have been developed to carry out this process, including tools like CNVkit, Control-FREEC, and Sequenza. As with MSI analysis, the specific method can vary based on the lab protocol or workflow.

This process can accurately detect large-scale CNAs, but detecting smaller CNAs can be more challenging due to factors like noise in the data. The accuracy can also depend on the sequencing depth - deeper sequencing provides more data and can allow for more accurate CNA detection. It's also important to note that this method only provides relative copy number (compared to the baseline), not absolute copy number.


User = Remark on CNA

That's very interesting.   I know that NGS is reported with an average read depth ( a quality measure), but here, if an area locally appears to have a very high read depth (60X instead of 20X) we can infer the CNA is 3X as high as elsewhere.   Clever.


Exactly, you've got it! In this context, "read depth" refers to the number of individual sequencing reads that cover or align to a particular region in the genome. It's essentially a measure of how much sequencing data you have for that region.

In the context of Next-Generation Sequencing (NGS), the average read depth across the whole genome or exome is indeed often used as a quality metric. A higher average read depth allows for more confident variant calling and can also allow detection of variants present in a small proportion of cells (useful for detecting tumor heterogeneity or low-frequency variants in liquid biopsy samples, for example).

For Copy Number Variation (CNA) analysis, we're using read depth in a slightly different way. Instead of looking at the average read depth, we're looking at how the read depth varies across different regions of the genome. If one region has a higher read depth than the rest of the genome, it suggests that there are more copies of that region in the sample's cells - hence, a copy number gain. Conversely, a region with lower read depth might indicate a copy number loss.

This approach is indeed quite clever! It's a great example of how bioinformatics can extract valuable information from NGS data beyond just identifying variants. Of course, it also has its challenges and limitations (like distinguishing between true CNAs and variations in read depth due to technical factors), but with careful normalization and analysis, it can provide useful insights into the genomic alterations in a tumor.

Are the MSI variations actually "mutations" like point mutations or is it copy number variations (like one has triple repeats in huntington's disease).    And these small repeated regions, how long are the repeats?  5 bases?  500?


Microsatellite instability (MSI) is indeed caused by genetic mutations, but not in the same sense as point mutations, deletions, or insertions in coding sequences or splice sites of genes. MSI is the result of mutations in short, repeated sequences of DNA, known as microsatellites.

Microsatellites, also known as short tandem repeats (STRs), consist of simple sequences of DNA that are repeated in a direct head-to-tail manner. The most common type of microsatellite is a dinucleotide repeat (e.g., CA repeated n times), but tri-, tetra-, penta-, and hexanucleotide repeats also exist. So in terms of length, the repeated unit is usually 1-6 base pairs, and the total length of the microsatellite region can vary significantly but is typically in the range of about 10 to 60 base pairs. However, microsatellites with repeat units up to about 100 base pairs long can still be analyzed with some methods.

MSI is caused by a deficiency in the DNA mismatch repair (MMR) system. Normally, the MMR system recognizes and repairs mistakes made during DNA replication, such as if a microsatellite length changes due to errors in replication. If the MMR system is deficient, these mistakes are not repaired and the length of the microsatellites can change, leading to instability. In other words, the number of repeats in the microsatellites can increase or decrease, leading to a different length of the microsatellite region in the DNA. This is different from copy number variations, where whole sections of DNA, usually much larger than microsatellites, are duplicated or deleted.

These microsatellite mutations often occur in noncoding regions of the genome, but can sometimes occur within genes or in regulatory regions and can affect gene function. MSI has been associated with certain types of cancer, particularly colorectal and endometrial cancer, and can be an important consideration in treatment decisions.


Just to confirm, these regions could be "CGCGCGCGCG," which would be five pairs repeated up to a length of 10 bases.


Yes, that's correct. Microsatellites, or short tandem repeats (STRs), are sequences in which a short motif is repeated multiple times in a row. The repeated motif is usually 1-6 bases long.

In your example, "CGCGCGCGCG" is a sequence in which the dinucleotide "CG" is repeated five times, resulting in a total length of 10 bases. If you saw this sequence in one individual but then saw "CGCGCGCGCGCGCG" (7 repeats of "CG") in another individual or even in another cell from the same individual, that would represent microsatellite instability (MSI). The DNA mismatch repair system normally corrects these types of replication errors, and when it doesn't, it can lead to variability in the length of these microsatellite regions.

So, your example perfectly represents what a microsatellite is, and the changes in length that can happen due to MSI.


And it's a variation in the repeats (here, 7 instead of 5) rather than sequencing (like CGCGCGTTCG)


Exactly. Microsatellite instability (MSI) refers to changes in the length of the microsatellite region due to a difference in the number of repeats. So, if we were looking at the "CG" microsatellite you used in your example, MSI could result in a sequence like "CGCGCGCGCGCGCG" (7 repeats) instead of the original "CGCGCGCGCG" (5 repeats).

It's not about point mutations within the repeats. If you have a sequence like "CGCGCGTTCG", that would be considered a point mutation (a change from "G" to "T") within the repeated region, but it's not what we're referring to when we talk about MSI. MSI is about changes in the number of repeats, not changes in the sequence of the repeats themselves.

However, it's important to note that both point mutations and MSI can have significant impacts on gene function and can be associated with various diseases, including cancer. The DNA mismatch repair system is involved in correcting both types of errors, and a defect in this system can lead to both an increased mutation rate and MSI.


So it is quite a bit like trinucleotide repeat diseases like Huntington's which I am quite familiar with.


Yes, the principle is indeed similar. Both microsatellite instability (MSI) in cancer and trinucleotide repeat disorders like Huntington's disease involve the expansion of repeating sequences of DNA.

Huntington's disease, for example, is caused by an expansion of a trinucleotide repeat sequence (CAG) in the HTT gene. Healthy individuals usually have fewer than 36 CAG repeats in this gene, while people with Huntington's disease have 40 or more repeats. The length of the repeated sequence can continue to expand over generations, a phenomenon known as anticipation.

Microsatellite instability (MSI), on the other hand, involves changes in the length of microsatellites (which can be di-, tri-, tetra-, penta-, or hexanucleotide repeats) due to deficiencies in the DNA mismatch repair (MMR) system. These changes can occur throughout the genome and are a hallmark of certain types of cancer, particularly colorectal and endometrial cancer.

The key difference is that while trinucleotide repeat disorders involve specific genes and are usually inherited, MSI is often acquired (due to mutations in the MMR genes that occur during an individual's lifetime) and can affect many different regions of the genome.

So, while the underlying mechanism - changes in the length of repeating DNA sequences - is the same, the context, causes, and consequences of these changes can vary significantly between MSI in cancer and trinucleotide repeat disorders.


Does the specific nature of the MSI matter?   Do some cases focus on producing certain pentanucleotide repeats, or different types of repeat (CGC vs TAT for example).   Or is MSI just a big bucket.


The overall presence of microsatellite instability (MSI) in a tumor is usually the most critical factor considered clinically because it indicates a deficiency in the DNA mismatch repair (MMR) system. This information can be important for guiding treatment decisions, particularly with respect to the use of checkpoint inhibitor immunotherapies. However, the specific nature of MSI can vary, and these variations can sometimes provide additional information.

Type of repeat: The most commonly used MSI testing methods examine multiple different microsatellite loci, which can include a variety of repeat types - usually a mix of mono-, di-, and tetranucleotide repeats. However, the specific loci tested can vary between different tests.

Location: While MSI can occur throughout the genome, some microsatellites are located within genes or regulatory regions, and instability in these regions can potentially affect gene function. However, predicting the effects of such changes can be challenging due to the complex nature of gene regulation and function.

Level of instability: A tumor can be classified as microsatellite stable (MSS), MSI-low (MSI-L), or MSI-high (MSI-H) depending on the proportion of tested microsatellites that show instability. This can provide some information about the severity of the MMR deficiency and can be relevant for treatment decisions.

Underlying cause: MSI can be caused by inherited defects in the MMR genes (as in Lynch syndrome), or by acquired (somatic) mutations or epigenetic changes in these genes. Determining the underlying cause of MSI can be important for assessing a patient's risk of other cancers and for providing information to their relatives.

The specific nature of the MSI might potentially matter more for research purposes. For example, it's possible that some types of repeats or some loci might be more prone to instability, or that the patterns of MSI might differ between different types of cancer. Understanding these patterns could potentially provide insights into the biology of cancer and the DNA mismatch repair system. However, this is a complex area of research and much is still not known.