Friday, January 13, 2023

CHATGTP Produces Passable Scientific Abstracts (Nature; Gao)


Back in mid-December, I wrote a blog about CHATGPT and asked it a number of questions related to genomics (here).   (Separately, I recently asked CHATGPT to find an obscure Medicare regulation about self-administered drugs, here.)

By now, the internet and YouTube is ablaze with articles and videos about countless aspects of CHATGPT.

Here a news article in Nature that CHATGTP can produce scientific abstracts that in many cases cannot be detected as AI products, by human reviewers.   The Nature news article is based on a BioRxiv preprint by Gao et al.

Find Nature here:  https://www.nature.com/articles/d41586-023-00056-7

Find Gao et al. here:  https://www.biorxiv.org/content/10.1101/2022.12.23.521610v1

Humans identified about 7 in 10 (68%) of AI abstracts, but while doing so, they misclassified 14% of human abstracts as AI generated.  If I read the abstract correctly, a machine-generated tool could correctly classify essentially 100% of the AI and human abstracts, better than human readers.

###

For fun, I asked CHATGTP to write a methods section for an NGS-based breast cancer risk gene project.

  • AI RESPONSE
  • "Next-generation sequencing (NGS) was used to identify breast cancer risk genes in this study. DNA was extracted from blood samples of participants, who were self-reported as having a family history of breast cancer. Library preparation and sequencing were performed using the Illumina HiSeq platform. Raw NGS data were processed and analyzed using a pipeline that included quality control, alignment, and variant calling. Identified variants were annotated and filtered based on their potential functional impact and frequency in a control population. Candidate breast cancer risk genes were then further validated using Sanger sequencing and functional assays."

Can you write a methods section on the same topic but about twice as long and twice as detailed?

  • AI RESPONSE
  • "Next-generation sequencing (NGS) was utilized in this study to identify breast cancer risk genes among individuals who reported a family history of the disease. Blood samples were collected from participants and DNA was extracted using standard protocols. The extracted DNA was quantified using a fluorometer and quality was assessed on a 2% agarose gel.
  • Library preparation for NGS was performed using the Illumina TruSeq DNA PCR-Free Library Prep Kit. The libraries were then quantified using a fluorometer and the size distribution was assessed on a Bioanalyzer. The libraries were then pooled in equimolar concentrations and sequencing was performed on an Illumina HiSeq 2500 platform.
  • Raw NGS data was processed using a pipeline that included quality control, alignment, and variant calling. Quality control was performed using FastQC to assess the quality of the raw sequencing data. Alignment of the processed data was performed using STAR aligner and the reference genome (GRCh38) was used as the reference. Variants were called using GATK HaplotypeCaller.
  • The identified variants were annotated using Annovar and filtered based on their potential functional impact and frequency in a control population. Candidate breast cancer risk genes were further validated using Sanger sequencing on a ABI 3730xl DNA Analyzer, and functional assays such as cell proliferation and migration assays were conducted to evaluate the functional effects of the variants on breast cancer cell lines.
  • Statistical analysis was performed to evaluate the association of the identified variants with breast cancer risk. Logistic regression models were used to calculate odds ratios and 95% confidence intervals for each variant. A p-value of less than 0.05 was considered statistically significant. The study was approved by the institutional review board and written informed consent was obtained from all participants."