Discoveries in Health Policy: AI Corner: Large Study of AI Test Performance, 32 Courses, Compared to Humans

Thursday, September 7, 2023

AI Corner: Large Study of AI Test Performance, 32 Courses, Compared to Humans

In a large scale study published in the Nature family journal Scientific Reports, essay question grades for ChatGPT were compared to human subjects in 32 classes spread over 8 subject matter areas. For example, in the Psychology course "Biopsychology," ChatGPT outperformed, while in the Psychology course "Social Psychology," it did slightly worse. In the chart below, the AI score is shown in green, the human score in blue.

click to enlarge

Find Ibrahim et al. here.

See a news report about the study here.

In each topic area, three real student answers and three ChatGPT answers (total of 6 answers) to each of 10 questions were graded by three graders. (The inter-rater reliability of grading the 6 answers, varied by subject area).

The AI classification programs GPTZero and AI Text Classifier were quite imperfect at detecting which answers were AI generated and which were student-written, making many errors in both directions. This fell even further when AI answers were "processed" through a rephrasing program called Quillbot.

___

Some of the computer/math classes required math or coding answers only, not essays.