ChatGPT in OBGYN: Strengths and Weaknesses
We recently tested ChatGPT on the MRCOG exams, a challenging set of tests in obstetrics and gynaecology set by the Royal College of Obstetricians and Gynaecologists. The MRCOG examinations are generally regarded as the international gold standard. These exams measure knowledge and clinical reasoning, requiring candidates to demonstrate both factual understanding and the ability to make decisions in complex scenarios. ChatGPT achieved a 72.2% accuracy on Part One, which focuses on foundational scientific knowledge, but only managed 50.4% on Part Two, which tests advanced clinical reasoning. These results highlight both the promise and current limits of AI in medical settings.
One of the key findings was ChatGPT’s varied performance across different topics. It excelled in biochemistry, achieving nearly 80% accuracy, and handled clinical management questions with similar success. These are areas where clear, factual answers dominate, making them more suitable for AI’s pattern-recognition capabilities. However, ChatGPT struggled with subjects like labour management, where answers depend on a nuanced understanding of clinical context. For example, making decisions during childbirth often requires integrating multiple factors—patient history, real-time data, and risks—which are difficult for AI to fully grasp. This gap in reasoning reflects the challenges of applying AI to complex, dynamic fields like medicine.
We also studied ChatGPT’s confidence in its answers. Surprisingly, it was often just as confident about answers it got wrong as those it got right. This is concerning. Overconfidence in incorrect answers can mislead users, especially in high-stakes settings like healthcare. The inability to gauge uncertainty means AI cannot reliably signal when it might be making a mistake. For example, when ChatGPT misidentified the best clinical intervention in certain cases, it provided no indication that its choice might be flawed. This highlights an important limitation: AI models are not yet able to self-assess their reliability in a meaningful way.
Another interesting finding was the role of question format. ChatGPT performed better on Single Best Answer (SBA) questions, where the correct choice is often clear-cut, compared to Extended Matching Questions (EMQs), which require comparing multiple options. EMQs often include more subtle distinctions and demand higher levels of reasoning. This difference in performance underscores that AI, while capable of processing large amounts of data, still struggles with tasks requiring deep contextual understanding or prioritization of conflicting information.
Finally, we evaluated whether the complexity of language in the questions influenced ChatGPT’s performance. Questions with more complex vocabulary and structure slightly reduced accuracy, but the effect was small. This suggests that ChatGPT’s limitations are less about understanding language and more about the underlying reasoning required for certain medical tasks. While the model can parse intricate language, it still lacks the deeper, clinical insight that comes from years of medical training and experience.
These findings are important. They show where AI might be useful in medicine and where it still falls short. ChatGPT could assist with basic knowledge retrieval or act as a learning tool for trainees, particularly in straightforward areas with less nuance. However, its limitations in reasoning and reliability mean it is far from ready for independent clinical use. Understanding these boundaries is crucial as we explore how AI can safely complement human expertise in healthcare.
Read our publication in Nature Partner Journals – Women’s Health here.