Improving LLM Safety in Women’s Health with Semantic Entropy
Large language models (LLMs) have shown promise in clinical decision support, but their tendency to generate incorrect or misleading information—known as hallucinations—limits their safe application in healthcare. This issue is particularly critical in women’s health, where errors in medical reasoning can have serious consequences for maternal and neonatal outcomes. Traditional methods for detecting uncertainty in AI-generated content, such as perplexity, often fail to capture inconsistencies in meaning. A new study, now available as a preprint on arXiv, introduces semantic entropy (SE) as a novel metric to assess uncertainty at the level of meaning rather than individual words, improving the ability to detect AI-generated errors in obstetrics and gynaecology.
Using a dataset from the UK Royal College of Obstetricians and Gynaecologists (RCOG) MRCOG examinations, the study found that SE significantly outperformed perplexity in identifying unreliable AI-generated responses. SE achieved an AUROC of 0.76, compared to 0.62 for perplexity, demonstrating its superior ability to flag potentially misleading content. Further clinical expert validation showed near-perfect discrimination (AUROC: 0.97), reinforcing SE’s effectiveness in distinguishing between reliable and uncertain outputs. While semantic clustering—an additional technique for grouping meaningfully similar responses—was only fully successful in 30% of cases, SE still proved to be a valuable tool for enhancing the safety of AI-generated medical content.
As AI continues to be integrated into healthcare, ensuring its reliability and safety is paramount. Semantic entropy provides a practical and scalable approach to improving trust in AI-driven clinical tools, particularly in settings where medical expertise is scarce. By allowing for better identification and filtering of uncertain responses, this approach has the potential to support safer and more effective AI applications in women’s health. Further research will be crucial in refining these methods and expanding their use in real-world clinical environments.