Article Text

Download PDFPDF

Comparative efficacy of ChatGPT 3.5, ChatGPT 4, and other large language models in gynecology and infertility research
  1. Pallav Senguptaa,*,
  2. Sulagna Duttab,
  3. Srikumar Chakravarthic,
  4. Ravindran Jegasothyd,
  5. Ravichandran Jeganathane and
  6. Anuradha Pichumanif
  1. aCollege of Medicine, Gulf Medical University, Ajman, United Arab Emirates
  2. bSchool of Medical Sciences, Bharath Institute of Higher Education and Research, TN, India
  3. cFaculty of Medicine, SEGi University, Kota Damansara, Malaysia
  4. dFaculty of Medicine, MAHSA University, Jenjarom, Malaysia
  5. eDepartment of Obstetrics & Gynaecology, Hospital Sultanah Aminah, Johor Bahru, Malaysia
  6. fSree Renga Hospital, Chengalpattu, Tamil Nadu, India
  1. *Corresponding author. Department of Biomedical Sciences, College of Medicine Gulf Medical University, Ajman, United Arab Emirates. pallav_cu{at}

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Dear Editor,

As experts in gynecology and infertility research, we have witnessed the rapid advancement of artificial intelligence (AI) technologies, particularly language models, which have greatly improved their capabilities. In this communication, we aim to compare the proficiency of advanced language models, such as Chat Generative Pre-Trained Transformer (ChatGPT) 3.5, ChatGPT 4, and others, in relation to our field.

AI-driven algorithms have potential to accelerate research by analyzing vast amounts of academic literature, identifying complex correlations, and synthesizing fragmented knowledge to form a cohesive understanding.1 These advanced systems have the ability to revolutionize the way we conduct research, extract valuable insights, and translate them into effective clinical applications. The power of AI-based frameworks can speed up scientific progress and improve the challenging process of extracting valuable insights from extensive and complex existing knowledge repositories.2

ChatGPT 3.5, a cutting-edge generative pre-trained transformer by OpenAI, augments numerous disciplines, including content generation, linguistic metamorphosis, and complex data analysis. However, its knowledge repository, limited to pre-September 2021 data, requires supplementation for optimal utility.3 ChatGPT 4, surpasses its antecedent in linguistic cognition, generalizability, and response quality. Enhanced contextual processing and discernment facilitate more refined interactions in gynecology and infertility research, while adeptly addressing intricate queries, thus elevating the investigative process to unprecedented levels.4

Compared to GPT-3, ChatGPT 3.5 showcased superior natural language generation, while ChatGPT 4 further refined this quality. The performance, however, varies among large language models(LLMs). Whenever gynecology or infertility related questions or instructions are provided, understanding of the context is vital for the LLMs in order to deliver coherent outputs. ChatGPT 3.5 enhanced comprehension of context and user input, a trait that ChatGPT 4 further sharpened, coupled with a better memory for user interactions. Yet, this understanding is inconsistent across LLMs. When synthesizing data, especially complex research in fields like gynecology or infertility, ChatGPT 3.5 effectively interprets findings, a capability ChatGPT 4 excels in by delivering coherent summaries. This ability, though, depends on the LLMs in use, with some requiring fine-tuning. In customization, ChatGPT 3.5 provides fine-tuning opportunities, especially in specialized fields. ChatGPT 4 enhances this feature, supporting extensive user-driven fine-tuning and domain adaptation, though the extent varies among LLMs. Ethically, ChatGPT 3.5 addressed biases and concerns in gynecologic or infertility data or information, better than GPT-3, and ChatGPT 4 strengthened these efforts with advanced bias control. However, ethical attentiveness differs among LLMs. Challenges include increased processing times due to model complexities, overfitting risks, limited interpretability, and higher deployment costs, with ChatGPT 4 demanding more resources and incurring greater costs than its predecessor. These concerns differ across LLMs. Focusing on gynecology, several studies validated the efficacy of ChatGPT. To assess the efficacy of ChatGPT, Kemp MW et al.5 conducted a rigorous investigation within the context of a simulated clinical evaluation pertaining to the Royal College of Obstetricians and Gynecologists (RCOG) membership virtual Objective Structured Clinical Examination (OSCE). The advanced language model was subjected to a series of seven meticulously crafted, structured discourse queries, with its subsequent responses undergoing an unbiased assessment by a cohort of 14 certified examiners. These assessments were subsequently juxtaposed with the historical performance of human examinees. Remarkably, ChatGPT attained an average performance metric of 77.2%, surpassing human candidates across multiple knowledge domains relevant to obstetrics and gynecology. In another study by Santo DSE et al.,6 the utility of ChatGPT as a resource for guidance during unanticipated labor events was scrutinized. The findings of this study demonstrated that ChatGPT possesses considerable potential as a valuable adjunctive instrument to assist individuals confronted with unforeseen labor situations. However, considering these reports and also limitations of an AI-driven LLMs in gynecology practice and infertility research, Grunebaum A et al.7 have opined that ChatGPT possesses considerable efficacy in proffering foundational knowledge pertaining to the domain of obstetrics and gynecology, as corroborated by its elaborate, eloquent, erudite, and syntactically coherent responses to a multifarious assortment of queries (Table 1).5–11

Table 1

Comparative view of ChatGPT 3.5, ChatGPT 4, and other LLMs, and studies in Gynecology Research.

Given the rapid expansion of AI, gynecology and infertility researchers and clinicians should stay abreast of state-of-the-art developments and incorporate these technological tools to improve the quality and rigor of their empirical work. Language models can aid in identifying therapeutic targets, analyzing clinical trial data, and exploring innovative treatments by processing vast scientific texts.7 They enable rapid access to relevant data on gynecological pathologies, therapeutic interventions, and fertility alternatives through natural language inquiries and assimilation of evidence-based insights from extensive medical literature. They can also generate diagnostic hypotheses and offer informed therapeutic recommendations for specific conditions, augmenting the expertise of healthcare professionals.12 This approach enhances the decision-making abilities of clinicians and empowers superior diagnostic accuracy and patient outcome prediction through data-driven algorithms. It expedites the mining of medical records via natural language processing, obviating laborious chart reviews. Furthermore, AI fosters educational innovation, enabling context-aware training modules, augmenting scholastic proficiency, and catalyzing groundbreaking advancements in healthcare.13 (Fig. 1).

Fig. 1

Multi-faceted applications of language models in gynecology and infertility research.

Elucidating the intricate mechanisms governing male and female infertility has long been a subject of paramount importance.14 Leveraging the prodigious capabilities of LLMs presents a promising avenue to unravel the complex underpinnings of infertility and expedite advancements in this domain. By integrating LLMs into investigative frameworks, researchers and clinicians can synergistically consolidate disparate sources of information, facilitate hypothesis generation, and harness a robust knowledge base to propel future investigations.1 Current knowledge about infertility is riddled with gaps, primarily due to the convoluted nature of reproductive processes and the myriad factors that modulate them. A comprehensive understanding of the molecular, cellular, and physiological mechanisms in both male and female infertility remains elusive. Consequently, the multifactorial etiology of infertility, encompassing genetic, epigenetic, and environmental factors, necessitates innovative approaches to bridge these lacunae in our understanding.15–17 By consistently refreshing their information repository, LLMs can rapidly assimilate new findings and information from various domains associated with infertility, including genetics, endocrinology, embryology, and more. Their sophisticated analysis abilities can pinpoint previously missed patterns or links, resulting in innovative theories in infertility studies and therapeutic strategies, potentially tailoring treatments to individual needs. For example, utilizing LLMs to analyze high-dimensional, multifaceted data sets derived from genomic, transcriptomic, proteomic, and metabolomic studies can facilitate the identification of novel biomarkers and pathways implicated in infertility.18 This knowledge can subsequently be employed to guide the development of targeted therapeutic interventions and bolster personalized medicine strategies.18 LLMs can also analyze patient responses to the treatments based on intricate datasets. Moreover, LLMs can assist in uncovering novel gene-gene and gene-environment interactions, thereby illuminating the intricate interplay between genetic predispositions and environmental exposures in the context of infertility. By capitalizing on the capacity of LLMs to scrutinize vast troves of scientific literature, researchers and clinicians can derive contextually relevant, data-driven insights that can augment our understanding of the etiopathogenesis of infertility12 (Fig. 1). Furthermore, use of these modalities can level the playing field in many nations where clinicians of varying experience and knowledge levels can be brought up to par in deciding on evidence-based treatment modalities and strategies in an exponentially evolving field. Patient safety can be strengthened by ensuring research-backed treatment.

LLMs can also be used to obtain directions for future investigations by integration of LLMs with machine learning and artificial intelligence algorithms to establish predictive models for infertility diagnosis, prognosis, and response to treatment; leveraging LLMs to identify potential targets for pharmacological intervention by analyzing molecular pathways and gene networks implicated in infertility12; employing LLMs to discern epigenetic modifications, such as DNA methylation and histone modifications, that may contribute to the pathogenesis of infertility and serve as targets for therapeutic modulation; exploiting LLMs to facilitate cross-disciplinary collaboration, by synthesizing findings from diverse fields, including endocrinology, immunology, and genetics, to elucidate the multifactorial nature of infertility; and also by utilizing LLMs to enhance patient counseling and education, by generating personalized, evidence-based recommendations grounded in the latest scientific advancements.1

Nevertheless, ChatGPT has certain limitations as it often miscodes human language and may occasionally provide inaccurate information due to constant updates and user adaptation. Researchers and clinicians should use ChatGPT judiciously to avoid plagiarism.19 It is noteworthy that the training data of the model is static, necessitating cognizance of the users of its obsolescence. Moreover, adding citation features could improve its relevance in the research field. Besides ChatGPT, it also remains crucial to assess the efficacious potentialities of alternative extensive language models, such as BERT, T5, and GPT-Neo. Systems that update themselves regularly and release new versions will gain an advantage over their competitors. The review of 118 publications shows that ChatGPT can act as a healthcare “clinical assistant” for tasks like patient inquiries and research but struggles with issues like inconsistency, bias, and legality. Additionally, it has potential in academic writing but faces challenges like plagiarism and a lack of human-like qualities, questioning its authority as an author.20 Thus, the effectiveness and reliability of the LLMs should be enhanced further, both in healthcare and academic settings, and the developers of ChatGPT may need to focus on addressing the identified issues such as bias, plagiarism, and improving human-like interactions.21 The application of LLMs to gynecology and infertility research presents notable constraints, alluding to prolonged deployment times, paucity of data accessibility, dearth of expertise, and a demand for a further extensive investigation to thoroughly evaluate ethical, legal, and data security considerations. Temporal considerations factor significantly into LLMs deployment. The construction of these models requires an extended period for data aggregation, model training, and optimization, further delaying the implementation of such technologies in clinical practice.12 In gynecology and infertility research, where time can often play a critical role, these protracted deployment durations can represent a substantial limitation. Accessibility and quality of data pose another significant barrier. Infertility and gynecological data, like other medical data, are subject to stringent privacy regulations, hindering the ease of data accessibility. In addition, given the sensitive nature of gynecological and fertility data, acquiring a sufficiently large and diverse dataset to train the LLMs can prove arduous.12,13 The inherent complexity of gynecological conditions, their multifactorial etiologies, and heterogeneous presentations demand comprehensive, high-quality, and diverse datasets that can be challenging to amass. Regarding the human aspect, when it comes to LMMs, there is a lack of information driven by experts. This interdisciplinary skill set is requisite for effective LLMs implementation and tailoring the models to specific research questions. Without a critical mass of such expertise, the models risk misinterpretation or misuse. Further extensive studies are indispensable to holistically understand the applicability of LLMs in this domain. As LLMs grow more complex, there is an increasing need to evaluate not only their performance but also their reliability, transparency, and fairness. Protocols for fertility-related information usage must ensure informed consent, clarifying to patients that their data will be processed with this tool while maintaining confidentiality. Importantly, as the tool is data-driven, it avoids plagiarism concerns when multiple researchers investigate identical topics. Finally, and critically, the application of LLMs to gynecology and infertility research carries with it substantial ethical, legal, and data security implications. The use of sensitive medical data demands stringent measures to ensure patient privacy and data security. Ethical considerations, such as ensuring fair and unbiased model outputs, maintaining transparency in model decisions, and obtaining informed consent from patients for data usage, also necessitate careful navigation. Legal frameworks regulating the use of AI in healthcare, often lagging behind technological advances, further complicate the deployment of LLMs in this field.13

In light of the rapidly expanding domain of artificial intelligence, it is incumbent upon investigators in the fields of gynecology and infertility research to maintain a comprehensive awareness of these state-of-the-art developments, judiciously incorporating such technological instruments to augment the quality and intellectual rigor of their empirical endeavors.1 LLMs aid in the identification of prospective therapeutic targets, analysis of clinical trial data, and exploration of innovative treatment modalities through the processing of voluminous scientific texts.1 LLMs enable expeditious access to germane data regarding gynecological pathologies, therapeutic interventions, and fertility alternatives, through the processing of natural language inquiries and assimilation of evidence-based insights from an extensive corpus of medical literature. They can also proffer specific diagnostic conjectures for gynecological- and infertility-associated conditions, supplementing the expertise of healthcare professionals and expediting the diagnostic trajectory.7 A study revealed that Flan-T5, a publicly available LLM, efficiently phenotyped patients with postpartum hemorrhage (PPH), achieving a 0.95 positive predictive value and identifying 47% more patients than standard claims codes, even without manual annotation. Its ability to extract 24 detailed concepts allowed for the creation of intricate phenotypes and subtypes related to PPH, surpassing claims-based approaches and offering a flexible, easily updatable algorithm.22 These can generate informed therapeutic propositions for specific gynecological and infertility challenges by appraising current medical research and guidelines, thus empowering clinicians and researchers in decision-making and enhancing comprehension of available options.7

This article serves both as a testament to the indispensable role AI is poised to play in catalyzing groundbreaking advancements within the realm of reproductive health research and as an exhortation to expeditiously capitalize on the plethora of possibilities furnished by this dynamic and ever-expanding technological milieu. Employing these sophisticated computational methods and algorithms can provide unprecedented insights and facilitate innovative solutions to overcome challenges in the gynecological and infertility research landscape.

Declaration of competing interest