3.3. Discussion
In this study, we analyzed the accuracy of ChatGPT in the MRI domain. We found that ChatGPT was generally accurate in answering simple, well-defined questions but needed to be more able to solve off-the-beaten-path multiple-choice questions that required specialized MRI knowledge and judgment.
ChatGPT, developed by OpenAI, is a large-scale language generation model that has been trained on a large corpus of text data. It uses a deep neural network with self-attention mechanisms and a transformer-based architecture to understand natural language and generate contextually appropriate and linguistically coherent responses. The success of ChatGPT can be attributed to its large number of parameters and its ability to learn from vast amounts of data. It is well-suited for complex language tasks due to its ability to handle long-range dependencies [
10].
ChatGPT has shown significant potential in the healthcare domain as a tool for generating natural language responses to medical questions and assisting healthcare professionals in decision-making. It has been pre-trained on a large corpus of text data, allowing it to learn a wide range of linguistic patterns and structures. Additionally, the model has been fine-tuned on specialized medical text data, improving its accuracy in answering medical questions. Its ability to generate accurate and contextually appropriate responses has the potential to improve patient outcomes and increase efficiency in healthcare settings [
11].
The integration of Large Language Models (LLMs) like ChatGPT into clinical practice, particularly in radiology, brings both advantages and challenges. On the positive side, LLMs offer advanced natural language processing capabilities, enabling tasks such as text summarization, translation, and question-answering, which can be particularly useful in interpreting and communicating complex radiological reports in a patient-friendly language. Additionally, LLMs can be instrumental in generating code for medical imaging research, and, when combined with convolutional neural networks (CNNs), they can assist in image recognition and relevant text generation, enhancing research in medical image analysis.
However, several risks and difficulties are inherent in the use of LLMs in clinical settings. One major concern is privacy, as sensitive patient information could be compromised when uploaded to LLMs, raising serious ethical concerns [
12]. Moreover, LLMs may generate artificial or potentially harmful information, such as incorrect translations or diagnostic conclusions, which necessitates thorough validation of LLM-generated content, particularly in patient care.
Another challenge is the interpretability and transparency of LLMs [
12]. It is crucial for medical professionals to understand why a model produces a certain output, especially when these outputs have a direct impact on patient care. The accuracy of LLM outputs is heavily dependent on the quality and diversity of the training data. Generic models not specifically trained on medical data might provide inaccurate responses to medical tasks [
13]. Even medically oriented LLMs could have limitations in their representation of certain information [
13]. Additionally, the rapid evolution of medical knowledge poses a challenge, as LLMs might not have access to the latest data and guidelines.
Large language models like ChatGPT are less accurate in specialized domains like the basics of MR, which is what we studied, and clinical medicine for several reasons, and we need to be aware of this fact. Firstly, these models need a deeper understanding of the meaning and context of the language they generate, as they rely on statistical patterns and word associations [
14]. This lack of understanding can lead to the production of factually incorrect statements and the overlooking of crucial medical findings [
15]. Additionally, these models may recommend unnecessary investigations and overtreatment, which can be potentially harmful [
16]. Furthermore, the responses generated by these models can vary inconsistently with repeat queries, indicating a need for more stability in their performance [
17]. Lastly, these models’ fallibility and the difficulty distinguishing their responses from human answers pose potential threats to teaching and learning in healthcare education [
18]. Therefore, whereas large language models have potential in healthcare, their limitations and risks must be carefully considered before widespread use in specialized domains like healthcare. Understanding and addressing these challenges is critical for healthcare professionals. The following schematic illustrates the key takeaways (
Figure 4).
In our study, ChatGPT needs to improve in solving multiple-choice questions (MCQs) related to the medical field (
Figure 5). MCQs often require a deeper understanding of the context, which is challenging for machine learning algorithms. Additionally, the complex answer options with subtle differences in medical MCQs make it difficult for the algorithm to differentiate accurately. The presence of distracting or irrelevant information in MCQs can lead to interpretation by the algorithm, resulting in accurate answers [
19,
20,
21].
The findings of our study reveal a significant disparity in ChatGPT’s performance between answering straightforward MRI questions and tackling specialized multiple-choice questions. This variance underscores the complexities inherent in applying AI in medical education and clinical decision-making. ChatGPT’s proficiency in handling basic questions suggests it can be a valuable tool for educating medical students and junior radiologists. However, its limitations in solving complex, case-based scenarios indicate a need for cautious integration of AI in clinical practice, especially in diagnostic processes.
AI models, such as ChatGPT, have limitations in context-heavy and specialized tasks like radiology. These limitations are particularly pronounced in fields that require nuanced interpretation and detailed anatomical knowledge. To bridge the gap between general knowledge dissemination and specialized case handling, it is crucial to enhance AI models’ capabilities in context awareness and depth of medical knowledge. This study emphasizes the need for AI models that not only process text but also understand the underlying medical concepts [
19,
22].
Furthermore, the need for high-quality labeled training data is a significant limitation in using machine learning algorithms in the medical domain. Obtaining labeled data in the medical field is costly, leading to biased, incomplete, or insufficiently diverse datasets for training medical models. This results in suboptimal performance of the models [
23]. Additionally, models trained on labeled data may struggle when faced with new data that differ significantly from the training data, known as a domain shift. The medical domain is particularly prone to domain shifts due to the wide variation in patients’ clinical presentations and conditions [
24].
ChatGPT has demonstrated high accuracy in generating responses to text input thanks to its ability to recognize and utilize abstract relationships between words within a neural network [
25]. It is effective for answering general or straightforward questions and generating coherent and contextually appropriate responses [
5]. ChatGPT has a Language Learning Module (LLM) that allows it to continually learn and improve its performance by adapting to new data and contexts [
26].
ChatGPT has impressive capabilities but should not be overly relied upon in specialized healthcare, particularly in radiology. It has limitations in terms of needing more specific knowledge and experience in radiology [
25]. ChatGPT’s accuracy is dependent on the data it has been trained on, which may need revision for specialized medical applications [
27]. Furthermore, ChatGPT cannot understand the complexity of medical images or interpret their significance, which is crucial in radiological interpretation [
28]. As a result, ChatGPT’s answers may need to be improved in usefulness and accuracy for radiological diagnosis and treatment planning [
29].
To enhance AI performance in complex medical fields, incorporating diverse and extensive datasets, including complex case studies and advanced imaging interpretations, could be beneficial [
2]. Integrating AI models with clinical decision support systems may offer a more comprehensive understanding and improved accuracy in complex scenarios [
30]. Collaborations between AI developers and medical professionals are essential in creating datasets that reflect the intricacies of real-world medical cases [
31].
MRI basics and physics are essential for the practice of radiology and can be used for clinical purposes. Understanding the principles of MRI physics, signal generation, and image contrast mechanisms is crucial for non-radiology clinicians to interpret MR images and facilitate interdisciplinary understanding [
32]. Some resources, such as textbooks, provide a conceptual approach to understanding the basics of MRI from both clinical and technological perspectives. However, in anticipation of the many uses of large language models, ChatGPT can be used partly for educational purposes on MRI basics by providing concise and easy-to-understand explanations without requiring extensive technical background. In our study, ChatGPT showed high accuracy on simple, short-answer questions about MRI physics (observer 1, 86% correct; observer 2, 88% correct). However, based on the generally inaccurate results for multiple-choice questions, it may stretch it to use ChatGPT for clinical medicine education, including MRI physics, without validation.
In fact, from its inception, ChatGPT has been used by the public and medical professionals, including professors and residents, for problem-solving and obtaining information. However, its use in medical education and practice should be approached cautiously due to its nature as a language generation model and its accuracy in specialized medical areas. While ChatGPT offers potential applications in medical education, such as generating exercises, quizzes, and scenarios for students to practice and evaluate their understanding of medical concepts, there are challenges and limitations to consider. These include the need to carefully assess the accuracy and reliability of ChatGPT responses, address its limitations in understanding medical terminology and context, and consider ethical concerns regarding patient privacy. Despite the potential benefits, further research is needed to effectively integrate ChatGPT into medical education and explore its impact on learning outcomes, critical thinking skills, and student and faculty satisfaction [
33].
Moreover, the integration of AI in healthcare raises ethical and practical considerations. Patient safety and data privacy are paramount, and stringent quality checks and human oversight are necessary to ensure the reliability of AI responses [
34]. Continuous updates and learning for AI systems are also crucial due to the evolving nature of medical knowledge and practice [
35]. Establishing guidelines and protocols for the safe and effective use of AI in healthcare is essential as it becomes more prevalent [
36].
The limitations of our study are below. First, a few researchers conducted the study at only a single institution, so it generated little question input. Second, although we obtained offline multiple-choice questions from the public open website (Courtesy of Allen D. Elster, MRIquestions.com), we could only obtain a few questions due to the specificity and scope of MRI. This problem can be solved in the future if multiple researchers submit many questions for input or if a more extensive set of publicly available questions is obtained.
This study opens several avenues for future research. Exploring the integration of visual data, such as MRI scans, with textual data could enhance AI’s diagnostic capabilities. Longitudinal studies involving a more extensive set of questions and diverse clinical scenarios could provide deeper insights into AI’s applicability in radiology. Additionally, comparative studies involving different AI models could determine the most effective approaches in medical AI.