1. Introduction
Since the introduction of natural language processing (NLP) into the field of artificial intelligence (AI) and the market launch of the first freely available large language model (LLM) ChatGPT
®, there have been signs of increasing integration of AI in medical fields. Initial publications report a progressive replacement of humans in direct customer communication by AI systems, which is sometimes reflected in job cuts in this area [
1,
2,
3]. It can be assumed that these developments in the private sector will also have an impact on the medical sector in the future [
4,
5,
6]. The use of AI in the healthcare sector, particularly to improve doctor-patient communication, is the subject of intense discussion and research [
7,
8,
9]. However, trends are emerging that support the use of AI in this area [
10,
11].
Whether its use makes sense at the present time depends on a number of factors.
It would be desirable if LLM could automate the process of informing patients and obtaining consent for medical procedures [
12]. This could save cost-intensive personnel resources and allow a focus on core competencies [
13]. By using NLP, information documents could be created individually and automatically. The optimization of existing information sheets is another option. In particular, linking to a patient’s existing medical records with risk stratification represents opportunities for optimizing care using AI [
14].
Although modern AI systems offer many advantages, they also have limitations, particularly with regard to the classification of specific risks and possible process risks [
15]. The classification of specific medical risks often requires extensive knowledge of the context and individual circumstances of the patient, the procedure, and the infrastructural environment. There is a danger that specific risks will be overlooked or misjudged and not mapped accordingly, so much so that even the World Health Organization (WHO) warns against the undifferentiated use of AI systems [
16].
In today’s society, providing legally compliant and individualized information prior to medical interventions is a core task of medical practice. To meet information requirements with a corresponding procedural mandate, numerous commercial providers make comfort forms available for corresponding interventions, sometimes even in multilingual and digital form [
17]. Accordingly, there is an unmanageable liability risk when AI systems are used to create individualized information via LLM in the event of existing misinformation. This risk can be circumvented by standardized, legally verified forms [
18].
Overall, it is currently unclear what potential exists for the use of AI systems to create accurate patient information documents for informed consent for medical interventions in anesthesia. Regardless of individual risk stratification, the extent to which currently available LLMs can be used to design a legally compliant patient information sheet for standard procedures has rarely been investigated [
18,
19]. The aim of this study was to evaluate the suitability of an LLM for the preparation of standardized patient information for consent for anesthesiology interventions from the end user’s perspective without including patient-specific risks.
2. Materials and Methods
This study was conducted according to the ethical principles of the Helsinki Declaration (Ethical Principles for Medical Research Involving Human Subjects) [
20]. Since the present study was only an evaluation of publicly available data created by an LLM and no clinical research was conducted with humans, there was no requirement for the involvement of an institutional review board (IRB; the ethics committee of the University Hospital Frankfurt). This manuscript adheres to the current CONSORT guidelines [
21].
2.1. Preparation of the Prompts for LLM and Generation
As part of the preparation, a study team was developed that comprised four anesthesiologists with varying levels of expertise: one resident, one specialist, and two senior physicians. Two prompts were formulated for each of the six topics (general anesthesia, anesthesia for ambulatory surgery, peripheral regional anesthesia, peridural anesthesia, spinal anesthesia, and central venous line CVL) (see translated
Supplement S1). For all prompts, the German language was used to compare German-based information sheets. These prompts were simultaneously fed to ChatGPT (version 3.5 and 4.0) and Google Gemini (7 July 2023 between 09:00 and 12:15 (CET)) to generate patient information. The selection was based on the fact that ChatGPT-4.0, as a more advanced and up-to-date (but currently charged) version of ChatGPT-3.5, may provide more accurate results. There is also a significant difference between ChatGPT-3.5 and Google Gemini in terms of access to the internet. Gemini has the ability to pull its answers from the internet in real-time, whereas ChatGPT-4 relies on a dataset that was current only until the end of 2021. This limitation means that ChatGPT-4 may not be able to provide the most up-to-date information, while Gemini can provide the latest answers to questions. In addition, there could be a significant difference due to the underlying training datasets of these LLMs. All responses were saved in individual Microsoft Word (MS Word 365, Microsoft Corporation, Redmond, WA, USA) documents. The results of the LLMs were prepared in German, analogous to the prompts, and were not translated during the follow-up.
2.2. Evaluation
The evaluation utilized commercially available questionnaires that are standard in the industry. These tools were chosen for their relevance and widespread use in assessing informed consent procedures. Diomed® questionnaires (Thieme Group, New York, NY, USA), and Perimed® questionnaires (perimed Fachbuch Verlag Dr. med. Straube GmbH, Fürth, Germany), in addition to questionnaires from a university hospital and district hospital were used. The use of the German language in the study was essential for comparing the AI-generated information sheets against standard German medical documents.
In the first step, a catalog of risks, complications, and relevant procedure descriptions was created for the official informed consent sheet (
n = 2) and each of the topics (
n = 6). In addition, five experts in the field of anesthesia were consulted in a concerted effort to review the checklists for completeness and, if necessary, to expand them based on existing guidelines for available informed consent items (procedures). A corresponding complete list of the items used in the checklists can be found in the translated
Supplement S2.
In this process, the numbers of items to be investigated were identified, as shown in
Table 1.
There were 216 items from 6 areas (including risk, procedure, preparations, and notice) across all six topics. This resulted in a total of 5328 data points with three LLM providers, a duplicate provision of a questionnaire, six topics, and 4 investigators.
In the second step, all automatically generated questionnaires were checked for congruence with the pre-generated catalogs. The agreement process was performed by two independent, experienced anesthesiologists. Disagreements led to the involvement of a third specialist, and a mutual agreement was reached to reach a consensus. The evaluation was conducted using a three-point scale. Possible answers were “applicable”, “not applicable”, and “paraphrase” (paraphrase of the characteristic in question).
Figure 1 shows the flow of the study including the numbers. The results were saved in a Microsoft Excel spreadsheet (Microsoft Corporation).
2.3. Statistics
The results were collected using Windows Excel (Microsoft Corporation). Data analysis was performed using SPSS (Ver. 29, IBM Corp., Chicago, IL, USA). Continuous data are presented as the mean (± standard deviation). Categorical data are presented as frequencies and percentages. A p-value of <0.05 was considered to indicate statistical significance.
3. Results
A corresponding evaluation of the data was performed with regard to the six most common informed consent topics related to modern anesthesia. The assessment of the previously completed skills-specific checklists of professional providers revealed that the degree of fulfillment of the LLM was consistently less than 50% with reference to the items to be tested. The representation of the risks and precautionary prompts required for adequate risk assessment, as well as the existing necessities with regard to a physician’s consultation or the documentation of the time and success of the consultation, were generated very variably. The necessity of written consent in the sense of a signature was, however, by far the most frequent aspect in the generated sheets, together with the explicit naming to pose open questions to the attending anesthesiologist. A detailed illustration of the averaged percentage fulfillment levels of the LLMs is depicted in
Figure 2.
Figure 2 shows a comparison of the three LLMs used (green = Gemini, light blue ChatGPT 3.5., dark blue ChatGPT 4.0) with regard to the degree of fulfillment of the summarized aspects of legally compliant consent forms. The graphical representation is given as a percentage. Therefore, the circular structure at the outer edge would be expected for a fully satisfactory information form.
Certain formulations of the drafting of appropriate consent sheets did not lead to any results. Only rewording in the LLMs produced a corresponding draft document. Furthermore, there were significant differences in the time required to create the anesthesia consent sheets (
Figure 3). Although the paid version of ChatGPT 4.0 promises faster creation, we were unable to detect this difference in our investigation. In contrast, even the processing time was significantly slower when using ChatGPT 4.0 than the free version of ChatGPT 3.5 (
p < 0.001) and Google Gemini (
p < 0.001). Across all three systems, Google Gemini was the fastest system at 12.1 s (± 2.1 s), followed by ChatGPT 3.5 (27.4 s (± 10.6 s)) and ChatGPT 4.0 (85.9 s (± 9.9 s)). It took Google Gemini 09:42 min, ChatGPT 3.5 21:53 min, and ChatGPT 4.0 68:43 min to create all of the questionnaires.
4. Discussion
The results of our study show that LLMs do not currently appear to be able to create patient information sheets for standard anesthesiology procedures and, therefore, do not currently represent an alternative to commercially available products.
The use of artificial intelligence in the field of medicine has attracted great interest and considerable concern. The purpose of this study was to test whether popular LLMs that are popular among medical laypersons and physicians are capable of generating legally compliant patient consent forms. However, it was explicitly not the intention of this study to investigate the most suitable AI tool for this task or to test machine learning approaches with regard to automated document analysis. For barely six months, the major medical journals have been asking numerous critical questions on special issues about patient safety and the difficulties that may arise for physicians when using AI [
22,
23]. The first discussion addressed issues related to regulating language models such as ChatGPT [
24]. The potential solutions offered by artificial intelligence appear to be highly promising, particularly in the analysis of constantly growing datasets in various areas of medical fields [
25]. However, due to LLMs’ applicability to the public and comprehensibility for laypeople, they are indisputably beyond the scope of national regulations or the wishes of professional societies with regard to the decision to apply them in medical settings. Accordingly, at present, there are mainly descriptive attempts to make this new technology understandable for medical professionals and to critically classify the results of its application.
With regard to the further development of patient-specific models of education, numerous obstacles have been discussed by Hunter and Holmes. The authors identified poor verifiability of automated evidence in terms of the selection and quality classification in the previous literature, as well as related risks [
26]. However, our comparison of the two versions of ChatGPT versions (3.5 without online data access and 4.0) indicates that little incorporation of the existing literature has occurred. Therefore, it remains unclear how an AI determines a corresponding risk allocation for a specific anesthesia procedure or a specific intervention; it may thus be subject to severe errors. The considerable difficulty for LLMs in generating legally compliant patient information could be due to the small number of questionnaires available online, most of which are subject to strict copyright. Similarly, the risks are not based on any clear logic but, for example, on past case law. In addition to information on standardized anesthesia procedures, an algorithm for individualized consent forms would require a risk assessment, which is difficult to obtain from the complex specialist literature and requires a correspondingly detailed evaluation by a specialist. A corresponding proof of concept should be manually checked on a corresponding dataset, but forensic hazards would still be possible in the event of a secondary misjudgment resulting in patient harm [
27,
28].
There are few relevant differences between the three large language models tested. This appears all the more astonishing in a direct comparison of ChatGPT 3.5 with ChatGPT 4.0; the commercial version 4.0 was supposed to include online data in the generative answers, although this version did not deliver quantifiable advantages. Nonetheless, numerous informed consent sheets from various clinics can be found online under common keywords. In some cases, even lists of relevant content for informed consent sheets provided by specialist societies can be found. In this respect, the results we generated may be interpreted as indicating that the training datasets are primarily responsible for the answers to our various queries.
The performance of the largest LLMs that are available online for free creation of patient information sheets for the most common anesthesiologic procedures, which we examined as an example, illustrates their inadequate application in medical specialist settings. The models fail with a consistently inadequate identification of the relevant risks and inadequate patient education. It was unlikely that the models would prove to be fully suitable immediately after their creation. However, the extent of the pending issues suggested a frightening current situation in the context of the general public’s access to these models and their expected private application to medical questions by laypersons.
The initial phrasing of the task requesting the creation of patient informed consent sheets for patients appeared to have surprisingly limited influence on the results. The LLM revealed a remarkably consistent task completion in this respect, which can be linked to the operational principles of the model. However, in individual cases, language models were unable to generate the form based on the phrasing due to unclear circumstances. Furthermore, the significant differences in the time needed to complete the task were surprising. Although the ChatGPT application simulates a display similar to the manual data entry of ty** when a task is presented, the fact that the commercial version took more than 45 s on average, which was almost twice as long as the other model, is difficult to explain. The supposed parallel online research to answer the question did not provide any significant added value in terms of content. However, it remains unclear how access to the internet affects the compatibility of LLMs. Further studies should conduct repeated assessments of performance.