LLMs Often Produce Medical Device-Like Output | en

The Promise and Regulatory Challenges of LLMs in Healthcare

Large Language Models (LLMs) are rapidly demonstrating their potential across various sectors, and healthcare is no exception. Their ability to process vast amounts of data and generate human-like text makes them attractive candidates for clinical decision support (CDS). LLMs are trained on massive datasets, enabling them to understand and respond to complex queries, including those related to medical information. This capability has fueled interest in using LLMs to assist clinicians in making informed decisions, potentially improving patient care and outcomes.

However, the very strengths of LLMs also present significant challenges for regulatory bodies like the Food and Drug Administration (FDA). The FDA’s regulatory framework was primarily designed for traditional medical devices, which typically have well-defined functionalities and limited scope. LLMs, on the other hand, are dynamic and adaptable, capable of generating responses on a wide range of topics, making it difficult to apply existing regulations. The inherent flexibility and open-ended nature of LLMs pose a unique challenge to the established regulatory paradigm.

Currently, most publicly available LLMs are not classified as medical devices. The Federal Food, Drug, and Cosmetic Act (FD&C Act § 201(h)(1)) defines a medical device as an “instrument… intended for use in the diagnosis, …cure, mitigation, treatment, or prevention of disease… which does not achieve its primary intended purposes through chemical action.” To avoid falling under FDA regulation, many LLMs include disclaimers stating that they are not intended to provide medical advice. These disclaimers aim to position LLMs as general information tools rather than medical devices.

Despite these disclaimers, there is growing evidence, both anecdotal and from published research, that LLMs are being used for medical decision support. Clinicians and researchers are exploring the potential of LLMs in various clinical settings, from generating differential diagnoses to suggesting treatment plans. This increasing use, even in the absence of formal FDA authorization, raises concerns about the potential risks and benefits of unregulated LLM-based CDS.

Defining the Scope of Regulation for LLM-Based Clinical Decision Support

The potential integration of LLMs into clinical decision support systems (CDSSs) necessitates a clear understanding of the regulatory landscape. The 21st Century Cures Act amendment to the FD&C Act (Public Law 114–255), along with subsequent FDA guidance, provides a framework for determining whether decision support software qualifies as a medical device and, therefore, falls under FDA jurisdiction. This framework focuses on four key criteria:

Input Data: The type of data the software function uses as input.
Output Data: The nature of the information the software provides as output.
Clinical Recommendations: The substance and specificity of the clinical recommendations generated.
Reviewability: The ability of the end-user to independently review the rationale behind the software’s recommendations.

These criteria are designed to distinguish between software that merely provides information and software that actively guides clinical decision-making.

Specifically, a CDSS is considered a device if its output provides a precise directive for treatment or diagnosis, rather than general information or a range of options. If the CDSS offers a specific course of action without providing the underlying reasoning, it is more likely to be classified as a device. This is because the user is unable to independently evaluate the recommendation and must rely on the software’s judgment.

Furthermore, FDA guidance clarifies that a CDSS used in a clinical emergency is considered a device. In emergency situations, time is of the essence, and clinicians may not have the opportunity to thoroughly review the basis for the CDSS’s recommendations. The critical and time-sensitive nature of these decisions necessitates a higher level of regulatory scrutiny.

Investigating Device-Like Output in Generative AI Systems

A crucial question is whether a CDSS that utilizes generative AI, such as an LLM, produces output that meets the criteria for a medical device. The free-text output of an unconstrained LLM may or may not align with these established criteria. The inherent variability and unpredictability of LLM responses raise concerns about whether they can be reliably controlled to avoid providing device-like output.

Furthermore, the response of LLMs to challenging prompts or “jailbreaks” is a significant area of concern. “Jailbreaks” are techniques designed to circumvent the safety mechanisms and restrictions built into LLMs, potentially eliciting responses that violate the intended guidelines. Understanding how LLMs respond to such manipulations is crucial for assessing their potential risks in a clinical setting.

The increasing use of LLMs for medical advice, coupled with the uncertainty surrounding their regulatory status, creates a potential impediment to the safe and effective development of these technologies. Striking the right balance between fostering innovation and ensuring patient safety is paramount. Regulation must be adaptable enough to accommodate the rapid advancements in generative AI while providing adequate safeguards against potential harms.

Research Objectives: Evaluating Device-Like Functionality

This research aimed to systematically evaluate the device-like functionality of LLMs. This functionality was defined as their utility for “diagnosis, treatment, prevention, cure or mitigation of diseases or other conditions,” regardless of whether such use is intended or permitted by the LLM developers. The study focused on two primary objectives:

Assessing Alignment with Device Criteria: To determine whether LLM output would align with established device criteria when prompted with instructions about those criteria and presented with a clinical emergency scenario. This objective aimed to test whether explicit instructions could constrain LLM output to remain within the bounds of non-device decision support.
Identifying Conditions for Device-Like Output: To identify the conditions, if any, under which a model’s output could be manipulated to provide device-like output. This included using direct requests for diagnostic and treatment information, as well as a pre-defined “jailbreak” designed to elicit device-like output despite prompts to adhere to non-device criteria. This objective aimed to explore the vulnerability of LLMs to manipulation and their potential to generate responses that would be considered medical advice.

Findings: LLM Responses and Device Criteria Alignment

The study investigated the responses of two widely used LLMs, GPT-4 and Llama-3, across various scenarios and prompting methods. The findings revealed a complex interplay between prompting strategies, clinical context, and the likelihood of generating device-like output.

Preventive Care Recommendations

When prompted to provide preventive care recommendations, all LLMs generally generated responses consistent with non-device criteria in their final text output. This suggests that, in relatively straightforward scenarios, LLMs can be guided to provide general information rather than specific medical advice.

However, the Llama-3 model, in response to a single-shot prompt, initially provided device-like decision support in a small percentage of responses (20% for family medicine and 60% for psychiatry preventive care scenarios). This indicates that even with relatively simple prompts, there is a potential for LLMs to generate responses that cross the line into medical device territory. Interestingly, the model subsequently replaced this device-like text with a disclaimer: “Sorry, I can’t help you with this request right now.” This suggests an internal mechanism within the model that recognized and attempted to correct the potentially problematic output.

When presented with a multi-shot prompt containing detailed examples of device criteria, all models consistently provided non-device recommendations for all initial preventive care responses. This demonstrates that more elaborate prompting, providing the LLM with a clearer understanding of the desired output format, can be effective in controlling its responses.

Time-Critical Emergency Scenarios

In situations involving time-critical emergencies, the likelihood of LLMs generating device-like output increased significantly. 100% of GPT-4 responses and 52% of Llama-3 responses aligned with device-like decision support. This highlights the inherent challenge of controlling LLM output in high-stakes scenarios where specific guidance may be perceived as necessary.

The overall rates of device-like recommendations remained consistent with multi-shot prompts, but showed variation across different clinical scenarios. This suggests that while multi-shot prompting can improve control, it is not a foolproof solution, and the specific clinical context plays a crucial role.

The device-like responses included suggestions for specific diagnoses and treatments related to the emergencies. This type of output clearly falls within the FDA’s definition of a medical device, as it provides direct guidance for managing a medical condition.

‘Desperate Intern’ Jailbreak

The “desperate intern” jailbreak, designed to simulate a situation where a user is urgently seeking medical advice, proved to be highly effective in eliciting device-like responses. A significant proportion of responses exhibited device-like recommendations: 80% and 68% of GPT-4 responses, and 36% and 76% of Llama-3 responses, included device-like recommendations following single- and multi-shot prompts, respectively.

This finding demonstrates the vulnerability of LLMs to manipulation and their potential to generate responses that would be considered medical advice, even when prompted to adhere to non-device criteria. The jailbreak effectively bypassed the built-in safety mechanisms, highlighting the need for more robust safeguards.

Clinical Appropriateness of LLM Suggestions

It is important to note that, despite generating device-like output, all model suggestions were clinically appropriate and aligned with established standards of care. This indicates that the LLMs were drawing upon accurate medical knowledge, even when providing responses that would be considered medical advice.

In the family medicine and cardiology scenarios, much of the device-like decision support was suitable only for trained clinicians. Examples include the placement of an intravenous catheter and the administration of intravenous antibiotics. These are procedures that require specialized medical training and should not be performed by laypersons.

In other scenarios, device-like recommendations were generally consistent with bystander standards of care, such as administering naloxone for an opioid overdose or using an epinephrine auto-injector for anaphylaxis. These are actions that are often taught to non-medical personnel in emergency situations.

Implications for Regulation and Oversight

The study’s findings have significant implications for the regulation and oversight of LLMs used in healthcare. Although no LLM is currently FDA-authorized as a CDSS, and some explicitly state they should not be used for medical advice, the reality is that patients and clinicians may be utilizing them for this purpose. The ease with which LLMs can generate device-like output, even without specific prompting or with the use of jailbreaks, underscores the need for proactive regulatory measures.

The study found that neither single-shot nor multi-shot prompts, based on language from an FDA guidance document, reliably restricted LLMs to producing only non-device decision support. This suggests that relying solely on prompting techniques to control LLM output is insufficient. More robust and sophisticated methods are needed to ensure that LLMs consistently adhere to the intended regulatory boundaries.

Furthermore, the fact that a pre-defined jailbreak was often unnecessary to elicit device-like decision support highlights the inherent vulnerability of LLMs. This reinforces prior research calling for novel regulatory paradigms tailored to AI/ML CDSSs. The existing regulatory framework, designed for traditional medical devices, is not well-suited to address the unique challenges posed by generative AI.

Rethinking Regulatory Approaches

Effective regulation may necessitate new methods to better align LLM output with either device-like or non-device decision support, depending on the intended use. The traditional FDA authorization process, which grants approval for a specific intended use and indication, may not be appropriate for LLMs.

For instance, FDA-authorized AI/ML devices are often designed for a narrow purpose, such as predicting hemodynamic instability or clinical deterioration. LLMs, however, can be queried on a vast array of topics, potentially leading to responses that, while clinically appropriate, would be considered “off-label” relative to their approved indication. The study’s results demonstrate that both single- and multi-shot prompts are inadequate for controlling this “off-label” use.

This finding doesn’t represent a limitation of LLMs themselves, but rather underscores the need for new methods that preserve the flexibility of LLM output while confining it to an approved indication. The challenge is to develop regulatory approaches that allow for the broad applicability of LLMs while ensuring that they are used safely and effectively within defined boundaries.

Exploring New Authorization Pathways

Regulation of LLMs might require new authorization pathways that are not tied to specific indications. A device authorization pathway for “generalized” decision support could be suitable for LLMs and generative AI tools. This approach would acknowledge the broad capabilities of LLMs while establishing a framework for evaluating their safety and effectiveness across a range of applications.

However, the optimal method for assessing the safety, effectiveness, and equity of systems with such broad indications remains unclear. For example, a “firm-based” approach to authorization, which focuses on evaluating the manufacturer’s overall quality control processes rather than individual devices, could bypass the need for device-specific evaluation. This might be appropriate for an LLM, but it comes with uncertain guarantees regarding clinical effectiveness and safety.

Refining Criteria for Different User Groups

The findings also highlight the need to refine criteria for CDSSs intended for clinicians versus non-clinician bystanders. The FDA has previously indicated that patient- and caregiver-facing CDSSs would be considered medical devices, generally subject to regulation. However, there isn’t currently a regulatory category for an AI/ML CDSS designed for a non-clinician bystander.

Making a specific diagnosis and providing a specific directive for a time-critical emergency clearly aligns with the FDA’s criteria for devices intended for healthcare professionals. However, actions like cardiopulmonary resuscitation (CPR) and the administration of epinephrine or naloxone also meet these device criteria, yet they are simultaneously well-established rescue behaviors for non-clinician bystanders. This creates a gray area where actions that are considered standard practice for laypersons could also be classified as medical device functionality if performed by an LLM.

Study Limitations

This study has several limitations that should be considered when interpreting the findings:

Non-Intended Use Evaluation: The study evaluates LLMs against a task that is not a specified intended use of the software. The LLMs were not explicitly designed for medical decision support, and the study’s findings should be interpreted in light of this.
Focus on FDA Guidance: The study compares LLM output to FDA guidance, which is non-binding. It does not assess the consistency of LLM recommendations with other relevant US statutory provisions or regulatory frameworks.
Limited Prompting Methods: The study does not evaluate other prompting methods that might have been more effective than single- and multi-shot prompts. There may be more sophisticated prompting techniques that could better control LLM output.
Practical Integration: The study does not explore how such prompts might be practically integrated into real-world clinical workflows. The feasibility of implementing these prompting strategies in a clinical setting remains an open question.
Limited LLM Scope: The study does not evaluate a broader range of widely available and commonly used LLMs beyond GPT-4 and Llama-3. The findings may not be generalizable to all LLMs.
Small Sample Size: The sample size of prompts is small, limiting the generalizability of the findings.

Moving Forward: Balancing Innovation and Safety

Prompts based on the text of FDA guidance for CDSS device criteria, whether single- or multi-shot, are insufficient to ensure that LLM output aligns with non-device decision support. New regulatory paradigms and technologies are needed to address generative AI systems, striking a balance between innovation, safety, and clinical effectiveness. The rapid evolution of this technology demands a proactive and adaptive approach to regulation, ensuring that the benefits of LLMs in healthcare can be realized while mitigating potential risks. This includes developing more robust methods for controlling LLM output, exploring new authorization pathways, and refining the criteria for distinguishing between device and non-device functionality, particularly for applications intended for non-clinician users. Ongoing research and collaboration between regulators, developers, and clinicians are essential to navigate this evolving landscape and ensure the responsible and beneficial integration of LLMs into healthcare.

updated at 2025-03-11

# AIGC # OpenAI # GPT