The Automation Paradox in Medical AI: Why Your Interface May Be Creating Errors

Portrait of Dennis Lenard in the UX design agency.

Dennis Lenard

Mar 2026

Experienced radiologists' accuracy dropped from 82% to 45.5% when shown AI scores first. Bainbridge predicted this in 1983. Clinical AI is now living it, and interface design is where it gets solved.

This article draws on Creative Navy's project work in medtech UX, spanning practice management software, surgical equipment, ventilators, blood pumps, infusion systems, and patient monitoring devices, including Class II and Class III regulated products. Our work in this sector covers clinical environments including the ICU and operating theatre, designing for surgeons, nurses, and biomedical engineers. Dennis Lenard, who leads this work at Creative Navy, is the author of User Interface Design For Medical Devices And Software, the practitioner reference on UX design for medical devices and software. Our approach integrates IEC 62366 usability engineering requirements and FDA Human Factors guidance as structural inputs to the design process, not post-hoc compliance activities.

The case against clinical AI is usually made through its failures. The deeper problem is what happens to human judgment after AI gets it right ten thousand times in a row.

Key Statistics

  • 228 passengers and crew who died in the Air France 447 crash (2009), after autopilot disengaged in a stall and pilots who had not manually flown in those conditions for months took 42 seconds to correctly identify the situation
  • 82% to 45.5% the drop in experienced radiologists' independent diagnostic accuracy when shown AI-generated scores before forming their own assessment (2023 study)
  • 40 years the time since Lisanne Bainbridge formally predicted this problem in "Ironies of Automation" (1983), writing about industrial process control
  • 0 the number of IEC 62366 hazard analysis frameworks that currently include a mandatory category for automation-induced deskilling as a foreseeable use error
  • 85–99% of clinical alarms are non-actionable, training clinicians to clear alerts automatically, the same cognitive pattern that makes AI confirmation steps vulnerable to reflexive acceptance
  • 3x the approximate increase in clinical AI decision support tools deployed in hospital settings between 2020 and 2024

The Problem With Getting It Right Most of the Time

The case against AI in clinical settings is usually made through its failures. The misdiagnosis. The hallucinated drug interaction. The recommendation generated for the wrong patient. These are real problems. They are also the wrong place to look for the deepest risk.

The deeper problem is what happens to human judgment after the AI gets it right ten thousand times in a row.

On the night of 1 June 2009, Air France Flight 447 was crossing the Atlantic at 35,000 feet. The autopilot disengaged after ice crystals blocked the pitot tubes, giving the flight computers inconsistent airspeed readings. The aircraft entered a stall. The crew had four minutes and twenty-four seconds to identify the situation and recover.

They took forty-two seconds before anyone correctly diagnosed what was happening. By then recovery was no longer possible. All 228 passengers and crew died.

The accident investigation found that the pilots had not manually flown at high altitude without autopilot assistance for months. The skills required to manage that specific failure (the exact situation the autopilot had been protecting them from) had atrophied under the conditions that were supposed to make them safer.

The AI hadn't failed. The humans had. Because the AI had succeeded for too long.

Lisanne Bainbridge named this problem in 1983. Aviation lived it in 2009. Clinical AI is now in the early stages of the same arc. Reliable automation removes the conditions that maintain the human competence required when automation fails. The better the system, the wider the gap it creates. The higher the stakes when that gap is exposed.

What Are the Ironies of Automation?

In 1983, Lisanne Bainbridge published "Ironies of Automation" in the journal Automatica. She was writing about industrial process control, oil refineries, power stations, chemical plants. Not hospitals. Not AI. The paper has aged without needing revision.

Bainbridge identified three structural problems that emerge in any highly automated system where a human remains nominally responsible for outcomes.

The first irony: automation eliminates the practice it depends on. The more reliable the automation, the less opportunity the human operator has to exercise the manual skills that automation was supposed to make unnecessary. Those skills atrophy. When automation fails, which is when those skills are most needed, the human is least prepared to use them. The designers who built the automation to remove the need for manual skill also removed the conditions that would have maintained it.

The second irony: humans inherit the hardest tasks. Automation typically removes the routine, repetitive decisions and leaves the human responsible for the exceptional cases. But skill at the exceptional is built through practice with the routine. Surgeons develop judgment about when to deviate from protocol partly through the accumulated experience of following it thousands of times. Remove the routine and you remove the substrate from which that judgment grows. The human has been assigned the most cognitively demanding tasks and denied the experience that builds competence for them.

The third irony: monitoring is harder than doing. A human whose only role is to watch for failures in a system that rarely fails will be in a state of low vigilance when failure occurs. Bainbridge called this the "out of the loop" problem. Monitoring a working system is more cognitively demanding, and less intrinsically engaging, than operating one. Operators become passive. When the system finally fails, response time has degraded precisely because the system was reliable enough to require no active engagement in the interim.

All three ironies are now present in clinical AI environments. Addressing them at the interface design level has not yet become standard practice.

What the Evidence Shows Is Already Happening

The Air France 447 case is useful because it is far enough from healthcare to read as cautionary rather than accusatory. The clinical evidence is closer and more uncomfortable.

A 2023 study examining radiologist performance with AI decision support found that experienced clinicians' independent diagnostic accuracy fell from 82% to 45.5% when they were shown AI-generated scores before forming their own assessments. The AI scores were sometimes correct and sometimes deliberately incorrect. Clinicians' judgments shifted systematically toward the AI output regardless.

The mechanism matters. The interface did not ask clinicians to defer. It simply showed them an answer before they had reached one. That alone was enough to shift their reasoning from independent analysis to evaluation of the AI's conclusion. The cognitive task changed from "what do I see?" to "do I agree with what the system sees?", and accuracy fell almost in half.

This is Bainbridge's third irony made clinical. The monitoring task, evaluate the AI recommendation, is categorically different from and harder than the original diagnostic task. The interface assigned the harder cognitive role while removing the practice that built competence for the easier one.

The pattern appears elsewhere. Studies of clinical decision support systems have consistently found that alert acceptance rates fall as alert volume increases , but the alerts that get accepted are not reliably the most clinically important ones. Clinicians learn to clear alerts by default. Override becomes the automatic response to any notification. Then, when an alert is genuinely critical, the automatic response fires before deliberate evaluation does.

Clinicians know this. It is one of the most consistent findings in user research conducted in clinical settings , not that clinical staff are inattentive, but that alert volume and uniformity has trained them into response patterns that are reasonable adaptations to the environment they actually work in. The problem is not the clinicians. It is the design of the information environment they are embedded in.

What "Human in the Loop" Actually Means

The phrase "human in the loop" has become a regulatory and product development reassurance. It appears in safety analyses, FDA submissions, hospital procurement criteria. It is used to mean that a human confirms the AI's recommendation before it is acted upon.

At the interface level, this is frequently not what it describes.

A human whose only interaction with an AI-generated recommendation is to confirm or reject it is not in the loop. They are at the end of it. The loop completed before they arrived, the AI processed the data, weighted the variables, reached a conclusion, and presented an output. The human's role is to ratify or veto. This is not reasoning. It is closer to quality control on a production line: the product arrives at the end of a process the human did not participate in, and they decide whether to approve it.

The error modes are different depending on which situation you are in.

When a human generates a value and then confirms it, the confirmation catches execution errors: transcription mistakes, unit confusions, patient selection slips. The human has already done the reasoning. The confirmation checks the output of that reasoning.

When an AI generates a value and the human confirms it, the confirmation is supposed to do something harder: evaluate reasoning the human did not perform. Identify the cases where the AI's conclusion is wrong. Notice the patient presentation that falls outside the training distribution. Maintain independent clinical calibration across thousands of confirmations. That is not a confirmation task. It is a continuous expert evaluation task, and it is being performed through an interface designed for the easier job.

Three interface approaches can change this. They do not eliminate the paradox. But they address it at the level where it is actually experienced.

Present data before recommendation. Before the AI recommendation is disclosed, the interface should require the clinician to record a brief independent clinical assessment. Not a full diagnosis — a directional judgment, a noted concern, a flagged anomaly. The 2023 radiologist study shows that seeing an AI score first changes the entire trajectory of clinical reasoning. Reversing the sequence preserves independent judgment as a genuine input to the decision rather than a reaction to the AI's output. Some EHR implementations do this for specific high-stakes decisions. It is not universal and not standard.

Show reasoning, not just recommendation. An interface that displays the AI's conclusion without its reasoning asks the clinician to evaluate a verdict without access to the evidence. Showing the weighted factors, the confidence interval, the edge conditions that affected the output, these allow meaningful evaluation. The clinician is checking logic, not endorsing a number. Where AI systems flag that a case falls outside their validated performance range, that information needs to be surfaced in the interface, not buried in documentation.

Design override to preserve independent judgment, not penalise it. Most AI confirmation interfaces ask for justification when the clinician overrides the recommendation. The burden of explanation falls on disagreement, not agreement. This is backwards. The clinical decision is the clinician's responsibility regardless of what the AI suggests. An interface that treats departure from AI recommendation as a deviation requiring justification creates steady institutional pressure toward acceptance. Over time, that pressure is Bainbridge's second irony operating through social architecture: the exceptional case, clinician overrides AI, requires effort; the routine case, clinician confirms AI , requires none. Competence in the exceptional degrades.

How AI Causes Clinical Deskilling

Bainbridge's second irony has a time dimension that the current conversation about clinical AI is not confronting.

If AI handles routine diagnostic or dosing decisions, and humans handle only the escalated or failed cases, clinical skill in the AI-covered domain degrades from disuse. Slowly. Over months and years, not in the immediate aftermath of implementation. The curve is not visible in the first year of deployment. By the time it becomes visible, the gap may be wide.

Air France 447 is a compressed version of this curve, six to twelve months of minimal manual flying before a crisis that required it. The clinical equivalent plays out longer. But the mechanism is the same.

Clinicians who routinely defer to algorithmic dosing recommendations may lose calibration for when a recommendation is outside the normal range for their specific patient. Radiologists who routinely incorporate AI preliminary reads may lose confidence in unassisted assessment in cases where AI has been withdrawn, is unavailable, or is wrong. Junior clinicians who have trained in AI-assisted environments may develop competency profiles that are genuinely different from those who trained before AI integration, with implications that are difficult to assess in advance and will only become visible retrospectively.

We do not have validated design patterns for maintaining clinical skill in AI-assisted environments. That is an honest statement, not a placeholder. The literature on automation and deskilling in aviation, nuclear operations, and process control is substantial. The equivalent research in clinical AI is thin, and the design frameworks that would translate it into interface specifications do not exist in any robust form.

This matters in practice. When working through the design rationale for an AI-recommendation interface with a client, this is often the point where the conversation becomes genuinely uncertain, not about what the risk is, but about what adequate mitigation looks like. We can specify the sequence of AI disclosure. We can specify what the audit trail should record. We can design override architecture that does not penalise independent judgment. What we cannot do with confidence is specify how frequently a clinician needs to make unaided decisions to maintain meaningful calibration in a domain where AI has taken over the routine cases. That number probably varies by domain, by individual, by the quality of feedback the clinician receives on unaided decisions. It has not been studied at the granularity that would support a design specification.

Some aviation training programmes build in mandatory manual flying hours, defined, documented, reviewed. The principle is established. The clinical equivalent has not been formalised. The design space is open in a way that is unusual for a problem this clearly identified: What decisions should be reserved for unaided judgment? At what frequency? With what feedback? Across which clinical roles and settings? These are empirical questions that need research, not design opinions that need confidence.

We are not aware of any current medical device or clinical AI interface that has addressed them in its design rationale. The closest available structural model is aviation's Crew Resource Management framework, which emerged from the United Airlines Flight 173 accident in 1978 and was established as a global standard by 1992. CRM didn't just address immediate failure modes, it built a systematic approach to maintaining human judgment and team performance in highly automated environments. Clinical AI will need something equivalent. Interface design is where it will have to start, because the interface is where the human-AI boundary is actually negotiated in practice.

The Independent Judgment Preservation Model

There is no single design template for AI-recommendation interfaces that prevents the automation paradox. Clinical contexts vary too much, AI systems vary too much, and the honest answer is that the right calibration requires contextual judgment rather than universal prescription.

What follows is a set of design principles, four characteristics of an AI-recommendation interface designed to maintain rather than displace the clinical reasoning it assists. We call them, collectively, the Independent Judgment Preservation Model. Each addresses one of Bainbridge's three ironies directly.

Principle 1: Sequence clinical reasoning before AI disclosure for high-consequence decisions

This addresses the third irony, monitoring replacing doing. For any decision where a significant error would cause irreversible harm, the interface should require the clinician to form and record a directional clinical judgment before the AI recommendation is disclosed. Brief is sufficient. The purpose is not comprehensive documentation, it is preserving the cognitive habit of reaching a conclusion before evaluating the AI's. That habit is what the 2023 radiologist study shows is eroded by AI-first interfaces.

Not every decision warrants this. The risk classification that determines which decisions are high-consequence enough to justify the additional friction should be explicit in the design rationale and documented in the hazard analysis.

Principle 2: Make AI reasoning transparent, including its limits

This addresses the second irony, exceptional cases without the foundation of routine engagement. A clinician who cannot see why the AI reached its conclusion cannot meaningfully evaluate whether that conclusion is correct for this patient. They can only accept or reject a verdict.

The interface should show the factors driving the recommendation, the confidence level, and critically, any indication that the current case falls outside the AI's validated performance range. An AI system that fails silently is significantly more dangerous than one that flags its own uncertainty. That uncertainty needs to be surfaced where the clinical decision is made, not in the system's technical documentation.

Principle 3: Make override the path of least friction, not most

This addresses the first irony, the atrophy of the skills most needed when the system fails. Override architecture that burdens disagreement creates cumulative pressure toward AI acceptance. The interface should treat AI recommendations as clinical inputs, not clinical defaults. Audit trails should record the clinical reasoning alongside the confirmation or override, not just the deviation from recommendation. Institutional AI acceptance rates should not be framed as quality metrics.

This is the design choice most organisations get wrong, because the instinct is to ensure the AI's outputs are being used. That instinct produces an interface that gradually deskills the clinicians using it.

Principle 4: Reserve defined decisions for unaided clinical judgment

This is the most underused design principle and the one most directly relevant to the deskilling curve.

Some decisions in any AI-assisted workflow should be explicitly handled without AI support, not occasionally, not as a failsafe, but as a defined, documented, recurring practice. Rotated. Reviewed. Known to the clinician as a deliberate skill-maintenance exercise. The aviation equivalent is mandatory manual flying hours. The clinical equivalent has not yet been formalised, but the principle is implementable now at the interface level: a defined category of decisions where the AI recommendation is not shown, the clinician's independent judgment is recorded, and the outcome is tracked.

This is friction. It will not be welcomed in workflows under time pressure. The design challenge is calibrating it to a frequency and decision type that maintains the skill without degrading the workflow, and that requires contextual research in real use environments that lab usability testing cannot substitute for.

For the medical device interfaces we design and validate, the gap between lab performance and deployed performance is one of the most consistent findings we return to. This principle in particular requires that gap to be understood before the design is finalised.

The Curve Is Not Linear

Clinical AI will become more capable, more integrated, and more depended upon. That is a description of what is already underway, not a prediction.

The automation paradox follows. Bainbridge established forty years ago that reliable automation removes the conditions that maintain human competence for failure management. The better the system, the wider the gap. The only open question is whether the medical device and clinical software industry designs for that gap now or discovers it later.

Discovery tends to happen at the Air France 447 moment: a system failure, in conditions the humans were no longer equipped to handle, because the system's reliability had removed the conditions that would have kept them equipped. The accident investigation identifies the gap. The design response follows. Two hundred and twenty-eight people are not available to assess whether the response was timely.

Clinical AI is earlier on this curve. The gap is not yet wide. The design decisions, what sequence the interface uses, whether it shows reasoning or only recommendation, how it handles override, whether it preserves any space for unaided clinical judgment, are still being made. In many products they are still open.

The interface is where these decisions get made. Most products currently deploying clinical AI have not made them explicitly, which means they have been made by default, through whatever was easiest to build. That is a design choice with a known trajectory.

If you are making these decisions for a clinical AI product, we would be glad to work through them with you.

FAQ: Automation Bias, Clinical AI, and Interface Design

What is automation bias in healthcare, and why does it matter for medical device design? Automation bias is the tendency to over-rely on automated recommendations, particularly when the automation is usually correct. In clinical settings, it means clinicians may accept AI-generated recommendations without adequate independent evaluation, not because they are negligent, but because the interface structure and accumulated experience of correct recommendations trains that behaviour. For medical device design, it matters because a confirmation step that a clinician accepts reflexively is not a safety mechanism. The 2023 radiologist study found diagnostic accuracy fell from 82% to 45.5% when AI scores were shown before independent assessment, without the interface asking for deference in any way.

What are Bainbridge's Ironies of Automation and how do they apply to clinical AI? Lisanne Bainbridge's 1983 paper identified three structural problems in highly automated systems: automation removes the practice that maintains human skills; humans inherit the most demanding tasks while losing the routine experience that builds competence for them; and monitoring a working system is harder and less engaging than operating one. All three apply to clinical AI. Clinicians confirm AI recommendations without performing the reasoning that generated them, inherit the responsibility for exceptional cases while losing routine practice, and must maintain vigilance across thousands of AI-generated outputs that are usually correct.

How should AI recommendation interfaces be designed differently from standard medical device confirmation steps? A confirmation step designed for human-generated inputs verifies that the user's entry was correct. A confirmation step for AI-generated recommendations must do something harder: enable genuine evaluation of a recommendation the clinician did not produce. This requires sequencing clinical assessment before AI disclosure for high-consequence decisions; showing the AI's reasoning and confidence level, not just its conclusion; and designing override architecture that treats departures from AI recommendation as legitimate clinical judgment rather than deviations requiring justification.

What does IEC 62366 require for AI-assisted medical devices, and where does the current framework fall short? IEC 62366 requires manufacturers to identify foreseeable use errors and implement validated mitigations. For AI-assisted devices, the foreseeable use errors include automation bias (accepting incorrect recommendations without evaluation), distribution shift errors (the AI operating outside its validated performance range without the interface flagging this), and deskilling effects (loss of clinician calibration for unaided judgment over extended AI reliance). Current hazard analysis frameworks do not systematically require the third category to be addressed. This is a regulatory gap that post-market surveillance data will eventually expose.

What is the "human in the loop" problem in clinical AI interface design? "Human in the loop" describes an interface where a clinician confirms an AI recommendation before it is acted upon. At the interface level, this frequently means the clinician is at the end of the AI's decision loop, not inside it. The AI has already reasoned, concluded, and presented an output. The human ratifies or vetoes. This is not independent clinical reasoning, it is approval-gating a completed recommendation. The design challenge is building interfaces that keep the human genuinely inside the reasoning process.

How can medical device designers address the clinical deskilling risk from AI integration? The clearest available design principle, drawn from aviation's approach to maintaining manual proficiency alongside automated flight, is to reserve defined decisions for unaided clinical judgment on a deliberate, documented, recurring basis. Not as a failsafe but as a skill-maintenance practice. Which decisions, at what frequency, with what feedback, requires contextual design and research rather than a universal template. The principle is implementable now. The validated framework for clinical AI specifically does not yet exist, and any designer or organisation claiming to have fully solved this problem should be treated with scepticism.

You might also like

Silent Data Errors in Scientific Software: The Validity Failure
Scientific Interfaces

Silent Data Errors in Scientific Software: The Validity Failure

Scientific instrument software can produce invalid results without triggering any errors. We name the pattern, document three instances in AZtecCrystal, and sets out UX principles that prevent silent failures from reaching the literature.

17 min read
When Flexibility Becomes the Enemy of Good Design
Industrial GUI

When Flexibility Becomes the Enemy of Good Design

A four-iteration design project for an aircraft engine manufacturer found that designed constraints reduce error risk more reliably than flexible interfaces. Here is what the evidence showed at each stage.

16 min read
Mass Photometry Software UX Benchmarking: a systematic review
Scientific Interfaces

Mass Photometry Software UX Benchmarking: a systematic review

Five mass photometry analysis tools reviewed against a consistent UX framework. DiscoverMP, PhotoMol, ImageJ, CellProfiler, and BioImageIT each fail in documented ways that compound reproducibility risk across shared facilities.

22 min read