WHO Checklist Failure: Medical Device Confirmation Design

This article draws on Creative Navy's project work in medtech UX, spanning practice management software, surgical equipment, ventilators, blood pumps, infusion systems, and patient monitoring devices, including Class II and Class III regulated products. Our work in this sector covers clinical environments including the ICU and operating theatre, designing for surgeons, nurses, and biomedical engineers. Dennis Lenard, who leads this work at Creative Navy, is the author of User Interface Design For Medical Devices And Software, the practitioner reference on UX design for medical devices and software. Our approach integrates IEC 62366 usability engineering requirements and FDA Human Factors guidance as structural inputs to the design process, not post-hoc compliance activities.

Key Statistics

47% reduction in surgical deaths when the WHO Surgical Safety Checklist was fully and genuinely implemented (Haynes et al., NEJM, 2009)
0% statistically significant reduction in operative mortality after mandatory province-wide checklist implementation in Ontario, despite 92 to 98% reported compliance
0.23 odds ratio for patient death when checklists were fully completed with genuine engagement (Dutch study): a 77% reduction
No benefit was found for partial checklist completion. Non-completion was associated with higher mortality than before the checklist existed
8 seconds: the approximate time a surgical safety checklist can be completed if one person reads and answers their own questions without verification
60% of infusion pump administrations are associated with errors in clinical practice, despite confirmation steps being standard on most devices
38x: the overdose magnitude in the UCSF infusion pump case, in which a confirmation alert was cleared without the error being caught

The Completed Checklist That Didn't Help

The surgical team completes the WHO Surgical Safety Checklist before incision. Every item answered. Every box ticked. Compliance: 100%.

The patient is harmed by something the checklist was specifically designed to prevent.

This is not a hypothetical. It is a documented pattern across healthcare systems in multiple countries. And it exposes a problem that most organisations involved in clinical safety have been reluctant to name precisely: the difference between a completed checklist and a protective one is not found in the audit trail. It is found in the interface design.

The 2014 Ontario province-wide study, published in the New England Journal of Medicine, found no statistically significant reduction in operative mortality after mandatory surgical checklist implementation across hundreds of hospitals, despite reported completion rates of 92 to 98%. The same checklist that had produced a 47% reduction in surgical deaths in a controlled 2009 study was producing nothing at scale.

If the intervention works under controlled conditions and fails at scale with near-perfect compliance, the problem is not the checklist. The problem is something about what compliance means in practice. That is a design problem.

What the Evidence on Checklist Failure Shows

Three studies in sequence tell the story more precisely than the headline numbers suggest.

The Original Controlled Evidence

The Haynes et al. study, published in the New England Journal of Medicine in 2009, introduced the WHO Surgical Safety Checklist across eight hospitals in eight countries under controlled conditions. Surgical death rates fell from 1.5% to 0.8%, a 47% reduction. Major complications fell from 11% to 7%.

Atul Gawande, the lead investigator, described the results as startling. The reductions held equally across high-income and low-income sites.

The checklist worked. Under those conditions, with that implementation.

The Province-Wide Failure

Five years later, Ontario made the checklist mandatory across all hospitals in the province. Reported compliance reached 92 to 98%. The NEJM study examining this found no statistically significant reduction in operative mortality or surgical complications.

Same checklist. Near-perfect reported compliance. No protective effect.

The study noted that no standardised training was required before public reporting began and implementation was not standardised across institutions. But this explanation locates the problem in people and institutions. It does not explain the mechanism.

The Dutch Study That Found the Mechanism

A Dutch hospital study isolated the variable the Ontario data could not. It separated outcomes by the quality of checklist engagement: full completion, partial completion, and non-completion.

The results were precise and important.

Full completion, with genuine engagement: odds ratio for patient death of 0.23. A 77% reduction in the odds of dying.

Partial completion: no statistically significant benefit.

Non-completion: odds ratio of 1.57. Patients were more likely to die than before the checklist existed.

The checklist, genuinely engaged with, is one of the most protective interventions in surgical care. The checklist, treated as a form to be completed, is worse than nothing. The false confidence of a ticked checklist displaces whatever informal vigilance it replaced.

This is the mechanism. And it is a design mechanism.

Why Checklists Become Ritual

The standard explanation for checklist failure is cultural. Clinicians are busy. Surgeons are senior and impatient. Hierarchies make challenge difficult. Teams treat it as box-ticking. The solution, under this framing, is more training, stronger mandates, better culture.

These explanations are not wrong. But they locate the problem entirely in people, which means every proposed solution is also about people, and people-focused solutions consistently fail to move the outcome data at scale.

The design explanation is different. And more actionable.

The WHO checklist, in most implementations, is a paper form that cannot distinguish genuine engagement from reflexive attestation. It looks identical whether completed in 45 seconds with real verification or in 8 seconds with one person reading and answering their own questions. It has no mechanism for registering the difference. An instrument that cannot distinguish the thing it is meant to enforce from the simulacrum of that thing is not a safety system. It is a documentation system.

Three specific design failures explain how this happens.

No Independent Verification Architecture

Aviation's challenge-response checklist requires two operators with access to independent information sources. The monitoring pilot challenges; the flying pilot responds. But the monitoring pilot does not take the response on faith. They have instrument access to the same data and would catch a discrepancy.

Most surgical checklist implementations have one person reading items to a team who can answer without checking. The circulating nurse asks: "Has the site been marked?" The surgeon answers: "Yes." There is no instrument, no independent data source, no second verifier. The verification is social. It is a question and an answer. Under time pressure, it becomes a question and the answer that ends the question fastest.

This is not a failure of professionalism. It is a failure of architecture.

No Forcing Functions

A forcing function is an interface element that makes it structurally impossible to proceed without completing a required action genuinely.

A paper checklist has none. You can move through every item without any system registering whether the underlying state was verified. In medical device interfaces, the equivalent failure is everywhere. A confirmation button that appears after a high-dose entry requires a tap. It does not require the user to demonstrate that they read the dose. It does not require them to confirm the unit of measurement explicitly. It registers completion. It does not register comprehension.

The UCSF infusion pump case is the clearest example. A physician entered a dose 38 times higher than intended because the default unit of measurement, mg/kg versus mg, was not sufficiently legible. An alert fired. It was cleared. The patient suffered a grand mal seizure and respiratory arrest. The confirmation step existed. It was completed. It was not a forcing function, because it could be cleared without demonstrating comprehension of what was being confirmed.

Most confirmation interfaces test for completion. The use error happens in the gap between completing the step and actually processing what it said.

In an operating room with a surgeon waiting, the circulating nurse reading checklist items is operating under real social pressure to complete quickly. The paper form does not protect against this. It does not know the surgeon is waiting. It does not register that "site confirmed" was answered before anyone looked.

A well-designed digital implementation could change this. "Antibiotics administered?" could link to the pharmacy dispensing system and display the confirmed administration time. The answer is not an attestation. It is a data verification. Social pressure to rush becomes structurally irrelevant, because the system requires confirmation of a fact rather than a verbal agreement that a fact is true.

Most implementations produce a compliance record while believing they have a safety mechanism. The audit trail is identical either way.

Across drug delivery and infusion device projects, this pattern recurs in a way that is difficult to ignore. The confirmation step is present in the hazard analysis as a named risk mitigation. In multiple cases it has existed across two or three product generations without being formally evaluated as a safety control. When we run formative usability sessions under conditions approximating real ward use, concurrent alarms, a second operator asking questions in the background, a time constraint imposed explicitly, participants clear confirmation alerts within seconds. Asked immediately afterward what value they had just confirmed, many cannot accurately recall it. The step appeared in the regulatory documentation. It had never been tested as a safety mechanism. Those are not the same thing, and the difference is exactly what the Dutch study's odds ratios are measuring.

When Compliance Is Not Safety: IEC 62366

This distinction matters beyond clinical outcomes. It matters for regulatory liability.

Under IEC 62366, the standard governing usability engineering for medical devices, manufacturers are required to identify foreseeable use errors, assess their severity, and implement mitigation measures that reduce either the probability of the error or the severity of its consequences. The adequacy of those mitigations must be validated through summative usability testing with representative users in realistic conditions.

A use error is a foreseeable action or omission that differs from intended use. A device deficiency is a design characteristic that makes a use error foreseeable and insufficiently mitigated.

The difference is the manufacturer's liability question.

When a clinician clears a high-dose alert without reading it, the question is not only whether a user error occurred. The question is whether the confirmation step was designed so that a reasonably busy, reasonably stressed clinician under foreseeable use conditions could be expected to engage with it genuinely. If the only friction was a button tap that looked identical whether the warning was read or ignored, the confirmation step may be a device deficiency rather than a use error mitigation.

Stating that a confirmation step exists is not a risk mitigation argument. The risk mitigation argument is: "We have a confirmation step that requires the user to actively enter the dose value, which addresses the foreseeable use error of overlooking the default unit of measurement, and which summative testing has shown is not bypassed under foreseeable use conditions including time pressure and concurrent task load."

These are different regulatory positions. The second one is defensible. The first is documentation.

A confirmation step that exists and a confirmation step that works are not the same regulatory claim. The difference between them, in IEC 62366 terms, is whether the design can be shown to reduce the probability of the target use error under conditions the device will actually encounter. The medical device UX work we have documented consistently shows that the summative testing step, conducted under realistic conditions, is where this distinction becomes visible and auditable.

The FDA's human factors guidance (ANSI/AAMI HE75) reinforces this framing. Inadequate user interface design is listed as a distinct adverse event cause category in the FDA's MAUDE database, separate from software defects and hardware failures. A confirmation step that cannot be shown to reduce the probability of its target use error is, in regulatory terms, an unresolved risk, not a managed one.

For anyone building the internal case for better confirmation design, this reframing is useful. It transforms the conversation from "we should improve the UX," which sounds like aesthetics, to "our current confirmation step may not constitute adequate risk mitigation under IEC 62366," which sounds like compliance. Those conversations have different outcomes in budget discussions.

What Genuine Verification Requires

There is no single universal answer to what a protective confirmation workflow looks like. Clinical contexts vary too much. But there are five design characteristics that distinguish a genuine verification mechanism from a documented one, collectively forming the Designed Safety Verification Model.

Each characteristic can be evaluated in a design review, tested in summative usability research, and documented as a specific risk mitigation in a hazard analysis.

A confirmation step that genuinely reduces the probability of a high-consequence use error shares five characteristics: it requires active engagement with specific content rather than acknowledgement of a prompt; it cross-references independent data where available rather than relying on attestation; it is calibrated to the severity of the error it addresses; it is designed for foreseeable conditions including time pressure and interruption; and its audit trail records what was processed, not merely that the step was completed. Where any of these characteristics is absent, the step may satisfy a compliance requirement without functioning as a safety control.

Requires engagement, not acknowledgement

A confirmation step that presents information and asks for acknowledgement tests whether the user saw the prompt. A confirmation step that requires active engagement with specific content tests whether they processed it.

Concretely: rather than displaying the dose and asking the user to confirm, ask the user to enter the dose value they are confirming. Require explicit selection of the unit of measurement rather than acceptance of a default. Ask the user to confirm the patient weight driving the calculation.

These inputs cannot be completed without reading the specific content. The interface cannot be advanced by someone who has not processed what they are confirming. That is the operational definition of a forcing function.

Uses independent data where available

Where a system has access to a second data source, pharmacy dispensing records, previous device entries, an EMR, the confirmation step can cross-reference rather than rely on attestation.

"Antibiotics confirmed in dispensing record: 08:42" requires no user input and cannot be falsified by time pressure. The answer is not what the user says. It is what the system can verify independently.

This is the architecture aviation uses. The monitoring pilot does not ask the flying pilot about fuel quantity and accept the answer. They look at the same instrument. The challenge-response creates a shared information state, not a social transaction. Where medtech systems have access to equivalent data, the same principle applies.

Calibrated to consequence, not uniform

Forcing function intensity should scale with the severity of the use error it addresses. High consequence, wrong patient, wrong drug, dose magnitude error, warrants high friction. Low consequence, suboptimal timing, minor parameter adjustment, warrants low friction or none.

Uniform confirmation requirements across all actions have a documented failure mode: they train users to complete them automatically. Once a confirmation step is completed automatically, it is no longer engaging cognition. The probability of the target use error reverts toward its pre-mitigation baseline.

This is the alarm fatigue dynamic applied to confirmation steps. The same mechanism that makes clinical alarms ineffective when 85 to 99% are non-actionable makes confirmation workflows ineffective when they fire for every action regardless of risk level.

We should be honest that we do not have a universal calibration to offer. The right friction level for a given action depends on the clinical context, the patient population, the use environment, and the specific error being mitigated. What we do know from the projects we have worked on is that uniform friction is worse than no friction for low-risk actions, because it systematically erodes attentiveness to the high-risk ones. The risk classification must exist and must drive the design. The exact thresholds require contextual judgment, not a formula, and they should be validated rather than assumed.

Designed for foreseeable conditions

A confirmation step is only a risk mitigation for the conditions under which it will be used. If it is evaluated under ideal conditions, rested user, single task, unhurried, and then deployed into an environment that is none of those things, the validation is not valid for the deployment context.

The user research we conduct in live clinical environments reliably produces findings that controlled lab testing misses, not because the lab methodology is wrong, but because the stressors that reveal interface weaknesses are absent from it. A confirmation step that cannot be penetrated by a focused test participant in a quiet room may be penetrated routinely by a night-shift nurse managing three concurrent alarms.

Summative testing should include scenarios replicating the foreseeable conditions under which the target use error is most likely to occur: time pressure, interruption, concurrent task load. If the confirmation step is bypassed under those conditions, it is not adequate mitigation for those conditions, regardless of lab results.

Audit trail of engagement, not completion

For regulatory purposes, a confirmation step is only a documented risk mitigation if the record it generates reflects what actually occurred. A log entry that records "confirmation step completed at 02:17" is documentation. A log entry that records "dose value 250mg entered by user at 02:17, unit mg confirmed" is evidence that the mitigation was executed as designed.

This distinction matters in post-market surveillance. If an adverse event occurs and the confirmation step was completed, the investigation's first question is whether the step was genuinely engaged with or procedurally cleared. An audit trail that cannot answer this question cannot demonstrate that the mitigation was functioning.

Designing the audit trail as a record of engagement rather than a record of completion is a small technical decision with significant regulatory consequences.

When the Device Confirms for You

There is a harder version of this problem emerging, and it deserves naming here even though its full implications belong to a separate discussion.

As smart drug libraries, algorithmic dosing recommendations, and AI-assisted clinical decision support become standard features of medical devices, the confirmation problem changes character in a specific and important way.

When a device calculates and recommends a dose, and then asks the clinician to confirm it, the confirmation becomes an endorsement of the device's reasoning rather than a genuine verification.

The clinician's role shifts from independent verifier to approver. They are no longer confirming that a value they entered is correct. They are ratifying a value the system generated. The cognitive task is different. The risk profile is different. And the design requirements for the confirmation step are different.

A 2023 study found that experienced radiologists' independent diagnostic accuracy fell from 82% to 45.5% when they were presented with incorrect AI-generated scores, not because they were negligent, but because the interface had effectively replaced their reasoning with an endorsement prompt. They were confirming the system's output rather than exercising independent judgment.

This is the automation bias problem made concrete in clinical practice. It means that a confirmation step in an AI-assisted device is not a safety mechanism unless it is specifically designed to restore independent clinical reasoning, not just ratify algorithmic output.

What that looks like in practice, how to keep the clinician genuinely in the loop when the device is doing the cognitive work, is a design challenge that goes beyond confirmation step architecture. We examine it in depth in our next article on automation and clinical decision support.

For now, the relevant point is direct: if your current confirmation workflow was designed for a device where the user generates the values being confirmed, and your product roadmap adds AI-generated recommendations to that workflow, the confirmation step needs to be redesigned. It is not the same interaction with new content. It is a fundamentally different cognitive task with different failure modes.

The Design Brief That Follows

Return to the completed checklist. The 98% compliance rate. The unimproved mortality.

The problem was not that people failed to complete the form. The problem was that the form could not distinguish completion from verification, and that at scale, under normal clinical conditions, the two diverged.

A safety confirmation mechanism is only as protective as the engagement quality it enforces. That quality cannot come from mandate alone. Ninety-eight percent compliance with a mandate produced no patient benefit. Engagement quality must be structural: built into what the interface requires in order to proceed.

For every confirmation step in your device, two questions apply.

First: could a user complete this step in under three seconds, under time pressure, without having processed the specific content being confirmed? If yes, the friction is insufficient for the risk level the step addresses.

Second: if an adverse event occurred and the confirmation step was completed, could the audit trail demonstrate that the step was genuinely engaged with and not just cleared? If not, the step is documentation. It may satisfy a compliance requirement. It is not a safety control.

The gap between those two states, documented and protective, is a design gap. It is addressable through design. And it is where the difference between 0.23 and 1.57 lives, in the Dutch study's odds ratios, as the clearest possible evidence of what happens when you close it or leave it open.

FAQ: Checklist Failures and Confirmation Design

Why did the WHO checklist fail in Ontario?

The Ontario study found 92 to 98% reported completion rates but no statistically significant mortality reduction. The most likely explanation, supported by the Dutch study, is that completion rates measure whether the form was filled out, not whether the verification actions within it were genuinely performed. A checklist completed by one person reading and answering their own questions in eight seconds looks identical in the audit trail to a checklist completed through a genuine challenge-response process. The instrument could not distinguish between the two, so mandating it at scale produced mandated completion without mandated engagement.

Use error vs. device deficiency: IEC 62366

A use error is a foreseeable action or omission that differs from intended use. A device deficiency is a design characteristic that makes a foreseeable use error insufficiently mitigated. If a user makes an error that a well-designed confirmation step would have caught, the question is whether the actual confirmation step was adequate: whether it was designed to reduce the probability of that specific error under foreseeable use conditions. If not, the error may reflect a device deficiency rather than residual use error risk. The distinction determines where regulatory liability sits.

What is a forcing function in medical device UX?

A forcing function is an interface element that makes it structurally impossible to proceed without completing a required action. In medical device design, a confirmation step that requires the user to enter a specific value, rather than tap a button acknowledging that a value was displayed, is a forcing function. It cannot be completed without processing the specific content. A button tap that looks the same whether the alert was read or ignored is not a forcing function. It registers completion, not comprehension.

IEC 62366 confirmation workflow requirements

IEC 62366 requires manufacturers to identify foreseeable use errors, implement mitigation measures to reduce their probability or severity, and validate those mitigations through summative usability testing with representative users under realistic use conditions. A confirmation step qualifies as a risk mitigation only if it can be shown, through that testing, to actually reduce the probability of the target use error. Testing under ideal conditions with unhurried participants does not validate performance under foreseeable conditions including time pressure, interruption, and concurrent task load.

Confirmation steps under time pressure

Design for the assumption that social and time pressure will be applied. Ask: could this step be completed in under three seconds without genuine engagement? If yes, the friction is insufficient for high-consequence actions. Make the step require active engagement with specific content: entering a value, explicitly selecting a unit, confirming a patient-specific parameter, rather than acknowledging a static display. Where system data is available for cross-reference, use it rather than relying on attestation. Test the step under realistic stress conditions before submission.

Automation bias and AI-assisted medical devices?

When a device generates a recommendation and asks the clinician to confirm it, the clinician's cognitive task shifts from independent verification to ratification of the system's output. Research shows this shift significantly reduces independent judgment accuracy. In one clinical study, experienced clinicians' accuracy fell from 82% to 45.5% when exposed to incorrect AI-generated values. A confirmation step designed for user-generated values may be entirely inadequate for AI-generated ones, because the failure mode is different. The step must restore independent reasoning, not request endorsement.

References

Haynes, A. B., et al. (2009). A surgical safety checklist to reduce morbidity and mortality in a global population. New England Journal of Medicine, 360(5), 491-499. https://www.nejm.org/doi/full/10.1056/NEJMsa0810119
Urbach, D. R., et al. (2014). Introduction of surgical safety checklists in Ontario, Canada. New England Journal of Medicine, 370(11), 1029-1038. https://www.nejm.org/doi/full/10.1056/NEJMsa1308261
de Vries, E. N., et al. (2010). Effect of a comprehensive surgical safety system on patient outcomes. New England Journal of Medicine, 363(20), 1928-1937.
UCSF School of Nursing. (2019). UCSF and others strive to create true health care teams. https://nursing.ucsf.edu/scienceofcaring/news/ucsf-and-others-strive-create-true-health-care-teams
IEC. (2015, amended 2020). IEC 62366-1: Medical devices, Part 1: Application of usability engineering to medical devices. International Electrotechnical Commission. https://www.iso.org/standard/73007.html
ANSI/AAMI. (2009, revised 2018). HE75: Human factors engineering design of medical devices. Association for the Advancement of Medical Instrumentation. https://www.aami.org/detail-pages/store/technical-information-report/ansi-aami-he752009r2018
FDA. (2016). Applying human factors and usability engineering to medical devices. U.S. Food and Drug Administration.

In this story

When a safety confirmation mechanism can be completed without genuine engagement, compliance and protection diverge. This article examines the evidence on checklist failure, the IEC 62366 device deficiency distinction, and the five characteristics of the Designed Safety Verification Model.

22 min read

Table of contents

Key Statistics
The Completed Checklist That Didn't Help
What the Evidence on Checklist Failure Shows
Why Checklists Become Ritual
When Compliance Is Not Safety: IEC 62366
What Genuine Verification Requires
When the Device Confirms for You
The Design Brief That Follows
FAQ: Checklist Failures and Confirmation Design
References

About

Work

Lab

Blog

Jobs

Contact

Why the WHO Surgical Safety Checklist Fails: Confirmation Design

Key Statistics

The Completed Checklist That Didn't Help