Quality
FMEA: When Your Organization Stops Discovering Failures After They
Happen and Starts Predicting Them Before They Occur — and the Risks You
Never Prioritized Became the Disasters You Never Saw Coming
The Funeral Nobody Attended
In 2018, a major automotive supplier in Stuttgart produced 14 million
fuel injector assemblies without a single field failure related to a
specific sealing surface. By any measure, the process was a success. The
customer was satisfied. Scrap rates were below 0.3%. Audits passed with
flying colors.
Then, in the spring of 2019, three vehicles caught fire.
The investigation traced every fire back to the same sealing surface
— a dimension that had drifted 12 microns beyond specification over six
months, slowly and silently, because nobody had identified it as a
critical characteristic during product development. The FMEA team had
listed 47 failure modes for that assembly. The sealing surface wasn’t
one of them.
Fourteen million parts. Three fires. Two injuries. One recall that
cost €23 million.
And the most painful truth: the failure mode had been discussed
during the FMEA session. A junior engineer mentioned it. The team lead
said, “That’s never been a problem before,” and moved on. The risk
priority number was never calculated. The column was left blank.
This is the story of FMEA — not the textbook version with its tidy
matrices and color-coded risk priority numbers, but the real version:
the one where human judgment meets organizational pressure, where the
most dangerous failure modes are the ones your team dismisses because
they’ve never happened yet.
What FMEA Actually
Is — Beyond the Spreadsheet
Failure Mode and Effects Analysis is, at its core, a structured
imagination exercise. It is a discipline that forces a team to ask a
question most organizations avoid: What could go wrong?
Not what has gone wrong. Not what went wrong last time. Not what the
competitor got wrong. What could go wrong — including the
failures nobody has ever seen, the combinations nobody has considered,
and the degradation paths nobody has monitored.
The AIAG-VDA FMEA handbook, now in its first edition since the
harmonization of the American and German automotive standards, defines
the process across seven steps:
- Planning and Preparation — Define scope, form the
team, gather baseline data - Structure Analysis — Map the system, subsystem, and
component hierarchy - Function Analysis — Define what each element is
supposed to do - Failure Analysis — Identify how each function could
fail - Risk Analysis — Assess severity, occurrence, and
detection for each failure mode - Optimization — Define actions to reduce risk
- Results Documentation — Capture lessons learned and
communicate findings
Seven steps. Simple enough to fit on a wall poster. Complex enough
that most organizations execute the first five, skip the sixth, and file
the seventh without ever reading it.
The Three Faces of FMEA
FMEA is not one tool. It is three tools wearing the same name, and
confusing them is where most organizations begin their descent into
compliance theater.
Design FMEA (DFMEA) focuses on product design — the
geometry, material selection, tolerances, and interactions that could
cause a product to fail in the hands of the customer. The team asks:
What could go wrong with this design, even if we manufacture it
perfectly?
Process FMEA (PFMEA) focuses on the manufacturing
process — the sequence of operations, machines, tooling, and human
actions that could introduce defects. The team asks: What could go
wrong while making this product, even if the design is perfect?
System FMEA (SFMEA) focuses on the system level —
the interactions between subsystems, the interfaces, the integration
risks that don’t belong to any single component. The team asks: What
could go wrong when we put everything together?
Each type requires different expertise, different perspectives, and
different questions. Running a DFMEA with manufacturing engineers, or a
PFMEA without the operators who actually run the process, is like asking
a cardiologist to diagnose a software bug. The title says “expert.” The
results say otherwise.
The RPN Illusion — And
What Replaced It
For decades, FMEA was synonymous with the Risk Priority Number:
Severity × Occurrence × Detection. A number between 1 and 1,000 that was
supposed to tell you which failure modes mattered most.
The problem was that RPN was a lie dressed in mathematical
clothing.
A failure mode with Severity 10 (catastrophic), Occurrence 1
(remote), and Detection 10 (undetectable) scores an RPN of 100. A
failure mode with Severity 5 (moderate), Occurrence 5 (occasional), and
Detection 4 (moderately detectable) also scores 100. These two failure
modes are not equally important. One could kill someone. The other could
cause a minor cosmetic defect. The math treated them as equivalent.
Worse, organizations set arbitrary RPN thresholds — “anything above
200 requires action” — which meant that engineers learned to game the
scoring. Severity was usually honest; it’s hard to argue that a brake
failure isn’t severe. But Occurrence and Detection were subjective,
negotiable, and frequently massaged until the RPN landed below the
threshold. “We’ve never seen that failure” became a justification for
Occurrence = 1, even when the process had changed, the supplier had
changed, and the operating conditions had changed.
The AIAG-VDA harmonization replaced RPN with the Action
Priority system: High, Medium, or Low, determined by a matrix
that considers the combination of Severity, Occurrence, and Detection
holistically rather than multiplicatively. A catastrophic severity with
any detection gap is High priority, regardless of how unlikely the
occurrence seems.
This is better. But it still depends on the quality of the
conversation happening around the table.
The Human Factor — Why
FMEA Sessions Fail
The FMEA form is a structured document. The FMEA process is a human
conversation. And human conversations are vulnerable to every cognitive
bias known to psychology.
Anchoring on historical data. Teams anchor on what
has failed before, not what could fail next. Past failure data is
invaluable, but it creates a dangerous frame: “We’ve never had a problem
with X.” The Stuttgart fuel injector team had never had a problem with
that sealing surface. That was exactly why they didn’t analyze it.
Authority gradient. When the team lead or senior
engineer dismisses a failure mode, junior team members stop
contributing. The most dangerous failure modes are often the ones
visible only to the people closest to the work — the operators, the
technicians, the new engineers who haven’t yet learned what they’re
“supposed to” ignore.
Completion pressure. FMEA sessions are long,
tedious, and often scheduled against project deadlines. By hour six, the
team is tired, the coffee is cold, and the remaining failure modes get
perfunctory treatment. “Low risk, move on” becomes the default answer
for everything past page twelve.
Groupthink. The team converges on consensus too
quickly. Someone suggests a failure mode, the group agrees it’s
unlikely, and the conversation moves on before anyone has a chance to
say, “But what if…?”
The “We’ll Catch It Later” fallacy. Detection scores
are systematically underestimated. Teams assume that existing controls —
inspections, tests, SPC charts — will catch failures that, in reality,
those controls were never designed to detect. The detection column
becomes a list of aspirations rather than a list of verified
capabilities.
The Detection Trap
The Detection rating is the most misunderstood element of FMEA. It
does not ask “How likely is this failure to occur?” That’s Occurrence.
It asks: “If this failure mode occurs, how likely are your current
controls to detect it before it reaches the customer?”
This distinction is critical. Many teams score Detection based on
whether they have a control in place, not whether that control actually
works. Having an inspection station does not mean the inspection catches
the defect. Having an SPC chart does not mean the chart is monitored.
Having a test does not mean the test is capable.
I once reviewed a PFMEA for a machining line where the team had given
a Detection rating of 2 (very likely to detect) for a burr failure mode
because “the operator visually inspects every part.” When I asked how
often the operator actually caught burrs during the visual inspection,
the quality engineer admitted they didn’t know. They had never measured
the effectiveness of the visual inspection. The Detection rating of 2
was based on the existence of the inspection, not its performance.
Effective FMEA requires that Detection scores be based on evidence:
Gage R&R studies for measurement systems, detection capability
studies for visual inspections, escape rate data for end-of-line tests.
If you can’t prove the control detects the failure, the Detection score
should reflect that uncertainty.
FMEA
as a Living Document — The Lie Every Organization Tells
Every FMEA manual says the document should be updated throughout the
product lifecycle. Every auditor checks that the FMEA exists. Almost no
organization actually updates it.
The FMEA is treated as a deliverable — a box to check during APQP, a
document to show the customer, a file to store in the quality system.
Once the product launches, the FMEA is archived. When field failures
occur, root cause analysis is performed in isolation, disconnected from
the original FMEA. Lessons are learned but never fed back into the
failure mode database.
This is not a process problem. It is a management commitment problem.
Updating an FMEA after launch requires: – A trigger mechanism that
activates when new failure data arrives – A responsible owner who is
accountable for maintaining the document – Time allocated in the team’s
workload for FMEA maintenance – A culture that treats the FMEA as a
living knowledge repository, not a compliance artifact
Organizations that master this — and they exist — treat their FMEA
library as institutional memory. When a new product is developed, the
team starts not with a blank form, but with the FMEA from the previous
generation, enriched with field data, warranty claims, and lessons
learned. Each new product stands on the shoulders of every product that
came before it.
The
FMEA-SPC Connection That Most Organizations Miss
FMEA tells you what could go wrong. SPC tells you when it’s starting
to go wrong. Together, they form a prevention system that neither can
achieve alone.
Every high-priority failure mode identified in the FMEA should
correspond to at least one process parameter being monitored by SPC. The
control plan — which is derived directly from the FMEA — should specify
what is monitored, how it’s monitored, the sampling frequency, and the
reaction plan when the parameter moves out of control.
This chain — FMEA to Control Plan to SPC — is the backbone of
preventive quality. Break any link, and the system collapses. FMEA
without SPC is speculation. SPC without FMEA is monitoring without
understanding. The Control Plan without either is a document without
substance.
Yet in practice, these three documents are often developed by
different people, at different times, with different levels of
engagement. The FMEA is done by the engineering team. The control plan
is written by the quality engineer. The SPC charts are set up by the
process engineer. Nobody connects the dots.
Practical
Guide: Running an FMEA That Actually Works
Before the session:
-
Define the scope with surgical precision. “The
fuel injector assembly” is too broad. “The sealing surface between the
injector body and the O-ring at Station 12” is actionable. Scope
determines the quality of the analysis. -
Assemble the right team. You need the design
engineer, the process engineer, the quality engineer, the maintenance
technician, and — critically — the operator who runs the process. If the
person closest to the work isn’t in the room, you’re doing FMEA in a
vacuum. -
Gather data before you start. Warranty claims,
customer complaints, internal scrap data, audit findings, lessons
learned from similar products. The team should walk into the session
with evidence, not just opinions.
During the session:
-
Start with functions, not failures. Define what
the system is supposed to do before you discuss how it could fail. “The
sealing surface must maintain a leak-tight seal at 200 bar operating
pressure for 150,000 miles” is a function. “The seal leaks” is a failure
mode. The function definition determines how many failure modes you
identify. -
Use the “Five Whys” for failure causes. Don’t
stop at “operator error” or “tool wear.” Drill down to the root cause:
“Operator mispositions the part because the fixture lacks a positive
stop, because the fixture was designed for a different part geometry,
because the design change wasn’t communicated to tooling.” -
Challenge every Detection score. Ask: “When was
the last time this control actually caught this type of failure?” If the
answer is “It’s never happened,” then you don’t know if the control
works. Score it accordingly. -
Assign actions with owners and deadlines.
“Improve detection” is not an action. “Install vision system at Station
15 by March 31, owner: Jan Müller, budget: €45,000” is an action. Vague
actions are the graveyard of FMEA.
After the session:
-
Track actions to completion. Every action item
from the FMEA should be tracked in the same system that tracks CAPAs and
project milestones. If it’s not tracked, it won’t happen. -
Update the FMEA when reality diverges from
prediction. Field failures that weren’t predicted by the FMEA
are the most valuable input you will ever receive. They represent the
gap between your imagination and reality. Close that gap. -
Carry forward to the next project. The FMEA is
not just a record of what you thought. It is the starting point for what
you will think next time.
The Cost of
Skipping FMEA — Or Doing It Badly
Organizations that treat FMEA as a paperwork exercise pay for it in
ways that don’t show up on the FMEA form:
- Warranty costs that could have been prevented by
identifying a failure mode during design, when the fix costs €1, instead
of after launch, when the fix costs €100,000. - Recall costs that destroy not just margins but
reputation — the trust that takes decades to build and weeks to
lose. - Launch delays caused by late discovery of failure
modes that should have been identified during APQP, forcing emergency
engineering changes while the production line waits. - Audit findings that cascade into customer
scorecards, threatening business with your most important accounts. - Human costs — the injuries and worse that result
from failure modes that a competent FMEA would have identified and
mitigated.
The Stuttgart supplier could have prevented €23 million in recall
costs with a single line item on their FMEA. A sealing surface. A junior
engineer’s observation. A moment of intellectual humility from the team
lead.
That’s the return on investment of FMEA done right. Not the
spreadsheet. Not the scoring. The conversation. The willingness to say,
“We don’t know what we don’t know, and that’s exactly why we need to do
this.”
The Paradox at the Heart of
FMEA
Here is the deepest truth about Failure Mode and Effects Analysis:
the failure modes that matter most are the ones your team is least
likely to identify.
The known risks — the ones with historical data, the ones that
happened before, the ones everyone agrees are possible — those are easy.
They fill the FMEA form with comforting rows of familiar problems.
The unknown risks — the ones that emerge from new technologies, new
suppliers, new operating conditions, new combinations of factors that
have never coexisted — those are the ones that cause recalls. And those
are the ones that require not just a structured methodology, but a
culture of curiosity, intellectual humility, and relentless
questioning.
FMEA is not a form. It is a mindset. The form captures the output.
The mindset generates the insight.
The organizations that understand this difference are the ones that
don’t make headlines for the wrong reasons.
Peter Stasko is a Quality Architect with 25+ years of experience
transforming organizations across automotive, aerospace, and
pharmaceutical industries. He has led FMEA programs that have prevented
failures most organizations never imagined — and cleaned up after the
ones that did. His approach integrates behavioral science with quality
engineering to build systems that work with human nature, not against
it.