Quality
and Ergodicity: When Your Organization’s Averages Conceal the Failures
That Destroy Individual Customers
The Dashboard That Lied
The quality dashboard showed 99.7% on-time delivery. The customer
satisfaction score held steady at 4.2 out of 5. The defect rate sat
comfortably at 0.3% — well within tolerance. By every metric the
executive team tracked, the quality system was performing
beautifully.
Then the letter arrived.
It was from a mid-tier automotive supplier in Bavaria — not their
largest customer, not their smallest, but a steady partner for eleven
years. The letter was three paragraphs long, and every sentence was a
slow-motion disaster. In the past eighteen months, this customer had
received three shipments with incorrect labeling, two deliveries that
arrived late, one batch with dimensional non-conformances that shut down
their assembly line for fourteen hours, and a corrective action request
that went unanswered for forty-seven days.
The quality manager read the letter twice. He pulled up the
dashboards. He ran the numbers. And he realized something that made his
stomach drop: every single number on the dashboard was accurate. The
averages were real. The aggregate performance was exactly what it
claimed to be.
But for this one customer — this single path through the system — the
experience had been catastrophic. And nobody had seen it because the
math they were using was designed to hide it.
This is the ergodicity problem in quality management. And most
organizations don’t even know it exists.
What Ergodicity Actually
Means
The concept comes from statistical mechanics, and it sounds more
intimidating than it is. Here’s the core idea in plain language:
An ergodic system is one where the average
experience of the group over a short period is the same as the average
experience of any individual over a long period. Think of a roulette
wheel: if you spin it a thousand times, your average outcome will be the
same as if a thousand people each spin it once. The time average equals
the ensemble average. The system is ergodic.
A non-ergodic system is one where these two averages
diverge dramatically. Think of Russian roulette. If a thousand people
each play once, the survival rate is 83.3% — not bad on paper. But if
you play six times, your survival rate is zero. The ensemble average and
the time average are completely different. And the time average is the
one that actually kills you.
Most quality systems are non-ergodic. They track ensemble averages —
how the system performs across all customers, all batches, all shifts,
all days. But individual customers experience the time average — what
happens to them over the course of their relationship with you.
And in non-ergodic systems, these two realities can be worlds apart.
Your 99.7% delivery performance might mean that 99.7% of all
shipments arrive on time. But for the customer who happens to be on the
receiving end of three consecutive late shipments, the reality is 0% —
not 99.7%. Their experience is catastrophic. And your dashboard will
never show it.
Why Quality Systems Are
Non-Ergodic
Quality systems become non-ergodic for several structural reasons,
and understanding them is the first step toward addressing the
problem.
Correlated failures. In a truly random system, one
defect doesn’t predict the next. But in real manufacturing, failures
cluster. A worn tool produces defective parts in runs. A misaligned
fixture affects every piece until it’s corrected. A fatigued operator
makes progressively worse decisions throughout a long shift. When
failures correlate, the ensemble average looks fine, but individual
paths through the system experience bursts of concentrated misery.
Path dependence. Quality outcomes aren’t independent
events — they build on each other. A supplier quality issue that isn’t
caught at incoming inspection propagates through production, shows up at
final test, and reaches the customer. The next shipment from the same
supplier carries the same risk. The customer’s experience isn’t a random
sample of your overall performance; it’s a chain of dependent events
where one failure increases the probability of the next.
Uneven distribution. Your worst-performing shift,
line, operator, or supplier doesn’t uniformly distribute its failures
across all customers. Some customers get hit repeatedly because of
routing rules, geographic proximity, or production scheduling. These
customers experience a reality that bears no resemblance to your
aggregate metrics.
Feedback loops. When a customer experiences a
quality failure, their behavior changes. They may tighten incoming
inspection, which slows your response time. They may reduce order
quantities, which changes your production scheduling. They may escalate
every minor issue, consuming disproportionate resources from your
quality team. The initial failure triggers a cascade that makes
subsequent failures more likely — for that specific customer.
The Insurance Illusion
Here’s where ergodicity becomes genuinely dangerous: most
organizations treat their quality metrics as if they were insurance
policies. They believe that an acceptable average rate protects
everyone.
Consider a pharmaceutical manufacturer with a sterility assurance
level of 10⁻⁶ — one contaminated unit per million. On paper, this is
exceptional. The ensemble average is world-class. But sterility failures
aren’t randomly distributed. A single bioburden excursion in the water
system can contaminate an entire batch — thousands of units going to
thousands of patients. For the patients who receive those units, the
failure rate isn’t one in a million. It’s one in one. One hundred
percent.
The average didn’t protect them. The average wasn’t designed to
protect them. The average was designed to make the organization feel
safe while specific paths through the system carried catastrophic
risk.
This isn’t a hypothetical. Every major quality disaster — from the
Ford Pinto to the Takata airbags to the Boeing 737 MAX — looked
acceptable on aggregate metrics right up until the moment it didn’t. The
organizations weren’t ignoring data. They were looking at the wrong
data. They were tracking ensemble averages in a non-ergodic system and
believing those averages told them something about individual risk.
How to Spot the Ergodicity
Gap
You don’t need advanced mathematics to identify whether your quality
system has an ergodicity problem. You need to ask a different set of
questions.
Instead of asking “What is our average defect rate?” ask
“What does the worst customer’s experience look like?” Not the
worst customer by revenue or strategic importance, but the worst
customer by accumulated quality failures. Find the customer who has
experienced the most defects, the most late shipments, the most
corrective actions. Look at their experience as a continuous path, not
as isolated incidents. What you see may surprise you.
Instead of asking “What percentage of batches pass
inspection?” ask “What is the longest consecutive run of batches that
passed without a single failure?” In an ergodic system, the
length of success runs follows a predictable pattern. In a non-ergodic
system, success runs cluster and failure runs cluster, and the clusters
tell you where your hidden risks are.
Instead of asking “How many CAPAs did we close this quarter?”
ask “What is the probability that a customer who experiences one failure
will experience another within the next three shipments?” This
is a conditional probability question, and it reveals the correlation
structure that your aggregate metrics hide. If the probability goes up
after a first failure, your system is non-ergodic and your averages are
misleading you.
Instead of asking “What is our overall customer satisfaction
score?” ask “How many customers have satisfaction scores below our
threshold, and how long have they been there?” Aggregate
satisfaction scores are ensemble averages. Individual customer
trajectories reveal whether some customers are on a path to defection
that the average completely obscures.
Building an Ergodic Quality
System
You cannot make a quality system fully ergodic — the real world
doesn’t work that way. Failures will cluster. Paths will diverge. Some
customers will have worse experiences than others. But you can build a
system that acknowledges this reality and accounts for it.
Track customer-level trajectories. Don’t just
monitor aggregate metrics. Build a system that tracks the cumulative
quality experience of every customer over time. Flag customers whose
trajectory is deteriorating — even if the aggregate numbers look fine. A
customer who has experienced three quality events in six months is on a
different path than one who has experienced three events in three years,
even if both contribute equally to your overall defect rate.
Decouple failure paths. If a tool wears out and
produces defective parts, the defect doesn’t just affect one customer —
it affects every customer whose product ran through that tool.
Redundancy in critical processes, diverse sourcing for key materials,
and production rotation that distributes risk across multiple paths can
reduce the correlation that makes your system non-ergodic.
Design for the tail, not the mean. Most quality
standards are designed around acceptable average performance. But in a
non-ergodic system, the tail — the worst-case path — is what destroys
reputations, triggers recalls, and ends careers. Your quality system
should have specific controls designed to limit the severity of the
worst individual experience, not just maintain an acceptable
average.
Implement survival metrics. Instead of tracking
defect rates, track customer survival rates — the percentage of
customers who have experienced zero quality failures over rolling time
windows. This metric is much closer to the time average that individual
customers actually experience, and it reveals problems that defect rate
averages completely hide.
Recognize absorbing states. In non-ergodic systems,
some states are absorbing — once you enter them, you can’t leave. A
customer who has a catastrophic quality failure may never trust you
again, regardless of how much you improve. A product recall creates a
permanent record. A safety failure creates a permanent injury. Your
quality system should be designed to prevent entry into these absorbing
states with much greater rigor than your aggregate metrics would suggest
is necessary.
The Story Behind the Numbers
Let me return to the Bavarian supplier. After the letter arrived, the
quality manager did something he’d never done before: he reconstructed
the entire customer experience as a continuous narrative rather than a
collection of isolated incidents.
The labeling errors had started when a new operator was trained on
the packaging line without adequate work instructions. The late
deliveries were a consequence of the labeling errors — each labeling
failure triggered a containment action that delayed the next shipment.
The dimensional non-conformances came from a tool that had worn past its
replacement threshold but hadn’t been flagged because the replacement
schedule was based on piece count, not actual tool condition. And the
unanswered corrective action request had fallen into a gap between two
quality engineers who both thought the other was responsible.
Every single failure was connected. Each one increased the
probability of the next. The customer’s experience wasn’t a random
sample of the organization’s overall quality — it was a cascade of
correlated failures that the aggregate metrics had rendered
invisible.
The corrective action wasn’t to improve the average. The average was
already excellent. The corrective action was to redesign the system so
that no single customer could fall into a failure cascade without being
detected and rescued. That meant customer-level monitoring, failure
correlation analysis, and a fundamentally different way of thinking
about what quality metrics are supposed to tell you.
The Deeper Implications
The ergodicity problem in quality management isn’t just about
metrics. It’s about a fundamental misunderstanding of what quality
means.
Most organizations define quality as conformance to requirements,
measured in aggregate. If 99.8% of your output meets specifications, you
have a 99.8% quality rate. This definition is clean, measurable, and
wrong — or at least incomplete.
Quality isn’t experienced in the aggregate. It’s experienced one
customer, one product, one moment at a time. The customer who receives
the defective unit doesn’t experience your 99.8% quality rate. They
experience 0%. And if they receive two defective units, they don’t
experience 99.6%. They experience betrayal.
This isn’t sentimentality. It’s mathematics. In non-ergodic systems,
the relevant metric for any individual actor is not the ensemble average
but the time average — the accumulated experience of their specific path
through the system. And when failure paths correlate, when outcomes are
path-dependent, when absorbing states exist, the time average can be
catastrophically worse than the ensemble average for the unlucky
few.
The organizations that understand this — truly understand it, not
just acknowledge it in a management review presentation — design their
quality systems differently. They don’t just track averages. They track
individual trajectories. They don’t just prevent average failures. They
prevent catastrophic paths. They don’t just aim for acceptable aggregate
performance. They aim to ensure that no single customer’s experience can
deteriorate below a survivable threshold without the system detecting it
and intervening.
The Question You Should Be
Asking
The next time you review your quality dashboard and everything looks
green, ask yourself one question:
“Is there a customer out there right now whose experience looks
nothing like this dashboard — and if so, will I find out before or after
they find someone else?”
In an ergodic system, the answer would be reassuring. In the real
world — in your factory, with your processes, serving your customers —
the system is almost certainly non-ergodic. Which means the answer
should make you uncomfortable enough to start looking at your data
differently.
Not at the average. At the paths. Not at the aggregate. At the
individual trajectories. Not at what’s happening to everyone on average,
but at what’s happening to someone in particular.
Because the failure that destroys your reputation won’t show up in
your average. It will show up in one customer’s experience. And by the
time it does, the path they’ve been on has already carried them
somewhere your dashboards were never designed to see.
Peter Stasko is a Quality Architect with 25+ years
of experience transforming organizations across automotive, aerospace,
and pharmaceutical industries. He specializes in building quality
systems that don’t just look good on dashboards — they protect every
customer, every time.