Design of Experiments: When Your Systematic Experimentation Becomes a Statistical Ritual Nobody Understands — and the Factors You Optimized Became the Interactions You Never Tested

You know the scenario. A process is producing defects. An engineer
proposes a designed experiment to identify the root causes. Management
agrees, because they have heard that DOE is the “right way” to solve
problems. A two-level full factorial design is set up. Data is
collected. A software package generates p-values, main effects plots,
and interaction plots. The team identifies three significant factors.
Control plans are updated. And six months later, the defect rate has
barely moved.

What went wrong? The experiment was designed correctly. The
statistics were sound. The software was legitimate. The team followed
every step in the textbook. And yet the answer they found was not the
answer they needed — because the answer they needed was hiding in the
interaction they never tested, the factor they held constant out of
convenience, or the noise variable they never measured because their
design matrix did not include it.

Design of Experiments is one of the most powerful tools in quality
engineering. It is also one of the most systematically misused. Not
because people calculate the wrong statistics, but because they approach
experimentation with the wrong mental model — treating DOE as a
statistical ritual that produces answers rather than a structured
learning process that generates understanding.

The Promise of DOE

Let us start with what DOE is supposed to do. When you have a process
with multiple input variables (factors) and you want to understand which
ones affect the output (response), you have two options. The first is
One-Factor-At-A-Time (OFAT) experimentation: change one variable, hold
everything else constant, measure the result, then change the next
variable. The second is DOE: change multiple variables simultaneously
according to a structured matrix, and use statistics to separate the
individual effects and their interactions.

DOE is objectively superior to OFAT. This is not opinion — it is
mathematical fact. OFAT cannot detect interactions between factors. If
temperature and pressure interact in a way that high temperature only
causes defects at high pressure, OFAT will never find this, because it
tests temperature at one pressure level and pressure at one temperature
level. DOE tests all combinations (in a factorial design) or a
structured subset (in a fractional factorial), making interactions
visible.

DOE also requires fewer experimental runs to achieve the same
precision. A full factorial with 5 factors at 2 levels requires 32 runs.
Testing each factor at 2 levels individually would require only 10 runs
— but would give you no information about interactions and less precise
estimates of main effects. For detecting real effects with adequate
statistical power, DOE is dramatically more efficient.

This is the promise. The reality in most manufacturing organizations
is something quite different.

How DOE Actually Plays
Out in Practice

Here is what typically happens. An engineer — possibly trained to
Green Belt or Black Belt level — designs an experiment. They use
software like Minitab, JMP, or even Python libraries to generate the
design matrix. They identify factors based on engineering knowledge,
historical data, or brainstorming sessions. They select response
variables. They determine factor levels (usually the current setting and
a proposed new setting). They calculate the number of runs needed and
the resources required.

So far, so good. The problems begin at the execution stage.

Problem 1:
Convenience Sampling of Factors

The factors chosen for the experiment are rarely the factors most
likely to be the root cause. They are the factors that are easy to
manipulate, easy to measure, and politically safe to question. If raw
material supplier is a likely contributor to defects but questioning the
supplier would create commercial tension, it is excluded from the
experiment and listed under “held constant.” If operator skill level is
suspected but testing it would require cross-training and schedule
disruption, it is declared a “noise variable” and randomized away rather
than studied.

The result is an experiment that is statistically valid within the
factor space it examines but practically irrelevant because the factor
space does not include the variables that actually drive the problem.
The team runs a beautiful experiment, gets clean significance results,
and optimizes factors that collectively account for 20% of the variation
while the factor responsible for 60% of the variation was excluded
before the first run was conducted.

Problem 2: Two Levels Are
Not Enough

Most industrial DOE uses two-level designs (typically a 2^k factorial
or a 2^(k-p) fractional factorial). Two levels are efficient — they
minimize runs while estimating main effects and two-factor interactions.
But two-level designs assume that the relationship between factor and
response is linear across the tested range. If there is curvature — a
quadratic relationship where the optimum is in the middle of the range —
a two-level design will miss it entirely.

You get a significant main effect, conclude that higher is better (or
lower is better), and update your process settings accordingly. But the
true optimum was at the center of your range, and you moved away from
it. The experiment was not wrong — the design was inadequate for
detecting the curvature that existed. You needed center points or a
Response Surface Methodology (RSM) design, but those require more runs
and more time, and the project deadline did not allow for a second phase
of experimentation.

Problem
3: Interaction Blindness in Fractional Designs

Fractional factorial designs are popular because they reduce the
number of runs. A 2^(5-1) design gives you 16 runs instead of 32 for
five factors. The trade-off is that main effects are confounded with
higher-order interactions, and two-factor interactions are confounded
with other two-factor interactions. This is called aliasing or
confounding.

If you use a Resolution III design (the most economical), your main
effects are confounded with two-factor interactions. This means that
what you think is a significant main effect might actually be an
interaction between two other factors. If you use a Resolution IV
design, main effects are clear of two-factor interactions, but
two-factor interactions are confounded with each other — you know an
interaction exists, but you cannot tell which pair of factors is
responsible.

In practice, most engineers using DOE software do not carefully
examine the alias structure. They run the design the software suggests,
look at the pareto chart of effects, and proceed with the significant
ones. If the aliasing means their “significant factor” is actually a
confounded interaction, they optimize the wrong variable and then wonder
why the process does not improve.

Problem 4: Measurement
System Inadequacy

DOE assumes that your measurement system can detect the differences
your process produces. If your gauge R&R is poor — if measurement
variation is a significant fraction of total observed variation — then
the signal from your experiment is buried in measurement noise. You
either fail to detect real effects (Type II error) or detect phantom
effects that are actually measurement artifacts (Type I error).

This is the most under-discussed failure mode of DOE. Teams spend
days designing experiments, weeks running them, and hours analyzing
results without ever verifying that their measurement system can
reliably distinguish between the factor levels they have chosen. A
measurement system with 40% gauge R&R can turn a well-designed
experiment into a random number generator.

Problem 5:
Confirmation Runs That Confirm Nothing

The textbook DOE process includes a confirmation run — a set of
trials at the optimal settings predicted by the experiment, to verify
that the predicted improvement actually materializes. In theory, this is
excellent practice. In reality, confirmation runs are often conducted
under different conditions than the original experiment.

The original experiment was run on a specific machine, with a
specific batch of material, on a specific shift, with a specific ambient
temperature. The confirmation run is conducted weeks later, on a
different machine (because the original was in production), with a
different material batch (because the original was consumed), on a
different shift (because the original operator was reassigned). When the
confirmation run “confirms” the result, the team declares victory — but
they have confirmed nothing, because the noise variables that changed
between experiment and confirmation could have masked or mimicked the
factor effects.

Then the optimized settings are rolled out to production, and the
defect rate stays the same, and nobody can explain why.

The Deeper Failure:
DOE as Answer Machine

All of these technical failures share a common root: the belief that
DOE is a machine that produces answers. You put in a design matrix and
data, and you get out p-values and optimal settings. The thinking — the
engineering judgment, the process knowledge, the iterative learning — is
treated as overhead that delays the experiment rather than the essential
ingredient that makes the experiment meaningful.

This is backwards. DOE is not an answer machine. It is a structured
method for asking better questions. Each experiment should narrow your
understanding — eliminating some hypotheses, supporting others, and
generating new questions that the next experiment will address. The
power of DOE is not in any single experiment but in the sequence: screen
first to identify important factors from many candidates, then follow up
with a focused design to characterize those factors and their
interactions, then use response surface methods to find the optimum.

This sequential approach requires something that most quality
organizations do not have: patience and a culture that values
understanding over speed. When the mandate is “find the root cause by
Friday,” DOE becomes a single-shot exercise that produces a
statistically defensible answer regardless of whether that answer is
correct.

What Good DOE Practice Looks
Like

Organizations that use DOE effectively share several
characteristics:

They invest in measurement first. Before any
experiment is designed, the measurement system is evaluated. If gauge
R&R is above 10% (or above 30% for less critical applications), the
measurement system is improved before experimentation begins. This is
unglamorous work, but without it, every subsequent result is
suspect.

They screen before they optimize. Rather than
throwing all factors into one large design, they start with a screening
design (a Plackett-Burman or a low-resolution fractional factorial) to
eliminate the obviously insignificant factors. Then they run a second,
focused design on the 3-5 factors that matter, with enough resolution to
detect interactions. Then, if curvature is suspected, they augment with
center points or axial points to enable response surface analysis.

They randomize properly. Randomization is not
optional. It is the mechanism that protects against time-varying noise
variables contaminating the factor effects. Running the design matrix in
standard order (Run 1, Run 2, Run 3…) because it is easier to set up
means that any drift in temperature, material properties, or operator
fatigue over the course of the experiment will be systematically
correlated with the factors — destroying the validity of the
results.

They replicate. Replication means running the entire
design (or at least the center points) more than once. This provides an
estimate of pure error, which allows you to test for lack of fit —
whether your model adequately describes the data or whether there are
effects (like curvature) that the model misses. Without replication, you
cannot distinguish between “no significant effects” and “inadequate
experimental precision.”

They verify with production-scale confirmation. The
confirmation run is not a formality. It is conducted under production
conditions — same machines, same materials, same operators, same
environment — and it is run long enough to capture the normal variation
of the process. If the confirmation run spans only a few hours, it has
no statistical power to detect anything but the most dramatic
improvements.

They document what they learned, not just what they
decided. Every DOE generates information beyond the immediate
decision. Factors that were not significant are still valuable to know.
Interactions that were detected but not fully characterized are starting
points for future experiments. The experimental record is a knowledge
asset that accumulates over years — if it is preserved.

The Organizational Dimension

The technical failures of DOE are well-documented in statistics
textbooks. The organizational failures are not, because they are not
statistics problems — they are management problems.

DOE requires resources: machine time, material, operator hours,
engineering time. In organizations where production schedule is king,
securing these resources is a political battle. The compromise is
typically a reduced design — fewer runs, fewer factors, fewer replicates
— that fits within the available window but compromises the statistical
integrity of the experiment.

DOE requires cross-functional collaboration. The engineer designing
the experiment needs input from operators (who know which factors are
practically important), maintenance (who knows which machine settings
drift), and quality (who knows the measurement system capabilities). In
organizations with strong functional silos, this collaboration does not
happen, and the experiment is designed based on the engineer’s
incomplete mental model of the process.

DOE requires a tolerance for ambiguity. A well-designed experiment
may conclude that none of the tested factors are significant — meaning
the root cause lies outside the factor space examined. This is a valid
and valuable result (it eliminates a set of hypotheses), but in
organizations that equate “no significant findings” with “experiment
failed,” the response is to either question the methodology or massage
the analysis until something significant emerges. Both responses corrupt
the learning process.

Rebuilding the Practice

If your organization’s DOE practice is producing statistically valid
but practically irrelevant results, the fix is not more training on
statistics. Your engineers probably know how to use the software. The
fix is rebuilding the process around DOE:

Start with the problem, not the method. Before
anyone opens DOE software, the team should write down: What is the
specific problem we are trying to solve? What is our current
understanding of the causal chain? What are the hypotheses we are
testing? What would we do differently if each hypothesis is confirmed or
refuted? If you cannot answer these questions, you are not ready to
design an experiment.

Budget for sequential learning. Accept that one
experiment will not give you the answer. Plan for at least two phases:
screening and characterization. Budget the resources for both upfront,
so the second phase does not require a new justification battle.

Invest in the boring fundamentals. Measurement
system analysis, process mapping, root cause hypothesis generation —
these are the foundations upon which DOE builds. Skimping on them to get
to the experiment faster is like building a house on sand to move in
sooner.

Create an experimental knowledge base. Every DOE —
whether it produced significant results or not — generates
organizational learning. Maintain a database of experiments, their
designs, their results, and their limitations. Future experimenters will
avoid repeating dead ends and build on prior findings.

Teach the difference between statistical significance and
practical significance. A factor with a p-value of 0.001 that
accounts for 2% of total variation is statistically significant and
practically irrelevant. Conversely, a factor with a p-value of 0.08 that
accounts for 30% of total variation might be practically important even
if it does not meet the arbitrary 0.05 threshold. Statistical training
without judgment produces technicians, not problem solvers.

The Real Lesson

Design of Experiments is not a technique you apply. It is a
discipline you practice. The difference between organizations that get
value from DOE and organizations that do not is not statistical
sophistication — it is the patience to build understanding sequentially,
the rigor to validate assumptions before and after experimentation, and
the humility to accept that the first experiment rarely provides the
final answer.

When your DOE practice becomes a statistical ritual — a sequence of
software clicks that produce p-values and pareto charts and confidence
intervals that nobody truly understands — you have not just wasted
experimental resources. You have created something worse than ignorance:
the illusion of knowledge. You have given your organization
statistically defensible answers to questions that may not be the right
questions, based on designs that may not capture the relevant factors,
interpreted by people who may not understand the assumptions.

And the factors you optimized became the interactions you never
tested. The answers you published became the questions you stopped
asking. The experiment you completed became the learning you never
did.

The most valuable DOE result is not the optimal setting. It is the
understanding of why that setting is optimal — and under what conditions
it would stop being optimal. That understanding only comes from
disciplined, sequential, properly-resourced experimentation. There are
no shortcuts.

About the Author: Peter Stasko is a Quality
Architect with over 25 years of experience in manufacturing quality
management, process optimization, and continuous improvement. He has
implemented quality systems across automotive, electronics, and
precision manufacturing industries and writes about the real-world
failures and recoveries that quality engineering methodologies rarely
address.

The Promise of DOE

How DOE Actually Plays Out in Practice

Problem 1: Convenience Sampling of Factors

Problem 2: Two Levels Are Not Enough

Problem 3: Interaction Blindness in Fractional Designs

Problem 4: Measurement System Inadequacy

Problem 5: Confirmation Runs That Confirm Nothing

The Deeper Failure: DOE as Answer Machine

What Good DOE Practice Looks Like