Experimental Design Fundamentals
After reading my article on how all research revolves around choices, you’ll know that I classify all research into one of three categories: understanding choices, making choices and tracking choices. The two latter categories (making and tracking choices) have access to a method that you should understand; the gold standard in these categories — experimental design.
A design of experiments or, as it’s also known, experimental design, is the only method that I’m aware of that provides a cause-and-effect assessment of choices. If you’re looking to compare several choices and see if they drive impact, experimental design is the tool of choice. Experimental design helps not only in making choices but also in tracking performance.
Few people aren’t familiar with experimental designs as they’re referred to in popular media. If you don’t work in market research, you’re likely familiar with experimental design from medical research studies where one group of headache sufferers gets a new pain medication, and the other gets the placebo or sugar pill. Sound familiar? At the end of the study, we measure the decline in headaches amongst the group that took the headache meds vs. the placebo (sugar pill) group. We refer to these groups as test (took medication) and control (took placebo).
This method of measuring the effect of medical treatments has been in practice since the mid-1700′s when James Lind, a surgeon in the Royal British Navy, used the approach to cure scurvy (you can read about Lind’s experiment on the National Institutes of Health site at http://goo.gl/Rh4Ys). Little has changed in the fundamental approach to measuring the effects of medical treatments. We owe our raft of COVID-19 vaccines to sped-up experiments that were pushed through by the pharmaceutical companies.
These studies are the gold standard in medical testing, but difficult to set-up. One reason we waited so long for vaccines to make it into distribution. As an example, imagine a study that’s using two groups of at risk COVID-19 patients. One group gets a vaccine with saline water and the other group gets a new vaccine. Is it a fair comparison if the vaccine group is all men and the saline group is all women? No, of course not. That doesn’t show if the vaccine works or if biological gender differences playing a role in the vaccine’s effectiveness.
Perhaps we tell the placebo group they’re getting the saline solution and we tell the test group they’re getting the vaccine. Is that a fair comparison? No, the test group will assume they’re going to be protected and put themselves in situations where infection risk is higher. Our placebo group will isolate more to avoid getting infected. In this scenario, we don’t know what the vaccine is doing vs. what the study participants are doing based on how what we tell them affects their behavior. These two medical examples point out three concepts in experimental designs.
Comparisons: All experimental designs rely on the concept of comparisons. In every study, you’re comparing groups of people who are engaging in different behaviors, taking different medications, trying different products or even seeing different advertisements. One experiment could compare a control group to 3 different vaccine candidates to see which one will be most effective. Critical to this method is the comparison. If your measurement doesn’t have a comparison, then it’s not an experimental design.
Blinding: A good experimental design will keep participants ignorant to which group they’re in, also known as a blinded test. This means that the subjects have either no idea they’re part of research or it means they don’t know whether they’re in the test or control group. You’ll often hear the term blinded in medical research as it refers to the patient not knowing what drug they’re taking. There’s also double blinding where both the participant and the researcher (i.e. doctor) do not know what drug the patients are taking.
Equality: In experiments, the control and test groups should be equal in all ways except for what’s being measured. If one group is 30% women, the other group should be 30% women. This equality is most commonly achieved through randomization, an important concept in experimental designs. By randomly assigning everyone to either the test or control group, I can maximize the chance that the test and control groups are equal in all ways. Randomization is essential for good experimental design research but not always effective, especially within small populations. Here, we attain fairness through blocking or stratification. This is ensuring that the study represented certain variables in both the test groups and the control groups.
Comparisons, fairness and blinding; sounds simple, doesn’t it? Well, as I mentioned, experimental designs assume that your test and control cells are equal in every way, but that’s difficult to achieve. Let me provide an example. Assume I’m developing a new headache medication and have chosen a group of 200 participants that suffer from headaches. To make this an experiment, I’m going to split my group into two so I can create comparison groups. Comparison — check.
I’ll give one the new medication and the other I’ll give a sugar pill but I won’t tell anyone what treatment they’re getting. Blinding — check.
Next, I randomly assign people to test (those taking the meds) and control (those taking the sugar pill). As assignment was random, then you can assume the two groups are equal. We should have an equal number of men and women, age differences, etc. Fairness — check?
Unfortunately, 200 people split are not enough people to minimize any differences between groups. Humans are complex creatures and we all have different biological processes at work. Perhaps someone in the test group has a genetic condition that triggers migraines, while no one in the control group has the same condition. Likewise, someone in the control group could have a heart condition that might limit blood flow that brings on headaches, while no one in the test group would suffer the same problems. There are too many factors to assume you can wash them out through random assignment between test and control, especially when you’re dealing with small group sizes.
This is one reason these designs are difficult to execute. Without a perfect sense of fairness, someone could later prove the results incorrect. It’s not just putting people into test and control buckets; you should think about and control everything that could be different between both groups. If you join a modern medical experiment, you’ll answer 400 question medical and genetic histories, find yourself subjected to physicals and X-rays, all in the name of making sure that the people in the test cell have the same incidence of medical conditions as those in the control group.
We’ve already talked about the designs needing to make comparisons between test and control, we’ve covered the idea of fairness, randomizing across test and control to ensure both cells are equal, and we’ve talked about blinded studies keeping participants from changing behavior because of an awareness of their role in the research. Let’s point out one other thing to keep in mind with experimental design research. Specifically, let’s talk about timing.
The concept of timing is a critical element in experimental designs and one that relates to the comparisons we make. Data collected simultaneously is important to classifying a study as experimental design. Why? To prevent the changes that happen over time from affecting results. Take our headache example. Let’s say that we start our experiment by giving out our placebo pill to control participants in January, but due to supply chain issues, we don’t give out the new headache medicine to test respondents until April. We have a comparison, but it’s flawed because of the different timings. Lots of issues around timing of collection can crop up. Perhaps allergies trigger the onset of headaches, which will more likely happen in the spring when we test the new medication, but less likely in the winter when we’re collecting our control data. The difference in the time of data collection can affect our results. Imagine doing a study around dieting techniques where we measure the control group (not dieting) in the summer and we measure the test group (doing the diet) over the holiday break while people are eating cookies, cakes and other holiday foods. Not a fair comparison.
The timing (temporal) issue is especially important to understand because often business research masquerades as an experiment but is in fact riddled with temporal issues. In fact, the most commonly used method for tracking choices is pre-post measurement. This doesn’t control timing differences because a lot of things could have happened between the pre and post measurement. The takeaway being, if you want a cause-and-effect measurement, your test and control groups need to be collected simultaneously. Without this, you’re not running an “experimental design”.
So why aren’t more market research projects experimental designs? Logistics make it difficult to do at scale. Controlling all the variables needed to ensure a fair comparison is hard, which is why we have alternate designs such as tracking, pre-post, and matched market studies. In most studies, we’re gambling that biases introduced through weaker designs have little to no effect on outcomes. Despite that, experimental design is one of the most effective measurement techniques ever created. It’s not just for medical researchers and I would recommend this method for many projects. It can apply to ad tests, comparisons of web sites, product comparisons, message tests, package tests, and in market performance evaluations.