September 4, 2025

A taxonomy of biases in Marketing Mix Model effect estimation – Part 4

By Ted Lorenzen · 9 minute read

16:39

For those of you perusing LinkedIn for interview prep material, this is your pay dirt.  Previously we've covered omitted variable bias, mediation (we might call this variable inclusion bias?), and aggregation bias.

But you know what? None of those were on anyone's interview question list when I was regularly on a panel interviewing MMM analysts. And what was on our question list? Multicollinearity.

Which, honestly, doesn't properly belong in a series about biases, because multicollinearity isn’t a cause of bias!  I’m including it in this series regardless because it is a topic of concern to modelers (perhaps doubly so because it is always asked about in interviews) as it can be the root cause of a wonky effect estimate.

Multicollinearity

Sometimes helpfully spelt multi-collinearity (if you remember when we sent e-mail, then you probably don’t need interview coaching) it is perhaps easiest understood as multi(ple) – co – linear-ity. I.E. this is what happens when one or more (multi) variables are on the same (co) line (linearity).

For that to make sense, you need to hold a geometrical view of your dataset in your head: imagine that each variable in the dataset is a vector in the (p+1)-dimensional space of your dataset (where p is the number of predictors in the model and the +1 is for the dependent variable). The angle between any pairs of these vectors is not likely to be 90 degrees. For some pairs of these dimensions, we might see them as perfectly parallel (i.e. the dot product is equal to the product of the magnitudes). Those would be perfectly correlated variables, or a pair of perfectly collinear variables.

For some other of these dimensions, we might see that no single variable is perfectly parallel to them, but that we can construct a weighted sum (i.e. a linear combination) of dimensions that is collinear. These variables, both the variables in the weighted sum and the variable that is parallel to the weighted sum, are ‘multi-collinear.’

Back to that geometric view; if you can plot the observations in your dataset and then find a rotation where some of the vectors make a plane then you have multicollinearity. The short movie below might help to illustrate . . . or it might just hypnotize you into reading more of my blog posts:

collinearity

Perfect collinearity is rare in real data. When we have a set of predictors that are perfectly linear, there is a plane that some of predictors lies on in the geometric view and the X’X matrix will be rank deficient which makes OLS coefficients impossible to estimate. But near collinearity is pretty common (and is what most of us are talking about when we discuss multicollinearity in regression). In near collinearity we have a set of predictors that are very close to lying in shared plane and while the X’X matrix is full rank, it has a determinant near 0. It’s Only the Xs.

If you are a particularly engaged reader, you might be thinking: “Gee, Mister, this multicollinearity sounds like what I want between my target variable and my explanatory variables. Why is this bad?”

Yes, ideally, a weighted sum of the explanatory variables will be (nearly) perfectly correlated to the target variable; the process of building a linear regression model is the process of finding a set of variables and their weights where this is true!

But when we have subsets of the explanatory variables where this is true, it can create difficulties with our effect estimation. Multicollinearity is ONLY a problem for an analyst when it is within the intended explanatory variables.

In real data, without the spin-y cube, what is this?

Leaving our n-dimensional geometry aside, marketing activity data typically has some degree of collinearity because media buys are planned as campaigns that involve multiple media channels that all start and stop at (approximately) the same time.

This leads to a group of model variables that are all 1-to-1 correlated with each other. Any group of variables where that is true will also be participating in a near collinear relationship.

And an MMM building analyst will often make this even worse when adding variables to control for seasonal demand. Often marketing campaigns are planned around high season sales so adding seasonality terms often adds variables that are pairwise correlated to the marketing drivers.

You said it doesn’t cause bias, but we all know it’s bad. What gives?

Well, effectively, multicollinearity increases the uncertainty of estimated effects.

A bias in an effect estimation process, such as might be caused by an omitted variable, is systematic. In a frequentist mindset, we might say that a bias is a difference between the population true parameter and the average of the estimator, even as the sample size approaches infinity.

Multicollinearity doesn’t do that.

Multicollinearity increases standard error of estimated parameters (or the variance of the posterior distribution if you use Bayesian models).

In the world of Marketing Mix Modeling, that might well sound like a secondary concern. After all, as long as the point estimate I have is good, then the budget allocation I recommend is good, and so the fact that band of believable values around that point estimate is wide makes it fine, right?

And, honestly, it often _IS_ fine, just like we have to accept that the known biases we’ve discussed previously will exist in every model we create.

But it is also possible that we could get a point estimate that is NOT fine. We could have an award-winning, buzz-generating, obviously great campaign that your stakeholders really care about that has a true contribution of 2% of sales in the model show up with a -1.38% contribution because the standard error on the estimate is 5 times as big as the true coefficient. And that would surely make for a model no one can use. Which is the very definition of a bad effect estimate, I would think.

Why does this happen, again?

I think the clearest answer to ‘why does collinearity cause this increased uncertainty’ requires us to center our focus on OLS as an optimization. OLS is the process of identifying the coefficients that minimize the sum of squared error of predictions. With that lens, the standard error around the point estimate for a coefficient is the range of values that don’t make much difference to the sum of squared error.

With perfect multicollinearity there are (at least) two sets of coefficients that are equally good at minimizing sum of squared error. As an example, let’s have 3 variables, A, B, and C such that C:= 2*A +3*B. Then we could have coefficients for (A,B,C) of either (0, 0, 10) or (20,30,0) and get the same predicted values:

A	B	C	Y = 10 * C	Y = 20A + 30B
10	0	20	200	200
100	0	200	2000	2000
10	90	290	2900	2900
100	5	215	2150	2150

In the real world of _near_ collinearity we could have that C really is 2A+3B+ɛ, where that ɛ has a standard error large enough to smudge the relationship to A and B, or C has a systematic component not related to A or B which is small enough that the 2A+3B generates most of the variation in the series.

To show a very slightly more realistic example, I’ve added a ‘Y’ column that has some gaussian noise in it, and I’ve added 10% to every value of C. The columns with a ‘~’ in the header are the using the results of an OLS regression (via Excel) for 3 cases: only C, A and B, and all 3 variables offered as independent variables:

I Invite You To Have a ³^rd Grader Check This Arithmetic

A	B	C	Y	Y ~ 9.51 * C	Y ~ 20.08A + 29.98B	Y ~ 20.08A + 29.95B + .0089C
10	0	22	195.85	209.22	200.80	201.00
100	0	220	2009.54	2092.20	2008.00	2009.96
10	90	319	2899.12	3033.69	2899.00	2899.14
100	5	193.5	2158.55	1840.19	2157.90	2159.47

Your ^{3^rd} Grader Might Ask For a Calculator

Let’s walk through this in more detail. The TRUE data generating process here was Y = 20* A + 30 * B + noise. And C = (2*A + 3*C)*1.10. So, we might reasonably expect that the _best_ model would be the 2nd one, as it has only the true effects specified.

And it is the best model if judged by which model has the closest effect estimates to the true values.

But the predicted values are closer to the truth for the 3-parameter model. Even though there is no information in C that isn’t in A and B, we see here that the random noise in the Y happens to correlate very slightly to C even after A and B’s contribution to the Y-hat values are accounted for. If we are judging models based on in-sample fit alone, the 3 variable model is ‘best’. Now, your intuition might be that out of sampling fits wouldn’t be better for this model . . . and you could be right. Or you could be wrong. It would depend on if the ‘noise’ in Y is truly white noise and not in any way (e.g. through an unobserved confounder) related to variation in C.

As a sidebar, you might wonder why anyone would judge a model by its fit and not how closely the estimated effects capture the true effects. I promise you; every analyst would love to judge a model by how close its effects were to the truth. But in real models, we don’t _know_ the true effects and so we can’t tell how close we are to them directly. Model predictive performance (a.k.a model ‘fit’) is a proxy for if the model has close to correct effects estimates. It is not always a good proxy! But it is the one we always get when we fit a model.

Hey, wait a minute - those are only point estimates!

I knew I liked you! There really isn’t much point to this story without considering the uncertainty in the point estimates. As I said above, multicollinearity damages our certainty about estimates and does not introduce a bias. In this simple example, here is what that looks like:

Driver	2 Coef Model		3 Coef Model
Driver	Coefficient	Std Error	Coefficient	Std Error
A	20.09	0.03	20.07	0.33
B	29.98	0.04	29.95	0.55
C	*	*	0.01	0.17

This is what near collinearity does – adding the 3rd coefficient, where C is mostly a linear combination of A and B, has increased the standard errors by a factor of 10! You noticed that we aren’t appreciably further from the true values of 20 and 30 . . . but the confidence intervals got a whole bunch bigger.

The Fix

As with the other articles in this series, I can’t really prescribe a ‘solution’ because the answer isn’t a question of modeling approach – maybe even more so with multicollinearity than with the list of biases.

There is nothing wrong with a dataset with sets of collinear variables in it. There is nothing wrong with including near collinear variables in the model! [NB: that might be considered a ‘hot take’ by some folks in the world of data science and analytics, so don’t deploy this as your first statement in an interview!]

But collinear variables do dramatically increase uncertainty in parameter estimates. And when we want to estimate many small effects (e.g., in a marketing mix model that has 45 marketing variables in it) that increased uncertainty looks exactly like a lack of evidence for any effect at all.

So, while we can’t ‘fix collinearity’ we do have a few tools to deploy [NB: this is what you should lead with in that interview.].

The first step is detection of collinearity. Most analysts main tool is VIF (or equivalently ‘tolerance’) statistics, which are reported by every software tool I’ve used for regression. Each coefficient has its own VIF, and those with high VIF are participating in a ‘near collinear’ relationship. How high is a high VIF? Depends on your favorite textbook. I like 10. Many people were taught 5 (check the Wikipedia footnotes for sources for each of those numbers) . . . but, of course, quantizing a continuous metric into ‘ok’ and ‘too high’ is arbitrary and so the best advice I can offer is to ignore VIF until you have a problem and then go and see how high your VIFs are.

One deficit of the VIF calculation is that it doesn’t tell you which independent variables are in the collinear relationship. You can only leave in or take out the high VIF variable itself.

But Belsely, Kuh, and Welsch solved that for us in the 80s. The approach outlined in chapter 3 allows the analyst to diagnose each near collinear relationship in an iterative process, with full knowledge of which variables are contributing. Their condition indices are reported by most software packages; I wrote a statsmodels-based calculation when I couldn’t quickly find an existing python package and posted it here, if you have a need.

Having identified collinearity, the analyst then has a few options for working around the high uncertainties.

Most straightforwardly, you can drop the high VIF variables from a model. This does expose the remaining effect estimates to omitted variable bias, but that can be a worthwhile trade off.

Almost as easily, the analyst can aggregate together variables that are in a collinear relationship. This also has costs in terms of biases, and there is no guarantee that the aggregate won’t be collinear with another model term, but frequently an aggregated driver is useable and useful.

Thirdly, the analyst can use a model estimation method that is robust to collinearity. Tikhonov regularization/Ridge regression is the historical choice. But really, many model types are variations on the regularization/penalized estimation approach. With all of them, the tradeoff is the loss of guarantee of unbiasedness in exchange for reduced uncertainty in the estimate. Please recall that ridge regression doesn’t “fix multicollinearity” or even reduce the epistemic uncertainty in causes that collinearity might represent. It is a model fitting approach that has reduced variation in the resulting estimates from near collinearity and is even able to get coefficients in a dataset with perfect collinearity, which makes it a good workaround in practice. But this doesn’t fix collinearity and much as use near collinearity as an excuse to enforce a preference for small effect sizes.

It is worth mentioning that there is a non-analytics ‘fix’ for collinearity in marketing variables, of course. We can encourage marketers to plan media buys with minimal correlation in time. When this would involve poor marketing, e.g. when it would imply turning down marketing during peak season, this is likely not a great choice. But not every season is high season, and not every change in marketing levels has to be downward! Strategically adding some high and low spend weeks of specific marketing tactics to improve effect estimation is valid and worthwhile for large marketing expenditures. If we were particularly pro-marketing mix models, we might even try to recast the entire concept of designed experiments in marketing as an attempt to get good variation in marketing activity into the data history to improve our model estimates. . .

In Conclusion

I imagine this conclusion is going to feel very familiar to repeat readers. This article continues to emphasize that effect estimation in marketing mix models is an exercise in carefully judged tradeoffs. Multicollinearity doesn’t cause a bias in effect estimation, but instead inflates the variance of effect estimates. So, in that sense it is a different beast than omitted variable bias, or aggregation bias . . . but the analyst’s response should be similar. They must know their model is wrong and should think hard to make good defensible modeling choices with model usability as a top priority when making those choices.