# Background¶

The following sections provide a very brief overview of the BLP model and how it is estimated. This goal is to concisely introduce the notation and terminology used throughout the rest of the documentation. For a more in-depth overview, refer to Conlon and Gortmaker (2020).

## The Model¶

There are \(t = 1, 2, \dotsc, T\) markets, each with \(j = 1, 2, \dotsc, J_t\) products produced by \(f = 1, 2, \dotsc, F_t\) firms, for a total of \(N\) products across all markets. There are \(i = 1, 2, \dotsc, I_t\) individuals/agents who choose among the \(J_t\) products and an outside good \(j = 0\). These numbers also represent sets. For example, \(J_t = \{1, 2, \dots, J_t\}\).

### Demand¶

Observed demand-side product characteristics are contained in the \(N \times K_1\) matrix of linear characteristics, \(X_1\), and the \(N \times K_2\) matrix of nonlinear characteristics, \(X_2\), which is typically a subset of \(X_1\). Unobserved demand-side product characteristics, \(\xi\), are a \(N \times 1\) vector.

In market \(t\), observed agent characteristics are a \(I_t \times D\) matrix called demographics, \(d\). Unobserved agent characteristics are a \(I_t \times K_2\) matrix, \(\nu\).

The indirect utility of agent \(i\) from purchasing product \(j\) in market \(t\) is

in which the mean utility is, in vector-matrix form,

The \(K_1 \times 1\) vector of demand-side linear parameters, \(\beta\), is partitioned into two components: \(\alpha\) is a \(K_1^\text{en} \times 1\) vector of parameters on the \(N \times K_1^\text{en}\) submatrix of endogenous characteristics, \(X_1^\text{en}\), and \(\beta^\text{ex}\) is a \(K_1^\text{ex} \times 1\) vector of parameters on the \(N \times K_1^\text{ex}\) submatrix of exogenous characteristics, \(X_1^\text{ex}\). Usually, \(X_1^\text{en} = p\), prices, so \(\alpha\) is simply a scalar.

The agent-specific portion of utility in a single market is, in vector-matrix form,

The model incorporates both observable (demographics) and unobservable taste heterogeneity though random coefficients. For the unobserved heterogeneity, we let \(\nu\) denote independent draws from the standard normal distribution. These are scaled by a \(K_2 \times K_2\) lower-triangular matrix \(\Sigma\), which denotes the Cholesky root of the covariance matrix for unobserved taste heterogeneity. The \(K_2 \times D\) matrix \(\Pi\) measures how agent tastes vary with demographics.

In the above expression, random coefficients are assumed to be normally distributed. To incorporate one or more lognormal random coefficients, the associated columns in the parenthesized expression can be exponentiated before being pre-multiplied by \(X_2\). For example, this allows for the coefficient on price to be lognormal so that demand slopes down for all agents. For lognormal random coefficients, a constant column is typically included in \(d\) so that its coefficients in \(\Pi\) parametrize the means of the logs of the random coefficients.

Random idiosyncratic preferences, \(\epsilon_{ijt}\), are assumed to be Type I Extreme Value, so that conditional on the heterogeneous coefficients, market shares follow the well-known logit form. Aggregate market shares are obtained by integrating over the distribution of individual heterogeneity. They are approximated with Monte Carlo integration or quadrature rules defined by the \(I_t \times K_2\) matrix of integration nodes, \(\nu\), and an \(I_t \times 1\) vector of integration weights, \(w\):

where the probability that agent \(i\) chooses product \(j\) in market \(t\) is

There is a one in the denominator because the utility of the outside good is normalized to \(U_{i0t} = 0\). The scale of utility is normalized by the variance of \(\epsilon_{ijt}\).

### Supply¶

Observed supply-side product characteristics are contained in the \(N \times K_3\) matrix of supply-side characteristics, \(X_3\). Prices cannot be supply-side characteristics, but non-price product characteristics often overlap with the demand-side characteristics in \(X_1\) and \(X_2\). Unobserved supply-side product characteristics, \(\omega\), are a \(N \times 1\) vector.

Firm \(f\) chooses prices in market \(t\) to maximize the profits of its products \(J_{ft} \subset J_t\):

In a single market, the corresponding multi-product differentiated Bertrand first order conditions are, in vector-matrix form,

where the multi-product Bertrand markup \(\eta\) depends on \(\Delta\), a \(J_t \times J_t\) matrix of intra-firm (negative) demand derivatives:

Here, \(\mathscr{H}\) denotes the market-level ownership or product holdings matrix in the market, where \(\mathscr{H}_{jk}\) is typically \(1\) if the same firm produces products \(j\) and \(k\), and \(0\) otherwise.

To include a supply side, we must specify a functional form for marginal costs:

The most common choices are \(f(c) = c\) and \(f(c) = \log(c)\).

## Estimation¶

A demand side is always estimated but including a supply side is optional. With only a demand side, there are three sets of parameters to be estimated: \(\beta\) (which may include \(\alpha\)), \(\Sigma\) and \(\Pi\). With a supply side, there is also \(\gamma\). The linear parameters, \(\beta\) and \(\gamma\), are typically concentrated out of the problem. The exception is \(\alpha\), which cannot be concentrated out when there is a supply side because it is needed to compute demand derivatives and hence marginal costs. Linear parameters that are not concentrated out along with unknown nonlinear parameters in \(\Sigma\) and \(\Pi\) are collectively denoted \(\theta\).

The GMM problem is

in which \(q(\theta)\) is the GMM objective. By default, PyBLP scales this value by \(N\) so that objectives across different problem sizes are comparable. This behavior can be disabled. In some of the BLP literature and in earlier versions of this package, the objective was scaled by \(N^2\).

Here, \(W\) is a \(M \times M\) weighting matrix and \(\bar{g}\) is a \(M \times 1\) vector of averaged demand- and supply-side moments:

where \(Z_D\) and \(Z_S\) are \(N \times M_D\) and \(N \times M_S\) matrices of demand- and supply-side instruments.

The vector \(\bar{g}\) contains sample analogues of the demand- and supply-side moment conditions \(E[g_{D,jt}] = E[g_{S,jt}] = 0\) where

In each GMM stage, a nonlinear optimizer finds the \(\hat{\theta}\) that minimizes the GMM objective value \(q(\theta)\).

### The Objective¶

Given a \(\theta\), the first step to computing the objective \(q(\theta)\) is to compute \(\delta(\theta)\) in each market with the following standard contraction:

where \(s\) are the market’s observed shares and \(s(\delta, \theta)\) are calculated market shares. Iteration terminates when the norm of the change in \(\delta(\theta)\) is less than a small number.

With a supply side, marginal costs are then computed according to (7):

Concentrated out linear parameters are recovered with linear IV-GMM:

where

With only a demand side, \(\alpha\) can be concentrated out, so \(X = X_1\), \(Z = Z_D\), and \(Y = \delta(\theta)\) recover the full \(\hat{\beta}\) in (15).

Finally, the unobserved product characteristics (i.e., the structural errors),

are interacted with the instruments to form \(\bar{g}(\theta)\) in (11), which gives the GMM objective \(q(\theta)\) in (10).

### The Gradient¶

The gradient of the GMM objective in (10) is

where

Writing \(\delta\) as an implicit function of \(s\) in (4) gives the demand-side Jacobian:

The supply-side Jacobian is derived from the definition of \(\tilde{c}\) in (9):

The second term in this expression is derived from the definition of \(\eta\) in (7):

One thing to note is that \(\frac{\partial\xi}{\partial\theta} = \frac{\partial\delta}{\partial\theta}\) and \(\frac{\partial\omega}{\partial\theta} = \frac{\partial\tilde{c}}{\partial\theta}\) need not hold during optimization if we concentrate out linear parameters because these are then functions of \(\theta\). Fortunately, one can use orthogonality conditions to show that it is fine to treat these parameters as fixed when computing the gradient.

### Weighting Matrices¶

Conventionally, the 2SLS weighting matrix is used in the first stage:

With two-step GMM, \(W\) is updated before the second stage according to

For heteroscedasticity robust weighting matrices,

For clustered weighting matrices with \(c = 1, 2, \dotsc, C\) clusters,

where, letting the set \(J_{ct} \subset J_t\) denote products in cluster \(c\) and market \(t\),

For unadjusted weighting matrices,

where \(\sigma_\xi^2\), \(\sigma_\omega^2\), and \(\sigma_{\xi\omega}\) are estimates of the variances and covariance between the structural errors.

Simulation error can be accounted for by resampling agents \(r = 1, \dots, R\) times, evaluating each \(\bar{g}_r\), and adding the following to \(S\):

### Standard Errors¶

An estimate of the asymptotic covariance matrix of \(\sqrt{N}(\hat{\theta} - \theta_0)\) is

Standard errors are the square root of the diagonal of this matrix divided by \(N\).

If the weighting matrix was chosen such that \(W = S^{-1}\), this simplifies to

Standard errors extracted from this simpler expression are called unadjusted.

## Fixed Effects¶

The unobserved product characteristics can be partitioned into

where \(k_1, k_2, \dotsc, k_{E_D}\) and \(\ell_1, \ell_2, \dotsc, \ell_{E_S}\) index unobserved characteristics that are fixed across \(E_D\) and \(E_S\) dimensions. For example, with \(E_D = 1\) dimension of product fixed effects, \(\xi_{jt} = \xi_j + \Delta\xi_{jt}\).

Small numbers of fixed effects can be estimated with dummy variables in \(X_1\), \(X_3\), \(Z_D\), and \(Z_S\). However, this approach does not scale with high dimensional fixed effects because it requires constructing and inverting an infeasibly large matrix in (15).

Instead, fixed effects are typically absorbed into \(X\), \(Z\), and \(Y(\theta)\) in (15). With one fixed effect, these matrices are simply de-meaned within each level of the fixed effect. Both \(X\) and \(Z\) can be de-meaned just once, but \(Y(\theta)\) must be de-meaned for each new \(\theta\).

This procedure is equivalent to replacing each column of the matrices with residuals from a regression of the column on the fixed effect. The Frish-Waugh-Lovell (FWL) theorem of Frisch and Waugh (1933) and Lovell (1963) guarantees that using these residualized matrices gives the same results as including fixed effects as dummy variables. When \(E_D > 1\) or \(E_S > 1\), the matrices are residualized with more involved algorithms.

Once fixed effects have been absorbed, estimation is as described above with the structural errors \(\Delta\xi\) and \(\Delta\omega\).

## Micro Moments¶

More detailed micro data on individual choices can be used to supplement the standard demand- and supply-side moments \(\bar{g}_D\) and \(\bar{g}_S\) in (11) with an additional \(m = 1, 2, \ldots, M_M\) micro moments, \(\bar{g}_M\), for a total of \(M = M_D + M_S + M_M\) moments:

Each micro moment \(m\) is the difference between an observed value \(f_m(\bar{v})\) and its simulated analogue \(f_m(v)\):

in which \(f_m(\cdot)\) is a function that maps a vector of \(p = 1, \ldots, P_M\) micro moment parts \(\bar{v} = (\bar{v}_1, \dots, \bar{v}_{P_M})'\) or \(v = (v_1, \dots, v_{P_M})'\) into a micro statistic. Each sample micro moment part \(p\) is an average over observations \(n \in N_{d_m}\) in the associated micro dataset \(d_p\):

Its simulated analogue is

In which \(w_{it} s_{ijt} w_{d_pijt}\) is the probability an observation in the micro dataset is for an agent \(i\) who chooses \(j\) in market \(t\).

The simplest type of micro moment is just an average over the entire sample, with \(f_m(v) = v_1\). For example, with \(v_{1ijt}\) equal to the income for an agent \(i\) who chooses \(j\) in market \(t\), micro moment \(m\) would match the average income in dataset \(d_p\). Observed values such as conditional expectations, covariances, correlations, or regression coefficients can be matched by choosing the appropriate function \(f_m\). For example, with \(v_{2ijt}\) equal to the interaction between income and an indicator for the choice of the outside option, and with \(v_{3ijt}\) equal to an indicator for the choiced of the outside option, \(f_m(v) = v_2 / v_3\) would match an observed conditional mean income within those who choose the outside option.

A micro dataset \(d\), often a survey, is defined by survey weights \(w_{dijt}\). For example, \(w_{dijt} = 1\{j \neq 0, t \in T_d\}\) defines a micro dataset that is a selected sample of inside purchasers in a few markets \(T_d \subset T\), giving each market an equal sampling weight. Different micro datasets are independent.

A micro dataset will often admit multiple micro moment parts. Each micro moment part \(p\) is defined by its dataset \(d_p\) and micro values \(v_{pijt}\). For example, a micro moment part \(p\) with \(v_{pijt} = y_{it}x_{jt}\) delivers the mean \(\bar{v}_p\) or expectation \(v_p\) of an interaction between some demographic \(y_{it}\) and some product characteristic \(x_{jt}\).

A micro moment is a function of one or more micro moment parts. The simplest type is a function of only one micro moment part, and matches the simple average defined by the micro moment part. For example, \(f_m(v) = v_p\) with \(v_{pijt} = y_{it} x_{jt}\) matches the mean of an interaction between \(y_{it}\) and \(x_{jt}\). Non-simple averages such as conditional means, covariances, correlations, or regression coefficients can be matched by choosing an appropriate function \(f_m\). For example, \(f_m(v) = v_1 / v_2\) with \(v_{1ijt} = y_{it}x_{jt}1\{j \neq 0\}\) and \(v_{2ijt} = 1\{j \neq 0\}\) matches the conditional mean of an interaction between \(y_{it}\) and \(x_{jt}\) among those who do not choose the outside option \(j = 0\).

Micro moments are computed for each \(\theta\) and contribute to the GMM objective \(q(\theta)\) in (10). Their derivatives with respect to \(\theta\) are added as rows to \(\bar{G}\) in (19), and blocks are added to both \(W\) and \(S\) in (23) and (24). The covariance between standard moments and micro moments is zero, so these matrices are block-diagonal. The delta method delivers the covariance matrix for the micro moments:

The scaled covariance between micro moment parts \(p\) and \(q\) in \(S_P\) is zero if they are based on different micro datasets \(d_p\) neq d_q`; otherwise, if based on the same dataset \(d_p = d_q = d\),

in which

Micro moment parts based on second choice are averages over values \(v_{pijkt}\) where \(k\) indexes second choices, and are based on datasets defined by survey weights \(w_{dijkt}\). A sample micro moment part is

Its simulated analogue is

in which second choice probabilities are \(s_{ik(-j)t} = \frac{s_{ikt}}{1 - s_{ijt}}\) if \(k \neq j\) and zero if \(k = j\). Covariances are defined analogously.

## Random Coefficients Nested Logit¶

Incorporating parameters that measure within nesting group correlation gives the random coefficients nested logit (RCNL) model of Brenkers and Verboven (2006) and Grigolon and Verboven (2014). There are \(h = 1, 2, \dotsc, H\) nesting groups and each product \(j\) is assigned to a group \(h(j)\). The set \(J_{ht} \subset J_t\) denotes the products in group \(h\) and market \(t\).

In the RCNL model, idiosyncratic preferences are partitioned into

where \(\bar{\epsilon}_{ijt}\) is Type I Extreme Value and \(\bar{\epsilon}_{ih(j)t}\) is distributed such that \(\epsilon_{ijt}\) is still Type I Extreme Value.

The nesting parameters, \(\rho\), can either be a \(H \times 1\) vector or a scalar so that for all groups \(\rho_h = \rho\). Letting \(\rho \to 0\) gives the standard BLP model and \(\rho \to 1\) gives division by zero errors. With \(\rho_h \in (0, 1)\), the expression for choice probabilities in (5) becomes more complicated:

where

The contraction for \(\delta(\theta)\) in (13) is also slightly different:

Otherwise, estimation is as described above with \(\rho\) included in \(\theta\).

## Logit and Nested Logit¶

Letting \(\Sigma = 0\) gives the simpler logit (or nested logit) model where there is a closed-form solution for \(\delta\). In the logit model,

and a lack of nonlinear parameters means that nonlinear optimization is often unneeded.

In the nested logit model, \(\rho\) must be optimized over, but there is still a closed-form solution for \(\delta\):

where

In both models, a supply side can still be estimated jointly with demand. Estimation is as described above with a representative agent in each market: \(I_t = 1\) and \(w_1 = 1\).

## Equilibrium Prices¶

Counterfactual evaluation, synthetic data simulation, and optimal instrument generation often involve solving for prices implied by the Bertrand first order conditions in (7). Solving this system with Newton’s method is slow and iterating over \(p \leftarrow c + \eta(p)\) may not converge because it is not a contraction.

Instead, Morrow and Skerlos (2011) reformulate the solution to (7):

where \(\Lambda\) is a diagonal \(J_t \times J_t\) matrix approximated by

and \(\Gamma\) is a dense \(J_t \times J_t\) matrix approximated by

Equilibrium prices are computed by iterating over the \(\zeta\)-markup equation in (49),

which, unlike (7), is a contraction. Iteration terminates when the norm of firms’ first order conditions, \(||\Lambda(p)(p - c - \zeta(p))||\), is less than a small number.

If marginal costs depend on quantity, then they also depend on prices and need to be updated during each iteration: \(c_{jt} = c_{jt}(s_{jt}(p))\).