Download the Jupyter Notebook for this section: blp.ipynb

Supply Side Tutorial with Automobile Data

[1]:
import pyblp
import numpy as np
import pandas as pd

pyblp.options.digits = 2
pyblp.options.verbose = False
pyblp.__version__
[1]:
'1.1.0'

In this tutorial, we’ll use data from Berry, Levinsohn, and Pakes (1995) to solve the paper’s automobile problem. This tutorial is similar to the fake cereal tutorial, but exhibits some other features of pyblp:

  • Incorporating a supply side into demand estimation.

  • Allowing for simple price-income demographic effects.

  • Calculating clustered standard errors.

Loading Data

We’ll use pandas to load two sets of data:

  1. product_data, which contains prices, shares, and other product characteristics.

  2. agent_data, which contains draws from the distribution of heterogeneity.

[2]:
product_data = pd.read_csv(pyblp.data.BLP_PRODUCTS_LOCATION)
product_data.head()
[2]:
market_ids clustering_ids car_ids firm_ids region shares prices hpwt air mpd ... supply_instruments2 supply_instruments3 supply_instruments4 supply_instruments5 supply_instruments6 supply_instruments7 supply_instruments8 supply_instruments9 supply_instruments10 supply_instruments11
0 1971 AMGREM71 129 15 US 0.001051 4.935802 0.528997 0 1.888146 ... 0.0 1.705933 1.595656 87.0 -61.959985 0.0 46.060389 29.786989 0.0 1.888146
1 1971 AMHORN71 130 15 US 0.000670 5.516049 0.494324 0 1.935989 ... 0.0 1.680910 1.490295 87.0 -61.959985 0.0 46.060389 29.786989 0.0 1.935989
2 1971 AMJAVL71 132 15 US 0.000341 7.108642 0.467613 0 1.716799 ... 0.0 1.801067 1.357703 87.0 -61.959985 0.0 46.060389 29.786989 0.0 1.716799
3 1971 AMMATA71 134 15 US 0.000522 6.839506 0.426540 0 1.687871 ... 0.0 1.818061 1.261347 87.0 -61.959985 0.0 46.060389 29.786989 0.0 1.687871
4 1971 AMAMBS71 136 15 US 0.000442 8.928395 0.452489 0 1.504286 ... 0.0 1.933210 1.237365 87.0 -61.959985 0.0 46.060389 29.786989 0.0 1.504286

5 rows × 33 columns

The product_data contains market IDs, product IDs, firm IDs, shares, prices, a number of product characteristics, and instruments. The product IDs are called clustering_ids because they will be used to compute clustered standard errors. For more information about the instruments and the example data as a whole, refer to the data module.

The agent_data contains market IDs, integration weights \(w_{it}\), integration nodes \(\nu_{it}\), and demographics \(d_{it}\). Here we use the \(I_t = 200\) importance sampling weights and nodes from the original paper.

In non-example problems, it is usually a better idea to use many more draws, or a more sophisticated Integration configuration such as sparse grid quadrature.

[3]:
agent_data = pd.read_csv(pyblp.data.BLP_AGENTS_LOCATION)
agent_data.head()
[3]:
market_ids weights nodes0 nodes1 nodes2 nodes3 nodes4 income
0 1971 0.000543 1.192188 0.478777 0.980830 -0.824410 2.473301 109.560369
1 1971 0.000723 1.497074 -2.026204 -1.741316 1.412568 -0.747468 45.457314
2 1971 0.000544 1.438081 0.813280 -1.749974 -1.203509 0.049558 127.146548
3 1971 0.000701 1.768655 -0.177453 0.286602 0.391517 0.683669 22.604045
4 1971 0.000549 0.849970 -0.135337 0.735920 1.036247 -1.143436 170.226032

Setting up the Problem

Unlike the fake cereal problem, we won’t absorb any fixed effects in the automobile problem, so the linear part of demand \(X_1\) has more components. We also need to specify a formula for the random coefficients \(X_2\), including a random coefficient on the constant, which captures correlation among all inside goods.

The primary new addition to the model relative to the fake cereal problem is that we add a supply side formula for product characteristics that contribute to marginal costs, \(X_3\). The patsy-style formulas support functions of regressors such as the log function used below.

We stack the three product formulations in order: \(X_1\), \(X_2\), and \(X_3\).

[4]:
product_formulations = (
   pyblp.Formulation('1 + hpwt + air + mpd + space'),
   pyblp.Formulation('1 + prices + hpwt + air + mpd + space'),
   pyblp.Formulation('1 + log(hpwt) + air + log(mpg) + log(space) + trend')
)
product_formulations
[4]:
(1 + hpwt + air + mpd + space,
 1 + prices + hpwt + air + mpd + space,
 1 + log(hpwt) + air + log(mpg) + log(space) + trend)

The original specification for the automobile problem includes the term \(\log(y_i - p_j)\), in which \(y\) is income and \(p\) are prices. Instead of including this term, which gives rise to a host of numerical problems, we’ll follow Berry, Levinsohn, and Pakes (1999) and use its first-order linear approximation, \(p_j / y_i\).

The agent formulation for demographics, \(d\), includes a column of \(1 / y_i\) values, which we’ll interact with \(p_j\). To do this, we will treat draws of \(y_i\) as demographic variables.

[5]:
agent_formulation = pyblp.Formulation('0 + I(1 / income)')
agent_formulation
[5]:
I(1 / income)

As in the cereal example, the Problem can be constructed by combining the product_formulations, product_data, agent_formulation, and agent_data. We’ll also choose the functional form of marginal costs \(c_{jt}\). A linear marginal cost specification is the default setting, so we’ll need to use the costs_type argument of Problem to employ the log-linear specification used by Berry, Levinsohn, and Pakes (1995).

When initializing the problem, we get a warning about integration weights not summing to one. This is because the above product data were created by the original paper with importance sampling. To disable this warning, we could increase pyblp.options.weights_tol.

[6]:
problem = pyblp.Problem(product_formulations, product_data, agent_formulation, agent_data, costs_type='log')
problem
Integration weights in the following markets sum to a value that differs from 1 by more than options.weights_tol: all markets. Sometimes this is fine, for example when weights were built with importance sampling. Otherwise, it is a sign that there is a data problem.
[6]:
Dimensions:
=======================================================
 T    N     F    I     K1    K2    K3    D    MD    MS
---  ----  ---  ----  ----  ----  ----  ---  ----  ----
20   2217  26   4000   5     6     6     1    13    18
=======================================================

Formulations:
=======================================================================================
       Column Indices:             0           1       2       3          4         5
-----------------------------  ----------  ---------  ----  --------  ----------  -----
 X1: Linear Characteristics        1         hpwt     air     mpd       space
X2: Nonlinear Characteristics      1        prices    hpwt    air        mpd      space
X3: Log Cost Characteristics       1       log(hpwt)  air   log(mpg)  log(space)  trend
       d: Demographics         1*1/income
=======================================================================================

The problem outputs a table of dimensions:

  • \(T\) denotes the number of markets.

  • \(N\) is the length of the dataset (the number of products across all markets).

  • \(F\) denotes the number of firms.

  • \(I = \sum_t I_t\) is the total number of agents across all markets (200 draws per market times 20 markets).

  • \(K_1\) is the number of linear demand characteristics.

  • \(K_2\) is the number of nonlinear demand characteristics.

  • \(K_3\) is the number of linear supply characteristics.

  • \(D\) is the number of demographic variables.

  • \(M_D\) is the number of demand instruments, including exogenous regressors.

  • \(M_S\) is the number of supply instruments, including exogenous regressors.

The formulations table describes all four formulas for demand-side linear characteristics, demand-side nonlinear characteristics, supply-side characteristics, and demographics.

Solving the Problem

The only remaining decisions are:

  • Choosing \(\Sigma\) and \(\Pi\) starting values, \(\Sigma_0\) and \(\Pi_0\).

  • Potentially choosing bounds for \(\Sigma\) and \(\Pi\).

The decisions we will use are:

  • Use published estimates as our starting values in \(\Sigma_0\).

  • Interact the inverse of income, \(1 / y_i\), only with prices, and use the published estimate on \(\log(y_i - p_j)\) as our starting value for \(\alpha\) in \(\Pi_0\).

  • Bound \(\Sigma_0\) to be positive since it is a diagonal matrix where the diagonal consists of standard deviations.

When using a routine that supports bounds, it’s usually a good idea to set your own more bounds so that the routine doesn’t try out large parameter values that create numerical issues.

[7]:
initial_sigma = np.diag([3.612, 0, 4.628, 1.818, 1.050, 2.056])
initial_pi = np.c_[[0, -43.501, 0, 0, 0, 0]]

Note that there are only 5 nonzeros on the diagonal of \(\Sigma\), which means that we only need 5 columns of integration nodes to integrate over these 5 dimensions of unobserved heterogeneity. Indeed, agent_data contains exactly 5 columns of nodes. If we were to ignore the \(\log(y_i - p_j)\) term (by not configuring \(\Pi\)) and include a term on prices in \(\Sigma\) instead, we would have needed 6 columns of integration nodes in our agent_data.

A downside of the log-linear marginal costs specification is that nonpositive estimated marginal costs can create problems for the optimization routine when computing \(\log c(\hat{\theta})\). We’ll use the costs_bounds argument to bound marginal costs from below by a small number.

Finally, as in the original paper, we’ll use W_type and se_type to cluster by product IDs, which were specified as clustering_ids in product_data, and set initial_update=True to update the initial GMM weighting matrix and the mean utility at the starting parameter values.

[8]:
results = problem.solve(
    initial_sigma,
    initial_pi,
    costs_bounds=(0.001, None),
    W_type='clustered',
    se_type='clustered',
    initial_update=True,
)
results
[8]:
Problem Results Summary:
=======================================================================================================================
GMM   Objective    Projected    Reduced Hessian  Reduced Hessian  Clipped  Clipped  Weighting Matrix  Covariance Matrix
Step    Value    Gradient Norm  Min Eigenvalue   Max Eigenvalue   Shares    Costs   Condition Number  Condition Number
----  ---------  -------------  ---------------  ---------------  -------  -------  ----------------  -----------------
 2    +5.0E+02     +1.1E-08        +4.9E-01         +5.1E+02         0        0         +4.2E+09          +3.8E+08
=======================================================================================================================

Cumulative Statistics:
===========================================================================
Computation  Optimizer  Optimization   Objective   Fixed Point  Contraction
   Time      Converged   Iterations   Evaluations  Iterations   Evaluations
-----------  ---------  ------------  -----------  -----------  -----------
 00:10:14       No           58           126         36807       112905
===========================================================================

Nonlinear Coefficient Estimates (Robust SEs Adjusted for 999 Clusters in Parentheses):
===================================================================================================
Sigma:      1        prices      hpwt        air         mpd        space     |   Pi:    1*1/income
------  ----------  --------  ----------  ----------  ----------  ----------  |  ------  ----------
  1      +2.0E+00                                                             |    1      +0.0E+00
        (+6.1E+00)                                                            |
                                                                              |
prices   +0.0E+00   +0.0E+00                                                  |  prices   -4.5E+01
                                                                              |          (+9.2E+00)
                                                                              |
 hpwt    +0.0E+00   +0.0E+00   +6.1E+00                                       |   hpwt    +0.0E+00
                              (+2.2E+00)                                      |
                                                                              |
 air     +0.0E+00   +0.0E+00   +0.0E+00    +4.0E+00                           |   air     +0.0E+00
                                          (+2.1E+00)                          |
                                                                              |
 mpd     +0.0E+00   +0.0E+00   +0.0E+00    +0.0E+00    +2.5E-01               |   mpd     +0.0E+00
                                                      (+5.5E-01)              |
                                                                              |
space    +0.0E+00   +0.0E+00   +0.0E+00    +0.0E+00    +0.0E+00    +1.9E+00   |  space    +0.0E+00
                                                                  (+1.1E+00)  |
===================================================================================================

Beta Estimates (Robust SEs Adjusted for 999 Clusters in Parentheses):
==========================================================
    1          hpwt        air         mpd        space
----------  ----------  ----------  ----------  ----------
 -7.3E+00    +3.5E+00    -1.0E+00    +4.2E-01    +4.2E+00
(+2.8E+00)  (+1.4E+00)  (+2.1E+00)  (+2.5E-01)  (+6.6E-01)
==========================================================

Gamma Estimates (Robust SEs Adjusted for 999 Clusters in Parentheses):
======================================================================
    1       log(hpwt)      air       log(mpg)   log(space)    trend
----------  ----------  ----------  ----------  ----------  ----------
 +2.8E+00    +9.0E-01    +4.2E-01    -5.2E-01    -2.6E-01    +2.7E-02
(+1.2E-01)  (+7.2E-02)  (+8.7E-02)  (+7.3E-02)  (+2.1E-01)  (+3.1E-03)
======================================================================

There are some discrepancies between our results and the original paper, but results are similar.