Download the Jupyter Notebook for this section: blp.ipynb
Random Coefficients Logit Tutorial with the Automobile Data¶
[1]:
import pyblp
import numpy as np
import pandas as pd
pyblp.options.digits = 2
pyblp.options.verbose = False
pyblp.__version__
[1]:
'0.7.0'
In this tutorial, we’ll use data from Berry, Levinsohn, and Pakes (1995) to solve the paper’s automobile problem.
Application of Random Coefficients Logit with the Automobile Data¶
This tutorial is similar to the fake cereal tutorial, but exhibits some other features of pyblp:
Incorporating a supply side into demand estimation.
Allowing for simple priceincome demographic effects.
Calculating clustered standard errors.
Loading the Data¶
We’ll use pandas to load two sets of data:
product_data
, which contains prices, shares, and other product characteristics.agent_data
, which contains draws from the distribution of heterogeneity.
[2]:
product_data = pd.read_csv(pyblp.data.BLP_PRODUCTS_LOCATION)
product_data.head()
[2]:
market_ids  clustering_ids  car_ids  firm_ids  region  shares  prices  hpwt  air  mpd  ...  demand_instruments2  demand_instruments3  demand_instruments4  demand_instruments5  supply_instruments0  supply_instruments1  supply_instruments2  supply_instruments3  supply_instruments4  supply_instruments5  

0  1971  AMGREM71  129  15  US  0.001051  4.935802  0.528997  0  1.888146  ...  0.566217  0.365328  0.659480  0.141017  0.011161  1.478879  0.546875  0.163302  0.833091  0.301411 
1  1971  AMHORN71  130  15  US  0.000670  5.516049  0.494324  0  1.935989  ...  0.566217  0.290959  0.173552  0.128205  0.079317  1.088327  0.546875  0.095609  0.390314  0.289947 
2  1971  AMJAVL71  132  15  US  0.000341  7.108642  0.467613  0  1.716799  ...  0.566217  0.599771  0.546387  0.002634  0.021034  0.609213  0.546875  0.449818  0.400461  0.434632 
3  1971  AMMATA71  134  15  US  0.000522  6.839506  0.426540  0  1.687871  ...  0.566217  0.620544  1.122968  0.089023  0.090014  0.207461  0.546875  0.454159  0.934641  0.331099 
4  1971  AMAMBS71  136  15  US  0.000442  8.928395  0.452489  0  1.504286  ...  0.566217  0.877198  1.258575  0.153840  0.038013  0.385211  0.546875  0.728959  1.146654  0.520555 
5 rows × 25 columns
The product_data
contains market IDs, product IDs, firm IDs, shares, prices, a number of product characteristics, and some precomputed excluded instruments. The product IDs are called clustering_ids
because they will be used to compute clustered standard errors. For more information about the instruments and the example data as a whole, refer to the data module.
The agent_data
contains market IDs, integration weights \(w_{ti}\), integration nodes \(\nu_{ti}\), and demographics \(d_{ti}\). Here we use \(I_{t} = 200\) equally weighted Monte Carlo draws per market.
In nonexample problems, it is usually a better idea to use many more draws, or a more sophisticated Integration configuration such as sparse grid quadrature.
[3]:
agent_data = pd.read_csv(pyblp.data.BLP_AGENTS_LOCATION)
agent_data.head()
[3]:
market_ids  weights  nodes0  nodes1  nodes2  nodes3  nodes4  income  

0  1971  0.005  0.548814  0.457760  0.564690  0.395537  0.392173  9.728478 
1  1971  0.005  0.715189  0.376918  0.839746  0.844017  0.041157  7.908957 
2  1971  0.005  0.602763  0.702335  0.376884  0.150442  0.923301  11.079404 
3  1971  0.005  0.544883  0.207324  0.499676  0.306309  0.406235  17.641671 
4  1971  0.005  0.423655  0.074280  0.081302  0.094570  0.944282  12.423995 
Setting up the Problem¶
Unlike the fake cereal problem, we won’t absorb any fixed effects in the automobile problem, so the linear part of demand \(X_1\) has more components. We also need to specify a formula for the random coefficients \(X_2\), including a random coefficient on the constant, which captures correlation among all inside goods.
The primary new addition to the model relative to the fake cereal problem is that we add a supply side formula for product characteristics that contribute to marginal costs, \(X_3\). The patsystyle formulas support functions of regressors such as the log
function used below.
We stack the three product formulations in order: \(X_1\), \(X_2\), and \(X_3\).
[4]:
product_formulations = (
pyblp.Formulation('1 + hpwt + air + mpd + space'),
pyblp.Formulation('1 + prices + hpwt + air + mpd + space'),
pyblp.Formulation('1 + log(hpwt) + air + log(mpg) + log(space) + trend')
)
product_formulations
[4]:
(1 + hpwt + air + mpd + space,
1 + prices + hpwt + air + mpd + space,
1 + log(hpwt) + air + log(mpg) + log(space) + trend)
The original specification for the automobile problem includes the term \(\log(y_i  p_j)\), in which \(y\) is income and \(p\) are prices. Instead of including this term, which gives rise to a host of numerical problems, we’ll follow Berry, Levinsohn, and Pakes (1999) and use its firstorder linear approximation, \(p_j / y_i\).
The agent formulation for demographics, \(d\), includes a column of \(1 / y_i\) values, which we’ll interact with \(p_j\). To do this, we will treat draws of \(y_i\) as demographic variables.
[5]:
agent_formulation = pyblp.Formulation('0 + I(1 / income)')
agent_formulation
[5]:
I(1 / income)
As in the cereal example, the Problem can be constructed by combining the product_formulations
, product_data
, agent_formulation
, and agent_data
.
[6]:
problem = pyblp.Problem(product_formulations, product_data, agent_formulation, agent_data)
problem
[6]:
Dimensions:
=======================================================
T N F I K1 K2 K3 D MD MS
         
20 2217 26 4000 5 6 6 1 11 12
=======================================================
Formulations:
======================================================================================
Column Indices: 0 1 2 3 4 5
      
X1: Linear Characteristics 1 hpwt air mpd space
X2: Nonlinear Characteristics 1 prices hpwt air mpd space
X3: Cost Characteristics 1 log(hpwt) air log(mpg) log(space) trend
d: Demographics 1/income
======================================================================================
The problem outputs a table of dimensions:
\(T\) denotes the number of markets.
\(N\) is the length of the dataset (the number of products across all markets).
\(F\) denotes the number of firms.
\(I = \sum_t I_t\) is the total number of agents across all markets (200 draws per market times 20 markets).
\(K_1\) is the number of linear demand characteristics.
\(K_2\) is the number of nonlinear demand characteristics.
\(K_3\) is the number of linear supply characteristics.
\(D\) is the number of demographic variables.
\(M_D\) is the number of demand instruments, including exogenous regressors.
\(M_S\) is the number of supply instruments, including exogenous regressors.
The formulations table describes all four formulas for linear characteristics, nonlinear characteristics, cost characteristics, and demographics.
Solving the Problem¶
The only remaining decisions are:
Choosing \(\Sigma\) and \(\Pi\) starting values, \(\Sigma_0\) and \(\Pi_0\).
Potentially choosing bounds for \(\Sigma\) and \(\Pi\).
Choosing a form for marginal costs, \(c_{jt}\): either a linear or loglinear functional form.
The decisions we will use are:
Use published estimates as our starting values in \(\Sigma_0\).
Interact the inverse of income, \(1 / y_i\), only with prices, and use the published estimate on \(\log(y_i  p_j)\) as our starting value for \(\alpha\) in \(\Pi_0\).
Bound \(\Sigma_0\) to be positive since it is a diagonal matrix where the diagonal consists of standard deviations.
Constrain the \(p_j / y_i\) coefficient to be negative. Specifically, we’ll use a bound that’s slightly smaller than zero because when the parameter is exactly zero, there are matrix inversion problems with estimating marginal costs.
When using a routine that supports bounds, Problem chooses some default bounds. These bounds are often very wide, so it’s usually a good idea to set your own more restrictive bounds.
[7]:
initial_sigma = np.diag([3.612, 0, 4.628, 1.818, 1.050, 2.056])
initial_pi = np.c_[[0, 43.501, 0, 0, 0, 0]]
sigma_bounds = (
np.zeros_like(initial_sigma),
np.diag([100, 0, 100, 100, 100, 100])
)
pi_bounds = (
np.c_[[0, 100, 0, 0, 0, 0]],
np.c_[[0, 0.1, 0, 0, 0, 0]]
)
Note that there are only 5 nonzeros on the diagonal of \(\Sigma\), which means that we only need 5 columns of integration nodes to integrate over these 5 dimensions of unobserved heterogeneity. Indeed, agent_data
contains exactly 5 columns of nodes. If we were to ignore the \(\log(y_i  p_j)\) term (by not configuring \(\Pi\)) and include a term on prices in \(\Sigma\) instead, we would have needed 6 columns of integration nodes in our agent_data
.
A linear marginal cost specification is the default setting, so we’ll need to use the costs_type
argument of Problem.solve to employ the loglinear specification used by Berry, Levinsohn, and Pakes (1995). A downside of this specification is that nonpositive estimated marginal costs can create problems for the optimization routine when computing \(\log c(\hat{\theta})\).
We’ll use the costs_bounds
argument to bound marginal costs from below by a small number.
Finally, as in the original paper, we’ll use W_type
and se_type
to cluster by product IDs, which were specified as clustering_ids
in product_data
.
[8]:
results = problem.solve(
initial_sigma,
initial_pi,
sigma_bounds=sigma_bounds,
pi_bounds=pi_bounds,
costs_type='log',
costs_bounds=(0.001, None),
W_type='clustered',
se_type='clustered'
)
results
[8]:
Problem Results Summary:
==================================================================================================================================
Smallest Largest Clipped
Computation GMM Optimization Objective Fixed Point Contraction Objective Gradient Hessian Hessian Marginal
Time Step Iterations Evaluations Iterations Evaluations Value Infinity Norm Eigenvalue Eigenvalue Costs
          
00:31:03 2 39 49 72322 217903 +2.1E+05 +2.8E+03 +1.8E+02 +1.8E+04 0
==================================================================================================================================
Nonlinear Coefficient Estimates (Robust SEs Adjusted for 999 Clusters in Parentheses):
=====================================================================================================
Sigma: 1 prices hpwt air mpd space  Pi: 1/income
         
1 +1.2E+00 +0.0E+00 +0.0E+00 +0.0E+00 +0.0E+00 +0.0E+00  1 +0.0E+00
(+2.1E+00) 

prices +0.0E+00 +0.0E+00 +0.0E+00 +0.0E+00 +0.0E+00  prices 1.2E+01
 (+4.8E+00)

hpwt +0.0E+00 +0.0E+00 +0.0E+00 +0.0E+00  hpwt +0.0E+00
(+1.0E+01) 

air +0.0E+00 +0.0E+00 +0.0E+00  air +0.0E+00
(+2.6E+00) 

mpd +2.6E+00 +0.0E+00  mpd +0.0E+00
(+1.4E+00) 

space +1.1E+00  space +0.0E+00
(+1.9E+00) 
=====================================================================================================
Beta Estimates (Robust SEs Adjusted for 999 Clusters in Parentheses):
==========================================================
1 hpwt air mpd space
    
7.6E+00 +4.8E+00 +2.6E+00 2.2E+00 +1.8E+00
(+1.2E+00) (+2.7E+00) (+1.4E+00) (+1.5E+00) (+1.3E+00)
==========================================================
Gamma Estimates (Robust SEs Adjusted for 999 Clusters in Parentheses):
======================================================================
1 log(hpwt) air log(mpg) log(space) trend
     
+2.3E+00 +6.4E01 +8.8E01 3.9E01 +2.8E01 +1.4E02
(+1.4E01) (+1.2E01) (+7.1E02) (+1.2E01) (+1.2E01) (+3.5E03)
======================================================================
There are some discrepancies between our results and the original paper. The instruments we constructed to are meant to mimic the original instruments, which we were unable to reconstruct perfectly. We also use different agent data and the firstorder linear approximation to the \(\ln(y_i  p_j)\) term.