# Logit and Nested Logit Tutorial¶

[1]:
import pyblp
import numpy as np
import pandas as pd

pyblp.options.digits = 2
pyblp.options.verbose = False
pyblp.__version__
[1]:
'0.7.0'

In this tutorial, we’ll use data from Nevo (2000) to solve the paper’s fake cereal problem. Locations of CSV files that contain the data are in the data module.

We will compare two simple models, the plain (IIA) logit model and the nested logit (GEV) model using the fake cereal dataset of Nevo (2000).

## Theory of Plain Logit¶

Let’s start with the plain logit model under independence of irrelevant alternatives (IIA). In this model (indirect) utility is given by

(1)$U_{jti} = \alpha p_{jt} + x_{jt} \beta^x + \xi_{jt} + \epsilon_{jti},$

where $$\epsilon_{jti}$$ is distributed IID with the Type I Extreme Value (Gumbel) distribution. It is common to normalize the mean utility of the outside good to zero so that $$U_{0ti} = \epsilon_{0ti}$$. This gives us aggregate marketshares

(2)$s_{jt} = \frac{\exp(\alpha p_{jt} + x_{jt} \beta^x + \xi_{jt})}{1 + \sum_k \exp(\alpha p_{jt} + x_{kt} \beta^x + \xi_{kt})}.$

If we take logs we get

(3)$\log s_{jt} = \alpha p_{jt} + x_{jt} \beta^x + \xi_{jt} - 0 - \log \sum_k \exp(\alpha p_{jt} + x_{kt} \beta^x + \xi_{kt})$

and

(4)$\log s_{0t} = 0 - \log \sum_k \exp(\alpha p_{jt} + x_{kt} \beta^x + \xi_{kt}).$

By differencing the above we get a linear estimating equation:

(5)$\log s_{jt} - \log s_{0t} = \alpha p_{jt} + x_{jt} \beta^x + \xi_{jt}.$

Because the left hand side is data, we can estimate this model using linear IV GMM.

## Application of Plain Logit¶

A Logit Problem can be created by simply excluding the formulation for the nonlinear parameters, $$X_2$$, along with any agent information. In other words, it requires only specifying the linear component of demand.

We’ll set up and solve a simple version of the fake data cereal problem from Nevo (2000). Since we won’t include any nonlinear characteristics or parameters, we don’t have to worry about configuring an optimization routine.

There are some important reserved variable names:

• market_ids are the unique market identifiers which we subscript with $$t$$.

• shares specifies the marketshares which need to be between zero and one, and within a market ID, $$\sum_{j} s_{jt} \leq 1$$.

• prices are prices $$p_{jt}$$. These have some special properties and are always treated as endogenous.

• demand_instruments0, demand_instruments1, and so on are numbered demand instruments. These represent only the excluded instruments. The exogenous regressors in $$X_1$$ will be automatically added to the set of instruments.

We begin with two steps:

1. Load the product data which at a minimum consists of market_ids, shares, prices, and at least a single column of demand instruments, demand_instruments0.

2. Define a Formulation for the $$X_1$$ (linear) demand model.

• This and all other formulas are similar to R and patsy formulas.

• It includes a constant by default. To exclude the constant, specify either a 0 or a -1.

• To efficiently include fixed effects, use the absorb option and specify which categorical variables you would like to absorb.

• Some model reduction may happen automatically. The constant will be excluded if you include fixed effects and some precautions are taken against collinearity. However, you will have to make sure that differently-named variables are not collinear.

3. Combine the Formulation and product data to construct a Problem.

4. Use Problem.solve to estimate paramters.

The product_data argument of Problem should be a structured array-like object with fields that store data. Product data can be a structured NumPy array, a pandas DataFrame, or other similar objects.

[2]:
[2]:
market_ids city_ids quarter product_ids firm_ids brand_ids shares prices sugar mushy ... demand_instruments10 demand_instruments11 demand_instruments12 demand_instruments13 demand_instruments14 demand_instruments15 demand_instruments16 demand_instruments17 demand_instruments18 demand_instruments19
0 C01Q1 1 1 F1B04 1 4 0.012417 0.072088 2 1 ... 2.116358 -0.154708 -0.005796 0.014538 0.126244 0.067345 0.068423 0.034800 0.126346 0.035484
1 C01Q1 1 1 F1B06 1 6 0.007809 0.114178 18 1 ... -7.374091 -0.576412 0.012991 0.076143 0.029736 0.087867 0.110501 0.087784 0.049872 0.072579
2 C01Q1 1 1 F1B07 1 7 0.012995 0.132391 4 1 ... 2.187872 -0.207346 0.003509 0.091781 0.163773 0.111881 0.108226 0.086439 0.122347 0.101842
3 C01Q1 1 1 F1B09 1 9 0.005770 0.130344 3 0 ... 2.704576 0.040748 -0.003724 0.094732 0.135274 0.088090 0.101767 0.101777 0.110741 0.104332
4 C01Q1 1 1 F1B11 1 11 0.017934 0.154823 12 0 ... 1.261242 0.034836 -0.000568 0.102451 0.130640 0.084818 0.101075 0.125169 0.133464 0.121111

5 rows × 30 columns

The product data contains market_ids, product_ids, firm_ids, shares, prices, a number of other IDs and product characteristics, and some pre-computed excluded demand_instruments0, demand_instruments1, and so on. The product_ids will be incorporated as fixed effects.

For more information about the instruments and the example data as a whole, refer to the data module.

### Setting Up the Problem¶

We can combine the Formulation and product_data to construct a Problem. We pass the Formulation first and the product_data second. We can also display the properties of the problem by typing its name.

[3]:
logit_formulation = pyblp.Formulation('prices', absorb='C(product_ids)')
logit_formulation
[3]:
prices + Absorb[C(product_ids)]
[4]:
problem = pyblp.Problem(logit_formulation, product_data)
problem
[4]:
Dimensions:
================================
T    N     F    K1    MD    ED
---  ----  ---  ----  ----  ----
94   2256   5    1     20    1
================================

Formulations:
=====================================
Column Indices:           0
-----------------------------  ------
X1: Linear Characteristics    prices
=====================================

Two sets of properties are displayed:

1. Dimensions of the data.

2. Formulations of the problem.

The dimensions describe the shapes of matrices as laid out in Notation. They include:

• $$T$$ is the number of markets.

• $$N$$ is the length of the dataset (the number of products across all markets).

• $$F$$ is the number of firms, which we won’t use in this example.

• $$K_1$$ is the dimension of the linear demand parameters.

• $$M_D$$ is the dimension of the instrument variables (excluded instruments and exogenous regressors).

• $$E_D$$ is the number of fixed effect dimensions (one-dimensional fixed effects, two-dimensional fixed effects, etc.).

There is only a single Formulation for this model.

• $$X_1$$ is the linear component of utility for demand and depends only on prices (after the fixed effects are removed).

### Solving the Problem¶

The Problem.solve method always returns a ProblemResults class, which can be used to compute post-estimation outputs. See the post estimation tutorial for more information.

[5]:
logit_results = problem.solve()
logit_results
[5]:
Problem Results Summary:
=========================================
Computation  GMM    Objective   Objective
Time      Step  Evaluations    Value
-----------  ----  -----------  ---------
00:00:01     2         2       +4.2E+05
=========================================

Beta Estimates (Robust SEs in Parentheses):
==========
prices
----------
-3.0E+01
(+1.0E+00)
==========

## Theory of Nested Logit¶

We can extend the logit model to allow for correlation within a group $$h$$ so that

(6)$U_{jti} = \alpha p_{jt} + x_{jt} \beta^x + \xi_{jt} + \bar{\epsilon}_{h(j)ti} + (1 - \rho) \bar{\epsilon}_{jti}.$

Now, we require that $$\epsilon_{jti} = \bar{\epsilon}_{h(j)ti} + (1 - \rho) \bar{\epsilon}_{jti}$$ is distributed IID with the Type I Extreme Value (Gumbel) distribution. As $$\rho \rightarrow 1$$, all consumers stay within their group. As $$\rho \rightarrow 0$$, this collapses to the IIA logit. Note that if we wanted, we could allow $$\rho$$ to differ between groups with the notation $$\rho_{h(j)}$$.

This gives us aggregate marketshares as the product of two logits, the within group logit and the across group logit:

(7)$s_{jt} = \frac{\exp[V_{jt} / (1 - \rho)]}{\exp[V_{h(j)t} / (1 - \rho)]}\cdot\frac{\exp V_{h(j)t}}{1 + \sum_h \exp V_{ht}},$

where $$V_{jt} = \alpha p_{jt} + x_{jt} \beta^x + \xi_{jt}$$.

After some work we again obtain the linear estimating equation:

(8)$\log s_{jt} - \log s_{0t} = \alpha p_{jt}+ x_{jt} \beta^x +\rho \log s_{j|h(j)t} + \xi_{jt},$

where $$s_{j|h(j)t} = s_{jt} / s_{h(j)t}$$ and $$s_{h(j)t}$$ is the share of group $$h$$ in market $$t$$. See Berry (1994) or Cardell (1997) for more information.

Again, the left hand side is data, though the $$\ln s_{j|h(j)t}$$ is clearly endogenous which means we must instrument for it. Rather than include $$\ln s_{j|h(j)t}$$ along with the linear components of utility, $$X_1$$, whenever nesting_ids are included in product_data, $$\rho$$ is treated as a nonlinear $$X_2$$ parameter. This means that the linear component is given instead by

(9)$\log s_{jt} - \log s_{0t} - \rho \log s_{j|h(j)t} = \alpha p_{jt} + x_{jt} \beta^x + \xi_{jt}.$

This is done for two reasons:

1. It forces the user to treat $$\rho$$ as an endogenous parameter.

2. It extends much more easily to the RCNL model of Brenkers and Verboven (2006).

A common choice for an additional instrument is the number of products per nest.

## Application of Nested Logit¶

By including nesting_ids (another reserved name) as a field in product_data, we tell the package to estimate a nested logit model, and we don’t need to change any of the formulas. We show how to construct the category groupings in two different ways:

1. We put all products in a single nest (only the outside good in the other nest).

2. We put products into two nests (either mushy or non-mushy).

We also construct an additional instrument based on the number of products per nest. Typically this is useful as a source of exogenous variation in the within group share $$\ln s_{j|h(j)t}$$. However, in this example because the number of products per nest does not vary across markets, if we include product fixed effects, this instrument is irrelevant.

We’ll define a function that constructs the additional instrument and solves the nested logit problem. We’ll exclude product ID fixed effects, which are collinear with mushy, and we’ll choose $$\rho = 0.7$$ as the initial value at which the optimization routine will start.

[6]:
def solve_nl(df):
groups = df.groupby(['market_ids', 'nesting_ids'])
df['demand_instruments20'] = groups['shares'].transform(np.size)
nl_formulation = pyblp.Formulation('0 + prices')
problem = pyblp.Problem(nl_formulation, df)
return problem.solve(rho=0.7)

First, we’ll solve the problem when there’s a single nest for all products, with the outside good in its own nest.

[7]:
df1 = product_data.copy()
df1['nesting_ids'] = 1
nl_results1 = solve_nl(df1)
nl_results1
[7]:
Problem Results Summary:
==================================================================================
Computation  GMM   Optimization   Objective   Objective    Gradient      Hessian
Time      Step   Iterations   Evaluations    Value    Infinity Norm  Eigenvalue
-----------  ----  ------------  -----------  ---------  -------------  ----------
00:00:11     2         3             8       +4.6E+05     +3.1E-06      +2.4E+07
==================================================================================

Rho Estimates (Robust SEs in Parentheses):
==========
All Groups
----------
+9.8E-01
(+1.4E-02)
==========

Beta Estimates (Robust SEs in Parentheses):
==========
prices
----------
-1.2E+00
(+4.0E-01)
==========

When we inspect the Problem, the only changes from the plain logit model is the additional instrument that contributes to $$M_D$$ and the inclusion of $$H$$, the number of nesting categories.

[8]:
nl_results1.problem
[8]:
Dimensions:
===============================
T    N     F    K1    MD    H
---  ----  ---  ----  ----  ---
94   2256   5    1     21    1
===============================

Formulations:
=====================================
Column Indices:           0
-----------------------------  ------
X1: Linear Characteristics    prices
=====================================

Next, we’ll solve the problem when there are two nests for mushy and non-mushy.

[9]:
df2 = product_data.copy()
df2['nesting_ids'] = df2['mushy']
nl_results2 = solve_nl(df2)
nl_results2
[9]:
Problem Results Summary:
==================================================================================
Computation  GMM   Optimization   Objective   Objective    Gradient      Hessian
Time      Step   Iterations   Evaluations    Value    Infinity Norm  Eigenvalue
-----------  ----  ------------  -----------  ---------  -------------  ----------
00:00:13     2         3             8       +1.6E+06     +9.4E-06      +1.3E+07
==================================================================================

Rho Estimates (Robust SEs in Parentheses):
==========
All Groups
----------
+8.9E-01
(+1.9E-02)
==========

Beta Estimates (Robust SEs in Parentheses):
==========
prices
----------
-7.8E+00
(+4.8E-01)
==========

For both cases we find that $$\hat{\rho} > 0.8$$.

Finally, we’ll also look at the adjusted parameter on prices, $$\alpha / (1-\rho)$$.

[10]:
nl_results1.beta[0] / (1 - nl_results1.rho)
[10]:
array([[-67.39338888]])
[11]:
nl_results2.beta[0] / (1 - nl_results2.rho)
[11]:
array([[-72.27074638]])

### Treating Within Group Shares as Exogenous¶

The package is designed to prevent the user from treating the within group share, $$\log s_{j|h(j)t}$$, as an exogenous variable. For example, if we were to compute a group_share variable and use the algebraic functionality of Formulation by including the expression log(shares / group_share) in our formula for $$X_1$$, the package would raise an error because the package knows that shares should not be included in this formulation.

To demonstrate why this is a bad idea, we’ll override this feature by calculating $$\log s_{j|h(j)t}$$ and including it as an additional variable in $$X_1$$. To do so, we’ll first re-define our function for setting up and solving the nested logit problem.

[12]:
def solve_nl2(df):
groups = df.groupby(['market_ids', 'nesting_ids'])
df['group_share'] = groups['shares'].transform(np.sum)
df['within_share'] = df['shares'] / df['group_share']
df['demand_instruments20'] = groups['shares'].transform(np.size)
nl2_formulation = pyblp.Formulation('0 + prices + log(within_share)')
problem = pyblp.Problem(nl2_formulation, df.drop(columns=['nesting_ids']))
return problem.solve()

Again, we’ll solve the problem when there’s a single nest for all products, with the outside good in its own nest.

[13]:
nl2_results1 = solve_nl2(df1)
nl2_results1
[13]:
Problem Results Summary:
=========================================
Computation  GMM    Objective   Objective
Time      Step  Evaluations    Value
-----------  ----  -----------  ---------
00:00:01     2         2       +4.6E+05
=========================================

Beta Estimates (Robust SEs in Parentheses):
=============================
prices    log(within_share)
----------  -----------------
-1.0E+00       +9.9E-01
(+2.4E-01)     (+7.9E-03)
=============================

And again, we’ll solve the problem when there are two nests for mushy and non-mushy.

[14]:
nl2_results2 = solve_nl2(df2)
nl2_results2
[14]:
Problem Results Summary:
=========================================
Computation  GMM    Objective   Objective
Time      Step  Evaluations    Value
-----------  ----  -----------  ---------
00:00:01     2         2       +1.6E+06
=========================================

Beta Estimates (Robust SEs in Parentheses):
=============================
prices    log(within_share)
----------  -----------------
-6.8E+00       +9.3E-01
(+2.9E-01)     (+1.1E-02)
=============================

One can observe that we obtain parameter estimates which are quite different than above.

[15]:
nl2_results1.beta[0] / (1 - nl2_results1.beta[1])
[15]:
array([-86.37368446])
[16]:
nl2_results2.beta[0] / (1 - nl2_results2.beta[1])
[16]:
array([-100.14496892])