pyblp.Simulation¶

class pyblp.Simulation(product_formulations, product_data, beta, sigma=None, pi=None, gamma=None, rho=None, phi=None, agent_formulation=None, agent_data=None, integration=None, xi=None, omega=None, xi_variance=1, omega_variance=1, correlation=0.9, rc_types=None, epsilon_scale=1.0, costs_type='linear', seed=None)¶

Simulation of data in BLP-type models.

Any data left unspecified are simulated during initialization. Simulated prices and shares can be replaced by Simulation.replace_endogenous() with equilibrium values that are consistent with true parameters. Less commonly, simulated exogenous variables can be replaced instead by Simulation.replace_exogenous(). To choose your own prices, refer to the first note in Simulation.replace_endogenous(). Simulations are typically used for two purposes:

Solving for equilibrium prices and shares under more complicated counterfactuals than is possible with ProblemResults.compute_prices() and ProblemResults.compute_shares(). For example, this class can be initialized with estimated parameters, structural errors, and marginal costs from a ProblemResults(), but with changed data (fewer products, new products, different characteristics, etc.) and Simulation.replace_endogenous() can be used to compute the corresponding prices and shares.

Simulation of BLP-type models from scratch. For example, a model with fixed true parameters can be simulated many times, converted into problems with SimulationResults.to_problem(), and solved with Problem.solve() to evaluate in a Monte Carlo study how well the true parameters can be recovered.

If data for variables (used to formulate product characteristics in \(X_1\), \(X_2\), and \(X_3\), as well as agent demographics, \(d\), and endogenous prices and market shares \(p\) and \(s\)) are not provided, the values for each unspecified variable are drawn independently from the standard uniform distribution. In each market \(t\), market shares are divided by the number of products in the market \(J_t\). Typically, Simulation.replace_endogenous() is used to replace prices and shares with equilibrium values that are consistent with true parameters.

If data for unobserved demand-and supply-side product characteristics, \(\xi\) and \(\omega\), are not provided, they are by default drawn from a mean-zero bivariate normal distribution.

After variables are loaded or simulated, any unspecified integration nodes and weights, \(\nu\) and \(w\), are constructed according to a specified Integration configuration.

Parameters

product_formulations (Formulation or sequence of Formulation) –
Formulation configuration or a sequence of up to three Formulation configurations for the matrix of demand-side linear product characteristics, \(X_1\), for the matrix of demand-side nonlinear product characteristics, \(X_2\), and for the matrix of supply-side characteristics, \(X_3\), respectively. If the formulation for \(X_2\) is not specified or is None, the logit (or nested logit) model will be simulated.

The shares variable should not be included in the formulations for \(X_1\) or \(X_2\). If shares is included in the formulation for \(X_3\) and product_data does not include shares, one will likely want to set constant_costs=False in Simulation.replace_endogenous().

The prices variable should not be included in the formulation for \(X_3\), but it should be included in the formulation for \(X_1\) or \(X_2\) (or both). Variables that cannot be loaded from product_data will be drawn from independent standard uniform distributions. Unlike in Problem, fixed effect absorption is not supported during simulation.

Warning

Characteristics that involve prices, \(p\), or shares, \(s\), should always be formulated with the prices and shares variables, respectively. If another name is used, Simulation will not understand that the characteristic is endogenous. For example, to include a \(p^2\) characteristic, include I(prices**2) in a formula instead of manually constructing and including a prices_squared variable.
product_data (structured array-like) –
Each row corresponds to a product. Markets can have differing numbers of products. The convenience function build_id_data() can be used to construct the following required ID data:
- market_ids : (object) - IDs that associate products with markets.
- firm_ids : (object) - IDs that associate products with firms.
Custom ownership matrices can be specified as well:
- ownership : (numeric, optional) - Custom stacked \(J_t \times J_t\) ownership or product holding matrices, \(\mathscr{H}\), for each market \(t\), which can be built with build_ownership(). By default, standard ownership matrices are built only when they are needed to reduce memory usage. If specified, there should be as many columns as there are products in the market with the most products. Rightmost columns in markets with fewer products will be ignored.
Note

The ownership field can either be a matrix or can be broken up into multiple one-dimensional fields with column index suffixes that start at zero. For example, if there are three products in each market, a ownership field with three columns can be replaced by three one-dimensional fields: ownership0, ownership1, and ownership2.

To simulate a nested logit or random coefficients nested logit (RCNL) model, nesting groups must be specified:
- nesting_ids (object, optional) - IDs that associate products with nesting groups. When these IDs are specified, rho must be specified as well.
It may be convenient to define IDs for different products:
- product_ids (object, optional) - IDs that identify products within markets. There can be multiple columns.
To specify unobservable autocorrelation with phi, indices that define lags of the data must be specified:
- lag_indices : (int, optional) - Indices that take on values from \(0\) to \(N - 1\), which define the lag operator \(L\) on the data. For example, if markets \(t\) are simply time periods and the identity of products \(j\) are persistent across periods, then \(L x_{jt} = x_{j,t-1}\).
  
  The value of the current row index indicates that this is the initial period for a product. Otherwise, the value should correspond to the row that is the lagged version of the current row.
Along with market_ids, firm_ids, product_ids, and nesting_ids, the names of any additional fields can typically be used as variables in product_formulations. However, there are a few variable names such as 'X1', which are reserved for use by Products.
beta (array-like) – Vector of demand-side linear parameters, \(\beta\). Elements correspond to columns in \(X_1\), which is formulated by product_formulations.
sigma (array-like, optional) – Lower-triangular Cholesky root of the covariance matrix for unobserved taste heterogeneity, \(\Sigma\). Rows and columns correspond to columns in \(X_2\), which is formulated by product_formulations. If \(X_2\) is not formulated, this should not be specified, since the logit model will be simulated.
pi (array-like, optional) – Parameters that measure how agent tastes vary with demographics, \(\Pi\). Rows correspond to the same product characteristics as in sigma. Columns correspond to columns in \(d\), which is formulated by agent_formulation. If \(d\) is not formulated, this should not be specified.
gamma (array-like, optional) – Vector of supply-side linear parameters, \(\gamma\). Elements correspond to columns in \(X_3\), which is formulated by product_formulations. If \(X_3\) is not formulated, this should not be specified.
rho (array-like, optional) – Parameters that measure within nesting group correlation, \(\rho\). If this is a scalar, it corresponds to all groups defined by the nesting_ids field of product_data. If this is a vector, it must have \(H\) elements, one for each nesting group. Elements correspond to group IDs in the sorted order of Simulation.unique_nesting_ids. If nesting IDs are not specified, this should not be specified either.
phi (float, optional) –
Parameters measuring unobservable autocorrelation,

(1)\[\begin{split}\phi = \begin{bmatrix} \phi_\xi & \phi_{\xi\omega} \\ \phi_{\omega\xi} & \phi_\omega \end{bmatrix} ,\end{split}\]

which must be specified if lag_indices in product_data are specified. This is ignored during simulation if xi and omega are specified. Otherwise, if specified, unobservables are drawn according to AR(1) processes:

(2)¶\[\begin{split}\xi_{jt} = \phi_\xi \cdot L \xi_{jt} + \phi_{\xi\omega} \cdot L \omega_{jt} + \tilde{\xi}_{jt}, \\ \omega_{jt} = \phi_\omega \cdot L \omega_{jt} + \phi_{\omega\xi} \cdot L \xi_{jt} + \tilde{\omega}_{jt},\end{split}\]

where the lag_indices field in product_data defines the lag operator \(L\).
agent_formulation (Formulation, optional) – Formulation configuration for the matrix of observed agent characteristics called demographics, \(d\), which will only be included in the model if this formulation is specified. Any variables that cannot be loaded from agent_data will be drawn from independent standard uniform distributions.
agent_data (structured array-like, optional) –
Each row corresponds to an agent. Markets can have differing numbers of agents. Since simulated agents are only used if there are demand-side nonlinear product characteristics, agent data should only be specified if \(X_2\) is formulated in product_formulations. If agent data are specified, market IDs are required:
- market_ids : (object, optional) - IDs that associate agents with markets. The set of distinct IDs should be the same as the set in product_data. If integration is specified, there must be at least as many rows in each market as the number of nodes and weights that are built for the market.
If integration is not specified, the following fields are required:
- weights : (numeric, optional) - Integration weights, \(w\), for integration over agent choice probabilities.
- nodes : (numeric, optional) - Unobserved agent characteristics called integration nodes, \(\nu\). If there are more than \(K_2\) columns (the number of demand-side nonlinear product characteristics), only the first \(K_2\) will be used. If any columns of sigma are fixed at zero, only the first few columns of these nodes will be used.
The convenience function build_integration() can be useful when constructing custom nodes and weights.

Note

If nodes has multiple columns, it can be specified as a matrix or broken up into multiple one-dimensional fields with column index suffixes that start at zero. For example, if there are three columns of nodes, a nodes field with three columns can be replaced by three one-dimensional fields: nodes0, nodes1, and nodes2.

It may be convenient to define IDs for different agents:
- agent_ids (object, optional) - IDs that identify agents within markets. There can be multiple of the same ID within a market.
Along with market_ids and agent_ids, the names of any additional fields can typically be used as variables in agent_formulation. The exception is the name 'demographics', which is reserved for use by Agents.

In addition to standard demographic variables \(d_{it}\), it is also possible to specify product-specific demographics \(d_{ijt}\). A typical example is geographic distance of agent \(i\) from product \(j\). If agent_formulation has, for example, 'distance', instead of including a single 'distance' field in agent_data, one should instead include 'distance0', 'distance1', 'distance2' and so on, where the index corresponds to the order in which products appear within market in product_data. For example, 'distance5' should measure the distance of agents to the fifth product within the market, as ordered in product_data. The last index should be the number of products in the largest market, minus one. For markets with fewer products than this maximum number, latter columns will be ignored.

Finally, by default each agent \(i\) in market \(t\) is faced with the same choice set of product \(j\), but it is possible to specify agent-specific availability \(a_{ijt}\) much in the same way that product-specific demographics are specified. To do so, the following field can be specified:
- availability : (numeric, optional) - Agent-specific product availability, \(a\). Choice probabilities in (5) are modified according to
  
  (3)¶\[s_{ijt} = \frac{a_{ijt} \exp V_{ijt}}{1 + \sum_{k \in J_t} a_{ijt} \exp V_{ikt}},\]
  
  and similarly for the nested logit model and consumer surplus calculations. By default, all \(a_{ijt} = 1\). To have a product \(j\) be unavailable to agent \(i\), set \(a_{ijt} = 0\).
  
  Agent-specific availability is specified in the same way that product-specific demographics are specified. In agent_data, one can include 'availability0', 'availability1', 'availability2', and so on, where the index corresponds to the order in which products appear within market in product_data. The last index should be the number of products in the largest market, minus one. For markets with fewer products than this maximum number, latter columns will be ignored.
integration (Integration, optional) –
Integration configuration for how to build nodes and weights for integration over agent choice probabilities, which will replace any nodes and weights fields in agent_data. This configuration is required if nodes and weights in agent_data are not specified. It should not be specified if \(X_2\) is not formulated in product_formulations.

If this configuration is specified, \(K_2\) columns of nodes (the number of demand-side nonlinear product characteristics) will be built. However, if sigma is left unspecified or is specified with columns fixed at zero, fewer columns will be used.
xi (array-like, optional) –
Demand-side unobservable, \(\xi\). This must be specified if \(X_3\) is not formulated or if omega is specified.

By default, if \(X_3\) is formulated, this and \(\omega_{jt}\) are drawn from a mean-zero bivariate normal distribution. If phi is specified, then innovations \(\tilde{\xi}_{jt}\) and \(\tilde{\omega}_{jt}\) in (2) are drawn instead, and initial values are draws from their stationary distribution.
omega (array-like, optional) –
Supply-side unobservable, \(\omega\). This must be specified if \(X_3\) is formulated and xi is specified. It is ignored if \(X_3\) is not formulated.

By default, if \(X_3\) is formulated, this and \(\xi_{jt}\) are drawn from a mean-zero bivariate normal distribution. If phi is specified, then innovations are drawn instead, as described for xi.
xi_variance (float, optional) – Variance of \(\xi_{jt}\) (or its innovation if phi is specified). The default value is 1.0. This is ignored if xi or omega is specified.
omega_variance (float, optional) – Variance of \(\omega_{jt}\) (or its innovation if phi is specified). The default value is 1.0. This is ignored if xi or omega is specified.
correlation (float, optional) – Correlation between \(\xi_{jt}\) and \(\omega_{jt}\) (or their innovations if phi is specified). The default value is 0.9. This is ignored if xi or omega is specified.
rc_types (sequence of str, optional) –
Random coefficient types:
- 'linear' (default) - The random coefficient is as defined in (3).
- 'log' - The random coefficient’s column in (3) is exponentiated before being pre-multiplied by \(X_2\). It will take on values bounded from below by zero.
- 'logit' - The random coefficient’s column in (3) is passed through the inverse logit function before being pre-multiplied by \(X_2\). It will take on values bounded from below by zero and above by one.
The list should have as many strings as there are columns in \(X_2\). Each string determines the type of the random coefficient on the corresponding product characteristic in \(X_2\).

A typical example of when to use 'log' is to have a lognormal coefficient on prices. Implementing this typically involves having an I(-prices) in the formulation for \(X_2\), and instead of including prices in \(X_1\), including a 1 in the agent_formulation. Then the corresponding coefficient in \(\Pi\) will serve as the mean parameter for the lognormal random coefficient on negative prices, \(-p_{jt}\).
epsilon_scale (float, optional) –
Factor by which the Type I Extreme Value idiosyncratic preference term, \(\epsilon_{ijt}\), is scaled. By default, \(\epsilon_{ijt}\) is not scaled. The typical use of this parameter is to approximate the pure characteristics model of Berry and Pakes (2007) by choosing a value smaller than 1.0. As this scaling factor approaches zero, the model approaches the pure characteristics model in which there is no idiosyncratic preference term.

For more information about choosing this parameter and estimating models where it is smaller than 1.0, refer to the same argument in Problem.solve(). In some situations, it may be easier to solve simulations with small epsilon scaling factors by using Simulation.replace_exogenous() rather than Simulation.replace_endogenous().
costs_type (str, optional) –
Specification of the marginal cost function \(\tilde{c} = f(c)\) in (9). The following specifications are supported:
- 'linear' (default) - Linear specification: \(\tilde{c} = c\).
- 'log' - Log-linear specification: \(\tilde{c} = \log c\).
seed (int, optional) – Passed to numpy.random.RandomState to seed the random number generator before data are simulated. By default, a seed is not passed to the random number generator.

product_formulations¶

Formulation configurations for \(X_1\), \(X_2\), and \(X_3\), respectively.

Type: tuple

agent_formulation¶

Formulation configuration for \(d\).

Type: tuple

product_data¶

Synthetic product data that were loaded or simulated during initialization. Typically, Simulation.replace_endogenous() is used replace prices and shares with equilibrium values that are consistent with true parameters. The data_to_dict() function can be used to convert this into a more usable data type.

Type: recarray

agent_data¶

Synthetic agent data that were loaded or simulated during initialization. The data_to_dict() function can be used to convert this into a more usable data type.

Type: recarray

integration¶

Integration configuration for how any nodes and weights were built during initialization.

Type: Integration

products¶

Product data structured as Products, which consists of data taken from Simulation.product_data along with matrices build according to Simulation.product_formulations. The data_to_dict() function can be used to convert this into a more usable data type.

Type: Products

agents¶

Agent data structured as Agents, which consists of data taken from Simulation.agent_data or built by Simulation.integration along with any demographics formulated by Simulation.agent_formulation. The data_to_dict() function can be used to convert this into a more usable data type.

Type: Agents

unique_market_ids¶

Unique market IDs in product and agent data.

Type: ndarray

unique_firm_ids¶

Unique firm IDs in product data.

Type: ndarray

unique_nesting_ids¶

Unique nesting IDs in product data.

Type: ndarray

unique_product_ids¶

Unique product IDs in product data.

Type: ndarray

unique_agent_ids¶

Unique agent IDs in agent data.

Type: ndarray

beta¶

Demand-side linear parameters, \(\beta\).

Type: ndarray

sigma¶

Cholesky root of the covariance matrix for unobserved taste heterogeneity, \(\Sigma\).

Type: ndarray

gamma¶

Supply-side linear parameters, \(\gamma\).

Type: ndarray

pi¶

Parameters that measures how agent tastes vary with demographics, \(\Pi\).

Type: ndarray

rho¶

Parameters that measure within nesting group correlation, \(\rho\).

Type: ndarray

phi¶

Parameters that measure unobservable autocorrelation, \(\phi\).

Type: ndarray

xi¶

Unobserved demand-side product characteristics, \(\xi\).

Type: ndarray

omega¶

Unobserved supply-side product characteristics, \(\omega\).

Type: ndarray

rc_types¶

Random coefficient types.

Type: list of str

epsilon_scale¶

Factor by which the Type I Extreme Value idiosyncratic preference term, \(\epsilon_{ijt}\), is scaled.

Type: float

costs_type¶

Functional form of the marginal cost function \(\tilde{c} = f(c)\).

Type: str

T¶

Number of markets, \(T\).

Type: int

N¶

Number of products across all markets, \(N\).

Type: int

F¶

Number of firms across all markets, \(F\).

Type: int

I¶

Number of agents across all markets, \(I\).

Type: int

K1¶

Number of demand-side linear product characteristics, \(K_1\).

Type: int

K2¶

Number of demand-side nonlinear product characteristics, \(K_2\).

Type: int

K3¶

Number of supply-side characteristics, \(K_3\).

Type: int

D¶

Number of demographic variables, \(D\).

Type: int

MD¶

Number of demand-side instruments, \(M_D\), which is always zero because instruments are added or constructed in SimulationResults.to_problem().

Type: int

MS¶

Number of supply-side instruments, \(M_S\), which is similarly aways zero.

Type: int

MC¶

Number of covariance instruments, \(M_C\).

Type: int

ED¶

Number of absorbed dimensions of demand-side fixed effects, \(E_D\), which is always zero because simulations do not support fixed effect absorption.

Type: int

ES¶

Number of absorbed dimensions of supply-side fixed effects, \(E_S\), which is always zero because simulations do not support fixed effect absorption.

Type: int

H¶

Number of nesting groups, \(H\).

Type: int

Examples

Tutorial

Methods

`replace_endogenous`([costs, prices, …])	Replace simulated prices and market shares with equilibrium values that are consistent with true parameters.
`replace_exogenous`(X1_name[, X3_name, delta, …])	Replace exogenous product characteristics with values that are consistent with true parameters.