pyblp.Problem¶
-
class
pyblp.Problem(product_formulations, product_data, agent_formulation=None, agent_data=None, integration=None, rc_types=None, epsilon_scale=1.0, costs_type='linear', add_exogenous=True)¶ A BLP-type problem.
This class is initialized with relevant data and solved with
Problem.solve().- Parameters
product_formulations (Formulation or sequence of Formulation) –
Formulationconfiguration or a sequence of up to threeFormulationconfigurations for the matrix of demand-side linear product characteristics, \(X_1\), for the matrix of demand-side nonlinear product characteristics, \(X_2\), and for the matrix of supply-side characteristics, \(X_3\), respectively. If the formulation for \(X_3\) is not specified or isNone, a supply side will not be estimated. Similarly, if the formulation for \(X_2\) is not specified or isNone, the logit (or nested logit) model will be estimated.Variable names should correspond to fields in
product_data. Thesharesvariable should not be included in the formulations for \(X_1\) or \(X_2\). The formulation for \(X_3\) can include shares to allow marginal costs to depend on quantity.The
pricesvariable should not be included in the formulation for \(X_3\), but it should be included in the formulation for \(X_1\) or \(X_2\) (or both). Theabsorbargument ofFormulationcan be used to absorb fixed effects into \(X_1\) and \(X_3\), but not \(X_2\). Characteristics in \(X_2\) should generally be included in \(X_1\). The typical exception is characteristics that are collinear with fixed effects that have been absorbed into \(X_1\).By default, characteristics in \(X_1\) that do not involve
prices, \(X_1^\text{ex}\), will be combined with excluded demand-side instruments (specified below) to create the full set of demand-side instruments, \(Z_D\). Any fixed effects absorbed into \(X_1\) will also be absorbed into \(Z_D\). Similarly, characteristics in \(X_3\) that do not involveshares, \(X_3^\text{ex}\), will be combined with the excluded supply-side instruments to create \(Z_S\), and any fixed effects absorbed into \(X_3\) will also be absorbed into \(Z_S\). Theadd_exogenousflag can be used to disable this behavior.Warning
Characteristics that involve prices, \(p\), or shares, \(s\), should always be formulated with the
pricesandsharesvariables, respectively. If another name is used,Problemwill not understand that the characteristic is endogenous, so it will be erroneously included in \(Z_D\) or \(Z_S\), and derivatives computed with respect to prices or shares will likely be wrong. For example, to include a \(p^2\) characteristic, includeI(prices**2)in a formula instead of manually constructing and including aprices_squaredvariable.product_data (structured array-like) –
Each row corresponds to a product. Markets can have differing numbers of products. The following fields are required:
market_ids : (object) - IDs that associate products with markets.
shares : (numeric) - Market shares, \(s\), which should be between zero and one, exclusive. Outside shares should also be between zero and one. Shares in each market should sum to less than one.
prices : (numeric) - Product prices, \(p\).
If a formulation for \(X_3\) is specified in
product_formulations, firm IDs are also required, since they will be used to estimate the supply side of the problem:firm_ids : (object, optional) - IDs that associate products with firms.
Excluded instruments are typically specified with the following fields:
demand_instruments : (numeric) - Excluded demand-side instruments, which, together with the formulated exogenous demand-side linear product characteristics, \(X_1^\text{ex}\), constitute the full set of demand-side instruments, \(Z_D\). To instead specify the full matrix \(Z_D\), set
add_exogenoustoFalse.supply_instruments : (numeric, optional) - Excluded supply-side instruments, which, together with the formulated exogenous supply-side characteristics, \(X_3^\text{ex}\), constitute the full set of supply-side instruments, \(Z_S\). To instead specify the full matrix \(Z_S\), set
add_exogenoustoFalse.covariance_instruments : (numeric, optional) - Covariance instruments \(Z_C\). If specified, additional moments \(E[g_{C,jt}] = E[\xi_{jt}\omega_{jt}Z_{C,jt}] = 0\) will be added, as in MacKay and Miller (2025). The default 2SLS weighting matrix will have an additional \((Z_C'Z_C / N)^{-1}\) block after the first two.
Note
Using covariance restrictions to identify a parameter on price can sometimes yield two solutions, where the “upper” solution may be positive (i.e., implying upward-sloping demand). See MacKay and Miller (2025) for more discussion of this point. In these cases when the “lower” root is the correct solution, consider imposing a one-sided bound (e.g., zero) on the parameter on price to ensure the appropriate sign using
beta_bounds(if the parameter is inbeta) or replacing it with a lognormal coefficient on price via therc_typeargument toProblem.Note
In the current implementation, these covariance restrictions only affect the nonlinear parameters. The linear parameters are estimated using other moments. In the case of overidentification, the estimator may not be fully efficient because of this implementation decision.
If
firm_idsare specified, custom ownership matrices can be specified as well:ownership : (numeric, optional) - Custom stacked \(J_t \times J_t\) ownership or product holding matrices, \(\mathscr{H}\), for each market \(t\), which can be built with
build_ownership(). By default, standard ownership matrices are built only when they are needed to reduce memory usage. If specified, there should be as many columns as there are products in the market with the most products. Rightmost columns in markets with fewer products will be ignored.
Note
Fields that can have multiple columns (
demand_instruments,supply_instruments, andownership) can either be matrices or can be broken up into multiple one-dimensional fields with column index suffixes that start at zero. For example, if there are three columns of excluded demand-side instruments, ademand_instrumentsfield with three columns can be replaced by three one-dimensional fields:demand_instruments0,demand_instruments1, anddemand_instruments2.To estimate a nested logit or random coefficients nested logit (RCNL) model, nesting groups must be specified:
nesting_ids (object, optional) - IDs that associate products with nesting groups. When these IDs are specified,
rhomust be specified inProblem.solve()as well.
It may be convenient to define IDs for different products:
product_ids (object, optional) - IDs that identify products within markets. There can be multiple columns.
To estimate unobservable autocorrelation with
phiinProblem.solve(), indices that define lags of the data must be specified:lag_indices : (int, optional) - Indices that take on values from \(0\) to \(N - 1\), which define the lag operator \(L\) on the data. For example, if markets \(t\) are simply time periods and the identity of products \(j\) are persistent across periods, then \(L x_{jt} = x_{j,t-1}\).
The value of the current row index indicates that this is the initial period for a product. Otherwise, the value should correspond to the row that is the lagged version of the current row.
Finally, clustering groups can be specified to account for within-group correlation while updating the weighting matrix and estimating standard errors:
clustering_ids (object, optional) - Cluster group IDs, which will be used if
W_typeorse_typeinProblem.solve()is'clustered'.
Along with
market_ids,firm_ids,nesting_ids,product_ids,clustering_ids, andprices, the names of any additional fields can typically be used as variables inproduct_formulations. However, there are a few variable names such as'X1', which are reserved for use byProducts.agent_formulation (Formulation, optional) –
Formulationconfiguration for the matrix of observed agent characteristics called demographics, \(d\), which will only be included in the model if this formulation is specified. Since demographics are only used if there are demand-side nonlinear product characteristics, this formulation should only be specified if \(X_2\) is formulated inproduct_formulations. Variable names should correspond to fields inagent_data. See the information underagent_datafor how to give fields for product-specific demographics \(d_{ijt}\).agent_data (structured array-like, optional) –
Each row corresponds to an agent. Markets can have differing numbers of agents. Since simulated agents are only used if there are demand-side nonlinear product characteristics, agent data should only be specified if \(X_2\) is formulated in
product_formulations. If agent data are specified, market IDs are required:market_ids : (object) - IDs that associate agents with markets. The set of distinct IDs should be the same as the set in
product_data. Ifintegrationis specified, there must be at least as many rows in each market as the number of nodes and weights that are built for the market.
If
integrationis not specified, the following fields are required:weights : (numeric, optional) - Integration weights, \(w\), for integration over agent choice probabilities.
nodes : (numeric, optional) - Unobserved agent characteristics called integration nodes, \(\nu\). If there are more than \(K_2\) columns (the number of demand-side nonlinear product characteristics), only the first \(K_2\) will be retained. If any columns of
sigmainProblem.solve()are fixed at zero, only the first few columns of these nodes will be used.
The convenience function
build_integration()can be useful when constructing custom nodes and weights.Note
If
nodeshas multiple columns, it can be specified as a matrix or broken up into multiple one-dimensional fields with column index suffixes that start at zero. For example, if there are three columns of nodes, anodesfield with three columns can be replaced by three one-dimensional fields:nodes0,nodes1, andnodes2.It may be convenient to define IDs for different agents:
agent_ids (object, optional) - IDs that identify agents within markets. There can be multiple of the same ID within a market.
Along with
market_idsandagent_ids, the names of any additional fields can be typically be used as variables inagent_formulation. Exceptions are the names'demographics'and'availability', which are reserved for use byAgents.In addition to standard demographic variables \(d_{it}\), it is also possible to specify product-specific demographics \(d_{ijt}\). A typical example is geographic distance of agent \(i\) from product \(j\). If
agent_formulationhas, for example,'distance', instead of including a single'distance'field inagent_data, one should instead include'distance0','distance1','distance2'and so on, where the index corresponds to the order in which products appear within market inproduct_data. For example,'distance5'should measure the distance of agents to the fifth product within the market, as ordered inproduct_data. The last index should be the number of products in the largest market, minus one. For markets with fewer products than this maximum number, latter columns will be ignored.Finally, by default each agent \(i\) in market \(t\) is faced with the same choice set of product \(j\), but it is possible to specify agent-specific availability \(a_{ijt}\) much in the same way that product-specific demographics are specified. To do so, the following field can be specified:
availability : (numeric, optional) - Agent-specific product availability, \(a\). Choice probabilities in (5) are modified according to
(1)¶\[s_{ijt} = \frac{a_{ijt} \exp V_{ijt}}{1 + \sum_{k \in J_t} a_{ijt} \exp V_{ikt}},\]and similarly for the nested logit model and consumer surplus calculations. By default, all \(a_{ijt} = 1\). To have a product \(j\) be unavailable to agent \(i\), set \(a_{ijt} = 0\).
Agent-specific availability is specified in the same way that product-specific demographics are specified. In
agent_data, one can include'availability0','availability1','availability2', and so on, where the index corresponds to the order in which products appear within market inproduct_data. The last index should be the number of products in the largest market, minus one. For markets with fewer products than this maximum number, latter columns will be ignored.
integration (Integration, optional) –
Integrationconfiguration for how to build nodes and weights for integration over agent choice probabilities, which will replace anynodesandweightsfields inagent_data. This configuration is required ifnodesandweightsinagent_dataare not specified. It should not be specified if \(X_2\) is not formulated inproduct_formulations.If this configuration is specified, \(K_2\) columns of nodes (the number of demand-side nonlinear product characteristics) will be built. However, if
sigmainProblem.solve()is left unspecified or specified with columns fixed at zero, fewer columns will be used.rc_types (sequence of str, optional) –
Random coefficient types:
'linear'(default) - The random coefficient is as defined in (3). All elliptical distributions are supported, including the normal distribution.'log'- The random coefficient’s column in (3) is exponentiated before being pre-multiplied by \(X_2\). It will take on values bounded from below by zero. All log-elliptical distributions are supported, including the lognormal distribution.'logit'- The random coefficient’s column in (3) is passed through the inverse logit function before being pre-multiplied by \(X_2\). It will take on values bounded from below by zero and above by one.
The list should have as many strings as there are columns in \(X_2\). Each string determines the type of the random coefficient on the corresponding product characteristic in \(X_2\).
A typical example of when to use
'log'is to have a lognormal coefficient on prices. Implementing this typically involves having anI(-prices)in the formulation for \(X_2\), and instead of includingpricesin \(X_1\), including a1in theagent_formulation. Then the corresponding coefficient in \(\Pi\) will serve as the mean parameter for the lognormal random coefficient on negative prices, \(-p_{jt}\).epsilon_scale (float, optional) –
Factor by which the Type I Extreme Value idiosyncratic preference term, \(\epsilon_{ijt}\), is scaled. By default, \(\epsilon_{ijt}\) is not scaled. The typical use of this parameter is to approximate the pure characteristics model of Berry and Pakes (2007) by choosing a value smaller than
1.0. As this scaling factor approaches zero, the model approaches the pure characteristics model in which there is no idiosyncratic preference term.In practice, this is implemented by dividing \(V_{ijt} = \delta_{jt} + \mu_{ijt}\) by the scaling factor when solving for the mean utility \(\delta_{jt}\). For small scaling factors, this leads to large values of \(V_{ijt}\), which when exponentiated in the logit expression can lead to overflow issues discussed in Berry and Pakes (2007). The safe versions of the contraction mapping discussed in the documentation for
fp_typeinProblem.solve()(which is used by default) eliminate overflow issues at the cost of introducing fewer (but still common for a small scaling factor) underflow issues. Throughout the contraction mapping, some values of the simulated shares \(s_{jt}(\delta, \theta)\) can underflow to zero, causing the contraction to fail when taking logs. By default,shares_boundsinProblem.solve()bounds these simulated shares from below by1e-300, which eliminates these underflow issues at the cost of making it more difficult for iteration routines to converge.With this in mind, scaling epsilon is not supported for nonlinear contractions, and is also not supported when there are nesting groups, since these further complicate the problem. In practice, if the goal is to approximate the pure characteristics model, it is a good idea to slowly decrease the scale of epsilon (e.g., starting with
0.5, trying0.1, etc.) until the contraction begins to fail. To further decrease the scale, there are a few things that can help. One is passing a differentIterationconfiguration toiterationinProblem.solve(), such as'lm', which can be robust in this situation. Another is to setpyblp.options.dtype = np.longdoublewhen on a system that supports extended precision (seeoptionsfor more information about this) and choose a smaller lower bound by configuringshares_boundsinProblem.solve(). Ultimately the model will stop being solvable at a certain point, and this point will vary by problem, so approximating the pure characteristics model requires some degree of experimentation.costs_type (str, optional) –
Functional form of the marginal cost function \(\tilde{c} = f(c)\) in (9). The following specifications are supported:
'linear'(default) - Linear specification: \(\tilde{c} = c\).'log'- Log-linear specification: \(\tilde{c} = \log c\).
This specification is only relevant if \(X_3\) is formulated.
add_exogenous (bool, optional) –
Whether to add characteristics in \(X_1\) that do not involve prices, \(X_1^\text{ex}\), to the
demand_instrumentsfield inproduct_data(including absorbed fixed effects), and similarly, whether to add characteristics in \(X_3\) that do not involve shares, \(X_3^\text{ex}\), to thesupply_instrumentsfield. This is by defaultTrueso that only excluded instruments need to be specified.If this is set to
False,demand_instrumentsandsupply_instrumentsshould specify the full sets of demand- and supply-side instruments, \(Z_D\) and \(Z_S\), and fixed effects should be manually absorbed (for example, with thebuild_matrix()function). This behavior can be useful, for example, when price is not the only endogenous product characteristic over which consumers have preferences. This model could be correctly estimated by manually adding the truly exogenous characteristics in \(X_1\) to \(Z_D\).Warning
If this flag is set to
Falsebecause there are multiple endogenous product characteristics, care should be taken when including a supply side or computing optimal instruments. These routines assume that price is the only endogenous variable over which consumers have preferences.
-
product_formulations¶ Formulationconfigurations for \(X_1\), \(X_2\), and \(X_3\), respectively.- Type
Formulation or sequence of Formulation
-
agent_formulation¶ Formulationconfiguration for \(d\).- Type
Formulation
-
products¶ Product data structured as
Products, which consists of data taken fromproduct_dataalong with matrices built according toProblem.product_formulations. Thedata_to_dict()function can be used to convert this into a more usable data type.- Type
Products
-
agents¶ Agent data structured as
Agents, which consists of data taken fromagent_dataor built byintegrationalong with any demographics built according toProblem.agent_formulation. Thedata_to_dict()function can be used to convert this into a more usable data type.- Type
Agents
-
unique_market_ids¶ Unique market IDs in product and agent data.
- Type
ndarray
-
unique_firm_ids¶ Unique firm IDs in product data.
- Type
ndarray
-
unique_nesting_ids¶ Unique nesting group IDs in product data.
- Type
ndarray
-
unique_product_ids¶ Unique product IDs in product data.
- Type
ndarray
-
unique_agent_ids¶ Unique agent IDs in agent data.
- Type
ndarray
-
rc_types¶ Random coefficient types.
- Type
list of str
-
epsilon_scale¶ Factor by which the Type I Extreme Value idiosyncratic preference term, \(\epsilon_{ijt}\), is scaled.
- Type
float
-
costs_type¶ Functional form of the marginal cost function \(\tilde{c} = f(c)\).
- Type
str
-
T¶ Number of markets, \(T\).
- Type
int
-
N¶ Number of products across all markets, \(N\).
- Type
int
-
F¶ Number of firms across all markets, \(F\).
- Type
int
-
I¶ Number of agents across all markets, \(I\).
- Type
int
-
K1¶ Number of demand-side linear product characteristics, \(K_1\).
- Type
int
-
K2¶ Number of demand-side nonlinear product characteristics, \(K_2\).
- Type
int
-
K3¶ Number of supply-side product characteristics, \(K_3\).
- Type
int
-
D¶ Number of demographic variables, \(D\).
- Type
int
-
MD¶ Number of demand-side instruments, \(M_D\), which is typically the number of excluded demand-side instruments plus the number of exogenous demand-side linear product characteristics, \(K_1^\text{ex}\).
- Type
int
-
MS¶ Number of supply-side instruments, \(M_S\), which is typically the number of excluded supply-side instruments plus the number of exogenous supply-side linear product characteristics, \(K_3^\text{ex}\).
- Type
int
-
MC¶ Number of covariance instruments, \(M_C\).
- Type
int
-
ED¶ Number of absorbed dimensions of demand-side fixed effects, \(E_D\).
- Type
int
-
ES¶ Number of absorbed dimensions of supply-side fixed effects, \(E_S\).
- Type
int
-
H¶ Number of nesting groups, \(H\).
- Type
int
Examples
Methods
solve([sigma, pi, rho, phi, beta, gamma, …])Solve the problem.