pyblp.MicroDataset¶
-
class
pyblp.MicroDataset(name, observations, compute_weights, eliminated_product_ids_index=None, market_ids=None)¶ Configuration for a micro dataset \(d\) on which micro moments are computed.
A micro dataset \(d\), often a survey, is defined by survey weights \(w_{dijt}\), which are used in (33). For example, \(w_{dijt} = 1\{j \neq 0, t \in T_d\}\) defines a micro dataset that is a selected sample of inside purchasers in a few markets \(T_d \subset T\), giving each market an equal sampling weight. Different micro datasets are independent.
See Conlon and Gortmaker (2025) for a more in-depth discussion of the standardized framework used by PyBLP for incorporating micro data into BLP-style estimation.
- Parameters
name (str) – The unique name of the dataset, which will be used for outputting information about micro moments.
observations (int) – The number of observations \(N_d\) in the micro dataset.
compute_weights (callable) –
Function for computing survey weights \(w_{dijt}\) in a market of the following form:
compute_weights(t, products, agents) --> weights
where
tis the market in which to compute weights,productsis the market’sProducts(with \(J_t\) rows), andagentsis the market’sAgents(with \(I_t\) rows), unlesspyblp.options.micro_computation_chunksis larger than its default of1, in which caseagentsis a chunk of the market’sAgents. Denoting the number of rows inagentsby \(I\), the returnedweightsshould be an array of one of the following shapes:\(I \times J_t\): Conditions on inside purchases by assuming \(w_{di0t} = 0\). Rows correspond to agents \(i \in I\) in the same order as
agent_datainProblemorSimulationand columns correspond to inside products \(j \in J_t\) in the same order asproduct_datainProblemorSimulation.\(I \times (1 + J_t)\): The first column indexes the outside option, which can have nonzero survey weights \(w_{di0t}\).
Warning
If using different lambda functions to define different
compute_weightsfunctions in a loop, any variables that are changing within the loop should be passed as extra arguments to the function to preserve their scope. For example,lambda t, p, a: weights[t]whereweightsis some dictionary that is changing in the outer loop should instead belambda t, p, a, weights=weights: weights[t]; otherwise, theweightsin the current loop’s iteration will be lost.Warning
If using product-specific demographics,
agents.demographicswill be a \(I_t \times D \times J_t\) array, instead of a \(I_t \times D\) array like usual. Non-product specific demographics will be repeated \(J_t\) times.Note
Particularly when using product-specific demographics or second choices, it may be convenient to use
numpy.einsum, which handles many multiplying multi-dimensional arrays with common dimensions in an elegant way.If the micro dataset contains second choice data,
weightscan have a third axis corresponding to second choices \(k\) in \(w_{dijkt}\):\(I \times J_t \times J_t\): Conditions on inside purchases by assuming \(w_{di0kt} = w_{dij0t} = 0\).
\(I \times (1 + J_t) \times J_t\): The first column indexes the outside option, but the second choice is assumed to be an inside option, \(w_{dij0t} = 0\).
\(I \times J_t \times (1 + J_t)\): The first index in the third axis indexes the outside option, but the first choice is assumed to be an inside option, \(w_{di0k} = 0\).
\(I \times (1 + J_t) \times (1 + J_t)\): The first column and the first index in the third axis index the outside option as the first and second choice.
Warning
Second choice moments can use a lot of memory, especially when \(J_t\) is large. If this becomes an issue, consider setting
pyblp.options.micro_computation_chunksto a value higher than its default of1, such as the highest \(J_t\). This will cut down on memory usage without much affecting speed.eliminated_product_ids_index (int, optional) – This option determines whether the dataset’s second choices are after only the first choice product \(j\) is eliminated from the choice set, in which case this should be
None, the default, or if a group of products including the first choice product is eliminated, in which case this should be a number between0and the number of columns in theproduct_idsfield ofproduct_dataminus one, inclusive. The column ofproduct_idsdetermines the groups.market_ids (array-like, optional) – Distinct market IDs with nonzero survey weights \(w_{dijt}\). For other markets, \(w_{dijt} = 0\), and
compute_weightswill not be called.
Examples
Methods