pyblp.MicroDataset¶
-
class
pyblp.
MicroDataset
(name, observations, compute_weights, eliminated_product_ids_index=None, market_ids=None)¶ Configuration for a micro dataset \(d\) on which micro moments are computed.
A micro dataset \(d\), often a survey, is defined by survey weights \(w_{dijt}\), which are used in (34). For example, \(w_{dijt} = 1\{j \neq 0, t \in T_d\}\) defines a micro dataset that is a selected sample of inside purchasers in a few markets \(T_d \subset T\), giving each market an equal sampling weight. Different micro datasets are independent.
See Conlon and Gortmaker (2023) for a more in-depth discussion of the standardized framework used by PyBLP for incorporating micro data into BLP-style estimation.
- Parameters
name (str) – The unique name of the dataset, which will be used for outputting information about micro moments.
observations (int) – The number of observations \(N_d\) in the micro dataset.
compute_weights (callable) –
Function for computing survey weights \(w_{dijt}\) in a market of the following form:
compute_weights(t, products, agents) --> weights
where
t
is the market in which to compute weights,products
is the market’sProducts
(with \(J_t\) rows), andagents
is the market’sAgents
(with \(I_t\) rows), unlesspyblp.options.micro_computation_chunks
is larger than its default of1
, in which caseagents
is a chunk of the market’sAgents
. Denoting the number of rows inagents
by \(I\), the returnedweights
should be an array of one of the following shapes:\(I \times J_t\): Conditions on inside purchases by assuming \(w_{di0t} = 0\). Rows correspond to agents \(i \in I\) in the same order as
agent_data
inProblem
orSimulation
and columns correspond to inside products \(j \in J_t\) in the same order asproduct_data
inProblem
orSimulation
.\(I \times (1 + J_t)\): The first column indexes the outside option, which can have nonzero survey weights \(w_{di0t}\).
Warning
If using different lambda functions to define different
compute_weights
functions in a loop, any variables that are changing within the loop should be passed as extra arguments to the function to preserve their scope. For example,lambda t, p, a: weights[t]
whereweights
is some dictionary that is changing in the outer loop should instead belambda t, p, a, weights=weights: weights[t]
; otherwise, theweights
in the current loop’s iteration will be lost.Warning
If using product-specific demographics,
agents.demographics
will be a \(I_t \times D \times J_t\) array, instead of a \(I_t \times D\) array like usual. Non-product specific demographics will be repeated \(J_t\) times.Note
Particularly when using product-specific demographics or second choices, it may be convenient to use
numpy.einsum
, which handles many multiplying multi-dimensional arrays with common dimensions in an elegant way.If the micro dataset contains second choice data,
weights
can have a third axis corresponding to second choices \(k\) in \(w_{dijkt}\):\(I \times J_t \times J_t\): Conditions on inside purchases by assuming \(w_{di0kt} = w_{dij0t} = 0\).
\(I \times (1 + J_t) \times J_t\): The first column indexes the outside option, but the second choice is assumed to be an inside option, \(w_{dij0t} = 0\).
\(I \times J_t \times (1 + J_t)\): The first index in the third axis indexes the outside option, but the first choice is assumed to be an inside option, \(w_{di0k} = 0\).
\(I \times (1 + J_t) \times (1 + J_t)\): The first column and the first index in the third axis index the outside option as the first and second choice.
Warning
Second choice moments can use a lot of memory, especially when \(J_t\) is large. If this becomes an issue, consider setting
pyblp.options.micro_computation_chunks
to a value higher than its default of1
, such as the highest \(J_t\). This will cut down on memory usage without much affecting speed.eliminated_product_ids_index (int, optional) – This option determines whether the dataset’s second choices are after only the first choice product \(j\) is eliminated from the choice set, in which case this should be
None
, the default, or if a group of products including the first choice product is eliminated, in which case this should be a number between0
and the number of columns in theproduct_ids
field ofproduct_data
minus one, inclusive. The column ofproduct_ids
determines the groups.market_ids (array-like, optional) – Distinct market IDs with nonzero survey weights \(w_{dijt}\). For other markets, \(w_{dijt} = 0\), and
compute_weights
will not be called.
Examples
Methods