pyblp.MicroDataset

class pyblp.MicroDataset(name, observations, compute_weights, eliminated_product_ids_index=None, market_ids=None)

Configuration for a micro dataset \(d\) on which micro moments are computed.

A micro dataset \(d\), often a survey, is defined by survey weights \(w_{dijt}\), which are used in (34). For example, \(w_{dijt} = 1\{j \neq 0, t \in T_d\}\) defines a micro dataset that is a selected sample of inside purchasers in a few markets \(T_d \subset T\), giving each market an equal sampling weight. Different micro datasets are independent.

See Conlon and Gortmaker (2023) for a more in-depth discussion of the standardized framework used by PyBLP for incorporating micro data into BLP-style estimation.

Parameters
  • name (str) – The unique name of the dataset, which will be used for outputting information about micro moments.

  • observations (int) – The number of observations \(N_d\) in the micro dataset.

  • compute_weights (callable) –

    Function for computing survey weights \(w_{dijt}\) in a market of the following form:

    compute_weights(t, products, agents) --> weights
    

    where t is the market in which to compute weights, products is the market’s Products (with \(J_t\) rows), and agents is the market’s Agents (with \(I_t\) rows), unless pyblp.options.micro_computation_chunks is larger than its default of 1, in which case agents is a chunk of the market’s Agents. Denoting the number of rows in agents by \(I\), the returned weights should be an array of one of the following shapes:

    • \(I \times J_t\): Conditions on inside purchases by assuming \(w_{di0t} = 0\). Rows correspond to agents \(i \in I\) in the same order as agent_data in Problem or Simulation and columns correspond to inside products \(j \in J_t\) in the same order as product_data in Problem or Simulation.

    • \(I \times (1 + J_t)\): The first column indexes the outside option, which can have nonzero survey weights \(w_{di0t}\).

    Warning

    If using different lambda functions to define different compute_weights functions in a loop, any variables that are changing within the loop should be passed as extra arguments to the function to preserve their scope. For example, lambda t, p, a: weights[t] where weights is some dictionary that is changing in the outer loop should instead be lambda t, p, a, weights=weights: weights[t]; otherwise, the weights in the current loop’s iteration will be lost.

    Warning

    If using product-specific demographics, agents.demographics will be a \(I_t \times D \times J_t\) array, instead of a \(I_t \times D\) array like usual. Non-product specific demographics will be repeated \(J_t\) times.

    Note

    Particularly when using product-specific demographics or second choices, it may be convenient to use numpy.einsum, which handles many multiplying multi-dimensional arrays with common dimensions in an elegant way.

    If the micro dataset contains second choice data, weights can have a third axis corresponding to second choices \(k\) in \(w_{dijkt}\):

    • \(I \times J_t \times J_t\): Conditions on inside purchases by assuming \(w_{di0kt} = w_{dij0t} = 0\).

    • \(I \times (1 + J_t) \times J_t\): The first column indexes the outside option, but the second choice is assumed to be an inside option, \(w_{dij0t} = 0\).

    • \(I \times J_t \times (1 + J_t)\): The first index in the third axis indexes the outside option, but the first choice is assumed to be an inside option, \(w_{di0k} = 0\).

    • \(I \times (1 + J_t) \times (1 + J_t)\): The first column and the first index in the third axis index the outside option as the first and second choice.

    Warning

    Second choice moments can use a lot of memory, especially when \(J_t\) is large. If this becomes an issue, consider setting pyblp.options.micro_computation_chunks to a value higher than its default of 1, such as the highest \(J_t\). This will cut down on memory usage without much affecting speed.

  • eliminated_product_ids_index (int, optional) – This option determines whether the dataset’s second choices are after only the first choice product \(j\) is eliminated from the choice set, in which case this should be None, the default, or if a group of products including the first choice product is eliminated, in which case this should be a number between 0 and the number of columns in the product_ids field of product_data minus one, inclusive. The column of product_ids determines the groups.

  • market_ids (array-like, optional) – Distinct market IDs with nonzero survey weights \(w_{dijt}\). For other markets, \(w_{dijt} = 0\), and compute_weights will not be called.

Examples

Methods