Policy Learning with Adaptively Collected Data

In a wide variety of applications, including healthcare, bidding in first price auctions, digital recommendations, and online education, it can be beneficial to learn a policy that assigns treatments to individuals based on their characteristics. The growing policy-learning literature focuses on set...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Management science 2024-08, Vol.70 (8), p.5270-5297
1. Verfasser: Zhan, Ruohan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In a wide variety of applications, including healthcare, bidding in first price auctions, digital recommendations, and online education, it can be beneficial to learn a policy that assigns treatments to individuals based on their characteristics. The growing policy-learning literature focuses on settings in which policies are learned from historical data in which the treatment assignment rule is fixed throughout the data-collection period. However, adaptive data collection is becoming more common in practice from two primary sources: (1) data collected from adaptive experiments that are designed to improve inferential efficiency and (2) data collected from production systems that progressively evolve an operational policy to improve performance over time (e.g., contextual bandits). Yet adaptivity complicates the problem of learning an optimal policy ex post for two reasons: first, samples are dependent and, second, an adaptive assignment rule may not assign each treatment to each type of individual sufficiently often. In this paper, we address these challenges. We propose an algorithm based on generalized augmented inverse propensity weighted (AIPW) estimators, which nonuniformly reweight the elements of a standard AIPW estimator to control worst case estimation variance. We establish a finite-sample regret upper bound for our algorithm and complement it with a regret lower bound that quantifies the fundamental difficulty of policy learning with adaptive data. When equipped with the best weighting scheme, our algorithm achieves minimax rate-optimal regret guarantees even with diminishing exploration. Finally, we demonstrate our algorithm’s effectiveness using both synthetic data and public benchmark data sets. This paper was accepted by Hamid Nazerzadeh, data science. Funding: This work is supported by the National Science Foundation [Grant CCF-2106508]. R. Zhan was supported by Golub Capital and the Michael Yao and Sara Keying Dai AI and Digital Technology Fund. Z. Ren was supported by the Office of Naval Research [Grant N00014-20-1-2337]. S. Athey was supported by the Office of Naval Research [Grant N00014-19-1-2468]. Z. Zhou is generously supported by the New York University’s 2022–2023 Center for Global Economy and Business faculty research grant and the Digital Twin research grant from Bain & Company. Supplemental Material: The data files are available at https://doi.org/10.1287/mnsc.2023.4921 .
ISSN:0025-1909
1526-5501
DOI:10.1287/mnsc.2023.4921