Policy Learning with Adaptively Collected Data
In a wide variety of applications, including healthcare, bidding in first price auctions, digital recommendations, and online education, it can be beneficial to learn a policy that assigns treatments to individuals based on their characteristics. The growing policy-learning literature focuses on set...
Gespeichert in:
Veröffentlicht in: | Management science 2024-08, Vol.70 (8), p.5270-5297 |
---|---|
1. Verfasser: | |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 5297 |
---|---|
container_issue | 8 |
container_start_page | 5270 |
container_title | Management science |
container_volume | 70 |
creator | Zhan, Ruohan |
description | In a wide variety of applications, including healthcare, bidding in first price auctions, digital recommendations, and online education, it can be beneficial to learn a policy that assigns treatments to individuals based on their characteristics. The growing policy-learning literature focuses on settings in which policies are learned from historical data in which the treatment assignment rule is fixed throughout the data-collection period. However, adaptive data collection is becoming more common in practice from two primary sources: (1) data collected from adaptive experiments that are designed to improve inferential efficiency and (2) data collected from production systems that progressively evolve an operational policy to improve performance over time (e.g., contextual bandits). Yet adaptivity complicates the problem of learning an optimal policy ex post for two reasons: first, samples are dependent and, second, an adaptive assignment rule may not assign each treatment to each type of individual sufficiently often. In this paper, we address these challenges. We propose an algorithm based on generalized augmented inverse propensity weighted (AIPW) estimators, which nonuniformly reweight the elements of a standard AIPW estimator to control worst case estimation variance. We establish a finite-sample regret upper bound for our algorithm and complement it with a regret lower bound that quantifies the fundamental difficulty of policy learning with adaptive data. When equipped with the best weighting scheme, our algorithm achieves minimax rate-optimal regret guarantees even with diminishing exploration. Finally, we demonstrate our algorithm’s effectiveness using both synthetic data and public benchmark data sets.
This paper was accepted by Hamid Nazerzadeh, data science.
Funding:
This work is supported by the National Science Foundation [Grant CCF-2106508]. R. Zhan was supported by Golub Capital and the Michael Yao and Sara Keying Dai AI and Digital Technology Fund. Z. Ren was supported by the Office of Naval Research [Grant N00014-20-1-2337]. S. Athey was supported by the Office of Naval Research [Grant N00014-19-1-2468]. Z. Zhou is generously supported by the New York University’s 2022–2023 Center for Global Economy and Business faculty research grant and the Digital Twin research grant from Bain & Company.
Supplemental Material:
The data files are available at
https://doi.org/10.1287/mnsc.2023.4921
. |
doi_str_mv | 10.1287/mnsc.2023.4921 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3095236006</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3095236006</sourcerecordid><originalsourceid>FETCH-LOGICAL-c332t-6bb8247a0fbd33c37039fc677297370d5ad1e9690fe0073f757c3be80f62a20b3</originalsourceid><addsrcrecordid>eNqFkEFLxDAQRoMoWFevngueWyfJJmmOy6qrsKAHPYc0TTRLt6lJV9l_b0sFj56Ggfd9MzyErjGUmFTidt8lUxIgtFxKgk9QhhnhBWOAT1EGQFiBJchzdJHSDgBEJXiGypfQenPMt1bHznfv-bcfPvJVo_vBf9n2mK9D21oz2Ca_04O-RGdOt8le_c4Fenu4f10_FtvnzdN6tS0MpWQoeF1XZCk0uLqh1FABVDrDhSBSjEvDdIOt5BKcHR-hTjBhaG0rcJxoAjVdoJu5t4_h82DToHbhELvxpKIgGaEcgI9UOVMmhpSidaqPfq_jUWFQkxM1OVGTEzU5GQP5HLAmdD794ZWUWFaYTZ3FjPjOhbhP_1X-AHeIbDA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3095236006</pqid></control><display><type>article</type><title>Policy Learning with Adaptively Collected Data</title><source>INFORMS PubsOnLine</source><creator>Zhan, Ruohan</creator><creatorcontrib>Zhan, Ruohan</creatorcontrib><description>In a wide variety of applications, including healthcare, bidding in first price auctions, digital recommendations, and online education, it can be beneficial to learn a policy that assigns treatments to individuals based on their characteristics. The growing policy-learning literature focuses on settings in which policies are learned from historical data in which the treatment assignment rule is fixed throughout the data-collection period. However, adaptive data collection is becoming more common in practice from two primary sources: (1) data collected from adaptive experiments that are designed to improve inferential efficiency and (2) data collected from production systems that progressively evolve an operational policy to improve performance over time (e.g., contextual bandits). Yet adaptivity complicates the problem of learning an optimal policy ex post for two reasons: first, samples are dependent and, second, an adaptive assignment rule may not assign each treatment to each type of individual sufficiently often. In this paper, we address these challenges. We propose an algorithm based on generalized augmented inverse propensity weighted (AIPW) estimators, which nonuniformly reweight the elements of a standard AIPW estimator to control worst case estimation variance. We establish a finite-sample regret upper bound for our algorithm and complement it with a regret lower bound that quantifies the fundamental difficulty of policy learning with adaptive data. When equipped with the best weighting scheme, our algorithm achieves minimax rate-optimal regret guarantees even with diminishing exploration. Finally, we demonstrate our algorithm’s effectiveness using both synthetic data and public benchmark data sets.
This paper was accepted by Hamid Nazerzadeh, data science.
Funding:
This work is supported by the National Science Foundation [Grant CCF-2106508]. R. Zhan was supported by Golub Capital and the Michael Yao and Sara Keying Dai AI and Digital Technology Fund. Z. Ren was supported by the Office of Naval Research [Grant N00014-20-1-2337]. S. Athey was supported by the Office of Naval Research [Grant N00014-19-1-2468]. Z. Zhou is generously supported by the New York University’s 2022–2023 Center for Global Economy and Business faculty research grant and the Digital Twin research grant from Bain & Company.
Supplemental Material:
The data files are available at
https://doi.org/10.1287/mnsc.2023.4921
.</description><identifier>ISSN: 0025-1909</identifier><identifier>EISSN: 1526-5501</identifier><identifier>DOI: 10.1287/mnsc.2023.4921</identifier><language>eng</language><publisher>Linthicum: INFORMS</publisher><subject>adaptive data collection ; Algorithms ; Assignment ; Auctions ; Bidding ; contextual bandits ; Data collection ; Decision making ; Distance learning ; Estimating techniques ; Health care ; Learning ; Management science ; minimax optimality ; off-line policy learning ; personalized decision making ; Process engineering ; Regret ; Weighting</subject><ispartof>Management science, 2024-08, Vol.70 (8), p.5270-5297</ispartof><rights>Copyright Institute for Operations Research and the Management Sciences Aug 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c332t-6bb8247a0fbd33c37039fc677297370d5ad1e9690fe0073f757c3be80f62a20b3</citedby><cites>FETCH-LOGICAL-c332t-6bb8247a0fbd33c37039fc677297370d5ad1e9690fe0073f757c3be80f62a20b3</cites><orcidid>0000-0001-6934-562X ; 0000-0002-0005-9411 ; 0000-0002-3426-2784 ; 0000-0002-2872-5842</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://pubsonline.informs.org/doi/full/10.1287/mnsc.2023.4921$$EHTML$$P50$$Ginforms$$H</linktohtml><link.rule.ids>314,780,784,3690,27922,27923,62614</link.rule.ids></links><search><creatorcontrib>Zhan, Ruohan</creatorcontrib><title>Policy Learning with Adaptively Collected Data</title><title>Management science</title><description>In a wide variety of applications, including healthcare, bidding in first price auctions, digital recommendations, and online education, it can be beneficial to learn a policy that assigns treatments to individuals based on their characteristics. The growing policy-learning literature focuses on settings in which policies are learned from historical data in which the treatment assignment rule is fixed throughout the data-collection period. However, adaptive data collection is becoming more common in practice from two primary sources: (1) data collected from adaptive experiments that are designed to improve inferential efficiency and (2) data collected from production systems that progressively evolve an operational policy to improve performance over time (e.g., contextual bandits). Yet adaptivity complicates the problem of learning an optimal policy ex post for two reasons: first, samples are dependent and, second, an adaptive assignment rule may not assign each treatment to each type of individual sufficiently often. In this paper, we address these challenges. We propose an algorithm based on generalized augmented inverse propensity weighted (AIPW) estimators, which nonuniformly reweight the elements of a standard AIPW estimator to control worst case estimation variance. We establish a finite-sample regret upper bound for our algorithm and complement it with a regret lower bound that quantifies the fundamental difficulty of policy learning with adaptive data. When equipped with the best weighting scheme, our algorithm achieves minimax rate-optimal regret guarantees even with diminishing exploration. Finally, we demonstrate our algorithm’s effectiveness using both synthetic data and public benchmark data sets.
This paper was accepted by Hamid Nazerzadeh, data science.
Funding:
This work is supported by the National Science Foundation [Grant CCF-2106508]. R. Zhan was supported by Golub Capital and the Michael Yao and Sara Keying Dai AI and Digital Technology Fund. Z. Ren was supported by the Office of Naval Research [Grant N00014-20-1-2337]. S. Athey was supported by the Office of Naval Research [Grant N00014-19-1-2468]. Z. Zhou is generously supported by the New York University’s 2022–2023 Center for Global Economy and Business faculty research grant and the Digital Twin research grant from Bain & Company.
Supplemental Material:
The data files are available at
https://doi.org/10.1287/mnsc.2023.4921
.</description><subject>adaptive data collection</subject><subject>Algorithms</subject><subject>Assignment</subject><subject>Auctions</subject><subject>Bidding</subject><subject>contextual bandits</subject><subject>Data collection</subject><subject>Decision making</subject><subject>Distance learning</subject><subject>Estimating techniques</subject><subject>Health care</subject><subject>Learning</subject><subject>Management science</subject><subject>minimax optimality</subject><subject>off-line policy learning</subject><subject>personalized decision making</subject><subject>Process engineering</subject><subject>Regret</subject><subject>Weighting</subject><issn>0025-1909</issn><issn>1526-5501</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNqFkEFLxDAQRoMoWFevngueWyfJJmmOy6qrsKAHPYc0TTRLt6lJV9l_b0sFj56Ggfd9MzyErjGUmFTidt8lUxIgtFxKgk9QhhnhBWOAT1EGQFiBJchzdJHSDgBEJXiGypfQenPMt1bHznfv-bcfPvJVo_vBf9n2mK9D21oz2Ca_04O-RGdOt8le_c4Fenu4f10_FtvnzdN6tS0MpWQoeF1XZCk0uLqh1FABVDrDhSBSjEvDdIOt5BKcHR-hTjBhaG0rcJxoAjVdoJu5t4_h82DToHbhELvxpKIgGaEcgI9UOVMmhpSidaqPfq_jUWFQkxM1OVGTEzU5GQP5HLAmdD794ZWUWFaYTZ3FjPjOhbhP_1X-AHeIbDA</recordid><startdate>20240801</startdate><enddate>20240801</enddate><creator>Zhan, Ruohan</creator><general>INFORMS</general><general>Institute for Operations Research and the Management Sciences</general><scope>OQ6</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>8BJ</scope><scope>FQK</scope><scope>JBE</scope><orcidid>https://orcid.org/0000-0001-6934-562X</orcidid><orcidid>https://orcid.org/0000-0002-0005-9411</orcidid><orcidid>https://orcid.org/0000-0002-3426-2784</orcidid><orcidid>https://orcid.org/0000-0002-2872-5842</orcidid></search><sort><creationdate>20240801</creationdate><title>Policy Learning with Adaptively Collected Data</title><author>Zhan, Ruohan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c332t-6bb8247a0fbd33c37039fc677297370d5ad1e9690fe0073f757c3be80f62a20b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>adaptive data collection</topic><topic>Algorithms</topic><topic>Assignment</topic><topic>Auctions</topic><topic>Bidding</topic><topic>contextual bandits</topic><topic>Data collection</topic><topic>Decision making</topic><topic>Distance learning</topic><topic>Estimating techniques</topic><topic>Health care</topic><topic>Learning</topic><topic>Management science</topic><topic>minimax optimality</topic><topic>off-line policy learning</topic><topic>personalized decision making</topic><topic>Process engineering</topic><topic>Regret</topic><topic>Weighting</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhan, Ruohan</creatorcontrib><collection>ECONIS</collection><collection>CrossRef</collection><collection>International Bibliography of the Social Sciences (IBSS)</collection><collection>International Bibliography of the Social Sciences</collection><collection>International Bibliography of the Social Sciences</collection><jtitle>Management science</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhan, Ruohan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Policy Learning with Adaptively Collected Data</atitle><jtitle>Management science</jtitle><date>2024-08-01</date><risdate>2024</risdate><volume>70</volume><issue>8</issue><spage>5270</spage><epage>5297</epage><pages>5270-5297</pages><issn>0025-1909</issn><eissn>1526-5501</eissn><abstract>In a wide variety of applications, including healthcare, bidding in first price auctions, digital recommendations, and online education, it can be beneficial to learn a policy that assigns treatments to individuals based on their characteristics. The growing policy-learning literature focuses on settings in which policies are learned from historical data in which the treatment assignment rule is fixed throughout the data-collection period. However, adaptive data collection is becoming more common in practice from two primary sources: (1) data collected from adaptive experiments that are designed to improve inferential efficiency and (2) data collected from production systems that progressively evolve an operational policy to improve performance over time (e.g., contextual bandits). Yet adaptivity complicates the problem of learning an optimal policy ex post for two reasons: first, samples are dependent and, second, an adaptive assignment rule may not assign each treatment to each type of individual sufficiently often. In this paper, we address these challenges. We propose an algorithm based on generalized augmented inverse propensity weighted (AIPW) estimators, which nonuniformly reweight the elements of a standard AIPW estimator to control worst case estimation variance. We establish a finite-sample regret upper bound for our algorithm and complement it with a regret lower bound that quantifies the fundamental difficulty of policy learning with adaptive data. When equipped with the best weighting scheme, our algorithm achieves minimax rate-optimal regret guarantees even with diminishing exploration. Finally, we demonstrate our algorithm’s effectiveness using both synthetic data and public benchmark data sets.
This paper was accepted by Hamid Nazerzadeh, data science.
Funding:
This work is supported by the National Science Foundation [Grant CCF-2106508]. R. Zhan was supported by Golub Capital and the Michael Yao and Sara Keying Dai AI and Digital Technology Fund. Z. Ren was supported by the Office of Naval Research [Grant N00014-20-1-2337]. S. Athey was supported by the Office of Naval Research [Grant N00014-19-1-2468]. Z. Zhou is generously supported by the New York University’s 2022–2023 Center for Global Economy and Business faculty research grant and the Digital Twin research grant from Bain & Company.
Supplemental Material:
The data files are available at
https://doi.org/10.1287/mnsc.2023.4921
.</abstract><cop>Linthicum</cop><pub>INFORMS</pub><doi>10.1287/mnsc.2023.4921</doi><tpages>28</tpages><orcidid>https://orcid.org/0000-0001-6934-562X</orcidid><orcidid>https://orcid.org/0000-0002-0005-9411</orcidid><orcidid>https://orcid.org/0000-0002-3426-2784</orcidid><orcidid>https://orcid.org/0000-0002-2872-5842</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0025-1909 |
ispartof | Management science, 2024-08, Vol.70 (8), p.5270-5297 |
issn | 0025-1909 1526-5501 |
language | eng |
recordid | cdi_proquest_journals_3095236006 |
source | INFORMS PubsOnLine |
subjects | adaptive data collection Algorithms Assignment Auctions Bidding contextual bandits Data collection Decision making Distance learning Estimating techniques Health care Learning Management science minimax optimality off-line policy learning personalized decision making Process engineering Regret Weighting |
title | Policy Learning with Adaptively Collected Data |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-10T07%3A53%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Policy%20Learning%20with%20Adaptively%20Collected%20Data&rft.jtitle=Management%20science&rft.au=Zhan,%20Ruohan&rft.date=2024-08-01&rft.volume=70&rft.issue=8&rft.spage=5270&rft.epage=5297&rft.pages=5270-5297&rft.issn=0025-1909&rft.eissn=1526-5501&rft_id=info:doi/10.1287/mnsc.2023.4921&rft_dat=%3Cproquest_cross%3E3095236006%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3095236006&rft_id=info:pmid/&rfr_iscdi=true |