ATM: Improving Model Merging by Alternating Tuning and Merging
Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-11 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Zhou, Luca Solombrino, Daniele Donato Crisostomi Bucarelli, Maria Sofia Silvestri, Fabrizio Rodolà, Emanuele |
description | Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3125867779</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3125867779</sourcerecordid><originalsourceid>FETCH-proquest_journals_31258677793</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwcwzxtVLwzC0oyi_LzEtX8M1PSc1R8E0tSgfxkioVHHNKUovyEktA3JDSPBCVmJcCU8HDwJqWmFOcyguluRmU3VxDnD10geYVlqYWl8Rn5ZcCtecUxxsbGplamJmbm1saE6cKAEEJN4c</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3125867779</pqid></control><display><type>article</type><title>ATM: Improving Model Merging by Alternating Tuning and Merging</title><source>Free E- Journals</source><creator>Zhou, Luca ; Solombrino, Daniele ; Donato Crisostomi ; Bucarelli, Maria Sofia ; Silvestri, Fabrizio ; Rodolà, Emanuele</creator><creatorcontrib>Zhou, Luca ; Solombrino, Daniele ; Donato Crisostomi ; Bucarelli, Maria Sofia ; Silvestri, Fabrizio ; Rodolà, Emanuele</creatorcontrib><description>Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Bridge maintenance ; Computer vision ; Effectiveness ; Orthogonality ; Tuning ; Upper bounds</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>777,781</link.rule.ids></links><search><creatorcontrib>Zhou, Luca</creatorcontrib><creatorcontrib>Solombrino, Daniele</creatorcontrib><creatorcontrib>Donato Crisostomi</creatorcontrib><creatorcontrib>Bucarelli, Maria Sofia</creatorcontrib><creatorcontrib>Silvestri, Fabrizio</creatorcontrib><creatorcontrib>Rodolà, Emanuele</creatorcontrib><title>ATM: Improving Model Merging by Alternating Tuning and Merging</title><title>arXiv.org</title><description>Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.</description><subject>Bridge maintenance</subject><subject>Computer vision</subject><subject>Effectiveness</subject><subject>Orthogonality</subject><subject>Tuning</subject><subject>Upper bounds</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwcwzxtVLwzC0oyi_LzEtX8M1PSc1R8E0tSgfxkioVHHNKUovyEktA3JDSPBCVmJcCU8HDwJqWmFOcyguluRmU3VxDnD10geYVlqYWl8Rn5ZcCtecUxxsbGplamJmbm1saE6cKAEEJN4c</recordid><startdate>20241106</startdate><enddate>20241106</enddate><creator>Zhou, Luca</creator><creator>Solombrino, Daniele</creator><creator>Donato Crisostomi</creator><creator>Bucarelli, Maria Sofia</creator><creator>Silvestri, Fabrizio</creator><creator>Rodolà, Emanuele</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241106</creationdate><title>ATM: Improving Model Merging by Alternating Tuning and Merging</title><author>Zhou, Luca ; Solombrino, Daniele ; Donato Crisostomi ; Bucarelli, Maria Sofia ; Silvestri, Fabrizio ; Rodolà, Emanuele</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31258677793</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Bridge maintenance</topic><topic>Computer vision</topic><topic>Effectiveness</topic><topic>Orthogonality</topic><topic>Tuning</topic><topic>Upper bounds</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhou, Luca</creatorcontrib><creatorcontrib>Solombrino, Daniele</creatorcontrib><creatorcontrib>Donato Crisostomi</creatorcontrib><creatorcontrib>Bucarelli, Maria Sofia</creatorcontrib><creatorcontrib>Silvestri, Fabrizio</creatorcontrib><creatorcontrib>Rodolà, Emanuele</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhou, Luca</au><au>Solombrino, Daniele</au><au>Donato Crisostomi</au><au>Bucarelli, Maria Sofia</au><au>Silvestri, Fabrizio</au><au>Rodolà, Emanuele</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>ATM: Improving Model Merging by Alternating Tuning and Merging</atitle><jtitle>arXiv.org</jtitle><date>2024-11-06</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-11 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3125867779 |
source | Free E- Journals |
subjects | Bridge maintenance Computer vision Effectiveness Orthogonality Tuning Upper bounds |
title | ATM: Improving Model Merging by Alternating Tuning and Merging |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T21%3A20%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=ATM:%20Improving%20Model%20Merging%20by%20Alternating%20Tuning%20and%20Merging&rft.jtitle=arXiv.org&rft.au=Zhou,%20Luca&rft.date=2024-11-06&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3125867779%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3125867779&rft_id=info:pmid/&rfr_iscdi=true |