ATM: Improving Model Merging by Alternating Tuning and Merging

Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-11
Hauptverfasser: Zhou, Luca, Solombrino, Daniele, Donato Crisostomi, Bucarelli, Maria Sofia, Silvestri, Fabrizio, Rodolà, Emanuele
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Zhou, Luca
Solombrino, Daniele
Donato Crisostomi
Bucarelli, Maria Sofia
Silvestri, Fabrizio
Rodolà, Emanuele
description Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3125867779</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3125867779</sourcerecordid><originalsourceid>FETCH-proquest_journals_31258677793</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwcwzxtVLwzC0oyi_LzEtX8M1PSc1R8E0tSgfxkioVHHNKUovyEktA3JDSPBCVmJcCU8HDwJqWmFOcyguluRmU3VxDnD10geYVlqYWl8Rn5ZcCtecUxxsbGplamJmbm1saE6cKAEEJN4c</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3125867779</pqid></control><display><type>article</type><title>ATM: Improving Model Merging by Alternating Tuning and Merging</title><source>Free E- Journals</source><creator>Zhou, Luca ; Solombrino, Daniele ; Donato Crisostomi ; Bucarelli, Maria Sofia ; Silvestri, Fabrizio ; Rodolà, Emanuele</creator><creatorcontrib>Zhou, Luca ; Solombrino, Daniele ; Donato Crisostomi ; Bucarelli, Maria Sofia ; Silvestri, Fabrizio ; Rodolà, Emanuele</creatorcontrib><description>Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Bridge maintenance ; Computer vision ; Effectiveness ; Orthogonality ; Tuning ; Upper bounds</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>777,781</link.rule.ids></links><search><creatorcontrib>Zhou, Luca</creatorcontrib><creatorcontrib>Solombrino, Daniele</creatorcontrib><creatorcontrib>Donato Crisostomi</creatorcontrib><creatorcontrib>Bucarelli, Maria Sofia</creatorcontrib><creatorcontrib>Silvestri, Fabrizio</creatorcontrib><creatorcontrib>Rodolà, Emanuele</creatorcontrib><title>ATM: Improving Model Merging by Alternating Tuning and Merging</title><title>arXiv.org</title><description>Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.</description><subject>Bridge maintenance</subject><subject>Computer vision</subject><subject>Effectiveness</subject><subject>Orthogonality</subject><subject>Tuning</subject><subject>Upper bounds</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwcwzxtVLwzC0oyi_LzEtX8M1PSc1R8E0tSgfxkioVHHNKUovyEktA3JDSPBCVmJcCU8HDwJqWmFOcyguluRmU3VxDnD10geYVlqYWl8Rn5ZcCtecUxxsbGplamJmbm1saE6cKAEEJN4c</recordid><startdate>20241106</startdate><enddate>20241106</enddate><creator>Zhou, Luca</creator><creator>Solombrino, Daniele</creator><creator>Donato Crisostomi</creator><creator>Bucarelli, Maria Sofia</creator><creator>Silvestri, Fabrizio</creator><creator>Rodolà, Emanuele</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241106</creationdate><title>ATM: Improving Model Merging by Alternating Tuning and Merging</title><author>Zhou, Luca ; Solombrino, Daniele ; Donato Crisostomi ; Bucarelli, Maria Sofia ; Silvestri, Fabrizio ; Rodolà, Emanuele</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31258677793</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Bridge maintenance</topic><topic>Computer vision</topic><topic>Effectiveness</topic><topic>Orthogonality</topic><topic>Tuning</topic><topic>Upper bounds</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhou, Luca</creatorcontrib><creatorcontrib>Solombrino, Daniele</creatorcontrib><creatorcontrib>Donato Crisostomi</creatorcontrib><creatorcontrib>Bucarelli, Maria Sofia</creatorcontrib><creatorcontrib>Silvestri, Fabrizio</creatorcontrib><creatorcontrib>Rodolà, Emanuele</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhou, Luca</au><au>Solombrino, Daniele</au><au>Donato Crisostomi</au><au>Bucarelli, Maria Sofia</au><au>Silvestri, Fabrizio</au><au>Rodolà, Emanuele</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>ATM: Improving Model Merging by Alternating Tuning and Merging</atitle><jtitle>arXiv.org</jtitle><date>2024-11-06</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-11
issn 2331-8422
language eng
recordid cdi_proquest_journals_3125867779
source Free E- Journals
subjects Bridge maintenance
Computer vision
Effectiveness
Orthogonality
Tuning
Upper bounds
title ATM: Improving Model Merging by Alternating Tuning and Merging
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T21%3A20%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=ATM:%20Improving%20Model%20Merging%20by%20Alternating%20Tuning%20and%20Merging&rft.jtitle=arXiv.org&rft.au=Zhou,%20Luca&rft.date=2024-11-06&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3125867779%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3125867779&rft_id=info:pmid/&rfr_iscdi=true