ATM: Improving Model Merging by Alternating Tuning and Merging

Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-11
Hauptverfasser:	Zhou, Luca, Solombrino, Daniele, Donato Crisostomi, Bucarelli, Maria Sofia, Silvestri, Fabrizio, Rodolà, Emanuele
Format:	Artikel
Sprache:	eng
Schlagworte:	Bridge maintenance Computer vision Effectiveness Orthogonality Tuning Upper bounds
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Zhou, Luca Solombrino, Daniele Donato Crisostomi Bucarelli, Maria Sofia Silvestri, Fabrizio Rodolà, Emanuele
description	Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3125867779</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3125867779</sourcerecordid><originalsourceid>FETCH-proquest_journals_31258677793</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwcwzxtVLwzC0oyi_LzEtX8M1PSc1R8E0tSgfxkioVHHNKUovyEktA3JDSPBCVmJcCU8HDwJqWmFOcyguluRmU3VxDnD10geYVlqYWl8Rn5ZcCtecUxxsbGplamJmbm1saE6cKAEEJN4c</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3125867779</pqid></control><display><type>article</type><title>ATM: Improving Model Merging by Alternating Tuning and Merging</title><source>Free E- Journals</source><creator>Zhou, Luca ; Solombrino, Daniele ; Donato Crisostomi ; Bucarelli, Maria Sofia ; Silvestri, Fabrizio ; Rodolà, Emanuele</creator><creatorcontrib>Zhou, Luca ; Solombrino, Daniele ; Donato Crisostomi ; Bucarelli, Maria Sofia ; Silvestri, Fabrizio ; Rodolà, Emanuele</creatorcontrib><description>Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Bridge maintenance ; Computer vision ; Effectiveness ; Orthogonality ; Tuning ; Upper bounds</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>777,781</link.rule.ids></links><search><creatorcontrib>Zhou, Luca</creatorcontrib><creatorcontrib>Solombrino, Daniele</creatorcontrib><creatorcontrib>Donato Crisostomi</creatorcontrib><creatorcontrib>Bucarelli, Maria Sofia</creatorcontrib><creatorcontrib>Silvestri, Fabrizio</creatorcontrib><creatorcontrib>Rodolà, Emanuele</creatorcontrib><title>ATM: Improving Model Merging by Alternating Tuning and Merging</title><title>arXiv.org</title><description>Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.</description><subject>Bridge maintenance</subject><subject>Computer vision</subject><subject>Effectiveness</subject><subject>Orthogonality</subject><subject>Tuning</subject><subject>Upper bounds</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwcwzxtVLwzC0oyi_LzEtX8M1PSc1R8E0tSgfxkioVHHNKUovyEktA3JDSPBCVmJcCU8HDwJqWmFOcyguluRmU3VxDnD10geYVlqYWl8Rn5ZcCtecUxxsbGplamJmbm1saE6cKAEEJN4c</recordid><startdate>20241106</startdate><enddate>20241106</enddate><creator>Zhou, Luca</creator><creator>Solombrino, Daniele</creator><creator>Donato Crisostomi</creator><creator>Bucarelli, Maria Sofia</creator><creator>Silvestri, Fabrizio</creator><creator>Rodolà, Emanuele</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241106</creationdate><title>ATM: Improving Model Merging by Alternating Tuning and Merging</title><author>Zhou, Luca ; Solombrino, Daniele ; Donato Crisostomi ; Bucarelli, Maria Sofia ; Silvestri, Fabrizio ; Rodolà, Emanuele</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31258677793</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Bridge maintenance</topic><topic>Computer vision</topic><topic>Effectiveness</topic><topic>Orthogonality</topic><topic>Tuning</topic><topic>Upper bounds</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhou, Luca</creatorcontrib><creatorcontrib>Solombrino, Daniele</creatorcontrib><creatorcontrib>Donato Crisostomi</creatorcontrib><creatorcontrib>Bucarelli, Maria Sofia</creatorcontrib><creatorcontrib>Silvestri, Fabrizio</creatorcontrib><creatorcontrib>Rodolà, Emanuele</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhou, Luca</au><au>Solombrino, Daniele</au><au>Donato Crisostomi</au><au>Bucarelli, Maria Sofia</au><au>Silvestri, Fabrizio</au><au>Rodolà, Emanuele</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>ATM: Improving Model Merging by Alternating Tuning and Merging</atitle><jtitle>arXiv.org</jtitle><date>2024-11-06</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, task vectors are mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that task vectors perform optimally when equality is maintained, and their effectiveness is largely driven by the first epoch's gradient. Building on this insight, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). This method acts as a bridge between model merging and multi-task gradient descent, achieving state-of-the-art results with the same data and computational requirements. We extensively evaluate ATM across diverse settings, achieving up to 20% higher accuracy in computer vision and NLP tasks, compared to the best baselines. Finally, we provide both empirical and theoretical support for its effectiveness, demonstrating increased orthogonality between task vectors and proving that ATM minimizes an upper bound on the loss obtained by jointly finetuning all tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3125867779
source	Free E- Journals
subjects	Bridge maintenance Computer vision Effectiveness Orthogonality Tuning Upper bounds
title	ATM: Improving Model Merging by Alternating Tuning and Merging
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T21%3A20%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=ATM:%20Improving%20Model%20Merging%20by%20Alternating%20Tuning%20and%20Merging&rft.jtitle=arXiv.org&rft.au=Zhou,%20Luca&rft.date=2024-11-06&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3125867779%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3125867779&rft_id=info:pmid/&rfr_iscdi=true