PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning

Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to align...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Pentyala, Shiva Kumar, Wang, Zhichao, Bi, Bin, Ramnath, Kiran, Mao, Xiang-Bo, Radhakrishnan, Regunathan, Asur, Sitaram, Na, Cheng
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Pentyala, Shiva Kumar Wang, Zhichao Bi, Bin Ramnath, Kiran Mao, Xiang-Bo Radhakrishnan, Regunathan Asur, Sitaram Na Cheng
description	Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to alignment tax that degrades the LLM performance. This paper introduces PAFT, a new PArallel training paradigm for effective LLM Fine-Tuning, which independently performs SFT and preference alignment (e.g., DPO and ORPO, etc.) with the same pre-trained model on respective datasets. The model produced by SFT and the model from preference alignment are then merged into a final model by parameter fusing for use in downstream applications. This work reveals important findings that preference alignment like DPO naturally results in a sparse model while SFT leads to a natural dense model which needs to be sparsified for effective model merging. This paper introduces an effective interference resolution which reduces the redundancy by sparsifying the delta parameters. The LLM resulted from the new training paradigm achieved Rank #1 on the HuggingFace Open LLM Leaderboard. Comprehensive evaluation shows the effectiveness of the parallel training paradigm.
doi_str_mv	10.48550/arxiv.2406.17923
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_17923</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_17923</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-f84f009292cbe86e2a06506a49a27ba287f54562b572e13c250da67bd70eed313</originalsourceid><addsrcrecordid>eNotj81Kw0AUhWfjQlofwJXzAok3d_6SrgylUSHFLmYfbpI7ZSCNMmrRt9dGVwcO5zvwCXFbQK5LY-Ce0lc856jB5oWrUF2Lh0Pd-I2s5YESTRNP0ieKc5yPSzPG40mG1yR3IfDwEc8s23Yvmzhz5j8vs7W4CjS9881_roRvdn77lLUvj8_bus3IOpWFUgeACiscei4tI4E1YElXhK4nLF0w2ljsjUMu1IAGxl-wHx0wj6pQK3H3d7sodG8pnih9dxeVblFRP6L_QXg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning</title><source>arXiv.org</source><creator>Pentyala, Shiva Kumar ; Wang, Zhichao ; Bi, Bin ; Ramnath, Kiran ; Mao, Xiang-Bo ; Radhakrishnan, Regunathan ; Asur, Sitaram ; Na ; Cheng</creator><creatorcontrib>Pentyala, Shiva Kumar ; Wang, Zhichao ; Bi, Bin ; Ramnath, Kiran ; Mao, Xiang-Bo ; Radhakrishnan, Regunathan ; Asur, Sitaram ; Na ; Cheng</creatorcontrib><description>Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to alignment tax that degrades the LLM performance. This paper introduces PAFT, a new PArallel training paradigm for effective LLM Fine-Tuning, which independently performs SFT and preference alignment (e.g., DPO and ORPO, etc.) with the same pre-trained model on respective datasets. The model produced by SFT and the model from preference alignment are then merged into a final model by parameter fusing for use in downstream applications. This work reveals important findings that preference alignment like DPO naturally results in a sparse model while SFT leads to a natural dense model which needs to be sparsified for effective model merging. This paper introduces an effective interference resolution which reduces the redundancy by sparsifying the delta parameters. The LLM resulted from the new training paradigm achieved Rank #1 on the HuggingFace Open LLM Leaderboard. Comprehensive evaluation shows the effectiveness of the parallel training paradigm.</description><identifier>DOI: 10.48550/arxiv.2406.17923</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2024-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.17923$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.17923$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Pentyala, Shiva Kumar</creatorcontrib><creatorcontrib>Wang, Zhichao</creatorcontrib><creatorcontrib>Bi, Bin</creatorcontrib><creatorcontrib>Ramnath, Kiran</creatorcontrib><creatorcontrib>Mao, Xiang-Bo</creatorcontrib><creatorcontrib>Radhakrishnan, Regunathan</creatorcontrib><creatorcontrib>Asur, Sitaram</creatorcontrib><creatorcontrib>Na</creatorcontrib><creatorcontrib>Cheng</creatorcontrib><title>PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning</title><description>Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to alignment tax that degrades the LLM performance. This paper introduces PAFT, a new PArallel training paradigm for effective LLM Fine-Tuning, which independently performs SFT and preference alignment (e.g., DPO and ORPO, etc.) with the same pre-trained model on respective datasets. The model produced by SFT and the model from preference alignment are then merged into a final model by parameter fusing for use in downstream applications. This work reveals important findings that preference alignment like DPO naturally results in a sparse model while SFT leads to a natural dense model which needs to be sparsified for effective model merging. This paper introduces an effective interference resolution which reduces the redundancy by sparsifying the delta parameters. The LLM resulted from the new training paradigm achieved Rank #1 on the HuggingFace Open LLM Leaderboard. Comprehensive evaluation shows the effectiveness of the parallel training paradigm.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81Kw0AUhWfjQlofwJXzAok3d_6SrgylUSHFLmYfbpI7ZSCNMmrRt9dGVwcO5zvwCXFbQK5LY-Ce0lc856jB5oWrUF2Lh0Pd-I2s5YESTRNP0ieKc5yPSzPG40mG1yR3IfDwEc8s23Yvmzhz5j8vs7W4CjS9881_roRvdn77lLUvj8_bus3IOpWFUgeACiscei4tI4E1YElXhK4nLF0w2ljsjUMu1IAGxl-wHx0wj6pQK3H3d7sodG8pnih9dxeVblFRP6L_QXg</recordid><startdate>20240625</startdate><enddate>20240625</enddate><creator>Pentyala, Shiva Kumar</creator><creator>Wang, Zhichao</creator><creator>Bi, Bin</creator><creator>Ramnath, Kiran</creator><creator>Mao, Xiang-Bo</creator><creator>Radhakrishnan, Regunathan</creator><creator>Asur, Sitaram</creator><creator>Na</creator><creator>Cheng</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240625</creationdate><title>PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning</title><author>Pentyala, Shiva Kumar ; Wang, Zhichao ; Bi, Bin ; Ramnath, Kiran ; Mao, Xiang-Bo ; Radhakrishnan, Regunathan ; Asur, Sitaram ; Na ; Cheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-f84f009292cbe86e2a06506a49a27ba287f54562b572e13c250da67bd70eed313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Pentyala, Shiva Kumar</creatorcontrib><creatorcontrib>Wang, Zhichao</creatorcontrib><creatorcontrib>Bi, Bin</creatorcontrib><creatorcontrib>Ramnath, Kiran</creatorcontrib><creatorcontrib>Mao, Xiang-Bo</creatorcontrib><creatorcontrib>Radhakrishnan, Regunathan</creatorcontrib><creatorcontrib>Asur, Sitaram</creatorcontrib><creatorcontrib>Na</creatorcontrib><creatorcontrib>Cheng</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Pentyala, Shiva Kumar</au><au>Wang, Zhichao</au><au>Bi, Bin</au><au>Ramnath, Kiran</au><au>Mao, Xiang-Bo</au><au>Radhakrishnan, Regunathan</au><au>Asur, Sitaram</au><au>Na</au><au>Cheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning</atitle><date>2024-06-25</date><risdate>2024</risdate><abstract>Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to alignment tax that degrades the LLM performance. This paper introduces PAFT, a new PArallel training paradigm for effective LLM Fine-Tuning, which independently performs SFT and preference alignment (e.g., DPO and ORPO, etc.) with the same pre-trained model on respective datasets. The model produced by SFT and the model from preference alignment are then merged into a final model by parameter fusing for use in downstream applications. This work reveals important findings that preference alignment like DPO naturally results in a sparse model while SFT leads to a natural dense model which needs to be sparsified for effective model merging. This paper introduces an effective interference resolution which reduces the redundancy by sparsifying the delta parameters. The LLM resulted from the new training paradigm achieved Rank #1 on the HuggingFace Open LLM Leaderboard. Comprehensive evaluation shows the effectiveness of the parallel training paradigm.</abstract><doi>10.48550/arxiv.2406.17923</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2406.17923
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2406_17923
source	arXiv.org
subjects	Computer Science - Computation and Language
title	PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-10T03%3A37%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PAFT:%20A%20Parallel%20Training%20Paradigm%20for%20Effective%20LLM%20Fine-Tuning&rft.au=Pentyala,%20Shiva%20Kumar&rft.date=2024-06-25&rft_id=info:doi/10.48550/arxiv.2406.17923&rft_dat=%3Carxiv_GOX%3E2406_17923%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true