Cross-layer Attention Sharing for Large Language Models

As large language models (LLMs) evolve, the increase in model depth and parameter number leads to substantial redundancy. To enhance the efficiency of the attention mechanism, previous works primarily compress the KV cache or group attention heads, while largely overlooking redundancy between layers...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-08
Hauptverfasser:	Mu, Yongyu, Wu, Yuzhang, Fan, Yuchun, Wang, Chenglong, Li, Hengyu, He, Qiaozhi, Yang, Murun, Tong, Xiao, Zhu, Jingbo
Format:	Artikel
Sprache:	eng
Schlagworte:	Large language models Redundancy Weight reduction
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Schreiben Sie den ersten Kommentar!