Enhancing Parameter Efficiency in Model Inference Using an Ultralight Inter-Transformer Linear Structure

Pre-trained language models are the cornerstone of modern natural language processing and information retrieval. However, fine-tuning all the parameters reduces the efficiency of models both in training and inference owing to their increasingly heavy structures. Existing methods for parameter effici...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2024, Vol.12, p.43734-43746
Hauptverfasser: Shi, Haoxiang, Sakai, Tetsuya
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Pre-trained language models are the cornerstone of modern natural language processing and information retrieval. However, fine-tuning all the parameters reduces the efficiency of models both in training and inference owing to their increasingly heavy structures. Existing methods for parameter efficiency still require approximately 1 MB of storage and have approximately 10^{7} operations during model deployment and inference. This puts a strain on the storage and processor capacity of end devices such as smartphones and IoT equipment, and slow model inference adversely affecting the user experience. To achieve more efficient and storage-friendly inference compared to mainstream methods, such as low-rank adaptation (LoRA) and Adapter, LayerConnect (hyper-network-assisted interlayer connectors) is proposed in this paper. Extensive experiments were conducted to validate the performance of LayerConnect for two essential tasks with completely different learning frameworks and purposes: natural language understanding (using the general language understanding evaluation (GLUE) benchmark) and information retrieval (using the a contextualized inverted list (COIL) framework). For both tasks, our LayerConnect saves up to 95.31% and 91.18% of parameters in LoRA and Adapter, respectively. In contrast, LayerConnect maintains performance degradation for GLUE and COIL to less than 8% and 3%, compared to LoRA. When compared to Adapter, the numbers become 5% and 3%, for GLUE and COIL, respectively. In addition, LayerConnect required approximately 100 kB of storage per task-specific trained model for both tasks and reduced the number of operations in the model inference by four orders of magnitude, reaching approximately 10^{3} .
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2024.3378518