Efficient self-attention with smart pruning for sustainable large language models

<p dir="ltr">Large Language Models (LLMs) have revolutionized artificial intelligence by enabling multitasking across diverse fields. However, their high computational demands result in significant environmental impacts, particularly in terms of energy and water consumption. This pap...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Samir Brahim Belhaouari (9427347) (author)
مؤلفون آخرون: Insaf Kraidia (19198012) (author)
منشور في: 2025
الموضوعات:
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
_version_ 1864513534734368768
author Samir Brahim Belhaouari (9427347)
author2 Insaf Kraidia (19198012)
author2_role author
author_facet Samir Brahim Belhaouari (9427347)
Insaf Kraidia (19198012)
author_role author
dc.creator.none.fl_str_mv Samir Brahim Belhaouari (9427347)
Insaf Kraidia (19198012)
dc.date.none.fl_str_mv 2025-03-24T09:00:00Z
dc.identifier.none.fl_str_mv 10.1038/s41598-025-92586-5
dc.relation.none.fl_str_mv https://figshare.com/articles/journal_contribution/Efficient_self-attention_with_smart_pruning_for_sustainable_large_language_models/30393217
dc.rights.none.fl_str_mv CC BY 4.0
info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Information and computing sciences
Artificial intelligence
Machine learning
Theory of computation
Large Language Models (LLMs)
Consumption
Computational Demands
Self-attention
Compression
Pruning
dc.title.none.fl_str_mv Efficient self-attention with smart pruning for sustainable large language models
dc.type.none.fl_str_mv Text
Journal contribution
info:eu-repo/semantics/publishedVersion
text
contribution to journal
description <p dir="ltr">Large Language Models (LLMs) have revolutionized artificial intelligence by enabling multitasking across diverse fields. However, their high computational demands result in significant environmental impacts, particularly in terms of energy and water consumption. This paper addresses these issues by proposing an innovative compression approach to reducing LLM sizes. We focus on compressing the internal transformer layers, which are critical contributors to LLMs’ computational complexity. Our approach combines new mathematical and structural key methods for model compression. We begin by applying Forward Propagation Pruning (FPP) to compress the embedding and feed-forward layers, utilizing a weight freezing and zeroing technique for suspected unused parameters. This reduces the number of trainable parameters, accelerating the overall training process and enabling faster convergence. Second, the Weight Matrix Folding method is introduced to efficiently prune the self-attention layer matrices in a simple and efficient mathematical model. This method integrates Identical Row Compression (IRC) to optimize the compression of the Query and Key matrices, alongside Diagonal Weight Compression (DWC), which reformulates the Value matrix into a diagonal structure. Consequently, this technique significantly diminishes parameter variability across the three metrics, enhancing consistency and performance while simplifying complexity. The compression approach is evaluated on three language modeling datasets and eight widely used classification datasets, comparing it to various pruning methods. Our method successfully compresses transformer layers by 99% and linear layers by 70%, resulting in an overall model compression of around 70%, while maintaining nearly the same accuracy. Notably, with moderate compression rates of 20% to 40%, model performance not only remained stable but even improved. This leads to substantial reductions in memory usage and computational demands, making LLMs more resource-efficient and highlighting the potential to optimize them for a more sustainable AI future.</p><h2>Other Information</h2><p dir="ltr">Published in: Scientific Reports<br>License: <a href="https://creativecommons.org/licenses/by/4.0" target="_blank">https://creativecommons.org/licenses/by/4.0</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1038/s41598-025-92586-5" target="_blank">https://dx.doi.org/10.1038/s41598-025-92586-5</a></p>
eu_rights_str_mv openAccess
id Manara2_cc89bcc34420a21c5a44a39d1c7e11a8
identifier_str_mv 10.1038/s41598-025-92586-5
network_acronym_str Manara2
network_name_str Manara2
oai_identifier_str oai:figshare.com:article/30393217
publishDate 2025
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv CC BY 4.0
spelling Efficient self-attention with smart pruning for sustainable large language modelsSamir Brahim Belhaouari (9427347)Insaf Kraidia (19198012)Information and computing sciencesArtificial intelligenceMachine learningTheory of computationLarge Language Models (LLMs)ConsumptionComputational DemandsSelf-attentionCompressionPruning<p dir="ltr">Large Language Models (LLMs) have revolutionized artificial intelligence by enabling multitasking across diverse fields. However, their high computational demands result in significant environmental impacts, particularly in terms of energy and water consumption. This paper addresses these issues by proposing an innovative compression approach to reducing LLM sizes. We focus on compressing the internal transformer layers, which are critical contributors to LLMs’ computational complexity. Our approach combines new mathematical and structural key methods for model compression. We begin by applying Forward Propagation Pruning (FPP) to compress the embedding and feed-forward layers, utilizing a weight freezing and zeroing technique for suspected unused parameters. This reduces the number of trainable parameters, accelerating the overall training process and enabling faster convergence. Second, the Weight Matrix Folding method is introduced to efficiently prune the self-attention layer matrices in a simple and efficient mathematical model. This method integrates Identical Row Compression (IRC) to optimize the compression of the Query and Key matrices, alongside Diagonal Weight Compression (DWC), which reformulates the Value matrix into a diagonal structure. Consequently, this technique significantly diminishes parameter variability across the three metrics, enhancing consistency and performance while simplifying complexity. The compression approach is evaluated on three language modeling datasets and eight widely used classification datasets, comparing it to various pruning methods. Our method successfully compresses transformer layers by 99% and linear layers by 70%, resulting in an overall model compression of around 70%, while maintaining nearly the same accuracy. Notably, with moderate compression rates of 20% to 40%, model performance not only remained stable but even improved. This leads to substantial reductions in memory usage and computational demands, making LLMs more resource-efficient and highlighting the potential to optimize them for a more sustainable AI future.</p><h2>Other Information</h2><p dir="ltr">Published in: Scientific Reports<br>License: <a href="https://creativecommons.org/licenses/by/4.0" target="_blank">https://creativecommons.org/licenses/by/4.0</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1038/s41598-025-92586-5" target="_blank">https://dx.doi.org/10.1038/s41598-025-92586-5</a></p>2025-03-24T09:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1038/s41598-025-92586-5https://figshare.com/articles/journal_contribution/Efficient_self-attention_with_smart_pruning_for_sustainable_large_language_models/30393217CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/303932172025-03-24T09:00:00Z
spellingShingle Efficient self-attention with smart pruning for sustainable large language models
Samir Brahim Belhaouari (9427347)
Information and computing sciences
Artificial intelligence
Machine learning
Theory of computation
Large Language Models (LLMs)
Consumption
Computational Demands
Self-attention
Compression
Pruning
status_str publishedVersion
title Efficient self-attention with smart pruning for sustainable large language models
title_full Efficient self-attention with smart pruning for sustainable large language models
title_fullStr Efficient self-attention with smart pruning for sustainable large language models
title_full_unstemmed Efficient self-attention with smart pruning for sustainable large language models
title_short Efficient self-attention with smart pruning for sustainable large language models
title_sort Efficient self-attention with smart pruning for sustainable large language models
topic Information and computing sciences
Artificial intelligence
Machine learning
Theory of computation
Large Language Models (LLMs)
Consumption
Computational Demands
Self-attention
Compression
Pruning