Efficient self-attention with smart pruning for sustainable large language models

<p dir="ltr">Large Language Models (LLMs) have revolutionized artificial intelligence by enabling multitasking across diverse fields. However, their high computational demands result in significant environmental impacts, particularly in terms of energy and water consumption. This pap...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Samir Brahim Belhaouari (9427347) (author)
مؤلفون آخرون:	Insaf Kraidia (19198012) (author)
منشور في:	2025
الموضوعات:	Information and computing sciences Artificial intelligence Machine learning Theory of computation Large Language Models (LLMs) Consumption Computational Demands Self-attention Compression Pruning
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

_version_	1864513534734368768
author	Samir Brahim Belhaouari (9427347)
author2	Insaf Kraidia (19198012)
author2_role	author
author_facet	Samir Brahim Belhaouari (9427347) Insaf Kraidia (19198012)
author_role	author
dc.creator.none.fl_str_mv	Samir Brahim Belhaouari (9427347) Insaf Kraidia (19198012)
dc.date.none.fl_str_mv	2025-03-24T09:00:00Z
dc.identifier.none.fl_str_mv	10.1038/s41598-025-92586-5
dc.relation.none.fl_str_mv	https://figshare.com/articles/journal_contribution/Efficient_self-attention_with_smart_pruning_for_sustainable_large_language_models/30393217
dc.rights.none.fl_str_mv	CC BY 4.0 info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv	Information and computing sciences Artificial intelligence Machine learning Theory of computation Large Language Models (LLMs) Consumption Computational Demands Self-attention Compression Pruning
dc.title.none.fl_str_mv	Efficient self-attention with smart pruning for sustainable large language models
dc.type.none.fl_str_mv	Text Journal contribution info:eu-repo/semantics/publishedVersion text contribution to journal
description	<p dir="ltr">Large Language Models (LLMs) have revolutionized artificial intelligence by enabling multitasking across diverse fields. However, their high computational demands result in significant environmental impacts, particularly in terms of energy and water consumption. This paper addresses these issues by proposing an innovative compression approach to reducing LLM sizes. We focus on compressing the internal transformer layers, which are critical contributors to LLMs’ computational complexity. Our approach combines new mathematical and structural key methods for model compression. We begin by applying Forward Propagation Pruning (FPP) to compress the embedding and feed-forward layers, utilizing a weight freezing and zeroing technique for suspected unused parameters. This reduces the number of trainable parameters, accelerating the overall training process and enabling faster convergence. Second, the Weight Matrix Folding method is introduced to efficiently prune the self-attention layer matrices in a simple and efficient mathematical model. This method integrates Identical Row Compression (IRC) to optimize the compression of the Query and Key matrices, alongside Diagonal Weight Compression (DWC), which reformulates the Value matrix into a diagonal structure. Consequently, this technique significantly diminishes parameter variability across the three metrics, enhancing consistency and performance while simplifying complexity. The compression approach is evaluated on three language modeling datasets and eight widely used classification datasets, comparing it to various pruning methods. Our method successfully compresses transformer layers by 99% and linear layers by 70%, resulting in an overall model compression of around 70%, while maintaining nearly the same accuracy. Notably, with moderate compression rates of 20% to 40%, model performance not only remained stable but even improved. This leads to substantial reductions in memory usage and computational demands, making LLMs more resource-efficient and highlighting the potential to optimize them for a more sustainable AI future.</p><h2>Other Information</h2><p dir="ltr">Published in: Scientific Reports<br>License: <a href="https://creativecommons.org/licenses/by/4.0" target="_blank">https://creativecommons.org/licenses/by/4.0</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1038/s41598-025-92586-5" target="_blank">https://dx.doi.org/10.1038/s41598-025-92586-5</a></p>
eu_rights_str_mv	openAccess
id	Manara2_cc89bcc34420a21c5a44a39d1c7e11a8
identifier_str_mv	10.1038/s41598-025-92586-5
network_acronym_str	Manara2
network_name_str	Manara2
oai_identifier_str	oai:figshare.com:article/30393217
publishDate	2025
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv	CC BY 4.0
spelling	Efficient self-attention with smart pruning for sustainable large language modelsSamir Brahim Belhaouari (9427347)Insaf Kraidia (19198012)Information and computing sciencesArtificial intelligenceMachine learningTheory of computationLarge Language Models (LLMs)ConsumptionComputational DemandsSelf-attentionCompressionPruning<p dir="ltr">Large Language Models (LLMs) have revolutionized artificial intelligence by enabling multitasking across diverse fields. However, their high computational demands result in significant environmental impacts, particularly in terms of energy and water consumption. This paper addresses these issues by proposing an innovative compression approach to reducing LLM sizes. We focus on compressing the internal transformer layers, which are critical contributors to LLMs’ computational complexity. Our approach combines new mathematical and structural key methods for model compression. We begin by applying Forward Propagation Pruning (FPP) to compress the embedding and feed-forward layers, utilizing a weight freezing and zeroing technique for suspected unused parameters. This reduces the number of trainable parameters, accelerating the overall training process and enabling faster convergence. Second, the Weight Matrix Folding method is introduced to efficiently prune the self-attention layer matrices in a simple and efficient mathematical model. This method integrates Identical Row Compression (IRC) to optimize the compression of the Query and Key matrices, alongside Diagonal Weight Compression (DWC), which reformulates the Value matrix into a diagonal structure. Consequently, this technique significantly diminishes parameter variability across the three metrics, enhancing consistency and performance while simplifying complexity. The compression approach is evaluated on three language modeling datasets and eight widely used classification datasets, comparing it to various pruning methods. Our method successfully compresses transformer layers by 99% and linear layers by 70%, resulting in an overall model compression of around 70%, while maintaining nearly the same accuracy. Notably, with moderate compression rates of 20% to 40%, model performance not only remained stable but even improved. This leads to substantial reductions in memory usage and computational demands, making LLMs more resource-efficient and highlighting the potential to optimize them for a more sustainable AI future.</p><h2>Other Information</h2><p dir="ltr">Published in: Scientific Reports<br>License: <a href="https://creativecommons.org/licenses/by/4.0" target="_blank">https://creativecommons.org/licenses/by/4.0</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1038/s41598-025-92586-5" target="_blank">https://dx.doi.org/10.1038/s41598-025-92586-5</a></p>2025-03-24T09:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1038/s41598-025-92586-5https://figshare.com/articles/journal_contribution/Efficient_self-attention_with_smart_pruning_for_sustainable_large_language_models/30393217CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/303932172025-03-24T09:00:00Z
spellingShingle	Efficient self-attention with smart pruning for sustainable large language models Samir Brahim Belhaouari (9427347) Information and computing sciences Artificial intelligence Machine learning Theory of computation Large Language Models (LLMs) Consumption Computational Demands Self-attention Compression Pruning
status_str	publishedVersion
title	Efficient self-attention with smart pruning for sustainable large language models
title_full	Efficient self-attention with smart pruning for sustainable large language models
title_fullStr	Efficient self-attention with smart pruning for sustainable large language models
title_full_unstemmed	Efficient self-attention with smart pruning for sustainable large language models
title_short	Efficient self-attention with smart pruning for sustainable large language models
title_sort	Efficient self-attention with smart pruning for sustainable large language models
topic	Information and computing sciences Artificial intelligence Machine learning Theory of computation Large Language Models (LLMs) Consumption Computational Demands Self-attention Compression Pruning

Efficient self-attention with smart pruning for sustainable large language models

مواد مشابهة