STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization

<p>Although many high-performing speech separation models have been proposed recently, little attention has been paid to making them lightweight. In this paper, a novel speech separation algorithm is proposed that integrates the twin-delayed deep deterministic (TD3) policy gradient reinforceme...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Muhammad Salman Khan (7202543) (author)
مؤلفون آخرون: Sania Gul (18272227) (author)
منشور في: 2025
الموضوعات:
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
_version_ 1864513539910139904
author Muhammad Salman Khan (7202543)
author2 Sania Gul (18272227)
author2_role author
author_facet Muhammad Salman Khan (7202543)
Sania Gul (18272227)
author_role author
dc.creator.none.fl_str_mv Muhammad Salman Khan (7202543)
Sania Gul (18272227)
dc.date.none.fl_str_mv 2025-08-23T15:00:00Z
dc.identifier.none.fl_str_mv 10.1016/j.apacoust.2025.111022
dc.relation.none.fl_str_mv https://figshare.com/articles/journal_contribution/STEM_spatial_speech_separation_using_twin-delayed_DDPG_reinforcement_learning_and_expectation_maximization/30135205
dc.rights.none.fl_str_mv CC BY 4.0
info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Information and computing sciences
Artificial intelligence
Machine learning
Speech separation
Reinforcement learning
Continuous action space
Spatial cues
Reward function
Time–frequency masking
dc.title.none.fl_str_mv STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization
dc.type.none.fl_str_mv Text
Journal contribution
info:eu-repo/semantics/publishedVersion
text
contribution to journal
description <p>Although many high-performing speech separation models have been proposed recently, little attention has been paid to making them lightweight. In this paper, a novel speech separation algorithm is proposed that integrates the twin-delayed deep deterministic (TD3) policy gradient reinforcement learning (RL) agent with the expectation maximization (EM) algorithm for clustering the spatial cues of individual sources separated on azimuth. For stationary sources, the proposed system gives satisfactory performance in terms of quality, intelligibility, and separation speed, and generalizes well with the test data from a mismatched speech corpus. Its perceptual evaluation of speech quality (PESQ) score is 0.55 points better than a self-supervised learning (SSL) model and almost equivalent to the diffusion models at computational cost and training data which is many folds lesser than required by these algorithms. Additionally, it reduces the required training data by 39 times, training time by 36 times, model size by 6 times, real time factor (RTF) by 1 point, and multiply-accumulate operations (MACs) by 9 times compared to a recently proposed lightweight transformer-based encoder-decoder framework, while offering a slight decrease in PESQ score (by 0.45 points).</p><h2>Other Information</h2> <p> Published in: Applied Acoustics<br> License: <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">http://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1016/j.apacoust.2025.111022" target="_blank">https://dx.doi.org/10.1016/j.apacoust.2025.111022</a></p>
eu_rights_str_mv openAccess
id Manara2_035a33b8adb3c0d2149359e752f33647
identifier_str_mv 10.1016/j.apacoust.2025.111022
network_acronym_str Manara2
network_name_str Manara2
oai_identifier_str oai:figshare.com:article/30135205
publishDate 2025
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv CC BY 4.0
spelling STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximizationMuhammad Salman Khan (7202543)Sania Gul (18272227)Information and computing sciencesArtificial intelligenceMachine learningSpeech separationReinforcement learningContinuous action spaceSpatial cuesReward functionTime–frequency masking<p>Although many high-performing speech separation models have been proposed recently, little attention has been paid to making them lightweight. In this paper, a novel speech separation algorithm is proposed that integrates the twin-delayed deep deterministic (TD3) policy gradient reinforcement learning (RL) agent with the expectation maximization (EM) algorithm for clustering the spatial cues of individual sources separated on azimuth. For stationary sources, the proposed system gives satisfactory performance in terms of quality, intelligibility, and separation speed, and generalizes well with the test data from a mismatched speech corpus. Its perceptual evaluation of speech quality (PESQ) score is 0.55 points better than a self-supervised learning (SSL) model and almost equivalent to the diffusion models at computational cost and training data which is many folds lesser than required by these algorithms. Additionally, it reduces the required training data by 39 times, training time by 36 times, model size by 6 times, real time factor (RTF) by 1 point, and multiply-accumulate operations (MACs) by 9 times compared to a recently proposed lightweight transformer-based encoder-decoder framework, while offering a slight decrease in PESQ score (by 0.45 points).</p><h2>Other Information</h2> <p> Published in: Applied Acoustics<br> License: <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">http://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1016/j.apacoust.2025.111022" target="_blank">https://dx.doi.org/10.1016/j.apacoust.2025.111022</a></p>2025-08-23T15:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1016/j.apacoust.2025.111022https://figshare.com/articles/journal_contribution/STEM_spatial_speech_separation_using_twin-delayed_DDPG_reinforcement_learning_and_expectation_maximization/30135205CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/301352052025-08-23T15:00:00Z
spellingShingle STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization
Muhammad Salman Khan (7202543)
Information and computing sciences
Artificial intelligence
Machine learning
Speech separation
Reinforcement learning
Continuous action space
Spatial cues
Reward function
Time–frequency masking
status_str publishedVersion
title STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization
title_full STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization
title_fullStr STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization
title_full_unstemmed STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization
title_short STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization
title_sort STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization
topic Information and computing sciences
Artificial intelligence
Machine learning
Speech separation
Reinforcement learning
Continuous action space
Spatial cues
Reward function
Time–frequency masking