STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization
<p>Although many high-performing speech separation models have been proposed recently, little attention has been paid to making them lightweight. In this paper, a novel speech separation algorithm is proposed that integrates the twin-delayed deep deterministic (TD3) policy gradient reinforceme...
محفوظ في:
| المؤلف الرئيسي: | |
|---|---|
| مؤلفون آخرون: | |
| منشور في: |
2025
|
| الموضوعات: | |
| الوسوم: |
إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
|
| _version_ | 1864513539910139904 |
|---|---|
| author | Muhammad Salman Khan (7202543) |
| author2 | Sania Gul (18272227) |
| author2_role | author |
| author_facet | Muhammad Salman Khan (7202543) Sania Gul (18272227) |
| author_role | author |
| dc.creator.none.fl_str_mv | Muhammad Salman Khan (7202543) Sania Gul (18272227) |
| dc.date.none.fl_str_mv | 2025-08-23T15:00:00Z |
| dc.identifier.none.fl_str_mv | 10.1016/j.apacoust.2025.111022 |
| dc.relation.none.fl_str_mv | https://figshare.com/articles/journal_contribution/STEM_spatial_speech_separation_using_twin-delayed_DDPG_reinforcement_learning_and_expectation_maximization/30135205 |
| dc.rights.none.fl_str_mv | CC BY 4.0 info:eu-repo/semantics/openAccess |
| dc.subject.none.fl_str_mv | Information and computing sciences Artificial intelligence Machine learning Speech separation Reinforcement learning Continuous action space Spatial cues Reward function Time–frequency masking |
| dc.title.none.fl_str_mv | STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization |
| dc.type.none.fl_str_mv | Text Journal contribution info:eu-repo/semantics/publishedVersion text contribution to journal |
| description | <p>Although many high-performing speech separation models have been proposed recently, little attention has been paid to making them lightweight. In this paper, a novel speech separation algorithm is proposed that integrates the twin-delayed deep deterministic (TD3) policy gradient reinforcement learning (RL) agent with the expectation maximization (EM) algorithm for clustering the spatial cues of individual sources separated on azimuth. For stationary sources, the proposed system gives satisfactory performance in terms of quality, intelligibility, and separation speed, and generalizes well with the test data from a mismatched speech corpus. Its perceptual evaluation of speech quality (PESQ) score is 0.55 points better than a self-supervised learning (SSL) model and almost equivalent to the diffusion models at computational cost and training data which is many folds lesser than required by these algorithms. Additionally, it reduces the required training data by 39 times, training time by 36 times, model size by 6 times, real time factor (RTF) by 1 point, and multiply-accumulate operations (MACs) by 9 times compared to a recently proposed lightweight transformer-based encoder-decoder framework, while offering a slight decrease in PESQ score (by 0.45 points).</p><h2>Other Information</h2> <p> Published in: Applied Acoustics<br> License: <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">http://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1016/j.apacoust.2025.111022" target="_blank">https://dx.doi.org/10.1016/j.apacoust.2025.111022</a></p> |
| eu_rights_str_mv | openAccess |
| id | Manara2_035a33b8adb3c0d2149359e752f33647 |
| identifier_str_mv | 10.1016/j.apacoust.2025.111022 |
| network_acronym_str | Manara2 |
| network_name_str | Manara2 |
| oai_identifier_str | oai:figshare.com:article/30135205 |
| publishDate | 2025 |
| repository.mail.fl_str_mv | |
| repository.name.fl_str_mv | |
| repository_id_str | |
| rights_invalid_str_mv | CC BY 4.0 |
| spelling | STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximizationMuhammad Salman Khan (7202543)Sania Gul (18272227)Information and computing sciencesArtificial intelligenceMachine learningSpeech separationReinforcement learningContinuous action spaceSpatial cuesReward functionTime–frequency masking<p>Although many high-performing speech separation models have been proposed recently, little attention has been paid to making them lightweight. In this paper, a novel speech separation algorithm is proposed that integrates the twin-delayed deep deterministic (TD3) policy gradient reinforcement learning (RL) agent with the expectation maximization (EM) algorithm for clustering the spatial cues of individual sources separated on azimuth. For stationary sources, the proposed system gives satisfactory performance in terms of quality, intelligibility, and separation speed, and generalizes well with the test data from a mismatched speech corpus. Its perceptual evaluation of speech quality (PESQ) score is 0.55 points better than a self-supervised learning (SSL) model and almost equivalent to the diffusion models at computational cost and training data which is many folds lesser than required by these algorithms. Additionally, it reduces the required training data by 39 times, training time by 36 times, model size by 6 times, real time factor (RTF) by 1 point, and multiply-accumulate operations (MACs) by 9 times compared to a recently proposed lightweight transformer-based encoder-decoder framework, while offering a slight decrease in PESQ score (by 0.45 points).</p><h2>Other Information</h2> <p> Published in: Applied Acoustics<br> License: <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">http://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1016/j.apacoust.2025.111022" target="_blank">https://dx.doi.org/10.1016/j.apacoust.2025.111022</a></p>2025-08-23T15:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1016/j.apacoust.2025.111022https://figshare.com/articles/journal_contribution/STEM_spatial_speech_separation_using_twin-delayed_DDPG_reinforcement_learning_and_expectation_maximization/30135205CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/301352052025-08-23T15:00:00Z |
| spellingShingle | STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization Muhammad Salman Khan (7202543) Information and computing sciences Artificial intelligence Machine learning Speech separation Reinforcement learning Continuous action space Spatial cues Reward function Time–frequency masking |
| status_str | publishedVersion |
| title | STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization |
| title_full | STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization |
| title_fullStr | STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization |
| title_full_unstemmed | STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization |
| title_short | STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization |
| title_sort | STEM: spatial speech separation using twin-delayed DDPG reinforcement learning and expectation maximization |
| topic | Information and computing sciences Artificial intelligence Machine learning Speech separation Reinforcement learning Continuous action space Spatial cues Reward function Time–frequency masking |