Adjustment step size.

<div><p>In modern multimodal interaction design, integrating information from diverse modalities—such as speech, vision, and text—presents a significant challenge. These modalities differ in structure, timing, and data volume, often leading to mismatches, low computational efficiency, an...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Qingnan Ji (22662198) (author)
مؤلفون آخرون: Jinxia Wang (355966) (author), Lixian Wang (465239) (author)
منشور في: 2025
الموضوعات:
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
الوصف
الملخص:<div><p>In modern multimodal interaction design, integrating information from diverse modalities—such as speech, vision, and text—presents a significant challenge. These modalities differ in structure, timing, and data volume, often leading to mismatches, low computational efficiency, and suboptimal user experiences during the integration process. This study aims to enhance both the efficiency and accuracy of multimodal information fusion. To achieve this, publicly available datasets—Carnegie Mellon University Multimodal Opinion Sentiment Intensity (CMU-MOSI) and Interactive Emotional Dyadic Motion Capture (IEMOCAP)—are employed to collect speech, visual, and textual data relevant to multimodal interaction scenarios. The data undergo preprocessing steps including noise reduction, feature extraction (e.g., Mel Frequency Cepstral Coefficients and keypoint detection), and temporal alignment. An improved Kuhn-Munkres algorithm is then proposed, extending the traditional bipartite graph matching model to support weighted multimodal matching. The algorithm dynamically adjusts weight coefficients based on the importance scores of each modality, while also incorporating a cross-modal correlation matrix as a constraint to improve the robustness of the matching process. The enhanced algorithm’s performance is validated through information matching efficiency tests and user interaction satisfaction surveys. Experimental results show that it improves multimodal information matching accuracy by 28.2% over the baseline method. Integration efficiency increases by 18.7%, and computational complexity is significantly reduced, with average computation time decreased by 15.4%. User satisfaction also improves, with a 19.5% increase in experience ratings. Ablation studies further confirm the critical contribution of both the dynamic weighting mechanism and the correlation matrix constraint to the overall performance. This study introduces a novel optimization strategy for multimodal information integration, offering substantial theoretical value and broad applicability in intelligent interaction design and human-computer collaboration. These advancements contribute meaningfully to the development of next-generation multimodal interaction systems.</p></div>