Performance comparison of CAL module variants.

<div><p>Addressing the limitations in current visual question answering (VQA) models face limitations in multimodal feature fusion capabilities and often lack adequate consideration of local information, this study proposes a multimodal Transformer VQA network based on local and global i...

Full description

Saved in:

Bibliographic Details
Main Author:	Cuiyang Huang (21647898) (author)
Other Authors:	Zihan Hu (15363084) (author)
Published:	2025
Subjects:	Biotechnology Science Policy Space Science Biological Sciences not elsewhere classified Information Systems not elsewhere classified xlink "> addressing thereby enhancing visual reducing linguistic noise language feature fusion experimental results demonstrate integrates multimodal knowledge lgmtnet employs attention lgmtnet effectively focuses essential question terms deepen question comprehension extracting multimodal features multimodal representation module models face limitations global information integration local image features local features within question features global features multimodal transformer local information lgmtnet ). decoder module study proposes question context deep encoder
Tags:	Add Tag No Tags, Be the first to tag this record!

Performance comparison of CAL module variants.

Similar Items