Bowen Chen: SSAT-Adapter: Enhancing Vision-Language Model Few-shot Learning with Auxiliary Tasks

<p dir="ltr">Traditional deep learning models often struggle in few-shot learning scenarios, where limited labeled data is available.</p><p dir="ltr">While the Contrastive Language-Image Pre-training (CLIP) model demonstrates impressive zero-shot capabilities, i...

Full description

Saved in:
Bibliographic Details
Main Author: Bowen Chen (12156618) (author)
Other Authors: Yun Sing Koh (1221624) (author), Gill Dobbie (1192893) (author)
Published: 2025
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:<p dir="ltr">Traditional deep learning models often struggle in few-shot learning scenarios, where limited labeled data is available.</p><p dir="ltr">While the Contrastive Language-Image Pre-training (CLIP) model demonstrates impressive zero-shot capabilities, its performance in few-shot scenarios remains limited. </p><p dir="ltr">Existing methods primarily aim to leverage the limited labeled dataset, but this offers limited potential for improvement.</p><p dir="ltr">To overcome the limitations of small datasets in few-shot learning, we introduce a novel framework, SSAT-Adapter, that leverages CLIP's language understanding to generate informative auxiliary tasks and improve CLIP's performance and adaptability in few-shot settings.</p>