Binary Classification of 3D Small-Scale Medical Images Using Video Vision Transformers

The applications of the pure Transformer model on sequences of image patches achieved promising results, comparable to those of the Convolutional Neural Networks (CNNs), the leading models of computer vision tasks. However, one of its gaps is the need for large volumes of data for Vision Transformer...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Abi Younes, Simon (author)
التنسيق: masterThesis
منشور في: 2025
الوصول للمادة أونلاين:http://hdl.handle.net/10725/16688
https://doi.org/10.26756/th.2023.767
http://libraries.lau.edu.lb/research/laur/terms-of-use/thesis.php
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
الوصف
الملخص:The applications of the pure Transformer model on sequences of image patches achieved promising results, comparable to those of the Convolutional Neural Networks (CNNs), the leading models of computer vision tasks. However, one of its gaps is the need for large volumes of data for Vision Transformers, making it worth looking into smaller scale datasets. Despite its fast advances and wide range of application, it remains lagging when it comes to the field of 3D images. In general, low-level resolution images pose problems in the model learning curve. Hence, this study leverages Vision Transformers (ViTs) capabilities in capturing global linkages and long-range interdependencies within an image, in the aim of achieving performance comparable to the benchmark established by the MedMNIST3D v2 family of datasets - offering small-scale images of high and low levels of resolution. Previous studies have demonstrated a plethora of methods in treating 3D images, increasing the interest in applying ViTs models from scratch to that data modality, specifically in small-scale datasets. The VesselMNIST3d dataset binary classification experiment was implemented by treating the 3D image as video where the third dimension represents the number of frames. Therefore, initiating temporal information for the model to learn, enriching more relationships across spatial information at a higher dimension. The study provides a robustness experimentation to prove the high performing of the vanilla Video Vision Transformer model scoring on average 0.877 for Area Under the Curve (AUC) and 0.916 for Accuracy (ACC) across 30 independent experiment. The study extends proof of pretraining the model at a higher resolution to improve the model’s learning capacity at a lower resolution level in which succeeded to boost 3% AUC score. The study transcends multiple levels of interpretation and caution for proper inferential results in order to make the Vision Transformer model competitive in its weak areas, at an alerting domain needing for growth.