- Tài khoản và mật khẩu chỉ cung cấp cho sinh viên, giảng viên, cán bộ của TRƯỜNG ĐẠI HỌC FPT
- Hướng dẫn sử dụng:
Xem Video
.
- Danh mục tài liệu mới:
Tại đây
.
-
Đăng nhập
:
Tại đây
.
Trí tuệ nhân tạo Artificial Intelligence Visio-linguistic Computer Vision Natural Language Processing Prototype Learning Visual Question Answering
Issue Date:
2023
Publisher:
FPTU Hà Nội
Abstract:
Recently, the research on Medical Visual Question Answering (Med-VQA) [1] is becoming significantly popular. Med-VQA intends to answer the question, given an image with vital clinic-relevant information, helps physicians in diagnosing diseases, giving patients better insights about illness. Med-VQA performs worse than general domain VQA due to a lack of accurate data such as the typical image as X-ray image. And another reason is proposed models are complicated in both image encoder and text encoder, which does not completely have outstanding performance. In order to deal with Med-VQA data limitation, recents studies primarily refine the fusion module which is responsible for synthesizing the question features and image features and provide models pre-trained by self-collection new dataset, overlooking the effect of question and image history.
In this thesis, we introduce a visio-linguistic model, the architecture employing an Associative Memory Module in the shape of separate storage of visual-linguistic individual experiences and their relationship to enhance context. Additionally, we introduce a Prototype Learning block to carry out stratified prototype learning on textual, visual embeddings utilizing morden Hopfield layers. Our model endeavors to acquire the most significant prototypes from the embeddings of texts and images with the augmentation of memory from associate memory modules. This is in contrast to directly acquiring concrete representations of joint features for different meanings in text and image. Then, by using these learned prototypes, more complex semantics can be represented for the answer. On VQA-RAD datasets, the proposed method accomplishes state-of-the-art performance with notable accuracy improvements of 0.45 %.