- Tài khoản và mật khẩu chỉ cung cấp cho sinh viên, giảng viên, cán bộ của TRƯỜNG ĐẠI HỌC FPT
- Hướng dẫn sử dụng:
Xem Video
.
- Danh mục tài liệu mới:
Tại đây
.
-
Đăng nhập
:
Tại đây
.
Artificial Intelligence High-Resolution Image Text Detection
Issue Date:
2023
Publisher:
FPTU Hà Nội
Abstract:
Many datasets centered around scene text detection have emerged with the progressive evolution of deep learning techniques. These datasets exhibit attributes of high-resolution imagery containing diminutive textual elements, thereby establishing a burgeoning trend in computational tasks. Conventional approaches to mitigate the challenge of small text within these images involve downsizing the image dimensions. However, such a strategy often leads to text obfuscation and perceptual deterioration, consequently undermining performance outcomes. Thus, the employment of substantial models operating on enlarged input scales becomes imperative, albeit demanding significant GPU computational resources and prolonged training durations. In the context of this investigative inquiry, we introduce "TextFocus," an algorithm designed to harness a multi-scale training strategy optimally and efficiently. Instead of meticulously scrutinizing individual pixels across an image pyramid, the TextFocus algorithm adopts a discerning approach. It endeavors to delineate contextual domains encompassing instances of ground-truth text, referred to as "chips." Subsequently, the algorithm engages in an intricate process of identifying all textual regions within the sampled image. This entails accumulating comprehensive textual insights from each "chip," which are subjected to meticulous post-processing techniques, culminating in deriving definitive outcomes for text detection. The prowess of TextFocus lies in its capacity to adeptly transmute expansive image samples, boasting dimensions of 4000x4000 pixels, into scaled-down, lower-resolution "chips" measuring 640x640 pixels. This transformation imparts a dual advantage of expediting training procedures and enabling the accommodation of larger batch sizes, with a remarkable upper limit of 50 batches on a solitary GPU, even under conventional scaling paradigms. While the prevailing wisdom dictates an incremental enhancement in outcomes with augmented training dimensions, our approach deviates from this paradigm. Our experimentation illustrates that training on high-resolution scales might not yield optimal performance. Our implementation employs a ResNet-18 backbone, augmented by a segment-like head architecture. The empirical outcomes showcase a commendable F1 score of 0.828 on the SCUT-CTW1500 dataset [1], alongside a respectable F1 score of 0.611 on the Large CTW dataset [2]. These achievements are coupled with a real-time operational capacity, as substantiated by the acceptable frames per second (FPS) metric.