Deep Learning Approaches To Image Classification: Exploring The Future Of Visual Data Analysis
Main Article Content
Abstract
Over the past decade, deep learning has emerged as a revolutionary technology for the analysis of visual data, particularly images. This master's thesis focuses on deep learning approaches to image classification, which is a key task in many applications using visual data analysis. A state-of-the-art deep learning model, namely the Vision Transformer (ViT), is explored for image classification. ViT is trained using transfer-learning techniques on a new dataset of over 350,000 photographs of European buildings in eight cities, obtained across two separate flights from a drone-mounted camera. Initial results demonstrate that models pre-trained on large datasets such as JFT-300M can achieve performance competitively with the fine-tuning of models trained from scratch on smaller datasets and that ViT outperforms convolutional neural networks for drone-captured images. Further, the prospects of deep learning for image classification are discussed, highlighting the potential impact of new research directions within the architectural vision transformer domain (e.g., Swin-Transformer, CrossViTs, T2T-vision Transformer) and new training techniques (e.g., Vision-Language Pre-training models, multi-modality input). The exponential increase in data generated by cameras, mobile devices, and Internet-of-Things (IoT) sensors has escalated the need for automated processing and analysis of visual data. Furthermore, images and video frames are a popular medium for data collection across various domains, including commercial and industrial. Image classification, or finding the most relevant label for a given photograph, is one key task in many applications using visual data analysis. Popular applications include multimedia search engines, mobile applications navigating to points of interest (POI), and anomaly detection in industrial cameras. As a consequence, many datasets have been assembled, containing millions of photographs collected and labeled according to city, object, or scene. Deep neural networks trained end-to-end directly on pixels have become state-of-the-art image classification technology. More recently, architectures based solely on attention mechanisms, eschewing convolutions, have challenged the long-standing dominance of convolutional neural networks.