Qonstraint 5.9: Efficient TinyImageNet Classification with a Novel Hybrid Vision Transformer and Classification Head

Authors: Naman Gupta

DOI Link: https://doi.org/10.22214/ijraset.2025.73281

Abstract

The paper talks about a model Qonstraint 5.9. This is an approach that takes into consideration many branches of efficient image classification on resource-constrained benchmarks, which is evident in the dataset used, namely TinyImagenet. This efficient approach doesn’t commonly rely on optimization cheats like pretraining or aggressively resizing input data; instead, it is built from scratch to showcase the strengths of a convolutional-transformer hybrid model. The foundation of the same lies in QonvViT blocks, which are a beautiful crochet of convolutional token mixing inspired by mobileViT, adaptive feed-forward processing, and optimized downsampling. The model learns from both local and global features. This makes the model learn a broader picture of the context as well as focus on the fine-grained details of the image. All this is done keeping in mind resource constraints alongside issues that arise from the given condition of not using pretrained weights and layers. How this is achieved is explained in the sections to come. One of the many things that sets Qonstraint apart is not just making things dynamic, it’s the way it makes parameters and behaviors dynamic. It is based on a machine-learning-inspired reinforcement learning approach, utilizing a confidence-awareness training regime. As you scan through the model, one feature that will be vividly noticed is the intensity regularization and feature scaling being modulated in real time, but still not causing an overhead to model efficiency. This makes the model allocate its compute to where and when it\\\'s needed on the fly. What follows is an optimized classifier head, which is a sophisticated ensemble of global, local, and auxiliary regularization, whose outputs are then again fused based on confidence. Qonstraint an adaptive solution that balances all these factors so as to address the issue of high-performance image classification in limited resource availability. The outcome is a model that not just acts as a head over renowned models, but becomes the foundation model itself, combining various SOTA techniques, setting a new standard for from-scratch learning in constrained environments.

Introduction

Qonstraint 5.9 is a resource-efficient hybrid vision model developed to tackle the challenges of image classification on small-scale datasets (e.g., TinyImageNet, CIFAR-100) without relying on pretraining, image resizing, or large compute resources. It uniquely combines convolutional layers and transformer-like modules to achieve strong accuracy and generalization—from scratch.

Motivation & Problem Statement

Current Limitations: State-of-the-art (SOTA) models like ViTs and CNN-ViT hybrids perform well on large datasets but rely heavily on:
- Pretrained weights
- Large-scale training compute
- Extensive data augmentations and resizing
Real-World Challenge: These models aren't practical for scenarios with limited data, low computational power (e.g., mobile devices, basic GPUs), or domain-specific data.
Objective: Develop a model that is accurate, memory-efficient, fast-converging, and independent of pretraining, built specifically for low-resource environments.

Core Contributions

Qonstraint 5.9 introduces a suite of novel architectural and training innovations:

1. QonvViT Block

A custom block that fuses:

Convolutional local processing for spatial patterns
Token mixing and transformer traits for global context
Features depth-wise convolutions, channel MLPs, and adaptive regularization

2. Confidence-Based Reinforcement Training

Model dynamically adjusts:
- Hyperparameters
- Regularization intensity
- Training focus (on hard vs. easy samples)
Based on real-time confidence trends during training

3. Multi-Path Classifier Head

Triple-branch ensemble-style head:
- Transformer branch (global reasoning)
- Convolutional branch (local detail)
- Auxiliary branch (stabilization)
Branch weightings are dynamically adjusted based on confidence

4. Advanced Augmentation Pipeline

Strong/weak augmentations toggle over training
Uses CutMix, MixUp, and batch-level augmentation via custom collate functions

5. Resource-Efficient Design

Trained entirely from scratch
No image resizing or pretrained weights
Can run on Kaggle GPU (e.g., NVIDIA P100)

Experimental Foundation

TinyImageNet was chosen for its small size, class granularity, and real-world constraints.
Previous top models plateau under 55% accuracy when trained from scratch.
Qonstraint 5.9 exceeds these limits without external supervision, demonstrating its standalone strength.

Related Work

Qonstraint 5.9 builds on and improves aspects of:

MobileViT: lightweight hybrid design for edge devices
DeiT, T2T-ViT, SHViT, StructViT: transformer models optimized for efficiency
Neighborhood attention and squeeze-excite modules: for localized and adaptive attention
Medical image classification & ensemble learning: inspiration for triple-path heads

Architecture Pipeline

Convolutional Stem: Initial layers extract low-level features with downsampling and normalization.
QonvViT Blocks: Modular hybrid units perform local and global mixing, reinforced by:
- Residual scaling
- Channel MLPs
- Token Excite modules
- Confidence-based DropPath and distributional clamping
Feature Fusion: Combines high- and low-level features using attention mechanisms.
Triple-path Classifier Head: Dynamically chooses the best path for each sample.
Training Setup: Reproducible, single-GPU training with deterministic seeds.

Key Takeaways

No Pretraining Needed: Strong from-scratch performance
Efficient & Scalable: Deployable on mid-range GPUs
Accuracy + Practicality: Breaks the trade-off between model performance and resource constraints
Adaptive Learning: Real-time reinforcement and modular regularization make it robust and versatile

References

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ?ukasz Kaiser, Illia Polosukhin, “Attention is All You Need,” Neural Information Processing Systems, 2017. https://arxiv.org/pdf/1706.03762.pdf [2] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” International Conference on Learning Representations, 2021. https://arxiv.org/pdf/2010.11929.pdf [3] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis E.H. Tay, Jiashi Feng, Shuicheng Yan, “Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet,” International Conference on Computer Vision, 2021. https://arxiv.org/pdf/2101.11986.pdf [4] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou, “Training data-efficient image transformers & distillation through attention,” International Conference on Machine Learning, 2021. https://arxiv.org/pdf/2012.12877.pdf [5] Kai Wang, Yifan Sun, Jian Liang, Chunjing Xu, “Learning Correlation Structures for Vision Transformers,” arXiv, 2023. https://arxiv.org/pdf/2404.03924.pdf [6] Daehee Yun, Yong Man Ro, “SHViT: Single-Head Vision Transformer with Memory-Efficient Macro Design,” arXiv, 2024. https://arxiv.org/pdf/2401.16456.pdf [7] Ali Hassani, Steven Walton, Hessam Bagherinezhad, Mohammad Rastegari, “Neighborhood Attention Transformer,” Conference on Computer Vision and Pattern Recognition, 2023. https://openaccess.thecvf.com/content/CVPR2023/papers/Hassani_Neighborhood_Attention_Transformer_CVPR_2023_paper.pdf [8] Sachin Mehta, Mohammad Rastegari, “MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer,” International Conference on Learning Representations, 2022. https://arxiv.org/pdf/2110.02178.pdf [9] Abhishek Abai, Rahul Rajmalwar, “DenseNet Models for Tiny ImageNet Classification,” arXiv, 2019. https://arxiv.org/pdf/1904.10429.pdf [10] Stanford CS231n, “Tiny ImageNet Challenge Report,” 2017. http://cs231n.stanford.edu/reports/2017/pdfs/300.pdf [11] Amir Irandoust, Hamed R. Tavakoli, Mohammad Sabokrou, “Training a Vision Transformer from Scratch in Less than 24 Hours with 1 GPU,” Neural Information Processing Systems Workshop, 2022. https://arxiv.org/pdf/2211.05187.pdf [12] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, Lucas Beyer, “How to Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers,” 2021. https://arxiv.org/pdf/2106.10270.pdf [13] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, “DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification,” Neural Information Processing Systems, 2021. https://arxiv.org/pdf/2106.02034.pdf [14] Zongwei Wang, Ioan A. Voiculescu, “Triple-View Feature Learning for Medical Image Segmentation,” Medical Image Computing and Computer Assisted Intervention, 2022. https://arxiv.org/pdf/2208.06303.pdf [15] “Cancer-Cell Deep-Learning Classification by Integrating Spatial, Temporal, and Quantitative Information,” PubMed Central, 2021. https://pmc.ncbi.nlm.nih.gov/articles/PMC8699730/ [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep Residual Learning for Image Recognition,” arXiv, 2015. https://arxiv.org/pdf/1512.03385.pdf [17] Karen Simonyan, Andrew Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv, 2014. https://arxiv.org/pdf/1409.1556.pdf [18] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, Kurt Keutzer, “SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and <0.5MB Model Size,” arXiv, 2016. https://arxiv.org/pdf/1602.07360.pdf [19] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, “Going Deeper with Convolutions,” arXiv, 2015. https://arxiv.org/pdf/1409.4842.pdf [20] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, Jian Sun, “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design,” arXiv, 2018. https://arxiv.org/pdf/1807.11164.pdf [21] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” arXiv, 2018. https://arxiv.org/pdf/1801.04381.pdf [22] Dongyoon Han, Jiwhan Kim, Junmo Kim, “Deep Pyramidal Residual Networks,” arXiv, 2016. https://arxiv.org/pdf/1610.02915.pdf [23] Sergey Zagoruyko, Nikos Komodakis, “Wide Residual Networks,” arXiv, 2016. https://arxiv.org/pdf/1605.07146.pdf

Copyright

Copyright © 2025 Naman Gupta. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET73281

Publish Date : 2025-07-21

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here