This paper presents a unified framework for compressing large language and vision transformer models usingRecursive Knowledge Distillation (RKD), QLoRA, and pruning.Our experiments on the SST-2 and Beans datasets show that itis possible toachieve up to 10x model size reduction with only aminor drop in accuracy. The study benchmarks Straightforward,Successive, andMulti-Agent Distillation techniques and appliesquantization and structural pruning post-distillation to achievehighly efficient models suitable for real-world deployment.
Introduction
1. Background and Motivation
Large pre-trained models like BERT (NLP) and ViT (Vision) offer high performance but are too large and computationally expensive for edge devices. Researchers have developed model compression techniques such as:
Knowledge Distillation (KD)
Quantization
Low-Rank Adaptation (LoRA)
Pruning
However, their combined use—especially across both NLP and Vision domains—is underexplored. This project proposes a hybrid optimization pipeline combining Recursive KD (RKD), QLoRA, and Pruning.
??2. Problem Statement
Existing methods focus on either NLP or Vision—not both.
Limited analysis exists on multi-agent or recursive KD strategies, especially in combination with quantization and pruning.
There’s no clear consensus on the best combination of these techniques to balance compression and accuracy.
???? 3. Objectives and Scope
The project proposes a three-part compression pipeline:
Recursive & Multi-Agent KD
QLoRA (4-bit Quantized LoRA)
Magnitude-based pruning
It evaluates:
BERT models on the SST-2 dataset (sentiment analysis)
ViT models on the Beans dataset (image classification)
Three KD strategies: Straightforward, Successive, Multi-Agent
Impact of quantization and pruning on model performance
Goal: Create a generalized optimization framework for deploying lightweight, high-performance models across NLP and Vision tasks.
???? 4. Literature Review
KD, introduced by Hinton et al., helps smaller models mimic large ones.
Variants include successive and multi-teacher KD.
BERT has inspired smaller models like DistilBERT and TinyBERT.
ViT models are large but have been optimized via DeiT and others.
Quantization and QLoRA reduce memory footprint.
Pruning removes low-importance weights to improve inference time.
Gaps identified:
Lack of unified pipelines combining KD, QLoRA, and pruning.
Insufficient evaluation of multi-agent KD across domains.
???? 5. Methodology
Datasets:
NLP: SST-2 from GLUE
Vision: Beans dataset
Models:
BERT: base, medium, small, mini
ViT: 12, 9, 6, 3 encoder layers
Pipeline Stages:
Recursive Knowledge Distillation: 3 KD strategies
LoRA-based fine-tuning: Only trains lightweight adapter matrices
QLoRA: 4-bit quantization using NF4
Magnitude-Based Pruning: 30% L1-based sparsity
Evaluation Metrics:
Accuracy
Model size
Compression ratio
Performance retention
???? 6. Experimental Results
???? NLP (BERT on SST-2):
Base BERT: 91.7% accuracy, 110M parameters
Mini-BERT (after full compression): 85.1% accuracy, 11M parameters
Compression: 10× smaller with only ~6.6% accuracy loss
???? Vision (ViT on Beans):
ViT-3: Improved from 32.8% (raw) to 82.03% via KD (Successive)
Multi-Agent KD helped ViT-6 slightly but hurt ViT-3
ViT-3 used only ~34% of ViT-9’s parameters with near-equal accuracy
???? 7. Analysis and Discussion
Key Takeaways:
Recursive KD significantly boosts performance of compressed models.
QLoRA + pruning provides efficient inference with minor accuracy loss (~2.2% for Mini-BERT).
Successive KD generally outperforms other strategies.
Multi-agent KD is only helpful when student model has sufficient capacity.
Challenges:
Sensitive to hyperparameters
Computationally expensive training (due to large teacher models)
Results may not generalize beyond tested datasets (SST-2, Beans)
???? 8. Future Directions
Cross-domain KD: Train on NLP, apply to Vision or vice versa
Model-agnostic pruning: e.g., pruning neurons or attention heads
Hardware-aware optimization: Benchmark energy and speed on edge devices
Conclusion
This project demonstrates the effectiveness of combining Recursive Knowledge Distillation (RKD) with QLoRA and pruning to compress both NLP and Vision Transformer models without severely compromising accuracy. On the SST-2 sentiment classification task, we compressed BERT from 110M to 11M parameters, achieving only a 6.6% drop in performance. In the vision domain, ViT-3 achieved an accuracy boost of nearly 50% through RKD, outperforming its raw baseline. These results confirm that student models can retain most of the performance of their teachers while being significantly more efficient and deployable.
This work proposes and validates a cross-modality Recursive Knowledge Distillation (RKD) pipeline applicable to both natural language processing (BERT) and computer vision (ViT) models. We introduce multi-agent teacher fusion and successive distillation strategies, which significantly improve accuracy in extremely compressed models.
To enable real-world deployment, we integrate QLoRA-based 4-bit quantization and 30% pruning post-distillation, yielding highly efficient models with minimal performance loss. Finally, we conduct quantitative benchmarking across all KD strategies, highlighting accuracy-compression trade-offs using parameter-vs-accuracy plots for both modalities.
Future work can explore extending the RKD + QLoRA framework to larger and more diverse benchmarks such as GLUE, ImageNet, or COCO to validate generalizability. Investigating the impact of heterogeneous teacher architectures and cross-modal knowledge transfer could further improve student generalization. Incorporating AutoML techniques to automate the tuning of distillation hyperparameters—such as ?, temperature, and ? weights—using methods like reinforcement learning or Bayesian optimization presents another promising direction. Finally, testing pruned and quantized models on edge devices would offer valuable insights into real-world metrics including latency, energy consumption, and memory efficiency.
References
[1] A. Pagnoni, Y. Zhang, S. Shaikh, et al., “QLoRA: Efficient Finetuning of Quantized LLMs,” arXiv preprint arXiv:2305.14314, 2023. [Online]. Available: https://arxiv.org/abs/2305.14314
[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186. [Online]. Available: https://arxiv.org/abs/1810.04805
[3] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge Distillation: A Survey,” International Journal of Computer Vision, vol. 129, pp. 1789–1819, 2021. [Online]. Available: https://doi.org/10.1007/s11263-021-01453-z
[4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in Proceedings of the International Conference on Learning Representations (ICLR), 2021. [Online]. Available: https://arxiv.org/abs/2010.11929
[5] Gou, J., Yu, B., Maybank, S.J., & Tao, D. (2021). Knowledge Distillation: A Survey. International Journal of Computer Vision, 129, 1789–1819.
[6] Touvron, H., Cord, M., Sablayrolles, A., & Bach, F. (2021). Training data-efficient image transformers & distillation through attention. Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 10376-10386.
[7] Jiao, X., Wei, S., Wang, H., & Zhou, W. (2020). TinyBERT: Distilling BERT for natural language understanding. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 45-56.
[8] Sollich, P., & Krogh, A. (1996). Learning with ensembles: a theoretical analysis. Advances in neural information processing systems, 3, 190–196.
[9] Blalock, D., & Guttag, J. (2020). What is the state of the art in neural network pruning? arXiv preprint arXiv:2003.03033.
[10] Sun, Z., Wang, H., Tang, J., & Wang, J. (2020). Patient knowledge distillation for BERT-based models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 773–781.
[11] Hu, E. J., Shen, Y., Chen, Y., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685.