This survey reviews advances and deployment strategies for assistive vision systems that combine continuous on-device object detection with selective, high-fidelity scene segmentation and succinct audio narration. We synthesize evidence concerning mobile-optimized detectors (including YOLO family and recent YOLOv8 variants), promptable segmentation foundation models (SAM and SAM2), vision-language approaches for narration, and hybrid on-device/edge/cloud architectures that trade latency, privacy and capability. We discuss datasets captured by visually impaired users (VizWiz family, ORBIT), propose evaluation metrics beyond classical mAP (latency, answerability, safety- critical misses), and identify open problems and near-term re- search directions for making hybrid detection–segmentation–TTS pipelines practical.
Introduction
The text surveys the design of assistive vision systems for visually impaired users, focusing on how to balance real-time safety alerts with richer scene understanding. It highlights the need for hybrid architectures where lightweight models (like YOLOv8) run continuously on-device for fast hazard detection, while heavier models (such as SAM2 and vision-language models) are used selectively for detailed segmentation and explanation.
It explains that assistive systems must be designed around user needs such as low latency, low cognitive load, privacy, and customizable feedback (e.g., prioritizing objects like stairs or people). Real-world datasets like VizWiz and ORBIT are emphasized because they reflect the noisy, imperfect conditions of images captured by visually impaired users, unlike standard datasets such as COCO.
The text compares object detection approaches, noting that two-stage models are accurate but too slow for real-time assistive use, while single-stage models like the YOLO family (especially YOLOv8) offer a better speed–accuracy trade-off for mobile deployment. However, small-object detection remains a challenge, often requiring architectural improvements or hybrid methods.
It also describes segmentation methods, including semantic, instance, and panoptic segmentation, and highlights SAM and SAM2 as powerful promptable models for detailed and temporally consistent segmentation, though too computationally heavy for continuous use.
Finally, it discusses vision-language models and text-to-speech systems, emphasizing that simple template-based speech is best for urgent alerts, while richer AI-generated descriptions should be used only on demand due to latency and hallucination risks.
Conclusion
Hybrid architectures that combine continuous on-device detection (YOLOv8n or similar) with selective, promptable segmentation (SAM2) and a tiered narration strategy offer a pragmatic path toward deployable assistive vision systems. Achieving field readiness requires careful trigger policies, energy-efficient scheduling, privacy-preserving offload mech- anisms, and extensive user-centered evaluation to ensure practical utility.
This survey consolidates current progress in assistive vision and highlights how emerging detection, segmentation, and vision-language techniques can converge to build safer, more intelligent, and context-aware systems for visually impaired individuals.
References
[1] J. P. Bigham et al., “VizWiz: Nearly real-time answers to visual questions,” in Proc. UIST, 2010.
[2] D. Gurari et al., “VizWiz-Captions: Captioning images taken by people who are blind,” VizWiz Workshop / CVPR, 2020.
[3] D. B. Walker et al., “ORBIT: A dataset for few-shot personal object recognition,” 2021.
[4] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. ECCV, 2014.
[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- time object detection with region proposal networks,” IEEE TPAMI, vol. 39, no. 6, pp. 1137–1149, 2017.
[6] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick, “Mask R-CNN,” in Proc. ICCV, 2017.
[7] T.-Y. Lin et al., “Focal loss for dense object detection,” IEEE TPAMI, vol. 42, no. 2, pp. 318–327, 2020.
[8] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv:1804.02767, 2018.
[9] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in
[10] Proc. CVPR, 2017.
[11] G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLO,” 2023. (Ultra- lytics YOLOv8 resources and documentation uploaded.)
[12] M. Talib et al., “YOLOv8-CAB: Improved YOLOv8 for real-time object detection,” Karbala Int. J. Mod. Sci., 2024.
[13] T.-W. Sung et al., “Improvement of YOLOv8 object detection based on lightweight neck model for complex images,” Image Anal. Stereol., 2025.
[14] A. Kirillov et al., “Segment Anything,” in Proc. ICCV, 2023.
[15] N. Ravi et al., “SAM 2: Segment anything in images and videos,” arXiv:2408.00714, 2024.
[16] Y. Yamagishi et al., “SAM2 for zero-shot 3D segmentation,” JMIR AI, 2025.
[17] D. Bolya et al., “YOLACT: Real-time instance segmentation,” in Proc. ICCV, 2019.
[18] J. Li et al., “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Proc. ICML, 2022.
[19] A. Radford et al., “CLIP: Learning transferable visual models from natural language supervision,” in Proc. ICML, 2021.
[20] WalkVLM and related video-VLM works (uploaded arXiv entries, 2024).
[21] G. I. Okolo et al., “Assistive systems for visually impaired persons: Challenges and opportunities for navigation assistance,” Sensors, 2024.
[22] P. Pfreundschuh et al., “Sight Guide: A wearable assistive perception and navigation system,” arXiv:2506.02676, 2025.
[23] M. Talib et al., “Leveraging assistive technology for visually impaired people through optimal deep transfer learning based object detection,” Sci. Rep., 2025.
[24] B.-H. Le et al., “Leveraging large vision-language models for visual question answering in VizWiz Grand Challenge,” CVPR Workshop, 2024.
[25] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017.
[26] D. Ahmetovic´, C. Gleason, C. Ruan, K. M. Kitani, H. Takagi, and C. Asakawa, “NavCog: A navigational cognitive assistant for the blind,” in Proc. MobileHCI, Florence, Italy, 2016, pp. 1–10, doi:10.1145/2935334.2935361.