Monocular metric depth estimation is important for autonomous driving, road-scene understanding, navigation, and 3D perception from low-cost cameras. Recent foundation models such as Depth Anything V2, UniDepth, UniDepthV2, and Metric3D v2 have improved zero-shot generalization and metric depth prediction, while depth-specific augmentation methods such as CutDepth have shown that training strategies can improve depth learning without destroying object boundaries. However, dashcam imagery contains a difficult combination of road-plane geometry, horizon and vanishing-point structure, sky or infinity regions, and thin foreground objects such as poles, traffic signs, traffic lights, logos, wires, and distant roadside structures. This paper presents a theoretical and qualitative framework called SkyRoad-CutDepth for sky-aware and road-geometric pretext learning in metric monocular depth estimation. The work is motivated by qualitative outputs from a custom dashcam-dataset-trained Depth Anything V2 model with canonical-space transformations. The observations show that global road depth can become more consistent after custom training, but finite foreground objects located near the sky region or vanishing point can still be suppressed, partially missed, or assigned invalid depth. The proposed framework combines road-region priors, sky-region validity, sky-adjacent foreground preservation, boundary-aware losses, and region-guided CutDepth augmentation. By integrating these cues into a unified pretext-learning formulation, the framework aims to improve the geometric reliability of metric depth estimation in regions where current foundation models often produce over-smoothed, invalid, or semantically inconsistent depth predictions.
Introduction
Monocular depth estimation from a single RGB image is a challenging problem because it must infer 3D structure from 2D cues. This is especially important in autonomous driving, where accurate distance estimation of vehicles, pedestrians, road signs, and other objects is critical. Although modern deep learning approaches (transformers, diffusion models, foundation models, and pseudo-labeling techniques) have greatly improved depth prediction, they still struggle with scale ambiguity, sparse supervision, and domain shifts.
A key issue highlighted in road-scene depth estimation is that models often fail to preserve small but safety-critical objects (like poles, traffic lights, and signs), especially when they appear against sky or near horizon regions. While global depth metrics may look good, these local failures are often overlooked.
To address this, the paper introduces the concept of SkyRoad-CutDepth, a theoretical framework that emphasizes treating different image regions differently: road areas, sky, sky-adjacent regions, and vanishing-point regions each require separate geometric handling. The goal is to preserve finite objects in sky regions while still modeling sky as an “infinite” or undefined depth area.
The framework connects several research directions, including transformer-based depth models, metric depth estimation methods, CutDepth augmentation, semantic-guided learning, and sky/infinity masking techniques. It also uses qualitative analysis from a custom-trained Depth Anything V2 model to show that current systems often lose thin structures (like poles and signs) even when overall depth maps appear accurate.
The literature review shows that while methods like DPT, ZoeDepth, Metric3D, and Depth Anything V2 have significantly advanced monocular depth estimation, they still do not fully solve road-scene-specific failures. Existing augmentation methods like CutDepth also lack awareness of road geometry and sky-object relationships.
Conclusion
This paper presented SkyRoad-CutDepth, a theoretical framework for sky-aware and road-geometric pretext learning in metric monocular depth estimation for dashcam imagery. The work was motivated by qualitative outputs from a custom dashcam-dataset-trained Depth Anything V2 model with canonical-space transformations. The outputs show that global road and building depth can be consistent while thin foreground objects near sky and vanishing-point regions remain partially suppressed or invalid.
The proposed framework combines road-plane priors, sky validity, foreground-object preservation, boundary-aware losses, and region-guided CutDepth augmentation. The central idea is that pure sky, road, horizon, and sky-adjacent foreground objects should not be treated uniformly during training. Pure sky can be down-weighted, road can be constrained by near-to-far geometry, and finite objects such as poles, signs, traffic lights, and wires should be protected through sky-object validity and boundary preservation. Although full quantitative validation is future work, the proposed framework provides a focused and literature-supported direction for robust metric depth estimation in road scenes.
References
[1] A. Bhoi, “Monocular depth estimation: A survey,” arXiv preprint arXiv:1901.09402, Jan. 2019, doi: 10.48550/arXiv.1901.09402.
[2] Y. Ming, X. Meng, C. Fan, and H. Yu, “Deep learning for monocular depth estimation: A review,” Neurocomputing, vol. 438, pp. 14-33, May 2021, doi: 10.1016/j.neucom.2020.12.089.
[3] T. Ehret, “Monocular depth estimation: A review of the 2022 state of the art,” Image Processing On Line, vol. 13, pp. 38-56, 2023, doi: 10.5201/ipol.2023.459.
[4] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 12179-12188, doi: 10.1109/ICCV48922.2021.01196.
[5] G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer-based attention networks for continuous pixel-wise prediction,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 16269-16279, doi: 10.1109/ICCV48922.2021.01596.
[6] S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. Müller, “ZoeDepth: Zero-shot transfer by combining relative and metric depth,” arXiv preprint arXiv:2302.12288, Feb. 2023, doi: 10.48550/arXiv.2302.12288.
[7] M. Hu et al., “Metric3D v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation,” arXiv preprint arXiv:2404.15506, Apr. 2024, doi: 10.48550/arXiv.2404.15506.
[8] L. Piccinelli et al., “UniDepth: Universal monocular metric depth estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 10106-10116, doi: 10.1109/CVPR52733.2024.00963.
[9] L. Piccinelli, C. Sakaridis, Y.-H. Yang, M. Segu, S. Li, W. Abbeloos, and L. Van Gool, “UniDepthV2: Universal monocular metric depth estimation made simpler,” arXiv preprint arXiv:2502.20110, Feb. 2025, doi: 10.48550/arXiv.2502.20110.
[10] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth Anything: Unleashing the power of large-scale unlabeled data,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 10371-10381, doi: 10.1109/CVPR52733.2024.00987.
[11] L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth Anything V2,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 37, pp. 21875-21911, 2024, doi: 10.52202/079017-0688.
[12] Y. Ishii and T. Yamashita, “CutDepth: Edge-aware data augmentation in depth estimation,” arXiv preprint arXiv:2107.07684, Jul. 2021, doi: 10.48550/arXiv.2107.07684.
[13] D. Kim, W. Ka, P. Ahn, D. Joo, S. Chun, and J. Kim, “Global-local path networks for monocular depth estimation with vertical CutDepth,” arXiv preprint arXiv:2201.07436, Jan. 2022, doi: 10.48550/arXiv.2201.07436.
[14] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 3827-3837, doi: 10.1109/ICCV.2019.00393.
[15] Y. Wang, Y. Liang, H. Xu, S. Jiao, and H. Yu, “SQLDepth: Generalizable self-supervised fine-structured monocular depth estimation,” in Proc. AAAI Conf. Artif. Intell., vol. 38, no. 6, pp. 5713-5721, 2024, doi: 10.1609/aaai.v38i6.28383.
[16] R. Wang, S. Xu, C. Dai, J. Xiang, Y. Deng, X. Tong, and J. Yang, “MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 5261-5271, doi: 10.1109/CVPR52734.2025.00496.
[17] M. Klingner, J.-A. Termöhlen, J. Mikolajczyk, and T. Fingscheidt, “Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 582-600, doi: 10.1007/978-3-030-58565-5_35.
[18] P. Rottmann et al., “Improving monocular depth estimation by semantic pre-training,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2021, doi: 10.1109/IROS51168.2021.9636546.
[19] H. Jung, E. Park, and S. Yoo, “Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 12642-12652, doi: 10.1109/ICCV48922.2021.01241.
[20] M. Cordts et al., “The Cityscapes dataset for semantic urban scene understanding,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 3213-3223, doi: 10.1109/CVPR.2016.350.
[21] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2012, pp. 3354-3361, doi: 10.1109/CVPR.2012.6248074.
[22] Y. Cabon, N. Murray, and M. Humenberger, “Virtual KITTI 2,” arXiv preprint arXiv:2001.10773, Jan. 2020, doi: 10.48550/arXiv.2001.10773.
[23] B. Ke, A. Obukhov, S. Huang, N. Metzger, R. Caye Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monocular depth estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 9492-9502, doi: 10.1109/CVPR52733.2024.00907.
[24] S. Patni, A. Agarwal, and C. Arora, “ECoDepth: Effective conditioning of diffusion models for monocular depth estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 28285-28295, doi: 10.1109/CVPR52733.2024.02672.
[25] W. Zhao, Y. Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 5729-5739, doi: 10.1109/ICCV51070.2023.00527.