In recent years, the field of computer vision has witnessed significant advancements in understanding and interpreting human motion through pose estimation techniques. While traditional 2D human pose estimation methods are capable of detecting body joints from images, they fail to capture the depth, structure, and realistic appearance of the human body. To overcome this limitation, this project focuses on the development of a system for 3D Human Pose Estimation with Realistic 3D Output, which reconstructs a lifelike three-dimensional human model from a single 2D image or video frame using deep learning techniques. The proposed system follows a structured pipeline that includes 2D keypoint detection, 3D pose estimation, and 3D mesh reconstruction to achieve accurate and visually realistic results.The system employs pretrained deep learning models such as HMR (Human Mesh Recovery), SPIN (SMPLify-in-the-Loop), and PARE (Part Attention Regressor) to predict 3D joint coordinates and human body shape parameters efficiently. These parameters are then passed to the SMPL (Skinned Multi-Person Linear) parametric model to generate a smooth and anatomically correct 3D human mesh. The reconstructed model is rendered and visualized using advanced 3D rendering tools, allowing users to rotate, zoom, and observe the model from different viewpoints. The performance of the system is evaluated using standard accuracy metrics such as MPJPE and PA-MPJPE, ensuring reliable and precise pose estimation. This project demonstrates that realistic 3D human reconstruction can be achieved without complex motion capture systems or expensive hardware. By effectively bridging the gap between 2D image perception and 3D geometric understanding, the proposed system offers a practical and scalable solution for realistic human modeling. The results obtained from this work are highly suitable for real-world applications such as animation, sports analysis, motion tracking, virtual reality, and healthcare monitoring.
Introduction
Human motion and posture are critical for communication and interaction, and accurately modeling them in computer vision is a longstanding challenge. Traditional 2D pose estimation methods detect body keypoints in images but cannot capture depth, realistic body shape, or structure, limiting applications like animation, VR, sports analysis, and medical rehabilitation.
To overcome these limitations, 3D human pose estimation reconstructs the full three-dimensional structure of the body. Advances in deep learning—especially CNNs, GCNs, RNNs, and Transformers—enable accurate prediction of 3D poses and generation of realistic human meshes. Parametric models like SMPL allow reconstruction of lifelike body shapes using pose and shape parameters. Pretrained models such as HMR, SPIN, and PARE facilitate mesh recovery even from a single image.
Project Objective:
Design and implement a system that estimates 3D human poses from 2D images and reconstructs realistic 3D human meshes, bypassing the need for expensive motion-capture setups. The pipeline includes:
2D keypoint detection from images.
Estimation of 3D joint coordinates and body shape parameters.
Reconstruction of the full 3D human mesh and visualization from multiple angles.
Literature Insights:
Evolution: From 2D joint detection → 3D skeletal reconstruction → realistic 3D mesh recovery.
Architectures: CNNs extract visual features; GCNs model skeletal relationships; RNNs/Transformers handle temporal dynamics in videos. Hybrid models combine these strengths for improved accuracy.
Learning Approaches: Supervised, weakly-supervised, self-supervised methods, and 2D-to-3D lifting pipelines dominate current research.
Challenges: Occlusion, depth ambiguity, limited 3D datasets, complex backgrounds, and computational cost remain significant obstacles.
Datasets: Common benchmarks include Human3.6M, MPI-INF-3DHP, MuPoTS-3D, 3DPW (3D), and COCO, MPII (2D).
Evaluation Metrics: MPJPE, PA-MPJPE, PCK, PCP, mAP, and temporal consistency metrics are standard.
Methodology:
Input images/videos undergo preprocessing, normalization, and data augmentation.
Features are extracted via deep CNN backbones or Vision Transformers.
3D pose regression predicts joint coordinates, while SMPL-based or template-free networks reconstruct meshes.
Significance:
The proposed system provides practical, cost-effective, and visually realistic 3D human pose estimation suitable for applications in animation, VR, healthcare, sports, and HCI.
Conclusion
This literature survey has provided a clear understanding of the rapid progress made in the field of human pose estimation, starting from basic 2D joint detection to advanced 3D skeletal reconstruction and finally to realistic 3D human mesh generation. From the analyzed studies, it is evident that deep learning has completely transformed the way human pose estimation is performed, making it possible to reconstruct accurate and lifelike 3D human models from ordinary 2D images and videos. The use of large-scale benchmark datasets, powerful neural network architectures, and standardized evaluation metrics has significantly improved the accuracy, robustness, and practical applicability of modern pose estimation systems.
The survey also highlights the importance of parametric human body models such as SMPL, which play a crucial role in generating realistic 3D human shapes by representing both pose and body structure mathematically. Together, the findings from the literature form a strong theoretical and technical foundation for the proposed project, “3D Human Motion Reconstruction From 2D image Using Deep Learning and Computer Vision.” By integrating 2D pose detection, 3D pose regression, and mesh reconstruction techniques, this project aims to develop a practical system capable of producing visually realistic 3D human models. Thus, the literature study not only validates the relevance of this project but also directly guides its system design, methodology, and evaluation strategy.
References
[1] Zhou, L., Meng, X., Liu, Z., Wu, M., Gao, Z., & Wang, P. (2023). Human pose-based estimation, tracking and action recognition with deep learning: A survey. arXiv. https://doi.org/10.48550/arXiv.2310.13039
[2] Ibne, M. B., Islam, K. R., & Hasan, K. M. A. (2025). A survey on deep 3D human pose estimation. Artificial Intelligence Review, 58, Article 24. https://doi.org/10.1007/s10462-024-11019-3
[3] Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen, J., Kehtarnavaz, N., & Shah, M. (2024). Deep learning-based human pose estimation: A survey. ACM Computing Surveys, 56(1), Article 11, 1–37. https://doi.org/10.1145/3603618
[4] Liu, Y., Qiu, C., & Zhang, Z. (2024). Deep learning for 3D human pose estimation and mesh recovery: A survey. Neurocomputing, 596, 128049. https://doi.org/10.1016/j.neucom.2024.128049.
[5] Liu, W., Bao, Q., Sun, Y., & Mei, T. (2022). Recent advances in monocular 2D and 3D human pose estimation: A deep learning perspective. ACM Computing Surveys, 55(4), 1–41. https://doi.org/10.1145/3524497
[6] Lin, J., Li, S., Qin, H., Wang, H., Cui, N., Jiang, Q., Jian, H., & Wan, G. (2023). Overview of 3D human pose estimation. Computer Modeling in Engineering & Sciences, 134(3), 1621–1651. https://doi.org/10.32604/cmes.2023.018597
[7] Venkatrayappa, D., Trémeau, A., Muselet, D., & Colantoni, P. (2024). Survey of 3D human body pose and shape estimation methods for contemporary dance applications. arXiv. https://doi.org/10.48550/arXiv.2401.02383
[8] Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., & Wang, X. (2018). 3D human pose estimation in the wild by adversarial learning. https://doi.org/10.1109/CVPR.2018.00551 .
[9] Zhang, Y., Ji, P., Wang, A., Mei, J., Kortylewski, A., & Yuille, A. L. (2023). 3D-Aware neural body fitting for occlusion robust 3D human pose estimation. https://doi.org/10.1109/ICCV51070.2023.00862
[10] Zhan, Y., Li, F., Weng, R., & Choi, W. (2022). Ray3D: Ray-based 3D human pose estimation for monocular absolute 3D localization https://doi.org/10.1109/CVPR42600.2022.01284
[11] Yuan, Y., Wei, S.-E., Simon, T., Kitani, K. M., & Saragih, J. M. (2021). SimPoE: Simulated character control for 3D human pose estimation. CoRR, abs/2104.00683. https://arxiv.org/abs/2104.00683
[12] Rajasegaran, J., Pavlakos, G., Kanazawa, A., Feichtenhofer, C., & Malik, J. (2023). On the benefits of 3D pose and tracking for human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 640–649). https://doi.org/10.1109/CVPR52688.2023.00071
[13] Lee, K., Lee, I., & Lee, S. (2018). Propagating LSTM: 3D pose estimation based on joint interdependency. In V. Ferrari, C. Sminchisescu, M. Hebert, & Y. Weiss (Eds.), Computer vision – ECCV 2018 (Vol. 11211, pp. 123–141). Springer, Cham. https://doi.org/10.1007/978-3-030-01234-2_8
[14] Agarwal, A., & Triggs, B. (2006). Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 44–58. https://doi.org/10.1109/TPAMI.2006.9
[15] Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., & Ilic, S. (2016). 3D pictorial structures revisited: Multiple human pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10), 1929–1942. https://doi.org/10.1109/TPAMI.2015.2509986
[16] Wang, K., Lin, L., Jiang, C., Qian, C., & Wei, P. (2019). 3D human pose machines with self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(5), 1069–1082. https://doi.org/10.1109/TPAMI.2019.2892452
[17] Marín-Jiménez, M. J., Romero-Ramírez, F. J., Muñoz-Salinas, R., & Medina-Carnicer, R. (2018). 3D human pose estimation from depth maps using a deep combination of poses. Journal of Visual Communication and Image Representation, 55, 127–136. https://doi.org/10.1016/j.jvcir.2018.07.010
[18] Wang, J., Yan, S., Xiong, Y., & Lin, D. (2020). Motion guided 3D pose estimation from videos. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 764–780). https://doi.org/10.1007/978-3-030-58601-0_45
[19] Moon, G., Chang, J. Y., & Lee, K. M. (2018). V2V-PoseNet: Voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5079–5088). IEEE. https://doi.org/10.1109/CVPR.2018.00532
[20] Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., & Theobalt, C. (2017). Monocular 3D human pose estimation in the wild using improved CNN supervision. In Proceedings of the 2017 International Conference on 3D Vision (3DV) (pp. 506–516). IEEE Computer Society.Leibe, B., Leonardis, A., & Schiele, B. (2008). Learning an alphabet of shape and appearance for multi-class object detection. International Journal of Computer Vision, 80(1), 16–44. https://doi.org/10.1007/s11263-007-0119-2