Life expectancy is a key measure of how well society is doing. However, the relationship among socioeconomic, envi- ronmental, and healthcare factors that affect it is still not fully understood. This paper introduces a machine learning framework that uses gradient-boosted ensemble modeling, Shapley Additive Explanations (SHAP), and DoWhy-based causal inference to predict life expectancy and find actionable factors that contribute to a longer life.
We utilized the WHO Global Health Statistics dataset, along with World Bank longitudinal data and the Indian National Family Health Survey (NFHS-5) state-level indicators. We trained an XGBoost regressor that achieved R² = 0.9696 and MAE = 1.08 years.
The SHAP analysis identifies HIV/AIDS prevalence, adult mortality rate, and income composition of resources as the three most significant features. The DoWhy causal model shows that having above-median schooling leads to an increase in life expectancy by 4.29 years on average (ATE).
An interactive Streamlit dashboard combines these findings for global comparisons, state-level analyses in India, three scenario- based projections to 2050, and a personalized life expectancy estimator. Our results highlight the necessity of clear, causally grounded AI systems to support evidence-based public health policies.
Introduction
The text presents a data-driven study on global and Indian life expectancy, aiming to explain disparities, improve prediction accuracy, and support policy decisions using machine learning and causal analysis.
It begins by highlighting that life expectancy has improved globally, but large inequalities persist between countries (e.g., Japan vs. African nations) and within India (e.g., Kerala vs. Bihar). Traditional statistical methods struggle to capture the complex, non-linear factors influencing life expectancy, and many machine learning models lack interpretability.
To address this, the study proposes a framework with four main contributions:
An XGBoost regression model trained on WHO data with high accuracy (R² ≈ 0.97).
SHAP explainability, which identifies and ranks key factors influencing life expectancy at both global and individual levels.
Causal analysis using DoWhy, which estimates the true causal impact of factors like education and income rather than simple correlations.
A scenario-based simulation tool (Streamlit dashboard) to project India’s life expectancy up to 2050 under different policy conditions.
The literature review shows that tree-based models like XGBoost outperform other methods for tabular health data, while SHAP is widely used for interpretable AI in healthcare. It also highlights the importance of causal inference methods (like DoWhy) for understanding policy impacts.
The dataset combines WHO global health data, World Bank time series, and India’s NFHS-5 state-level survey, including indicators such as mortality rates, immunization, nutrition, education, income, and healthcare access.
Methodologically, the study uses an optimized XGBoost model as the main predictor, compared with Random Forest and linear regression. Results show strong performance, with XGBoost achieving the highest accuracy (~97%).
SHAP analysis identifies key drivers of life expectancy, including HIV/AIDS prevalence, adult mortality, income, education, nutrition, and immunization. Causal analysis further examines how factors like education and healthcare expenditure directly influence life expectancy outcomes.
Finally, the study builds scenario-based forecasts for India up to 2050, comparing optimistic, pessimistic, and business-as-usual development paths to inform policy planning.
Conclusion
We have presented a compatible, comprehensive, interpretable and casually grounded data driven machine learning framework for life expectancy prediction. The XGBoost model achieves R2 = 0.97 on WHO data. SHAP analysis identifies HIV/AIDS prevalence, adult mortality and income composition as high impact predictors. DoWhy Causal modeling shows us that above median education increases the life expectancy by 4.29 years. When applied to India, the framework reveals a 12 year sub-national gap and projects a aggressive policy reform could yield a 2 year increase by 2050.
References
[1] R. Dolgopolyi, I. Amaslidou, and A. Margaritou, “Interpretable ma- chine learning for life expectancy prediction: A comparative study of linear regression, decision tree, and random forest,” arXiv preprint arXiv:2510.00542, 2025.
[2] B. Lantz, Machine Learning with R, 3rd ed. Birmingham: Packt Publishing, 2019.
[3] K. Kawano, Y. Otaki, N. Suzuki, S. Fujimoto, K. Iseki et al., “Prediction of mortality risk of health checkup participants using machine learning- based models: the J-SHC study,” Scientific Reports, vol. 12, p. 14113, 2022.
[4] S. M. Lundberg and S.-I. Lee,“Aunified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
[5] M. T. Ribeiro, S. Singh, and C. Guestrin, ““Why should I trust you?”: Explaining the predictions of any classifier,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Francisco, CA, 2016, pp. 1135–1144.
[6] S. M. Lundberg et al., “From local explanations to global understanding with explainable AI for trees,” Nature Machine Intelligence, vol. 2, no. 1, pp. 56–67, Jan. 2020.
[7] A. Sharma and E. Kiciman, “DoWhy: An end-to-end library for causal inference,” arXiv preprint arXiv:2011.04216, 2020.
[8] J. Pearl, “The do-calculus revisited,” in Proc. 28th Conf. Uncertainty in Artificial Intelligence (UAI), Catalina Island, CA, 2012, pp. 3–11.
[9] A. Sharma and E. Kiciman, “DoWhy: Addressing challenges in express- ing and validating causal assumptions,” in Proc. Workshop on CausalML, NeurIPS, 2019.
[10] R. Chetty, M. Stepner, S. Abraham, S. Lin, B. Scuderi, N. Turner, A. Bergeron, and D. Cutler, “The association between income and life expectancy in the United States, 2001–2014,” JAMA, vol. 315, no. 16, pp. 1750–1766, Apr. 2016.
[11] A. Lleras-Muney, “The relationship between education and adult mor- tality in the United States,” Review of Economic Studies, vol. 72, no. 1, pp. 189–221, Jan. 2005.
[12] International Institute for Population Sciences (IIPS) and ICF, “National family health survey (NFHS-5), 2019–21: India,” IIPS, Mumbai, Tech. Rep., 2022.
[13] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,”in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Francisco, CA, 2016, pp. 785–794.