Data Correlation and Feature Importance Analysis in Predictive Modeling

Authors: Ms. B Ysujana, Nathani Vikas, Bejjanki Vamshi, Amburi Vishal

DOI Link: https://doi.org/10.22214/ijraset.2026.79498

Abstract

Predictive modeling is an essential component of modern data science, driving decision-making across domains such as healthcare, business, and engineering. One of the primary challenges in building reliable models is identifying correlations among features and selecting the most influential variables. This paper presents an interactive web application that automates and evaluates feature importance using machine learning techniques. The platform integrates a Flask backend with libraries such as Pandas, Scikit-learn, Matplotlib, Seaborn, Plotly, XGBoost, and SHAP to deliver real-time data analysis, interactive visualizations, and model interpretability. The system supports Pearson and Spearman correlation, Random Forest and permutation-based feature importance, and optional model training using decision trees, regression models, and ensemble methods. Experiments conducted on datasets from Kaggle and UCI repositories show that the platform reduces analysis time by 95% and increases model accuracy by up to 14% when compared to traditional manual workflows.

Introduction

The proposed system is an interactive web-based platform designed to simplify predictive modeling by automating correlation analysis and feature selection. It addresses the limitations of traditional data science workflows, which require extensive coding, technical expertise, and manual effort, making them difficult for non-technical users.

The system allows users to upload datasets and automatically performs preprocessing, including handling missing values, encoding categorical data, and normalization. It then conducts correlation analysis using Pearson and Spearman methods and visualizes relationships through interactive heatmaps to identify multicollinearity and feature dependencies.

For feature selection, the platform uses multiple techniques such as Random Forest importance, permutation importance, and SHAP values to provide both global and local interpretability of model behavior. Users can also optionally run predictive models like decision trees and linear regression to evaluate how feature selection impacts performance.

Built using React.js (frontend) and Flask (backend), the system integrates visualization tools like Matplotlib, Seaborn, and Plotly. It provides interactive dashboards, model evaluation metrics, and automated ML workflow support.

Results show that the system significantly reduces analysis time, improves model accuracy by about 10–15%, and achieves better interpretability and efficiency compared to traditional manual approaches, making data analysis more accessible, faster, and more user-friendly.

Conclusion

This study presents an interactive web application designed to streamline the processes of correlation analysis, feature importance ranking, and preliminary predictive modeling for structured datasets. The system acts as a bridge between technically intensive data science workflows and a user-friendly analytical interface, thereby empowering both domain experts and individuals with limited technical background to perform sophisticated data exploration without writing a single line of code. By automating crucial stages of the data analysis pipeline—such as preprocessing, correlation computation, model-driven feature evaluation, and visualization—the platform significantly reduces the cognitive and operational burden traditionally associated with manual analytical procedures. Experimental evaluations conducted using real-world datasets demonstrate that the integrated use of correlation metrics, machine learning–based feature importance techniques, and SHAP-based explainability substantially improves the transparency, interpretability, and predictive accuracy of models. These capabilities are particularly valuable when dealing with high-dimensional datasets, where identifying influential variables and understanding their interactions is often challenging. The application not only accelerates the feature engineering process but also enhances the reliability of downstream model predictions. Furthermore, the platform achieves a remarkable reduction in overall analysis time, transforming tasks that typically require hours of manual computation into an automated workflow that completes within minutes. This efficiency makes the system highly suitable for real-world deployment across various environments, including academic research, data-driven classrooms, business intelligence teams, and organizations seeking rapid analytical insights without dedicating extensive computational resources. By providing an interactive, intuitive, and fully automated analytical environment, the proposed system delivers a substantial advancement in the fields of exploratory data analysis and explainable machine learning. Traditional analytical workflows often require users to rely on multiple tools, scripts, and platforms to preprocess data, compute correlations, generate visualizations, and interpret model behavior. In contrast, the developed system unifies all these operations within a single, seamless interface, thereby eliminating fragmentation and significantly reducing the learning curve for individuals entering the world of data analytics. Its ability to integrate statistical techniques, machine learning algorithms, and visual interpretation tools transforms the often complex and code-intensive analytical process into an accessible and user-friendly experience. This integration facilitates clarity in understanding relationships within datasets, supports transparency in the feature selection process, and enhances trust in model predictions through explainability tools such as SHAP. The system not only accelerates the analytical workflow but also promotes rigorous, data-driven reasoning by providing users with immediate, visually interpretable insights. In addition to its educational and professional benefits, the platform also supports collaborative and organizational workflows by enabling consistent, repeatable, and shareable analytical processes. In many real-world environments, teams struggle to maintain uniform analysis standards due to variability in coding practices, tool preferences, and individual expertise. The proposed system eliminates these inconsistencies by providing a unified, standardized interface where all users—regardless of background—follow the same analysis pipeline. This not only improves the quality and reproducibility of insights but also reduces dependency on specialized personnel. The platform’s automated visual outputs, such as correlation heatmaps, feature ranking plots, and model performance graphs, can be easily exported for documentation, presentations, and stakeholder communication. As a result, organizations can integrate the system into their decision-making workflows, enhancing efficiency and fostering a more data-driven culture. Moreover, the system’s ability to simplify complex analytical workflows plays a crucial role in empowering organizations that may lack dedicated data science teams or advanced computational infrastructure. By integrating preprocessing, correlation analysis, feature evaluation, and model interpretation into a single cohesive platform, it minimizes the need for specialized software and reduces reliance on manual scripts that often vary from analyst to analyst. This uniformity not only enhances operational efficiency but also ensures that analytical insights maintain a high standard of accuracy and consistency across different projects. The system’s capability to generate clear, visually interpretable outputs further bridges the communication gap between technical experts and decision-makers, enabling stakeholders to understand and trust the analytical process. As industries continue to adopt data-driven strategies, platforms like this serve as essential tools that promote accessibility, collaboration, and informed decision-making at every level of an organization.

References

[1] Pedregosa et al., “Scikit-learn: Machine Learning in Python,” JMLR, 2011. [2] Chen & Guestrin, “XGBoost: A Scalable Tree Boosting System,” KDD, 2016. [3] Lundberg & Lee, “A Unified Approach to Interpreting Model Predictions,” NIPS, 2017. [4] McKinney, “Data Structures for Statistical Computing in Python,” SciPy Conference, 2010. [5] Hunter, “Matplotlib: A 2D Graphics Environment,” CSE, 2007. [6] Waskom, “Seaborn: Statistical Data Visualization,” JOSS, 2021. [7] Abadi et al., “TensorFlow,” 2015. [8] Flask Documentation, 2024. [9] Kaggle Datasets, 2024. [10] Raschka, Python Machine Learning, 2017. [11] Guyon & Elisseeff, “Feature Selection,” JMLR, 2003. [12] Molnar, Interpretable Machine Learning, 2020. [13] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer, 2009. [14] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016. [15] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann, 2011. [16] F. Pedregosa, “Feature Selection Methods for Machine Learning,” JMLR, 2012. [17] L. Breiman, “Random Forests,” Machine Learning, vol. 45, pp. 5–32, 2001. [18] C. Spearman, “The Proof and Measurement of Association Between Two Things,” American Journal of Psychology, 1904. [19] K. Pearson, “Notes on Regression and Inheritance in the Case of Two Parents,” Royal Society, 1895 [20] J. Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries,” Proc. of the First Instructional Conf. on Machine Learning, 2003. [21] M. Bostock, V. Ogievetsky, and J. Heer, “D3: Data-Driven Documents,” IEEE Trans. Visualization and Computer Graphics, 2011. [22] J. Heer and B. Shneiderman, “Interactive Dynamics for Visual Analysis,” Communications of the ACM, 2012. [23] H. Wickham, ggplot2: Elegant Graphics for Data Analysis, Springer, 2016. [24] T. Chen, T. He, and X. Jin, “Scalable Machine Learning on Data Streams,” ACM Computing Surveys, 2020. [25] F. Chollet, Deep Learning with Python, Manning, 2018.

Copyright

Copyright © 2026 Ms. B Ysujana, Nathani Vikas, Bejjanki Vamshi, Amburi Vishal. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET79498

Publish Date : 2026-04-05

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here