This study presents a comparative analysis of five feature selection methods—Chi-Square, Mutual Information, RFE, LASSO, and Random Forest Importance—applied to the Breast Cancer Wisconsin Diagnostic dataset. Their effectiveness was evaluated using Logistic Regression, SVM, and Random Forest classifiers based on accuracy, F1-score, ROC-AUC, runtime, and Jaccard-based stability. RFE achieved the highest predictive performance, whereas Chi-Square and Mutual Information provided the strongest stability and fastest computation. Random Forest Importance offered a balanced trade-off, while LASSO showed reduced stability due to aggressive regularization. The results highlight clear performance–stability trade-offs and provide practical guidelines for selecting reliable feature selection techniques in breast cancer prediction
Introduction
Breast cancer is one of the most common and life-threatening diseases among women worldwide, making early and accurate diagnosis essential for improving patient outcomes. With the increasing availability of biomedical data, machine learning (ML) has become an effective tool for building predictive models. However, high-dimensional medical datasets pose challenges such as overfitting, low interpretability, and high computational cost, emphasizing the need for efficient feature selection techniques.
Feature selection methods—filter, wrapper, and embedded—help identify the most informative attributes while removing redundant features. Despite extensive research on improving prediction accuracy, many studies overlook important aspects such as feature stability (reproducibility across data samples) and runtime efficiency, both crucial for clinical deployment. This study addresses these gaps by comparing five feature selection methods: Chi-Square, Mutual Information, RFE, LASSO, and Random Forest Importance, evaluated with three classifiers—Logistic Regression, SVM, and Random Forest—using accuracy, F1-score, ROC-AUC, stability (Jaccard similarity), and runtime.
Methodology
The Wisconsin Breast Cancer Diagnostic (WBCD) dataset (569 samples, 30 features) was used. Data preprocessing involved feature scaling, 5-fold cross-validation, and application of the five feature-selection techniques. After obtaining selected feature subsets, three classifiers were trained and evaluated. Feature stability and computational runtime were also recorded to ensure a comprehensive comparison.
Best Performance: RFE + SVM and RFE + Logistic Regression achieved the highest accuracy (~0.97) and AUC (~0.993), demonstrating that RFE captures the most discriminative features.
Runtime Efficiency: Chi-Square and Mutual Information were the fastest, while Random Forest Importance had the longest runtime due to complex tree-based calculations.
Feature Stability:
Chi-Square: 1.00 (perfect stability)
Random Forest Importance: 0.945
Mutual Information: 0.927
RFE: 0.794
LASSO: 0.507 (lowest stability)
Wrapper methods like RFE were less stable because small data variations affect feature ranking, while LASSO demonstrated poor reproducibility.
Interpretation
RFE is best for maximizing predictive accuracy but sacrifices stability and runtime.
Chi-Square excels in consistency and speed, making it highly suitable for clinical use where reproducibility matters.
Random Forest Importance provides a strong balance between accuracy and stability.
LASSO alone is not ideal for this dataset due to low stability and weaker performance.
Conclusion
This study provides a comprehensive evaluation of five feature selection techniques—Chi-Square, Mutual Information, RFE, LASSO, and Random Forest Importance—for breast cancer prediction using Logistic Regression, Random Forest, and SVM. The results reveal a clear trade-off between predictive accuracy, feature stability, and computational efficiency. RFE achieved the highest accuracy and ROC-AUC, demonstrating the strength of wrapper-based methods, while Chi-Square and Mutual Information offered the greatest stability and fastest computation, highlighting the reliability of filter-based approaches. Random Forest Importance provided a balanced compromise, whereas LASSO showed limited stability on small biomedical datasets.
These findings suggest that RFE with SVM or LR is optimal for maximizing predictive performance, whereas Chi-Square or Mutual Information is ideal when reproducibility and interpretability are prioritized. By integrating accuracy, stability, and runtime considerations, this study offers practical guidance for developing robust, interpretable, and clinically relevant machine learning models for breast cancer prediction.
References
[1] World Health Organization, Breast Cancer: Key Facts. WHO, 2024.
[2] F. Bray et al., “Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA: A Cancer Journal for Clinicians, vol. 71, no. 3, pp. 209–249, 2021.
[3] L. Cruz and D. Wishart, “Applications of machine learning in cancer prediction and prognosis,” Cancer Informatics, vol. 2, pp. 59–77, 2007.
[4] S. Abdar et al., “A review of computational intelligence methods for breast cancer diagnosis,” Neural Computing and Applications, vol. 32, pp. 643–680, 2020.
[5] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of Machine Learning Research, vol. 3, pp. 1157–1182, 2003.
[6] H. Liu and H. Motoda, Feature Extraction, Construction and Selection: A Data Mining Perspective. Springer, 1998.
[7] M. Robnik-Šikonja and I. Kononenko, “Theoretical and empirical analysis of ReliefF and RReliefF,” Machine Learning, vol. 53, pp. 23–69, 2003.
[8] M. Nogueira, K. Sechidis, and G. Brown, “On the stability of feature selection algorithms,” Journal of Machine Learning Research, vol. 18, no. 174, pp. 1–54, 2018.
[9] K. Somol and P. Pudil, “Feature selection toolbox and stability issues in high-dimensional data,” in Proceedings of the 16th International Conference on Pattern Recognition, vol. 2, pp. 556–559, 2002.
[10] W. Wolberg, W. Street, and O. Mangasarian, “Breast cancer diagnosis and prognosis via linear programming,” Operations Research, vol. 43, no. 4, pp. 570–577, 1995.
[11] D. Dua and C. Graff, “UCI Machine Learning Repository,” University of California, Irvine, 2019.
[12] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer, 2009.
[13] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation,” in IJCAI, 1995.
[14] A. Duda, P. Hart, and D. Stork, Pattern Classification, Wiley, 2001.
[15] T. Cover and J. Thomas, Elements of Information Theory, Wiley, 1991.
[16] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” Journal of the Royal Statistical Society: Series B, 1996.
[17] L. Breiman, “Random forests,” Machine Learning, 45(1), 5–32, 2001.
[18] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, 20, pp. 273–297, 1995.