Authors: Hardi Patel, Dr. Mehul P. Barot
Certificate: View Certificate
Breast Cancer is the second cause of death among women. Early prediction of breast cancer will help with the survival of breast cancer patient. Machine Learning and Data Mining have been widely used in the prediction of breast cancer and on the early detection of breast cancer. This paper compares the machine learning techniques which are used for the prediction of breast cancer.
In the whole world, breast cancer is the most common and dangerous cancer in women. According to the WHO report in 2020, “It is estimate that worldwide over 685000 women died due to breast cancer.”
Data mining and machine learning have been widely used in the diagnosis of breast cancer. Also, machine learning and data mining assist the medical researchers to identify relationships between different variables and make them able to predict the outcome of disease using datasets. Machine learning can be applied to improve breast cancer detection. Also, it could be an assistance to accurate decision making. Therefore, the aim of this research is to analyse the data mining and machine learning techniques in breast cancer detection. This research is organized as follows; Section 2 introduces of breast cancer. Section 3 explains the algorithms and tools of data mining and machine learning which are used to predict breast cancer. Section 4 discusses about the dataset of the breast cancer. Section 5 discusses the literature survey. Section 6 explains proposed architecture to compare the accuracy of different algorithms. Finally, Section 7 includes conclusion of the survey.
II. BREAST CANCER
Normally, cells in the body divide (reproduce) only when new cells are needed. Sometimes, cells grow and they divide out of control, which creates a mass of tissue called a tumour. If the tumor is benign then the cells that are growing out of control that are normal cells. If, however, the cells are growing out of control are abnormal and don't function like the body's normal cells, the tumor is called malignant.
Cancers are named after the body part from which they originate. The cancer which is originates in the breast tissue is called Breast Cancer. Like other cancers, breast cancer can grow into the tissue surrounding the breast. It can also travel from breast to other parts of the body and create new tumors, a process called metastasis.
A. Types of Tumors
Tumors can be benign or malignant.
B. Symptoms of Breast Cancer
Different people have different types of symptoms of breast cancer. Some people do not have any symptoms at all .
Some different types of symptoms are as follows:
C. Stages of Breast Cancer
Breast Cancer has four stages.
a. T1mi is a tumor that is 1 mm or smaller.
b. T2: The tumor is larger than 20 mm but not larger than 50 mm.
c. T3: The tumor is larger than 50 mm.
d. T4: The tumor falls into 1 of the following groups:
III. BIG DATA ANALYTICS AND MACHINE LEARNING
Big data analytics is the use of advanced analytic techniques against large, diverse data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from .
Big data is a time period utilized to datasets whose measurement or kind is beyond the capability of relational databases to capture, control and system the statistics with low latency. Big data has following characteristics: high volume, high velocity, high variety, veracity, and value.
Applications of big data analytics can improve the services which are patient based, to detect diseases earlier, generate new patterns into disease mechanisms, monitor the quality of the medical and healthcare institutions as well as provide better methods of treatments .
2. Machine Learning
Machine Learning is a learning program from experience to improve its performance without human instruction
There are two types of learning:
a. Supervised Learning
b. Unsupervised Learning
A. Data Mining Algorithms
There are many algorithms such as Naïve Bayes, K-Nearest Neighbor, k-mean, Random Forest; They are used for analysing a huge amount of data.
Some popular Data Mining Algorithms are discussed as follows:
???????B. Data Mining Tools
Data mining tools provide ready to use an implementation of the mining algorithms. Most of them are free opensource software. Some of the popular data mining tools are discussed in the following:
IV. BREAST CANCER DATASET
For the prediction of breast cancer, we used breast cancer Wisconsin(original) dataset. The dataset includes 699 instances and 11 attributes along with the class label. The distribution of class will be 458 instances belong to the benign class and other 241 instances belong to the malignant class.
V. LITERATURE SURVEY
A. Mining Big Data: Breast Cancer Prediction using DT-SVM Hybrid Model
In this paper, K. Sivakami uses Decision tree and Support Vector Machines (DT-SVM) both are hybrid methods. To introduce a disorder status prognosis, they employ DT-SVM methods. The experiment was performed through Weka tool. The authors have considered the Wisconsin breast cancer dataset that includes 699 instances; in those 458 instances belong to not cancer (benign) class and other 241 instances belong to cancer (malignant) class. Finally, the author compared the output of the DT-SVM model with Naive Bayes, instance-based learning (IBK), and sequential minimal optimization (SMO) and conclude that DT-SVM gives better accuracy i.e., 91% compared to NB, IBK, and SMO.
B. Big Data Analytics to Predict Breast Cancer Recurrence on SEER Dataset using MapReduce Approach
In this paper, D.R. Umesh and B. Ramachandra  have utilized Expectation Maximization (EM) algorithm for identifying the breast cancer recurrence. To find out the classification accuracy they have used SEER dataset which contains 2,20,811 instances with 17 attributes. The authors have performed their experiment through Amazon cloud computing environment (EC2) and declare expectation maximization algorithm gives 88.54% of accuracy.
C. Breast Cancer Diagnosis and Prediction Using Machine Learning and Data Mining Techniques: A Review
In this paper, Hiba Asri et al.  performed this experiment to determine the efficiency and effectiveness of various algorithms like Support
Vector Machine (SVM), K Nearest Neighbor (K-NN), Decision Tree (C4.5), and Naive Bayes (NB). They utilized Wisconsin breast cancer (original) dataset taken from UCI machine learning repository contains 699 instances with 11 attributes. The experiment is performed on WEKA tool and outcomes show that the SVM gives higher accuracy 97.13% compared to K-NN, C4.5 i.e., 95.27%, 95.13%.
D. Prediction of Breast Cancer using Big Data Analytics
In this paper, K. Shailaja et al  uses KNN algorithm to classify cancer tumor as either benign or malignant. This approach is evaluated and compared using Wisconsin Breast Cancer dataset. The authors have applied feature selection on the dataset to remove duplicate and irrelevant features. The experiment result shows the accuracy, precision, recall and F-measure are increased by the proposed method when compared with different models. Accuracy before feature selection is 96.6% and after feature selection is 98.14%.
E. Using Machine Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis
In this paper, Hiba Asri et al  employed four main algorithms: SVM, Naïve Bayes, KNN, C4.5 on the Wisconsin Breast Cancer (original) Dataset. The authors try to compare efficiency and effectiveness of those algorithms in terms of accuracy, precision, sensitivity, and specificity to find the best classification accuracy. SVM reaches at higher accuracy of 97.13%. In conclusion, SVM has proven its efficiency in Breast Cancer prediction and diagnosis and achieves the best performance in terms of precision and low error rate.
F. Early Diagnosis of Breast Cancer Prediction using Random Forest Classifier
In this paper, P. R. Anisha et al  used six main machine learning algorithms to predict and diagnose the breast cancer. Comparison of the six algorithms: Logistic Regression, Decision Tree, K- nearest Neighbor, Naïve Bayes, Support Vector Classifier and Random Forest Classifier. The author got higher accuracy 98% of the Random Forest classifier.
G. Performance Analysis of Different Classifiers in Prediction of Breast Cancer
In this paper, S. Roobini et al  performed different methodology and perform analysis of different classifiers in prediction of breast cancer.
In this research, 10-fold cross validation is used to validate the results. The dataset is divided into ten equal subsets randomly. One of the partition act as a testing set, whereas the rest of the partitions act as training set to train the model. A relative report on the execution of existing and proposed grouping model is talked about dependent on Accuracy, Error rate, F - measure, exactness, and review. Precision quantum’s the means by which profound the settled tuples are being ordered effectively, TP embodies to positive tuples and TN epitomizes to negative tuples characterized by the essential classifiers. So also, FP ascribes to positive tuples and FN attributes to negative tuples which is inaccurately grouped by the classifiers.
The performance of Fuzzy C-Means Clustering [FCM] with Naive Bayesian classifier provides a better prediction when compared to other classifiers.
VI. PROPOSED ARCHITECTURE
To understand the efficiency of different algorithms, we construct the confusion matrix to compare different algorithms like Naïve Bayes, SVM (Support Vector Machine), KNN and Random Forest.
???????A. Confusion Matrix
In this paper, we compared different type of machine learning algorithms to find the most accurate algorithm to classify the breast cancer dataset into two different classes benign and malignant. we performed these algorithms on WEKA tool. This experiment shows different accuracy of all the algorithms. KNN got the highest accuracy of 97.6%.
 D.R Umesh et al., “Big Data Analytics to Predict Breast Cancer Recurrence on SEER Dataset using MapReduce Approach”, International Journal of Computer Applications, volume 7, 2016.  https://my.clevelandclinic.org/health/diseases/3986-breast-cancer  https://www.cancer.net/cancer-types/breast-cancer/stages  https://jamanetwork.com/journals/jamaoncology/fullarticle/2768634  https://www.ibm.com/in-en/analytics/hadoop/big-data-analytics  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6340124/  Saria Eltalhi. “Breast Cancer Diagnosis and Prediction Using Machine Learning and Data Mining Techniques: A Review.” IOSR Journal of Dental and Medical Sciences (IOSR JDMS), vol. 18, no. 04, 2019, pp 85-94.  https://www.cdc.gov/cancer/breast/basic_info/symptoms.htm  https://www.researchgate.net/figure/Breast-cancer-dataset_tbl1_323952426  G. Sumalatha et al., “A Study on Early Prevention and Detection of Breast Cancer using Data Mining Techniques”, International Journal of Innovative Research in Computer and Communication Engineering, volume 5,2017.  Hiba Asri, “Using Machine Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis”, The 6th International Symposium on Frontiers in Ambient and Mobile Systems, pp.1064-1069  K. Shailaja, ” Prediction of Breast Cancer Using Big Data Analytic”, International Journal of Engineering & Technology, volume 7, 2018.  Eltalhi, Saria & Kutrani, Huda. (2019). Breast Cancer Diagnosis and Prediction using Machine Learning and Data Mining Techniques: A Review. IOSR Journal of Dental and Medical Sciences. 18. 85-94.  S. Roobini and J. Fenila Naomi, “Performance Analysis of Different Classifier in Prediction of Breast Cancer” , International Journal of Science and Technology , volume 12(8) , 2019.  Emanelwerfally, & Kutrani, Huda & Eltalhi, Saria & Ashleik, Naeima. (2021). Predicting Breast Cancer Treatment Using Decision Tree Algorithms and Statistical Metrics. IOSR Journal of Dental and Medical Sciences. 20. 48-54  V. Sivakumar et al, “Feasibility Study on Data Mining Techniques in Diagnosis of Breast Cancer”, International Journal of Machine Learning and Computing”, Volume 9 ,2019.
Copyright © 2022 Hardi Patel, Dr. Mehul P. Barot. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.