This study delves further into machine learning algorithms for estimating automotive prices, using Python-based frameworks. The study includes critical processes such as data collection, preprocessing, feature selection, model evaluation, and implementation. Our technique attempts to examine the efficacy of different machine learning models in predicting automobile prices by applying multiple regression-based algorithms such as Linear Regression, Random Forest, Gradient Boosting, Support Vector Regression, and K-Nearest Neighbors. The paper also tackles critical process issues such as data quality, feature selection, and model interpretability. The findings contribute to a data-driven methodology that can help buyers, sellers, and automotive experts make better decisions by giving accurate real-time price projections.
Introduction
This research focuses on using machine learning (ML), particularly regression-based models, to predict used car prices, a task critical for manufacturers, dealerships, and consumers. Unlike traditional software, ML models learn from historical data to improve prediction accuracy. The study compares various regression models—Linear Regression, Random Forest, Gradient Boosting, Support Vector Regression, and K-Nearest Neighbors—using a cleaned dataset sourced from platforms like Kaggle and sahibinden.com.
The used car market is growing rapidly, driven by factors such as high new car prices and the rise of electric vehicles, making accurate price prediction crucial. Existing online valuation platforms show inconsistent pricing, which this study aims to improve by leveraging advanced ML techniques.
Data preprocessing involved cleaning, feature selection, and encoding, resulting in a dataset with detailed vehicle attributes (e.g., brand, mileage, engine power, advanced features). Automated web scraping was used for efficient data collection. The models were trained on 70% of the data and tested on 30%.
Random Forest (RF) and Support Vector Machine (SVM) were emphasized for their strong performance. RF, an ensemble of decision trees, reduces overfitting and handles large, high-dimensional datasets well, making it suitable for price forecasting. SVM excels at classification and regression with well-structured data. The study found individual classifiers effective but limited in accuracy, leading to the proposal of an ensemble approach combining RF, SVM, and Artificial Neural Networks (ANN) for better prediction.
The ensemble model introduces a new categorical variable, "price rank," classifying cars into cheap, moderate, and expensive groups to improve prediction interpretability and accuracy. This integrated approach helps manufacturers analyze demand trends and aids buyers and sellers in making informed decisions, contributing to the digital transformation of the automotive sector.
Conclusion
Estimating car prices poses a complex challenge, primarily due to the diverse attributes that impact a vehicle\'s market valuation. This research emphasizes the vital importance of thorough data collection and preprocessing as essential steps in improving the precision of machine learning-based forecasts. By creating Python scripts for normalizing, cleaning, and organizing the raw data, we significantly enhanced the dataset’s quality, ensuring its suitability for machine learning analysis. While these preprocessing actions reduced noise and discrepancies, they could not entirely eliminate the complexities inherent in such a varied dataset.
Initial trials utilizing individual machine learning classifiers—Random Forest (RF), Support Vector Machine (SVM), and Artificial Neural Network (ANN)—showed moderate predictive capabilities. However, the accuracy levels were still lacking, especially in terms of capturing nuanced market dynamics. Acknowledging the drawbacks of using a solitary approach, this study suggested an ensemble method that combines RF, SVM, and ANN. This hybrid model effectively utilized the advantages of each algorithm, resulting in a substantial boost in prediction accuracy, achieving up to 92.38%, which is a significant improvement over individual classifier.To implement this ensemble method, we converted the continuous price variable into categorical classes—Budget, Mid-Range, and Premium—creating a more structured and interpretable classification system. The ensemble model’s success illustrates its effectiveness in managing high-dimensional, real-world data while providing a scalable and dependable framework for practical applications.
It is crucial to acknowledge, however, that this enhanced performance requires additional computational resources. The ensemble model necessitates more processing time and memory in comparison to single classifier approaches. Nonetheless, the trade-off is warranted given the considerable improvements in accuracy and reliability.
In summary, this research affirms the efficacy of ensemble learning in predicting automotive prices and lays the groundwork for future investigations. Integrating deeper neural networks, real-time pricing data, and economic indicators could further enhance the adaptability and accuracy of such predictive systems, making them essential tools for stakeholders in the automotive sector.
References
[1] Y. S. Balc?o?lu and B. Sezen, \"Car Price Prediction Using Machine Learning Techniques,\" 6th International Artemis Congress on Health and Sport Sciences Proceedings Book, Mar. 2024. DOI: 10.5281/zenODO.10893330.
[2] J. C., \"Machine Learning for Used Car Price Prediction,\" 2021 IEEE International Conference on Emergency Science and Information Technology (ICESIT), pp. 223–230, Nov. 2021.
[3] M. U. Sumeyra and K. Yildiz, \"Linear Regression Is Mainly Used to Predict Used Car Prices,\" Int. J. Comput. Exp. Sci. Eng., vol. 9, no. 1, pp. 11–16, Mar. 2023.
[4] A. B. C. and C. A. R., \"KNIME Analytics Platform Performance Analysis of Regression Algorithms for Used Car Price Prediction,\" Int. J. Recent Innov. Trends Comput. Commun., vol. 10, no. 8, pp. 104–109, Aug. 2022.
[5] A. Wang, Y. Q., L. X., L. Z., Y. X., and Z. Wang, \"Machine Learning-based Research on the Problem of Used Car Valuation,\" 2022 Int. Conf. Comput. Netw. Electron. Autom. (ICCNEA), pp. 101–106, Sept. 2022.
[6] S. Pudaruth, \"Machine Learning Algorithms for Predicting Used Automobile Prices,\" Int. J. Inf. Comput. Technol., vol. 4, no. 7, pp. 753–764, Jan. 2014.
[7] L. Bukvi?, J. P. Krinjar, T. Fratrovi?, and B. Abramovi?, \"Supervised Machine Learning is Used to Predict and Classify Used Vehicle Prices,\" Sustainability, vol. 14, no. 24, p. 17034, Dec. 2022.
[8] M. Antonakakis et al., \"Understanding the Mirai Botnet,\" in Proc. USENIX Security Symp., 2017.
[9] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, \"Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization,\" in Proc. 4th Int. Conf. Inf. Syst. Security Privacy (ICISSP), Portugal, Jan. 2018.
[10] H. H. Jazi, H. Gonzalez, N. Stakhanova, and A. A. Ghorbani, \"Detecting HTTP-based Application Layer DoS Attacks on Web Servers in the Presence of Sampling,\" Comput. Netw., vol. 2017.
[11] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, \"A systematic methodology for generating benchmark datasets for intrusion detection,\" Comput. Security, vol. 31, no. 3, pp. 357–374, 2012.
[12] Z. He, T. Zhang, and R. B. Lee, \"Machine Learning Techniques for DDoS Attack Detection from the Source Side in Cloud,\" in Proc. 2017 IEEE 4th Int. Conf. Cyber Security.
[13] A. Maheshwari, Data Analytics Made Accessible, 2nd ed., Amazon Digital Services, 2017.
[14] H. Han, H. Guo, and S. Yu, \"Variable selection using Mean Decrease Accuracy and Mean Decrease Gini based on Random Forest,\" in 2016 7th IEEE Int. Conf. Softw. Eng. Service Sci. (ICSESS), Beijing, China, Aug. 2016, pp. 219–224.
[15] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, 2009.