Accurate real estate price prediction is crucial in today’s market to aid buyers, sellers, and investors in making informed decisions. This study employs machine learning algorithms—specifically Linear Regression, Decision Tree Regression, and Random Forest Regression—to model and predict housing prices based on various influential features. The methodology involves data preprocessing, feature engineering, and model evaluation using standard metrics like R² score and Root Mean Squared Error (RMSE). The models are trained on real-world housing datasets, and results demonstrate the efficiency of ensemble learning over traditional linear approaches. This paper establishes that Random Forest offers the most accurate predictions and is suitable for practical applications in real estate.
Introduction
1. Objective
The study explores automated real estate price prediction using machine learning, aiming to overcome the limitations of traditional methods that rely on human judgment and comparable sales analysis. It compares three regression models:
Linear Regression
Decision Tree Regression
Random Forest Regression
The goal is to assess their performance and suitability for real-world deployment in housing price prediction.
2. Dataset: Boston Housing Dataset
Total Records: 506
Features: 13 predictors + 1 target (MEDV, median home value)
No missing values
Key Features:
RM: Avg. number of rooms per dwelling
LSTAT: % of lower-status population
CHAS: Binary indicator for proximity to Charles River
Other economic and geographic features like crime rate (CRIM), tax rate (TAX), and accessibility (RAD)
Data Preprocessing:
Normalized features using Min-Max scaling
Removed outliers in variables like CRIM, TAX, LSTAT
CHAS required no encoding (already binary)
3. Related Work
Linear Regression is easy to interpret but fails with non-linear data (Kumar & Singh, 2020).
Tree-based models (Zhang & Lee, 2019) like Random Forest handle complex interactions and outperform simpler regressors.
Deep Learning models (Patel & Mehta, 2021) perform well but need more data and lack interpretability.
This paper focuses on interpretable, effective models for practical use.
4. Methodology
Model Training: 80/20 train-test split with 5-fold cross-validation
Random Forest Tuning: Used GridSearchCV for best parameters (e.g., number of trees, depth)
Models Used:
Linear Regression – Simple, interpretable baseline
Decision Tree – Captures non-linear patterns but can overfit
Random Forest – Ensemble model that improves accuracy and generalization
5. Evaluation Metrics
Two key metrics are used for performance comparison:
R² Score (Coefficient of Determination):
Measures how well the model explains variance in the target.
Higher R² = better fit
RMSE (Root Mean Squared Error):
Indicates average prediction error in $1000s.
Lower RMSE = more accurate predictions
Why Both?
R² shows model fit, RMSE shows error size—together, they provide a full performance picture.
Conclusion
This study demonstrates the effectiveness of machine learning algorithms in predicting real estate prices using structured housing data. Using the Boston Housing dataset, we evaluated three regression models: Linear Regression, Decision Tree Regression, and Random Forest Regression. The models were assessed based on their R² Score and RMSE performance on test data.
The results indicate that Random Forest Regression outperforms the other models, achieving the highest R² score (0.89) and the lowest RMSE (4.12). Its ensemble nature helps overcome overfitting and improves generalization, making it well-suited for regression tasks with moderately sized datasets. Linear Regression served as a reliable and interpretable baseline, while Decision Tree Regression showed overfitting tendencies. The top features influencing housing prices were the number of rooms (RM), the percentage of lower-status population (LSTAT), and the pupil-teacher ratio (PTRATIO). These insights reaffirm the impact of both physical and socioeconomic attributes on real estate value.
Overall, this research highlights how data-driven approaches can augment or replace traditional real estate valuation techniques. When properly trained and validated, machine learning models offer fast, accurate, and scalable solutions for property price estimation.
References
[1] D. Harrison and D.L. Rubinfeld, “Hedonic prices and the demand for clean air,” Journal of Environmental Economics and Management, vol. 5, pp. 81–102, 1978.
[2] D. Belsley, E. Kuh, and R. E. Welsch, Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, Wiley, 1980.
[3] R. Quinlan, “Combining instance-based and model-based learning,” in Proc. 10th Int. Conf. on Machine Learning, Amherst, MA, 1993, pp. 236–243.
[4] A. Kumar and R. Singh, “Real Estate Valuation Using Machine Learning,” International Journal of Engineering Research & Technology (IJERT), vol. 9, no. 3, pp. 24–29, 2020.
[5] L. Zhang and H. Lee, “Comparative Study on Regression Models in Real Estate,” IEEE Access, vol. 7, pp. 106123–106132, 2019.
[6] S. Patel and V. Mehta, “Deep Learning for Housing Price Estimation,” Elsevier Journal of Advanced Computational Intelligence, vol. 33, no. 4, pp. 1012–1020, 2021.
[7] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd ed., O’Reilly Media, 2019.
[8] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning, 2nd ed., Springer, 2021.