This paper compares the Mean Square Error (MSE) and Least Square Error (LSE) loss functions when modeling a noisy sine wave. Utilizing a simple linear regression model implemented in Python on Google Colab, a sine curve corrupted by Gaussian noise is generated. Two models are then fit, one optimizing MSE and the other LSE. Their performance is evaluated using Mean Absolute Error (MAE) and R-squared (R²) metrics. The experimental results offer insights into the efficiency and effectiveness of each loss function in capturing underlying trends within noisy data.
Introduction
This study explores the impact of loss function choice—specifically Mean Square Error (MSE) and Least Square Error (LSE)—on model performance and convergence in regression tasks using noisy sine wave data. Loss functions are fundamental to machine learning, guiding optimization by quantifying discrepancies between predicted and true values.
MSE computes the average of squared errors, which leads to smoother gradient updates and more stable convergence.
LSE, defined here as the sum of squared errors, lacks this averaging, making it more sensitive to noise and prone to unstable training behavior.
Literature Context
Seminal works (e.g., Bishop; Hastie, Tibshirani, and Friedman; Goodfellow et al.) emphasize that:
MSE’s convex and differentiable properties make it ideal for gradient-based optimization.
LSE, by not scaling errors, introduces greater variance in gradient updates, particularly in noisy settings.
Deep learning frameworks further validate that the choice of loss function affects training stability, speed, and outcome quality.
Methodology
A synthetic noisy sine wave dataset was generated.
A simple linear regression model (PyTorch) was used to isolate the effects of the loss functions.
Models were trained using Stochastic Gradient Descent with identical settings (learning rate = 0.01, 1000 epochs).
Performance was evaluated using Mean Absolute Error (MAE) and R-squared (R²) metrics.
Findings
Convergence: MSE led to smoother and faster convergence. LSE exhibited erratic loss behavior due to unscaled gradients.
Sensitivity to Noise: LSE’s sensitivity to outliers resulted in higher MAE and lower R² scores.
Model Isolation: Using a simple linear model ensured that performance differences were due solely to the loss function.
Conclusion
The experimental results clearly demonstrate that the choice of loss function can have a significant impact on model performance, particularly in noisy environments. The MSE-based model achieved a final loss of 0.1938, an MAE of 0.3985, and an R² score of 0.5892. These metrics indicate a relatively stable convergence and a better fit to the noisy sine curve. In contrast, the LSE-based model, with a final loss of 48.9643, an MAE of 0.6194, and an R² score of -0.0251, exhibited unstable training dynamics and poor predictive performance. The averaging mechanism inherent in MSE appears to smooth out gradient fluctuations, thereby enhancing
References
[1] Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
[2] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
[3] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed., Springer, 2009.
[4] Scikit-learn. “scikit-learn: Machine Learning in Python.” scikit-learn.org, 2021, https://scikit-learn.org/stable/documentation.html.