This paper presents the development and evaluation of a RandomForestClassifier for identifying black holes using a combined dataset from NASA/ESA observations and additional astronomical catalogs. The dataset, comprising approximately 894 samples (447 black holes and 447 synthetic non-black holes), was preprocessed to handle missing values and scaled for machine learning. The model achieved an accuracy of 0.9535 on a test set of 86 samples, with perfect recall for non-black holes and high precision for black holes. The study highlights the challenges of overfitting and proposes future improvements.
Introduction
This research focuses on identifying black holes using machine learning, aggregating data from NASA/ESA’s “Black Holes Observed So Far” dataset and supplementary astronomical catalogs (SIMBAD, VizieR). After preprocessing and balancing with synthetic non-black-hole data, a dataset with 894 samples and 419 features was created.
A RandomForestClassifier was chosen for its robustness and trained on 80% of the data with hyperparameters tuned for generalization. The model achieved a test accuracy of 95.35%, with strong precision, recall, and F1 scores across classes. Confusion matrix and feature importance analyses were conducted, and the model was saved for future use.
Despite promising results, concerns about overfitting remain due to earlier models achieving perfect accuracy. Future improvements include feature selection, exploring alternative classifiers, and refining synthetic data generation.
Conclusion
This study demonstrates a viable machine learning approach to black hole classification, achieving 95.35% accuracy. The methodology and resources provide a foundation for further astrophysical research, with opportunities to refine the model and dataset.