Conventional machine learning models generally consider individual data instances within tabular datasets independently, without taking into account any underlying connections among similar data instances. This study presents an innovative architecture design to convert the typical structure of a tabular dataset into a network topology through the application of a K-Nearest Neighbor (KNN) method. In doing so, we can use a Graph Neural Network (GNN) that utilizes GraphSAGE (SAGEConv) layers for predicting electric vehicle adoption patterns. The effectiveness of the model is validated via actual adoption rates (Electric Vehicle Population Data) based on both the inherent representation of the vehicle data instance and its local geographical/manufacturer neighborhoods.
Introduction
This study proposes a Graph Neural Network (GNN)-based framework for improving prediction on structured tabular data by transforming it into a graph representation. While traditional machine learning models such as XGBoost and Random Forest assume that data instances are independent, the proposed approach captures relationships between similar data points, enabling more informative predictions through neighborhood learning.
The framework first preprocesses the tabular dataset by handling missing values, applying one-hot encoding to categorical features, and standardizing numerical attributes. It then constructs a K-Nearest Neighbor (KNN) graph, where each data instance becomes a node connected to its five nearest neighbors (K = 5) using Euclidean distance. This graph captures similarities among instances that conventional models cannot exploit.
A GraphSAGE (SAGEConv)-based Graph Neural Network is used to learn node representations by aggregating information from neighboring nodes. The model incorporates batch normalization, dropout, ReLU activation, and the Adam optimizer with cross-entropy loss to improve generalization and reduce overfitting. Hyperparameters such as learning rate, hidden layer size, and dropout rate are automatically optimized using Optuna.
The methodology was evaluated using an Electric Vehicle Population dataset. Exploratory data analysis examined EV registration growth, manufacturer distribution, and feature correlations before graph construction. During training, GraphSAGE recursively aggregates neighboring node features to classify vehicle adoption tiers, while t-SNE visualization is used to assess the quality of learned node embeddings.
Experimental results demonstrate that converting tabular data into a graph enables the model to capture structural dependencies that are ignored by traditional tabular learning methods. The GraphSAGE model produced well-separated clusters in t-SNE visualizations, indicating effective feature learning and improved class discrimination. Overall, the study shows that graph-based learning can provide a promising alternative to conventional machine learning models for tabular datasets by leveraging relationships between similar instances to enhance prediction performance.
Conclusion
In conclusion, this work proves that tabular data by no means reside in isolated and independent silos. Through an iterative process of converting seemingly independent table entries into highly interconnected graphs of multi-dimensional similarities via K-Nearest Neighbor algorithm topology analysis, it became possible to unlock and employ highly sophisticated Graph Neural Networks such as GraphSAGE specifically engineered for localized tabular data classification purposes. It was proven that these new graph embedding mechanisms are capable of performing as effectively as traditional Gradient Boosting algorithms with constraints while incorporating localized structural neighborhood characteristics that had been absent from conventional pipelines until now. Future architecture iterations based on the presented concept can be extended to explore and experiment with various dynamic algorithmic edge variable weight schemes on exponentially scaling datasets.
References
[1] Chen, T., &Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[2] Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive Representation Learning on Large Graphs. Advances in Neural Information Processing Systems (NeurIPS).
[3] Kipf, T. N., & Welling, M. (2016). Semi-Supervised Classification with Graph Convolutional Networks. International Conference on Learning Representations (ICLR).
[4] Fey, M., & Lenssen, J. E. (2019). Fast Graph Representation Learning with PyTorch Geometric. ICLR Representation Learning on Graphs and Manifolds Workshop.
[5] Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD.
[6] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2008). The Graph Neural Network Model. IEEE Transactions on Neural Networks, 20(1), 61-80.
[7] Veli?kovi?, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph Attention Networks. International Conference on Learning Representations (ICLR).
[8] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., & Grisel, O. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
[9] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., & Chanan, G. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural InformationProcessing Systems 32.
[10] Shrikumar, A., Greenside, P., &Kundaje, A. (2017). Learning Important Features Through Propagating Activation Differences. International Conference on Machine Learning(ICML).
[11] Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for Hyper-Parameter Optimization. Advances in Neural Information Processing System
[12] Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning (ICML).
[13] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., &Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.
[14] Van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(11), 2579-2605.
[15] Chawla, N. V., Bowyer, K. W., Hall, L. O., &Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.
[16] Loshchilov, I., & Hutter, F. (2016). SGDR: Stochastic Gradient Descent with Warm Restarts. International Conference on Learning Representations (ICLR).
[17] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems.
[18] You, J., Ying, R., & Leskovec, J. (2020). Design Space for Graph Neural Networks. Advances in Neural Information Processing Systems (NeurIPS).
[19] Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., & Adams, R. P. (2015). Convolutional Networks on Graphs for Learning Molecular Fingerprints. Advances in Neural Information Processing Systems.
[20] Li, P., Wang, Y., et al. (2024). Graph Neural Networks for Tabular Data Learning: A Survey with Taxonomy and Directions. ACM Computing Surveys.