Autism Spectrum Disorder (ASD) is a developmental condition affecting communication, behavior, and social interaction. Early detection is crucial for timely intervention, yet it is often delayed due to limited specialists and difficulty in recognizing symptoms. In this study, we propose a machine learning-based approach using Natural Language Processing (NLP) and PySpark to analyze unstructured text data from online forums, social media, and caregiver reports. Our methodology involves data collection, feature selection, and classification using deep learning models such as LSTM-RNN. By leveraging PySpark’s scalability, we process large text datasets efficiently to identify linguistic markers of ASD. The goal is to enhance early autism detection by analyzing caregiver-reported observations, ultimately supporting early intervention efforts. Future research will explore advanced ML techniques to reduce overfitting and improve model performance. This study contributes to ASD research by demonstrating the potential of NLP-driven approaches for scalable and automated autism detection.
Introduction
Autism Spectrum Disorder (ASD) is a neurodevelopmental condition marked by challenges in communication, social interaction, and repetitive behaviors. Early diagnosis is vital but often delayed due to limited specialists, social stigma, and difficulty recognizing symptoms. Recently, unstructured texts from caregivers on digital platforms (social media, blogs, forums) have emerged as valuable sources for early ASD detection.
This research proposes using Natural Language Processing (NLP) and machine learning (ML), particularly deep learning models like LSTM-RNN, combined with PySpark’s big data capabilities, to analyze large-scale caregiver narratives for early ASD indicators. The system involves collecting and preprocessing textual data, extracting linguistic features, developing classification models, and evaluating their performance to accurately identify potential ASD cases.
Traditional clinical assessments are time-consuming and limited by expert availability, whereas AI-driven automated text analysis offers a scalable, real-time alternative to support early detection and timely intervention. The methodology also includes federated learning with classifiers like SVM and Logistic Regression, enabling privacy-preserving, collaborative model training across multiple clients.
The study highlights the potential of AI and NLP in healthcare, aiming to assist caregivers and professionals with preliminary ASD screening via online platforms. The approach addresses challenges in early autism diagnosis and offers a foundation for developing effective, privacy-aware automated detection tools.
Conclusion
The assessment of ASD behavioral traits is a time taking process that is only aggravated by overlapping symptomatology. There is currently no diagnostic test that can quickly and accurately detect ASD, or an optimized and thorough screening tool that is explicitly developed to identify the onset of ASD. We have designed an automated ASD prediction model with minimum behavior sets selected from the diagnosis datasets of each. Out of the five models that we applied to our dataset; Logistic Regression was observed to give the highest accuracy. The primary limitation of this research is the scarce availability of large and open source ASD datasets. To build an accurate model, a large dataset is necessary. The dataset we used here did not have sufficient number of instances. However, our research has provided useful insights in the development of an automated model that can assist medical practitioners in detecting autism in children. In the future, we will be considering using a larger dataset to improve generalization. We also plan to employ deep learning techniques that integrate CNNs and classification to improve robustness and overall performance of the system. All in all, our research has resulted in analyzing various classification models that can accurately detect ASD in children with given attributes based on the child’s behavioral and medical information. The analysis of these classification models can be used by other researchers as a basis for further exploring this dataset or other Autism Spectrum Disorder data sets.
References
[1] Lord, C., Rutter, M., DiLavore, P. C., & Risi, S. (2000). Autism Diagnostic Observation Schedule (ADOS). Los Angeles, CA: Western Psychological Services.
[2] Thabtah, F. (2019). Machine learning in autistic spectrum disorder behavioral research: A review and ways forward. Informatics for Health and Social Care, 44(3), 278-297. https://doi.org/10.1080/17538157.2017.1399132
[3] Bone, D., Bishop, S. L., Black, M. P., Goodwin, M. S., & Lord, C. (2016). Use of machine learning to improve autism screening and diagnostic instruments: Efficiency, accuracy, and utility. Journal of the American Medical Informatics Association, 23(4), 602-609. https://doi.org/10.1093/jamia/ocv
[4] Duda, M., Ma, R., Haber, N., & Wall, D. P. (2016). Use of machine learning for behavioral distinction of autism and ADHD. Translational Psychiatry, 6(5), e732. https://doi.org/10.1038/tp.2015.221
[5] Heinsfeld, A. S., Franco, A. R., Craddock, R. C., Buchweitz, A., & Meneguzzi, F. (2018). Identification of autism spectrum disorder using deep learning and the ABIDE dataset. NeuroImage: Clinical, 17, 16-23.
https://doi.org/10.1016/j.nicl.2017.08.017
[6] Spooner, R., Warman, G., & Hastie, T. (2020). Artificial Intelligence in Autism Research: Machine Learning Methods for Describing Cognitive and Behavioral.