Personality Prediction Using Machine Learning and Social Media Data: A Myers-Briggs Approach

Authors: Nakul Bhangale, Aryan Jasuja, Vraj Gujrathi

DOI Link: https://doi.org/10.22214/ijraset.2025.69336

Abstract

Withtheexplosivegrowthofuser-generatedcontent on social media, there is a rising interest in utilizing digital footprints to infer personality traits. This study explores how social media text can be analyzed using Natural Language Processing (NLP) and machine learning to predict an individual’s personality, focusing on the Myers-Briggs Type Indicator (MBTI)framework.Leveraginglinguisticandbehavioralfeatures extracted from social media content, we apply Support Vector Machines (SVM) and Random Forest algorithms to classifyusers into MBTI personality types. Our approach highlights the scalability and potential of automated personality assessment in domains such as targeted marketing, recruitment, and mental health.

Introduction

Background

Personality analysis plays a vital role in psychology, marketing, and human-computer interaction. Traditional assessment methods like MBTI (Myers-Briggs Type Indicator) involve time-consuming, subjective questionnaires. With the rise of social media, vast amounts of user-generated text have made it possible to infer personality traits computationally.

Objective

This study proposes a machine learning framework to predict MBTI personality types from users' social media posts. It uses Support Vector Machines (SVM) and Random Forest classifiers, focusing on interpretability, scalability, and effectiveness on real-world, noisy datasets.

MBTI and Machine Learning Setup

MBTI classifies individuals into 16 personality types across 4 binary dimensions:
1. Introversion (I) vs. Extraversion (E)
2. Sensing (S) vs. Intuition (N)
3. Thinking (T) vs. Feeling (F)
4. Judging (J) vs. Perceiving (P)
The classification task is broken down into 4 independent binary classification problems, suitable for supervised learning.

Data Collection & Preprocessing

Social media datasets with self-declared MBTI labels were used.
Preprocessing steps:
- Text cleaning: lowercasing, punctuation removal, stopword filtering, tokenization, stemming.
- Lexical normalization and removal of MBTI mentions to avoid data leakage.
- Minimum 500-word threshold per user document.

Feature Extraction

Textual and behavioral features extracted include:
- TF-IDF vectors (unigrams and bigrams)
- Linguistic style: sentence length, punctuation, pronoun use
- Sentiment: polarity and subjectivity (via TextBlob)
- Part-of-speech distribution
- Posting behavior: frequency and timing

Model Architecture

SVM:
- Uses linear and RBF kernels
- Effective for high-dimensional, sparse data (like text)
- Tuned using grid search and cross-validation
Random Forest:
- Uses 100–200 trees
- Handles noisy data well
- Provides feature importance scores

Each MBTI dimension has a separate classifier trained independently.

Training & Evaluation

10-fold stratified cross-validation ensures balanced class representation.
SMOTE used to address class imbalance.
Evaluation metrics:
- Accuracy, Precision, Recall, F1-score, ROC-AUC
- Confusion matrices and ROC curves used for visual performance analysis

Implementation

Language: Python 3.9
Libraries: Scikit-learn, NLTK, TextBlob, Pandas, NumPy
Environment: Jupyter Notebook
Hardware: Intel i5, 16 GB RAM, Ubuntu 22.04

Related Work

Early work (Golbeck et al.) showed viability of using Facebook data for personality prediction.
Studies using Twitter datasets (Plank & Hovy) proved the effectiveness of SVM and LIWC-based features.
Deep learning methods (e.g., CNNs, BERT) have been explored but lack interpretability.
Research continues to highlight issues like class imbalance, generalizability, and data sparsity.

Results & Discussion

The model performs well across all four MBTI dimensions.
Classical models like SVM and Random Forest are effective for this task due to:
- Simpler architecture
- Better interpretability
- Robustness on small and noisy datasets
The approach demonstrates potential for real-world applications such as:
- Personalized recommendations
- User profiling
- Mental health diagnostics

Conclusion

Thisresearchpresentsamachinelearning-basedframework for predicting Myers-Briggs Type Indicator (MBTI) person- ality traits from social media text data. By decomposing the classification task into four binary subtasks corresponding to the MBTI dimensions, we leverage both Support Vector Ma- chines (SVM) and Random Forest classifiers, with extensive natural language processing for feature extraction. Our experimental results demonstrate that the Random Forest model consistently outperforms SVM across all MBTI dimensions, achieving an average accuracy of 77% and F1- score of 0.76. In particular, the model excels in predicting the Introversion/ExtraversionandJudging/Perceivingdimensions, which are often linguistically more distinguishable. The Ran- dom Forest model also offers interpretability through feature importance, enabling psychological insights from linguistic behavior. Thestudyalsoconfirmsthatsocialmediatext,despite its informal nature, contains rich linguistic and behavioralcues that can be mined to infer personality with reasonable accuracy. Feature engineering using sentiment scores, lexical statistics, and part-of-speech patterns contributed significantly to classification performance. However,challengessuchasclassimbalance,linguisticam- biguity, and limited training data for rare MBTI types remain. Addressingtheseissueswillbeessentialforimprovingreal- worldapplicability.Futureworkcouldinvolveincorporating deeplearningmodelslikeBERTforcontextualunderstanding, exploringmulti-modalinputs(e.g.,images,interactions),or adapting the framework for longitudinal personality tracking. Inconclusion,ourfindingshighlightthepotentialof machinelearningmodels,particularlyensembleapproaches likeRandomForest,inadvancingcomputationalpersonality recognition.Theproposedframeworkoffersascalable,in- terpretable,andeffectivesolutionforpersonalityinference, openingupapplicationsinareassuchaspersonalizedcontentdelivery,mentalhealthscreening,anddigitaluserprofiling.

References

[1] J. Golbeck, C. Robles, M. Edmondson, and K. Turner, “PredictingpersonalityfromTwitter,”inIEEEInternationalConferenceonPrivacy,Security, Risk and Trust, 2011, pp. 149–156. [2] B. Verhoeven, W. Daelemans, and B. Plank, “TwiSty: A multilingual Twitter stylometry corpus for gender and personality profiling,” in Proc. LREC, 2016. [3] B. Plank and D. Hovy, “Personality traits on Twitter—or—How to get 1,500 personality tests in a week,” in Proc. 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 2015, pp. 92–98. [4] T. Yamada, K. Sugiura, and Y. Ogawa, “MBTI Personality Prediction using Pre-trained Language Models,” in Proc. 9th Workshop on NLP for Social Media, 2021. [5] C. Filardi, J. Burger, and L. Sen, “Explaining MBTI Personality Pre- diction with Interpretable Machine Learning,” in ACM Transactions on Interactive Intelligent Systems, 2021. [6] JothiPrakash and Arul Antran Vijay, ”A Unified Framework for Ana- lyzing Textual Context and Intent in Social Media,” ACM Trans. Intell. Syst. Technol., 2024. [7] Bo Han, Paul Cook, and Timothy Baldwin, ”Lexical normalization for social media text,” ACM Trans. Intell. Syst. Technol., 2013. [8] HetalVora et al., ”Personality Prediction from Social Media Text: An Overview,” IJERT, 2020. [9] MouradEllouze and Lamia HadrichBelguith, ”AI for Personality Traits and Mental Health in Social Media: A Survey,” ACM Trans. Asian Low- Resour. Lang. Inf. Process., 2024. [10] G. Park, M. A. Schwartz, J. Sap, et al., “Automatic personality prediction from Facebook profiles,” in J. of Personality and Social Psychology, vol. 108, no. 6, pp. 934–952, 2015.

Copyright

Copyright © 2025 Nakul Bhangale, Aryan Jasuja, Vraj Gujrathi. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET69336

Publish Date : 2025-04-21

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here