Training Generative AI on Social Media Data: Implications and Outputs - A Worked Out Example

Authors: Raabia Riaz, Muhammad Areeb Chatni, Dr. Tanzeel Ur Rehman

DOI Link: https://doi.org/10.22214/ijraset.2026.78087

Abstract

This chapter presents a follow-along methodological exercise that shows how social media data can be used to study generative AI training and conditioning. It is written to be reproducible by both technical and non-technical readers and does not assume access to large commercial training pipelines. Using a simulation-based experimental approach, the chapter walks through collecting text from Reddit and Twitter (X), constructing two matched datasets of 2,500 texts per platform, and preparing those datasets for comparison through careful cleaning and documentation. Analysis is carried out in Orange Data Mining, using visual, screenshot-led steps to explore sentiment, basic linguistic patterns, and framing differences across platforms. The chapter then demonstrates prompt-based conditioning with a small open-source generative model, keeping prompts and generation settings fixed while varying only the platform-specific input text. The emphasis is on methodological clarity, transparency, and cautious interpretation rather than substantive claims about public opinion.

Introduction

The text presents a methodological chapter on studying generative AI (large language models) using a transparent, simulation-based approach. It explains that modern generative systems produce fluent and adaptive text, but their outputs are shaped by training data, data curation, and alignment processes, not by general intelligence. Therefore, understanding these systems requires attention to data provenance and how different data environments influence model behavior.

Social media platforms are important sources of training data because they provide large volumes of up-to-date, real-world discourse. However, platforms like Reddit and Twitter (X) differ in structure, communication style, moderation, and algorithmic design, which can shape language tone and framing. The chapter uses these two platforms as a comparative case study to examine how different discourse environments may influence generative model outputs.

Due to limited access to proprietary training pipelines, the chapter adopts a simulation-based experimental design. Instead of retraining a model, it uses prompt-based conditioning with a fixed open-source language model. The only variable changed is the exposure corpus (Reddit vs. Twitter texts about UK cost-of-living issues in 2025). This controlled approach allows researchers to observe how different text environments influence generated outputs while keeping other factors constant. The goal is methodological illustration, not performance optimization.

The workflow includes:

Data collection using platform APIs
Construction of matched corpora (2,500 texts per platform)
Corpus cleaning and preparation
Exploratory analysis using Orange Data Mining
Prompt-based conditioning and comparison of generated outputs
Discussion of limitations and adaptability to other contexts

The study emphasizes transparency, reproducibility, and careful interpretation of results, avoiding overgeneralization about platforms or social reality.

The methodological framework is grounded in simulation-based research, which helps study complex systems when direct access to full training processes is unavailable. The approach demonstrates how varying data inputs under controlled conditions can influence generative AI outputs.

The chapter also explains why Reddit and Twitter were selected, highlighting differences in community structure, post length, engagement mechanisms, and discourse style. Ethical considerations are addressed by collecting only text content, excluding user identifiers, and following responsible data practices.

Conclusion

This chapter has limitations by design rather than by oversight. It relies on an illustrative, small-scale corpus and uses prompt-based conditioning instead of full model training. It focuses on a single topic and a single year, and it operationalises UK focus through practical filtering strategies rather than guaranteed geolocation. These constraints mean that the chapter does not claim empirical representativeness, nor does it attempt to replicate the scale or complexity of commercial model development. The conclusions drawn are therefore specific to the sample and simulation presented. At the same time, these constraints also bring clarity. The chapter demonstrates a workflow that is accessible, replicable, and adaptable. Readers can substitute alternative topics, such as housing affordability, energy prices, or wage stagnation, adjust the time window, or compare different platform pairs. They can replace Orange with other analytical tools or extend the workflow to include additional methods, such as topic modelling or network analysis, depending on their technical experience. The conditioning approach can also be adapted, shifting from prompt-based exposure to lightweight fine-tuning where resources allow, while preserving the same comparative logic. In conclusion, training generative AI on social media data raises methodological and interpretive challenges that require transparent workflows and careful documentation. By combining practical steps, including API access, corpus extraction, analysis in Orange, and simulated conditioning, with ongoing theoretical reflection, this chapter offers a transferable approach to studying how platform-specific discourse environments shape generative outputs. Such approaches are essential for advancing research that is both technically informed and socially grounded.

References

[1] Bender, E.M., Gebru, T., McMillan-Major, A. and Shmitchell, S. (2021) ‘On the dangers of stochastic parrots: Can language models be too big?’, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21). New York: ACM, pp. 610–623. [2] Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, T., Goel, K., Goodrich, B., Hashimoto, T., Hegde, D., Heller, K., Ho, D.E., Hong, J., Hsu, K., Huang, J., Icard, T., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti, S., Keeling, G., Khani, F., Kiela, D., Kissinger, P., Koh, P.W., Krass, M., Krishna, R., Kuditipudi, R., Kumar, A., Ladhak, F., Lee, M., Liang, P., Li, Y., Li, X.L., Ma, T., Malik, A., Manning, C.D., Mirchandani, S., Mitchell, E., Munyikwa, Z., Nair, S., Narayan, A., Nayak, P., Niebles, J.C., Nilforoshan, H., Nyarko, J., Ogut, G., Orr, L., Papadimitriou, I., Park, J.S., Piech, C., Portelance, E., Potts, C., Raghunathan, A., Reich, R., Ren, H., Rong, F., Roohani, Y., Ruiz, C., Ryan, J., Sagawa, S., Santhanam, K., Sedoc, J., Sharma, S., Singh, A., Smith, N.A., Song, S., Tang, X., Tsipras, D., Wallace, B., Wang, T., Wang, X., Wilhelm, C., Wu, J., Wu, X., Xie, S.M., Yasunaga, M., You, J., Zaharia, M., Zhang, M., Zhang, T., Zhang, X., Zhang, Y. and Liang, P. (2021) ‘On the opportunities and risks of foundation models’, arXiv preprint arXiv:2108.07258. [3] boyd, d.m. and Ellison, N.B. (2007) ‘Social network sites: Definition, history, and scholarship’, Journal of Computer-Mediated Communication, 13(1), pp. 210–230. [4] Bruns, A. and Stieglitz, S. (2013) ‘Towards more systematic Twitter analysis: Metrics for tweeting activities’, International Journal of Social Research Methodology, 16(2), pp. 91–108. [5] Dems?ar, J., Curk, T., Erjavec, A., Gorup, C?., Hocevar, T., Milutinovic?, M., Mozina, M., Polajnar, M., Toplak, M., Staric?, A., Stajdohar, M., Umek, L., Zagar, L., Zbontar, J., Zitnik, M. and Zupan, B. (2013) ‘Orange: Data mining toolbox in Python’, Journal of Machine Learning Research, 14, pp. 2349–2353. [6] Epstein, J.M. (2006) Generative Social Science: Studies in Agent-Based Computational Modeling. Princeton, NJ: Princeton University Press. [7] Fiesler, C. and Proferes, N. (2018) ‘“Participant” perceptions of Twitter research ethics’, Social Media + Society, 4(1), pp. 1–14. [8] Gilbert, N. and Troitzsch, K.G. (2005) Simulation for the Social Scientist. 2nd edn. Maidenhead: Open University Press. [9] Gillespie, T. (2018) Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions That Shape Social Media. New Haven, CT: Yale University Press. [10] Liu, B. (2012) Sentiment Analysis and Opinion Mining. San Rafael, CA: Morgan & Claypool. [11] Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D. and Gebru, T. (2019) ‘Model cards for model reporting’, Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*). New York: ACM, pp. 220–229. [12] Noble, S.U. (2018) Algorithms of Oppression: How Search Engines Reinforce Racism. New York: New York University Press. [13] Pang, B. and Lee, L. (2008) ‘Opinion mining and sentiment analysis’, Foundations and Trends in Information Retrieval, 2(1–2), pp. 1–135. [14] Townsend, L. and Wallace, C. (2016) Social Media Research: A Guide to Ethics. Aberdeen: University of Aberdeen. [15] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q.V. and Zhou, D. (2022) ‘Chain-of-thought prompting elicits reasoning in large language models’, arXiv preprint arXiv:2201.11903.

Copyright

Copyright © 2026 Raabia Riaz, Muhammad Areeb Chatni, Dr. Tanzeel Ur Rehman. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET78087

Publish Date : 2026-03-09

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here