The widespread manipulation of online product reviews by spammers has emerged as one of the most pressing integrity challenges facing modern e-commerce platforms. Conventional spam detection approaches that rely on fixed feature sets and standard classifiers are increasingly outpaced by a new generation of crowdsourced spammers who systematically mimic the behavioral patterns of genuine users. To address this growing threat, this paper proposes hPSD, a Hybrid Positive-Unlabeled Learning-Based Spammer Detection model that jointly leverages individual user behavioral features and the structural information embedded in user-product relational networks. Operating under a semi-supervised framework, hPSD begins with a small seed set of confirmed spammers and iteratively refines its detection through a reliable negative extraction algorithm and a Bayesian hybrid classifier. The model is capable of identifying multiple distinct types of spammers in a single unified pipeline. Extensive experiments on both a synthetic movie-review dataset with injected shilling attacks and a real-world Amazon review corpus demonstrate that hPSD significantly outperforms eight state-of-the-art baseline detectors, achieving a precision of 0.81, recall of 0.97, and F-measure of 0.90 on real-world data. The framework additionally uncovers hidden employer organizations coordinating spammer activity, demonstrating practical value beyond standard detection metrics.
Introduction
Online product reviews and star ratings strongly influence consumer decisions on e-commerce and service platforms. High ratings increase trust and revenue, while negative reviews can quickly harm reputation, creating incentives for review manipulation. Modern spammers—often organized human crowds—write convincing reviews, making detection based on individual behavior difficult. The key signal lies in relational patterns, such as a reviewer disproportionately targeting a single brand or seller.
This paper introduces hPSD, a hybrid Positive-Unlabeled (PU) learning framework for detecting such sophisticated spammers. hPSD integrates behavioral features with user-product relational structures in a semi-supervised Bayesian model. It extracts reliable negatives from unlabeled users, discretizes features, and iteratively identifies multiple spammer types—including targeted promoters, duplicate reviewers, and colluding groups—through repeated detection loops. The system architecture comprises six layers: input data consolidation, feature discretization, reliable negative extraction, hybrid semi-supervised learning, iterative multi-type detection, and secure data access. Evaluations on synthetic and real-world Amazon datasets demonstrate hPSD’s superior performance in detecting organized review fraud while handling scarce labeled data.
Conclusion
This paper has presented hPSD, a Hybrid Positive-Unlabeled Learning-Based Spammer Detection framework that addresses the core limitations of existing review spam detection approaches through three design innovations: a semi-supervised PU-learning paradigm that operates effectively under realistic label scarcity conditions, a hybrid Bayesian classifier that jointly models individual behavioral features and user-product relational network structure, and an iterative multi-type detection loop that systematically uncovers heterogeneous spammer populations within a single unified pipeline.
The experimental results demonstrate that this combination delivers meaningfully superior detection performance relative to eight state-of-the-art baselines on both controlled synthetic attack scenarios and challenging real-world Amazon review data. Most significantly, the relational component of the hybrid model uncovers a substantial population of sophisticated promoters paid reviewers concentrating their activity on products from specific employers whose individual behavioral profiles are unremarkable and who would be missed entirely by feature-space-only detectors. The discovery of underlying employer organizations coordinating spamming campaigns represents a capability that substantially extends the practical value of automated spammer detection beyond identifying individual bad actors.
The framework’s design also surfaces several directions for future enhancement. As spammer strategies continue evolving in response to deployed detection systems, the ability to adapt in real time becomes increasingly critical. Development of online or incremental learning variants of hPSD that update their models continuously as new review data arrives would enable the system to maintain detection performance without requiring periodic full retraining cycles. Incorporating richer relational signals beyond the user-product review graph including device fingerprint sharing, IP address clustering, temporal coordination of account activity, and infrastructure-level linkages between accounts would provide additional evidence channels for detecting more subtle forms of collusion that the current model may miss.
Deepening the textual analysis component represents another high-value enhancement direction. The current framework treats review text as a source of behavioral metadata rather than analyzing the content directly. Integrating advanced natural language processing techniques including writing style fingerprinting, sentiment trajectory analysis, and cross-review semantic similarity scoring would enable the framework to detect spammers who have successfully calibrated their behavioral metadata to evade detection but whose review writing patterns still carry identifiable signatures of paid or coordinated authorship.
Explainability and interpretability improvements are essential for operational deployment. Platform trust and safety teams reviewing flagged accounts need to understand why specific users were identified as spammers in order to make defensible moderation decisions and build legally compliant enforcement cases. Developing structured explanation outputs that clearly articulate the combination of behavioral features and relational signals driving each detection decision would substantially improve the framework’s practical utility. Finally, deployment in live production environments with real-time processing constraints, large-scale distributed infrastructure, and human-in-the-loop review workflows would provide the operational validation needed to confirm that hPSD’s detection capabilities translate effectively to production-scale platforms.
References
[1] C. Forman, A. Ghose, and B. Wiesenfeld, “Examining the relationship between reviews and sales: The role of reviewer identity disclosure in electronic markets,” Inf. Syst. Res., vol. 19, no. 3, pp. 291–313, 2008.
[2] F. Zhu and X. Zhang, “Impact of online consumer reviews on sales: The moderating role of product and consumer characteristics,” J. Market., vol. 74, no. 2, pp. 133–148, 2010.
[3] T.-M. Choi, H. K. Chan, and X. Yue, “Recent development in big data analytics for business operations and risk management,” IEEE Trans. Cybern., vol. 47, no. 1, pp. 81–92, Jan. 2017.
[4] M. Ott, C. Cardie, and J. Hancock, “Estimating the prevalence of deception in online review communities,” in Proc. 21st Int. Conf. World Wide Web, 2012, pp. 201–210.
[5] E.-P. Lim, V.-A. Nguyen, N. Jindal, B. Liu, and H. W. Lauw, “Detecting product review spammers using rating behaviors,” in Proc. 19th ACM Int. Conf. Inf. Knowl. Manag., 2010, pp. 939–948.
[6] G. Wang, S. Xie, B. Liu, and P. S. Yu, “Review graph based online store review spammer detection,” in Proc. 11th IEEE Int. Conf. Data Min. (ICDM), 2011, pp. 1242– 1247.
[7] A. Fayazi, K. Lee, J. Caverlee, and A. Squicciarini, “Uncovering crowdsourced manipulation of online reviews,” in Proc. 38th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2015, pp. 233–242.
[8] N. Jindal and B. Liu, “Review spam detection,” in Proc. 16th Int. Conf. World Wide Web, 2007, pp. 1189–1190.
[9] A. Mukherjee, B. Liu, and N. Glance, “Spotting fake reviewer groups in consumer reviews,” in Proc. 21st Int. Conf. World Wide Web, 2012, pp. 191–200.
[10] A. Mukherjee et al., “Spotting opinion spammers using behavioral footprints,” in Proc. 19th ACM SIGKDD Int. Conf. Knowl. Disc. Data Min., 2013, pp. 632–640.