Deep Learning-Based Text-To-Image Synthesis for Criminal Face Generation

Authors: Mahanthi Sriramulu, Kancharla Mounika, Madhyanapu Vennela, Oleti Taraka Naga Lakshmi, Janjirala Sandeep

DOI Link: https://doi.org/10.22214/ijraset.2025.67502

Abstract

In the contemporary law enforcement and forensic investigations, the accurate identification of suspects playsa pivotal role in solving crimes and ensuring justice.Traditional methods of suspect identification, such as composite sketches and eyewitness descriptions, often suffer from subjectivity and inconsistency. To address these limitations, there is a growing interest in leveraging advanced technologies, particularly deep learning-based approaches, to enhance the accuracy and reliability of suspect identification processes. This research focuses on the development of a deep learning-based system for generating realistic facial images of potential suspects from textual descriptions.The objective of this project is to develop a deep learning-based system capable of generating realistic facial images of potential suspects based on textual descriptions or other relevant input.The scope of this project encompasses the development and evaluation of a deep learning-based system for generating realistic facial images of potential suspects from textual descriptions within context of criminal investigations.The proposed system aims to provide law enforcement agencies and forensic experts with a more objective and data-driven approach to suspect identification.

Introduction

Introduction:

Traditional suspect identification methods, such as sketches or verbal descriptions, often suffer from subjectivity and inaccuracies. This research addresses these challenges by developing a deep learning-based text-to-image system using Generative Adversarial Networks (GANs) to generate facial images from textual descriptions, aiming to provide a more objective and consistent approach for forensic investigations.

Core Contributions:

Text-to-Image Synthesis with GANs:
- Utilizes GANs, especially DCGAN and AttnGAN, to generate realistic facial images from descriptive text.
- Integrates NLP models (BERT, GPT) to extract rich features from the text, enhancing the precision of image generation.
Elimination of Human Bias:
- Shifts the suspect identification process from subjective sketching to data-driven AI models.
- Reduces errors from memory limitations or artist interpretations.

Literature Review Highlights:

GANs have evolved significantly since their introduction by Goodfellow et al. (2014).
Key models include DCGAN, CGAN, AttnGAN, and StackGAN, which progressively improved image realism and diversity.
Attention mechanisms and global-local collaborative models enhanced image quality from complex texts.
Comparative studies show DCGAN outperforms older GAN models in forensic applications.

Methodology Overview:

A. Text Processing Module:

Uses Transformer-based NLP (BERT & GPT) for understanding and encoding descriptions (e.g., "A man with short black hair and narrow eyes").
Converts descriptions into vector embeddings representing facial attributes.
Implements preprocessing, tokenization, and normalization steps.

B. Image Generation with GANs:

Generator: Produces facial images from input vectors using transposed convolutions.
Discriminator: Distinguishes real from fake images.
Trained adversarially with a focus on realism and attribute alignment.

C. Evaluation Metrics:

Frechet Inception Distance (FID): Measures similarity to real images.
Inception Score (IS): Evaluates quality and diversity.
Precision, Recall, Accuracy, F1-Score: Quantitative performance.
Inference Speed & Training Time: Computational efficiency.
Qualitative Evaluation: Realism, relevance, and diversity via expert judgment.

Experimental Results:

Quantitative Metrics (Table II):

Precision: 89%
Recall: 87%
Accuracy: 90%
F1-Score: 88%

Image Quality (Table III):

FID Score: 70 (lower = better)
IS: 0.8 (closer to 1 = better)

Computational Performance (Table IV):

Training Time: ~10 hours
Inference Speed: 0.5s per image
Model Size: 150 MB

Qualitative Evaluation (Table V):

Realism: 90%
Relevance: 88%
Diversity: 86%

Model Comparison (Table VI):

Model	Precision	Recall	Accuracy	F1-Score
Vanilla GAN	55%	50%	52%	52%
CGAN	84%	85%	83%	85%
DCGAN	89%	87%	90%	88%

DCGAN outperforms all others in all evaluated metrics.

Conclusion

DCGAN consistently performs best across all metrics, with higher scores in Precision, Recall, Accuracy, F1-Score, FIS, and qualitative evaluation while having a low FID score. This suggests DCGAN is the most effective model for generating realistic, high-quality, and diverse criminal facial images for your application.The superior performance of DCGAN across all evaluation metrics makes it the most effective model for generating realistic, high-quality, and diverse criminal facial images. Its ability to maintain a low FID score while achieving high precision, recall, and qualitative evaluation scores ensures that it is highly suitable for forensic applications where accuracy and realism are critical.DCGAN produces facial images that are visually authentic and lifelike, closely resembling real human faces.This realism is crucial for forensic applications where generating accurate representations is vital.The model effectively captures the essential features described in text inputs, ensuring the generated images align well with the descriptions provided.This relevance is important when generating criminal facial images based on specific descriptions from eyewitnesses or forensic sketches.DCGAN exhibits superior diversity, producing various facial features using textual descriptions.The ability to generate a wide range of outputs ensures that the model does not overfit or produce monotonous results.

References

[1] Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680. [2] S. Hong, D. Yang, J. Choi, and H. Lee, ‘‘Inferring semantic layout for hierarchical text-to-image synthesis,’’ in Proc.IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 7986–7994. [3] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, and A. Graves, ‘‘Conditional image generation with pixelcnn decoders,’’ in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 4790–4798. [4] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, ‘‘StackGAN++: Realistic image synthesis with stacked generative adversarial networks,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 1947–1962, Aug. 2019. [5] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, ‘‘The caltechucsd birds-200-2011 dataset,’’ California Inst.Technol., Pasadena, CA, USA, Tech. Rep., 2011. [6] M.-E. Nilsback and A. Zisserman, ‘‘Automated flower classification over a large number of classes,’’ in Proc. 6th Indian Conf. Comput. Vis., Graph. Image Process., Dec. 2008, pp. 722–729. [7] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dolláar, and C. L. Zitnick, ‘‘Microsoft coco: Common objects in context,’’ in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, pp. 740–755. [8] D. P. Kingma and M. Welling, ‘‘Auto-encoding variational Bayes,’’ 2013, arXiv:1312.6114. [Online]. Available: http://arxiv.org/abs/1312. 6114 [9] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, ‘‘Generative adversarial text to image synthesis,’’ 2016, arXiv:1605.05396. [Online]. Available: http://arxiv.org/abs/1605.05396 [10] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representation learning with deep convolutional generative adversarial networks,’’ 2015, arXiv:1511.06434. [Online]. Available: http://arxiv.org/abs/1511. 06434 [11] H. Dong, S. Yu, C. Wu, and Y. Guo, ‘‘Semantic image synthesis via adversarial learning,’’ in Proc. IEEEInt.Conf.Comput. Vis. (ICCV), Oct. 2017, pp. 5706–5714. [12] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas, ‘‘StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),Oct. 2017, pp. 5907–5915. [13] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, ‘‘Learning what and where to draw,’’ in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 217–225. [14] S. Sharma, D. Suhubdy, V. Michalski, S. Ebrahimi Kahou, and Y. Bengio, ‘‘ChatPainter: Improving text to image generation using dialogue,’’ 2018, arXiv:1802.08216. [Online]. Available: http://arxiv.org/abs/1802. 08216 [15] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, ‘‘AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 1316–1324. [16] T. Qiao, J. Zhang, D. Xu, and D. Tao, ‘‘MirrorGAN: Learning text-toimage generation by redescription,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1505–1514. [17] Z. Zhang, Y. Xie, and L. Yang, ‘‘Photographic text-to-image synthesis with a hierarchically-nested adversarial network,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6199–6208. [18] A. Gatt, M. Tanti, A. Muscat, P. Paggio, R. A. Farrugia, C. Borg, K. P. Camilleri, M. Rosner, and L. van der Plas, ‘‘Face2Text: Collecting an annotated image description corpus for the generation of rich face descriptions,’’ 2018, arXiv:1803.03827. [Online]. Available: http://arxiv.org/abs/1803.03827 [19] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, ‘‘Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,’’ in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 87–102. [20] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, ‘‘Labeled faces in the wild: A database forstudying face recognition in unconstrained environments,’’ Tech. Rep., 2008. [21] Z. Liu, P. Luo, X. Wang, and X. Tang, ‘‘Deep learning face attributes in the wild,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 3730–3738. [22] M. Mirza and S. Osindero, ‘‘Conditional generative adversarial nets,’’ 2014, arXiv:1411.1784. [Online]. Available:http://arxiv.org/abs/ 1411.1784 [23] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, ‘‘High-resolution image synthesis and semantic manipulation with conditional GANs,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8798–8807. [24] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, ‘‘Towards open-set identity preserving face synthesis,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6713–6722.

Copyright

Copyright © 2025 Mahanthi Sriramulu, Kancharla Mounika, Madhyanapu Vennela, Oleti Taraka Naga Lakshmi, Janjirala Sandeep. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET67502

Publish Date : 2025-03-14

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here