It has become nearly impossible to extract and retain chemical structure data from earlier papers that are only available in printed or scanned form as a result of the expanding availability of chemical structure data. Although systems for rule-based optical chemical structure recognition (OCSR) have been created to automate this procedure, they have drawbacks including being sluggish and prone to mistakes. The Deep Chem project has developed an application for recognizing chemical structures in order to address this problem and comparing the impact of these chemical structures in different foods on human health based on their molar mass. The app uses deep learning approaches to automate the recognition of chemical structures from printed or scanned articles. Traditional rule-based optical chemical structure recognition (OCSR) tools can be slow and prone to errors, but the Deep Chem app aims to provide a faster and more reliable solution. A collection of 50–100 million molecules were used to predict SMILES encodings of chemical structure renderings with over 96% accuracy for structures without stereochemical information and over 89% accuracy for structures with stereochemistry information. Additionally, the app provides a feature to compare the impact of different foods on health based on their molar mass. This work is entirely based on open-source software and open data and is available to the general public for any purpose.
Introduction
Chemical information is typically conveyed through text and graphics in scientific literature, but manually extracting this data is time-consuming. Automated methods, especially Optical Chemical Structure Recognition (OCSR), have improved over the past 30 years but mostly handle pure chemical structure images. Existing tools like OSRA and ChemSchematicResolver use rule-based or clustering methods but struggle with scanned or unlabeled images. A 2019 deep learning approach using a U-Net CNN improved segmentation but had limited public access to models.
Building on this, the DECIMER project developed Deep-Chem, an open-source deep learning application that segments chemical structures from scanned journal pages using the DECIMER Segmentation algorithm. It extracts chemical structures, converts them to SMILES notation, and compares molar masses to study health impacts. The code and models are publicly available via a web app, supporting bitmap images from older scanned articles and modern PDFs.
Theory highlights:
Deep Learning: A powerful AI method using multi-layer neural networks, excelling in computer vision, NLP, drug discovery, and more. It learns hierarchical data representations but requires large labeled datasets and has interpretability challenges.
TensorFlow: An open-source machine learning library widely used to develop models in vision, NLP, healthcare, finance, and social media analysis, offering flexible deployment and a strong community.
Molar Mass and Viscosity: Viscosity increases with molar mass due to larger molecules having stronger intermolecular forces, more complex shapes, and polymer chain entanglement.
Molar Mass and Diffusion: Diffusion rates decrease with increasing molar mass, as larger molecules move less freely through liquids due to size, shape, and higher viscosity.
Deep Learning Algorithms: Various architectures such as CNNs (AlexNet), DBNs, GANs, ResNets, and attention networks have advanced AI performance in image recognition, speech, and NLP by learning complex patterns and overcoming technical challenges like vanishing gradients.
Conclusion
The Deep Chem tool has the potential to benefit the public in many ways. One notable area of application is the food industry, where it can be used to assess the impact of various foods on health based on their molar mass, thereby facilitating informed dietary choices and promoting healthy eating habits. Moreover, the app\'s ability to automatically recognize chemical structures from printed or scanned documents has broad applicability in fields such as drug discovery, environmental analysis, and patent infringement detection.
Looking ahead, the potential of Deep Chem is vast and promising. The app\'s accuracy could be enhanced by expanding the size of its datasets, utilizing advanced machine learning algorithms, and improving computational power. This would allow the app to recognize more intricate chemical structures and identify molecules with greater precision. Additionally, the deep learning approach utilized by Deep Chem could be extended to other areas of chemical research, such as predicting toxicity and optimizing reactions. Overall, the potential for Deep Chem to drive advancements in chemical research, facilitate healthier eating habits, and promote public health makes it a highly promising tool for the future.
References
[1] O’Boyle NM, Guha R, Willighagen EL et al (2011) Open data, open source and open standards in chemistry: the Blue Obelisk five years on. J Cheminform 3:1–15
[2] Swain MC, Cole JM (2016) ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature. J Chem Inf Model 56:1894–1904
[3] Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A (2017) Information retrieval and text mining technologies for chemistry. Chem Rev 117:7673–7761
[4] Rajan K, Brinkhaus HO, Zielesny A, Steinbeck C (2020) A review of optical chemical structure recognition tools. J Cheminform. https://doi.org/10.1186/s13321-020-00465-0
[5] Filippov IV, Nicklaus MC (2009) Optical structure recognition software to recover chemical information: OSRA, an open source solution. J Chem Inf Model 49:740–743
[6] Beard EJ, Cole JM (2020) ChemSchematicResolver: a toolkit to decode 2d chemical diagrams with labels and R-groups into annotated chemical named entities. J Chem Inf Model 60:2059–2072
[7] Staker J, Marshall K, Abel R, McQuaw CM (2019) Molecular Structure extraction from documents using deep learning. J Chem Inf Model 59:1017–1029
[8] Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. Lecture Notes in Computer Science, p 234–241
[9] Rajan K, Zielesny A, Steinbeck C (2020) DECIMER: towards deep learning for chemical image recognition. J Cheminform 12:65
[10] He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proc IEEE Conf Comput Vis Pattern Recognit, pp 770–778
[11] Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In Adv Neural Inf Process Syst, pp 3104–3112
[12] Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In Proc Int Conf Learn Represent
[13] Gawehn E, Hiss JA, Schneider G (2016) Deep learning in drug discovery. Mol Inform 35(1):3–14
[14] Angermueller C, Pärnamaa T, Parts L, Stegle O (2016) Deep learning for computational biology. Mol Syst Biol 12(7):878
[15] Bao W, Yue J, Rao Y, Wang H (2017) Deep learning in finance. IEEE Trans Neural Netw Learn Syst 29(6):1399–1413
[16] Lipton ZC, Kale DC, Elkan C (2016) Learning to diagnose with LSTM recurrent neural networks. arXiv:1511.03677
[17] Levine S, Finn C, Darrell T, Abbeel P (2016) End-to-end training of deep visuomotor policies. J Mach Learn Res 17(39):1–40
[18] Kadian A, Krovi VN, Kalakrishnan M (2021) Learning robot grasping policies with shape-based image representations. Robot Auton Syst 136:103729
[19] Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2015) Semantic image segmentation with deep convolutional nets and fully connected CRFs. In Int Conf Learn Represent
[20] Zhang X, Chen Y, Wang J, Huang K (2016) Object detection in videos with tubelet proposal networks. In Proc IEEE Conf Comput Vis Pattern Recognit, pp 284–293
[21] Lipton ZC (2018) The mythos of model interpretability. arXiv:1606.03490
[22] Liu Y, Chen X, Liu C, Song W (2018) Real-time object detection system for autonomous driving based on deep learning. J Real-Time Image Process 15(4):753–765
[23] Zhang J, Xie S, Xing F (2018) Breast cancer classification using deep learning on digital mammograms. IEEE Access 6:20042–20048
[24] Wu Y, Shen D, Zhang J (2020) A deep learning-based method for text classification. Neural Comput Appl 32:12833–12840
[25] Li Y, Wu X, Wang W, Liu T (2020) Emotion recognition in text using deep learning: A review. IEEE Access 8:21528–21538
[26] Al-Turjman F, Jaber M, Al-Ani A, Alkafri A (2020) Early heart diseases detection using deep learning techniques. J Med Syst 44:127
[27] Xie J, Liu Y, Liu W, Yang H (2019) Prediction of Alzheimer\'s disease based on hippocampal shape analysis and deep learning. J Healthc Eng 2019:1–13
[28] Wen C, Li Q, Zhang W, Li Y (2020) An efficient deep learning framework for fake news identification. IEEE Access 8:41763–41773
[29] Zhang Y, Xie B, Zhang Q, Huang J (2020) Deep learning for stock prediction: A comparative study. Expert Syst Appl 156:113434
[30] Deng G, Li Y, Zhang Y, Chen J, Wang H (2016) Molecular size effect on the viscosity of ionic liquids. J Phys Chem B 120(44):11337–11344
[31] Cheng Z, Huang Y, Zhou H, Zhang S, Yuan W (2017) Molar mass effect on the rheology and microstructure of polymer solutions. Polymers 9(7):286
[32] Van Krevelen DW, Hoftyzer PJ (1976) Properties of polymers: their estimation and correlation with chemical structure. Elsevier Scientific Pub. Co.
[33] Moon IK, Shibata T, Endo T (2010) Molar mass effect on the viscosity of ionic liquids: comparison between model compounds and ionic liquids. Phys Chem Chem Phys 12(23):6068–6075
[34] Li X, Xu L, Du J, Sun C (2018) Effects of molecular size and shape on diffusion coefficients in liquids. Phys Chem Chem Phys 20(10):6871–6881
[35] Schofield K, Kuemmel J, Jackson G (2006) Diffusion of solutes in liquids. Chem Rev 106(3):1173–1202
[36] Kansal SK, Kumbharkhane AC (2014) Effect of molar mass on the diffusion coefficient of non-electrolyte solutes in dilute solutions of non-polymer solvents. J Mol Liq 193:34–39
[37] Kazarian SG, Chan KLA, Lee T (2015) Molar mass effect on protein diffusion in solution. Phys Chem Chem Phys 17(33):21448–21456
[38] Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In Adv Neural Inf Process Syst, pp 1097–1105
[39] Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
[40] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Bengio Y (2014) Generative adversarial nets. In Adv Neural Inf Process Syst, pp 2672–2680
[41] He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proc IEEE Conf Comput Vis Pattern Recognit, pp 770–778
[42] Liu T, Zhang M, Wang J, Zhang J (2020) Multi-scale attention network for object detection in images
[43] He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In Proc IEEE Int Conf Comput Vis, pp 2961–2969
[44] Chen, Y., Lou, J., & Wang, W. (2019). EfficientPS: Efficient Panoptic Segmentation. arXiv preprint arXiv:1911.09070.
[45] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241).
[46] Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).
[47] Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems (pp. 379-387).
[48] Landrum G, Others (2016) RDKit: Open-Source Cheminformatics Software.(2016). URL http://www. rdkit. org/, https://github. com/rdkit/rdkit cito:usesMethodIn]
[49] Riniker, S., and Landrum, G. A. (2013). Better informed distance geometry: sing what we know to improve conformation generation. Journal of chemical information and modeling, 53(6), 1689-1699.
[50] Ma, X. H., Zheng, C. J., Han, L. Y., Chen, Y. Z., and Cao, Z. W. (2015). Evaluation of molecular fingerprinting and machine learning techniques for the prediction of drug target interaction. Journal of chemical information and modeling, 55(11), 2444-2460.