0% Complete
صفحه اصلی
/
4th international edition and 13th Iranian Conference on Bioinformatics
In-silico Drug Generation using Masked Language Modeling
نویسندگان :
Seyed Hassan Alavi
1
Zahra Ghorbanali
2
Fatemeh Zare-Mirakabad
3
1- Computational Biology Research Canter (CBRC), Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran , Iran
2- Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran, Iran
3- Computational Biology Research Canter (CBRC) , Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran, Iran
کلمات کلیدی :
De novo drug discovery،Lead optimization،Chemical language modeling،IC50 Improvement
چکیده :
Introduction: De novo drug discovery is a complex, costly, and time-intensive process involving several phases, such as target discovery, screening, lead optimization, and clinical trials (Paul et al., 2010). Lead optimization is crucial for refining compounds to become viable drug candidates by enhancing their bioactivity and optimizing pharmacological properties. Given the vast chemical space, estimated to contain over 10^60 possible molecular structures, traditional methods are insufficient by themselves (Polishchuk et al., 2013). Consequently, machine learning (ML) techniques, especially deep generative models, have been adopted to efficiently explore this space (Elton et al., 2019). Inspired by advancements in natural language processing (NLP), transformer-based models like ChemBERTa-2 have been developed for molecular machine learning. ChemBERTa-2 leverages the SMILES representation of chemical compounds using a transformer-based architecture to learn intricate chemical features such as functional groups, chirality, and atomic connectivity (Ahmad et al., 2022). It is trained on two tasks: masked language modeling (MLM) and multi-task regression (MTR). The MLM task is particularly beneficial for lead optimization. Method: This study leverages ChemBERTa-2 for generating new chemical compounds from existing leads. Using the BACE dataset, containing data on beta-secretase inhibitors, the model generates novel SMILES sequences by sequentially masking atoms in a compound’s SMILES representation and predicting replacements. Only high-likelihood predictions are used to modify the original structure, producing a tree of molecular variations. The generation process is guided by constraints, such as a maximum depth of 10 and 200 variations per compound. Results and discussion: Our approach generated 28,911 novel SMILES structures from 303 compounds in the BACE dataset’s test set. These were evaluated based on synthetic accessibility and pIC50 improvement. The Synthetic Accessibility Score (SAS) assessed the feasibility of synthesizing new compounds, with scores below 5 considered experimentally feasible. Most generated compounds had SAS scores under this threshold. Moreover, we evaluate the pIC50 values of the generated compounds using a computational model. Notably, for 76 compounds, the pIC50 increased by 1 unit, while for 12 compounds, it increased by 2 units. A 1-unit increase in pIC50 corresponds to a tenfold reduction in the effective inhibitory concentration, representing a significant enhancement in drug efficacy. Conclusion: Our findings demonstrate that ChemBERTa-2, even with self-supervised training on SMILES sequences, effectively captures crucial chemical features that influence molecular properties. By leveraging ChemBERTa-2 for lead optimization, we successfully improved the efficacy of several drug candidates. This approach underscores the potential of transformer-based models in revolutionizing the drug discovery pipeline, offering a scalable and efficient method for exploring vast chemical spaces.
لیست مقالات
لیست مقالات بایگانی شده
Inhibition of angiogenesis based on the dynamic model of tumor growth using adaptive control method
Mehdi Ghasemi - Adel Akbarimajd - Solmaz Kia
Solving Diffusion Equations Using Physics-Informed Neural Networks: A Biological Application
Yasaman Razzaghi - Ali Shokri - Ahmad Aliyari Boroujeni
In silico analysis of Maize WRKY transcription factors in response to drought and salt stress
Majid NorouzI - Sahar Shahgoli - Bahram Baghban Kohnehrouz
A Fuzzy Bayesian Network Model for Personalized Diabetes Risk Prediction: Integrating Lifestyle, Genetic, and Environmental Factors
Lida Hooshyar - Nadia Tahiri
Unlocking the Hidden Potential of Leuconostoc: Insights from Genomic Analysis
Bahram Bassami - Niloufar Zamanpour - Najmeh Salehi - Javad Hamedi
Study population structure in Iranian Arab horse breed by principal component analysis (PCA) and discriminant analysis of principal components (DAPC) methods using genomic data
Behkam Teymori - Hossein Moradi sharbabak - Mohammad Moradi sharbabak - Mohammad Bagher Zandi - Alireza Fotuhi Siahpirani
Comparative Analysis of Enzybiotic Gene Abundance Across Environmental Microbiomes with Varied Plastic Pollution Levels
Arad Ariaeenejad - Arman Hasannejad - Donya Afshar Jahanshahi - Mohammad Reza Zabihi - Shohreh Ariaeenejad - Kaveh Kavousi
Investigation of the interaction between 2-aminothiazole and bovine serum albumin (BSA), using the methods of molecular docking calculations and density functional theory (DFT)
Forough Pakzadi - Yaghub Pazhang - Ebrahim Nemati-Kande
Vaccine design for outer membrane protein C(Shigella Flexneri)
Maedeh Esmaili - Fatemeh Sefid
Homology modeling and molecular docking studies for discovering FlgK protein inhibitors; Helicobacter pylori flagellar subunit.
Vajiheh Eskandari
بیشتر
ثمین همایش، سامانه مدیریت کنفرانس ها و جشنواره ها - نگارش 42.7.0