0% Complete
صفحه اصلی
/
4th international edition and 13th Iranian Conference on Bioinformatics
In-silico Drug Generation using Masked Language Modeling
نویسندگان :
Seyed Hassan Alavi
1
Zahra Ghorbanali
2
Fatemeh Zare-Mirakabad
3
1- Computational Biology Research Canter (CBRC), Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran , Iran
2- Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran, Iran
3- Computational Biology Research Canter (CBRC) , Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran, Iran
کلمات کلیدی :
De novo drug discovery،Lead optimization،Chemical language modeling،IC50 Improvement
چکیده :
Introduction: De novo drug discovery is a complex, costly, and time-intensive process involving several phases, such as target discovery, screening, lead optimization, and clinical trials (Paul et al., 2010). Lead optimization is crucial for refining compounds to become viable drug candidates by enhancing their bioactivity and optimizing pharmacological properties. Given the vast chemical space, estimated to contain over 10^60 possible molecular structures, traditional methods are insufficient by themselves (Polishchuk et al., 2013). Consequently, machine learning (ML) techniques, especially deep generative models, have been adopted to efficiently explore this space (Elton et al., 2019). Inspired by advancements in natural language processing (NLP), transformer-based models like ChemBERTa-2 have been developed for molecular machine learning. ChemBERTa-2 leverages the SMILES representation of chemical compounds using a transformer-based architecture to learn intricate chemical features such as functional groups, chirality, and atomic connectivity (Ahmad et al., 2022). It is trained on two tasks: masked language modeling (MLM) and multi-task regression (MTR). The MLM task is particularly beneficial for lead optimization. Method: This study leverages ChemBERTa-2 for generating new chemical compounds from existing leads. Using the BACE dataset, containing data on beta-secretase inhibitors, the model generates novel SMILES sequences by sequentially masking atoms in a compound’s SMILES representation and predicting replacements. Only high-likelihood predictions are used to modify the original structure, producing a tree of molecular variations. The generation process is guided by constraints, such as a maximum depth of 10 and 200 variations per compound. Results and discussion: Our approach generated 28,911 novel SMILES structures from 303 compounds in the BACE dataset’s test set. These were evaluated based on synthetic accessibility and pIC50 improvement. The Synthetic Accessibility Score (SAS) assessed the feasibility of synthesizing new compounds, with scores below 5 considered experimentally feasible. Most generated compounds had SAS scores under this threshold. Moreover, we evaluate the pIC50 values of the generated compounds using a computational model. Notably, for 76 compounds, the pIC50 increased by 1 unit, while for 12 compounds, it increased by 2 units. A 1-unit increase in pIC50 corresponds to a tenfold reduction in the effective inhibitory concentration, representing a significant enhancement in drug efficacy. Conclusion: Our findings demonstrate that ChemBERTa-2, even with self-supervised training on SMILES sequences, effectively captures crucial chemical features that influence molecular properties. By leveraging ChemBERTa-2 for lead optimization, we successfully improved the efficacy of several drug candidates. This approach underscores the potential of transformer-based models in revolutionizing the drug discovery pipeline, offering a scalable and efficient method for exploring vast chemical spaces.
لیست مقالات
لیست مقالات بایگانی شده
Drug repurposing using bulk RNA-seq based on key genes involved in inflammatory bowel disease
Nayereh Abdali - Shahram Tahmasebian - Atena Vaghf
Integrated bioinformatic analysis for the screening of hub genes & therapeutic drugs in high-grade serous ovarian cancer
Maryam Khalili - Behnaz Saffar
Comprehensive Gene and Protein Catalog for Antimicrobial Environments: A Metagenomic Approach to Mitigate Antimicrobial Resistance
Donya Afshar Jahanshahi - Arad Ariaeenejad - Arman Hasannejad - Mohammad Reza Zabihi - Shohreh Ariaeenejad - Kaveh Kavousi
Innovative Multi-Epitope Vaccine for Breast Cancer Management: Utilizing MAGE-A, MAM-A, and Gal-3 through an In Silico Reverse Vaccinology Approach
Faranak Aali - Abbas Doosti - Mostafa Shakhsi-Niaei
In silico analysis of Maize WRKY transcription factors in response to drought and salt stress
Majid NorouzI - Sahar Shahgoli - Bahram Baghban Kohnehrouz
Identifying mRNAs and miRNAs in extracellular vesicles through comparative transcriptome analyses of healthy and mastitic bovine milk
Farzad Ghafouri - Seyed Midia Pirkhezranian - Mostafa Sadeghi - Seyed Reza Miraei-Ashtiani - John P. Kastelic - Herman W. Barkema - Vahid Razban - Masoud Shirali
Exploring the Regulatory Landscape of LncRNAs in Alzheimer’s Disease: Insights into Inflammation
Narjes Khatoun Shabani Sadr - Farideh Faramarzi - Mehrdad Behmanesh
An efficient method based on transformers for antimicrobial peptide prediction
Alireza Khorramfard - Jamshid Pirgazi - Ali Ghanbari Sorkhi
Discovery of effective markers in the severity of the disease in the genome of Iranian patients with covid-19 and introduction of an effective plant in controlling the severity of the disease
Fariba Esmaeili - Dariush Salimi
BIRC5: The Silent Architect of Tumor Persistence and Senescence in Hepatocellular Carcinoma
Amirhosein Farrokhzad - Maryam Kaboli - Elahe Hoseinnia - Elham Rismani - Massoud Vosough
ثمین همایش، سامانه مدیریت کنفرانس ها و جشنواره ها - نگارش 40.4.1