همایش ملی بیوانفورماتیک ایران

صفحه اصلی / 4th international edition and 13th Iranian Conference on Bioinformatics

In-silico Drug Generation using Masked Language Modeling

نویسندگان :

Seyed Hassan Alavi¹ Zahra Ghorbanali² Fatemeh Zare-Mirakabad³

1- Computational Biology Research Canter (CBRC), Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran , Iran 2- Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran, Iran 3- Computational Biology Research Canter (CBRC) , Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran, Iran

کلمات کلیدی :

De novo drug discovery،Lead optimization،Chemical language modeling،IC50 Improvement

چکیده :

Introduction: De novo drug discovery is a complex, costly, and time-intensive process involving several phases, such as target discovery, screening, lead optimization, and clinical trials (Paul et al., 2010). Lead optimization is crucial for refining compounds to become viable drug candidates by enhancing their bioactivity and optimizing pharmacological properties. Given the vast chemical space, estimated to contain over 10^60 possible molecular structures, traditional methods are insufficient by themselves (Polishchuk et al., 2013). Consequently, machine learning (ML) techniques, especially deep generative models, have been adopted to efficiently explore this space (Elton et al., 2019). Inspired by advancements in natural language processing (NLP), transformer-based models like ChemBERTa-2 have been developed for molecular machine learning. ChemBERTa-2 leverages the SMILES representation of chemical compounds using a transformer-based architecture to learn intricate chemical features such as functional groups, chirality, and atomic connectivity (Ahmad et al., 2022). It is trained on two tasks: masked language modeling (MLM) and multi-task regression (MTR). The MLM task is particularly beneficial for lead optimization. Method: This study leverages ChemBERTa-2 for generating new chemical compounds from existing leads. Using the BACE dataset, containing data on beta-secretase inhibitors, the model generates novel SMILES sequences by sequentially masking atoms in a compound’s SMILES representation and predicting replacements. Only high-likelihood predictions are used to modify the original structure, producing a tree of molecular variations. The generation process is guided by constraints, such as a maximum depth of 10 and 200 variations per compound. Results and discussion: Our approach generated 28,911 novel SMILES structures from 303 compounds in the BACE dataset’s test set. These were evaluated based on synthetic accessibility and pIC50 improvement. The Synthetic Accessibility Score (SAS) assessed the feasibility of synthesizing new compounds, with scores below 5 considered experimentally feasible. Most generated compounds had SAS scores under this threshold. Moreover, we evaluate the pIC50 values of the generated compounds using a computational model. Notably, for 76 compounds, the pIC50 increased by 1 unit, while for 12 compounds, it increased by 2 units. A 1-unit increase in pIC50 corresponds to a tenfold reduction in the effective inhibitory concentration, representing a significant enhancement in drug efficacy. Conclusion: Our findings demonstrate that ChemBERTa-2, even with self-supervised training on SMILES sequences, effectively captures crucial chemical features that influence molecular properties. By leveraging ChemBERTa-2 for lead optimization, we successfully improved the efficacy of several drug candidates. This approach underscores the potential of transformer-based models in revolutionizing the drug discovery pipeline, offering a scalable and efficient method for exploring vast chemical spaces.