Identifying Protein Features and Pathways Responsible for Toxicity Using Machine Learning and Tox21: Implications for Predictive Toxicology
Abstract
:1. Introduction
2. Results and Discussion
2.1. Preprocessed Data
2.2. Comparison to Other Tox21 Studies
2.3. Enrichment Analysis
2.4. Case Study for NR-AhR
3. Limitations and Future Work
4. Materials and Methods
4.1. Tox21 Datasets
4.2. UniProt
4.3. Protein-Compound Interaction Scoring Protocol
4.4. Study Design
4.4.1. Data Pre-Processing and Feature Generation
4.4.2. Data Resampling for Predictive Modeling
4.4.3. Random Forest
- (1)
- Draw a bootstrap sample: we randomly sample N compounds with replacement from the original dataset;
- (2)
- Create maximum decision trees: we construct a decision tree for each bootstrap sample by randomly sampling a subset of features at each node and choosing the best split among those features;
- (3)
- Construct a forest by repeating steps 1 and 2 for N trees;
- (4)
- Predict the outcome: from the built forest, the prediction is obtained by aggregating the predictions of the N trees (i.e., majority votes for classification and average for regression tasks).
4.4.4. Random Forest for Feature Selection
4.4.5. Model Training and Testing
4.4.6. Performance Evaluation Metrics
4.4.7. Enrichment Analysis
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Sample Availability
References
- Binetti, R.; Costamagna, F.M.; Marcello, I. Exponential growth of new chemicals and evolution of information relevant to risk control. Ann.-Ist. Super. Sanità 2008, 44, 13. [Google Scholar]
- Ekins, S.; Nikolsky, Y.; Nikolskaya, T. Techniques: Application of systems biology to absorption, distribution, metabolism, excretion and toxicity. Trends Pharmacol. Sci. 2005, 26, 202–209. [Google Scholar] [CrossRef] [PubMed]
- Kavlock, R.; Dix, D. Computational toxicology as implemented by the US EPA: Providing high throughput decision support tools for screening and assessing chemical exposure, hazard and risk. J. Toxicol. Environ. Health Part B 2010, 13, 197–217. [Google Scholar] [CrossRef] [PubMed]
- Inglese, J.; Auld, D.S.; Jadhav, A.; Johnson, R.L.; Simeonov, A.; Yasgar, A.; Zheng, W.; Austin, C.P. Quantitative high-throughput screening: A titration-based approach that efficiently identifies biological activities in large chemical libraries. Proc. Natl. Acad. Sci. USA 2006, 103, 11473–11478. [Google Scholar] [CrossRef] [Green Version]
- Shukla, S.J.; Huang, R.; Austin, C.P.; ** immune system: A missing link in the global battle against infectious disease? Reprod. Toxicol. 2011, 31, 327–336. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Vezina, C.M.; Walker, N.J.; Olson, J.R. Subchronic exposure to TCDD, PeCDF, PCB126, and PCB153: Effect on hepatic gene expression. Environ. Health Perspect. 2004, 112, 1636–1644. [Google Scholar] [CrossRef] [PubMed]
- Tijet, N.; Boutros, P.C.; Moffat, I.D.; Okey, A.B.; Tuomisto, J.; Pohjanvirta, R. Aryl hydrocarbon receptor regulates distinct dioxin-dependent and dioxin-independent gene batteries. Mol. Pharmacol. 2006, 69, 140–153. [Google Scholar] [CrossRef] [Green Version]
- Matsumura, F. The significance of the nongenomic pathway in mediating inflammatory signaling of the dioxin-activated Ah receptor to cause toxic effects. Biochem. Pharmacol. 2009, 77, 608–626. [Google Scholar] [CrossRef]
- Kang, H.J.; Kim, H.J.; Kim, S.K.; Barouki, R.; Cho, C.H.; Khanna, K.K.; Rosen, E.M.; Bae, I. BRCA1 modulates xenobiotic stress-inducible gene expression by interacting with ARNT in human breast cancer cells. J. Biol. Chem. 2006, 281, 14654–14662. [Google Scholar] [CrossRef] [Green Version]
- Chan, C.Y.; Kim, P.M.; Winn, L.M. TCDD affects DNA double strand-break repair. Toxicol. Sci. 2004, 81, 133–138. [Google Scholar] [CrossRef] [Green Version]
- Rattenborg, T.; Gjermandsen, I.; Bonefeld-Jørgensen, E.C. Inhibition of E2-induced expression of BRCA1 by persistent organochlorines. Breast Cancer Res. 2002, 4, R12. [Google Scholar] [CrossRef] [Green Version]
- MacLachlan, T.K.; Somasundaram, K.; Sgagias, M.; Shifman, Y.; Muschel, R.J.; Cowan, K.H.; El-Deiry, W.S. BRCA1 effects on the cell cycle and the DNA damage response are linked to altered gene expression. J. Biol. Chem. 2000, 275, 2777–2785. [Google Scholar] [CrossRef] [Green Version]
- Gowen, L.C.; Avrutskaya, A.V.; Latour, A.M.; Koller, B.H.; Leadon, S.A. BRCA1 required for transcription-coupled repair of oxidative DNA damage. Science 1998, 281, 1009–1012. [Google Scholar] [CrossRef]
- Rosen, E.M.; Fan, S.; Pestell, R.G.; Goldberg, I.D. BRCA1 gene in breast cancer. J. Cell. Physiol. 2003, 196, 19–41. [Google Scholar] [CrossRef] [PubMed]
- Jeffy, B.D.; Schultz, E.U.; Selmin, O.; Gudas, J.M.; Bowden, G.T.; Romagnolo, D. Inhibition of BRCA-1 expression by benzo [a] pyrene and its diol epoxide. Mol. Carcinog. Publ. Coop. Univ. Tex. MD Anderson Cancer Cent. 1999, 26, 100–118. [Google Scholar] [CrossRef]
- Foo, T.K.; Vincelli, G.; Huselid, E.; Her, J.; Zheng, H.; Simhadri, S.; Wang, M.; Huo, Y.; Li, T.; Yu, X.; et al. ATR/ATM-Mediated Phosphorylation of BRCA1 T1394 Promotes Homologous Recombinational Repair and G2–M Checkpoint Maintenance. Cancer Res. 2021, 81, 4676–4684. [Google Scholar] [CrossRef] [PubMed]
- Hankinson, O. Role of coactivators in transcriptional activation by the aryl hydrocarbon receptor. Arch. Biochem. Biophys. 2005, 433, 379–386. [Google Scholar] [CrossRef]
- Voronov, I.; Li, K.; Tenenbaum, H.; Manolson, M. Benzo [a] pyrene inhibits osteoclastogenesis by affecting RANKL-induced activation of NF-κB. Biochem. Pharmacol. 2008, 75, 2034–2044. [Google Scholar] [CrossRef]
- Hayden, M.S.; Ghosh, S. Signaling to NF-κB. Genes Dev. 2004, 18, 2195–2224. [Google Scholar] [CrossRef] [Green Version]
- Zheng, W.; Li, R.; Pan, H.; He, D.; Xu, R.; Guo, T.B.; Guo, Y.; Zhang, J.Z. Role of osteopontin in induction of monocyte chemoattractant protein 1 and macrophage inflammatory protein 1β through the NF-κB and MAPK pathways in rheumatoid arthritis. Arthritis Rheum. Off. J. Am. Coll. Rheumatol. 2009, 60, 1957–1965. [Google Scholar] [CrossRef]
- Tanaka, S.; Nakamura, K.; Takahasi, N.; Suda, T. Role of RANKL in physiological and pathological bone resorption and therapeutics targeting the RANKL–RANK signaling system. Immunol. Rev. 2005, 208, 30–49. [Google Scholar] [CrossRef]
- Fu, J.; Nogueira, S.V.; van Drongelen, V.; Coit, P.; Ling, S.; Rosloniec, E.F.; Sawalha, A.H.; Holoshitz, J. Shared epitope–aryl hydrocarbon receptor crosstalk underlies the mechanism of gene–environment interaction in autoimmune arthritis. Proc. Natl. Acad. Sci. USA 2018, 115, 4755–4760. [Google Scholar] [CrossRef] [Green Version]
- Weiss, C.; Faust, D.; Dürk, H.; Kolluri, S.K.; Pelzer, A.; Schneider, S.; Dietrich, C.; Oesch, F.; Göttlicher, M. TCDD induces c-jun expression via a novel Ah (dioxin) receptor-mediated p38–MAPK-dependent pathway. Oncogene 2005, 24, 4975–4983. [Google Scholar] [CrossRef] [Green Version]
- Deng, L.; Wang, C.; Spencer, E.; Yang, L.; Braun, A.; You, J.; Slaughter, C.; Pickart, C.; Chen, Z.J. Activation of the IκB kinase complex by TRAF6 requires a dimeric ubiquitin-conjugating enzyme complex and a unique polyubiquitin chain. Cell 2000, 103, 351–361. [Google Scholar] [CrossRef] [Green Version]
- Biswas, R.; Bagchi, A. Inhibition of TRAF6-Ubc13 interaction in NFkB inflammatory pathway by analyzing the hotspot amino acid residues and protein–protein interactions using molecular docking simulations. Comput. Biol. Chem. 2017, 70, 116–124. [Google Scholar] [CrossRef] [PubMed]
- Consortium, T.U. UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Res. 2020, 49, D480–D489. [Google Scholar] [CrossRef] [PubMed]
- Protein Data Bank: The single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019, 47, D520–D528. [CrossRef] [Green Version]
- Jassal, B.; Matthews, L.; Viteri, G.; Gong, C.; Lorente, P.; Fabregat, A.; Sidiropoulos, K.; Cook, J.; Gillespie, M.; Haw, R.; et al. The reactome pathway knowledgebase. Nucleic Acids Res. 2020, 48, D498–D503. [Google Scholar] [CrossRef]
- Yang, J.; Roy, A.; Zhang, Y. Protein–ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 2013, 29, 2588–2595. [Google Scholar] [CrossRef]
- Sorensen, T.A. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Biol. Skar. 1948, 5, 1–34. [Google Scholar]
- Deric; Sforna, G.; Landrum, G.; Winter, H.D. Rdkit Github Page. 2012. Available online: https://github.com/rdkit (accessed on 1 September 2021).
- Bellman, R. Dynamic programming. Science 1966, 153, 34–37. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Beckmann, M.; Ebecken, N.F.; de Lima, B.S.P. A KNN undersampling approach for data balancing. J. Intell. Learn. Syst. Appl. 2015, 7, 104. [Google Scholar] [CrossRef] [Green Version]
- Batista, G.E.; Prati, R.C.; Monard, M.C. A Study of The Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
- Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J.C.; Sheridan, R.P.; Feuston, B.P. Random forest: A classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958. [Google Scholar] [CrossRef] [PubMed]
- Qi, Y. Random forest for bioinformatics. In Ensemble Machine Learning; Springer: Berlin/Heidelberg, Germany, 2012; pp. 307–323. [Google Scholar]
- Dietterich, T.G. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Mach. Learn. 2000, 40, 139–157. [Google Scholar] [CrossRef]
- Saeys, Y.; Inza, I.; Larranaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23, 2507–2517. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Nembrini, S.; König, I.R.; Wright, M.N. The revival of the Gini importance? Bioinformatics 2018, 34, 3711–3718. [Google Scholar] [CrossRef] [Green Version]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Lemaıtre, G.; Nogueira, F.; Christos, K. Aridas. Imbalanced-Learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. J. Mach. Learn. Res. 2017, 18, 559–563. [Google Scholar]
- Davis, J.; Goadrich, M. The Relationship between Precision-Recall and ROC Curves. 2006, pp. 233–240. Available online: https://www.biostat.wisc.edu/~page/rocpr.pdf (accessed on 30 October 2021).
- Huang, J.; Ling, C.X. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 2005, 17, 299–310. [Google Scholar] [CrossRef] [Green Version]
In Vitro qHTS Assay Identifier | Target/Assay | Number of Compounds | Training Set | Test Set | ||||
---|---|---|---|---|---|---|---|---|
Inactive | Active | IR | Inactive | Active | IR | |||
NR-AhR | Aryl hydrocarbon receptor | 7103 | 5777 | 734 | 7.87 | 521 | 71 | 7.34 |
NR-ER-LBD | Estrogen receptor (luciferase assay) | 7509 | 6643 | 282 | 23.56 | 564 | 20 | 28.20 |
NR-ER | Estrogen receptor | 6630 | 5474 | 651 | 8.41 | 456 | 49 | 9.31 |
NR-Aromatase | Aromatase | 6286 | 5496 | 274 | 20.06 | 479 | 37 | 12.94 |
NR-PPAR- | Peroxisome proliferator-activated receptor | 7039 | 6283 | 167 | 37.62 | 559 | 30 | 18.63 |
NR-AR | Androgen receptor | 7783 | 6958 | 252 | 27.61 | 561 | 12 | 46.75 |
NR-AR-LBD | Androgen receptor (luciferase assay) | 7298 | 6521 | 211 | 30.90 | 558 | 8 | 69.75 |
SR-MMP | Mitochondrial membrane potential | 6316 | 4899 | 888 | 5.52 | 474 | 55 | 8.62 |
SR-ARE | Nuclear factor (erythroid-derived 2)-like 2 antioxidant responsive element | 6339 | 4919 | 881 | 5.58 | 450 | 89 | 5.06 |
SR-ATAD5 | Genotoxicity indicated by ATAD5 | 7646 | 6787 | 256 | 26.51 | 569 | 34 | 16.73 |
SR-p53 | DNA damage p53-pathway | 7358 | 6351 | 409 | 15.53 | 560 | 38 | 14.74 |
SR-HSE | Heat shock factor response element | 7040 | 6144 | 305 | 20.14 | 574 | 17 | 33.76 |
Assays | F1 | Precision | Recall | AUCROC | AUPRC | BA | MCC | Specificity | Accuracy |
---|---|---|---|---|---|---|---|---|---|
NR-AhR | 0.471 | 0.318 | 0.901 | 0.896 | 0.560 | 0.819 | 0.438 | 0.737 | 0.757 |
NR-ER-LBD | 0.342 | 0.333 | 0.350 | 0.810 | 0.272 | 0.663 | 0.318 | 0.975 | 0.954 |
NR-ER | 0.420 | 0.301 | 0.694 | 0.806 | 0.414 | 0.760 | 0.370 | 0.827 | 0.814 |
NR-Aromatase | 0.317 | 0.250 | 0.432 | 0.795 | 0.282 | 0.666 | 0.260 | 0.900 | 0.866 |
NR-PPAR- | 0.286 | 0.308 | 0.267 | 0.745 | 0.241 | 0.617 | 0.251 | 0.968 | 0.932 |
NR-AR | 0.261 | 0.273 | 0.250 | 0.706 | 0.196 | 0.618 | 0.178 | 0.988 | 0.970 |
NR-AR-LBD | 0.000 | 0.000 | 0.000 | 0.618 | 0.036 | 0.493 | −0.014 | 0.986 | 0.972 |
SR-MMP | 0.488 | 0.331 | 0.927 | 0.916 | 0.597 | 0.855 | 0.478 | 0.783 | 0.798 |
SR-ARE | 0.425 | 0.305 | 0.697 | 0.757 | 0.403 | 0.692 | 0.294 | 0.687 | 0.688 |
SR-ATAD5 | 0.325 | 0.283 | 0.382 | 0.744 | 0.230 | 0.662 | 0.282 | 0.942 | 0.910 |
SR-p53 | 0.235 | 0.159 | 0.447 | 0.830 | 0.198 | 0.643 | 0.182 | 0.839 | 0.814 |
SR-HSE | 0.286 | 0.308 | 0.267 | 0.759 | 0.240 | 0.617 | 0.251 | 0.968 | 0.932 |
Pathway | Total Proteins | Selected Proteins (Gene IDs) | p-Value |
---|---|---|---|
Nonhomologous End-Joining (NHEJ) | 52 | RNF8,UBE2N, BRCA1,NSD2 | 1.36 × 10 |
Recruitment and ATM-mediated phosphorylation of repair and signaling proteins at DNA double strand breaks | 59 | RNF8,UBE2N, BRCA1,NSD2 | 2.22 × 10 |
TRAF6 mediated NF-B activation | 24 | TRAF2,TRAF6 | 2.35 × 10 |
DNA Double Strand Break Response | 60 | RNF8,UBE2N, BRCA1,NSD2 | 2.36 × 10 |
TRAF6 mediated IRF7 activation | 28 | TRAF2,TRAF6 | 3.73 × 10 |
Neurofascin interactions | 7 | NRCAM,CNTN1 | 5.28 × 10 |
DDX58/IFIH1-mediated induction of interferon-alpha/beta | 77 | TRAF2,RNF125, TRAF6,DDX58 | 6.04 × 10 |
RUNX3 regulates YAP1-mediated transcription | 8 | TEAD1,TEAD4 | 7.01 × 10 |
SUMOylation of transcription cofactors | 42 | RNF2,UHRF2,PIAS3 | 1.22 × 10 |
IRAK1 recruits IKK complex | 14 | TRAF6,UBE2N | 2.21 × 10 |
IRAK1 recruits IKK complex upon TLR7/8 or 9 stimulation | 14 | TRAF6,UBE2N | 2.21 × 10 |
YAP1- and WWTR1 (TAZ)-stimulated gene expression | 14 | TEAD1,TEAD4 | 2.21 × 10 |
TRAF6 mediated IRF7 activation in TLR7/8 or 9 signaling | 14 | TRAF6,UBE2N | 2.21 × 10 |
TICAM1, RIP1-mediated IKK complex recruitment | 19 | TRAF6,UBE2N | 4.05 × 10 |
Signal transduction by L1 | 20 | NRP1,NCAM1 | 4.48 × 10 |
G2/M DNA damage checkpoint | 78 | RNF8,UBE2N,BRCA1, NSD2,RPA1 | 4.64 × 10 |
Regulation of FZD by ubiquitination | 21 | LRP6,LGR5 | 4.92 × 10 |
IKK complex recruitment mediated by RIP1 | 22 | TRAF6,UBE2N | 5.39 × 10 |
JNK (c-Jun kinases) phosphorylation and activation mediated by activated human TAK1 | 22 | TRAF6,UBE2N | 5.39 × 10 |
Processing of DNA double-strand break ends | 81 | RNF8,UBE2N,BRCA1, NSD2,RPA1 | 5.55 × 10 |
Activated TAK1 mediates p38 MAPK activation | 23 | TRAF6,UBE2N | 5.87 × 10 |
Formation of Incision Complex in GG-NER | 43 | UBE2N,PIAS3, RBX1,RPA1 | 6.52 × 10 |
Recognition of DNA damage by PCNA-containing replication complex | 31 | RBX1,RPA1 | 1.03 × 10 |
TAK1 activates NFkB by phosphorylation and activation of IKKs complex | 32 | TRAF6,UBE2N | 1.10 × 10 |
DNA strand elongation | 32 | GINS2,RPA1 | 1.10 × 10 |
Sialic acid metabolism | 33 | GLB1,NANP | 1.16 × 10 |
Transcriptional Regulation by E2F6 | 34 | RNF2,BRCA1 | 1.23 × 10 |
Negative regulators of DDX58/IFIH1 signaling | 34 | RNF125,DDX58 | 1.23 × 10 |
NOD1/2 Signaling Pathway | 35 | TRAF6,UBE2N | 1.30 × 10 |
RUNX1 interacts with co-factors whose precise effect on RUNX1 targets is not known | 36 | RNF2,PCGF5 | 1.37 × 10 |
HDR through Single Strand Annealing (SSA) | 37 | BRCA1,RPA1 | 1.44 × 10 |
Ovarian tumor domain proteases | 38 | TRAF6,DDX58 | 1.51 × 10 |
Presynaptic phase of homologous DNA pairing and strand exchange | 39 | BRCA1,RPA1 | 1.58 × 10 |
Formation of Fibrin Clot (Clotting Cascade) | 39 | PROCR,GP1BB | 1.58 × 10 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Moukheiber, L.; Mangione, W.; Moukheiber, M.; Maleki, S.; Falls, Z.; Gao, M.; Samudrala, R. Identifying Protein Features and Pathways Responsible for Toxicity Using Machine Learning and Tox21: Implications for Predictive Toxicology. Molecules 2022, 27, 3021. https://doi.org/10.3390/molecules27093021
Moukheiber L, Mangione W, Moukheiber M, Maleki S, Falls Z, Gao M, Samudrala R. Identifying Protein Features and Pathways Responsible for Toxicity Using Machine Learning and Tox21: Implications for Predictive Toxicology. Molecules. 2022; 27(9):3021. https://doi.org/10.3390/molecules27093021
Chicago/Turabian StyleMoukheiber, Lama, William Mangione, Mira Moukheiber, Saeed Maleki, Zackary Falls, Mingchen Gao, and Ram Samudrala. 2022. "Identifying Protein Features and Pathways Responsible for Toxicity Using Machine Learning and Tox21: Implications for Predictive Toxicology" Molecules 27, no. 9: 3021. https://doi.org/10.3390/molecules27093021