Mumtaz, S ORCID: https://orcid.org/0000-0001-6364-6149, 2024. Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning. Nature Machine Intelligence. ISSN 2522-5839
Preview |
Text
1894381_Mumtaz.pdf - Published version Download (1MB) | Preview |
Abstract
Pretrained language models have shown promise in analysing nucleotide sequences, yet a versatile model excelling across diverse tasks with a single pretrained weight set remains elusive. Here we introduce RNAErnie, an RNA-focused pretrained model built upon the transformer architecture, employing two simple yet effective strategies. First, RNAErnie enhances pretraining by incorporating RNA motifs as biological priors and introducing motif-level random masking in addition to masked language modelling at base/subsequence levels. It also tokenizes RNA types (for example, miRNA, lnRNA) as stop words, appending them to sequences during pretraining. Second, subject to out-of-distribution tasks with RNA sequences not seen during the pretraining phase, RNAErnie proposes a type-guided fine-tuning strategy that first predicts possible RNA types using an RNA sequence and then appends the predicted type to the tail of sequence to refine feature embedding in a post hoc way. Our extensive evaluation across seven datasets and five tasks demonstrates the superiority of RNAErnie in both supervised and unsupervised learning. It surpasses baselines with up to 1.8% higher accuracy in classification, 2.2% greater accuracy in interaction prediction and 3.3% improved F1 score in structure prediction, showcasing its robustness and adaptability with a unified pretrained foundation.
Item Type: | Journal article |
---|---|
Publication Title: | Nature Machine Intelligence |
Creators: | Mumtaz, S. |
Publisher: | Nature Research |
Date: | 13 May 2024 |
ISSN: | 2522-5839 |
Identifiers: | Number Type 10.1038/s42256-024-00836-4 DOI 1894381 Other |
Rights: | © the author(s) 2024. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/. |
Divisions: | Schools > School of Science and Technology |
Record created by: | Jonathan Gallacher |
Date Added: | 14 May 2024 09:57 |
Last Modified: | 14 May 2024 09:57 |
URI: | https://irep.ntu.ac.uk/id/eprint/51433 |
Actions (login required)
Edit View |
Statistics
Views
Views per month over past year
Downloads
Downloads per month over past year