Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning

Mumtaz, S ORCID logoORCID: https://orcid.org/0000-0001-6364-6149, 2024. Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning. Nature Machine Intelligence. ISSN 2522-5839

[thumbnail of 1894381_Mumtaz.pdf]
Preview
Text
1894381_Mumtaz.pdf - Published version

Download (1MB) | Preview

Abstract

Pretrained language models have shown promise in analysing nucleotide sequences, yet a versatile model excelling across diverse tasks with a single pretrained weight set remains elusive. Here we introduce RNAErnie, an RNA-focused pretrained model built upon the transformer architecture, employing two simple yet effective strategies. First, RNAErnie enhances pretraining by incorporating RNA motifs as biological priors and introducing motif-level random masking in addition to masked language modelling at base/subsequence levels. It also tokenizes RNA types (for example, miRNA, lnRNA) as stop words, appending them to sequences during pretraining. Second, subject to out-of-distribution tasks with RNA sequences not seen during the pretraining phase, RNAErnie proposes a type-guided fine-tuning strategy that first predicts possible RNA types using an RNA sequence and then appends the predicted type to the tail of sequence to refine feature embedding in a post hoc way. Our extensive evaluation across seven datasets and five tasks demonstrates the superiority of RNAErnie in both supervised and unsupervised learning. It surpasses baselines with up to 1.8% higher accuracy in classification, 2.2% greater accuracy in interaction prediction and 3.3% improved F1 score in structure prediction, showcasing its robustness and adaptability with a unified pretrained foundation.

Item Type: Journal article
Publication Title: Nature Machine Intelligence
Creators: Mumtaz, S.
Publisher: Nature Research
Date: 13 May 2024
ISSN: 2522-5839
Identifiers:
Number
Type
10.1038/s42256-024-00836-4
DOI
1894381
Other
Rights: © the author(s) 2024. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.
Divisions: Schools > School of Science and Technology
Record created by: Jonathan Gallacher
Date Added: 14 May 2024 09:57
Last Modified: 14 May 2024 09:57
URI: https://irep.ntu.ac.uk/id/eprint/51433

Actions (login required)

Edit View Edit View

Statistics

Views

Views per month over past year

Downloads

Downloads per month over past year