RNAcentral Blog: RNAcentral Release 26

We are thrilled to announce RNAcentral release 26, which introduces a major milestone in how we organize and present non-coding RNA data: the creation of gene-level entries for all sequences across 204 organisms.

Why genes matter

Until now, RNAcentral has been a sequence-based resource, where each unique RNA sequence receives its own identifier (URS id) and is treated as a separate entry. While this approach has served us well, it has created some challenges for our users:

Fragmented related sequences: Transcripts that differ by just a single nucleotide are treated as completely separate entries, even when they represent variants of the same biological entity.
Confusing multiplicity: For highly studied genes like rRNAs, there can be thousands of near-identical sequences with slight variations in length or a few nucleotide changes.
Lost biological context: Many experiments and research questions operate at the gene level rather than the transcript level, making it difficult to find all relevant sequences for a particular gene.

These issues led us to develop a gene building pipeline that groups related transcripts together while maintaining RNAcentral's comprehensive coverage.

Building genes at scale

Creating gene-level entries for RNAcentral presented unique challenges. We couldn't simply adopt gene definitions from resources like Ensembl, because RNAcentral contains many sequences absent from other databases. We needed an automated approach that could:

Build gene-level entries comparable to those in Ensembl
Handle the addition of thousands of new sequences with each release
Maintain stable gene identifiers even as transcripts are added or modified
Work consistently across all organisms in RNAcentral

Our solution uses a graph clustering algorithm combined with machine learning. We compare pairs of transcripts using a random forest model trained on manually curated human genes from Ensembl/GENCODE. The model considers three types of features: the distance between transcript start sites, the overlap of exons, and the similarity of RNA types. Transcripts predicted to come from the same gene are connected in a graph, and communities within this graph become our genes.

To ensure genes remain stable across releases, we developed logic to track and merge gene clusters between versions. This means that as RNAcentral grows, gene identifiers persist even as new transcript variants are discovered and added.

What's in this release

RNAcentral release 26 now contains 103,814 human non-coding RNA genes built from 600,225 transcripts, covering 56 different RNA types. The most abundant are long non-coding RNAs (lncRNAs) with 65,187 genes, followed by antisense lncRNAs (16,790 genes) and pre-miRNAs (8,560 genes). Beyond human, we have predicted 367,909 genes across 203 species, totalling 1,189,743 transcripts. The average species has ~1800 predicted genes.

The average ncRNA gene in RNAcentral contains 6 transcripts, though this varies widely depending on the RNA type. We've successfully built single genes for well-studied lncRNAs like MALAT1 and NEAT1, and correctly separated miRNA sequences that map to different genomic locations.

Each gene receives a unique identifier following the pattern RNACG<species-prefix><11-digit hash>.<version>, allowing you to track genes across releases. Genes are assigned RNA types and descriptions based on expert database annotations, Rfam families, and R2DT structural information.

How to use genes in RNAcentral

We've made genes accessible throughout the RNAcentral website:

Text search: Select 'Genes' in the Entry Type facet to see only gene-level results
Sequence pages: All transcript entries now link to their parent gene (if applicable), e.g. https://rnacentral.org/rna/URS0000D59DC9/9606
GFF files: Genes appear as 'predicted_gene' entries in our downloadable annotation files, available on our FTP site

We recognize that some complex lncRNAs with extensive alternative splicing may not yet be built perfectly, and we're actively working on improvements. This is the first iteration of our gene building pipeline, and we'll continue to refine it based on user feedback and emerging edge cases.

Database Updates

Release 26 does not update any of the underlying data, so database versions are as they were in Release 25.

Get in touch

As always, all data are freely available on the RNAcentral website, via the API, our public database, and in the FTP archive. If you have any feedback about the new gene-level entries or any other aspect of RNAcentral, please get in touch by email, on Twitter, or by submitting an issue on GitHub. We look forward to hearing from you!

8 Oct 2025

RNAcentral Release 26