11 Apr 2018

RNAcentral Release 9



A new version of RNAcentral (release 9) is available. This features the Rat Genome Database (RGD) as a new Expert Database as well as updated data from ENA, RefSeq, Ensembl, PDBe and HGNC. Additionally, we have added more search options, provided a feature viewer for sequences and improved the display of genomic locations.

More search options

One common request for our search has been to make it easy to search by length. It has been possible using the advanced syntax documented in our help; but this wasn’t intuitive, so we added the length slider.



Another request we’ve gotten has been for sorting options. Now you can control sorting so it isn’t just our default sorting. We allow sorting by popular species and descending length or just length in descending or ascending order. Let us know what sortings you would want, or new search features!

Sequence feature viewer

Now that RNAcentral shows Rfam annotations of sequences, we want to ensure these results are easy to understand. To do this we added a feature viewer for sequences, which we use to show the Rfam annotations of sequences and any modifications or non-standard bases. This viewer is particularly nice in cases where a sequence is composed of several Rfam models, like:


Here are a few interesting examples. Can you find any others?
  1. An incomplete sequence
  2. A complex sequence
  3. A simple, well annotated sequence
  4. An example that shows the evolutionary history of the 5.8S

More useful genomic location display

RNAcentral displays the genomic location of ncRNAs in selected organisms. For example, the genome browser shows human HOTAIR in chromosome 12. Now, RNAcentral has a table that summarizes all known locations, additionally the current sequence is highlighted.


This helps clarify the localizations when a sequence is found in many databases or locations. For example, human hsa-mir-10a precursor is only found on chromosome 17, while the human hsa-mir-3648 precursor appears twice in chromosome 21. Without the summary table you would have to carefully read the entire cross reference table to learn this. Additionally, this table provides links to viewing this region in the Ensembl and UCSC genome browsers. We are working on other improvements to our genomic locations, so stay tuned for big changes!

Welcome RGD

We have imported another Model Organism database, RGD. This database serves as the primary resource for genomic, phenotype and disease data generated from Rat research.

Other data updates

The following database have also been updated:
  • ENA (release 134) 
  • Ensembl (release 91) 
  • RefSeq 
  • PDB 
  • HGNC 
  • WormBase 
Get in touch
The data are available on the RNAcentral website, via the API, and in the FTP archive. We plan to make the next release available in June, 2018. In the meantime, if you have any feedback please feel free to get in touch by email, on Twitter, or by submitting an issue on GitHub. We look forward to hearing from you!

6 Dec 2017

RNAcentral release 8



A new version of RNAcentral (release 8) is now available. This release includes Mouse Genome Informatics (MGI) as a new Expert Database as well as new data from ENA, PDB, Ensembl, snoPY, RefSeq, HGNC, and Rfam. This release also features Rfam annotations of all sequences and secondary structures from GtRNAdb.

Rfam annotations for all sequences

Rfam is a database of functional non-coding RNA families. It provides covariance models that can be used by the Infernal software to classify ncRNA sequences into families.


Starting with this release, all RNAcentral sequences are compared against all Rfam families. Each RNA sequence page has a new Rfam section showing whether the sequence matched an Rfam family. For example, sequence URS00005B7DD8, originally annotated as miscellaneous RNA (misc_RNA), matches a conserved domain of the MALAT1 Rfam family and contains MEN beta RNA that is processed from MALAT1 by RNAse P:




Rfam annotations may provide additional functional context to poorly annotated sequences and help identify potential problems.


The majority of RNAcentral sequences (84%) were mapped to one or more Rfam family. However, not all RNAcentral sequences are expected to match Rfam families, as Rfam does not include piRNAs, full-length lncRNAs, and several other ncRNA types. To find out more about this work, see the latest Rfam paper.

Quality control based on Rfam annotations

The automatic annotations based on Rfam classification can be compared with the annotations provided by Expert Databases, which provides an opportunity for quality control. Currently, RNAcentral warns about three types of potential problems:
  1. Incomplete sequences
    When an RNAcentral sequence matches only a part of the Rfam covariance model, the sequence is flagged as incomplete, for example the following sequence matches less than half of the model:

    More than 4.5 million RNAcentral sequences fall into this category, most of which are partial rRNA sequences.
  2. Possible contamination
    When a Eukaryotic sequence matches an Rfam family that is only found in Bacteria, this could indicate bacterial contamination or taxonomic misclassification, for example the following Eukaryotic sequence matches Archaeal rRNA:

  3. Missing Rfam hits
    The majority of RNAcentral sequences annotated as rRNA or tRNA match the corresponding Rfam families. However, some sequences do not match the expected Rfam families which could mean that either the sequence has an incorrect RNA type or that the Rfam model needs to be updated. For example, the following tRNA sequence did not match any Rfam families which may indicate a problem:



The table below shows the number of sequences with and without annotation problems:


Type
Number of sequences
No problems detected
8,188,217
Incomplete sequence
4,501,585
Missing hit
625,197
Potential contamination
58,536


Rfam warnings are displayed in text search results. For example, C. elegans RNA URS000049E54F_6239 matches a Bacterial RNA which is surprising and may require further investigation:


There is a new text search facet that allows to filter sequences based on the quality controls:


It is important to interpret the results of this automatic analysis with caution. For example, eukaryotic sequences found in organelles are expected to match bacterial Rfam models. However, you may see warnings on some RNAcentral pages when the software cannot recognise that the sequence comes from an organelle.


Read more about the quality checks in documentation and let us know if you have any feedback.

tRNA secondary structures from GtRNAdb

Following a major upgrade of the tRNAscan-SE software, GtRNAdb now provides a much broader range of tRNA sequences, including tRNAs with possible introns. RNAcentral imported Bacterial, Archeal, and selected model organism sequences from GtRNAdb increasing the coverage from 382 species to 4,239.


RNAcentral also displays RNA secondary structures provided by GtRNAdb using Forna, for example:


Welcome MGI

We have imported a new Model Organism Database, MGI (Mouse Genome Informatics), which serves as a primary resource for a spectrum of genetic, genomic and biological data supporting the use of the mouse as a model for understanding human biology and disease.


Other data updates

The following database have also been updated:
  • ENA (release 133)
  • Ensembl (release 90)
  • RefSeq
  • PDB
  • HGNC
  • FlyBase
  • WormBase

Get in touch

The data are available on the RNAcentral website, via the API, and in the FTP archive. We plan to make the next release available in February, 2018. In the meantime, if you have any feedback please feel free to get in touch by email, on Twitter, or by submitting an issue on GitHub. We look forward to hearing from you!

17 May 2017

RNAcentral release 7

Overview of RNAcentral release 7
We are happy to announce that the seventh release of RNAcentral is now available. The latest release includes FlyBase, Ensembl, and GENCODE as new Expert Databases as well as updates from ENA, RefSeq, snoPY, HGNC, and PDB. The data are available on the RNAcentral website, via the API, and in the FTP archive.

Welcome FlyBase

Model Organism Databases (MODs) perform invaluable service for the community by annotating genomes of key species, such as worm, yeast, and fly with functional information. RNAcentral already links to four MODs (dictyBase, PomBase, SGD, and WormBase). Starting with this release we also integrate ncRNAs from FlyBase, the database for Drosophila genes and genomes. FlyBase contributed over 13 thousand ncRNA sequences from 12 Drosophila species, with the majority coming from D. melanogaster. You can browse FlyBase sequences or view FlyBase summary page in RNAcentral.

Goodbye Vega, hello Ensembl

Since the first RNAcentral release, the Vega database provided RNAcentral with high quality annotations of human and mouse genomes. Recently, the Vega website has been archived, but the HAVANA team continues producing annotations and makes them available in Ensembl and GENCODE. In this release we retire Vega and begin importing non-coding RNAs from Ensembl, including GENCODE.
Ensembl provides manually curated, experimentally verified gene annotations for human and mouse genomes, as well as comprehensive gene annotations for over 60 other vertebrate genomes. Ensembl releases are built off GENCODE annotations for human and mouse where possible. We have imported release 87 of Ensembl, which contains 346,509 ncRNA sequences from 66 organisms.

Better sequence descriptions

We made improvements to the descriptions that RNAcentral displays for each sequence. We try to select the most informative name from the available descriptions submitted by different sources. Here is an example description, where the new name is much more specific than the old one:
The selected names come from high quality data sources like GENCODE, HGNC, and miRBase. This work is an ongoing process and we are always happy to get feedback on descriptions that could be improved.

Improved genome browser

RNAcentral features a genome browser that shows RNAcentral sequences alongside genes and transcripts from Ensembl or Ensembl Genomes. Now the browser supports deep linking and the URL is continuously updated as you scroll around or switch between species so that you can bookmark your favorite view to come back to it later or share the URL with anyone. For example, here is a link to the mouse Xist gene. The genome browser also received a fresh coat of paint and was updated to the latest version of Genoverse.

Other updates


  • Latest data from RefSeq, PDB, HGNC, and ENA 
  • Rfam families from release 12.2 and select cis-regulatory families (for example, here are SAM riboswitch sequences) 
  • Mouse sequences from snoPY 
  • Chicken, chimp, rat and cow sequences from NONCODE 
  • The GPI file now contains ncRNA types
  • New FTP section for database identifiers. We have added database specific mappings from RNAcentral URS’s to database ids in ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/7.0/id_mapping/database_mapping. This is a preliminary release of the data and will be refined in the future. The format and contains may change. We are looking for feedback on the current files.

Database growth over time

RNAcentral now contains almost 13 million unique RNA sequences from 25 Expert Databases. There are 760 thousand new distinct ncRNA sequences and 2 million additional cross-references compared to release 6. To see how the RNAcentral database grew over time, explore our interactive charts.

Get in touch

We plan to make the next release available in August, 2017. In the meantime, if you have any feedback please feel free to get in touch by email, on Twitter, or by submitting an issue on GitHub. We look forward to hearing from you!

6 Jan 2017

RNAcentral release 6



To kick off 2017 we are happy to announce that a new version of RNAcentral is now available. The latest release includes official human gene names from the HGNC database as well as new data from ENA, RefSeq, and PDB. The data are available on the RNAcentral website, via the API, and in the FTP archive.

Official human ncRNA gene names from HGNC

Starting from this release, RNAcentral links to ncRNAs from HGNC, which is a database that assigns unique and stable names to human genes. The HGNC gene symbols are the official names for human genes and are widely used in the literature and across many resources. You can find out more about HGNC in this NAR paper.

In addition to gene names, HGNC provides manually curated links to relevant publications and database accessions from RefSeq, Vega, and other resources. We used these accessions to map HGNC identifiers to RNAcentral entries so that each HGNC entry is matched to one RNAcentral sequence. For example, the HGNC entry for HOTAIR corresponds to RefSeq accession NR_003716, which is found in RNAcentral under the identifier URS000075C808.

As a result of this mapping, over 95% of 6,357 HGNC ncRNA genes of the sequences were connected to RNAcentral identifiers using RefSeq, Vega, or gtRNAdb identifiers from HGNC. If none of these were found in RNAcentral, we retrieved sequences for Ensembl genes (where available) using Ensembl REST API and matched them to RNAcentral accessions by sequence identity. Only about 300 HGNC ncRNA entries (<5%) remained unmapped, most of which are piRNA clusters, rRNAs, and snoRNAs. Some of these ncRNAs will be matched to RNAcentral in future releases, as they get integrated in RefSeq and other RNAcentral databases. Other RNAs, such as piRNA clusters, are unlikely to be mapped to RNAcentral because they correspond to a large number of RNA sequences.

Browse ncRNAs from HGNC or view HGNC summary page in RNAcentral.

Database growth over time

RNAcentral now contains almost 11 million unique RNA sequences from 23 Expert Databases. There are 750 thousand new distinct ncRNA sequences and 2 million additional cross-references from ENA, PDB, RefSeq in release 6 compared to release 5. To see how the RNAcentral database grew over time, explore the interactive charts at the RNAcentral stats page.

New NAR paper

If you haven’t seen the latest RNAcentral paper, the final version was published in the 2017 Database Issue of Nucleic Acids Research.

Get in touch

We plan to make the next release available in March, 2017. In the meantime, if you have any feedback please feel free to get in touch by email, on Twitter, or by submitting an issue on GitHub. We look forward to hearing from you!

1 Nov 2016

New paper is out



new RNAcentral paper has been published in Nucleic Acids Research as part of the 2017 Database issue. The paper gives an overview of activities over the last two years since the first RNAcentral release and shows examples of how RNAcentral is used in the wild.

Read the paper at the NAR website and feel free to get in touch with any comments or feedback.