6 Dec 2017

RNAcentral release 8



A new version of RNAcentral (release 8) is now available. This release includes Mouse Genome Informatics (MGI) as a new Expert Database as well as new data from ENA, PDB, Ensembl, snoPY, RefSeq, HGNC, and Rfam. This release also features Rfam annotations of all sequences and secondary structures from GtRNAdb.

Rfam annotations for all sequences

Rfam is a database of functional non-coding RNA families. It provides covariance models that can be used by the Infernal software to classify ncRNA sequences into families.


Starting with this release, all RNAcentral sequences are compared against all Rfam families. Each RNA sequence page has a new Rfam section showing whether the sequence matched an Rfam family. For example, sequence URS00005B7DD8, originally annotated as miscellaneous RNA (misc_RNA), matches a conserved domain of the MALAT1 Rfam family and contains MEN beta RNA that is processed from MALAT1 by RNAse P:




Rfam annotations may provide additional functional context to poorly annotated sequences and help identify potential problems.


The majority of RNAcentral sequences (84%) were mapped to one or more Rfam family. However, not all RNAcentral sequences are expected to match Rfam families, as Rfam does not include piRNAs, full-length lncRNAs, and several other ncRNA types. To find out more about this work, see the latest Rfam paper.

Quality control based on Rfam annotations

The automatic annotations based on Rfam classification can be compared with the annotations provided by Expert Databases, which provides an opportunity for quality control. Currently, RNAcentral warns about three types of potential problems:
  1. Incomplete sequences
    When an RNAcentral sequence matches only a part of the Rfam covariance model, the sequence is flagged as incomplete, for example the following sequence matches less than half of the model:

    More than 4.5 million RNAcentral sequences fall into this category, most of which are partial rRNA sequences.
  2. Possible contamination
    When a Eukaryotic sequence matches an Rfam family that is only found in Bacteria, this could indicate bacterial contamination or taxonomic misclassification, for example the following Eukaryotic sequence matches Archaeal rRNA:

  3. Missing Rfam hits
    The majority of RNAcentral sequences annotated as rRNA or tRNA match the corresponding Rfam families. However, some sequences do not match the expected Rfam families which could mean that either the sequence has an incorrect RNA type or that the Rfam model needs to be updated. For example, the following tRNA sequence did not match any Rfam families which may indicate a problem:



The table below shows the number of sequences with and without annotation problems:


Type
Number of sequences
No problems detected
8,188,217
Incomplete sequence
4,501,585
Missing hit
625,197
Potential contamination
58,536


Rfam warnings are displayed in text search results. For example, C. elegans RNA URS000049E54F_6239 matches a Bacterial RNA which is surprising and may require further investigation:


There is a new text search facet that allows to filter sequences based on the quality controls:


It is important to interpret the results of this automatic analysis with caution. For example, eukaryotic sequences found in organelles are expected to match bacterial Rfam models. However, you may see warnings on some RNAcentral pages when the software cannot recognise that the sequence comes from an organelle.


Read more about the quality checks in documentation and let us know if you have any feedback.

tRNA secondary structures from GtRNAdb

Following a major upgrade of the tRNAscan-SE software, GtRNAdb now provides a much broader range of tRNA sequences, including tRNAs with possible introns. RNAcentral imported Bacterial, Archeal, and selected model organism sequences from GtRNAdb increasing the coverage from 382 species to 4,239.


RNAcentral also displays RNA secondary structures provided by GtRNAdb using Forna, for example:


Welcome MGI

We have imported a new Model Organism Database, MGI (Mouse Genome Informatics), which serves as a primary resource for a spectrum of genetic, genomic and biological data supporting the use of the mouse as a model for understanding human biology and disease.


Other data updates

The following database have also been updated:
  • ENA (release 133)
  • Ensembl (release 90)
  • RefSeq
  • PDB
  • HGNC
  • FlyBase
  • WormBase

Get in touch

The data are available on the RNAcentral website, via the API, and in the FTP archive. We plan to make the next release available in February, 2018. In the meantime, if you have any feedback please feel free to get in touch by email, on Twitter, or by submitting an issue on GitHub. We look forward to hearing from you!