A major part of any data-science work consists in finding appropriate data which contains enough signal to tackle the problem we are interested in. Then, cleaning the data to ensure uniformity of the measures, compatibility of the various data sources and protocols, and a reasonable amount of noise is sometimes the most time-consuming step.
Here we propose a first attempt of standardized and automatically generated dataset dedicated to RNA combining together:
We hope this dataset will speed-up advances in machine-learning based approaches for RNA secondary and/or 3D structure prediction, by avoiding spending time on data gathering and cleaning.
Extract the archives using commands gunzip RNANet.db.gz to recreate RNANet.db and tar -xvzf RNANET_datapoints_latest.tar.gz to recreate a folder of text files.
Additional files and meta-data:
These might be useful to you to assert the dataset quality, or perform further filtering.
You can also browse all past releases of the flat-text files (approx. one per month)
RNANet is updated monthly to take newer structures from the BGSU representative sets into account.
RNANet is updated from scratch twice a year to take into account new sequences from Rfam, and updates in covariance models or PDB structures.
You can read our OpenAccess paper. Extensive documentation can also be found in the code repository:
For each RNA chain available in 3D and mapped to a RNA family, we provide the following list of descriptors:
|Index of the residue in the chain (from 1 to N)||index_chain||int >= 1|
|Position of the nucleotide in the Rfam family model (covariance model). NaN in insertions.||cm_coord||int >= 1|
|Position of the nucleotide in the multiple-sequence alignment of the 3D chains mapped to this family.||index_small_ali||int >= 1|
|Index of the residue in the source mmCIF file||old_nt_resnum||int >= 1|
|Position of the nucleotide in the chain, normalized by its length (value between 0 and 1)||nt_position||float|
|Nucleotide name, including modified bases (like 5MC)||nt_name||str|
|One-letter name. Lowercase "acgtu" letters are used for modified "ACGTU" bases||nt_code||char|
|Letter used for sequence alignment (A,C,G,U,N, or -)||nt_align_code||char|
|One-hot encoded sequence. 'other' contains gaps, unknown and modified nucleotides||is_A, is_C, is_G, is_U, is_other||0 or 1|
|Nucleotide frequencies (PSSM) at the current position in this RNA family||freq_A, freq_C, freq_G, freq_U, freq_other||
|Frequency of gaps at this position in the alignment (between (0.0 and 1.0).||gap_percent||float|
|Consensus nucleotide from the alignment at this position (A,C,G,U,N or -).||consensus||char|
|Secondary structure in dot-bracket notation of this position||dbn||char|
|Zero, or comma-separated values of index_chain of the nucleotide(s) which is(are) paired with this one. Canonical (Watson-Crick or Wobble) basepairs are first in the list.||paired||int, int, ...|
|Number of bases interacting with this one||nb_interact||int >= 0|
|Type of basepair in Leontis-Westhof nomenclature (comma-separated list)||pair_type_LW||str, str, ...|
|Type of basepair in DSSR nomenclature (comma-separated list)||pair_type_DSSR||str, str, ...|
|The six torsion angles of the backbone, from 5' to 3', between 0 and 2pi||alpha, beta, gamma, delta, epsilon, zeta||float (rad)|
|Difference between epsilon and zeta torsion angles||epsilon_zeta||float (rad)|
|Conformation of the backbone||bb_type||BI, BII, '..', or 'n/a'|
|Chi torsion angle (between ribose and base)||chi||float (rad)|
|Conformation of the sugar with respect to the base (depends on Chi)||glyco_bond||syn or anti|
|Torsion angles of the ribose cycle||v0, v1, v2, v3, v4||float (rad)|
|If the nucleotide is involved in a stem, the stem type||form||A, B, Z or '.'|
|Z-coordinate of the 3' phosphorus atom with reference to the 5' base plane||ssZp||float|
|Perpendicular distance of the 3' P atom to the glycosidic bond||Dp||float|
|Pseudotorsions between P and C1'||eta, theta||float (rad)|
|Pseudotorsions between P and C4'||eta_prime, theta_prime||float (rad)|
|Pseudotorsions between P and the base center||eta_base, theta_base||float (rad)|
|Conformation of the ribose cycle||phase_angle||float (rad)|
|Amplitude of the sugar puckering||amplitude||float|
|Conformation of the ribose cycle (10 classes corresponding to specific ranges of phase)||puckering||str|
More descriptors are available in the SQL tables. The database scheme is illustrated below:
It is possible to build sub-datasets by querying the results/RNANet.db file. We provide examples using Python3 and the sqlite3 package:
import pandas as pd
with sqlite3.connect("results/RNANet.db) as connection:
df = pd.read_sql("""SELECT structure_id, chain_name
FROM chain JOIN structure ON chain.structure_id = structure.pdb_id
WHERE resolution < 4.0 ORDER BY date ASC;""", con=connection)
More examples of SQL queries can be found in the README.md file on the IBISC forge, see section How to further filter the dataset.