EvryRNA : miRBoost

MicroRNA precursor classifier using boosting SVM

Tutorial

    • Users can provide their own training sequences.
    • Published positive and negative training datasets have been already uploaded in the prediction mode.
    • Available species are Human and Cross-species.
    • Uploaded file needs to meet the following requirements:
      • Fasta format.
      • Each sequence is displayed in one line.
    • We control the weakness of SVM component classifiers by implying a lower bound of 1/2 - Delta and an upper bound of 1/2 on their training error.
    • Delta is rationally restricted between 0.25 and 0.5 (see Tran et al.)
    • The choice of Delta plays an important role in determining the weakness of SVM component classifiers, and thus for the performance of miRBoost.
    • When Delta is small (close to 0.25), the training errors of component classifiers get close to each other, and thus the diversity among component classifiers is reduced.
    • Contrarily, when Delta is large (close to 0.5), the weakness is not always guaranteed.
    • We found a critical value of Delta = 0.25 for both data sets (human and cross-species), at which miRBoost performs its Pareto optimality regarding different measures, i.e., the optimal state that no measure could be made better off without making any other measure worse off.
  • Press CTRL + D on the waiting page, to access results later.
    • Sequences predicted as pre-miRNAs are displayed in green.
    • Sequences predicted as non pre-miRNAs are displayed in red.
    • A result file is available at the top right of the result page. The prediction is displayed next to the identifier of each sequence.
  • The genomes of eukaryotes containing at least 100 miRNAs in the miRBase database studied.
  • We take from these genomes pre-miRNAs of <400 nt. As it is known that miRBase contains a number of mis-annotated miRNAs, we first remove the sequences reported as mis-annotated in the later versions.
  • The remaining premiRNAs are filtered by ncRNAclassifier (Tempel et al. 2012) to discard the ones that are mis-annotated because corresponding to transposable elements.
  • The obtained sequences are then considered as positive data. They include 1279 sequences for human and 3082 sequences for cross-species.
  • To avoid overfitting, we remove the sequences that have an identity of >97% with the other ones using EMBOSS skipredundant. Finally, we obtain 863 pre-miRNAs for human and 1677 pre-miRNAs for cross-species.
  • From selected genomes, we randomly choose the exonic regions from protein-coding genes at NCBI. We also take the noncoding RNAs that are not miRNA, including tRNA, siRNA, snRNA and snoRNA, from fRNAdb, NONCODE, and snoRNA-LBME-db database. All of them contain <400 nt.
  • We use miRNAFold (Tempel and Tahi 2012) to predict a hairpin-like structure in each selected sequence. miRNAFold first identifies the longest exact stem from the given sequence. The identified stem is then extended into the longest nonexact stem (i.e., succession of exact stems separated by symmetric internal loops).
  • The hairpin secondary structure corresponding to this nonexact stem is finally predicted. Various constraints are applied to the structure prediction. The hairpin- like structure should have a folding free energy d_G0 < -25.0 kcal/mol, while its hairpin is formed with at least one exact stem of >5 nt.
  • Moreover, at least 90% of the features introduced in miRNAFold must be satisfied. We reduce the sequence redundancy to 97% using EMBOSS skipredundant, giving 7123 human and 7916 cross-species sequences from exonic regions that are not premiRNAs, and 299 human and 350 cross-species noncoding RNA sequences.
  • The two sets of coding and noncoding sequences are then combined to constitute the negative data set for cross-validation. We have then in total 7422 and 8266 sequences of human and cross-species, respectively.
For any questions, comments or suggestions about miRBoost, please feel free to contact: thuong.tran@isb-sib.ch and fariza.tahi@ibisc.univ-evry.fr