RNA - Algorithm

NCYPred is a deep learning model trained to classify non-coding RNA transcripts into 13 distinct classes: 5.8S rRNA, 5S rRNA, CD-box, HACA-box, Intron-gpI, Intron-gpII, Leader, miRNA, Riboswitch, Ribozyme, tRNA, Y RNA (vertebrates), sbRNA and CeY RNA (nematodes), sbRNA (insects), Y RNA like (bacterial). The model architecture was built based on bidirectional long short-term memory networks (biLSTM) with attention mechanism, using TensorFlow and Keras. Before feeding it to the neural network, input sequences are preprocessed: I) Sequences are decomposed into overlapping 3-mers (subsequences of 3 nt); II) Each unique 3-mer is “tokenized”. III) Sequences are concatenated with zeros until the maximum length is reached (500 nt). After this process, input sequences are fed to an Embedding layer, which maps each token into a 10-dimensional representation that is optimized during training. Then, the biLSTM layers encode sequential information, and the Attention layer assigns importance to each position, embedding the most relevant information into a context vector. This final sequence representation is used by a 128-layer feed-forward neural network to classify each sequence into 1 of 13 classes. For details of the training and evaluation process, please check our article at: (link). The datasets and trained models are also available at our GitHub repository (https://github.com/diegodslima/NCYPred).