Exons and splice sites prediction
An exon is a nucleic acid sequence that is represented in the mature form of an RNA molecule after introns have been removed by splicing process. The mature RNA molecule can be a messenger RNA or a functional form of a non-coding RNA such as rRNA or tRNA. Originally exon denoted protein-coding transcipts that are spliced before being translated, however today many noncoding exons are known in human genes. Depending on the context, exon can refer to the sequence in the DNA or its RNA transcript .
An intron is nucleid acid sequence within a gene that is not translated into protein. These non-coding sections are transcribed to precursor mRNA (pre-mRNA) and some other RNAs (such as long noncoding RNAs), and subsequently removed by a process called splicing during the processing to mature RNA. After intron splicing (ie. removal), the mRNA consists only of exon derived sequences, which are translated into a protein .
Splicing and alternative splicing
Splicing is a modification of an RNA after transcription, in which introns are removed and exons are joined. This is needed for the typical eukaryotic messenger RNA before it can be used to produce a correct protein through translation [3,4].
However in comparing different tissues or developmental stages, the mRNA produced from the same gene may be different depending on how the RNA is processed. Thus, for an identical gene, many different proteins can be produced. This is possible, because exons produced be by transcription of a gene can be reconnected in multiple ways. This is called alternative splicing. In eukariotes, alternative splicing is common phenomena, for example in humans over 80% of genes are alternatively spliced. Simple illustration of splicing process is presented in Figure 1.
Figure 1. Illustration of exons and introns in pre-mRNA and the formation of mature mRNA by splicing
Most introns start from the dinucleotide GU and end with the dinculeotide AG (in the 5' to 3' direction). These consensus sequences are known to be critical, because changing one of the conserved nucleotides results in inhibition of splicing and are referred to as the splice donor and splice acceptor site, respectively. Upstream (5'-ward) from the AG there is a region high in pyrimidines (C and U), or polypyrimidine tract. Upstream from the polypyrimidine tract is the branch point. The branch point always contains an adenine, but it is otherwise loosely conserved. A typical sequence is YNYYRAY, where Y indicates a pyrimidine (C or U), N denotes any nucleotide, R denotes any purine (G or A), and A denotes adenine [6,7].
In over 60% of cases, the exon sequence is (A/C)AG at the donor site, and G at the acceptor site.
Figure 2. Consensus sequences at the DNA level in introns of comlpex eukariotes
Splicing is catalyzed by the spliceosome which is a large RNA-protein complex composed of five small nuclear ribonucleoproteins (snRNPs). The RNA components of snRNPs interact with the intron and determine the exon–intron borders of the pre–processed mRNA. Finally, a set of enzymes cuts the intron from the RNA and joins the two ends or exons.
Cis-splicng and trans- splicing
Most often, signal elements act only on the intramolecular nucleotide sequence to which they are attached, and they are said to act "in cis" and such kind of splicing is called cis-splicing. This is in contrast to trans-splicing where separately transcribed exons from two different primary RNA transcripts are joined together .
The factors that influence the whole splicing process are:
- Correct identification of splicing signals – donor and acceptor site at the intron, and a branch point.
- Splicing regulatory elements (SREs or cis-acting elements), that is exonic splicing enhancers (ESEs), intronic splicing enhancers (ISEs), exonic splicing silencers (ESSs), and intronic splicing silencers (ISSs), which are defined by their effects on adjacent splice sites, e.g., ESEs tend to promote inclusion and ESSs promote exclusion of the exons they reside in.
- Other factors like: genomic architecture, extracellilar signalling, RNA secondary structure et.
Bioinformatics tools for mRNA splicing analysis
Algorithmic approaches to splice site prediction rely mainly on the consensus patterns found at the boundaries between protein coding and non-coding region. Additionally, to predict results of the splicing, we also need to identify splicing regulatory elements (SREs) - this sequences, residing at variable distances from splice sites, have been shown to function as cis-acting factor binding sites. Although splicing regulators have been identified in both exons and introns, exonic splicing regulators (ESRs) are generally better characterized, and are probably more common. Especially much attention has been given to exonic splicing enhancers (ESEs) which promote the inclusion (as opposed to skipping) of the exons in which they reside.
Several bioinformatics tools to study or predict splice signalshave been developed and are today available online. To improve the analysis, usually some machine-learning techniques are used such as Markov models or neural networks. These algorithms uses a set of known examples (the training set) and set of features describing the data to construct a model. For example,GeneSplicer uses Markov modelling techniques in addition to Maximal Dependency Decomposition analysis, and MaxEnt uses a maximum entropy approach to rank and select "constraints"(features) for splice-site prediction.
Splicing analysis tools can be divided into following groups:
- Donor, acceptor and branch-site evaluation programs.
- Splicing regulatory elements (SREs) evaluation programs.
- General splicing utilities.
As donor and acceptor elements are reasonably conserved in humans, programs that focus on these tend to be more accu-rate than those that analyse less conserved SREs.
mRNA splicing databases are also tools that play an important role as a source of refernece for clinicians and researchers. Depending on the content they can be divided into two groups:
- Splicing mutations databases which are repositories of pathological gene mutations for many diseases.
- Alternative splicing databases which collect data on particular genes.
List of various tools for splicing analysis can be found here or in Figure 3 (please, click to enlarge).
Figure 3. Putative splicing mutations can be analysed using various publicly available bioinformatic tools that provide predictions on potential disruption of basic splicing sequences (acceptor, donor and branch-point sites), regulatory elements, and other features such as RNA secondary structure and protein binding sites. Splicing mutations are then collected in various databases, and the data stored can be used to improve the predictive analysis of mutation-detecting software.
Source: Baralle et al. Splicing in action: assessing disease causing sequence changes. J.Med.Genet., 2005, 42, 10, 737-748
Open reading frame search
A reading frame is a contiguous and non-overlapping set of three-nucleotide codons in DNA or RNA. There are 3 possible reading frames in an mRNA strand and six in a double stranded DNA molecule due to the two strands from which transcription is possible .
Open reading frame
ORF is a portion of an organism's genome which contains a sequence of bases that could potentially encode a protein . It is a reading frame that contains a start codon, a subsequent region which usually has a length which is a multiple of 3 nucleotides, and is finished with a stop codon.
The start codon (or initiation codon) indicates a place in a chain where translation starts. Usually it is denoted by AUG sequence which encodes the amino acid methionine (Met) in eukaryotes and a modified Met (fMet) in prokaryotes. Alternative start codons (depending on the organism), include GUG (valine) or UUG (leucine).
A stop codon (or termination codon) is a nucleotide triplet within mRNA that signals a termination of translation . In the standard genetic code, there are three stop codons:
- UAG -amber
- UAA -ochre
- UGA -opal
Determination of the correct ORF is very important part of identification of a gene. ORFs are usually encountered when shifting through pieces of DNA while trying to locate a gene. The existence of an ORF, especially a long one, is usually a good indication of the presence of a gene in the surrounding sequence. Theoretically, the DNA sequence can be read in six reading frames in organisms with double-stranded DNA; three on each strand.
ORF determination - example
For a given nucleotide sequence GCTTCTCAAACGAGAA, we can start reading it from the first (-1), second (-2) or third (-3) nucleotide:
- In frame -1,"GCT TCT CAA" translates into Ala Ser Gln
- In frame -2,"CTT CTC AAA" translates into Leu Leu Lys
- In frame -3 "TTC TCA AAC" translates into Phe Ser Asn
RNA codon table for nucleotide triplets can be found here. Of course, first we need to translate the DNA sequence into its complementary RNA.
However, we often don't know the orientation of the DNA that we're sequencing, with respect to the RNA. So we have to do the translation using the complementary strand in opposite orientation – this results in identification the next three ORFs. Finally, we will have e six different amino acid sequences, but only one of them will represent correct ORF. Typically only one reading frame is used in translating a gene (in eukaryotes), thus to find it, we simply search for the longest stretch of DNA with no stop codons. Once the open reading frame is known the DNA sequence can be translated into its corresponding amino acid sequence.
ORF finding tools
Usually, ORF finding is a part of gene discovery process. Typically ORF finding tools take as an input the DNA sequence and analyze it in all six reading frames. Obtained nucleotide triplets are translated into its corresponding amino acid sequence. In addition, some of the applications provide a tool that allows to search deduced sequence against the sequence databases.
Below is a list of some popular ORF finding tools:
Below is a list of some popular ORF finding tools:
NCBI ORF Finder
ORF and ATG context scoring