1 The definition of ORF revisited

the evidence of these papers, another curtain appears behind it. To pull this second curtain back, we must develop strategies to isolate cells directly from tissues. What this successful technology will be is not yet clear, but without this information, our understanding of the roles miRNAs play in cellular differentiation and homeostasis remains incomplete. Accurate miRNA expression estimates are even more important as we learn about the importance of the relative abundance of miRNAs to their targets and sponges [13]. All together, these insights and resources greatly advance miRNA research. 5. Juzenas, S. et al. (2017) A comprehensive, cell specific microRNA catalogue of human peripheral blood. Nucleic Acids Res. 45, 9290–9301 6. McCall, M.N. et al. (2017) Toward the human cellular microRNAome. Genome Res. 27, 1769–1781 7. Ludwig, N. et al. (2016) Distribution of miRNA expression across human tissues. Nucleic Acids Res. 44, 3865–3877 8. Fromm, B. et al. (2015) A uniform system for the annotation of vertebrate microRNA genes and the evolution of the human microRNAome. Annu. Rev. Genet. 49, 213–242 9. Pritchard, C.C. et al. (2012) MicroRNA profiling: approaches and considerations. Nat. Rev. Genet. 13, 358–369 10. Baras, A.S. et al. (2015) miRge – a multiplexed method of processing small RNA-seq data to determine microRNA entropy. PLoS One 10, e0143066 11. Kuosmanen, S.M. et al. (2017) MicroRNA profiling reveals distinct profiles for tissue-derived and cultured endothelial cells. Sci. Rep. 7, 10943 12. Schwarz, E.C. et al. (2016) Deep characterization of blood cell miRNomes by NGS. Cell. Mol. Life Sci. 73, 3169–3181 13. Pinzon, N. et al. (2017) microRNA target prediction programs predict many false positives. Genome Res. 27, 234–245 Acknowledgments M.K.H. and M.N.M. are supported by grant 1R01HL137811 from the National Institutes of Health. M.K.H. is also supported by an American Heart Association Grant-in-Aid (17GRNT33670405). M.N.M. is also supported by the University of Rochester CTSA award number UL1 TR002001 from the National Center for Advancing Translational Sciences of the National Institutes of Health. B.F. is supported by the SouthEastern Norway Regional Health Authority (Grant No. 2014041). K.J.P. is supported by NASA-Ames. 1 Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA 2 Department of Tumor Biology, Institute for Cancer Research, The Norwegian Radium Hospital, Oslo University Hospital, N-0424 Oslo, Norway 3 Department of Biological Sciences, Dartmouth College, Hanover, NH 03755, USA 4 Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, USA @ Twitter: @Marc_Halushka *Correspondence: [email protected] (M.K. Halushka). URL: http://labs.pathology.jhu.edu/halushka/. https://doi.org/10.1016/j.tig.2017.12.015 References 1. Kent, O.A. et al. (2014) Lessons from miR-143/145: the importance of cell-type localization of miRNAs. Nucleic Acids Res. 42, 7528–7538 2. McCall, M.N. et al. (2011) MicroRNA profiling of diverse endothelial cell types. BMC Med. Genom. 4, 78 3. Witwer, K.W. and Halushka, M.K. (2016) Toward the promise of microRNAs – enhancing reproducibility and rigor in microRNA research. RNA Biol. 13, 1103–1116 4. de Rie, D. et al. (2017) An integrated expression atlas of miRNAs and their promoters in human and mouse. Nat. Biotechnol. 35, 872–878 Forum The Definition of Open Reading Frame Revisited Patricia Sieber,1 Matthias Platzer,2 and Stefan Schuster1,* The term open reading frame (ORF) is of central importance to gene finding. Surprisingly, at least three definitions are in use. We discuss several molecular biological and bioinformatics aspects, and we recommend using the definition in which an ORF is bounded by stop codons. Open reading frame (ORF) is a basic term in molecular genetics and bioinformatics. The detection of ORFs is an important step in finding protein-coding genes in genomic sequences, including analyses based on highly fragmented draft (meta) genome assemblies [1–3]. ORFs can be detected by simple in silico analysis, while proving that a sequence is really protein-coding requires more effort. Surprisingly, in many textbooks not much attention is spent on defining the term ORF, apparently taking its meaning for granted. Moreover, the given definitions are often not perfectly clear-cut. For example, the standard textbook Genes VII by Lewin [4] states on p. 26: ‘A reading frame that consists exclusively of triplets representing amino acids is called an open reading frame or ORF. A sequence that is translated into protein has a reading frame that starts with a special initiation codon (AUG) and that extends through a series of triplets representing amino acids until it ends at one of the three types of termination codon’. The first sentence defines an ORF as bounded by stop codons (stop/stop definition) whereas the second sentence may be (mis)understood as beginning with a start codon (start/stop definition). Currently at least three definitions are in use, which differ in the location of the ORF boundaries [5] (Box 1). Before going into detail, it is worth recalling the different meanings of the term ‘definition’ itself. A ‘lexical definition’ reports the most common usage of a term [6,7]. It is the definition likely to be found in a dictionary and can change over time. An ‘operational definition’ focuses on a specific objective or application and may differ from the lexical definition [6]. In our case, the main objective is gene finding using bioinformatics software. The question arises of how the term ORF deviates from that of a coding DNA sequence (CDS). A CDS means a nucleotide sequence that is eventually translated into a protein [8]. This implies that the CDS of a particular protein is bounded by translation start and stop codons. In some cases the term ORF is considered equivalent to that of CDS [9]. Other authors describe an ORF as a potential protein-coding sequence which can be determined by sequence features alone [8]. Note that there is a difference between the Trends in Genetics, March 2018, Vol. 34, No. 3 167 concepts of reading frame and ORF. A reading frame is one of six possibilities for translating a given double-stranded genomic sequence into amino acids. For a particular reading frame, an ORF is a region that is not interrupted by a stop codon and is bounded in accordance with a particular definition (Box 1) [5]. Thus, an ORF is a sequence region that is ‘open’ for translation. It is an indicator for a potential protein-coding gene [3]. We revisit and discuss here the different ORF definitions to finally recommend one definition universally applicable in finding protein-coding genes by bioinformatics tools. All three ORF definitions currently in use (Box 1 and Figure 1) consider stop codons. In the most widely used genetic code, three of 64 triplets encode stop codons (TAG, TGA, and TAA). In DNA sequences not subject to selection and with a G+C content of 50%, the average distance between stop codons is 64/ 3 ffi 21 codons. It is even shorter when the G + C content is lower [10]. By contrast, the median length of protein-coding sequences is considerably higher than 21 codons [5,10]. Definition 1 is the current lexical definition because most textbooks use it. This may have historical reasons because the first completely sequenced genomes (except viruses) were prokaryotic, and gene structure in prokaryotes is less complex than in eukaryotes because of the absence of splicing. Definition 1 focuses on identifying potential CDSs in prokaryotic genomes. In eukaryotes, it is applicable only to mature mRNAs, the few genes containing only a single translated exon, and to genes with introns of a length divisible by three and not containing stop codons in the respective reading frame. However, in both prokaryotes and eukaryotes there is the problem that internal methionine codons can be mistaken for start codons [11]. Context information (such as potential promoter sequences 168 Trends in Genetics, March 2018, Vol. 34, No. 3 or homology search) can be included to find the correct start of the ORF [11]. In the case of multiple ATG triplets, several tools based on Definition 1 only consider the first as the start codon. In eukaryotic genomes, it is much more complicated to predict CDSs because most introns contain stop codons and/ or cause shifts between reading frames (when comparing mature transcripts with the DNA sequence). Furthermore, it is difficult to identify the correct splice sites. Splicing can be considered easily when applying Definition 2, which can deal with stop codons located within introns. An ORF according to this definition does not necessarily contain an entire CDS, but a potential exon or group of exons. In addition, metagenomic or other fragments such as transcript contigs obtained by RNA-seq with missing start or stop codons can be analyzed. In this case, the search for ATG triplets is often meaningless. The definition should then be relaxed by considering a maximal stretch of a nucleotide sequence not interrupted by internal stop codons in the considered reading frame. It can be applied to complete genomic sequences as well, and could be proposed as general definition. Another point is that the 50 untranslated region (50 -UTR) frequently contains stop codons such that the ORF according to Definition 2 is not much longer than when beginning with a start codon [9]. Furthermore, it is easier to apply than Ddefinition 1 because stop codons simply need to be found. Among others, Definition 2 was applied in the algorithm of OrfM [3], which shows to be significantly faster than methods that search for start and stop codons. By contrast, Definition 3 is limited to eukaryotic internal and (potentially) completely protein-coding exons because they are identified by specific algorithms of eukaryotic gene annotation before determining translation start and stop positions. Although finding splice sites is more complicated than finding start and stop codons, Definition 3 is useful for these algorithms, but is only rarely mentioned in the literature. All three definitions are based on operational rules and are wellsuited for being implemented. All three are employed in ORF prediction software, as shown in Box S1 in the supplemental information online. In this context, it is also worthwhile comparing the concepts of ORF and exon. First, they obviously differ because stop codons and splice sites are clearly not identical. Second, there may not be a stop codon in the neighboring introns, and an ORF may therefore include more than one exon (legend to Figure 1). Box 1. Three ORF Definitions Currently in Use In all definitions, an ORF is regarded as a stretch of nucleotide sequence that is not interrupted by stop codons in a given reading framea [5], while they differ as follows: Definition 1: an ORF is a sequence that has a length divisible by three and begins with a translation start codon (ATG) and ends at a stop codon [2,8–10]. Definition 2: an ORF is a sequence that has a length divisible by three and is bounded by stop codons [3,5,12]. Definition 3: an ORF is a sequence delimited by an acceptor and a donor splice site [1]. Thus, it refers to a potentially translated eukaryotic internal exon. 50 - and 30 -terminal exons of a putative gene are determined at the end of the gene prediction process and are not considered for the actual ORF detection. This overarching ‘boundless’ definition is inherent in all three definitions and is necessary when analyzing very short sequence stretches, as in the case of metagenome assemblies. a Prokaryotes DefiniƟon 1 DefiniƟon 2 Stop ATG 5'UTR + CDS Stop DNA strand Stop ATG 5'UTR + CDS DNA strand Eukaryotes DefiniƟon 1 DefiniƟon 2 DefiniƟon 3 Stop ATG Stop Stop Exon Exon Stop ATG Stop Stop ATG Exon DNA strand Stop Exon Exon Stop Stop Stop Stop DNA strand Stop Exon DNA strand Figure 1. Applying the Three Definitions Leads to Different Open Reading Frames (ORFs) (Indicated by Orange Lines) Concerning Their Boundaries. The corresponding ORFs vary between prokaryotes and eukaryotes. An ORF is delimited by a start codon and a stop codon (Definition 1; in the case of prokaryotes practically redundant with CDS), two stop codons (Definition 2), or donor and acceptor splice sites (Definition 3; only for eukaryotes). In all cases the ORFs are not interrupted by internal stop codons in the considered reading frame. According to Definition 2, the ORFs of a eukaryotic gene need not lie in the same reading frame. An ORF according to Definitions 1 or 2 may involve more than one exon if there are no stop codons in the intronic region in between and if they lie in the same reading frame. Definition 2 distinguishes clearly between ORF, CDS, and exon. It can easily be processed by a computer and is the most general definition. Furthermore, this definition can be applied even in the case of prokaryotes and metagenomic sequences. Overall, we are coming to Acknowledgments the conclusion that Definition 2 is to be We thank Günter Theißen and Martin Hölzer for stimpreferred, and we suggest making it the ulating discussions. Financial support by the Univerof Jena and the Deutsche lexical definition in the future. Definition 2 sity Forschungsgemeinschaft (Transregio 124 FungiNet, is preferable from an operational, pragproject B1) is gratefully acknowledged. matic point of view: from stop to stop. Trends in Genetics, March 2018, Vol. 34, No. 3 169 Supplemental Information Supplemental information associated with this article can be found online at https://doi.org/10.1016/j.tig. References 1. Brent, M.R. (2005) Genome annotation past, present, and future: How to define an ORF at each locus. Genome Res. 15, 1777–1786 2017.12.009. 2. Mir, K. et al. (2012) Predicting statistical properties of open reading frames in bacterial genomes. PLoS One 7, e45103 1 Department of Bioinformatics, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany 2 Leibniz Institute on Aging – Fritz Lipmann Institute (FLI), 3. Woodcroft, B.J. et al. (2016) OrfM: a fast open reading frame predictor for metagenomic data. Bioinformatics 32, 2702–2703 Beutenbergstraße 11, 07745 Jena, Germany 5. Claverie, J.-M. (1997) Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6, 1735–1744 *Correspondence: [email protected] (S. Schuster). https://doi.org/10.1016/j.tig.2017.12.009 170 Trends in Genetics, March 2018, Vol. 34, No. 3 4. Lewin, B. (ed.) (1999) Genes VII, Oxford University Press 6. Sevilla, C.G. et al. (2007) Research Methods, Rex Book Store 7. Lau, J.Y.F. (2011) An Introduction to Critical Thinking and Creativity. Think More, Think Better, John Wiley & Sons 8. Andrews, S.J. and Rothnagel, J.A. (2014) Emerging evidence for functional peptides encoded by short open reading frames. Nat. Rev. Genet. 15, 193–204 9. Min, X.J. et al. (2005) OrfPredictor: predicting proteincoding regions in EST-derived sequences. Nucleic Acids Res. 33, W677–W680 10. Pohl, M. et al. (2012) GC content dependency of open reading frame prediction via stop codon frequencies. Gene 511, 441–446 11. Guigo, R. et al. (1992) Prediction of gene structure. J. Mol. Biol. 226, 141–157 12. Fermin, D. et al. (2006) Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biol. 7, R35

1 The definition of ORF revisited

Documentos relacionados

Productos

Apoyo

1 The definition of ORF revisited

Documentos relacionados

Añadir este documento a la recogida (s)

Añadir a este documento guardado

Sugiéranos cómo mejorar StudyLib