Subido por shar_cgof21

1 The definition of ORF revisited

Anuncio
the evidence of these papers, another
curtain appears behind it. To pull this
second curtain back, we must develop
strategies to isolate cells directly from
tissues. What this successful technology
will be is not yet clear, but without this
information, our understanding of the
roles miRNAs play in cellular differentiation and homeostasis remains incomplete. Accurate miRNA expression
estimates are even more important as
we learn about the importance of the
relative abundance of miRNAs to their
targets and sponges [13].
All together, these insights and resources
greatly advance miRNA research.
5. Juzenas, S. et al. (2017) A comprehensive, cell specific
microRNA catalogue of human peripheral blood. Nucleic
Acids Res. 45, 9290–9301
6. McCall, M.N. et al. (2017) Toward the human cellular
microRNAome. Genome Res. 27, 1769–1781
7. Ludwig, N. et al. (2016) Distribution of miRNA expression
across human tissues. Nucleic Acids Res. 44, 3865–3877
8. Fromm, B. et al. (2015) A uniform system for the annotation
of vertebrate microRNA genes and the evolution of the
human microRNAome. Annu. Rev. Genet. 49, 213–242
9. Pritchard, C.C. et al. (2012) MicroRNA profiling:
approaches and considerations. Nat. Rev. Genet. 13,
358–369
10. Baras, A.S. et al. (2015) miRge – a multiplexed method of
processing small RNA-seq data to determine microRNA
entropy. PLoS One 10, e0143066
11. Kuosmanen, S.M. et al. (2017) MicroRNA profiling reveals
distinct profiles for tissue-derived and cultured endothelial
cells. Sci. Rep. 7, 10943
12. Schwarz, E.C. et al. (2016) Deep characterization of blood
cell miRNomes by NGS. Cell. Mol. Life Sci. 73, 3169–3181
13. Pinzon, N. et al. (2017) microRNA target prediction programs predict many false positives. Genome Res. 27,
234–245
Acknowledgments
M.K.H. and
M.N.M. are
supported
by grant
1R01HL137811 from the National Institutes of Health.
M.K.H. is also supported by an American Heart Association Grant-in-Aid (17GRNT33670405). M.N.M. is
also supported by the University of Rochester CTSA
award number UL1 TR002001 from the National Center for Advancing Translational Sciences of the National
Institutes of Health. B.F. is supported by the SouthEastern Norway Regional Health Authority (Grant No.
2014041). K.J.P. is supported by NASA-Ames.
1
Department of Pathology, Johns Hopkins University
School of Medicine, Baltimore, MD 21205, USA
2
Department of Tumor Biology, Institute for Cancer
Research, The Norwegian Radium Hospital, Oslo
University Hospital, N-0424 Oslo, Norway
3
Department of Biological Sciences, Dartmouth College,
Hanover, NH 03755, USA
4
Department of Biostatistics and Computational Biology,
University of Rochester Medical Center, Rochester, NY
14642, USA
@
Twitter: @Marc_Halushka
*Correspondence: [email protected] (M.K. Halushka).
URL: http://labs.pathology.jhu.edu/halushka/.
https://doi.org/10.1016/j.tig.2017.12.015
References
1. Kent, O.A. et al. (2014) Lessons from miR-143/145: the
importance of cell-type localization of miRNAs. Nucleic
Acids Res. 42, 7528–7538
2. McCall, M.N. et al. (2011) MicroRNA profiling of diverse
endothelial cell types. BMC Med. Genom. 4, 78
3. Witwer, K.W. and Halushka, M.K. (2016) Toward the
promise of microRNAs – enhancing reproducibility and
rigor in microRNA research. RNA Biol. 13, 1103–1116
4. de Rie, D. et al. (2017) An integrated expression atlas of
miRNAs and their promoters in human and mouse. Nat.
Biotechnol. 35, 872–878
Forum
The Definition of Open
Reading Frame
Revisited
Patricia Sieber,1
Matthias Platzer,2 and
Stefan Schuster1,*
The term open reading frame
(ORF) is of central importance to
gene finding. Surprisingly, at least
three definitions are in use. We
discuss several molecular biological and bioinformatics aspects,
and we recommend using the
definition in which an ORF is
bounded by stop codons.
Open reading frame (ORF) is a basic term
in molecular genetics and bioinformatics.
The detection of ORFs is an important
step in finding protein-coding genes in
genomic sequences, including analyses
based on highly fragmented draft (meta)
genome assemblies [1–3]. ORFs can be
detected by simple in silico analysis,
while proving that a sequence is really
protein-coding requires more effort. Surprisingly, in many textbooks not much
attention is spent on defining the term
ORF, apparently taking its meaning for
granted. Moreover, the given definitions
are often not perfectly clear-cut. For
example, the standard textbook Genes
VII by Lewin [4] states on p. 26: ‘A reading
frame that consists exclusively of triplets
representing amino acids is called an
open reading frame or ORF. A sequence
that is translated into protein has a reading frame that starts with a special initiation codon (AUG) and that extends
through a series of triplets representing
amino acids until it ends at one of the
three types of termination codon’. The
first sentence defines an ORF as bounded
by stop codons (stop/stop definition)
whereas the second sentence may be
(mis)understood as beginning with a start
codon (start/stop definition). Currently at
least three definitions are in use, which
differ in the location of the ORF boundaries [5] (Box 1).
Before going into detail, it is worth recalling the different meanings of the term
‘definition’ itself. A ‘lexical definition’
reports the most common usage of a
term [6,7]. It is the definition likely to be
found in a dictionary and can change over
time. An ‘operational definition’ focuses
on a specific objective or application and
may differ from the lexical definition [6]. In
our case, the main objective is gene finding using bioinformatics software. The
question arises of how the term ORF deviates from that of a coding DNA sequence
(CDS). A CDS means a nucleotide
sequence that is eventually translated into
a protein [8]. This implies that the CDS of a
particular protein is bounded by translation start and stop codons. In some cases
the term ORF is considered equivalent to
that of CDS [9]. Other authors describe an
ORF as a potential protein-coding
sequence which can be determined by
sequence features alone [8]. Note that
there is a difference between the
Trends in Genetics, March 2018, Vol. 34, No. 3
167
concepts of reading frame and ORF. A
reading frame is one of six possibilities
for translating a given double-stranded
genomic sequence into amino acids. For
a particular reading frame, an ORF is a
region that is not interrupted by a stop
codon and is bounded in accordance with
a particular definition (Box 1) [5]. Thus, an
ORF is a sequence region that is ‘open’ for
translation. It is an indicator for a potential
protein-coding gene [3].
We revisit and discuss here the different
ORF definitions to finally recommend one
definition universally applicable in finding
protein-coding genes by bioinformatics
tools. All three ORF definitions currently
in use (Box 1 and Figure 1) consider stop
codons. In the most widely used genetic
code, three of 64 triplets encode stop
codons (TAG, TGA, and TAA). In DNA
sequences not subject to selection and
with a G+C content of 50%, the average
distance between stop codons is 64/
3 ffi 21 codons. It is even shorter when
the G + C content is lower [10]. By contrast, the median length of protein-coding
sequences is considerably higher than 21
codons [5,10].
Definition 1 is the current lexical definition
because most textbooks use it. This may
have historical reasons because the first
completely sequenced genomes (except
viruses) were prokaryotic, and gene
structure in prokaryotes is less complex
than in eukaryotes because of the
absence of splicing. Definition 1 focuses
on identifying potential CDSs in prokaryotic genomes. In eukaryotes, it is applicable only to mature mRNAs, the few genes
containing only a single translated exon,
and to genes with introns of a length
divisible by three and not containing stop
codons in the respective reading frame.
However, in both prokaryotes and eukaryotes there is the problem that internal
methionine codons can be mistaken for
start codons [11]. Context information
(such as potential promoter sequences
168
Trends in Genetics, March 2018, Vol. 34, No. 3
or homology search) can be included to
find the correct start of the ORF [11]. In
the case of multiple ATG triplets, several
tools based on Definition 1 only consider
the first as the start codon.
In eukaryotic genomes, it is much more
complicated to predict CDSs because
most introns contain stop codons and/
or cause shifts between reading frames
(when comparing mature transcripts with
the DNA sequence). Furthermore, it is
difficult to identify the correct splice sites.
Splicing can be considered easily when
applying Definition 2, which can deal with
stop codons located within introns. An
ORF according to this definition does
not necessarily contain an entire CDS,
but a potential exon or group of exons.
In addition, metagenomic or other fragments such as transcript contigs
obtained by RNA-seq with missing start
or stop codons can be analyzed. In this
case, the search for ATG triplets is often
meaningless. The definition should then
be relaxed by considering a maximal
stretch of a nucleotide sequence not
interrupted by internal stop codons in
the considered reading frame. It can be
applied to complete genomic sequences
as well, and could be proposed as general definition. Another point is that the 50 untranslated region (50 -UTR) frequently
contains stop codons such that the
ORF according to Definition 2 is not much
longer than when beginning with a start
codon [9]. Furthermore, it is easier to
apply than Ddefinition 1 because stop
codons simply need to be found. Among
others, Definition 2 was applied in the
algorithm of OrfM [3], which shows to
be significantly faster than methods that
search for start and stop codons. By contrast, Definition 3 is limited to eukaryotic
internal and (potentially) completely protein-coding exons because they are identified by specific algorithms of eukaryotic
gene annotation before determining
translation start and stop positions.
Although finding splice sites is more complicated than finding start and stop
codons, Definition 3 is useful for these
algorithms, but is only rarely mentioned
in the literature. All three definitions are
based on operational rules and are wellsuited for being implemented. All three
are employed in ORF prediction software,
as shown in Box S1 in the supplemental
information online.
In this context, it is also worthwhile comparing the concepts of ORF and exon.
First, they obviously differ because stop
codons and splice sites are clearly not
identical. Second, there may not be a
stop codon in the neighboring introns,
and an ORF may therefore include more
than one exon (legend to Figure 1).
Box 1. Three ORF Definitions Currently in Use
In all definitions, an ORF is regarded as a stretch of nucleotide sequence that is not interrupted by stop
codons in a given reading framea [5], while they differ as follows:
Definition 1: an ORF is a sequence that has a length divisible by three and begins with a translation start
codon (ATG) and ends at a stop codon [2,8–10].
Definition 2: an ORF is a sequence that has a length divisible by three and is bounded by stop codons
[3,5,12].
Definition 3: an ORF is a sequence delimited by an acceptor and a donor splice site [1]. Thus, it refers to a
potentially translated eukaryotic internal exon. 50 - and 30 -terminal exons of a putative gene are determined at
the end of the gene prediction process and are not considered for the actual ORF detection.
This overarching ‘boundless’ definition is inherent in all three definitions and is necessary when analyzing very short sequence stretches, as in the case of metagenome assemblies.
a
Prokaryotes
DefiniƟon 1
DefiniƟon 2
Stop
ATG
5'UTR + CDS
Stop
DNA strand
Stop
ATG
5'UTR + CDS
DNA strand
Eukaryotes
DefiniƟon 1
DefiniƟon 2
DefiniƟon 3
Stop
ATG
Stop
Stop
Exon
Exon
Stop
ATG
Stop
Stop
ATG
Exon
DNA strand
Stop
Exon
Exon
Stop
Stop
Stop Stop
DNA strand
Stop
Exon
DNA strand
Figure 1. Applying the Three Definitions Leads to Different Open Reading Frames (ORFs) (Indicated by Orange Lines) Concerning Their Boundaries.
The corresponding ORFs vary between prokaryotes and eukaryotes. An ORF is delimited by a start codon and a stop codon (Definition 1; in the case of prokaryotes
practically redundant with CDS), two stop codons (Definition 2), or donor and acceptor splice sites (Definition 3; only for eukaryotes). In all cases the ORFs are not
interrupted by internal stop codons in the considered reading frame. According to Definition 2, the ORFs of a eukaryotic gene need not lie in the same reading frame. An
ORF according to Definitions 1 or 2 may involve more than one exon if there are no stop codons in the intronic region in between and if they lie in the same reading frame.
Definition 2 distinguishes clearly between
ORF, CDS, and exon. It can easily be
processed by a computer and is the most
general definition. Furthermore, this definition can be applied even in the case of
prokaryotes
and
metagenomic
sequences. Overall, we are coming to Acknowledgments
the conclusion that Definition 2 is to be We thank Günter Theißen and Martin Hölzer for stimpreferred, and we suggest making it the ulating discussions. Financial support by the Univerof
Jena
and
the
Deutsche
lexical definition in the future. Definition 2 sity
Forschungsgemeinschaft (Transregio 124 FungiNet,
is preferable from an operational, pragproject B1) is gratefully acknowledged.
matic point of view: from stop to stop.
Trends in Genetics, March 2018, Vol. 34, No. 3
169
Supplemental Information
Supplemental information associated with this article
can be found online at https://doi.org/10.1016/j.tig.
References
1. Brent, M.R. (2005) Genome annotation past, present, and
future: How to define an ORF at each locus. Genome Res.
15, 1777–1786
2017.12.009.
2. Mir, K. et al. (2012) Predicting statistical properties of open
reading frames in bacterial genomes. PLoS One 7, e45103
1
Department of Bioinformatics, Friedrich Schiller
University Jena, Ernst-Abbe-Platz 2, 07743 Jena,
Germany
2
Leibniz Institute on Aging – Fritz Lipmann Institute (FLI),
3. Woodcroft, B.J. et al. (2016) OrfM: a fast open reading
frame predictor for metagenomic data. Bioinformatics 32,
2702–2703
Beutenbergstraße 11, 07745 Jena, Germany
5. Claverie, J.-M. (1997) Computational methods for the
identification of genes in vertebrate genomic sequences.
Hum. Mol. Genet. 6, 1735–1744
*Correspondence: [email protected] (S. Schuster).
https://doi.org/10.1016/j.tig.2017.12.009
170
Trends in Genetics, March 2018, Vol. 34, No. 3
4. Lewin, B. (ed.) (1999) Genes VII, Oxford University Press
6. Sevilla, C.G. et al. (2007) Research Methods, Rex Book
Store
7. Lau, J.Y.F. (2011) An Introduction to Critical Thinking and
Creativity. Think More, Think Better, John Wiley & Sons
8. Andrews, S.J. and Rothnagel, J.A. (2014) Emerging evidence for functional peptides encoded by short open
reading frames. Nat. Rev. Genet. 15, 193–204
9. Min, X.J. et al. (2005) OrfPredictor: predicting proteincoding regions in EST-derived sequences. Nucleic Acids
Res. 33, W677–W680
10. Pohl, M. et al. (2012) GC content dependency of open
reading frame prediction via stop codon frequencies. Gene
511, 441–446
11. Guigo, R. et al. (1992) Prediction of gene structure. J. Mol.
Biol. 226, 141–157
12. Fermin, D. et al. (2006) Novel gene and gene model detection using a whole genome open reading frame analysis in
proteomics. Genome Biol. 7, R35
Descargar