Subido por Manuel Eduardo Contreras Seitz

Quantitative Historical Linguistics - A Corpus Framework

Anuncio
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Quantitative Historical Linguistics
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
OX F O R D S T U D I E S I N D IAC H R O N IC A N D H I S T O R I C A L L I N G U I S T I C S
general editors
Adam Ledgeway and Ian Roberts, University of Cambridge
advisory editors
Cynthia Allen, Australian National University; Ricardo Bermúdez-Otero, University of
Manchester; Theresa Biberauer, University of Cambridge; Charlotte Galves, University of
Campinas; Geoff Horrocks, University of Cambridge; Paul Kiparsky, Stanford University;
Anthony Kroch, University of Pennsylvania; David Lightfoot, Georgetown University; Giuseppe
Longobardi, University of York; George Walkden, University of Konstanz; David Willis,
University of Cambridge
recently published in the series
19
The Syntax of Old Romanian
Edited by Gabriela Pană Dindelegan
20
Grammaticalization and the Rise of Configurationality in Indo-Aryan
Uta Reinöhl
21
The Rise and Fall of Ergativity in Aramaic
Cycles of Alignment Change
Eleanor Coghill
22
Portuguese Relative Clauses in Synchrony and Diachrony
Adriana Cardoso
23
Micro-change and Macro-change in Diachronic Syntax
Edited by Eric Mathieu and Robert Truswell
24
The Development of Latin Clause Structure
A Study of the Extended Verb Phrase
Lieven Danckaert
25
Transitive Nouns and Adjectives
Evidence from Early Indo-Aryan
John J. Lowe
26
Quantitative Historical Linguistics
A Corpus Framework
Gard B. Jenset and Barbara McGillivray
For a complete list of titles published and in preparation for the series, see pp. 230–2
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Quantitative Historical
Linguistics
A Corpus Framework
G A R D B. J E N SE T A N D
BA R BA R A M C G I L L I V R AY
1
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
3
Great Clarendon Street, Oxford, ox2 6dp,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Gard B. Jenset and Barbara McGillivray 2017
The moral rights of the authors have been asserted
First Edition published in 2017
Impression: 1
All rights reserved. No part of this publication may be reproduced, stored in
a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2017933972
ISBN 978–0–19–871817–8
Printed and bound by
CPI Group (UK) Ltd, Croydon, cr0 4yy
Links to third party websites are provided by Oxford in good faith and
for information only. Oxford disclaims any responsibility for the materials
contained in any third party website referenced in this work.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Contents
Series preface
List of figures and tables
ix
xi
 Methodological challenges in historical linguistics
. Aims of this book
. Context and motivation
.. Empirical methods
.. Models in historical linguistics
.. A new pace
. Main claims
.. The example-based approach
.. The importance of corpus annotation
.. Problems with certain quantitative analyses
.. Problems with the research process
.. Conceptual difficulties
. Can quantitative historical linguistics cross the chasm?
.. Who uses new technology?
.. One size does not fit all: the chasm
.. Perils of the chasm
. A historical linguistics meta study
.. An empirical baseline
.. Quantitative historical research in 



















 Foundations of the framework
. A new framework
.. Scope
.. Basic assumptions
.. Definitions
. Principles
.. Principle : Consensus
.. Principle : Conclusions
.. Principle : Almost any claim is possible
.. Principle : Some claims are stronger than others
.. Principle : Strong claims require strong evidence
.. Principle : Possibly does not entail probably
.. Principle : The weakest link













i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
vi
Contents
.. Principle : Spell out quantities
.. Principle : Trends should be modelled probabilistically
.. Principle : Corpora are the prime source of quantitative
evidence
.. Principle : The crud factor
.. Principle : Mind your stats
. Best practices and research infrastructure
.. Divide and conquer: reproducible research
.. Language resource standards and collaboration
.. Reproducibility in historical linguistics research
.. Historical linguistics and other disciplines
. Data-driven historical linguistics
.. Corpus-based, corpus-driven, and data-driven approaches
.. Data-driven approaches outside linguistics
.. Data and theory
.. Combining data and linguistic approaches
 Corpora and quantitative methods in historical linguistics
. Introduction
. Early experiments
. A bad case of glottochronology
. The advent of electronic corpora
. Return of the numbers
. What’s in a number anyway?
. The case against numbers in historical linguistics
.. Argumentation from convenience
.. Argumentation from redundancy
.. Argumentation from limitation of scope
.. Argumentation from principle
.. The pseudoscience argument
. Summary
 Historical corpus annotation
. Content, structure, and context in historical texts
.. The value of annotation
.. Annotation and historical corpora
.. Ways to annotate a historical corpus
. Annotation in practice
. Adding linguistic annotation to texts
.. Annotation formats
.. Levels of linguistic annotation
.. Annotation schemes and standards







































i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Contents
. Case study: a large-scale Latin corpus
. Challenges of historical corpus annotation
vii


 (Re)using resources for historical languages
. Historical languages and language resources
.. Corpora and language resources
.. Corpus-based and corpus-driven lexicons
. Beyond language resources
. Linking historical (language) data
.. Linked data
.. An example from the ALPINO Treebank
.. Linked historical data
. Future directions










 The role of numbers in historical linguistics
. The benefits of quantitative historical linguistics
.. Reaching across to the majority
.. The benefits of corpora
.. The benefits of quantitative methods
.. Numbers and the aims of historical linguistics
. Tackling complexity with multivariate techniques
. The rise of existential there in Middle English
.. Data
.. Exploration
.. The choice of statistical technique
.. Quantitative modelling
.. Summary













 A new methodology for quantitative historical linguistics
. The methodological framework
. Core steps of the research process
. Case study: verb morphology in early modern English
.. Data
.. Exploration
.. The models
.. Discussion
. Concluding remarks









References
Index


i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Series preface
Modern diachronic linguistics has important contacts with other subdisciplines,
notably first-language acquisition, learnability theory, computational linguistics, sociolinguistics, and the traditional philological study of texts. It is now recognized in
the wider field that diachronic linguistics can make a novel contribution to linguistic
theory, to historical linguistics, and arguably to cognitive science more widely.
This series provides a forum for work in both diachronic and historical linguistics,
including work on change in grammar, sound, and meaning within and across
languages; synchronic studies of languages in the past; and descriptive histories of
one or more languages. It is intended to reflect and encourage the links between these
subjects and fields such as those mentioned above.
The goal of the series is to publish high-quality monographs and collections of
papers in diachronic linguistics generally, i.e. studies focusing on change in linguistic
structure, and/or change in grammars, which are also intended to make a contribution
to linguistic theory, by developing and adopting a current theoretical model, by raising
wider questions concerning the nature of language change or by developing theoretical
connections with other areas of linguistics and cognitive science as listed above. There
is no bias towards a particular language or language family, or towards a particular
theoretical framework; work in all theoretical frameworks, and work based on the
descriptive tradition of language typology, as well as quantitatively based work using
theoretical ideas, also feature in the series.
Adam Ledgeway and Ian Roberts
University of Cambridge
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
List of figures and tables
Figures
1.1
Technology adoption life cycle modelled as a normal distribution (Moore, 1991)
23
1.2
Proportions of empirical studies appearing in Language (1960–2011)
26
1.3
MCA plot of the journals considered for the meta study and their attributes
32
1.4
The number of observations for various quantitative techniques in the selected
studies, for LVC and other journals
34
2.1
Main elements of our framework for quantitative historical linguistics
45
3.1
Illustration of Moore’s law with selected corpora plotted on a base 10
logarithmic scale
74
3.2
Sizes of some selected corpora plotted on a base 10 logarithmic scale, over time
75
3.3
Log-linear regression model showing the relationship between the growth in
computing power and the growth in corpus size for some selected corpora
76
3.4
Relative frequencies of linguistics terms every 1,000 instances of the word
linguistics in the twentieth century taken from the BYU Google Corpus
78
4.1
Phrase-structure tree (left) and dependency tree (right) for Example (2)
116
4.2
The dependency tree of Example (3) from the Latin Dependency Treebank
117
5.1
Lexical entry for the verb impono from the lexicon for the Latin Dependency
Treebank
134
Page containing information about the text of Chaucer’s Parson’s Tale from the
Penn–Helsinki Parsed Corpus of Middle English
141
5.3
Part of the entry for Adriatic Sea in Pleiades
150
6.1
Geometric representation of Table 6.1 in a two-dimensional Cartesian space
161
5.2
6.2 Line that best fits the four points in Figure 6.1
162
6.3
166
Plot from MCA on the variables ‘construction’, ‘era’, ‘preverb’, ‘sp’, and ‘class’
6.4 Graph showing the shift in relative frequencies of existential there and empty
existential subjects during the Middle English period
171
6.5
172
Distribution of V1 and V2 word-order patterns
6.6 Box-and-whiskers plot of conditional probabilities of elements following
existential there and empty existential subjects
6.7
Box-and-whiskers plot of the maximum degree of embedded (phrase-structure)
elements for sentences with there and empty existential subjects
173
174
6.8 Maximum degree of embedding for all sentences in the sample over time, with
added non-parametric regression line
175
6.9 Bar plot of counts of existential subjects by genre
176
6.10 Bar plot of counts of existential subjects by dialect
177
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
xii
List of figures and tables
6.11 Binned residuals plot of the logistic regression model, indicating acceptable fit to
the data
183
7.1 Plot showing the shifting probabilities over time between -(e)s and -(e)th in the
context of third person singular present tense verbs
195
7.2 Plots of the trends of lemma frequency over time for verb forms occurring with
-(e)s and -(e)th
196
7.3 MCA plot of suffix, corpus sub-period, gender, and phonological context
197
7.4 Binned residuals plot for the mixed-effects logistic regression model described in
Example (2)
199
7.5 Binned residuals plot for the mixed-effects logistic regression model described in
Example (3)
200
7.6 Binned residuals plot for the mixed-effects logistic regression model described in
Example (4)
201
Tables
1.1 Classification of sample papers according to whether they are corpus-based/
quantitative
29
1.2 Classification of papers from Language (2012) according to whether they are
corpus-based/quantitative
29
1.3 Confidence intervals (95) for the percentage of quantitative papers in Language
(2012) and the historical sample
30
1.4 Classification of sampled papers according to whether they are
corpus-based/quantitative (excluding LVC)
32
4.1 The first four lines of Virgil’s Aeneid in tabular format, where each row corresponds
to a line
104
4.2 Example of bibliographical information on a hypothetical collection of texts in
tabular format
105
4.3 Example of metadata and linguistic information encoded for the first three word
tokens of Virgil’s Aeneid
107
6.1 Example of a data set recording the century of the texts in which prefixed verbs
were observed, and the proportion of their spatial arguments expressed as a PP out
of all their spatial arguments
160
6.2 Subset of data frame used for study on Latin preverbs in McGillivray (2013)
163
6.3 Frequencies of there1 and Ø according to dialect in Middle English
178
6.4 Frequencies of there1 and Ø according to genre in Middle English
180
6.5 Coefficients for the binary logistic regression model showing the log odds ratio for
switching from there1 to Ø
184
7.1 Part of the metadata extracted from the PPCEME documentation
193
7.2 Part of the data extracted from PPCEME
193
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
List of figures and tables
xiii
7.3 Frequencies of verb tokens in the sample from texts produced by female and male
writers, broken down by corpus sub-period
197
7.4 Summary of fixed effects from the mixed-effects logistic regression model for E2
described in Example (3)
202
7.5 Summary of predictors from the mixed-effects logistic regression model for E3
described in Example (4)
202
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges
in historical linguistics
. Aims of this book
The principal aims of this book are to introduce the framework for quantitative
historical linguistics, and to provide some examples of how this framework can be
applied in research. Ours is a framework and not a ‘theory’ in any of the senses
commonly used in historical linguistics. For example, we do not take a position in
favour of a particular formalism for corpus annotation, nor do we offer answers
to metaphysical questions such as ‘what is language?’, ‘how is it learned?’. What we
are interested in is how corpus data can be employed to gather evidence that we
can analyse quantitatively to model various historical linguistics phenomena. To this
end, we set out principles for the research process as a whole. Ultimately, the aim of
quantitative historical linguistics is to make it easier to settle disputes in historical
linguistics by means of quantitative corpus evidence, so to progress the field as a whole.
The more concrete desirable outcome is the increased use of quantitative corpus
methods in historical linguistics through the adoption of a systematic methodological
framework.
Because the present book is about methodology in historical linguistics, it does not
primarily explain specific, individual methods, but discusses the relationship between
corpus data, aims, methods, and ways of doing research in historical linguistics. More
specifically, given some desirable outcomes, we discuss the necessary steps that should
be taken to achieve those outcomes (Andersen and Hepburn, 2015). There are three
focal points in this discussion:
(i) Why should historical linguistics adopt quantitative corpus methods to a larger
extent?
(ii) What are the obstacles for a more widespread adoption of such methods?
(iii) How ought such methods to be used in historical linguistics?
The first two points are addressed in the present chapter and in the next one, and
set out the context for the original contribution of this publication; the last point is the
focus of our framework, and is dealt with throughout the book.
Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray.
© Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
. Context and motivation
From what we have said so far it should be clear that this book is not an introduction
to corpus linguistics, nor is it an introduction to quantitative techniques. There are
already some very good introductions to corpus linguistics in print, such as McEnery
and Wilson (2001), Gries (2009b), and McEnery and Hardie (2012). There are also
good books introducing quantitative techniques to linguists, including Baayen (2008)
and Johnson (2008). Our position is that core corpus linguistics concepts such as
collocations, concordances, and frequency lists can be taught without necessarily
referring to historical data and still be transposed to historical linguistics. Likewise,
statistical techniques such as null-hypothesis testing, regression modelling, or correspondence analysis (CA) can be explained and illustrated using synchronic data
equally well as with historical data.
So if corpus linguistics and quantitative techniques can be taught without
specific reference to historical linguistics, is there a need for a quantitative corpus historical linguistics methodology? We believe that to be the case, as we
explain here.
Historical linguistics is an endeavour that is highly data-centric, as Labov (1972, 100)
observed when he described historical linguistics as making the best use of ‘bad data’,
i.e. imperfect pieces of evidence riddled with gaps. We also agree with Rydén (1980, 38)
that the ‘study of the past [. . .] must be basically empirical’, and with Fischer (2004, 57)
that ‘[t]he historical linguist has only one firm knowledge base and that is the historical
documents’. Moreover, we subscribe to what Penke and Rosenbach (2007b, 1) write:
‘nowadays most linguists will probably agree that linguistics is indeed an empirical
science’, and the thorny questions are instead what kind of evidence ought to be used,
and how it ought to be used.
In spite of the high-level awareness of historical linguistics as data-focused, quantitative corpus methods are still underused and often misused in historical linguistics,
and an overarching methodological structure inside which to place such methods is
missing, as we illustrate in sections 1.3 and 1.5. We believe that the question of what
it means for historical linguistics to be empirical (in the corpus-driven quantitative
sense that we define in our framework) is much less clear, as Penke and Rosenbach
(2007b) acknowledge is also the case for linguistics in general. With the additional
challenges faced by the special nature of historical language data, the concern with
methodological development should certainly not be lesser in historical linguistics
than in other linguistic disciplines. Therefore, the most pressing gap to fill is not a
book introducing corpus methods or statistical techniques to historical linguists, but
a book that tackles what it means to be empirical in historical linguistics research,
and how to go about doing it. That is precisely what we want to achieve with the
present book.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Context and motivation

.. Empirical methods
The term ‘empirical’ is of use to us to the extent that practices covered by it will improve
the precision level of professional linguistic communication regarding data, evidence,
hypotheses, and quantitative models in historical linguistics.
Penke and Rosenbach (2007b, 3–9) show how the term ‘empirical’ is used to
mean very different things in linguistics, including testing (i.e. attempting to falsify)
hypotheses, rational enquiry by means of counter-evidence, as well as data-driven
approaches that may rely on qualitative or quantitative evidence. We agree with
Penke and Rosenbach (2007b, 4–5) that a strict Popperian falsificationist definition
of empirical research (with the requirement that it collects data that can falsify a
hypothesis or theory) is problematic, since it quickly runs into grey areas of the kind
‘exactly how many counter-examples does it take to falsify the hypothesis?’. Instead, we
argue that a distinction conceptualized as a probabilistic continuum, where individual
pieces of evidence can increase or reduce support for a given hypothesis, is more
useful. Such a probabilistic approach is transparent to the extent that the data forming
the basis for the continuum are objectively verifiable. For the same reason, we consider
approaches based exclusively on intuitions about acceptability or grammaticality to be
less useful, since what constitutes sound empirical proof of grammaticality is subject
to individual judgements that vary greatly.
For the purposes of the present book, what it means to be ‘empirical’ in historical
linguistics is thus a matter of transparency and objective verifiability. This is related to
the point made by Geeraerts (2006) who argues that empirical methods are needed to
decide between competing conclusions in linguistics.
The ideal of transparency and objectivity can in principle be approached either
by means of a categorical argument or a probabilistic one. In their discussion of
how to set up a linguistic argument, Beavers and Sells (2014) point out that at the
end of the day linguistic argumentation is about classification: is item x an instance
of morpheme/phoneme/construction/etc. y or some other morpheme/phoneme/
construction/etc. z? This is a prime example of categorical argumentation. In categorical terms, an item x cannot partially belong to a class, or belong to it to some degree.
This contrasts with a probabilistic approach where arguments based on probabilities
derived from e.g. corpus frequencies can be used to establish a graded classification
scheme whereby x is an instance of y with a given probability. Probabilistic approaches
have become increasingly popular, especially in the computational linguistics community; for example, Lau et al. (2015) describe unsupervised language models for
predicting human acceptability judgements in a probabilistic way, and argue: ‘it is
reasonable to suggest that humans represent linguistic knowledge as a probabilistic,
rather than as a binary system. Probability distributions provide a natural explanation
of the gradience that characterises acceptability judgements. Gradience is intrinsic
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
to probability distributions, and to the acceptability scores that we derive from these
distributions’ (Lau et al., 2015, 1619).
The opposition between categorical and probabilistic approaches corresponds in
many ways to the distinction between a classical category structure based on necessary
and sufficient features, and a category structure based on degrees of resemblance to a
prototype, as discussed in Croft and Cruse (2004, 76–81).
To some extent, a qualitative approach is necessary, especially when dealing with the
clear, central cases. For instance, the morphemes in or at can clearly function as prepositions. However, the marginal cases such as concerning, regarding, following, given
are more difficult to place. Should they be considered as prepositions in some cases,
or not? Or only to some degree? A probabilistic approach might answer the question
differently by stating that some morphemes occur more often than others in certain
grammatical contexts, allowing us to establish a probabilistic membership of the class.
It should be clear that such a probabilistic approach to description and classification
does not (and is not intended to) completely do away with qualitative linguistic
judgements. For instance, how to decide what counts as a grammatical context?
A strictly probabilistic approach might run the risk of descending into an infinite
regression problem of probability estimates that rely on other probability estimates,
without any clear starting point for practical investigation of the phenomena of
interest. Therefore, we are content to take as axiomatic certain statements about
language and the conceptual framework for analysing language.
At first glance, this might seem like a half-way solution at best; at worst it may
suggest quantitative methods as a form of freeloading. Or as Campbell (2013, 484)
phrases it: quantitative methods appear ‘to involve methods that depend on the
results of the prior application of linguistic methods, made to masquerade as numbers
and algorithms’. However, this view is far too negative, and grossly overstates the
differences between quantitative and qualitative models in linguistics, as we will
explain in the next section.
.. Models in historical linguistics
A model is a representation, and any kind of linguistics is about creating models.
Zuidema and de Boer (2014) argue that, although all kinds of linguistics involve
modelling of some kind, the nature of the models differs. The model might be a
representation of a genealogical relationship between languages, or it might represent
a particular part of a grammatical system. Zuidema and de Boer (2014) discuss four
main types: symbolic models, statistical models, memory-based models, and connectionist models. We will only discuss the first two here, since they are of particular interest
in the context of our framework.
The key differences between symbolic models and statistical models are how they
deal with variation and complexity. In a symbolic model, such as the phrase-structure
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Context and motivation

tree in Example (1), no reference is typically given to how many times the individual
parts occur together in a corpus. The model operates with discrete, qualitative categories (such as S, VP, and NP), and the model provides the rules to connect the categories
in specific ways.
(1)
S
NP
Trees
VP
V
are
NP
symbolic models
As Zuidema and de Boer (2014) point out, such symbolic models tend to be vulnerable
to linguistic variation and performance factors. A statistical model, on the other
hand, is crucially reliant on quantitative information about how often combinations
of words, categories, or features are found. Since statistical models by default assume
a certain amount of variation in the data, they are very well equipped to deal
with variation, and they are uniquely able to disentangle very complex patterns of
probabilistic dependence between categories or features. This is particularly suited to
the case of corpus data, which always contain a frequency or quantitative dimension.
However, a purely statistical model may struggle with other types of complexity.
Zuidema and de Boer (2014) mention long-distance syntactic dependencies as one
example. As Zuidema and de Boer (2014) point out, when the two types of models are
combined, they can complement each other by allowing a probabilistic analysis that
builds on the symbolic model.
Manning (2003) discusses one way to build the statistical modelling on the symbolic
model. For instance, rather than adhering to a hard distinction between different
argument patterns for verbs, Manning (2003, 303) gives the example of representing
the different subcategorization patterns for the English verb retire as probabilities
like this:
(2) P(NP [obj] | V = retire) = 0.52
P(PP [ from] | V = retire) = 0.05
The annotation expresses that with the verb retire there is a probability of 0.52 of
encountering an NP functioning as an object, and a probability of 0.05 of encountering
a PP headed by the preposition from; this way, we do not have to choose only one
option for the argument patterns of this verb. This annotation keeps the same symbolic
(or qualitative) categories as the one above (NP, V, PP), but uses probabilities to
encode the relations between them.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
Alternatively, the statistical modelling may take the form of a statistical analysis
of frequency information derived from a collection of symbolic models, without
the intention of feeding the probabilities back into the grammatical model, as in
Example (2). A typical instance of this approach is the statistical analysis of annotated
corpora that are enriched with part-of-speech information or syntactic annotation,
in order to draw conclusions about usage, grammar, or language change. Clearly,
the scepticism expressed by Campbell (2013, 484) about quantitative models being
qualitative models ‘masquerading as numbers’ is not warranted. On the contrary:
investigating the same phenomenon by means of different types of models (what
Zuidema and de Boer (2014) call ‘model parallelization’) can lead to rich new insights
that combine the best qualities of both types of models. Thus, there is no real
opposition between qualitative (or symbolic) models and quantitative models. The
real question is how to achieve this in practice, as we discuss next.
.. A new pace
Although there certainly are concrete challenges in building corpora and adopting
specific quantitative methods, we believe that the main obstacle is not concrete. In
a discussion about the French eighteenth-century scholar Pierre Louis Maupertuis
(who formulated a theory stating that material particles from the seminal fluids of
both the mother and the father were responsible for forming the foetus), Gould (1985,
151) makes the following observation:
We often think, naïvely, that missing data are the primary impediments to intellectual
progress—just find the right facts and all problems will dissipate. But barriers are often
deeper and more abstract in thought. We must have access to the right metaphor, not only
to the requisite information. Revolutionary thinkers are not, primarily, gatherers of facts, but
weavers of new intellectual structures. Ultimately, Maupertuis failed because his age had not
yet developed a dominant metaphor of our own time-coded instructions as the precursor to
material complexity.
This quote very effectively stresses the important role of metaphors in preparing
the ground for true innovations in a field. Returning to historical linguistics, we
believe that the availability of historical corpora and statistical techniques alone
are insufficient to achieve the methodological shift that we propose here. What is
required is just as much a conceptual change of pace, whereby linguistic problems
are reformulated as complex interplays of factors that can be addressed quantitatively
by means of corpus data. Such a reconceptualization has a knock-on effect in terms
of what we consider as data and evidence, as well as the status of theoretical concepts.
This is why the present treatment goes beyond a collection of best practices for doing
historical corpus linguistics, although such advice is also discussed both in the present
and in the subsequent chapters.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Context and motivation

Some might argue that the change we are discussing here is unnecessary, since
historical linguistics is already making use of corpora and quantitative techniques.
After all, Labov (1972) commended historical linguists for what he considered their
superior methodological rigour compared to synchronic linguists. It might also be
argued that the change is already well under way, and that corpus methods and
quantitative techniques are becoming more important in the historical linguist’s
toolbox. Hilpert and Gries (2009) state that large corpora are increasingly being used
in historical linguistics; and with growing corpus size comes the need for statistical
techniques to handle large and complex data.
The first question is an empirical question about the present: to what extent are
historical linguists already using quantitative techniques and corpus methods, and
are they using them more or less than some relevant level of comparison? This is a
question we return to in more detail in Chapter 3, along with a discussion of how
quantitative methods have been used in historical linguistics previously. The second
argument, that the change we are advocating is already well under way, is more subtle,
since it is in fact a prediction. It assumes that we can observe some changes and
that those changes will continue until their natural completion. However, as with any
prediction, the result is only as good as the assumptions it builds on. In this case, the
assumption that the adoption of a specific set of technologies (corpus methods and
quantitative techniques) will continue at the present rate is an assumption that may
not be warranted. In section 1.4 we discuss some of the dynamics involved with the
adoption of new technologies, which we will argue also apply in the case of quantitative
historical linguistics.
Of course, the conceptual difficulties should not completely overshadow the practical obstacles involved in doing quantitative historical linguistics. However, the distinction can sometimes be hard to make. This is the reason for our efforts in compiling
a proper methodology which constitutes a framework within which to discuss these
matters. Specifically, sections 2.1, 2.2, and 2.3 set out a series of definitions, principles,
and best practices for quantitative historical linguistics. With the fundamentals we
set out acting as a common ground, the impetus for solving the practical obstacles
becomes all the much stronger.
In summary, there is a real need for a methodological treatment of quantitative
corpus methodology in historical linguistics that sketches the place of such methods in the broader historical linguistics landscape, and that offers a link between
the more conceptual level and the concrete computational and quantitative techniques taught in general courses for linguists. The present book takes on this
challenge by first acknowledging the conceptual hurdles represented by a required
shift in thinking as much as in doing. In the spirit of Gould (1985), we take
seriously the need for appropriate metaphors to help make concrete the changes
involved.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
In addition to the metaphor already mentioned above, namely seeing the spread
of quantitative corpus methods in historical linguistics as analogous to a technology
adoption process (further discussed in section 1.4), we see the following as some of the
governing metaphors of the approach we propose here. We do not claim uniqueness
or novelty in conceptualizing language via the metaphors below, but we do consider
them to be central to our approach:
• language change phenomena as outcomes modelled by a set of predictors;
• language data as multidimensional;
• historical linguistics as a humanistic field that not only analyses the particular,
but also looks for patterns and extends these to include probabilistic patterns.
This chapter will add more meat to the bone of the suggested methodological
approach. However, before that is discussed, the next section will elaborate on some
of the main claims involved in our argument.
. Main claims
This section highlights the methodological gaps in historical linguistics and how our
proposal addresses them.
.. The example-based approach
As shown by the evidence we have collected and which we will illustrate in section
1.5.2, historical linguistics generally does not make full use of corpora. This is not to
say that research in historical linguistics disregards primary sources of evidence, nor
to say that there are not increasingly more and more exceptions to this statement.
However, historical linguistics is still far from considering corpora as the default or
preferred source of evidence.
Not using corpora is justified in a limited number of circumstances. In some cases,
for example, the only evidence sources for a historical language are so limited that it
is not possible to build a corpus; examples include languages not attested in written
form (like Proto-Indo-European), or languages for which we only have access to an
extremely limited number of fragments. Apart from such particular instances, corpora
should be built for historical languages and diachronic phenomena, when they are not
already available, and should be used as an integral part of the research process.
In the literature review reported on in section 1.5.2 we will observe that the
proportion of historical linguistics research articles employing corpus data is lower
than the state of the art in general linguistics. When texts or corpora are the source
of evidence, we often enter the realm of example-based approaches. Example-based
approaches do not aim at an exhaustive account of the data and can be suitable to
show whether or not a particular construction or form is attested, which is in line with
a qualitative view of language. However, if we want to quantify how much a particular
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Main claims

form or construction is used, we need to resort to a larger pool of data that have
been collected systematically. As we discuss in Chapter 6, such quantitative approach
may (but not necessarily) be coupled with a deeper view of language as inherently
probabilistic. In any case, the example-based approach is not appropriate as a basis
for probabilistic conclusions about language and comes with a full range of problems,
which we discuss in this section.
Let us consider Rovai (2012) as a methodological case study; this article is very clear
and detailed, and we will use it as an illustration of the example-based methodology.
The paper analyses Latin gender doublets, i.e. those nouns that occur as both neuter
and masculine nouns. To support his statements, the author lists illustrative examples
(97–100), such as:
Corium ‘skin’ is currently attested as a thematic neuter at all stages of the languages, but in
Plautus’ plays (e. g. Poen. 139: ∼ 197 bc) and in Varro’s Menippeae (Men. 135: 80–60 bc) there
also occurs the masculine gender
The examples are taken from a canonical body of texts, whose critical editions are
listed in the bibliography. However, it is not clear how the author selected the examples
provided. This is more important when the examples reported in the research publication are not meant to be for illustration purposes only, but are the object of the analysis
itself. We do not know whether the author did not report those occurrences that
contradict the hypothesis, which brings with it the risk of the so-called ‘confirmation
bias’ (see Risen and Gilovich, 2007, 110–30 and Kroeber and Chrétien, 1937). Generally,
the lack of transparency on the selection criteria for the examples presented has
negative implications for the replicability of the studies. If another researcher were
to go through the same texts, due to the lack of clear selection criteria, he or she
would probably choose a different cohort of examples, and potentially reach different
conclusions. When the examples constitute the main basis of the argumentation, and
no more details about the rest of the evidence are given, the research conclusions
themselves may rest on unstable grounds.
Another problem with the example-based approach is that it limits the range
of questions that can be addressed in the research task. In fact, by not explicitly
stating the total number of words or instances from which the examples were drawn,
this approach cannot give a good sense of the quantitative value of the phenomena
illustrated, and cannot draw quantitative generalizations beyond the examples given.
The role of examples is limited to providing evidence that a linguistic phenomenon is
attested or not. Questions relating to the variation in the data, like ‘How many times
is corium attested as a thematic neuter?’ or ‘How many times does corium occur as
a masculine noun in Plautus and Varro?’ cannot be answered by an example-based
methodology.
Another case of the example-based approach is Bentein (2012). The main object
of study here is the function of periphrastic perfect in Ancient Greek according
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
to the period and according to the discourse primitives as theorized in the mental
spaces theory. His evidence basis consists of 784 examples taken from previous studies.
Although the author says that ‘[t]aken together, these studies comprise a large part of
Ancient Greek literature, both prose and poetry’ (175), it is not clear which texts he did
not analyse nor how many instances he selected them from, which makes it impossible
to place the data into their correct quantitative context.
Further issues come from the example-based approach. As we will see more
extensively in Chapter 6, the example-based approach does not allow for a quantitative
analysis, or, when it does, it typically has too few data to obtain statistically significant
results and large enough effects. This is accompanied by a lack of formal hypothesis
testing, as we will motivate further in section 1.3.3.
Moreover, analyses from example-based studies are not easily reproducible. Negative evidence for an argument is as critical as positive evidence. Which factors were
considered, which ones were found to be important, and which ones were not found
important? Also, this approach allows the researcher to perform the analysis of the
published examples on an ad hoc basis, according to criteria that vary depending on
the specific examples being analysed. An example may be used to show the relevance
of a particular feature (say, animacy for word order) and another one to demonstrate
another feature (say, the case of the object), but we are not given a full overview of
all the relevant features for all examples. This is what we call the practice of ‘post hoc
analysis’, and we will explain it further in section 1.3.3.
.. The importance of corpus annotation
García García (2000, 121) says: ‘[a]n exhaustive analysis of any linguistic issue in a
corpus language should be based, ideally, on a study of all the available texts in that
language or that period of the language [. . .] This is obviously a task that exceeds the
possibilities of any individual. Therefore, any feasible study must necessarily be based
on a limited and therefore incomplete corpus.’ This claim is justified if we assume that
the data need to be collected and analysed manually, which is not the only way, as we
discuss in this section.
When corpora are available and when the phenomena studied fall into the scope
discussed in section 2.1.1, corpora should constitute the source of linguistic data, and
larger corpora should be preferred to smaller corpora, all other things being equal.
Fortunately, it is not necessary to analyse all corpus data if the corpus has been
annotated.
Let us consider the example of a study on word order in Latin. Word-order change
is a complex phenomenon where morphological, syntactic, semantic, and pragmatic
factors play a role. Let us assume that our study focuses on morphosyntactic aspects
of word-order change. For this purpose, a morphosyntactically annotated corpus
(treebank) is the ideal evidence source (for an illustration of treebanks, see section
4.3.1). The example-based approach would imply analysing a set of texts to identify, for
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Main claims

example, the different word-order patterns used (SVO, OVS, etc.). Instead, Passarotti
et al. (2015) systematically used the data from the Latin Dependency Treebank and
the Index Thomisticus Treebank (via the Latin Dependency Treebank Valency Lexicon
and the Index Thomisticus Treebank Valency lexicon, see McGillivray 2013, 31–60) to
automatically retrieve such patterns, together with the metadata information relative
to the authors of the texts where each pattern was observed. After the phase of data
extraction from the corpus sources, the authors carried out a quantitative analysis of
the distribution of every word-order pattern by author, identifying a trend that has a
diachronic component and a genre component.
Passarotti et al. (2015) kept the phase of data collection and the phase of data analysis
completely separate, as the data were collected from corpora that had been annotated
by independent research projects. This has the advantage of eliminating the bias that
could arise when the researcher aims at proving a particular theoretical statement and
may unconsciously select examples that support that statement. Because the authors
conducted a systematic analysis of all available corpus data from the treebanks, there
was no option to analyse only specific examples. Also, the presence of the annotation
meant that they could use a much larger evidence base than they would have used if
they had had to manually analyse every instance.
If we search a corpus that has not been annotated, our search may have low precision,
because we may find a large number of irrelevant instances. Imagine that we are
interested in the uses of the English determiner that. If we search a corpus for the
string ‘that’, we will find a high number of occurrences of conjunctions. If we have a
corpus annotated by part of speech, however, we can limit our searches to include only
determiners whose lemma is ‘that’ and avoid a very time-consuming manual postselection.
Another risk in using an unannotated corpus concerns low recall. Imagine that we
want to identify relative clauses not introduced by relative pronouns, as in the train
they took was delayed. A corpus that annotated clause type would make it easy for us to
obtain those instances; conversely, if the corpus does not have this kind of annotation,
it is very difficult to find the relevant patterns.
Another advantage of using annotated corpora has to do with the research methodology, particularly the distinction between the annotation phase and the analysis
phase, and the relationship between annotation and linguistic theoretical framework,
as we discuss more extensively in section 2.4.3. Let us consider the case of verbal
phrases in Old English. The York–Helsinki Parsed Corpus of Old English (Pintzuk
and Plug, 2002) annotates a number of linguistic features, but does not annotate verb
phrases (VPs) specifically, due to a number of reasons, including the fact that the
boundaries of VPs in Old English are still disputed. If we were interested in using this
corpus to further the research on VPs in Old English, we could use the annotation
of the corpus to investigate the elements that define VPs, so to obtain a corpusbased distributional definition of VPs. This way, the corpus analysis could support
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
the definition of VPs themselves, thus leading to empirically-informed theoretical
statements.
.. Problems with certain quantitative analyses
We have talked about the problems with manually collecting and analysing examples
for a study on historical linguistics. Even when a corpus is used, there can be
methodological problems in the subsequent phase, the analysis of the data. This book
addresses this point in detail in Chapter 6. Here we will summarize two main aspects:
the use of raw frequency counts and the practice of what we called ‘post hoc analysis’.
Letting numbers speak for themselves In the literature review we present in section
1.5 we will see that, in the cases where quantitative evidence is used in the historical
linguistics publications we examined, there is a large variability in the statistical
techniques used, ranging from simple interpretation of raw frequency counts or
percentages, to null-hypothesis testing and multivariate statistical techniques. This
highlights a lack of standardization and best practices on which techniques are best
suited to study the particular phenomenon at hand, and we will cover this in more
detail in Chapter 6. Here, we will focus on the problems caused by the practice of
using raw frequencies and ‘letting the numbers speak for themselves’.
Let us consider the example of Bentein (2012, 186–187). After introducing the previous literature and his theoretical framework, and after analysing a series of examples,
the author introduces some quantitative data in terms of frequency counts of Ancient
Greek periphrastic perfect forms, broken down by author and person/number features. He uses the raw frequency counts to argue for ‘a general increase of the periphrastic perfect’, which ‘must have been—at least partially—morpho-phonologically
motivated’. The frequency data presented are presented as follows: ‘almost all examples
occur with the 3sg/pl’. It is not at all clear how such diachronic trend was detected,
since the frequency counts presented do not even follow a monotonic distribution;
moreover, the author gives no indication of the relevance of those forms with respect
to the overall amount of data available for each author, making it impossible to assess
the raw frequencies in any meaningful way. For what concerns the predominance of
third person singular or plural, the statement seems to be purely based on the raw
frequencies as well. In other words, letting the raw frequencies ‘speak for themselves’
is problematic, as we further explain below.
Let us take the example of McGillivray (2013, 57), who collected the frequencies
of the Latin word-order patterns VO and OV in two corpus-driven lexicons, one
based on classical Latin authors and one based on St Thomas’s and St Jerome’s texts.
OV has a higher frequency than VO in the classical data set (152 vs 52 occurrences)
and VO is more frequent than OV in the later age data set (107 vs 38). A simple
inspection of the raw frequencies would lead us to conclude that OV is preferred by
the classical authors and VO by the later authors. However, we may not have enough
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Main claims

data to exclude that the differences are due to chance. The rational way to answer this
question is by performing a statistical significance test. The author used Pearson’s chisquare test (illustrated in section 6.3.3) and found a significant result,1 which points
to a difference between the two groups of authors for what concerns their choice
of word-order pattern; more precisely, the probability of finding those frequencies
under the assumption that the two variables (author group and word-order pattern)
are independent is less than 1 per cent. After we have established whether the two
variables are significantly independent or not, it is important to consider the size of
the detected difference, the effect size, since high frequencies tend to magnify small
deviations (Mosteller, 1968). Effect sizes provide a standardized measure of how large
the detected difference is. In the case of Latin word-order patterns mentioned above,
the author found a large effect size,2 which justifies the conclusion that the two groups
of authors have indeed very different preferences for word-order patterns.
Adger (2015, 133) presents a special case of the argument that numbers should speak
for themselves.3 He starts by quoting Cohen (1988)’s rule of thumb that a large effect
size is one that can be identified with the naked eye. However, he then commits a
logical fallacy when he conflates the estimation of the size of an effect with the problem
of establishing whether or not we are faced with a meaningful difference or correlation.
To speak of a ‘large effect’ implies that we have enough data to speak of such an effect
to begin with. This is precisely the purpose of statistical testing, and only after this step
is it meaningful to speak of effect size. As Adger (2015, 133) puts it: ‘most syntacticians
feel justified in not subjecting [data] to statistical testing’. We cannot help but conclude
that such confidence is misplaced.
Dealing with linguistic complexity The second problem affecting quantitative analyses that we will examine here is the tendency towards what we call ‘post hoc analysis’,
and is related to the example-based approach covered in section 1.3.1. The post hoc
analysis consists in collecting occurrence counts of a phenomenon in a set of texts or
a corpus, and then focusing the analysis on specific examples drawn from this evidence
basis, highlighting the role played by certain variables, which are analysed in a nonsystematic way and are introduced after the data collection. This approach attempts
to account for the multidimensional nature of the phenomenon at hand, but it does
so without employing techniques from multivariate statistical analysis (see section 6.2
for more details). This is an instance of the search for particular elements, as opposed
to recurrent, general patterns (see discussion in section 1.3.5). For instance, we may
say that in a particular example the choice of word-order pattern seems to be related
to a particular grammatical case of the object, and explain why this is the case based
1
3
2 = 77.79.
2 ϕ = 0.474.
p < 0.01, χ(1)
We are grateful to Kristian Rusten for bringing this publication to our attention.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
on our theoretical statements. Then, we may argue for the role played by semantics by
illustrating it with an example showing a particular semantic class of the subject.
For example, in Bentein (2012, 187–8) the author introduces the following post
hoc variables to analyse periphrastic perfects in fifth-century classical Greek: passive
voice, object-oriented and resultative nature, and telicity of the verbs. As part of the
discoursive description of the examples, the author adds some quantitative details
in a footnote on page 188. Such details further specify the statement ‘especially in
Sophocles and Euripides one can find relatively more subject-oriented resultatives
than in the historians’ (Bentein, 2012, 187–8). The author provides frequencies and
percentages of active vs medio-passive forms in poetry and in prose, but does not test
the statistical significance of such effects, nor their size. Next, the author introduces
the placement of temporal/locational adverbials in the verbal group. However, the role
of this variable is not measured, and only a few examples are given. This is a missed
opportunity to add a quantitative dimension to the analysis. Similarly, the author
argues for the diachronic shift from resultative perfect to anterior perfect. However,
the argumentation stands on underspecified quantitative statements like ‘the active
transitive perfect (with an anterior meaning) is indeed rather uncommon in fifthcentury writers’ (Bentein, 2012, 189). Phrases like ‘various examples’, ‘several examples’,
and ‘many cases’ (Bentein, 2012, 190) indicate attempts to argue for the quantitative
relevance of the phenomenon described, but the lack of precise measures undermines
the efficacy of the arguments. In general, the argumentation develops throughout the
article adding more variables to the picture (such as the telicity of the predicates and
the agentivity of the clauses) in a post hoc fashion, and keeping them outside the scope
of the frequency-based analysis.
The practice of post hoc analysis may be coupled with an argumentation strategy
that relies heavily on anecdotal evidence. In this respect, a very instructive example
is again given in Bentein (2012, 192), where four examples are considered sufficient
to show a diachronic development of the periphrastic perfect towards an increased
degree of agentivity.
Let us consider another case of post hoc analysis, this time used in the context of
the presence vs absence of Latin gender doublets over time. Rovai (2012, 120) performs
a quantitative analysis by counting the occurrences of the feminine and neuter forms
in a given set of texts. The quantitative data are thus frequency counts according to
one variable (gender). After presenting the count data, the article contains a detailed
analysis of each of the sixteen lemmas, specifying the declension class, stem, and
number features of the forms found in the texts (Rovai, 2012, 102–3). This is a wellmotivated step, because obviously counting the number of occurrences of each gender
form is not sufficient for a good analysis of the phenomenon at hand, and more factors
need to be taken into consideration. It is also a step that we can consider part of a
qualitative analysis, because it goes into the detail of each instance. This analysis is
followed by a summary of the data according to the time variable, showing the cases
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Main claims

where the feminine forms are more ancient than the neuter ones (eight out of sixteen)
and those where the feminine forms do not occur after the archaic age (but with five
exceptions), while the neuter forms are attested in the later centuries. From these
observations the author draws the conclusion that the feminine forms ‘seem to be the
last occurrences of unproductive remnants already in early times’ (Rovai, 2012, 103).
To support this claim, he provides the example of the fossilized form ramentā and
the form caementa, mainly occurring in the conservative context of public law texts.
Therefore, the type of texts where the feminine forms are attested and their fossilized
nature are used as arguments for proving the fact that such forms are more ancient
than the neuter ones. In this case, the main analysis focused on the gender of the
forms and the age of the texts; however, later on, text type and formulaic features are
considered as well, but with respect to only two of the sixteen nouns (rament- and
caement-). It comes natural to ask: how many times do each of the sixteen nouns occur
in fossilized forms or in legal texts? Including such variables to the original analysis
would make the approach systematic and appropriate to the multidimensional nature
of the phenomenon studied.
Another variable considered in a post hoc fashion in the article is related to lexical
connectionism (Rovai, 2012, 106). Limited to a subset of the nouns analysed, this is
used as an argument supporting the hypothesis that ancient feminine forms were later
on reanalysed as thematic neuter forms. According to this argument, some feminine
nouns shared the same semantic field as some second-declension neuter nouns,
and therefore occurred in the same contexts. To support this, the author provides
two examples. However, it is not clear how to quantify the role played by lexical
connectionism in the phenomenon under investigation. How many counter-examples
can be found that contrast with the two examples provided? What is the relevance
of these two examples in the context of all occurrences of the nouns considered? As
the author says in Rovai (2012, 107–11), lexical connectionism cannot account for the
development of ten of the sixteen nouns analysed. For this reason, the author analyses
constructions that are ambiguous between the personal passive and the impersonal
interpretation (e. g. dicitur ‘it is said’), and uses the fact that the latter gradually became
more common over time to argue for the original first-declension feminine forms
(such as menda ‘error’) to be reanalysed as second-declension neuter forms (mendum
‘error’). However, no measure of the relevance of this argument is given as to how
many times these ambiguous constructions occur out of all occurrences of the nouns
considered, how many instances are available to support it, and how this account
compares quantitatively to the other factors considered for explaining the reanalysis.
.. Problems with the research process
We have seen some of the problems affecting the data collection and analysis phases.
Here we want to focus on the research process as a whole, and in section 2.3 we will
summarize the main claims of our proposal in this respect.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
Traditionally, access and automatic processing of large amounts of texts has been
difficult, due to technological limitations, as we illustrate later on in this section. These
constraints had an impact on how research was carried out, leading researchers in
historical linguistics to focus on relatively small data sets and publish the final results
of their investigations, typically in the form of articles or monographs. As we have
noticed, the fact that the analyses were not published meant that they would not be
easily reproducible. In spite of the technological advances of the past decades, the focus
on the final results of the analysis and the lack of documentation of the intermediate
phases of the research process is still a given, both in scientific disciplines and in the
humanities.
Following an increasingly popular line of thought (Candela et al., 2015), in our
proposed framework we argue that more emphasis should be placed on documenting,
publishing, and sharing all phases of the research process, from data collection to
interpretation. In section 2.3 we will outline our suggestions in this area.
New technologies The dramatic increase in digitization projects in the late 1990s
made it possible to encode documents in digital formats, and the growing computing
power has allowed computers to store more and more data at increasingly lower costs.
A number of projects aimed at digitizing historical material have led to large amounts
of data being available to the academic community, such as the Internet Archive,4
Europeana (Bülow and Ahmon, 2011),5 and Project Gutenberg,6 just to mention a few.
This has meant that archives and libraries can make their collections more accessible
and can preserve them in a better way. In parallel, the development of disciplines like
computational linguistics and its applied field of natural language processing has made
it possible to analyse large amounts of text automatically.
Let us imagine that we were interested in studying the usage of a as a preposition
(meaning ‘in’ as in We go there twice a week) in English in the seventeenth century.
We would not be able to read all texts written in the seventeenth century and note all
usages of a as a preposition. In the pre-digital era, we would have probably selected
a sample of the texts, checked existing theories, possibly formulated a hypothesis and
checked it against the selected texts. This way, we would be less likely to find patterns
that contradict our intuition, and if we did, we would only be able to collect a very
limited number of examples, and we would not have an idea of how common the
evidence contrasting our intuition is.
With the wealth of digitized texts we have at our disposal nowadays (especially for
English), we are able to resort to a much broader evidence basis, and this triggers
new research questions that were not conceivable before. Such increasingly larger
text collections cannot be tackled with the so-called ‘close-reading’ approach. On the
4
6
https://archive.org/index.php.
https://www.gutenberg.org.
5
http://www.europeana.eu/portal/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Main claims

other hand, simply searching for a in a raw-text collection leads to a high number
of spurious results, including all cases where a is used as a determiner. Even if we
searched for certain patterns (such as instances preceding ‘day’, ‘week’, or ‘year’), we
would only capture a subset of the relevant occurrences. Instead, if we are able to
automatically analyse all texts of interest by part of speech with appropriate natural
language processing (NLP) tools, we would be able to identify the cases where a is a
preposition. This way, we would be in the position to answer questions like ‘how has
the relative frequency of a as a preposition and a as a determiner changed?’ or ‘which
factors might have driven this change?’.
As we have suggested in the example above, the new possibilities offered by digital
technologies have had profound implications on research practice and methodologies.
In addition to historical linguistics, numerous other areas of human knowledge have
witnessed an explosion in the size of the data sets available. Ranging from market
analysis to traffic data, the phenomenon of ‘big data’ (generally referred to as data
sets characterized by large volume, variety, and velocity) has become a reality that
organizations cannot afford to ignore (Mayer-Schonberger and Kenneth, 2013). In
this book we argue that historical linguistics has not taken full advantage of this
technological and cultural change, and we suggest a framework which supports the
transition of this field to a new state that is more in harmony with the current scientific
landscape. This transition does not only consist of a set of new techniques applied
to traditional research questions or the ability to carry out traditional analyses on a
larger scale. We believe that this transition allows a whole set of new questions to be
answered.
In their abstract, Bender and Good (2010, 1) summarize the need for linguistics to
scale up its approach as follows:
The preeminent grand challenge facing the field of linguistics is the integration of theories and
analyses from different levels of linguistic structure and aspects of language use to develop
comprehensive models of language. Addressing this challenge will require massive scalingup in the size of data sets used to develop and test hypotheses in our field as well as new
computational methods, i.e., the deployment of cyberinfrastructure on a grand scale, including
new standards, tools and computational models, as well as requisite culture change. Dealing
with this challenge will allow us to break the barrier of only looking at pieces of languages to
actually being able to build comprehensive models of all languages. This will enable us to answer
questions that current paradigms cannot adequately address, not only transforming Linguistics
but also impacting all fields that have a stake in linguistic analysis.
This extract applies to the whole field of linguistics and the authors identify the main
challenges ahead of linguistics today as consisting of data sharing, collaboration, and
interdisciplinarity, as well as standards and scaling up of data sets used for formulating
and testing hypotheses on language (with the help of NLP tools for the automatic
analysis). They also underline the need for overcoming such challenges to allow higher
goals to be achieved. We fully support this view, and in the present book we will
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
combine it with further points pertaining specifically to historical linguistics, in the
context of a general methodological framework.
.. Conceptual difficulties
Why has historical linguistics not yet fully embraced the methodological shift we
outline in this book? There are many reasons for this. Inadequate technical means
and skills, insufficient computing power, and storage capabilities are certainly concrete
obstacles that have been in the way of a complete transition of historical linguistics
into the empirical, data-driven quantitative science we argue for in this book, as we
have seen in section 1.3.4. Here we want to briefly discuss other, more serious obstacles
which concern the place of historical linguistics and the humanities in general in the
scientific landscape.
Bod (2014) offers a comprehensive overview of the history of the humanities, while
at the same time taking the opportunity to discuss the defining elements of the
humanities and their relationship with the sciences. The humanities have been defined
as ‘the disciplines that investigate the expressions of the human mind’ (Dilthey,
1991); however, this definition is not unproblematic, for example it would apply to
mathematics as well. In fact, Bod chooses a more pragmatic one according to which
the humanities are ‘the disciplines that are taught and studied at humanities faculties’
(Bod, 2014, 2).
From Bod (2014)’s overview it is clear that a radical dichotomy between the
humanities and the sciences is not supported by historical evidence. In fact, he finds a
unifying feature shared by scientific and humanistic disciplines in the development of
methodological principles and the search for patterns (Bod, 2014, 355), which in the
case of the humanities’ focus on humanistic material (texts, language, art, music, and
so on). The nature of such patterns varies across disciplines, with examples of local
and approximate patterns found both in the humanities and, for example, in biology.
According to Bod (2014, 300):
linguistics is the humanistic field that is ideally suited to the pattern-seeking nomothetic
method, which has indeed become common currency [. . .] Despite its general pattern-seeking
character, present-day linguistics displays a striking lack of unity [. . .] In one cluster we see
the approaches that champion a rule-based, discrete method, whereas in the other cluster an
example-based, gradient method is advocated.
This perspective is in contrast with the view according to which the humanities
are not concerned with finding general patterns, and instead are only concerned
with analysing particular human artefacts, whether they are texts, or manuscripts’
transmission histories, or works of art. Instead of stressing a strict opposition between
scientific and humanistic disciplines, hence, it is helpful to appreciate the differences
that exist within the sciences themselves and opt for a more nuanced approach. In this
book we propose a methodological framework that encompasses a large portion of
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Can historical linguistics cross the chasm?

the practice of historical linguistics (for the scope of our framework, see section
2.1.1), and is concerned with empirical, corpus-driven quantitative approaches. In our
framework, historical linguistics research looks for patterns and tests hypotheses in
historical language data, mainly historical corpora, and builds models of historical
language phenomena.
. Can quantitative historical linguistics cross the chasm?
A fundamental assumption for this book is that historical linguists already work with
technology. The Greek root tekhnē can refer to any acquired or specialized skill. In the
more conventional sense of technology as some invented means by which we achieve
something (books are also a technology), historical linguistics is a technological field,
or at the very least not an atechnological one. Therefore, it is anachronistic to create an
artificial contradiction between historical linguistics on the one hand, and technology
on the other. For the discussion of this paragraph, we will consider ‘technology’
as having the broadest possible scope, pointing out that historical linguists already
use ‘technologies’. Along this conceptualization of technology, a symbolic analytical
framework (such as X-bar annotation) counts as a ‘technology’ just as much as a
software platform like R. This broad use of technology can then be distinguished from
the very advanced and possibly more recent high-tech type of technologies, such as
cutting-edge lab equipment or statistical and computational software or algorithms.
It is probably safe to say that historical linguistics is not typically or commonly
associated with high-tech approaches, and this impression will be further discussed
in later chapters. Above, we indicated that a more high-tech approach could benefit
historical linguistics. Since such an approach is already in use in other branches of
linguistics, it is clearly technically possible to adopt it, and there are examples of
historical linguists who already have made use of state-of-the-art techniques from
computational and corpus linguistics, and applied statistics. What we are more
concerned with here is the possibility for making these approaches mainstream. The
present section deals with the problem of disseminating such a methodology beyond
a small group of linguists who have already adopted it, and making it available to a
much larger share of historical linguists. To do this, we will base our discussion on a
much-touted model of technology adoption in the world of business, the problem of
crossing the chasm (Moore, 1991).
The technology adoption life cycle we have in mind is based on Moore (1991),
and views technology adoption as a process of diffusion. The market is viewed as
consisting of relatively distinct groups who will adopt a new technology or product
for very different reasons. Crucially, the different market segments will act as reference
points for each other, so that a product or a technology can seemingly be transmitted
from one group to the next. As we will see, this highly idealized model can bring
some real insights regarding the adoption of quantitative corpus methods in historical
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
linguistics, as much as it can inform marketers about how to push the latest hightech gadgets to consumers. The key to the insight is not in the details of the model
as such, but in the way it throws light on people’s motivation for deciding to make
use of a specific technology. It is this pivotal insight that we think merits the model’s
application to the problem of how to advocate a more widespread adoption of
quantitative corpus methods in historical linguistics, and what the obstacles are.
.. Who uses new technology?
To better understand why people adopt a technology, the model operates with five
groups of highly idealized technology users. These groups can again be grouped
together into two broad types, namely the early adopters and the mainstream adopters.
Early adopters will typically have very different motivations for picking up a new
technology compared to mainstream users. From this simple observation follows
the conclusion that a technology that appeals to early users may fall flat on its face
when presented to the mainstream. The gap in expectations and requirements that
separates the early adopters from the majority of potential users of the technology
is what constitutes the metaphorical chasm. But before tackling how to cross the
chasm, we will look into what defines the different groups of users. We have adapted
the business-oriented examples from Moore (1991) and situated them in a linguistic
context where needed.
The innovators The innovators are the technology enthusiasts. These are people who
are interested in new technology for its own sake, and they will eagerly pick up
something simply because the new technology appeals to them. They are typically not
deterred by cost, and since they have a high level of technological competence, they are
not put off by prototypes and a lack of formalized user support. If a new technology,
such as a piece of software, requires modification or configuration to function, they
will be able to do this themselves, or find out how to do it via technology discussion
forums on the web. In a more linguistic context, innovators are linguists who introduce
new technologies from other fields, or even create their own. This idealized user type
might remind us of the caricature of the quantitative corpus linguist from Fillmore
(1992) who is mostly concerned about corpora, tools, and corpus frequencies for their
own sake.
The visionaries The next group of users, the visionaries, are also technologically
savvy, but unlike the innovators, they are not primarily interested in the technology
for its own sake. The visionaries have a strategic interest in the technology and are
primarily interested in the subject matter, i.e. historical linguistics. To the visionaries
the exact properties of the new technology are subordinate to what it can help them
achieve in linguistics. Such achievements could be anything from answering a linguistic question that has hitherto been considered too hard to be adequately answered,
to gaining an advantage in the academic job market by mastering a new, trendy
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Can historical linguistics cross the chasm?

technology. The visionaries and the innovators make up the early adopters, and
the visionaries will have the innovators as a reference group. If the innovators can
demonstrate that, in principle, a new technology can tackle new questions, or answer
old questions on a new scale, then the visionaries are happy to start making use of
that new technology in order to gain an advantage.
The early majority With the visionaries, we leave behind the early adopter groups
and enter into the mainstream territory. Here we find the early majority. This group
will constitute a large share of the overall users, and the defining characteristic of the
group, according to Moore (1991), is pragmatism. They will adopt a new technology
when it is both convenient and beneficial for them to do so. They are more interested
in incremental improvements than in huge leaps forward, and will avoid the risks
associated with new technology by finding out how others, typically the visionaries,
have fared with it (Moore, 1991, 31). This means that the early majority are much slower
to adopt new technologies than the early adopters, but they are more likely to stick
to their new technology once it has caught on. Moore (1991, 31) points out that the
early majority is difficult to characterize, but we can think of them as linguists who
have adopted corpus linguistics methods or quantitative tools as a purely pragmatic
measure after seeing that the visionaries have successfully used the same tools to
answer questions in a new way, but only after those tools have reached a sufficient
level of maturity and user-friendliness.
The late majority The next large segment of users of technology are the conservatives.
The conservative users are not concerned about the latest high-tech tools; indeed, they
might be wary of them (Moore, 1991, 34). Conservatives are highly focused on ease
of use, and will stick with their chosen technology for as long as possible. They are
reluctant to changing it for another technology, and will do so only when the new
technology has become a virtual standard, is easy to use, and covers all their needs in
the area it is meant to cover. A hypothetical example might be a linguist who adopts a
new technology because it has become so widely adopted that it is a near requirement.
Incentives could be negative, as in the loss of support for an older technology, with the
new one being introduced as the standard; or they could be positive, e.g. some journals
favouring articles that make use of the technology in question.
The sceptics The sceptics, or ‘laggards’, as Moore (1991, 39) also calls them, make up
a small tail end of the technology adoption cycle. The sceptics, as the label implies, do
not adopt new technology and will instead stick to their tried-and-trusted methods, no
matter what the cost in lost productivity or lack of perceived coolness is. The linguistic
example in this case might be the caricature of the ‘armchair’ linguist from Fillmore
(1992) who has access to relevant data purely based on introspection and intuition.
As Moore (1991, 40) points out, there are important lessons to be learned from this
group, since they are more prone to seeing the flaws in any new technology, and are
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
sensitive to the hyperbole that inevitably accompanies a new technology making its
way into an academic field. Thus, while we are fundamentally in disagreement with
the sceptics regarding the value of new technology in historical linguistics, we are also
highly interested in their arguments, since we can learn much from them about the
discrepancy between how a new technology is marketed and its actual capabilities. We
remain committed to the idea that introducing new technologies can benefit historical
linguistics, but only to the extent that such technologies are fairly evaluated on their
actual merits, not hyperbole.
.. One size does not fit all: the chasm
As the characterization of the types of technology users above should make clear,
these are idealizations that do not necessarily fit any one person, and one person
might fit in several idealized groups to some degree. However, each idealized user
type captures very different motivations for taking on a new technology, and the broad
differentiation between early adopters and the mainstream captures the fact that some
of these motivations are more closely aligned than others. The key insight that they
confer is that motivations for adopting new technology differ. Essentially, one size does
not fit all. This means that although a new technology might be outright attractive to
the innovators, the visionaries might fail to see how it can be used in a meaningful
way to answer the linguistic questions they care about. In that case, the technology in
question is likely to remain a niche phenomenon. Alternatively, the technology itself
might appeal to the innovators and at the same time offer the visionaries the strategic
advantage they seek in answering linguistic questions. In this case, the technology
will have fully engaged the early adopters. However, the technology might still not
permeate the mainstream market, because it fails to cross the metaphorical chasm.
The idealized segments of user types are not continuous, hence there are gaps
separating them. However, one gap stands out as larger than the others: the chasm that
separates the early users from the mainstream users. This is illustrated in Figure 1.1,
which shows the relatively larger gap between early adopters and the mainstream as
a noticeable discontinuity. As the figure also makes clear, the chasm separates the
relatively small number of early adopters from the bulk of users who are found in the
mainstream part of the model. Thus, the chasm not only separates qualitatively different users from each other, it also represents a quantitative difference that separates a
minority of users from the vast majority of users.
We consider the chasm model a useful basis for analysing the status of quantitative
approaches to historical linguistics for two reasons. First, it covers technology in the
broad sense (even new analytic frameworks). Thus it provides a way to understand not
only the point that this book is trying to make, but also a tool for understanding the
current situation. Second, the model provides some insights about what can be done
to change the situation, provided that our argument in favour of increasing historical
linguistics’ reliance on high-tech approaches is accepted.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Can historical linguistics cross the chasm?

The chasm separates the early adopters
from the majority
Early majority
Late majority
‘Chasm’
Early adopters
Innovators
Sceptics
Technology adoption curve
Figure . Technology adoption life cycle modelled as a normal distribution, based on Moore
(1991, 13).
Although the chasm model can be used in many ways, we will focus on the key
component, namely the insight about the chasm that divides the early adopters from
the majority of users. To understand why some technologies never go mainstream, we
must consider what prevents them from crossing the chasm, as we will see in the next
section.
.. Perils of the chasm
There are a number of reasons why a technology might never arrive at the mainstream
segment of users. For instance, it might never reach the chasm at all, because it
fails to catch on among the innovators and visionaries. According to Moore (1991),
this is likely to happen if the vision behind the technology is marketed before the
technology itself is actually viable. For example, the vision of large-scale data-driven
corpus approaches to study language crucially depends on specific types of computer
technology. Without a suitably mature version of this technology, the vision may be
appealing, but the practical problems would prevent it from really catching on. As we
shall see later, we might find parallels to this in historical linguistics.
In the case of quantitative corpus methods, these are at the very least a technology
that has been embraced by early adopters (innovators and visionaries) in historical
linguistics. We argue that it has not yet fully entered the mainstream of users in
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
historical linguistics, and we will further substantiate this claim in section 1.5. However, as we consider some of the potential reasons for this failure to cross the chasm, it
is useful to look into potential general pitfalls for new technologies crossing the chasm,
adapted from Moore (1991, 41–3):
(i)
(ii)
(iii)
(iv)
lack of respect for established expertise and experience;
forgetting the linguistics and only focusing on the technology;
lack of concern for established ways of working;
practical problems such as missing standards, lack of training opportunities,
or educational practices.
Point (i) prevents a new technology from crossing the chasm because it alienates the
majority of mainstream users. For all the high-tech buzz about disruptive technology,
it is clear that technologies that are able to adapt to existing practices have an advantage
when it comes to crossing the chasm. The majority of users are pragmatic and the disruptive, innovative aspects of a new technology are simply not what appeals to them.
This brings us to point (ii), which is the insight that for the majority of users, such hightech approaches must present a better option for doing historical linguistics. Without
that perspective, we would hardly expect any attempt to push a new technology to the
majority of mainstream users to succeed. Point (iii) captures the fact that historical
linguists, as any users of technology, are interested in tools that work. Established tools,
such as the qualitative methods of historical comparative linguistics, clearly work.
Thus, the chasm model suggests that innovative technology ought to work best where
the established methods have their weakest points. Finally, point (iv) addresses all the
practical or financial problems associated with a new technology, such as acquiring the
technology itself, learning new skills (and transferring them to students), finding new
ways to integrate the existing technology with the new, establishing standards (e.g. for
annotation), and best practices (e.g. for peer review).
None of these points needs to be fatal for a new technology attempting to enter the
mainstream, but in combination they would seriously impede its chances of reaching
out. In the case of high-tech approaches to historical linguistics, we can easily find
examples of all four problems which taken together would prevent full adoption of
the technology advocated here.
As the following sections and chapters will make clear, our aim is to provide
a roadmap for how these potential problems can be avoided. Specifically, we seek
to address points (i) to (iii) by presenting quantitative historical linguistics in an
accessible, and relatively jargon-free manner, with the aim of highlighting how this
particular approach in many ways can exist alongside established ways of working.
We also aim to illustrate how the approach we advocate in some cases will result in
better or perhaps more interesting results, which we believe make the investment in
the technology well worth it from a historical linguistics point of view. The final point
dealing with practical problems lies to some extent outside the scope of the book.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
A historical linguistics meta study

There are a large number of books and courses teaching these skills and aimed at
linguists, and the software we advocate is free. However, we do tackle the problem of
standards and some terminology, thus hoping to also help to ease some of the practical
problems associated with the new technology. A crucial step for achieving these aims
is to have a clear understanding of the situation in historical linguistics, as we do in
section 1.5.
. A historical linguistics meta study
In this section we focus on the level of adoption of quantitative corpus methods
in historical linguistics compared to linguistics in general. We also report on a
quantitative study we have carried out on a selection of publications from existing
literature in historical linguistics.
.. An empirical baseline
Before looking into the current use of corpora and quantitative methods in historical linguistics, it is worth considering just how quantitative we expect historical
linguistics to be. A reasonable benchmark is the field of linguistics overall. After
all, those linguists working on contemporary languages have a wider spectrum of
methods available to them that are out of reach for most historical linguists: native
speaker intuitions, surveys, recordings, interviews, controlled experiments, and so on.
Given that these methods are all considered acceptable in mainstream linguistics (see
Podesva and Sharma 2014 for an overview), and given that the primary source of data
for most historical linguists is textual, our position is that historical linguistics should
not be using corpora and quantitative methods to a lesser degree than linguistics
overall.
For this benchmark we have relied on data from Sampson (2005 and 2013). These
two studies analyse research articles (excluding editorials and reviews) published in
the journal Language between 1960 and 2011. Sampson wanted to know the extent
to which mainstream linguistics relied on empirical and usage data. To this end he
sampled the volumes of what is arguably the leading linguistics journal, Language,
at regular intervals between 1960 and 2011. As a baseline he chose the journal’s 1950
volume, so as to reflect the period prior to the increased reliance on intuition-based
methods in the 1960s. Sampson devised a three-way classification system to label
articles as ‘empirical’, ‘intuition-based’, or ‘neutral’. The last category was designed to
cover papers that did not readily fit into the two first categories, such as methodological papers or papers dealing with the history of linguistics. To classify articles
he used a number of rules of thumb, including an admittedly arbitrary threshold
of two usage-based or corpus-based examples to classify a paper as evidence-based.
However, he also employed positive criteria for labelling papers as intuition-based,
notably the presence of grammaticality judgements. Thus, while the criteria for being
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
evidence-based might seem liberal, the presence of additional criteria ensures a
reasonable classification accuracy. For full details about the sampling, procedure, and
criteria, see Sampson (2005).
Although the data in Sampson (2005) indicated a trend towards an increasing
number of evidence-based papers, his main conclusion was cautious, suggesting that
linguistics still had some way to go before empirical scientific methods were fully
accepted in this field. The proportion of evidence-based papers (calculated from the
total number of non-neutral articles) was only growing slowly and showing some signs
of dipping. Picking up the thread from the previous study, Sampson (2013) continued
the exercise and found that what had appeared as a downward trend around 2000
was simply due to fluctuations. The addition of more data confirmed a continued
upward trend since the nadir in the 1970s. Figure 1.2, based on data from Sampson
(2013), illustrates this trend. Since 2005 the proportion of evidence-based studies has
exceeded the 1950 baseline, represented by the horizontal line in the plot.
As Figure 1.2 shows, empirical methods (according to Sampson’s criteria) have
made a remarkable comeback. Already in the 1980s approximately half the research
articles published in Language were based on empirical evidence (in Sampson’s sense
of the word), with a rapid increase setting off in the 1990s. This is perhaps not
surprising, since it coincides with the availability of electronic corpora around the
1.0
Proportion of empirical articles
0.8
0.6
0.4
0.2
0.0
1960
1970
1980
1990
2000
2010
Figure . Proportions of empirical studies appearing in the journal Language between 1960
and 2011. The horizontal dotted line represents the baseline of the 1950 volume. After figure 1 in
Sampson (2013).
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
A historical linguistics meta study

same time, as discussed in section 3.5, and Sampson (2005) also makes the link to the
availability of corpora explicitly. It is clear that searchable, electronic corpora foster
empirical research. However, it is all too easy to mistake correlation for causation, and
corpora are only one piece of the puzzle, as proven by the fact that what is published
in Language is still (as we believe it should be) a mixture of empirical papers (in the
sense of Sampson 2005) and other studies. Corpora do not determine what kind of
research is published. Thus, we cannot simply assume that a similar situation is found
in historical linguistics. To complement the picture, we therefore surveyed the field of
historical linguistics.
.. Quantitative historical research in 
Our meta study differs from those in Sampson (2005) and Sampson (2013) in that
we surveyed several journals published in one particular year, as opposed to a single
journal over several decades. We found this to be a reasonable approach, since our aim
was to present a snapshot of the field of historical linguistics as it currently appears.
For the literature survey we carefully read a selection of research articles published
in 2012, taken from six journals. These six journals are clearly a small sample of all
that is published within historical linguistics in a given year, but should nevertheless
provide some insight into the breadth of research currently being published. To make
the effort feasible, we applied a number of exclusion criteria, and focused on the cases
that met all the following criteria:
1.
2.
3.
4.
research journals (excluding monographs, yearbooks, and edited books);
journals published in English;
journals focusing specifically on historical linguistics and/or language change;
journals with a general coverage, excluding those focusing on specific languages
or subfields (like historical pragmatics or syntax);
5. linguistics journals (excluding interdisciplinary ones).
Applying these criteria resulted in the following final list of journals:
•
•
•
•
•
•
Diachronica
Folia Linguistica Historica (FLH)
Journal of Historical Linguistics (JHL)
Language Dynamics and Change (LDC)
Language Variation and Change (LVC)
Transactions of the Philological Society
From these journals we selected only the full-length research papers, thus excluding
book reviews, editorials, and squibs. This left us with sixty-nine papers, a number
which was pruned down to sixty-seven, after removing two papers that were deemed
out of scope. We then read and classified the final set of papers. The data and the code
for this study are available on the GitHub repository https://github.com/gjenset.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
For each paper, a number of variables were recorded, including journal, the type of
techniques employed in the analysis, whether or not corpora were used, and whether
or not the paper could be classified as quantitative, qualitative, or neutral. For the
neutral classification we employed the criteria from Sampson (2005). Five papers with
a mainly methodological or overview focus were included in the neutral category,
leaving us with sixty-two papers for the quantitative vs qualitative categories.
Our classification differs from that in Sampson (2005) in that it relies on more
variables, notably recording the use of corpus data, but also other data sources such
as word lists (e.g. in phylogenetic studies). Furthermore, we decided to distinguish
between the source of data (such as corpora vs quoted examples), and the use to which
they were put (e.g. if they were treated quantitatively or qualitatively). This was done to
obtain a classification that was both more fine-grained and easier to operationalize for
historical linguistics than the criteria from Sampson (2005), since none of the papers
relied on native speaker intuitions.
Whether or not a paper was corpus-based was judged based on the discussion
of the data in the paper. We relied on the accepted definition of a corpus as a
machine-readable collection of naturalistic language data aiming at representativity
(with obvious allowances being made for historical data with their gaps and genre
bias). This excluded sources of data such as the World Atlas of Language Structures or
word lists. Furthermore, we required the corpus to be published or at least in principle
accessible to others, which excluded private, purpose-built collections made for a
specific study, but we accepted as corpus-based those studies relying on a subset of
data from a corpus that would otherwise fulfil these criteria.
The distinction between quantitative and qualitative studies was made by assessing
whether or not the conclusion, as presented by the article’s author(s), relied on
quantitative evidence or not. Essentially, we considered whether or not the author(s)
argued along qualitative lines or quantitative ones by looking for phrases that would
imply a quantitative proof of the article’s point, such as ‘x is frequent/infrequent/
statistically correlated with y’. Qualitative papers were thus mostly defined as nonquantitative ones, but we also applied positive criteria. We judged arguments based on
the presence or absence of a feature or phenomenon to be indicative of a qualitative
line of argumentation. Phylogenetic studies, while not typically based on frequency
data, were counted as quantitative, since the underlying assumptions are based on
computing distances between features or clusters of features.
Applying these criteria we found that thirty-seven papers (60 per cent) were
qualitative, while the remaining twenty-five (40 per cent) were quantitative. Table 1.1
lists the number of papers grouped according to whether or not they are corpusbased and whether they are qualitative or quantitative. A Pearson chi-square test
of independence reveals that there is a statistically significant, medium-to-strong
2
association between corpus use and the use of quantitative methods (χdf
(1) = 12.68,
p = 0.0004, φ = 0.49). Perhaps unsurprisingly, corpus-based studies tend to favour a
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
A historical linguistics meta study

Table . Classification of sample papers according to
whether or not they are corpus-based, and whether or not
they are quantitative
Qualitative
Quantitative
Total
Not corpus-based
Corpus-based
33 (53)
4 (6)
11 (18)
14 (23)
44 (71)
18 (29)
Total
37 (60)
25 (40)
62 (100)
Table . Classification of papers from Language 2012
according to whether or not they are corpus-based, and
whether or not they are quantitative
Not corpus-based
Corpus-based
Qualitative
Quantitative
3
0
6
6
quantitative approach, although four qualitative corpus studies were also identified,
which illustrates that there is no simple one-to-one relationship between corpus data
and quantitative methods. Of the quantitative studies we see that a little over half
(fourteen out of twenty-five) were corpus-based.
Comparing this to the benchmark from Sampson (2013), it seems that the leading
linguistics journal Language has gone further than historical linguistics in adopting
quantitative methods. Recall that around 80 per cent of papers in the most recent
samples studied by Sampson were classified as empirical, whereas we only found
40 per cent. Some caution is required in the interpretation, since the criteria used
by Sampson differ subtly from ours due to Sampson’s focus on the use of native
speaker intuitions and authentic examples as the minimum criteria for what he terms
empirical.
To investigate how well Sampson’s classification corresponds to our own, we classified the 2012 volume of Language according to our own criteria. This classification,
based on fifteen research articles, yielded a similar result to Sampson’s conclusion for
recent articles in his sample: we deemed twelve out of fifteen (i.e. 80 per cent) to be
quantitative, with six out of fifteen (i.e. 40 per cent) being corpus-based. As Table 1.2
shows, there were no qualitative corpus-based articles. This indicates that, although
the minimum criteria employed by Sampson differs from our criteria, both sets of
criteria in fact point towards the same conclusion.
A sample of fifteen articles is obviously tiny, and the relative frequencies above are
offered as an easy means of comparison with the sampled historical linguistics articles,
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
and not as some generalized prediction about linguistics overall. However, we take
the numbers in Table 1.2 to indicate that our classification is at least comparable to
that made by Sampson, even if our exact criteria differ. Although the sample of fifteen
articles from Language 2012 is too small for a direct comparison with the historical
and diachronic linguistics sample via statistical methods, it certainly strengthens the
case for our main claim that historical linguistics journal articles are using quantitative
methods less than the articles in Language.
The claim can be further strengthened if we take into consideration the likely variation, or error margins, around these estimates. Based on the comparisons above, it
seems fair to compare Sampson’s estimate of 80 per cent empirical articles in Language
with the 40 per cent quantitative papers identified in our historical sample, since our
own classification of the 2012 volume of Language showed that the two correspond. It is
clear that 80 per cent is a higher percentage than 40 per cent, but how much should we
really read into this difference? One way to better understand the difference between
the two numbers is to think of them as estimates from an underlying distribution,
where we must account for some measurement error. Put differently: our estimates
might be incorrect, and the two samples might in fact be exaggerating the differences.
We can calculate the range or interval around each of these percentages using the
normal distribution as a model. The range of variation we calculate is a 95 per cent
confidence interval, which is taken to indicate that 95 per cent of the observations
from the underlying population (i.e. articles from the journals) would fall into this
range, if our sample is representative. The intervals are listed in Table 1.3.
If the error margin around our percentages was excessive, we would expect to
see the 95 per cent confidence intervals overlapping, i.e. we would expect to see the
upper range of variation for the historical sample reaching into the range surrounding
the estimate from Language. As the numbers in Table 1.3 show, this is not the case,
however. Even if we have underestimated the percentage of true quantitative papers
in the historical sample, and overestimated the percentage of true quantitative papers
in the Language 2012 sample, we see that the two samples are still likely to be different.
The likely theoretical maximum percentage of quantitative papers in the historical
sample is 52 per cent, whereas the theoretical minimum for Language is 60 per cent,
Table . 95 confidence intervals for the percentage of
quantitative papers in Language 2012 and the historical
sample. Note that the confidence intervals do not overlap
Language
Historical sample
Proportion of
quantitative papers
95 confidence interval
80
40
[60, 100]
[28, 52]
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
A historical linguistics meta study

so Language is clearly different even under this worst-case scenario. Using the same
logic, we can test this formally using the prop.test() function in R. The p-value
2
returned by the test is extremely small (χdf
(1) = 58.6, p 0.001), which shows that a
sample of sixty-two (the size of the historical sample) is sufficiently large to establish
that the percentage of quantitative papers (40 per cent) is statistically different from
the percentage reported by Sampson (2013) and found in our Language 2012 sample
(80 per cent).
However, there is another aspect to the question of how similar the two samples
are, namely the percentage of corpus-based papers, which is the same (40 per cent).
Looking only at the proportion of corpus-based papers, we might assume that the
situation in historical linguistics journals is very similar to the one in Language.
However, we also need to consider how dispersed the corpus-based papers are in the
historical sample before attempting to draw further conclusions.
Our aggregated results may hide different research traditions and methodological
conventions within the field of historical and diachronic linguistics. If this were
the case, then we would expect to see some differentiation among types of studies
depending on the journal. In fact, this is what we find if we group the classifications
by journal. We carried out an exploratory multiple correspondence analysis (MCA) to
look for the links between journals, evidence source type (corpus-based or not), and
the quantitative–qualitative distinction. MCA is an exploratory multivariate technique
that seeks to compress the variation in a large set of data into a smaller number of
dimensions that can be visualized in a two-dimensional plot (Greenacre, 2007).
The MCA analysis (shown in Figure 1.3) found that the first dimension (represented
by the horizontal axis) explained virtually all the variation in the data, accounting
for 90.9 per cent of the total variation. This means that the plot can simply be read
from left to right (or right to left), as a continuum where the leftmost journal is
maximally different from the rightmost journal. We can interpret how the data relate
to this first dimension by looking at the projection of the points on the horizontal
axis in Figure 1.3. We can see that the journals can be grouped along a continuum
from non-corpus-based and qualitative, to corpus-based and quantitative. On the
qualitative/non-corpus-based extreme we find Transactions of the Philological Society,
followed by Language Dynamics and Change, Diachronica, Folia Linguistica Historica,
and Journal of Historical Linguistics. The other, i.e. quantitative, end of the continuum
is represented by Language Variation and Change.
The results are hardly surprising for someone familiar with the scope of these
journals, and it is obvious that this picture represents some form of mutual selfselection between journals and scholars: journals attract submissions that are in line
with their explicitly stated profile. However, the conclusion that Language and the
historical linguistics data set as a group are similar in their use of corpora is clearly not
warranted. Instead, what we see in Figure 1.3 is that Language Variation and Change
is (not surprisingly) different from the other historical linguistics journals, and that
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
journalJHL
journalDiachronica
journalFLH
corpus.basedTRUE
0.2
quantitativeFALSE
journalTrPhilSoc
0.0
quantitativeTRUE
corpus.basedFALSE
journalLangVar
Change
–0.2
–0.4
–0.6
journalLangDynChange
–0.8
–0.6
–0.4
–0.2
0.0
Dim 1: 93.3%
0.2
0.4
0.6
Figure . MCA plot of the journals considered for the meta study and their attributes. Dim
1 is the dimension with the most explanatory value; Dim 2 is the dimension with the second
most explanatory value.
Table . Classification of sampled papers according to
whether or not they are corpus-based, and whether or not
they are quantitative, with LVC left out
Not corpus-based
Corpus-based
Qualitative
Quantitative
33 (66)
4 (8)
6 (12)
7 (14)
it is Language Variation and Change that is primarily associated with both corpora
and quantitative methods. We can still observe a continuum among the remaining
historical linguistics journals, reflecting the degree to which we find quantitative and
corpus-based articles in their 2012 publications. However, once Language Variation
and Change is excluded, the numbers change substantially, as Table 1.4 shows.
Once the data from Language Variation and Change are set aside, we find that the
corpus-based studies account for only 22 per cent of the articles, and the quantitative
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
A historical linguistics meta study

methods account for only 26 per cent. Without Language Variation and Change, the
sample is down to fifty articles, but the test of equal proportions (prop.test()
in R) tells us that we still have enough data to distinguish the historical sample from
Language.
Having established that the historical sample seems to use quantitative methods
and corpus methods less than what the state-of-the-art journal in general linguistics
(Language) does, we can turn to how this is related to the life-cycle model of adopting
new technology that we introduced in section 1.4. It is worthwhile reiterating the
theoretical proportions accounted for by the different adopter groups in the technology adoption model, with the chasm situation between early adopters and the early
majority:
•
•
•
•
Early adopters: 16 per cent
Early majority: 34 per cent (cumulative percentage: 50 per cent)
Late majority: 34 per cent (cumulative percentage: 84 per cent)
Sceptics: 16 per cent (cumulative percentage: 100 per cent)
If we make the working assumption that the published articles more or less correspond to the research technologies adopted by their authors, we can compare the
observed proportion of quantitative and corpus-based articles with the theoretical
proportions predicted by the technology adoption model. Of course, this assumption
cannot be taken literally, since mastery of quantitative corpus research techniques
does not preclude using qualitative methods. However, we consider this a useful
approximation, since the sampled journals can select their articles from a larger set of
submissions. Based on this, at least for our purposes, we can assimilate journal authors
to users of research technology.
Comparing the technology adoption model to the data from Language collected by
Sampson (2013), we see that the proportion of articles employing quantitative methods
in that journal (around 80 per cent) is close to what we would see with full adoption
of such technologies by the late majority. In our historical linguistics and language
change sample we found that 40 per cent of the studies were quantitative, which would
suggest that those methods extend to the early majority. However, as we saw above,
this is a little too optimistic due to the effect of papers from Language Variation and
Change. If that journal is excluded, the proportion of quantitative papers drops to 26
per cent which, although still among the early majority, suggests a less widespread adoption. If we look more specifically at the intersection of corpus data and quantitative
methods in the sample of historical and diachronic change articles, we see that 23 per
cent are both corpus-based and quantitative (Table 1.1). However, if we again exclude
Language Variation and Change, we see that the percentage drops to 14 (Table 1.4),
which is in the early adopter range, according to the technology adoption model.
This position is corroborated by taking into account the actual quantitative techniques employed by the quantitative articles in our historical sample. Figure 1.4 shows
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Methodological challenges in historical linguistics
14
LVC
Other
12
10
8
6
4
2
0
Linear models
Freq
%
NHT
Trees
PCA
Figure . The number of observations for various quantitative techniques in the selected
studies, for LVC and other journals. Some studies employed more than one of the techniques.
the number of times different quantitative techniques were encountered. Multivariate
techniques such as linear regression models (including Varbrul) are clearly the largest
single group of techniques. However, Language Variation and Change is again intimately involved in the details. The majority of the uses of linear models is found in that
journal, as is the majority of uses of null-hypothesis tests. The numbers in Figure 1.4
are small, but sufficient to give us the impression of Language Variation and Change
as a methodological outlier among the quantitative papers in the sample.
Thus, we can conclude that, based on our sample, articles from the journals specializing in historical linguistics and language change that we considered (published in
2012) use quantitative methods to a lesser degree than a relevant comparison journal
in general linguistics (Language). Furthermore, if we exclude Language Variation and
Change, which is biased towards quantitative methods, we see that the percentage
of quantitative papers and corpus-based papers drops even further. If we consider
historical papers that use both quantitative methods and corpora (excluding Language
Variation and Change), the percentage is low enough to be compared to the early
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
A historical linguistics meta study

adopter section of the curve from the technology adoption model presented in
section 1.4.
Having described the state of adoption of quantitative corpus methods in historical
linguistics, we are ready to carve out a niche for this technology in historical linguistics,
and we will do this in Chapter 2.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
. A new framework
In this chapter we outline the foundations of the new methodological framework we
propose. This framework is not meant to replace all existing ways of doing historical
linguistics. Instead, we present a carefully scoped framework for doing certain parts of
historical linguistics that we think would benefit from this approach. Other areas
of historical linguistics might not require the kind of innovation we propose, or they
might require innovations of a different kind. However, we strongly believe that the
approach outlined in this book is the right choice for what we define in the scope
of quantitative historical linguistics. We think many, if not most, historical linguists
would agree with us that corpora and frequencies are potentially very informative
in answering questions in historical linguistics. Our aim is to take this intuition one
step further by proposing principles and guidelines for best practices, essentially an
agreement as to what constitutes quantitative historical linguistics. The next section
addresses the question of scope for the framework.
.. Scope
We submit that the principles of quantitative historical linguistics pertain to any
branch or part of historical linguistics. These principles are not only meant as guides
to carrying out quantitative research, but also establish a hierarchy of claims about
evidence which also encompasses non-quantitative data. In this respect, quantitative
historical linguistics is just as much a framework for evaluating research as for doing
research.
The basic assumptions and principles laid out in sections 2.1.2 and 2.2 establish a
basis for evaluating and comparing research in historical linguistics, whether quantitative or qualitative. Our main focus is nevertheless the methodological implications
of these assumptions and principles for how to do historical linguistics research.
The principles and guidelines of quantitative historical linguistics can be applied
within any conventional area of historical linguistics, such as phonology, morphology,
syntax, and semantics. In line with Gries (2006b), we argue that corpora serve as the
best source of quantitative evidence in linguistics, and by extension also in historical
Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray.
© Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
A new framework

linguistics. This might at first glance seem to exclude e.g. historical phonology from
quantitative historical linguistics; however, this is a practical consideration based on
available corpus resources, not an inherent feature of quantitative historical linguistics.
In fact, historical corpora can be illuminating when it comes to questions of sound
change, as demonstrated by recent studies using the Origins of New Zealand English
corpus (Hay et al., 2015; Hay and Foulkes, 2016). Nevertheless, we stress that in
some areas, such as phonology, quantitative historical linguistics is to a large extent
complementary to traditional historical comparative linguistics (see section 2.1.2).
In the following chapters we give examples and case studies from morphology,
syntax, and semantics, with some discussion on phonology. Our focus in this book is
predominantly on corpus linguistics, since corpora constitute the best source of quantitative evidence. However, quantitative does not automatically entail corpus-based.
For instance, historical phylogenetic modelling attempts to establish relationships,
classification, and chronology of languages based on historical data by probabilistic
means (Forster and Renfrew, 2006; Campbell, 2013, 473–4). Phylogenetic models may
employ typological traits, such as Dunn et al. (2005), or lexical data, such as Atkinson
and Gray (2006), or corpus data (Pagel et al., 2007). Quantitative historical linguistics
is deliberately agnostic regarding the use of specific statistical techniques, since such
techniques must reflect the specific research question. The caveat here is that the
choice of statistical technique should reflect best practices in applied statistics and
be sufficiently advanced to tackle the full complexity of the research problems (see
section 2.2.12).
Thus, although we consider corpora the preferred and recommended source of
quantitative evidence, quantitative historical linguistics does not necessarily equate to
corpus linguistics. In addition to the source of data, quantitative historical linguistics
relies on a number of other principles and basic assumptions, which we turn to next.
.. Basic assumptions
The scope of our framework builds on a number of premises and relies on different
levels of analysis. As in other historical disciplines, different skills are needed for
different stages of the problem-solving process. Historians must judge sources in
light of both the physical documents, their literary genre, and the context of the
source, which might require very different sets of skills, as discussed in chapter 2 of
Carrier (2012). Similarly, quantitative historical linguistics must make a number of
assumptions, some of which rely on other scholarly disciplines. Thus, the approach is
not all-inclusive, but rests on and interacts with other pursuits of knowledge, by means
of the following assumptions. We are indebted to Carrier (2012) for inspiration, but
have reworked the material to match the case of historical linguistics.
The historical linguistic reality is lost Whether we study the history of particular
languages, the relationship between languages and language families over time, or
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
how language change proceeds in general, we all face the same inescapable problem:
whatever reality we wish to describe, understand, or approximate is irrecoverably lost.
It cannot be directly accessed and hence we can only study it indirectly. Because of
this inaccessibility, our models of the past historical linguistic reality will always be
imperfect. However, they may still be useful. A key question for the present book is to
show what we think constitutes a useful model, and in which circumstances it is useful.
Philological and text-critical research is fundamental No corpus is better than the
quality of what goes into it. Consequently, sound groundwork in terms of philological,
paleographical, and text-critical research must be assumed. Put differently, the proposed approach cannot replace these pursuits of knowledge. Instead, it complements
them and relies on them to critically study the physical manuscripts and philological
and stemmatological context of the text contained in them. Based on such research,
critical editions can be created, and these critical editions can subsequently form the
basis for corpora.
Grammars and dictionaries are indispensable Another level in the research process is
the creation of grammars and dictionaries that make it possible to annotate historical
corpora. Of course, such research is not only a means to create corpora, but it illustrates
the degree to which quantitative historical linguistics rests on other approaches to
historical linguistics. We would again like to emphasize that the present approach is
in many respects complementary to existing approaches, although, as we explain in
Chapter 5, it is desirable to create corpus-driven dictionaries. Reaching back to the
extended notion of technology that we introduced in section 1.4, we see no reason to
replace existing approaches where they work well. As the levels of analysis outlined
here illustrate, several approaches can and must coexist.
Qualitative models We agree with Gries (2006b) that corpora provide one type of
evidence only: quantitative evidence. It follows from this that quantitative claims or
hypotheses are best addressed by corpus evidence. However, not all hypotheses are
quantitative, as we illustrate in section 2.4.3. Qualitative approaches in historical linguistics have more than proved their worth in establishing genealogical relationships
between languages, especially through the study of regular sound correspondences.
Although such qualitative correspondences might be a simplification (i.e. imperfect
models), they might nevertheless be useful and successful. Similarly, the simplifications and generalizations involved in establishing paradigmatic grammatical patterns
might be useful without being a one-to-one correspondence to the lost historical
linguistic reality.
Where we do see the limits of qualitative approaches is in distributional claims,
especially as they relate to claims or hypotheses about syntagmatic patterns. In
the following sections we will elaborate the terminology and basic tenets of our
framework.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
A new framework

.. Definitions
In the present section we define the core terminology based on which we will
formulate the principles of our framework.
Evidence By evidence we mean facts or properties that can observed, independently
accessed, or verified by other researchers. Such facts can be pre-theoretical or based
on some hypotheses. A pre-theoretical fact could be the observation that in English
the word the is among the most frequent ones, alongside words such as you. We can
observe facts in light of a hypothesis by assuming grammatical classes of articles and
pronouns that group words together. Based on this hypothesis we can gather facts that
constitute evidence that the classes article and pronoun are among the most frequent
ones in English.
It follows from this definition that empirical evidence is a pleonasm, since all
evidence conforming to it must be empirical. The definition above explicitly excludes
the intuitions of the researcher as evidence in historical linguistics. Such intuitions
are problematic as evidence for languages where native speakers can judge them; for
extinct languages and language varieties we consider such intuitions inadmissible as
evidence. This position does not imply that intuitions are without value. For instance,
intuitions are undoubtedly valuable in formulating research questions and hypotheses,
and when collecting and evaluating data, as we stress in section 2.4.3. Thus, we think
intuitions can and should play a role in the research process, but we do not consider
them as evidence.
We can distinguish between different types evidence, namely quantitative evidence
and distributional evidence. Quantitative evidence is based on numerical or probabilistic observation or inference. The quantification must be precise enough to be
independently verifiable. As a consequence, quantifying the observations by means of
e.g. the words many or few will not suffice, since these terms are underspecified.
In the classic linguistics sense, distributional evidence is empirical in the sense
that it can be independently verified that certain linguistic units (be they phonemes,
morphemes, or other units) do or do not (tend to) occur in certain contexts. To the
extent that such distributional patterns can be reduced to hard, binary rules (e.g. x
does/does not occur in context y), distributional evidence is qualitative. However,
we also keep the option open that such distributional evidence may be recast in
probabilistic terms.
Finally, we need to consider criteria for strong and weak evidence, since independent verifiability is a necessary but not sufficient criterion for evidence. We can establish
the following hierarchy of evidence:
(i) More is better: a larger sample will yield better evidence than a small one, other
things being equal.
(ii) Clean is better than noisy: clean, accurate, and well-curated data will yield
better evidence than noisy data, i.e. data with (more) errors.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
(iii) Direct evidence is better than evidence by proxy: it is better to measure or
observe directly what is being studied, rather than through some proxy or
stand-in for the object of study.
(iv) Evidence that rests on fewer assumptions (be they linguistic, philological, or
mathematical) is preferable, other things being equal.
It is obvious from the list above that some of the statements in the hierarchy will
conflict. For instance, the ‘more is better’ requirement (i) will almost always conflict
to some degree with the requirement for precise, well-curated data (ii). This implies
that the end result will always be some kind of compromise which entails that perfect,
incontrovertible evidence is a goal that can be approximated, but never fully reached.
We believe this is an important consideration, since no numerical method can salvage
bad data. Instead, the realization that all data sets are imperfect to some degree breeds
humility and ushers along the need to explicitly argue for the strength of the evidence,
independently of the strength of the claims being made on the evidence. The next
section deals with claims.
Claim We follow Carrier (2012) in considering anything that is not evidence a claim.
A claim can be small or large in its scope, and it may rest directly on evidence, or it
may rest on other claims. A claim must always rest on evidence, directly or indirectly,
to be valid. The following are examples of different types of scientific claims of variable
complexity and scope:
•
•
•
•
Classification: x is an instance of class y.
Hypothesis: we assume x to be responsible for an observed change y.
Model interpretation: based on the model w, x is related to y by mechanism z.
Conclusion: we conclude that x was responsible for bringing about z.
All claims are subject to a number of constraints discussed further in section
2.2. However, we want to stress the distinction between evidence and claims, as it
is fundamental to the subsequent principles. In particular, we consider linguistic
frameworks (sometimes called ‘linguistic theories’) to be series of claims which cannot
be admitted as evidence for other claims. It also implies that such frameworks are
subject to the same standards of evaluation as other claims (see section 2.2).
Truth and probability Following chapter 2 of Carrier (2012) we consider a claim, be
it a classification or a hypothesis, to be a question of truth. However, the truth value of
such a claim, e.g. x belongs to class y, can be stated in categorical or probabilistic terms.
We choose to think of the truth value of claims about the past in probabilistic terms,
since there is always a risk that we are mistaken, even in the most well-established
claims. For sure, the probability may be vanishingly small, but it may still exist.
Furthermore, such probabilities about the truth value of claims can be interpreted
in at least two ways:
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
A new framework

(i) As facts about the world (i.e. physical probabilities).
(ii) As beliefs about the world (i.e. epistemic probabilities).
Carrier (2012) makes the distinction above in the context of an explicitly Bayesian
statistical framework. Bayesian statistics is a branch of statistics that considers probabilities as subjective and as degrees of belief (Bod, 2003, 12). So-called frequentist
statistics will tend to conceptualize probabilities as long-term relative frequencies of
events in the world. The distinction is discussed in more depth in Hájek (2012). For
our purposes it is sufficient to say that when we talk about the probability of some
claim being true, we are talking about the epistemic probability, i.e. how likely we are
to be correct when we claim that x is a class of y. Although the difference between
(i) and (ii) is sometimes overstated, there is a real difference between claiming that
‘8 out of 10 times in the past verb x belonged to conjugation class y’, versus claiming
that ‘if we assign verb x to conjugation class y, the probability that we are making the
correct classification is 0.8’. The latter statement is explicitly made contingent on our
knowledge and our argumentation in a manner that is different from and better than
the former case.
Historical corpus In this book, we are concerned with historical corpora and define
them as any set of machine-readable texts collected systematically from earlier stages of
extant languages or from extinct languages. We follow Gries (2009b, 7–9) in defining
a corpus as a collection of texts with some or all of these characteristics:
(i) Machine-readable: the corpus is stored electronically and can be searched by
computer programs.
(ii) Natural language: the corpus consists of authentic instances of language use for
communicative purposes, not texts created for making a corpus.
(iii) Representative: representativity is taken to refer to the language variety being
investigated.
(iv) Balanced: the corpus should ideally reflect the physical probabilities of use
within some language variety.
These characteristics are ideals, even for corpora based on extant languages. To
create a balanced and representative corpus of extinct language varieties is in most
cases not a realistic aim. Therefore we do not take these to be necessary and sufficient
features for what constitutes a (historical) corpus. In fact, we agree with Gries and
Newman (2014) who consider the notion of a ‘corpus’ to be a prototype-based category,
with some corpora being more prototypical than others.
However, the definition above is clearly also too broad, since it extends to other
types of text collections that are not normally considered corpora in the strict sense
(Gries, 2009b, 9). For instance, a text archive containing the writings of a single author
would fulfil criterion (i) and criterion (ii), but could not lay claim to representativity
beyond the author in question. Gries (2009b, 9) argues that in practice the distinction
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
between corpora and text archives can be diffuse, and for our purposes we take the
representativity criterion in (iii) to be sufficient to rule out many text archives from
the definition.
A more pressing exclusion are perhaps collections of examples. As Gries (2009b, 9)
points out, any such collection is prone to errors and omissions, and it is doubtful
what it can be taken to be representative of. For this reason, we follow Gries (2009b)
in excluding example collections from the definition of what constitutes a corpus.
This exclusion also applies to collections based on examples from historical corpora
or quotations, since such text fragments are by definition handpicked and presented
outside the communicative context they can be said to be representative of. Thus,
our definition of a corpus only includes machine-readable, natural, representative
(within limits) text that has been systematically sampled for the purpose of the corpus.
This is not to say that example collections cannot be useful, but we exclude them for
the purpose of terminological clarity. Finally, we exclude word lists (or sememe lists
of cognates based on the Swadesh lists, see section 3.3) since they fall short of the
requirement of texts collected for natural, communicative purposes.
The notion of ‘historical corpus’ can also be problematic, since it is not clear exactly
how historical a corpus needs to be in order to count as ‘historical’. We are inclined to
take a pragmatic approach to this question and consider as historical in the broad sense
any corpus that either covers an extinct language (or language variety), or that covers
a sufficient time span of an extant language variety that it can be used diachronically,
i.e. to detect trends (see also section 2.2).
We would also stress that annotation of corpora and analysis of data are two separate
and independent steps of the research process. The annotation step could for instance
involve enrichment from other linked external resources, not necessarily corpora. The
relationship between data, corpora, and annotation is discussed further in Chapter 4.
Linguistic annotation scheme By linguistic annotation scheme we intend the set of
guidelines that instruct annotators on how to annotate linguistic phenomena occurring in a corpus according to a specific format. Such schemes rely on certain theoretical
assumptions and usually contain a set of categories (tags) that are to be applied to the
corpus text. An example of a linguistic annotation scheme is the set of guidelines for
the annotation of the Latin Dependency Treebank and the Index Thomisticus Treebank
(Bamman et al., 2008). Section 4.3.3 gives a full description of annotation schemes.
In our framework we do not impose constraints on the particular schemes to
be used, as long as they are explicit, and allow the annotators to interpret the text
consistently and map it to the predefined categories.
Hypothesis By hypothesis we intend a claim that can be tested empirically, i.e. through
statistical hypothesis testing on corpus data. Hypotheses can come from previous
research, logical arguments, or intuition, and, as long as they can be tested empirically,
they have a place in our framework.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
A new framework

An example of a hypothesis is the statement ‘there is a statistically significant
difference in the relative distribution of the -(e)th and -(e)s endings of early modern
English verbs by gender of the speaker in corpus X’. This formulation is a technical one,
and there is usually some work involved in going from an under-specified hypothesis,
such as ‘the verbal endings -(e)th and -(e)s in early modern English vary by gender’, to
an operationalized one as in the example above. For a fuller explanation of hypothesis
testing and concrete examples, see section 6.3.4.
Generating hypothesis is one of the main steps in the research process, and it helps
focus the efforts in the analysis. Instead of considering all possible variables that might
remotely affect the phenomenon under study (the so-called strategy of ‘boiling the
ocean’), we can concentrate our attention on those factors that are promising, based
on what we know of the phenomenon. If the hypothesis is generated from data exploration, it can be defined as data-driven, although the process itself of exploring the
data will have relied on some theoretical assumptions, as we explain in section 2.4.3.
Model As we explained in section 1.2.2, by ‘model’ we mean a representation of
a linguistic phenomenon, be it statistical or symbolic. Not all models, however, are
allowed in our framework: only those that derive from hypotheses tested quantitatively
against corpus data or from statistical analysis of corpus data. An example of such a
model is given in Jenset (2013), where the use of the morpheme there in Early English
is modelled as a function of the presence of the verb be followed by an NP, and the
complexity of the sentence. In section 7.3.3 we provide a full description of a model for
historical English verb morphology.
Trend We define a trend as the directional change in the probability of some
linguistic phenomenon x over time that is detectable and verifiable by means of
statistical procedures (Andersen, 1999). In other words, a trend cannot be established
by impressions or intuitions. Furthermore, it can only be counted as a trend if reliable
and appropriate statistical evidence can be presented to back it up.
By ‘trend’, we mean the combination of innovation and spread of a linguistic
phenomenon. For a linguistic change to happen, a speaker (or a group of speakers)
needs to create a new form (innovation), and for this to be more than a nonce
formation, the use of such form needs to spread and be adopted more broadly. For
example, the use of ‘like’ in quoted speech must have been an innovation at first, and
was then adopted to a broader set of people until it became established in current
spoken English.
We believe that linguistic innovation is best dealt with probabilistically, although
this does not mean that our framework is incompatible with categorical views of
language innovation. When a new linguistic form is used for the first time (or
according to the terminology of Andersen (1999), it is ‘actualized’ in a speaker’s
usage), it will differ from the old form in some aspect, for example in a semantic
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
sense, or in a morphological realization. According to a categorical view of language,
this difference will be displayed as an opposition between the ‘old’ and the ‘new’
category; for example, the English affix -ism may be used as a noun as well (see ‘ism’
in the sentence We will talk about the isms of the 20th century). The innovative usage
consists of the nominal use and the opposition is between the two part-of-speech
categories. According to a non-categorical view of language, the innovative form could
for example be characterized by a ‘fuzzy’ nature which can be described as more nounlike than preposition-like.
We argue that both the categorical view and the non-categorical view are
compatible with a probabilistic modelling of the linguistic innovation. In the categorical view, we can describe the innovative use of ‘ism’ in terms of a low probability
of the preposition category and a higher probability of the noun category. In the
non-categorical view, we can describe this innovation as change along a continuum,
so that, for example, the innovative form ‘ism’ is found in contexts more similar to
those of ‘theory’ (e. g. following a determiner) than those of ‘-ian’ (e. g. as a morpheme
following ‘Darwin’).
On the other hand, the spread of new linguistic behaviours among speakers
through genres, linguistic environments, and social contexts, is a time-dependent
phenomenon. The innovative form and the new form will coexist for a period of time,
thus realizing synchronic variation, and there will be a more or less rapid adoption
of the new form by the language communities. This can be described as a shift in
probabilities, and it is clear that language spread should be dealt with in probabilistic
terms. Quantitative multivariate analysis of corpus data allows us to measure the
evidence for the spread of a linguistic phenomenon, and the effect of different variables
on it. This way, it is in principle possible to model the way an innovation is increasingly
used by a community; in section 7.3 we will provide a concrete example of this in a
study on English verb morphology.
. Principles
Figure 2.1 shows the diagram of the research process in our framework, and is based
on the entities defined in section 2.1.3 and the principles illustrated in that section. As
shown in Figure 2.1, the aim of quantitative historical linguistics is to arrive at models
of language that are quantitatively driven from evidence. Such definition of ‘model’
includes statistical models and their linguistic interpretation. Section 7.2 will outline
the steps of this process in a linear way, and we will describe these steps in more detail
throughout this book.
In the present section we describe the basic principles of quantitative historical
linguistics, which are valid within the scope defined above. The principles are inspired
by, and in some cases adapted from chapter 2 of Carrier (2012), a work advocating the
use of statistical methods in history. However much history and historical linguistics
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Principles
Historical linguistic reality
Primary sources
(Documents etc.)
Secondary
sources (Grammars,
dictionaries, etc.)
Linguistic annotation
schemes*

Models*
Intuition
Hypotheses
Examples
Annotated
corpora*
Quantitative distributional
evidence*
Figure . Main elements of our framework for quantitative historical linguistics. Boxes are
entities, arrows are actions or processes; asterisks mark terms for which we use our definitions
(see section 2.1.3). The dashed line from models to the (lost) historical linguistic reality implies
an approximation.
have in common, the differences are nevertheless sufficiently great to warrant a
reframing of the issues to fit into the context of historical linguistics.
The adoption of those principles allows for improved communication between
scholars regarding claims and evidence, which in turn will make it easier to resolve
contentious claims by means of empirical evidence. However, such a resolution is only
possible to the extent that historical linguists agree with and adhere to the principles
presented below. For this reason, the first issue deals with the question of consensus
in the historical linguistics community.
.. Principle : Consensus
To achieve the aim of quantitative historical linguistics research, it is necessary to
reach consensus among those scholars who accept the premises of quantitative historical
linguistics.
The basic premise for all the following principles is that the aim, indeed the duty,
of historical linguists is to seek consensus. However, consensus is only valuable to the
extent that it reflects an empirical evidence base. We therefore limit the consensus
to those scholars who accept the basic premises of empirical argumentation, as it is
grounded in the concepts of evidence and claims (section 2.1.3). Since we consider
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
these principles fundamental to empirical reasoning about historical linguistics, no
consensus would be possible without them, even in theory. This means that the
effort of creating consensus without a common ground of fundamental principles is
probably going to be futile.
The requirement of seeking consensus might seem overly optimistic and even a
negative constraint to the development of the field. However, all serious scholars
already abide by the consensus principle to some limited extent, by submitting their
research articles to a scientific peer review. This particular type of consensus does
not necessarily extend beyond the peer reviewers, the editors, and the scope of
journals, but the principle remains the same: all research is ultimately an attempt to
influence others by making claims grounded in some form of evidence. Without the
requirement to seek consensus, any claim could in principle be made and defended
by resorting to some private standard of evidence and argumentation. In contrast, the
consensus requirement provides an impetus to follow the principles of quantitative
historical linguistics as closely as possible, since this will help to persuade other
scholars of the validity of the claims being made. However, the principle cannot be
understood as an injunction to achieve consensus, only to seek it, since consensus by
definition must involve more than one researcher.
A hypothetical objection to the principle might be that it constrains creativity and
development of the field. However, we view the matter differently. We agree with the
argument made in chapter 2 of Carrier (2012) that when we have no direct access to
historical realities, our best approximation must be the consensus among the experts
in the field, in this case historical linguists. Naturally, experts may be mistaken, but
on the whole we must assume that their beliefs and claims are accurate, given the
current state of knowledge in the field. This final refinement of the point is crucial,
since the consensus by definition must rest on what has been discovered and argued
up until the present. Hence, new claims will always be in a position to challenge the
consensus. But to challenge the consensus is to seek its amendment. When facing a
new, possibly controversial, claim that goes against the current consensus, the experts
in the field must evaluate the claim according to the empirical principles. If the claim
is solid enough, the consensus will be given. Similarly, any claim might have gaps that
require fixing before other historical linguists will accept it. After such modifications,
the claim might be strong enough to alter the reigning consensus. We consider those
claims that are too weak to persuade other experts in the field to be of no interest.
If a creative, controversial claim cannot persuade those who are experts in the field,
then it is questionable whether it can bring the field forward. Thus, we do not consider
a plurality of claims regarding historical linguistics to be an aim in itself, but only a
means of providing suggestions for altering the current consensus.
.. Principle : Conclusions
All conclusions in quantitative historical linguistics must follow logically from shared
assumptions and evidence available to the historical linguistics community.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Principles

Following the definition of evidence in section 2.1.3, a piece of research is empirical if
it relies on empirical evidence that is observable and verifiable independently of the researcher and her attitudes and beliefs. Above all, intuitions (even those stemming from
long, in-depth study of the material under scrutiny) are inadmissible as evidence. This
principle is supported by the previous one: beliefs and intuitions are not independently
verifiable, hence they do not form a good basis on which to build consensus. This is
not to say that intuitions do not belong in historical linguistics research; quite the
contrary. Such intuitions can be a very valuable starting point for insightful research.
However, the intuitions can never be more than a starting point, or guidance, for
creating hypotheses and deriving empirically testable predictions from them.
.. Principle : Almost any claim is possible
Every claim has a non-zero probability of being true, unless it is logically or physically
impossible.
We consider this insight from Carrier (2012) to be a key principle when evaluating
claims regarding historical linguistics. Carrier (2012) points out that almost any claim
about the past has some probability of truth to it, with the exception of claims that are
logically impossible (such as ‘Julius Caesar was murdered on the Ides of March and
lived happily ever after’) or physically impossible (such as ‘Julius Caesar moved his
army from Gaul into Italy on a route that took them via the Moon’). We consider this
statement equally applicable to historical linguistics as to history.
Another way of phrasing the principle is that identifying sufficient conditions is
not enough to establish a strong claim. A very similar point is made by Beavers and
Sells (2014) who argue that since linguistic data can support many conclusions, it is
not enough to find data that support the claims we wish to make. It is also necessary
to consider all the other claims those same data might support, that is, what is the
evidence against our chosen interpretation of the data.
The take-home message in both cases is that the set of all possible claims (i.e. physically and logically possible) contains both profitable and misleading claims, but both
these types of claims can be supported by historical linguistic data, albeit to different
degrees. It follows from this principle that a claim that ‘fits the data’ in historical
linguistics is near worthless unless it is further substantiated. Such a claim could be a
very strong one, or it might have an associated probability so small that it would be indistinguishable from zero for all practical purposes. The subsequent section discusses
the problem of ranking claims that all have a non-zero possibility of being true.
.. Principle : Some claims are stronger than others
There is a hierarchy of claims from weakest to strongest.
It follows from principle 3 that all possible claims in historical linguistics have
some probability of being true, ranging from completely implausible to extremely
well attested and likely. In other words, there exists a hierarchy of claims where some
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
claims stand above others. For instance, the claim by Emonds and Faarlund (2014) that
Old English simply died out and was replaced by a variant of Norse (making modern
English genealogically North Germanic) has very little support in the data and is hence
an extremely weak claim, as demonstrated by Bech and Walkden (2016). Since the
claim that Middle English evolved from Old English (albeit from other dialects than
the dominant West Saxon variety of Old English) is based on much stronger evidence,
it takes precedence over the replacement argument. Essentially, all claims are not made
equal, and even if some kind of historical linguistic data can be made to fit a claim,
this is in itself unsurprising and constitutes an insufficient ground for accepting that
claim. The key question then becomes what distinguishes a weak claim from a strong
one. The following principle will dig further into the problem of how to rank claims.
.. Principle : Strong claims require strong evidence
The strength of any claim is always proportional to the strength of evidence supporting it.
Section 2.1.3 dealt with how we can judge the strength of the evidence. Here we
spell out the relationship this has to claims and their strength. Carrier (2012) argues,
correctly in our view, that evidence based only on a small number of examples is very
weak. Furthermore, when a claim is a generalization, its supporting evidence must
consist of more than one example. That is, the evidence for any generalization that
goes beyond the observed piece of data must consist of more than one observation.
Such arguments follow from the principle that the strength of a claim is proportional
to the evidence backing it up.
Since no claim is stronger than the evidence supporting it, the nature of the
supporting evidence is key. Other things being equal, more evidence implies stronger
support for a claim, as we stated in section 2.1.3. However, the principle is not only
about finding strong evidence. The opposite also applies: if your evidence is weak, your
claims ought to reflect this fact. In some cases a weak claim is all that can be supported
by a body of evidence. In this situation, we feel that the adage ‘better an approximate
answer to an exact question, than an exact answer to an approximate question’ applies.
That is, if the combination of a research question and some data only allows a weak or
tentative conclusion, then this should be explicitly acknowledged without attempts to
overstate the results.
In historical linguistics this means that in some cases certain generalizations might
be impossible. As the statistician John Tukey phrased it, ‘the combination of some data
and an aching desire for an answer does not ensure that a reasonable answer can be
extracted from a given body of data’ (Tukey, 1977). This applies to historical linguistics
as much as to statistics. The example from section 2.2.4 about the typological status of
English within the Germanic language family is also relevant here. Since the evidence
provided by Emonds and Faarlund (2014) is narrowly focused on one area (syntax)
and is also very sparse, the evidence is clearly not proportional to the claims being
made, as demonstrated by Bech and Walkden (2016).
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Principles

.. Principle : Possibly does not entail probably
The inference from ‘possibly’ to ‘probably’ is not logically valid.
In section 2.2.3 we argued that merely fitting the data is not sufficient for accepting
a claim. A special case that deserves its own principle is the logical fallacy that Carrier
(2012) describes as ‘possibly, therefore probably’. The principle can be made clearer
when recast as probabilities, where the notation P(x) means ‘probability of x’ for some
claim x:
•
•
•
•
If P(x) > 0, x is possible.
If P(x) is close to 1, x is probable.
If P(x) = 0.01, x is possible but not probable.
If P(x) = 0.99, x is both possible and probable.
Put differently, all probable claims are possible, but not all possible claims are
probable. The example-based approach described in section 1.3.1 should only be
associated with claims about events being possible or not; in order to state anything
about their probability value, quantitative data and systematic analysis are required.
We turn again to the claim that Old English died out and that Middle English
descended from Norse. We certainly agree with Emonds and Faarlund (2014) that this
is a possible scenario. The process of languages falling out of use and being substituted
by others, possibly with some substrate influence from the language falling out of use,
is clearly possible. However, since all logically and physically possible claims have a
non-zero probability of being true, it is trivial to state that Old English might have
died out and been replaced with a variant of Norse.
The possible does not automatically entail the probable because probable claims are
only a subset of all the possible claims. Thus the argument, ‘this might have been the
case, therefore it was probably the case’ is logically invalid without further supporting
evidence. It also follows from section 2.2.3 that the set of possible claims is very,
very large since it is constrained only by the physically and logically impossible. This
in turn raises the question: in the absence of stronger corroborating evidence, why
privilege one particular possible claim out of the much larger set of other possible
claims? To present a possible claim as probable without sufficient evidence, whether by
arbitrariness or sheer wishful thinking on the part of the researcher, does not support
that claim. In particular, such an inference cannot adequately support a conclusion, as
discussed in the next section.
.. Principle : The weakest link
The conclusion is only as strong as the weakest premise it builds on.
This principle entails that any conclusion will be evaluated by its weakest point, not
its strongest. This may sound counter-intuitive, because surely we want the strongest
evidence to inform our claims. The reason can be traced back to the principle that
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
any claim that is physically or logically possible has a non-zero probability of being
true (section 2.2.3). The great number of possible interpretations of evidence from
the linguistic past thus enables us to find individual strong arguments in favour of a
conclusion. However, the conclusion might nevertheless be undermined by a number
of weak premises.
.. Principle : Spell out quantities
Implicitly quantitative claims are still quantitative and require quantitative evidence.
One of the key aims of the principles outlined here is to enable a fair evaluation
of claims about historical linguistics in terms of quantities and frequencies. However,
such an evaluation is only possible when the quantification is spelled out. Terms such
as those in the following list are ambiguous and should be avoided when presenting
evidence in historical corpus linguistics: few, little, rare, scarce, uncommon, infrequent,
some, common, frequent, normal, recurrent, numerous, many, much.
The list is obviously not exhaustive, but it illustrates words that represent quantities and frequencies in a subjective and coarse-grained manner. They are subjective
because what counts as few or many depends on the circumstances and the person
doing the evaluation. They are coarse-grained because it is difficult to compare the
quantities they designate. Is an ‘uncommon’ phenomenon equally scarce as something
that is ‘infrequent’? Or is it perhaps more common? Or less? Such quantification is
hard to evaluate and verify independently and hence violates the fundamental requirement that the evidence for a claim must be objectively accessible to all researchers in
the field. This is not to say that such words cannot be used, but they render an argument
less powerful by making it opaque.
.. Principle : Trends should be modelled probabilistically
Quantitative historical linguistics can rely on different types of evidence, but only
quantitative evidence can serve as evidence for trends.
In section 2.1.3, we defined trends in explicitly probabilistic terms. The approach
defined here is deliberately agnostic about whether language is inherently based on
probabilities, or categorical rules, or some combination of the two. However, a trend
should be modelled as a probabilistic, quantitative entity since it denotes a directed
shift in variation over time. Sample sizes may vary at different points along a time line,
which makes statistical tools the correct choice for identifying and evaluating trends.
Linked to this point is the question of adequate statistical power. Thus, having three
points connected by a straight line pointing upwards does not qualify as a trend unless
this line can be shown to be both statistically significant and a good fit to the data.
Like any claim in historical linguistics, a proposed trend is subject to the principle in
section 2.2.3 that any claim has a greater than zero probability of being true, provided
that the claim is not logically or physically impossible. Any claim about a possible
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Principles

trend is consequently liable to a number of errors: the trend might not be a trend at all,
but merely random variation. Or the claimed trend might represent wishful thinking
or biased attention on the part of the researcher (as pointed out by Kroeber and
Chrétien 1937, 97); the data for the claimed trend might give the appearance of trend
due to inadequate sampling procedures, and so on. In other words, the requirement
that a trend be verified by statistical means is an insurance against overstating the case
beyond what the data can back up.
.. Principle : Corpora are the prime source of quantitative evidence
Corpora are the optimal sources of quantitative evidence in quantitative historical
linguistics.
Above we defined corpora as sources of quantitative data (section 2.1.2). We also
defined quantitative variation (including variation implicitly stated by means of words
like much or few) as subject to quantitative evidence (principles 8 and 9). However, we
reserve a separate principle for the statement that quantitative evidence in historical
linguistics should come from corpora. This is not to say that quantitative evidence
cannot come from other sources; there are clearly other possible sources for quantitative evidence (see section 2.1.3). However, when available, corpora should always be
the preferred source of quantitative evidence for a number of reasons:
(i) Corpora (as defined in section 2.1.3) have a better claim to be representative
than other text collections, other things being equal.
(ii) Publicly available corpora allow reproducibility, to the extent that they are
available to the research community.
Thus, following principle 4 (section 2.2.4), we consider a claim based on quantitative
evidence coming from corpora stronger than a claim that is not based on corpus evidence, as long as the two claims are equally capable of accounting for the relevant facts.
.. Principle : The crud factor
Language is multivariate and should be studied as such.
For the purpose of historical linguistic research, we consider language, and language
use, to be an inherently multivariate object of study. Bayley (2002, 118) explains
a similar ‘principle of multiple causes’ as the need to include multiple potentially
explanatory factors in an analysis, since it is likely that a single factor can explain only
some of the observed variation in the data. In other words, it is essential to be open
to a potentially large number of explanatory variables for any linguistic phenomenon.
This principle does not imply that this is inherent in language, only in language as
an object of study. From this principle it follows that a large number of potential
explanatory variables should be considered. This is consonant with principle 3 (section
2.2.3), since finding a single variable that is correlated with the phenomenon being
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
studied is trivial. The real aim in quantitative historical linguistics is to find one or
more variables that are more strongly correlated with the phenomenon being studied,
compared to other potential variables. In essence, this guards against spuriously positive
results, since we aim to build counter-arguments into the quantitative model. Doing
so protects against what Meehl (1990, 123–7) calls the ‘crud factor’, or ‘soft correlation
noise’, since many factors involved in language will be correlated with each other
at some level. Stacking them up against each other helps separate the wheat from
the chaff.
.. Principle : Mind your stats
Quantitative analyses of language data must adhere to best practices in applied statistics.
From principle 11 it follows that statistical methods are required to distinguish
the more important correlations from the less important ones. Bayley (2002, 118)
describes this as the ‘principle of quantitative modeling’, which implies calculating
likelihoods for linguistic forms given context features. This implies that multivariate
statistical methods, such as regression models or dimensionality reduction techniques,
are typically required. For instance, a single multivariate regression model with all
relevant variables is superior to a series of individual null-hypothesis tests, since the
latter do not take the simultaneous effect of all the other variables into account and are
vulnerable to false positive results by testing the same data several times over. Testing
the same data over and over again with a null-hypothesis test such as Pearson’s Chisquare is a little like having several attempts to hit the bull’s eye in darts: more attempts
make it more likely to get a statistically significant result, but the approach artificially
inflates the strength of the claim. Furthermore, as Gelman and Loken (2014) make
clear, null-hypothesis tests are often under-specified, a point also raised by Baayen
(2008, 236), which means that in practice they can often be supported or refuted
by the data in more than one way. Furthermore, comparing null-hypothesis tests
is conceptually difficult. Although the p-values may look comparable, they actually
represent a series of alternative hypotheses, each of which has been compared against a
null-hypothesis (Gelman and Loken, 2014). This is not to say that we proscribe the use
of simple null-hypothesis tests in quantitative historical linguistics, we merely consider
them to provide weaker evidence than multivariate techniques in those cases where a
multivariate approach is possible and gainful.
Similarly, the direct interpretation of raw counts, or what Stefanowitsch (2005) calls
‘the raw frequency fallacy’, constitutes the weakest form of quantitative evidence, since
such numbers are void of context. Without a frame of reference, it is impossible to
judge objectively (see the requirement that all evidence be accessible to all linguists)
whether an integer is large or small. Also, the direct interpretation of proportions or
percentages needs to be done with care. Proportions can be misleading since they
can inflate small quantities unless accompanied by the actual number of observed instances. Furthermore, the proportion constitutes a point estimate, i.e. a single number
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Best practices and research infrastructure

that in reality comes with an error margin attached. When we perform a formalized
null-hypothesis test and the test compares observed and expected frequencies, we
account for such error margins. If we interpret proportion data directly, they should
be accompanied by a confidence interval, as we exemplify in section 1.5.
. Best practices and research infrastructure
In section 1.3.4 we highlighted some of the problems with common practices in the
historical linguistics research process. In this section we will outline our proposed
solutions, which are meant to accompany the principles outlined above and create the
context for an infrastructure that facilitates and optimizes research achievements in
quantitative historical linguistics.
.. Divide and conquer: reproducible research
As we will see more in detail in Chapter 5, documentation and sharing of processes and
data are at the core of our framework. Transparency in the research process facilitates
reproducibility of the research results, as well as their generalization to other data sets,
thus advancing the field itself. Moreover, if the process is transparent, it is easier to
credit all the people who participated in it, including those responsible for gathering
and cleaning the data, and building language resources like corpora and lexicons, an
aspect that is still undervalued in the historical linguistics community. Replicability
is also aligned with principle 1 (section 2.2.11), which stresses the importance of
consensus in quantitative historical linguistics.
Transparency (and therefore replicability and reproducibility) is achieved by documenting the data and phases of the research process and by making them available.
In addition to being transparent about the research methodology used, corpora, data
sets, metadata, and computer code1 should be made publicly available whenever
possible and appropriate. In the case of historical data, questions of privacy are rarely
a problem, so compared to other fields of study historical linguistics is in a fortunate
position in this respect.
Once we have taken all steps to ensure transparent and reproducible results, and
have made the data openly available, the research practice can move beyond the scope
of an individual study to that of a larger, collaborative effort. Each study may still
concentrate on just one aspect of the process (design of a resource or generalization of
previous results, for example), while keeping a view to documenting and making the
tools and data sets available to the community. Efforts in this direction have already
had some success, for example in the case of the Perseus Digital Library2 and the
1 Generally speaking, using code/scriptable tools like Python and open formats like csv instead of tools
with graphical user interfaces and proprietary formats like Excel is essential for reproducibility.
2 http://www.perseus.tufts.edu/hopper/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
Open Greek and Latin Project.3 The Open Science Framework (https://osf.io/) offers
a platform for managing research projects in an open way, facilitating reproducibility
and data sharing; see the page https://osf.io/ek6pt/ for one such project dealing with
Latin usage in the Croatian author Paulus Ritter Vitezović. We believe that such
an approach will allow the field of historical linguistics to move forward in a less
fragmented way than it has so far.
.. Language resource standards and collaboration
Since the time of the comparative philologists, historical linguists have often resorted
to gathering their own data. Although this is sometimes warranted (and even the only
available option), historical linguistics as a scientific endeavour would benefit from a
greater reliance on reuse of existing resources, and on the creation of publicly available
standardized corpora and resources, whenever reuse is not an option.
Electronic resources like lexical databases (WordNet, FrameNet, valency lexicons)
provide valuable information complementary to corpora. Such resources are still not
widely used in historical linguistics, partly for epistemological reasons and partly for
technical reasons, as argued by Geeraerts (2006). Our framework provides historical
linguists with the methodological scaffolding to incorporate computational resources
into their research practice.
As we will argue more extensively in Chapter 5, the design, creation, and maintenance of language resources should be a crucial component of the work of historical
linguists, and in order to maximize their reuse and compatibility, language resources
should be developed in the spirit of linked open data (Freitas and Curry, 2012), when
possible.
Reusing resources means that conclusions and results can more easily be replicated
and tested by other researchers, which is a crucial point of our framework (see section
2.3.1). Moreover, if a study on a specific linguistic phenomenon is carried out on a
resource built in an ad hoc fashion, there is always the lingering doubt that the results
were influenced by the choice of data. Conversely, if the results are obtained from a
pre-existing resource or corpus, they are less likely to have been influenced by factors
directly related to the research in question. A greater reliance on reuse gives an impetus
to creating corpora also for less-resourced languages (McGillivray, 2013).
In spite of its strengths, gathering and annotating data is costly in terms of time and
resources. The labour-intensive tasks involved in creating language resources often
involve technical expertise which is not normally part of standard linguistics training;
therefore, the development of language resources is an interdisciplinary team effort
and is at the core of the collaborative approach to research that we propose. If a group
creates a resource that is well documented, has a standard format, and is compatible
3
http://www.dh.uni-leipzig.de/wo/projects/open-greek-and-latin-project/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Best practices and research infrastructure

and interoperable with other resources (for example a historical lexicon of namedentities), this makes it possible for another group to build on this work and either
enrich the resource itself, or use its data (alone or in combination with data from
other sources) for further analyses. If such analyses are well documented, they will
be more likely to be reproduced by others, who can check the validity of the results,
generalize them further to other data sets, or add more components to the analysis.
However, researchers do not currently have sufficient incentives to spend time on
building a corpus or other language resources. We believe that the publication of such
resources ought to carry substantial weight in terms of academic merit, as much as the
publication of studies carried out on them.
.. Reproducibility in historical linguistics research
In sections 1.3.1 and 1.3.3 we considered the major weaknesses concerning certain
research practices in historical linguistics. In this section we will broaden the perspective to cover the issues of transparency, replicability, and reproducibility and their
impact on the field of historical linguistics in general.
Section 1.3.1 dealt with the negative effects of the lack of transparency in the evidence
sources employed in historical linguistics research. As a matter of fact, the issue of
transparency concerns all phases of the research process, from data collection to
annotation and analysis. Making all phases of the research process more transparent
has a number of benefits. First, it makes it possible to replicate the research results
obtained by a study in the context of other studies dealing with the same data,
method, and research questions. This increases the chances of detecting omissions
and correcting errors. Second, transparency forms the basis for generalizing the
research results, thus advancing the field itself: this generalization can involve applying
the same method to a different data set or extending the approach. For example, a
researcher can test alternative approaches based on the data from a reproducible piece
of research. Third, transparency ensures that the work involved in building a data set
(for example a historical language resource) is visible, and therefore acknowledged
and credited appropriately. Considered the emphasis on publishing research articles
that report on analyses of particular phenomena or formulation of theories, this level
of transparency on the data behind the analysis would encourage more researchers
to dedicate their time to building language resources, which play an essential role in
advancing the field.
The issue of lack of transparency is, of course, not unique to (historical) linguistics,
and has very negative consequences that in some fields like medicine span well beyond
the academic community to impact directly people’s lives.4 Although it does not affect
4 For an example of how current this issue is in medicine and psychology, see https://www.nature.com/
news/first-results-from-psychology-s-largest-reproducibility-test-., respectively.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
human lives, this issue is current in linguistics research as well, as demonstrated,
for example, by the recent special issue of the journal Language Resources and
Evaluation dedicated to the replicability and reproducibility of language technology
experiments.
Transparency (and therefore replicability and reproducibility) is achieved in two
main steps: by describing the data and phases of the research process, and by making
such data and processes available, which we will discuss in more detail in the next
sections.
Documentation As we saw in section 1.3.1, research papers in historical linguistics
dedicate a lot of space and attention to the theoretical framework(s) and the final
results of the research, as well as to linguistic examples, either as illustration of the
phenomenon studied or as the evidence base of the analysis. However, little attention
is usually dedicated to the following aspects, in spite of the crucial role they play in the
research process: how the data were collected, how the hypotheses were formulated
and tested, which variables were measured (if any), how the analysis was carried out.
For example, Bentein (2012) presents the details of his data collection criteria in the
footnotes, and describes the corpus used in four lines (Bentein, 2012, 175).
As there are no agreed standards on how to build and annotate a corpus, how
to carry out the analysis, and how to report the results, we argue that the following
guidelines would significantly increase the level of transparency in historical linguistics research.
• Include references to the resources (including corpora) used, with exact locations
and URL links.
• Specify the size of the corpus or linguistic sample(s) used.
• Describe how the corpus/sample was collected by detailing the inclusion/exclusion criteria.
• Detail the annotation schema used, even when the researcher performed the
annotation as a by-product of the subsequent analysis.
• Add information about the analysis methods employed and their motivation, as
well as the statistical techniques, programming languages and software used (with
version number).
• Give details of the different analyses performed (including the ones that did not
lead to the desired results), to eliminate the risk of ‘cherry-picking’ results that
conform to the researcher’s expectations.
• Add all relevant information to allow the reader to interpret and reproduce the
data visualizations.
Sharing and publishing research objects Being transparent about the research methodology used is very important, but may not ensure full replicability of the results when the work is complex. Therefore, it is important that the corpora, data
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Best practices and research infrastructure

sets, metadata, and computer code on which the research was based are made
publicly available whenever this is possible and appropriate from an ethical point
of view.
Evidence from a study on 516 biology articles published between 1991 and 2011
reported on in Vines et al. (2014) has shown that informally stored data associated
with published works disappear at a rate of around 17 per cent per year. Even though
we do not have evidence of this kind for (historical) linguistics research, it would not
be surprising if a similar pattern were found. Access to the data and the process behind
a research work is essential, and should be ensured in a systematic way and through
platforms that researchers can use.
There are a number of repositories for language resources, including corpora, lexica,
terminologies, and multimedia resources. One (non-free) catalogue of such resources
is available through the European Language Resources Association (ELRA).5 Another
example is CLARIN (Common Language Resources and Technology Infrastructure),6
a large repository of language resources. Examples of research data repositories
which are not specific to linguistics but are widely used in the sciences are Figshare7 and Dryad.8 Figshare allows researchers to upload figures, data sets, media,
papers, posters, presentations, and data deposited in Figshare receive a digital object
identifier (DOI), which makes them citable. The most commonly used repository
designed to track versions of computer code, attribute it to its authors, and share
it is GitHub.9 Specific to humanities, Humanities Commons10 is a platform for
sharing data and work in progress and constitutes a positive example of this sharing
attitude.
An interesting publishing model that is gaining popularity among the scientific
community is concerned with so-called ‘data journals’. Such peer-reviewed publications collect descriptions of data sets rather than traditional article publications
reporting on theoretical considerations or the results of particular studies. Such citable
‘data descriptors’ or ‘data papers’ receive persistent identifiers and give publication
credits for the authors. The methodological importance of such data publications
consists in allowing other researchers to use the data described and benefit from
them, and ensuring that scientists who collect and share data in a reusable fashion
receive credit for that. Examples of open access data journals in the scientific domain
are Scientific Data11 and Gigascience.12 One notable example in the humanities is the
Research Data Journal for the Humanities and Social Sciences13 published by Brill in
collaboration with Data Archiving and Networked Services.
5
8
11
13
6 http://clarin.eu/.
7 http://figshare.com/.
http://catalog.elra.info/.
9 https://github.com/.
10 https://hcommons.org/.
http://datadryad.org/.
12 http://www.gigasciencejournal.com/.
http://www.nature.com/sdata/.
http://dansdatajournal.nl/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
.. Historical linguistics and other disciplines
In spite of their clear connections, historical linguistics and historical disciplines like
history and archaeology have largely followed separate paths (Faudree and Hansen,
2014). To take a concrete example, with the exception of corpora created for historical
sociolinguists, early historical corpora contained very limited metadata about the texts
themselves, and focused primarily on annotating linguistic features. In section 5.2
we argue for a stronger interaction between historical linguistics and history, and
make a case for a stronger connection between historical language resources and other
resources (like collections of information on people or places). This strengthened link
has the potential to enrich historical linguistics research by accounting for the sociohistorical context of the language data in a direct way.
Linked data provides a valid solution to this question because it allows to connect
linguistic corpora with general resources on various aspects of the historical context
of the texts. This enables a more historically accurate investigation of the language
and facilitates interdisciplinary efforts, which would benefit historical linguistics
research.
In Chapter 4, we also make a case for cooperation between historical linguistics and
digital humanities. In particular, the Text Encoding Initiative has established standards
for annotating a range of information on texts and their contexts. This type of
annotation would definitely make the traditional corpus annotation more exhaustive
and therefore allow corpus analyses to consider a wider range of properties of texts and
their context; this, in turn, would make the linguistic results more comprehensive.
. Data-driven historical linguistics
In section 1.3 we have stressed the importance of using corpora as evidence basis for
research in historical linguistics. We dedicate this section to defining ‘corpus-driven’
and ‘data-driven’ in the context of the methodological framework we propose, and to
explaining how this approach interacts with linguistic theory.
.. Corpus-based, corpus-driven, and data-driven approaches
Once we have established the necessity of using corpus data as evidence sources for
the historical linguistics investigation, we need to clarify how this evidence (which
by definition relates to individual instances of language use, or parole in Saussurian
terms) relates to more general statements about language as a system (or langue,
to follow Saussurian terminology). Are corpus data going to support general claims
about language, or will they determine them? Will the investigation start from corpus
data, or from theoretical statements, or a combination of the two?
According to a terminology that is well established in corpus linguistics (TogniniBonelli, 2001, 65), corpus-based approaches involve starting from a research question
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Data-driven historical linguistics

and testing it against (annotated) corpus data, often by analysing selected examples.
Theoretical hypotheses play a prominent role in this approach, and corpus data are
used to support (or more rarely refute) them; therefore, we could categorize such
approaches as ‘confirmatory’.
On the other hand, ‘corpus-driven’ approaches (Tognini-Bonelli, 2001, 85) rely on
unannotated corpus data with the aim of revising existing assumptions about language
that pre-dated electronic corpora; in fact, annotated corpora are seen as sources of
skewed data because they reflect such pre-existing assumptions. In corpus-driven
approaches the corpus evidence is the starting point of all analyses and needs to
be reflected in the theoretical statements, which makes the primary focus of such
approaches exploratory. The researcher draws generalizations from the observation of
individual instances of unannotated corpus data to theoretical statements about the
language system. In other words, corpus-driven approaches aim ‘to derive linguistic
categories systematically from the recurrent patterns and the frequency distributions
that emerge from language in context’ (Tognini-Bonelli, 2001, 87).
Rayson (2008) proposes the use of the term ‘data-driven’ as a compromise between
the ‘corpus-based’ and the ‘corpus-driven’ approaches contrasted above. His starting
point is the automatic annotation of two corpora by part-of-speech and semantic
fields; then, he conducts a quantitative analysis of the keywords extracted from the
two corpora. At this point, in his model the researcher’s contribution consists in
examining qualitatively ‘concordance examples of the significant words, POS and
semantic domains’ (Rayson, 2008, 528) to formulate research questions. This way, the
research questions arise from the qualitative analysis of quantitatively processed data
from automatically annotated corpora, rathen than from theoretical hypotheses, as in
corpus-based approaches.
In this book we will employ the term ‘corpus-driven’ in a sense that is different
from the ones outlined above. We will accept the confirmatory view according to
which corpus analyses can test hypotheses from previous theories (and we discuss
the term ‘theory’ in section 2.4.3), but we also allow for exploratory views in which
such hypotheses emerge directly from corpus data. Moreover, unlike in the traditional
definition of ‘corpus-driven’, we consider annotated corpora as legitimate sources of
evidence. Finally, we do not consider it acceptable to analyse selected examples from
corpora in order to test theoretical statements, as done in large part in corpus-based
research.
In our definition, ‘corpus-driven’ will refer to those approaches whereby evidence
from (annotated) corpus data is collected systematically, usually with automatic
means. This evidence (whose size is typically relatively large) undergoes a systematic
and exhaustive quantitative analysis. Such analysis aims at testing theoretical hypotheses (in confirmatory studies) or formulate new ones (in exploratory studies). ‘Datadriven’ refers to the same procedure as ‘corpus-driven’ defined above, but specifically
affects other types of data in addition to corpus data, for example metadata on authors,
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
genres, geography, or data from language resources and other resources. Because
historical corpora necessarily contain some form of metadata, any ‘corpus-driven’
methodology is also ‘data-driven’.
Doing linguistics research in this data-driven quantitative way accounts for the
variability in language use and lends itself to a usage-based and probabilistic view of
language, whereby frequency distributions are built into the language models (Penke
and Rosenbach, 2007b, 20). However, as explained by de Marneffe and Potts (2014,
8), as we discussed in section 1.2.1, and as we will see in Chapter 6, corpus research is
compatible with non-probabilistic approaches as well, because the statistical evidence
collected from corpora may reflect the interaction of various discrete phenomena.
.. Data-driven approaches outside linguistics
The emphasis on data-driven approaches is common to a general movement
affecting a range of disciplines, particularly in the sciences and in business. The
terms ‘data-intensive’ (Hey et al., 2009) and ‘data-centric’ science (Leonelli, 2016)
have acquired specific senses in today’s scientific context. They refer to approaches
characterized by large-scale networks of scientists, a focus on open data sharing and
data publishing, and a drive towards large collections of data employed as evidence
sources in research.
Following a similar trend, the business world has witnessed an exponential growth
in the demand for data scientists and a general shift towards data-centred attitudes
in organizations in recent years. Data-driven approaches are increasingly employed
in designing business strategy by relying on large-scale analyses of data on users’
behaviour and preferences, as well as data from internal systems, including workflow
and sales databases (Redman, 2008).
Mason and Patil (2015, 10) define the ‘data scientific method’ for organizations as
follows:
1.
2.
3.
4.
Start with data.
Develop intuitions about the data and the questions it can answer.
Formulate your question.
Leverage your current data to better understand if it is the right question to ask. If not,
iterate until you have a testable hypothesis.
5. Create a framework where you can run tests/experiments.
6. Analyse the results to draw insights about the question.
This list of steps highlights the importance of data exploration in the initial phases
(steps 1 and 2) and largely overlaps with the exploratory approaches we referred to in
section 2.4.1.
Examples from the scientific world and the business world highlight a general
trend in society, which may be explained by the cultural innovations driven by new
technologies, as we discussed in section 1.3.4. Since linguistics research does not
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Data-driven historical linguistics

happen in isolation, but interacts with this changing external context, we believe that
keeping in mind this broader perspective can help us to understand and frame datadriven approaches in historical linguistics as well. However, one important difference
between business applications and research in historical linguistics is the ultimate aim
of the investigation. Historical linguistics intended as an empirical enterprise aims to
model and ultimately explain language phenomena in the past. This theoretical aim
has implications on the data-driven research process we propose, as we see in the
next section.
.. Data and theory
Given the explanatory aim of historical linguistics, the corpus-driven framework
we propose must be compatible with the creation of a historical linguistics theory,
intended as a system of laws for historical languages and language change which allows
us to explain and predict phenomena affecting these languages.
As a matter of fact, the term ‘theory’ is used quite generously in linguistics to refer
to annotation schemes like HPSG, X-BAR, LFG, dependency grammar, construction
grammar, approaches like distributional semantics, or other formalisms; we agree with
Köhler (2012, 21) when he states:
there is no linguistic theory as of yet. The philosophy of science defines the term ‘theory’ as
a system of interrelated, universally valid laws and hypotheses [. . . ] which enables to derive
explanations of phenomena within a given scientific field.
Data and theory interact in complex ways in corpus-driven historical linguistics.
In this section we will examine individually what we consider the main aspects of this
interaction in the context of the data-driven approach to historical linguistics research
we propose. We will present these different aspects one by one for reasons of clarity,
although we recognize that they often occur together and interact.
Theory in data representation In spite of their intended objectivity, whenever some
data are collected as part of a research study, by necessity they reflect a specific way of
understanding, representing, and encoding the recorded entities or events. Therefore,
they are tied to a particular historical moment and theoretical views. As TogniniBonelli (2001, 85) summarizes very effectively, ‘[t]here is no such thing as pure induction’ and even corpus-driven approaches (in the sense she defines) acknowledge this.
Let us imagine that we have taken records of daily measurements of the temperature. In order to make sense of the pairs of numbers and characters collected, we would
need to read them as temperatures (e.g. in centigrade degrees) and day–month–year
triples, if that is the way we decided to represent dates.
When it comes to linguistics research, the notations chosen for representing and
collecting the corpus data play an important role in any subsequent analysis. In the
case of annotated corpus data, the annotation is always performed with reference
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
to a specific notational framework and, therefore, annotated data will reflect that
viewpoint, which we may call ‘theoretical’ (de Marneffe and Potts, 2014, 17). If the
annotation includes part of speech, for instance, it has to rely on definitions for the
different part of speech categories and how these labels apply to the language or variety
in question. If the corpus is a treebank, we may want to choose a phrase-structure
model or a dependency-based model for representing syntactic structures, and this
choice will depend on our preferences, the features of the language annotated, as well
as other considerations. In a similar vein, a corpus that has not been annotated will
still need to be interpreted according to a specific theoretical perspective in order to
form the basis for any subsequent linguistic analysis.
Theoretical assumptions In addition to the way we represent the entities that we want
to analyse and their context, whenever we carry out a data analysis we rely on a set of
assumptions, which we may call ‘theoretical’, too.
Let us go back to the example of daily temperatures. When we collect and then
interpret the data, we need to keep in mind that they are limited to a specific range, so
that if we spot a measurement of −400, for instance, we can quickly identify it as an
error. In this case, any data-driven analysis would only make sense if we have access
to the domain knowledge concerning temperatures on the earth.
When we annotate and then analyse a corpus, in addition to the notational framework chosen, we rely on a set of assumptions on which there is general consensus
among linguists: for example that nouns in French are inflected by number and gender,
that verbs in Latin can display different endings depending on their person, number,
tense, voice, etc. When we analyse verb data in a treebank, for instance, we assume that
verbs do not occur with their arguments in a random way, but that they display specific
syntactic and lexical–semantic preferences according to their argument structure.
This kind of domain knowledge also supports the design and interpretation of
exploratory analyses. The choice of which variables we decide to study will need
to make sense according to this domain knowledge or in the context of specific
hypotheses we want to test. To take a slightly absurd example, we may collect data
relative to a number of events happening by a beach and we may find a strong
correlation between the number of shark attacks in a day and the amount of ice
cream sold in the same day. However, our domain knowledge tells us that, rather than
concluding that buying more ice cream increases the chances of being attacked by
a shark, we could hypothesize that both variables are correlated with (and possibly
caused by) the number of visitors to the beach in that day.
As Köhler (2012, 15) states, even exploratory approaches need some theoretical
grounding:
It is impossible to ‘find’ units, categories, relations, or even explanations by data inspection—
statistical or not. Even if there are only a few variables, there are principally infinitely many
formulae, categories, or other models which would fit in with the observed data.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Data-driven historical linguistics

For instance, in McGillivray (2013, 127–78), the author studied the change of the
argument structure of Latin verbs prefixed with spatial preverbs, and particularly
the opposition between prepositional constructions and bare-case constructions. This
phenomenon involves the interplay of a range of variables, including morphological,
lexical, and semantic features such as the type of verbal prefix, the particular spatial
relation expressed in conjunction with the verb, the semantics of the verb, and the case
of the verbal argument. McGillivray (2013, 169–72) employed an exploratory approach
to deal with the complexity of the phenomenon and to measure the contribution
of various variables to it. She resorted to exploratory data analysis (Tukey, 1977)
(specifically CA, see section 6.2), which aims at letting the model ‘emerge’ from the
data. However, the author chose the set of variables based on a combination of findings
from previous research and linguistic domain knowledge.
Data and theoretical hypotheses We have seen that exploratory approaches to historical linguistics analysis need access to domain knowledge and need to be theoretically
grounded. But, of course, theory plays a crucial role in confirmatory approaches as
well, which are essential to the progress of any empirical research.
When approaching corpus data with a theoretical hypothesis, it is important to
avoid the risk of confirmation bias, which would lead us to only find positive evidence
of the claims we intend to make. To address this issue, McEnery and Hardie (2012,
15) define the principle of total accountability according to which we should always
aim at using the entire corpus, or at least random samples when the corpus is too
large. This way we can satisfy the criterion of falsifiability, identified by Popper (1959)
as the defining feature of the scientific method. If we follow the principle of total
accountability, we are very likely to employ quantitative analysis techniques, as manual
analysis is often inadequate to deal with the size and complexity of the data.
By relying on the systematic evidence from a corpus, corpus-driven approaches can
address the question of whether a phenomenon is attested by finding occurrences of
certain patterns or constructions. However, finding few examples of such patterns in a
corpus in itself does not guarantee that these are not annotation errors, typos, or other
anomalies; this is also true in the case of historical texts, for which spelling and other
elements often do not follow standards and cannot be easily categorized because they
are captured at the moment in which they undergo diachronic change. A systematic
quantitative account of corpus evidence based on all available data, together with a
theoretical model and the effort to make our results consistently replicable (Doyle,
2005), can help to avoid spurious conclusions in these situations and increase their
validity in an empirical context (McEnery and Hardie, 2012, 16).
Corpus data may also support statements about possible but unseen phenomena,
by relying on seen events and statistical estimation, coupled with domain knowledge
(Stefanowitsch, 2005; de Marneffe and Potts, 2014, 11). Corpora, paired with a theoretical model, can also predict that a phenomenon is impossible (or has a negligible
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Foundations of the framework
probability to occur). For example, Pereira (2000) used a statistical model trained
on a newspaper corpus to predict that colourless green ideas sleep furiously is about
200,000 times more probable than furiously sleep ideas green color less, thus addressing
Chomsky (1957)’s challenge. As Pereira’s study illustrates, corpora can also address
probabilistic hypotheses about language, as well as binary ones. This is explored in
more depth in Chapter 6.
As an example of such a probabilistic hypothesis from historical linguistics, in the
context of the diachronic change in the argument structure of Latin prefixed verbs,
McGillivray (2013, 127–78) formulated various hypotheses including the following,
concerning one of the constructions that pertain to these verbs, specifically the barecase construction:
Construction 1 [ . . . ] is significantly more frequent in the archaic age and in works by poets
than in the later ages and in prose writers.
This hypothesis operationalizes a generalizing statement in terms that can be
addressed by corpus data. McGillivray (2013) tested the hypothesis above with a
statistical significance test (chi-square test; see section 6.3.3 for details on this test), and
obtained a confirmation of the hypothesis, together with a measure of the size of the
detected effect. The process involved all available corpus data and therefore fulfilled the
principle of total accountability, which is fundamental to corpus-driven approaches.
This way, the results contributed new quantitative evidence which can support more
general theoretical models.
.. Combining data and linguistic approaches
Following Köhler (2012, 2), we define linguistic theory as a series of connected claims
from which predictions about historical languages can be made. As we have seen,
our framework includes theoretical hypotheses, properly tested against corpus data.
From a series of such contingent statements corresponding to tested theoretical
hypotheses, we can proceed towards formulating theoretical models of the historical
linguistics phenomenon at hand. By this term we mean those generalized explanations
of observed phenomena that some linguists call ‘theories’. Our framework does not
impose restrictions on which particular models can be derived from this process, nor
on the ontological setup that allows this generalization step from contingent claims to
theoretical models. Our main concern is on the way such process is performed. In the
rest of this book we will provide more details of this process; particularly, Chapter 6
will give some concrete examples of how this can be realized in practice.
Thus, our framework is not meant to replace other approaches to historical linguistics rooted in e.g. generative theory or traditional comparative linguistics. We consider
work on X-bar theory, grammaticalization, and language history equally compatible
with our framework. A linguistic description can be characterized as a hypothesis
(Carnie, 2012, §3.1). As mentioned above, the key characteristic of a good hypothesis
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Data-driven historical linguistics

is that it is falsifiable, and that it has predictive power (Beavers and Sells, 2014), and
it needs to be tested against alternative hypotheses, as stressed by Beavers and Sells
(2014). The reason for this is captured in our principle number 3 (section 2.2.3): since
almost any claim is possible, merely fitting a hypothesis to the data is insufficient.
Instead, the hypotheses must be compared and tested against data. This ought to be
uncontroversial, and both Carnie (2012) and Beavers and Sells (2014) are rooted in
generative theory, thus demonstrating that such hypothesis testing is not restricted
to probabilistic approaches to linguistics. We go beyond Carnie (2012) and Beavers
and Sells (2014) in insisting that such hypothesis testing and comparison in historical
linguistics ought to be done quantitatively using corpus data, whenever possible.
Furthermore, we argue that multivariate techniques for quantitative modelling are
superior to others, due to the complex nature of language. We also see this focus on
quantitative techniques and corpus data as a means to compare results across linguistic
frameworks, hence our emphasis on an empirically based consensus (see section
2.2.1 informed by appropriate statistical techniques; see section 2.2.12). In short, the
present framework extends commonly accepted guidelines for constructing linguistic
arguments. The framework takes explicit issue with hypothesis testing in historical
linguistics by means of intuitions and qualitative judgements about frequencies
(as well as quantitative arguments that do not follow state-of-the-art standards). It
is precisely this methodological focus that makes our framework compatible with
different paradigms in historical linguistics.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
in historical linguistics
. Introduction
Historical linguistics and quantitative research have enjoyed a long and tangled coexistence over the years. It must be stressed that any attempt to paint a picture of
a gradual, one-directional diachronic shift from qualitative to quantitative methods
in historical linguistics is an oversimplification; not even a particularly useful one.
Instead, we would like to repeat the image of the chasm separating the early innovators
and visionaries from the majority or mainstream, discussed in Chapter 1.
Looking back at the history of quantitative and corpus methods in historical
linguistics through the lens of the chasm model, we can compare the degree to which
quantitative corpus methods are used within the groups defined in the chasm model.
For instance, the early adopters would correspond to roughly 16 per cent of the
potential users. A technology adopted by the early majority would bring the total up
to about 50 per cent whereas including the late majority too would mean that the
technology has reached more than 80 per cent of potential users. This is essentially an
empirical question (contingent on the validity of the chasm model). As this chapter
will show, in the case of historical linguistics, quantitative corpus technologies have
not transitioned much beyond the early stages of the adoption curve. However, we
also want to better understand why these methods have failed to transition from the
ranks of early innovators to the majority of linguists practising historical linguistics.
It is indisputable that the early models of linguistic change associated with the development of the comparative method, such as the family-tree model and wave theory,
largely fall under the rubric of qualitative methodology (Campbell, 2013, 187–90). The
comparative method remains a vital approach to historical linguistics, and Campbell
(2013, 471) argues that what he calls ‘mathematical solutions’ to historical linguistic
problems are neither necessary nor sufficient, implying that historical linguistics can
do without quantitative methods. The chapter on quantitative methods only appeared
as a full chapter in the third edition of Campbell’s book, suggesting perhaps a growing
need for addressing these methods in historical linguistics, albeit largely to refute
Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray.
© Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Early experiments

them. Yet the casualness of the refutation also points to the status of the qualitative approaches, and particularly the comparative method, as the hegemon of historical linguistics. That state of affairs undoubtedly stems at least partially from both the success
and age of the comparative method (McMahon and McMahon, 2005, 5–14). However,
as Campbell’s treatment of quantitative approaches to historical linguistics illustrates,
a certain antagonism can also be traced back to early attempts at statistical approaches
to historical linguistics, leading to the perhaps surprising conclusion that the full
acceptance of quantitative methods in historical linguistics is not only hampered by
the novelty of the methods, but also by a somewhat painful previous exposure.
. Early experiments
The success of the comparative method notwithstanding, it is possible to find examples
of researchers proposing ‘mathematical solutions’ to problems in historical linguistics
at least as far back as the nineteenth century. Köhler (2012, 12) claims that ‘in linguistics, the history of quantitative research is only 60 years old’, a claim that appears to be
founded on his view that quantitative linguistics in the modern sense began with the
work of George K. Zipf in the late 1940s, although Köhler does point out that studies
based on ‘statistical counting’ can be found as far back as the nineteenth century
(Köhler, 2012, 13). Gorrell (1895), to take but one example, certainly took a quantitative
approach in his study of indirect discourse in Old English, with tables displaying
counts of constructions appearing every few pages. However, it is paramount to avoid
simplistic generalizations, and the overall picture of historical linguistics a century or
so ago is above all one of variation. McGillivray (2013, 144–7) discusses a study of Latin
preverbs by Bennett, in a study dating back to 1914, where the author made the choice
of classifying occurrences above ten as ‘frequent’, but without providing the reader
with access to the actual numbers of occurrences, leaving the reader to guess exactly
what evidence underpins such unquantified yet implicitly numerical distinctions as
‘many’ vs ‘most’.
Statistics beyond word counts also enjoys some seniority within historical linguistics. Kroeber and Chrétien (1937) calculated correlation coefficients for linguistic
features in order to arrive at a statistically based classification of Indo-European
languages. Their work was criticized by Ross (1950), who (although generally sympathetic to the approach) took issue with their calculations, favouring instead using
the chi-square statistic, a test with its own inherent problems when employed in
linguistics. However, both Kroeber and Chrétien and Ross were somewhat pessimistic
in their conclusions, prompting a rebuttal from Ellegård (1959). While taking care
to emphasize that statistical measures of linguistic similarity refer to similarity with
respect to some specifically chosen traits or features (rather than a global, a-theoretical
similarity), Ellegård proposed an alternative approach to interlanguage correlations.
Ellegård’s conclusion is mainly methodological, but rounds off with the insight that
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
the application of statistical methods in historical linguistics is not merely a methodological choice. Instead, he outlines a dynamic relationship where quantitative
methods spur on theoretical developments, and where statistical methods ‘will require
a linguistic taxonomy, will help to establish it, and can be used for bringing taxonomic
and developmental studies into fruitful contact’ (Ellegård, 1959, 156). In hindsight it is
clear, however, that the impact of statistical methods on historical linguistic theorizing
remained limited.
The limited impact of statistical methods is perhaps best judged by the insistent tone
in the publications advocating their use, which (following the logic from historical
studies that what is frequently prescribed by law generally reflects common actual
behaviour) quickly raises the image of a besieged minority. At the same time, the
arguments are often pithy and convey a message which in many cases remains relevant
today. Take the point made by Kroeber and Chrétien (1937, 97), who suggested that the
linguist working only with intuition easily becomes biased when the linguist
observes a certain affiliation which is real enough, but perhaps secondary; thereafter he notes
mentally every corroborative item, but unconsciously overlooks or weighs more lightly items
which point in other directions.
The quote, which is a polite way of saying that non-quantitative studies are prone
to bias by over-emphasizing rare or unexpected phenomena, has held up well and
is in tune with more recent critiques of qualitative methods, such as that raised by
Sandra and Rice (1995). The simple psychological fact that the human mind is not well
equipped at dealing objectively with relative frequencies in an intuitive way remains a
key objection to non-quantitative work in historical linguistics. However, the fact that
similar critiques are being made decades after Kroeber and Chrétien says something
of their impact, or more precisely lack of such. Occasional arguments for the virtues
of a fully quantitative linguistics can be found around the middle of the twentieth
century, but their relative rareness as well as their timbre are testament to a lack of
impact. Consider the acerbic yet slightly despondent tone in the following observation
by Guiraud (1959, 15), published over twenty years after Kroeber and Chrétien:
La linguistique est la science statistique type; les statisticiens le savent bien; la plupart des
linguistes l’ignorent encore.
(‘Linguistics is the typical statistical science; the statisticians know this well; most linguists are
still ignorant of it.’)
If the quote from Guiraud suggests ignorance of (and hence lack of involvement in)
quantitative research on the part of the linguists, the following passage from Ellegård
(1959, 151–2) has the air of a well-rehearsed response to familiar criticism: ‘Even
intuitive judgments must be based on evidence. Now if that evidence turns out to
be insufficient statistically, it will be insufficient also for an intuitive judgment.’ The
comment is poignant, the tone is one of calm reason; however, the implications went
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
A bad case of glottochronology

largely unheeded by historical linguistics as a discipline, suggesting that Guiraud and
Ellegård were early adopters of a technology that did not quite catch on. There are
clearly several reasons for this: first and foremost is probably the undoubted success
of the comparative method mentioned earlier, which (following the old adage that if it
ain’t broke don’t fix it), must have made the rather tedious mathematical calculations
seem subject to diminishing returns. Second, the lack of electronic corpora, desktop
computers, and statistical software meant that quantitative work was slow and almost
impossible to perform at the large scale where it really comes into its own. Third,
the advent of generative linguistics, vividly chronicled in Harris (1993), heralded a
period where numerical approaches to linguistics generally were no longer in vogue,
or were even regarded with some hostility (Pullum, 2009). And finally, there were
the stains left behind by a specific, much revered (and later much reviled) method:
glottochronology.
. A bad case of glottochronology
In the 1950s linguistics was both changing and expanding with a mature and optimistic
sense of security, enjoying ‘measured dissent, pluralism, and exploration’ (Harris,
1993, 37). Such exploration was also taking place in historical linguistics where Morris
Swadesh launched the term glottochronology in the early 1950s, see e.g. Swadesh (1952)
and Swadesh (1953).
Glottochronology was proposed by Swadesh as an approach to lexicostatistics more
generally. The distinction is worth making since lexicostatistics is generally taken to
mean statistical treatments of lexical material for the purposes of studying historical
linguistics. McMahon and McMahon (2005, 33) offer the following definition of
lexicostatistics: ‘the use of standard meaning lists to assess degrees of relatedness
among languages’. Campbell (2013, 448), like McMahon and McMahon (2005, 33–4),
notes that ‘glottochronology’ and ‘lexicostatistics’ are frequently used interchangeably,
but Campbell goes on to claim that ‘in more recent times scholars have called for
the two to be distinguished’. However, the attempts at making the distinction are as
old as the confusion itself. To Hockett (1958, 529) the two terms appear to have been
synonyms, whereas Hymes (1960, 4) argues for a distinction:
Glottochronology is the study of rate of change in language, and the use of the rate for historical
inference, especially for the estimation of time depths, and the use of such time depths to
provide a pattern of internal relationships within a language family. Lexicostatistics is the study
of vocabulary statistically for historical inference. . . . Lexicostatistics and glottochronology are
thus best conceived as intersecting fields.
Hymes goes on to point out that lexicostatistics could in fact refer to any numerical
study of lexical material, synchronic or diachronic, but that the term has received a
‘specialized association with historical studies’.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
At its core, glottochronology operates with three basic elements: word lists, or
strictly speaking lists of sememes or ‘meaning lists’ (McMahon and McMahon, 2005,
34), of ‘basic’ vocabulary for the languages to be compared, the number of cognate
items within the list, and the retention rate over time (Hymes, 1960, 3). Two lists
predominate the literature, one containing 100 items, the other 200 (Campbell, 2013,
448–51). As subsequent criticism would show, all three variables turned out to have
their particular trapdoors, including the problem of defining culturally neutral and
replicable versions of the lists themselves, what to count as ‘basic’ (which again had
an impact on the cognates), but also the assumption of a constant rate of change. The
constant retention rate over a 1,000 years, argued to be 86 per cent for the 100-word
list and 81 per cent for the 200-word list, was boldly presented as a real, mathematical
fact (a physical probability, see section 2.1.3) with evidence ‘sufficient to eliminate the
possibility of chance’ (Swadesh, 1952, 455).
Glottochronology was met with an initial rush of enthusiasm (Hymes, 1960, 32), and
made it into the introductory-course university curriculum in linguistics (Hockett,
1958, 526–35). However, some methodological problems were pointed out both by
Swadesh himself and by others. Ellegård (1959, 155) criticized the lexico-statistical
method used by Swadesh (1953), commenting that the latter seemed ‘somewhat rash
in assuming a uniform rate of development’. Hockett questioned the assumption of
a ‘basic vocabulary’, but nevertheless rounded off his introduction of the approach
to undergraduate students rather optimistically by stating that ‘no development in
historical linguistics in many decades has showed such great promise’ (1958, 534).
Many, however, went further, possible out of confusion, enthusiasm, or both. Hymes
(1960, 4) notes that some academics leaped from the method’s treatment of a narrowly
circumscribed basic vocabulary, to endorsing it for tackling the problem of language
change at large. In this we can recognize the pitfall pointed out by Moore (1991)
regarding any new technology, namely the risk in overselling the ‘vision’ of the new
technology before it is sufficiently mature to back up that vision with concrete results.
The detailed contemporary critiques summarized in Hymes (1960) cover the nowfamiliar criticisms against glottochronology, namely problems with basic lists, problems with judging sameness, problems with cultural bias, problems with synonyms,
problems with borrowings, problems with taboo words, the problematic assumption
of a constant rate of change, as well as specific mathematical problems.1 Although
critical, Hymes (1960, 15) nevertheless argued that more research should go into the
method, and he approved of its continued application.
However, problems were mounting. In addition to the problems listed above,
different linguists were reporting different results for glottochronological studies of
the same languages, as discussed in e.g. Bergsland and Vogt (1962), suggesting that
1 The core criticism against glottochronology is concisely presented in McMahon and McMahon (,
–) and Campbell (, –).
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
The advent of electronic corpora

the method was introducing more vagueness rather than more objective replicability.
The central tenet of glottochronology, a universal constant rate of change in basic
vocabulary, did not hold empirical water, as shown by Fodor (1961) and Bergsland
and Vogt (1962). In one of the studies conducted by Bergsland and Vogt (1962), the
authors found that lexical replacement rates in the basic vocabulary list were either
far higher (Icelandic) or far lower (Riksmål/Norwegian Bokmål) than those predicted
by the model. Fodor (1961), on the other hand, found split dates for Slavic languages
that were not only at odds with the comparative method, but also with well-attested
historical facts.
Add to this further criticism of the mathematics involved (Chrétien, 1962), and
the result was a predictable and considerable dampening of the initial enthusiasm.
In 1964 Lunt, in an editorial quoted in McMahon and McMahon (2005, 45), declared
glottochronology an ‘idle delusion’ and bluntly denied the usefulness of continuing the
project. As the 1960s and 1970s went on, glottochronology, and quantitative methods
in linguistics more generally, largely fell out of favour, and well-known exceptions
such as William Labov’s quantitative studies of sociolinguistic variation implied some
opposition to the orthodoxy (Sampson, 2003; Lüdeling et al., 2011).
Glottochronology did not reduce the interest for quantitative methods in historical
linguistics on its own. As we have seen, structuralist approaches to linguistics were
sceptical towards statistical evidence, and that scepticism was inherited and refined by
mainstream transformational-generative grammar in subsequent decades (Sampson,
2003; Lüdeling et al., 2011; Gelderen, 2014). Cognitive–functional approaches (Sandra
and Rice, 1995) also displayed a lack of attention to statistical methods (Deignan,
2005), which suggests a general tenor of linguistic research that went beyond historical
linguistics. However, judging from the fact that glottochronology is still being discussed in the context of quantitative approaches to historical linguistics (McMahon
and McMahon, 2005; Campbell, 2013; Pereltsvaig and Lewis, 2015), it seems clear
that the negative perception of quantitative methods stemming from the failure of
glottochronology has endured beyond the method itself. According to Moore (1991),
such negative impressions could be a contributing factor when a technology fails to
cross the chasm. As we pointed out in section 2.2.6, we cannot go logically from this
possibility to concluding that this is in fact the case. We need to take a much richer
context into account. In the next section, we turn from quantitative methods to the
use of corpora.
. The advent of electronic corpora
From a methodological point of view, it is interesting that some of the early publications referred to here, notably Kroeber and Chrétien (1937), Ross (1950), and Ellegård
(1959), while predominantly concerned with statistical methods, kept returning to the
question of data. Statistical methods in themselves will not yield answers without
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
appropriate quantitative data, which means that methodological advances in one
area are contingent on the other. Ross (1950) built upon the data from Kroeber
and Chrétien (1937) and added more data, whereas Ellegård (1959) returned to the
question of ‘relative frequencies’ from ‘random samples’ several times when discussing
the methodological shortcomings of statistical methods that only deal with binary
features. Without reading too much into this, it is clear that even if they do not
phrase it in those terms, these authors were intimately aware of the problems caused
by the lack of (and to some extent solved by the presence of) today’s electronic
annotated corpora.
That is not to say that text corpora were something new in the mid-twentieth
century. Käding published his 11-million corpus concordance in 1897 (McEnery and
Wilson, 2001, 12), and the first half of the twentieth century saw a string of studies
relying on corpus linguistic methods, with the early 1950s witnessing both Firth’s
work on collocations (Gries, 2006a, 3) as well as Fries’s corpus study of spoken
American English (Gries, 2011, 81–2). The 1960s saw the introduction of the so-called
first generation of machine-readable corpora whose characteristics today are the
defining hallmarks of corpora: electronically stored, searchable, possibly annotated,
and with an aim at representativeness. In the field of historical language studies, the
Index Thomisticus corpus coevolved alongside the technological development from
punched cards to magnetic storage, and finally online publication over its thirty-year
construction phase (Busa, 1980). Pioneering work on corpus linguistics continued
from the 1950s to the 1980s (McEnery and Wilson, 2001, 20). However, with notable
exceptions such as the Index Thomisticus corpus and the Helsinki Corpus of English
Texts, these efforts were mainly directed at contemporary languages.
Today it is perhaps easy to underestimate the financial and technical difficulties facing early corpus builders. As Baayen (2003, 229–30) points out, early computers were
few and expensive, which provided both a positive incentive for a formal approach
to language, as well as a negative incentive against statistical investigation of large
corpora. The case of the Index Thomisticus corpus proves an interesting illustration of
the difficulties: it took some thirty years to complete (including adaptation to changing
technologies along the way), and it was reliant on large-scale funding from the IBM
corporation (Busa, 1980). In the face of such financial and technical obstacles, perhaps
it is not surprising that historical linguistics (with its data being considerably less
interesting from a commercial point of view) lagged behind in corpus creation, or
at least in the modern sense of general, representative, machine-readable corpora.
We have already mentioned Käding’s large late nineteenth-century corpus;
however, the usefulness of such a corpus would be severely limited by the available
(manual) search technology. Thus, the pragmatic pressures imposed by technology
for creating and searching relatively large corpora, alongside the financial costs, would
naturally favour smaller collections of purpose-built corpora that could be collected
and searched manually. For all their merits, such corpora are nevertheless limited in
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Return of the numbers

their usefulness. Organized as lists of sentences, they are difficult to search, except by
manually reading each sentence. Organized as collections of index cards, they can take
a lot of space and are not easily distributed. The size limitation imposed by storage and
searching points naturally in the direction of relatively small, purpose-built specialized
corpora, rather than large general ones. Although such specialization can be valuable,
it might also limit the potential for reuse. The lack of shared, reusable resources would
also mean that each corpus in the worst-case scenario would have to be created afresh
for each new project.
This view of the situation is perhaps slightly too sombre. As the studies in Kroeber
and Chrétien (1937), Ross (1950), and Ellegård (1959) attest to, collections of data could
be shared and expanded gradually. Some specialized early corpora have enjoyed much
longevity, perhaps most notably the data on the history of the English periphrastic
do from Ellegård (1953), which have been reanalysed by Kroch (1989) and Vulanović
and Baayen (2007), among others. Nevertheless, the central critiques of such early
corpora remain: their specialized nature leads to a proliferation of isolated resources,
rather than general ones that are suited for at least a majority of research questions.
Furthermore, idiosyncrasies in sampling and annotation might make comparing or
merging data sets difficult, a difficulty which would be compounded by a lack of
standardized annotation. Although quantitative work based on a corpus methodology
was being carried out in historical linguistics prior to the emergence of electronic
historical corpora, reduced costs and improved computing power (together with the
availability of lessons learned from the efforts to build corpora of contemporary
language) meant that by the 1990s the scene was set for mainstream electronic
historical corpora.
. Return of the numbers
By the end of the 1980s, the stage was set for a growing interest in corpora. The two
decades that had passed since the release of the Brown corpus in 1967 had seen a
gradual growth in corpus size, as well as a growth in corpus use, including in commercial projects like the Cobuild Dictionary. The evolution of a scientific community
which refined and promoted the building and use of corpora was undoubtedly vital. So
was another development taking place: the growth of computing power. In computer
science, the power of computing hardware (measured by the number of transistors that
could fit into an integrated circuit) has been argued to follow what is commonly known
as Moore’s law, a prediction made in the late 1960s that computing power would double
every two years, that is, grow at an exponential rate. Especially since the computing
industry to some extent have calibrated their development efforts to match the law,
the law itself is perhaps less interesting than the result, namely a massive growth
in computing power at a greatly reduced cost. Figure 3.1 illustrates Moore’s law as a
regression line showing the growth in computing power over time on a logarithmic
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
Computing power and corpora
PPCMBE*
COHA*
PPCEME*
OEC
Corpus del
Español*
YCOE*
COCA
PPCME2*
Google N−grams
Computing power
10bn
100m
BNC
1m
Index
Thomisticus*
10K
LOB
Brown
1960
1970
Helsinki*
1980
1990
Year
2000
2010
Figure . Illustration of Moore’s law with selected corpora plotted on a base 10 logarithmic
scale. Corpora marked with an asterisk (∗ ) are historical.
scale, with some corpora added to the plot according to the year of their release. The
data and the code for the figures in this chapter are available on the GitHub repository
https://github.com/gjenset.
Unsurprisingly, we see a cluster of corpora from the year 2000 onwards. It would be
grossly simplistic to claim that computing power alone powered this growth. Corpora
are created for a number of reasons, and typically require established research projects
(which again require a certain intellectual climate), long-term funding, an ecosystem
of tools and standards, and so on. However, keep in mind the observations from
Baayen (2003) about how the technological bottleneck of early computing provided an
incentive towards formal and non-corpus based approaches to linguistics. Clearly, at
the very least, we can hypothesize an interplay between intellectual development and
new technological possibilities (see also section 1.3.4). The historiographical problem
of deciding the exact causality of this development is obviously outside the scope of
this book. It is also secondary to what we consider far more important: the growth in
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Return of the numbers

Corpus sizes
Google N−grams
100bn
Corpus size
OEC
1bn
CoCA
CoHA*
100m
BNC
Corpus del
Español*
Index
Thomisticus*
10m
Helsinki*
1m
Brown
1960
1970
LOB
1980
1990
Year released
YCOE*
PPCEME*
PPCMBE*
PPCME2*
2000
2010
Figure . Sizes of some selected corpora plotted on a base 10 logarithmic scale, over time.
Corpora marked with an asterisk (∗ ) are historical.
computing power, coupled with easier access and lower price, obviously removed an
important bottleneck that was present in the 1950s and 1960s.
It is instructive to consider the growth in corpus size, which has also followed an
exponential curve during the same period. Figure 3.2 illustrates this by means of a
bubble plot. The vertical axis shows the size of the plotted corpora on a logarithmic
scale, whereas the bubbles (each representing a corpus) are scaled to be proportional
to the corpus size.
As the plot shows, there has been a rapid growth in the potential for building large
corpora since the 1990s. A caveat is in order here, since the potential for building
large corpora does not prevent small corpora from being built. Take for example the
syntactically annotated historical corpora in the lower-right corner of the plot. These
corpora have remained small for a number of reasons unrelated to computing power:
dealing with historical texts, there is only a finite set of data to base the corpus on.
Furthermore, the annotation step requires manual coding, since machine-learning
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
Corpus size vs computing power (Log-linear model)
Google
N−grams
100bn
10bn
Corpus size
OEC
1bn
CoCA
100m
BNC
10m
Index
Thomisticus*
CoHA*
Corpus del
Español*
Helsinki*
1m Brown
LOB
10K
1m
100m
Computing power
10bn
Figure . Log-linear regression model showing the relationship between the growth in
computing power and the growth in corpus size for some selected corpora. Corpora marked
with an asterisk (∗ ) are historical.
algorithms for adding annotation to corpora cannot normally be used with good
results on historical texts, without some manually annotated historical data as training
material. However, if we remove these historical, syntactically annotated corpora,
and fit a log-linear regression model (see section 6.2 for an introduction to linear
regression models) relating computing power (on a base 10 logarithmic scale) to
corpus size (also on a base 10 logarithmic scale), we find a significant relationship
between the two.2 According to this model, every 1 per cent increase in computing
power corresponds to a 44 per cent increase in corpus size. The model is illustrated in
Figure 3.3.
As mentioned earlier, it is impossible to claim that the increases in computing
power directly caused corpora to grow, since the creation of corpora depends on much
more than computing power alone. However, cheaper and faster computers with more
2
Fdf (1,8) = ., p << ., adjusted R2 = ..
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Return of the numbers

storage capacity meant that some obstacles in corpus creation were removed, or at
least gradually became less important. As we have seen with some of the historical
syntactically annotated corpora mentioned earlier, corpora may remain small for a
number of reasons; however, the creation of part-of-speech-annotated diachronic corpora, such as CoHA (Davies, 2010) with its 400 million words, illustrates that attempts
to draw a sharp distinction between small historical corpora and large contemporary
ones necessarily must fail. Thus, electronic corpora that are too large to be read
manually in their entirety are firmly established in historical linguistics. The obvious
question then is: how ought historical linguists to respond to this development?
The existence of corpora creates a potential for use; it does not indicate that they are
being used, or used by more than a small group of die-hard corpus linguists. However,
there are signs that the interest in corpus use is growing. The Brigham–Young
corpora created by Mark Davies, freely searchable through a web interface, report
170,000 unique users every month.3 If we look at the term ‘corpus linguistics’ in the
Brigham–Young Google corpus itself (Davies, 2011) we find that the phrase has been
increasing in frequency since the 1980s.
Figure 3.4 shows the occurrences of ‘corpus linguistics’ scaled to reflect the number
of instances per 1,000 instances of the word ‘linguistics’. Figure 3.4 also shows the
use of the phrase ‘historical linguistics’, which is still more common than corpus
linguistics. Also shown are ‘mathematical linguistics’ and ‘quantitative linguistics’,
which we merged, due to their low numbers. The phrases show a different behaviour,
with a peak around the 1960s (chiefly due to ‘mathematical linguistics’) after which
they gradually decline. It is of course impossible to tell from such graphs how the
frequencies relate to the use of corpora and quantitative methods, or to the interest
in historical linguistics for that matter. Does a peak mean that many researchers use
such tools and methods, or are they merely being very vocal in their denunciations?
We have carried out a detailed quantitative study on this and described it in section
1.5. Of course, we cannot say that the changing relative frequencies in Figure 3.4
represent physical probabilities in the sense that they directly represent interest or
activity as described by these terms. However, we can draw some conclusions based
on the corpus data, within the scope of the Google corpus. First, there is obviously
an increasing attention to corpus linguistics: it is being talked, or more accurately,
written about to a larger extent. Second, there is no clear correlation between corpus
linguistics and quantitative linguistics. If anything, the correlation appears to be negative. This raises the question of whether ‘quantitative linguistics’ is simply superfluous.
If corpus linguistics is taken to be inherently quantitative, this could certainly be
the case. However, corpora may also serve as sources of examples, and the use of
corpora may not automatically entail the use of quantitative argumentation. Finally,
3
http://corpus.byu.edu/faq.asp.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
US Google Books Corpus occurrences
Modifiers of ‘linguistics’
Freq. per 1,000 occurrences of linguistics
30
historical
mathematical
corpus
statistical
quantitative
25
20
15
10
5
0
1900
1910
1920
1930
1940
1950
1960
1970
1980
1990
2000
Figure . Relative frequencies of linguistics terms every 1,000 instances of the word
linguistics in the twentieth century taken from the BYU Google Corpus.
the corpus we used does not tell us how these tendencies play out within historical
linguistics.
In short, both intellectual and technological developments have facilitated a growth
in corpus size and availability. Availability is of course a prerequisite for use, but not a
determining factor. More importantly, corpus linguistics is to some extent a linguistic
subdiscipline in itself, with journals, books, and conferences. The increased attention
to corpus linguistics in general does not entail a similar focus within historical
linguistics, which is our main concern in this book. Consequently, we are interested
in investigating how the potential represented by increased availability of corpus
material is played out in practice within historical linguistics, and to what extent
corpus material is used in quantitative (or probabilistic) argumentation.
. What’s in a number anyway?
It is a truism that not everything that can be counted matters, and that not everything
that matters can be counted. However, this is a quip; not an argument. The fact that
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
What’s in a number anyway?

not everything can be quantified does not impinge on the usefulness of quantifying
some things in some situations and for some purposes. Specifically, the question we
must face in historical linguistics is whether quantification is more useful than the
lack of quantification, i.e. a purely qualitative line of argumentation. A qualitative
argument is, strictly speaking, a binary argument. That is, the qualitative argument
may support the assertion that some phenomenon is either present or not. Any
argument that deals with degrees is by definition quantitative, although numbers
need not be explicitly mentioned in the argumentation. Modifiers such as much, little,
hardly, seldom, frequently, infrequently are clearly about degrees and bear witness of
an ordinal quantification. In the case of these modifiers the quantification is veiled
by the language involved, but the underlying ordinal (and hence quantitative) nature
of the relationship or entity described is obvious. In short, we recognize as genuinely
qualitative only those arguments that deal with the presence or absence of some feature
or phenomenon (e.g. morpheme x occurs in language variety y), although this may
need some further specification in some cases, as discussed below. All other arguments
are quantitative in some way or another, including ordinal observations or arguments
expressed via ordinary language rather than numbers, and they are hence subject to
the kind of methodological scrutiny required by a quantitative study.
The definition of quantitative and qualitative studies outlined above might seem
excessively strict. However, our point is not that all quantitative arguments always
need to be expressed in numbers, merely that such arguments must be recognized
as fundamentally quantitative, as we stated in principle 8 (see section 2.2.8). The use
of ordinary language to express quantitative facts on an ordinal scale may be fully
justified, depending on the context. For instance, we might state that in present-day
English corpus texts the determiner the is much more frequent than the noun book.
This well-established fact is uncontroversial and, if part of an argument, would not
normally need to be directly backed up by probabilities (although we could provide
such probabilities to back up the claim if challenged). However, in a more controversial
argument, the probabilities ought to be made explicit because such a move makes
the argument more transparent and open to criticism, which again makes it a much
stronger argument provided that it prevails against the criticism. Carrier (2012) makes
the point that such explicit quantification of an argument forces the opposing side to
do the same and quantify their argument, lest they are left with a much weaker argument. Thus, we contend that any claim in historical linguistics that does not simply
involve a binary choice (present/absent, true/false), but somehow resorts to degrees, is
inherently quantitative. Furthermore, our position is that any quantitative claim that
is not completely uncontroversial ought to be explicitly quantified, if at all possible.
If this is not possible, then the person making the statement or fact must accept
that such lack of quantification leaves him or her with an epistemologically weaker
argument. This follows from principle 2 (outlined in section 2.2.2) that all conclusions
must be based on publicly observable facts, and from principle 4 (section 2.2.4) that
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
some claims are stronger than others. If the claim supports a certain level of certainty,
then making it explicit by means of numbers makes the claim more accessible for
public scrutiny and criticism, i.e. stronger. Conversely, if a claim is left expressed on an
ordinal scale in everyday language, then the claim is less open to public scrutiny, since
what counts as very frequent in ordinary language may vary depending on the context
and personal beliefs. The latter are by definition not accessible to objective inspection
and hence carry no weight in an argument regarding the empirical facts of historical
linguistics.
We agree with the point made in Gries (2006b, 198) that the only data provided
by corpora are quantitative, and that the logical consequence of this is that corpus
data ought to be subject to quantitative analysis. That corpora can be employed
as a repository of examples is not a counterargument to Gries’s claim. Irrespective
of how they are used, corpora are full of quantitative data. Of course, we have no
objection to the practice of picking illustrative examples from corpora as means to
showcase the phenomenon under investigation. Quite the contrary, we believe that
such examples are better taken from corpora than from dictionaries or textbooks (or
made up), whenever this is possible, but only as illustrations of a phenomenon (or
a source of hypotheses; see Chapter 1), and not as evidence basis for the research
itself. One swallow does not a summer make, and one corpus example (or a handful
of such examples) does not constitute data in any meaningful sense unless the aim
is to demonstrate that the construction or phenomenon under investigation occurs in
the corpus, or that a particular constellation of phenomena or features occur together.
We consider this point sufficiently important to repeat it: a qualitative, examplebased approach to corpus linguistics allows the historical linguist to state that the
phenomenon being investigated appears in the corpus material, period. Of course,
there is always the risk that such an occurrence represents some kind of error, as we
stressed in section 2.4.3. However, even if we discount errors, the occurrence of the
feature or phenomenon represents a modest level of evidence.
Since language is varied and subject to a number of different types of influences,
it is important to know whether some feature or phenomenon is common or rare
(either in general or given some specific context). Qualitative evidence (examples)
cannot inform us about the rareness. Furthermore, qualitative evidence has nothing
to say about non-occurrence, since a particular feature or phenomenon can be absent
from a corpus for a number very different reasons: sampling errors in the corpus
construction, sparsity in the written records, skewed representations of the extant
written records (typically towards male-dominated elite language characteristic of registers associated with writing), or combinations of all three. Frequency information,
on the other hand, can be employed to make estimations of the expected number
of occurrences, with which the observed number (e.g. zero observations) can be
compared. If expected and observed numbers converge, then an observation of zero
occurrences would be entirely undramatic. Conversely, if the observed and expected
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Case against numbers in historical linguistics

frequencies are sufficiently divergent, a case could be made for some linguistic explanation (provided we have taken the other reasons for missing data mentioned above into
account). This is what Stefanowitsch (2005, 296) refers to as the ‘expected-frequency
epiphany’, which allows us to convert raw counts into linguistically meaningful scientific facts. A strictly qualitative approach is unable to make the observed–expected
distinction in a principled manner, leading inevitably to either imprecise or faulty
estimates of frequency information, or the abandonment of frequencies as a source
of information altogether. The resulting loss of information regarding the frequencygoverned aspects of language such as word frequencies (Baayen, 2001; Köhler, 2012)
constrains the linguist to focus on the occurrence (or non-occurrence) of phenomena
and features. We consider this detrimental to the enterprise of historical linguistics.
Instead, subsequent chapters will show that quantitative information is an important
aspect of historical linguistic research, and furthermore that certain standards must
be adhered to, in order to fully exploit this source of information. The next section will
discuss the core arguments levied at quantitative methods and corpus methods, and
show that they do not apply to historical linguistics.
. The case against numbers in historical linguistics
This section will counter the core arguments against the use of corpora and quantitative methods in historical linguistics. Not all arguments are specific to historical
linguistics, so the refutation of the arguments will also be more general in some cases.
Before dealing with the substantial arguments against quantitative and corpus
methods in general, we must dispense with the straw man of glottochronology.
Section 3.3 provided an overview of glottochronology, and discussed its role as the
whipping boy of quantitative research in historical linguistics since at least the 1970s.
In hindsight, it is safe to say that the initial enthusiasm for glottochronology as
the future of historical linguistics was premature. Glottochronology was a structure
built on shaky methodological and mathematical foundations, whose flaws served to
marginalize it in little over a decade after it was first proposed. However, seen from a
certain perspective, the case of glottochronology is a good example of normal scientific
principles operating as they ought to: novel proposals and experimentation, followed
by empirical testing and methodological criticism, leading (in this case) to rejection
of the suggested approach. Nevertheless, we need to keep in mind exactly what was
being rejected. It would be a logical fallacy to assume that the rejection of one specific
approach constitutes a wholesale refutation of the usefulness of quantitative methods
in historical linguistics. Glottochronology focused on lexical items, and it is undeniably the case that the success of the historical comparative method is founded on considering lexical items in light of phonology and morphology, as is evident from such
works as Campbell (2013, 232–3) and Pereltsvaig and Lewis (2015). Thus, if we compare
the success of the traditional method in its most successful domains (phonology
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
and morphology) with the misadventure of glottochronology applied to exclusively
lexical material (an area where the traditional comparative method also encounters
difficulties), we are hardly comparing like with like. That glottochronology was
unsuccessful, and perhaps started with poor odds due to focus on lexical material, is
another matter, since the method needed to be tested in any case before being judged.
However, we are less concerned about glottochronology per se than about the fact
that criticism against glottochronology in particular runs the risk of deteriorating
into an unfounded, blanket criticism of all forms of quantitative methods in historical
linguistics. As mentioned above, this is a formal, logical fallacy (technically an illicit
minor) of the following form:
1. Glottochronology is a flawed research method.
2. Glottochronology is a quantitative research method.
3. Therefore, all quantitative research methods are invalid.
As should be evident from presenting the argument in this form, the syllogism is
invalid since glottochronology is only one among many possible quantitative research
methods in historical linguistics, and there is no logically necessary connection
between the failure of one such method and the viability of other methods.
However, we find evidence of such unfortunate tendencies of logically invalid
blanket-critique of quantitative methods in Campbell (2013). That specific, lexically
based quantitative methods in historical linguistics can be criticized without throwing
quantitative methods in general out with the bath water is thoroughly demonstrated
in Pereltsvaig and Lewis (2015). The authors explicitly say they take no issue with
quantitative methods from computational and corpus linguistics applied to historical
problems, despite delivering a sharp criticism of phylogenetic methods in historical
comparative linguistics. Thus, we wish to embrace a position in which criticism of
individual methods (be they qualitative or quantitative) is possible without wholesale
rejection of an entire mode of reasoning (see also the discussion of quantitative and
qualitative methods in section 3.6). It is perhaps no coincidence that the general
criticism of quantitative methods in historical linguistics resonates with criticism put
forth by previous critics against such methods in linguistics generally. Campbell’s
position can thus be seen as a special case of arguments against quantitative methods
in linguistics more generally.
While proponents of quantitative methods in linguistics may agree in many ways,
the arguments against such methods can be made from a more diverse range of
positions. However, some organizing principles can be detected, and some of the
arguments against quantitative methods will be discussed below, categorized as
follows:
(i) quantitative methods are potentially useful, but not very convenient or
practical;
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Case against numbers in historical linguistics

(ii) quantitative methods are potentially useful but redundant, since the same
results can be achieved by other means;
(iii) quantitative methods are useful, but should be limited to certain types of
linguistic problems;
(iv) quantitative methods cannot as a matter of principle contribute to the goals of
linguistics.
The arguments have been ordered from the weakest in their epistemological impact,
(‘quantitative methods are not convenient’), to the strongest (‘quantitative methods
are unable to contribute to linguistics as a matter of principle’). Below we will reject all
these arguments and show that quantitative methods have an important role to play
in linguistics and in historical linguistics.
.. Argumentation from convenience
Bloomfield (1933, 37) argued that a detailed statistical study of language use would
be very informative, particularly for studies of language change. However, having
made this point, he immediately dismissed it as unnecessary. His argument was
that since language is a convention-bound activity, all that the linguist really needs
to do is to describe the norms that govern this convention-bound activity, i.e. the
grammatical rules. Bloomfield’s motivation comes across as at least partly pragmatic,
since he refers to the simplicity of the latter method compared with the former
(Bloomfield, 1933, 37). He took the same view for language change, noting that an
inventory counting every use of every linguistic form would be welcome. However,
after having noted that this is ‘beyond our powers’, he reassured the reader that such
a record is not necessary since the changes can anyway be deduced by comparing
linguistic (structural) systems diachronically (Bloomfield, 1933, 394–5). Given the
resources available to Bloomfield, his views were not entirely unreasonable, although,
as section 3.2 showed, some of his contemporaries were experimenting with statistical
approaches. His argument has nonetheless been superseded by the technological
development since 1933. The availability of large, easily accessible corpora, fast and
cheap computers, and advanced statistical software packages mean that what was a
major (perhaps an insurmountable) undertaking at the time when Bloomfield was
writing, has become not only achievable, but in some cases outright trivial. Of course,
Bloomfield the structuralist was not merely discounting the method of counting for
practical reasons, a point to which we return below. Nevertheless, it is undeniable that
some of his arguments come across as remarkably pragmatic and convenience-based.
More recently, Mair (2004) presents a modified version of this argument, and
argues that when ‘superficial’ statistical analyses (Mair discusses raw counts and
proportions) are inadequate, the linguist should turn to what he calls ‘philological
methods’ (i.e. looking at examples in context) as a next step (Mair, 2004, 134). The
obvious alternative, turning from a ‘superficial’ to a more advanced statistical analysis,
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
is surprisingly absent. Since Mair is clearly open to using quantitative evidence to
start with, the reluctance to recommend more advanced quantitative methods cannot
stem from some principled dismissal of their value, and we can only assume that he
does not consider them a practical or cost-effective alternative. Thus, Mair’s advice to
only consider qualitative methods when simple quantitative methods are insufficient
appears to be a variant of the argument from convenience.
The argument from convenience has little or no persuasive force at the present,
although some exceptions need to be made for historical languages where so little data
is available that no quantitative investigation is meaningful. We would nevertheless
argue that in such cases the argument is not one of convenience, but rather of assessing
whether statistical methods can correctly and meaningfully—not conveniently—be
applied. The availability of historical corpora is increasing, and the tools to create
corpora are also becoming more sophisticated. Although computational tools to
create linguistic electronic resources for historical language varieties tend to lag
behind the tools for (large) contemporary languages, the situation is undoubtedly
improving (Piotrowski, 2012; McGillivray, 2013). In conclusion, there is ample reason
to dismiss the convenience argument as a relic from the past.
.. Argumentation from redundancy
An argument independent of technological infrastructure is the view that quantitative
methods in linguistics are potentially useful, but in practice redundant since they
inevitably end up discovering nothing more than what has already been established
by means of traditional methods. As with the previous argument, contemporary
advocates can be found, but the roots of the argument can be traced back in time.
Structuralism espoused a static view of synchronic language (or rather: langue,
as opposed to the usage captured by the term parole) as a natural object with a
set of rules to be discovered and described in qualitative terms, e.g. by means of
algebraic relationships and set-theoretical notions of class membership (Harris, 1993,
16–24; Köhler, 2012, 13). In such a system, the numerical distribution of an item
(phonemes, words, syntactic units, etc.) may be recognized as potentially informative,
as Bloomfield (1933) hinted at; however, it is equally clear that quantification was not
crucial. The task for the (synchronic) American structuralist in Bloomfield’s view
was to observe the ‘speech habits of a community without resorting to statistics’,
and to record and report the results as objectively and conscientiously as possible
(Bloomfield, 1933, 37–8). Chapter 22 of Bloomfield’s book, which deals with fluctuation
in language forms over time and thus (inevitably) with changing relative frequencies,
is remarkable in how it sidesteps the entire issue of quantification by arguing that a
number of correlates (or proxies) for statistical quantification can be used instead. The
message is again clear: numbers would be nice if we could get our hands on them, but
there is no need to worry about that since we are capable of observing and studying
all the relevant factors already without quantification: algebra (a qualitative branch of
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Case against numbers in historical linguistics

mathematics), not probability theory, is the only mathematical tool needed, if any. The
conclusion that quantitative, or probabilistic, methods are redundant is continuously
present between the lines.
The same argument is advanced by Campbell, although when presented in its purest
form the author attributes the viewpoint to some unspecified and unnamed historical
linguists. The most forthright articulation is found in the following passage from
Campbell (2013, 471):
Some say that if a solution to a particular problem cannot be reach [sic] by tried-and-true
historical linguistic methods, then they cannot trust a proposed mathematical solution, but at
the same time they ask: if a solution is provided by standard linguistic methods, then what is
the need for the mathematical solution in the first place?
Note that the argument here has shifted subtly from the position occupied by
Bloomfield. For Campbell (or the unnamed historical linguists he attributes the
view to), the primary problem with quantitative methods (and thus cause for their
redundancy) is not their difficulty, or the supremacy of observations of a social
communicative structure or norm. Instead, the suggested problem is the quantitative
method’s reliability, or rather lack of such. The same redundancy is implied when
Campbell (2013, 485) again lets others speak for him: ‘To most traditional linguists, the
scholars who have invested in quantitative approaches to historical linguistic questions
have appeared to progress by gradually reinventing the wheel.’
Campbell’s slightly sheepish rhetorical shuffle of letting other linguists (unnamed
and uncited) speak for him whenever he is unleashing a full broadside against quantitative methods can be interpreted in at least two ways, neither of which excludes the
other. The most obvious interpretation is of course that he misrepresents quantitative
methods. He appears to briefly consider the possibility that his rendering is ‘not fair’
before retreating to his original position (Campbell, 2013, 486). The second obvious
reason is that Campbell recognizes that his arguments are not particularly strong
and consequently resorts to reporting what comes across as departmental lunchroom
hearsay in order to put quantitative methods in a less than flattering light.
The argument from redundancy can be rebutted by following a line of reasoning described by Gibson and Fedorenko (2013) in their argument for quantitative
methods in synchronic linguistics. Were the criticism against quantitative methods
presented above to be true, i.e. were quantitative methods in historical linguistics to be
redundant, we would expect the following: first, no study should exist in the published
historical linguistics literature which disproves conclusions in traditional, qualitative
studies by means of quantitative methods. Second, the avoidance of quantitative
methods should not have harmed or impeded the progress of linguistics in general,
or historical linguistics in particular. On the other hand, if we do find examples of
conclusions from qualitative studies being disproved by quantitative methods, or that
progress in linguistics and historical linguistics has been impeded through a lack of
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
quantitative methods, the only possible conclusion must be that quantitative methods
are not redundant.
As it happens, studies that employ quantitative methods to correct linguistic
conclusions based upon qualitative methods do exist. A prime example from synchronic linguistics is Bresnan et al. (2007) who studied the so-called English dative
alternation (i.e. the existence of parallel syntactic structures like ‘she gave him the
book’ and ‘she gave the book to him’). Previous studies relying on qualitative methods,
such as Levin (1993) and many others, had concluded that certain verbs could only
occur with one of the two constructions. The consensus was that the dative alternation
was too difficult to properly characterize in terms of a grammatical system (Bresnan
et al., 2007). However, using quantitative methods, Bresnan et al. (2007) demonstrated
that properties such as animacy and discourse accessibility, alongside formal and
semantic features, could predict the dative alternation. Furthermore, drawing on usage
data rather than intuitions, Bresnan et al. (2007) found well-formed examples of verbs
occurring with constructional variants that had been proclaimed ungrammatical by
previous qualitative studies.
Another synchronic example, similar in many ways, is the study by Grondelaers
et al. (2007) which deals with the use and non-use of the Dutch existential er (similar
to English existential there) in postverbal position. According to standard Dutch
grammars, no rules could be formulated for the presence or absence of this morpheme
in postverbal position (Grondelaers et al., 2007, 152). However, using corpus data and
quantitative methods, they found regional differences between Belgian and Netherlandic Dutch, but also systematic variation with respect to register (for Belgian Dutch),
as well as the topicality and concreteness of preverbal adjuncts. As with the study by
Bresnan et al. (2007), what appeared chaotic to the naked eye looking for deterministic
rules was in fact systematic variation of probabilistic rules.
Turning to historical linguistics, we can find similar examples. One such example is
the case of referential null subjects in Old English and Middle English. Rusten (2014)
and Rusten (2015) report on corpus-based quantitative studies of null-referential
subjects from Old English to early modern English using data from syntactically
annotated corpora. Referential null subjects (or ‘pro-drop’) are an established typological parameter for language classification. This is also a feature which has gradually
been lost in Germanic languages. Historical studies of this phenomenon in historical
varieties of English have mainly relied on qualitative methods (Rusten, 2014, 250).
Despite few systematic studies beyond the Old English period, the phenomenon
has been proclaimed ‘quite frequent’ in later periods of English by previous studies
(Rusten, 2014, 250). However, Rusten’s detailed longitudinal study of a sample of texts
from Old, Middle, and Early Modern English revealed that referential null subjects in
those texts were extremely infrequent in all periods, which raises important questions
about the status of the pro-drop rule in earlier stages of English, a rule which previous
qualitative studies had assumed rested on much firmer empirical foundations.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Case against numbers in historical linguistics

Without leaving the topic of English historical syntax, we can consider another
case, namely the question of word-order change. Old English is typically considered
to have had some form of verb-second constraint, even if this constraint was less
rigid than in other Germanic languages (Allen, 1995, 32–6; Fischer, 1996). This verbsecond constraint was lost somewhere in the transition to Middle English; however,
the exact details remain disputed. One attempt at explaining this change is the degreezero learner hypothesis (Lightfoot, 1989; and 2006), which rests on the assumption of
an abrupt decline in verb-final word-order patterns in subordinate clauses. However,
as Heggelund (2015) demonstrates, the data from Old English and Middle English
corpora in fact disprove this hypothesis.
The two examples above are by no means trivial. They deal with fundamental
questions regarding the typological status and trajectory of change in earlier stages
of English, a well-researched language. In both cases, a lack of proper quantitative
investigation has hampered theoretical development in the respective areas of historical linguistic research, since hypotheses that ought to have been culled (or modified)
by empirical means have remained to clutter the picture. In both cases, the questions
are undoubtedly important (both related to the historical–typological status of earlier
stages of English) and cannot be brushed aside as peripheral or instances of merely
clarifying residual details.
The examples above clearly demonstrate that quantitative methods are not redundant, neither in historical linguistics nor in linguistics in general. Quite the contrary:
quantitative methods serve as necessary corrective steps for hypotheses formed based
on qualitative methods. When generalizations are made based on qualitative methods
alone, the results are at risk of missing important variation in the data, either due to
cognitive bias, small samples (which means the samples may contain too few types or
tokens to reveal the full variation), or by not detecting complex sets of connections
between properties that are more readily disentangled by computers. Furthermore,
relying exclusively on qualitative methods is potentially harmful to linguistics, since
theorizing and development may be led astray by incorrect results, as Gibson and
Fedorenko (2013) and Sampson (2005) argue. Far from being redundant, quantitative
methods are a necessary part of the historical linguist’s toolkit, alongside qualitative
methods. Rather than replacing qualitative methods altogether, quantitative methods
should, in our opinion, replace qualitative ones for testing hypotheses that may or
may not have been arrived at via qualitative means. As we stressed in section 2.4.3,
we believe that quantitative methods should be a natural tool of choice for historical
linguistics. Which particular problems those tools are best used for is a question for
the next section.
.. Argumentation from limitation of scope
A superficially more relevant criticism of quantitative methods is that they are
potentially useful, but that their scope is limited to certain types of problems or
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
certain areas of research. Such views have been expressed by e.g. Sampson (2001)
and Talmy (2007). Sampson (2001, 181) argues that while syntactic phenomena such
as collocations and word order can benefit from quantitative studies, semantics (or
‘word meaning’) cannot. Talmy (2007, xiii), while implicitly admitting that other
methods can also be used to study meaning, argues that qualitative methods (specifically introspection) constitute the superior tool in this case. In historical linguistics,
Campbell (2013, 491) writes that quantitative methods ‘hold out promise’ of understanding the role played by frequency of usage for lexical change, while remaining
pessimistic about other uses of their application in other areas of historical linguistics.
All of these three positions can be considered instances of the limitation of scope
argument.
If we assume that quantitative methods have a limited scope in linguistics, there
must clearly be some areas in which their use is justified. Identifying collocations and
other examples involving frequency of use would seem like a good candidate for areas
of linguistic research where the application of quantitative methods is uncontroversial.
Conversely, the authors cited above could be interpreted as agreeing that semantics
is the most difficult problem to tackle for quantitative methods. Consequently, the
degree to which quantitative methods really can be applied to semantics must be
considered a fair test of the limitation of scope argument.
The limitation of scope argument is particularly pertinent in historical linguistics,
since in most—if not all—cases no native speakers are available to give qualitative
judgements, let alone perform any kind of introspective analysis. At this point we
will emphatically take issue with the assertion by Fischer (2007) (see §1.1.3) that the
linguist’s intuition, no matter how widely read she is in the language in question,
has any value or status as evidence whatsoever. The postulate that the intuitions of
linguists with something that putatively approaches native speaker competence in an
extant language can be admitted as evidence is not compatible with our definition of
evidence, which states that historical linguistic evidence must be open and accessible
to everyone (section 2.1.3). Of course, this view should in no way whatsoever be
taken to indicate that we dismiss the value of philological knowledge or primary
sources for extant languages. However, we maintain that the intuitions arising from
such knowledge are starting points or hypotheses, rather than facts with the ability
to carry a logically valid argument (we exclude arguments from authority here, for
obvious reasons). Thus, the putative benefits of native speaker judgements or native
speaker intuitions over quantitative evidence have no bearing on the limitation of
scope argument as far as historical linguistics is concerned.
The limitation of scope argument would seem to predict that quantitative methods
are either unsuccessful when applied to problems in semantics, or alternatively, that
their successes (e.g. in terms of practical applications) are less interesting than the
results arrived at without quantitative methods. However, the argument itself can be
formulated generally, and as in previous sections we will provide counterexamples
from both synchronic and diachronic linguistics.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Case against numbers in historical linguistics

Campbell (2013, 222) writes that, traditionally, work in diachronic semantics has
mainly been concerned with lexical semantics, and the types of change that lexemes
undergo. However, even if lexical semantics has been the traditional focus in work of
semantic change, we must not lose sight of the fact that other branches of semantics
exist, such as grammatical semantics, formal semantics, and linguistic pragmatics
(Cruse, 2011, 17–18). We will consider studies within any of these branches of semantics
as being able to support our argument, which is that quantitative methods have a useful
role to play in semantics.
We must first rid ourselves of any notion that semantics is by definition outside
the scope of quantitative methods. Such an argument falters already at the impossible
effort of drawing a clear and consistent line between semantics on the one hand and
other areas of linguistic scholarship on the other. Defining semantics in terms of what
can and cannot be investigated quantitatively runs the risk of circularity, and smacks
of the ‘no true Scotsman’ logical fallacy—an attempt to redefine semantics in terms
that would disqualify any branch of semantics if it benefits from quantitative research,
leading to a circular argument. Thus, we are prepared to admit studies from any branch
of semantics as evidence against the argument from limitation of scope as it applies to
semantics.
If we look at the published literature, it quickly becomes apparent that an entire subfield, distributional semantics, is concerned with working out how semantics can be
studied by quantitative corpus methods. Since early seminal studies such as (Schütze,
1998), the field has expanded and in the words of Baroni (2013, 511) ‘Distributional
Semantic Models, which automatically induce word meaning representations from
naturally occurring textual data, are a success story of computational linguistics’.
Distributional semantics is not only a case of practical applications, and Lenci (2008)
argues that it has a theoretical import as well, especially for usage-based and functional
theoretical fields of study.
In synchronic linguistics, quantitative corpus methods have been applied to
semantic problems like the selectional preferences of verbs (the fact that some verbs
take objects whereas others do not) and the semantics of verb classes, as in e.g. Schulte
im Walde (2004 and 2007) and Lenci et al. (2008). These examples of grammatical
semantics are further complemented by attempts to integrate formal semantics with
quantitative methods and distributional semantics (Baroni and Zamparelli, 2010).
Thus, there is a rich literature describing the use of quantitative corpus methods
in semantics. Such efforts have proved particularly useful in grammatical semantics
when dealing with verb subcategorization or selectional preferences. We have already
mentioned Bresnan et al. (2007)’s study on the English dative alternation, which used
a quantitative, corpus-based approach to correct previous theorizing based on native
speakers’ intuitions or anecdotal evidence. However, the study is also relevant in
grammatical semantics. As the discussion in Manning (2003) makes clear, selectional
preferences and argument structures are better handled in probabilistic terms than as
discrete and categorical rules.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
In the fields of historical linguistics and linguistic change, we also find examples of
quantitative methods employed fruitfully for semantic purposes. Barðdal et al. (2012)
use dimensionality reduction models based on occurrence and non-occurrence of
verb classes as an aid to reconstructing the semantics of the dative subject construction
in Indo-European. McGillivray (2013, 78–87) shows how the framework for using
quantitative methods to study verb subcategorization can be extended from contemporary languages to a non-extant language such as Latin. Meanwhile, Jenset (2013)
similarly used dimensionality reduction methods on corpus data to more precisely describe the semantics of existential there in early English, in a corpus-driven approach
to lexical meaning defined as patterns of co-occurrence (Cruse, 2011, 215–22).
In summary, the limitation of scope argument, which states that some areas such
as semantics constitute a no-go area for quantitative methods in linguistics, can
only be sustained if semantics is narrowed down to exclude quantitative methods by
definition. As we stressed above, this would lead to a circular argument (‘quantitative
methods are not applicable to semantics because we define semantics as that which
cannot be studied quantitatively’). We believe that the erection of such barriers
between semantics and other areas of linguistic research would require much stronger
arguments than those we have argued against here.
In Chapter 1 we presented the defining principles of our approach, and stressed that
only data that are open to public scrutiny can be admitted as evidence in historical
linguistics. The corollary of this principle is that empirical evidence trumps intuition.
It does not follow that the empirical evidence in question needs to be quantitative,
but nor does the principle preclude quantitative evidence. As we have argued, any
non-circular (i.e. non-trivial) definition of semantics must take into account subfields
where quantitative evidence has made a substantial difference. The conclusion is that
anyone wishing to defend the argument from limitation must retreat into a trivial,
straw-man-like position arguing that quantitative evidence is not applicable to a
particular semantic question or some highly specific subfield within semantics. In any
case, the argument from limitation in its strong form cannot be upheld. That leaves us
with arguments against quantitative evidence based on deeper principles.
.. Argumentation from principle
The strongest possible refutation of quantitative methods is based on the axiom
that probability and quantification are in principle uninformative when it comes to
language. Such positions can be arrived at via several distinct paths. However, we argue
that this position, while possible, is less desirable than the alternative of accepting
quantitative methods in historical linguistics.
The two positions we will consider are the following:
(i) linguistics (including historical linguistics) is inherently qualitative;
(ii) the primary object of linguistic study is competence, not performance.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Case against numbers in historical linguistics

The first argument, that linguistics and also historical linguistics are inherently
qualitative, has been around for some time. It can be found embedded in the quote
from Bloomfield (1933, 37–8) (see section 3.7.1) which states that the convention bound
nature of language makes statistics redundant (see also section 3.7.2). That this was
a view shared by other structuralists is shown by the following quote from Joos,
cited in Manning (2003, 290): ‘[linguistics] does not even make any compromise
with continuity as statistics does . . . All continuities, all possibilities of infinitesimal
gradation, are shoved outside of linguistics in one direction or the other.’ As Manning
(2003, 290) points out, this view implies casting language as a set of discrete (or
categorical), qualitative rules. However, Manning (2003), along with previously cited
studies such as Bresnan et al. (2007) and Grondelaers et al. (2007), demonstrates the
problems of preserving such a set of rigidly categorical rules in the face of linguistic
data. Thus, in exchange for clear, algebraic, qualitative rules, we risk losing the details
of gradients and variability found in language. As we have attempted to illustrate
above, such a move would come at a dire loss of empirical descriptive power. If
language is probabilistic, as e.g. Manning (2003) suggests, then discrete rules will
necessarily be only an imperfect approximation. Even if one is sceptical of the idea
of language as inherently probabilistic, the difficulty involved in measuring language
(whether language use or grammatical competence) with perfect precision entails that
a probabilistic model is nevertheless the best choice, since it will be better able to deal
with imperfect measurements than a categorical model. In other words, the view that
languages consist of qualitative, categorical rules at a deeper level, does not contradict
modelling language probabilistically. In fact, as Zuidema and de Boer (2014) argue,
creating linguistic models by both qualitative means (rules) and probabilistic means
(quantitative techniques), offers greater insight into the underlying reality we try to
capture. This is what Zuidema and de Boer (2014) call ‘model parallelization’ (see section 1.2.2). One way of doing such model parallelization is building a statistical model
based on qualitative analyses, such as found in treebanks, an example of what Zuidema
and de Boer (2014) call ‘model serialization’. Thus, the argument that language, and
linguistics, are inherently qualitative, is simply not an argument against quantitative
modelling, since many things that are inherently qualitative can be understood via
statistical models. This general point is well illustrated by an example from public
health research given in Gelman and Hill (2007, 86–101), who present a statistical
model of which wells are used by local communities in Bangladesh. Of course, the
use of a particular well is qualitative (specifically, binary: you use it or not). Gelman
and Hill show that this qualitative choice is conditioned by both the level of pollution
in the well and the distance to other non-polluted wells. Similarly, even if language
were to be proved inherently qualitative, such a finding would have no relevance for
the usefulness of understanding language by means of statistical models.
The second argument, that the real object of linguistic enquiry is not performance or
language use, but linguistic competence, is related to but still subtly different from the
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
first argument. It is similar in the sense that it potentially rejects the use of probabilities
derived from usage in linguistics. In particular, a concern here is whether or not
corpora are capable of providing negative evidence to establish the limits of linguistic
competence (Ringe and Eska, 2013, chapter 1). However, the argument is also subtly
different since it implicitly acknowledges that even if studying language use by means
of frequencies may not fall under the purview of linguistics, such an approach can
nevertheless still be carried out to serve other uses. This position has most famously
been articulated by Noam Chomsky in several publications, arguing that linguistics
should concern itself with an internal language, or a language capability, rather than
the external patterns of language use. Chomsky has compared the latter activity to
studying bees by videotaping bees flying4 or collecting butterflies.5 Manning (2003,
296) shares Chomsky’s concern for the importance of going beyond description in
order to identify explanations (as do we). However, as Manning (2003, 296) also
points out, any explanatory hypothesis that is ‘disconnected from verifiable linguistic
data’ also ought to give rise to some concern. Incidentally, Manning also criticizes
corpus linguistics for being overly concerned with ‘surface facts of language’, despite
(or perhaps because of) its empirical approach. Manning (2003) goes on to argue
for a model where syntax is seen as inherently probabilistic, an approach that shares
many points of contact with that outlined in Köhler (2012). We sympathize with these
probabilistic approaches, but do not consider them necessary to refute position (ii) as
far as historical linguistics is concerned for the purposes of the present volume.
A key contested point regarding linguistic competence as the primary object of
linguistic study is the issue of negative evidence (see Carnie, 2012, §3.2; Ringe and
Eska, 2013, chapter 1). However, the primary concern of statistical modelling is not
merely the observed corpus counts, but the difference between the observed data
and the expected counts that we would have expected under different circumstances.
This is what Stefanowitsch (2005) calls the ‘raw frequency fallacy’. There are wellknown techniques for estimating the number of unseen items in a corpus (Gale
and Sampson, 1995). Pereira (2000) uses statistical techniques and corpus data to
estimate the considerable difference in probability between a grammatical and an
ungrammatical sentence, the key point being that both sentences have an observed
probability of zero in the corpus. Thus, to the extent that argument against corpora
relies on the status of negative evidence, it should be clear that negative evidence is also
something that can be approached fruitfully from a corpus-based, quantitative angle.
On a more practical level, it is also worth noting that there is a noteworthy tradition
for corpus linguistics in the generative tradition itself. Gelderen (2014) argues that
generative linguistics traditionally has been sceptical of corpus data, due in part to
the framework’s distinction between surface and underlying structures, as well as the
4
5
See e.g. http://languagelog.ldc.upenn.edu/myl/PinkerChomskyMIT.html.
As cited in Manning (, ).
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Case against numbers in historical linguistics

norm of basing theoretical statements on native speaker intuitions. However, according to Gelderen (2014), interest in phenomena such as word-order variation, argument
structure, and information structure have accelerated interest in and acceptance of
corpora within generative linguistics. Moreover, the research tradition of variationist
diachronic generative linguistics routinely relies on quantitative techniques, as evident
from e.g. Kroch (1989) and Pintzuk (2003). This particular tradition of research has
spurred the creation of syntactically annotated historical corpora, including Taylor
et al. (2003), Kroch and Taylor (2000), Kroch and Delfs (2004), and Wallenberg et al.
(2011b), to name a few.
Thus, the initial reluctance toward corpora in generative linguistics, broadly construed, noted by Gelderen (2014) might be waning. In fact, Lightfoot (2013, e28) states
that ‘[i]f we can identify meaningful properties of I-languages, then that imposes limits
on possible diachronic changes, and those limits, alongside the contingent environmental factors, explain why changes take the form they have.’ This quote appears to
give priority to I-language, i.e. native speaker competence; however, we do not see how
the phrase ‘contingent environmental factors’ can have any other interpretation than
a probabilistic one. And, in historical linguistics, probabilistic is strongly connected
with corpus data, as specified in principle 10 (section 2.2.10). Further, for historical
language varieties, such probabilities can only be reliably obtained from corpora. It
follows from this that if the limits imposed by I-language are played out alongside
contingent factors, then measuring the contingent factors through corpus data quickly
becomes an empirical hypothesis test of the extent and manner in which I-language
and contingent factors compete in explaining diachronic change.
To sum up, neither point (i) nor point (ii) presents a strong case against corpora
and quantitative evidence. As we saw, even if language were to be conclusively proven
to consist of rigidly discrete, qualitative rules, this would not invalidate the usefulness
of probabilistic models of such rules. Similar applications of probabilistic models to
discrete, or qualitative, phenomena can be found in other scientific disciplines. This
implies that against such a broad scientific consensus, a much stronger argument
than the one presented in (i) would be required. Furthermore, the objection in (ii)
was found wanting in so far as the objection concerns the importance of studying
linguistic competence as a means to discover the limits of grammaticality. We gave
examples from the research literature showing that quantitative methods and corpora
can be employed for estimating the probability of grammatical and ungrammatical
sentences. It is also worth noting that, as Gale and Sampson (1995) attest, these are not
specific statistical procedures for linguistics, but general statistical techniques applied
to linguistic data. As with (i), this implies that a much stronger argument (involving
the statistical details of those techniques) than what is presented in (ii) would be
needed to refute corpora and statistical techniques on a principled basis. Finally, we
note that in practical terms, there does not seem to be a clear distinction between
corpus use and non-corpus use drawn along lines corresponding to generative
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
and non-generative linguists. Thus, even the arguments from principle do not
amount to a sound basis for rejecting corpora and quantitative methods in historical
linguistics.
.. The pseudoscience argument
In section 3.7 we argued that the arguments against using quantitative corpus methods
in historical linguistics do not stand up to scrutiny. But we must still deal with the
challenging question of how well linguistics lends itself to the techniques of statistics.
To put it bluntly, are linguists engaging in pseudoscience if they take a quantitative
approach? The exact argument we are challenging here might take different forms,
but its different strands can be paraphrased or summarized as follows: quantitative
historical linguistics is pseudoscience if all it involves is using p-values and other
trappings of experimental or quantitative sciences as a mere rhetorical device. The
reason we bring the discussion to the forefront is that there is some truth to it. If
statistical tests are applied erroneously, if their assumptions are not matched by the
data, or if their results are misinterpreted, then no amount of decimal precision can
salvage the results of such testing.
An informed and thoughtful critique of the application of null-hypothesis testing
in linguistics is presented in Kilgarriff (2005), who makes the legitimate point that
null-hypothesis testing (e.g. Pearson’s chi-square test, Fisher’s exact test, and other
similar statistical procedures) are all based on a common underlying logic: summarize
the observed data into a test statistic, and compare that test statistic to an expected
value arising from a theoretical scenario where the data are showing no systematic
patterns of association. If the difference between the two values is large enough to be
labelled ‘significant’ (according to common conventions), we can safely assume that
our observed distribution of counts is unlikely to have arisen under the randomness
scenario.
As Kilgarriff (2005) stresses, language is far from random, and given a sufficiently
large sample, any null-hypothesis test will return a verdict of ‘statistically significant’.
This latter property—false positive results arising from a large sample size—is in fact a
familiar property of null-hypothesis tests that is well known to statisticians and it has
been discussed as far back as the 1930s (Berkson, 1938; Mosteller, 1968; Cohen, 1994).
Nevertheless, solutions to this problem can be found in sensible practices. Cohen
(1994), writing about psychology, advocates reporting effect sizes and confidence
intervals, rather than focusing narrow mindedly on specific p-values. Similarly, Gries
(2005), in a response to Kilgarriff, shows that applying effect size measures to corpus
data to a large extent will solve the problem of spuriously positive results stemming
from this inflation effect.
However, a seemingly more problematic requirement is that of random sampling.
As taught in most introductory statistics courses, random sampling (where every unit
of interest to the study has the same probability of being included in the sample)
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Case against numbers in historical linguistics

allows us to generalize from a sample to a whole population (Hinton, 2004, 51–7).
In historical linguistics, the sample might correspond to a corpus built on extant texts
from a language variety, and the population might correspond to the linguistic system
(whether it is conceptualized as a social or psychological one). However, viewed
from a traditional sampling perspective, the composition of corpora is problematic,
as Evert (2006) argues in a thought-provoking article. Like Kilgarriff (2005), Evert
observes that language is not used at random. Quite the contrary: language is imbued
with structure, which makes the assumption of random sampling required by nullhypothesis tests such as Pearson’s chi-square problematic (Hinton, 2004, 258)—a fact
glossed over in some published presentations of the chi-square test aimed at linguists.
If the requirement of random sampling is unrealistic even for present-day language
varieties, where increasingly vast collections of texts are being built, the situation
would seem hopeless for historical linguistics, where we are left with whatever text
material survived by being passed down in time through copying and preservation,
often for specific reasons (such as author prestige or the topic of the text) rather than
random selection. In his discussion of statistical testing for corpus linguistics, Evert
(2006) proposes to adopt what he calls ‘the library metaphor’. Essentially, his point
is that, although the words occurring in a particular order in any part of a corpus
do not constitute a random sample, the decision to include a particular stretch of
text in a corpus can be viewed in analogy with randomly picking a book out of the
shelves of a vast library. As long as the ‘book’, i.e. the stretch of text, was not included
in the corpus because of the particular words it contains (or the order in which
they combine), the assumptions of random sampling are not too seriously violated in
Evert’s view.
Although we find the argument in Evert (2006) intriguing, it should be clear that
the metaphor works better for modern languages where very large corpora can be
built and where there is much written material to choose from, possibly from a wide
range of genres. As mentioned earlier, historical corpus linguists face a convenience
sample, i.e. a non-random sample that represents the data we could find. That is, the
data might not only be few in number, but also neither random nor arbitrary. Consider
for example the language represented in a hypothetical surviving canonical religious
text, as opposed to the language represented in a lost, heretical religious text. The
mechanisms ensuring the inclusion and exclusion of those two hypothetical language
samples in a historical corpus are anything but random. However, this does not spell
the end of statistical testing for quantitative corpus linguists.
There are a number of compelling reasons to reject the strict random sampling
requirement outright. First, the requirement is primarily motivated by the use of a
statistical null-hypothesis paradigm that we consider outdated in some respects. This
debate is not restricted to linguistics, as Cohen (1994) bears witness to. Although
the arguments from Gries (2005) and Evert (2006) to some extent counter the
objections against such a paradigm in linguistics raised by Kilgarriff (2005), it does
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Corpora and quantitative methods
not follow that such testing is the best way to proceed. As statisticians know well, the
classical null-hypothesis tests were developed to find any relevant differences between
groups in small samples of experimental data. This is simply not descriptive of the
situation in corpus linguistics, since corpus data are not experimentally collected
data. Instead, a more detailed understanding of both the data and the effects of small
and skewed samples ought to guide our conclusions. By drawing on research on
word-frequency distributions it is possible to draw inferences about the amount of
missing, i.e. unseen, data. Such knowledge informs us about the degree of adequacy
(which is always a matter of degree) of a given corpus for a given study. Jenset
(2010, 167–72) does precisely that and concludes that the historical corpus at hand
is adequate for the study being performed. This might seem questionable, but in
fact the procedure rests on a principle that is entirely uncontroversial in historical
linguistics, namely the uniformitarian principle that languages in the past behaved
like today’s languages, from which it follows that this principle also applies to wordfrequency distributions. If this is not persuasive, we offer the point argued by Gelman
(2012). Gelman, a professor of statistics and political science (and an active blogger
on matters of statistics), argues that in the absence of random sampling, it is still
possible to carry out meaningful statistical testing, since random sampling is neither
an end in itself nor an absolute prerequisite. Instead, random sampling is merely
one possible manner in which variation can be dealt with in the context of data
analysis. Better statistical procedures are available, and ought to replace the classical
null-hypothesis tests. Multilevel/mixed-effects models, as advocated by Harald Baayen
(e.g. Baayen, 2008; Baayen, 2014) and Stefan Gries (e.g. Gries, 2015) to mention two
prominent linguists, allow us to adjust for known biases in the corpus. This approach,
which we take inspiration from, involves the desirable move away from simplistic
null-hypothesis testing in the search for p-values towards a more meaningful (and
potentially much more linguistically informed) modelling approach where numerous
sources of variation are pitted against each other in a single statistical model in order
to better capture a glimpse of the true complexities of language.
. Summary
In this chapter, we have argued that quantitative methods, corpus methods, and
quantitative corpus methods, have a long history in historical linguistics. However,
both technological and non-technological factors have prevented them from taking a
more prominent role in the mainstream of historical linguistics work, even though
the aims of historical linguistics are perfectly in line with a quantitative approach
(see section 1.1).
As we suggested in section 1.1, adoption of new technology is not merely a question
of availability, and the results discussed in the present chapter seem to corroborate
this view. In the preceding sections we have argued that a whole range of arguments
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Summary

against quantitative methods in historical linguistics do not hold water. The argument
that such methods are inconvenient has little merit. We also showed the view that such
methods simply reproduce results arrived at by qualitative means to be erroneous,
since quantitative methods have an important role in checking whether the empirical
facts square with arguments based on qualitative research. Furthermore, we showed
that there is no principled reason to limit the topics to which quantitative methods
in historical corpus linguistics (as defined in Chapter 1) can be applied, since such
methods are not inherently limited to a single research topic like syntax. We also
argued that defining linguistics as inherently non-quantitative creates a tautology,
and that historical linguists have a long tradition (irrespective of self-identification
with a specific linguistic paradigm) of making claims based on quantitative data.
We also demonstrated that arguments to the effect that quantitative methods do
not belong in historical linguistics because of deficiencies, gaps, or peculiarities in
the data or sampling do not hold. By dispensing with summary statistics or simple
null-hypothesis testing as the only basis of quantitative argumentation, and instead
adopting more sophisticated statistical modelling that better informs us about the
strengths and weaknesses of the analysis, historical linguistics can better benefit from
quantitative information.
In short, the preceding sections have argued that not only are quantitative techniques a valid part of the historical linguist’s toolbox; they are indispensable and can
co-exist with other tools, including symbolic models. We have shown that quantitative
corpus linguistics fills an important role in historical linguistics, and that such techniques are neither redundant nor limited to certain types of research questions. The
subsequent chapters deal in more detail with how historical linguistics can best profit
from the opportunities that corpora and quantitative evidence represent.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
. Content, structure, and context in historical texts
Empirical science needs evidence, and the primary type of evidence directly analysed
in historical linguistics consists of written sources: text collections, fragments, word
lists from language families, etc. (Joseph and Janda, 2003; Campbell, 2013).
In order to make full use of written sources for historical linguistic investigation, we
need to look at their content, as well as their structure and context. For example, if we
are interested in the evolution of a particular grammatical class (say, pronouns) in a
given language, we need to be able to identify all instances of the pronoun class in the
texts under consideration. This involves going beyond seeing the text as a continuous
flow of characters and being able to dissect it in order to separate pronouns from
the other words. Moreover, the location of pronoun occurrences in the text can be
an important feature to be considered: do they appear in the title, in the appendix,
or maybe in a footnote? To answer these questions, we must identify the internal
structure of the text, and make it explicit in order to use it as a factor in the linguistic
analysis. For instance, we need to be able to separate the instances of pronouns
occurring in the title from those occurring in the appendix, if that is needed to test
our hypothesis.
Further, the information about when, how, where, and by whom the text was written
is essential to place its language in the correct historical context and to model its
contribution to diachronic change. Who is its author? What is the title of the work? We
can answer similar questions by incorporating context information into the analysis.
If the text reports recorded utterances, for example, knowing the demographic characteristics of the speakers may also help to explain the interconnections between social
change and linguistic change. In addition, information on the contexts of use of the
text and on its relationships to other written sources can be equally critical. Was the
text published as a book? Did it have many readers? Where was it distributed? How
many printed copies were produced? All this information can measure the volume
and quality of the audience and the popularity of the text, thus contributing to a better
interpretation of its language and shedding light on its relevance to the change in the
general language.
Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray.
© Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Content, structure, and context in historical texts

Following a common practice in corpus linguistics and information science, we will
use the technical term metadata to refer to the description of written sources from
the point of view of their content, context, and structure. Metadata can be defined
as ‘the sum total of what one can say about any information object at any level of
aggregation’ (Gilliland, 2008, 2). For what concerns historical linguistics specifically,
archival and manuscript metadata, as well as bibliographical metadata, are among the
most important types of contextual and structural metadata about texts, and include
information on author, title, publisher, year of publication, etc. Historical texts present
an additional challenge compared with most contemporary texts, because in some
cases this information may be lost for ever.
In this chapter we will discuss the relationship between data and metadata in historical linguistics, with particular attention to the way we can collect information about
the texts (when it is available) and the language represented in historical corpora,
and how this can be achieved through corpus annotation. The links from corpora to
external resources such as demographic databases will be covered in Chapter 5.
.. The value of annotation
According to principle 10 discussed in section 2.2.10, corpora are the prime source of
quantitative evidence and are therefore an essential element of quantitative historical
linguistics. However, linguistic analyses that can be performed on so-called ‘raw text
corpora’ are limited. To take just one example, if we want to investigate the registers
used in a play, we may want to extract the lists of words uttered by each character. In
order to do that, we need to exploit metadata about the names of the characters and
their utterances. The more information of this kind is encoded in a corpus, the more
advanced analyses are possible on it.
Annotation is the process of marking the implicit information about a text in
an explicit and computationally retrievable and interpretable way (McEnery and
Wilson, 2001, 32). Leech (1997, 2) defines annotation as ‘the practice of adding
interpretative, linguistic information to an electronic corpus of spoken and/or written
language data’. In this chapter we will extend this definition to cover not only linguistic
information, but also information about the context of the texts and their structure.
Annotated corpora are very valuable because annotation adds structure, thus
making language data into information (Lenci et al., 2005, 64–5). Annotation enriches
corpora, and therefore makes them more powerful resources for linguistics research.
Linguistic information in an annotated corpus can be more easily retrieved, as the
range of searches that we can perform on a corpus largely depends on its annotation. For example, a lemmatized and morphologically annotated corpus contains
the morphological analysis of all its words in terms of their lemmas and their
morphological features. This allows the user not only to search for the single forms in
a text (e. g. all passages containing the form gave), but also all occurrences of a given
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
lemma (e. g. all passages where any form of the verb give is present, or where stroke is
a singular noun). Principle 11 (section 2.2.11) stresses the importance of multivariate
approaches to studying language in general, and historical languages in particular.
The combination of linguistic data and metadata through corpus annotation makes
it possible to address this multidimensional perspective.
Furthermore, annotated corpora allow us to gather and analyse quantitative information about language data, which makes it possible to enrich symbolic modelling of
language with statistical modelling, as we discussed in sections 1.1 and 1.2.2.
Finally, from a research infrastructure perspective, annotated corpora can be used
over and over again, and in different contexts from those from which the annotation
originated. Thus, they can form the basis for further analyses, and in this sense they
constitute reference resources. In section 5.1.2 we will illustrate some examples of
language resources built on existing annotated corpora. Using corpora (and transparent research processes) means that we can conduct replicable analyses, which we
recommend in section 2.3 as one of the best practices.
.. Annotation and historical corpora
Nowadays modern language corpora are often collected from the web for the purposes
of synchronic research. Examples of such web corpora are: the COW (COrpora
from the Web) collection1 containing large web corpora for Dutch, English, French,
German, Spanish, and Swedish with between 10 and 20 billion tokens; the UKWaC
British English web corpus (Ferraresi et al., 2008); the ItWaC Italian web corpus
(Baroni and Kilgarriff, 2006); the Brigham Young Corpora.2
Unlike synchronic corpora, historical corpora present unique challenges in a
number of different aspects. First, the long history of philological research on some
historical languages like Latin or Ancient Greek means that compilers and annotators
of historical corpora often need to consider a variety of different critical editions
and commentaries, and a large body of scholarly literature produced both on the
transmission of the texts and on their interpretation. The high philological and literary
interest in historical texts means that the creation of a corpus from such texts has
to take into account the interest of scholars on the texts themselves, in addition to
the language. As a consequence, in the initial phases of historical corpus collection,
we need to give special attention to the specific choice of source data. Sometimes
the corpus compilers choose the first known editions, as in the case of the Austrian
Baroque Corpus (Czeitschner et al., 2013); in other cases the corpus compilers have to
resort to the only texts that history has preserved, and those texts may be fragmentary
and may only represent a portion of an author’s work. Other times, pragmatic factors
related to copyright issues and text format play a decisive role, as in the case of the
PROIEL Treebank (Haug et al., 2009).
1
http://corporafromtheweb.org/.
2
http://corpus.byu.edu/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Content, structure, and context in historical texts

The special status of historical texts has important consequences on their annotation. Linguistic annotation is always an interpretative process. However, annotation is particularly subject to debate in the case of historical languages, for
which no native speakers are available and for which the texts we have today are
often the result of a complex series of manuscript transmissions and text alterations.
For a model of annotation based on the annotator’s ownership, see Bamman and
Crane (2009), who present this model on the portion of the Ancient Greek Treebank containing the complete works by Aeschylus. The example of Bamman and
Crane (2009) presents a promising adaptation of the methods of corpus linguistics
to the special case of historical languages with a long philological tradition. This
is certainly a positive example, which researchers in historical corpus linguistics
have started to follow, and demonstrates that this discipline has progressed beyond a straightforward application of modern methods to historical texts towards a
more independent position. Computational philology has already developed original
approaches to the creation of digital editions of historical texts and their analysis,
which make best use of the potential of digital humanities. An example is the
Homer Multitext project,3 which aims at making texts, images, and tools available to
readers interested in studying the textual tradition of Iliad and the Odyssey in their
historical context.
In spite of the importance of these issues, most historical corpora lack a philologically satisfactory account of the texts. In section 2.1.1 we assumed that the historical
corpora we work with in quantitative historical linguistics rely on adequate editions.
We believe that in the future it will be beneficial to combine the wealth of scholarship
already generated by computational philologists and digital humanists with the experience of historical corpus linguists, thus letting corpus annotation extend its scope
beyond strictly linguistic information, in the direction suggested by Boschetti (2010).
Only such a collaborative model will allow historical corpus linguistics to reach a
deeper level of analysis and to fully account for the variety of complex phenomena
that characterize historical texts.
.. Ways to annotate a historical corpus
When the first corpora were created, the corpus compilers employed manual annotation extensively. This approach to annotation is supposed to guarantee optimal quality, as it completely relies on humans. However, numerous studies have highlighted
how manual annotation is prone to inconsistencies and errors that are very difficult
to detect, precisely because they tend to be unsystematic (see, for example, McCarthy,
2001, 2; 19–21).
3
http://www.homermultitext.org.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
One way to aim at the best possible quality of annotation is to design clearly defined
annotation guidelines and rely on an independent team of annotators. The degree of
agreement between annotators is very important, and can be measured in various ways
(see, for example, Cohen 1960).
Given its higher costs in terms of time and human resources, manual annotation
is preferred when dealing with small-to-medium-sized corpora, or when automatic
annotation tools have not yet been developed. For these reasons, manual annotation
is very popular for historical corpora, and the large majority of them have been
annotated this way. One such example is the Oxford Corpus of Old Japanese.4
Once complete, this corpus will collect all extant texts in Japanese from the Old
Japanese period, and it is currently being annotated at the orthographic, phonological,
morphological, syntactic, semantic, and lexical levels, with additional annotation of
literary, biographical, historical, geographical, and other information.
Because manual annotation is expensive, time-consuming, prone to inconsistency
errors, and only really feasible on small corpora, an increasing number of projects
have recently started to explore the option of automatic annotation. Research in the
field of NLP is devoted precisely to developing better and better tools for analysing
language data in an automatic way. Automatic annotation programs can cover a variety
of levels: for example, lemmatizers are employed for lemmatizing corpora, part-ofspeech taggers for part-of-speech annotation, and parsers for syntactic annotation.
NLP tools are able to annotate vast amounts of data at low costs and are therefore
particularly useful (and sometimes essential) when annotating large corpora.
An exciting emerging field has been gaining interest in the NLP community
concerning the development or adaptation of NLP tools to historical language data.
Such data present special challenges to NLP, as we will illustrate in more detail
in section 4.3. Although manual annotation is not error-free, it relies on the long
tradition of close reading and on the assumption that it is always possible to check
the texts analysed. On the other hand, using automatic annotation means accepting
the fact that errors are an unavoidable part of the data that the researcher will analyse.
Some projects have chosen a compromise between the two approaches by combining automatic and manual annotation in the so-called semi-automatic annotation.
Semi-automatic annotation consists in human correction of the output of automatic
analysers, and therefore combines the speed and consistency of automatic annotation
with the quality of manual annotation. Among others, the developers of the Index
Thomisticus Treebank (Passarotti, 2007b) and the PROIEL project (Haug and Jøndal,
2008) have used this approach to build their corpora.
An alternative way to combine the advantages of automatic and manual annotation
while at the same time covering large amounts of texts is via crowdsourcing. This
has also been made easier thanks to the availability of crowdsourcing platforms like
4
http://vsarpj.orinst.ox.ac.uk/corpus/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Annotation in practice

Amazon Mechanical Turk (Sabou et al., 2014). The crowdsourcing approach is in line
with the increasing popularity of user-generated metadata, which involves scholarly
and non-scholarly content. In non-academic contexts, user-generated metadata are
employed for tagging web content such as photos or videos, but also texts. So-called
folksonomies are an example of metadata generated collaboratively and consist in
tagging objects based on an open or closed set of categories. Typically, users are asked
to associate an object with one or more terms that describe it in some way. Halpin et al.
(2007) have shown that users tend to agree on a shared set of tags, even when they are
not provided with a controlled vocabulary to choose from.
Crowdsourcing has increasingly gained popularity in the computational linguistics
community (Snow et al. 2008; Callison-Burch 2009; Munro et al. 2010). The Digital
Humanities community has also engaged with a range of crowdsourced projects, some
of which deal with historical language material. For example, Europeana 1914–19185 is
a large project aimed at collecting historical material about the First World War from
libraries, public archives, and family collections. On a more linguistically related note,
the Papyrological Editor6 is an open collaborative editing environment for papyrological texts, as well as translations, commentaries, bibliography, and images. Another
good example is GitClassics, whose projects aim at involving a large community of
scholars and enthusiasts in a ‘collaborative effort to edit, translate, and publish new
Latin texts using GitHub’.7
Crowdsourced annotation has the advantage of allowing large-scale annotation at
a low cost, and is certainly a very promising idea. In the case of historical corpora, the
task of annotation has traditionally been assigned to small groups of highly qualified
persons. Pursuing a crowdsourcing approach in this context would require adapting
the general-purpose model to the case of a limited scholarly community. Shared
infrastructures and built-in mechanisms for checking the quality of the data are the
next challenge to face in order to make this approach sustainable and optimal for
historical corpora.
. Annotation in practice
So far we have highlighted how corpus annotation allows us to incorporate various
types of information into the analysis of the texts for the purposes of historical
linguistics research. This section will deal with the question of how we can achieve
this in practice.
5
6 http://papyri.info/.
http://europeana–.eu/en.
Quote from the GitClassics website, http://gitclassics.github.io/. GitHub (https://github.com/) is a
service for hosting web-based repositories, and is very popular among computer scientists as a platform
for storing and sharing code as well as data sets.
7
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
Let us consider the following text, taken from the beginning of the Aeneid by Virgil:
(1)
Arma virumque cano, Troiae qui primus ab oris
Italiam, fato profugus, Laviniaque venit
litora, multum ille et terris iactatus et alto
vi superum saevae memorem Iunonis ob iram;
This is an example of raw text, a typical case of so-called ‘unstructured data’. We know
that Example (1) is part of a poetic text, and that it consists of four lines. In other words,
the text has an internal structure which is shown in the way it appears visually.
Humans can untangle the complex configuration of elements in a text in a relatively
easy way. For example, books usually display chapter titles in a particular font, which
is different from the rest of the text, and readers can easily detect chapters thanks to
such widespread conventions. On the other hand, if we cannot or do not want to read
the full text, but are interested in analysing certain patterns, we need to be able to
find those patterns reliably. For example to identify the lines in a poem, a computer
program needs to know where the line boundaries are placed. In Example (1), the fact
that lines end with a line break gives an indication of their boundaries, but even this
needs to be explicitly encoded in order to be retrievable by a computer program. If
we want to retrieve this type of structural detail from the text, we need to represent it
explicitly. In other words, we need to add structure to the data in the form of metadata.
This can be achieved in various ways.
One way to add this type of metadata is to use a table format, where each row (or
record) represents a line, and the columns (or fields) contain the unique identifier and
the content of the line. A table like Table 4.1 can be included in a very simple database
where each row corresponds to a record (a line of text) and every column corresponds
to a field. The fields in Table 4.1 are ‘Line_identifier’, ‘Line_text’, and ‘Author’. For every
record, i.e. a line in the text, ‘Line_identifier’ is a unique code for it. Table 4.1 is a
very simple example of structured data where we can imagine that we have defined
Table . The first four lines of Virgil’s Aeneid in a tabular format,
where each row corresponds to a line
Line_identifier
Line_text
Author
1
Arma virumque cano, Troiae qui
primus ab oris
Italiam, fato profugus, Laviniaque
venit
litora, multum ille et terris iactatus
et alto
vi superum saevae memorem
Iunonis ob iram;
Publius Vergilius Maro
2
3
4
Publius Vergilius Maro
Publius Vergilius Maro
Publius Vergilius Maro
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Annotation in practice

Table . Example of bibliographical information on a hypothetical
collection of texts in tabular format, to be considered for illustration
purposes only
Work_identifier
Author
Title
Genre
Location
IX001
IX003
IX002
Publius Vergilius Maro
Giacomo Leopardi
Honoré de Balzac
Aeneid
Canti
Eugénie
Grandet
epic
poem
novel
Ancient Rome
Italy
France
in advance the set of fields (identifier and text), as well as their data types (numeric,
string, date, etc.) and any constraints on their content. For example, we do not expect
the field ‘Line_identifier’ to contain anything else than a number between 1 and 9996,
which is the number of lines in the Aeneid.
In addition to structural information, we may want to collect bibliographical
information about the texts. For example, Table 4.2 contains a portion of a hypothetical
structured data set with bibliographical information, presented here for illustration
purposes only. Like as Table 4.1, Table 4.2 contains structured data. The values for
‘Author’ range over the closed list of all possible authors of the books contained in
our imaginary collection, with the option of adding new ones to the list as we make
new acquisitions. Depending on the size of the collection, we can draw this list from
a potentially very large set. One way to keep this list manageable is to allow only one
variant for a given author name, or a limited number of variants; for example, if we
decide to use only the original names of authors, we would need to map the English
name ‘Virgil’ to its Latin equivalent ‘Publius Vergilius Maro’. Similar arguments hold
for the other fields, which we should keep within some closed boundary in order to
ensure consistency of the data, whenever that is possible.
If we wanted to combine structural and contextual information about our text, we
could link Table 4.1 and Table 4.2 using the fact that they share the field ‘Author’.
Instead of repeating the bibliographical information for every line of the text, having
two separate but linked tables is an efficient way of storing metadata in a so-called
relational database.
An alternative way to display the data in Table 4.1 and Table 4.2 in a combined way
is via a markup language like XML (Extensible Markup Language). XML is also the
preferred format in corpus linguistics, and we will describe it in more detail in section
4.3.1. For now, we will just use two simple examples to show how XML is particularly
suited to expressing deep hierarchical structures in documents:
<body>
<text number="1">
This is an example text.
</text>
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
<text number="2">
This is another example text.
</text>
</body>
The text between the signs ‘<’ and ‘>’ is a tag; in the example above we see two
types of tags, body and text. The first opening tag is <body>, and its closing tag
is </body>. Within the scope of the <body> tag, we see two instances of the tag
<text>, which indicates that <text> is nested inside <body>. The text in double
quotes contained in the tag <text> is an attribute and in this case is used to indicate
the number of the text embedded within the main body. Let us see an example of XML
for Virgil’s text discussed above.
< collection >
<work Work\ _identifier ="IX001" author="PubliusVergiliusMaro"
title ="Aeneid" genre="epic" country="AncientRome">
<book identifier ="1">
< line identifier ="1">Arma virumque cano, Troiae qui primus ab oris </ line >
< line identifier ="2">Italiam, fato profugus , Laviniaque venit </ line >
< line identifier ="3">litora, multum ille et terris iactatus et alto </ line >
< line identifier ="4">vi superum saevae memorem Iunonis ob iram;</line>
</book>
</work>
</ collection >
After the opening tag <work>, the tag <book> has an attribute identifier with
value 1, which refers to the fact that this is the first book of the work. Then, every line of
the poem is enclosed between the opening tag <line> and the closing tag </line>.
This example shows how it is possible to annotate structural information in the text.
In the next section we will focus on linguistic information.
. Adding linguistic annotation to texts
So far, we have focused on how to represent structural and contextual metadata. Of
course, it is essential to represent the content of the text as well, as we will see in
this section.
Humans are very skilled at identifying implicit layers of linguistic information in
language. For example, native speakers of English can easily recognize that the word
book in the sentence They are going to book the flight tonight is a verb, and it is a noun
in She read that book in one day, although their ability to make the distinction explicit
may depend on the level of their grammatical training.
When the texts are analysed by a computer, we need to make such information
explicit in order to interpret the text and retrieve its elements. For instance, in Example
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Adding linguistic annotation to texts

Table . Example of metadata and linguistic information encoded for the first
three word tokens of Virgil’s Aeneid
Work ID
Title
Token ID
Token
form
Lemma
Part of
speech
Case
Number
IX001
IX001
IX001
Aeneid
Aeneid
Aeneid
T00101
T00102
T00103
Arma
virum
que
arma
vir
que
noun
noun
conjunction
accusative
accusative
–
plural
singular
–
(1), discussed on page 104, arma is the accusative of the plural noun arma ‘weapons’;
que is an enclitic which means ‘and’ and is attached to the end of the word virum, which
is the accusative of the noun vir ‘man’. Because this type of morphological information
is at the level of individual words (more precisely, tokens in corpus linguistics terms),
rather than phrases or larger segments, one way to encode it is to define each row
as the minimal analytical unit, i.e. the token, and add new fields called ‘lemma’, ’part
of speech, ‘case’, and ‘number’, as in Table 4.3. Once we have the information for the
whole text, we can run searches on any combination of the fields; for instance, we can
retrieve all occurrences of the singular accusative of vir.
Alternatively, if we choose to use XML, we can embed every token in the XML
presented on pages 105–6 in a new tag <token>, and add the attributes tokenID,
lemma, part of speech, case, and number to it, as shown below.
< collection >
<work Work\ _identifier ="IX001"
title ="Aeneid">
<book identifier ="1">
< line identifier ="1">
<token tokenID="T00101" lemma="arma" part−of−speech="noun"
case="acc" number="plural">Arma</token>
<token tokenID="T00102" lemma="vir" part−of−speech="noun"
case="acc" number="singular">virum</token>
<token tokenID="T00103" lemma="que" part−of−speech="conjunction"
case="-" number="-">que</token>
. . .
</ line >
</book>
</work>
</ collection >
We could also decide to encode other types of linguistic information, like the
English translation of every word, their syntactic relations, or their synonyms. In
any case, this added information contributes to making such elements searchable; for
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
example, we can retrieve all instances of the lemma vir in the text by simply limiting
the search to the lemma attribute of the tag <token>.
.. Annotation formats
There are different ways to include annotation in a corpus. In the so-called embedded
format, annotation is included in the original text and is displayed in the form of tags.
For example, the example below indicates that reading is a participle form, as the tag
‘PARTICIPLE’ is next to the form reading, and is separated by a forward-slash sign:
reading/PARTICIPLE
When the units being annotated span over more than one token, we need some
way of grouping together their elements; this is sometimes achieved by bracketing or
nesting tags, as in phrase-structure syntactic annotation. The example below shows a
parse tree from the Early Modern English Treebank (Kroch et al., 2004).
( (IP−MAT (NP−SBJ (D The) (N Chancelor))
(VBD saide)
(CP−THT (C that)
(IP−SUB (PP (P after )
(NP (ADJ long) (N debating )))
(NP−SBJ (PRO they))
(VBD departyd)
(PP (P for )
(NP (D that) (N tyme)))
(, ,)
(IP−PPL (CONJ nedyr)
(IP−PPL (VAG falling)
(PP (P to)
(NP (Q any) (N poynt ))))
(CONJP (CONJ nor)
(ADJP (ADJ lyke)
(IP−INF (TO to)
(VB com)
(PP (P to)
(NP (Q any )))))))))
(. .)) (ID AMBASS−E1−P2,3.2,25.20))
The phrase-structure of the sentence is represented with embedded bracketing
corresponding to syntactic constituents, and the leaf nodes consist of tags followed by
word forms. ‘IP–MAT’ signals the whole sentence, ‘NP–SBJ’ the subject–noun phrase,
consisting of a determiner node (‘D’) and a noun node (‘N’); ‘VBD’ is the past tense
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Adding linguistic annotation to texts

verb and ‘CP–THT’ is a complementizer phrase introduced by the conjunction that,
while the last node contains the ID of the sentence.8
Another way to represent embedded structures in corpus annotation is by using
the XML format, introduced in section 4.3. An example of embedded dependency
annotation in XML format is given below and is taken from the Latin Dependency
Treebank (Bamman and Crane, 2006):
<sentence id="7" document_id="Perseus:text: 1999.02.0066"
subdoc="book=1:poem=1" span="ergo0:puellam0">
<primary> alexlessie </primary>
<primary>sneil01</primary>
<secondary>millermo</secondary>
<word id="1" form="ergo" lemma="ergo1" postag="d--------"
head="3" rel="AuxY" />
<word id="2" form="velocem" lemma="velox1" postag="a-s---fa-"
head="5" rel="ATR" />
<word id="3" form="potuit" lemma="possum1" postag="v3sria---"
head="0" rel="PRED" />
<word id="4" form="domuisse" lemma="domo1" postag="v--rna---"
head="3" rel="OBJ" />
<word id="5" form="puellam" lemma="puella1" postag="n-s---fa-"
head="4" rel="OBJ" />
</sentence>
The tags <sentence> and </sentence> indicate respectively the beginning
and end of the sentence being annotated; the attributes of the <sentence> tag
indicate various properties of the sentence: the sentence’s unique identifier (id), the
identifier of the text (document_id), the portion of the text containing the sentence
(subdoc), and the first and last words of the chunk of text (span). Inside the tag
sentence, we find the names of the primary and secondary annotators, followed by
the words making up the sentence. The tag word indicates every word of the sentence.
Inside the tag, the attribute id uniquely identifies the word in the corpus, form represents the word form, lemma its lemma, and postag contains a series of codes for
morphological features such as part-of-speech tag, gender, mood, case, and number.
In this example the type of syntactic annotation is relational, as opposed to the
structural type of the phrase-structure example from the Modern English Treebank.
The tag <head> contains the ID of the dependency head of each word, while the tag
<rel> indicates the syntactic dependency relation between the word and its head.
For example, the first word of the sentence above is ergo, and is a sentence adverbial
(‘AuxY’) depending on the third word of the sentence, i.e. potuit. Its lemma is ergo1
8
For a full description of the tags, see https://www.ling.upenn.edu/hist-corpora/annotation/index.html.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
as it is the first (and only) homograph of the lemma ergo in the Lewis–Short Latin
dictionary (Lewis and Short, 1879).
In addition to linguistic information, as we noted in section 4.1, it is important
to record contextual information about a text; this is sometimes included as part
of the corpus annotation itself, as in the Helsinki Corpus. McEnery and Wilson
(2001, 39–40) list a document header from this corpus, where for example, the tag
<A BEAUMONT ELIZABETH> indicates an author’s name, and <X FEMALE> her
gender. Such metadata can then be used by corpus programs to restrict the search
criteria on texts’ attributes and their linguistic content.
So far, we have examined examples of embedded annotation. Standalone annotation
retains the annotation information in a separate document, which is linked to the
original text. The American National Corpus (Ide and Macleod, 2001) has followed
this approach (Gries and Berez, 2015). For example, each word of the sentence We
then read is assigned an identifier:
<w id="1">We</w>\\
<w id="2">then</w>\\
<w id="3">read</w>\\
<w id="4">.</w>\\
Each word is then associated to its part-of-speech tag in the standalone annotation
by means of identifiers:
<word id="1">PRONOUN</word>\\
<word id="2">ADVERB</word>\\
<word id="3">VERB</word>\\
<word id="4">PUNCTUATION</word>\\
Standalone annotation makes it possible to have multiple formats or levels of
annotation for the same text. Although standalone annotation is recommended by
the standard for corpus annotation (the Corpus Encoding Standard), most corpora
have embedded annotation; therefore, in the rest of this chapter we will refer to this
type of annotation.
.. Levels of linguistic annotation
Linguistic annotation is typically performed in an incremental way, by adding successive layers to the original text, starting from the most basic ones with lemma or
part of speech, to the most advanced ones with semantic and pragmatic information.
In this section we will cover these main levels of annotation, with particular attention
to the peculiarities of historical corpora.
The challenges of text pre-processing When building a historical corpus, researchers
usually acquire texts held in non-electronic formats. Optical character recognition
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Adding linguistic annotation to texts

(OCR) and direct transcription are the most popular ways to convert the texts into
a digital format. Alternatively, manual transcription is an option when automatic
methods are not able to reach an acceptable level of accuracy.
Automatic and manual transcription are not mutually exclusive options, as the
results of an automatic process can be further refined by manual intervention. This
approach was the one chosen by the Impact Centre of Competence in Digitization,9
a collaborative network of libraries, industry partners, and researchers working towards the goal of digitizing historical printed texts from Europe’s cultural heritage
material. Concerning OCR, Impact has developed an OCR software whose results are
further improved thanks to the involvement of volunteers through an interface for
crowdsourcing.
Historical texts present challenges also regarding their characters, which typically
span over a much larger set than modern texts. In the history of historical text
processing, the lack of a common framework for encoding texts has meant that
customized processing tools have been created which could not be shared across different systems. Over the past decades, the character encoding Unicode has gradually
become the universal standard, and contains now more than one million characters.
New characters often need to be added to the Unicode repository, especially to deal
with historical scripts, and this is achieved via the Script Encoding Initiative.10 As
Piotrowski (2012, 53–60) points out, the wide coverage of Unicode facilitates the
sharing of tools and texts across different projects. For an overview of the issues
concerning the digitization of historical texts and historical character encoding, see
Piotrowski (2012).
In the next sections, we focus on the levels of linguistic annotation that can be
performed on historical corpora, stressing their features and challenges.
Tokenization The first step in automatically processing the language in a corpus
usually consists of tokenization. Tokenization segments a running text into linguistic
units such as words, punctuation marks, or numbers. Once we have identified such
units (called tokens), we can perform further levels of annotation. The task of word
segmentation is more complex for those East Asian Languages like Chinese, Japanese,
Korean, and Thai, which do not use white spaces to separate words. This is relevant also
to those historical languages that were written in scriptio continua, such as classical
Greek and classical Latin, for which the word separation is sometimes disputed by
different philological interpretations.
Even in languages like English, Italian, or French, where white spaces are used to
separate tokens in many cases, we can find several exceptions. For example, the English
sequence I’m, the French l’oiseau ‘the bird’, and the Italian l’anguilla ‘the eel’ comprise
two tokens each; on the other hand, the English name New York, should count as one
9
http://www.digitisation.eu/.
10
http://linguistics.berkeley.edu/sei/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
single token. Moreover, compounds may not require spaces, as the German compound
computerlinguistik ‘computational linguistics’. Another challenge in tokenization is
given by the different possible uses of hyphens, for example to split a word at the end
of a line for typesetting or to join elements of a single complex unit like forty-two.
What counts as a token, therefore, depends on the language, the context of use, and
further processing. For languages that use a Latin-based, Cyrillic-based, or Greekbased writing system, tokenization is often performed by a combination of rules
that rely on white spaces and punctuation marks as delimiters of token boundaries.
In addition to applying these general rules, we need to take into account languagespecific exceptions drawn from lists of acronyms and abbreviations. For example, such
lists for English should contain Dr. and Mrs., because in these cases the dot should be
considered part of the token. One challenge with abbreviations is that the same string
may be a full word in certain contexts and an abbreviation in others, like in. for inches.
Sentence segmentation is another crucial task related to tokenization and can
present challenges for historical texts which do not employ punctuation marks consistently. For an overview of such challenges in Latin and Armenian, and respective
solutions adopted to build the PROIEL Project corpus, see Haug et al. (2009, 24–6).
Morphological annotation From a historical perspective, researchers have expanded
much effort on written texts, and therefore the morphological, syntactic, semantic,
and pragmatic levels have received most of the attention, compared to other levels of
annotation such as phonetic/phonemic and prosodic annotation. In this section we
will describe morphological annotation in more detail.
Morphological annotation is the first layer of annotation that is normally added
to raw corpora. It usually involves spelling variant processing, lemmatization, partof-speech tagging, and annotation of other morphological features such as number,
gender, animacy, and case for nouns and adjectives, degree for adjectives, mood,
voice, aspect, tense, person for verbs. In this section we will examine the main
challenges posed by morphological annotation of historical texts, and how current
or past projects have tackled them.
Tackling spelling variation One major challenge of historical texts relates to the
amount of spelling variation they typically contain. Many historical corpora cover
large time spans, during which spelling standards were often lacking and spelling
conventions changed. Second, data capture errors and philological issues sometimes
make spelling uncertain. For these reasons, a unique approach to spelling is often not
viable for historical texts.
The field of NLP has developed tools that generally assume consistent spelling
and consequently work well for modern languages, which normally have a much
smaller degree of spelling variation than their historical counterparts. When applying
NLP tools to historical texts, a common strategy is to normalize spelling to their
modern equivalents. Normalization can be acceptable for certain applications, such
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Adding linguistic annotation to texts

as retrieving information from historical documents, where the user wants to find the
relevant content by searching for a limited number of terms. Normalization requires a
set of mappings between the historical variants and the modern ones (if available) or
rules that prescribe how to infer one from the other (for example -yng or -ynge endings
in place of the modern -ing ending of verbs in English). This approach was adopted
by the designers of VARD,11 a spelling analysis tool for early modern English. When
such lexicons or rules are not available, we can adopt several different approaches to
identify the relationship between spelling variants. One such example is the so-called
edit distance, which measures the ‘distance’ between two strings by considering the
number of deletions, insertions, and replacements of characters needed to transform
one into the other. We can employ similar methods also when correcting OCR errors,
a common challenge of digitized historical documents. For an overview of this topic,
see Piotrowski (2012, 69–83).
As an example of the challenges of historical spelling for English, Archer et al.
(2003) present a historical adaptation of USAS, which is the Semantic Analysis
System developed by UCREL, the University Centre for Computer Corpus Research
on Language of Lancaster University. Because USAS was designed for present-day
English, when it was applied to early modern English texts, it failed to part-of-speechtag a number of items. The issues concerned spelling, because some historical variants
were not present in the lexicon used by USAS. A straightforward modification of
the lexicon that included historical variants would have led to incorrect results; for
example, one historical spelling of the verb be is bee, which is also a noun in presentday English. Therefore, the authors decided to keep the present-day lexicon separate
from the early modern lexicon, and to create the historical lexicon manually by
analysing the items that were not tagged by the part-of-speech tagger. The system then
assigned the correct tags of such items based on some rules; for example, it would
analyse bee as a form of the verb be if it was preceded by a modal verb.
Tagging by part-of-speech Part-of-speech tagging is a crucial step in annotating
corpora. As with other levels of annotation, automatic part-of-speech taggers exist
alongside manual systems; however, compared with part-of-speech taggers for modern languages, historical part-of-speech taggers present some specific challenges, as
Piotrowski (2012, 86-96) explains. Here we will summarize some of the main solutions
devised to perform part-of-speech for historical languages.
Machine learning algorithms for part-of-speech taggers have become increasingly
popular in the recent years. A typical so-called supervised machine learning system
for linguistic annotation relies on an annotated corpus used as a training set; the
model learns the patterns observed in the training set and subsequently uses these
patterns to annotate a new corpus. Following this approach, Passarotti and Dell’Orletta
11
http://ucrel.lancs.ac.uk/vard/about/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
(2010) trained a part-of-speech tagger on the morphologically annotated data from the
Latin Index Thomisticus Treebank (61,024 tokens), and automatically disambiguated
the lemmas of the Index Thomisticus.
Scholars have adopted different solutions with the aim of improving the accuracy
of part-of-speech taggers for historical languages. Some, such as Rayson et al. (2007),
have used part-of-speech taggers for modern language varieties to analyse historical
varieties by modernizing their spelling. Another approach consists in using a partof-speech tagger for the modern variety of the historical language being studied and
expand its lexicon with historical forms, as Sanchez-Marco et al. (2011) did for Spanish.
An alternative method is to first use a modern-language tagger and then incrementally
correct it for historical data. This was the approach followed by Resch et al. (2014),
who describe used the modern-German version of Treetagger (Schmid, 1995) to tag
the Austrian Baroque Corpus, a corpus of printed German language texts dating from
the Baroque era (particularly from 1650 to 1750). Given the high number of incorrectly
tagged and lemmatized items, they manually corrected a portion of the output of the
tagger; they then retrained Treetagger on the additional training set. This procedure
was sufficient to make the performance of Treetagger increase significantly. Bamman
and Crane (2008) use a similar approach and report on experiments on part-of-speech
tagging of classical Latin with TreeTagger (Schmid, 1994), trained on a treebank for
classical Latin.
Lemmatization and morphological annotation Lemmatization associates every
word form with its lemma, together with its homograph number, where needed. We
can perform lemmatization both on inflected forms and on spelling variants; for
example, if we want to use a list of lemmas from British English, we can lemmatize the
American variant color as colour. Lemmatization is closely related to morphological
analysis and part-of-speech tagging. In fact, if we know the part of speech of a given
form in a given context, we can often assign the correct lemma to it. For example,
the Latin form rosa can be an inflected form of the noun rosa ‘rose’, but also the
feminine past participle of the verb rodo ‘gnaw’, and its correct lemma will depend
on the context. For this reason, lemmatization is often coupled with part-of-speech
tagging in corpus annotation.
Just like other levels of linguistic annotation, lemmatization can be performed
either manually or automatically, through tools called lemmatizers. Examples of
historical corpora which have been manually lemmatized are treebanks, which we will
introduce later in this section. While possible, attempts in the direction of automatic
lemmatization of historical corpora have been overall rare. One method for automatic
lemmatization is based on a set of rules that prescribe how to analyse a given word
form depending on which category it falls in. Examples of rule-based systems are
LGeRM (Souvay and Pierrel, 2009), which identifies the dictionary entry of a given
form in Middle French, and the morphological model build by Borin and Forsberg
(2008) for Old Swedish. Along similar lines, several software systems are available
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Adding linguistic annotation to texts

for performing automatic lemmatization and morphological analysis of Latin and
Ancient Greek. For example CHLT-LEMLAT (Passarotti, 2007a)12 is a lemmatizer and
morphological analyser for Latin created at the Institute of Computational Linguistics
(ILC-CNR) in Pisa. Another morphological analyser for Latin and Ancient Greek is
Morpheus (Crane, 1991).13 Morpheus contains rules for generating inflected forms
automatically and allows the users to search the digital library by word forms and lemmas. Kestemont et al. (2010) propose a machine-learning approach to lemmatization
of Middle Dutch.
Syntactic annotation Syntactic annotation consists of assigning each element of
the sentences in a corpus to its syntactic role. Given the complexity of the task of
syntactic annotation, historical corpora with this type of annotation are quite small,
and attempts in the direction of automatic annotation have been rare. In this section
we will give a brief overview of the research in this area.
Manual syntactic annotation and treebanks Syntactically annotated corpora are
usually called treebanks because we can represent syntactically annotated sentences as
trees. For an overview of existing treebanks for modern and historical languages and
some methodological points, see Abeillé (2003). Here we will focus on methodological
issues specific to historical treebanks.
There are two main kinds of syntactic annotation: constituency annotation and dependency annotation. In a constituent annotation, phrases are identified and marked
so that it is clear which one each element belongs to. Constituency annotation makes
use of bracketing to represent the syntactic embedding of constituents and is the style
followed by the early treebanks. We presented an example of this kind of annotation in
section 4.3.1. Examples of constituent-based historical treebanks are the Penn Corpora
of Historical English (Kroch and Taylor, 2000; Kroch and Delfs, 2004; Kroch and
Diertani, 2010).
On the other hand, dependency annotation is based on the theoretical assumptions
of Dependency Grammar (Tesnière, 1959), which represents the syntactic structure
of a sentence with the dependency relations between its words. In a Dependency
Grammar annotation, each lexical element corresponds to a node in the syntactic tree
of the sentence; in order to tag its syntactic role in the sentence, we assign each node
to a label (such as ‘predicate’, ‘object’, ‘attribute’) and to the node it is governed by.
Figure 4.1 shows the phrase-structure tree and the dependency tree for Example (2):
(2) She ate the apple.
In the dependency tree of Figure 4.1 we can see the nodes corresponding to the words
of the sentence, and the edges representing the dependencies between the words
(‘Pred’ for predicate, ‘Sb’ for subject, ‘Obj’ for objects, and ‘Det’ for determiner). In
12
13
http://webilc.ilc.cnr.it/ ruffolo/lemlat/index.html.
http://www.Perseus.tufts.edu/hopper/morph.jsp.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
S
ate
Pred
VP
NP
PR
V
She
ate
She
Sb
apple
Obj
NP
DET
N
the
apple
the
Det
Figure . Phrase-structure tree (left) and dependency tree (right) for Example (2).
the constituent tree we can see the terminal nodes corresponding to the words of
the sentence, and non-terminal symbols corresponding to the constituents (e.g. noun
phrases ‘NP’ and verb phrases ‘VP’ in Figure 4.1) or part-of-speech (such as pronouns
(‘PR’), verbs (‘V’), determiners (‘DET’), nouns (‘N’) in Figure 4.1).
Dependency annotation has become increasingly popular among treebank creators.
One common model of annotation is that of the Prague Dependency Treebank
(Böhmová et al., 2003), developed under the Dependency Grammar theoretical
framework of Functional Generative Description (Sgall et al., 1986). This treebank
contains part of the Czech National Corpus annotated at three levels: morphological,
so-called ‘analytical’ (with dependency trees of all sentences), and semantic, so-called
‘tectogrammatical’. Dependency annotation is generally considered to be very suitable
for morphologically rich languages with free word order such as Czech and Latin.
Examples of historical treebanks that followed this framework are: the Ancient Greek
Dependency Treebank (Bamman and Crane, 2011), the PROIEL Treebank (Haug and
Jøndal, 2008), the Latin Dependency Treebank (Bamman and Crane, 2007), and the
Index Thomisticus Treebank (Passarotti, 2007b).
Let us consider an example. Figure 4.2 from McGillivray (2013, 45) shows the
dependency tree of the Latin in Example (3), where movet and pervenit are coordinated
predicates, governing respectively the direct object castra, and the adverbial diebus and
the indirect object fines introduced by the preposition ad.
(3) Re frumentaria
provisa
castra
Provisions:abl.f.sg provide:ptcp.pf.abl.f.sg camp:acc.n.pl
movet
diebus
-que circiter
XV
ad
move:ind.prs.3sg day:abl.m.pl and about:adv fifteen to
fines
Belgarum
pervenit
border:acc.m.pl Belgian:gen.m.pl arrive:ind.pf.3sg
‘After providing his provisions, he moved his camp, and in about fifteen days
reached the borders of the Belgae’
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Adding linguistic annotation to texts

#1
AuxS
que-6
Coord
provisa-3 movet-5
Adv
Pred_Co
Re-1 castra-4
Sb
Obj
frumentaria-2
Atr
diebus-7
Adv
XV-9
Atr
circiter-8
Adv
pervenit-13
Pred_Co
ad-10
AuxP
fines-11
Obj
Belgarum-12
Atr
Figure . The dependency tree of Example (3) from the Latin Dependency Treebank.
As we have seen from Example (3), the high level of complexity of the annotation
in treebanks makes them very valuable resources for linguistic analyses, allowing for
complex searches involving syntactic functions. Treebanks can also help linguists test
their theories, as they can provide examples and counter-examples for illustrating
linguistic phenomena in qualitative research. As a matter of fact, empirical linguistic
analysis provided the prevalent motivation behind the creation of the early treebanks
(and corpora in general).
Treebanks can also constitute the basis for corpus-driven analyses as defined in
section 2.4. This latter use is the one that makes the most of the potential of treebanks,
because they offer the kind of systematic information and frequency data that is
needed in this type of linguistic analyses. Moreover, there is a significant educational
potential in the use of treebanks, as testified by the Visual Interactive Syntax Learning
project at the University of Southern Denmark,14 which contains syntactically annotated sentences and games for modern and historical languages (Latin and Ancient
Greek). Finally, treebanks have been recently used as gold-standard resources for
historical NLP, as they can be used to train automatic syntactic analysers or parsers,
as explained in the next section.
14
http://beta.visl.sdu.dk/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
Automatic annotation of syntax: parsing Parsing consists in automatically annotating a corpus from a syntactic point of view. Parsing is a very important field of research
in NLP, and has a variety of practical and commercial applications, ranging from
machine translation to natural language understanding and lexicography. Parsing can
be achieved in two main ways: rule-based and statistical. Rule-based parsers exploit
some manually constructed rules to parse a sentence; on the other hand, statistical
parsers, based on machine-learning techniques, are trained on treebanks, from which
they learn patterns of linguistic regularities that can then be applied when analysing
new unannotated texts. As with any other automatic method, parsing involves a
number of errors, which we must take into consideration when using parsed data
directly. Depending on the end use of the annotated corpus, this margin of error may
constitute a problem, as traditionally historical linguists and philologists have aimed
at an almost perfect level of analysis and often require the same accuracy to carry
out further analyses based on annotated corpora. When the historical corpora are so
small that it is possible to manually check the annotation, semi-automatic annotation
is often the preferred solution.
As illustrated in Piotrowski (2012, 98–100), parsing experiments for historical
languages have highlighted interesting challenges and have often originated from
adaptations of parsers developed for modern languages. For example, comparing
classical Chinese and modern Chinese, Huang et al. (2002) report an accuracy of 82.3
per cent for a parser trained on a 1,000-word treebank. The challenges involved in
segmenting the text are less serious for classical Chinese, which has a higher number
of single-character words compared to modern Chinese; on the other hand, part-ofspeech ambiguity is more extreme for classical Chinese and therefore makes part-ofspeech tagging more difficult.
Given the historical importance of Latin in Western culture, it is not surprising that
significant efforts have been devoted to parsing this language. Koch (1993) describes a
first attempt at parsing Latin. McGillivray et al. (2009), Passarotti and Ruffolo (2009),
and Passarotti and Dell’Orletta (2010) report on more recent experiments on parsing
Latin corpora using machine learning. For example, following the same approach
exposed earlier and consisting in adapting parsers developed for modern languages
to the case of historical languages, Passarotti and Dell’Orletta (2010) applied the DeSR
parser (Attardi, 2006) to medieval Latin, and designed some specific features for this
language.
English is the other language for which considerable research has been done on
parsing historical texts. Considered that modern English is the language which has the
highest number of language processing tools, it is not surprising that such tools have
also been tested on historical varieties of this language. One such tool is the Pro3Gres
parser (Schneider, 2008), a hybrid dependency parser for modern English. Pro3Gres
is based on a combination of handwritten rules and statistical disambiguation, and can
be adapted to historical language varieties. Schneider (2012) evaluated Pro3Gres on the
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Adding linguistic annotation to texts

historical corpus ARCHER (A Representative Corpus of Historical English Registers,
Biber and Atkinson 1994), constructed by Douglas Biber and Edward Finegan in the
1990s and consisting of British and American English texts written between 1650 and
1999. A preprocessing step was performed before parsing and led to a normalization
of the text with the tool for spelling normalization VARD2 (Baron, 2009). Schneider’s
evaluation results range from 70 per cent on seventeenth-century texts to 80 per cent
on early-twentieth-century texts for the unadapted parser. If we compare these results
with the state-of-the-art parsers for modern English we can see that the difference
is not as great as one might expect. For example, Kolachina and Kolachina (2012)
evaluated a number of dependency and phrase-structure parsers for English and
found accuracy ranges between 70 per cent and 90 per cent.15
Semantic, pragmatic, and sociolinguistic annotation Semantic annotation often
builds on syntactic annotation and involves interpreting a variety of different linguistic
phenomena. These include indicating the semantic fields of a text like sport or
medicine, for example, but also tagging named entities such as names of people or
places, indicating whether an entity is animate or inanimate, whether it is an event
or an abstract entity, and so on.
Sense tagging is another important way to semantically annotate a corpus and
consists in associating every word with its correct sense in context, based on an
external ontology such as WordNet (Miller et al., 1990). WordNet is a lexical–semantic
database for the English lexicon. Lexical items are assigned to sets of synonyms
(synsets) representing lexical concepts, which are linked through semantic and lexical
relations like hyponymy, hyperonymy, and meronymy. An example of an English
synchronic semantically annotated corpus is SemCor (Fellbaum, 1998). Semantic
annotation of historical corpora also covers the automatic detection of named-entities
such as people, organizations, locations, time expressions, which are of particular
relevance to historical research (Toth, 2013). This section will focus on the semantic
annotation of historical corpora, and provide some examples.
Semantically annotated historical corpora Annotating a historical corpus at the
semantic level is challenging for a variety of reasons, including the complexity of the
task, the high degree of linguistic interpretation required, the scarcity of annotation
standards, and the diachronic change of meaning. Some historical corpora have
successfully attempted this kind annotation and have approached it from different
points of view.
The PROIEL corpus, introduced in section 4.3.2, contains a semantic annotation for
its Ancient Greek portion (Haug et al., 2009, 40–3), in addition to morphological and
syntactic annotation. The semantic annotation in PROIEL has the form of type-level
15 The authors first converted the parses of a constituency parser into dependency structures. Then, they
measured labelled attachment score (LAS), unlabelled attachment score (UAS), and label accuracy (LA).
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
animacy tagging, and follows the framework developed by Zaenen et al. (2004).
Every Greek noun lemma is associated with one category taken from the following
set: HUMAN, ORG (for organizations), ANIMAL, VEH (for vehicles), CONC (for
concrete entities), PLACE, NONCONC (for non-concrete, inanimate entities), and
TIME. These tags provide a ‘flat’ annotation, because they are not organized in any
hierarchy. The treebank annotators tagged nouns in the corpus; then, thanks to anaphoric links, the tags were transferred from nouns to pronouns. Since the annotation
is generally done at the level of the lemma rather than at the token level, it represents
the animacy values of the majority of the corpus tokens, rather than a strictly contextspecific identification of animacy. Moreover, this corpus-driven approach means that
every lemma is annotated based on the collection of its tokens and not on its general
meaning. Therefore the noun kardia ‘heart’, for example, is labelled as NONCONC
because none of its corpus occurrences refer to physical hearts.
Another type of semantic annotation of historical texts is that of Declerck et al.
(2011), who report on the semantic annotation of the Viennese Danse Macabre
Corpus, consisting of a digital collection of printed German texts from 1650 to 1750.
The aim of the annotation is to identify different conceptualizations of the theme of
death, and hence the annotation specifically concerns this domain, and uses a tagset
which conforms to the Text Encoding Initiative (TEI). Below we give an example of
the annotation, taken from Declerck et al. (2011):
<rs type="death" subtype="figure">Mors, Tod, Todt</rs>
<rs type="death" subtype="figureAlternative">General Haut und Bein,
Menschenfeind</rs>
This example shows two instances of the tag <rs>, which is used for generalpurpose names or strings; in this case the two tags annotate two personifications
of violent death. This annotation allows for semantically informed searches on the
corpus; for example, we can retrieve the personifications of death as a figure.
A different approach to semantic annotation of historical corpora focuses on the
historical context of the texts. The Hansard Corpus, which contains 1.6 billion words
from 7.6 million speeches in the British Parliament from the period 1803-2005, is
semantically tagged, which allows for powerful meaning-based searches. Users can
create ‘virtual corpora’ by speaker, time period, house of parliament, and party in
power, and make comparisons across these corpora.
Semantic annotation can also be performed automatically with the support of
computational tools. For instance, Archer et al. (2003) present a tool for semantic
annotation of English historical corpora based on USAS (see section 4.3.2 for an
introduction to USAS), which was designed and initially implemented for presentday English. USAS assigns semantic labels based on a thesaurus consisting of over
45,000 words and almost 19,000 multi-word expressions. It works on a set of rules that
rank the most likely analysis of a word based on some context-specific disambiguation
rules and a frequency lexicon which records all semantic analyses of a word in order
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Adding linguistic annotation to texts

of frequency. Archer et al. (2003) adapted USAS to make it possible to tag every word
of a historical corpus, thus allowing meaning-based searches on the texts. The analysis
referred to the Historical Thesaurus of English compiled at the University of Glasgow,
which contains almost 800,000 words from Old English to the present day, arranged
into fine-grained hierarchies primarily based on the second edition of the Oxford
English Dictionary and Its Supplements, and the Thesaurus of Old English. Given the
hierarchical structure of the thesaurus, the semantic analysis tool allows for conceptbased searches on the texts.
Pragmatically and sociolinguistically annotated corpora At the beginning of the
history of corpus linguistics, the annotation of language-internal phenomena like
lemmatization, part of speech, or syntax, received a great amount of attention.
However, language use is best understood when analysed together with its context,
as a discursive and social practice. Sociolinguistic research is interested in such
contextual information, which covers social categories like gender and class, but
also the knowledge possessed by the participants of the communicative event and
situational aspects such as the relationships between the participants and the purpose
of their communication (Biber, 2001). Recording the macro-social components of
language, as well as the situational aspects of the individual communicative events,
is very important to explain the role of language in society, and corpus data constitute
crucially important evidence sources for this type of investigation.
Sociolinguistic research is the background to the Corpus of Early English Correspondence, a family of historical corpora compiled with the aim of testing sociolinguistic theories on historical data. In addition to morphological and syntactic
annotation, these corpora are linked to a database containing information about letter
writers, which allows the users to search sociolinguistic information about writers
and recipients like age, gender, and family roles, and thus study the relation between
language use and its context.
One way to capture pragmatic and social characteristics of language is through
the specific type of annotation employed in the Sociopragmatic Corpus (Archer and
Culpeper, 2003), a section of the Corpus of English Dialogues 1560–1760 (Kytö and
Walker, 2006) covering the years 1640–1760. This corpus contains more than 240,000
words from trial proceedings and drama, annotated with characteristics of the speakers and the addressees. Here is an example from Culpeper and Archer (2008):
<u speaker="s" spid="s4tfranc001" spsex="m" sprole1="v" spstatus="1"
spage="9" addressee="s" adid="s4franc003" spsex="m" adrole="w"
adstatus ="4" adage="8">Look upon this Book; Is this the Book?</u>
The example shows that a male speaker (indicated by spsex="m"), identified by
the code s4franc001, acts here as a prosecutor, belongs to the social status ‘gentry’, and
is classified as an older adult. His addressee is a male witness, identified by the code
s4franc003, of social status commoner, and an adult. All this information is encoded
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
in terms of attributes of the tag <u>, which includes a speaker’s conversational turn
directed to a specific addressee, in an item of direct speech.
Another way to perform pragmatic annotation is by marking discursive elements
in language data. This is the approach chosen by the PROIEL project and the tectogrammatical annotation of the Index Thomisticus Treebank, as we will see now.
The PROIEL Project (Pragmatic Resources in Old Indo-European Languages) has
developed a parallel corpus of the Greek text of the New Testament and its translations
into the old Indo-European languages Latin, Gothic, Armenian, and Old Church
Slavonic. Specifically, the Greek gospels have been annotated for information structure
and discourse structure, in addition to the morphological, syntactic, and semantic
annotations (Haug et al., 2009). This kind of annotation records information status
and anaphoric distance, covering givenness tags based on the context used by the
hearer to establish the reference, situational information, encyclopaedic knowledge,
and tags to express information new to the context, as well as anaphoric links between
discourse referents.
The annotation scheme chosen by the Index Thomisticus Treebank is the tectogrammatical annotation of the Prague Dependency Treebank (Passarotti, 2010, 2014),
which refers to the Functional Generative Description framework (Sgall et al., 1986).
This level of annotation builds on the so-called ‘analytical’ (i.e. syntactic) layer, where
every token is a node in the dependency tree. However, the tectogrammatical annotation resolves ellipsis by reconstructing elided nodes, and represents the dependency relations between the elements which have semantic meaning, thus excluding
nodes like conjunctions, prepositions, and auxiliaries. The dependency relations are
represented in terms of semantic roles thanks to so-called ‘functors’, such as actor.
The pragmatic content of the annotation involves anaphoric references, as well as the
information structure of sentences, distinguishing between topic and focus.
.. Annotation schemes and standards
We have seen the major levels of linguistic annotation and discussed their application
to historical corpora. In this section we will concentrate on some recommended
procedures to conduct a historical corpus annotation project, and stress the infrastructural implications of corpus annotation.
In order for an annotation to be consistent throughout the corpus, it is essential that
it follows some predefined parameters. An annotation scheme defines the architecture
of an annotation in terms of the tags that are allowed in it, and how they should be
used. Good annotation schemes should allow us to describe (rather than explain) the
phenomena observed in the corpus, and should be based on theory-neutral, widely
agreed principles, as far as this is possible.
An example of an annotation scheme is Bamman et al. (2008), where the authors
describe all the tags employed in the annotation of the Latin Dependency Treebank.
This is also an interesting example of a collaborative approach to defining annotation
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Adding linguistic annotation to texts

guidelines, because these guidelines are shared with another Latin treebank, the Index
Thomisticus Treebank (Passarotti, 2007b). Moreover, both treebanks follow the overall
theoretical framework of the Prague Dependency Treebank (Böhmová et al., 2003). In
addition, another Latin treebank, the PROIEL project Latin treebank, is compatible
with both the Index Thomisticus Treebank and the Latin Dependency Treebank, since
automatic conversion processes are available from one format to the other, and this
increases the range of opportunities for linguistic analyses that span over the data from
all three treebanks.
A similar example of shared approach to annotation is given by the Penn Corpora
of Historical English, which include the Penn–Helsinki Parsed Corpus of Middle
English (Kroch and Taylor, 2000), the Penn–Helsinki Parsed Corpus of Early Modern English (Kroch and Delfs, 2004), and the Penn Parsed Corpus of Modern British
English (Kroch and Diertani, 2010). Following the same schema designed for the Penn
Corpora of Historical English, a whole constellation of corpora have been built over
the years: the York–Helsinki Parsed Corpus of Old English Poetry (Pintzuk and Plug,
2002), the York–Toronto–Helsinki Parsed Corpus of Old English Prose (Taylor et al.,
2003), the York–Helsinki Parsed Corpus of Early English Correspondence (Taylor
et al., 2006), the Tycho Brahe Corpus of Historical Portuguese (Galves and Britto,
2002), Corpus MCVF (parsed corpus), Modéliser le changement: les voies du français
(Martineau and Morin, 2010), and the Icelandic Parsed Historical Corpus (Wallenberg
et al., 2011a).
Covering a larger set of languages, Universal Dependencies16 is a project aimed at
developing treebank annotation for many languages (including historical ones),
with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing
research from a language typology perspective. [. . .] The general philosophy is to provide a
universal inventory of categories and guidelines to facilitate consistent annotation of similar
constructions across languages, while allowing language-specific extensions when necessary.17
Unfortunately, collaborations such as the ones mentioned above are not as frequent
as we would wish. In historical corpus research, as well as in corpus linguistics in
general, there are several schemes for corpus annotation, and no prevailing one. This
has to do with historical reasons, as especially the older projects often originated
within different theoretical frameworks to address specific needs and goals, and
therefore developed their own (often peculiar) approaches to annotation; see, for
example, the original annotation used for the Index Thomisticus (Busa, 1980). While
this is partially justified by the fact that different languages require different annotation
schemes, and that each level of annotation has its own features, it becomes increasingly
important to aim at a more harmonized state, especially given the growth in the
number of annotated historical corpora.
16
http://universaldependencies.org/.
17
http://universaldependencies.org/introduction.html.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
Although no annotation scheme should be considered as a standard a priori, since
the beginning of corpus linguistics it has gradually become clear that standards
commonly agreed through practice and consensus are necessary. Such standards make
corpora processable by a variety of software systems, thus facilitating the comparison,
sharing, and linking of annotated corpora, avoiding duplication of effort, while at the
same time enhancing the evidence basis for historical linguistic analyses. To this end,
TEI18 has published the Guidelines for Electronic Text Encoding and Interchange, which
document ‘a markup language for representing the structural, display, and conceptual
features of texts’.19 TEI has modules for different text types (drama, dictionaries, letters,
poems, and so on), and its annotation guidelines cover a range of palaeographic,
linguistic, and historical features. For an overview of TEI for historical texts, see
Piotrowski (2012, 60–7).
Here we will look at one example of a historical text annotated following TEI
conventions, the Bodleian First Folio.20 The following is an excerpt from the beginning
of Shakespeare’s A Midsummer Night’s Dream.
<stage rend="italic center" type="entrance">Enter Theseus,
Hippolita , with others . </ stage >
<cb n="1"/>
<sp who="#F-mnd-duk">
<speaker rend="italic center">Theseus.</speaker>
<l n="1">
<c rend="decoratedCapital">N</c>Ow faire Hippolita, our nuptiall
houre</l>
<l n="2">Drawes on apace: foure happy daies bring in</ l>
<l n="3">Another Moon: but oh, me thinkes, how slow</l>
<l n="4">This old Moon wanes; She lingers my desires </ l>
<l n="5">Like to a Step−dame, or a Dowager,</l>
<l n="6">Long withering out a yong mans reuennew.</l>
</sp>
. . .
</ stage >
The element ‘stage’ contains stage directions, ‘cb’ marks the beginning of a column
of text, ‘sp’ marks the speech text, ‘speaker’ gives the name of the speaker in the
dramatic text, and ‘l’ indicates the verse line. For a complete explanation of the tags
and attributes, see TEI Consortium (2014).
18
http://www.tei-c.org.
Unlike annotation, which typically adds linguistic information to the text, markup is usually concerned with marking information relative to the structure and context of the texts, such as author names or
speakers in a drama, for example.
20 http://firstfolio.bodleian.ox.ac.uk/.
19
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Case study: a large-scale Latin corpus

TEI is a very positive initiative which addresses the need for standardization in
the markup and annotation of texts in the humanities and social sciences; it is very
widespread in the field of digital humanities. The Medieval Nordic Text Archive aims
to preserve, disseminate and publish medieval texts in digital form, and to develop the
standards required for this. The archive includes texts on the Nordic languages and in
Latin (http://www.menota.org/), and its texts are encoded in TEI.
Generally speaking, TEI is not very widely used for historical corpora, where there
is a stronger emphasis on linguistic annotation rather than on palaeographic and
historical markup. Moreover, most programs for automatic annotation (the NLP tools
introduced in section 4.3) strip down all forms of markup contained in the texts,
as it is not relevant to the automatic processing they perform. However, in the case
of historical texts, the information contained in these tags can be crucial to the
interpretation of the text and should be considered by the language processing tools.
A related difficulty is the fact that historical texts typically contain a number of nonlinear elements, such as alternative readings or corrected and erroneous text, which
are heavily dependent on the specific edition of the text. A challenge for the future
will certainly be to have the NLP community interact more with the TEI community
and make it possible to apply NLP to complex TEI documents while preserving their
tagging structure for further analysis.
. Case study: a large-scale Latin corpus
We have seen how annotation makes it possible for researchers to search historical
corpora for simple and complex linguistic entities. As the size of the corpora increases,
automatic annotation becomes more and more of a necessity. This is especially true
when we consider the increasing amount of texts that are being digitized as part
of digital humanities projects, and that constitute very valuable sources of data for
historical linguistics research. The case study illustrated in this section, an interesting
application of historical NLP tools to Latin, shows an example of a very fruitful
interchange between these disciplines.
LatinISE (McGillivray and Kilgarriff, 2013) is a Latin corpus containing 13
million word tokens, available through the corpus query tool Sketch Engine
(Kilgarriff et al., 2004). Similarly to corpora compiled for modern languages like
ukWac (Ferraresi et al., 2008), the texts making up LatinISE were collected from web
pages. However, the process of data extraction was controlled by selecting three
specific online digital libraries: LacusCurtius,21 IntraText,22 and Musique Deoque.23
These websites contain Latin texts covering a wide range of chronological eras,
from the archaic age to the beginning of the current century, all editorially curated,
21
22
http://penelope.uchicago.edu/Thayer/I/Roman/home.html.
23 http://www.mqdq.it.
http://www.intratext.com.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
which meant that the quality of the raw material is superior to that of general web
resources. Another important observation concerns the metadata that the texts were
provided with. As we discussed in section 4.1, this is an essential property of historical
corpora, since it allows for further corpus-based studies that analyse the language in
its historical context. In the case of LatinISE, the metadata were inherited from the
original online libraries and include information on the names of authors, titles, books,
sections, paragraphs, and line boundaries for poetry.
After removing HTML tags and irrelevant content from the web pages, the corpus
compiler converted them into the verticalized format required by Sketch Engine,
where each line contains only one token or punctuation mark. In addition to being
provided with rich metadata, LatinISE is also lemmatized and part-of-speech-tagged.
The lemmatization relies on the morphological analyser of the PROIEL Project,
developed by Dag Haug’s team,24 complemented with the analyser Quick Latin.25 As
an example, consider the following phrase:
(4) sumant
exordia
fasces
take:sbjv.prs.3pl beginning:acc.n.pl fasces:nom.m.pl
‘let the fasces open the year’
This sentence was automatically analysed as follows:
> sumant
sumo<verb><3><pl><present><subjunctive><active>
> exordia
exordium<noun><n><pl><acc>
exordium<noun><n><pl><nom>
exordium<noun><n><pl><voc>
> fasces
no result for fasces
For each word form, the morphological analyser generated all possible analyses,
which included an empty result for fasces. These multiple analyses needed to be
disambiguated so to assign the most likely lemma and part of speech to each token
in context. This disambiguation was achieved with a machine-learning approach,
by relying on existing Latin treebanks: the Index Thomisticus Treebank, the Latin
Dependency Treebank, and the PROIEL Project’s Latin treebank. At the time of the
creation of LatinISE, these corpora contained a total of 242,000 lemmatized and
morphosyntactically annotated words; this set constituted the training set for training
TreeTagger (Schmid, 1995), a statistical part-of-speech tagger developed by Helmut
Schmid at the University of Stuttgart. McGillivray and Kilgarriff (2013) describe how
24
25
http://www.hf.uio.no/ifikk/english/research/projects/PROIEL/
http://www.quicklatin.com/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Challenges of historical corpus annotation

TreeTagger was run on the analyses of the morphological analyser to obtain the most
likely part of speech and lemma for each word form in the corpus. In Example (4), the
corresponding corpus occurrences are:
sumant V
exordia N
fasces N
sumo
exordium
fascis
Every line contains the word form, followed by the part-of-speech tag (‘N’ for ‘noun’
and ‘V’ for ‘verb’) and the lemma.
LatinISE is currently in its first version and an evaluation of the automatic lemmatization and part-of-speech tagging is the necessary next step to assess the usability of
the corpus, especially on the texts from those eras whose language differs significantly
from that of the training set. With its ongoing development, this corpus testifies to
the challenges of applying NLP tools to historical language data, and of dealing with
texts from very different time periods. At the same time, a large diachronic annotated
corpus is what is needed to conduct a study of language change. Of course, some
may discount the period when Latin was not spoken by native speakers; we believe
that this corpus is nevertheless a valuable resource for Latin (diachronic) studies.
Following principle 9 (section 2.2.9) and principle 10 (section 2.2.10), quantitative
evidence is the only type of evidence for detecting trends, and this evidence comes
primarily from corpora. A corpus like LatinISE, which was annotated automatically,
can be improved by successively refining the training set for the automatic annotation.
Hence, it is a resource that can serve the community both by being the empirical basis
for quantitative analyses and by being subject to further incremental developments
leading to better and better language resources.
. Challenges of historical corpus annotation
So far we have stressed the merits of corpus annotation and have seen how annotated historical corpora can serve the scholarly community. However, some scholars
have criticized annotation, and in this section we will dedicate some space to their
arguments, and to more general considerations about annotated corpora in historical
linguistics research.
Sinclair (2004, 191) called corpus annotation ‘a perilous activity’, which negatively
affected the text’s ‘integrity’ and caused researchers to miss ‘anything the tags [are] not
sensitive to’. Hunston (2002, 93) evokes a similar danger, in which researchers may
tend to forget that the categories used to search the corpora partially shape their
research questions:
the categories used to annotate a corpus are typically determined before any corpus analysis is
carried out, which in turn tends to limit, not the kind of question that can be asked, but the kind
of question that usually is asked.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Historical corpus annotation
It is indeed true that if we choose an annotation scheme that is too firmly bound to
a specific linguistic paradigm, we risk only finding results supporting that paradigm.
Moreover, depending on the annotation available for a corpus, certain research questions may not be answered by that corpus. For example, imagine that our annotation
scheme contained a tag for ‘noun’ and our annotation guidelines specified when an
element was to be annotated as a noun; then, we would be constrained by these
choices when retrieving nouns from the corpus. If our research aim is to define the
characteristics of nouns, then our results would be heavily influenced by the corpus
guidelines. One partial solution to this is to be very precise in specifying the corpus
compilation principles and the assumptions made during the annotation phase, so
that any research results can be interpreted in light of this, and results from differently
annotated corpora can be compared.
In any case, the dependence of annotation on the schema is an unavoidable
consequence of the practice of annotation itself. We can think of annotation as a
pragmatic (in the common, non-linguistic sense of the word) solution to the problem
of representing linguistic categories and their properties. Annotation tags are convenience representations of theoretical entities and should not be confused with the
linguistic entities themselves.
Annotated corpora are examples of the symbolic modelling of language introduced
in section 1.2.2. They impose discrete categorizations to linguistic elements. This
symbolic representation is compatible with both categorical and non-categorical views
of language, precisely because it is a model and is not the linguistic reality directly.
In other words, a corpus annotation that contains the categories of noun and verb
can coexist with a view whereby such categories sit along a probability distribution.
Equally, such annotation is compatible with a view according to which words possess
part of speech as discrete classes.
Archer (2012) discusses some of the objections to corpus annotation and explores
the question of whether annotation can be seen simply as a useless exercise that does
not add anything to the data that is not already contained in them. In line with Archer
(2012)’s view, we believe that corpus annotation is an essential step in the research
process, and that, in spite of its limits, it contributes to a transparent way to empirically
draw conclusions from language data. Historical corpus linguistics can certainly hope
to gain an independent status from corpus linguistics for modern languages by developing more and more sophisticated tools for annotating historical texts—following
past and current research directions—and by emphasizing the unique features of
historical texts.
Annotated corpora are not fixed and immutable objects, and the issue of maintenance is critical in corpus building. Corpora need updates continuously for a range
of reasons, as new linguistic theories emerge and as we discover new properties
of language, or simply as more people contribute to the annotation by various
means, including crowdsourcing. In the case of historical corpora, this is particularly
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Challenges of historical corpus annotation

important. Due to the lack of native speakers and the philological complexities that
affect many historical texts, it is advisable to support a more flexible type of annotation,
which allows for multiple interpretations of the texts by different annotators.
This model of annotation is particularly appreciated by classicists and philologists,
who are interested in displaying the different variants of the original text as a consequence of the transmission of the text over time. Along these lines, Bamman and
Crane (2009) propose a model of annotation that takes into account the scholarly
tradition developed on the texts and gives the annotators scholarly credit for their
work. Bamman and Crane (2009) applied this model to the portion of the Ancient
Greek Dependency Treebank containing Aeschylus’ plays. This case displays an example of a highly debated text, both in terms of its philological transmission and
its syntactic interpretation (which are linked, of course). In this respect, this model
of scholarly annotation corresponds to the traditional practice of compiling critical
editions and will, it is hoped, encourage philologists to engage with it alongside corpus
and computational linguists.
Another way in which corpora should be updated is in conjunction with the
research process itself. Historical corpora are often used to study particular linguistic
phenomena. Once the researcher has extracted the patterns of interest from the
corpus, he or she may carry out further analyses. For example, in the case study on
early modern English third person verbal ending described in section 7.3, we collected
all instances of third-person ending of verbs from the Penn–Helsinki Corpus of Early
Modern English. Then, we added lemma information on each verbal form, as we
wanted to measure the effect of the lemma frequency on the type of morphological
ending realized. The lemma information was not available in the original corpus, so
this work enriched the corpus material, which we made available for reuse by other
scholars.
One way to maintain data sets like the one we built for that case study, which
are important outputs of the research process, is to make it possible to incorporate
additional annotation into the user’s personal working copy of the corpora, as allowed
by the Penn–Heksinki Early English Treebanks. Additionally or alternatively, the
analysis can be made publicly accessible by publishing it in a repository, as we chose
to do. This way, other researchers can make use of this work in combination with
the original corpus data, provided that some linking mechanism is in place. In this
specific case, we published the list of verb form types and their associated lemmas,
thus effectively providing a linking facility, which is in line with the requirement of
reproducibility highlighted in section 2.3.1. All these approaches point towards a view
of annotated corpora as the results of collaborative efforts based on which research
can make progress in an incremental way. We believe that, thanks to this collaborative
attitude, historical linguistics research can achieve access to larger data sets that allow
us to reach more ambitious goals, well beyond what is possible in the context of smallscale studies. In the next chapter we will expand on this further.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

(Re)using resources for historical
languages
. Historical languages and language resources
In Chapter 4 we have seen that annotated corpora are essential to quantitative
historical linguistics research. Of course, they are not the only source we can rely on.
Indirect sources of language data like dictionaries and lexicons have been and still
are of great importance. Unlike corpora, where words are organized in their context
of occurrence, traditional language resources store general information about lexical
items out of context, and in some cases link this information back to their occurrences
in the texts (section 2.1.3).
In this chapter we will support a view according to which such links between
lexical entries and their occurrences in context (i.e. in corpora) should be made more
systematic and explicit; we will therefore argue that the gap between corpora and other
language resources can be closed thanks to a corpus-driven approach paired with a
quantitative practice, and show the benefits of this perspective for research in historical
linguistics. We will also turn our attention beyond language resources, towards the
wider landscape of historical and cultural heritage resources, and make a case for
synergies that can benefit research on historical languages. Finally, we will make a
case for building language resources in a way that makes them easy to maintain and
compatible with other resources, and reusing existing resources when that is possible,
thus increasing the level of transparency and replicability that are among the most
important elements of our methodology (sections 1.1 and 4.1.1).
.. Corpora and language resources
Traditional language resources like dictionaries are very useful in historical linguistics
research. However, even when they are based on corpora, if they are qualitative
in nature it is not possible to draw quantitative arguments from them, apart from
basic type frequencies extracted from the resource itself. Conversely, corpus-driven
language resources like computational lexicons offer more potential for integration
with corpora and therefore allow the researchers to include a quantitative dimension
to their analysis, as we will show in this section.
Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray.
© Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Historical languages and language resources

Let us start with an example of a psycholinguistic phenomenon that is relevant to
historical linguistics: local syntactic ambiguity. Consider Example (1), from a Latin
sentence from Ovid, Metamorphoses 1.736:
(1)
et Stygias
iubet
hoc
audire
and Stygian:acc.f.pl command:ind.prs.3sg this:acc.n.sg listen:inf.prs
paludes
water:acc.f.pl
‘He commands the Stygian waters to listen to this.’
Example (1) contains an instance of the general pattern [V1 ARG V2 ], where V1 is the
verb iubet, ARG is the pronoun hoc, and V2 is the verb audire. According to the valency
properties of the two verbs, ARG could be an argument of both V1 and V2 .
Example (1) is a case of local syntactic ambiguity, which is resolved once the
sentence is read out in full. This is in line with the online nature of oral language
comprehension, whereby the hearer perceives one word at a time and incrementally interprets the partial input, even before the sentence is complete (Schlesewsky
and Bornkessel, 2004; Van Gompel and Pickering, 2007, 289; Levy, 2008, 1129).
McGillivray and Vatri (2015) investigated this phenomenon in Latin and Ancient
Greek, taking the opportunity to apply some principles from psycholinguistics to
historical languages, for which experiments on native speakers are, of course, not
possible. Before it is read in full, Example (1) may be taken to mean ‘he commands
the Stygian waters this’, indicating an order given to the waters; however, after reading
audire, it becomes clear that this verb governs hoc and therefore the sentence unambiguously means ‘he commands the Stygian waters to listen to this’.
In order to classify Example (1) as ambiguous, we need to know that both iubeo
and audio can govern hoc. In other words, we need to answer the question: are iubeo
‘to command’ and audio ‘to listen’ transitive verbs? Traditional language resources
like dictionaries and lexicons can help to answer this question, as they contain vast
amounts of information about lexical items, including verbs’ transitivity. The Latinto-English dictionary by Lewis and Short (1879)1 records that sense 1α of iubeo can
occur ‘with an object clause’, as in (2) from Terence’s Eunuchus 3, 2, 16, where istos
foras exire ‘that they come out’ is the object clause of the imperative iubete ‘order’:
(2) iubete
istos
foras exire
order:imp.prs.2pl that:acc.m.pl out come out:inf.prs
‘order them to come out’
On the other hand, the first sense of the entry for the verb audio in Lewis and
Short (1879) records aliquid ‘something’ (i.e. accusative direct object) as the first of
1
Accessed from the Perseus Project’s page http://www.perseus.tufts.edu.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

(Re)using resources for historical languages
the possible argument structure configurations for this verb. Therefore, from the
information contained in the dictionary, we know that the two verbs in Example (1)
are transitive, and thus we can hypothesize that the accusative hoc in Example (1) can
be the argument of either iubet (‘He commands this to the Stygian waters’)2 or audire
(‘He commands the Stygian waters to listen to this’).
The argument structure information contained in a dictionary can certainly help to
consider the different possible syntactic interpretations of a sentence like Example (1).
However, if we want to be able to identify all locally ambiguous sentences from a
corpus of texts without manually checking each instance, we need to combine the
corpus data with a machine-readable resource. Such resource can be automatically
queried by a computer algorithm in order to detect those sentences where two verbs
occur with a noun phrase that is compatible with the valency properties of both verbs,
making it possible for both verbs to govern that phrase. This is the approach followed
by McGillivray and Vatri (2015), who relied on corpus-driven computational valency
lexicons for Latin and Ancient Greek verbs. In the next section we will cover the
difference between corpus-based and corpus-driven lexicons, and briefly illustrate the
valency lexicon in question.
.. Corpus-based and corpus-driven lexicons
As noted in McGillivray (2013, 32–6), traditional historical dictionaries are qualitative
resources. They are compiled based on large collections of examples usually taken
from the canon of texts of a historical language. In this sense they may be called
‘corpus-supported’ resources in a loose sense, if we broaden the term ‘corpus’ to
cover any collection of texts, independently of their format, and the selection criteria
and annotation features of modern corpus linguistics. In other words, the texts
constitute the evidence source on which the historical lexicographer relies to prepare
the summary contained in a dictionary entry. That this is the case is evident from
the amount of examples included to support most statements about grammatical and
lexical-semantic properties in a dictionary. However, the process leading from the
whole collection of texts to the selected examples that appear in a lexical entry is the
result of the subjective judgement of the dictionary’s compilers, and cannot always
be reliably reproduced. A similar argument holds for other historical dictionaries and
thesauri like the Oxford English Dictionary3 and the Historical Thesaurus of English.4
Such a qualitative approach makes the dictionaries supported by a complete corpus
good resources for answering qualitative questions such as ‘Is verb X found with a
dative object in historical language Y?’ (assuming that verb X is included among the
examples presented in the dictionary), but not quantitative questions like ‘Has the
2 This interpretation is only acceptable if we consider the online processing of the sentence up to the
word hoc (et Stygias iubet hoc).
3 http://www.oed.com.
4 http://historicalthesaurus.arts.gla.ac.uk.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Historical languages and language resources

proportion of animate senses of noun X over inanimate senses increased over time?’.
The reason for this is rooted in the original purpose of printed dictionaries, which
suited an era of information scarcity. They aimed to ‘provide information in a manner
which is accessible to the reader . . . The reader should . . . regard the Dictionary as a
convenient guide to the history and meaning of the words of the English language,
rather than as a comprehensive and exhaustive listing of every possible nuance’
(Jackson, 2002, 60).
With the potential offered today by digitized text collections and computational
tools, we can raise our ambitions to a more systematic account of the behaviour of
words in texts; then, this information can be queried by programs as well as humans,
as we will see in the next sections.
Historical valency lexicons In the field of computational linguistics there have been
several successful attempts at building lexical resources from corpora for modern
languages in a radically different way from traditional dictionaries. One example is the
Italian historical dictionary TLIO (Tesoro della Lingua Italiana delle Origini),5 which
is directly associated with a corpus of texts.
If we focus on valency lexicons, we find examples like PDT-Vallex (Hajič et al.,
2003), FrameNet (Baker et al., 1998), and PropBank (Kingsbury and Palmer, 2002),
to name just a few. All these lexicons have in common the fact that they are based on
syntactically annotated corpora. This makes it possible to maintain an explicit relation
between the corpus and the lexicon: once the corpus has been annotated (for example
by marking all arguments and their dependency from verbs), human compilers create
the lexicon by summarizing the corpus occurrences into the lexical entries (for
example by describing argument patterns found for each verb) and recording the link
between the entries and the corpus.
Moving from a corpus-based to a corpus-driven approach, computational lexicons
like Valex (Korhonen et al., 2006) for English and LexSchem (Messiant et al., 2008)
for French systematically describe the valency behaviour of all verbs in the corpora
they are linked to. These lexicons are automatically extracted from annotated corpora
and therefore display frequency information about each valency pattern, which can
be traced back to the original corpus occurrences it was derived from. For example,
it is possible to know how many times a verb occurs with a subject and direct object,
and retrieve all corpus instances of this pattern. Attempts to apply this approach to
Latin data have resulted, for example, in the lexicon described by Bamman and Crane
(2008), which was automatically extracted from a Latin corpus consisting of 3.5 million words from the Perseus Digital Library and automatically parsed (i.e. syntactically
analysed).
5
http://tlio.ovi.cnr.it/TLIO.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

(Re)using resources for historical languages
Figure . Lexical entry for the verb impono from the lexicon for the Latin Dependency
Treebank. The pattern is called ‘scc2_voice_morph’ because it shows the voice and the morphological features of the arguments.
McGillivray (2013, 31–60) describes a corpus-driven lexicon automatically derived
from the Latin Dependency Treebank (Bamman and Crane, 2007) and the Index
Thomisticus Treebank (Passarotti, 2007b). Figure 5.1 shows the lexicon entry for the
Latin verb impono. Each entry in the lexicon corresponds to a verb occurrence in the
corpora, identified by an ID number for the verb (second column) and the unique
sentence number from the corpus (last column); in addition, the lexicon entry displays
the author of the text in which that occurrence is found (first column), the verb
lemma (third column), and the argument pattern corresponding to that verb token
(fourth column). For example, the pattern ‘A_Obj[acc],Sb[nom]’ in the first row
indicates that the verb impono in sentence 845 occurs in the active voice, with an
accusative direct object and a nominative subject. Applying the same database queries
developed to create the Latin lexicons to data from the Ancient Greek Dependency
Treebank (Bamman and Crane, 2009), which follows the same annotation guidelines
and format as the two Latin treebanks previously mentioned, McGillivray and Vatri
(2015) describe a corpus-driven valency lexicon for Ancient Greek, which they used
to study the phenomenon of local syntactic ambiguity.
The advantages of automatically built lexicons like the Latin and Greek ones
described above are numerous. First of all, as we have seen, they contain frequency
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Historical languages and language resources

information which is directly linked to the corpus, thus allowing for corpus-based
quantitative studies, as prescribed by principle 8 (section 2.2.8). Second, they are easy
to maintain because, as the corpus grows in size, the automatic processes for obtaining
the lexicon can be executed again without starting a new process from scratch. This is
exemplified by the Ancient Greek lexicon described in McGillivray and Vatri (2015),
built using the same approach developed for the Latin lexicon, as we have seen. Third,
the creation of these lexicons is independent from the corpus annotation phase, which
minimizes the risks of biased results. In traditional studies (as we have seen from
the survey reported on in Chapter 1), the phase of data/text collection and the phase
of data analysis are often performed jointly, in the context of a specific study and
with a particular set of theoretical hypotheses in mind. By resorting to corpus-driven
resources, the phases are kept separate, because the text collection phase happens at
the point in which the corpus compilers build the corpus; then the persons responsible
for the language resource extract the corpus data to create the lexicon via automatic
techniques. Only at this stage does the researcher pull the relevant data from the
language resource to address a specific research question. For example, McGillivray
(2013, 127–78) describes a study on the argument structure of Latin verbs prefixed
with spatial preverbs. The study relies on the corpus-driven valency lexicon described
in McGillivray (2013, 31–60). Hence, the decision of what counts as a verbal argument
and what is an adjunct was made by the corpus annotators and therefore was not
influenced by the specific purpose of the study on preverbs. This guarantees a higher
level of consistency (as noted in section 6.1), and facilitates the reproducibility of the
study, as recommended in our best practices (section 2.3.1).
Other historical lexicons Valency lexicons are very useful for studies that require
information on verbs’ syntactic arguments. For other purposes, different types of
lexicons are available for historical languages. One such type of resources are the
lexicons developed in the context of the IMPACT (Improving Access to Text) project,6
which aims at developing a framework for digitizing historical printed texts written in
European languages.
One common issue with performing OCR on historical texts is that it requires a
large lexicon containing all possible spellings and inflections of words over time, as
the OCR algorithm uses the lexicon to assign the most likely transcription to each
word. Another challenge with searching historical texts concerns retrieval: ideally,
users should find occurrences of old spellings or inflections of words by searching for
the modern variants. For example, the user may search for ‘water’ and be presented
with corpus occurrences of ‘weter’, ‘waterr’, ‘watre’, and so on. Moreover, lists of proper
names (so-called ‘named-entities’, typically for locations, persons, and organizations),
drastically improve the accuracy of OCR systems for historical texts. To address all
6
http://www.digitisation.eu/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

(Re)using resources for historical languages
these needs, the researchers of the IMPACT project have developed computational
morphological lexicons which display both spelling variants and inflected forms for
modern lemmas, as well as named-entities lexicons, for Bulgarian, Czech, Dutch,
English, French, German, Polish, Slovene, and Spanish (Depuydt and de Does, 2009).
The morphological lexicons were created from corpora, collections of quotations
contained in historical dictionaries, and/or modern dictionaries provided with historical variants. These morphological lexicons usually contain frequency information as
well. The named-entities lexicons were created by training named-entity recognition
algorithms on manually curated sets tagged with various types of named-entities labels
(the so-called ‘gold standards’) and then running the named-entity recognizers on
new, unannotated data.
Let us consider one of the historical lexicons developed as part of the IMPACT
project, the lexicon for German.7 This lexicon was extracted from a corpus of
3.5 million words from Early New High German (1350–1650) and New High German
period (since 1650). Each entry in the lexicon has the following structure: historical
word form, followed by the corresponding modern lemma and its attestations in the
corpora. The lexicon was created with Lextractor, a web-based tool with a graphical
user interface designed for lexicographers. This tool contains a modern morphological
lexicon, a lemmatizer, and an algorithm that uses rules to generate historical forms
from modern lemmas. Therefore, the tool is able to suggest the linguistic interpretation
for some of the historical word forms, in terms of their modern lemmas, part-ofspeech information, and their possible attestations in corpora. The lexicographer
has the option of accepting or rejecting the automatic suggestions, and difficult
cases are handled collaboratively (Gotscharek et al., 2009). For example, the following rules are among those for generating modern forms from historical forms in
German:
1.
2.
3.
4.
th → t
ei → ai
ey → ei
l → ll
In addition, the morphological lexicon for modern German maps the inflected form
teile to the noun teil ‘part’ (plural) and to the verb teilen ‘to share’ (first person singular
present indicative). When presented with the historical form theile, Lextractor can
suggest the lemmas teile by combining the first rule listed above with the modern
morphological information; Lextractor can also suggest the lemma taille by applying
the second and fourth rules and the modern morphological lexicon entry for taille
‘waist’. At this point the lexicographer can confirm or reject the automatic suggestions;
7 http://www.digitisation.eu/tools-resources/language-resources/ historical-and-named-entities-lexicaof-german/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Beyond language resources

moreover, he or she can classify the historical form as one or more of the following
cases:
•
•
•
•
•
historic form without modern equivalent;
historic abbreviation;
pattern matcher failed;
named-entity;
missing in modern lexicon.
Next, Lextractor provides the lexicographer user with a list of candidate corpus
attestations of the word form, with their context in the form of concordances, as well
as the frequencies of all forms of the lemma being analysed and their time stamps. The
user can then select the correct occurrences.
. Beyond language resources
Historical sociolinguists have emphasized the relationship between language and its
social context for a long time (Romaine, 1982; Nevalainen, 2003; McColl Millar, 2012).
As we said in section 4.3.2, recording the macro-social components of language,
as well as the situational aspects of the individual communicative events, is very
important to explain the history of language in society (both in terms history of
individual languages and language change), and corpus data constitute important
evidence sources for this type of investigation.
As we noted in section 4.3.3, the field of digital humanities has been concerned with
contributing to humanities research by addressing research questions of humanistic
disciplines with the support of digital tools. One important project in this area is
the TEI, which establishes a standard for annotating a wide range of textual and
contextual information for a large number of text types and formats. Unfortunately,
the academic communities of digital humanities and historical linguistics have not
always shared approaches and tools, and TEI markup is still not usually employed
in corpus annotation. However, this tendency is gradually changing. In recent years,
the collaboration between historical linguists and scholars from other historical areas
of the humanities has received a new impulse thanks to their shared interests in the
analysis of cultural heritage data. The LaTeCH (Language Technology for Cultural
Heritage) workshops testify to the increased popularity of this area of research (see, for
example, Kalliopi Zervanou, 2014). We argue that this collaboration presents a number
of benefits for all research fields involved, as we explain here.
On the one hand, historical linguistics can gain more insight into how language
changed over time by explicitly placing language data into their historical context.
One way to achieve that is by adding information on social features of the texts,
and the work done by (historical) sociolinguists is a good model for such efforts.
Metadata on where the language data were composed (or uttered) are certainly an
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

(Re)using resources for historical languages
essential piece of knowledge that needs to be recorded; in addition, annotating social
features of the authors/speakers and the location allows the researchers to investigate
how language and such social factors interact, thus adding an additional level of
depth to the analysis. One example of this approach is the project for creating the
British Telecom Correspondence Corpus at Coventry University (Morton, 2014),
which annotated business letters written over the years 1853–1982 with TEI-compliant
XML. The metadata elements recorded for this corpus include: date, author’s name,
occupation, gender, and location; recipient’s name and location; general topic of the
letter, whether the letter was part of a chain or not, format (handwritten, printed, etc.),
and company/department, in addition to text-internal annotation marking quotes,
letter openings, paragraphs, letter closings and salutations, as well as a pragmatic
annotation of the letter’s function (application, complaint, query, etc.).
On the other hand, from the point of view of historical research, texts and archives
are among the various sources from which we can come to new interpretations of
historical facts or we discover new relations between events. Detailed linguistic analyses grounded on language data, particularly texts, can definitely support and enrich
this work. For instance, social history and marginalized groups are best investigated
by a corpus-based register and lexical analysis of the language of certain official
documents, as exemplified by the study of prostitution based on judicial records
from the seventeenth century described in McEnery and Baker (2014). The authors
analysed nearly one billion words from the seventeenth-century section of the Early
English Books Online corpus.8 The texts underwent variant spelling annotation,
lemmatization, part-of-speech tagging, and semantic tagging. After processing the
corpus data, historians and linguists in the project team carried out the collection
of relevant linguistic data in an iterative fashion. These data concerned the change
in meaning and discourse features of a set of lexical items recognized as pertinent
to the topic through literature review and corpus data inspection. This phase was
followed by the corpus work, which investigated semantic and pragmatic change
though collocation analysis. The analysis also concerned place names associated with
the nouns of interest (synonyms of prostitute). This research shed new light onto
certain aspects of language change, and offered insights into the society and culture
of that historical period, which would have been more limited without access to large
corpora, linguistic knowledge, and historical expertise.
As the experience of McEnery and Baker (2014) shows, while more and more
historical texts become available, the traditional approach involving close reading
of the texts becomes less and less feasible, leaving space to the so-called ‘distant
reading’ approach and coexisting with it. This is where the experience of corpus and
computational linguistics can give a substantial contribution to historical research,
thanks to the vast set of tools for language processing and examination that these
8
http://www.textcreationpartnership.org/tcp-eebo/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Beyond language resources

disciplines have developed. Such tools allow the researchers to scale up their analyses
and give a faithful representation of the language used in the texts, as well as their
content. In addition, large corpora make it possible to investigate rare language usages
that are simply not found in corpora of the size manageable by hand. Some examples
of this line of thinking are the research outlined in Toth (2013), the HiCor (History
and Corpus Linguistics) research network funded by the Oxford Research Centre
in the Humanities,9 the Network for Digital Methods in the Arts and Humanities (NeDiMAH),10 and the Collaborative European Digital Archive Infrastructure
(CENDARI).11
When applying corpus methods to historical archives and documents, however,
we need to keep in mind an important difference between corpora and archives
(and digital resources in general). This difference concerns the well-known issue of
‘representativity’, which is far from being resolved, especially in historical contexts,
where the corpus compilers often can only include texts or fragments that have
survived historical accidents, and cannot aim at so-called ‘balanced’ corpora (see
discussion in Chapters 2 and 4). Archives, in particular, usually group together records
relating to certain events, thus making it difficult to identify individual text types in
them. Attention should be also paid to ensuring that documents on less prominent
individuals are included as well, so to best reflect linguistic variation.
A number of software tools are now available to support historians’ interpretative
work by using the traditional corpus linguistics tools such as concordances and
keyword-in-context, as well as language technology techniques, including morphological tagging, part-of-speech tagging, syntactic parsing, named-entity recognition,
semantic relation extraction, temporal and geographical information processing,
semantic similarity, and sentiment analysis. Just to mention a few examples, ALCIDE
(Analysis of Language and Content in a Digital Environment) was specifically developed for historians at the Fondazione Bruno Kessler in Trento, and combines data
visualization techniques with information extraction tools to make it possible to view
and select the relevant information from a document, including a semantic analysis
of the content (Menini, 2014).
Another example concerns the synergy between geographic systems and language
technology, specifically named-entity recognition. Geographic information systems
(GIS) help to investigate the role played by different places in social phenomena over
time by analysing their mentions (both overt and implicit) and their collocations in
historical documents; see, for example, Joulain et al. (2013). In section 5.3.3 we will give
an example of a resource created in the context of geographical historical data.
From this brief overview it will be clear, we hope, that our position supports
synergies and collaborations between historical linguistics and other historical
9
11
http://www.torch.ox.ac.uk/hicor/.
http://www.cendari.eu/.
10
http://www.nedimah.eu/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

(Re)using resources for historical languages
disciplines, which requires historical linguistics to develop a stronger commitment
to non-language-related resources. This way, it will be possible to combine multidisciplinary expertise to cover more research ground and achieve further goals that
could not be achieved in the context of the individual disciplines. Such synergies and
exchanges do not only affect people; they also have an important implementation in
linking the data resources employed in research, as we will see in section 5.3.
. Linking historical (language) data
In section 5.2 we argued that historical corpora should be more integrated with other
linguistic and non-linguistic resources in order to give a fuller account of language
change over time. One way to achieve that is to enrich the corpus annotation with
metadata information recording the historical context of the texts and social features
of authors, characters, and places (usually at the beginning of the corpus or in a
separate file), as well as pragmatic functions of the speech acts (typically with in-line
annotation). Traditionally, this has been the standard approach in corpus-based historical sociolinguistics, and has allowed researchers to study the interplay of linguistic
phenomena and external factors by extracting the data directly from the corpora.
Along these lines, the compilers of the Penn–Helsinki Parsed Corpus of Middle
English, second edition (PPCME2) created a series of files containing a range of
metadata information about each text of the corpus. For instance, Figure 5.2 shows the
page of the Parson’s Tale by Chaucer. In addition to indicating the details of the manuscript (name, date, edition, and sampled portion for the corpus), the page contains
the genre and dialect of the text, in addition to other information from the original
Helsinki corpus from which the PPCME2 was derived, such as the relationship to the
original text and its language, the sex, age, and social rank of the author.
Enriching the annotation with such information makes the size of the corpus files
much larger. This does not need to be a problem, especially given the low cost of
data storage nowadays. However, there is another, more serious disadvantage in this
approach. Maintaining this kind of annotation is time-consuming and not particularly
efficient, because it involves creating copies of information already available in other
sources. Let us consider the example of a study on the relationship between the
determiners a and an over time and the social rank of the author. The researcher would
need to run a search of the corpus and then associate each occurrence of a/an in each
text with the social rank of the author and the date of the text as given by the corpus
pages exemplified in Figure 5.2. Let us now imagine that a new discovery reveals that
the manuscript of the Parson’s Tale used by the compilers of PPCME2 was in fact
produced ten years earlier than was thought previously. In order for the linguistic
analysis on a/an to be updated, the corpus compilers would need to be informed and
they would have to correct the corpus page (both for the date of the text and the age of
the author); the data for the sociolinguistic study would then need to be re-extracted.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Linking historical (language) data

Helsinki Corpus information
File name
CMCTPROS
Text identifier
M3 NI FICT CTMEL
Text name
CT MELIBEE
Author
CHAUCER GEOFFREY
Period
M3
Date of original
1350–1420
Date of manuscript
1350–1420
Contemporaneity
X
Dialect
EML
Verse or prose
PROSE
Text type
FICTION
Relationship to foreign original
TRANSL
Foreign original
FRENCH
Relationship to spoken language
WRITTEN
Sex of author
MALE
Age of author
40–60
Social rank of author
PROF HIGH
Audience description
X
Participant relationship
X
Interaction
X
Setting
X
Prototypical text category
NARR IMAG
Sample
SAMPLE X
Figure . Page containing information about the text of Chaucer’s Parson’s Tale from the
Penn–Helsinki Parsed Corpus of Middle English, https://www.ling.upenn.edu/hist-corpora/
PPCME2-RELEASE-3/info/cmctpars.m3.html (accessed 22 March 2015).
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

(Re)using resources for historical languages
This process is prone to errors and requires a number of people to be aware of the
new discovery. An alternative solution would involve having the data stored in one
single place (a ‘knowledge base’), to which they would be linked from the corpus,
for example, a repository of all manuscripts (or all Middle English manuscripts). In
the scenario imagined above, such repository would be the only resource requiring
a change. As the corpus would link to it, those responsible for the corpus would just
need to update the links to the repository in order to get the corrected metadata on
which to base a sociolinguistic analysis. Linked data is a growing area of research and
development in computing which offers the model for realizing this link, as we will
see in section 5.3.1.
.. Linked data
The term ‘Linked Data’ refers to a way of representing data so that it can be interlinked.
Bizer et al. (2008) define linked data as follows:
Linked Data is about employing the Resource Description Framework (RDF) and the Hypertext
Transfer Protocol (HTTP) to publish structured data on the Web and to connect data between
different data sources, effectively allowing data in one data source to be linked to data in another
data source.
In simple terms, the World Wide Web consists of a large number of pages interlinked via HTML links. These links express a very rudimentary form of relationship
between webpages: we know that one page is related to another, but we do not know
the nature of this relationship, at least not explicitly from the link itself.12 In contrast,
the approach of linked data assumes a ‘web of data’ whereby entities (and not just
webpages) are connected through semantic links; these links identify the two entities
being linked and express explicitly the type of link between them; moreover, this is
done in such a way to allow the information to be automatically read by computers.
In the RDF data model, links are expressed in the form of triples where a subject
is connected to an object via a predicate that indicates the nature of the relationship
between the two. Triples are an example of structured data (see section 4.3) that
can be automatically retrieved by computer algorithms. In order to illustrate RDF,
we will take an example from DBPedia, which is a large resource of linked data
derived from Wikipedia, representing one of the hubs of the emerging web of data.
DBPedia organizes a subset of the Wikipedia entries into an ontology of over 4 million
entities, covering persons, places, creative works, organizations, species, and diseases,
together with the links between them. The DBPedia entry for Geoffrey Chaucer (the
subject)13 lists a series of attributes pertaining to this writer (the predicates) and their
12 We could consider the context in which the link appears, for example the words surrounding it, and
perform a distributional semantics analysis on that. However, what we are concerned with here is the explicit
type of relationship between the two entities being linked.
13 http://dbpedia.org/page/Geoffrey_Chaucer.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Linking historical (language) data

respective values (the objects). For example, Chaucer is related to the date ‘1343-0101’ though the predicate ‘Birth date’, to the place ‘Westminster_Abbey’ though the
predicate ‘RestingPlace’, and to ‘Philippa_Roet’ via ‘spouse’. The two latter entities
(‘Westminster_Abbey’ and ‘Philippa_Roet’) also have their own entries, thus creating
an interlinked network of information. Using this knowledge base, it is possible to
run searches that are not possible on Wikipedia, thus allowing for a much wider
discoverability of the content of this resource. For example, the search for all authors
who were born in the fourteenth century and whose spouses died in the fourteenth
century.
Linked data collections may be open according to the Open Definition,14 which in
its concise version states:
Open means anyone can freely access, use, modify, and share for any purpose (subject, at most,
to requirements that preserve provenance and openness).
Linked open data are by definition easier to access by a wide audience, which offers
new avenues of research for a large number of scientific fields. Linguistics is certainly
one such field, and one unquestionable advantage of developing and using linked
open data in linguistics is that resources can be combined together to improve specific
linguistic processing tasks. For example, combining a dictionary with a part-of-speech
tagger makes it possible to perform dictionary-based part-of-speech tagging; another
example is the integration of dictionaries and corpora, which allows the lexicographer
to refer to corpus examples from lexical entries, and therefore place each example in its
corpus context. Linking language resources in this way makes them at the same time
integrated and interoperable. This means that the resources are not only provided with
links to allow exchange of information, but that the interpretation of this information
is consistent across the linked resources.
.. An example from the ALPINO Treebank
Let us take the example of a treebank (see section 4.3.2 for an illustration of treebanks).
The ALPINO Treebank is a syntactically annotated corpus of Dutch (Van der Beek
et al., 2002) with over 150,000 words from the newspaper part of the Eindhoven
corpus. Its original format is in XML (illustrated in section 4.2), as shown below for
the syntactic tree of the phrase In principe althans ‘In principle, at least’.
<?xml version="1.0" encoding="ISO-8859-1"?>
<top>
<node rel="top" cat="du" begin="0" end="3" hd="3">
<node rel="dp" cat="pp" begin="0" end="2" hd="1">
<node rel="hd" pos="prep" begin="0" end="1" hd="1"
root="in" word="In"/>
14
http://opendefinition.org/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

(Re)using resources for historical languages
<node rel="obj1" cat="np" pos="noun" begin="1"
end="2" hd="2" root="principe" word="principe"/>
</node>
<node rel="dp" cat="advp" pos="adv" begin="2" end="3"
hd="3" root="althans" word="althans"/>
</node>
<sentence>In principe althans .</sentence>
</top>
The nodes of the dependency tree are tagged as <node> and the attributes cat,
rel, and pos stand for categories/phrase types, dependency relations, and part-ofspeech tags, respectively. For example, the word in is a preposition (pos=‘prep’).
Moreover, it is the first word of the sentence, so it begins at position 0 (begin=‘0’)
and ends at position 1 (end=‘1’), and the lexical head of its phrase is in position 1
(hd=‘1’). The node corresponding to in is part of a prepositional phrase, so its parent
node (which starts at position 0 and ends at position 2 because it includes principe as
well) has cat=‘pp’.
Let us imagine that we want to make sure that the inventory of part-of-speech tags
is consistent with an external tagset. The Linked Data approach to this is to link the
corpus to another resource through RDF. One such resource is the General Ontology
for Linguistic Description (Farrar and Langendoen, 2003).15 Here, we will consider the
linguistic ontology LexInfo (Cimiano et al., 2011).16 The linking between the treebank
and LexInfo allows us to connect the treebank with another corpus that uses the
LexInfo tagset; moreover, if the tagset is updated, the part-of-speech information in
the treebank will not need to be changed. Let us have a closer look at ontologies
through the case of Lexinfo in the next section.
The LexInfo ontology In computer science an ontology formally defines the entities
of a particular domain, together with their properties and relationships. OWL (Web
Ontology Language) is the standard language used to represent ontologies. OWL
defines classes and subclasses, which classify individuals into groups which share
common characteristics; an ontology in OWL also specifies the types of relationships
permitted between these individuals.
For what concerns language resources, LEMON (LExicon Model for ONtologies,
McCrae et al., 2012) is an RDF model specifically designed for lexicons and machinereadable dictionaries. LexInfo is a model for relating linguistic information (such as
part of speech, subcategorization frames) to ontology elements (such as concepts,
relations, individuals), following the LEMON model. The following example shows
the portion of the Lexinfo ontology relative to the category of adverbs.17
15
17
http://www.linguistics-ontology.org/.
The line numbers were added by us.
16
http://www.lexinfo.net.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Linking historical (language) data

1 <owl:Class rdf:about="http://www.lexinfo.net/ontology/
2.0/lexinfo#Adverb">
2 <owl:equivalentClass>
3 <owl:Restriction>
4 <owl:onProperty rdf:resource="http://www.lexinfo.net/
ontology/2.0/lexinfo#partOfSpeech"/>
5 <owl:someValuesFrom rdf:resource="http://www.lexinfo.
net/ontology/2.0/lexinfo#AdverbPOS"/>
6 </owl:Restriction>
7 </owl:equivalentClass>
8 <rdfs:subClassOf rdf:resource="http://lemon-model.net/
lemon#Word"/>
9 <rdfs:isDefinedBy rdf:resource="http://www.lexinfo.net/
ontology/2.0/lexinfo"/>
10 </owl:Class>
Let us examine each element of this RDF snippet.
• <owl:Class defines the class ‘Adverb’;
• <owl:equivalentClass>: lines 2-7 indicate that two class descriptions
have an exact set of individuals. In this case the individuals of the class ‘Adverb’
are exactly those individuals that are identified by the properties listed in lines
4–6, as explained in the next two points;
• <owl:onProperty and <owl:someValuesFrom refer to the fact that the
parts of speech of the individuals being described take values from a list of
adverbial parts of speech;
• <rdfs:subClassOf indicates that the class Adverb is a subclass of the larger
class ‘Word’ in the LEMON model;
• <rdfs:isDefinedBy indicates the resource defining the class of adverbs in
LexInfo.
The second part of the ontology mentioning adverbs is as follows:18
1 <owl:Thing rdf:about="http://www.lexinfo.net/ontology/
2.0/lexinfo#adverb">
2 <rdf:type rdf:resource="http://www.lexinfo.net/
ontology/2.0/lexinfo#AdverbPOS"/>
3 <rdf:type rdf:resource="http://www.lexinfo.net/
ontology/2.0/lexinfo#PartOfSpeech"/>
4 <rdf:type rdf:resource="http://www.w3.org/2002/07/
owl#NamedIndividual"/>
18
The line numbers were added by us.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

(Re)using resources for historical languages
5 <rdfs:label>adverb</rdfs:label>
6 <rdfs:comment> Part of speech to refer to an
heterogeneous group of words whose most frequent
function is to specify the mode of action of the verb.
7 </rdfs:comment>
8 <dc:creator>Francopoulo, Gil</dc:creator>
9 <owl:versionInfo>1:0</owl:versionInfo>
10 <dcr:datcat rdf:resource="http://www.isocat.org/
datcat/DC-1232"/>
11 <rdfs:isDefinedBy rdf:resource="http://www.lexinfo.net/
ontology/2.0/lexinfo"/>
12 </owl:Thing>
Again, let us look at each line:
• line 1 indicates that ‘adverb’ is an individual of the class Adverb;
• lines 2–4 state the fact that this individual is member of three classes: ‘AdverbPOS’,
‘PartOfSpeech’, and ‘NamedIndividual’;
• line 5 contains the label for the individuals belonging to the class Adverb;
• lines 6–7 contain a comment explaining adverbs;
• line 8 states the name of the creator of the class: ‘dc’ refers to the Dublin Core
standard19 for describing metadata of web resources;
• line 9 contains the version information;
• <rdfs:isDefinedBy indicates the defining resource, as in the case of the class of
adverbs explained above.
Linking LexiInfo and the ALPINO Treebank Now that we have seen an example part
of speech from the Lexinfo ontology, we can appreciate the advantages of linking all
the information on part of speech to the ALPINO Treebank, to ensure that the two
resources are in sync and that there is no unnecessary duplication of data.
John McCrae has transformed the ALPINO Treebank into RDF format and linked it
to Lexinfo, as described in his blog.20 To ensure that the links were semantic, he created
an ontology in the OWL language to describe the categories used in the treebank. The
following example describes the part-of-speech ‘adverb’ in ALPINO:21
1 <owl:NamedIndividual rdf:about="http://lexinfo.net/
corpora/alpino/categories#adv">
2 <rdf:type rdf:resource="http://lexinfo.net/corpora/
alpino/
3 categories#PartOfSpeech"/>
19
21
20 http://john.mccr.ae/blog/alpino.
http://dublincore.org/.
The line numbers were added by us.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Linking historical (language) data

4 <rdfs:label xml:lang="en">Adverb</rdfs:label>
5 <dcr:datcat rdf:resource="http://www.isocat.org/datcat/
DC-1232"/>
6 <owl:sameAs rdf:resource="&lexinfo;adverb"/>
7 </owl:NamedIndividual>
This portion of the ontology declares the named individuals corresponding to
adverbs, it states that adverbs are members of the class ‘PartOfSpeech’, and gives its
English language label (‘Adverb’). Line 5 gives the details of the ISOcat22 element DC1232, which corresponds to adverbs.
Defining this ontology made it possible to link the treebank to Lexinfo. For example,
in the linking between ALPINO and Lexinfo the second node of the phrase shown
on pages 143–4 (In principe althans), the adverb althans ‘at least’, is represented as
follows:23
1 <node>
2 <rdf:Description rdf:about="#top/node/node_2">
3 <cat:rel xmlns:cat="http://lexinfo.net/corpora/alpino/
categories#"
rdf:resource="http://lexinfo.net/corpora/alpino/
categories#dp"/>
4 <cat:cat xmlns:cat="http://lexinfo.net/corpora/alpino/
categories#"
rdf:resource="http://lexinfo.net/corpora/alpino/
categories#advp"/>
5 <cat:pos xmlns:cat="http://lexinfo.net/corpora/alpino/
categories#"
rdf:resource="http://lexinfo.net/corpora/alpino/
categories#adv"/>
6 <begin>2</begin>
7 <end>3</end>
8 <hd>3</hd>
9 <root>althans</root>
10 <word>althans</word>
11 </rdf:Description>
12 </node>
This is an example of RDF code expressed in XML. Line 2 states that what follows
is a description of node 2, which is assigned the unique ID #top/node/node_2.
22 ISOcat (http://www.isocat.org) is a central registry for all linguistic concepts. It contains so-called
‘data categories’, which describe these linguistic concepts.
23 The line numbers were added by us.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

(Re)using resources for historical languages
Lines 3–5 are statements about three predicates for node 2 (the subject): cat:rel,
cat:cat, and cat:pos.24 For example, line 5 expresses the fact that node 2 has
a property (or predicate) with name cat:pos and its value is the object identified
by the string http://lexinfo.net/corpora/alpino/categories#
adv. Lines 6–10 contain the treebank-specific information about node 2, namely its
position, its lexical head, root, and word form.
What is important to note here is that node 2 of the sentence in question is an
adverb, and it refers to the category ‘adv’ in the LexInfo ontology.
.. Linked historical data
We have seen an example of an annotated corpus for a modern language linked with
a lexical resource. One of the motivations for developing linguistic linked data is
related to the field of NLP. By definition, linked data are machine-readable and can
therefore be used directly by computer programs, and this presents a huge potential for
improving NLP tools. For example, named-entity recognition software greatly benefits
from using knowledge bases like DBPedia, which contain large collections of named
entities (Hellmann et al., 2013, 2).
Compared to linguistic linked open data, the motivations for historical linked
data do not primarily include NLP development. However, there are other strong
motivations in favour of adopting the linked data model for historical language data.
First, effective searching across a range of resources is made easier (and sometimes
possible at all) by having resources linked together and interoperable. This means
that the information available in the linked resources needs to be compatible. For
example, if we have a corpus of sports reports annotated with domain information on
the particular sports covered by the texts, we may want to connect that to a repository
of sport players and historical events to study lexical development for every sport
over time. However, this would be impossible if the two resources (the corpus and
the repository) did not share the same domain definitions.
Some projects have already explored the option of linking two historical corpora
together. For example, the ElEPHãT project has linked the Early English Print portion
of the HathiTrust texts and Early English Books Online Text Creation Partnership
(EEBO-TCP), a smaller collection of texts from ProQuest’s Early English Books
Online database; both sets of texts were dated from 1473 (date of the first book printed
in English) to 1700, but were designed and built independently, which made it difficult
to align them. The aim of the project was to allow scholars to investigate the combined
data set and explore new research questions. For example, it is possible to search the
combined data set for all works by a given author, as well as run searches by publication
24 Note that all three predicates have the same prefix cat, which is defined as a namespace
http://lexinfo.net/corpora/alpino/categories. Namespaces avoid conflicts between tags with the same
name. In this case, we want to guarantee that even if other tags exist called rel, cat, and pos, the specific
tags used here are unique.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Linking historical (language) data

place, publication date or period, subject (such as political science), and genre (such
as biography).
Linking historical and geographical data We have seen examples of annotated corpora linked with lexical resources and with other corpora. Of course, corpora can
be linked to other types of resources, such as non-linguistic ones. In section 5.2 we
stressed the importance of combining language-external and encyclopedic resources
with corpora in historical linguistics research, and the flexibility of the linked data
model makes it well suited to capture this combination (Chiarcos et al., 2013).
When different resources are linked together we can run a much wider range of
searches on them through the so-called ‘federated search’. Finally, the links between
resources can be automatically updated as the resources change, thus streamlining
their maintenance.
Over the past few years, the field of digital humanities has witnessed a number
of projects aimed at building resources in the linked data model, and these projects
have the potential to greatly enrich the options available to historical linguists. In
the next sections we will see some examples of the linked data model applied to
historical language data and we will indicate how historical linguistics can benefit from
such examples. We will discuss a couple of projects which have applied the model of
linked data to the field of digital classics, and have created valuable resources that will
facilitate the research in the ancient world.
A resource for historical geography: Pleiades Pleiades25 is an open-access digital
gazetteer for ancient history. The goal of this project is to allow continuous updates to
the gazetteer and also to facilitate its use in conjunction with other projects in digital
classics by relying on ‘open, standards based interfaces’ (Elliott and Gillies, 2009).
Since Pleiades relies on a crowdsourcing approach, scholars, students and enthusiasts
have the option of contributing to this resource by suggesting geographic names
for locations in the ancient world, adding bibliographical information, or writing
documentation, while at the same time retaining the intellectual property of their
contributions. In the words of Elliott and Gillies (2009), ‘[i]n a real sense then Pleiades
is also like an encyclopedic reference work, but with the built-in assumption of ongoing revision and iterative publishing of versions’.
Each of the tens of thousands of geographical entities that make up the Pleiades
gazetteer is given a stable unique identifier (uniform resource identifier or URI
in the linked data terminology), which makes it possible for other resources to
unambiguously link to them. For example, the entry for the Adriatic Sea has the
unique identifier http://pleiades.stoa.org/places/1004, which also indicates the web
page where the entry is available (Downs et al.). Figure 5.3 displays the top part of the
page for Adriatic Sea, showing its identifier, its modern location (also highlighted in
25
http://pleiades.stoa.org.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

(Re)using resources for historical languages
Figure . Part of the entry for Adriatic Sea in Pleiades.
the map on the right-hand side of the page), as well as its names with the attested dates
when available, and its category. Moreover, the page shows the relationship between
the Adriatic Sea and various locations, also marked in the map. At the bottom of the
web page (not displayed in Figure 5.3) there is a section called ‘References’ which links
to the occurrences of the name Adriatic Sea (spelled as Hadriaticum or Adriaticum)
in three Latin literature texts in the classical Latin texts by the Packard Humanities
Institute,26 and to relevant scholarly works. On the right-hand side of Figure 5.3 you
can see the link to the entry for Adriatic Sea in the Ancient World Mapping Center,
which is a partner of the Pleiades project.27
One methodologically interesting aspect of this project is the development of a
new approach to the representation of geographical entities, specifically designed
for historical data. As Elliott and Gillies (2011) explain, conventional GIS models
require geometrical objects and therefore are not well suited for sparse and ambiguous
historical data, where some locations are unknown or can only be located relatively
to other locations, and their properties change over time. Pleiades’ model involves
mapping the relationships between conceptual places/spaces, names, locations and
time periods by resorting to a variety of sources ranging from ancient texts to modern
scholarly works, ancient coins through their minting locations, and archaelogical
findings through their locations.
Pleiades is a very valuable resource for the study of antiquity and can be integrated
with other resources in numismatics, epigraphy, and papyrology. Furthermore, we
believe that such resources would be very important for the study of historical
26
http://latin.packhum.org/.
27
http://awmc.unc.edu/wordpress/about/.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Future directions

languages. For example, in the context of a study of spatial expressions in the classical
languages, we would be able to reuse the work done within the Pleiades project on
linking linguistic patterns found in ancient texts to their corresponding geographical
locations. This would probably save a considerable amount of time and even allow for
investigations at a scale that would not have been imaginable within the scope of an
individual historical linguistics project.
Linking place references and historical geographical data The Pelagios project (Enable Linked Ancient Geodata in Open Systems) constitutes a natural evolution from
the experience of Pleiades (described in section 5.3.3). The aim of this project is to
annotate place references to entries in the Pleiades gazetteer using the format of
the Open Annotation RDF vocabulary. Pelagios covers not just the ancient GrecoRoman worlds, but also the early Byzantine, Christian, Maritime, Islamic, and Chinese
cultures through their geospatial documents. The project has built a resource for
exploring geographical locations up to 1492. This was achieved by referring to standardized lists of historical places such as the Pleiades gazetteer and the China Historical
GIS. Then, toponyms occurring in semi-automatically transcribed texts and images
were identified automatically and mapped to the gazetteers. This work was done by
combining scholars’ knowledge with crowdsourcing.
Linking people’s references and people’s names One common issue with historical material is to determine whether two documents refer to the same person. The same issue
concerns contemporary material, and is even harder because of ethical and privacy
questions. Being able to link names of people to their mentions in texts is very valuable
for historical linguistics investigations, as it allows to discover new connections and
it provides another way to incorporate contextual information into the linguistic
analysis, as we stressed throughout this section. The SNAP (Standards for Networking
Ancient Prosopographies) project28 project provides this type of linked resource for
the ancient world. SNAP investigated linking various data collections concerning
the lives of groups of persons (prosopographies), persons’ names (onomastica) and
person-like entities using resources available for the Greco-Roman ancient world,
such as the Lexicon of Greek Personal Names, containing names of persons mentioned
in ancient Greek texts, Trismegistos, a database of names and persons from Egyptian
papyri, and Prosopographia Imperii Romani, containing names of elite persons from
the first three centuries of the Roman Empire.
. Future directions
We have seen how language resources are critical elements in historical linguistics
research, sometimes as input and other times as output of the research process.
28
https://snapdrgn.net.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

(Re)using resources for historical languages
In this chapter we have proposed a number of improvements to how such resources
are designed, built, maintained, and managed. These improvements concern making
the resources available and usable by a large number of people, employing standards
in the formats and in the representation models of the data, as well as linking the
resources together.
More work is needed to create and promote standards, and we believe that historical
linguists can benefit from the experience of scholars working in other historical
disciplines, as well as computational linguists, who have made substantial progress in
this direction. In particular, going beyond the silos of language resources and corpora
will allow a whole new range of opportunities.
Another aspect that we think should take priority in the future is related to open
data: being an integral part of the research work, data and resources deserve more
attention than they have received in the past. Now that data storage cost has decreased
significantly and the computing power has reached relatively high levels, research
replication is more than a theoretical option, but is achievable only if the data and
the processes are well documented and easily accessible (section 2.3.1).
Adopting an open data attitude does not require the publication of large data sets
only: the data generated within the scope of a single study should be released at the
micro level and at the macro level.
In section 2.3.3 we mentioned data repositories and data journals. In spite of these
nascent and promising initiatives, article and book publications still have a higher
status than data publications. In order to provide incentives for scholars to create
language resources, data publications should acquire a higher position in scholars’
career options, and traditional publications should include persistent links to the data
set(s). In this respect, publishers can help to ensure that data supporting publications
are available to the scholarly community by enforcing data policies and requiring
statements on data availability on all their publications.
Finally, more work is needed to build software tools that make it possible (and
ideally easier) to build and link historical language resources in an effective way.
Kenter et al. (2012) describe an editing tool designed for historical texts to make
the processes of corpus annotation and creation of corpus-based lexicons aligned,
including a single platform where the annotation can be revised, as well as a
standardized annotation format. Tools such as this are going to facilitate the change
that we described in this chapter and that we believe would make the field of historical
linguistics evolve further.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

The role of numbers in historical
linguistics
. The benefits of quantitative historical linguistics
So far we have mostly asserted that historical linguistics would benefit from an
increased used of quantitative corpus methods. The present chapter addresses more
directly how quantitative evidence and corpora are relevant to historical linguistic
research. The chapter also provides a rebuttal of some counterarguments to the use of
corpora and numbers. The chapter concludes with a case study that illustrates the use
of quantitative corpus methods in historical linguistics, by showing how such methods
can help to evaluate competing claims against each other, and thus help realize the aim
of principle 1 regarding consensus (section 2.2.1).
.. Reaching across to the majority
In section 1.4 we described the large gap, or chasm, that separates the early adopters
of new technologies from the majority. The majority of adopters of any technology are
likely to be motivated by pragmatic considerations such as ease of use and concrete
benefits, not the technology itself.
Regarding ease of use, it has probably never been easier to start using quantitative
models. The statistical tools needed to carry out advanced quantitative studies are easily obtainable, in many cases entirely free of charge. The statistical software package R,
considered the default statistical tool in many academic fields, is available for free for
all major computer platforms. The R platform is well suited for quantitative research in
any branch of linguistics, as attested both by the wide variety of published quantitative
studies and the variety of textbooks aimed at linguists using R, such as Baayen (2008),
Johnson (2008), and Gries (2009a, 2009b).
As a support or alternative to R, the adoption of general programming languages
such as Perl or Python for quantitative corpus studies is made easier by textbooks such as Weisser (2010) and Bird et al. (2009). Moreover, the skills and knowledge required to use such quantitative tools, and to interpret their outcomes, are
being disseminated more vigorously than ever. In addition to on-campus courses,
Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray.
© Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
there are at the time of writing several options for studying quantitative data analysis
for free through open-access MOOCs (massive open online courses).
In summary, the technology is readily available, as are the instructional materials.
Although it is not effortless to master, the technology itself is easy to use once the
initial lack of familiarity has been overcome. Our aim is that the present book will fill
in some of the gaps between the technology and the research questions in historical
linguistics, thus easing the crossing of the chasm. After this consideration of the ease
with which the technology can be adopted, we will focus on the benefits of doing so.
.. The benefits of corpora
The benefits of using quantitative and corpus methods are of course connected, but
we will discuss separate aspects of them here. The three main benefits arising from
using corpora, in the sense defined in section 2.1.3, are data transparency, data quality,
efficiency, information about frequency, and information about context.
The data transparency that arises from using shared corpora is a way to establish the
empirical basis of the consensus described in principle 1 (see section 2.2.1). A corpus
available to other researchers allows detailed replication, as well as criticism, of every
step in the data retrieval process, and hence a much stronger basis for argumentation.
Benefits of efficiency and quality are closely linked to the development and
dissemination of corpora. Any historical linguistic research that attempts to answer
how frequently a linguistics phenomenon occurred in the surviving material, can do
one of the following (in descending order of preference):
(i) use an existing corpus;
(ii) build a new corpus;
(iii) use an ad hoc collection of texts or citations.
If there is an existing corpus, and if that corpus can be deemed reasonably representative of the language variety for the purposes of the study, then it is clearly beneficial
to make use of it (i). Reuse of an existing corpus saves time and effort, but the existing
corpus is also agnostic about the aims of whatever study is being used for, as long as
it was designed as a general resource. As Gries and Newman (2014) point out, there
is a considerable risk of bias when the investigators of a study also collect all the data
directly from some source. This bias, together with the extra effort, increases the risk of
errors due to potential lack of quality assurance, as well as issues with representativity
and size, and speaks against (iii) as an option. However, we want to make clear
that when approached correctly, we consider an ad hoc text collection better than
nothing. Given the necessary resources, (ii) should be the preferred option if no
satisfactory corpus exists. Thus, aspects like size, representativity, quality assurance,
and being agnostic to any one particular research question (which also ensures greater
comparability of studies) ensures that a corpus has an edge over an ad hoc collection
of citations or other texts.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
The benefits of quantitative historical linguistics

Quantitative evidence, as stated in principle 10 (see section 2.2.10), should primarily
be drawn from corpora. By definition, for any question about how often a linguistic
phenomenon occurred in the past, corpora will provide the best evidence. However,
the choice of corpus matters. The larger the corpus, the more precise the results will be,
other things like representativity and annotation being equal. However, with increased
size comes additional complexity and an increased workload, with a concomitant
increased risk of errors (whether manual or computational). This consideration also
gives corpora an edge over ad hoc text collections, since corpora often benefit from
a longer development period with more people involved (e.g. in the form of tests of
inter-annotator agreement).
In addition to the benefits arising from frequency information, corpora provide
information about context. This context might be a matter of frequency, e.g. how often
x occurs with y. Such a benefit goes beyond counting what we already know, because
it is also a means to identify which contexts x occurs in. Linguistic units tend to follow
a Zipf -like distribution with a long tail of infrequent occurrences (Baayen, 2008;
Köhler, 2012), which makes corpora ideal for discovering new, possibly infrequent
cases. Moreover, corpora also provide a principled means for connecting metadata
about texts and speakers to linguistic data. Thus, corpora provide both linguistic and
extra-linguistic context (see Chapter 5).
.. The benefits of quantitative methods
In the previous section we argued that corpora are the best source of numerical
information about historical language use. However, once the frequencies have been
obtained, there are also some specific benefits from using statistical methods. By
statistical method we mean a formalized, mathematical procedure (but not necessarily
null-hypothesis tests) that will allow us to draw inferences about the data in some
principled manner. We normally take this to exclude the assessment of raw numbers
and raw relative frequencies for the purposes of drawing conclusions.
These benefits are partly pragmatic. For instance, when dealing with large samples,
or highly complex data where several variables interact, statistical methods can
provide important insights about the data that would otherwise not be available. For
instance, with a large set of data, we might have so much variation that it must be
treated statistically since the number of variable values, and combinations of them,
would otherwise become unmanageable. Furthermore, statistical methods can help
to rank the importance of a large number of variables, thus helping the researcher to
distinguish the chaff from the wheat.
However, there are also more principled benefits. Principle 9 (section 2.2.9) defined
trends as probabilistically modelled. Therefore, probabilistic, i.e. statistical, methods
are required to identify their characteristics and to tease apart the variables involved.
Furthermore, principle 8 (section 2.2.8) prescribed explicit quantities, arguing that
this leads to a more transparent and hence stronger argument. To this argument
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
we can also add the reproducibility argument, as well as the importance of using
quantitative evidence to avoid bias and confounding factors, as pointed out by Kroeber
and Chrétien (1937).
.. Numbers and the aims of historical linguistics
The benefits of quantitative corpus methods described in sections 6.1.2 and 6.1.3 are of
course not inherently beneficial regardless of context; they are only relevant relative to
one or more aims. Harrison (2003, 214) lists the following aims of historical linguistics:
(i) Identifying genealogical relatedness between languages.
(ii) Exploring the history of individual languages.
(iii) Developing a theory of linguistic change.
The first aim has been the main purview of the comparative method in historical
linguistics. Given the success of the comparative method in dealing with aim (i), we
think the main benefits of quantitative historical linguistics are found among the other
two aims, but ultimately this is an empirical question. Aim (ii) is perhaps the one
with the strongest history of using corpora, at least with respect to those languages for
which corpora exist. Aim (ii) offers rich opportunities for quantitative corpus methods
since it will inevitably involve finding patterns among highly variable data. Finally, aim
(iii) builds on (ii) and can also benefit from such methods, as we will show below.
A theory of language change, i.e. a series of laws and the predictions that follow from
them in the sense of Köhler (2012), must take variation into account. In fact, taking
variation seriously by means of quantitative models addresses the key problem with
the nineteenth-century neogrammarian sound laws, namely the assumption that they
were exceptionless (Campbell, 2013, 15). A probabilistic reinterpretation of such laws
can accommodate far more variation. This is the core of the claim in Kroch (1989) that
syntactic change proceeds at a regular rate of change, since the quantitative model he
uses is robust enough to cope with the variation in the data. However, Kroch (1989)’s
study is also interesting, since it illustrates another point we have made, namely that
shared data and quantitative methods enable openness and precise communication as
well as criticism. Vulanović and Baayen (2007) use the same data as Kroch (1989) and
they show that the model proposed in Kroch (1989) does not fit the data well. Based
on a series of models that fit the data better, Vulanović and Baayen (2007) argue that
the rate of change varies depending on the syntactic environment.
The variation captured by such models of change need not be strictly linguistic.
Kretzschmar and Tamasi (2003) show how extralinguistic correlations of linguistic
change can be taken into account. Their argument is positioned against what they
see as a “Labovian” tradition that conceptualizes change within a closed linguistic
system. Similarly, Blythe and Croft (2012) use computer simulations, or agent-based
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Tackling complexity with multivariate techniques

modelling, to argue that the most plausible scenario for dialect change in New Zealand
English involves social factors. Such simulation models, another instance of model
parallelization (see section 1.1), provides a means of testing hypotheses regarding the
linguistic mechanisms at work, since a number of different mechanisms such as lexical
diffusion or catastrophic syntactic change might be reconcilable with an observed
trend in the data (Denision, 2002).
We have argued that the aims of historical linguistics are not only compatible
with quantitative approaches, but direct beneficiaries of such approaches. A natural
follow-up question is what role the various linguistic theories have to play. As stated
earlier, our position regarding linguistic theories is agnostic. Ours is a framework for
conducting corpus-based quantitative investigations, not a linguistic theory. Specifically, this framework does not rest upon an explicitly probabilistic theory of language,
such as the one described in the chapters of Bod et al. (2003).
. Tackling complexity with multivariate techniques
In this section we will argue that multivariate statistical techniques are in most cases
the ideal way to deal with the complexity of linguistic phenomena, and we introduce
multivariate techniques. We have seen that linguistic phenomena (as phenomena
in many other disciplines) are often correlated with a whole range of variables (see
principle 11 in section 2.2.11). In historical linguistics, time is often an important factor,
but other factors include text-related features like genre, register, or author, specifically
linguistic features, ranging from morphological to lexical, syntactic, semantic, contextual features, and so on. Multivariate analysis is concerned with precisely this type
of scenario. It can account for the effect of multiple variables on a phenomenon of
interest, thus shedding light on the possible ways in which those variables are related
to each other in a systematic fashion.
As an example, let us consider the argument structure of Latin prefixed verbs like
ad-eo ‘go to, reach’, where the prefix (also known as preverb) ad ‘to’ is added to the
verb eo ‘go’. Latin preverbs have been associated to adpositions due to their common
origin from Indo-European adverbial particles, which were relatively free to occur in
various positions of the sentence (Meillet and Vendryes, 1963, 199–200, 573–8).
If we focus on verbs prefixed with spatial preverbs and on the realizations of their
spatial arguments, we observe four main ways in which these arguments can be
realized:
1. (CasePrev) as an NP, whose case is that governed by the preposition corresponding to the preverb; see the following example, where the preverb e- relates to the
preposition e/ex ‘from’, which governs the ablative case, and the prefixed verb
egressi occurs with the ablative castris:
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

(1)
Role of numbers in historical linguistics
castris
e-gressi
camp:abl.n.pl from-go:ptcp.nom.m.pl
‘having marched out of the camp’
(Caes., B. G., II, 11, 1)
2. (CaseNonPrev) as an NP, whose case is not the one governed by the preposition
corresponding to the preverb; see the following example, where the preverb ais related to the preposition a/ab ‘from’, which governs the ablative case, but the
prefixed verb avertitur occurs with the accusative fontes:
(2) fontes-que
a-vertitur
spring:acc.m.pl-and from-turn:ind.prs.3sg.pass
‘he turns away from the springs’
(Verg., Georg., 499)
3. (PrepPrev) as a PP, whose preposition corresponds to the preverb; see the
following example, where the preposition e ‘from’ introduces a prepositional
phrase expressing the spatial complement of the verb egressi, formed with the
preverb e-:
(3) e
castris
Helvetiorum
e-gressi
(Caes., B. G., I, 27)
from camp:abl.n.pl Helventian:gen.m.pl from-go:ptcp.nom.m.pl
‘having marched out of the Helvetians’ camp’
4. (PrepNonPrev) as a PP, whose preposition does not correspond to the preverb:
(4) ab-i-n
e
conspectu
meo?
(Plaut., Amph., 518)
from-go:ind.fut.2sg-part. from sight:abl.m.sg my:abl.m.sg
‘will you be away from my sight?’
Some studies in Romance linguistics have argued that Latin preverbs underwent
lexicalization and a gradual loss of semantic transparency (Tekavčić, 1972, §948.3–
1345; Salvi and Vanelli, 1992, 206; Crocco Galèas and Iacobini, 1992, 172; Vicario, 1997,
129; Haverling, 2000, 458–60; Dufresne et al., 2003, 40). This lexicalization has been
connected with the gradual loss of the case system in Latin and the trend towards
more analytic constructions formed with prepositions (analogous to PrepPrev and
PrepNonPrev) in Romance languages (Iacobini and Masini, 2007). This phenomenon
has been investigated in various qualitative studies based on sets of examples, rather
than corpora. As an illustrative example we will consider Bennett (1914).
In the preface to his second volumes on the syntax of early Latin, Bennett (1914),
presents his methodological approach (Bennett, 1914, iii–iv):
My task in the preparation of this second volume has been much more difficult than I had
anticipated. Barring a few of the more recent monographs, I soon found that the treatizes
on which I had hoped largely to depend, were extremely defective, not only lacking a large
proportion of the important material, but being based, in great measure, on conjectural readings
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Tackling complexity with multivariate techniques

of the past generation. Not infrequently false interpretations added to the confusion. Under
these circumstances, it became necessary to make my own special collections to supplement
the obvious lacunae encountered at almost every turn. The expenditure of time and labor thus
caused have unquestionably been greater than if I had made independent collections of the
entire material from the beginning. Nevertheless I believe that substantial completeness has
been achieved in the material here presented. Wherever possible, I have given the exact number
of instances of the occurrence of a usage. When a usage is found ten or more times, I have
marked it “frequent”.
This approach makes it impossible to derive falsifiable hypotheses from the author’s
claims, since they lack a quantitative account of the phenomenon, see for example
Bennett (1914, 131–2):
Of the foregoing prepositional compounds governing the dative, those with ante, inter, ob,
prae, sub, and super are used with the dative almost exclusively. They rarely take the accusative
or prepositional phrases as alternative constructions. Of the other compounds, those with
com- show greater hospitality toward the admission of alternative constructions, especially
prepositional compounds; while those with ad and in exhibit the greatest tendency in this
direction.
A general tendency is exhibited in all the compounds to employ the dative rather in
figurative relations than in literal ones, though examples of the latter are not especially rare.
Literal relations are expressed more usually by the accusative or prepositional phrases; yet we
frequently find figurative relations also expressed by these same means.
The word “frequent” can be applied to a range of cases, and its meaning depends
on its context of use and on the other terms of comparison, making it inadequate
by today’s standards of quantitative research, which can rely on large amounts of
data, processing power, and computational approaches that were not available in
Bennet’s times.
In contrast with this methodology, McGillivray (2013, 127–210) (and previously Meini and McGillivray 2010 and McGillivray 2012) employs a quantitative
corpus-based approach to investigate this topic, which relies on statistical and computational tools available today. Here we will use this study as an illustration of the
approach we propose.
The data frame format The study reported on in McGillivray (2013, 127–210) relies
on two corpus-driven valency lexicons for Latin verbs (see section 5.1.1), which were
derived automatically from two Latin treebanks (see section 4.3.1). This shows an
example of reuse of previously built language resources, as well as the benefits of
corpus annotation. In addition, the corpus data were systematically collected and
analysed using a multivariate approach, as we will see now.
The main object of investigation of this study is the type of construction observed
for the realization of spatial arguments of prefixed verbs, and specifically the four
options listed on pages 157–8. To start from a simple illustrative example, let us imagine
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
Table . Example of a data set recording the
century of the texts in which prefixed verbs were
observed, and the proportion of their spatial arguments expressed as a PP out of all their spatial
arguments
Century
2nd cent. bc
1st cent. bc
3rd cent. ad
4th cent. ad
Proportion of PP
0.1
0.2
0.8
0.9
that we have collected data on prefixed verbs in our corpus, and that we have
decided to represent them as a simple four-by-two table: each observation corresponds
to the century of the texts where a prefixed verb is found, and for each century
we have recorded the proportion of occurrences of spatial arguments expressed as
prepositional phrases (constructions PrepPrev and PrepNonPrev, according to the
terminology on pages 157–8), out of all spatial arguments. For example, in the first
century bc 10 per cent of all spatial arguments of prefixed verbs are prepositional
phrases, as shown in the first row of Table 6.1.
We can visualize the data in Table 6.1 geometrically using a Cartesian space, as
shown in Figure 6.1. In Figure 6.1 we can see the four points corresponding to the
four observations recorded in Table 6.1. The horizontal axis (i.e. x axis) corresponds to
the time variable showing the century, with negative values for bc dates and positive
ones for ad dates. So, the further to the right a point lies, the later its century. The
vertical axis (y axis) corresponds to the proportion of prepositional constructions,
ranging from 0 (all constructions are bare-case constructions) to 1 (all constructions
are prepositional constructions). For example, the first row of Table 6.1 corresponds
to the point (–2,0.1).
This two-dimensional representation allows us to express the interaction between
the time dimension (along the x axis) and the syntactic dimension (along the y axis).
This makes it possible to look for any pattern in the set of points corresponding to the
observations. One way to achieve that is by using linear regression models.
Linear regression models While analysing Figure 6.1, we saw that every point in a
two-dimensional space is associated with a pair of coordinates (x, y), one along the
horizontal axis (abscissa) and one along the vertical axis (ordinate); we have thus
represented the four rows of Table 6.1 as four points in a two-dimensional space.
Now, we may ask if we can detect any regularity in the way the ordinates change as
the abscissas change; one way to do that is to find a straight line that is as close as
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Tackling complexity with multivariate techniques

0.8
Prop. preposition
0.6
0.4
0.2
–2
–1
0
1
Century
2
3
4
Figure . Geometric representation of Table 6.1 in a two-dimensional Cartesian space.
possible to all four points of Figure 6.1. We observe that the line in Figure 6.2 is a good
approximation of the four points.
Compared with dealing with a set of points as those in Figure 6.1, dealing with
a straight line has some advantages. For example, all the points of a line share the
property that their ordinates y can be obtained from their abscissas x by applying this
formula:
y=a+b∗x
a and b are unique to each line; a (intercept) is the ordinate of the intersection point
between the line and the y axis, and b (slope) measures the steepness of the line (in
subsequent case studies we use the term coefficient to refer to the slope b). In our case,
the line is defined by the equation:
y = 0.36 + 0.14 ∗ x
For example, the abscissa of the point corresponding to the second row of
Table 6.1 is –1 and its ordinate is 0.3. The corresponding point on the line is
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
Prop. preposition
0.8
0.6
0.4
0.2
–2
–1
0
1
century
2
3
4
Figure . Line that best fits the four points in Figure 6.1.
(–1,0.14 ∗ (–1)+0.36)=(–1,0.22). In other words, the linear representation allows us
to quantify the magnitude of change along the y axis in terms of changes per unit
along the x axis. Another advantage is that we can measure how well the line fits the
points by taking into consideration the sum of distances between each point and the
line itself. A line that describes the points well will be closer to each point than a line
that is a poor fit to the data (this measure of fit is sometimes referred to as R2 or the
coefficient of determination).
When we approximate the points of our data set with a line, also called
regression line, we are fitting a linear regression model, specifically a two-dimensional
linear regression model, to the data. In general, linear regression analysis constitutes
a series of multivariate techniques based on the idea that it is often beneficial to
approximate a set of points in a multidimensional space with linear, and hence simpler
models. Using a common statistical terminology, we will call the variable describing
the phenomenon we want to model (in this case the proportion of PP constructions
by century) as the response; the response is potentially affected by a range of other
variables, which we will call predictors; in our case the only predictor is the century.
Generalization to higher dimensions The two-dimensional representation of the data
in Table 6.1 and Figure 6.1 is a very simple case. Counting the number of times
each construction occurs by century would not fully describe other factors that may
affect the presence of such constructions. Accounting for this complexity in a post
hoc way by interpreting new variables in light of the results, without testing these
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Tackling complexity with multivariate techniques

new variables properly, is detrimental to historical linguistics research, because it
does not quantify and test the role played by those variables. Instead, we argue that
it is methodologically more appropriate to collect the values of all these variables
upfront, in the data collection phase of the research, and to represent them in an
appropriately multidimensional data format which allows for further analyses (see
discussion in section 1.3.3). This is achieved by extending the reasoning done above
for two-dimensional spaces to the case of spaces in higher dimensions.
For illustration purposes, Table 6.2 shows a subset of the table (technically called
data frame) used to study the relation between the different variables in the study
on Latin preverbs described in McGillivray (2013, 127–210). Each row in this table
represents an observed instance in the source corpus (i.e. an occurrence of a prefixed
verb with one or more spatial arguments) and each column represents the value of each
variable for each observation. The first column contains the ID of the prefixed verb
in the original corpora, the second column shows ‘prep’, a binary variable indicating
whether (1) or not (0) the prefixed verb occurred with a PP spatial argument in
that specific instance, the third column contains the era of the text, the fourth the
prefixed verb’s lemma, the fifth the frequency of that verb in the corpora, the sixth
the mood of the verbal form, the seventh its voice as a binary variable (1 stands for
active or deponent, and 0 for passive), and the eighth the lemma of the preposition
found in the argument structure of the verb, if present. Such data frame formats
represent very clearly the multidimensional nature of the data. It allows to record a
range of measurements for every corpus instance: the lemma of the verb, the form
of the preverb, the case of the verbal argument, and so on. As a generalization of the
two-dimensional case illustrated earlier in this section, we can imagine that the seven
variables in Table 6.2 correspond to as many dimensions, describing different features
of each observation.
Once we have the data in a data frame format, if we want to identify the relationship
between the response and the predictors, we can resort to a range of multivariate
Table . Subset of data frame used for study on Latin preverbs in
McGillivray (2013, 127–210)
id
24290
32817
32289
11028
12831
17440
17526
prep
0
1
1
1
1
1
1
era
verb
Classical
Classical
Late
Late
Late
Late
Late
abeo
abigo
abscedo
abstraho
abstraho
abstraho
abstraho
freq_verb
28
3
6
11
11
11
11
mood
inf
sbjv
inf
ind
ind
inf
part
voice
1
1
1
0
1
1
0
prep_type
NA
ab
ab
ab
ab
ab
ab
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
statistics techniques, such as regression models introduced above and described more
fully in McGillivray (2013, 162–6).
Other models The regression model that best fits the data in Table 6.2 is a
mixed-effect logistic regression model, a special class of generalized linear models.
Generalized linear models generalize the logic behind linear regression models to
response data that are not normally distributed, usually via some transformation of
the response. Mixed-effect models involve a set of predictors (or fixed effects) and
a set of so-called random effects. The random effects are responsible for the grouplevel variation in the model and are particularly useful with diachronic linguistic data,
which tend to have uneven composition with respect to the set of authors of the texts.
In the case at hand, the data set has an uneven composition of authors, with some
authors being more heavily present than others, and setting author as a random effect
accounts for this fact. For a more thorough explanation of mixed-effects models, see
McGillivray (2013, 177; 189–90), as well as Baayen (2008), Tagliamonte and Baayen
(2012), and Baayen (2014). Further examples and a discussion of generalized linear
models and generalized mixed-effects models are given in sections 6.3 and 7.3.
Logistic regression models are used when the response variable is binary, as in
the case of the prepositional vs bare-case construction for Latin prefixed verbs.
Logistic models estimate the probability (from 0 to 1) of switching from one of the
two outcomes to the other, given the value of the predictors. This probability is not
estimated directly, since the values bounded by 0 and 1 cannot be easily handled via
a straight regression line. Instead, the so-called logit function is used to transform the
probabilities onto a scale that ranges from −∞ to ∞; when applied to a probability p
between 0 and 1, this function returns the logarithm of the odds ratio1 of p:
logit(p) = log(
p
)
1−p
In the case of the preverb study, the best model for predicting the type of construction (response) is based on the following predictors: the lemma of the preverb, the era
of the text, and the case of the verb’s argument; its random effects are the semantic
class of the verb (motion, rest, or transport) and the author of the text. The model can
be expressed as follows:
(5) Response: probability of switching from a bare-case construction to a prepositional construction, modelled as depending on
fixed effects: preverb + era + case
random effect: genre, with a random slope for class
1 We can think of odds as the ratio of an event to its corresponding non-events over a sufficiently long
time. For example, if we roll a fair die, the odds of getting  are  to .
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Tackling complexity with multivariate techniques

A multivariate exploratory technique: CA Multivariate techniques have applications
beyond the hypothesis-testing setting outlined above. For the purposes of data exploration, we can use other types of multivariate techniques to find associations
between different variables in our data. The class of multivariate techniques include so-called dimensionality reduction models, which all aim at reducing the
variation in the data in a systematic manner that lends itself to interpretation. Linguistic applications of such techniques are discussed in Baayen (2008, 118–37) and
Jenset and McGillivray (2012).
One such multivariate technique that is highly useful for corpus data is correspondence analysis (CA) and its generalization to more than two variables called MCA or
multiple correspondence analysis (Greenacre, 2007). Exploratory techniques follow
the principle formulated by Benzécri (1973): ‘the model should follow the data. The
data should not follow the model.’ CA aims at finding the essential structure of the
data by reducing the original multidimensional space to a lower-dimensional space
(typically consisting of two or three dimensions) that is easier to interpret. CA is
discussed more extensively in McGillivray (2013, 168–9), and here we will give an
example of its use to capture the multidimensional nature of the data set and research
questions relative to the Latin preverbs study.
We will focus on the following variables, as they potentially interact with the
syntactic constructions that are the object of the study:
• ‘construction’, with values from 1 to 4, corresponding to ‘CasePrev’, ‘CaseNonPrev’, ‘PrepPrev’, and ‘PrepNoPrev’;
• ‘era’, a broad chronological classification of the authors of the data set: ‘early’
(Plautus), ‘classical’ (Caesar, Cicero, Ovid, Petronius, Propertius, Sallust, Vergil),
and ‘late’ (Jerome and Thomas);
• ‘case’, the case required by the preposition corresponding to the preverb (ablative
or accusative);
• ‘sp’, a representation of the lexical–semantic properties of the verbal arguments
on the arguments’ lexical fillers;
• ‘class’, the semantic class of the verb, with values ‘motion’, ‘rest’, and ‘transport’.
Figure 6.3 shows the result of the analysis. It is a two-dimensional representation
of the original data set capturing the essential structure of the data. We can assess
how accurate such approximations are by considering how much of the variability in
the data is expressed by the analysis (so-called percentage of explained inertia). The
representation in Figure 6.3 accounts for 53.0 per cent of the variability in the data
and highlights associations between the variables. Thanks to the bidimensionality of
the plot, we can detect complex relations where more than two variables interact, for
example constructions 4 (PrepNoPrev) and 3 (PrepPrev) are associated with the lateera authors (Jerome and Thomas); construction 1 (CasePrev) tends to interact with the
classical authors and motion verbs.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
0.2
era.Late
2BareCaseNonRequired
case.acc
3PrepPrev
sp.abstraction
sp.affectus
sp.actio
sp.classis
0.0
sp.phenomenon
era.Classical
sp.eventum sp.entity
era.Early
–0.2
.4PrepNoPrev
case.abl
sp.psychological_feature
–0.4
.1BareCaseRequired
sp.artefact
–0.5
0.0
0.5
Figure . Plot from MCA on the variables ‘construction’, ‘era’, ‘preverb’, ‘sp’, and ‘class’. The
first axis accounts for 34.6 of the explained inertia, the second axis for 18.5.
This is a simple example of the power of multivariate statistical techniques in
capturing the multidimensional nature of the data in a systematic and quantitative
way, which is a good basis for the subsequent theoretical interpretation.
. The rise of existential there in Middle English
As a case study of how quantitative corpus methods can shed new light on historical
linguistic questions, we will discuss a syntactic change that took place in Middle
English. Middle English was the language of England during a period roughly
extending from 1100 to 1500, or from just after the Norman Conquest to just after
the introduction of the printing press in England.
The grammatical change in question is the evolution of existential (sometimes
called “expletive” or “dummy”) there in English. The contemporary examples from
Davies (2008) in Examples (6) and (7) illustrate the difference between existential there
and the locative adverbial use of the morpheme:
(6) There is a house to the north. (existential)
(7) So it’s been accumulating there, now, for 60-some years. (locative)
During the Middle English period, existential there gradually became more frequent in
existential constructions, that is, constructions that serve to present new information
about the existence of some entity, as in Example (6). The rise of existential there
meant the demise of a corresponding existential construction without there, i.e. with
what we informally can call a null, or empty, variant. The two competing Middle
English existential constructions are exemplified below with sentences taken from
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
The rise of existential there in Middle English

Chaucer’s Canterbury Tales (Benson, 1987), respectively the General Prologue and The
Monk’s Tale:
(8) With hym ther was a Plowman
(GP: 529)
“With him there was a ploughman”
(9) [Ø]
“There
manye
many
Was nevere wight, sith that this world bigan, That slow so
was never man since the
world began who killed so
monstres (MkT: 2111)
monsters”
As the two examples from Chaucer attest, the two variants could be used interchangeably even by the same author. As with Middle English in general, we also find
considerable variation in how there was spelt, including ther, þer, and ðere.2 To simplify
matters, we will use the modern spelling for there when referring to the existential
pronoun. We will refer to the absence of the pronoun, i.e. the null variant, as Ø.
Although existential constructions with there are found in Old English (Breivik,
1990), it was during the Middle English period that constructions with there (hereafter there1 , to distinguish it from the locative adverbial use of the morpheme there
exemplified in Example (7)) became the dominant variant. Simultaneously, the null
variant gradually fell out of use, but with considerable synchronic variation, as the
examples from Chaucer illustrate.
The reasons behind this constructional change are less clear, however. Breivik
(1990) essentially takes a pragmatic, functional-typological view, whereby syntax and
pragmatics interacted to make there1 obligatory due to the loss of verb-second (V2)
word order in English, so that “the increasing use of there1 in earlier English is part of
a series of parallel syntactic changes, acting in a coordinated manner and pushing
the language from one typological category to another” (Breivik, 1990, 247). The
increasing use of there1 is attributed to pragmatic, information-related factors that
together make the reanalysed locative adverb there1 gradually more obligatory in
contexts where new information is being introduced. One such pragmatic factor is
what Breivik (1990, 140–50) calls the visual impact constraint: if the post-verbal noun
phrase being introduced by the construction refers to something abstract or nonconcrete, there1 serves as an added, pragmatic introductory signal. Williams (2000),
on the other hand, takes the view that this change was at least partially connected
with the loss, or lack of productivity, in verb-initial (V1) word order. A third angle
is presented by Jenset (2014), who finds evidence for sociolinguistic factors being
involved, following Croft (2000) and Blythe and Croft (2012).
2 Jenset () identifies twenty-seven spelling variants of there based on data from the PPCME
treebank (Kroch and Taylor, ).
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
Identifying causes for linguistic change is of course challenging, as we have discussed earlier. In addition to the problems facing anyone wishing to establish a causal
effect, the different linguistic paradigms will approach the question of causality in
very different ways. Ringe and Eska (2013), who explicitly position themselves in
a generative paradigm, stress the importance of errors when children learn their
first language, as well as contact induced second-language learner errors, as the
primary sources of linguistic change. In the cognitive-functional paradigm, Croft
(2000) dismisses language learning and instead focuses on sociolinguistic context and
usage preferences as selection mechanisms in the evolutionary sense of the word.
These are but two examples, but they illustrate well the thorny problems facing
anyone attempting to establish causes for linguistic change, not to mention establish
a consensus (see principle 1, section 2.2.1). We believe that a focus on the empirical
consequences associated with the claims and proposed models can help to bridge this
gap, and evaluate the competing claims and models across paradigms.
A crucial step in quantitative historical linguistics involves translating the competing claims and models into questions that can be answered with statistical techniques.
This step allows us to assess the quantitative consequences of the competing claims,
and hopefully reach a consensus model. Crucially, quantitative arguments are not
themselves sufficient, as we argued in principles 11 (section 2.2.11) and 12 (section 2.2.12). If the linguistic phenomenon is multivariate, then multivariate techniques
are required, and those techniques must be applied according to best practices. In the
case of the Middle English existential construction, both Breivik (1990) and Williams
(2000) rely on empirical and quantitative arguments, based on extensive collections
of data. We need some form of quantitative evaluation that can weigh the different
options against each other.
Clearly, two of the proposed explanations directly or indirectly imply a correlation
between different existential subjects (there1 and Ø) and changes in (surface) wordorder probabilities. Breivik’s argument implies a correlation with the loss of V2 word
order, whereas Williams’s argument implies a correlation with the loss of V1. However,
as we established in principle 3 (section 2.2.3), any claim about the linguistic past
that is not physically or logically impossible has a non-zero probability of being true.
Principles 4 (section 2.2.4) and 5 (section 2.2.5) established that, since a claim based
on strong evidence has more merit than one based on weak evidence, we can recast
the claims in terms of the relative strengths of the correlations.
1. The strongest correlation is found between there1 and the loss of V1.
2. The strongest correlation is found between there1 and the loss of V2.
3. No real correlation is found between there1 and any word-order pattern.
The question can then be rephrased as follows: is the loss of V1 or V2 more
important, or is some other variable of greater importance? Since the rate of V1 and V2
word-order patterns in main clauses can be retrieved from a corpus, we can investigate
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
The rise of existential there in Middle English

directly the question of correlation between (surface) word-order patterns and rates
of there1 .
At this point, we might reach for a statistical null-hypothesis test to measure the
correlation (after retrieving the necessary information from a corpus). However, there
are also other, competing claims and hypotheses to take into account. Jenset (2014)
argues that sociolinguistic status might have been involved in the use and non-use of
there1 in Old English, and suggests that this would also be the case for Middle English.
In other words, a realistic quantitative assessment must handle not only the correlation
between the realization of the existential subject and the word-order patterns, but also
simultaneously account for sociolinguistic factors.
Furthermore, the evaluation of the claims and models must take into account
pragmatic, or information-theoretic, factors. Breivik (1990)’s suggestion that there1
gradually becomes obligatory in contexts where new information is expanded upon
in Breivik (1997), arguing that the development of there1 can be seen as a form of
grammaticalization, whereby the function of there1 as a signal of new information
becomes increasingly tied to a fixed grammatical context. Since the grammatical
context is observable in a corpus, we can predict that the increased importance of such
a signal function should manifest itself in an increasing statistical correlation with the
surrounding grammatical context. Below we will focus on the grammatical element
that follows there1 . Based on this prediction, we would say that such a correlation
with the context would strengthen the semantic-pragmatic claims made by Breivik
(1990).
Another aspect of the discourse-pragmatic argument made by Breivik (1990) is the
tendency for complex elements to occur later in the clause, also known as “Behagel’s
Law” (Köhler, 1999). This suggests that the complexity of the sentence is a possible
factor. Jenset (2013) investigated the gradual evolution of the two uses of there in
early English, and found that syntactic complexity, as measured by a composite index
weighing the number of NPs and finite verbs against the total number of elements,
was a significant predictor in distinguishing there1 from there2 .
This leaves us with a number of competing claims and hypotheses regarding
the competing use of Ø and there1 in Middle English, based on sociolinguistics,
pragmatics (context, complexity), and the effect of word-order patterns. All have
varying claims to explanatory power regarding the phenomenon we are studying, and
the next section discusses the data used to assess these claims and hypotheses.
.. Data
We used syntactically annotated Middle English data as the basis for the statistical
investigation, i.e. model serialization (section 1.1). The data were drawn from the
PPCME2 treebank (Kroch and Taylor, 2000), which comprises around 1.1 million
words of prose covering the period from around 1150 ce to 1500 ce. We used a bespoke
Python script to extract information about sentences with there1 or Ø, as well as
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
(surface) word-order patterns in main clauses. The data and the code for this study
are available on the GitHub repository https://github.com/gjenset.
Advantages of treebanks A methodological remark about the data source is in order.
We introduced treebanks in section 4.3.2; the detailed syntactic annotation that
treebanks provide is the only level that makes studies like this one feasible at a large
scale, with the advantages of reproducibility that we stressed in section 4.1.1.
Since the PPCME2 annotates existential uses of there differently from locative uses,
by means of assigning an ‘EX’ tag to the former and an ‘ADV’ tag to the latter, we
could single out the cases of there1 . Furthermore, since empty pronouns are annotated
with an ‘∗ exp∗ ’ tag, we could also identify the Ø variants. According to the corpus
documentation, the ∗ exp∗ tag is ambiguous since it is also used for subjects in
impersonal constructions. However, we could exclude these cases by checking the
context of the ∗ exp∗ tag, thus allowing us to extract only the existential sentences.
We could also rely on the treebank annotation when extracting only existential
subjects from main clauses, thus avoiding the added complexity of dealing with both
main and subordinate clauses. Furthermore, two of the hypotheses discussed above
involve probabilities of (surface) word-order pattern. The topic of word-order patterns
in early English is a complex area; see e.g. Heggelund (2015). We take a fairly atheoretical approach (Horobin and Smith, 2002, 99–103) where any verb-initial main
clause is considered V1 if the first element is a finite verb (including verb clitics with the
negative particle ne). By means of the corpus annotation, we could exclude imperative
sentences and questions from our data.
Some of the choices we have made will be contested by other linguists. By using
a Python script to extract the data, we have a concrete documentation of those
choices, which can be shared with, and critiqued by, other scholars in a transparent
manner.
Data description We found a total of 1688 main clauses with an existential subject,
807 (48 per cent) of which could be analysed as having an empty pronoun.
For each sentence, we recorded the name of the corpus file it was found in, its
period (from the Helsinki corpus metadata, incorporated into the PPCME2), the
unique identifier of the sentence, the dialect of the text in question as recorded in
the corpus documentation, and the manuscript date. Whenever the manuscript date
was uncertain according to the documentation, a reasonable date was chosen based on
the information in the corpus metadata. As we will see, that simplification is probably
warranted and not a major problem for the analysis, since the techniques being used
anyway can handle some noise in the observed data.
We also collected more data about the sentences, to properly test the competing
claims and hypotheses mentioned above. This includes whether or not the subject
of the main clause is an empty expletive, the corpus tag of the next element in the
main clause, and the conditional probability of finding that next element after an
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
The rise of existential there in Middle English

empty or non-empty existential subject (respectively). This probability was calculated
as the probability of an item x following either Ø or there1 divided by the number of
occurrences of x in the entire corpus. Also included are two columns providing the
relative frequencies of main clauses displaying the V1 and V2 word order in the given
corpus file. Finally, we included a sentence-level variable recording the maximum
depth of syntactic embedding for each sentence as an approximation to its syntactic
complexity.
.. Exploration
Faraway (2005, 2) emphasizes graphical analyses as a means to properly understand
the data, and in this section we present a number of exploratory plots to help to
understand the data below.
Realization of subject The plot in Figure 6.4 visualizes the distribution of the existential subjects over time. The plot shows the shifting probabilities of the there1 and Ø
Existential subject realization
1.0
0.6
Ø
Existential subject
0.8
0.4
There
0.2
0.0
1150
1200
1250
1300
1350
MS date
1400
1450
Figure . Graph showing the shift in relative frequencies of existential there and empty
existential subjects during the Middle English period.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
realizations of existential subject, over time. The resulting S-curve is expected based
on previous research of existential there (Breivik, 1990, 226; Jenset, 2010, 273). This
distribution of the realization of existential subject in Middle English shows how there1
rapidly overtakes the null variant after the end of the thirteenth century. Consequently,
a potentially explanatory variable whose explanation assumes a correlation with the
rise of there1 would need to display a roughly similar distribution. However, we can
also note that Blythe and Croft (2012), using simulation modelling based on diachronic
linguistic data, conclude that such an S-curve is also driven by sociolinguistic factors,
specifically different social prestige associated with the linguistic variants in question.
If this is correct, we might also expect to see some effects in our data related to variables
that can express different social prestige, specifically genre and dialect.
Word-order patterns In Figure 6.5 we see the main trends for the two word-order
patterns. As expected, the V1 pattern is much less frequent than the V2 pattern.
The V2 pattern is receding more than the V1 pattern, but both appear to stabilize,
with visible variation in frequency, in the second half of the Middle English period.
Proportion of pattern in declarative main clauses
0.5
Word order
V1
V2
0.4
0.3
0.2
0.1
0.0
1200
1300
MS date
1400
1500
Figure . Distribution of V1 and V2 word-order patterns. The lines are smoothed, locally
adjusted regression lines that outline the main trends for the two patterns.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
The rise of existential there in Middle English

A caveat is that at least some of this fluctuation may reflect uncertainty in manuscript
dating, as mentioned above. The lines in the plot are locally smoothed non-parametric
regression lines (Venables and Ripley, 2002, 230), which are useful for exploration
and identification of the predominant behaviour in the data. Based on those lines,
we would be inclined to think that the loss of V2 word order is a more promising
explanatory variable than the loss of V1, since the former appears to have a clearly
observable (negative) correlation with the rise of existential there.
Conditional probability of right context In Figure 6.6, we can see the probability of
the grammatical elements found after the existential subjects, plotted as a box and
whiskers plot.
The mean (black line) for the elements following there1 is higher than for Ø,
suggesting that there1 is conventionally bound to the context following the morpheme.
Despite a number of outliers, the Ø subject appears with more variable contexts.
Maximum sentence depth As mentioned above, we decided to use the maximum
embedding depth (based on the phrase-structure annotation of the treebank) as a
0.30
Probability of next constituent
0.25
0.20
0.15
0.10
0.05
0.00
There
Ø
Existential subject
Figure . Box-and-whiskers plot of conditional probabilities of elements following existential there and empty existential subjects.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
Maximum degree of embedding
20
15
10
5
There
Ø
Existential subject
Figure . Box-and-whiskers plot of the maximum degree of embedded (phrase-structure)
elements for sentences with there and empty existential subjects.
proxy for linguistic complexity. A higher number indicates more levels of embedding
in the sentence. Figure 6.7 shows the maximum sentence depth by existential subject
in a box-and-whiskers plot. Sentences with there1 have a lower degree of embedding,
suggesting a slightly simpler clause structure on average.
Figure 6.8 shows the maximum degree of embedding for all the sentences in the
sample over time. Although there appears to be a very slight tendency towards less
embedding in later sentences, the overall impression is one of stability.
Genre Genre is a possible confounding factor in the analysis, since the choice of
existential subject could be motivated by stylistic factors. Although there1 is nearobligatory in the present-day English existential construction, it can still be omitted
under some circumstances, as exemplified in Example (10) from Davies (2008).
(10) Behind the door was a large room lit by strips of blue phosphor laid across the
ceiling.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
The rise of existential there in Middle English

Maximum degree of embedding
20
15
10
5
0
1200
1300
MS date
1400
1500
Figure . Maximum degree of embedding for all sentences in the sample over time, with
added non-parametric regression line.
Such examples are associated with particular styles and genres, although Coopmans
(1989) argues they can be accounted for by syntactic means.
Figure 6.9 shows the distribution of existential subjects by genre. Some genres
are clearly more frequent than others, notably religious treatises, history, homilies,
romances, travelogues, and sermons. Three of these (religious treatises, homilies, and
travelogues) have a majority of null-existential subjects.
Dialect Dialect is interesting because of its role as a possible proxy for typical
sociolinguistic variables such as regional or local identity. Although any kind of ethnic
identity is too complex to be reducible to a direct correspondence with language
(Fought, 2002), it is still reasonable to assume that language can express regional in
addition to national identity (Chambers, 2002, 362–4).
Figure 6.10 shows the existential subject by dialect, and we can see that there1 is
the most frequent variant in most dialects, except in East Midlands and Northern
Middle English. The similarity between East Midlands and Northern material is not
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
Existential subject by genre
There
Ø
250
200
150
100
Philosophy
Handbook
Bible
Biography
Fiction
Rule
Sermon
Travelogue
Romance
Homily
History
0
Religious
Treatise
50
Figure . Bar plot of counts of existential subjects by genre.
unexpected, since these Middle English dialects both came out of the Old English
Anglian dialect area (Corrie, 2006; Irvine, 2006).
.. The choice of statistical technique
In section 2.2.12 we argued for the importance of choosing the right statistical test
or model to evaluate competing claims and hypotheses. Since more than one of the
hypotheses could be correct (section 2.2.11), we must evaluate them against each
other. Simply counting the raw number of observations in different categories and
interpreting them as frequent or infrequent without a further frame of reference
(as done in e.g. Bybee, 2003) will not do. Stefanowitsch (2005) discusses this “raw
frequency fallacy” and points out that the correct approach is to compare the observed
counts with the expected numbers, using some plausible statistical model. However,
traditional null-hypotheses are not really suited for our purpose here, since they are
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
The rise of existential there in Middle English

Existential subject by dialect
There
Ø
400
300
200
100
0
East
Midlands
West
Midlands
Southern
Kentish
Northern
Figure . Bar plot of counts of existential subjects by dialect.
designed for testing a hypothesis against a null-hypothesis, not for comparing multiple
hypotheses (Gelman and Loken, 2014).
To illustrate this point, consider Pearson’s chi-square test. The chi-square test, which
compares the difference between observed counts and expected counts based on the
chi-squared distribution, is a statistical test that many linguists will be familiar with,
either through use or exposure in the literature. The test is fairly simple and can be
found in introductory books to corpus linguistics, such as Gries (2009b). However, the
test itself is less than ideal for the purposes we are considering here, as we will show.3
The perils of chi-square For the purposes of illustration, we consider the evolution
of there1 in light of dialects, one of the possible variables correlated with there1 . As
discussed above, we must take into account the possibility that dialect variation is a
piece of the puzzle of there1 . Table 6.3 provides an overview of there1 and Ø by dialect,
and it is clear that the Midlands dialects account for the bulk of the occurrences,
3 Much of the same criticism pertains to another popular test, Fisher’s exact test, but we will only consider
Pearson’s chi-square here.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
Table . Frequencies of there1 and Ø
according to dialect in Middle English
Dialect
East Midlands
Kentish
Northern
Southern
West Midlands
there1
Ø
389
51
24
114
303
465
25
64
46
207
with more instance of there1 than Ø in West Midlands, and the reversed situation in
East Midlands. The Northern dialect is the only other dialect that has more instances
of Ø than there1 . However, we need to take into account both the total number of
observations by dialect and the total number of observations for the two existential
variants, since they are obviously attested to very different degrees.
We can take the different numbers of observations into account by comparing the
observed number in each cell to its theoretical expected frequency based on the row
and column sums. This expected frequency based on rows and column sums is how
we can take the different numbers of observations for the different categories into
account. By taking the difference between each pair of observed and expected values
and dividing by the expected value, we are left with a proportional deviation from
the expected. Squaring the difference avoids negative numbers, and adding all the
squared proportional differences together, we are left with the chi-square score. This
chi-square score is essentially a measure of how much the observed values in the table
overall differ from their expected counterparts. A larger chi-square score would in
general signal a larger deviation from expected values than a smaller one. If we take
the number of categories, i.e. rows and columns, into account (for reasons of simplicity
of exposition we gloss over a deeper discussion of degrees of freedom), we can
compare the chi-square score to the chi-square distribution, which is usually a good
model for such differences of expected and observed counts if there are no systematic
correlations between rows and columns. Finally, a p-value signals the degree to which
the observed data are likely given the chi-square distribution. The last point is worth
underlining, since it is often misconstrued (Cohen, 1994). The p-value is not the
probability of the null hypothesis, or the probability of being incorrect regarding some
hypothesis. The p-value is the probability of observing the data in the table (or some
more extreme version of it), if we assume the chi-square distribution to be a good
model. If the p-value is small (conventionally below 0.05), we can assume that the
chi-square distribution is not a good model, and that there are indeed correlations
between rows and columns.
From the discussion above it should be clear that the chi-square test is not a
particularly intuitive way of reasoning about data. Since the p-value gives us the
probability of the observed data given the chi-square distribution, a precise hypothesis
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
The rise of existential there in Middle English

is needed to make any sense of the results. Furthermore, as we discussed in section
3.7.5, the chi-square test is vulnerable to an inflation effect, which was noted as early as
in Berkson (1938). Mosteller (1968, 1), in a discussion of chi-square testing in corpus
linguistics that foreshadowed the discussion reported on in section 3.7.5, wrote that
“I fear that the first act of most social scientists upon seeing a contingency table is
to compute chi-square for it.” The reason for Mosteller’s fear is that having much
data will inflate the final chi-square value used to compute the p-value. The reason
for this inflation effect is that the chi-square deviation from the expected values is
roughly proportional to the size of the sample (Mosteller, 1968, 2). In short, the p-value
tells us whether we have the required minimum sample size to detect a correlation
between rows and columns, but as the sample size grows larger, the p-value becomes
increasingly less informative.
For the data in Table 6.3, the result of a chi-square test in R is statistically significant
(χ 2 = 77.72, df = 4, p 0.001), with a p-value so small that it is indistinguishable
from zero for all practical purposes. However, since we have 1,688 observations in
total in the table, we should not be surprised at the small p-value. In section 3.7.5 we
discussed how Gries (2005) showed that the use of effect size measures to some extent
can mitigate the inflation effect that comes with corpus data. A convenient effect size
measure for chi-square tests is φ for two-by-two tables, and Cramèr V for larger tables.
The details in the calculations differ, but both measures attempt to counter the inflation
effect by dividing the chi-square sum for the table by the total number of observations
in the table (while also taking the number of rows and columns into account: see Gries
2005 for details).
The Cramèr V effect size measure for Table 6.3 is 0.22, which is a small effect for a
table of this size (i.e. with six rows and two columns), with this many observations.
Another way of thinking of this effect size is to consider it a measure of how much
of the variation in the table can be explained by the correlation between rows and
columns. We can approximate this by taking the square of the effect size, which
in this case amounts to about 0.05 with rounding. In other words, the correlation
between rows (dialects) and columns (there1 vs Ø) explains 0.5 per cent of the variation
in the data. Such a negligible explanatory power is hardly impressive, and it shows
clearly that dialect differences alone are incapable of explaining all the variation in
the data. This is not to say that dialect differences play no role at all, but rather
that whatever influence they exert must be weighed up against other potentially
explanatory variables. However, bringing in more variables is associated with other
challenges.
The perils of multiple testing Above we showed that a simple chi-square test of a table
of data with counts of there1 and Ø was incapable of fully describing the variation
between the two variants. As we will see, attempting to repeat the same procedure for
more variables is not the correct way to solve the problem.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
Table . Frequencies of there1 and Ø
according to genre in Middle English
Genre
Bible
Biography, Life of Saint
Fiction
Handbook
History
Homily
Philosophy
Religious Treatise
Romance
Rule
Sermon
Travelogue
there1
Ø
30
9
36
8
176
21
10
212
157
33
117
72
7
29
11
10
73
220
0
273
25
24
33
102
In Table 6.4 we have collected the counts of there1 and Ø by genre. A Pearson
chi-square test again informs us that we have enough data to detect an association
between rows and columns (χ 2 = 409.86, df = 11, p 0.001) with a p-value practically
indistinguishable from zero. Also, to obtain the strength of the association, we again
calculate Cramèr’s V, which is 0.49. This is a medium effect size for a table of this
size, and it would theoretically explain about 24 per cent of the variation in the table.
At first glance this is promising, since we have established that genre appears more
important than dialect. However, so far we have only considered two tables, both of
which contain nominal, or count, data. As we showed above, we collected various
types of data, including proportions of word-order patterns and counts of maximum
syntactic embedding.
Following the approach here, we would end up with a whole series of statistical test
results, not all of them coming from a Pearson chi-square test, with different effect
size measures, that would have to be compared against each other. To make matters
worse, such multiple testing on the same data makes it easier to find significant pvalues by sheer chance, which would need to be taken into account by some sort of
correction. Moreover, since the testing is done by slicing and dicing the data with
each table unconnected with the others, we would have no means of working out any
deeper connections or correlations among the variables found in different tables. And,
finally, as pointed out above, the logic of each null-hypothesis test is strictly speaking
invalidated, since we intend to compare multiple alternative hypotheses (Gelman and
Loken, 2014). Hence, chi-square testing of the type outlined here is seldom the best
choice for historical corpus data.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
The rise of existential there in Middle English

Regression modelling The problems discussed above concerning Pearson’s chi-square
test (or similar tests such as Fisher’s exact test) point towards one conclusion: when
historical linguistic data are viewed in their proper, multivariate context (see principle
11 in section 2.2.11), the appropriate technique is one or more of the multivariate
techniques discussed in section 6.2 (given the appropriate data). The exact choice
of technique depends on exactly how the question is conceptualized. For instance,
Jenset (2010) modelled the difference between there1 and there2 in historical varieties
of English by means of a mixed-effects binary logistic regression model. McGillivray
(2013, 190–3) employs the same technique to model the binary choice between the
bare-case constructions and the prepositional constructions for the realization of the
argument structure of Latin prefixed verbs. In both cases the model was used to estimate the probability of switching from one variant to the other given some combination
of variables. However, this is merely one way of looking at the data. McGillivray
(2013, 202–10) used multivariate exploratory techniques like CA to explore a range
of variables affecting the argument realizations of Latin prefixed verbs; Jenset (2013)
uses a similar approach to better understand the difference between there1 and there2
from a distributional semantics perspective.
For the present case study, we find it useful to conceptualize the problem as a binary
choice between there1 and Ø, and the influence that the variables discussed above have
on this choice. For this reason our main technique in the analysis below will be a binary
logistic regression model; see section 6.2 for details.
.. Quantitative modelling
In order to test the hypotheses in question, we chose to use a binary logistic regression
model, a type of regression model that is useful and well tested in linguistics (Bayley
2002; Baayen 2008; Johnson 2008).
We initially considered both ordinary logistic regression models and mixed-effects
model (see section 6.2). However, during the model evaluation phase, we found
that the ordinary logistic regression model provided the best fit to the data. The
question of proper model evaluation or criticism follows from principle 12 (see section
2.2.12, regarding adherence to best practices in applied statistics). Model criticism as
best practice is recognized both by statisticians (Faraway, 2005, 53–75) and linguists
(Baayen, 2008, 188–93; Hilpert and Gries, 2016).
Ultimately, some trial and error is involved in hitting upon a model that fits the
data well. Baayen (2008, 236) points out that linguistic hypotheses are often somewhat
under-specified; however, even quite specific hypotheses will often require some trial
and error in finding a good model, which shows that we are not dealing with some
kind of mechanical process without room for the judgement of the researcher.
The model The model has three elements: a response (i.e. a dependent variable), a
set predictors (i.e. fixed effects or independent variables), and an error component
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
modelled by the binomial distribution. The model that fit the data best has the
structure outlined below, and we created it with the glm() function in R.
(11)
Response: probability of switching from there1 to Ø modelled as depending on
the following
fixed effects: probability of context + max sentence depth (log scale) + proportion
of V1 main clauses + proportion of V2 main clauses + dialect + genre + MS date
Some of the fixed effects were rescaled for different reasons. The variable recording maximum embedding depth was log-transformed, to better match the glm()
function’s expectations of normally distributed data. MS date (manuscript date)
was rescaled so that each change of unit in the statistical model reflects fifty years
(instead of one year), to make the result more interpretable. Similarly, the variables
for probability of context and proportion of V1 and V2 clauses (all on a 0 to 1 scale)
were rescaled so that each change of unit in the model reflects a 0.1 unit. Again, this
was done to make the model easier to interpret.
Model fit As mentioned above, an important step in quantitative modelling is
evaluating how well the model fits the data. This makes intuitive sense: a model that
does not fit the data is obviously not a sound basis for drawing conclusions about
claims regarding the data. Baayen (2008, 204) mentions two ways in which we can
assess the fit of a logistic regression model. One is (pseudo) R2 , a measure of fit loosely
based on the proper R2 (or coefficient of determination) which ranges from 0 to 1 and
indicates the degree of variation explained by the model. Logistic regression models
are based on a different procedure, but we can interpret a pseudo R2 index such as
Nagelkerke’s R2 (Nagelkerke, 1991) as the degree of improvement over an alternative
model that only predicts the most frequent outcome. Another measure, Harrel’s C,
calculates the correlation between the response values and the values predicted by the
regression model. A C value of 0.5 signals random guessing, whereas a value of 1 means
perfect prediction. A value of 0.8 is often taken as a minimum for a model with some
predictive power (Baayen, 2008, 204).
However, these measures are not uncontroversial indicators of good models (Long
and Freese, 2001, 83–7). A powerful alternative to such numerical indicators of model
fit is to look at diagnostic plots of the residuals (Faraway, 2005, 58). The residuals are
essentially the distances between the actual observations and the straight line which
is the basis for the regression model. Visualizing these residuals makes it possible to
spot problems with the model, once we have some familiarity with residuals plots.
We consider such familiarization a good investment of time, and Jenset (2010, 106–9)
discusses such plots in more detail in the context of linguistics, based on the more
technical exposition in Faraway (2005).
Turning to the model at hand, we find that Nagelkerke’s R2 for the model is 0.4,
whereas Harrel’s C index for the model works out to 0.82. Both measures were
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
The rise of existential there in Middle English

Binned residuals
0.4
Average residual
0.2
0.0
–0.2
–0.4
0.2
0.4
0.6
Expected values
0.8
Figure . Binned residuals plot of the logistic regression model, indicating acceptable fit to
the data.
calculated by refitting the model. The C index above 0.8 suggests that the model has a
genuine capability to correctly identify the response (Baayen, 2008, 204).
Not wishing to rely on such numbers alone, we also checked the model structure
for any signs of a bad fit to the data by means of a plot (Faraway, 2005, 53–8; Gelman
and Hill, 2007, 97–101). Figure 6.11 is a binned residuals plot of the model. Again, the
impression is a positive one: the predicted values span the whole range of possible
values from 0 to 1 (there1 and Ø, respectively), and most of the black dots (i.e. the
observations in the data set) are inside the grey confidence interval lines, without too
much of a clear pattern to them. The joint impression from these three ways to assess
the model fit is that the model is good. From this step, we can go on to interpret the
output of the model.
Results Table 6.5 summarizes the fixed effects (or predictors). The first column of the
table lists the name of the fixed effect variable, whereas the second column gives the
coefficient, i.e. the size of the effect that changing the predictor has on the response.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
Table . Coefficients for the binary logistic regression
model showing the log odds ratio for switching from
there1 to Ø. Positive coefficients favour Ø, negative there1
(Intercept)
nextProb01
logMaxDepth
v1prop01
v2prop01
dialectKentish
dialectNorthern
dialectSouthern
dialectWest Midlands
genreBiography, Life of Saint
genreFiction
genreHandbook
genreHistory
genreHomily
genrePhilosophy
genreReligious Treatise
genreRomance
genreRule
genreSermon
genreTravelogue
msDate50
Coef β
SE(β)
12.35
−0.88
0.72
0.05
0.24
−1.43
1.50
0.51
0.17
1.40
0.66
2.73
1.08
1.99
−12.81
1.63
0.60
1.24
1.01
2.42
−0.53
2.14
5.8 <.0001
0.26 −3.4 <.001
0.21
3.5 <.001
0.53
0.1 >0.9
0.19
1.2 >0.2
0.30 −4.8 <.0001
0.28
5.4 <.0001
0.31
1.6 >0.1
0.23
0.8 >0.4
0.64
2.2 <.05
0.57
1.1 >0.3
0.69
3.9 <.0001
0.51
2.1 <.05
0.58
3.4 <.001
277.22
0.0 >1
0.46
3.5 <.001
0.66
0.9 >0.4
0.56
2.2 <.05
0.52
2.0 >0.1
0.49
4.9 <.0001
0.07 −7.3 <.0001
z
p
The next column gives the standard error of the coefficient, a measure of how much
variation we can expect around the estimate. The penultimate column is the test
statistic used for calculating the p-value in the last column. It is good practice to
provide all this information when making use of regression modelling, since it allows
a detailed look into the model.
To summarize the regression output in Table 6.5:
• The intercept is difficult to interpret in this case since it corresponds to a case
where the probability of V1 and V2 word order, and the probability of the item
following the existential subject are all zero.
• The coefficient for nextProb is –0.88 on the logit scale. The variable is statistically
significant and indicates the effect upon the response of a 0.1 increase in the
probability of the grammatical element following the existential subject. The
negative sign indicates that increasing the conditional probability of the context
decreases the probability of a null realization of the existential subject. Dividing
the coefficient by four (Gelman and Hill, 2007, 82) gives an estimate of the
maximum effect size on a probability scale. Dividing –0.88 by four gives –0.22,
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
The rise of existential there in Middle English
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

in other words: for every 10 per cent increase in the probability of the subject
context, there is a 22 per cent decrease in the probability of a null subject.
The coefficient for maximum syntactic depth (log scale) is 0.72 on the logit
scale, which when divided by four corresponds to an 18 per cent increase in the
probability of there1 for every 1 per cent increase in maximum syntactic depth.
The coefficient for v1prop is 0.05. However, the last column in the table tells us
that this effect is not statistically significant.
The coefficient for v2prop is higher than for v1prop, 0.24, but again the effect is
not statistically significant.
The coefficient for dialect=Kentish is –1.43 on the logit scale. This is the log odds
ratio for switching from there1 to Ø when changing from the reference level (East
Midlands) to Kentish. The negative sign tells us that Kentish, or South Eastern
English, is associated with there1 . Dividing by four estimates the effect to be a
decrease in the probability of Ø of about 36 per cent.
The coefficient for dialect=Northern is 1.50 on the logit scale. Again we have a
statistically significant difference from the reference level East Midlands. The sign
here is positive, so a Northern dialect is associated with the null subject (increase
of around 38 per cent).
The Southern and West Midlands dialects are not statistically significant from the
East Midlands reference level.
The coefficient for genre=Biography, Life of Saint is 1.40 on the logit scale, which
translates into a 35 per cent increase in the probability of Ø in this genre. However,
we cannot ignore the relatively large standard error compared to the coefficient
(0.64) and the corresponding relatively high significance level. This result is less
clear-cut than some of the other results.
The coefficient for genre=Fiction is not statistically significant from the reference
level (Bible).
The coefficient for genre=Handbook is 2.73 on the logit scale, indicating a large
increase (68 per cent) in the probability of the null variant.
The coefficient for genre=History is 1.08 on the logit scale, which translates
into a 27 per cent increase in the probability of the null subject. As with
genre=Biography, Life of Saint, we can note the relatively large uncertainty about
the estimate as expressed by the standard error (0.51) relative to the coefficient.
The coefficient for genre=Homily is 1.99 on the logit scale, i.e. a 50 per cent increase
in the probability of Ø.
genre=Philosophy is not statistically significant.
The coefficient for genre=Religious Treatise is 1.63, or a 41 per cent increase in the
probability of Ø.
genre=Romance is not statistically significant.
The coefficient for genre=Rule is 1.24 on the logit scale, translating into a 31 per
cent increase in the probability of the null subject. Again we note a relatively large
uncertainty about this estimate.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Role of numbers in historical linguistics
• genre=Sermon is not statistically significant.
• The coefficient for genre=Travelogue is 2.42 on the logit scale, or an increase in
the probability of Ø of about 60 per cent.
• The coefficient for MS date is –0.53 on the logit scale, indicating that every fiftyyear increase of MS date corresponds to an average 13 per cent drop in the
probability of the null subject.
We can now return to the competing claims and hypotheses formulated above and
review them in light of the statistical model. Interestingly, neither of the word-orderrelated hypotheses implied by Williams (2000) or Breivik (1990) proved significant,
when compared to the other variables. Although we cannot categorically exclude an
effect of major word-order changes upon the realization of the existential subject,
the results above make such hypotheses less plausible. Instead, we find that there1 is
associated with a closer link to the surrounding grammatical context, as would be
expected under a process driven by pragmatic factors (Breivik, 1990). This interpretation is strengthened by the result seen for the proxy measure of syntactic complexity,
namely maximum sentence depth. A more complex sentence appears to favour the
null subject, which could again support a pragmatics (or information theory) based
view of the rise of there1 . Nevertheless, this cannot be the full explanation. We also
found that dialect plays a role, with a continuum running from North to South.
Compared to the central areas of England, the Northern dialect is more likely to prefer
Ø, whereas the South East is more likely to prefer there1 . Such a result is compatible
with the possible sociolinguistic explanation based on Blythe and Croft (2012) and
discussed in Jenset (2014). Similarly, genre appears to play a role even when we control
for dialect and MS date, suggesting that this may also partially be a stylistic choice.
These results are expected based on the assumption laid out in section 2.2.11
that language is best explained in multivariate terms. Although it is clear that one
single factor cannot explain all the variation, we have shown that some explanations
are substantially less likely than others. The quantitative model we have presented
above compares favourably with an approach based on multiple null-hypothesis tests.
Importantly, the model shows that the changes in the Middle English existential
construction cannot be reduced to a purely syntactic process. Clearly, an explanatory
model must somehow account for the sociolinguistic differences among dialects and
genres that we have identified. As such, the results seen here broadly support the
sociolinguistics-informed approach to historical linguistics in Croft (2000) and Blythe
and Croft (2012).
.. Summary
This case study has illustrated the benefits of quantitative methods in historical
linguistics discussed at the beginning of this chapter. Specifically, we have shown
how multivariate techniques such as logistic regression can deal with complexity in
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
The rise of existential there in Middle English

the form of different linguistic and sociolinguistic variables. Although we have stated
that linguistic features anyway ought to be modelled with more than one explanatory
variable (see section 2.2.11), we have illustrated this by bringing up a case study where
multiple competing and partially overlapping proposed explanations exist. As we have
seen, quantitative, multivariate techniques are well suited for assessing such competing
claims and hypotheses against each other.
By performing this comparison, we have also identified potentially explanatory
factors, which in turn point to how an explanatory linguistic model of the change
in question must be framed. Specifically, the connection between existential subject
realization and the linguistic context of the subject suggests that context at a relatively
fine-grained level needs to be taken into account. But the linguistic model also needs
to account for the sociolinguistic effects identified in the statistical model above.
In demonstrating this, we have shown how the principles of quantitative historical
linguistics can contribute to some of the major goals of historical linguistics listed
above (section 6.1.4), namely through working out some of the details of histories of
individual languages, and thereby, by pinpointing which variables have an explanatory
value, cover some of the ground needed for modelling linguistic change.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

A new methodology for quantitative
historical linguistics
. The methodological framework
The literature survey we described in Chapter 3 has shown that the landscape
of historical linguistics research is characterized by a high degree of methodological variability. However, we can safely say that we observed a general
under-representation of corpus-based and/or quantitative approaches. Moreover,
there is no agreed standard on what is considered high-quality quantitative research
in historical linguistics.
In this book we have proposed a new overarching framework for quantitative
historical linguistics, and we have argued that this is a good framework for conducting
historical linguistics research within the scope defined in section 2.1.1 for three main
reasons: it allows us to answer questions that the qualitative approaches cannot answer,
it provides stronger evidence (and therefore stronger explanations) than qualitative
approaches; it allows a higher degree of integration between historical linguistics and
other related fields, and a higher level of understanding between scholars, thus making
the field move forward more effectively.
Our framework encourages corpus-driven approaches and the systematic adoption
of multivariate statistical methods as the most appropriate ways to deal with the
multifaceted nature of historical languages. Moreover, we argue for a clearer boundary
between data-driven exploratory studies (whose results can be used to formulate hypotheses), and studies that attempt to answer questions by testing specific hypotheses.
Ultimately, quantitative historical linguistics makes progress by confirming or rejecting these hypotheses in a reproducible way, and by defining models of historical
linguistic phenomena.
In line with the view expressed by Meyer and Schroeder (2015) and McGillivray
(2013), we believe that quantitative corpus-driven methods are not a technical issue
only; on the contrary, they have the potential to profoundly change the research
practices and the research questions of historical linguistics.
Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray.
© Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Core steps of the research process

. Core steps of the research process
In this chapter we will operationalize our framework in concrete terms, thus making
it easier for other scholars to judge it. In the rest of the chapter we will illustrate
the proposed methodology through a historical study on English corpus data. Before turning to the case study, though, we would like to summarize the principles
(section 2.2) and best practices (section 2.3) of our framework. We will do that by
representing the research process as a circle, loosely inspired by McGillivray (2013,
127), and involving the steps outlined below. Note that these steps should not be taken
in a strictly prescriptive way, as they are meant as methodological guidelines which
will need to be adapted on a case-by-case basis.
1. Study preparation
(a) select a phenomenon that can be operationalized so that it falls within the
scope of quantitative historical linguistics;
(b) formulate operational definitions for the phenomenon (e.g. word-order
change) and the variables under consideration (e.g. semantic features,
morphological features, authors, time periods);
2. Data collection
(a) Collect the data set(s) by drawing on relevant annotated corpus data, if
available to the historical linguistics community; alternatively build a new,
annotated, corpus available to the community and draw the data set(s)
from it;
(b) combine corpus data with external resources including non-linguistic ones,
if relevant;
(c) if the corpus annotation does not contain all variables relevant to the analysis,
annotate the data set with those variables;
(d) if possible, re-encode the variables back into the corpus or link them to the
corpus for replicability purposes;
(e) document the data and the process, and make the data set available to the
community;
3. Quantitative modelling
(a) establish the explanandum (according to the terminology by Goldthorpe,
2001), i.e. the statistical pattern that needs explanation;
(b) explore the corpus data by making use of replicable visualization techniques
and descriptive statistics
(c) optionally, formulate a hypothesis from the data exploration or from existing claims, and conduct replicable hypothesis testing through quantitative
analyses on the data set using suitable quantitative techniques (preferring
multivariate methods over univariate ones);
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

A new methodology
(d) optionally, identify response and predictors, and define one or more suitable
statistical models for the response based on the predictors; assess the models
with diagnostics tools and compare them;
(e) report all relevant details of the results of the analyses;
4. Interpretation and publication of the results
(a) formulate a probabilistic explanation of the phenomenon based on explanatory factors as found in the analysis;
(b) publish the results and make the data and code available to the community.
We would like to point out that the point regarding formulating an explanation in
the last step does not formally coincide with identifying causal relationships, as we
stressed in section 6.3. A full discussion of causality and explanations in linguistics is
outside the scope of this book; instead we refer the reader to Goldthorpe (2001) and a
linguistic view of this position discussed in Jenset (2010, 47–71), as well as the chapters
in Penke and Rosenbach (2007a) and in Campbell (2013, 322–45).
. Case study: verb morphology in early modern English
In this section we present a longer case study to illustrate our framework for
quantitative historical linguistics. In section 6.3 we illustrated how empirical methods can be used to evaluate competing claims about the evolution of existential
there in historical English. That case study was a scenario where we were able to
find a satisfactory model that could corroborate some claims and that made other
claims less likely. In the present case study we tackle a more complex case where
a satisfactory model is more difficult to identify, but where quantitative methods
can nevertheless inform us about the details of the diachronic development. We
also deal with the effects of frequency directly in mechanisms of change. The data
and the code for this figure are available on the GitHub repository https://github.
com/gjenset.
The topic of the case study is the diachronic change that took place in English
inflectional verb morphology during the early modern period, roughly the time period
from 1500 to 1700 ce. In this period the third person singular form -(e)s, originally a
Northern form from Middle English, spread to the rest of England, where -(e)th was
the dominant form (Nevalainen, 2006, 184). Nevalainen notes that the -(e)th form
was used in early Bible translations, but that Shakespeare on the whole preferred -(e)s.
However, that did not prevent Shakespeare from using both forms, sometimes next to
each other, as in this example from The Merry Wives of Windsor:
(1)
Ford: Has Page any brains? hath he any eyes? hath he any thinking? Sure, they
sleep; he hath no use of them. (3.2.1338–9)
Here we see both -(e)s and -(e)th, has and hath, used in the same passage with the
same verb. Thus there is clearly considerable variation that might potentially inform us
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Verb morphology in early modern English

about the process that led to the general adoption of -(e)s in the third person singular.
Moreover, we have corpus data that we can use to investigate this phenomenon. In
other words, we are in a position to operationalize the phenomenon in such a way
that it falls within the scope of quantitative historical linguistics, as indicated by step
1a in the list (section 7.2).
Nevalainen (2006) outlines the following overview of the diachronic process,
based on data from the Helsinki Corpus. She recognizes these stages, following the
periodization of the Helsinki Corpus:
1. 1500–70: -(e)th dominates at the national level, -(e)s is a regional Northern
variant.
2. 1570–1640: the use of -(e)s becomes dominant in informal writing such as letters,
and becomes a substantial minority variant in official documents in England. The
exceptions to this are the verbs do and have which tend to retain -(e)th.
3. 1570–1640: simultaneously, there was an increase of -(e)th in Older Scots,
the Germanic language of Scotland; this increase appears to be genre-specific
(Nevalainen, 2006, 191).
By the end of the seventeenth century, -(e)s was the dominant form in English,
except for the most conservative genres. Nevalainen (2006) lists a number of possible
reasons for this development:
1. Immigration to the London area from the north brought the -(e)s form to the
south.
2. The -(e)th form gradually became associated with more formal registers.
3. Female writers picked up the -(e)s form, which is perhaps connected to the role
of women in linguistic innovation (Nevalainen, 2006, 188).
4. Some phonological contexts favoured -(e)s, especially verbs ending in stops such
as /t/ and /d/, since lasts was easier to pronounce than lasteth, for instance; at
the same time, the extra syllable added by -eth could be exploited metrically in
poetry.
5. -(e)s spread by lexical diffusion, via word-specific restrictions.
This explanation still leaves many questions unanswered. In section 1.1 we noted
that exploration of the history of individual languages and the establishment of
general processes of linguistic change are two of the aims of historical linguistics.
Although all the explanatory variables discussed by Nevalainen (2006) are plausible,
we cannot immediately establish the relative importance among them, or how they
might interact. This makes it difficult to go from the description of the specific case
(the history of third person singular inflections in English) to a more generalized
description of change.
One such generalization is the observation that infrequent words tend to lead
the way in analogical change (Lieberman et al., 2007). Another is the observation
that frequent words tend to be replaced at a slower rate than less frequent ones
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

A new methodology
(Pagel et al., 2007). Hay et al. (2015), in a diachronic corpus study of word-frequency
effects in diachronic sound change in New Zealand English, find an interaction
between the speaker’s year of birth and lexical frequencies. Their study shows that lowfrequency words lead the change, in a manner that would be expected under analogical
change but not under regular phonological change. Hay et al. (2015) explain this by
pointing out that very frequent words are well represented in memory, which helps
explain their resistance to change. Conversely, an infrequent word is less likely to be
stored in memory, and this holds even more if the word is difficult to understand. Such
impaired perception may affect words that are close to an advancing change, which
affects memory storage. Together, this leads to a greater susceptibility for change for
the low-frequency words. From this brief literature review we can already identify
the operational definition of the phenomenon we are going to study (the alternation
between -(e)s and -(e)th forms in early modern English) and the variables under
consideration (step 1a in section 7.2).
These generalizations may have some application to the case of -(e)s and -(e)th
in early modern English. If the shift from -(e)th to -(e)s is a case of analogical
change driven by perception forces, we would expect lexical frequency to play a
role. Specifically, we would expect a higher word frequency to correlate with a lower
probability of -(e)s, and conversely that a lower word frequency would correlate with
a higher probability of -(e)th. We would also expect this frequency effect to increase as
the change nears its completion (Hay et al., 2015). The inclusion of frequency sets our
study apart from Gries and Hilpert (2010), who use a different corpus for a comparable
time period, but similar statistical techniques.
Based on these considerations, we can approach the Early Modern corpus data with
the following claims, which correspond to the hypotheses that we want to test (step 3c
in section 7.2):
• If genre plays a role, we expect a statistically significant difference between genres
in the use of -(e)s and -(e)th.
• If gender plays a role, we expect a statistically significant difference between male
and female writers in the use of -(e)s and -(e)th.
• If phonological context is an important cause of change, we expect final stops to
favour -(e)s, and final vowels and fricatives to favour -(e)th.
• If lexical diffusion is a leading cause of change, we expect individual differences
between verbs, especially for do and have.
• If lexical frequency is an important variable, we expect a statistically significant
effect towards the end of the period, as the change was approaching completion
and the perceptual pressure on remaining -(e)th variants increased.
.. Data
For the data collection phase (step 2 in section 7.2) we relied on an annotated corpus.
We extracted the data for this case study from the 1.7 million word PPCEME treebank
(Kroch and Delfs, 2004), using a Python script. Since the corpus does not have
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Verb morphology in early modern English

annotation for third person present tense singular verb morphology, we identified
all present tense verbs with the tags ‘DOP’ (do), ‘HVP’ (have), and ‘VBP’ (other
verbs), and processed the results further to identify the third person cases, excluding
forms of be, for which the alternation between -(e)s and -(e)th is not applicable. We
lemmatized the set of present tense verbs and used them to calculate the lemma
frequency in the present tense. This frequency was counted for the three sub-periods
of the corpus (E1: 1500–69, E2: 1570–1639, E3: 1640–1710), to avoid having future
increases in verb frequency influence past observations. The lemmatization lexicon
we built is an example of additional resource that integrated the corpus annotation
with extra variables (verb lemma, in this case) needed for the study, as suggested by
steps 2d in section 7.2.
Following the recommendations for representing and analysing multidimensional
data outlined in section 6.2, we collected the data into a data frame format for analysis
with R. Tables 7.1 and 7.2 exemplify an excerpt of the data. The full data set comprises
10,430 observations of 1,654 verb forms for 737 lemmas, and the list of variables (step
2c in section 7.2) is:
• filename: a factor variable with the PPCEME file identifiers as levels;
• period: a factor variable with the PPCEME sub-corpus period identifiers e1, e2,
and e3 as levels;
• id: a factor variable with the identifier for the individual syntactic tree;
Table . Part of the metadata extracted from the PPCEME documentation
filename
period
alhatton-e3-h
alhatton-e3-h
alhatton-e3-h
alhatton-e3-h
anhatton-e3-h
e3
e3
e3
e3
e3
id
ALHATTON-E3-H,2,241.7
ALHATTON-E3-H,2,242.27
ALHATTON-E3-H,2,245.42
ALHATTON-E3-H,2,245.44
ANHATTON-E3-H,2,211.6
genre
year
female
LET PRIV
LET PRIV
LET PRIV
LET PRIV
LET PRIV
1699
1699
1699
1699
1690
T
T
T
T
T
Table . Part of the data extracted from PPCEME
verbForm
verbTag
lemma
suffix3sg
subPeriodCount
has
sayes
designes
sayes
plays
HVP
VBP
VBP
VBP
VBP
have
say
design
say
play
s
s
s
s
s
2, 532
393
13
393
6
context
vowel
vowel
stop
vowel
vowel
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

A new methodology
• year: a numerical variable with the year of the text as given in the PPCEME
documentation;
• female: a logical variable with the levels TRUE (female) and FALSE (male),
corresponding to the gender of the author as given in the PPCEME
documentation;
• author: a factor variable with the name of the author (if known) from the
PPCEME documentation;
• verbForm: a factor variable with the verb forms observed in the corpus as levels;
• verbTag: a factor variable with the corpus tags of the verbs as levels (‘DOP’, ‘HVP’,
and ‘VBP’);
• lemma: a factor variable with the lemmas of the verb forms as levels (we manually
derived these from the verb forms, and lemmatized them to the modern form,
due to the large early modern English variation in spelling);
• suffix3sg: a factor variable with two levels indicating whether the verb ends in
-(e)s (s) or -(e)th (th);
• subPeriodCount: a numerical variable with counts of the lemma frequency in the
corpus sub-period (E1, E2, or E3);
• context: a factor variable indicating the phonological context of the third person
suffix, based on the modern lemma, with the levels ‘fricative_other’, ‘liquid’,
‘sibilant’, ‘stop’, and ‘vowel’.1
.. Exploration
As part of the data exploration phase (step 3b in section 7.2), we consider the
distribution of -(e)s and -(e)th, plotted as changing probabilities over time in Figure 7.1.
The distribution corresponds to what we would expect based on Nevalainen (2006),
with a very low overall initial probability of -(e)s, and with a gradual decline in -(e)th
throughout the seventeenth century. Comparing this with the similar plot in section
6.3 we see that there is less of a pronounced S-shaped curve in Figure 7.1, and the
increase seems more gradual.
Turning to the lexical frequencies plotted in Figure 7.2, two things are immediately
clear. As we would expect based on Figure 7.1, there is a greater concentration of
observations for -(e)th in early parts of the corpus, and a greater concentration of -(e)s
in the later stages. Next, we notice that the trend, represented by the non-parametric
1 Basing the phonological context on a contemporary rendering of the lemma is of course problematic. During the early modern period, English underwent a number of phonological changes, the most
noteworthy of which is perhaps the changes to the Middle English long vowels known as the Great Vowel
Shift. However, a detailed reconstruction of the phonological context at the time of attestation in corpus
is beyond the scope of this case study. Also, the Great Vowel Shift itself is still a matter of discussion,
as McMahon () demonstrates. Hence, we have opted for the pragmatic solution of normalizing the
phonological context along with the spelling. This is partly justified by simply referring to vowels in general,
including diphthongs, without a further distinction between long and short vowels.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Verb morphology in early modern English

1.0
0.8
Suffix
th
0.6
0.4
s
0.2
0.0
1550
1600
Year
1650
1700
Figure . Plot showing the shifting probabilities over time between -(e)s and -(e)th in the
context of third person singular present tense verbs.
regression line in the plot,2 is relatively stable over time for -(e)s. However, for -(e)th we
see that over time there is an increasing tendency towards higher lemma frequencies.
This observation is compatible with more than one interpretation, but at least it
suggests that lexical frequencies may be involved somehow.
To explore some of the variables further, we performed a multiple correspondence
analysis (see section 6.2), which reduces the variation among the variables suffix,
corpus sub-period, gender, and phonological context to a compact, two-dimensional
sub-space that can be easily visualized. We can see the plot in Figure 7.3. Only the
first (horizontal) dimension has a high enough explanatory power, judging from the
percentage of explained inertia. This implies that we can read the plot from left to
right, with similar categories close to each other. As we now would expect, we see that
2 A non-parametric regression line uses local adjustments to fit the line to the data, which typically results
in a regression line that is not straight. This makes a non-parametric, or smoothed, regression line difficult
to analyse in the same manner as a traditional regression line. However, it is a useful tool for describing the
behaviour of the data. The lines in the plots were created with the lowess() function in R.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

A new methodology
–th
3.5
3.5
3.0
3.0
Verb frequency (base 10 log scale)
Verb frequency (base 10 log scale)
–s
2.5
2.0
1.5
1.0
2.5
2.0
1.5
1.0
0.5
0.5
0.0
0.0
1500
1600
Year
1700
1500
1550
1600
Year
1650
1700
Figure . Plots of the trends of lemma frequency over time for verb forms occurring with
-(e)s (left panel) and -(e)th (right panel). The black lines are non-parametric regression lines
outlining the trend over time.
the earliest period (E1) is associated with -(e)th, and the latest period (E3) with -(e)s.
Period E2 takes up an intermediate position. We can also see that female writers are
associated with later periods and -(e)s. However, this might be due to both higher use
of -(e)s by female writers, as well as the relative lack of female writers in the earliest
period, as illustrated by the numbers in Table 7.3. In other words, at this point we
cannot decide if the use of -(e)s is directly associated with female writers, or if both
are associated with the later time periods.
Finally, we can see from Figure 7.3 that the phonological context accords with our
predictions, with vowels displaying some association with -(e)th, while stops show
some association with -(e)s. The remaining contexts do not display any particular
tendencies to one or the other, judging from the plot.
From this preliminary exploration, we turn to a more formalized hypothesis testing
phase using statistical modelling.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Verb morphology in early modern English
0.4

period:e1
period:e3
0.2
female:FALSE
context:liquid
context:stop
context:fricative_other
Dimension 2: 1.8%
suffix3sg:s
0.0
suffix3sg:th
context:sibilant
context:vowel
–0.2
–0.4
female:TRUE
period:e2
–0.6
–0.4
–0.2
0.0
0.2
Dimension 1: 71.5%
0.4
0.6
Figure . MCA plot of suffix, corpus sub-period, gender, and phonological context. Only the
first (horizontal) dimension is interpretable.
Table . Frequencies of verb tokens in the sample as taken
from texts produced by female and male writers, broken down
by corpus sub-period
Author
E1
E2
Male
Female
3205 (31)
97 (<1)
3591 (34)
485 (5)
E3
2794 (27)
258 (3)
.. The models
As with the case study in section 6.3, we have opted for a modelling approach based
on binary logistic regression, specifically a mixed-effects model. The advantage of this
technique for our case study is that we can model the direct effect that each variable
has on the choice of -(e)s or -(e)th in the form of log odds ratios (which we convert to
probabilities for ease of interpretation). This is in line with step 3d in section 7.2.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

A new methodology
During the model fitting phase we tested a large number of models with different
forms, as per step 3d in section 7.2. However, we were not able to find a single model
that could successfully capture the variation between -(e)s and -(e)th for the entire
corpus. To illustrate an example of a badly fitting model, we discuss one of these
unsuccessful models below.
A bad model The model in (2) is representative of our attempts to fit a single model
to the data. The response is the probability of switching from -(e)th to -(e)s, and the
fixed effects correspond to the claims we wish to test. In this model we used genre as
a random effect, since we can assume some genre effects in this case. We also tested
some models where the verb lemma was incorporated as a random effect; however,
this did not improve the model fit in any substantial way.
(2) Response: probability of switching from -(e)th to -(e)s modelled as depending on
the following
fixed effects: lexical frequency (log base 10 scale) + period + gender + verb tag +
phonological context
random effect: genre
In section 6.3 we noted that it is not sufficient to rely exclusively on numerical
measures of model fit such as Harrel’s C or Nagelkerke’s R2 . The reason is that such
measures may be quite high even when the model is not a good fit to the data. A much
better test of model fit is the extent to which the model residuals (the differences
between the ideal straight line of the model and the observed values) are well behaved.
Since the usefulness of the model depends on certain assumptions regarding these
residuals, it is a crucial step to check them.
For the model in (2), these indices work out as follows: Harrel’s C is 0.96 and
Nagelkerke’s R2 is 1 compared to a mixed-effects model that only predicts the most
frequent outcome. This looks promising, but interpreting these measures only makes
sense if we have a proper fit to the data. Unfortunately, the binned residuals plot in
Figure 7.4 reveals a disastrously poor fit to the data.
The binned plot in Figure 7.4 is a good example of a bad fit to the data. In a
well-behaved binned residuals plot we would expect most points to fall within the
confidence intervals indicated by the grey lines. In this case, we notice that most points
in fact fall outside these lines. There are also clear signs that the model is mis-specified
because of the V-shaped pattern among the points. This tells us that the model is not
an equally good fit in the whole data set, a very important assumption in regression
modelling. Ideally, the black dots should be symmetrically distributed around the
horizontal dotted line, without any clear signs of patterns. Such lack of any patterning
is only an ideal situation, but Figure 7.4 is clearly too far from this ideal.
When one model is not enough Our solution in this case was to fit three models,
one per sub-corpus period (E1, E2, and E3). For the earliest period, we were simply
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Verb morphology in early modern English

Binned residual plot
0.6
0.4
Average residual
0.2
0.0
–0.2
–0.4
–0.6
–0.8
0.0
0.2
0.4
0.6
0.8
1.0
Estimated probability of ‘−s’
Figure . Binned residuals plot for the mixed-effects logistic regression model described
in (2). The model is a very poor fit to the data as expressed by the fact that most of the predicted
points are outside the grey confidence interval lines, and there are clear up-and-down patterns
in the points.
not able to find a model that was a good fit to the data. The model was essentially
always predicting -(e)th due to the extremely low number of occurrences of -(e)s in
this period. We cannot exclude that a satisfactory model can be achieved using other
variables, but our variables were not sufficient to distinguish the two variants in the
earliest period. For the two later periods, we were able to find satisfactory models;
however, these models are different, as we will see now.
For the E2 period we arrived at the following model:
(3) Response: probability of switching from -(e)th to -(e)s modelled as depending on
the following
fixed effects: lexical frequency (log base 10 scale) + gender + verb tag + phonological context
random effect: genre
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

A new methodology
Binned residual plot
0.6
0.4
Average residual
0.2
0.0
–0.2
–0.4
–0.6
0.0
0.2
0.4
0.6
Estimated probability of ‘−s’
0.8
1.0
Figure . Binned residuals plot for the mixed-effects logistic regression model described in
(3). The model is an acceptable fit to the data, with points being fairly symmetrically distributed
around the middle. There is a skew towards predicting 0, i.e. -(e)th.
The binned residuals plot in Figure 7.5 is an improvement over the one in 7.4. The
points are more symmetrically distributed around the middle, and more points are
inside the grey lines. We can see that a large number of points are clustered together
around the 0 point (left side). This means that there is a large number of cases being
predicted as 0, i.e. -(e)th. However, we note that the model also predicts -(e)s, and
a Nagelkerke’s R2 of 1 shows a real improvement over simply predicting the most
frequent outcome. Similarly, a value of Harrell’s C of 0.96 is excellent.
For the E3 period we used the model set out in (4):
(4) Response: probability of switching from -(e)th to -(e)s modelled as depending on
the following
fixed effects: lexical frequency (log base 10 scale) + gender + verb tag
random effect: genre
For this model we obtained the best fit by removing the variable for the phonological
context. In the binned residuals plot shown in Figure 7.6 we see that there are quite a
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Verb morphology in early modern English

Binned residual plot
0.4
Average residual
0.2
0.0
–0.2
–0.4
0.2
0.4
0.6
Estimated probability of ‘−s’
0.8
1.0
Figure . Binned residuals plot for the mixed-effects logistic regression model described in
(4). The model is an marginal fit to the data, with points being fairly symmetrically distributed
around the middle. There is a skew towards predicting 1, i.e. -(e)s.
few points falling outside the grey lines, but there is not too much structure among the
points. As with the binned plot in Figure 7.5, we see that there is a skew, but here the
tendency is for the points to be clustered near 1, i.e. around -(e)s. However, such a skew
is not necessarily a large problem. For the purposes of the current study, we accept the
model based on this plot. The numerical measures for evaluating the model are still
good, with a Nagelkerke’s R2 of 1 and Harrell’s C of 0.93.
Results We next give the relevant details from this statistical testing (step 3e in section
7.2), and turn our attention to the summary outputs of the models in (3) and (4),
displayed in Tables 7.4 and 7.5, respectively.
For the model covering period E2, Table 7.4 shows that the only two non-significant
variables are the lexical frequency count (transformed to a logarithmic scale for
improved fit) and the category designating the lemma have; this means that have is
indistinguishable from do with respect to the model’s response. The intercept is not of
much interest here, since it represents the average outcome effect when the frequency
count is zero. Turning to the remaining coefficients, it is worth noting that they are all
positive, i.e. they all point towards a higher probability of -(e)s.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

A new methodology
Table . Summary of fixed effects from the mixedeffects logistic regression model for E2 described
in (3)
(Intercept)
log10(subPeriodCount)
femaleTRUE
contextliquid
contextsibilant
contextstop
contextvowel
verbTagHVP
verbTagVBP
Coef β
SE(β)
z
p
-8.74
-0.10
1.94
4.38
2.68
4.68
3.85
-0.08
3.28
1.07
0.10
0.32
0.75
0.78
0.74
0.75
0.33
0.34
-8.2
-1.0
6.1
5.8
3.4
6.3
5.1
-0.2
9.6
<.0001
>0.3
<.0001
<.0001
<.001
<.0001
<.0001
>0.8
<.0001
Table . Summary of predictors from the mixedeffects logistic regression model for E3 described
in (4)
(Intercept)
log10(subPeriodCount)
femaleTRUE
verbTagHVP
verbTagVBP
Coef β
SE(β)
z
2.77
-0.76
2.64
-0.15
2.05
0.70
0.14
0.46
0.21
0.26
3.9
-5.5
5.7
-0.7
7.7
p
<.0001
<.0001
<.0001
>0.5
<.0001
We can summarize the results as follows, using the divide-by-four rule to transform
the log odds ratios to probabilities (Gelman and Hill, 2007, 82):
• Female writers are associated with a 50 per cent increase in the probability of -(e)s
compared to men.
• Phonological context=liquid is associated with a 100 per cent increase in the
probability of -(e)s compared to the reference category of non-sibilant fricatives.
• Phonological context=sibilant is associated with a 67 per cent increase in the
probability of -(e)s compared to the reference category of non-sibilant fricatives.
• Phonological context=stop is associated with a 120 per cent increase in the
probability of -(e)s compared to the reference category of non-sibilant fricatives.
• Phonological context=vowel is associated with a 96 per cent increase in the
probability of -(e)s compared to the reference category of non-sibilant fricatives.
• Verbs other than do and have are associated with an 82 per cent increase in the
probability of -(e)s.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Verb morphology in early modern English

In other words, both the phonological context and gender are highly associated with
the use of -(e)s. For the phonological context we see that there is in fact a preference
for -(e)s in most contexts, but the degree of preference varies.
Next we turn to the summary of the model in (4), displayed in Table 7.5. In this
model we had to remove the phonological context variable to achieve an acceptable
fit, leaving us with the lexical frequency variable, gender, and the verb categories
derived from the corpus annotation. As with the previous model, we will not attempt
to interpret the intercept since it is not particularly meaningful here. Also, as in the
previous model, we note that the distinction between the verb tag reference category
do and the verb have is not statistically significant. However, the lexical frequency
variable is significant in this model.
Below, we quickly summarize the coefficients:
• A 1 per cent increase in lemma frequency decreases the probability of -(e)s by 20
per cent.
• Female writers are associated with a 66 per cent increase in the probability of -(e)s
compared to men.
• Verbs other than do and have are associated with a 51 per cent increase in the
probability of -(e)s.
Finally, we look at the random effect, genre, in the two models. Here we see an
interesting difference between the models in (3) and (4). For the E2 model in (3),
the standard deviation of the random effect is 2.8. Mixed-effects models assume that
the random effects are normally distributed, and we can make use of this to calculate
something resembling a confidence interval for the random effect, i.e. a range within
which we would expect 95 per cent of all values to fall. For the E2 model, this interval
barely reaches above zero, which means that all genres tend towards -(e)th. In other
words, the variation in third person singular suffix cannot really be attributed to genre
differences for this period. Conversely, for the E3 model the standard deviation of the
random effect is 2.1, but the confidence interval in this case spans from 0.19 to 0.99
on a probability scale. In other words, we find that the variation in tendency among
genres spans virtually the whole range of probabilities from -(e)th to -(e)s. In short,
to the extent that our models are reliable, it is in the E3 period that genre differences
regarding -(e)th and -(e)s can be identified.
.. Discussion
We are now in the position to evaluate the claims presented initially, and reproduced
here in enumerated form for convenience:
(i) If genre plays a role, we expect a statistically significant difference between
genres in the use of -(e)s and -(e)th.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

A new methodology
(ii) If gender plays a role, we expect a statistically significant difference between
male and female writers in the use of -(e)s and -(e)th.
(iii) If phonological context is an important cause of change, we expect final stops
to favour -(e)s, and final vowels and fricatives to favour -(e)th.
(iv) If lexical diffusion is a leading cause of change, we expect individual differences
between verbs, especially for do and have.
(v) If lexical frequency is an important variable, we expect a statistically significant
effect towards the end of the period, as the change was approaching completion
and the perceptual pressure on remaining -(e)th variants increased.
Following step 4a in section 7.2, our results appear to refute claim (i) regarding the
importance of genre. Recall that the model for E2 found virtually no variation between
genres. Instead, the variation in the use of -(e)s and -(e)th in the E2 period could better
be described by the predictors. It was only in the next period, E3, that our model
could identify large, systematic differences between genres in the use of -(e)s and
-(e)th. Since the last period was also the period when -(e)s was increasingly becoming
the norm while -(e)th became relegated to highly formal writing, it appears that the
genre differences observed in E3 are a result of the change taking place, rather than an
active cause for it.
Our initial explorations left unanswered the question of whether women used -(e)s
more than men, or whether we simply find more women writers in the period when
-(e)s was becoming the norm. However, the two models for E2 and E3 both agreed that
female writers employed -(e)s to a larger degree than men. We can thus conclude that
claim (ii) has been strengthened.
Claim (iii) deals with the effect of phonological context, and here our models are
clear: the phonological context only plays a role in the E2 period. Furthermore, we note
that rather than seeing a clear, unequivocal preference for -(e)s in some contexts, we
found that there is a continuum of degrees of preference for -(e)s. We can schematically
represent this as follows:
(5) -(e)s > stop > liquid > vowel > sibilant > other fricatives > -(e)th
Nevertheless, this does agree with the claim in (iii) that stops should show a preference
for -(e)s while vowels and sibilants show more of a preference for -(e)th. However,
this only holds for the E2 period. The model for the E3 period omitted the context
variable in order to obtain an acceptable fit to the data. Based on this, it appears that
the phonological context acted as an important factor in the early stages of the change,
whereas other variables were involved in concluding the process.
Regarding claim (iv), we found a reliable difference between do and have on the
one hand, and other verbs on the other hand. Unfortunately, no models using the
verb lemma proved an acceptable fit to the data. This leaves some uncertainty, since
do and have are also high frequency verbs. However, we can state with confidence that
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Verb morphology in early modern English

verbs other than do and have present a clear preference for -(e)s as a whole in both E2
and E3. If most of the variation within the VBP category was tied to individual verbs,
we would expect sufficient variation within this category for no significant difference
to manifest itself compared to do and have. Instead, since the VBP category as a whole
shows this difference, we suspect a frequency effect is involved. Hence, we tentatively
note a decreased confidence in claim (iv).
Finally, claim (v) involved lexical frequencies and predicted that this variable would
grow in importance towards the endpoint of the change, when the number of lowfrequency verbs preferring -(e)th was decreasing. First, we can note an indirect support
for this position based on our discussion of claim (iv). Second, we find support for
it in the fact that in the E2 model the lexical frequency variable was not significant.
However, in the E3 model, i.e. towards the endpoint of the transition, the lexical
frequency variable was significant. For the E3 period, increasing the lemma frequency
implied a lower probability of -(e)s, which is what we would expect if the change was
caused by low-frequency analogical change, driven by the perception factors similar
to the ones outlined by Hay et al. (2015).
Thus, just as with the case study on existential there in Chapter 6, we have shown
that empirical corpus methods are well suited to evaluating the merits of competing
claims regarding historical linguistic phenomena. However, in addition we have
demonstrated that careful use of multivariate statistical models can be a rich source of
information for reasoning about the details of a process of diachronic change. Such
models are no better than the data they make use of. By employing syntactically
annotated data, essentially a form of model parallelization (section 1.1), we could
extract a data set that was both large and rich in detail. Nevertheless, it is conceivable
that the models might improve with more data. For instance, Nevalainen (2006,
193) mentions vowel contraction and adding this feature might result in even more
informative models. Similarly, Nevalainen (2006, 193) notes that -(e)th was retained
in some Southern dialects. The lack of dialect information in the metadata for the
PPCEME corpus prevented us from including dialect information in this case study,
but this is clearly a promising variable for further exploration regarding the change
from -(e)th to -(e)s. Nevertheless, we showed that our models were informative enough
to evaluate the claims listed above. Our results on the role of author gender are aligned
with Gries and Hilpert (2010), but differ in other respects.
Although these claims to some extent deal with the description of the historical
evolution of a single language (English), we also demonstrated how the empirical
approach advocated here could be employed to understand more general, abstract
processes of historical linguistic change.
Finally, we described the research process advocated by our proposed methodological framework, which involves a circle composed by clearly defined phases,
transparent processes, and publicly available data. Moreover, the research relies on
existing resources like annotated corpora, and creates new ones, such as the lemmatization lexicon which we built to compensate for the lack of lemma annotation in the
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

A new methodology
PPCEME corpus. We strongly believe that this will enable further studies to confirm
or refute our findings, thus advancing the field.
. Concluding remarks
In this book we have outlined a comprehensive approach for doing quantitative
historical linguistics. We have also outlined how we think this can be best achieved,
and why it matters. In Chapter 1 we outlined the main reasons why historical linguistics
ought to make more use of corpora. Historical linguistics is by necessity data-centric;
however, the high uncertainty that comes with time depth ensures that there is
room for a large variety of claims about the past. This necessitates a high degree
of precision in communication when these claims are presented and evaluated. We
have shown that when claims about historical linguistics are expressed in terms of
frequencies and probabilities, the historical linguist is forced to come up with more
precise formulations of those claims. This can only be a good thing, since a precise
claim is easier both to defend and refute. Furthermore, quantitative claims are highly
transparent. True, a qualitative claim regarding the existence or non-existence of a
construction or grammatical phenomenon in the past is precise in exactly the same
manner, but note that nothing is lost when such binary claims are expressed as
endpoints of a probability scale. However, in practice, linguistic argumentation is to a
large extent about establishing category relationships and relations between categories
(Beavers and Sells, 2014). As many studies have argued (see discussion in previous
chapters and references there, including Manning, 2003; Bresnan et al., 2007; and
Zuidema and de Boer, 2014), such relations are in most cases better expressed in
probabilistic terms to better account for the large degree of variability in language.
We have made a point of remaining agnostic about the question of whether or not
language is inherently probabilistic, or if probabilities are simply useful in describing
the details of an underlying, extremely detailed categorical system. We consider this
an unresolved empirical question, which ought not to stand in the way of corpus
methods to achieve what we consider the most important aspects, namely: increased
precision, a standard for resolving claims in linguistics (see also Geeraerts, 2006), and
reproducibility. We have outlined a framework for achieving this, using case studies to
illustrate what we consider good practice in using quantitative techniques in historical
linguistics, including interpretation, discussion, and presentation of results.
However, in addition to these benefits to historical linguistics as a field, we also
see wider benefits. Historical linguistics does of course not exist in a vacuum, and
we consider the increased adoption of empirical corpus linguistics, probabilistic and
computational methods, and an increased level of attention towards data sharing
and reproducibility a step in the direction of improved professional understanding between historical linguistics and adjacent fields (for a proposal in the same
spirit as ours and applied to the case of Latin linguistics, see McGillivray, 2013). In
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Concluding remarks

short, statistical techniques and a probabilistic conceptualization of the questions
and claims can act as a bridge for cross-disciplinary communication. For instance,
experimental psycholinguistic research already relies on statistical modelling as its
methodological core. Psycholinguistic models of language processing can inform
historical and diachronic research, as seen in Hay et al. (2015) and the previous
section. The uniformitarian principle implies that such psycholinguistic models of
understanding are relevant; empirical corpus methods (especially as part of efforts
in model parallelization; see. Zuidema and de Boer, 2014) make the link between the
two concrete and testable. However, the usefulness of statistical methods as a means
of communication extends beyond the various linguistic subfields. There is a rich
literature (e.g. McMahon and McMahon, 2005; Campbell, 2013; Pereltsvaig and Lewis,
2015) attesting to the communication problems that have arisen when researchers
with backgrounds in fields other than linguistics have introduced new methods, most
notably Bayesian phylogenetic trees, to study historical linguistic phenomena from a
new perspective. Such communication problems across fields should in our view be
resolved, not ignored or condemned, and we consider quantitative corpus methods
a contribution to this end. Although our main focus with this book is historical
linguistics, we consider such increased possibilities for improved cross-disciplinary
communication a positive side effect. We can only hope that the effect size will be a
considerable one, both within and beyond historical linguistics.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
References
Abeillé, A. (2003). Treebanks: Building and Using Parsed Corpora. Dordrecht: Kluwer.
Adger, D. (2015). Syntax. Wiley Interdisciplinary Reviews: Cognitive Science 6(2), 131–47.
Allen, C. L. (1995). Case Marking and Reanalysis: Grammatical Relations from Old to Early
Modern English. Oxford: Oxford University Press.
Andersen, H. (1999). Actualization and the (uni)directionality of change. In H. Andersen (ed.),
Actualization: Linguistic Change in Progress. Papers from a workshop held at the 14th International Conference on Historical Linguistics, Vancouver, B.C., Current Issues in Linguistic
Theory, pp. 225–48. New York: John Benjamins.
Andersen, H. and B. Hepburn (2015). Scientific method. In E. N. Zalta (ed.), The Stanford
Encyclopedia of Philosophy (winter 2015 edn.).
Archer, D. (2012). Corpus annotation: A welcome addition or an interpretation too far? In
J. Tyrkkö, M. Kipiö, T. Nevalainen, and M. Rissanen (eds.), Outposts of Historical Corpus
Linguistics: From the Helsinki Corpus to a Proliferation of Resources. Studies in Variation,
Contacts and Change in English eSeries.
Archer, D. and J. Culpeper (2003). Sociopragmatic annotation: New directions and possibilities
in historical corpus linguistics. In G. N. Leech, P. Rayson, A. McEnery, and A. Wilson (eds.),
Corpus Linguistics by the Lune: A Festschrift for Geoffrey Leech, pp. 37–58. Frankfurt am Main:
Peter Lang.
Archer, D., T. McEnery, P. Rayson, and A. Hardie (2003). Developing an automated semantic
analysis system for early modern English. In Corpus Linguistics 2003 Conference, Lancaster
University, pp. 22–31.
Atkinson, Q. D. and R. D. Gray (2006). How old is the Indo-European language family?
Illumination or more moths to the flame. In P. Forster and C. Renfrew (eds.), Phylogenetic
Methods and the Prehistory of Languages, pp. 91–109. Cambridge: McDonald Institute for
Archaeological Research.
Attardi, G. (2006). Experiments with a multilanguage non-projective dependency parser.
In Proceedings of the Tenth Conference on Computational Natural Language Learning
(CoNLL-X). New York City, pp. 166–70. Association for Computational Linguistics.
Baayen, R. H. (2001). Word Frequency Distributions. Dordrecht: Kluwer Academic.
Baayen, R. H. (2003). Probabilistic approaches to morphology. In R. Bod, J. Hay, and S. Jannedy
(eds.), Probabilistic Linguistics, pp. 229–87. Cambridge, MA: MIT Press.
Baayen, R. H. (2008). Analyzing Linguistic Data. Cambridge: Cambridge University Press.
Baayen, R. H. (2014). Multivariate statistics. In R. J. Podesva and D. Sharma (eds.), Research
Methods in Linguistics, pp. 337–72. Cambridge: Cambridge University Press.
Baker, C., C. Fillmore, and J. Lowe (1998). The Berkeley FrameNet project. In Proceedings of
COLING-ACL 1998. Montreal.
Bamman, D. and G. Crane (2006). The design and use of a Latin dependency treebank. In
J. Hajič and J. Nivre (eds.), Proceedings of the Fifth International Workshop on Treebanks and
Linguistic Theories (TLT 2006). Prague, pp. 67–78. ÚFAL MFF UK.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

References
Bamman, D. and G. Crane (2007). The Latin Dependency Treebank in a cultural heritage digital
library. In Proceedings of the Workshop on Language Technology for Cultural Heritage Data
(LaTeCH 2007). Prague, pp. 33–40.
Bamman, D. and G. Crane (2008). Building a dynamic lexicon from a digital library. In
Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2008).
Pittsburgh.
Bamman, D. and G. Crane (2011). The Ancient Greek and Latin Dependency Treebanks. In
C. Sporleder, A. Bosch, and K. Zervanou (eds.), Language Technology for Cultural Heritage: Theory and Applications of Natural Language Processing, pp. 79–98. Berlin/Heidelberg:
Springer.
Bamman, D., M. Passarotti, R. Busa, and G. Crane (2008). The annotation guidelines of the
Latin Dependency Treebank and Index Thomisticus Treebank. The treatment of some specific
syntactic constructions in Latin. In Proceedings of the Sixth International Conference on
Language Resources and Evaluation (LREC 2008). Marrakech.
Bamman, David, M. F. and G. Crane (2009). An Ownership Model of Annotation: The
Ancient Greek Dependency Treebank. In TLT 2009: Proceedings of the Eighth International
Workshop on Treebanks and Linguistic Theories Conference, Milan Italy. Milan, pp. 5–15.
UniCATT.
Barðdal, J., T. Smitherman, V. Bjarnadóttir, S. Danesi, G. B. Jenset, and B. McGillivray (2012).
Reconstructing constructional semantics: The dative subject construction in Old Norse–
Icelandic, Latin, Ancient Greek, Old Russian and Old Lithuanian. Studies in Language 36(3),
511–47.
Baron, Alistair, R. P. (2009). Automatic standardization of texts containing spelling variation.
How much training data do you need? In Proceedings of Corpus Linguistics 2009.
Baroni, M. (2013). Composition in distributional semantics. Language and Linguistics Compass 7(10), 511–22.
Baroni, M. and A. Kilgarriff (2006). Linguistically-processed web corpora for multiple languages. In Proceedings of EACL 2006, Trento, Italy, pp. 87–90.
Baroni, M. and R. Zamparelli (2010). Nouns are vectors, adjectives are matrices: Representing
adjective–noun constructions in semantic space. In Proceedings of the 2010 Conference on
Empirical Methods in Natural Language Processing, pp. 1183–93. Association for Computational Linguistics.
Bayley, R. (2002). The quantitative paradigm. In J. K. Chambers, P. Trudgill, and N. SchillingEstes (eds.), The Handbook of Language Variation and Change, pp. 117–41. Malden,
MA: Blackwell.
Beavers, J. and P. Sells (2014). Constructing and supporting a linguistic analysis. In R. J. Podesva
and D. Sharma (eds.), Research Methods in Linguistics, pp. 397–421. Cambridge: Cambridge
University Press.
Bech, K. and G. Walkden (2016). English is (still) a West Germanic language. Nordic Journal of
Linguistics 39(01), 65–100.
Bender, E. M. and J. Good (2010). A grand challenge for linguistics: Scaling up and integrating
models. White paper contributed to the National Science Foundation’s SBE 2020: Future
Research in the Social, Behavioral and Economic Sciences initiative.
Bennett, C. E. (1914). Syntax of Early Latin, vol. II—The Cases. Boston: Allyn & Bacon.
Benson, L. D. (ed.) (1987). The Riverside Chaucer (3rd edn). Oxford: Oxford University Press.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
References

Bentein, K. (2012). The periphrastic perfect in Ancient Greek: A diachronic mental space
analysis. Transactions of the Philological Society 110(2), 171–211.
Benzécri, J.-P. (1973). L’Analyse des Données, vol. 1. Paris: Dunod.
Bergsland, K. and H. Vogt (1962). On the validity of glottochronology. Current Anthropology 3(2), 115–53.
Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chisquare test. Journal of the American Statistical Association 33(203), 526–36.
Biber, Douglas, C. S. (2001). Register variation: A corpus approach. In T. D. Schiffrin, Deborah
and H. E. Hamilton (eds.), The Handbook of Discourse Analysis, pp. 175–96. Oxford: Blackwell.
Biber, Douglas, F. E. and D. Atkinson (1994). ARCHER and its challenges: Compiling and
exploring A Representative Corpus of Historical English Registers. In S. P. Fries, Udo
and G. Tottie (eds.), Creating and using English language corpora. Papers from the 14th
International Conference on English Language Research on Computerized Corpora, Zurich
1993, pp. 1–13. Amsterdam. Rodopi.
Bird, S., E. Klein, and E. Loper (2009). Natural Language Processing with Python. Sebastopol,
CA: O’Reilly.
Bizer, C., T. Heath, K. U. Idehen, and T. Berners-Lee (2008). Linked data on the web.
In Proceedings of the 17th International World Wide Web Conference (WWW2008), Beijing.
Bloomfield, L. (1933). Language. New York: Holt.
Blythe, R. A. and W. Croft (2012). S-curves and the mechanisms of propagation in language
change. Language 88(2), 269–304.
Bod, R. (2003). Introduction to elementary probability theory and formal stochastic language
theory. In R. Bod, J. Hay, and S. Jannedy (eds.), Probabilistic Linguistics, pp. 11–38. Cambridge,
MA: MIT Press.
Bod, R. (2014). A New History of the Humanities: The Search for Principles and Patterns from
Antiquity to the Present. Oxford: Oxford University Press.
Bod, R., J. Hay, and S. Jannedy (eds.) (2003). Probabilistic Linguistics. Cambridge, MA: MIT
Press.
Böhmová, A., J. Hajič, E. Hajičová, and B. Hladká (2003). The Prague Dependency Treebank:
A three-level annotation scenario. In A. Abeillé (ed.), Treebanks: Building and Using Parsed
Corpora, pp. 103–28. Dordrecht: Kluwer Academic.
Borin, L. and M. Forsberg (2008). Something old, something new: A computational morphological description of Old Swedish. In K. Ribarov and C. Sporleder (eds.), Proceedings of the
LREC 2008 Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008),
pp. 9–16.
Boschetti, F. (2010). A corpus-based approach to philological issues. Ph.D. thesis, University of
Trento, Trento, Italy.
Breivik, L. E. (1990). Existential There: A Synchronic and Diachronic Study (2nd edn). Oslo:
Novus Press.
Breivik, L. E. (1997). There in space and time. In H. Ramisch and K. Wynne (eds.), Language in
Time and Space: Studies in Honour of Wolfgang Viereck on the Occasion of his 60th birthday,
pp. 32–45. Stuttgart: Franz Steiner Verlag.
Bresnan, J., A. Cueni, T. Nikitina, and R. H. Baayen (2007). Predicting the dative alternation.
In G. Bouma, I. Kraemer, and J. Zwarts (eds.), Cognitive Foundations of Interpretation,
pp. 69–94. Amsterdam: Royal Netherlands Academy of Arts and Sciences.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

References
Bülow, A. E. and J. Ahmon (2011). Preparing Collections for Digitization. London: Facet.
Busa, R. (1980). The annals of humanities computing: The Index Thomisticus. Computers and
the Humanities 14(2), 83–90.
Bybee, J. (2003). Mechanisms of change in grammaticization: The role of frequency. In
B. D. Joseph and R. D. Janda (eds.), The Handbook of Historical Linguistics, pp. 602–23.
Malden, MA: Blackwell.
Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using
Amazon’s Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods
in Natural Language Processing, Singapore, pp. 286–95. Association for Computational
Linguistics.
Campbell, L. (2013). Historical Linguistics: An Introduction (3rd edn). Edinburgh: Edinburgh
University Press.
Candela, L., D. Castelli, P. Manghi, and A. Tani (2015). Data journals: A survey. Journal of the
Association for Information Science and Technology 66.
Carnie, A. (2012). Syntax: A Generative Introduction (3rd (electronic) edn). Maldeu,
MA: Blackwell.
Carrier, R. C. (2012). Proving History: Bayes’s Theorem and the Quest for the Historical Jesus.
Amherst, NY: Prometheus.
Chambers, J. K. (2002). Patterns of variation including change. In J. K. Chambers, P. Trudgill,
and N. Schilling-Estes (eds.), The Handbook of Language Variation and Change, pp. 349–72.
Malden, MA: Blackwell.
Chiarcos, C., J. McCrae, P. Cimiano, and C. Fellbaum (2013). Towards open data for linguistics:
Linguistic linked data. In A. Oltramari, P. Vossen, L. Qin, and E. Hovy (eds.), New Trends
of Research in Ontologies and Lexical Resources. Heidelberg/New York/Dordrecht/London:
Springer.
Chomsky, N. (1957). Syntactic Structures. The Hague: Mouton.
Chretien, C. D. (1962). The mathematical models of glottochronology. Language 3(1), 11–37.
Cimiano, P., P. Buitelaar, and M. Sintek (2011). LexInfo: A declarative model for the lexicon–
ontology interface. Journal of Web Semantics: Science, Services and Agents on the World Wide
Web 9, 29–51.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological
Measurement 20(1), 37–46.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd edn). Hillsdale, NJ:
Lawrence Erlbaum.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist 49(12), 997–1003.
Coopmans, P. (1989). Where stylistic and syntactic processes meet: Locative inversion in
English. Language 65(4), 728–51.
Corrie, M. (2006). Middle English—dialects and diversity. In L. Mugglestone (ed.), The Oxford
History of English, pp. 86–119. Oxford: Oxford University Press.
Crane, G. (1991). Generating and parsing classical Greek. Literary and Linguistic Computing 6(4), 243–245.
Crocco Galèas, G. and C. Iacobini (1992). Parasintesi e doppio stadio derivativo nella formazione
verbale del latino. Archivio Glottologico Italiano 77, 167–99.
Croft, W. (2000). Explaining Language Change: An Evolutionary Approach. London: Longman.
Croft, W. and D. A. Cruse (2004). Cognitive Linguistics. Cambridge: Cambridge University Press.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
References

Cruse, A. (2011). Meaning in Language: An introduction to Semantics and Pragmatics (3rd edn).
Oxford: Oxford University Press.
Culpeper, J. and D. Archer (2008). Requests and directness in early modern English trial
proceedings and play-texts, 1640–1760. In A. H. Jucker and I. Taavitsainen (eds.), Speech Acts
in the History of English, pp. 45–84. Amsterdam/Philadelphia: John Benjamins.
Czeitschner, U., T. Declerk, and C. Resch (2013). Porting elements of the Austrian Baroque
Corpus onto the linguistics linked open data format. In P. Osenova, K. Simov, G. Georgiev,
and P. Nakov (eds.), Proceedings of the Joint NLP&LOD and SWAIE Workshops, RANLP,
Hissar, Bulgaria, pp. 12–16.
Davies, M. (2008). The Corpus of Contemporary American English (COCA): 410+ million
words, 1990–present.
Davies, M. (2010). The Corpus of Historical American English: 400 million words, 1810–2009.
Davies, M. (2011). Google Books (American English) Corpus (155 billion words, 1810–2009).
Available online at http://googlebooks.byu.edu/.
de Marneffe, M.-C. and C. Potts (2014). Developing linguistic theories using annotated corpora.
In N. Ide and J. Pustejovsky (eds.), The Handbook of Linguistic Annotation. Berlin: Springer.
Declerck, T., U. Czeitschner, K. Moerth, C. Resch, and G. Budin (2011). A text technology infrastructure for annotating corpora in the eHumanities. In S. Gradmann, F. Borri, C. Meghini,
and H. Schuldt (eds.), Proceedings of the International Conference on Theory and Practice of
Digital Libraries (TPDL–2011), pp. 457–60.
Deignan, A. (2005). Metaphor and Corpus Linguistics. Amsterdam: John Benjamins.
Denision, D. (2002). Log(ist)ic and simplistic S-curves. In R. Hickey (ed.), Motives for language
Change, pp. 54–70. Cambridge: Cambridge University Press.
Depuydt, K. and J. de Does (2009). Computational tools and lexica to improve access to Text.
In E. Beijk and L. Colman (eds.), Fons Verborum. Feestbundel voor prof. dr. A.M.F.J. (Fons)
Moerdijk, aangeboden door vrienden en collega’s bij zijn afscheid van het INL, pp. 187–99.
Leiden/Amsterdam: Instituut voor Nederlandse Lexicologie.
Dilthey, W. (1991). Selected Works, vol. I. Princeton, NJ: Princeton University Press.
Downs, M. E., B. Z. Lund, R. Talbert, M. J. McDaniel, J. Becker, N. Jovanovic, S. Gillies, and
T. Elliott. Places: 1004 ((H)Adriaticum/Superum Mare). Pleiades.
Doyle, P. (2005). Replicating corpus-based linguistics: Investigating lexical networks in text. In
Proceedings of the Corpus Linguistics Conference. University of Birmingham, UK.
Dufresne, M., F. Dupuis, and M. Tremblay (2003). Preverbs and particles in Old French.
Yearbook of Morphology, 33–59.
Dunn, M., A. Terrill, G. Reesink, R. A. Foley, and S. C. Levinson (2005). Structural phylogenetics
and the reconstruction of ancient language history. Science 309(5743), 2072–5.
Ellegård, A. (1953). The Auxiliary Do: The Establishment and Regulation of its Use in English.
Stockholm: Almquist & Wiksell.
Ellegård, A. (1959). Statistical measurement of linguistic relationship. Language 35(2), 131–56.
Elliott, T. and S. Gillies (2009). Data and code for ancient geography: Shared effort across
projects and disciplines. In Digital Humanities 2009 Conference Abstracts, pp. 4–6.
Elliott, T. and S. Gillies (2011). Pleiades: An un-GIS for ancient geography. In Digital Humanities
2011, Conference Abstracts, Stanford, pp. 311–12. Stanford University.
Emonds, J. E. and J. T. Faarlund (2014). English: The Language of the Vikings. Olomouc Modern
Language Monographs. Palackỳ University.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

References
Evert, S. (2006). How random is a corpus? The library metaphor. Zeitschrift für Anglistik und
Amerikanistik 54(2), 177–90.
Faraway, J. J. (2005). Linear Models with R. Boca Raton, FL: Chapman & Hall/CRC.
Farrar, S. and D. T. Langendoen (2003). A linguistic ontology for the semantic web. GLOT
International 7(3), 97–100.
Faudree, P. and M. P. Hansen (2014). Language, society, and history towards a unified approach?
In The Cambridge Handbook of Linguistic Anthropology, pp. 227–49. Cambridge: Cambridge
University Press.
Fellbaum, C. (1998). Wordnet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
Ferraresi, A., E. Zanchetta, M. Baroni, and S. Bernardini (2008, 1 June 2008). Introducing and
evaluating ukWaC, a very large web-derived corpus of English. In S. Evert, A. Kilgarriff, and
S. Sharoff (eds.), Proceedings of the 4th LREC Web as Corpus Workshop (WAC-4)—Can We
Beat Google?, Marrakech, Morocco. European Language Resources Association.
Fillmore, C. J. (1992). ‘Corpus linguistics’ or ‘computer-aided armchair linguistics’. In J. Svartvik
(ed.), Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, Stockholm, 4–8
August 1991, pp. 35–60. Berlin: Mouton de Gruyter.
Fischer, O. (1996). Syntax. In N. Blake (ed.), The Cambridge History of the English Language,
vol. II: 1066–1476, pp. 207–408. Cambridge: Cambridge University Press.
Fischer, O. (2004). Grammar change versus language change: Is there a difference? In C. Kay,
S. Horobin, and J. J. Smith (eds.), New Perspectives on English Historical Linguistics. Selected papers from 12 ICEHL. Glasgow, 21–26 August 2002, pp. 31–63. Philadelphia, PA:
John Benjamins.
Fischer, O. (2007). Morphosyntactic Change: Functional and Formal Perspectives (electronic
edn). Oxford: Oxford University Press.
Fodor, I. (1961). The validity of glottochronology on the basis of the Slavonic languages. Studia
Slavica 7(4), 295–346.
Forster, P. and C. Renfrew (eds.) (2006). Phylogenetic Methods and the Prehistory of Languages.
Cambridge: McDonald Institute for Archeological Research.
Fought, C. (2002). Ethnicity. In J. K. Chambers, P. Trudgill, and N. Schilling-Estes (eds.), The
Handbook of Language Variation and Change, pp. 444–72. Malden, MA: Blackwell.
Freitas, A. and O.-S. Curry, E. (2012). A distributional approach for terminological semantic
search on the linked data web. pp. 384–91.
Gale, W. and G. Sampson (1995). Good-turing smoothing without tears. Journal of Quantitative
Linguistics 2(3), 217–37.
Galves, C. and H. Britto (2002). The Tycho Brahe Corpus of Historical Portuguese. Technical
report, Department of Linguistics, University of Campinas. Online publication, 1st.
García García, L. (2000). A case study in historical linguistic research. In Perspectives on the
Genitive in English: Synchronic, Diachronic, Contrastive and Research, vol. 1, pp. 118–29.
Universidad de Sevilla.
Geeraerts, D. (2006). Methodology in cognitive linguistics. In G. Kristiansen, M. Achard,
R. Dirven, and F. J. Ruiz de Mendoza Ibáñez (eds.), Cognitive Linguistics: Current Applications
and Future Perspectives, pp. 21–50. Berlin: Mouton de Gruyter.
Gelderen, E. v. (2014). Generative syntax and language change. In C. Bowern and B. Evans (eds.),
The Routledge Handbook of Historical Linguistics, pp. 326–42. Abingdon, UK: Routledge.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
References

Gelman, A. (2012). Statistics in a world where nothing is random. Blog post. Accessed
13/09/2015 from http://andrewgelman.com/2012/12/17/statistics-in-a-world-where-nothingis-random/.
Gelman, A. and J. Hill (2007). Data Analysis Using Regression and Multilevel/Hierarchical
Models. Cambridge: Cambridge University Press.
Gelman, A. and E. Loken (2014). The statistical crisis in science data-dependent analysis—a
“garden of forking paths”—explains why many statistically significant comparisons don’t hold
up. American Scientist 102(6), 460.
Gibson, E. and E. Fedorenko (2013). The need for quantitative methods in syntax and semantics
research. Language and Cognitive Processes 28(1–2), 88–124.
Gilliland, A. J. (2008). Setting the stage. In M. Baca (ed.), Introduction to Metadata (2nd edn).
Los Angeles: Getty.
Goldthorpe, J. H. (2001). Causation, statistics, and sociology. European Sociological Review 17(1),
1–20.
Gorrell, J. H. (1895). Indirect discourse in Anglo-Saxon. PMLA 10(3), 342–485.
Gotscharek, A., A. Neumann, U. Reffle, C. Ringlstetter, and K. U. Schulz (2009). Constructing a
lexicon from a historical corpus. In Proceedings of the Conference of the American Association
for Corpus Linguistics (AACL09), Edmonton.
Gould, S. J. (1985). The Flamingo’s Smile: Reflections in Natural History. New York: Norton.
Greenacre, M. (2007). Correspondence Analysis in Practice (2nd edn). Boca Raton, FL: Chapman
& Hall/CRC.
Gries, S. T. (2005). Null-hypothesis significance testing of word frequencies: A follow-up on
Kilgarriff. Corpus Linguistics and Linguistic Theory 1(2), 277–94.
Gries, S. T. (2006a). Introduction. In S. T. Gries and A. Stefanowitsch (eds.), Corpora in Cognitive
Linguistics: Corpus-Based Approaches to Syntax and Lexis, pp. 1–17. Berlin & New York:
Mouton de Gruyter.
Gries, S. T. (2006b). Some proposals towards a more rigorous corpus linguistics. Zeitschrift für
Anglistik und Amerikanistik 54(2), 191–202.
Gries, S. T. (2009a). Quantitative Corpus Linguistics with R: A Practical Introduction. New York:
Routledge.
Gries, S. T. (2009b). Statistics for Linguistics with R: A Practical Introduction. Berlin: Mouton de
Gruyter.
Gries, S. T. (2011). Methodological and interdisciplinary stance in corpus linguistics. In
G. Barnbrook, V. Viana, and S. Zyngier (eds.), Perspectives on Corpus Linguistics: Connections
and Controversies, pp. 81–98. Amsterdam: John Benjamins.
Gries, S. T. (2015). The most under-used statistical method in corpus linguistics: Multi-level
(and mixed-effects) models. Corpora 10(1), 95–125.
Gries, S. T. and A. L. Berez (2015). Linguistic annotation in/for corpus linguistics. In P. J. Ide,
Nancy (ed.), Handbook of Linguistic Annotation. Berlin/New York: Springer.
Gries, S. T. and M. Hilpert (2010). Modeling diachronic change in the third person singular:
a multifactorial, verb- and author-specific exploratory approach. English Language and
Linguistics 14(3), 293–320.
Gries, S. T. and J. Newman (2014). Creating and using corpora. In R. J. Podesva and D. Sharma
(eds.), Research Methods in Linguistics, pp. 257–87. Cambridge: Cambridge University Press.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

References
Grondelaers, S., D. Geeraerts, and D. Speelman (2007). A case for a cognitive corpus linguistics.
In M. Gonzalez-Marquez, I. Mittelberg, S. Coulson, and M. J. Spivey (eds.), Methods in
Cognitive Linguistics, pp. 149–69. Amsterdam: John Benjamins.
Guiraud, P. (1959). Problèmes et méthodes de la statistique linguistique. Dordrecht: Reidel.
Hajič, J., J. Panevová, Z. Urešová, A. Bémová, V. Kolárová-Reznícková, and P. Pajas (2003).
PDT-VALLEX: Creating a large coverage valency lexicon for treebank annotation. In J. Nivre
and E. Hinrichs (eds.), Proceedings of the Second Workshop on Treebanks and Linguistic
Theories (TLT 2003), Växjö. vol. 9, pp. 57–68. Växjö University Press.
Halpin, H., V. Robu, and H. Shepherd (2007). The complex dynamics of collaborative tagging.
In Proceedings of the International Conference on World Wide Web. ACM Press.
Harris, R. A. (1993). The Linguistics Wars. New York: Oxford University Press.
Harrison, S. (2003). On the limits of the comparative method. In B. D. Joseph and R. D. Janda
(eds.), The Handbook of Historical Linguistics, pp. 213–243. Malden, MA: Blackwell.
Haug, D., M. Jøhndal, H. Eckhoff, E. Welo, M. Hertzenberg, and A. Müth (2009). Computational and linguistic issues in designing a syntactically annotated parallel corpus of
Indo-European languages. Traitement automatique des langues 50, 17–45.
Haug, D. T. T. and M. L. Jøndal (2008). Creating a parallel treebank of the old Indo-European
Bible translations. In Proceedings of Language Technologies for Cultural Heritage Workshop
(LREC 2008), Marrakech, pp. 27–34.
Haverling, G. (2000). On SCO verbs, prefixes and semantic functions. Number 64 in Studia
Graeca et Latina Gothoburgensia. Göteborg: Acta Universitatis Gothoburgensis.
Hay, J. and P. Foulkes (2016). The evolution of medial /t/ over real and remembered time.
Language 92(2), 298–330.
Hay, J. B., J. B. Pierrehumbert, A. J. Walker, and P. LaShell (2015). Tracking word frequency
effects through 130 years of sound change. Cognition 139, 83–91.
Heggelund, Ø. (2015). On the use of data in historical linguistics: Word order in early English
subordinate clauses. English Language and Linguistics 19(01), 83–106.
Hellmann, S., J. Lehmann, S. Auer, and M. Brümmer (2013). Integrating NLP using linked data.
In H. Alani, L. Kagal, A. Fokoue, P. Groth, C. Biemann, J. Parreira, L. Aroyo, N. Noy, C. Welty,
and K. Janowicz (eds.), The Semantic Web—ISWC 2013, vol. 8219, Lecture Notes in Computer
Science, pp. 98–113. Springer Berlin Heidelberg.
Hey, T., S. Tansley, and K. Tolle (eds.) (2009). The Fourth Paradigm: Data-Intensive Scientific
Discovery. Redmond, WA: Microsoft Research.
Hilpert, M. and S. T. Gries (2009). Assessing frequency changes in multistage diachronic
corpora: Applications for historical corpus linguistics and the study of language acquisition.
Literary and Linguistic Computing 24(4), 385–401.
Hilpert, M. and S. T. Gries (2016). Quantitative approaches to diachronic corpus linguistics. In
M. Kytö and P. Pahta (eds.), The Cambridge Handbook of English Historical Linguistics, pp.
36–53. Cambridge: Cambridge University Press.
Hinton, P. R. (2004). Statistics Explained. Routledge.
Hockett, C. F. (1958). A Course in Modern Linguistics. Oxford: Macmillan.
Horobin, S. and J. Smith (2002). An Introduction to Middle English. Edinburgh: Edinburgh
University Press.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
References

Huang, Liang, P. Y., H. Wang, and Z. Wu (2002). PCFG parsing for restricted Classical
Chinese texts. In Proceedings of the first SIGHAN workshop on Chinese language processing,
Stroudsburg, PA, USA, pp. 1–6. Association for Computational Linguistics.
Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press.
Hymes, D. H. (1960). Lexicostatistics so far. Current Anthropology 1(1), 3–44.
Hájek, A. (2012). Interpretations of probability. In E. N. Zalta (ed.), The Stanford Encyclopedia
of Philosophy (winter 2012 edn). Stanford.
Iacobini, C. and F. Masini (2007). Verb-particle constructions and prefixed verbs in Italian:
typology, diachrony and semantics. In G. Booij, B. Fradin, A. Ralli, and S. Scalise (eds.),
On-line Proceedings of the Fifth Mediterranean Morphology Meeting (MMM5), pp. 157–84.
Università degli Studi di Bologna.
Ide, N. and C. Macleod (2001). The American National Corpus: A standardized resource of
American English. In Proceedings of Corpus Linguistics 2001, Lancaster.
Irvine, S. (2006). Beginnings and transitions: Old English. In L. Mugglestone (ed.), The Oxford
History of English, pp. 32–60. Oxford: Oxford University Press.
Jackson, H. (2002). Lexicography: An Introduction. Routledge: Routledge.
Jenset, G. B. (2010). A corpus-based study on the evolution of there: Statistical analysis and
cognitive interpretation. Ph.D. thesis, University of Bergen.
Jenset, G. B. (2013). Mapping meaning with distributional methods: A diachronic corpus-based
study of existential there. Journal of Historical Linguistics 3(2), 272–306.
Jenset, G. B. (2014). In search of the S (curve) in there. In K. E. Haugland, K. A. Rusten, and
K. McCafferty (eds.), ‘Ye whom the charms of grammar please’: Studies in English Language
History in Honour of Leiv Egil Breivik, pp. 27–54. Oxford: Peter Lang.
Jenset, G. B. and B. McGillivray (2012). Multivariate analyses of affix productivity in translated
English. In M. Oakes and M. Ji (eds.), Quantitative Methods in Corpus-Based Translation
Studies, pp. 301–23. Amsterdam: John Benjamins.
Johnson, K. (2008). Quantitative Methods in Linguistics. Oxford: Blackwell.
Joseph, B. D. and R. D. Janda (eds.) (2003). The Handbook of Historical Linguistics. Oxford:
Blackwell.
Joulain, A., I. Gregory, and A. Hardie (2013). The spatial patterns in historical texts: Combining
corpus linguistics and geographical information systems to explore places in Victorian
newspapers. In Exploring Historical Sources: Abstracts of Presentations.
Kenter, T., T. Erjavec, M. Z. Dulmin, and D. Fišer (2012). Lexicon construction and corpus
annotation of historical language with the CoBaLT editor. In Proceedings of the 6th EACL
Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities,
pp. 1–6. Association for Computational Linguistics.
Kestemont, M., W. Daelemans, and G. De Pauw (2010). Weigh your words—memory-based
lemmatization for Middle Dutch. Literary and Linguistic Computing 25(3), 287–301.
Kilgarriff, A. (2005). Language is never, ever, ever, random. Corpus Linguistics and Linguistic
Theory 1(2), 263–76.
Kilgarriff, A., P. Rychly, P. Smrz, and D. Tugwell (2004). The Sketch Engine. In G. Williams
and S. Vessier (eds.), Proceedings of the Eleventh Euralex International Congress, Lorient,
pp. 105–16. Université de Bretagne-Sud.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

References
Kingsbury, P. and M. Palmer (2002). From Treebank to Propbank. In Proceedings of the Third
International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas,
Canary Islands.
Koch, U. (1993). The enhancement of a dependency parser for Latin. Technical Report
AI-1993-03, Artificial Intelligence Programs, University of Georgia.
Köhler, R. (1999). Syntactic structures: Properties and interrelations. Journal of Quantitative
Linguistics 6(1), 46–57.
Köhler, R. (2012). Quantitative Syntax Analysis, vol. 65. Walter de Gruyter.
Kolachina, S. and P. Kolachina (2012, May). Parsing any domain English text to CoNLL
dependencies. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani,
A. Moreno, J. Odijk, and S. Piperidis (eds.), Proceedings of the Eighth International Conference
on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language
Resources Association.
Korhonen, A., Y. Krymolowski, and T. Briscoe (2006). A large subcategorization lexicon for
natural language processing applications. In Proceedings of the Fifth International Conference
on Language Resources and Evaluation (LREC 2006). Genoa.
Kretzschmar, W. A. and S. Tamasi (2003). Distributional foundations for a theory of language
Change. World Englishes 22(4), 377–401.
Kroch, A. (1989). Reflexes of grammar in patterns of language change. Language Variation and
Change 1, 199–244.
Kroch, A., B. Santorini, and L. Delfs (2004). Penn–Helsinki Parsed Corpus of Early Modern
English. http://www.ling.upenn.edu/hist-corpora/PPCEME-RELEASE-3/index.html.
Kroch, A. and A. Taylor (2000). The Penn–Helsinki Parsed Corpus of Middle English
(PPCME2). Technical report, Department of Linguistics, University of Pennsylvania.
Kroch, Anthony, S. B. and L. Delfs (2004). The Penn–Helsinki Parsed Corpus of Early Modern
English (PPCEME). Technical report, Department of Linguistics, University of Pennsylvania.
Kroch, Anthony, S. B. and A. Diertani (2010). The Penn–Helsinki Parsed Corpus of Modern British English (PPCMBE). Technical report, Department of Linguistics, University of
Pennsylvania.
Kroeber, A. L. and C. D. Chrétien (1937). Quantitative classification of Indo-European languages. Language 13(2), 83–103.
Kytö, M. and T. Walker (2006). Guide to A Corpus of English Dialogues 1560–1760. Studia
Anglistica Upsaliensia 130.
Labov, W. (1972). Some principles of linguistic methodology. Language in Society 1(1),
97–120.
Lau, J. H., A. Clark, and S. Lappin (2015). Unsupervised prediction of acceptability judgements.
In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and
the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),
Beijing, China, pp. 1618–28. Association for Computational Linguistics.
Leech, G. (1997). Introducing corpus annotation. In Corpus annotation: Linguistic Information
from Computer Text Corpora (3rd edn). London: Longman.
Lenci, A. (2008). Distributional semantics in linguistic and cognitive research: A foreword.
Italian Journal of Linguistics 20, 1–31.
Lenci, A., B. McGillivray, S. Montemagni, and V. Pirrelli (2008). Unsupervised acquisition
of verb subcategorization frames from shallow-parsed corpora. In Proceedings of the 6th
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
References

Language Resources and Evaluation Conference (LREC 2008). Marrakech, pp. 3000–6.
European Language Resources Association.
Lenci, A., S. Montemagni, and V. Pirrelli (2005). Testo e computer. Elementi di linguistica
computazionale. Roma: Carocci.
Leonelli, S. (2016). Researching Life in the Digital Age: A Philosophical Study of Data-Centric
Biology. Chicago, IL: Chicago University Press.
Levin, B. (1993). English Verb Classes and Alternations: A Preliminary Investigation. Chicago:
University of Chicago Press.
Levy, R. (2008). Expectation-based syntactic comprehension. Cognition 106(3), 1126–77.
Lewis, C. T. and C. Short (1879). A Latin Dictionary, Founded on Andrews’ edition of Freund’s
Latin dictionary revised, enlarged, and in great part rewritten by Charlton T. Lewis, Ph.D.
and Charles Short, LL.D. Oxford: Clarendon, http://www.lib.uchicago.edu/efts/PERSEUS/
Reference/lewisandshort.html.
Lieberman, E., J.-B. Michel, J. Jackson, T. Tang, and M. A. Nowak (2007). Quantifying the
evolutionary dynamics of language. Nature 449(7163), 713–16.
Lightfoot, D. (1989). The child’s trigger experience: Degree-0 learnability. Behavioral and Brain
Sciences 12(02), 321–34.
Lightfoot, D. (2006). How New Languages Emerge. Cambridge: Cambridge University Press.
Lightfoot, D. W. (2013). Types of explanation in history. Language 89(4), e18–e38.
Long, J. S. and J. Freese (2001). Regression Models for Categorical Dependent Variables Using
STATA. College Station, TX: Stata Press.
Lüdeling, A., H. Hirschmann, and A. Zeldes (2011). Variationism and underuse statistics in the
analysis of the development of relative clauses in German. In Y. Kawaguchi, M. Minegishi, and
W. Viereck (eds.), Corpus-Based Analysis and Diachronic Linguistics, pp. 37–58. Amsterdam:
John Benjamins.
McCarthy, D. (2001). Lexical acquisition at the syntax–semantics interface: Diathesis alternations, subcategorization frames and selectional preferences. Ph.D. thesis, University of
Sussex.
McColl Millar, R. (2012). English Historical Sociolinguistics. Edinburgh: Edinburgh University
Press.
McCrae, J., E. Montiel-Ponsoda, and P. Cimiano (2012). Integrating WordNet and Wiktionary
with lemon. In C. Chiarcos, S. Nordhoff, and S. Hellmann (eds.), Linked Data in Linguistics,
pp. 25–34. Heidelberg/New York/Dordrecht/London: Springer.
McEnery, T. and H. Baker (2014). The corpus as historian: Using corpora to investigate the past.
In Exploring Historical Sources: Abstracts of Presentations.
McEnery, T. and A. Hardie (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge
University Press.
McEnery, T. and A. Wilson (2001). Corpus Linguistics. An Introduction. Edinburgh: Edinburgh
University Press.
McGillivray, B. (2012). Latin preverbs and verb argument structure: New insights from new
methods. In J. Barðdal, M. Cennamo, and E. van Gelderen (eds.), Argument Structure: The
Naples/Capri Papers. John Benjamins.
McGillivray, B. (2013). Methods in Latin Computational Linguistics. Leiden: Brill.
McGillivray, B. and A. Kilgarriff (2013). Tools for historical corpus research, and a corpus of
Latin. In P. Bennett, M. Durrell, S. Scheible, and R. J. Whitt (eds.), New Methods in Historical
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

References
Corpus Linguistics, vol. 3, Corpus Linguistics and Interdisciplinary Perspectives on Language,
Tübingen. Narr.
McGillivray, B., M. Passarotti, and P. Ruffolo (2009). The Index Thomisticus Treebank Project:
Annotation, Parsing and Valency Lexicon. TAL 50(2), 103–27.
McGillivray, B. and A. Vatri (2015). Computational valency lexica for Latin and Greek in use:
A case study of syntactic ambiguity. Journal of Latin Linguistics 14, 101–26.
McMahon, A. (2006). Restructuring Renaissance English. In L. Mugglestone (ed.), The Oxford
History of English, pp. 147–177. Oxford: Oxford University Press.
McMahon, A. and R. McMahon (2005). Language Classification by Numbers. Oxford: Oxford
University Press.
Mair, C. (2004). Corpus linguistics and grammaticalization theory: Statistics, frequencies, and
beyond. In H. Lindquist and C. Mair (eds.), Corpus Approaches to Grammaticalization in
English, pp. 121–50. Amsterdam: Jonn Benjamins.
Manning, C. D. (2003). Probabilistic syntax. In R. Bod, J. Hay, and S. Jannedy (eds.), Probabilistic
Linguistics, pp. 289–342. Cambridge, MA: MIT Press.
Martineau, France, H. P. K. A. and Y. C. Morin (2010). Corpus MCVF, Modéliser le changement:
les voies du français. Technical report, Département de français, University of Ottawa.
CD-ROM.
Mason, H. and D. Patil (2015). Data Driven: Creating a Data Culture. Sebastopol, CA: O’Reilly.
Mayer-Schonberger, V. and C. Kenneth (2013). Big Data: A Revolution That Will Transform How
We Live, Work and Think. Boston: Houghton Mifflin Harcourt.
Meehl, P. E. (1990). Appraising and amending theories: The strategy of lakatosian defense and
two principles that warrant it. Psychological Inquiry 1(2), 108–41.
Meillet, A. and J. Vendryes (1963). Traité de grammaire comparée des langues classiques. Paris:
Librairie Ancienne Honoré Champion.
Meini, L. and B. McGillivray (2010). Between semantics and syntax: Spatial verbs and prepositions in Latin. In Proceedings of the Space in Language Conference, 8–10 October 2009, Pisa.
ETS.
Menini, S. (2014). Computational analysis of historical texts. In Exploring Historical Sources:
Abstracts of Presentations.
Messiant, C., A. Korhonen, and T. Poibeau (2008). LexSchem: A large subcategorization lexicon
for French verbs. In Proceedings of the Sixth International Language Resources and Evaluation
(LREC 2008), Marrakech, Morocco.
Meyer, E. T. and R. Schroeder (2015). Knowledge Machines: Digital Transformations of the
Sciences and Humanities. Cambridge, MA: MIT Press.
Miller, G. A., R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography 3(4),
235–44.
Moore, G. A. (1991). Crossing the Chasm: Marketing and Selling High-Tech Products to
Mainstream Customers. New York: HarperBusiness.
Morton, R. (2014). Using TEI mark-up and pragmatic classification in the construction and
analysis of the British Telecom Correspondence Corpus. In Exploring Historical Sources:
Abstracts of Presentations.
Mosteller, F. (1968). Association and estimation in contingency tables. Journal of the American
Statistical Association 321(63), 1–28.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
References

Munro, R., S. Bethard, V. Kuperman, V. Tzuyin Lai, R. Melnick, C. Potts, T. Schnoebelen, and
H. Tily (2010). Crowdsourcing and language studies: The new generation of linguistic data.
In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data
with Amazon’s Mechanical Turk, Los Angeles, pp. 122–30. Association for Computational
Linguistics.
Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination.
Biometrika 78(3), 691–2.
Nevalainen, T. (2003). Socio-Historical Linguistics: Language Change in Tudor and Stuart
England. London: Longman.
Nevalainen, T. (2006). Mapping change in Tudor English. In L. Mugglestone (ed.), The Oxford
History of English, pp. 178–211. Oxford: Oxford University Press.
Pagel, M., Q. D. Atkinson, and A. Meade (2007). Frequency of word-use predicts rates of lexical
evolution throughout Indo-European history. Nature 449(7163), 717–20.
Passarotti, M. (2007a). LEMLAT. Uno strumento per la lemmatizzazione morfologica automatica del latino. In F. Citti and T. Del Vecchio (eds.), From Manuscript to Digital Text: Problems
of Interpretation and Markup. Proceedings of the Colloquium (Bologna, June 12th 2003). Roma,
pp. 107–28.
Passarotti, M. (2007b). Verso il Lessico Tomistico Biculturale. La treebank dell’Index Thomisticus. In R. Petrilli and D. Femia (eds.), Il filo del discorso. Intrecci testuali, articolazioni
linguistiche, composizioni logiche. Atti del XIII Congresso Nazionale della Società di Filosofia
del Linguaggio. Viterbo, pp. 187–205.
Passarotti, M. (2010). Leaving behind the less-resourced status: The case of Latin through the
experience of the Index Thomisticus Treebank. In Proceedings of the 7th SaLTMiL Workshop
on Creation and Use of Basic Lexical Resources for Less-Resourced Languages, LREC 2010.
La Valletta, Malta, 23 May 2010, pp. 27–32.
Passarotti, M. (2014). From syntax to semantics: First steps towards tectogrammatical annotation of Latin. In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), pp. 100–9. Association for Computational
Linguistics.
Passarotti, M. and F. Dell’Orletta (2010). Improvements in parsing the Index Thomisticus
Treebank. Revision, combination and a feature model for medieval Latin. In Proceedings of
the Seventh International Conference on Language Resources and Evaluation (LREC 2010). May
19–21, 2010, La Valletta, Malta, pp. 1964–71. European Language Resources Association.
Passarotti, M., B. McGillivray, and D. Bamman (2015). A treebank-based study on Latin word
order. In Proceedings of the 16th International Colloquium on Latin Linguistics, Uppsala.
Passarotti, M. and P. Ruffolo (2009). Parsing the Index Thomisticus Treebank: Some preliminary
results. In P. Anreiter and M. Kienpointner (eds.), Proceedings of the 15th International
Colloquium on Latin Linguistics, Innsbrucker Beiträge zur Sprachwissenschaft, Innsbruck.
Penke, M. and A. Rosenbach (eds.) (2007a). What Counts as Evidence in Linguistics. Amsterdam:
John Benjamins.
Penke, M. and A. Rosenbach (2007b). What counts as evidence in linguistics? An introduction.
In M. Penke and A. Rosenbach (eds.), What Counts as Evidence in Linguistics, pp. 1–49.
Amsterdam: John Benjamins.
Pereira, F. C. (2000). Formal grammar and information theory: together again? Philosophical
Transactions of the Royal Society 358, 1239–53.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

References
Pereltsvaig, A. and M. W. Lewis (2015). The Indo-European Controversy (electronic edn).
Cambridge: Cambridge University Press.
Pintzuk, S. (2003). Variationist approaches to syntactic change. In B. D. Joseph and R. D. Janda
(eds.), The Handbook of Historical Linguistics, pp. 509–28. Malden, MA: Blackwell.
Pintzuk, S. and L. Plug (2002). The York–Helsinki Parsed Corpus of Old English Poetry.
Technical report, Department of Linguistics, University of York. Oxford Text Archive.
Piotrowski, M. (2012). Natural Language Processing for Historical Texts. Synthesis Lectures on
Human Language Technologies. Morgan & Claypool.
Podesva, R. J. and D. Sharma (eds.) (2014). Research Methods in Linguistics. Cambridge:
Cambridge University Press.
Popper, K. (1959). The Logic of Scientific Discovery (2002 edn). London: Routledge.
Pullum, G. (2009). Computational linguistics and generative linguistics: The triumph of hope
over experience. In T. Baldwin and V. Kordoni (eds.), Proceedings of the EACL 2009 Workshop
on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or
Vacuous?, Athens, Greece, pp. 12–21. Association for Computational Linguistics.
Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus
Linguistics 13, 519–49.
Rayson, P., D. Archer, A. Baron, J. Culpeper, and N. Smith (2007). Tagging the bard: Evaluating
the accuracy of a modern POS tagger in early modern English corpora. In Corpus Linguistics
Conference (CL2007), Birmingham. University of Birmingham.
Redman, T. C. (2008). Data Driven: Profiting from Your Most Important Business Asset.
New York. Harvard Business Review Press.
Resch, C., T. Declerck, B. Krautgartner, and U. Czeitschner (2014). ABaC:us revisited. Extracting and linking lexical data from a historical corpus of sacred literature. In C. Brierley,
M. Sawalha, and E. Atwell (eds.), Proceedings of the 2nd Workshop on Language Resources
and Evaluation for Religious Texts (LRE-REL 2), pp. 36–41.
Ringe, D. and J. F. Eska (2013). Historical Linguistics: Toward a Twenty-First Century Reintegration (electronic edn). Cambridge: Cambridge University Press.
Risen, J. and T. Gilovich (2007). Informal Logical Fallacies. In R. J. Sternberg, H. L. Roediger III,
and D. F. Halpern (eds.), Critical Thinking in Psychology, pp. 110–30. Cambridge: Cambridge
University Press.
Romaine, S. (1982). Socio-Historical Linguistics: Its Status and Methodology. New York: Cambridge University Press.
Ross, A. S. (1950). Philological probability problems. Journal of the Royal Statistical Society. Series
B (Methodological) 12(1), 19–59.
Rovai, F. (2012). Between feminine singular and neuter plural: Re-analysis patterns. Transactions
of the Philological Society 110(1), 94–121.
Rusten, K. A. (2014). Null referential subjects from Old to early modern English.
In K. E. Haugland, K. McCafferty, and K. A. Rusten (eds.), ‘Ye whom the charms of grammar
please’: Studies in English Language History in Honour of Leiv Egil Breivik, Oxford, pp. 249–70.
Peter Lang.
Rusten, K. A. (2015). A quantitative study of empty referential subjects in Old English prose and
poetry. Transactions of the Philological Society 113(1), 53–75.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
References

Rydén, M. (1980). Syntactic variation in a historical perspective. In S. Jacobson (ed.), Papers from
the Scandinavian symposium on syntactic variation. Stockholm, 18–19 May 1979, Stockholm,
pp. 37–45. Almqvist & Wiksell.
Sabou, M., K. Bontcheva, L. Derczynski, and A. Scharl (2014). Corpus annotation through
crowdsourcing: Towards best practice guidelines. In N. Calzolari, K. Choukri, T. Declerck,
H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis (eds.), Proceedings
of the Ninth International Conference on Language Resources and Evaluation (LREC’14),
Reykjavik, Iceland. European Language Resources Association.
Salvi, G. and L. Vanelli (1992). Grammatica essenziale di riferimento della lingua italiana. Istituto
Geografico De Agostini. Le Monnier.
Sampson, G. R. (2001). Empirical Linguistics. London/New York: Continuum.
Sampson, G. R. (2003). Statistical linguistics. In W. J. Frawley (ed.), International Encyclopedia
of Linguistics (2nd edn). New York: Oxford University Press.
Sampson, G. R. (2005). Quantifying the shift towards empirical methods. International Journal
of Corpus Linguistics 10, 10–36.
Sampson, G. R. (2013). The empirical trend: Ten years on. International Journal of Corpus
Linguistics 18(2), 281–9.
Sanchez-Marco, C., G. Boleda, and L. Padró (2011). Extending the tool, or how to annotate
historical language varieties. In Proceedings of the 5th ACL-HLT Workshop on Language
Technology for Cultural Heritage, Social Sciences, and Humanities, Portland, OR, pp. 1–9.
Association for Computational Linguistics.
Sandra, D. and S. Rice (1995). Network analyses of prepositional meaning: Mirroring whose
mind—the linguist’s or the language user’s? Cognitive Linguistics 6(1), 89–130.
Schlesewsky, M. and I. Bornkessel (2004). On incremental interpretation: Degrees of meaning
accessed during sentence comprehension. Lingua 114(9–10), 1213–34.
Schmid, G. (1994). TreeTagger: A language independent part-of-speech tagger. Available at
http://www.cis.uni-muenchen.de/∼schmid/tools/TreeTagger/.
Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German.
In Proceedings of the ACL SIGDAT Workshop, pp. 47–50.
Schneider, G. (2008). Hybrid long-distance functional dependency parsing. Ph.D. thesis, Universität Zürich, Zurich, Switzerland.
Schneider, G. (2012). Adapting a parser to historical English. In M. Rissanen, Tyrkkö,
T. Nevalainen, and M. Kilpiö (eds.), Proceedings of the Helsinki Corpus Festival, Studies in
Variation, Contacts and Change in English, Amsterdam/Philadelphia. Research Unit for
Variation, Contacts and Change in English (VARIENG), University of Helsinki.
Schulte im Walde, S. (2004). GermaNet synsets as selectional preferences in semantic verb
clustering. Journal for Computational Linguistics and Language Technology 19(1/2), 69–79.
Schulte im Walde, S. (2007). Corpus Linguistics. An International Handbook (chapter on the
induction of verb frames and verb classes from corpora). Berlin: Mouton de Gruyter.
Schütze, H. (1998). Automatic word sense discrimination. Computational Linguistics 24(1),
97–123.
Sgall, P., E. Hajicová, and J. Panevová (1986). The Meaning of the Sentence in its Semantic and
Pragmatic Aspects. Dordrecht, NL: Reidel.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

References
Sinclair, J. (2004). Trust the Text: Language, Corpus and Discourse. London: Routledge.
Snow, R., B. O’Connor, D. Jurafsky, and A. Y. Ng (2008). Cheap and fast—but is it good?
Evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, Hawaii, pp. 254–63.
Association for Computational Linguistics.
Souvay, G. and J.-M. Pierrel (2009). LGeRM: Lemmatisation des mots en moyen français.
Traitement automatique des langues 50(2), 149–72.
Stefanowitsch, A. (2005). New York, Dayton (Ohio), and the raw frequency fallacy. Corpus
linguistics and linguistic theory 1(2), 295–301.
Swadesh, M. (1952). Lexico-statistic dating of prehistoric ethnic contacts: With special reference to North American Indians and Eskimos. Proceedings of the American philosophical
society 96(4), 452–63.
Swadesh, M. (1953). Archeological and linguistic chronology of Indo-European groups.
American Anthropologist 55(3), 349–52.
Tagliamonte, S. A. and R. H. Baayen (2012). Models, forests, and trees of York English: Was/were
variation as a case study for statistical practice. Language Variation and Change 24(02),
135–78.
Talmy, L. (2007). Foreword. In M. Gonzalez-Marquez, I. Mittelberg, S. Coulson, and M. J. Spivey
(eds.), Methods in Cognitive Linguistics, pp. xi–xxi. Amsterdam: John Benjamins.
Taylor, A., A. Nurmi, A. Warner, S. Pintzuk, and T. Nevalainen (2006). The York–Helsinki
Parsed Corpus of Early English Correspondence (PCEEC). Technical report, Department
of Linguistics, University of York. Oxford Text Archive.
Taylor, A., A. Warner, S. Pintzuk, and F. Beths (2003). The York–Toronto–Helsinki Parsed Corpus of Old English Prose (YCOE). Technical report, Department of Linguistics, University of
York. Oxford Text Archive.
TEI Consortium (2014). Guidelines for electronic text encoding and interchange. Technical
Report WP2.11, TEI Consortium, http://www.tei-c.org/Guidelines/P5/.
Tekavčić, P. (1972). Grammatica storica dell’italiano, vol. II Morfosintassi; III Lessico. Bologna:
Il Mulino.
Tesnière, L. (1959). Éléments de syntaxe structurale. Paris: Klincksieck.
Tognini-Bonelli, E. (2001). Corpus Linguistics at Work. Amsterdam/Philadelphia: John
Benjamins.
Toth, G. M. (2013). Knowledge and thinking in Renaissance Florence: A computer-assisted
analysis of the diaries and commonplace books of Giovanni Rucellai and his contemporaries.
Ph.D. thesis, University of Oxford.
Tukey, J. W. (1977). Exploratory Data Analysis. Reading, MA: Addison-Wesley.
Van der Beek, L., G. Bouma, R. Malouf, and G. van Noord (2002). The Alpino Dependency
Treebank. In M. Theune, A. Nijholt, and H. Hondorp (eds.), Proceedings of the Twelfth Meeting
of Computational Linguistics in the Netherlands (CLIN 2001), pp. 8–22. Rodopi, Amsterdam.
Van Gompel, R. P. G. and M. J. Pickering (2007). Syntactic parsing. In G. Gaskell (ed.), The
Oxford Handbook of Psycholinguistics, pp. 284–307. Oxford: Oxford University Press.
Venables, W. N. and B. D. Ripley (2002). Modern Applied Statistics with S (4th edn). New York:
Springer.
Vicario, F. (1997). I verbi analitici in friulano. Milano: Franco Angeli.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
References

Vines, T., A. Albert, R. Andrew, F. Débarre, D. Bock, M. Franklin, K. Gilbert, J.-S. Moore,
S. Renaut, and D. Rennison (2014). The availability of research data declines rapidly with
article age. Current Biology 24(1), 94–7.
Vulanović, R. and R. H. Baayen (2007). Fitting the development of periphrastic do in all sentence
types. In P. Grzybek and R. Köhler (eds.), Exact Methods in the Study of Language and Text:
Dedicated to Gabriel Altmann on the Occasion of his 75th Birthday, pp. 679–88. Berlin: Mouton
de Gruyter.
Wallenberg, J., A. K. Ingason, E. F. Sigurðsson, and E. Rögnvaldsson (2011a). Icelandic Parsed
Historical Corpus (IcePaHC). Technical report, Department of Linguistics, University of
Iceland. Online publication.
Wallenberg, J., A. K. Ingason, E. F. Sigurðsson, and E. Rögnvaldsson (2011b). Icelandic Parsed
Historical Corpus (IcePaHC). Version 0.9.
Weisser, M. (2010). Essential Programming for Linguistics. Edinburgh: Edinburgh University
Press.
Williams, A. (2000). Null subjects in Middle English existentials. In S. Pintzuk, G. Tsoulos, and
A. Warner (eds.), Diachronic Syntax: Models and Mechanisms, pp. 285–310. Oxford: Oxford
University Press.
Zaenen, A., J. Carletta, G. Garretson, J. Bresnan, A. Koontz-Garboden, T. Nikitina,
M. C. O’Connor, and T. Wasow (2004). Animacy encoding in English: Why and how. In
B. Webber and D. K. Byron (eds.), Proceedings of the ACL2004 Workshop on Discourse
Annotation, Volume 17, Barcelona, pp. 118–25. Association for Computational Linguistics.
Zervanou, K. and C. Vertan (eds.) (2014). Proceedings of the 8th Workshop on Language
Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH). Gothenburg,
Sweden: Association for Computational Linguistics.
Zuidema, W. and B. de Boer (2014). Modeling in the language sciences. In R. J. Podesva
and D. Sharma (eds.), Research Methods in Linguistics, pp. 428–45. Cambridge: Cambridge
University Press.
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Index
ALPINO Treebank 143, 144
Ancient Greek Dependency Treebank 101,
117, 118, 120, 129, 134, 143, 147, 148, 159
annotation 1, 5, 6, 10, 11, 19, 24, 42, 45, 55, 56,
58, 59, 61, 62, 63, 73, 75, 76, 99, 100, 101,
102, 103, 108, 109, 110, 111, 112, 113, 114,
115, 116, 117, 118, 119, 120, 121, 122, 123,
124, 125, 127, 128, 129, 132, 134, 135, 137,
138, 140, 151, 152, 155, 159, 170, 173, 189,
193, 203, 205
Bayesian phylogenetic trees 207
Bayesian statistics 41
categorical 3, 4, 40, 43, 44, 50, 89, 91, 128,
186, 206
causation 27
chasm model 19, 20, 22, 23, 24, 33, 66, 71,
153, 154
chi-square 28, 64, 94, 95, 177, 178, 179,
180, 181
computational linguistics 3, 16, 89, 103, 112,
115, 133, 138
corpus-based 25, 28, 29, 31, 32, 33, 37, 58, 59,
86, 89, 92, 126, 132, 133, 135, 138, 140, 152,
157, 159, 188
corpus-driven 2, 12, 19, 38, 58, 59, 60, 61,
63, 64, 90, 117, 120, 131, 132, 133, 134, 135,
159, 188
corpus linguistics 2, 6, 19, 21, 37, 50, 58, 72, 77,
78, 80, 82, 92, 95, 96, 97, 99, 101, 105, 107,
121, 123, 124, 128, 132, 139, 177, 179, 206
correlation 13, 27, 52, 62, 67, 77, 156, 168, 169,
172, 173, 178, 179, 180, 182
correspondence analysis 2, 31, 165, 195
data-driven 3, 18, 23, 43, 58, 59, 60, 61, 62, 188
data exploration 43, 59, 60, 63, 165, 189, 194
dependency
annotation 109, 115, 116
grammar 61, 115, 116
tree 115, 116, 122, 144
treebank 11, 42, 109, 116, 122, 123, 126,
129, 134
diachronic linguistics 30, 31, 88, 164, 172
digital humanities 58, 101, 103, 125, 137, 149
empirical 2, 3, 18, 19, 25, 26, 27, 29, 30, 39, 42,
45, 46, 47, 61, 63, 65, 66, 80, 81, 86, 87,
90, 91, 92, 93, 97, 98, 117, 127, 128, 154,
156, 168, 190, 205, 206, 207
English
early modern 43, 86, 108, 113, 123, 129,
192, 194
middle 48, 49, 86, 87, 123, 140, 142, 166,
167, 168, 169, 172, 175, 186, 190, 194
old 11, 48, 49, 67, 86, 87, 121, 123, 167,
169, 176
evidence 1, 2, 3, 6, 8, 9, 10, 11, 12, 13, 14, 16, 18,
25, 26, 28, 31, 36, 37, 38, 39, 40, 43, 44, 45,
46, 47, 48, 49, 50, 51, 52, 55, 56, 57, 58, 59,
60, 63, 64, 67, 68, 70, 71, 80, 82, 84, 88,
89, 90, 92, 93, 97, 98, 99, 121, 124, 127, 132,
137, 153, 155, 156, 167, 168, 188
exploratory data analysis see data
exploration
FrameNet 54, 133
frequency 2, 5, 6, 12, 14, 28, 52, 77, 80, 81, 88,
117, 120, 121, 129, 133, 134, 136, 154, 155,
163, 172, 176, 178, 190, 192, 193, 194, 198,
199, 200, 201, 203, 204, 205
distribution 59, 60, 96
expected 178
raw 12, 92
relative 17, 29, 41, 68, 72, 84, 155, 171
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i

Index
glottochronology 69, 70, 71, 81, 82
gold standard 117, 136
Greek 19, 54, 112, 120, 122
ancient 9, 10, 12, 100, 101, 115, 116, 117, 119,
129, 131, 132, 134, 135, 151
classical 14, 111
historical linguistics 1, 2, 3, 7, 8, 12, 16, 17, 18,
19, 20, 22, 23, 24, 25, 27, 28, 29, 30, 31, 32,
33, 34, 35, 36, 37, 38, 39, 44, 45, 46, 47, 48,
50, 51, 52, 53, 54, 55, 56, 58, 61, 63, 64, 65,
66, 67, 68, 69, 70, 71, 72, 73, 77, 78, 79,
80, 81, 82, 83, 85, 86, 87, 88, 90, 91, 92, 93,
94, 95, 96, 97, 98, 99, 103, 118, 124, 125,
127, 129, 130, 131, 137, 139, 140, 149, 151,
152, 153, 154, 156, 157, 163, 166, 181, 186,
187, 188, 189, 190, 191, 205, 206, 207
quantitative 1, 7, 24, 36, 37, 38, 44, 45, 46,
50, 51, 52, 53, 94, 99, 101, 130, 153, 156,
168, 187, 188, 189, 190, 191, 206
hypothesis testing 2, 10, 12, 34, 42, 43, 52, 53,
65, 93, 94, 95, 96, 97, 155, 165, 169, 180,
186, 189, 196
Index Thomisticus Treebank 102, 114, 116,
122, 123, 126, 134
language change 6, 7, 8, 33, 34, 37, 38, 43, 44,
61, 63, 64, 66, 70, 71, 83, 87, 88, 89, 90,
93, 98, 119, 127, 137, 138, 140, 142, 156,
157, 166, 167, 168, 186, 187, 190, 191, 192,
204, 205
language resource 53, 54, 55, 56, 57, 58,
60, 100, 127, 130, 131, 135, 143, 144, 151,
152, 159
Latin 9, 10, 11, 12, 13, 14, 42, 54, 62, 63, 64, 67,
90, 100, 103, 105, 109, 110, 111, 112, 114,
115, 116, 117, 118, 122, 123, 125, 126, 127, 131,
132, 133, 134, 135, 150, 157, 158, 159, 163,
164, 165, 181, 206
Dependency Treebank 109, 116, 123,
126, 134
LatinISE 125, 126, 127
lemma 11, 14, 99, 100, 109, 110, 114, 120, 126,
127, 129, 134, 136, 137, 163, 164, 193, 194,
195, 198, 201, 203, 204, 205
lemmatization 99, 102, 112, 114, 115, 121, 122,
127, 136, 138, 193, 194, 205
lexicon 11, 12, 53, 54, 55, 113, 114, 120, 130, 131,
132, 133, 134, 135, 136, 137, 144, 152, 159,
193, 205
linguistic innovation 43, 44, 191
linguistic spread 43, 44, 191
linked data 58, 142, 143, 144, 148, 149
markup 105, 124, 125, 137
metadata 11, 53, 57, 58, 59, 60, 99, 100, 103,
104, 105, 106, 107, 110, 126, 137, 138, 140,
142, 146, 155, 170, 193, 205
model parallelization 6, 91, 157, 205, 207
morphology 36, 37, 43, 44, 81, 82, 190, 193
multivariate
analysis 13, 44, 157
model 205
techniques 12, 31, 34, 52, 65, 157, 162,
163, 164, 165, 166, 168, 181, 186, 187,
188, 189
natural language processing 16, 17, 102, 112,
117, 118, 125, 127, 148
null hypothesis 178
parsing 102, 117, 118, 119, 133, 139
part-of-speech tagging 112, 114, 127, 138,
139, 143
Penn-Helsinki Parsed Corpus of Early
Modern English 108, 169, 192
Penn-Helsinki Parsed Corpus of Middle
English 123, 140, 141
phonology 36, 37, 81
phrase-structure
annotation 108, 173
tree 5, 6, 115
pragmatics 10, 27, 89, 100, 110, 112, 119, 121,
122, 138, 140, 167, 169, 186
Prague Dependency Treebank 122, 123
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
Index

probabilistic 3, 4, 5, 8, 9, 37, 39, 40, 44, 50,
60, 64, 65, 78, 85, 86, 89, 91, 92, 93, 155,
156, 157, 190, 206, 207
PROIEL Treebank 100, 126
dependency 115, 116, 122, 143, 144
phrase-structure 108, 115, 116
syntax 27, 36, 37, 48, 87, 92, 97, 117, 118, 121,
158, 167
qualitative analysis 14, 59, 91
quantitative analysis 10, 11, 12, 13, 14, 52, 59,
63, 80, 127, 189
Text Encoding Initiative 120, 124, 125,
137, 138
theory 1, 3, 6, 10, 16, 17, 40, 44, 55, 58, 59, 61,
63, 64, 65, 66, 85, 117, 121, 122, 128, 156,
157, 170
token 87, 107, 108, 111, 112, 120, 122, 125,
126, 134
tokenization 111, 112
treebank 91, 114, 115, 116, 117, 118, 120, 123, 135,
143, 144, 170, 173
TreeTagger 114, 126, 128
trend 11, 12, 42, 43, 50, 51, 127, 155, 157, 158,
172, 194
regression
linear 34, 44, 76, 160, 162, 164
logistic 164, 181, 182, 183, 184, 186, 197, 199,
200, 201, 202
mixed-effects model 96, 164, 181, 197, 198,
199, 200, 201
model 2, 52, 76, 160, 162, 164, 181, 182, 183,
184, 198, 199, 200, 201, 202
multilevel model 96, 164, 181, 197, 198, 199,
200, 201
reproducibility 51, 53, 54, 55, 56, 129, 135, 156,
170, 206
Resource Description Framework 142, 144,
145, 147, 151
selectional preferences 89
semantics 14, 36, 37, 61, 63, 88, 89, 90, 181
sociolinguistics 58, 71, 119, 121, 137, 140, 142,
167, 168, 169, 172, 175, 186, 187
subcategorization 5, 89, 90, 144
syntactic tree 143, 193
usage 7, 16, 25, 43, 44, 54, 84, 86, 88, 92, 139,
159, 168
-based 25, 60, 89
valency lexicon 11, 54, 132, 133, 134, 135, 159
visualization 56, 139, 160, 182, 189, 195
WordNet 119
XML 105, 106, 107, 109, 138, 143, 147
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
OX F O R D S T U D I E S I N D IAC H R O N IC A N D H I S T O R I C A L L I N G U I S T I C S
general editors
Adam Ledgeway and Ian Roberts, University of Cambridge
advisory editors
Cynthia Allen, Australian National University; Ricardo Bermúdez-Otero, University of
Manchester; Theresa Biberauer, University of Cambridge; Charlotte Galves, University of
Campinas; Geoff Horrocks, University of Cambridge; Paul Kiparsky, Stanford University;
Anthony Kroch, University of Pennsylvania; David Lightfoot, Georgetown University; Giuseppe
Longobardi, University of York; George Walkden, University of Konstanz; David Willis,
University of Cambridge
published
1
From Latin to Romance
Morphosyntactic Typology and Change
Adam Ledgeway
2
Parameter Theory and Linguistic Change
Edited by Charlotte Galves, Sonia Cyrino, Ruth Lopes, Filomena Sandalo, and Juanito Avelar
3
Case in Semitic
Roles, Relations, and Reconstruction
Rebecca Hasselbach
4
The Boundaries of Pure Morphology
Diachronic and Synchronic Perspectives
Edited by Silvio Cruschina, Martin Maiden, and John Charles Smith
5
The History of Negation in the Languages of Europe and the Mediterranean
Volume I: Case Studies
Edited by David Willis, Christopher Lucas, and Anne Breitbarth
6
Constructionalization and Constructional Changes
Elizabeth Traugott and Graeme Trousdale
7
Word Order in Old Italian
Cecilia Poletto
8
Diachrony and Dialects
Grammatical Change in the Dialects of Italy
Edited by Paola Benincà, Adam Ledgeway, and Nigel Vincent
9
Discourse and Pragmatic Markers from Latin to the Romance Languages
Edited by Chiara Ghezzi and Piera Molinelli
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
10
Vowel Length from Latin to Romance
Michele Loporcaro
11
The Evolution of Functional Left Peripheries in Hungarian Syntax
Edited by Katalin É. Kiss
12
Syntactic Reconstruction and Proto-Germanic
George Walkden
13
The History of Low German Negation
Anne Breitbarth
14
Arabic Indefinites, Interrogatives, and Negators
A Linguistic History of Western Dialects
David Wilmsen
15
Syntax over Time
Lexical, Morphological, and Information-Structural Interactions
Edited by Theresa Biberauer and George Walkden
16
Syllable and Segment in Latin
Ranjan Sen
17
Participles in Rigvedic Sanskrit
The Syntax and Semantics of Adjectival Verb Forms
John J. Lowe
18
Verb Movement and Clause Structure in Old Romanian
Virginia Hill and Gabriela Alboiu
19
The Syntax of Old Romanian
Edited by Gabriela Pană Dindelegan
20
Grammaticalization and the Rise of Configurationality in Indo-Aryan
Uta Reinöhl
21
The Rise and Fall of Ergativity in Aramaic
Cycles of Alignment Change
Eleanor Coghill
22
Portuguese Relative Clauses in Synchrony and Diachrony
Adriana Cardoso
23
Micro-change and Macro-change in Diachronic Syntax
Edited by Eric Mathieu and Robert Truswell
i
i
i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi
i
i
24
The Development of Latin Clause Structure
A Study of the Extended Verb Phrase
Lieven Danckaert
25
Transitive Nouns and Adjectives
Evidence from Early Indo-Aryan
John J. Lowe
26
Quantitative Historical Linguistics
A Corpus Framework
Gard B. Jenset and Barbara McGillivray
In preparation
Negation and Nonveridicality in the History of Greek
Katerina Chatzopoulou
Morphological Borrowing
Francesco Gardani
Nominal Expressions and Language Change
From Early Latin to Modern Romance
Giuliana Giusti
The Historical Dialectology of Arabic: Linguistic and Sociolinguistic Approaches
Edited by Clive Holes
A Study in Grammatical Change
The Modern Greek Weak Subject Pronoun τ oς and
its Implications for Language Change and Structure
Brian D. Joseph
Gender from Latin to Romance
Michele Loporcaro
Reconstructing Pre-Islamic Arabic Dialects
Alexander Magidow
Word Order Change
Edited by Anna Maria Martins and Adriana Cardoso
Grammaticalization from a Typological Perspective
Heiko Narrog and Bernd Heine
Word Order and Parameter Change in Romanian
Alexandru Nicolae
The History of Negation in the Languages of Europe and the Mediterranean
Volume II: Patterns and Processes
Edited by David Willis, Christopher Lucas, and Anne Breitbarth
Verb Second in Medieval Romance
Sam Wolfe
Palatal Sound Change in the Romance Languages
Diachronic and Synchronic Perspectives
André Zampaulo
i
i
i
i
Descargar