Quantitative Historical Linguistics - A Corpus Framework

i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Quantitative Historical Linguistics i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i OX F O R D S T U D I E S I N D IAC H R O N IC A N D H I S T O R I C A L L I N G U I S T I C S general editors Adam Ledgeway and Ian Roberts, University of Cambridge advisory editors Cynthia Allen, Australian National University; Ricardo Bermúdez-Otero, University of Manchester; Theresa Biberauer, University of Cambridge; Charlotte Galves, University of Campinas; Geoff Horrocks, University of Cambridge; Paul Kiparsky, Stanford University; Anthony Kroch, University of Pennsylvania; David Lightfoot, Georgetown University; Giuseppe Longobardi, University of York; George Walkden, University of Konstanz; David Willis, University of Cambridge recently published in the series 19 The Syntax of Old Romanian Edited by Gabriela Pană Dindelegan 20 Grammaticalization and the Rise of Configurationality in Indo-Aryan Uta Reinöhl 21 The Rise and Fall of Ergativity in Aramaic Cycles of Alignment Change Eleanor Coghill 22 Portuguese Relative Clauses in Synchrony and Diachrony Adriana Cardoso 23 Micro-change and Macro-change in Diachronic Syntax Edited by Eric Mathieu and Robert Truswell 24 The Development of Latin Clause Structure A Study of the Extended Verb Phrase Lieven Danckaert 25 Transitive Nouns and Adjectives Evidence from Early Indo-Aryan John J. Lowe 26 Quantitative Historical Linguistics A Corpus Framework Gard B. Jenset and Barbara McGillivray For a complete list of titles published and in preparation for the series, see pp. 230–2 i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Quantitative Historical Linguistics A Corpus Framework G A R D B. J E N SE T A N D BA R BA R A M C G I L L I V R AY 1 i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i 3 Great Clarendon Street, Oxford, ox2 6dp, United Kingdom Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © Gard B. Jenset and Barbara McGillivray 2017 The moral rights of the authors have been asserted First Edition published in 2017 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America British Library Cataloguing in Publication Data Data available Library of Congress Control Number: 2017933972 ISBN 978–0–19–871817–8 Printed and bound by CPI Group (UK) Ltd, Croydon, cr0 4yy Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Contents Series preface List of figures and tables ix xi  Methodological challenges in historical linguistics . Aims of this book . Context and motivation .. Empirical methods .. Models in historical linguistics .. A new pace . Main claims .. The example-based approach .. The importance of corpus annotation .. Problems with certain quantitative analyses .. Problems with the research process .. Conceptual difficulties . Can quantitative historical linguistics cross the chasm? .. Who uses new technology? .. One size does not fit all: the chasm .. Perils of the chasm . A historical linguistics meta study .. An empirical baseline .. Quantitative historical research in                      Foundations of the framework . A new framework .. Scope .. Basic assumptions .. Definitions . Principles .. Principle : Consensus .. Principle : Conclusions .. Principle : Almost any claim is possible .. Principle : Some claims are stronger than others .. Principle : Strong claims require strong evidence .. Principle : Possibly does not entail probably .. Principle : The weakest link              i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i vi Contents .. Principle : Spell out quantities .. Principle : Trends should be modelled probabilistically .. Principle : Corpora are the prime source of quantitative evidence .. Principle : The crud factor .. Principle : Mind your stats . Best practices and research infrastructure .. Divide and conquer: reproducible research .. Language resource standards and collaboration .. Reproducibility in historical linguistics research .. Historical linguistics and other disciplines . Data-driven historical linguistics .. Corpus-based, corpus-driven, and data-driven approaches .. Data-driven approaches outside linguistics .. Data and theory .. Combining data and linguistic approaches  Corpora and quantitative methods in historical linguistics . Introduction . Early experiments . A bad case of glottochronology . The advent of electronic corpora . Return of the numbers . What’s in a number anyway? . The case against numbers in historical linguistics .. Argumentation from convenience .. Argumentation from redundancy .. Argumentation from limitation of scope .. Argumentation from principle .. The pseudoscience argument . Summary  Historical corpus annotation . Content, structure, and context in historical texts .. The value of annotation .. Annotation and historical corpora .. Ways to annotate a historical corpus . Annotation in practice . Adding linguistic annotation to texts .. Annotation formats .. Levels of linguistic annotation .. Annotation schemes and standards                                        i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Contents . Case study: a large-scale Latin corpus . Challenges of historical corpus annotation vii    (Re)using resources for historical languages . Historical languages and language resources .. Corpora and language resources .. Corpus-based and corpus-driven lexicons . Beyond language resources . Linking historical (language) data .. Linked data .. An example from the ALPINO Treebank .. Linked historical data . Future directions            The role of numbers in historical linguistics . The benefits of quantitative historical linguistics .. Reaching across to the majority .. The benefits of corpora .. The benefits of quantitative methods .. Numbers and the aims of historical linguistics . Tackling complexity with multivariate techniques . The rise of existential there in Middle English .. Data .. Exploration .. The choice of statistical technique .. Quantitative modelling .. Summary               A new methodology for quantitative historical linguistics . The methodological framework . Core steps of the research process . Case study: verb morphology in early modern English .. Data .. Exploration .. The models .. Discussion . Concluding remarks          References Index   i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Series preface Modern diachronic linguistics has important contacts with other subdisciplines, notably first-language acquisition, learnability theory, computational linguistics, sociolinguistics, and the traditional philological study of texts. It is now recognized in the wider field that diachronic linguistics can make a novel contribution to linguistic theory, to historical linguistics, and arguably to cognitive science more widely. This series provides a forum for work in both diachronic and historical linguistics, including work on change in grammar, sound, and meaning within and across languages; synchronic studies of languages in the past; and descriptive histories of one or more languages. It is intended to reflect and encourage the links between these subjects and fields such as those mentioned above. The goal of the series is to publish high-quality monographs and collections of papers in diachronic linguistics generally, i.e. studies focusing on change in linguistic structure, and/or change in grammars, which are also intended to make a contribution to linguistic theory, by developing and adopting a current theoretical model, by raising wider questions concerning the nature of language change or by developing theoretical connections with other areas of linguistics and cognitive science as listed above. There is no bias towards a particular language or language family, or towards a particular theoretical framework; work in all theoretical frameworks, and work based on the descriptive tradition of language typology, as well as quantitatively based work using theoretical ideas, also feature in the series. Adam Ledgeway and Ian Roberts University of Cambridge i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i List of figures and tables Figures 1.1 Technology adoption life cycle modelled as a normal distribution (Moore, 1991) 23 1.2 Proportions of empirical studies appearing in Language (1960–2011) 26 1.3 MCA plot of the journals considered for the meta study and their attributes 32 1.4 The number of observations for various quantitative techniques in the selected studies, for LVC and other journals 34 2.1 Main elements of our framework for quantitative historical linguistics 45 3.1 Illustration of Moore’s law with selected corpora plotted on a base 10 logarithmic scale 74 3.2 Sizes of some selected corpora plotted on a base 10 logarithmic scale, over time 75 3.3 Log-linear regression model showing the relationship between the growth in computing power and the growth in corpus size for some selected corpora 76 3.4 Relative frequencies of linguistics terms every 1,000 instances of the word linguistics in the twentieth century taken from the BYU Google Corpus 78 4.1 Phrase-structure tree (left) and dependency tree (right) for Example (2) 116 4.2 The dependency tree of Example (3) from the Latin Dependency Treebank 117 5.1 Lexical entry for the verb impono from the lexicon for the Latin Dependency Treebank 134 Page containing information about the text of Chaucer’s Parson’s Tale from the Penn–Helsinki Parsed Corpus of Middle English 141 5.3 Part of the entry for Adriatic Sea in Pleiades 150 6.1 Geometric representation of Table 6.1 in a two-dimensional Cartesian space 161 5.2 6.2 Line that best fits the four points in Figure 6.1 162 6.3 166 Plot from MCA on the variables ‘construction’, ‘era’, ‘preverb’, ‘sp’, and ‘class’ 6.4 Graph showing the shift in relative frequencies of existential there and empty existential subjects during the Middle English period 171 6.5 172 Distribution of V1 and V2 word-order patterns 6.6 Box-and-whiskers plot of conditional probabilities of elements following existential there and empty existential subjects 6.7 Box-and-whiskers plot of the maximum degree of embedded (phrase-structure) elements for sentences with there and empty existential subjects 173 174 6.8 Maximum degree of embedding for all sentences in the sample over time, with added non-parametric regression line 175 6.9 Bar plot of counts of existential subjects by genre 176 6.10 Bar plot of counts of existential subjects by dialect 177 i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i xii List of figures and tables 6.11 Binned residuals plot of the logistic regression model, indicating acceptable fit to the data 183 7.1 Plot showing the shifting probabilities over time between -(e)s and -(e)th in the context of third person singular present tense verbs 195 7.2 Plots of the trends of lemma frequency over time for verb forms occurring with -(e)s and -(e)th 196 7.3 MCA plot of suffix, corpus sub-period, gender, and phonological context 197 7.4 Binned residuals plot for the mixed-effects logistic regression model described in Example (2) 199 7.5 Binned residuals plot for the mixed-effects logistic regression model described in Example (3) 200 7.6 Binned residuals plot for the mixed-effects logistic regression model described in Example (4) 201 Tables 1.1 Classification of sample papers according to whether they are corpus-based/ quantitative 29 1.2 Classification of papers from Language (2012) according to whether they are corpus-based/quantitative 29 1.3 Confidence intervals (95) for the percentage of quantitative papers in Language (2012) and the historical sample 30 1.4 Classification of sampled papers according to whether they are corpus-based/quantitative (excluding LVC) 32 4.1 The first four lines of Virgil’s Aeneid in tabular format, where each row corresponds to a line 104 4.2 Example of bibliographical information on a hypothetical collection of texts in tabular format 105 4.3 Example of metadata and linguistic information encoded for the first three word tokens of Virgil’s Aeneid 107 6.1 Example of a data set recording the century of the texts in which prefixed verbs were observed, and the proportion of their spatial arguments expressed as a PP out of all their spatial arguments 160 6.2 Subset of data frame used for study on Latin preverbs in McGillivray (2013) 163 6.3 Frequencies of there1 and Ø according to dialect in Middle English 178 6.4 Frequencies of there1 and Ø according to genre in Middle English 180 6.5 Coefficients for the binary logistic regression model showing the log odds ratio for switching from there1 to Ø 184 7.1 Part of the metadata extracted from the PPCEME documentation 193 7.2 Part of the data extracted from PPCEME 193 i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i List of figures and tables xiii 7.3 Frequencies of verb tokens in the sample from texts produced by female and male writers, broken down by corpus sub-period 197 7.4 Summary of fixed effects from the mixed-effects logistic regression model for E2 described in Example (3) 202 7.5 Summary of predictors from the mixed-effects logistic regression model for E3 described in Example (4) 202 i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics . Aims of this book The principal aims of this book are to introduce the framework for quantitative historical linguistics, and to provide some examples of how this framework can be applied in research. Ours is a framework and not a ‘theory’ in any of the senses commonly used in historical linguistics. For example, we do not take a position in favour of a particular formalism for corpus annotation, nor do we offer answers to metaphysical questions such as ‘what is language?’, ‘how is it learned?’. What we are interested in is how corpus data can be employed to gather evidence that we can analyse quantitatively to model various historical linguistics phenomena. To this end, we set out principles for the research process as a whole. Ultimately, the aim of quantitative historical linguistics is to make it easier to settle disputes in historical linguistics by means of quantitative corpus evidence, so to progress the field as a whole. The more concrete desirable outcome is the increased use of quantitative corpus methods in historical linguistics through the adoption of a systematic methodological framework. Because the present book is about methodology in historical linguistics, it does not primarily explain specific, individual methods, but discusses the relationship between corpus data, aims, methods, and ways of doing research in historical linguistics. More specifically, given some desirable outcomes, we discuss the necessary steps that should be taken to achieve those outcomes (Andersen and Hepburn, 2015). There are three focal points in this discussion: (i) Why should historical linguistics adopt quantitative corpus methods to a larger extent? (ii) What are the obstacles for a more widespread adoption of such methods? (iii) How ought such methods to be used in historical linguistics? The first two points are addressed in the present chapter and in the next one, and set out the context for the original contribution of this publication; the last point is the focus of our framework, and is dealt with throughout the book. Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray. © Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics . Context and motivation From what we have said so far it should be clear that this book is not an introduction to corpus linguistics, nor is it an introduction to quantitative techniques. There are already some very good introductions to corpus linguistics in print, such as McEnery and Wilson (2001), Gries (2009b), and McEnery and Hardie (2012). There are also good books introducing quantitative techniques to linguists, including Baayen (2008) and Johnson (2008). Our position is that core corpus linguistics concepts such as collocations, concordances, and frequency lists can be taught without necessarily referring to historical data and still be transposed to historical linguistics. Likewise, statistical techniques such as null-hypothesis testing, regression modelling, or correspondence analysis (CA) can be explained and illustrated using synchronic data equally well as with historical data. So if corpus linguistics and quantitative techniques can be taught without specific reference to historical linguistics, is there a need for a quantitative corpus historical linguistics methodology? We believe that to be the case, as we explain here. Historical linguistics is an endeavour that is highly data-centric, as Labov (1972, 100) observed when he described historical linguistics as making the best use of ‘bad data’, i.e. imperfect pieces of evidence riddled with gaps. We also agree with Rydén (1980, 38) that the ‘study of the past [. . .] must be basically empirical’, and with Fischer (2004, 57) that ‘[t]he historical linguist has only one firm knowledge base and that is the historical documents’. Moreover, we subscribe to what Penke and Rosenbach (2007b, 1) write: ‘nowadays most linguists will probably agree that linguistics is indeed an empirical science’, and the thorny questions are instead what kind of evidence ought to be used, and how it ought to be used. In spite of the high-level awareness of historical linguistics as data-focused, quantitative corpus methods are still underused and often misused in historical linguistics, and an overarching methodological structure inside which to place such methods is missing, as we illustrate in sections 1.3 and 1.5. We believe that the question of what it means for historical linguistics to be empirical (in the corpus-driven quantitative sense that we define in our framework) is much less clear, as Penke and Rosenbach (2007b) acknowledge is also the case for linguistics in general. With the additional challenges faced by the special nature of historical language data, the concern with methodological development should certainly not be lesser in historical linguistics than in other linguistic disciplines. Therefore, the most pressing gap to fill is not a book introducing corpus methods or statistical techniques to historical linguists, but a book that tackles what it means to be empirical in historical linguistics research, and how to go about doing it. That is precisely what we want to achieve with the present book. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Context and motivation  .. Empirical methods The term ‘empirical’ is of use to us to the extent that practices covered by it will improve the precision level of professional linguistic communication regarding data, evidence, hypotheses, and quantitative models in historical linguistics. Penke and Rosenbach (2007b, 3–9) show how the term ‘empirical’ is used to mean very different things in linguistics, including testing (i.e. attempting to falsify) hypotheses, rational enquiry by means of counter-evidence, as well as data-driven approaches that may rely on qualitative or quantitative evidence. We agree with Penke and Rosenbach (2007b, 4–5) that a strict Popperian falsificationist definition of empirical research (with the requirement that it collects data that can falsify a hypothesis or theory) is problematic, since it quickly runs into grey areas of the kind ‘exactly how many counter-examples does it take to falsify the hypothesis?’. Instead, we argue that a distinction conceptualized as a probabilistic continuum, where individual pieces of evidence can increase or reduce support for a given hypothesis, is more useful. Such a probabilistic approach is transparent to the extent that the data forming the basis for the continuum are objectively verifiable. For the same reason, we consider approaches based exclusively on intuitions about acceptability or grammaticality to be less useful, since what constitutes sound empirical proof of grammaticality is subject to individual judgements that vary greatly. For the purposes of the present book, what it means to be ‘empirical’ in historical linguistics is thus a matter of transparency and objective verifiability. This is related to the point made by Geeraerts (2006) who argues that empirical methods are needed to decide between competing conclusions in linguistics. The ideal of transparency and objectivity can in principle be approached either by means of a categorical argument or a probabilistic one. In their discussion of how to set up a linguistic argument, Beavers and Sells (2014) point out that at the end of the day linguistic argumentation is about classification: is item x an instance of morpheme/phoneme/construction/etc. y or some other morpheme/phoneme/ construction/etc. z? This is a prime example of categorical argumentation. In categorical terms, an item x cannot partially belong to a class, or belong to it to some degree. This contrasts with a probabilistic approach where arguments based on probabilities derived from e.g. corpus frequencies can be used to establish a graded classification scheme whereby x is an instance of y with a given probability. Probabilistic approaches have become increasingly popular, especially in the computational linguistics community; for example, Lau et al. (2015) describe unsupervised language models for predicting human acceptability judgements in a probabilistic way, and argue: ‘it is reasonable to suggest that humans represent linguistic knowledge as a probabilistic, rather than as a binary system. Probability distributions provide a natural explanation of the gradience that characterises acceptability judgements. Gradience is intrinsic i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics to probability distributions, and to the acceptability scores that we derive from these distributions’ (Lau et al., 2015, 1619). The opposition between categorical and probabilistic approaches corresponds in many ways to the distinction between a classical category structure based on necessary and sufficient features, and a category structure based on degrees of resemblance to a prototype, as discussed in Croft and Cruse (2004, 76–81). To some extent, a qualitative approach is necessary, especially when dealing with the clear, central cases. For instance, the morphemes in or at can clearly function as prepositions. However, the marginal cases such as concerning, regarding, following, given are more difficult to place. Should they be considered as prepositions in some cases, or not? Or only to some degree? A probabilistic approach might answer the question differently by stating that some morphemes occur more often than others in certain grammatical contexts, allowing us to establish a probabilistic membership of the class. It should be clear that such a probabilistic approach to description and classification does not (and is not intended to) completely do away with qualitative linguistic judgements. For instance, how to decide what counts as a grammatical context? A strictly probabilistic approach might run the risk of descending into an infinite regression problem of probability estimates that rely on other probability estimates, without any clear starting point for practical investigation of the phenomena of interest. Therefore, we are content to take as axiomatic certain statements about language and the conceptual framework for analysing language. At first glance, this might seem like a half-way solution at best; at worst it may suggest quantitative methods as a form of freeloading. Or as Campbell (2013, 484) phrases it: quantitative methods appear ‘to involve methods that depend on the results of the prior application of linguistic methods, made to masquerade as numbers and algorithms’. However, this view is far too negative, and grossly overstates the differences between quantitative and qualitative models in linguistics, as we will explain in the next section. .. Models in historical linguistics A model is a representation, and any kind of linguistics is about creating models. Zuidema and de Boer (2014) argue that, although all kinds of linguistics involve modelling of some kind, the nature of the models differs. The model might be a representation of a genealogical relationship between languages, or it might represent a particular part of a grammatical system. Zuidema and de Boer (2014) discuss four main types: symbolic models, statistical models, memory-based models, and connectionist models. We will only discuss the first two here, since they are of particular interest in the context of our framework. The key differences between symbolic models and statistical models are how they deal with variation and complexity. In a symbolic model, such as the phrase-structure i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Context and motivation  tree in Example (1), no reference is typically given to how many times the individual parts occur together in a corpus. The model operates with discrete, qualitative categories (such as S, VP, and NP), and the model provides the rules to connect the categories in specific ways. (1) S NP Trees VP V are NP symbolic models As Zuidema and de Boer (2014) point out, such symbolic models tend to be vulnerable to linguistic variation and performance factors. A statistical model, on the other hand, is crucially reliant on quantitative information about how often combinations of words, categories, or features are found. Since statistical models by default assume a certain amount of variation in the data, they are very well equipped to deal with variation, and they are uniquely able to disentangle very complex patterns of probabilistic dependence between categories or features. This is particularly suited to the case of corpus data, which always contain a frequency or quantitative dimension. However, a purely statistical model may struggle with other types of complexity. Zuidema and de Boer (2014) mention long-distance syntactic dependencies as one example. As Zuidema and de Boer (2014) point out, when the two types of models are combined, they can complement each other by allowing a probabilistic analysis that builds on the symbolic model. Manning (2003) discusses one way to build the statistical modelling on the symbolic model. For instance, rather than adhering to a hard distinction between different argument patterns for verbs, Manning (2003, 303) gives the example of representing the different subcategorization patterns for the English verb retire as probabilities like this: (2) P(NP [obj] | V = retire) = 0.52 P(PP [ from] | V = retire) = 0.05 The annotation expresses that with the verb retire there is a probability of 0.52 of encountering an NP functioning as an object, and a probability of 0.05 of encountering a PP headed by the preposition from; this way, we do not have to choose only one option for the argument patterns of this verb. This annotation keeps the same symbolic (or qualitative) categories as the one above (NP, V, PP), but uses probabilities to encode the relations between them. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics Alternatively, the statistical modelling may take the form of a statistical analysis of frequency information derived from a collection of symbolic models, without the intention of feeding the probabilities back into the grammatical model, as in Example (2). A typical instance of this approach is the statistical analysis of annotated corpora that are enriched with part-of-speech information or syntactic annotation, in order to draw conclusions about usage, grammar, or language change. Clearly, the scepticism expressed by Campbell (2013, 484) about quantitative models being qualitative models ‘masquerading as numbers’ is not warranted. On the contrary: investigating the same phenomenon by means of different types of models (what Zuidema and de Boer (2014) call ‘model parallelization’) can lead to rich new insights that combine the best qualities of both types of models. Thus, there is no real opposition between qualitative (or symbolic) models and quantitative models. The real question is how to achieve this in practice, as we discuss next. .. A new pace Although there certainly are concrete challenges in building corpora and adopting specific quantitative methods, we believe that the main obstacle is not concrete. In a discussion about the French eighteenth-century scholar Pierre Louis Maupertuis (who formulated a theory stating that material particles from the seminal fluids of both the mother and the father were responsible for forming the foetus), Gould (1985, 151) makes the following observation: We often think, naïvely, that missing data are the primary impediments to intellectual progress—just find the right facts and all problems will dissipate. But barriers are often deeper and more abstract in thought. We must have access to the right metaphor, not only to the requisite information. Revolutionary thinkers are not, primarily, gatherers of facts, but weavers of new intellectual structures. Ultimately, Maupertuis failed because his age had not yet developed a dominant metaphor of our own time-coded instructions as the precursor to material complexity. This quote very effectively stresses the important role of metaphors in preparing the ground for true innovations in a field. Returning to historical linguistics, we believe that the availability of historical corpora and statistical techniques alone are insufficient to achieve the methodological shift that we propose here. What is required is just as much a conceptual change of pace, whereby linguistic problems are reformulated as complex interplays of factors that can be addressed quantitatively by means of corpus data. Such a reconceptualization has a knock-on effect in terms of what we consider as data and evidence, as well as the status of theoretical concepts. This is why the present treatment goes beyond a collection of best practices for doing historical corpus linguistics, although such advice is also discussed both in the present and in the subsequent chapters. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Context and motivation  Some might argue that the change we are discussing here is unnecessary, since historical linguistics is already making use of corpora and quantitative techniques. After all, Labov (1972) commended historical linguists for what he considered their superior methodological rigour compared to synchronic linguists. It might also be argued that the change is already well under way, and that corpus methods and quantitative techniques are becoming more important in the historical linguist’s toolbox. Hilpert and Gries (2009) state that large corpora are increasingly being used in historical linguistics; and with growing corpus size comes the need for statistical techniques to handle large and complex data. The first question is an empirical question about the present: to what extent are historical linguists already using quantitative techniques and corpus methods, and are they using them more or less than some relevant level of comparison? This is a question we return to in more detail in Chapter 3, along with a discussion of how quantitative methods have been used in historical linguistics previously. The second argument, that the change we are advocating is already well under way, is more subtle, since it is in fact a prediction. It assumes that we can observe some changes and that those changes will continue until their natural completion. However, as with any prediction, the result is only as good as the assumptions it builds on. In this case, the assumption that the adoption of a specific set of technologies (corpus methods and quantitative techniques) will continue at the present rate is an assumption that may not be warranted. In section 1.4 we discuss some of the dynamics involved with the adoption of new technologies, which we will argue also apply in the case of quantitative historical linguistics. Of course, the conceptual difficulties should not completely overshadow the practical obstacles involved in doing quantitative historical linguistics. However, the distinction can sometimes be hard to make. This is the reason for our efforts in compiling a proper methodology which constitutes a framework within which to discuss these matters. Specifically, sections 2.1, 2.2, and 2.3 set out a series of definitions, principles, and best practices for quantitative historical linguistics. With the fundamentals we set out acting as a common ground, the impetus for solving the practical obstacles becomes all the much stronger. In summary, there is a real need for a methodological treatment of quantitative corpus methodology in historical linguistics that sketches the place of such methods in the broader historical linguistics landscape, and that offers a link between the more conceptual level and the concrete computational and quantitative techniques taught in general courses for linguists. The present book takes on this challenge by first acknowledging the conceptual hurdles represented by a required shift in thinking as much as in doing. In the spirit of Gould (1985), we take seriously the need for appropriate metaphors to help make concrete the changes involved. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics In addition to the metaphor already mentioned above, namely seeing the spread of quantitative corpus methods in historical linguistics as analogous to a technology adoption process (further discussed in section 1.4), we see the following as some of the governing metaphors of the approach we propose here. We do not claim uniqueness or novelty in conceptualizing language via the metaphors below, but we do consider them to be central to our approach: • language change phenomena as outcomes modelled by a set of predictors; • language data as multidimensional; • historical linguistics as a humanistic field that not only analyses the particular, but also looks for patterns and extends these to include probabilistic patterns. This chapter will add more meat to the bone of the suggested methodological approach. However, before that is discussed, the next section will elaborate on some of the main claims involved in our argument. . Main claims This section highlights the methodological gaps in historical linguistics and how our proposal addresses them. .. The example-based approach As shown by the evidence we have collected and which we will illustrate in section 1.5.2, historical linguistics generally does not make full use of corpora. This is not to say that research in historical linguistics disregards primary sources of evidence, nor to say that there are not increasingly more and more exceptions to this statement. However, historical linguistics is still far from considering corpora as the default or preferred source of evidence. Not using corpora is justified in a limited number of circumstances. In some cases, for example, the only evidence sources for a historical language are so limited that it is not possible to build a corpus; examples include languages not attested in written form (like Proto-Indo-European), or languages for which we only have access to an extremely limited number of fragments. Apart from such particular instances, corpora should be built for historical languages and diachronic phenomena, when they are not already available, and should be used as an integral part of the research process. In the literature review reported on in section 1.5.2 we will observe that the proportion of historical linguistics research articles employing corpus data is lower than the state of the art in general linguistics. When texts or corpora are the source of evidence, we often enter the realm of example-based approaches. Example-based approaches do not aim at an exhaustive account of the data and can be suitable to show whether or not a particular construction or form is attested, which is in line with a qualitative view of language. However, if we want to quantify how much a particular i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Main claims  form or construction is used, we need to resort to a larger pool of data that have been collected systematically. As we discuss in Chapter 6, such quantitative approach may (but not necessarily) be coupled with a deeper view of language as inherently probabilistic. In any case, the example-based approach is not appropriate as a basis for probabilistic conclusions about language and comes with a full range of problems, which we discuss in this section. Let us consider Rovai (2012) as a methodological case study; this article is very clear and detailed, and we will use it as an illustration of the example-based methodology. The paper analyses Latin gender doublets, i.e. those nouns that occur as both neuter and masculine nouns. To support his statements, the author lists illustrative examples (97–100), such as: Corium ‘skin’ is currently attested as a thematic neuter at all stages of the languages, but in Plautus’ plays (e. g. Poen. 139: ∼ 197 bc) and in Varro’s Menippeae (Men. 135: 80–60 bc) there also occurs the masculine gender The examples are taken from a canonical body of texts, whose critical editions are listed in the bibliography. However, it is not clear how the author selected the examples provided. This is more important when the examples reported in the research publication are not meant to be for illustration purposes only, but are the object of the analysis itself. We do not know whether the author did not report those occurrences that contradict the hypothesis, which brings with it the risk of the so-called ‘confirmation bias’ (see Risen and Gilovich, 2007, 110–30 and Kroeber and Chrétien, 1937). Generally, the lack of transparency on the selection criteria for the examples presented has negative implications for the replicability of the studies. If another researcher were to go through the same texts, due to the lack of clear selection criteria, he or she would probably choose a different cohort of examples, and potentially reach different conclusions. When the examples constitute the main basis of the argumentation, and no more details about the rest of the evidence are given, the research conclusions themselves may rest on unstable grounds. Another problem with the example-based approach is that it limits the range of questions that can be addressed in the research task. In fact, by not explicitly stating the total number of words or instances from which the examples were drawn, this approach cannot give a good sense of the quantitative value of the phenomena illustrated, and cannot draw quantitative generalizations beyond the examples given. The role of examples is limited to providing evidence that a linguistic phenomenon is attested or not. Questions relating to the variation in the data, like ‘How many times is corium attested as a thematic neuter?’ or ‘How many times does corium occur as a masculine noun in Plautus and Varro?’ cannot be answered by an example-based methodology. Another case of the example-based approach is Bentein (2012). The main object of study here is the function of periphrastic perfect in Ancient Greek according i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics to the period and according to the discourse primitives as theorized in the mental spaces theory. His evidence basis consists of 784 examples taken from previous studies. Although the author says that ‘[t]aken together, these studies comprise a large part of Ancient Greek literature, both prose and poetry’ (175), it is not clear which texts he did not analyse nor how many instances he selected them from, which makes it impossible to place the data into their correct quantitative context. Further issues come from the example-based approach. As we will see more extensively in Chapter 6, the example-based approach does not allow for a quantitative analysis, or, when it does, it typically has too few data to obtain statistically significant results and large enough effects. This is accompanied by a lack of formal hypothesis testing, as we will motivate further in section 1.3.3. Moreover, analyses from example-based studies are not easily reproducible. Negative evidence for an argument is as critical as positive evidence. Which factors were considered, which ones were found to be important, and which ones were not found important? Also, this approach allows the researcher to perform the analysis of the published examples on an ad hoc basis, according to criteria that vary depending on the specific examples being analysed. An example may be used to show the relevance of a particular feature (say, animacy for word order) and another one to demonstrate another feature (say, the case of the object), but we are not given a full overview of all the relevant features for all examples. This is what we call the practice of ‘post hoc analysis’, and we will explain it further in section 1.3.3. .. The importance of corpus annotation García García (2000, 121) says: ‘[a]n exhaustive analysis of any linguistic issue in a corpus language should be based, ideally, on a study of all the available texts in that language or that period of the language [. . .] This is obviously a task that exceeds the possibilities of any individual. Therefore, any feasible study must necessarily be based on a limited and therefore incomplete corpus.’ This claim is justified if we assume that the data need to be collected and analysed manually, which is not the only way, as we discuss in this section. When corpora are available and when the phenomena studied fall into the scope discussed in section 2.1.1, corpora should constitute the source of linguistic data, and larger corpora should be preferred to smaller corpora, all other things being equal. Fortunately, it is not necessary to analyse all corpus data if the corpus has been annotated. Let us consider the example of a study on word order in Latin. Word-order change is a complex phenomenon where morphological, syntactic, semantic, and pragmatic factors play a role. Let us assume that our study focuses on morphosyntactic aspects of word-order change. For this purpose, a morphosyntactically annotated corpus (treebank) is the ideal evidence source (for an illustration of treebanks, see section 4.3.1). The example-based approach would imply analysing a set of texts to identify, for i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Main claims  example, the different word-order patterns used (SVO, OVS, etc.). Instead, Passarotti et al. (2015) systematically used the data from the Latin Dependency Treebank and the Index Thomisticus Treebank (via the Latin Dependency Treebank Valency Lexicon and the Index Thomisticus Treebank Valency lexicon, see McGillivray 2013, 31–60) to automatically retrieve such patterns, together with the metadata information relative to the authors of the texts where each pattern was observed. After the phase of data extraction from the corpus sources, the authors carried out a quantitative analysis of the distribution of every word-order pattern by author, identifying a trend that has a diachronic component and a genre component. Passarotti et al. (2015) kept the phase of data collection and the phase of data analysis completely separate, as the data were collected from corpora that had been annotated by independent research projects. This has the advantage of eliminating the bias that could arise when the researcher aims at proving a particular theoretical statement and may unconsciously select examples that support that statement. Because the authors conducted a systematic analysis of all available corpus data from the treebanks, there was no option to analyse only specific examples. Also, the presence of the annotation meant that they could use a much larger evidence base than they would have used if they had had to manually analyse every instance. If we search a corpus that has not been annotated, our search may have low precision, because we may find a large number of irrelevant instances. Imagine that we are interested in the uses of the English determiner that. If we search a corpus for the string ‘that’, we will find a high number of occurrences of conjunctions. If we have a corpus annotated by part of speech, however, we can limit our searches to include only determiners whose lemma is ‘that’ and avoid a very time-consuming manual postselection. Another risk in using an unannotated corpus concerns low recall. Imagine that we want to identify relative clauses not introduced by relative pronouns, as in the train they took was delayed. A corpus that annotated clause type would make it easy for us to obtain those instances; conversely, if the corpus does not have this kind of annotation, it is very difficult to find the relevant patterns. Another advantage of using annotated corpora has to do with the research methodology, particularly the distinction between the annotation phase and the analysis phase, and the relationship between annotation and linguistic theoretical framework, as we discuss more extensively in section 2.4.3. Let us consider the case of verbal phrases in Old English. The York–Helsinki Parsed Corpus of Old English (Pintzuk and Plug, 2002) annotates a number of linguistic features, but does not annotate verb phrases (VPs) specifically, due to a number of reasons, including the fact that the boundaries of VPs in Old English are still disputed. If we were interested in using this corpus to further the research on VPs in Old English, we could use the annotation of the corpus to investigate the elements that define VPs, so to obtain a corpusbased distributional definition of VPs. This way, the corpus analysis could support i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics the definition of VPs themselves, thus leading to empirically-informed theoretical statements. .. Problems with certain quantitative analyses We have talked about the problems with manually collecting and analysing examples for a study on historical linguistics. Even when a corpus is used, there can be methodological problems in the subsequent phase, the analysis of the data. This book addresses this point in detail in Chapter 6. Here we will summarize two main aspects: the use of raw frequency counts and the practice of what we called ‘post hoc analysis’. Letting numbers speak for themselves In the literature review we present in section 1.5 we will see that, in the cases where quantitative evidence is used in the historical linguistics publications we examined, there is a large variability in the statistical techniques used, ranging from simple interpretation of raw frequency counts or percentages, to null-hypothesis testing and multivariate statistical techniques. This highlights a lack of standardization and best practices on which techniques are best suited to study the particular phenomenon at hand, and we will cover this in more detail in Chapter 6. Here, we will focus on the problems caused by the practice of using raw frequencies and ‘letting the numbers speak for themselves’. Let us consider the example of Bentein (2012, 186–187). After introducing the previous literature and his theoretical framework, and after analysing a series of examples, the author introduces some quantitative data in terms of frequency counts of Ancient Greek periphrastic perfect forms, broken down by author and person/number features. He uses the raw frequency counts to argue for ‘a general increase of the periphrastic perfect’, which ‘must have been—at least partially—morpho-phonologically motivated’. The frequency data presented are presented as follows: ‘almost all examples occur with the 3sg/pl’. It is not at all clear how such diachronic trend was detected, since the frequency counts presented do not even follow a monotonic distribution; moreover, the author gives no indication of the relevance of those forms with respect to the overall amount of data available for each author, making it impossible to assess the raw frequencies in any meaningful way. For what concerns the predominance of third person singular or plural, the statement seems to be purely based on the raw frequencies as well. In other words, letting the raw frequencies ‘speak for themselves’ is problematic, as we further explain below. Let us take the example of McGillivray (2013, 57), who collected the frequencies of the Latin word-order patterns VO and OV in two corpus-driven lexicons, one based on classical Latin authors and one based on St Thomas’s and St Jerome’s texts. OV has a higher frequency than VO in the classical data set (152 vs 52 occurrences) and VO is more frequent than OV in the later age data set (107 vs 38). A simple inspection of the raw frequencies would lead us to conclude that OV is preferred by the classical authors and VO by the later authors. However, we may not have enough i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Main claims  data to exclude that the differences are due to chance. The rational way to answer this question is by performing a statistical significance test. The author used Pearson’s chisquare test (illustrated in section 6.3.3) and found a significant result,1 which points to a difference between the two groups of authors for what concerns their choice of word-order pattern; more precisely, the probability of finding those frequencies under the assumption that the two variables (author group and word-order pattern) are independent is less than 1 per cent. After we have established whether the two variables are significantly independent or not, it is important to consider the size of the detected difference, the effect size, since high frequencies tend to magnify small deviations (Mosteller, 1968). Effect sizes provide a standardized measure of how large the detected difference is. In the case of Latin word-order patterns mentioned above, the author found a large effect size,2 which justifies the conclusion that the two groups of authors have indeed very different preferences for word-order patterns. Adger (2015, 133) presents a special case of the argument that numbers should speak for themselves.3 He starts by quoting Cohen (1988)’s rule of thumb that a large effect size is one that can be identified with the naked eye. However, he then commits a logical fallacy when he conflates the estimation of the size of an effect with the problem of establishing whether or not we are faced with a meaningful difference or correlation. To speak of a ‘large effect’ implies that we have enough data to speak of such an effect to begin with. This is precisely the purpose of statistical testing, and only after this step is it meaningful to speak of effect size. As Adger (2015, 133) puts it: ‘most syntacticians feel justified in not subjecting [data] to statistical testing’. We cannot help but conclude that such confidence is misplaced. Dealing with linguistic complexity The second problem affecting quantitative analyses that we will examine here is the tendency towards what we call ‘post hoc analysis’, and is related to the example-based approach covered in section 1.3.1. The post hoc analysis consists in collecting occurrence counts of a phenomenon in a set of texts or a corpus, and then focusing the analysis on specific examples drawn from this evidence basis, highlighting the role played by certain variables, which are analysed in a nonsystematic way and are introduced after the data collection. This approach attempts to account for the multidimensional nature of the phenomenon at hand, but it does so without employing techniques from multivariate statistical analysis (see section 6.2 for more details). This is an instance of the search for particular elements, as opposed to recurrent, general patterns (see discussion in section 1.3.5). For instance, we may say that in a particular example the choice of word-order pattern seems to be related to a particular grammatical case of the object, and explain why this is the case based 1 3 2 = 77.79. 2 ϕ = 0.474. p < 0.01, χ(1) We are grateful to Kristian Rusten for bringing this publication to our attention. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics on our theoretical statements. Then, we may argue for the role played by semantics by illustrating it with an example showing a particular semantic class of the subject. For example, in Bentein (2012, 187–8) the author introduces the following post hoc variables to analyse periphrastic perfects in fifth-century classical Greek: passive voice, object-oriented and resultative nature, and telicity of the verbs. As part of the discoursive description of the examples, the author adds some quantitative details in a footnote on page 188. Such details further specify the statement ‘especially in Sophocles and Euripides one can find relatively more subject-oriented resultatives than in the historians’ (Bentein, 2012, 187–8). The author provides frequencies and percentages of active vs medio-passive forms in poetry and in prose, but does not test the statistical significance of such effects, nor their size. Next, the author introduces the placement of temporal/locational adverbials in the verbal group. However, the role of this variable is not measured, and only a few examples are given. This is a missed opportunity to add a quantitative dimension to the analysis. Similarly, the author argues for the diachronic shift from resultative perfect to anterior perfect. However, the argumentation stands on underspecified quantitative statements like ‘the active transitive perfect (with an anterior meaning) is indeed rather uncommon in fifthcentury writers’ (Bentein, 2012, 189). Phrases like ‘various examples’, ‘several examples’, and ‘many cases’ (Bentein, 2012, 190) indicate attempts to argue for the quantitative relevance of the phenomenon described, but the lack of precise measures undermines the efficacy of the arguments. In general, the argumentation develops throughout the article adding more variables to the picture (such as the telicity of the predicates and the agentivity of the clauses) in a post hoc fashion, and keeping them outside the scope of the frequency-based analysis. The practice of post hoc analysis may be coupled with an argumentation strategy that relies heavily on anecdotal evidence. In this respect, a very instructive example is again given in Bentein (2012, 192), where four examples are considered sufficient to show a diachronic development of the periphrastic perfect towards an increased degree of agentivity. Let us consider another case of post hoc analysis, this time used in the context of the presence vs absence of Latin gender doublets over time. Rovai (2012, 120) performs a quantitative analysis by counting the occurrences of the feminine and neuter forms in a given set of texts. The quantitative data are thus frequency counts according to one variable (gender). After presenting the count data, the article contains a detailed analysis of each of the sixteen lemmas, specifying the declension class, stem, and number features of the forms found in the texts (Rovai, 2012, 102–3). This is a wellmotivated step, because obviously counting the number of occurrences of each gender form is not sufficient for a good analysis of the phenomenon at hand, and more factors need to be taken into consideration. It is also a step that we can consider part of a qualitative analysis, because it goes into the detail of each instance. This analysis is followed by a summary of the data according to the time variable, showing the cases i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Main claims  where the feminine forms are more ancient than the neuter ones (eight out of sixteen) and those where the feminine forms do not occur after the archaic age (but with five exceptions), while the neuter forms are attested in the later centuries. From these observations the author draws the conclusion that the feminine forms ‘seem to be the last occurrences of unproductive remnants already in early times’ (Rovai, 2012, 103). To support this claim, he provides the example of the fossilized form ramentā and the form caementa, mainly occurring in the conservative context of public law texts. Therefore, the type of texts where the feminine forms are attested and their fossilized nature are used as arguments for proving the fact that such forms are more ancient than the neuter ones. In this case, the main analysis focused on the gender of the forms and the age of the texts; however, later on, text type and formulaic features are considered as well, but with respect to only two of the sixteen nouns (rament- and caement-). It comes natural to ask: how many times do each of the sixteen nouns occur in fossilized forms or in legal texts? Including such variables to the original analysis would make the approach systematic and appropriate to the multidimensional nature of the phenomenon studied. Another variable considered in a post hoc fashion in the article is related to lexical connectionism (Rovai, 2012, 106). Limited to a subset of the nouns analysed, this is used as an argument supporting the hypothesis that ancient feminine forms were later on reanalysed as thematic neuter forms. According to this argument, some feminine nouns shared the same semantic field as some second-declension neuter nouns, and therefore occurred in the same contexts. To support this, the author provides two examples. However, it is not clear how to quantify the role played by lexical connectionism in the phenomenon under investigation. How many counter-examples can be found that contrast with the two examples provided? What is the relevance of these two examples in the context of all occurrences of the nouns considered? As the author says in Rovai (2012, 107–11), lexical connectionism cannot account for the development of ten of the sixteen nouns analysed. For this reason, the author analyses constructions that are ambiguous between the personal passive and the impersonal interpretation (e. g. dicitur ‘it is said’), and uses the fact that the latter gradually became more common over time to argue for the original first-declension feminine forms (such as menda ‘error’) to be reanalysed as second-declension neuter forms (mendum ‘error’). However, no measure of the relevance of this argument is given as to how many times these ambiguous constructions occur out of all occurrences of the nouns considered, how many instances are available to support it, and how this account compares quantitatively to the other factors considered for explaining the reanalysis. .. Problems with the research process We have seen some of the problems affecting the data collection and analysis phases. Here we want to focus on the research process as a whole, and in section 2.3 we will summarize the main claims of our proposal in this respect. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics Traditionally, access and automatic processing of large amounts of texts has been difficult, due to technological limitations, as we illustrate later on in this section. These constraints had an impact on how research was carried out, leading researchers in historical linguistics to focus on relatively small data sets and publish the final results of their investigations, typically in the form of articles or monographs. As we have noticed, the fact that the analyses were not published meant that they would not be easily reproducible. In spite of the technological advances of the past decades, the focus on the final results of the analysis and the lack of documentation of the intermediate phases of the research process is still a given, both in scientific disciplines and in the humanities. Following an increasingly popular line of thought (Candela et al., 2015), in our proposed framework we argue that more emphasis should be placed on documenting, publishing, and sharing all phases of the research process, from data collection to interpretation. In section 2.3 we will outline our suggestions in this area. New technologies The dramatic increase in digitization projects in the late 1990s made it possible to encode documents in digital formats, and the growing computing power has allowed computers to store more and more data at increasingly lower costs. A number of projects aimed at digitizing historical material have led to large amounts of data being available to the academic community, such as the Internet Archive,4 Europeana (Bülow and Ahmon, 2011),5 and Project Gutenberg,6 just to mention a few. This has meant that archives and libraries can make their collections more accessible and can preserve them in a better way. In parallel, the development of disciplines like computational linguistics and its applied field of natural language processing has made it possible to analyse large amounts of text automatically. Let us imagine that we were interested in studying the usage of a as a preposition (meaning ‘in’ as in We go there twice a week) in English in the seventeenth century. We would not be able to read all texts written in the seventeenth century and note all usages of a as a preposition. In the pre-digital era, we would have probably selected a sample of the texts, checked existing theories, possibly formulated a hypothesis and checked it against the selected texts. This way, we would be less likely to find patterns that contradict our intuition, and if we did, we would only be able to collect a very limited number of examples, and we would not have an idea of how common the evidence contrasting our intuition is. With the wealth of digitized texts we have at our disposal nowadays (especially for English), we are able to resort to a much broader evidence basis, and this triggers new research questions that were not conceivable before. Such increasingly larger text collections cannot be tackled with the so-called ‘close-reading’ approach. On the 4 6 https://archive.org/index.php. https://www.gutenberg.org. 5 http://www.europeana.eu/portal/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Main claims  other hand, simply searching for a in a raw-text collection leads to a high number of spurious results, including all cases where a is used as a determiner. Even if we searched for certain patterns (such as instances preceding ‘day’, ‘week’, or ‘year’), we would only capture a subset of the relevant occurrences. Instead, if we are able to automatically analyse all texts of interest by part of speech with appropriate natural language processing (NLP) tools, we would be able to identify the cases where a is a preposition. This way, we would be in the position to answer questions like ‘how has the relative frequency of a as a preposition and a as a determiner changed?’ or ‘which factors might have driven this change?’. As we have suggested in the example above, the new possibilities offered by digital technologies have had profound implications on research practice and methodologies. In addition to historical linguistics, numerous other areas of human knowledge have witnessed an explosion in the size of the data sets available. Ranging from market analysis to traffic data, the phenomenon of ‘big data’ (generally referred to as data sets characterized by large volume, variety, and velocity) has become a reality that organizations cannot afford to ignore (Mayer-Schonberger and Kenneth, 2013). In this book we argue that historical linguistics has not taken full advantage of this technological and cultural change, and we suggest a framework which supports the transition of this field to a new state that is more in harmony with the current scientific landscape. This transition does not only consist of a set of new techniques applied to traditional research questions or the ability to carry out traditional analyses on a larger scale. We believe that this transition allows a whole set of new questions to be answered. In their abstract, Bender and Good (2010, 1) summarize the need for linguistics to scale up its approach as follows: The preeminent grand challenge facing the field of linguistics is the integration of theories and analyses from different levels of linguistic structure and aspects of language use to develop comprehensive models of language. Addressing this challenge will require massive scalingup in the size of data sets used to develop and test hypotheses in our field as well as new computational methods, i.e., the deployment of cyberinfrastructure on a grand scale, including new standards, tools and computational models, as well as requisite culture change. Dealing with this challenge will allow us to break the barrier of only looking at pieces of languages to actually being able to build comprehensive models of all languages. This will enable us to answer questions that current paradigms cannot adequately address, not only transforming Linguistics but also impacting all fields that have a stake in linguistic analysis. This extract applies to the whole field of linguistics and the authors identify the main challenges ahead of linguistics today as consisting of data sharing, collaboration, and interdisciplinarity, as well as standards and scaling up of data sets used for formulating and testing hypotheses on language (with the help of NLP tools for the automatic analysis). They also underline the need for overcoming such challenges to allow higher goals to be achieved. We fully support this view, and in the present book we will i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics combine it with further points pertaining specifically to historical linguistics, in the context of a general methodological framework. .. Conceptual difficulties Why has historical linguistics not yet fully embraced the methodological shift we outline in this book? There are many reasons for this. Inadequate technical means and skills, insufficient computing power, and storage capabilities are certainly concrete obstacles that have been in the way of a complete transition of historical linguistics into the empirical, data-driven quantitative science we argue for in this book, as we have seen in section 1.3.4. Here we want to briefly discuss other, more serious obstacles which concern the place of historical linguistics and the humanities in general in the scientific landscape. Bod (2014) offers a comprehensive overview of the history of the humanities, while at the same time taking the opportunity to discuss the defining elements of the humanities and their relationship with the sciences. The humanities have been defined as ‘the disciplines that investigate the expressions of the human mind’ (Dilthey, 1991); however, this definition is not unproblematic, for example it would apply to mathematics as well. In fact, Bod chooses a more pragmatic one according to which the humanities are ‘the disciplines that are taught and studied at humanities faculties’ (Bod, 2014, 2). From Bod (2014)’s overview it is clear that a radical dichotomy between the humanities and the sciences is not supported by historical evidence. In fact, he finds a unifying feature shared by scientific and humanistic disciplines in the development of methodological principles and the search for patterns (Bod, 2014, 355), which in the case of the humanities’ focus on humanistic material (texts, language, art, music, and so on). The nature of such patterns varies across disciplines, with examples of local and approximate patterns found both in the humanities and, for example, in biology. According to Bod (2014, 300): linguistics is the humanistic field that is ideally suited to the pattern-seeking nomothetic method, which has indeed become common currency [. . .] Despite its general pattern-seeking character, present-day linguistics displays a striking lack of unity [. . .] In one cluster we see the approaches that champion a rule-based, discrete method, whereas in the other cluster an example-based, gradient method is advocated. This perspective is in contrast with the view according to which the humanities are not concerned with finding general patterns, and instead are only concerned with analysing particular human artefacts, whether they are texts, or manuscripts’ transmission histories, or works of art. Instead of stressing a strict opposition between scientific and humanistic disciplines, hence, it is helpful to appreciate the differences that exist within the sciences themselves and opt for a more nuanced approach. In this book we propose a methodological framework that encompasses a large portion of i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Can historical linguistics cross the chasm?  the practice of historical linguistics (for the scope of our framework, see section 2.1.1), and is concerned with empirical, corpus-driven quantitative approaches. In our framework, historical linguistics research looks for patterns and tests hypotheses in historical language data, mainly historical corpora, and builds models of historical language phenomena. . Can quantitative historical linguistics cross the chasm? A fundamental assumption for this book is that historical linguists already work with technology. The Greek root tekhnē can refer to any acquired or specialized skill. In the more conventional sense of technology as some invented means by which we achieve something (books are also a technology), historical linguistics is a technological field, or at the very least not an atechnological one. Therefore, it is anachronistic to create an artificial contradiction between historical linguistics on the one hand, and technology on the other. For the discussion of this paragraph, we will consider ‘technology’ as having the broadest possible scope, pointing out that historical linguists already use ‘technologies’. Along this conceptualization of technology, a symbolic analytical framework (such as X-bar annotation) counts as a ‘technology’ just as much as a software platform like R. This broad use of technology can then be distinguished from the very advanced and possibly more recent high-tech type of technologies, such as cutting-edge lab equipment or statistical and computational software or algorithms. It is probably safe to say that historical linguistics is not typically or commonly associated with high-tech approaches, and this impression will be further discussed in later chapters. Above, we indicated that a more high-tech approach could benefit historical linguistics. Since such an approach is already in use in other branches of linguistics, it is clearly technically possible to adopt it, and there are examples of historical linguists who already have made use of state-of-the-art techniques from computational and corpus linguistics, and applied statistics. What we are more concerned with here is the possibility for making these approaches mainstream. The present section deals with the problem of disseminating such a methodology beyond a small group of linguists who have already adopted it, and making it available to a much larger share of historical linguists. To do this, we will base our discussion on a much-touted model of technology adoption in the world of business, the problem of crossing the chasm (Moore, 1991). The technology adoption life cycle we have in mind is based on Moore (1991), and views technology adoption as a process of diffusion. The market is viewed as consisting of relatively distinct groups who will adopt a new technology or product for very different reasons. Crucially, the different market segments will act as reference points for each other, so that a product or a technology can seemingly be transmitted from one group to the next. As we will see, this highly idealized model can bring some real insights regarding the adoption of quantitative corpus methods in historical i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics linguistics, as much as it can inform marketers about how to push the latest hightech gadgets to consumers. The key to the insight is not in the details of the model as such, but in the way it throws light on people’s motivation for deciding to make use of a specific technology. It is this pivotal insight that we think merits the model’s application to the problem of how to advocate a more widespread adoption of quantitative corpus methods in historical linguistics, and what the obstacles are. .. Who uses new technology? To better understand why people adopt a technology, the model operates with five groups of highly idealized technology users. These groups can again be grouped together into two broad types, namely the early adopters and the mainstream adopters. Early adopters will typically have very different motivations for picking up a new technology compared to mainstream users. From this simple observation follows the conclusion that a technology that appeals to early users may fall flat on its face when presented to the mainstream. The gap in expectations and requirements that separates the early adopters from the majority of potential users of the technology is what constitutes the metaphorical chasm. But before tackling how to cross the chasm, we will look into what defines the different groups of users. We have adapted the business-oriented examples from Moore (1991) and situated them in a linguistic context where needed. The innovators The innovators are the technology enthusiasts. These are people who are interested in new technology for its own sake, and they will eagerly pick up something simply because the new technology appeals to them. They are typically not deterred by cost, and since they have a high level of technological competence, they are not put off by prototypes and a lack of formalized user support. If a new technology, such as a piece of software, requires modification or configuration to function, they will be able to do this themselves, or find out how to do it via technology discussion forums on the web. In a more linguistic context, innovators are linguists who introduce new technologies from other fields, or even create their own. This idealized user type might remind us of the caricature of the quantitative corpus linguist from Fillmore (1992) who is mostly concerned about corpora, tools, and corpus frequencies for their own sake. The visionaries The next group of users, the visionaries, are also technologically savvy, but unlike the innovators, they are not primarily interested in the technology for its own sake. The visionaries have a strategic interest in the technology and are primarily interested in the subject matter, i.e. historical linguistics. To the visionaries the exact properties of the new technology are subordinate to what it can help them achieve in linguistics. Such achievements could be anything from answering a linguistic question that has hitherto been considered too hard to be adequately answered, to gaining an advantage in the academic job market by mastering a new, trendy i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Can historical linguistics cross the chasm?  technology. The visionaries and the innovators make up the early adopters, and the visionaries will have the innovators as a reference group. If the innovators can demonstrate that, in principle, a new technology can tackle new questions, or answer old questions on a new scale, then the visionaries are happy to start making use of that new technology in order to gain an advantage. The early majority With the visionaries, we leave behind the early adopter groups and enter into the mainstream territory. Here we find the early majority. This group will constitute a large share of the overall users, and the defining characteristic of the group, according to Moore (1991), is pragmatism. They will adopt a new technology when it is both convenient and beneficial for them to do so. They are more interested in incremental improvements than in huge leaps forward, and will avoid the risks associated with new technology by finding out how others, typically the visionaries, have fared with it (Moore, 1991, 31). This means that the early majority are much slower to adopt new technologies than the early adopters, but they are more likely to stick to their new technology once it has caught on. Moore (1991, 31) points out that the early majority is difficult to characterize, but we can think of them as linguists who have adopted corpus linguistics methods or quantitative tools as a purely pragmatic measure after seeing that the visionaries have successfully used the same tools to answer questions in a new way, but only after those tools have reached a sufficient level of maturity and user-friendliness. The late majority The next large segment of users of technology are the conservatives. The conservative users are not concerned about the latest high-tech tools; indeed, they might be wary of them (Moore, 1991, 34). Conservatives are highly focused on ease of use, and will stick with their chosen technology for as long as possible. They are reluctant to changing it for another technology, and will do so only when the new technology has become a virtual standard, is easy to use, and covers all their needs in the area it is meant to cover. A hypothetical example might be a linguist who adopts a new technology because it has become so widely adopted that it is a near requirement. Incentives could be negative, as in the loss of support for an older technology, with the new one being introduced as the standard; or they could be positive, e.g. some journals favouring articles that make use of the technology in question. The sceptics The sceptics, or ‘laggards’, as Moore (1991, 39) also calls them, make up a small tail end of the technology adoption cycle. The sceptics, as the label implies, do not adopt new technology and will instead stick to their tried-and-trusted methods, no matter what the cost in lost productivity or lack of perceived coolness is. The linguistic example in this case might be the caricature of the ‘armchair’ linguist from Fillmore (1992) who has access to relevant data purely based on introspection and intuition. As Moore (1991, 40) points out, there are important lessons to be learned from this group, since they are more prone to seeing the flaws in any new technology, and are i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics sensitive to the hyperbole that inevitably accompanies a new technology making its way into an academic field. Thus, while we are fundamentally in disagreement with the sceptics regarding the value of new technology in historical linguistics, we are also highly interested in their arguments, since we can learn much from them about the discrepancy between how a new technology is marketed and its actual capabilities. We remain committed to the idea that introducing new technologies can benefit historical linguistics, but only to the extent that such technologies are fairly evaluated on their actual merits, not hyperbole. .. One size does not fit all: the chasm As the characterization of the types of technology users above should make clear, these are idealizations that do not necessarily fit any one person, and one person might fit in several idealized groups to some degree. However, each idealized user type captures very different motivations for taking on a new technology, and the broad differentiation between early adopters and the mainstream captures the fact that some of these motivations are more closely aligned than others. The key insight that they confer is that motivations for adopting new technology differ. Essentially, one size does not fit all. This means that although a new technology might be outright attractive to the innovators, the visionaries might fail to see how it can be used in a meaningful way to answer the linguistic questions they care about. In that case, the technology in question is likely to remain a niche phenomenon. Alternatively, the technology itself might appeal to the innovators and at the same time offer the visionaries the strategic advantage they seek in answering linguistic questions. In this case, the technology will have fully engaged the early adopters. However, the technology might still not permeate the mainstream market, because it fails to cross the metaphorical chasm. The idealized segments of user types are not continuous, hence there are gaps separating them. However, one gap stands out as larger than the others: the chasm that separates the early users from the mainstream users. This is illustrated in Figure 1.1, which shows the relatively larger gap between early adopters and the mainstream as a noticeable discontinuity. As the figure also makes clear, the chasm separates the relatively small number of early adopters from the bulk of users who are found in the mainstream part of the model. Thus, the chasm not only separates qualitatively different users from each other, it also represents a quantitative difference that separates a minority of users from the vast majority of users. We consider the chasm model a useful basis for analysing the status of quantitative approaches to historical linguistics for two reasons. First, it covers technology in the broad sense (even new analytic frameworks). Thus it provides a way to understand not only the point that this book is trying to make, but also a tool for understanding the current situation. Second, the model provides some insights about what can be done to change the situation, provided that our argument in favour of increasing historical linguistics’ reliance on high-tech approaches is accepted. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Can historical linguistics cross the chasm?  The chasm separates the early adopters from the majority Early majority Late majority ‘Chasm’ Early adopters Innovators Sceptics Technology adoption curve Figure . Technology adoption life cycle modelled as a normal distribution, based on Moore (1991, 13). Although the chasm model can be used in many ways, we will focus on the key component, namely the insight about the chasm that divides the early adopters from the majority of users. To understand why some technologies never go mainstream, we must consider what prevents them from crossing the chasm, as we will see in the next section. .. Perils of the chasm There are a number of reasons why a technology might never arrive at the mainstream segment of users. For instance, it might never reach the chasm at all, because it fails to catch on among the innovators and visionaries. According to Moore (1991), this is likely to happen if the vision behind the technology is marketed before the technology itself is actually viable. For example, the vision of large-scale data-driven corpus approaches to study language crucially depends on specific types of computer technology. Without a suitably mature version of this technology, the vision may be appealing, but the practical problems would prevent it from really catching on. As we shall see later, we might find parallels to this in historical linguistics. In the case of quantitative corpus methods, these are at the very least a technology that has been embraced by early adopters (innovators and visionaries) in historical linguistics. We argue that it has not yet fully entered the mainstream of users in i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics historical linguistics, and we will further substantiate this claim in section 1.5. However, as we consider some of the potential reasons for this failure to cross the chasm, it is useful to look into potential general pitfalls for new technologies crossing the chasm, adapted from Moore (1991, 41–3): (i) (ii) (iii) (iv) lack of respect for established expertise and experience; forgetting the linguistics and only focusing on the technology; lack of concern for established ways of working; practical problems such as missing standards, lack of training opportunities, or educational practices. Point (i) prevents a new technology from crossing the chasm because it alienates the majority of mainstream users. For all the high-tech buzz about disruptive technology, it is clear that technologies that are able to adapt to existing practices have an advantage when it comes to crossing the chasm. The majority of users are pragmatic and the disruptive, innovative aspects of a new technology are simply not what appeals to them. This brings us to point (ii), which is the insight that for the majority of users, such hightech approaches must present a better option for doing historical linguistics. Without that perspective, we would hardly expect any attempt to push a new technology to the majority of mainstream users to succeed. Point (iii) captures the fact that historical linguists, as any users of technology, are interested in tools that work. Established tools, such as the qualitative methods of historical comparative linguistics, clearly work. Thus, the chasm model suggests that innovative technology ought to work best where the established methods have their weakest points. Finally, point (iv) addresses all the practical or financial problems associated with a new technology, such as acquiring the technology itself, learning new skills (and transferring them to students), finding new ways to integrate the existing technology with the new, establishing standards (e.g. for annotation), and best practices (e.g. for peer review). None of these points needs to be fatal for a new technology attempting to enter the mainstream, but in combination they would seriously impede its chances of reaching out. In the case of high-tech approaches to historical linguistics, we can easily find examples of all four problems which taken together would prevent full adoption of the technology advocated here. As the following sections and chapters will make clear, our aim is to provide a roadmap for how these potential problems can be avoided. Specifically, we seek to address points (i) to (iii) by presenting quantitative historical linguistics in an accessible, and relatively jargon-free manner, with the aim of highlighting how this particular approach in many ways can exist alongside established ways of working. We also aim to illustrate how the approach we advocate in some cases will result in better or perhaps more interesting results, which we believe make the investment in the technology well worth it from a historical linguistics point of view. The final point dealing with practical problems lies to some extent outside the scope of the book. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i A historical linguistics meta study  There are a large number of books and courses teaching these skills and aimed at linguists, and the software we advocate is free. However, we do tackle the problem of standards and some terminology, thus hoping to also help to ease some of the practical problems associated with the new technology. A crucial step for achieving these aims is to have a clear understanding of the situation in historical linguistics, as we do in section 1.5. . A historical linguistics meta study In this section we focus on the level of adoption of quantitative corpus methods in historical linguistics compared to linguistics in general. We also report on a quantitative study we have carried out on a selection of publications from existing literature in historical linguistics. .. An empirical baseline Before looking into the current use of corpora and quantitative methods in historical linguistics, it is worth considering just how quantitative we expect historical linguistics to be. A reasonable benchmark is the field of linguistics overall. After all, those linguists working on contemporary languages have a wider spectrum of methods available to them that are out of reach for most historical linguists: native speaker intuitions, surveys, recordings, interviews, controlled experiments, and so on. Given that these methods are all considered acceptable in mainstream linguistics (see Podesva and Sharma 2014 for an overview), and given that the primary source of data for most historical linguists is textual, our position is that historical linguistics should not be using corpora and quantitative methods to a lesser degree than linguistics overall. For this benchmark we have relied on data from Sampson (2005 and 2013). These two studies analyse research articles (excluding editorials and reviews) published in the journal Language between 1960 and 2011. Sampson wanted to know the extent to which mainstream linguistics relied on empirical and usage data. To this end he sampled the volumes of what is arguably the leading linguistics journal, Language, at regular intervals between 1960 and 2011. As a baseline he chose the journal’s 1950 volume, so as to reflect the period prior to the increased reliance on intuition-based methods in the 1960s. Sampson devised a three-way classification system to label articles as ‘empirical’, ‘intuition-based’, or ‘neutral’. The last category was designed to cover papers that did not readily fit into the two first categories, such as methodological papers or papers dealing with the history of linguistics. To classify articles he used a number of rules of thumb, including an admittedly arbitrary threshold of two usage-based or corpus-based examples to classify a paper as evidence-based. However, he also employed positive criteria for labelling papers as intuition-based, notably the presence of grammaticality judgements. Thus, while the criteria for being i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics evidence-based might seem liberal, the presence of additional criteria ensures a reasonable classification accuracy. For full details about the sampling, procedure, and criteria, see Sampson (2005). Although the data in Sampson (2005) indicated a trend towards an increasing number of evidence-based papers, his main conclusion was cautious, suggesting that linguistics still had some way to go before empirical scientific methods were fully accepted in this field. The proportion of evidence-based papers (calculated from the total number of non-neutral articles) was only growing slowly and showing some signs of dipping. Picking up the thread from the previous study, Sampson (2013) continued the exercise and found that what had appeared as a downward trend around 2000 was simply due to fluctuations. The addition of more data confirmed a continued upward trend since the nadir in the 1970s. Figure 1.2, based on data from Sampson (2013), illustrates this trend. Since 2005 the proportion of evidence-based studies has exceeded the 1950 baseline, represented by the horizontal line in the plot. As Figure 1.2 shows, empirical methods (according to Sampson’s criteria) have made a remarkable comeback. Already in the 1980s approximately half the research articles published in Language were based on empirical evidence (in Sampson’s sense of the word), with a rapid increase setting off in the 1990s. This is perhaps not surprising, since it coincides with the availability of electronic corpora around the 1.0 Proportion of empirical articles 0.8 0.6 0.4 0.2 0.0 1960 1970 1980 1990 2000 2010 Figure . Proportions of empirical studies appearing in the journal Language between 1960 and 2011. The horizontal dotted line represents the baseline of the 1950 volume. After figure 1 in Sampson (2013). i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i A historical linguistics meta study  same time, as discussed in section 3.5, and Sampson (2005) also makes the link to the availability of corpora explicitly. It is clear that searchable, electronic corpora foster empirical research. However, it is all too easy to mistake correlation for causation, and corpora are only one piece of the puzzle, as proven by the fact that what is published in Language is still (as we believe it should be) a mixture of empirical papers (in the sense of Sampson 2005) and other studies. Corpora do not determine what kind of research is published. Thus, we cannot simply assume that a similar situation is found in historical linguistics. To complement the picture, we therefore surveyed the field of historical linguistics. .. Quantitative historical research in  Our meta study differs from those in Sampson (2005) and Sampson (2013) in that we surveyed several journals published in one particular year, as opposed to a single journal over several decades. We found this to be a reasonable approach, since our aim was to present a snapshot of the field of historical linguistics as it currently appears. For the literature survey we carefully read a selection of research articles published in 2012, taken from six journals. These six journals are clearly a small sample of all that is published within historical linguistics in a given year, but should nevertheless provide some insight into the breadth of research currently being published. To make the effort feasible, we applied a number of exclusion criteria, and focused on the cases that met all the following criteria: 1. 2. 3. 4. research journals (excluding monographs, yearbooks, and edited books); journals published in English; journals focusing specifically on historical linguistics and/or language change; journals with a general coverage, excluding those focusing on specific languages or subfields (like historical pragmatics or syntax); 5. linguistics journals (excluding interdisciplinary ones). Applying these criteria resulted in the following final list of journals: • • • • • • Diachronica Folia Linguistica Historica (FLH) Journal of Historical Linguistics (JHL) Language Dynamics and Change (LDC) Language Variation and Change (LVC) Transactions of the Philological Society From these journals we selected only the full-length research papers, thus excluding book reviews, editorials, and squibs. This left us with sixty-nine papers, a number which was pruned down to sixty-seven, after removing two papers that were deemed out of scope. We then read and classified the final set of papers. The data and the code for this study are available on the GitHub repository https://github.com/gjenset. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics For each paper, a number of variables were recorded, including journal, the type of techniques employed in the analysis, whether or not corpora were used, and whether or not the paper could be classified as quantitative, qualitative, or neutral. For the neutral classification we employed the criteria from Sampson (2005). Five papers with a mainly methodological or overview focus were included in the neutral category, leaving us with sixty-two papers for the quantitative vs qualitative categories. Our classification differs from that in Sampson (2005) in that it relies on more variables, notably recording the use of corpus data, but also other data sources such as word lists (e.g. in phylogenetic studies). Furthermore, we decided to distinguish between the source of data (such as corpora vs quoted examples), and the use to which they were put (e.g. if they were treated quantitatively or qualitatively). This was done to obtain a classification that was both more fine-grained and easier to operationalize for historical linguistics than the criteria from Sampson (2005), since none of the papers relied on native speaker intuitions. Whether or not a paper was corpus-based was judged based on the discussion of the data in the paper. We relied on the accepted definition of a corpus as a machine-readable collection of naturalistic language data aiming at representativity (with obvious allowances being made for historical data with their gaps and genre bias). This excluded sources of data such as the World Atlas of Language Structures or word lists. Furthermore, we required the corpus to be published or at least in principle accessible to others, which excluded private, purpose-built collections made for a specific study, but we accepted as corpus-based those studies relying on a subset of data from a corpus that would otherwise fulfil these criteria. The distinction between quantitative and qualitative studies was made by assessing whether or not the conclusion, as presented by the article’s author(s), relied on quantitative evidence or not. Essentially, we considered whether or not the author(s) argued along qualitative lines or quantitative ones by looking for phrases that would imply a quantitative proof of the article’s point, such as ‘x is frequent/infrequent/ statistically correlated with y’. Qualitative papers were thus mostly defined as nonquantitative ones, but we also applied positive criteria. We judged arguments based on the presence or absence of a feature or phenomenon to be indicative of a qualitative line of argumentation. Phylogenetic studies, while not typically based on frequency data, were counted as quantitative, since the underlying assumptions are based on computing distances between features or clusters of features. Applying these criteria we found that thirty-seven papers (60 per cent) were qualitative, while the remaining twenty-five (40 per cent) were quantitative. Table 1.1 lists the number of papers grouped according to whether or not they are corpusbased and whether they are qualitative or quantitative. A Pearson chi-square test of independence reveals that there is a statistically significant, medium-to-strong 2 association between corpus use and the use of quantitative methods (χdf (1) = 12.68, p = 0.0004, φ = 0.49). Perhaps unsurprisingly, corpus-based studies tend to favour a i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i A historical linguistics meta study  Table . Classification of sample papers according to whether or not they are corpus-based, and whether or not they are quantitative Qualitative Quantitative Total Not corpus-based Corpus-based 33 (53) 4 (6) 11 (18) 14 (23) 44 (71) 18 (29) Total 37 (60) 25 (40) 62 (100) Table . Classification of papers from Language 2012 according to whether or not they are corpus-based, and whether or not they are quantitative Not corpus-based Corpus-based Qualitative Quantitative 3 0 6 6 quantitative approach, although four qualitative corpus studies were also identified, which illustrates that there is no simple one-to-one relationship between corpus data and quantitative methods. Of the quantitative studies we see that a little over half (fourteen out of twenty-five) were corpus-based. Comparing this to the benchmark from Sampson (2013), it seems that the leading linguistics journal Language has gone further than historical linguistics in adopting quantitative methods. Recall that around 80 per cent of papers in the most recent samples studied by Sampson were classified as empirical, whereas we only found 40 per cent. Some caution is required in the interpretation, since the criteria used by Sampson differ subtly from ours due to Sampson’s focus on the use of native speaker intuitions and authentic examples as the minimum criteria for what he terms empirical. To investigate how well Sampson’s classification corresponds to our own, we classified the 2012 volume of Language according to our own criteria. This classification, based on fifteen research articles, yielded a similar result to Sampson’s conclusion for recent articles in his sample: we deemed twelve out of fifteen (i.e. 80 per cent) to be quantitative, with six out of fifteen (i.e. 40 per cent) being corpus-based. As Table 1.2 shows, there were no qualitative corpus-based articles. This indicates that, although the minimum criteria employed by Sampson differs from our criteria, both sets of criteria in fact point towards the same conclusion. A sample of fifteen articles is obviously tiny, and the relative frequencies above are offered as an easy means of comparison with the sampled historical linguistics articles, i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics and not as some generalized prediction about linguistics overall. However, we take the numbers in Table 1.2 to indicate that our classification is at least comparable to that made by Sampson, even if our exact criteria differ. Although the sample of fifteen articles from Language 2012 is too small for a direct comparison with the historical and diachronic linguistics sample via statistical methods, it certainly strengthens the case for our main claim that historical linguistics journal articles are using quantitative methods less than the articles in Language. The claim can be further strengthened if we take into consideration the likely variation, or error margins, around these estimates. Based on the comparisons above, it seems fair to compare Sampson’s estimate of 80 per cent empirical articles in Language with the 40 per cent quantitative papers identified in our historical sample, since our own classification of the 2012 volume of Language showed that the two correspond. It is clear that 80 per cent is a higher percentage than 40 per cent, but how much should we really read into this difference? One way to better understand the difference between the two numbers is to think of them as estimates from an underlying distribution, where we must account for some measurement error. Put differently: our estimates might be incorrect, and the two samples might in fact be exaggerating the differences. We can calculate the range or interval around each of these percentages using the normal distribution as a model. The range of variation we calculate is a 95 per cent confidence interval, which is taken to indicate that 95 per cent of the observations from the underlying population (i.e. articles from the journals) would fall into this range, if our sample is representative. The intervals are listed in Table 1.3. If the error margin around our percentages was excessive, we would expect to see the 95 per cent confidence intervals overlapping, i.e. we would expect to see the upper range of variation for the historical sample reaching into the range surrounding the estimate from Language. As the numbers in Table 1.3 show, this is not the case, however. Even if we have underestimated the percentage of true quantitative papers in the historical sample, and overestimated the percentage of true quantitative papers in the Language 2012 sample, we see that the two samples are still likely to be different. The likely theoretical maximum percentage of quantitative papers in the historical sample is 52 per cent, whereas the theoretical minimum for Language is 60 per cent, Table . 95 confidence intervals for the percentage of quantitative papers in Language 2012 and the historical sample. Note that the confidence intervals do not overlap Language Historical sample Proportion of quantitative papers 95 confidence interval 80 40 [60, 100] [28, 52] i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i A historical linguistics meta study  so Language is clearly different even under this worst-case scenario. Using the same logic, we can test this formally using the prop.test() function in R. The p-value 2 returned by the test is extremely small (χdf (1) = 58.6, p 0.001), which shows that a sample of sixty-two (the size of the historical sample) is sufficiently large to establish that the percentage of quantitative papers (40 per cent) is statistically different from the percentage reported by Sampson (2013) and found in our Language 2012 sample (80 per cent). However, there is another aspect to the question of how similar the two samples are, namely the percentage of corpus-based papers, which is the same (40 per cent). Looking only at the proportion of corpus-based papers, we might assume that the situation in historical linguistics journals is very similar to the one in Language. However, we also need to consider how dispersed the corpus-based papers are in the historical sample before attempting to draw further conclusions. Our aggregated results may hide different research traditions and methodological conventions within the field of historical and diachronic linguistics. If this were the case, then we would expect to see some differentiation among types of studies depending on the journal. In fact, this is what we find if we group the classifications by journal. We carried out an exploratory multiple correspondence analysis (MCA) to look for the links between journals, evidence source type (corpus-based or not), and the quantitative–qualitative distinction. MCA is an exploratory multivariate technique that seeks to compress the variation in a large set of data into a smaller number of dimensions that can be visualized in a two-dimensional plot (Greenacre, 2007). The MCA analysis (shown in Figure 1.3) found that the first dimension (represented by the horizontal axis) explained virtually all the variation in the data, accounting for 90.9 per cent of the total variation. This means that the plot can simply be read from left to right (or right to left), as a continuum where the leftmost journal is maximally different from the rightmost journal. We can interpret how the data relate to this first dimension by looking at the projection of the points on the horizontal axis in Figure 1.3. We can see that the journals can be grouped along a continuum from non-corpus-based and qualitative, to corpus-based and quantitative. On the qualitative/non-corpus-based extreme we find Transactions of the Philological Society, followed by Language Dynamics and Change, Diachronica, Folia Linguistica Historica, and Journal of Historical Linguistics. The other, i.e. quantitative, end of the continuum is represented by Language Variation and Change. The results are hardly surprising for someone familiar with the scope of these journals, and it is obvious that this picture represents some form of mutual selfselection between journals and scholars: journals attract submissions that are in line with their explicitly stated profile. However, the conclusion that Language and the historical linguistics data set as a group are similar in their use of corpora is clearly not warranted. Instead, what we see in Figure 1.3 is that Language Variation and Change is (not surprisingly) different from the other historical linguistics journals, and that i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics journalJHL journalDiachronica journalFLH corpus.basedTRUE 0.2 quantitativeFALSE journalTrPhilSoc 0.0 quantitativeTRUE corpus.basedFALSE journalLangVar Change –0.2 –0.4 –0.6 journalLangDynChange –0.8 –0.6 –0.4 –0.2 0.0 Dim 1: 93.3% 0.2 0.4 0.6 Figure . MCA plot of the journals considered for the meta study and their attributes. Dim 1 is the dimension with the most explanatory value; Dim 2 is the dimension with the second most explanatory value. Table . Classification of sampled papers according to whether or not they are corpus-based, and whether or not they are quantitative, with LVC left out Not corpus-based Corpus-based Qualitative Quantitative 33 (66) 4 (8) 6 (12) 7 (14) it is Language Variation and Change that is primarily associated with both corpora and quantitative methods. We can still observe a continuum among the remaining historical linguistics journals, reflecting the degree to which we find quantitative and corpus-based articles in their 2012 publications. However, once Language Variation and Change is excluded, the numbers change substantially, as Table 1.4 shows. Once the data from Language Variation and Change are set aside, we find that the corpus-based studies account for only 22 per cent of the articles, and the quantitative i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i A historical linguistics meta study  methods account for only 26 per cent. Without Language Variation and Change, the sample is down to fifty articles, but the test of equal proportions (prop.test() in R) tells us that we still have enough data to distinguish the historical sample from Language. Having established that the historical sample seems to use quantitative methods and corpus methods less than what the state-of-the-art journal in general linguistics (Language) does, we can turn to how this is related to the life-cycle model of adopting new technology that we introduced in section 1.4. It is worthwhile reiterating the theoretical proportions accounted for by the different adopter groups in the technology adoption model, with the chasm situation between early adopters and the early majority: • • • • Early adopters: 16 per cent Early majority: 34 per cent (cumulative percentage: 50 per cent) Late majority: 34 per cent (cumulative percentage: 84 per cent) Sceptics: 16 per cent (cumulative percentage: 100 per cent) If we make the working assumption that the published articles more or less correspond to the research technologies adopted by their authors, we can compare the observed proportion of quantitative and corpus-based articles with the theoretical proportions predicted by the technology adoption model. Of course, this assumption cannot be taken literally, since mastery of quantitative corpus research techniques does not preclude using qualitative methods. However, we consider this a useful approximation, since the sampled journals can select their articles from a larger set of submissions. Based on this, at least for our purposes, we can assimilate journal authors to users of research technology. Comparing the technology adoption model to the data from Language collected by Sampson (2013), we see that the proportion of articles employing quantitative methods in that journal (around 80 per cent) is close to what we would see with full adoption of such technologies by the late majority. In our historical linguistics and language change sample we found that 40 per cent of the studies were quantitative, which would suggest that those methods extend to the early majority. However, as we saw above, this is a little too optimistic due to the effect of papers from Language Variation and Change. If that journal is excluded, the proportion of quantitative papers drops to 26 per cent which, although still among the early majority, suggests a less widespread adoption. If we look more specifically at the intersection of corpus data and quantitative methods in the sample of historical and diachronic change articles, we see that 23 per cent are both corpus-based and quantitative (Table 1.1). However, if we again exclude Language Variation and Change, we see that the percentage drops to 14 (Table 1.4), which is in the early adopter range, according to the technology adoption model. This position is corroborated by taking into account the actual quantitative techniques employed by the quantitative articles in our historical sample. Figure 1.4 shows i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Methodological challenges in historical linguistics 14 LVC Other 12 10 8 6 4 2 0 Linear models Freq % NHT Trees PCA Figure . The number of observations for various quantitative techniques in the selected studies, for LVC and other journals. Some studies employed more than one of the techniques. the number of times different quantitative techniques were encountered. Multivariate techniques such as linear regression models (including Varbrul) are clearly the largest single group of techniques. However, Language Variation and Change is again intimately involved in the details. The majority of the uses of linear models is found in that journal, as is the majority of uses of null-hypothesis tests. The numbers in Figure 1.4 are small, but sufficient to give us the impression of Language Variation and Change as a methodological outlier among the quantitative papers in the sample. Thus, we can conclude that, based on our sample, articles from the journals specializing in historical linguistics and language change that we considered (published in 2012) use quantitative methods to a lesser degree than a relevant comparison journal in general linguistics (Language). Furthermore, if we exclude Language Variation and Change, which is biased towards quantitative methods, we see that the percentage of quantitative papers and corpus-based papers drops even further. If we consider historical papers that use both quantitative methods and corpora (excluding Language Variation and Change), the percentage is low enough to be compared to the early i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i A historical linguistics meta study  adopter section of the curve from the technology adoption model presented in section 1.4. Having described the state of adoption of quantitative corpus methods in historical linguistics, we are ready to carve out a niche for this technology in historical linguistics, and we will do this in Chapter 2. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework . A new framework In this chapter we outline the foundations of the new methodological framework we propose. This framework is not meant to replace all existing ways of doing historical linguistics. Instead, we present a carefully scoped framework for doing certain parts of historical linguistics that we think would benefit from this approach. Other areas of historical linguistics might not require the kind of innovation we propose, or they might require innovations of a different kind. However, we strongly believe that the approach outlined in this book is the right choice for what we define in the scope of quantitative historical linguistics. We think many, if not most, historical linguists would agree with us that corpora and frequencies are potentially very informative in answering questions in historical linguistics. Our aim is to take this intuition one step further by proposing principles and guidelines for best practices, essentially an agreement as to what constitutes quantitative historical linguistics. The next section addresses the question of scope for the framework. .. Scope We submit that the principles of quantitative historical linguistics pertain to any branch or part of historical linguistics. These principles are not only meant as guides to carrying out quantitative research, but also establish a hierarchy of claims about evidence which also encompasses non-quantitative data. In this respect, quantitative historical linguistics is just as much a framework for evaluating research as for doing research. The basic assumptions and principles laid out in sections 2.1.2 and 2.2 establish a basis for evaluating and comparing research in historical linguistics, whether quantitative or qualitative. Our main focus is nevertheless the methodological implications of these assumptions and principles for how to do historical linguistics research. The principles and guidelines of quantitative historical linguistics can be applied within any conventional area of historical linguistics, such as phonology, morphology, syntax, and semantics. In line with Gries (2006b), we argue that corpora serve as the best source of quantitative evidence in linguistics, and by extension also in historical Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray. © Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i A new framework  linguistics. This might at first glance seem to exclude e.g. historical phonology from quantitative historical linguistics; however, this is a practical consideration based on available corpus resources, not an inherent feature of quantitative historical linguistics. In fact, historical corpora can be illuminating when it comes to questions of sound change, as demonstrated by recent studies using the Origins of New Zealand English corpus (Hay et al., 2015; Hay and Foulkes, 2016). Nevertheless, we stress that in some areas, such as phonology, quantitative historical linguistics is to a large extent complementary to traditional historical comparative linguistics (see section 2.1.2). In the following chapters we give examples and case studies from morphology, syntax, and semantics, with some discussion on phonology. Our focus in this book is predominantly on corpus linguistics, since corpora constitute the best source of quantitative evidence. However, quantitative does not automatically entail corpus-based. For instance, historical phylogenetic modelling attempts to establish relationships, classification, and chronology of languages based on historical data by probabilistic means (Forster and Renfrew, 2006; Campbell, 2013, 473–4). Phylogenetic models may employ typological traits, such as Dunn et al. (2005), or lexical data, such as Atkinson and Gray (2006), or corpus data (Pagel et al., 2007). Quantitative historical linguistics is deliberately agnostic regarding the use of specific statistical techniques, since such techniques must reflect the specific research question. The caveat here is that the choice of statistical technique should reflect best practices in applied statistics and be sufficiently advanced to tackle the full complexity of the research problems (see section 2.2.12). Thus, although we consider corpora the preferred and recommended source of quantitative evidence, quantitative historical linguistics does not necessarily equate to corpus linguistics. In addition to the source of data, quantitative historical linguistics relies on a number of other principles and basic assumptions, which we turn to next. .. Basic assumptions The scope of our framework builds on a number of premises and relies on different levels of analysis. As in other historical disciplines, different skills are needed for different stages of the problem-solving process. Historians must judge sources in light of both the physical documents, their literary genre, and the context of the source, which might require very different sets of skills, as discussed in chapter 2 of Carrier (2012). Similarly, quantitative historical linguistics must make a number of assumptions, some of which rely on other scholarly disciplines. Thus, the approach is not all-inclusive, but rests on and interacts with other pursuits of knowledge, by means of the following assumptions. We are indebted to Carrier (2012) for inspiration, but have reworked the material to match the case of historical linguistics. The historical linguistic reality is lost Whether we study the history of particular languages, the relationship between languages and language families over time, or i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework how language change proceeds in general, we all face the same inescapable problem: whatever reality we wish to describe, understand, or approximate is irrecoverably lost. It cannot be directly accessed and hence we can only study it indirectly. Because of this inaccessibility, our models of the past historical linguistic reality will always be imperfect. However, they may still be useful. A key question for the present book is to show what we think constitutes a useful model, and in which circumstances it is useful. Philological and text-critical research is fundamental No corpus is better than the quality of what goes into it. Consequently, sound groundwork in terms of philological, paleographical, and text-critical research must be assumed. Put differently, the proposed approach cannot replace these pursuits of knowledge. Instead, it complements them and relies on them to critically study the physical manuscripts and philological and stemmatological context of the text contained in them. Based on such research, critical editions can be created, and these critical editions can subsequently form the basis for corpora. Grammars and dictionaries are indispensable Another level in the research process is the creation of grammars and dictionaries that make it possible to annotate historical corpora. Of course, such research is not only a means to create corpora, but it illustrates the degree to which quantitative historical linguistics rests on other approaches to historical linguistics. We would again like to emphasize that the present approach is in many respects complementary to existing approaches, although, as we explain in Chapter 5, it is desirable to create corpus-driven dictionaries. Reaching back to the extended notion of technology that we introduced in section 1.4, we see no reason to replace existing approaches where they work well. As the levels of analysis outlined here illustrate, several approaches can and must coexist. Qualitative models We agree with Gries (2006b) that corpora provide one type of evidence only: quantitative evidence. It follows from this that quantitative claims or hypotheses are best addressed by corpus evidence. However, not all hypotheses are quantitative, as we illustrate in section 2.4.3. Qualitative approaches in historical linguistics have more than proved their worth in establishing genealogical relationships between languages, especially through the study of regular sound correspondences. Although such qualitative correspondences might be a simplification (i.e. imperfect models), they might nevertheless be useful and successful. Similarly, the simplifications and generalizations involved in establishing paradigmatic grammatical patterns might be useful without being a one-to-one correspondence to the lost historical linguistic reality. Where we do see the limits of qualitative approaches is in distributional claims, especially as they relate to claims or hypotheses about syntagmatic patterns. In the following sections we will elaborate the terminology and basic tenets of our framework. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i A new framework  .. Definitions In the present section we define the core terminology based on which we will formulate the principles of our framework. Evidence By evidence we mean facts or properties that can observed, independently accessed, or verified by other researchers. Such facts can be pre-theoretical or based on some hypotheses. A pre-theoretical fact could be the observation that in English the word the is among the most frequent ones, alongside words such as you. We can observe facts in light of a hypothesis by assuming grammatical classes of articles and pronouns that group words together. Based on this hypothesis we can gather facts that constitute evidence that the classes article and pronoun are among the most frequent ones in English. It follows from this definition that empirical evidence is a pleonasm, since all evidence conforming to it must be empirical. The definition above explicitly excludes the intuitions of the researcher as evidence in historical linguistics. Such intuitions are problematic as evidence for languages where native speakers can judge them; for extinct languages and language varieties we consider such intuitions inadmissible as evidence. This position does not imply that intuitions are without value. For instance, intuitions are undoubtedly valuable in formulating research questions and hypotheses, and when collecting and evaluating data, as we stress in section 2.4.3. Thus, we think intuitions can and should play a role in the research process, but we do not consider them as evidence. We can distinguish between different types evidence, namely quantitative evidence and distributional evidence. Quantitative evidence is based on numerical or probabilistic observation or inference. The quantification must be precise enough to be independently verifiable. As a consequence, quantifying the observations by means of e.g. the words many or few will not suffice, since these terms are underspecified. In the classic linguistics sense, distributional evidence is empirical in the sense that it can be independently verified that certain linguistic units (be they phonemes, morphemes, or other units) do or do not (tend to) occur in certain contexts. To the extent that such distributional patterns can be reduced to hard, binary rules (e.g. x does/does not occur in context y), distributional evidence is qualitative. However, we also keep the option open that such distributional evidence may be recast in probabilistic terms. Finally, we need to consider criteria for strong and weak evidence, since independent verifiability is a necessary but not sufficient criterion for evidence. We can establish the following hierarchy of evidence: (i) More is better: a larger sample will yield better evidence than a small one, other things being equal. (ii) Clean is better than noisy: clean, accurate, and well-curated data will yield better evidence than noisy data, i.e. data with (more) errors. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework (iii) Direct evidence is better than evidence by proxy: it is better to measure or observe directly what is being studied, rather than through some proxy or stand-in for the object of study. (iv) Evidence that rests on fewer assumptions (be they linguistic, philological, or mathematical) is preferable, other things being equal. It is obvious from the list above that some of the statements in the hierarchy will conflict. For instance, the ‘more is better’ requirement (i) will almost always conflict to some degree with the requirement for precise, well-curated data (ii). This implies that the end result will always be some kind of compromise which entails that perfect, incontrovertible evidence is a goal that can be approximated, but never fully reached. We believe this is an important consideration, since no numerical method can salvage bad data. Instead, the realization that all data sets are imperfect to some degree breeds humility and ushers along the need to explicitly argue for the strength of the evidence, independently of the strength of the claims being made on the evidence. The next section deals with claims. Claim We follow Carrier (2012) in considering anything that is not evidence a claim. A claim can be small or large in its scope, and it may rest directly on evidence, or it may rest on other claims. A claim must always rest on evidence, directly or indirectly, to be valid. The following are examples of different types of scientific claims of variable complexity and scope: • • • • Classification: x is an instance of class y. Hypothesis: we assume x to be responsible for an observed change y. Model interpretation: based on the model w, x is related to y by mechanism z. Conclusion: we conclude that x was responsible for bringing about z. All claims are subject to a number of constraints discussed further in section 2.2. However, we want to stress the distinction between evidence and claims, as it is fundamental to the subsequent principles. In particular, we consider linguistic frameworks (sometimes called ‘linguistic theories’) to be series of claims which cannot be admitted as evidence for other claims. It also implies that such frameworks are subject to the same standards of evaluation as other claims (see section 2.2). Truth and probability Following chapter 2 of Carrier (2012) we consider a claim, be it a classification or a hypothesis, to be a question of truth. However, the truth value of such a claim, e.g. x belongs to class y, can be stated in categorical or probabilistic terms. We choose to think of the truth value of claims about the past in probabilistic terms, since there is always a risk that we are mistaken, even in the most well-established claims. For sure, the probability may be vanishingly small, but it may still exist. Furthermore, such probabilities about the truth value of claims can be interpreted in at least two ways: i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i A new framework  (i) As facts about the world (i.e. physical probabilities). (ii) As beliefs about the world (i.e. epistemic probabilities). Carrier (2012) makes the distinction above in the context of an explicitly Bayesian statistical framework. Bayesian statistics is a branch of statistics that considers probabilities as subjective and as degrees of belief (Bod, 2003, 12). So-called frequentist statistics will tend to conceptualize probabilities as long-term relative frequencies of events in the world. The distinction is discussed in more depth in Hájek (2012). For our purposes it is sufficient to say that when we talk about the probability of some claim being true, we are talking about the epistemic probability, i.e. how likely we are to be correct when we claim that x is a class of y. Although the difference between (i) and (ii) is sometimes overstated, there is a real difference between claiming that ‘8 out of 10 times in the past verb x belonged to conjugation class y’, versus claiming that ‘if we assign verb x to conjugation class y, the probability that we are making the correct classification is 0.8’. The latter statement is explicitly made contingent on our knowledge and our argumentation in a manner that is different from and better than the former case. Historical corpus In this book, we are concerned with historical corpora and define them as any set of machine-readable texts collected systematically from earlier stages of extant languages or from extinct languages. We follow Gries (2009b, 7–9) in defining a corpus as a collection of texts with some or all of these characteristics: (i) Machine-readable: the corpus is stored electronically and can be searched by computer programs. (ii) Natural language: the corpus consists of authentic instances of language use for communicative purposes, not texts created for making a corpus. (iii) Representative: representativity is taken to refer to the language variety being investigated. (iv) Balanced: the corpus should ideally reflect the physical probabilities of use within some language variety. These characteristics are ideals, even for corpora based on extant languages. To create a balanced and representative corpus of extinct language varieties is in most cases not a realistic aim. Therefore we do not take these to be necessary and sufficient features for what constitutes a (historical) corpus. In fact, we agree with Gries and Newman (2014) who consider the notion of a ‘corpus’ to be a prototype-based category, with some corpora being more prototypical than others. However, the definition above is clearly also too broad, since it extends to other types of text collections that are not normally considered corpora in the strict sense (Gries, 2009b, 9). For instance, a text archive containing the writings of a single author would fulfil criterion (i) and criterion (ii), but could not lay claim to representativity beyond the author in question. Gries (2009b, 9) argues that in practice the distinction i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework between corpora and text archives can be diffuse, and for our purposes we take the representativity criterion in (iii) to be sufficient to rule out many text archives from the definition. A more pressing exclusion are perhaps collections of examples. As Gries (2009b, 9) points out, any such collection is prone to errors and omissions, and it is doubtful what it can be taken to be representative of. For this reason, we follow Gries (2009b) in excluding example collections from the definition of what constitutes a corpus. This exclusion also applies to collections based on examples from historical corpora or quotations, since such text fragments are by definition handpicked and presented outside the communicative context they can be said to be representative of. Thus, our definition of a corpus only includes machine-readable, natural, representative (within limits) text that has been systematically sampled for the purpose of the corpus. This is not to say that example collections cannot be useful, but we exclude them for the purpose of terminological clarity. Finally, we exclude word lists (or sememe lists of cognates based on the Swadesh lists, see section 3.3) since they fall short of the requirement of texts collected for natural, communicative purposes. The notion of ‘historical corpus’ can also be problematic, since it is not clear exactly how historical a corpus needs to be in order to count as ‘historical’. We are inclined to take a pragmatic approach to this question and consider as historical in the broad sense any corpus that either covers an extinct language (or language variety), or that covers a sufficient time span of an extant language variety that it can be used diachronically, i.e. to detect trends (see also section 2.2). We would also stress that annotation of corpora and analysis of data are two separate and independent steps of the research process. The annotation step could for instance involve enrichment from other linked external resources, not necessarily corpora. The relationship between data, corpora, and annotation is discussed further in Chapter 4. Linguistic annotation scheme By linguistic annotation scheme we intend the set of guidelines that instruct annotators on how to annotate linguistic phenomena occurring in a corpus according to a specific format. Such schemes rely on certain theoretical assumptions and usually contain a set of categories (tags) that are to be applied to the corpus text. An example of a linguistic annotation scheme is the set of guidelines for the annotation of the Latin Dependency Treebank and the Index Thomisticus Treebank (Bamman et al., 2008). Section 4.3.3 gives a full description of annotation schemes. In our framework we do not impose constraints on the particular schemes to be used, as long as they are explicit, and allow the annotators to interpret the text consistently and map it to the predefined categories. Hypothesis By hypothesis we intend a claim that can be tested empirically, i.e. through statistical hypothesis testing on corpus data. Hypotheses can come from previous research, logical arguments, or intuition, and, as long as they can be tested empirically, they have a place in our framework. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i A new framework  An example of a hypothesis is the statement ‘there is a statistically significant difference in the relative distribution of the -(e)th and -(e)s endings of early modern English verbs by gender of the speaker in corpus X’. This formulation is a technical one, and there is usually some work involved in going from an under-specified hypothesis, such as ‘the verbal endings -(e)th and -(e)s in early modern English vary by gender’, to an operationalized one as in the example above. For a fuller explanation of hypothesis testing and concrete examples, see section 6.3.4. Generating hypothesis is one of the main steps in the research process, and it helps focus the efforts in the analysis. Instead of considering all possible variables that might remotely affect the phenomenon under study (the so-called strategy of ‘boiling the ocean’), we can concentrate our attention on those factors that are promising, based on what we know of the phenomenon. If the hypothesis is generated from data exploration, it can be defined as data-driven, although the process itself of exploring the data will have relied on some theoretical assumptions, as we explain in section 2.4.3. Model As we explained in section 1.2.2, by ‘model’ we mean a representation of a linguistic phenomenon, be it statistical or symbolic. Not all models, however, are allowed in our framework: only those that derive from hypotheses tested quantitatively against corpus data or from statistical analysis of corpus data. An example of such a model is given in Jenset (2013), where the use of the morpheme there in Early English is modelled as a function of the presence of the verb be followed by an NP, and the complexity of the sentence. In section 7.3.3 we provide a full description of a model for historical English verb morphology. Trend We define a trend as the directional change in the probability of some linguistic phenomenon x over time that is detectable and verifiable by means of statistical procedures (Andersen, 1999). In other words, a trend cannot be established by impressions or intuitions. Furthermore, it can only be counted as a trend if reliable and appropriate statistical evidence can be presented to back it up. By ‘trend’, we mean the combination of innovation and spread of a linguistic phenomenon. For a linguistic change to happen, a speaker (or a group of speakers) needs to create a new form (innovation), and for this to be more than a nonce formation, the use of such form needs to spread and be adopted more broadly. For example, the use of ‘like’ in quoted speech must have been an innovation at first, and was then adopted to a broader set of people until it became established in current spoken English. We believe that linguistic innovation is best dealt with probabilistically, although this does not mean that our framework is incompatible with categorical views of language innovation. When a new linguistic form is used for the first time (or according to the terminology of Andersen (1999), it is ‘actualized’ in a speaker’s usage), it will differ from the old form in some aspect, for example in a semantic i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework sense, or in a morphological realization. According to a categorical view of language, this difference will be displayed as an opposition between the ‘old’ and the ‘new’ category; for example, the English affix -ism may be used as a noun as well (see ‘ism’ in the sentence We will talk about the isms of the 20th century). The innovative usage consists of the nominal use and the opposition is between the two part-of-speech categories. According to a non-categorical view of language, the innovative form could for example be characterized by a ‘fuzzy’ nature which can be described as more nounlike than preposition-like. We argue that both the categorical view and the non-categorical view are compatible with a probabilistic modelling of the linguistic innovation. In the categorical view, we can describe the innovative use of ‘ism’ in terms of a low probability of the preposition category and a higher probability of the noun category. In the non-categorical view, we can describe this innovation as change along a continuum, so that, for example, the innovative form ‘ism’ is found in contexts more similar to those of ‘theory’ (e. g. following a determiner) than those of ‘-ian’ (e. g. as a morpheme following ‘Darwin’). On the other hand, the spread of new linguistic behaviours among speakers through genres, linguistic environments, and social contexts, is a time-dependent phenomenon. The innovative form and the new form will coexist for a period of time, thus realizing synchronic variation, and there will be a more or less rapid adoption of the new form by the language communities. This can be described as a shift in probabilities, and it is clear that language spread should be dealt with in probabilistic terms. Quantitative multivariate analysis of corpus data allows us to measure the evidence for the spread of a linguistic phenomenon, and the effect of different variables on it. This way, it is in principle possible to model the way an innovation is increasingly used by a community; in section 7.3 we will provide a concrete example of this in a study on English verb morphology. . Principles Figure 2.1 shows the diagram of the research process in our framework, and is based on the entities defined in section 2.1.3 and the principles illustrated in that section. As shown in Figure 2.1, the aim of quantitative historical linguistics is to arrive at models of language that are quantitatively driven from evidence. Such definition of ‘model’ includes statistical models and their linguistic interpretation. Section 7.2 will outline the steps of this process in a linear way, and we will describe these steps in more detail throughout this book. In the present section we describe the basic principles of quantitative historical linguistics, which are valid within the scope defined above. The principles are inspired by, and in some cases adapted from chapter 2 of Carrier (2012), a work advocating the use of statistical methods in history. However much history and historical linguistics i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Principles Historical linguistic reality Primary sources (Documents etc.) Secondary sources (Grammars, dictionaries, etc.) Linguistic annotation schemes*  Models* Intuition Hypotheses Examples Annotated corpora* Quantitative distributional evidence* Figure . Main elements of our framework for quantitative historical linguistics. Boxes are entities, arrows are actions or processes; asterisks mark terms for which we use our definitions (see section 2.1.3). The dashed line from models to the (lost) historical linguistic reality implies an approximation. have in common, the differences are nevertheless sufficiently great to warrant a reframing of the issues to fit into the context of historical linguistics. The adoption of those principles allows for improved communication between scholars regarding claims and evidence, which in turn will make it easier to resolve contentious claims by means of empirical evidence. However, such a resolution is only possible to the extent that historical linguists agree with and adhere to the principles presented below. For this reason, the first issue deals with the question of consensus in the historical linguistics community. .. Principle : Consensus To achieve the aim of quantitative historical linguistics research, it is necessary to reach consensus among those scholars who accept the premises of quantitative historical linguistics. The basic premise for all the following principles is that the aim, indeed the duty, of historical linguists is to seek consensus. However, consensus is only valuable to the extent that it reflects an empirical evidence base. We therefore limit the consensus to those scholars who accept the basic premises of empirical argumentation, as it is grounded in the concepts of evidence and claims (section 2.1.3). Since we consider i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework these principles fundamental to empirical reasoning about historical linguistics, no consensus would be possible without them, even in theory. This means that the effort of creating consensus without a common ground of fundamental principles is probably going to be futile. The requirement of seeking consensus might seem overly optimistic and even a negative constraint to the development of the field. However, all serious scholars already abide by the consensus principle to some limited extent, by submitting their research articles to a scientific peer review. This particular type of consensus does not necessarily extend beyond the peer reviewers, the editors, and the scope of journals, but the principle remains the same: all research is ultimately an attempt to influence others by making claims grounded in some form of evidence. Without the requirement to seek consensus, any claim could in principle be made and defended by resorting to some private standard of evidence and argumentation. In contrast, the consensus requirement provides an impetus to follow the principles of quantitative historical linguistics as closely as possible, since this will help to persuade other scholars of the validity of the claims being made. However, the principle cannot be understood as an injunction to achieve consensus, only to seek it, since consensus by definition must involve more than one researcher. A hypothetical objection to the principle might be that it constrains creativity and development of the field. However, we view the matter differently. We agree with the argument made in chapter 2 of Carrier (2012) that when we have no direct access to historical realities, our best approximation must be the consensus among the experts in the field, in this case historical linguists. Naturally, experts may be mistaken, but on the whole we must assume that their beliefs and claims are accurate, given the current state of knowledge in the field. This final refinement of the point is crucial, since the consensus by definition must rest on what has been discovered and argued up until the present. Hence, new claims will always be in a position to challenge the consensus. But to challenge the consensus is to seek its amendment. When facing a new, possibly controversial, claim that goes against the current consensus, the experts in the field must evaluate the claim according to the empirical principles. If the claim is solid enough, the consensus will be given. Similarly, any claim might have gaps that require fixing before other historical linguists will accept it. After such modifications, the claim might be strong enough to alter the reigning consensus. We consider those claims that are too weak to persuade other experts in the field to be of no interest. If a creative, controversial claim cannot persuade those who are experts in the field, then it is questionable whether it can bring the field forward. Thus, we do not consider a plurality of claims regarding historical linguistics to be an aim in itself, but only a means of providing suggestions for altering the current consensus. .. Principle : Conclusions All conclusions in quantitative historical linguistics must follow logically from shared assumptions and evidence available to the historical linguistics community. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Principles  Following the definition of evidence in section 2.1.3, a piece of research is empirical if it relies on empirical evidence that is observable and verifiable independently of the researcher and her attitudes and beliefs. Above all, intuitions (even those stemming from long, in-depth study of the material under scrutiny) are inadmissible as evidence. This principle is supported by the previous one: beliefs and intuitions are not independently verifiable, hence they do not form a good basis on which to build consensus. This is not to say that intuitions do not belong in historical linguistics research; quite the contrary. Such intuitions can be a very valuable starting point for insightful research. However, the intuitions can never be more than a starting point, or guidance, for creating hypotheses and deriving empirically testable predictions from them. .. Principle : Almost any claim is possible Every claim has a non-zero probability of being true, unless it is logically or physically impossible. We consider this insight from Carrier (2012) to be a key principle when evaluating claims regarding historical linguistics. Carrier (2012) points out that almost any claim about the past has some probability of truth to it, with the exception of claims that are logically impossible (such as ‘Julius Caesar was murdered on the Ides of March and lived happily ever after’) or physically impossible (such as ‘Julius Caesar moved his army from Gaul into Italy on a route that took them via the Moon’). We consider this statement equally applicable to historical linguistics as to history. Another way of phrasing the principle is that identifying sufficient conditions is not enough to establish a strong claim. A very similar point is made by Beavers and Sells (2014) who argue that since linguistic data can support many conclusions, it is not enough to find data that support the claims we wish to make. It is also necessary to consider all the other claims those same data might support, that is, what is the evidence against our chosen interpretation of the data. The take-home message in both cases is that the set of all possible claims (i.e. physically and logically possible) contains both profitable and misleading claims, but both these types of claims can be supported by historical linguistic data, albeit to different degrees. It follows from this principle that a claim that ‘fits the data’ in historical linguistics is near worthless unless it is further substantiated. Such a claim could be a very strong one, or it might have an associated probability so small that it would be indistinguishable from zero for all practical purposes. The subsequent section discusses the problem of ranking claims that all have a non-zero possibility of being true. .. Principle : Some claims are stronger than others There is a hierarchy of claims from weakest to strongest. It follows from principle 3 that all possible claims in historical linguistics have some probability of being true, ranging from completely implausible to extremely well attested and likely. In other words, there exists a hierarchy of claims where some i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework claims stand above others. For instance, the claim by Emonds and Faarlund (2014) that Old English simply died out and was replaced by a variant of Norse (making modern English genealogically North Germanic) has very little support in the data and is hence an extremely weak claim, as demonstrated by Bech and Walkden (2016). Since the claim that Middle English evolved from Old English (albeit from other dialects than the dominant West Saxon variety of Old English) is based on much stronger evidence, it takes precedence over the replacement argument. Essentially, all claims are not made equal, and even if some kind of historical linguistic data can be made to fit a claim, this is in itself unsurprising and constitutes an insufficient ground for accepting that claim. The key question then becomes what distinguishes a weak claim from a strong one. The following principle will dig further into the problem of how to rank claims. .. Principle : Strong claims require strong evidence The strength of any claim is always proportional to the strength of evidence supporting it. Section 2.1.3 dealt with how we can judge the strength of the evidence. Here we spell out the relationship this has to claims and their strength. Carrier (2012) argues, correctly in our view, that evidence based only on a small number of examples is very weak. Furthermore, when a claim is a generalization, its supporting evidence must consist of more than one example. That is, the evidence for any generalization that goes beyond the observed piece of data must consist of more than one observation. Such arguments follow from the principle that the strength of a claim is proportional to the evidence backing it up. Since no claim is stronger than the evidence supporting it, the nature of the supporting evidence is key. Other things being equal, more evidence implies stronger support for a claim, as we stated in section 2.1.3. However, the principle is not only about finding strong evidence. The opposite also applies: if your evidence is weak, your claims ought to reflect this fact. In some cases a weak claim is all that can be supported by a body of evidence. In this situation, we feel that the adage ‘better an approximate answer to an exact question, than an exact answer to an approximate question’ applies. That is, if the combination of a research question and some data only allows a weak or tentative conclusion, then this should be explicitly acknowledged without attempts to overstate the results. In historical linguistics this means that in some cases certain generalizations might be impossible. As the statistician John Tukey phrased it, ‘the combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data’ (Tukey, 1977). This applies to historical linguistics as much as to statistics. The example from section 2.2.4 about the typological status of English within the Germanic language family is also relevant here. Since the evidence provided by Emonds and Faarlund (2014) is narrowly focused on one area (syntax) and is also very sparse, the evidence is clearly not proportional to the claims being made, as demonstrated by Bech and Walkden (2016). i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Principles  .. Principle : Possibly does not entail probably The inference from ‘possibly’ to ‘probably’ is not logically valid. In section 2.2.3 we argued that merely fitting the data is not sufficient for accepting a claim. A special case that deserves its own principle is the logical fallacy that Carrier (2012) describes as ‘possibly, therefore probably’. The principle can be made clearer when recast as probabilities, where the notation P(x) means ‘probability of x’ for some claim x: • • • • If P(x) > 0, x is possible. If P(x) is close to 1, x is probable. If P(x) = 0.01, x is possible but not probable. If P(x) = 0.99, x is both possible and probable. Put differently, all probable claims are possible, but not all possible claims are probable. The example-based approach described in section 1.3.1 should only be associated with claims about events being possible or not; in order to state anything about their probability value, quantitative data and systematic analysis are required. We turn again to the claim that Old English died out and that Middle English descended from Norse. We certainly agree with Emonds and Faarlund (2014) that this is a possible scenario. The process of languages falling out of use and being substituted by others, possibly with some substrate influence from the language falling out of use, is clearly possible. However, since all logically and physically possible claims have a non-zero probability of being true, it is trivial to state that Old English might have died out and been replaced with a variant of Norse. The possible does not automatically entail the probable because probable claims are only a subset of all the possible claims. Thus the argument, ‘this might have been the case, therefore it was probably the case’ is logically invalid without further supporting evidence. It also follows from section 2.2.3 that the set of possible claims is very, very large since it is constrained only by the physically and logically impossible. This in turn raises the question: in the absence of stronger corroborating evidence, why privilege one particular possible claim out of the much larger set of other possible claims? To present a possible claim as probable without sufficient evidence, whether by arbitrariness or sheer wishful thinking on the part of the researcher, does not support that claim. In particular, such an inference cannot adequately support a conclusion, as discussed in the next section. .. Principle : The weakest link The conclusion is only as strong as the weakest premise it builds on. This principle entails that any conclusion will be evaluated by its weakest point, not its strongest. This may sound counter-intuitive, because surely we want the strongest evidence to inform our claims. The reason can be traced back to the principle that i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework any claim that is physically or logically possible has a non-zero probability of being true (section 2.2.3). The great number of possible interpretations of evidence from the linguistic past thus enables us to find individual strong arguments in favour of a conclusion. However, the conclusion might nevertheless be undermined by a number of weak premises. .. Principle : Spell out quantities Implicitly quantitative claims are still quantitative and require quantitative evidence. One of the key aims of the principles outlined here is to enable a fair evaluation of claims about historical linguistics in terms of quantities and frequencies. However, such an evaluation is only possible when the quantification is spelled out. Terms such as those in the following list are ambiguous and should be avoided when presenting evidence in historical corpus linguistics: few, little, rare, scarce, uncommon, infrequent, some, common, frequent, normal, recurrent, numerous, many, much. The list is obviously not exhaustive, but it illustrates words that represent quantities and frequencies in a subjective and coarse-grained manner. They are subjective because what counts as few or many depends on the circumstances and the person doing the evaluation. They are coarse-grained because it is difficult to compare the quantities they designate. Is an ‘uncommon’ phenomenon equally scarce as something that is ‘infrequent’? Or is it perhaps more common? Or less? Such quantification is hard to evaluate and verify independently and hence violates the fundamental requirement that the evidence for a claim must be objectively accessible to all researchers in the field. This is not to say that such words cannot be used, but they render an argument less powerful by making it opaque. .. Principle : Trends should be modelled probabilistically Quantitative historical linguistics can rely on different types of evidence, but only quantitative evidence can serve as evidence for trends. In section 2.1.3, we defined trends in explicitly probabilistic terms. The approach defined here is deliberately agnostic about whether language is inherently based on probabilities, or categorical rules, or some combination of the two. However, a trend should be modelled as a probabilistic, quantitative entity since it denotes a directed shift in variation over time. Sample sizes may vary at different points along a time line, which makes statistical tools the correct choice for identifying and evaluating trends. Linked to this point is the question of adequate statistical power. Thus, having three points connected by a straight line pointing upwards does not qualify as a trend unless this line can be shown to be both statistically significant and a good fit to the data. Like any claim in historical linguistics, a proposed trend is subject to the principle in section 2.2.3 that any claim has a greater than zero probability of being true, provided that the claim is not logically or physically impossible. Any claim about a possible i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Principles  trend is consequently liable to a number of errors: the trend might not be a trend at all, but merely random variation. Or the claimed trend might represent wishful thinking or biased attention on the part of the researcher (as pointed out by Kroeber and Chrétien 1937, 97); the data for the claimed trend might give the appearance of trend due to inadequate sampling procedures, and so on. In other words, the requirement that a trend be verified by statistical means is an insurance against overstating the case beyond what the data can back up. .. Principle : Corpora are the prime source of quantitative evidence Corpora are the optimal sources of quantitative evidence in quantitative historical linguistics. Above we defined corpora as sources of quantitative data (section 2.1.2). We also defined quantitative variation (including variation implicitly stated by means of words like much or few) as subject to quantitative evidence (principles 8 and 9). However, we reserve a separate principle for the statement that quantitative evidence in historical linguistics should come from corpora. This is not to say that quantitative evidence cannot come from other sources; there are clearly other possible sources for quantitative evidence (see section 2.1.3). However, when available, corpora should always be the preferred source of quantitative evidence for a number of reasons: (i) Corpora (as defined in section 2.1.3) have a better claim to be representative than other text collections, other things being equal. (ii) Publicly available corpora allow reproducibility, to the extent that they are available to the research community. Thus, following principle 4 (section 2.2.4), we consider a claim based on quantitative evidence coming from corpora stronger than a claim that is not based on corpus evidence, as long as the two claims are equally capable of accounting for the relevant facts. .. Principle : The crud factor Language is multivariate and should be studied as such. For the purpose of historical linguistic research, we consider language, and language use, to be an inherently multivariate object of study. Bayley (2002, 118) explains a similar ‘principle of multiple causes’ as the need to include multiple potentially explanatory factors in an analysis, since it is likely that a single factor can explain only some of the observed variation in the data. In other words, it is essential to be open to a potentially large number of explanatory variables for any linguistic phenomenon. This principle does not imply that this is inherent in language, only in language as an object of study. From this principle it follows that a large number of potential explanatory variables should be considered. This is consonant with principle 3 (section 2.2.3), since finding a single variable that is correlated with the phenomenon being i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework studied is trivial. The real aim in quantitative historical linguistics is to find one or more variables that are more strongly correlated with the phenomenon being studied, compared to other potential variables. In essence, this guards against spuriously positive results, since we aim to build counter-arguments into the quantitative model. Doing so protects against what Meehl (1990, 123–7) calls the ‘crud factor’, or ‘soft correlation noise’, since many factors involved in language will be correlated with each other at some level. Stacking them up against each other helps separate the wheat from the chaff. .. Principle : Mind your stats Quantitative analyses of language data must adhere to best practices in applied statistics. From principle 11 it follows that statistical methods are required to distinguish the more important correlations from the less important ones. Bayley (2002, 118) describes this as the ‘principle of quantitative modeling’, which implies calculating likelihoods for linguistic forms given context features. This implies that multivariate statistical methods, such as regression models or dimensionality reduction techniques, are typically required. For instance, a single multivariate regression model with all relevant variables is superior to a series of individual null-hypothesis tests, since the latter do not take the simultaneous effect of all the other variables into account and are vulnerable to false positive results by testing the same data several times over. Testing the same data over and over again with a null-hypothesis test such as Pearson’s Chisquare is a little like having several attempts to hit the bull’s eye in darts: more attempts make it more likely to get a statistically significant result, but the approach artificially inflates the strength of the claim. Furthermore, as Gelman and Loken (2014) make clear, null-hypothesis tests are often under-specified, a point also raised by Baayen (2008, 236), which means that in practice they can often be supported or refuted by the data in more than one way. Furthermore, comparing null-hypothesis tests is conceptually difficult. Although the p-values may look comparable, they actually represent a series of alternative hypotheses, each of which has been compared against a null-hypothesis (Gelman and Loken, 2014). This is not to say that we proscribe the use of simple null-hypothesis tests in quantitative historical linguistics, we merely consider them to provide weaker evidence than multivariate techniques in those cases where a multivariate approach is possible and gainful. Similarly, the direct interpretation of raw counts, or what Stefanowitsch (2005) calls ‘the raw frequency fallacy’, constitutes the weakest form of quantitative evidence, since such numbers are void of context. Without a frame of reference, it is impossible to judge objectively (see the requirement that all evidence be accessible to all linguists) whether an integer is large or small. Also, the direct interpretation of proportions or percentages needs to be done with care. Proportions can be misleading since they can inflate small quantities unless accompanied by the actual number of observed instances. Furthermore, the proportion constitutes a point estimate, i.e. a single number i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Best practices and research infrastructure  that in reality comes with an error margin attached. When we perform a formalized null-hypothesis test and the test compares observed and expected frequencies, we account for such error margins. If we interpret proportion data directly, they should be accompanied by a confidence interval, as we exemplify in section 1.5. . Best practices and research infrastructure In section 1.3.4 we highlighted some of the problems with common practices in the historical linguistics research process. In this section we will outline our proposed solutions, which are meant to accompany the principles outlined above and create the context for an infrastructure that facilitates and optimizes research achievements in quantitative historical linguistics. .. Divide and conquer: reproducible research As we will see more in detail in Chapter 5, documentation and sharing of processes and data are at the core of our framework. Transparency in the research process facilitates reproducibility of the research results, as well as their generalization to other data sets, thus advancing the field itself. Moreover, if the process is transparent, it is easier to credit all the people who participated in it, including those responsible for gathering and cleaning the data, and building language resources like corpora and lexicons, an aspect that is still undervalued in the historical linguistics community. Replicability is also aligned with principle 1 (section 2.2.11), which stresses the importance of consensus in quantitative historical linguistics. Transparency (and therefore replicability and reproducibility) is achieved by documenting the data and phases of the research process and by making them available. In addition to being transparent about the research methodology used, corpora, data sets, metadata, and computer code1 should be made publicly available whenever possible and appropriate. In the case of historical data, questions of privacy are rarely a problem, so compared to other fields of study historical linguistics is in a fortunate position in this respect. Once we have taken all steps to ensure transparent and reproducible results, and have made the data openly available, the research practice can move beyond the scope of an individual study to that of a larger, collaborative effort. Each study may still concentrate on just one aspect of the process (design of a resource or generalization of previous results, for example), while keeping a view to documenting and making the tools and data sets available to the community. Efforts in this direction have already had some success, for example in the case of the Perseus Digital Library2 and the 1 Generally speaking, using code/scriptable tools like Python and open formats like csv instead of tools with graphical user interfaces and proprietary formats like Excel is essential for reproducibility. 2 http://www.perseus.tufts.edu/hopper/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework Open Greek and Latin Project.3 The Open Science Framework (https://osf.io/) offers a platform for managing research projects in an open way, facilitating reproducibility and data sharing; see the page https://osf.io/ek6pt/ for one such project dealing with Latin usage in the Croatian author Paulus Ritter Vitezović. We believe that such an approach will allow the field of historical linguistics to move forward in a less fragmented way than it has so far. .. Language resource standards and collaboration Since the time of the comparative philologists, historical linguists have often resorted to gathering their own data. Although this is sometimes warranted (and even the only available option), historical linguistics as a scientific endeavour would benefit from a greater reliance on reuse of existing resources, and on the creation of publicly available standardized corpora and resources, whenever reuse is not an option. Electronic resources like lexical databases (WordNet, FrameNet, valency lexicons) provide valuable information complementary to corpora. Such resources are still not widely used in historical linguistics, partly for epistemological reasons and partly for technical reasons, as argued by Geeraerts (2006). Our framework provides historical linguists with the methodological scaffolding to incorporate computational resources into their research practice. As we will argue more extensively in Chapter 5, the design, creation, and maintenance of language resources should be a crucial component of the work of historical linguists, and in order to maximize their reuse and compatibility, language resources should be developed in the spirit of linked open data (Freitas and Curry, 2012), when possible. Reusing resources means that conclusions and results can more easily be replicated and tested by other researchers, which is a crucial point of our framework (see section 2.3.1). Moreover, if a study on a specific linguistic phenomenon is carried out on a resource built in an ad hoc fashion, there is always the lingering doubt that the results were influenced by the choice of data. Conversely, if the results are obtained from a pre-existing resource or corpus, they are less likely to have been influenced by factors directly related to the research in question. A greater reliance on reuse gives an impetus to creating corpora also for less-resourced languages (McGillivray, 2013). In spite of its strengths, gathering and annotating data is costly in terms of time and resources. The labour-intensive tasks involved in creating language resources often involve technical expertise which is not normally part of standard linguistics training; therefore, the development of language resources is an interdisciplinary team effort and is at the core of the collaborative approach to research that we propose. If a group creates a resource that is well documented, has a standard format, and is compatible 3 http://www.dh.uni-leipzig.de/wo/projects/open-greek-and-latin-project/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Best practices and research infrastructure  and interoperable with other resources (for example a historical lexicon of namedentities), this makes it possible for another group to build on this work and either enrich the resource itself, or use its data (alone or in combination with data from other sources) for further analyses. If such analyses are well documented, they will be more likely to be reproduced by others, who can check the validity of the results, generalize them further to other data sets, or add more components to the analysis. However, researchers do not currently have sufficient incentives to spend time on building a corpus or other language resources. We believe that the publication of such resources ought to carry substantial weight in terms of academic merit, as much as the publication of studies carried out on them. .. Reproducibility in historical linguistics research In sections 1.3.1 and 1.3.3 we considered the major weaknesses concerning certain research practices in historical linguistics. In this section we will broaden the perspective to cover the issues of transparency, replicability, and reproducibility and their impact on the field of historical linguistics in general. Section 1.3.1 dealt with the negative effects of the lack of transparency in the evidence sources employed in historical linguistics research. As a matter of fact, the issue of transparency concerns all phases of the research process, from data collection to annotation and analysis. Making all phases of the research process more transparent has a number of benefits. First, it makes it possible to replicate the research results obtained by a study in the context of other studies dealing with the same data, method, and research questions. This increases the chances of detecting omissions and correcting errors. Second, transparency forms the basis for generalizing the research results, thus advancing the field itself: this generalization can involve applying the same method to a different data set or extending the approach. For example, a researcher can test alternative approaches based on the data from a reproducible piece of research. Third, transparency ensures that the work involved in building a data set (for example a historical language resource) is visible, and therefore acknowledged and credited appropriately. Considered the emphasis on publishing research articles that report on analyses of particular phenomena or formulation of theories, this level of transparency on the data behind the analysis would encourage more researchers to dedicate their time to building language resources, which play an essential role in advancing the field. The issue of lack of transparency is, of course, not unique to (historical) linguistics, and has very negative consequences that in some fields like medicine span well beyond the academic community to impact directly people’s lives.4 Although it does not affect 4 For an example of how current this issue is in medicine and psychology, see https://www.nature.com/ news/first-results-from-psychology-s-largest-reproducibility-test-., respectively. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework human lives, this issue is current in linguistics research as well, as demonstrated, for example, by the recent special issue of the journal Language Resources and Evaluation dedicated to the replicability and reproducibility of language technology experiments. Transparency (and therefore replicability and reproducibility) is achieved in two main steps: by describing the data and phases of the research process, and by making such data and processes available, which we will discuss in more detail in the next sections. Documentation As we saw in section 1.3.1, research papers in historical linguistics dedicate a lot of space and attention to the theoretical framework(s) and the final results of the research, as well as to linguistic examples, either as illustration of the phenomenon studied or as the evidence base of the analysis. However, little attention is usually dedicated to the following aspects, in spite of the crucial role they play in the research process: how the data were collected, how the hypotheses were formulated and tested, which variables were measured (if any), how the analysis was carried out. For example, Bentein (2012) presents the details of his data collection criteria in the footnotes, and describes the corpus used in four lines (Bentein, 2012, 175). As there are no agreed standards on how to build and annotate a corpus, how to carry out the analysis, and how to report the results, we argue that the following guidelines would significantly increase the level of transparency in historical linguistics research. • Include references to the resources (including corpora) used, with exact locations and URL links. • Specify the size of the corpus or linguistic sample(s) used. • Describe how the corpus/sample was collected by detailing the inclusion/exclusion criteria. • Detail the annotation schema used, even when the researcher performed the annotation as a by-product of the subsequent analysis. • Add information about the analysis methods employed and their motivation, as well as the statistical techniques, programming languages and software used (with version number). • Give details of the different analyses performed (including the ones that did not lead to the desired results), to eliminate the risk of ‘cherry-picking’ results that conform to the researcher’s expectations. • Add all relevant information to allow the reader to interpret and reproduce the data visualizations. Sharing and publishing research objects Being transparent about the research methodology used is very important, but may not ensure full replicability of the results when the work is complex. Therefore, it is important that the corpora, data i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Best practices and research infrastructure  sets, metadata, and computer code on which the research was based are made publicly available whenever this is possible and appropriate from an ethical point of view. Evidence from a study on 516 biology articles published between 1991 and 2011 reported on in Vines et al. (2014) has shown that informally stored data associated with published works disappear at a rate of around 17 per cent per year. Even though we do not have evidence of this kind for (historical) linguistics research, it would not be surprising if a similar pattern were found. Access to the data and the process behind a research work is essential, and should be ensured in a systematic way and through platforms that researchers can use. There are a number of repositories for language resources, including corpora, lexica, terminologies, and multimedia resources. One (non-free) catalogue of such resources is available through the European Language Resources Association (ELRA).5 Another example is CLARIN (Common Language Resources and Technology Infrastructure),6 a large repository of language resources. Examples of research data repositories which are not specific to linguistics but are widely used in the sciences are Figshare7 and Dryad.8 Figshare allows researchers to upload figures, data sets, media, papers, posters, presentations, and data deposited in Figshare receive a digital object identifier (DOI), which makes them citable. The most commonly used repository designed to track versions of computer code, attribute it to its authors, and share it is GitHub.9 Specific to humanities, Humanities Commons10 is a platform for sharing data and work in progress and constitutes a positive example of this sharing attitude. An interesting publishing model that is gaining popularity among the scientific community is concerned with so-called ‘data journals’. Such peer-reviewed publications collect descriptions of data sets rather than traditional article publications reporting on theoretical considerations or the results of particular studies. Such citable ‘data descriptors’ or ‘data papers’ receive persistent identifiers and give publication credits for the authors. The methodological importance of such data publications consists in allowing other researchers to use the data described and benefit from them, and ensuring that scientists who collect and share data in a reusable fashion receive credit for that. Examples of open access data journals in the scientific domain are Scientific Data11 and Gigascience.12 One notable example in the humanities is the Research Data Journal for the Humanities and Social Sciences13 published by Brill in collaboration with Data Archiving and Networked Services. 5 8 11 13 6 http://clarin.eu/. 7 http://figshare.com/. http://catalog.elra.info/. 9 https://github.com/. 10 https://hcommons.org/. http://datadryad.org/. 12 http://www.gigasciencejournal.com/. http://www.nature.com/sdata/. http://dansdatajournal.nl/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework .. Historical linguistics and other disciplines In spite of their clear connections, historical linguistics and historical disciplines like history and archaeology have largely followed separate paths (Faudree and Hansen, 2014). To take a concrete example, with the exception of corpora created for historical sociolinguists, early historical corpora contained very limited metadata about the texts themselves, and focused primarily on annotating linguistic features. In section 5.2 we argue for a stronger interaction between historical linguistics and history, and make a case for a stronger connection between historical language resources and other resources (like collections of information on people or places). This strengthened link has the potential to enrich historical linguistics research by accounting for the sociohistorical context of the language data in a direct way. Linked data provides a valid solution to this question because it allows to connect linguistic corpora with general resources on various aspects of the historical context of the texts. This enables a more historically accurate investigation of the language and facilitates interdisciplinary efforts, which would benefit historical linguistics research. In Chapter 4, we also make a case for cooperation between historical linguistics and digital humanities. In particular, the Text Encoding Initiative has established standards for annotating a range of information on texts and their contexts. This type of annotation would definitely make the traditional corpus annotation more exhaustive and therefore allow corpus analyses to consider a wider range of properties of texts and their context; this, in turn, would make the linguistic results more comprehensive. . Data-driven historical linguistics In section 1.3 we have stressed the importance of using corpora as evidence basis for research in historical linguistics. We dedicate this section to defining ‘corpus-driven’ and ‘data-driven’ in the context of the methodological framework we propose, and to explaining how this approach interacts with linguistic theory. .. Corpus-based, corpus-driven, and data-driven approaches Once we have established the necessity of using corpus data as evidence sources for the historical linguistics investigation, we need to clarify how this evidence (which by definition relates to individual instances of language use, or parole in Saussurian terms) relates to more general statements about language as a system (or langue, to follow Saussurian terminology). Are corpus data going to support general claims about language, or will they determine them? Will the investigation start from corpus data, or from theoretical statements, or a combination of the two? According to a terminology that is well established in corpus linguistics (TogniniBonelli, 2001, 65), corpus-based approaches involve starting from a research question i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Data-driven historical linguistics  and testing it against (annotated) corpus data, often by analysing selected examples. Theoretical hypotheses play a prominent role in this approach, and corpus data are used to support (or more rarely refute) them; therefore, we could categorize such approaches as ‘confirmatory’. On the other hand, ‘corpus-driven’ approaches (Tognini-Bonelli, 2001, 85) rely on unannotated corpus data with the aim of revising existing assumptions about language that pre-dated electronic corpora; in fact, annotated corpora are seen as sources of skewed data because they reflect such pre-existing assumptions. In corpus-driven approaches the corpus evidence is the starting point of all analyses and needs to be reflected in the theoretical statements, which makes the primary focus of such approaches exploratory. The researcher draws generalizations from the observation of individual instances of unannotated corpus data to theoretical statements about the language system. In other words, corpus-driven approaches aim ‘to derive linguistic categories systematically from the recurrent patterns and the frequency distributions that emerge from language in context’ (Tognini-Bonelli, 2001, 87). Rayson (2008) proposes the use of the term ‘data-driven’ as a compromise between the ‘corpus-based’ and the ‘corpus-driven’ approaches contrasted above. His starting point is the automatic annotation of two corpora by part-of-speech and semantic fields; then, he conducts a quantitative analysis of the keywords extracted from the two corpora. At this point, in his model the researcher’s contribution consists in examining qualitatively ‘concordance examples of the significant words, POS and semantic domains’ (Rayson, 2008, 528) to formulate research questions. This way, the research questions arise from the qualitative analysis of quantitatively processed data from automatically annotated corpora, rathen than from theoretical hypotheses, as in corpus-based approaches. In this book we will employ the term ‘corpus-driven’ in a sense that is different from the ones outlined above. We will accept the confirmatory view according to which corpus analyses can test hypotheses from previous theories (and we discuss the term ‘theory’ in section 2.4.3), but we also allow for exploratory views in which such hypotheses emerge directly from corpus data. Moreover, unlike in the traditional definition of ‘corpus-driven’, we consider annotated corpora as legitimate sources of evidence. Finally, we do not consider it acceptable to analyse selected examples from corpora in order to test theoretical statements, as done in large part in corpus-based research. In our definition, ‘corpus-driven’ will refer to those approaches whereby evidence from (annotated) corpus data is collected systematically, usually with automatic means. This evidence (whose size is typically relatively large) undergoes a systematic and exhaustive quantitative analysis. Such analysis aims at testing theoretical hypotheses (in confirmatory studies) or formulate new ones (in exploratory studies). ‘Datadriven’ refers to the same procedure as ‘corpus-driven’ defined above, but specifically affects other types of data in addition to corpus data, for example metadata on authors, i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework genres, geography, or data from language resources and other resources. Because historical corpora necessarily contain some form of metadata, any ‘corpus-driven’ methodology is also ‘data-driven’. Doing linguistics research in this data-driven quantitative way accounts for the variability in language use and lends itself to a usage-based and probabilistic view of language, whereby frequency distributions are built into the language models (Penke and Rosenbach, 2007b, 20). However, as explained by de Marneffe and Potts (2014, 8), as we discussed in section 1.2.1, and as we will see in Chapter 6, corpus research is compatible with non-probabilistic approaches as well, because the statistical evidence collected from corpora may reflect the interaction of various discrete phenomena. .. Data-driven approaches outside linguistics The emphasis on data-driven approaches is common to a general movement affecting a range of disciplines, particularly in the sciences and in business. The terms ‘data-intensive’ (Hey et al., 2009) and ‘data-centric’ science (Leonelli, 2016) have acquired specific senses in today’s scientific context. They refer to approaches characterized by large-scale networks of scientists, a focus on open data sharing and data publishing, and a drive towards large collections of data employed as evidence sources in research. Following a similar trend, the business world has witnessed an exponential growth in the demand for data scientists and a general shift towards data-centred attitudes in organizations in recent years. Data-driven approaches are increasingly employed in designing business strategy by relying on large-scale analyses of data on users’ behaviour and preferences, as well as data from internal systems, including workflow and sales databases (Redman, 2008). Mason and Patil (2015, 10) define the ‘data scientific method’ for organizations as follows: 1. 2. 3. 4. Start with data. Develop intuitions about the data and the questions it can answer. Formulate your question. Leverage your current data to better understand if it is the right question to ask. If not, iterate until you have a testable hypothesis. 5. Create a framework where you can run tests/experiments. 6. Analyse the results to draw insights about the question. This list of steps highlights the importance of data exploration in the initial phases (steps 1 and 2) and largely overlaps with the exploratory approaches we referred to in section 2.4.1. Examples from the scientific world and the business world highlight a general trend in society, which may be explained by the cultural innovations driven by new technologies, as we discussed in section 1.3.4. Since linguistics research does not i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Data-driven historical linguistics  happen in isolation, but interacts with this changing external context, we believe that keeping in mind this broader perspective can help us to understand and frame datadriven approaches in historical linguistics as well. However, one important difference between business applications and research in historical linguistics is the ultimate aim of the investigation. Historical linguistics intended as an empirical enterprise aims to model and ultimately explain language phenomena in the past. This theoretical aim has implications on the data-driven research process we propose, as we see in the next section. .. Data and theory Given the explanatory aim of historical linguistics, the corpus-driven framework we propose must be compatible with the creation of a historical linguistics theory, intended as a system of laws for historical languages and language change which allows us to explain and predict phenomena affecting these languages. As a matter of fact, the term ‘theory’ is used quite generously in linguistics to refer to annotation schemes like HPSG, X-BAR, LFG, dependency grammar, construction grammar, approaches like distributional semantics, or other formalisms; we agree with Köhler (2012, 21) when he states: there is no linguistic theory as of yet. The philosophy of science defines the term ‘theory’ as a system of interrelated, universally valid laws and hypotheses [. . . ] which enables to derive explanations of phenomena within a given scientific field. Data and theory interact in complex ways in corpus-driven historical linguistics. In this section we will examine individually what we consider the main aspects of this interaction in the context of the data-driven approach to historical linguistics research we propose. We will present these different aspects one by one for reasons of clarity, although we recognize that they often occur together and interact. Theory in data representation In spite of their intended objectivity, whenever some data are collected as part of a research study, by necessity they reflect a specific way of understanding, representing, and encoding the recorded entities or events. Therefore, they are tied to a particular historical moment and theoretical views. As TogniniBonelli (2001, 85) summarizes very effectively, ‘[t]here is no such thing as pure induction’ and even corpus-driven approaches (in the sense she defines) acknowledge this. Let us imagine that we have taken records of daily measurements of the temperature. In order to make sense of the pairs of numbers and characters collected, we would need to read them as temperatures (e.g. in centigrade degrees) and day–month–year triples, if that is the way we decided to represent dates. When it comes to linguistics research, the notations chosen for representing and collecting the corpus data play an important role in any subsequent analysis. In the case of annotated corpus data, the annotation is always performed with reference i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework to a specific notational framework and, therefore, annotated data will reflect that viewpoint, which we may call ‘theoretical’ (de Marneffe and Potts, 2014, 17). If the annotation includes part of speech, for instance, it has to rely on definitions for the different part of speech categories and how these labels apply to the language or variety in question. If the corpus is a treebank, we may want to choose a phrase-structure model or a dependency-based model for representing syntactic structures, and this choice will depend on our preferences, the features of the language annotated, as well as other considerations. In a similar vein, a corpus that has not been annotated will still need to be interpreted according to a specific theoretical perspective in order to form the basis for any subsequent linguistic analysis. Theoretical assumptions In addition to the way we represent the entities that we want to analyse and their context, whenever we carry out a data analysis we rely on a set of assumptions, which we may call ‘theoretical’, too. Let us go back to the example of daily temperatures. When we collect and then interpret the data, we need to keep in mind that they are limited to a specific range, so that if we spot a measurement of −400, for instance, we can quickly identify it as an error. In this case, any data-driven analysis would only make sense if we have access to the domain knowledge concerning temperatures on the earth. When we annotate and then analyse a corpus, in addition to the notational framework chosen, we rely on a set of assumptions on which there is general consensus among linguists: for example that nouns in French are inflected by number and gender, that verbs in Latin can display different endings depending on their person, number, tense, voice, etc. When we analyse verb data in a treebank, for instance, we assume that verbs do not occur with their arguments in a random way, but that they display specific syntactic and lexical–semantic preferences according to their argument structure. This kind of domain knowledge also supports the design and interpretation of exploratory analyses. The choice of which variables we decide to study will need to make sense according to this domain knowledge or in the context of specific hypotheses we want to test. To take a slightly absurd example, we may collect data relative to a number of events happening by a beach and we may find a strong correlation between the number of shark attacks in a day and the amount of ice cream sold in the same day. However, our domain knowledge tells us that, rather than concluding that buying more ice cream increases the chances of being attacked by a shark, we could hypothesize that both variables are correlated with (and possibly caused by) the number of visitors to the beach in that day. As Köhler (2012, 15) states, even exploratory approaches need some theoretical grounding: It is impossible to ‘find’ units, categories, relations, or even explanations by data inspection— statistical or not. Even if there are only a few variables, there are principally infinitely many formulae, categories, or other models which would fit in with the observed data. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Data-driven historical linguistics  For instance, in McGillivray (2013, 127–78), the author studied the change of the argument structure of Latin verbs prefixed with spatial preverbs, and particularly the opposition between prepositional constructions and bare-case constructions. This phenomenon involves the interplay of a range of variables, including morphological, lexical, and semantic features such as the type of verbal prefix, the particular spatial relation expressed in conjunction with the verb, the semantics of the verb, and the case of the verbal argument. McGillivray (2013, 169–72) employed an exploratory approach to deal with the complexity of the phenomenon and to measure the contribution of various variables to it. She resorted to exploratory data analysis (Tukey, 1977) (specifically CA, see section 6.2), which aims at letting the model ‘emerge’ from the data. However, the author chose the set of variables based on a combination of findings from previous research and linguistic domain knowledge. Data and theoretical hypotheses We have seen that exploratory approaches to historical linguistics analysis need access to domain knowledge and need to be theoretically grounded. But, of course, theory plays a crucial role in confirmatory approaches as well, which are essential to the progress of any empirical research. When approaching corpus data with a theoretical hypothesis, it is important to avoid the risk of confirmation bias, which would lead us to only find positive evidence of the claims we intend to make. To address this issue, McEnery and Hardie (2012, 15) define the principle of total accountability according to which we should always aim at using the entire corpus, or at least random samples when the corpus is too large. This way we can satisfy the criterion of falsifiability, identified by Popper (1959) as the defining feature of the scientific method. If we follow the principle of total accountability, we are very likely to employ quantitative analysis techniques, as manual analysis is often inadequate to deal with the size and complexity of the data. By relying on the systematic evidence from a corpus, corpus-driven approaches can address the question of whether a phenomenon is attested by finding occurrences of certain patterns or constructions. However, finding few examples of such patterns in a corpus in itself does not guarantee that these are not annotation errors, typos, or other anomalies; this is also true in the case of historical texts, for which spelling and other elements often do not follow standards and cannot be easily categorized because they are captured at the moment in which they undergo diachronic change. A systematic quantitative account of corpus evidence based on all available data, together with a theoretical model and the effort to make our results consistently replicable (Doyle, 2005), can help to avoid spurious conclusions in these situations and increase their validity in an empirical context (McEnery and Hardie, 2012, 16). Corpus data may also support statements about possible but unseen phenomena, by relying on seen events and statistical estimation, coupled with domain knowledge (Stefanowitsch, 2005; de Marneffe and Potts, 2014, 11). Corpora, paired with a theoretical model, can also predict that a phenomenon is impossible (or has a negligible i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Foundations of the framework probability to occur). For example, Pereira (2000) used a statistical model trained on a newspaper corpus to predict that colourless green ideas sleep furiously is about 200,000 times more probable than furiously sleep ideas green color less, thus addressing Chomsky (1957)’s challenge. As Pereira’s study illustrates, corpora can also address probabilistic hypotheses about language, as well as binary ones. This is explored in more depth in Chapter 6. As an example of such a probabilistic hypothesis from historical linguistics, in the context of the diachronic change in the argument structure of Latin prefixed verbs, McGillivray (2013, 127–78) formulated various hypotheses including the following, concerning one of the constructions that pertain to these verbs, specifically the barecase construction: Construction 1 [ . . . ] is significantly more frequent in the archaic age and in works by poets than in the later ages and in prose writers. This hypothesis operationalizes a generalizing statement in terms that can be addressed by corpus data. McGillivray (2013) tested the hypothesis above with a statistical significance test (chi-square test; see section 6.3.3 for details on this test), and obtained a confirmation of the hypothesis, together with a measure of the size of the detected effect. The process involved all available corpus data and therefore fulfilled the principle of total accountability, which is fundamental to corpus-driven approaches. This way, the results contributed new quantitative evidence which can support more general theoretical models. .. Combining data and linguistic approaches Following Köhler (2012, 2), we define linguistic theory as a series of connected claims from which predictions about historical languages can be made. As we have seen, our framework includes theoretical hypotheses, properly tested against corpus data. From a series of such contingent statements corresponding to tested theoretical hypotheses, we can proceed towards formulating theoretical models of the historical linguistics phenomenon at hand. By this term we mean those generalized explanations of observed phenomena that some linguists call ‘theories’. Our framework does not impose restrictions on which particular models can be derived from this process, nor on the ontological setup that allows this generalization step from contingent claims to theoretical models. Our main concern is on the way such process is performed. In the rest of this book we will provide more details of this process; particularly, Chapter 6 will give some concrete examples of how this can be realized in practice. Thus, our framework is not meant to replace other approaches to historical linguistics rooted in e.g. generative theory or traditional comparative linguistics. We consider work on X-bar theory, grammaticalization, and language history equally compatible with our framework. A linguistic description can be characterized as a hypothesis (Carnie, 2012, §3.1). As mentioned above, the key characteristic of a good hypothesis i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Data-driven historical linguistics  is that it is falsifiable, and that it has predictive power (Beavers and Sells, 2014), and it needs to be tested against alternative hypotheses, as stressed by Beavers and Sells (2014). The reason for this is captured in our principle number 3 (section 2.2.3): since almost any claim is possible, merely fitting a hypothesis to the data is insufficient. Instead, the hypotheses must be compared and tested against data. This ought to be uncontroversial, and both Carnie (2012) and Beavers and Sells (2014) are rooted in generative theory, thus demonstrating that such hypothesis testing is not restricted to probabilistic approaches to linguistics. We go beyond Carnie (2012) and Beavers and Sells (2014) in insisting that such hypothesis testing and comparison in historical linguistics ought to be done quantitatively using corpus data, whenever possible. Furthermore, we argue that multivariate techniques for quantitative modelling are superior to others, due to the complex nature of language. We also see this focus on quantitative techniques and corpus data as a means to compare results across linguistic frameworks, hence our emphasis on an empirically based consensus (see section 2.2.1 informed by appropriate statistical techniques; see section 2.2.12). In short, the present framework extends commonly accepted guidelines for constructing linguistic arguments. The framework takes explicit issue with hypothesis testing in historical linguistics by means of intuitions and qualitative judgements about frequencies (as well as quantitative arguments that do not follow state-of-the-art standards). It is precisely this methodological focus that makes our framework compatible with different paradigms in historical linguistics. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods in historical linguistics . Introduction Historical linguistics and quantitative research have enjoyed a long and tangled coexistence over the years. It must be stressed that any attempt to paint a picture of a gradual, one-directional diachronic shift from qualitative to quantitative methods in historical linguistics is an oversimplification; not even a particularly useful one. Instead, we would like to repeat the image of the chasm separating the early innovators and visionaries from the majority or mainstream, discussed in Chapter 1. Looking back at the history of quantitative and corpus methods in historical linguistics through the lens of the chasm model, we can compare the degree to which quantitative corpus methods are used within the groups defined in the chasm model. For instance, the early adopters would correspond to roughly 16 per cent of the potential users. A technology adopted by the early majority would bring the total up to about 50 per cent whereas including the late majority too would mean that the technology has reached more than 80 per cent of potential users. This is essentially an empirical question (contingent on the validity of the chasm model). As this chapter will show, in the case of historical linguistics, quantitative corpus technologies have not transitioned much beyond the early stages of the adoption curve. However, we also want to better understand why these methods have failed to transition from the ranks of early innovators to the majority of linguists practising historical linguistics. It is indisputable that the early models of linguistic change associated with the development of the comparative method, such as the family-tree model and wave theory, largely fall under the rubric of qualitative methodology (Campbell, 2013, 187–90). The comparative method remains a vital approach to historical linguistics, and Campbell (2013, 471) argues that what he calls ‘mathematical solutions’ to historical linguistic problems are neither necessary nor sufficient, implying that historical linguistics can do without quantitative methods. The chapter on quantitative methods only appeared as a full chapter in the third edition of Campbell’s book, suggesting perhaps a growing need for addressing these methods in historical linguistics, albeit largely to refute Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray. © Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Early experiments  them. Yet the casualness of the refutation also points to the status of the qualitative approaches, and particularly the comparative method, as the hegemon of historical linguistics. That state of affairs undoubtedly stems at least partially from both the success and age of the comparative method (McMahon and McMahon, 2005, 5–14). However, as Campbell’s treatment of quantitative approaches to historical linguistics illustrates, a certain antagonism can also be traced back to early attempts at statistical approaches to historical linguistics, leading to the perhaps surprising conclusion that the full acceptance of quantitative methods in historical linguistics is not only hampered by the novelty of the methods, but also by a somewhat painful previous exposure. . Early experiments The success of the comparative method notwithstanding, it is possible to find examples of researchers proposing ‘mathematical solutions’ to problems in historical linguistics at least as far back as the nineteenth century. Köhler (2012, 12) claims that ‘in linguistics, the history of quantitative research is only 60 years old’, a claim that appears to be founded on his view that quantitative linguistics in the modern sense began with the work of George K. Zipf in the late 1940s, although Köhler does point out that studies based on ‘statistical counting’ can be found as far back as the nineteenth century (Köhler, 2012, 13). Gorrell (1895), to take but one example, certainly took a quantitative approach in his study of indirect discourse in Old English, with tables displaying counts of constructions appearing every few pages. However, it is paramount to avoid simplistic generalizations, and the overall picture of historical linguistics a century or so ago is above all one of variation. McGillivray (2013, 144–7) discusses a study of Latin preverbs by Bennett, in a study dating back to 1914, where the author made the choice of classifying occurrences above ten as ‘frequent’, but without providing the reader with access to the actual numbers of occurrences, leaving the reader to guess exactly what evidence underpins such unquantified yet implicitly numerical distinctions as ‘many’ vs ‘most’. Statistics beyond word counts also enjoys some seniority within historical linguistics. Kroeber and Chrétien (1937) calculated correlation coefficients for linguistic features in order to arrive at a statistically based classification of Indo-European languages. Their work was criticized by Ross (1950), who (although generally sympathetic to the approach) took issue with their calculations, favouring instead using the chi-square statistic, a test with its own inherent problems when employed in linguistics. However, both Kroeber and Chrétien and Ross were somewhat pessimistic in their conclusions, prompting a rebuttal from Ellegård (1959). While taking care to emphasize that statistical measures of linguistic similarity refer to similarity with respect to some specifically chosen traits or features (rather than a global, a-theoretical similarity), Ellegård proposed an alternative approach to interlanguage correlations. Ellegård’s conclusion is mainly methodological, but rounds off with the insight that i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods the application of statistical methods in historical linguistics is not merely a methodological choice. Instead, he outlines a dynamic relationship where quantitative methods spur on theoretical developments, and where statistical methods ‘will require a linguistic taxonomy, will help to establish it, and can be used for bringing taxonomic and developmental studies into fruitful contact’ (Ellegård, 1959, 156). In hindsight it is clear, however, that the impact of statistical methods on historical linguistic theorizing remained limited. The limited impact of statistical methods is perhaps best judged by the insistent tone in the publications advocating their use, which (following the logic from historical studies that what is frequently prescribed by law generally reflects common actual behaviour) quickly raises the image of a besieged minority. At the same time, the arguments are often pithy and convey a message which in many cases remains relevant today. Take the point made by Kroeber and Chrétien (1937, 97), who suggested that the linguist working only with intuition easily becomes biased when the linguist observes a certain affiliation which is real enough, but perhaps secondary; thereafter he notes mentally every corroborative item, but unconsciously overlooks or weighs more lightly items which point in other directions. The quote, which is a polite way of saying that non-quantitative studies are prone to bias by over-emphasizing rare or unexpected phenomena, has held up well and is in tune with more recent critiques of qualitative methods, such as that raised by Sandra and Rice (1995). The simple psychological fact that the human mind is not well equipped at dealing objectively with relative frequencies in an intuitive way remains a key objection to non-quantitative work in historical linguistics. However, the fact that similar critiques are being made decades after Kroeber and Chrétien says something of their impact, or more precisely lack of such. Occasional arguments for the virtues of a fully quantitative linguistics can be found around the middle of the twentieth century, but their relative rareness as well as their timbre are testament to a lack of impact. Consider the acerbic yet slightly despondent tone in the following observation by Guiraud (1959, 15), published over twenty years after Kroeber and Chrétien: La linguistique est la science statistique type; les statisticiens le savent bien; la plupart des linguistes l’ignorent encore. (‘Linguistics is the typical statistical science; the statisticians know this well; most linguists are still ignorant of it.’) If the quote from Guiraud suggests ignorance of (and hence lack of involvement in) quantitative research on the part of the linguists, the following passage from Ellegård (1959, 151–2) has the air of a well-rehearsed response to familiar criticism: ‘Even intuitive judgments must be based on evidence. Now if that evidence turns out to be insufficient statistically, it will be insufficient also for an intuitive judgment.’ The comment is poignant, the tone is one of calm reason; however, the implications went i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i A bad case of glottochronology  largely unheeded by historical linguistics as a discipline, suggesting that Guiraud and Ellegård were early adopters of a technology that did not quite catch on. There are clearly several reasons for this: first and foremost is probably the undoubted success of the comparative method mentioned earlier, which (following the old adage that if it ain’t broke don’t fix it), must have made the rather tedious mathematical calculations seem subject to diminishing returns. Second, the lack of electronic corpora, desktop computers, and statistical software meant that quantitative work was slow and almost impossible to perform at the large scale where it really comes into its own. Third, the advent of generative linguistics, vividly chronicled in Harris (1993), heralded a period where numerical approaches to linguistics generally were no longer in vogue, or were even regarded with some hostility (Pullum, 2009). And finally, there were the stains left behind by a specific, much revered (and later much reviled) method: glottochronology. . A bad case of glottochronology In the 1950s linguistics was both changing and expanding with a mature and optimistic sense of security, enjoying ‘measured dissent, pluralism, and exploration’ (Harris, 1993, 37). Such exploration was also taking place in historical linguistics where Morris Swadesh launched the term glottochronology in the early 1950s, see e.g. Swadesh (1952) and Swadesh (1953). Glottochronology was proposed by Swadesh as an approach to lexicostatistics more generally. The distinction is worth making since lexicostatistics is generally taken to mean statistical treatments of lexical material for the purposes of studying historical linguistics. McMahon and McMahon (2005, 33) offer the following definition of lexicostatistics: ‘the use of standard meaning lists to assess degrees of relatedness among languages’. Campbell (2013, 448), like McMahon and McMahon (2005, 33–4), notes that ‘glottochronology’ and ‘lexicostatistics’ are frequently used interchangeably, but Campbell goes on to claim that ‘in more recent times scholars have called for the two to be distinguished’. However, the attempts at making the distinction are as old as the confusion itself. To Hockett (1958, 529) the two terms appear to have been synonyms, whereas Hymes (1960, 4) argues for a distinction: Glottochronology is the study of rate of change in language, and the use of the rate for historical inference, especially for the estimation of time depths, and the use of such time depths to provide a pattern of internal relationships within a language family. Lexicostatistics is the study of vocabulary statistically for historical inference. . . . Lexicostatistics and glottochronology are thus best conceived as intersecting fields. Hymes goes on to point out that lexicostatistics could in fact refer to any numerical study of lexical material, synchronic or diachronic, but that the term has received a ‘specialized association with historical studies’. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods At its core, glottochronology operates with three basic elements: word lists, or strictly speaking lists of sememes or ‘meaning lists’ (McMahon and McMahon, 2005, 34), of ‘basic’ vocabulary for the languages to be compared, the number of cognate items within the list, and the retention rate over time (Hymes, 1960, 3). Two lists predominate the literature, one containing 100 items, the other 200 (Campbell, 2013, 448–51). As subsequent criticism would show, all three variables turned out to have their particular trapdoors, including the problem of defining culturally neutral and replicable versions of the lists themselves, what to count as ‘basic’ (which again had an impact on the cognates), but also the assumption of a constant rate of change. The constant retention rate over a 1,000 years, argued to be 86 per cent for the 100-word list and 81 per cent for the 200-word list, was boldly presented as a real, mathematical fact (a physical probability, see section 2.1.3) with evidence ‘sufficient to eliminate the possibility of chance’ (Swadesh, 1952, 455). Glottochronology was met with an initial rush of enthusiasm (Hymes, 1960, 32), and made it into the introductory-course university curriculum in linguistics (Hockett, 1958, 526–35). However, some methodological problems were pointed out both by Swadesh himself and by others. Ellegård (1959, 155) criticized the lexico-statistical method used by Swadesh (1953), commenting that the latter seemed ‘somewhat rash in assuming a uniform rate of development’. Hockett questioned the assumption of a ‘basic vocabulary’, but nevertheless rounded off his introduction of the approach to undergraduate students rather optimistically by stating that ‘no development in historical linguistics in many decades has showed such great promise’ (1958, 534). Many, however, went further, possible out of confusion, enthusiasm, or both. Hymes (1960, 4) notes that some academics leaped from the method’s treatment of a narrowly circumscribed basic vocabulary, to endorsing it for tackling the problem of language change at large. In this we can recognize the pitfall pointed out by Moore (1991) regarding any new technology, namely the risk in overselling the ‘vision’ of the new technology before it is sufficiently mature to back up that vision with concrete results. The detailed contemporary critiques summarized in Hymes (1960) cover the nowfamiliar criticisms against glottochronology, namely problems with basic lists, problems with judging sameness, problems with cultural bias, problems with synonyms, problems with borrowings, problems with taboo words, the problematic assumption of a constant rate of change, as well as specific mathematical problems.1 Although critical, Hymes (1960, 15) nevertheless argued that more research should go into the method, and he approved of its continued application. However, problems were mounting. In addition to the problems listed above, different linguists were reporting different results for glottochronological studies of the same languages, as discussed in e.g. Bergsland and Vogt (1962), suggesting that 1 The core criticism against glottochronology is concisely presented in McMahon and McMahon (, –) and Campbell (, –). i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i The advent of electronic corpora  the method was introducing more vagueness rather than more objective replicability. The central tenet of glottochronology, a universal constant rate of change in basic vocabulary, did not hold empirical water, as shown by Fodor (1961) and Bergsland and Vogt (1962). In one of the studies conducted by Bergsland and Vogt (1962), the authors found that lexical replacement rates in the basic vocabulary list were either far higher (Icelandic) or far lower (Riksmål/Norwegian Bokmål) than those predicted by the model. Fodor (1961), on the other hand, found split dates for Slavic languages that were not only at odds with the comparative method, but also with well-attested historical facts. Add to this further criticism of the mathematics involved (Chrétien, 1962), and the result was a predictable and considerable dampening of the initial enthusiasm. In 1964 Lunt, in an editorial quoted in McMahon and McMahon (2005, 45), declared glottochronology an ‘idle delusion’ and bluntly denied the usefulness of continuing the project. As the 1960s and 1970s went on, glottochronology, and quantitative methods in linguistics more generally, largely fell out of favour, and well-known exceptions such as William Labov’s quantitative studies of sociolinguistic variation implied some opposition to the orthodoxy (Sampson, 2003; Lüdeling et al., 2011). Glottochronology did not reduce the interest for quantitative methods in historical linguistics on its own. As we have seen, structuralist approaches to linguistics were sceptical towards statistical evidence, and that scepticism was inherited and refined by mainstream transformational-generative grammar in subsequent decades (Sampson, 2003; Lüdeling et al., 2011; Gelderen, 2014). Cognitive–functional approaches (Sandra and Rice, 1995) also displayed a lack of attention to statistical methods (Deignan, 2005), which suggests a general tenor of linguistic research that went beyond historical linguistics. However, judging from the fact that glottochronology is still being discussed in the context of quantitative approaches to historical linguistics (McMahon and McMahon, 2005; Campbell, 2013; Pereltsvaig and Lewis, 2015), it seems clear that the negative perception of quantitative methods stemming from the failure of glottochronology has endured beyond the method itself. According to Moore (1991), such negative impressions could be a contributing factor when a technology fails to cross the chasm. As we pointed out in section 2.2.6, we cannot go logically from this possibility to concluding that this is in fact the case. We need to take a much richer context into account. In the next section, we turn from quantitative methods to the use of corpora. . The advent of electronic corpora From a methodological point of view, it is interesting that some of the early publications referred to here, notably Kroeber and Chrétien (1937), Ross (1950), and Ellegård (1959), while predominantly concerned with statistical methods, kept returning to the question of data. Statistical methods in themselves will not yield answers without i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods appropriate quantitative data, which means that methodological advances in one area are contingent on the other. Ross (1950) built upon the data from Kroeber and Chrétien (1937) and added more data, whereas Ellegård (1959) returned to the question of ‘relative frequencies’ from ‘random samples’ several times when discussing the methodological shortcomings of statistical methods that only deal with binary features. Without reading too much into this, it is clear that even if they do not phrase it in those terms, these authors were intimately aware of the problems caused by the lack of (and to some extent solved by the presence of) today’s electronic annotated corpora. That is not to say that text corpora were something new in the mid-twentieth century. Käding published his 11-million corpus concordance in 1897 (McEnery and Wilson, 2001, 12), and the first half of the twentieth century saw a string of studies relying on corpus linguistic methods, with the early 1950s witnessing both Firth’s work on collocations (Gries, 2006a, 3) as well as Fries’s corpus study of spoken American English (Gries, 2011, 81–2). The 1960s saw the introduction of the so-called first generation of machine-readable corpora whose characteristics today are the defining hallmarks of corpora: electronically stored, searchable, possibly annotated, and with an aim at representativeness. In the field of historical language studies, the Index Thomisticus corpus coevolved alongside the technological development from punched cards to magnetic storage, and finally online publication over its thirty-year construction phase (Busa, 1980). Pioneering work on corpus linguistics continued from the 1950s to the 1980s (McEnery and Wilson, 2001, 20). However, with notable exceptions such as the Index Thomisticus corpus and the Helsinki Corpus of English Texts, these efforts were mainly directed at contemporary languages. Today it is perhaps easy to underestimate the financial and technical difficulties facing early corpus builders. As Baayen (2003, 229–30) points out, early computers were few and expensive, which provided both a positive incentive for a formal approach to language, as well as a negative incentive against statistical investigation of large corpora. The case of the Index Thomisticus corpus proves an interesting illustration of the difficulties: it took some thirty years to complete (including adaptation to changing technologies along the way), and it was reliant on large-scale funding from the IBM corporation (Busa, 1980). In the face of such financial and technical obstacles, perhaps it is not surprising that historical linguistics (with its data being considerably less interesting from a commercial point of view) lagged behind in corpus creation, or at least in the modern sense of general, representative, machine-readable corpora. We have already mentioned Käding’s large late nineteenth-century corpus; however, the usefulness of such a corpus would be severely limited by the available (manual) search technology. Thus, the pragmatic pressures imposed by technology for creating and searching relatively large corpora, alongside the financial costs, would naturally favour smaller collections of purpose-built corpora that could be collected and searched manually. For all their merits, such corpora are nevertheless limited in i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Return of the numbers  their usefulness. Organized as lists of sentences, they are difficult to search, except by manually reading each sentence. Organized as collections of index cards, they can take a lot of space and are not easily distributed. The size limitation imposed by storage and searching points naturally in the direction of relatively small, purpose-built specialized corpora, rather than large general ones. Although such specialization can be valuable, it might also limit the potential for reuse. The lack of shared, reusable resources would also mean that each corpus in the worst-case scenario would have to be created afresh for each new project. This view of the situation is perhaps slightly too sombre. As the studies in Kroeber and Chrétien (1937), Ross (1950), and Ellegård (1959) attest to, collections of data could be shared and expanded gradually. Some specialized early corpora have enjoyed much longevity, perhaps most notably the data on the history of the English periphrastic do from Ellegård (1953), which have been reanalysed by Kroch (1989) and Vulanović and Baayen (2007), among others. Nevertheless, the central critiques of such early corpora remain: their specialized nature leads to a proliferation of isolated resources, rather than general ones that are suited for at least a majority of research questions. Furthermore, idiosyncrasies in sampling and annotation might make comparing or merging data sets difficult, a difficulty which would be compounded by a lack of standardized annotation. Although quantitative work based on a corpus methodology was being carried out in historical linguistics prior to the emergence of electronic historical corpora, reduced costs and improved computing power (together with the availability of lessons learned from the efforts to build corpora of contemporary language) meant that by the 1990s the scene was set for mainstream electronic historical corpora. . Return of the numbers By the end of the 1980s, the stage was set for a growing interest in corpora. The two decades that had passed since the release of the Brown corpus in 1967 had seen a gradual growth in corpus size, as well as a growth in corpus use, including in commercial projects like the Cobuild Dictionary. The evolution of a scientific community which refined and promoted the building and use of corpora was undoubtedly vital. So was another development taking place: the growth of computing power. In computer science, the power of computing hardware (measured by the number of transistors that could fit into an integrated circuit) has been argued to follow what is commonly known as Moore’s law, a prediction made in the late 1960s that computing power would double every two years, that is, grow at an exponential rate. Especially since the computing industry to some extent have calibrated their development efforts to match the law, the law itself is perhaps less interesting than the result, namely a massive growth in computing power at a greatly reduced cost. Figure 3.1 illustrates Moore’s law as a regression line showing the growth in computing power over time on a logarithmic i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods Computing power and corpora PPCMBE* COHA* PPCEME* OEC Corpus del Español* YCOE* COCA PPCME2* Google N−grams Computing power 10bn 100m BNC 1m Index Thomisticus* 10K LOB Brown 1960 1970 Helsinki* 1980 1990 Year 2000 2010 Figure . Illustration of Moore’s law with selected corpora plotted on a base 10 logarithmic scale. Corpora marked with an asterisk (∗ ) are historical. scale, with some corpora added to the plot according to the year of their release. The data and the code for the figures in this chapter are available on the GitHub repository https://github.com/gjenset. Unsurprisingly, we see a cluster of corpora from the year 2000 onwards. It would be grossly simplistic to claim that computing power alone powered this growth. Corpora are created for a number of reasons, and typically require established research projects (which again require a certain intellectual climate), long-term funding, an ecosystem of tools and standards, and so on. However, keep in mind the observations from Baayen (2003) about how the technological bottleneck of early computing provided an incentive towards formal and non-corpus based approaches to linguistics. Clearly, at the very least, we can hypothesize an interplay between intellectual development and new technological possibilities (see also section 1.3.4). The historiographical problem of deciding the exact causality of this development is obviously outside the scope of this book. It is also secondary to what we consider far more important: the growth in i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Return of the numbers  Corpus sizes Google N−grams 100bn Corpus size OEC 1bn CoCA CoHA* 100m BNC Corpus del Español* Index Thomisticus* 10m Helsinki* 1m Brown 1960 1970 LOB 1980 1990 Year released YCOE* PPCEME* PPCMBE* PPCME2* 2000 2010 Figure . Sizes of some selected corpora plotted on a base 10 logarithmic scale, over time. Corpora marked with an asterisk (∗ ) are historical. computing power, coupled with easier access and lower price, obviously removed an important bottleneck that was present in the 1950s and 1960s. It is instructive to consider the growth in corpus size, which has also followed an exponential curve during the same period. Figure 3.2 illustrates this by means of a bubble plot. The vertical axis shows the size of the plotted corpora on a logarithmic scale, whereas the bubbles (each representing a corpus) are scaled to be proportional to the corpus size. As the plot shows, there has been a rapid growth in the potential for building large corpora since the 1990s. A caveat is in order here, since the potential for building large corpora does not prevent small corpora from being built. Take for example the syntactically annotated historical corpora in the lower-right corner of the plot. These corpora have remained small for a number of reasons unrelated to computing power: dealing with historical texts, there is only a finite set of data to base the corpus on. Furthermore, the annotation step requires manual coding, since machine-learning i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods Corpus size vs computing power (Log-linear model) Google N−grams 100bn 10bn Corpus size OEC 1bn CoCA 100m BNC 10m Index Thomisticus* CoHA* Corpus del Español* Helsinki* 1m Brown LOB 10K 1m 100m Computing power 10bn Figure . Log-linear regression model showing the relationship between the growth in computing power and the growth in corpus size for some selected corpora. Corpora marked with an asterisk (∗ ) are historical. algorithms for adding annotation to corpora cannot normally be used with good results on historical texts, without some manually annotated historical data as training material. However, if we remove these historical, syntactically annotated corpora, and fit a log-linear regression model (see section 6.2 for an introduction to linear regression models) relating computing power (on a base 10 logarithmic scale) to corpus size (also on a base 10 logarithmic scale), we find a significant relationship between the two.2 According to this model, every 1 per cent increase in computing power corresponds to a 44 per cent increase in corpus size. The model is illustrated in Figure 3.3. As mentioned earlier, it is impossible to claim that the increases in computing power directly caused corpora to grow, since the creation of corpora depends on much more than computing power alone. However, cheaper and faster computers with more 2 Fdf (1,8) = ., p << ., adjusted R2 = .. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Return of the numbers  storage capacity meant that some obstacles in corpus creation were removed, or at least gradually became less important. As we have seen with some of the historical syntactically annotated corpora mentioned earlier, corpora may remain small for a number of reasons; however, the creation of part-of-speech-annotated diachronic corpora, such as CoHA (Davies, 2010) with its 400 million words, illustrates that attempts to draw a sharp distinction between small historical corpora and large contemporary ones necessarily must fail. Thus, electronic corpora that are too large to be read manually in their entirety are firmly established in historical linguistics. The obvious question then is: how ought historical linguists to respond to this development? The existence of corpora creates a potential for use; it does not indicate that they are being used, or used by more than a small group of die-hard corpus linguists. However, there are signs that the interest in corpus use is growing. The Brigham–Young corpora created by Mark Davies, freely searchable through a web interface, report 170,000 unique users every month.3 If we look at the term ‘corpus linguistics’ in the Brigham–Young Google corpus itself (Davies, 2011) we find that the phrase has been increasing in frequency since the 1980s. Figure 3.4 shows the occurrences of ‘corpus linguistics’ scaled to reflect the number of instances per 1,000 instances of the word ‘linguistics’. Figure 3.4 also shows the use of the phrase ‘historical linguistics’, which is still more common than corpus linguistics. Also shown are ‘mathematical linguistics’ and ‘quantitative linguistics’, which we merged, due to their low numbers. The phrases show a different behaviour, with a peak around the 1960s (chiefly due to ‘mathematical linguistics’) after which they gradually decline. It is of course impossible to tell from such graphs how the frequencies relate to the use of corpora and quantitative methods, or to the interest in historical linguistics for that matter. Does a peak mean that many researchers use such tools and methods, or are they merely being very vocal in their denunciations? We have carried out a detailed quantitative study on this and described it in section 1.5. Of course, we cannot say that the changing relative frequencies in Figure 3.4 represent physical probabilities in the sense that they directly represent interest or activity as described by these terms. However, we can draw some conclusions based on the corpus data, within the scope of the Google corpus. First, there is obviously an increasing attention to corpus linguistics: it is being talked, or more accurately, written about to a larger extent. Second, there is no clear correlation between corpus linguistics and quantitative linguistics. If anything, the correlation appears to be negative. This raises the question of whether ‘quantitative linguistics’ is simply superfluous. If corpus linguistics is taken to be inherently quantitative, this could certainly be the case. However, corpora may also serve as sources of examples, and the use of corpora may not automatically entail the use of quantitative argumentation. Finally, 3 http://corpus.byu.edu/faq.asp. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods US Google Books Corpus occurrences Modifiers of ‘linguistics’ Freq. per 1,000 occurrences of linguistics 30 historical mathematical corpus statistical quantitative 25 20 15 10 5 0 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Figure . Relative frequencies of linguistics terms every 1,000 instances of the word linguistics in the twentieth century taken from the BYU Google Corpus. the corpus we used does not tell us how these tendencies play out within historical linguistics. In short, both intellectual and technological developments have facilitated a growth in corpus size and availability. Availability is of course a prerequisite for use, but not a determining factor. More importantly, corpus linguistics is to some extent a linguistic subdiscipline in itself, with journals, books, and conferences. The increased attention to corpus linguistics in general does not entail a similar focus within historical linguistics, which is our main concern in this book. Consequently, we are interested in investigating how the potential represented by increased availability of corpus material is played out in practice within historical linguistics, and to what extent corpus material is used in quantitative (or probabilistic) argumentation. . What’s in a number anyway? It is a truism that not everything that can be counted matters, and that not everything that matters can be counted. However, this is a quip; not an argument. The fact that i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i What’s in a number anyway?  not everything can be quantified does not impinge on the usefulness of quantifying some things in some situations and for some purposes. Specifically, the question we must face in historical linguistics is whether quantification is more useful than the lack of quantification, i.e. a purely qualitative line of argumentation. A qualitative argument is, strictly speaking, a binary argument. That is, the qualitative argument may support the assertion that some phenomenon is either present or not. Any argument that deals with degrees is by definition quantitative, although numbers need not be explicitly mentioned in the argumentation. Modifiers such as much, little, hardly, seldom, frequently, infrequently are clearly about degrees and bear witness of an ordinal quantification. In the case of these modifiers the quantification is veiled by the language involved, but the underlying ordinal (and hence quantitative) nature of the relationship or entity described is obvious. In short, we recognize as genuinely qualitative only those arguments that deal with the presence or absence of some feature or phenomenon (e.g. morpheme x occurs in language variety y), although this may need some further specification in some cases, as discussed below. All other arguments are quantitative in some way or another, including ordinal observations or arguments expressed via ordinary language rather than numbers, and they are hence subject to the kind of methodological scrutiny required by a quantitative study. The definition of quantitative and qualitative studies outlined above might seem excessively strict. However, our point is not that all quantitative arguments always need to be expressed in numbers, merely that such arguments must be recognized as fundamentally quantitative, as we stated in principle 8 (see section 2.2.8). The use of ordinary language to express quantitative facts on an ordinal scale may be fully justified, depending on the context. For instance, we might state that in present-day English corpus texts the determiner the is much more frequent than the noun book. This well-established fact is uncontroversial and, if part of an argument, would not normally need to be directly backed up by probabilities (although we could provide such probabilities to back up the claim if challenged). However, in a more controversial argument, the probabilities ought to be made explicit because such a move makes the argument more transparent and open to criticism, which again makes it a much stronger argument provided that it prevails against the criticism. Carrier (2012) makes the point that such explicit quantification of an argument forces the opposing side to do the same and quantify their argument, lest they are left with a much weaker argument. Thus, we contend that any claim in historical linguistics that does not simply involve a binary choice (present/absent, true/false), but somehow resorts to degrees, is inherently quantitative. Furthermore, our position is that any quantitative claim that is not completely uncontroversial ought to be explicitly quantified, if at all possible. If this is not possible, then the person making the statement or fact must accept that such lack of quantification leaves him or her with an epistemologically weaker argument. This follows from principle 2 (outlined in section 2.2.2) that all conclusions must be based on publicly observable facts, and from principle 4 (section 2.2.4) that i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods some claims are stronger than others. If the claim supports a certain level of certainty, then making it explicit by means of numbers makes the claim more accessible for public scrutiny and criticism, i.e. stronger. Conversely, if a claim is left expressed on an ordinal scale in everyday language, then the claim is less open to public scrutiny, since what counts as very frequent in ordinary language may vary depending on the context and personal beliefs. The latter are by definition not accessible to objective inspection and hence carry no weight in an argument regarding the empirical facts of historical linguistics. We agree with the point made in Gries (2006b, 198) that the only data provided by corpora are quantitative, and that the logical consequence of this is that corpus data ought to be subject to quantitative analysis. That corpora can be employed as a repository of examples is not a counterargument to Gries’s claim. Irrespective of how they are used, corpora are full of quantitative data. Of course, we have no objection to the practice of picking illustrative examples from corpora as means to showcase the phenomenon under investigation. Quite the contrary, we believe that such examples are better taken from corpora than from dictionaries or textbooks (or made up), whenever this is possible, but only as illustrations of a phenomenon (or a source of hypotheses; see Chapter 1), and not as evidence basis for the research itself. One swallow does not a summer make, and one corpus example (or a handful of such examples) does not constitute data in any meaningful sense unless the aim is to demonstrate that the construction or phenomenon under investigation occurs in the corpus, or that a particular constellation of phenomena or features occur together. We consider this point sufficiently important to repeat it: a qualitative, examplebased approach to corpus linguistics allows the historical linguist to state that the phenomenon being investigated appears in the corpus material, period. Of course, there is always the risk that such an occurrence represents some kind of error, as we stressed in section 2.4.3. However, even if we discount errors, the occurrence of the feature or phenomenon represents a modest level of evidence. Since language is varied and subject to a number of different types of influences, it is important to know whether some feature or phenomenon is common or rare (either in general or given some specific context). Qualitative evidence (examples) cannot inform us about the rareness. Furthermore, qualitative evidence has nothing to say about non-occurrence, since a particular feature or phenomenon can be absent from a corpus for a number very different reasons: sampling errors in the corpus construction, sparsity in the written records, skewed representations of the extant written records (typically towards male-dominated elite language characteristic of registers associated with writing), or combinations of all three. Frequency information, on the other hand, can be employed to make estimations of the expected number of occurrences, with which the observed number (e.g. zero observations) can be compared. If expected and observed numbers converge, then an observation of zero occurrences would be entirely undramatic. Conversely, if the observed and expected i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Case against numbers in historical linguistics  frequencies are sufficiently divergent, a case could be made for some linguistic explanation (provided we have taken the other reasons for missing data mentioned above into account). This is what Stefanowitsch (2005, 296) refers to as the ‘expected-frequency epiphany’, which allows us to convert raw counts into linguistically meaningful scientific facts. A strictly qualitative approach is unable to make the observed–expected distinction in a principled manner, leading inevitably to either imprecise or faulty estimates of frequency information, or the abandonment of frequencies as a source of information altogether. The resulting loss of information regarding the frequencygoverned aspects of language such as word frequencies (Baayen, 2001; Köhler, 2012) constrains the linguist to focus on the occurrence (or non-occurrence) of phenomena and features. We consider this detrimental to the enterprise of historical linguistics. Instead, subsequent chapters will show that quantitative information is an important aspect of historical linguistic research, and furthermore that certain standards must be adhered to, in order to fully exploit this source of information. The next section will discuss the core arguments levied at quantitative methods and corpus methods, and show that they do not apply to historical linguistics. . The case against numbers in historical linguistics This section will counter the core arguments against the use of corpora and quantitative methods in historical linguistics. Not all arguments are specific to historical linguistics, so the refutation of the arguments will also be more general in some cases. Before dealing with the substantial arguments against quantitative and corpus methods in general, we must dispense with the straw man of glottochronology. Section 3.3 provided an overview of glottochronology, and discussed its role as the whipping boy of quantitative research in historical linguistics since at least the 1970s. In hindsight, it is safe to say that the initial enthusiasm for glottochronology as the future of historical linguistics was premature. Glottochronology was a structure built on shaky methodological and mathematical foundations, whose flaws served to marginalize it in little over a decade after it was first proposed. However, seen from a certain perspective, the case of glottochronology is a good example of normal scientific principles operating as they ought to: novel proposals and experimentation, followed by empirical testing and methodological criticism, leading (in this case) to rejection of the suggested approach. Nevertheless, we need to keep in mind exactly what was being rejected. It would be a logical fallacy to assume that the rejection of one specific approach constitutes a wholesale refutation of the usefulness of quantitative methods in historical linguistics. Glottochronology focused on lexical items, and it is undeniably the case that the success of the historical comparative method is founded on considering lexical items in light of phonology and morphology, as is evident from such works as Campbell (2013, 232–3) and Pereltsvaig and Lewis (2015). Thus, if we compare the success of the traditional method in its most successful domains (phonology i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods and morphology) with the misadventure of glottochronology applied to exclusively lexical material (an area where the traditional comparative method also encounters difficulties), we are hardly comparing like with like. That glottochronology was unsuccessful, and perhaps started with poor odds due to focus on lexical material, is another matter, since the method needed to be tested in any case before being judged. However, we are less concerned about glottochronology per se than about the fact that criticism against glottochronology in particular runs the risk of deteriorating into an unfounded, blanket criticism of all forms of quantitative methods in historical linguistics. As mentioned above, this is a formal, logical fallacy (technically an illicit minor) of the following form: 1. Glottochronology is a flawed research method. 2. Glottochronology is a quantitative research method. 3. Therefore, all quantitative research methods are invalid. As should be evident from presenting the argument in this form, the syllogism is invalid since glottochronology is only one among many possible quantitative research methods in historical linguistics, and there is no logically necessary connection between the failure of one such method and the viability of other methods. However, we find evidence of such unfortunate tendencies of logically invalid blanket-critique of quantitative methods in Campbell (2013). That specific, lexically based quantitative methods in historical linguistics can be criticized without throwing quantitative methods in general out with the bath water is thoroughly demonstrated in Pereltsvaig and Lewis (2015). The authors explicitly say they take no issue with quantitative methods from computational and corpus linguistics applied to historical problems, despite delivering a sharp criticism of phylogenetic methods in historical comparative linguistics. Thus, we wish to embrace a position in which criticism of individual methods (be they qualitative or quantitative) is possible without wholesale rejection of an entire mode of reasoning (see also the discussion of quantitative and qualitative methods in section 3.6). It is perhaps no coincidence that the general criticism of quantitative methods in historical linguistics resonates with criticism put forth by previous critics against such methods in linguistics generally. Campbell’s position can thus be seen as a special case of arguments against quantitative methods in linguistics more generally. While proponents of quantitative methods in linguistics may agree in many ways, the arguments against such methods can be made from a more diverse range of positions. However, some organizing principles can be detected, and some of the arguments against quantitative methods will be discussed below, categorized as follows: (i) quantitative methods are potentially useful, but not very convenient or practical; i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Case against numbers in historical linguistics  (ii) quantitative methods are potentially useful but redundant, since the same results can be achieved by other means; (iii) quantitative methods are useful, but should be limited to certain types of linguistic problems; (iv) quantitative methods cannot as a matter of principle contribute to the goals of linguistics. The arguments have been ordered from the weakest in their epistemological impact, (‘quantitative methods are not convenient’), to the strongest (‘quantitative methods are unable to contribute to linguistics as a matter of principle’). Below we will reject all these arguments and show that quantitative methods have an important role to play in linguistics and in historical linguistics. .. Argumentation from convenience Bloomfield (1933, 37) argued that a detailed statistical study of language use would be very informative, particularly for studies of language change. However, having made this point, he immediately dismissed it as unnecessary. His argument was that since language is a convention-bound activity, all that the linguist really needs to do is to describe the norms that govern this convention-bound activity, i.e. the grammatical rules. Bloomfield’s motivation comes across as at least partly pragmatic, since he refers to the simplicity of the latter method compared with the former (Bloomfield, 1933, 37). He took the same view for language change, noting that an inventory counting every use of every linguistic form would be welcome. However, after having noted that this is ‘beyond our powers’, he reassured the reader that such a record is not necessary since the changes can anyway be deduced by comparing linguistic (structural) systems diachronically (Bloomfield, 1933, 394–5). Given the resources available to Bloomfield, his views were not entirely unreasonable, although, as section 3.2 showed, some of his contemporaries were experimenting with statistical approaches. His argument has nonetheless been superseded by the technological development since 1933. The availability of large, easily accessible corpora, fast and cheap computers, and advanced statistical software packages mean that what was a major (perhaps an insurmountable) undertaking at the time when Bloomfield was writing, has become not only achievable, but in some cases outright trivial. Of course, Bloomfield the structuralist was not merely discounting the method of counting for practical reasons, a point to which we return below. Nevertheless, it is undeniable that some of his arguments come across as remarkably pragmatic and convenience-based. More recently, Mair (2004) presents a modified version of this argument, and argues that when ‘superficial’ statistical analyses (Mair discusses raw counts and proportions) are inadequate, the linguist should turn to what he calls ‘philological methods’ (i.e. looking at examples in context) as a next step (Mair, 2004, 134). The obvious alternative, turning from a ‘superficial’ to a more advanced statistical analysis, i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods is surprisingly absent. Since Mair is clearly open to using quantitative evidence to start with, the reluctance to recommend more advanced quantitative methods cannot stem from some principled dismissal of their value, and we can only assume that he does not consider them a practical or cost-effective alternative. Thus, Mair’s advice to only consider qualitative methods when simple quantitative methods are insufficient appears to be a variant of the argument from convenience. The argument from convenience has little or no persuasive force at the present, although some exceptions need to be made for historical languages where so little data is available that no quantitative investigation is meaningful. We would nevertheless argue that in such cases the argument is not one of convenience, but rather of assessing whether statistical methods can correctly and meaningfully—not conveniently—be applied. The availability of historical corpora is increasing, and the tools to create corpora are also becoming more sophisticated. Although computational tools to create linguistic electronic resources for historical language varieties tend to lag behind the tools for (large) contemporary languages, the situation is undoubtedly improving (Piotrowski, 2012; McGillivray, 2013). In conclusion, there is ample reason to dismiss the convenience argument as a relic from the past. .. Argumentation from redundancy An argument independent of technological infrastructure is the view that quantitative methods in linguistics are potentially useful, but in practice redundant since they inevitably end up discovering nothing more than what has already been established by means of traditional methods. As with the previous argument, contemporary advocates can be found, but the roots of the argument can be traced back in time. Structuralism espoused a static view of synchronic language (or rather: langue, as opposed to the usage captured by the term parole) as a natural object with a set of rules to be discovered and described in qualitative terms, e.g. by means of algebraic relationships and set-theoretical notions of class membership (Harris, 1993, 16–24; Köhler, 2012, 13). In such a system, the numerical distribution of an item (phonemes, words, syntactic units, etc.) may be recognized as potentially informative, as Bloomfield (1933) hinted at; however, it is equally clear that quantification was not crucial. The task for the (synchronic) American structuralist in Bloomfield’s view was to observe the ‘speech habits of a community without resorting to statistics’, and to record and report the results as objectively and conscientiously as possible (Bloomfield, 1933, 37–8). Chapter 22 of Bloomfield’s book, which deals with fluctuation in language forms over time and thus (inevitably) with changing relative frequencies, is remarkable in how it sidesteps the entire issue of quantification by arguing that a number of correlates (or proxies) for statistical quantification can be used instead. The message is again clear: numbers would be nice if we could get our hands on them, but there is no need to worry about that since we are capable of observing and studying all the relevant factors already without quantification: algebra (a qualitative branch of i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Case against numbers in historical linguistics  mathematics), not probability theory, is the only mathematical tool needed, if any. The conclusion that quantitative, or probabilistic, methods are redundant is continuously present between the lines. The same argument is advanced by Campbell, although when presented in its purest form the author attributes the viewpoint to some unspecified and unnamed historical linguists. The most forthright articulation is found in the following passage from Campbell (2013, 471): Some say that if a solution to a particular problem cannot be reach [sic] by tried-and-true historical linguistic methods, then they cannot trust a proposed mathematical solution, but at the same time they ask: if a solution is provided by standard linguistic methods, then what is the need for the mathematical solution in the first place? Note that the argument here has shifted subtly from the position occupied by Bloomfield. For Campbell (or the unnamed historical linguists he attributes the view to), the primary problem with quantitative methods (and thus cause for their redundancy) is not their difficulty, or the supremacy of observations of a social communicative structure or norm. Instead, the suggested problem is the quantitative method’s reliability, or rather lack of such. The same redundancy is implied when Campbell (2013, 485) again lets others speak for him: ‘To most traditional linguists, the scholars who have invested in quantitative approaches to historical linguistic questions have appeared to progress by gradually reinventing the wheel.’ Campbell’s slightly sheepish rhetorical shuffle of letting other linguists (unnamed and uncited) speak for him whenever he is unleashing a full broadside against quantitative methods can be interpreted in at least two ways, neither of which excludes the other. The most obvious interpretation is of course that he misrepresents quantitative methods. He appears to briefly consider the possibility that his rendering is ‘not fair’ before retreating to his original position (Campbell, 2013, 486). The second obvious reason is that Campbell recognizes that his arguments are not particularly strong and consequently resorts to reporting what comes across as departmental lunchroom hearsay in order to put quantitative methods in a less than flattering light. The argument from redundancy can be rebutted by following a line of reasoning described by Gibson and Fedorenko (2013) in their argument for quantitative methods in synchronic linguistics. Were the criticism against quantitative methods presented above to be true, i.e. were quantitative methods in historical linguistics to be redundant, we would expect the following: first, no study should exist in the published historical linguistics literature which disproves conclusions in traditional, qualitative studies by means of quantitative methods. Second, the avoidance of quantitative methods should not have harmed or impeded the progress of linguistics in general, or historical linguistics in particular. On the other hand, if we do find examples of conclusions from qualitative studies being disproved by quantitative methods, or that progress in linguistics and historical linguistics has been impeded through a lack of i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods quantitative methods, the only possible conclusion must be that quantitative methods are not redundant. As it happens, studies that employ quantitative methods to correct linguistic conclusions based upon qualitative methods do exist. A prime example from synchronic linguistics is Bresnan et al. (2007) who studied the so-called English dative alternation (i.e. the existence of parallel syntactic structures like ‘she gave him the book’ and ‘she gave the book to him’). Previous studies relying on qualitative methods, such as Levin (1993) and many others, had concluded that certain verbs could only occur with one of the two constructions. The consensus was that the dative alternation was too difficult to properly characterize in terms of a grammatical system (Bresnan et al., 2007). However, using quantitative methods, Bresnan et al. (2007) demonstrated that properties such as animacy and discourse accessibility, alongside formal and semantic features, could predict the dative alternation. Furthermore, drawing on usage data rather than intuitions, Bresnan et al. (2007) found well-formed examples of verbs occurring with constructional variants that had been proclaimed ungrammatical by previous qualitative studies. Another synchronic example, similar in many ways, is the study by Grondelaers et al. (2007) which deals with the use and non-use of the Dutch existential er (similar to English existential there) in postverbal position. According to standard Dutch grammars, no rules could be formulated for the presence or absence of this morpheme in postverbal position (Grondelaers et al., 2007, 152). However, using corpus data and quantitative methods, they found regional differences between Belgian and Netherlandic Dutch, but also systematic variation with respect to register (for Belgian Dutch), as well as the topicality and concreteness of preverbal adjuncts. As with the study by Bresnan et al. (2007), what appeared chaotic to the naked eye looking for deterministic rules was in fact systematic variation of probabilistic rules. Turning to historical linguistics, we can find similar examples. One such example is the case of referential null subjects in Old English and Middle English. Rusten (2014) and Rusten (2015) report on corpus-based quantitative studies of null-referential subjects from Old English to early modern English using data from syntactically annotated corpora. Referential null subjects (or ‘pro-drop’) are an established typological parameter for language classification. This is also a feature which has gradually been lost in Germanic languages. Historical studies of this phenomenon in historical varieties of English have mainly relied on qualitative methods (Rusten, 2014, 250). Despite few systematic studies beyond the Old English period, the phenomenon has been proclaimed ‘quite frequent’ in later periods of English by previous studies (Rusten, 2014, 250). However, Rusten’s detailed longitudinal study of a sample of texts from Old, Middle, and Early Modern English revealed that referential null subjects in those texts were extremely infrequent in all periods, which raises important questions about the status of the pro-drop rule in earlier stages of English, a rule which previous qualitative studies had assumed rested on much firmer empirical foundations. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Case against numbers in historical linguistics  Without leaving the topic of English historical syntax, we can consider another case, namely the question of word-order change. Old English is typically considered to have had some form of verb-second constraint, even if this constraint was less rigid than in other Germanic languages (Allen, 1995, 32–6; Fischer, 1996). This verbsecond constraint was lost somewhere in the transition to Middle English; however, the exact details remain disputed. One attempt at explaining this change is the degreezero learner hypothesis (Lightfoot, 1989; and 2006), which rests on the assumption of an abrupt decline in verb-final word-order patterns in subordinate clauses. However, as Heggelund (2015) demonstrates, the data from Old English and Middle English corpora in fact disprove this hypothesis. The two examples above are by no means trivial. They deal with fundamental questions regarding the typological status and trajectory of change in earlier stages of English, a well-researched language. In both cases, a lack of proper quantitative investigation has hampered theoretical development in the respective areas of historical linguistic research, since hypotheses that ought to have been culled (or modified) by empirical means have remained to clutter the picture. In both cases, the questions are undoubtedly important (both related to the historical–typological status of earlier stages of English) and cannot be brushed aside as peripheral or instances of merely clarifying residual details. The examples above clearly demonstrate that quantitative methods are not redundant, neither in historical linguistics nor in linguistics in general. Quite the contrary: quantitative methods serve as necessary corrective steps for hypotheses formed based on qualitative methods. When generalizations are made based on qualitative methods alone, the results are at risk of missing important variation in the data, either due to cognitive bias, small samples (which means the samples may contain too few types or tokens to reveal the full variation), or by not detecting complex sets of connections between properties that are more readily disentangled by computers. Furthermore, relying exclusively on qualitative methods is potentially harmful to linguistics, since theorizing and development may be led astray by incorrect results, as Gibson and Fedorenko (2013) and Sampson (2005) argue. Far from being redundant, quantitative methods are a necessary part of the historical linguist’s toolkit, alongside qualitative methods. Rather than replacing qualitative methods altogether, quantitative methods should, in our opinion, replace qualitative ones for testing hypotheses that may or may not have been arrived at via qualitative means. As we stressed in section 2.4.3, we believe that quantitative methods should be a natural tool of choice for historical linguistics. Which particular problems those tools are best used for is a question for the next section. .. Argumentation from limitation of scope A superficially more relevant criticism of quantitative methods is that they are potentially useful, but that their scope is limited to certain types of problems or i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods certain areas of research. Such views have been expressed by e.g. Sampson (2001) and Talmy (2007). Sampson (2001, 181) argues that while syntactic phenomena such as collocations and word order can benefit from quantitative studies, semantics (or ‘word meaning’) cannot. Talmy (2007, xiii), while implicitly admitting that other methods can also be used to study meaning, argues that qualitative methods (specifically introspection) constitute the superior tool in this case. In historical linguistics, Campbell (2013, 491) writes that quantitative methods ‘hold out promise’ of understanding the role played by frequency of usage for lexical change, while remaining pessimistic about other uses of their application in other areas of historical linguistics. All of these three positions can be considered instances of the limitation of scope argument. If we assume that quantitative methods have a limited scope in linguistics, there must clearly be some areas in which their use is justified. Identifying collocations and other examples involving frequency of use would seem like a good candidate for areas of linguistic research where the application of quantitative methods is uncontroversial. Conversely, the authors cited above could be interpreted as agreeing that semantics is the most difficult problem to tackle for quantitative methods. Consequently, the degree to which quantitative methods really can be applied to semantics must be considered a fair test of the limitation of scope argument. The limitation of scope argument is particularly pertinent in historical linguistics, since in most—if not all—cases no native speakers are available to give qualitative judgements, let alone perform any kind of introspective analysis. At this point we will emphatically take issue with the assertion by Fischer (2007) (see §1.1.3) that the linguist’s intuition, no matter how widely read she is in the language in question, has any value or status as evidence whatsoever. The postulate that the intuitions of linguists with something that putatively approaches native speaker competence in an extant language can be admitted as evidence is not compatible with our definition of evidence, which states that historical linguistic evidence must be open and accessible to everyone (section 2.1.3). Of course, this view should in no way whatsoever be taken to indicate that we dismiss the value of philological knowledge or primary sources for extant languages. However, we maintain that the intuitions arising from such knowledge are starting points or hypotheses, rather than facts with the ability to carry a logically valid argument (we exclude arguments from authority here, for obvious reasons). Thus, the putative benefits of native speaker judgements or native speaker intuitions over quantitative evidence have no bearing on the limitation of scope argument as far as historical linguistics is concerned. The limitation of scope argument would seem to predict that quantitative methods are either unsuccessful when applied to problems in semantics, or alternatively, that their successes (e.g. in terms of practical applications) are less interesting than the results arrived at without quantitative methods. However, the argument itself can be formulated generally, and as in previous sections we will provide counterexamples from both synchronic and diachronic linguistics. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Case against numbers in historical linguistics  Campbell (2013, 222) writes that, traditionally, work in diachronic semantics has mainly been concerned with lexical semantics, and the types of change that lexemes undergo. However, even if lexical semantics has been the traditional focus in work of semantic change, we must not lose sight of the fact that other branches of semantics exist, such as grammatical semantics, formal semantics, and linguistic pragmatics (Cruse, 2011, 17–18). We will consider studies within any of these branches of semantics as being able to support our argument, which is that quantitative methods have a useful role to play in semantics. We must first rid ourselves of any notion that semantics is by definition outside the scope of quantitative methods. Such an argument falters already at the impossible effort of drawing a clear and consistent line between semantics on the one hand and other areas of linguistic scholarship on the other. Defining semantics in terms of what can and cannot be investigated quantitatively runs the risk of circularity, and smacks of the ‘no true Scotsman’ logical fallacy—an attempt to redefine semantics in terms that would disqualify any branch of semantics if it benefits from quantitative research, leading to a circular argument. Thus, we are prepared to admit studies from any branch of semantics as evidence against the argument from limitation of scope as it applies to semantics. If we look at the published literature, it quickly becomes apparent that an entire subfield, distributional semantics, is concerned with working out how semantics can be studied by quantitative corpus methods. Since early seminal studies such as (Schütze, 1998), the field has expanded and in the words of Baroni (2013, 511) ‘Distributional Semantic Models, which automatically induce word meaning representations from naturally occurring textual data, are a success story of computational linguistics’. Distributional semantics is not only a case of practical applications, and Lenci (2008) argues that it has a theoretical import as well, especially for usage-based and functional theoretical fields of study. In synchronic linguistics, quantitative corpus methods have been applied to semantic problems like the selectional preferences of verbs (the fact that some verbs take objects whereas others do not) and the semantics of verb classes, as in e.g. Schulte im Walde (2004 and 2007) and Lenci et al. (2008). These examples of grammatical semantics are further complemented by attempts to integrate formal semantics with quantitative methods and distributional semantics (Baroni and Zamparelli, 2010). Thus, there is a rich literature describing the use of quantitative corpus methods in semantics. Such efforts have proved particularly useful in grammatical semantics when dealing with verb subcategorization or selectional preferences. We have already mentioned Bresnan et al. (2007)’s study on the English dative alternation, which used a quantitative, corpus-based approach to correct previous theorizing based on native speakers’ intuitions or anecdotal evidence. However, the study is also relevant in grammatical semantics. As the discussion in Manning (2003) makes clear, selectional preferences and argument structures are better handled in probabilistic terms than as discrete and categorical rules. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods In the fields of historical linguistics and linguistic change, we also find examples of quantitative methods employed fruitfully for semantic purposes. Barðdal et al. (2012) use dimensionality reduction models based on occurrence and non-occurrence of verb classes as an aid to reconstructing the semantics of the dative subject construction in Indo-European. McGillivray (2013, 78–87) shows how the framework for using quantitative methods to study verb subcategorization can be extended from contemporary languages to a non-extant language such as Latin. Meanwhile, Jenset (2013) similarly used dimensionality reduction methods on corpus data to more precisely describe the semantics of existential there in early English, in a corpus-driven approach to lexical meaning defined as patterns of co-occurrence (Cruse, 2011, 215–22). In summary, the limitation of scope argument, which states that some areas such as semantics constitute a no-go area for quantitative methods in linguistics, can only be sustained if semantics is narrowed down to exclude quantitative methods by definition. As we stressed above, this would lead to a circular argument (‘quantitative methods are not applicable to semantics because we define semantics as that which cannot be studied quantitatively’). We believe that the erection of such barriers between semantics and other areas of linguistic research would require much stronger arguments than those we have argued against here. In Chapter 1 we presented the defining principles of our approach, and stressed that only data that are open to public scrutiny can be admitted as evidence in historical linguistics. The corollary of this principle is that empirical evidence trumps intuition. It does not follow that the empirical evidence in question needs to be quantitative, but nor does the principle preclude quantitative evidence. As we have argued, any non-circular (i.e. non-trivial) definition of semantics must take into account subfields where quantitative evidence has made a substantial difference. The conclusion is that anyone wishing to defend the argument from limitation must retreat into a trivial, straw-man-like position arguing that quantitative evidence is not applicable to a particular semantic question or some highly specific subfield within semantics. In any case, the argument from limitation in its strong form cannot be upheld. That leaves us with arguments against quantitative evidence based on deeper principles. .. Argumentation from principle The strongest possible refutation of quantitative methods is based on the axiom that probability and quantification are in principle uninformative when it comes to language. Such positions can be arrived at via several distinct paths. However, we argue that this position, while possible, is less desirable than the alternative of accepting quantitative methods in historical linguistics. The two positions we will consider are the following: (i) linguistics (including historical linguistics) is inherently qualitative; (ii) the primary object of linguistic study is competence, not performance. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Case against numbers in historical linguistics  The first argument, that linguistics and also historical linguistics are inherently qualitative, has been around for some time. It can be found embedded in the quote from Bloomfield (1933, 37–8) (see section 3.7.1) which states that the convention bound nature of language makes statistics redundant (see also section 3.7.2). That this was a view shared by other structuralists is shown by the following quote from Joos, cited in Manning (2003, 290): ‘[linguistics] does not even make any compromise with continuity as statistics does . . . All continuities, all possibilities of infinitesimal gradation, are shoved outside of linguistics in one direction or the other.’ As Manning (2003, 290) points out, this view implies casting language as a set of discrete (or categorical), qualitative rules. However, Manning (2003), along with previously cited studies such as Bresnan et al. (2007) and Grondelaers et al. (2007), demonstrates the problems of preserving such a set of rigidly categorical rules in the face of linguistic data. Thus, in exchange for clear, algebraic, qualitative rules, we risk losing the details of gradients and variability found in language. As we have attempted to illustrate above, such a move would come at a dire loss of empirical descriptive power. If language is probabilistic, as e.g. Manning (2003) suggests, then discrete rules will necessarily be only an imperfect approximation. Even if one is sceptical of the idea of language as inherently probabilistic, the difficulty involved in measuring language (whether language use or grammatical competence) with perfect precision entails that a probabilistic model is nevertheless the best choice, since it will be better able to deal with imperfect measurements than a categorical model. In other words, the view that languages consist of qualitative, categorical rules at a deeper level, does not contradict modelling language probabilistically. In fact, as Zuidema and de Boer (2014) argue, creating linguistic models by both qualitative means (rules) and probabilistic means (quantitative techniques), offers greater insight into the underlying reality we try to capture. This is what Zuidema and de Boer (2014) call ‘model parallelization’ (see section 1.2.2). One way of doing such model parallelization is building a statistical model based on qualitative analyses, such as found in treebanks, an example of what Zuidema and de Boer (2014) call ‘model serialization’. Thus, the argument that language, and linguistics, are inherently qualitative, is simply not an argument against quantitative modelling, since many things that are inherently qualitative can be understood via statistical models. This general point is well illustrated by an example from public health research given in Gelman and Hill (2007, 86–101), who present a statistical model of which wells are used by local communities in Bangladesh. Of course, the use of a particular well is qualitative (specifically, binary: you use it or not). Gelman and Hill show that this qualitative choice is conditioned by both the level of pollution in the well and the distance to other non-polluted wells. Similarly, even if language were to be proved inherently qualitative, such a finding would have no relevance for the usefulness of understanding language by means of statistical models. The second argument, that the real object of linguistic enquiry is not performance or language use, but linguistic competence, is related to but still subtly different from the i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods first argument. It is similar in the sense that it potentially rejects the use of probabilities derived from usage in linguistics. In particular, a concern here is whether or not corpora are capable of providing negative evidence to establish the limits of linguistic competence (Ringe and Eska, 2013, chapter 1). However, the argument is also subtly different since it implicitly acknowledges that even if studying language use by means of frequencies may not fall under the purview of linguistics, such an approach can nevertheless still be carried out to serve other uses. This position has most famously been articulated by Noam Chomsky in several publications, arguing that linguistics should concern itself with an internal language, or a language capability, rather than the external patterns of language use. Chomsky has compared the latter activity to studying bees by videotaping bees flying4 or collecting butterflies.5 Manning (2003, 296) shares Chomsky’s concern for the importance of going beyond description in order to identify explanations (as do we). However, as Manning (2003, 296) also points out, any explanatory hypothesis that is ‘disconnected from verifiable linguistic data’ also ought to give rise to some concern. Incidentally, Manning also criticizes corpus linguistics for being overly concerned with ‘surface facts of language’, despite (or perhaps because of) its empirical approach. Manning (2003) goes on to argue for a model where syntax is seen as inherently probabilistic, an approach that shares many points of contact with that outlined in Köhler (2012). We sympathize with these probabilistic approaches, but do not consider them necessary to refute position (ii) as far as historical linguistics is concerned for the purposes of the present volume. A key contested point regarding linguistic competence as the primary object of linguistic study is the issue of negative evidence (see Carnie, 2012, §3.2; Ringe and Eska, 2013, chapter 1). However, the primary concern of statistical modelling is not merely the observed corpus counts, but the difference between the observed data and the expected counts that we would have expected under different circumstances. This is what Stefanowitsch (2005) calls the ‘raw frequency fallacy’. There are wellknown techniques for estimating the number of unseen items in a corpus (Gale and Sampson, 1995). Pereira (2000) uses statistical techniques and corpus data to estimate the considerable difference in probability between a grammatical and an ungrammatical sentence, the key point being that both sentences have an observed probability of zero in the corpus. Thus, to the extent that argument against corpora relies on the status of negative evidence, it should be clear that negative evidence is also something that can be approached fruitfully from a corpus-based, quantitative angle. On a more practical level, it is also worth noting that there is a noteworthy tradition for corpus linguistics in the generative tradition itself. Gelderen (2014) argues that generative linguistics traditionally has been sceptical of corpus data, due in part to the framework’s distinction between surface and underlying structures, as well as the 4 5 See e.g. http://languagelog.ldc.upenn.edu/myl/PinkerChomskyMIT.html. As cited in Manning (, ). i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Case against numbers in historical linguistics  norm of basing theoretical statements on native speaker intuitions. However, according to Gelderen (2014), interest in phenomena such as word-order variation, argument structure, and information structure have accelerated interest in and acceptance of corpora within generative linguistics. Moreover, the research tradition of variationist diachronic generative linguistics routinely relies on quantitative techniques, as evident from e.g. Kroch (1989) and Pintzuk (2003). This particular tradition of research has spurred the creation of syntactically annotated historical corpora, including Taylor et al. (2003), Kroch and Taylor (2000), Kroch and Delfs (2004), and Wallenberg et al. (2011b), to name a few. Thus, the initial reluctance toward corpora in generative linguistics, broadly construed, noted by Gelderen (2014) might be waning. In fact, Lightfoot (2013, e28) states that ‘[i]f we can identify meaningful properties of I-languages, then that imposes limits on possible diachronic changes, and those limits, alongside the contingent environmental factors, explain why changes take the form they have.’ This quote appears to give priority to I-language, i.e. native speaker competence; however, we do not see how the phrase ‘contingent environmental factors’ can have any other interpretation than a probabilistic one. And, in historical linguistics, probabilistic is strongly connected with corpus data, as specified in principle 10 (section 2.2.10). Further, for historical language varieties, such probabilities can only be reliably obtained from corpora. It follows from this that if the limits imposed by I-language are played out alongside contingent factors, then measuring the contingent factors through corpus data quickly becomes an empirical hypothesis test of the extent and manner in which I-language and contingent factors compete in explaining diachronic change. To sum up, neither point (i) nor point (ii) presents a strong case against corpora and quantitative evidence. As we saw, even if language were to be conclusively proven to consist of rigidly discrete, qualitative rules, this would not invalidate the usefulness of probabilistic models of such rules. Similar applications of probabilistic models to discrete, or qualitative, phenomena can be found in other scientific disciplines. This implies that against such a broad scientific consensus, a much stronger argument than the one presented in (i) would be required. Furthermore, the objection in (ii) was found wanting in so far as the objection concerns the importance of studying linguistic competence as a means to discover the limits of grammaticality. We gave examples from the research literature showing that quantitative methods and corpora can be employed for estimating the probability of grammatical and ungrammatical sentences. It is also worth noting that, as Gale and Sampson (1995) attest, these are not specific statistical procedures for linguistics, but general statistical techniques applied to linguistic data. As with (i), this implies that a much stronger argument (involving the statistical details of those techniques) than what is presented in (ii) would be needed to refute corpora and statistical techniques on a principled basis. Finally, we note that in practical terms, there does not seem to be a clear distinction between corpus use and non-corpus use drawn along lines corresponding to generative i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods and non-generative linguists. Thus, even the arguments from principle do not amount to a sound basis for rejecting corpora and quantitative methods in historical linguistics. .. The pseudoscience argument In section 3.7 we argued that the arguments against using quantitative corpus methods in historical linguistics do not stand up to scrutiny. But we must still deal with the challenging question of how well linguistics lends itself to the techniques of statistics. To put it bluntly, are linguists engaging in pseudoscience if they take a quantitative approach? The exact argument we are challenging here might take different forms, but its different strands can be paraphrased or summarized as follows: quantitative historical linguistics is pseudoscience if all it involves is using p-values and other trappings of experimental or quantitative sciences as a mere rhetorical device. The reason we bring the discussion to the forefront is that there is some truth to it. If statistical tests are applied erroneously, if their assumptions are not matched by the data, or if their results are misinterpreted, then no amount of decimal precision can salvage the results of such testing. An informed and thoughtful critique of the application of null-hypothesis testing in linguistics is presented in Kilgarriff (2005), who makes the legitimate point that null-hypothesis testing (e.g. Pearson’s chi-square test, Fisher’s exact test, and other similar statistical procedures) are all based on a common underlying logic: summarize the observed data into a test statistic, and compare that test statistic to an expected value arising from a theoretical scenario where the data are showing no systematic patterns of association. If the difference between the two values is large enough to be labelled ‘significant’ (according to common conventions), we can safely assume that our observed distribution of counts is unlikely to have arisen under the randomness scenario. As Kilgarriff (2005) stresses, language is far from random, and given a sufficiently large sample, any null-hypothesis test will return a verdict of ‘statistically significant’. This latter property—false positive results arising from a large sample size—is in fact a familiar property of null-hypothesis tests that is well known to statisticians and it has been discussed as far back as the 1930s (Berkson, 1938; Mosteller, 1968; Cohen, 1994). Nevertheless, solutions to this problem can be found in sensible practices. Cohen (1994), writing about psychology, advocates reporting effect sizes and confidence intervals, rather than focusing narrow mindedly on specific p-values. Similarly, Gries (2005), in a response to Kilgarriff, shows that applying effect size measures to corpus data to a large extent will solve the problem of spuriously positive results stemming from this inflation effect. However, a seemingly more problematic requirement is that of random sampling. As taught in most introductory statistics courses, random sampling (where every unit of interest to the study has the same probability of being included in the sample) i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Case against numbers in historical linguistics  allows us to generalize from a sample to a whole population (Hinton, 2004, 51–7). In historical linguistics, the sample might correspond to a corpus built on extant texts from a language variety, and the population might correspond to the linguistic system (whether it is conceptualized as a social or psychological one). However, viewed from a traditional sampling perspective, the composition of corpora is problematic, as Evert (2006) argues in a thought-provoking article. Like Kilgarriff (2005), Evert observes that language is not used at random. Quite the contrary: language is imbued with structure, which makes the assumption of random sampling required by nullhypothesis tests such as Pearson’s chi-square problematic (Hinton, 2004, 258)—a fact glossed over in some published presentations of the chi-square test aimed at linguists. If the requirement of random sampling is unrealistic even for present-day language varieties, where increasingly vast collections of texts are being built, the situation would seem hopeless for historical linguistics, where we are left with whatever text material survived by being passed down in time through copying and preservation, often for specific reasons (such as author prestige or the topic of the text) rather than random selection. In his discussion of statistical testing for corpus linguistics, Evert (2006) proposes to adopt what he calls ‘the library metaphor’. Essentially, his point is that, although the words occurring in a particular order in any part of a corpus do not constitute a random sample, the decision to include a particular stretch of text in a corpus can be viewed in analogy with randomly picking a book out of the shelves of a vast library. As long as the ‘book’, i.e. the stretch of text, was not included in the corpus because of the particular words it contains (or the order in which they combine), the assumptions of random sampling are not too seriously violated in Evert’s view. Although we find the argument in Evert (2006) intriguing, it should be clear that the metaphor works better for modern languages where very large corpora can be built and where there is much written material to choose from, possibly from a wide range of genres. As mentioned earlier, historical corpus linguists face a convenience sample, i.e. a non-random sample that represents the data we could find. That is, the data might not only be few in number, but also neither random nor arbitrary. Consider for example the language represented in a hypothetical surviving canonical religious text, as opposed to the language represented in a lost, heretical religious text. The mechanisms ensuring the inclusion and exclusion of those two hypothetical language samples in a historical corpus are anything but random. However, this does not spell the end of statistical testing for quantitative corpus linguists. There are a number of compelling reasons to reject the strict random sampling requirement outright. First, the requirement is primarily motivated by the use of a statistical null-hypothesis paradigm that we consider outdated in some respects. This debate is not restricted to linguistics, as Cohen (1994) bears witness to. Although the arguments from Gries (2005) and Evert (2006) to some extent counter the objections against such a paradigm in linguistics raised by Kilgarriff (2005), it does i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Corpora and quantitative methods not follow that such testing is the best way to proceed. As statisticians know well, the classical null-hypothesis tests were developed to find any relevant differences between groups in small samples of experimental data. This is simply not descriptive of the situation in corpus linguistics, since corpus data are not experimentally collected data. Instead, a more detailed understanding of both the data and the effects of small and skewed samples ought to guide our conclusions. By drawing on research on word-frequency distributions it is possible to draw inferences about the amount of missing, i.e. unseen, data. Such knowledge informs us about the degree of adequacy (which is always a matter of degree) of a given corpus for a given study. Jenset (2010, 167–72) does precisely that and concludes that the historical corpus at hand is adequate for the study being performed. This might seem questionable, but in fact the procedure rests on a principle that is entirely uncontroversial in historical linguistics, namely the uniformitarian principle that languages in the past behaved like today’s languages, from which it follows that this principle also applies to wordfrequency distributions. If this is not persuasive, we offer the point argued by Gelman (2012). Gelman, a professor of statistics and political science (and an active blogger on matters of statistics), argues that in the absence of random sampling, it is still possible to carry out meaningful statistical testing, since random sampling is neither an end in itself nor an absolute prerequisite. Instead, random sampling is merely one possible manner in which variation can be dealt with in the context of data analysis. Better statistical procedures are available, and ought to replace the classical null-hypothesis tests. Multilevel/mixed-effects models, as advocated by Harald Baayen (e.g. Baayen, 2008; Baayen, 2014) and Stefan Gries (e.g. Gries, 2015) to mention two prominent linguists, allow us to adjust for known biases in the corpus. This approach, which we take inspiration from, involves the desirable move away from simplistic null-hypothesis testing in the search for p-values towards a more meaningful (and potentially much more linguistically informed) modelling approach where numerous sources of variation are pitted against each other in a single statistical model in order to better capture a glimpse of the true complexities of language. . Summary In this chapter, we have argued that quantitative methods, corpus methods, and quantitative corpus methods, have a long history in historical linguistics. However, both technological and non-technological factors have prevented them from taking a more prominent role in the mainstream of historical linguistics work, even though the aims of historical linguistics are perfectly in line with a quantitative approach (see section 1.1). As we suggested in section 1.1, adoption of new technology is not merely a question of availability, and the results discussed in the present chapter seem to corroborate this view. In the preceding sections we have argued that a whole range of arguments i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Summary  against quantitative methods in historical linguistics do not hold water. The argument that such methods are inconvenient has little merit. We also showed the view that such methods simply reproduce results arrived at by qualitative means to be erroneous, since quantitative methods have an important role in checking whether the empirical facts square with arguments based on qualitative research. Furthermore, we showed that there is no principled reason to limit the topics to which quantitative methods in historical corpus linguistics (as defined in Chapter 1) can be applied, since such methods are not inherently limited to a single research topic like syntax. We also argued that defining linguistics as inherently non-quantitative creates a tautology, and that historical linguists have a long tradition (irrespective of self-identification with a specific linguistic paradigm) of making claims based on quantitative data. We also demonstrated that arguments to the effect that quantitative methods do not belong in historical linguistics because of deficiencies, gaps, or peculiarities in the data or sampling do not hold. By dispensing with summary statistics or simple null-hypothesis testing as the only basis of quantitative argumentation, and instead adopting more sophisticated statistical modelling that better informs us about the strengths and weaknesses of the analysis, historical linguistics can better benefit from quantitative information. In short, the preceding sections have argued that not only are quantitative techniques a valid part of the historical linguist’s toolbox; they are indispensable and can co-exist with other tools, including symbolic models. We have shown that quantitative corpus linguistics fills an important role in historical linguistics, and that such techniques are neither redundant nor limited to certain types of research questions. The subsequent chapters deal in more detail with how historical linguistics can best profit from the opportunities that corpora and quantitative evidence represent. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation . Content, structure, and context in historical texts Empirical science needs evidence, and the primary type of evidence directly analysed in historical linguistics consists of written sources: text collections, fragments, word lists from language families, etc. (Joseph and Janda, 2003; Campbell, 2013). In order to make full use of written sources for historical linguistic investigation, we need to look at their content, as well as their structure and context. For example, if we are interested in the evolution of a particular grammatical class (say, pronouns) in a given language, we need to be able to identify all instances of the pronoun class in the texts under consideration. This involves going beyond seeing the text as a continuous flow of characters and being able to dissect it in order to separate pronouns from the other words. Moreover, the location of pronoun occurrences in the text can be an important feature to be considered: do they appear in the title, in the appendix, or maybe in a footnote? To answer these questions, we must identify the internal structure of the text, and make it explicit in order to use it as a factor in the linguistic analysis. For instance, we need to be able to separate the instances of pronouns occurring in the title from those occurring in the appendix, if that is needed to test our hypothesis. Further, the information about when, how, where, and by whom the text was written is essential to place its language in the correct historical context and to model its contribution to diachronic change. Who is its author? What is the title of the work? We can answer similar questions by incorporating context information into the analysis. If the text reports recorded utterances, for example, knowing the demographic characteristics of the speakers may also help to explain the interconnections between social change and linguistic change. In addition, information on the contexts of use of the text and on its relationships to other written sources can be equally critical. Was the text published as a book? Did it have many readers? Where was it distributed? How many printed copies were produced? All this information can measure the volume and quality of the audience and the popularity of the text, thus contributing to a better interpretation of its language and shedding light on its relevance to the change in the general language. Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray. © Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Content, structure, and context in historical texts  Following a common practice in corpus linguistics and information science, we will use the technical term metadata to refer to the description of written sources from the point of view of their content, context, and structure. Metadata can be defined as ‘the sum total of what one can say about any information object at any level of aggregation’ (Gilliland, 2008, 2). For what concerns historical linguistics specifically, archival and manuscript metadata, as well as bibliographical metadata, are among the most important types of contextual and structural metadata about texts, and include information on author, title, publisher, year of publication, etc. Historical texts present an additional challenge compared with most contemporary texts, because in some cases this information may be lost for ever. In this chapter we will discuss the relationship between data and metadata in historical linguistics, with particular attention to the way we can collect information about the texts (when it is available) and the language represented in historical corpora, and how this can be achieved through corpus annotation. The links from corpora to external resources such as demographic databases will be covered in Chapter 5. .. The value of annotation According to principle 10 discussed in section 2.2.10, corpora are the prime source of quantitative evidence and are therefore an essential element of quantitative historical linguistics. However, linguistic analyses that can be performed on so-called ‘raw text corpora’ are limited. To take just one example, if we want to investigate the registers used in a play, we may want to extract the lists of words uttered by each character. In order to do that, we need to exploit metadata about the names of the characters and their utterances. The more information of this kind is encoded in a corpus, the more advanced analyses are possible on it. Annotation is the process of marking the implicit information about a text in an explicit and computationally retrievable and interpretable way (McEnery and Wilson, 2001, 32). Leech (1997, 2) defines annotation as ‘the practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written language data’. In this chapter we will extend this definition to cover not only linguistic information, but also information about the context of the texts and their structure. Annotated corpora are very valuable because annotation adds structure, thus making language data into information (Lenci et al., 2005, 64–5). Annotation enriches corpora, and therefore makes them more powerful resources for linguistics research. Linguistic information in an annotated corpus can be more easily retrieved, as the range of searches that we can perform on a corpus largely depends on its annotation. For example, a lemmatized and morphologically annotated corpus contains the morphological analysis of all its words in terms of their lemmas and their morphological features. This allows the user not only to search for the single forms in a text (e. g. all passages containing the form gave), but also all occurrences of a given i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation lemma (e. g. all passages where any form of the verb give is present, or where stroke is a singular noun). Principle 11 (section 2.2.11) stresses the importance of multivariate approaches to studying language in general, and historical languages in particular. The combination of linguistic data and metadata through corpus annotation makes it possible to address this multidimensional perspective. Furthermore, annotated corpora allow us to gather and analyse quantitative information about language data, which makes it possible to enrich symbolic modelling of language with statistical modelling, as we discussed in sections 1.1 and 1.2.2. Finally, from a research infrastructure perspective, annotated corpora can be used over and over again, and in different contexts from those from which the annotation originated. Thus, they can form the basis for further analyses, and in this sense they constitute reference resources. In section 5.1.2 we will illustrate some examples of language resources built on existing annotated corpora. Using corpora (and transparent research processes) means that we can conduct replicable analyses, which we recommend in section 2.3 as one of the best practices. .. Annotation and historical corpora Nowadays modern language corpora are often collected from the web for the purposes of synchronic research. Examples of such web corpora are: the COW (COrpora from the Web) collection1 containing large web corpora for Dutch, English, French, German, Spanish, and Swedish with between 10 and 20 billion tokens; the UKWaC British English web corpus (Ferraresi et al., 2008); the ItWaC Italian web corpus (Baroni and Kilgarriff, 2006); the Brigham Young Corpora.2 Unlike synchronic corpora, historical corpora present unique challenges in a number of different aspects. First, the long history of philological research on some historical languages like Latin or Ancient Greek means that compilers and annotators of historical corpora often need to consider a variety of different critical editions and commentaries, and a large body of scholarly literature produced both on the transmission of the texts and on their interpretation. The high philological and literary interest in historical texts means that the creation of a corpus from such texts has to take into account the interest of scholars on the texts themselves, in addition to the language. As a consequence, in the initial phases of historical corpus collection, we need to give special attention to the specific choice of source data. Sometimes the corpus compilers choose the first known editions, as in the case of the Austrian Baroque Corpus (Czeitschner et al., 2013); in other cases the corpus compilers have to resort to the only texts that history has preserved, and those texts may be fragmentary and may only represent a portion of an author’s work. Other times, pragmatic factors related to copyright issues and text format play a decisive role, as in the case of the PROIEL Treebank (Haug et al., 2009). 1 http://corporafromtheweb.org/. 2 http://corpus.byu.edu/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Content, structure, and context in historical texts  The special status of historical texts has important consequences on their annotation. Linguistic annotation is always an interpretative process. However, annotation is particularly subject to debate in the case of historical languages, for which no native speakers are available and for which the texts we have today are often the result of a complex series of manuscript transmissions and text alterations. For a model of annotation based on the annotator’s ownership, see Bamman and Crane (2009), who present this model on the portion of the Ancient Greek Treebank containing the complete works by Aeschylus. The example of Bamman and Crane (2009) presents a promising adaptation of the methods of corpus linguistics to the special case of historical languages with a long philological tradition. This is certainly a positive example, which researchers in historical corpus linguistics have started to follow, and demonstrates that this discipline has progressed beyond a straightforward application of modern methods to historical texts towards a more independent position. Computational philology has already developed original approaches to the creation of digital editions of historical texts and their analysis, which make best use of the potential of digital humanities. An example is the Homer Multitext project,3 which aims at making texts, images, and tools available to readers interested in studying the textual tradition of Iliad and the Odyssey in their historical context. In spite of the importance of these issues, most historical corpora lack a philologically satisfactory account of the texts. In section 2.1.1 we assumed that the historical corpora we work with in quantitative historical linguistics rely on adequate editions. We believe that in the future it will be beneficial to combine the wealth of scholarship already generated by computational philologists and digital humanists with the experience of historical corpus linguists, thus letting corpus annotation extend its scope beyond strictly linguistic information, in the direction suggested by Boschetti (2010). Only such a collaborative model will allow historical corpus linguistics to reach a deeper level of analysis and to fully account for the variety of complex phenomena that characterize historical texts. .. Ways to annotate a historical corpus When the first corpora were created, the corpus compilers employed manual annotation extensively. This approach to annotation is supposed to guarantee optimal quality, as it completely relies on humans. However, numerous studies have highlighted how manual annotation is prone to inconsistencies and errors that are very difficult to detect, precisely because they tend to be unsystematic (see, for example, McCarthy, 2001, 2; 19–21). 3 http://www.homermultitext.org. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation One way to aim at the best possible quality of annotation is to design clearly defined annotation guidelines and rely on an independent team of annotators. The degree of agreement between annotators is very important, and can be measured in various ways (see, for example, Cohen 1960). Given its higher costs in terms of time and human resources, manual annotation is preferred when dealing with small-to-medium-sized corpora, or when automatic annotation tools have not yet been developed. For these reasons, manual annotation is very popular for historical corpora, and the large majority of them have been annotated this way. One such example is the Oxford Corpus of Old Japanese.4 Once complete, this corpus will collect all extant texts in Japanese from the Old Japanese period, and it is currently being annotated at the orthographic, phonological, morphological, syntactic, semantic, and lexical levels, with additional annotation of literary, biographical, historical, geographical, and other information. Because manual annotation is expensive, time-consuming, prone to inconsistency errors, and only really feasible on small corpora, an increasing number of projects have recently started to explore the option of automatic annotation. Research in the field of NLP is devoted precisely to developing better and better tools for analysing language data in an automatic way. Automatic annotation programs can cover a variety of levels: for example, lemmatizers are employed for lemmatizing corpora, part-ofspeech taggers for part-of-speech annotation, and parsers for syntactic annotation. NLP tools are able to annotate vast amounts of data at low costs and are therefore particularly useful (and sometimes essential) when annotating large corpora. An exciting emerging field has been gaining interest in the NLP community concerning the development or adaptation of NLP tools to historical language data. Such data present special challenges to NLP, as we will illustrate in more detail in section 4.3. Although manual annotation is not error-free, it relies on the long tradition of close reading and on the assumption that it is always possible to check the texts analysed. On the other hand, using automatic annotation means accepting the fact that errors are an unavoidable part of the data that the researcher will analyse. Some projects have chosen a compromise between the two approaches by combining automatic and manual annotation in the so-called semi-automatic annotation. Semi-automatic annotation consists in human correction of the output of automatic analysers, and therefore combines the speed and consistency of automatic annotation with the quality of manual annotation. Among others, the developers of the Index Thomisticus Treebank (Passarotti, 2007b) and the PROIEL project (Haug and Jøndal, 2008) have used this approach to build their corpora. An alternative way to combine the advantages of automatic and manual annotation while at the same time covering large amounts of texts is via crowdsourcing. This has also been made easier thanks to the availability of crowdsourcing platforms like 4 http://vsarpj.orinst.ox.ac.uk/corpus/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Annotation in practice  Amazon Mechanical Turk (Sabou et al., 2014). The crowdsourcing approach is in line with the increasing popularity of user-generated metadata, which involves scholarly and non-scholarly content. In non-academic contexts, user-generated metadata are employed for tagging web content such as photos or videos, but also texts. So-called folksonomies are an example of metadata generated collaboratively and consist in tagging objects based on an open or closed set of categories. Typically, users are asked to associate an object with one or more terms that describe it in some way. Halpin et al. (2007) have shown that users tend to agree on a shared set of tags, even when they are not provided with a controlled vocabulary to choose from. Crowdsourcing has increasingly gained popularity in the computational linguistics community (Snow et al. 2008; Callison-Burch 2009; Munro et al. 2010). The Digital Humanities community has also engaged with a range of crowdsourced projects, some of which deal with historical language material. For example, Europeana 1914–19185 is a large project aimed at collecting historical material about the First World War from libraries, public archives, and family collections. On a more linguistically related note, the Papyrological Editor6 is an open collaborative editing environment for papyrological texts, as well as translations, commentaries, bibliography, and images. Another good example is GitClassics, whose projects aim at involving a large community of scholars and enthusiasts in a ‘collaborative effort to edit, translate, and publish new Latin texts using GitHub’.7 Crowdsourced annotation has the advantage of allowing large-scale annotation at a low cost, and is certainly a very promising idea. In the case of historical corpora, the task of annotation has traditionally been assigned to small groups of highly qualified persons. Pursuing a crowdsourcing approach in this context would require adapting the general-purpose model to the case of a limited scholarly community. Shared infrastructures and built-in mechanisms for checking the quality of the data are the next challenge to face in order to make this approach sustainable and optimal for historical corpora. . Annotation in practice So far we have highlighted how corpus annotation allows us to incorporate various types of information into the analysis of the texts for the purposes of historical linguistics research. This section will deal with the question of how we can achieve this in practice. 5 6 http://papyri.info/. http://europeana–.eu/en. Quote from the GitClassics website, http://gitclassics.github.io/. GitHub (https://github.com/) is a service for hosting web-based repositories, and is very popular among computer scientists as a platform for storing and sharing code as well as data sets. 7 i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation Let us consider the following text, taken from the beginning of the Aeneid by Virgil: (1) Arma virumque cano, Troiae qui primus ab oris Italiam, fato profugus, Laviniaque venit litora, multum ille et terris iactatus et alto vi superum saevae memorem Iunonis ob iram; This is an example of raw text, a typical case of so-called ‘unstructured data’. We know that Example (1) is part of a poetic text, and that it consists of four lines. In other words, the text has an internal structure which is shown in the way it appears visually. Humans can untangle the complex configuration of elements in a text in a relatively easy way. For example, books usually display chapter titles in a particular font, which is different from the rest of the text, and readers can easily detect chapters thanks to such widespread conventions. On the other hand, if we cannot or do not want to read the full text, but are interested in analysing certain patterns, we need to be able to find those patterns reliably. For example to identify the lines in a poem, a computer program needs to know where the line boundaries are placed. In Example (1), the fact that lines end with a line break gives an indication of their boundaries, but even this needs to be explicitly encoded in order to be retrievable by a computer program. If we want to retrieve this type of structural detail from the text, we need to represent it explicitly. In other words, we need to add structure to the data in the form of metadata. This can be achieved in various ways. One way to add this type of metadata is to use a table format, where each row (or record) represents a line, and the columns (or fields) contain the unique identifier and the content of the line. A table like Table 4.1 can be included in a very simple database where each row corresponds to a record (a line of text) and every column corresponds to a field. The fields in Table 4.1 are ‘Line_identifier’, ‘Line_text’, and ‘Author’. For every record, i.e. a line in the text, ‘Line_identifier’ is a unique code for it. Table 4.1 is a very simple example of structured data where we can imagine that we have defined Table . The first four lines of Virgil’s Aeneid in a tabular format, where each row corresponds to a line Line_identifier Line_text Author 1 Arma virumque cano, Troiae qui primus ab oris Italiam, fato profugus, Laviniaque venit litora, multum ille et terris iactatus et alto vi superum saevae memorem Iunonis ob iram; Publius Vergilius Maro 2 3 4 Publius Vergilius Maro Publius Vergilius Maro Publius Vergilius Maro i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Annotation in practice  Table . Example of bibliographical information on a hypothetical collection of texts in tabular format, to be considered for illustration purposes only Work_identifier Author Title Genre Location IX001 IX003 IX002 Publius Vergilius Maro Giacomo Leopardi Honoré de Balzac Aeneid Canti Eugénie Grandet epic poem novel Ancient Rome Italy France in advance the set of fields (identifier and text), as well as their data types (numeric, string, date, etc.) and any constraints on their content. For example, we do not expect the field ‘Line_identifier’ to contain anything else than a number between 1 and 9996, which is the number of lines in the Aeneid. In addition to structural information, we may want to collect bibliographical information about the texts. For example, Table 4.2 contains a portion of a hypothetical structured data set with bibliographical information, presented here for illustration purposes only. Like as Table 4.1, Table 4.2 contains structured data. The values for ‘Author’ range over the closed list of all possible authors of the books contained in our imaginary collection, with the option of adding new ones to the list as we make new acquisitions. Depending on the size of the collection, we can draw this list from a potentially very large set. One way to keep this list manageable is to allow only one variant for a given author name, or a limited number of variants; for example, if we decide to use only the original names of authors, we would need to map the English name ‘Virgil’ to its Latin equivalent ‘Publius Vergilius Maro’. Similar arguments hold for the other fields, which we should keep within some closed boundary in order to ensure consistency of the data, whenever that is possible. If we wanted to combine structural and contextual information about our text, we could link Table 4.1 and Table 4.2 using the fact that they share the field ‘Author’. Instead of repeating the bibliographical information for every line of the text, having two separate but linked tables is an efficient way of storing metadata in a so-called relational database. An alternative way to display the data in Table 4.1 and Table 4.2 in a combined way is via a markup language like XML (Extensible Markup Language). XML is also the preferred format in corpus linguistics, and we will describe it in more detail in section 4.3.1. For now, we will just use two simple examples to show how XML is particularly suited to expressing deep hierarchical structures in documents: <body> <text number="1"> This is an example text. </text> i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation <text number="2"> This is another example text. </text> </body> The text between the signs ‘<’ and ‘>’ is a tag; in the example above we see two types of tags, body and text. The first opening tag is <body>, and its closing tag is </body>. Within the scope of the <body> tag, we see two instances of the tag <text>, which indicates that <text> is nested inside <body>. The text in double quotes contained in the tag <text> is an attribute and in this case is used to indicate the number of the text embedded within the main body. Let us see an example of XML for Virgil’s text discussed above. < collection > <work Work\ _identifier ="IX001" author="PubliusVergiliusMaro" title ="Aeneid" genre="epic" country="AncientRome"> <book identifier ="1"> < line identifier ="1">Arma virumque cano, Troiae qui primus ab oris </ line > < line identifier ="2">Italiam, fato profugus , Laviniaque venit </ line > < line identifier ="3">litora, multum ille et terris iactatus et alto </ line > < line identifier ="4">vi superum saevae memorem Iunonis ob iram;</line> </book> </work> </ collection > After the opening tag <work>, the tag <book> has an attribute identifier with value 1, which refers to the fact that this is the first book of the work. Then, every line of the poem is enclosed between the opening tag <line> and the closing tag </line>. This example shows how it is possible to annotate structural information in the text. In the next section we will focus on linguistic information. . Adding linguistic annotation to texts So far, we have focused on how to represent structural and contextual metadata. Of course, it is essential to represent the content of the text as well, as we will see in this section. Humans are very skilled at identifying implicit layers of linguistic information in language. For example, native speakers of English can easily recognize that the word book in the sentence They are going to book the flight tonight is a verb, and it is a noun in She read that book in one day, although their ability to make the distinction explicit may depend on the level of their grammatical training. When the texts are analysed by a computer, we need to make such information explicit in order to interpret the text and retrieve its elements. For instance, in Example i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Adding linguistic annotation to texts  Table . Example of metadata and linguistic information encoded for the first three word tokens of Virgil’s Aeneid Work ID Title Token ID Token form Lemma Part of speech Case Number IX001 IX001 IX001 Aeneid Aeneid Aeneid T00101 T00102 T00103 Arma virum que arma vir que noun noun conjunction accusative accusative – plural singular – (1), discussed on page 104, arma is the accusative of the plural noun arma ‘weapons’; que is an enclitic which means ‘and’ and is attached to the end of the word virum, which is the accusative of the noun vir ‘man’. Because this type of morphological information is at the level of individual words (more precisely, tokens in corpus linguistics terms), rather than phrases or larger segments, one way to encode it is to define each row as the minimal analytical unit, i.e. the token, and add new fields called ‘lemma’, ’part of speech, ‘case’, and ‘number’, as in Table 4.3. Once we have the information for the whole text, we can run searches on any combination of the fields; for instance, we can retrieve all occurrences of the singular accusative of vir. Alternatively, if we choose to use XML, we can embed every token in the XML presented on pages 105–6 in a new tag <token>, and add the attributes tokenID, lemma, part of speech, case, and number to it, as shown below. < collection > <work Work\ _identifier ="IX001" title ="Aeneid"> <book identifier ="1"> < line identifier ="1"> <token tokenID="T00101" lemma="arma" part−of−speech="noun" case="acc" number="plural">Arma</token> <token tokenID="T00102" lemma="vir" part−of−speech="noun" case="acc" number="singular">virum</token> <token tokenID="T00103" lemma="que" part−of−speech="conjunction" case="-" number="-">que</token> . . . </ line > </book> </work> </ collection > We could also decide to encode other types of linguistic information, like the English translation of every word, their syntactic relations, or their synonyms. In any case, this added information contributes to making such elements searchable; for i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation example, we can retrieve all instances of the lemma vir in the text by simply limiting the search to the lemma attribute of the tag <token>. .. Annotation formats There are different ways to include annotation in a corpus. In the so-called embedded format, annotation is included in the original text and is displayed in the form of tags. For example, the example below indicates that reading is a participle form, as the tag ‘PARTICIPLE’ is next to the form reading, and is separated by a forward-slash sign: reading/PARTICIPLE When the units being annotated span over more than one token, we need some way of grouping together their elements; this is sometimes achieved by bracketing or nesting tags, as in phrase-structure syntactic annotation. The example below shows a parse tree from the Early Modern English Treebank (Kroch et al., 2004). ( (IP−MAT (NP−SBJ (D The) (N Chancelor)) (VBD saide) (CP−THT (C that) (IP−SUB (PP (P after ) (NP (ADJ long) (N debating ))) (NP−SBJ (PRO they)) (VBD departyd) (PP (P for ) (NP (D that) (N tyme))) (, ,) (IP−PPL (CONJ nedyr) (IP−PPL (VAG falling) (PP (P to) (NP (Q any) (N poynt )))) (CONJP (CONJ nor) (ADJP (ADJ lyke) (IP−INF (TO to) (VB com) (PP (P to) (NP (Q any ))))))))) (. .)) (ID AMBASS−E1−P2,3.2,25.20)) The phrase-structure of the sentence is represented with embedded bracketing corresponding to syntactic constituents, and the leaf nodes consist of tags followed by word forms. ‘IP–MAT’ signals the whole sentence, ‘NP–SBJ’ the subject–noun phrase, consisting of a determiner node (‘D’) and a noun node (‘N’); ‘VBD’ is the past tense i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Adding linguistic annotation to texts  verb and ‘CP–THT’ is a complementizer phrase introduced by the conjunction that, while the last node contains the ID of the sentence.8 Another way to represent embedded structures in corpus annotation is by using the XML format, introduced in section 4.3. An example of embedded dependency annotation in XML format is given below and is taken from the Latin Dependency Treebank (Bamman and Crane, 2006): <sentence id="7" document_id="Perseus:text: 1999.02.0066" subdoc="book=1:poem=1" span="ergo0:puellam0"> <primary> alexlessie </primary> <primary>sneil01</primary> <secondary>millermo</secondary> <word id="1" form="ergo" lemma="ergo1" postag="d--------" head="3" rel="AuxY" /> <word id="2" form="velocem" lemma="velox1" postag="a-s---fa-" head="5" rel="ATR" /> <word id="3" form="potuit" lemma="possum1" postag="v3sria---" head="0" rel="PRED" /> <word id="4" form="domuisse" lemma="domo1" postag="v--rna---" head="3" rel="OBJ" /> <word id="5" form="puellam" lemma="puella1" postag="n-s---fa-" head="4" rel="OBJ" /> </sentence> The tags <sentence> and </sentence> indicate respectively the beginning and end of the sentence being annotated; the attributes of the <sentence> tag indicate various properties of the sentence: the sentence’s unique identifier (id), the identifier of the text (document_id), the portion of the text containing the sentence (subdoc), and the first and last words of the chunk of text (span). Inside the tag sentence, we find the names of the primary and secondary annotators, followed by the words making up the sentence. The tag word indicates every word of the sentence. Inside the tag, the attribute id uniquely identifies the word in the corpus, form represents the word form, lemma its lemma, and postag contains a series of codes for morphological features such as part-of-speech tag, gender, mood, case, and number. In this example the type of syntactic annotation is relational, as opposed to the structural type of the phrase-structure example from the Modern English Treebank. The tag <head> contains the ID of the dependency head of each word, while the tag <rel> indicates the syntactic dependency relation between the word and its head. For example, the first word of the sentence above is ergo, and is a sentence adverbial (‘AuxY’) depending on the third word of the sentence, i.e. potuit. Its lemma is ergo1 8 For a full description of the tags, see https://www.ling.upenn.edu/hist-corpora/annotation/index.html. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation as it is the first (and only) homograph of the lemma ergo in the Lewis–Short Latin dictionary (Lewis and Short, 1879). In addition to linguistic information, as we noted in section 4.1, it is important to record contextual information about a text; this is sometimes included as part of the corpus annotation itself, as in the Helsinki Corpus. McEnery and Wilson (2001, 39–40) list a document header from this corpus, where for example, the tag <A BEAUMONT ELIZABETH> indicates an author’s name, and <X FEMALE> her gender. Such metadata can then be used by corpus programs to restrict the search criteria on texts’ attributes and their linguistic content. So far, we have examined examples of embedded annotation. Standalone annotation retains the annotation information in a separate document, which is linked to the original text. The American National Corpus (Ide and Macleod, 2001) has followed this approach (Gries and Berez, 2015). For example, each word of the sentence We then read is assigned an identifier: <w id="1">We</w>\\ <w id="2">then</w>\\ <w id="3">read</w>\\ <w id="4">.</w>\\ Each word is then associated to its part-of-speech tag in the standalone annotation by means of identifiers: <word id="1">PRONOUN</word>\\ <word id="2">ADVERB</word>\\ <word id="3">VERB</word>\\ <word id="4">PUNCTUATION</word>\\ Standalone annotation makes it possible to have multiple formats or levels of annotation for the same text. Although standalone annotation is recommended by the standard for corpus annotation (the Corpus Encoding Standard), most corpora have embedded annotation; therefore, in the rest of this chapter we will refer to this type of annotation. .. Levels of linguistic annotation Linguistic annotation is typically performed in an incremental way, by adding successive layers to the original text, starting from the most basic ones with lemma or part of speech, to the most advanced ones with semantic and pragmatic information. In this section we will cover these main levels of annotation, with particular attention to the peculiarities of historical corpora. The challenges of text pre-processing When building a historical corpus, researchers usually acquire texts held in non-electronic formats. Optical character recognition i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Adding linguistic annotation to texts  (OCR) and direct transcription are the most popular ways to convert the texts into a digital format. Alternatively, manual transcription is an option when automatic methods are not able to reach an acceptable level of accuracy. Automatic and manual transcription are not mutually exclusive options, as the results of an automatic process can be further refined by manual intervention. This approach was the one chosen by the Impact Centre of Competence in Digitization,9 a collaborative network of libraries, industry partners, and researchers working towards the goal of digitizing historical printed texts from Europe’s cultural heritage material. Concerning OCR, Impact has developed an OCR software whose results are further improved thanks to the involvement of volunteers through an interface for crowdsourcing. Historical texts present challenges also regarding their characters, which typically span over a much larger set than modern texts. In the history of historical text processing, the lack of a common framework for encoding texts has meant that customized processing tools have been created which could not be shared across different systems. Over the past decades, the character encoding Unicode has gradually become the universal standard, and contains now more than one million characters. New characters often need to be added to the Unicode repository, especially to deal with historical scripts, and this is achieved via the Script Encoding Initiative.10 As Piotrowski (2012, 53–60) points out, the wide coverage of Unicode facilitates the sharing of tools and texts across different projects. For an overview of the issues concerning the digitization of historical texts and historical character encoding, see Piotrowski (2012). In the next sections, we focus on the levels of linguistic annotation that can be performed on historical corpora, stressing their features and challenges. Tokenization The first step in automatically processing the language in a corpus usually consists of tokenization. Tokenization segments a running text into linguistic units such as words, punctuation marks, or numbers. Once we have identified such units (called tokens), we can perform further levels of annotation. The task of word segmentation is more complex for those East Asian Languages like Chinese, Japanese, Korean, and Thai, which do not use white spaces to separate words. This is relevant also to those historical languages that were written in scriptio continua, such as classical Greek and classical Latin, for which the word separation is sometimes disputed by different philological interpretations. Even in languages like English, Italian, or French, where white spaces are used to separate tokens in many cases, we can find several exceptions. For example, the English sequence I’m, the French l’oiseau ‘the bird’, and the Italian l’anguilla ‘the eel’ comprise two tokens each; on the other hand, the English name New York, should count as one 9 http://www.digitisation.eu/. 10 http://linguistics.berkeley.edu/sei/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation single token. Moreover, compounds may not require spaces, as the German compound computerlinguistik ‘computational linguistics’. Another challenge in tokenization is given by the different possible uses of hyphens, for example to split a word at the end of a line for typesetting or to join elements of a single complex unit like forty-two. What counts as a token, therefore, depends on the language, the context of use, and further processing. For languages that use a Latin-based, Cyrillic-based, or Greekbased writing system, tokenization is often performed by a combination of rules that rely on white spaces and punctuation marks as delimiters of token boundaries. In addition to applying these general rules, we need to take into account languagespecific exceptions drawn from lists of acronyms and abbreviations. For example, such lists for English should contain Dr. and Mrs., because in these cases the dot should be considered part of the token. One challenge with abbreviations is that the same string may be a full word in certain contexts and an abbreviation in others, like in. for inches. Sentence segmentation is another crucial task related to tokenization and can present challenges for historical texts which do not employ punctuation marks consistently. For an overview of such challenges in Latin and Armenian, and respective solutions adopted to build the PROIEL Project corpus, see Haug et al. (2009, 24–6). Morphological annotation From a historical perspective, researchers have expanded much effort on written texts, and therefore the morphological, syntactic, semantic, and pragmatic levels have received most of the attention, compared to other levels of annotation such as phonetic/phonemic and prosodic annotation. In this section we will describe morphological annotation in more detail. Morphological annotation is the first layer of annotation that is normally added to raw corpora. It usually involves spelling variant processing, lemmatization, partof-speech tagging, and annotation of other morphological features such as number, gender, animacy, and case for nouns and adjectives, degree for adjectives, mood, voice, aspect, tense, person for verbs. In this section we will examine the main challenges posed by morphological annotation of historical texts, and how current or past projects have tackled them. Tackling spelling variation One major challenge of historical texts relates to the amount of spelling variation they typically contain. Many historical corpora cover large time spans, during which spelling standards were often lacking and spelling conventions changed. Second, data capture errors and philological issues sometimes make spelling uncertain. For these reasons, a unique approach to spelling is often not viable for historical texts. The field of NLP has developed tools that generally assume consistent spelling and consequently work well for modern languages, which normally have a much smaller degree of spelling variation than their historical counterparts. When applying NLP tools to historical texts, a common strategy is to normalize spelling to their modern equivalents. Normalization can be acceptable for certain applications, such i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Adding linguistic annotation to texts  as retrieving information from historical documents, where the user wants to find the relevant content by searching for a limited number of terms. Normalization requires a set of mappings between the historical variants and the modern ones (if available) or rules that prescribe how to infer one from the other (for example -yng or -ynge endings in place of the modern -ing ending of verbs in English). This approach was adopted by the designers of VARD,11 a spelling analysis tool for early modern English. When such lexicons or rules are not available, we can adopt several different approaches to identify the relationship between spelling variants. One such example is the so-called edit distance, which measures the ‘distance’ between two strings by considering the number of deletions, insertions, and replacements of characters needed to transform one into the other. We can employ similar methods also when correcting OCR errors, a common challenge of digitized historical documents. For an overview of this topic, see Piotrowski (2012, 69–83). As an example of the challenges of historical spelling for English, Archer et al. (2003) present a historical adaptation of USAS, which is the Semantic Analysis System developed by UCREL, the University Centre for Computer Corpus Research on Language of Lancaster University. Because USAS was designed for present-day English, when it was applied to early modern English texts, it failed to part-of-speechtag a number of items. The issues concerned spelling, because some historical variants were not present in the lexicon used by USAS. A straightforward modification of the lexicon that included historical variants would have led to incorrect results; for example, one historical spelling of the verb be is bee, which is also a noun in presentday English. Therefore, the authors decided to keep the present-day lexicon separate from the early modern lexicon, and to create the historical lexicon manually by analysing the items that were not tagged by the part-of-speech tagger. The system then assigned the correct tags of such items based on some rules; for example, it would analyse bee as a form of the verb be if it was preceded by a modal verb. Tagging by part-of-speech Part-of-speech tagging is a crucial step in annotating corpora. As with other levels of annotation, automatic part-of-speech taggers exist alongside manual systems; however, compared with part-of-speech taggers for modern languages, historical part-of-speech taggers present some specific challenges, as Piotrowski (2012, 86-96) explains. Here we will summarize some of the main solutions devised to perform part-of-speech for historical languages. Machine learning algorithms for part-of-speech taggers have become increasingly popular in the recent years. A typical so-called supervised machine learning system for linguistic annotation relies on an annotated corpus used as a training set; the model learns the patterns observed in the training set and subsequently uses these patterns to annotate a new corpus. Following this approach, Passarotti and Dell’Orletta 11 http://ucrel.lancs.ac.uk/vard/about/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation (2010) trained a part-of-speech tagger on the morphologically annotated data from the Latin Index Thomisticus Treebank (61,024 tokens), and automatically disambiguated the lemmas of the Index Thomisticus. Scholars have adopted different solutions with the aim of improving the accuracy of part-of-speech taggers for historical languages. Some, such as Rayson et al. (2007), have used part-of-speech taggers for modern language varieties to analyse historical varieties by modernizing their spelling. Another approach consists in using a partof-speech tagger for the modern variety of the historical language being studied and expand its lexicon with historical forms, as Sanchez-Marco et al. (2011) did for Spanish. An alternative method is to first use a modern-language tagger and then incrementally correct it for historical data. This was the approach followed by Resch et al. (2014), who describe used the modern-German version of Treetagger (Schmid, 1995) to tag the Austrian Baroque Corpus, a corpus of printed German language texts dating from the Baroque era (particularly from 1650 to 1750). Given the high number of incorrectly tagged and lemmatized items, they manually corrected a portion of the output of the tagger; they then retrained Treetagger on the additional training set. This procedure was sufficient to make the performance of Treetagger increase significantly. Bamman and Crane (2008) use a similar approach and report on experiments on part-of-speech tagging of classical Latin with TreeTagger (Schmid, 1994), trained on a treebank for classical Latin. Lemmatization and morphological annotation Lemmatization associates every word form with its lemma, together with its homograph number, where needed. We can perform lemmatization both on inflected forms and on spelling variants; for example, if we want to use a list of lemmas from British English, we can lemmatize the American variant color as colour. Lemmatization is closely related to morphological analysis and part-of-speech tagging. In fact, if we know the part of speech of a given form in a given context, we can often assign the correct lemma to it. For example, the Latin form rosa can be an inflected form of the noun rosa ‘rose’, but also the feminine past participle of the verb rodo ‘gnaw’, and its correct lemma will depend on the context. For this reason, lemmatization is often coupled with part-of-speech tagging in corpus annotation. Just like other levels of linguistic annotation, lemmatization can be performed either manually or automatically, through tools called lemmatizers. Examples of historical corpora which have been manually lemmatized are treebanks, which we will introduce later in this section. While possible, attempts in the direction of automatic lemmatization of historical corpora have been overall rare. One method for automatic lemmatization is based on a set of rules that prescribe how to analyse a given word form depending on which category it falls in. Examples of rule-based systems are LGeRM (Souvay and Pierrel, 2009), which identifies the dictionary entry of a given form in Middle French, and the morphological model build by Borin and Forsberg (2008) for Old Swedish. Along similar lines, several software systems are available i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Adding linguistic annotation to texts  for performing automatic lemmatization and morphological analysis of Latin and Ancient Greek. For example CHLT-LEMLAT (Passarotti, 2007a)12 is a lemmatizer and morphological analyser for Latin created at the Institute of Computational Linguistics (ILC-CNR) in Pisa. Another morphological analyser for Latin and Ancient Greek is Morpheus (Crane, 1991).13 Morpheus contains rules for generating inflected forms automatically and allows the users to search the digital library by word forms and lemmas. Kestemont et al. (2010) propose a machine-learning approach to lemmatization of Middle Dutch. Syntactic annotation Syntactic annotation consists of assigning each element of the sentences in a corpus to its syntactic role. Given the complexity of the task of syntactic annotation, historical corpora with this type of annotation are quite small, and attempts in the direction of automatic annotation have been rare. In this section we will give a brief overview of the research in this area. Manual syntactic annotation and treebanks Syntactically annotated corpora are usually called treebanks because we can represent syntactically annotated sentences as trees. For an overview of existing treebanks for modern and historical languages and some methodological points, see Abeillé (2003). Here we will focus on methodological issues specific to historical treebanks. There are two main kinds of syntactic annotation: constituency annotation and dependency annotation. In a constituent annotation, phrases are identified and marked so that it is clear which one each element belongs to. Constituency annotation makes use of bracketing to represent the syntactic embedding of constituents and is the style followed by the early treebanks. We presented an example of this kind of annotation in section 4.3.1. Examples of constituent-based historical treebanks are the Penn Corpora of Historical English (Kroch and Taylor, 2000; Kroch and Delfs, 2004; Kroch and Diertani, 2010). On the other hand, dependency annotation is based on the theoretical assumptions of Dependency Grammar (Tesnière, 1959), which represents the syntactic structure of a sentence with the dependency relations between its words. In a Dependency Grammar annotation, each lexical element corresponds to a node in the syntactic tree of the sentence; in order to tag its syntactic role in the sentence, we assign each node to a label (such as ‘predicate’, ‘object’, ‘attribute’) and to the node it is governed by. Figure 4.1 shows the phrase-structure tree and the dependency tree for Example (2): (2) She ate the apple. In the dependency tree of Figure 4.1 we can see the nodes corresponding to the words of the sentence, and the edges representing the dependencies between the words (‘Pred’ for predicate, ‘Sb’ for subject, ‘Obj’ for objects, and ‘Det’ for determiner). In 12 13 http://webilc.ilc.cnr.it/ ruffolo/lemlat/index.html. http://www.Perseus.tufts.edu/hopper/morph.jsp. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation S ate Pred VP NP PR V She ate She Sb apple Obj NP DET N the apple the Det Figure . Phrase-structure tree (left) and dependency tree (right) for Example (2). the constituent tree we can see the terminal nodes corresponding to the words of the sentence, and non-terminal symbols corresponding to the constituents (e.g. noun phrases ‘NP’ and verb phrases ‘VP’ in Figure 4.1) or part-of-speech (such as pronouns (‘PR’), verbs (‘V’), determiners (‘DET’), nouns (‘N’) in Figure 4.1). Dependency annotation has become increasingly popular among treebank creators. One common model of annotation is that of the Prague Dependency Treebank (Böhmová et al., 2003), developed under the Dependency Grammar theoretical framework of Functional Generative Description (Sgall et al., 1986). This treebank contains part of the Czech National Corpus annotated at three levels: morphological, so-called ‘analytical’ (with dependency trees of all sentences), and semantic, so-called ‘tectogrammatical’. Dependency annotation is generally considered to be very suitable for morphologically rich languages with free word order such as Czech and Latin. Examples of historical treebanks that followed this framework are: the Ancient Greek Dependency Treebank (Bamman and Crane, 2011), the PROIEL Treebank (Haug and Jøndal, 2008), the Latin Dependency Treebank (Bamman and Crane, 2007), and the Index Thomisticus Treebank (Passarotti, 2007b). Let us consider an example. Figure 4.2 from McGillivray (2013, 45) shows the dependency tree of the Latin in Example (3), where movet and pervenit are coordinated predicates, governing respectively the direct object castra, and the adverbial diebus and the indirect object fines introduced by the preposition ad. (3) Re frumentaria provisa castra Provisions:abl.f.sg provide:ptcp.pf.abl.f.sg camp:acc.n.pl movet diebus -que circiter XV ad move:ind.prs.3sg day:abl.m.pl and about:adv fifteen to fines Belgarum pervenit border:acc.m.pl Belgian:gen.m.pl arrive:ind.pf.3sg ‘After providing his provisions, he moved his camp, and in about fifteen days reached the borders of the Belgae’ i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Adding linguistic annotation to texts  #1 AuxS que-6 Coord provisa-3 movet-5 Adv Pred_Co Re-1 castra-4 Sb Obj frumentaria-2 Atr diebus-7 Adv XV-9 Atr circiter-8 Adv pervenit-13 Pred_Co ad-10 AuxP fines-11 Obj Belgarum-12 Atr Figure . The dependency tree of Example (3) from the Latin Dependency Treebank. As we have seen from Example (3), the high level of complexity of the annotation in treebanks makes them very valuable resources for linguistic analyses, allowing for complex searches involving syntactic functions. Treebanks can also help linguists test their theories, as they can provide examples and counter-examples for illustrating linguistic phenomena in qualitative research. As a matter of fact, empirical linguistic analysis provided the prevalent motivation behind the creation of the early treebanks (and corpora in general). Treebanks can also constitute the basis for corpus-driven analyses as defined in section 2.4. This latter use is the one that makes the most of the potential of treebanks, because they offer the kind of systematic information and frequency data that is needed in this type of linguistic analyses. Moreover, there is a significant educational potential in the use of treebanks, as testified by the Visual Interactive Syntax Learning project at the University of Southern Denmark,14 which contains syntactically annotated sentences and games for modern and historical languages (Latin and Ancient Greek). Finally, treebanks have been recently used as gold-standard resources for historical NLP, as they can be used to train automatic syntactic analysers or parsers, as explained in the next section. 14 http://beta.visl.sdu.dk/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation Automatic annotation of syntax: parsing Parsing consists in automatically annotating a corpus from a syntactic point of view. Parsing is a very important field of research in NLP, and has a variety of practical and commercial applications, ranging from machine translation to natural language understanding and lexicography. Parsing can be achieved in two main ways: rule-based and statistical. Rule-based parsers exploit some manually constructed rules to parse a sentence; on the other hand, statistical parsers, based on machine-learning techniques, are trained on treebanks, from which they learn patterns of linguistic regularities that can then be applied when analysing new unannotated texts. As with any other automatic method, parsing involves a number of errors, which we must take into consideration when using parsed data directly. Depending on the end use of the annotated corpus, this margin of error may constitute a problem, as traditionally historical linguists and philologists have aimed at an almost perfect level of analysis and often require the same accuracy to carry out further analyses based on annotated corpora. When the historical corpora are so small that it is possible to manually check the annotation, semi-automatic annotation is often the preferred solution. As illustrated in Piotrowski (2012, 98–100), parsing experiments for historical languages have highlighted interesting challenges and have often originated from adaptations of parsers developed for modern languages. For example, comparing classical Chinese and modern Chinese, Huang et al. (2002) report an accuracy of 82.3 per cent for a parser trained on a 1,000-word treebank. The challenges involved in segmenting the text are less serious for classical Chinese, which has a higher number of single-character words compared to modern Chinese; on the other hand, part-ofspeech ambiguity is more extreme for classical Chinese and therefore makes part-ofspeech tagging more difficult. Given the historical importance of Latin in Western culture, it is not surprising that significant efforts have been devoted to parsing this language. Koch (1993) describes a first attempt at parsing Latin. McGillivray et al. (2009), Passarotti and Ruffolo (2009), and Passarotti and Dell’Orletta (2010) report on more recent experiments on parsing Latin corpora using machine learning. For example, following the same approach exposed earlier and consisting in adapting parsers developed for modern languages to the case of historical languages, Passarotti and Dell’Orletta (2010) applied the DeSR parser (Attardi, 2006) to medieval Latin, and designed some specific features for this language. English is the other language for which considerable research has been done on parsing historical texts. Considered that modern English is the language which has the highest number of language processing tools, it is not surprising that such tools have also been tested on historical varieties of this language. One such tool is the Pro3Gres parser (Schneider, 2008), a hybrid dependency parser for modern English. Pro3Gres is based on a combination of handwritten rules and statistical disambiguation, and can be adapted to historical language varieties. Schneider (2012) evaluated Pro3Gres on the i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Adding linguistic annotation to texts  historical corpus ARCHER (A Representative Corpus of Historical English Registers, Biber and Atkinson 1994), constructed by Douglas Biber and Edward Finegan in the 1990s and consisting of British and American English texts written between 1650 and 1999. A preprocessing step was performed before parsing and led to a normalization of the text with the tool for spelling normalization VARD2 (Baron, 2009). Schneider’s evaluation results range from 70 per cent on seventeenth-century texts to 80 per cent on early-twentieth-century texts for the unadapted parser. If we compare these results with the state-of-the-art parsers for modern English we can see that the difference is not as great as one might expect. For example, Kolachina and Kolachina (2012) evaluated a number of dependency and phrase-structure parsers for English and found accuracy ranges between 70 per cent and 90 per cent.15 Semantic, pragmatic, and sociolinguistic annotation Semantic annotation often builds on syntactic annotation and involves interpreting a variety of different linguistic phenomena. These include indicating the semantic fields of a text like sport or medicine, for example, but also tagging named entities such as names of people or places, indicating whether an entity is animate or inanimate, whether it is an event or an abstract entity, and so on. Sense tagging is another important way to semantically annotate a corpus and consists in associating every word with its correct sense in context, based on an external ontology such as WordNet (Miller et al., 1990). WordNet is a lexical–semantic database for the English lexicon. Lexical items are assigned to sets of synonyms (synsets) representing lexical concepts, which are linked through semantic and lexical relations like hyponymy, hyperonymy, and meronymy. An example of an English synchronic semantically annotated corpus is SemCor (Fellbaum, 1998). Semantic annotation of historical corpora also covers the automatic detection of named-entities such as people, organizations, locations, time expressions, which are of particular relevance to historical research (Toth, 2013). This section will focus on the semantic annotation of historical corpora, and provide some examples. Semantically annotated historical corpora Annotating a historical corpus at the semantic level is challenging for a variety of reasons, including the complexity of the task, the high degree of linguistic interpretation required, the scarcity of annotation standards, and the diachronic change of meaning. Some historical corpora have successfully attempted this kind annotation and have approached it from different points of view. The PROIEL corpus, introduced in section 4.3.2, contains a semantic annotation for its Ancient Greek portion (Haug et al., 2009, 40–3), in addition to morphological and syntactic annotation. The semantic annotation in PROIEL has the form of type-level 15 The authors first converted the parses of a constituency parser into dependency structures. Then, they measured labelled attachment score (LAS), unlabelled attachment score (UAS), and label accuracy (LA). i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation animacy tagging, and follows the framework developed by Zaenen et al. (2004). Every Greek noun lemma is associated with one category taken from the following set: HUMAN, ORG (for organizations), ANIMAL, VEH (for vehicles), CONC (for concrete entities), PLACE, NONCONC (for non-concrete, inanimate entities), and TIME. These tags provide a ‘flat’ annotation, because they are not organized in any hierarchy. The treebank annotators tagged nouns in the corpus; then, thanks to anaphoric links, the tags were transferred from nouns to pronouns. Since the annotation is generally done at the level of the lemma rather than at the token level, it represents the animacy values of the majority of the corpus tokens, rather than a strictly contextspecific identification of animacy. Moreover, this corpus-driven approach means that every lemma is annotated based on the collection of its tokens and not on its general meaning. Therefore the noun kardia ‘heart’, for example, is labelled as NONCONC because none of its corpus occurrences refer to physical hearts. Another type of semantic annotation of historical texts is that of Declerck et al. (2011), who report on the semantic annotation of the Viennese Danse Macabre Corpus, consisting of a digital collection of printed German texts from 1650 to 1750. The aim of the annotation is to identify different conceptualizations of the theme of death, and hence the annotation specifically concerns this domain, and uses a tagset which conforms to the Text Encoding Initiative (TEI). Below we give an example of the annotation, taken from Declerck et al. (2011): <rs type="death" subtype="figure">Mors, Tod, Todt</rs> <rs type="death" subtype="figureAlternative">General Haut und Bein, Menschenfeind</rs> This example shows two instances of the tag <rs>, which is used for generalpurpose names or strings; in this case the two tags annotate two personifications of violent death. This annotation allows for semantically informed searches on the corpus; for example, we can retrieve the personifications of death as a figure. A different approach to semantic annotation of historical corpora focuses on the historical context of the texts. The Hansard Corpus, which contains 1.6 billion words from 7.6 million speeches in the British Parliament from the period 1803-2005, is semantically tagged, which allows for powerful meaning-based searches. Users can create ‘virtual corpora’ by speaker, time period, house of parliament, and party in power, and make comparisons across these corpora. Semantic annotation can also be performed automatically with the support of computational tools. For instance, Archer et al. (2003) present a tool for semantic annotation of English historical corpora based on USAS (see section 4.3.2 for an introduction to USAS), which was designed and initially implemented for presentday English. USAS assigns semantic labels based on a thesaurus consisting of over 45,000 words and almost 19,000 multi-word expressions. It works on a set of rules that rank the most likely analysis of a word based on some context-specific disambiguation rules and a frequency lexicon which records all semantic analyses of a word in order i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Adding linguistic annotation to texts  of frequency. Archer et al. (2003) adapted USAS to make it possible to tag every word of a historical corpus, thus allowing meaning-based searches on the texts. The analysis referred to the Historical Thesaurus of English compiled at the University of Glasgow, which contains almost 800,000 words from Old English to the present day, arranged into fine-grained hierarchies primarily based on the second edition of the Oxford English Dictionary and Its Supplements, and the Thesaurus of Old English. Given the hierarchical structure of the thesaurus, the semantic analysis tool allows for conceptbased searches on the texts. Pragmatically and sociolinguistically annotated corpora At the beginning of the history of corpus linguistics, the annotation of language-internal phenomena like lemmatization, part of speech, or syntax, received a great amount of attention. However, language use is best understood when analysed together with its context, as a discursive and social practice. Sociolinguistic research is interested in such contextual information, which covers social categories like gender and class, but also the knowledge possessed by the participants of the communicative event and situational aspects such as the relationships between the participants and the purpose of their communication (Biber, 2001). Recording the macro-social components of language, as well as the situational aspects of the individual communicative events, is very important to explain the role of language in society, and corpus data constitute crucially important evidence sources for this type of investigation. Sociolinguistic research is the background to the Corpus of Early English Correspondence, a family of historical corpora compiled with the aim of testing sociolinguistic theories on historical data. In addition to morphological and syntactic annotation, these corpora are linked to a database containing information about letter writers, which allows the users to search sociolinguistic information about writers and recipients like age, gender, and family roles, and thus study the relation between language use and its context. One way to capture pragmatic and social characteristics of language is through the specific type of annotation employed in the Sociopragmatic Corpus (Archer and Culpeper, 2003), a section of the Corpus of English Dialogues 1560–1760 (Kytö and Walker, 2006) covering the years 1640–1760. This corpus contains more than 240,000 words from trial proceedings and drama, annotated with characteristics of the speakers and the addressees. Here is an example from Culpeper and Archer (2008): <u speaker="s" spid="s4tfranc001" spsex="m" sprole1="v" spstatus="1" spage="9" addressee="s" adid="s4franc003" spsex="m" adrole="w" adstatus ="4" adage="8">Look upon this Book; Is this the Book?</u> The example shows that a male speaker (indicated by spsex="m"), identified by the code s4franc001, acts here as a prosecutor, belongs to the social status ‘gentry’, and is classified as an older adult. His addressee is a male witness, identified by the code s4franc003, of social status commoner, and an adult. All this information is encoded i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation in terms of attributes of the tag <u>, which includes a speaker’s conversational turn directed to a specific addressee, in an item of direct speech. Another way to perform pragmatic annotation is by marking discursive elements in language data. This is the approach chosen by the PROIEL project and the tectogrammatical annotation of the Index Thomisticus Treebank, as we will see now. The PROIEL Project (Pragmatic Resources in Old Indo-European Languages) has developed a parallel corpus of the Greek text of the New Testament and its translations into the old Indo-European languages Latin, Gothic, Armenian, and Old Church Slavonic. Specifically, the Greek gospels have been annotated for information structure and discourse structure, in addition to the morphological, syntactic, and semantic annotations (Haug et al., 2009). This kind of annotation records information status and anaphoric distance, covering givenness tags based on the context used by the hearer to establish the reference, situational information, encyclopaedic knowledge, and tags to express information new to the context, as well as anaphoric links between discourse referents. The annotation scheme chosen by the Index Thomisticus Treebank is the tectogrammatical annotation of the Prague Dependency Treebank (Passarotti, 2010, 2014), which refers to the Functional Generative Description framework (Sgall et al., 1986). This level of annotation builds on the so-called ‘analytical’ (i.e. syntactic) layer, where every token is a node in the dependency tree. However, the tectogrammatical annotation resolves ellipsis by reconstructing elided nodes, and represents the dependency relations between the elements which have semantic meaning, thus excluding nodes like conjunctions, prepositions, and auxiliaries. The dependency relations are represented in terms of semantic roles thanks to so-called ‘functors’, such as actor. The pragmatic content of the annotation involves anaphoric references, as well as the information structure of sentences, distinguishing between topic and focus. .. Annotation schemes and standards We have seen the major levels of linguistic annotation and discussed their application to historical corpora. In this section we will concentrate on some recommended procedures to conduct a historical corpus annotation project, and stress the infrastructural implications of corpus annotation. In order for an annotation to be consistent throughout the corpus, it is essential that it follows some predefined parameters. An annotation scheme defines the architecture of an annotation in terms of the tags that are allowed in it, and how they should be used. Good annotation schemes should allow us to describe (rather than explain) the phenomena observed in the corpus, and should be based on theory-neutral, widely agreed principles, as far as this is possible. An example of an annotation scheme is Bamman et al. (2008), where the authors describe all the tags employed in the annotation of the Latin Dependency Treebank. This is also an interesting example of a collaborative approach to defining annotation i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Adding linguistic annotation to texts  guidelines, because these guidelines are shared with another Latin treebank, the Index Thomisticus Treebank (Passarotti, 2007b). Moreover, both treebanks follow the overall theoretical framework of the Prague Dependency Treebank (Böhmová et al., 2003). In addition, another Latin treebank, the PROIEL project Latin treebank, is compatible with both the Index Thomisticus Treebank and the Latin Dependency Treebank, since automatic conversion processes are available from one format to the other, and this increases the range of opportunities for linguistic analyses that span over the data from all three treebanks. A similar example of shared approach to annotation is given by the Penn Corpora of Historical English, which include the Penn–Helsinki Parsed Corpus of Middle English (Kroch and Taylor, 2000), the Penn–Helsinki Parsed Corpus of Early Modern English (Kroch and Delfs, 2004), and the Penn Parsed Corpus of Modern British English (Kroch and Diertani, 2010). Following the same schema designed for the Penn Corpora of Historical English, a whole constellation of corpora have been built over the years: the York–Helsinki Parsed Corpus of Old English Poetry (Pintzuk and Plug, 2002), the York–Toronto–Helsinki Parsed Corpus of Old English Prose (Taylor et al., 2003), the York–Helsinki Parsed Corpus of Early English Correspondence (Taylor et al., 2006), the Tycho Brahe Corpus of Historical Portuguese (Galves and Britto, 2002), Corpus MCVF (parsed corpus), Modéliser le changement: les voies du français (Martineau and Morin, 2010), and the Icelandic Parsed Historical Corpus (Wallenberg et al., 2011a). Covering a larger set of languages, Universal Dependencies16 is a project aimed at developing treebank annotation for many languages (including historical ones), with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. [. . .] The general philosophy is to provide a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary.17 Unfortunately, collaborations such as the ones mentioned above are not as frequent as we would wish. In historical corpus research, as well as in corpus linguistics in general, there are several schemes for corpus annotation, and no prevailing one. This has to do with historical reasons, as especially the older projects often originated within different theoretical frameworks to address specific needs and goals, and therefore developed their own (often peculiar) approaches to annotation; see, for example, the original annotation used for the Index Thomisticus (Busa, 1980). While this is partially justified by the fact that different languages require different annotation schemes, and that each level of annotation has its own features, it becomes increasingly important to aim at a more harmonized state, especially given the growth in the number of annotated historical corpora. 16 http://universaldependencies.org/. 17 http://universaldependencies.org/introduction.html. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation Although no annotation scheme should be considered as a standard a priori, since the beginning of corpus linguistics it has gradually become clear that standards commonly agreed through practice and consensus are necessary. Such standards make corpora processable by a variety of software systems, thus facilitating the comparison, sharing, and linking of annotated corpora, avoiding duplication of effort, while at the same time enhancing the evidence basis for historical linguistic analyses. To this end, TEI18 has published the Guidelines for Electronic Text Encoding and Interchange, which document ‘a markup language for representing the structural, display, and conceptual features of texts’.19 TEI has modules for different text types (drama, dictionaries, letters, poems, and so on), and its annotation guidelines cover a range of palaeographic, linguistic, and historical features. For an overview of TEI for historical texts, see Piotrowski (2012, 60–7). Here we will look at one example of a historical text annotated following TEI conventions, the Bodleian First Folio.20 The following is an excerpt from the beginning of Shakespeare’s A Midsummer Night’s Dream. <stage rend="italic center" type="entrance">Enter Theseus, Hippolita , with others . </ stage > <cb n="1"/> <sp who="#F-mnd-duk"> <speaker rend="italic center">Theseus.</speaker> <l n="1"> <c rend="decoratedCapital">N</c>Ow faire Hippolita, our nuptiall houre</l> <l n="2">Drawes on apace: foure happy daies bring in</ l> <l n="3">Another Moon: but oh, me thinkes, how slow</l> <l n="4">This old Moon wanes; She lingers my desires </ l> <l n="5">Like to a Step−dame, or a Dowager,</l> <l n="6">Long withering out a yong mans reuennew.</l> </sp> . . . </ stage > The element ‘stage’ contains stage directions, ‘cb’ marks the beginning of a column of text, ‘sp’ marks the speech text, ‘speaker’ gives the name of the speaker in the dramatic text, and ‘l’ indicates the verse line. For a complete explanation of the tags and attributes, see TEI Consortium (2014). 18 http://www.tei-c.org. Unlike annotation, which typically adds linguistic information to the text, markup is usually concerned with marking information relative to the structure and context of the texts, such as author names or speakers in a drama, for example. 20 http://firstfolio.bodleian.ox.ac.uk/. 19 i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Case study: a large-scale Latin corpus  TEI is a very positive initiative which addresses the need for standardization in the markup and annotation of texts in the humanities and social sciences; it is very widespread in the field of digital humanities. The Medieval Nordic Text Archive aims to preserve, disseminate and publish medieval texts in digital form, and to develop the standards required for this. The archive includes texts on the Nordic languages and in Latin (http://www.menota.org/), and its texts are encoded in TEI. Generally speaking, TEI is not very widely used for historical corpora, where there is a stronger emphasis on linguistic annotation rather than on palaeographic and historical markup. Moreover, most programs for automatic annotation (the NLP tools introduced in section 4.3) strip down all forms of markup contained in the texts, as it is not relevant to the automatic processing they perform. However, in the case of historical texts, the information contained in these tags can be crucial to the interpretation of the text and should be considered by the language processing tools. A related difficulty is the fact that historical texts typically contain a number of nonlinear elements, such as alternative readings or corrected and erroneous text, which are heavily dependent on the specific edition of the text. A challenge for the future will certainly be to have the NLP community interact more with the TEI community and make it possible to apply NLP to complex TEI documents while preserving their tagging structure for further analysis. . Case study: a large-scale Latin corpus We have seen how annotation makes it possible for researchers to search historical corpora for simple and complex linguistic entities. As the size of the corpora increases, automatic annotation becomes more and more of a necessity. This is especially true when we consider the increasing amount of texts that are being digitized as part of digital humanities projects, and that constitute very valuable sources of data for historical linguistics research. The case study illustrated in this section, an interesting application of historical NLP tools to Latin, shows an example of a very fruitful interchange between these disciplines. LatinISE (McGillivray and Kilgarriff, 2013) is a Latin corpus containing 13 million word tokens, available through the corpus query tool Sketch Engine (Kilgarriff et al., 2004). Similarly to corpora compiled for modern languages like ukWac (Ferraresi et al., 2008), the texts making up LatinISE were collected from web pages. However, the process of data extraction was controlled by selecting three specific online digital libraries: LacusCurtius,21 IntraText,22 and Musique Deoque.23 These websites contain Latin texts covering a wide range of chronological eras, from the archaic age to the beginning of the current century, all editorially curated, 21 22 http://penelope.uchicago.edu/Thayer/I/Roman/home.html. 23 http://www.mqdq.it. http://www.intratext.com. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation which meant that the quality of the raw material is superior to that of general web resources. Another important observation concerns the metadata that the texts were provided with. As we discussed in section 4.1, this is an essential property of historical corpora, since it allows for further corpus-based studies that analyse the language in its historical context. In the case of LatinISE, the metadata were inherited from the original online libraries and include information on the names of authors, titles, books, sections, paragraphs, and line boundaries for poetry. After removing HTML tags and irrelevant content from the web pages, the corpus compiler converted them into the verticalized format required by Sketch Engine, where each line contains only one token or punctuation mark. In addition to being provided with rich metadata, LatinISE is also lemmatized and part-of-speech-tagged. The lemmatization relies on the morphological analyser of the PROIEL Project, developed by Dag Haug’s team,24 complemented with the analyser Quick Latin.25 As an example, consider the following phrase: (4) sumant exordia fasces take:sbjv.prs.3pl beginning:acc.n.pl fasces:nom.m.pl ‘let the fasces open the year’ This sentence was automatically analysed as follows: > sumant sumo<verb><3><pl><present><subjunctive><active> > exordia exordium<noun><n><pl><acc> exordium<noun><n><pl><nom> exordium<noun><n><pl><voc> > fasces no result for fasces For each word form, the morphological analyser generated all possible analyses, which included an empty result for fasces. These multiple analyses needed to be disambiguated so to assign the most likely lemma and part of speech to each token in context. This disambiguation was achieved with a machine-learning approach, by relying on existing Latin treebanks: the Index Thomisticus Treebank, the Latin Dependency Treebank, and the PROIEL Project’s Latin treebank. At the time of the creation of LatinISE, these corpora contained a total of 242,000 lemmatized and morphosyntactically annotated words; this set constituted the training set for training TreeTagger (Schmid, 1995), a statistical part-of-speech tagger developed by Helmut Schmid at the University of Stuttgart. McGillivray and Kilgarriff (2013) describe how 24 25 http://www.hf.uio.no/ifikk/english/research/projects/PROIEL/ http://www.quicklatin.com/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Challenges of historical corpus annotation  TreeTagger was run on the analyses of the morphological analyser to obtain the most likely part of speech and lemma for each word form in the corpus. In Example (4), the corresponding corpus occurrences are: sumant V exordia N fasces N sumo exordium fascis Every line contains the word form, followed by the part-of-speech tag (‘N’ for ‘noun’ and ‘V’ for ‘verb’) and the lemma. LatinISE is currently in its first version and an evaluation of the automatic lemmatization and part-of-speech tagging is the necessary next step to assess the usability of the corpus, especially on the texts from those eras whose language differs significantly from that of the training set. With its ongoing development, this corpus testifies to the challenges of applying NLP tools to historical language data, and of dealing with texts from very different time periods. At the same time, a large diachronic annotated corpus is what is needed to conduct a study of language change. Of course, some may discount the period when Latin was not spoken by native speakers; we believe that this corpus is nevertheless a valuable resource for Latin (diachronic) studies. Following principle 9 (section 2.2.9) and principle 10 (section 2.2.10), quantitative evidence is the only type of evidence for detecting trends, and this evidence comes primarily from corpora. A corpus like LatinISE, which was annotated automatically, can be improved by successively refining the training set for the automatic annotation. Hence, it is a resource that can serve the community both by being the empirical basis for quantitative analyses and by being subject to further incremental developments leading to better and better language resources. . Challenges of historical corpus annotation So far we have stressed the merits of corpus annotation and have seen how annotated historical corpora can serve the scholarly community. However, some scholars have criticized annotation, and in this section we will dedicate some space to their arguments, and to more general considerations about annotated corpora in historical linguistics research. Sinclair (2004, 191) called corpus annotation ‘a perilous activity’, which negatively affected the text’s ‘integrity’ and caused researchers to miss ‘anything the tags [are] not sensitive to’. Hunston (2002, 93) evokes a similar danger, in which researchers may tend to forget that the categories used to search the corpora partially shape their research questions: the categories used to annotate a corpus are typically determined before any corpus analysis is carried out, which in turn tends to limit, not the kind of question that can be asked, but the kind of question that usually is asked. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Historical corpus annotation It is indeed true that if we choose an annotation scheme that is too firmly bound to a specific linguistic paradigm, we risk only finding results supporting that paradigm. Moreover, depending on the annotation available for a corpus, certain research questions may not be answered by that corpus. For example, imagine that our annotation scheme contained a tag for ‘noun’ and our annotation guidelines specified when an element was to be annotated as a noun; then, we would be constrained by these choices when retrieving nouns from the corpus. If our research aim is to define the characteristics of nouns, then our results would be heavily influenced by the corpus guidelines. One partial solution to this is to be very precise in specifying the corpus compilation principles and the assumptions made during the annotation phase, so that any research results can be interpreted in light of this, and results from differently annotated corpora can be compared. In any case, the dependence of annotation on the schema is an unavoidable consequence of the practice of annotation itself. We can think of annotation as a pragmatic (in the common, non-linguistic sense of the word) solution to the problem of representing linguistic categories and their properties. Annotation tags are convenience representations of theoretical entities and should not be confused with the linguistic entities themselves. Annotated corpora are examples of the symbolic modelling of language introduced in section 1.2.2. They impose discrete categorizations to linguistic elements. This symbolic representation is compatible with both categorical and non-categorical views of language, precisely because it is a model and is not the linguistic reality directly. In other words, a corpus annotation that contains the categories of noun and verb can coexist with a view whereby such categories sit along a probability distribution. Equally, such annotation is compatible with a view according to which words possess part of speech as discrete classes. Archer (2012) discusses some of the objections to corpus annotation and explores the question of whether annotation can be seen simply as a useless exercise that does not add anything to the data that is not already contained in them. In line with Archer (2012)’s view, we believe that corpus annotation is an essential step in the research process, and that, in spite of its limits, it contributes to a transparent way to empirically draw conclusions from language data. Historical corpus linguistics can certainly hope to gain an independent status from corpus linguistics for modern languages by developing more and more sophisticated tools for annotating historical texts—following past and current research directions—and by emphasizing the unique features of historical texts. Annotated corpora are not fixed and immutable objects, and the issue of maintenance is critical in corpus building. Corpora need updates continuously for a range of reasons, as new linguistic theories emerge and as we discover new properties of language, or simply as more people contribute to the annotation by various means, including crowdsourcing. In the case of historical corpora, this is particularly i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Challenges of historical corpus annotation  important. Due to the lack of native speakers and the philological complexities that affect many historical texts, it is advisable to support a more flexible type of annotation, which allows for multiple interpretations of the texts by different annotators. This model of annotation is particularly appreciated by classicists and philologists, who are interested in displaying the different variants of the original text as a consequence of the transmission of the text over time. Along these lines, Bamman and Crane (2009) propose a model of annotation that takes into account the scholarly tradition developed on the texts and gives the annotators scholarly credit for their work. Bamman and Crane (2009) applied this model to the portion of the Ancient Greek Dependency Treebank containing Aeschylus’ plays. This case displays an example of a highly debated text, both in terms of its philological transmission and its syntactic interpretation (which are linked, of course). In this respect, this model of scholarly annotation corresponds to the traditional practice of compiling critical editions and will, it is hoped, encourage philologists to engage with it alongside corpus and computational linguists. Another way in which corpora should be updated is in conjunction with the research process itself. Historical corpora are often used to study particular linguistic phenomena. Once the researcher has extracted the patterns of interest from the corpus, he or she may carry out further analyses. For example, in the case study on early modern English third person verbal ending described in section 7.3, we collected all instances of third-person ending of verbs from the Penn–Helsinki Corpus of Early Modern English. Then, we added lemma information on each verbal form, as we wanted to measure the effect of the lemma frequency on the type of morphological ending realized. The lemma information was not available in the original corpus, so this work enriched the corpus material, which we made available for reuse by other scholars. One way to maintain data sets like the one we built for that case study, which are important outputs of the research process, is to make it possible to incorporate additional annotation into the user’s personal working copy of the corpora, as allowed by the Penn–Heksinki Early English Treebanks. Additionally or alternatively, the analysis can be made publicly accessible by publishing it in a repository, as we chose to do. This way, other researchers can make use of this work in combination with the original corpus data, provided that some linking mechanism is in place. In this specific case, we published the list of verb form types and their associated lemmas, thus effectively providing a linking facility, which is in line with the requirement of reproducibility highlighted in section 2.3.1. All these approaches point towards a view of annotated corpora as the results of collaborative efforts based on which research can make progress in an incremental way. We believe that, thanks to this collaborative attitude, historical linguistics research can achieve access to larger data sets that allow us to reach more ambitious goals, well beyond what is possible in the context of smallscale studies. In the next chapter we will expand on this further. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  (Re)using resources for historical languages . Historical languages and language resources In Chapter 4 we have seen that annotated corpora are essential to quantitative historical linguistics research. Of course, they are not the only source we can rely on. Indirect sources of language data like dictionaries and lexicons have been and still are of great importance. Unlike corpora, where words are organized in their context of occurrence, traditional language resources store general information about lexical items out of context, and in some cases link this information back to their occurrences in the texts (section 2.1.3). In this chapter we will support a view according to which such links between lexical entries and their occurrences in context (i.e. in corpora) should be made more systematic and explicit; we will therefore argue that the gap between corpora and other language resources can be closed thanks to a corpus-driven approach paired with a quantitative practice, and show the benefits of this perspective for research in historical linguistics. We will also turn our attention beyond language resources, towards the wider landscape of historical and cultural heritage resources, and make a case for synergies that can benefit research on historical languages. Finally, we will make a case for building language resources in a way that makes them easy to maintain and compatible with other resources, and reusing existing resources when that is possible, thus increasing the level of transparency and replicability that are among the most important elements of our methodology (sections 1.1 and 4.1.1). .. Corpora and language resources Traditional language resources like dictionaries are very useful in historical linguistics research. However, even when they are based on corpora, if they are qualitative in nature it is not possible to draw quantitative arguments from them, apart from basic type frequencies extracted from the resource itself. Conversely, corpus-driven language resources like computational lexicons offer more potential for integration with corpora and therefore allow the researchers to include a quantitative dimension to their analysis, as we will show in this section. Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray. © Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Historical languages and language resources  Let us start with an example of a psycholinguistic phenomenon that is relevant to historical linguistics: local syntactic ambiguity. Consider Example (1), from a Latin sentence from Ovid, Metamorphoses 1.736: (1) et Stygias iubet hoc audire and Stygian:acc.f.pl command:ind.prs.3sg this:acc.n.sg listen:inf.prs paludes water:acc.f.pl ‘He commands the Stygian waters to listen to this.’ Example (1) contains an instance of the general pattern [V1 ARG V2 ], where V1 is the verb iubet, ARG is the pronoun hoc, and V2 is the verb audire. According to the valency properties of the two verbs, ARG could be an argument of both V1 and V2 . Example (1) is a case of local syntactic ambiguity, which is resolved once the sentence is read out in full. This is in line with the online nature of oral language comprehension, whereby the hearer perceives one word at a time and incrementally interprets the partial input, even before the sentence is complete (Schlesewsky and Bornkessel, 2004; Van Gompel and Pickering, 2007, 289; Levy, 2008, 1129). McGillivray and Vatri (2015) investigated this phenomenon in Latin and Ancient Greek, taking the opportunity to apply some principles from psycholinguistics to historical languages, for which experiments on native speakers are, of course, not possible. Before it is read in full, Example (1) may be taken to mean ‘he commands the Stygian waters this’, indicating an order given to the waters; however, after reading audire, it becomes clear that this verb governs hoc and therefore the sentence unambiguously means ‘he commands the Stygian waters to listen to this’. In order to classify Example (1) as ambiguous, we need to know that both iubeo and audio can govern hoc. In other words, we need to answer the question: are iubeo ‘to command’ and audio ‘to listen’ transitive verbs? Traditional language resources like dictionaries and lexicons can help to answer this question, as they contain vast amounts of information about lexical items, including verbs’ transitivity. The Latinto-English dictionary by Lewis and Short (1879)1 records that sense 1α of iubeo can occur ‘with an object clause’, as in (2) from Terence’s Eunuchus 3, 2, 16, where istos foras exire ‘that they come out’ is the object clause of the imperative iubete ‘order’: (2) iubete istos foras exire order:imp.prs.2pl that:acc.m.pl out come out:inf.prs ‘order them to come out’ On the other hand, the first sense of the entry for the verb audio in Lewis and Short (1879) records aliquid ‘something’ (i.e. accusative direct object) as the first of 1 Accessed from the Perseus Project’s page http://www.perseus.tufts.edu. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  (Re)using resources for historical languages the possible argument structure configurations for this verb. Therefore, from the information contained in the dictionary, we know that the two verbs in Example (1) are transitive, and thus we can hypothesize that the accusative hoc in Example (1) can be the argument of either iubet (‘He commands this to the Stygian waters’)2 or audire (‘He commands the Stygian waters to listen to this’). The argument structure information contained in a dictionary can certainly help to consider the different possible syntactic interpretations of a sentence like Example (1). However, if we want to be able to identify all locally ambiguous sentences from a corpus of texts without manually checking each instance, we need to combine the corpus data with a machine-readable resource. Such resource can be automatically queried by a computer algorithm in order to detect those sentences where two verbs occur with a noun phrase that is compatible with the valency properties of both verbs, making it possible for both verbs to govern that phrase. This is the approach followed by McGillivray and Vatri (2015), who relied on corpus-driven computational valency lexicons for Latin and Ancient Greek verbs. In the next section we will cover the difference between corpus-based and corpus-driven lexicons, and briefly illustrate the valency lexicon in question. .. Corpus-based and corpus-driven lexicons As noted in McGillivray (2013, 32–6), traditional historical dictionaries are qualitative resources. They are compiled based on large collections of examples usually taken from the canon of texts of a historical language. In this sense they may be called ‘corpus-supported’ resources in a loose sense, if we broaden the term ‘corpus’ to cover any collection of texts, independently of their format, and the selection criteria and annotation features of modern corpus linguistics. In other words, the texts constitute the evidence source on which the historical lexicographer relies to prepare the summary contained in a dictionary entry. That this is the case is evident from the amount of examples included to support most statements about grammatical and lexical-semantic properties in a dictionary. However, the process leading from the whole collection of texts to the selected examples that appear in a lexical entry is the result of the subjective judgement of the dictionary’s compilers, and cannot always be reliably reproduced. A similar argument holds for other historical dictionaries and thesauri like the Oxford English Dictionary3 and the Historical Thesaurus of English.4 Such a qualitative approach makes the dictionaries supported by a complete corpus good resources for answering qualitative questions such as ‘Is verb X found with a dative object in historical language Y?’ (assuming that verb X is included among the examples presented in the dictionary), but not quantitative questions like ‘Has the 2 This interpretation is only acceptable if we consider the online processing of the sentence up to the word hoc (et Stygias iubet hoc). 3 http://www.oed.com. 4 http://historicalthesaurus.arts.gla.ac.uk. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Historical languages and language resources  proportion of animate senses of noun X over inanimate senses increased over time?’. The reason for this is rooted in the original purpose of printed dictionaries, which suited an era of information scarcity. They aimed to ‘provide information in a manner which is accessible to the reader . . . The reader should . . . regard the Dictionary as a convenient guide to the history and meaning of the words of the English language, rather than as a comprehensive and exhaustive listing of every possible nuance’ (Jackson, 2002, 60). With the potential offered today by digitized text collections and computational tools, we can raise our ambitions to a more systematic account of the behaviour of words in texts; then, this information can be queried by programs as well as humans, as we will see in the next sections. Historical valency lexicons In the field of computational linguistics there have been several successful attempts at building lexical resources from corpora for modern languages in a radically different way from traditional dictionaries. One example is the Italian historical dictionary TLIO (Tesoro della Lingua Italiana delle Origini),5 which is directly associated with a corpus of texts. If we focus on valency lexicons, we find examples like PDT-Vallex (Hajič et al., 2003), FrameNet (Baker et al., 1998), and PropBank (Kingsbury and Palmer, 2002), to name just a few. All these lexicons have in common the fact that they are based on syntactically annotated corpora. This makes it possible to maintain an explicit relation between the corpus and the lexicon: once the corpus has been annotated (for example by marking all arguments and their dependency from verbs), human compilers create the lexicon by summarizing the corpus occurrences into the lexical entries (for example by describing argument patterns found for each verb) and recording the link between the entries and the corpus. Moving from a corpus-based to a corpus-driven approach, computational lexicons like Valex (Korhonen et al., 2006) for English and LexSchem (Messiant et al., 2008) for French systematically describe the valency behaviour of all verbs in the corpora they are linked to. These lexicons are automatically extracted from annotated corpora and therefore display frequency information about each valency pattern, which can be traced back to the original corpus occurrences it was derived from. For example, it is possible to know how many times a verb occurs with a subject and direct object, and retrieve all corpus instances of this pattern. Attempts to apply this approach to Latin data have resulted, for example, in the lexicon described by Bamman and Crane (2008), which was automatically extracted from a Latin corpus consisting of 3.5 million words from the Perseus Digital Library and automatically parsed (i.e. syntactically analysed). 5 http://tlio.ovi.cnr.it/TLIO. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  (Re)using resources for historical languages Figure . Lexical entry for the verb impono from the lexicon for the Latin Dependency Treebank. The pattern is called ‘scc2_voice_morph’ because it shows the voice and the morphological features of the arguments. McGillivray (2013, 31–60) describes a corpus-driven lexicon automatically derived from the Latin Dependency Treebank (Bamman and Crane, 2007) and the Index Thomisticus Treebank (Passarotti, 2007b). Figure 5.1 shows the lexicon entry for the Latin verb impono. Each entry in the lexicon corresponds to a verb occurrence in the corpora, identified by an ID number for the verb (second column) and the unique sentence number from the corpus (last column); in addition, the lexicon entry displays the author of the text in which that occurrence is found (first column), the verb lemma (third column), and the argument pattern corresponding to that verb token (fourth column). For example, the pattern ‘A_Obj[acc],Sb[nom]’ in the first row indicates that the verb impono in sentence 845 occurs in the active voice, with an accusative direct object and a nominative subject. Applying the same database queries developed to create the Latin lexicons to data from the Ancient Greek Dependency Treebank (Bamman and Crane, 2009), which follows the same annotation guidelines and format as the two Latin treebanks previously mentioned, McGillivray and Vatri (2015) describe a corpus-driven valency lexicon for Ancient Greek, which they used to study the phenomenon of local syntactic ambiguity. The advantages of automatically built lexicons like the Latin and Greek ones described above are numerous. First of all, as we have seen, they contain frequency i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Historical languages and language resources  information which is directly linked to the corpus, thus allowing for corpus-based quantitative studies, as prescribed by principle 8 (section 2.2.8). Second, they are easy to maintain because, as the corpus grows in size, the automatic processes for obtaining the lexicon can be executed again without starting a new process from scratch. This is exemplified by the Ancient Greek lexicon described in McGillivray and Vatri (2015), built using the same approach developed for the Latin lexicon, as we have seen. Third, the creation of these lexicons is independent from the corpus annotation phase, which minimizes the risks of biased results. In traditional studies (as we have seen from the survey reported on in Chapter 1), the phase of data/text collection and the phase of data analysis are often performed jointly, in the context of a specific study and with a particular set of theoretical hypotheses in mind. By resorting to corpus-driven resources, the phases are kept separate, because the text collection phase happens at the point in which the corpus compilers build the corpus; then the persons responsible for the language resource extract the corpus data to create the lexicon via automatic techniques. Only at this stage does the researcher pull the relevant data from the language resource to address a specific research question. For example, McGillivray (2013, 127–78) describes a study on the argument structure of Latin verbs prefixed with spatial preverbs. The study relies on the corpus-driven valency lexicon described in McGillivray (2013, 31–60). Hence, the decision of what counts as a verbal argument and what is an adjunct was made by the corpus annotators and therefore was not influenced by the specific purpose of the study on preverbs. This guarantees a higher level of consistency (as noted in section 6.1), and facilitates the reproducibility of the study, as recommended in our best practices (section 2.3.1). Other historical lexicons Valency lexicons are very useful for studies that require information on verbs’ syntactic arguments. For other purposes, different types of lexicons are available for historical languages. One such type of resources are the lexicons developed in the context of the IMPACT (Improving Access to Text) project,6 which aims at developing a framework for digitizing historical printed texts written in European languages. One common issue with performing OCR on historical texts is that it requires a large lexicon containing all possible spellings and inflections of words over time, as the OCR algorithm uses the lexicon to assign the most likely transcription to each word. Another challenge with searching historical texts concerns retrieval: ideally, users should find occurrences of old spellings or inflections of words by searching for the modern variants. For example, the user may search for ‘water’ and be presented with corpus occurrences of ‘weter’, ‘waterr’, ‘watre’, and so on. Moreover, lists of proper names (so-called ‘named-entities’, typically for locations, persons, and organizations), drastically improve the accuracy of OCR systems for historical texts. To address all 6 http://www.digitisation.eu/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  (Re)using resources for historical languages these needs, the researchers of the IMPACT project have developed computational morphological lexicons which display both spelling variants and inflected forms for modern lemmas, as well as named-entities lexicons, for Bulgarian, Czech, Dutch, English, French, German, Polish, Slovene, and Spanish (Depuydt and de Does, 2009). The morphological lexicons were created from corpora, collections of quotations contained in historical dictionaries, and/or modern dictionaries provided with historical variants. These morphological lexicons usually contain frequency information as well. The named-entities lexicons were created by training named-entity recognition algorithms on manually curated sets tagged with various types of named-entities labels (the so-called ‘gold standards’) and then running the named-entity recognizers on new, unannotated data. Let us consider one of the historical lexicons developed as part of the IMPACT project, the lexicon for German.7 This lexicon was extracted from a corpus of 3.5 million words from Early New High German (1350–1650) and New High German period (since 1650). Each entry in the lexicon has the following structure: historical word form, followed by the corresponding modern lemma and its attestations in the corpora. The lexicon was created with Lextractor, a web-based tool with a graphical user interface designed for lexicographers. This tool contains a modern morphological lexicon, a lemmatizer, and an algorithm that uses rules to generate historical forms from modern lemmas. Therefore, the tool is able to suggest the linguistic interpretation for some of the historical word forms, in terms of their modern lemmas, part-ofspeech information, and their possible attestations in corpora. The lexicographer has the option of accepting or rejecting the automatic suggestions, and difficult cases are handled collaboratively (Gotscharek et al., 2009). For example, the following rules are among those for generating modern forms from historical forms in German: 1. 2. 3. 4. th → t ei → ai ey → ei l → ll In addition, the morphological lexicon for modern German maps the inflected form teile to the noun teil ‘part’ (plural) and to the verb teilen ‘to share’ (first person singular present indicative). When presented with the historical form theile, Lextractor can suggest the lemmas teile by combining the first rule listed above with the modern morphological information; Lextractor can also suggest the lemma taille by applying the second and fourth rules and the modern morphological lexicon entry for taille ‘waist’. At this point the lexicographer can confirm or reject the automatic suggestions; 7 http://www.digitisation.eu/tools-resources/language-resources/ historical-and-named-entities-lexicaof-german/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Beyond language resources  moreover, he or she can classify the historical form as one or more of the following cases: • • • • • historic form without modern equivalent; historic abbreviation; pattern matcher failed; named-entity; missing in modern lexicon. Next, Lextractor provides the lexicographer user with a list of candidate corpus attestations of the word form, with their context in the form of concordances, as well as the frequencies of all forms of the lemma being analysed and their time stamps. The user can then select the correct occurrences. . Beyond language resources Historical sociolinguists have emphasized the relationship between language and its social context for a long time (Romaine, 1982; Nevalainen, 2003; McColl Millar, 2012). As we said in section 4.3.2, recording the macro-social components of language, as well as the situational aspects of the individual communicative events, is very important to explain the history of language in society (both in terms history of individual languages and language change), and corpus data constitute important evidence sources for this type of investigation. As we noted in section 4.3.3, the field of digital humanities has been concerned with contributing to humanities research by addressing research questions of humanistic disciplines with the support of digital tools. One important project in this area is the TEI, which establishes a standard for annotating a wide range of textual and contextual information for a large number of text types and formats. Unfortunately, the academic communities of digital humanities and historical linguistics have not always shared approaches and tools, and TEI markup is still not usually employed in corpus annotation. However, this tendency is gradually changing. In recent years, the collaboration between historical linguists and scholars from other historical areas of the humanities has received a new impulse thanks to their shared interests in the analysis of cultural heritage data. The LaTeCH (Language Technology for Cultural Heritage) workshops testify to the increased popularity of this area of research (see, for example, Kalliopi Zervanou, 2014). We argue that this collaboration presents a number of benefits for all research fields involved, as we explain here. On the one hand, historical linguistics can gain more insight into how language changed over time by explicitly placing language data into their historical context. One way to achieve that is by adding information on social features of the texts, and the work done by (historical) sociolinguists is a good model for such efforts. Metadata on where the language data were composed (or uttered) are certainly an i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  (Re)using resources for historical languages essential piece of knowledge that needs to be recorded; in addition, annotating social features of the authors/speakers and the location allows the researchers to investigate how language and such social factors interact, thus adding an additional level of depth to the analysis. One example of this approach is the project for creating the British Telecom Correspondence Corpus at Coventry University (Morton, 2014), which annotated business letters written over the years 1853–1982 with TEI-compliant XML. The metadata elements recorded for this corpus include: date, author’s name, occupation, gender, and location; recipient’s name and location; general topic of the letter, whether the letter was part of a chain or not, format (handwritten, printed, etc.), and company/department, in addition to text-internal annotation marking quotes, letter openings, paragraphs, letter closings and salutations, as well as a pragmatic annotation of the letter’s function (application, complaint, query, etc.). On the other hand, from the point of view of historical research, texts and archives are among the various sources from which we can come to new interpretations of historical facts or we discover new relations between events. Detailed linguistic analyses grounded on language data, particularly texts, can definitely support and enrich this work. For instance, social history and marginalized groups are best investigated by a corpus-based register and lexical analysis of the language of certain official documents, as exemplified by the study of prostitution based on judicial records from the seventeenth century described in McEnery and Baker (2014). The authors analysed nearly one billion words from the seventeenth-century section of the Early English Books Online corpus.8 The texts underwent variant spelling annotation, lemmatization, part-of-speech tagging, and semantic tagging. After processing the corpus data, historians and linguists in the project team carried out the collection of relevant linguistic data in an iterative fashion. These data concerned the change in meaning and discourse features of a set of lexical items recognized as pertinent to the topic through literature review and corpus data inspection. This phase was followed by the corpus work, which investigated semantic and pragmatic change though collocation analysis. The analysis also concerned place names associated with the nouns of interest (synonyms of prostitute). This research shed new light onto certain aspects of language change, and offered insights into the society and culture of that historical period, which would have been more limited without access to large corpora, linguistic knowledge, and historical expertise. As the experience of McEnery and Baker (2014) shows, while more and more historical texts become available, the traditional approach involving close reading of the texts becomes less and less feasible, leaving space to the so-called ‘distant reading’ approach and coexisting with it. This is where the experience of corpus and computational linguistics can give a substantial contribution to historical research, thanks to the vast set of tools for language processing and examination that these 8 http://www.textcreationpartnership.org/tcp-eebo/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Beyond language resources  disciplines have developed. Such tools allow the researchers to scale up their analyses and give a faithful representation of the language used in the texts, as well as their content. In addition, large corpora make it possible to investigate rare language usages that are simply not found in corpora of the size manageable by hand. Some examples of this line of thinking are the research outlined in Toth (2013), the HiCor (History and Corpus Linguistics) research network funded by the Oxford Research Centre in the Humanities,9 the Network for Digital Methods in the Arts and Humanities (NeDiMAH),10 and the Collaborative European Digital Archive Infrastructure (CENDARI).11 When applying corpus methods to historical archives and documents, however, we need to keep in mind an important difference between corpora and archives (and digital resources in general). This difference concerns the well-known issue of ‘representativity’, which is far from being resolved, especially in historical contexts, where the corpus compilers often can only include texts or fragments that have survived historical accidents, and cannot aim at so-called ‘balanced’ corpora (see discussion in Chapters 2 and 4). Archives, in particular, usually group together records relating to certain events, thus making it difficult to identify individual text types in them. Attention should be also paid to ensuring that documents on less prominent individuals are included as well, so to best reflect linguistic variation. A number of software tools are now available to support historians’ interpretative work by using the traditional corpus linguistics tools such as concordances and keyword-in-context, as well as language technology techniques, including morphological tagging, part-of-speech tagging, syntactic parsing, named-entity recognition, semantic relation extraction, temporal and geographical information processing, semantic similarity, and sentiment analysis. Just to mention a few examples, ALCIDE (Analysis of Language and Content in a Digital Environment) was specifically developed for historians at the Fondazione Bruno Kessler in Trento, and combines data visualization techniques with information extraction tools to make it possible to view and select the relevant information from a document, including a semantic analysis of the content (Menini, 2014). Another example concerns the synergy between geographic systems and language technology, specifically named-entity recognition. Geographic information systems (GIS) help to investigate the role played by different places in social phenomena over time by analysing their mentions (both overt and implicit) and their collocations in historical documents; see, for example, Joulain et al. (2013). In section 5.3.3 we will give an example of a resource created in the context of geographical historical data. From this brief overview it will be clear, we hope, that our position supports synergies and collaborations between historical linguistics and other historical 9 11 http://www.torch.ox.ac.uk/hicor/. http://www.cendari.eu/. 10 http://www.nedimah.eu/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  (Re)using resources for historical languages disciplines, which requires historical linguistics to develop a stronger commitment to non-language-related resources. This way, it will be possible to combine multidisciplinary expertise to cover more research ground and achieve further goals that could not be achieved in the context of the individual disciplines. Such synergies and exchanges do not only affect people; they also have an important implementation in linking the data resources employed in research, as we will see in section 5.3. . Linking historical (language) data In section 5.2 we argued that historical corpora should be more integrated with other linguistic and non-linguistic resources in order to give a fuller account of language change over time. One way to achieve that is to enrich the corpus annotation with metadata information recording the historical context of the texts and social features of authors, characters, and places (usually at the beginning of the corpus or in a separate file), as well as pragmatic functions of the speech acts (typically with in-line annotation). Traditionally, this has been the standard approach in corpus-based historical sociolinguistics, and has allowed researchers to study the interplay of linguistic phenomena and external factors by extracting the data directly from the corpora. Along these lines, the compilers of the Penn–Helsinki Parsed Corpus of Middle English, second edition (PPCME2) created a series of files containing a range of metadata information about each text of the corpus. For instance, Figure 5.2 shows the page of the Parson’s Tale by Chaucer. In addition to indicating the details of the manuscript (name, date, edition, and sampled portion for the corpus), the page contains the genre and dialect of the text, in addition to other information from the original Helsinki corpus from which the PPCME2 was derived, such as the relationship to the original text and its language, the sex, age, and social rank of the author. Enriching the annotation with such information makes the size of the corpus files much larger. This does not need to be a problem, especially given the low cost of data storage nowadays. However, there is another, more serious disadvantage in this approach. Maintaining this kind of annotation is time-consuming and not particularly efficient, because it involves creating copies of information already available in other sources. Let us consider the example of a study on the relationship between the determiners a and an over time and the social rank of the author. The researcher would need to run a search of the corpus and then associate each occurrence of a/an in each text with the social rank of the author and the date of the text as given by the corpus pages exemplified in Figure 5.2. Let us now imagine that a new discovery reveals that the manuscript of the Parson’s Tale used by the compilers of PPCME2 was in fact produced ten years earlier than was thought previously. In order for the linguistic analysis on a/an to be updated, the corpus compilers would need to be informed and they would have to correct the corpus page (both for the date of the text and the age of the author); the data for the sociolinguistic study would then need to be re-extracted. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Linking historical (language) data  Helsinki Corpus information File name CMCTPROS Text identifier M3 NI FICT CTMEL Text name CT MELIBEE Author CHAUCER GEOFFREY Period M3 Date of original 1350–1420 Date of manuscript 1350–1420 Contemporaneity X Dialect EML Verse or prose PROSE Text type FICTION Relationship to foreign original TRANSL Foreign original FRENCH Relationship to spoken language WRITTEN Sex of author MALE Age of author 40–60 Social rank of author PROF HIGH Audience description X Participant relationship X Interaction X Setting X Prototypical text category NARR IMAG Sample SAMPLE X Figure . Page containing information about the text of Chaucer’s Parson’s Tale from the Penn–Helsinki Parsed Corpus of Middle English, https://www.ling.upenn.edu/hist-corpora/ PPCME2-RELEASE-3/info/cmctpars.m3.html (accessed 22 March 2015). i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  (Re)using resources for historical languages This process is prone to errors and requires a number of people to be aware of the new discovery. An alternative solution would involve having the data stored in one single place (a ‘knowledge base’), to which they would be linked from the corpus, for example, a repository of all manuscripts (or all Middle English manuscripts). In the scenario imagined above, such repository would be the only resource requiring a change. As the corpus would link to it, those responsible for the corpus would just need to update the links to the repository in order to get the corrected metadata on which to base a sociolinguistic analysis. Linked data is a growing area of research and development in computing which offers the model for realizing this link, as we will see in section 5.3.1. .. Linked data The term ‘Linked Data’ refers to a way of representing data so that it can be interlinked. Bizer et al. (2008) define linked data as follows: Linked Data is about employing the Resource Description Framework (RDF) and the Hypertext Transfer Protocol (HTTP) to publish structured data on the Web and to connect data between different data sources, effectively allowing data in one data source to be linked to data in another data source. In simple terms, the World Wide Web consists of a large number of pages interlinked via HTML links. These links express a very rudimentary form of relationship between webpages: we know that one page is related to another, but we do not know the nature of this relationship, at least not explicitly from the link itself.12 In contrast, the approach of linked data assumes a ‘web of data’ whereby entities (and not just webpages) are connected through semantic links; these links identify the two entities being linked and express explicitly the type of link between them; moreover, this is done in such a way to allow the information to be automatically read by computers. In the RDF data model, links are expressed in the form of triples where a subject is connected to an object via a predicate that indicates the nature of the relationship between the two. Triples are an example of structured data (see section 4.3) that can be automatically retrieved by computer algorithms. In order to illustrate RDF, we will take an example from DBPedia, which is a large resource of linked data derived from Wikipedia, representing one of the hubs of the emerging web of data. DBPedia organizes a subset of the Wikipedia entries into an ontology of over 4 million entities, covering persons, places, creative works, organizations, species, and diseases, together with the links between them. The DBPedia entry for Geoffrey Chaucer (the subject)13 lists a series of attributes pertaining to this writer (the predicates) and their 12 We could consider the context in which the link appears, for example the words surrounding it, and perform a distributional semantics analysis on that. However, what we are concerned with here is the explicit type of relationship between the two entities being linked. 13 http://dbpedia.org/page/Geoffrey_Chaucer. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Linking historical (language) data  respective values (the objects). For example, Chaucer is related to the date ‘1343-0101’ though the predicate ‘Birth date’, to the place ‘Westminster_Abbey’ though the predicate ‘RestingPlace’, and to ‘Philippa_Roet’ via ‘spouse’. The two latter entities (‘Westminster_Abbey’ and ‘Philippa_Roet’) also have their own entries, thus creating an interlinked network of information. Using this knowledge base, it is possible to run searches that are not possible on Wikipedia, thus allowing for a much wider discoverability of the content of this resource. For example, the search for all authors who were born in the fourteenth century and whose spouses died in the fourteenth century. Linked data collections may be open according to the Open Definition,14 which in its concise version states: Open means anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness). Linked open data are by definition easier to access by a wide audience, which offers new avenues of research for a large number of scientific fields. Linguistics is certainly one such field, and one unquestionable advantage of developing and using linked open data in linguistics is that resources can be combined together to improve specific linguistic processing tasks. For example, combining a dictionary with a part-of-speech tagger makes it possible to perform dictionary-based part-of-speech tagging; another example is the integration of dictionaries and corpora, which allows the lexicographer to refer to corpus examples from lexical entries, and therefore place each example in its corpus context. Linking language resources in this way makes them at the same time integrated and interoperable. This means that the resources are not only provided with links to allow exchange of information, but that the interpretation of this information is consistent across the linked resources. .. An example from the ALPINO Treebank Let us take the example of a treebank (see section 4.3.2 for an illustration of treebanks). The ALPINO Treebank is a syntactically annotated corpus of Dutch (Van der Beek et al., 2002) with over 150,000 words from the newspaper part of the Eindhoven corpus. Its original format is in XML (illustrated in section 4.2), as shown below for the syntactic tree of the phrase In principe althans ‘In principle, at least’. <?xml version="1.0" encoding="ISO-8859-1"?> <top> <node rel="top" cat="du" begin="0" end="3" hd="3"> <node rel="dp" cat="pp" begin="0" end="2" hd="1"> <node rel="hd" pos="prep" begin="0" end="1" hd="1" root="in" word="In"/> 14 http://opendefinition.org/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  (Re)using resources for historical languages <node rel="obj1" cat="np" pos="noun" begin="1" end="2" hd="2" root="principe" word="principe"/> </node> <node rel="dp" cat="advp" pos="adv" begin="2" end="3" hd="3" root="althans" word="althans"/> </node> <sentence>In principe althans .</sentence> </top> The nodes of the dependency tree are tagged as <node> and the attributes cat, rel, and pos stand for categories/phrase types, dependency relations, and part-ofspeech tags, respectively. For example, the word in is a preposition (pos=‘prep’). Moreover, it is the first word of the sentence, so it begins at position 0 (begin=‘0’) and ends at position 1 (end=‘1’), and the lexical head of its phrase is in position 1 (hd=‘1’). The node corresponding to in is part of a prepositional phrase, so its parent node (which starts at position 0 and ends at position 2 because it includes principe as well) has cat=‘pp’. Let us imagine that we want to make sure that the inventory of part-of-speech tags is consistent with an external tagset. The Linked Data approach to this is to link the corpus to another resource through RDF. One such resource is the General Ontology for Linguistic Description (Farrar and Langendoen, 2003).15 Here, we will consider the linguistic ontology LexInfo (Cimiano et al., 2011).16 The linking between the treebank and LexInfo allows us to connect the treebank with another corpus that uses the LexInfo tagset; moreover, if the tagset is updated, the part-of-speech information in the treebank will not need to be changed. Let us have a closer look at ontologies through the case of Lexinfo in the next section. The LexInfo ontology In computer science an ontology formally defines the entities of a particular domain, together with their properties and relationships. OWL (Web Ontology Language) is the standard language used to represent ontologies. OWL defines classes and subclasses, which classify individuals into groups which share common characteristics; an ontology in OWL also specifies the types of relationships permitted between these individuals. For what concerns language resources, LEMON (LExicon Model for ONtologies, McCrae et al., 2012) is an RDF model specifically designed for lexicons and machinereadable dictionaries. LexInfo is a model for relating linguistic information (such as part of speech, subcategorization frames) to ontology elements (such as concepts, relations, individuals), following the LEMON model. The following example shows the portion of the Lexinfo ontology relative to the category of adverbs.17 15 17 http://www.linguistics-ontology.org/. The line numbers were added by us. 16 http://www.lexinfo.net. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Linking historical (language) data  1 <owl:Class rdf:about="http://www.lexinfo.net/ontology/ 2.0/lexinfo#Adverb"> 2 <owl:equivalentClass> 3 <owl:Restriction> 4 <owl:onProperty rdf:resource="http://www.lexinfo.net/ ontology/2.0/lexinfo#partOfSpeech"/> 5 <owl:someValuesFrom rdf:resource="http://www.lexinfo. net/ontology/2.0/lexinfo#AdverbPOS"/> 6 </owl:Restriction> 7 </owl:equivalentClass> 8 <rdfs:subClassOf rdf:resource="http://lemon-model.net/ lemon#Word"/> 9 <rdfs:isDefinedBy rdf:resource="http://www.lexinfo.net/ ontology/2.0/lexinfo"/> 10 </owl:Class> Let us examine each element of this RDF snippet. • <owl:Class defines the class ‘Adverb’; • <owl:equivalentClass>: lines 2-7 indicate that two class descriptions have an exact set of individuals. In this case the individuals of the class ‘Adverb’ are exactly those individuals that are identified by the properties listed in lines 4–6, as explained in the next two points; • <owl:onProperty and <owl:someValuesFrom refer to the fact that the parts of speech of the individuals being described take values from a list of adverbial parts of speech; • <rdfs:subClassOf indicates that the class Adverb is a subclass of the larger class ‘Word’ in the LEMON model; • <rdfs:isDefinedBy indicates the resource defining the class of adverbs in LexInfo. The second part of the ontology mentioning adverbs is as follows:18 1 <owl:Thing rdf:about="http://www.lexinfo.net/ontology/ 2.0/lexinfo#adverb"> 2 <rdf:type rdf:resource="http://www.lexinfo.net/ ontology/2.0/lexinfo#AdverbPOS"/> 3 <rdf:type rdf:resource="http://www.lexinfo.net/ ontology/2.0/lexinfo#PartOfSpeech"/> 4 <rdf:type rdf:resource="http://www.w3.org/2002/07/ owl#NamedIndividual"/> 18 The line numbers were added by us. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  (Re)using resources for historical languages 5 <rdfs:label>adverb</rdfs:label> 6 <rdfs:comment> Part of speech to refer to an heterogeneous group of words whose most frequent function is to specify the mode of action of the verb. 7 </rdfs:comment> 8 <dc:creator>Francopoulo, Gil</dc:creator> 9 <owl:versionInfo>1:0</owl:versionInfo> 10 <dcr:datcat rdf:resource="http://www.isocat.org/ datcat/DC-1232"/> 11 <rdfs:isDefinedBy rdf:resource="http://www.lexinfo.net/ ontology/2.0/lexinfo"/> 12 </owl:Thing> Again, let us look at each line: • line 1 indicates that ‘adverb’ is an individual of the class Adverb; • lines 2–4 state the fact that this individual is member of three classes: ‘AdverbPOS’, ‘PartOfSpeech’, and ‘NamedIndividual’; • line 5 contains the label for the individuals belonging to the class Adverb; • lines 6–7 contain a comment explaining adverbs; • line 8 states the name of the creator of the class: ‘dc’ refers to the Dublin Core standard19 for describing metadata of web resources; • line 9 contains the version information; • <rdfs:isDefinedBy indicates the defining resource, as in the case of the class of adverbs explained above. Linking LexiInfo and the ALPINO Treebank Now that we have seen an example part of speech from the Lexinfo ontology, we can appreciate the advantages of linking all the information on part of speech to the ALPINO Treebank, to ensure that the two resources are in sync and that there is no unnecessary duplication of data. John McCrae has transformed the ALPINO Treebank into RDF format and linked it to Lexinfo, as described in his blog.20 To ensure that the links were semantic, he created an ontology in the OWL language to describe the categories used in the treebank. The following example describes the part-of-speech ‘adverb’ in ALPINO:21 1 <owl:NamedIndividual rdf:about="http://lexinfo.net/ corpora/alpino/categories#adv"> 2 <rdf:type rdf:resource="http://lexinfo.net/corpora/ alpino/ 3 categories#PartOfSpeech"/> 19 21 20 http://john.mccr.ae/blog/alpino. http://dublincore.org/. The line numbers were added by us. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Linking historical (language) data  4 <rdfs:label xml:lang="en">Adverb</rdfs:label> 5 <dcr:datcat rdf:resource="http://www.isocat.org/datcat/ DC-1232"/> 6 <owl:sameAs rdf:resource="&lexinfo;adverb"/> 7 </owl:NamedIndividual> This portion of the ontology declares the named individuals corresponding to adverbs, it states that adverbs are members of the class ‘PartOfSpeech’, and gives its English language label (‘Adverb’). Line 5 gives the details of the ISOcat22 element DC1232, which corresponds to adverbs. Defining this ontology made it possible to link the treebank to Lexinfo. For example, in the linking between ALPINO and Lexinfo the second node of the phrase shown on pages 143–4 (In principe althans), the adverb althans ‘at least’, is represented as follows:23 1 <node> 2 <rdf:Description rdf:about="#top/node/node_2"> 3 <cat:rel xmlns:cat="http://lexinfo.net/corpora/alpino/ categories#" rdf:resource="http://lexinfo.net/corpora/alpino/ categories#dp"/> 4 <cat:cat xmlns:cat="http://lexinfo.net/corpora/alpino/ categories#" rdf:resource="http://lexinfo.net/corpora/alpino/ categories#advp"/> 5 <cat:pos xmlns:cat="http://lexinfo.net/corpora/alpino/ categories#" rdf:resource="http://lexinfo.net/corpora/alpino/ categories#adv"/> 6 <begin>2</begin> 7 <end>3</end> 8 <hd>3</hd> 9 <root>althans</root> 10 <word>althans</word> 11 </rdf:Description> 12 </node> This is an example of RDF code expressed in XML. Line 2 states that what follows is a description of node 2, which is assigned the unique ID #top/node/node_2. 22 ISOcat (http://www.isocat.org) is a central registry for all linguistic concepts. It contains so-called ‘data categories’, which describe these linguistic concepts. 23 The line numbers were added by us. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  (Re)using resources for historical languages Lines 3–5 are statements about three predicates for node 2 (the subject): cat:rel, cat:cat, and cat:pos.24 For example, line 5 expresses the fact that node 2 has a property (or predicate) with name cat:pos and its value is the object identified by the string http://lexinfo.net/corpora/alpino/categories# adv. Lines 6–10 contain the treebank-specific information about node 2, namely its position, its lexical head, root, and word form. What is important to note here is that node 2 of the sentence in question is an adverb, and it refers to the category ‘adv’ in the LexInfo ontology. .. Linked historical data We have seen an example of an annotated corpus for a modern language linked with a lexical resource. One of the motivations for developing linguistic linked data is related to the field of NLP. By definition, linked data are machine-readable and can therefore be used directly by computer programs, and this presents a huge potential for improving NLP tools. For example, named-entity recognition software greatly benefits from using knowledge bases like DBPedia, which contain large collections of named entities (Hellmann et al., 2013, 2). Compared to linguistic linked open data, the motivations for historical linked data do not primarily include NLP development. However, there are other strong motivations in favour of adopting the linked data model for historical language data. First, effective searching across a range of resources is made easier (and sometimes possible at all) by having resources linked together and interoperable. This means that the information available in the linked resources needs to be compatible. For example, if we have a corpus of sports reports annotated with domain information on the particular sports covered by the texts, we may want to connect that to a repository of sport players and historical events to study lexical development for every sport over time. However, this would be impossible if the two resources (the corpus and the repository) did not share the same domain definitions. Some projects have already explored the option of linking two historical corpora together. For example, the ElEPHãT project has linked the Early English Print portion of the HathiTrust texts and Early English Books Online Text Creation Partnership (EEBO-TCP), a smaller collection of texts from ProQuest’s Early English Books Online database; both sets of texts were dated from 1473 (date of the first book printed in English) to 1700, but were designed and built independently, which made it difficult to align them. The aim of the project was to allow scholars to investigate the combined data set and explore new research questions. For example, it is possible to search the combined data set for all works by a given author, as well as run searches by publication 24 Note that all three predicates have the same prefix cat, which is defined as a namespace http://lexinfo.net/corpora/alpino/categories. Namespaces avoid conflicts between tags with the same name. In this case, we want to guarantee that even if other tags exist called rel, cat, and pos, the specific tags used here are unique. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Linking historical (language) data  place, publication date or period, subject (such as political science), and genre (such as biography). Linking historical and geographical data We have seen examples of annotated corpora linked with lexical resources and with other corpora. Of course, corpora can be linked to other types of resources, such as non-linguistic ones. In section 5.2 we stressed the importance of combining language-external and encyclopedic resources with corpora in historical linguistics research, and the flexibility of the linked data model makes it well suited to capture this combination (Chiarcos et al., 2013). When different resources are linked together we can run a much wider range of searches on them through the so-called ‘federated search’. Finally, the links between resources can be automatically updated as the resources change, thus streamlining their maintenance. Over the past few years, the field of digital humanities has witnessed a number of projects aimed at building resources in the linked data model, and these projects have the potential to greatly enrich the options available to historical linguists. In the next sections we will see some examples of the linked data model applied to historical language data and we will indicate how historical linguistics can benefit from such examples. We will discuss a couple of projects which have applied the model of linked data to the field of digital classics, and have created valuable resources that will facilitate the research in the ancient world. A resource for historical geography: Pleiades Pleiades25 is an open-access digital gazetteer for ancient history. The goal of this project is to allow continuous updates to the gazetteer and also to facilitate its use in conjunction with other projects in digital classics by relying on ‘open, standards based interfaces’ (Elliott and Gillies, 2009). Since Pleiades relies on a crowdsourcing approach, scholars, students and enthusiasts have the option of contributing to this resource by suggesting geographic names for locations in the ancient world, adding bibliographical information, or writing documentation, while at the same time retaining the intellectual property of their contributions. In the words of Elliott and Gillies (2009), ‘[i]n a real sense then Pleiades is also like an encyclopedic reference work, but with the built-in assumption of ongoing revision and iterative publishing of versions’. Each of the tens of thousands of geographical entities that make up the Pleiades gazetteer is given a stable unique identifier (uniform resource identifier or URI in the linked data terminology), which makes it possible for other resources to unambiguously link to them. For example, the entry for the Adriatic Sea has the unique identifier http://pleiades.stoa.org/places/1004, which also indicates the web page where the entry is available (Downs et al.). Figure 5.3 displays the top part of the page for Adriatic Sea, showing its identifier, its modern location (also highlighted in 25 http://pleiades.stoa.org. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  (Re)using resources for historical languages Figure . Part of the entry for Adriatic Sea in Pleiades. the map on the right-hand side of the page), as well as its names with the attested dates when available, and its category. Moreover, the page shows the relationship between the Adriatic Sea and various locations, also marked in the map. At the bottom of the web page (not displayed in Figure 5.3) there is a section called ‘References’ which links to the occurrences of the name Adriatic Sea (spelled as Hadriaticum or Adriaticum) in three Latin literature texts in the classical Latin texts by the Packard Humanities Institute,26 and to relevant scholarly works. On the right-hand side of Figure 5.3 you can see the link to the entry for Adriatic Sea in the Ancient World Mapping Center, which is a partner of the Pleiades project.27 One methodologically interesting aspect of this project is the development of a new approach to the representation of geographical entities, specifically designed for historical data. As Elliott and Gillies (2011) explain, conventional GIS models require geometrical objects and therefore are not well suited for sparse and ambiguous historical data, where some locations are unknown or can only be located relatively to other locations, and their properties change over time. Pleiades’ model involves mapping the relationships between conceptual places/spaces, names, locations and time periods by resorting to a variety of sources ranging from ancient texts to modern scholarly works, ancient coins through their minting locations, and archaelogical findings through their locations. Pleiades is a very valuable resource for the study of antiquity and can be integrated with other resources in numismatics, epigraphy, and papyrology. Furthermore, we believe that such resources would be very important for the study of historical 26 http://latin.packhum.org/. 27 http://awmc.unc.edu/wordpress/about/. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Future directions  languages. For example, in the context of a study of spatial expressions in the classical languages, we would be able to reuse the work done within the Pleiades project on linking linguistic patterns found in ancient texts to their corresponding geographical locations. This would probably save a considerable amount of time and even allow for investigations at a scale that would not have been imaginable within the scope of an individual historical linguistics project. Linking place references and historical geographical data The Pelagios project (Enable Linked Ancient Geodata in Open Systems) constitutes a natural evolution from the experience of Pleiades (described in section 5.3.3). The aim of this project is to annotate place references to entries in the Pleiades gazetteer using the format of the Open Annotation RDF vocabulary. Pelagios covers not just the ancient GrecoRoman worlds, but also the early Byzantine, Christian, Maritime, Islamic, and Chinese cultures through their geospatial documents. The project has built a resource for exploring geographical locations up to 1492. This was achieved by referring to standardized lists of historical places such as the Pleiades gazetteer and the China Historical GIS. Then, toponyms occurring in semi-automatically transcribed texts and images were identified automatically and mapped to the gazetteers. This work was done by combining scholars’ knowledge with crowdsourcing. Linking people’s references and people’s names One common issue with historical material is to determine whether two documents refer to the same person. The same issue concerns contemporary material, and is even harder because of ethical and privacy questions. Being able to link names of people to their mentions in texts is very valuable for historical linguistics investigations, as it allows to discover new connections and it provides another way to incorporate contextual information into the linguistic analysis, as we stressed throughout this section. The SNAP (Standards for Networking Ancient Prosopographies) project28 project provides this type of linked resource for the ancient world. SNAP investigated linking various data collections concerning the lives of groups of persons (prosopographies), persons’ names (onomastica) and person-like entities using resources available for the Greco-Roman ancient world, such as the Lexicon of Greek Personal Names, containing names of persons mentioned in ancient Greek texts, Trismegistos, a database of names and persons from Egyptian papyri, and Prosopographia Imperii Romani, containing names of elite persons from the first three centuries of the Roman Empire. . Future directions We have seen how language resources are critical elements in historical linguistics research, sometimes as input and other times as output of the research process. 28 https://snapdrgn.net. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  (Re)using resources for historical languages In this chapter we have proposed a number of improvements to how such resources are designed, built, maintained, and managed. These improvements concern making the resources available and usable by a large number of people, employing standards in the formats and in the representation models of the data, as well as linking the resources together. More work is needed to create and promote standards, and we believe that historical linguists can benefit from the experience of scholars working in other historical disciplines, as well as computational linguists, who have made substantial progress in this direction. In particular, going beyond the silos of language resources and corpora will allow a whole new range of opportunities. Another aspect that we think should take priority in the future is related to open data: being an integral part of the research work, data and resources deserve more attention than they have received in the past. Now that data storage cost has decreased significantly and the computing power has reached relatively high levels, research replication is more than a theoretical option, but is achievable only if the data and the processes are well documented and easily accessible (section 2.3.1). Adopting an open data attitude does not require the publication of large data sets only: the data generated within the scope of a single study should be released at the micro level and at the macro level. In section 2.3.3 we mentioned data repositories and data journals. In spite of these nascent and promising initiatives, article and book publications still have a higher status than data publications. In order to provide incentives for scholars to create language resources, data publications should acquire a higher position in scholars’ career options, and traditional publications should include persistent links to the data set(s). In this respect, publishers can help to ensure that data supporting publications are available to the scholarly community by enforcing data policies and requiring statements on data availability on all their publications. Finally, more work is needed to build software tools that make it possible (and ideally easier) to build and link historical language resources in an effective way. Kenter et al. (2012) describe an editing tool designed for historical texts to make the processes of corpus annotation and creation of corpus-based lexicons aligned, including a single platform where the annotation can be revised, as well as a standardized annotation format. Tools such as this are going to facilitate the change that we described in this chapter and that we believe would make the field of historical linguistics evolve further. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  The role of numbers in historical linguistics . The benefits of quantitative historical linguistics So far we have mostly asserted that historical linguistics would benefit from an increased used of quantitative corpus methods. The present chapter addresses more directly how quantitative evidence and corpora are relevant to historical linguistic research. The chapter also provides a rebuttal of some counterarguments to the use of corpora and numbers. The chapter concludes with a case study that illustrates the use of quantitative corpus methods in historical linguistics, by showing how such methods can help to evaluate competing claims against each other, and thus help realize the aim of principle 1 regarding consensus (section 2.2.1). .. Reaching across to the majority In section 1.4 we described the large gap, or chasm, that separates the early adopters of new technologies from the majority. The majority of adopters of any technology are likely to be motivated by pragmatic considerations such as ease of use and concrete benefits, not the technology itself. Regarding ease of use, it has probably never been easier to start using quantitative models. The statistical tools needed to carry out advanced quantitative studies are easily obtainable, in many cases entirely free of charge. The statistical software package R, considered the default statistical tool in many academic fields, is available for free for all major computer platforms. The R platform is well suited for quantitative research in any branch of linguistics, as attested both by the wide variety of published quantitative studies and the variety of textbooks aimed at linguists using R, such as Baayen (2008), Johnson (2008), and Gries (2009a, 2009b). As a support or alternative to R, the adoption of general programming languages such as Perl or Python for quantitative corpus studies is made easier by textbooks such as Weisser (2010) and Bird et al. (2009). Moreover, the skills and knowledge required to use such quantitative tools, and to interpret their outcomes, are being disseminated more vigorously than ever. In addition to on-campus courses, Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray. © Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics there are at the time of writing several options for studying quantitative data analysis for free through open-access MOOCs (massive open online courses). In summary, the technology is readily available, as are the instructional materials. Although it is not effortless to master, the technology itself is easy to use once the initial lack of familiarity has been overcome. Our aim is that the present book will fill in some of the gaps between the technology and the research questions in historical linguistics, thus easing the crossing of the chasm. After this consideration of the ease with which the technology can be adopted, we will focus on the benefits of doing so. .. The benefits of corpora The benefits of using quantitative and corpus methods are of course connected, but we will discuss separate aspects of them here. The three main benefits arising from using corpora, in the sense defined in section 2.1.3, are data transparency, data quality, efficiency, information about frequency, and information about context. The data transparency that arises from using shared corpora is a way to establish the empirical basis of the consensus described in principle 1 (see section 2.2.1). A corpus available to other researchers allows detailed replication, as well as criticism, of every step in the data retrieval process, and hence a much stronger basis for argumentation. Benefits of efficiency and quality are closely linked to the development and dissemination of corpora. Any historical linguistic research that attempts to answer how frequently a linguistics phenomenon occurred in the surviving material, can do one of the following (in descending order of preference): (i) use an existing corpus; (ii) build a new corpus; (iii) use an ad hoc collection of texts or citations. If there is an existing corpus, and if that corpus can be deemed reasonably representative of the language variety for the purposes of the study, then it is clearly beneficial to make use of it (i). Reuse of an existing corpus saves time and effort, but the existing corpus is also agnostic about the aims of whatever study is being used for, as long as it was designed as a general resource. As Gries and Newman (2014) point out, there is a considerable risk of bias when the investigators of a study also collect all the data directly from some source. This bias, together with the extra effort, increases the risk of errors due to potential lack of quality assurance, as well as issues with representativity and size, and speaks against (iii) as an option. However, we want to make clear that when approached correctly, we consider an ad hoc text collection better than nothing. Given the necessary resources, (ii) should be the preferred option if no satisfactory corpus exists. Thus, aspects like size, representativity, quality assurance, and being agnostic to any one particular research question (which also ensures greater comparability of studies) ensures that a corpus has an edge over an ad hoc collection of citations or other texts. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i The benefits of quantitative historical linguistics  Quantitative evidence, as stated in principle 10 (see section 2.2.10), should primarily be drawn from corpora. By definition, for any question about how often a linguistic phenomenon occurred in the past, corpora will provide the best evidence. However, the choice of corpus matters. The larger the corpus, the more precise the results will be, other things like representativity and annotation being equal. However, with increased size comes additional complexity and an increased workload, with a concomitant increased risk of errors (whether manual or computational). This consideration also gives corpora an edge over ad hoc text collections, since corpora often benefit from a longer development period with more people involved (e.g. in the form of tests of inter-annotator agreement). In addition to the benefits arising from frequency information, corpora provide information about context. This context might be a matter of frequency, e.g. how often x occurs with y. Such a benefit goes beyond counting what we already know, because it is also a means to identify which contexts x occurs in. Linguistic units tend to follow a Zipf -like distribution with a long tail of infrequent occurrences (Baayen, 2008; Köhler, 2012), which makes corpora ideal for discovering new, possibly infrequent cases. Moreover, corpora also provide a principled means for connecting metadata about texts and speakers to linguistic data. Thus, corpora provide both linguistic and extra-linguistic context (see Chapter 5). .. The benefits of quantitative methods In the previous section we argued that corpora are the best source of numerical information about historical language use. However, once the frequencies have been obtained, there are also some specific benefits from using statistical methods. By statistical method we mean a formalized, mathematical procedure (but not necessarily null-hypothesis tests) that will allow us to draw inferences about the data in some principled manner. We normally take this to exclude the assessment of raw numbers and raw relative frequencies for the purposes of drawing conclusions. These benefits are partly pragmatic. For instance, when dealing with large samples, or highly complex data where several variables interact, statistical methods can provide important insights about the data that would otherwise not be available. For instance, with a large set of data, we might have so much variation that it must be treated statistically since the number of variable values, and combinations of them, would otherwise become unmanageable. Furthermore, statistical methods can help to rank the importance of a large number of variables, thus helping the researcher to distinguish the chaff from the wheat. However, there are also more principled benefits. Principle 9 (section 2.2.9) defined trends as probabilistically modelled. Therefore, probabilistic, i.e. statistical, methods are required to identify their characteristics and to tease apart the variables involved. Furthermore, principle 8 (section 2.2.8) prescribed explicit quantities, arguing that this leads to a more transparent and hence stronger argument. To this argument i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics we can also add the reproducibility argument, as well as the importance of using quantitative evidence to avoid bias and confounding factors, as pointed out by Kroeber and Chrétien (1937). .. Numbers and the aims of historical linguistics The benefits of quantitative corpus methods described in sections 6.1.2 and 6.1.3 are of course not inherently beneficial regardless of context; they are only relevant relative to one or more aims. Harrison (2003, 214) lists the following aims of historical linguistics: (i) Identifying genealogical relatedness between languages. (ii) Exploring the history of individual languages. (iii) Developing a theory of linguistic change. The first aim has been the main purview of the comparative method in historical linguistics. Given the success of the comparative method in dealing with aim (i), we think the main benefits of quantitative historical linguistics are found among the other two aims, but ultimately this is an empirical question. Aim (ii) is perhaps the one with the strongest history of using corpora, at least with respect to those languages for which corpora exist. Aim (ii) offers rich opportunities for quantitative corpus methods since it will inevitably involve finding patterns among highly variable data. Finally, aim (iii) builds on (ii) and can also benefit from such methods, as we will show below. A theory of language change, i.e. a series of laws and the predictions that follow from them in the sense of Köhler (2012), must take variation into account. In fact, taking variation seriously by means of quantitative models addresses the key problem with the nineteenth-century neogrammarian sound laws, namely the assumption that they were exceptionless (Campbell, 2013, 15). A probabilistic reinterpretation of such laws can accommodate far more variation. This is the core of the claim in Kroch (1989) that syntactic change proceeds at a regular rate of change, since the quantitative model he uses is robust enough to cope with the variation in the data. However, Kroch (1989)’s study is also interesting, since it illustrates another point we have made, namely that shared data and quantitative methods enable openness and precise communication as well as criticism. Vulanović and Baayen (2007) use the same data as Kroch (1989) and they show that the model proposed in Kroch (1989) does not fit the data well. Based on a series of models that fit the data better, Vulanović and Baayen (2007) argue that the rate of change varies depending on the syntactic environment. The variation captured by such models of change need not be strictly linguistic. Kretzschmar and Tamasi (2003) show how extralinguistic correlations of linguistic change can be taken into account. Their argument is positioned against what they see as a “Labovian” tradition that conceptualizes change within a closed linguistic system. Similarly, Blythe and Croft (2012) use computer simulations, or agent-based i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Tackling complexity with multivariate techniques  modelling, to argue that the most plausible scenario for dialect change in New Zealand English involves social factors. Such simulation models, another instance of model parallelization (see section 1.1), provides a means of testing hypotheses regarding the linguistic mechanisms at work, since a number of different mechanisms such as lexical diffusion or catastrophic syntactic change might be reconcilable with an observed trend in the data (Denision, 2002). We have argued that the aims of historical linguistics are not only compatible with quantitative approaches, but direct beneficiaries of such approaches. A natural follow-up question is what role the various linguistic theories have to play. As stated earlier, our position regarding linguistic theories is agnostic. Ours is a framework for conducting corpus-based quantitative investigations, not a linguistic theory. Specifically, this framework does not rest upon an explicitly probabilistic theory of language, such as the one described in the chapters of Bod et al. (2003). . Tackling complexity with multivariate techniques In this section we will argue that multivariate statistical techniques are in most cases the ideal way to deal with the complexity of linguistic phenomena, and we introduce multivariate techniques. We have seen that linguistic phenomena (as phenomena in many other disciplines) are often correlated with a whole range of variables (see principle 11 in section 2.2.11). In historical linguistics, time is often an important factor, but other factors include text-related features like genre, register, or author, specifically linguistic features, ranging from morphological to lexical, syntactic, semantic, contextual features, and so on. Multivariate analysis is concerned with precisely this type of scenario. It can account for the effect of multiple variables on a phenomenon of interest, thus shedding light on the possible ways in which those variables are related to each other in a systematic fashion. As an example, let us consider the argument structure of Latin prefixed verbs like ad-eo ‘go to, reach’, where the prefix (also known as preverb) ad ‘to’ is added to the verb eo ‘go’. Latin preverbs have been associated to adpositions due to their common origin from Indo-European adverbial particles, which were relatively free to occur in various positions of the sentence (Meillet and Vendryes, 1963, 199–200, 573–8). If we focus on verbs prefixed with spatial preverbs and on the realizations of their spatial arguments, we observe four main ways in which these arguments can be realized: 1. (CasePrev) as an NP, whose case is that governed by the preposition corresponding to the preverb; see the following example, where the preverb e- relates to the preposition e/ex ‘from’, which governs the ablative case, and the prefixed verb egressi occurs with the ablative castris: i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  (1) Role of numbers in historical linguistics castris e-gressi camp:abl.n.pl from-go:ptcp.nom.m.pl ‘having marched out of the camp’ (Caes., B. G., II, 11, 1) 2. (CaseNonPrev) as an NP, whose case is not the one governed by the preposition corresponding to the preverb; see the following example, where the preverb ais related to the preposition a/ab ‘from’, which governs the ablative case, but the prefixed verb avertitur occurs with the accusative fontes: (2) fontes-que a-vertitur spring:acc.m.pl-and from-turn:ind.prs.3sg.pass ‘he turns away from the springs’ (Verg., Georg., 499) 3. (PrepPrev) as a PP, whose preposition corresponds to the preverb; see the following example, where the preposition e ‘from’ introduces a prepositional phrase expressing the spatial complement of the verb egressi, formed with the preverb e-: (3) e castris Helvetiorum e-gressi (Caes., B. G., I, 27) from camp:abl.n.pl Helventian:gen.m.pl from-go:ptcp.nom.m.pl ‘having marched out of the Helvetians’ camp’ 4. (PrepNonPrev) as a PP, whose preposition does not correspond to the preverb: (4) ab-i-n e conspectu meo? (Plaut., Amph., 518) from-go:ind.fut.2sg-part. from sight:abl.m.sg my:abl.m.sg ‘will you be away from my sight?’ Some studies in Romance linguistics have argued that Latin preverbs underwent lexicalization and a gradual loss of semantic transparency (Tekavčić, 1972, §948.3– 1345; Salvi and Vanelli, 1992, 206; Crocco Galèas and Iacobini, 1992, 172; Vicario, 1997, 129; Haverling, 2000, 458–60; Dufresne et al., 2003, 40). This lexicalization has been connected with the gradual loss of the case system in Latin and the trend towards more analytic constructions formed with prepositions (analogous to PrepPrev and PrepNonPrev) in Romance languages (Iacobini and Masini, 2007). This phenomenon has been investigated in various qualitative studies based on sets of examples, rather than corpora. As an illustrative example we will consider Bennett (1914). In the preface to his second volumes on the syntax of early Latin, Bennett (1914), presents his methodological approach (Bennett, 1914, iii–iv): My task in the preparation of this second volume has been much more difficult than I had anticipated. Barring a few of the more recent monographs, I soon found that the treatizes on which I had hoped largely to depend, were extremely defective, not only lacking a large proportion of the important material, but being based, in great measure, on conjectural readings i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Tackling complexity with multivariate techniques  of the past generation. Not infrequently false interpretations added to the confusion. Under these circumstances, it became necessary to make my own special collections to supplement the obvious lacunae encountered at almost every turn. The expenditure of time and labor thus caused have unquestionably been greater than if I had made independent collections of the entire material from the beginning. Nevertheless I believe that substantial completeness has been achieved in the material here presented. Wherever possible, I have given the exact number of instances of the occurrence of a usage. When a usage is found ten or more times, I have marked it “frequent”. This approach makes it impossible to derive falsifiable hypotheses from the author’s claims, since they lack a quantitative account of the phenomenon, see for example Bennett (1914, 131–2): Of the foregoing prepositional compounds governing the dative, those with ante, inter, ob, prae, sub, and super are used with the dative almost exclusively. They rarely take the accusative or prepositional phrases as alternative constructions. Of the other compounds, those with com- show greater hospitality toward the admission of alternative constructions, especially prepositional compounds; while those with ad and in exhibit the greatest tendency in this direction. A general tendency is exhibited in all the compounds to employ the dative rather in figurative relations than in literal ones, though examples of the latter are not especially rare. Literal relations are expressed more usually by the accusative or prepositional phrases; yet we frequently find figurative relations also expressed by these same means. The word “frequent” can be applied to a range of cases, and its meaning depends on its context of use and on the other terms of comparison, making it inadequate by today’s standards of quantitative research, which can rely on large amounts of data, processing power, and computational approaches that were not available in Bennet’s times. In contrast with this methodology, McGillivray (2013, 127–210) (and previously Meini and McGillivray 2010 and McGillivray 2012) employs a quantitative corpus-based approach to investigate this topic, which relies on statistical and computational tools available today. Here we will use this study as an illustration of the approach we propose. The data frame format The study reported on in McGillivray (2013, 127–210) relies on two corpus-driven valency lexicons for Latin verbs (see section 5.1.1), which were derived automatically from two Latin treebanks (see section 4.3.1). This shows an example of reuse of previously built language resources, as well as the benefits of corpus annotation. In addition, the corpus data were systematically collected and analysed using a multivariate approach, as we will see now. The main object of investigation of this study is the type of construction observed for the realization of spatial arguments of prefixed verbs, and specifically the four options listed on pages 157–8. To start from a simple illustrative example, let us imagine i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics Table . Example of a data set recording the century of the texts in which prefixed verbs were observed, and the proportion of their spatial arguments expressed as a PP out of all their spatial arguments Century 2nd cent. bc 1st cent. bc 3rd cent. ad 4th cent. ad Proportion of PP 0.1 0.2 0.8 0.9 that we have collected data on prefixed verbs in our corpus, and that we have decided to represent them as a simple four-by-two table: each observation corresponds to the century of the texts where a prefixed verb is found, and for each century we have recorded the proportion of occurrences of spatial arguments expressed as prepositional phrases (constructions PrepPrev and PrepNonPrev, according to the terminology on pages 157–8), out of all spatial arguments. For example, in the first century bc 10 per cent of all spatial arguments of prefixed verbs are prepositional phrases, as shown in the first row of Table 6.1. We can visualize the data in Table 6.1 geometrically using a Cartesian space, as shown in Figure 6.1. In Figure 6.1 we can see the four points corresponding to the four observations recorded in Table 6.1. The horizontal axis (i.e. x axis) corresponds to the time variable showing the century, with negative values for bc dates and positive ones for ad dates. So, the further to the right a point lies, the later its century. The vertical axis (y axis) corresponds to the proportion of prepositional constructions, ranging from 0 (all constructions are bare-case constructions) to 1 (all constructions are prepositional constructions). For example, the first row of Table 6.1 corresponds to the point (–2,0.1). This two-dimensional representation allows us to express the interaction between the time dimension (along the x axis) and the syntactic dimension (along the y axis). This makes it possible to look for any pattern in the set of points corresponding to the observations. One way to achieve that is by using linear regression models. Linear regression models While analysing Figure 6.1, we saw that every point in a two-dimensional space is associated with a pair of coordinates (x, y), one along the horizontal axis (abscissa) and one along the vertical axis (ordinate); we have thus represented the four rows of Table 6.1 as four points in a two-dimensional space. Now, we may ask if we can detect any regularity in the way the ordinates change as the abscissas change; one way to do that is to find a straight line that is as close as i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Tackling complexity with multivariate techniques  0.8 Prop. preposition 0.6 0.4 0.2 –2 –1 0 1 Century 2 3 4 Figure . Geometric representation of Table 6.1 in a two-dimensional Cartesian space. possible to all four points of Figure 6.1. We observe that the line in Figure 6.2 is a good approximation of the four points. Compared with dealing with a set of points as those in Figure 6.1, dealing with a straight line has some advantages. For example, all the points of a line share the property that their ordinates y can be obtained from their abscissas x by applying this formula: y=a+b∗x a and b are unique to each line; a (intercept) is the ordinate of the intersection point between the line and the y axis, and b (slope) measures the steepness of the line (in subsequent case studies we use the term coefficient to refer to the slope b). In our case, the line is defined by the equation: y = 0.36 + 0.14 ∗ x For example, the abscissa of the point corresponding to the second row of Table 6.1 is –1 and its ordinate is 0.3. The corresponding point on the line is i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics Prop. preposition 0.8 0.6 0.4 0.2 –2 –1 0 1 century 2 3 4 Figure . Line that best fits the four points in Figure 6.1. (–1,0.14 ∗ (–1)+0.36)=(–1,0.22). In other words, the linear representation allows us to quantify the magnitude of change along the y axis in terms of changes per unit along the x axis. Another advantage is that we can measure how well the line fits the points by taking into consideration the sum of distances between each point and the line itself. A line that describes the points well will be closer to each point than a line that is a poor fit to the data (this measure of fit is sometimes referred to as R2 or the coefficient of determination). When we approximate the points of our data set with a line, also called regression line, we are fitting a linear regression model, specifically a two-dimensional linear regression model, to the data. In general, linear regression analysis constitutes a series of multivariate techniques based on the idea that it is often beneficial to approximate a set of points in a multidimensional space with linear, and hence simpler models. Using a common statistical terminology, we will call the variable describing the phenomenon we want to model (in this case the proportion of PP constructions by century) as the response; the response is potentially affected by a range of other variables, which we will call predictors; in our case the only predictor is the century. Generalization to higher dimensions The two-dimensional representation of the data in Table 6.1 and Figure 6.1 is a very simple case. Counting the number of times each construction occurs by century would not fully describe other factors that may affect the presence of such constructions. Accounting for this complexity in a post hoc way by interpreting new variables in light of the results, without testing these i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Tackling complexity with multivariate techniques  new variables properly, is detrimental to historical linguistics research, because it does not quantify and test the role played by those variables. Instead, we argue that it is methodologically more appropriate to collect the values of all these variables upfront, in the data collection phase of the research, and to represent them in an appropriately multidimensional data format which allows for further analyses (see discussion in section 1.3.3). This is achieved by extending the reasoning done above for two-dimensional spaces to the case of spaces in higher dimensions. For illustration purposes, Table 6.2 shows a subset of the table (technically called data frame) used to study the relation between the different variables in the study on Latin preverbs described in McGillivray (2013, 127–210). Each row in this table represents an observed instance in the source corpus (i.e. an occurrence of a prefixed verb with one or more spatial arguments) and each column represents the value of each variable for each observation. The first column contains the ID of the prefixed verb in the original corpora, the second column shows ‘prep’, a binary variable indicating whether (1) or not (0) the prefixed verb occurred with a PP spatial argument in that specific instance, the third column contains the era of the text, the fourth the prefixed verb’s lemma, the fifth the frequency of that verb in the corpora, the sixth the mood of the verbal form, the seventh its voice as a binary variable (1 stands for active or deponent, and 0 for passive), and the eighth the lemma of the preposition found in the argument structure of the verb, if present. Such data frame formats represent very clearly the multidimensional nature of the data. It allows to record a range of measurements for every corpus instance: the lemma of the verb, the form of the preverb, the case of the verbal argument, and so on. As a generalization of the two-dimensional case illustrated earlier in this section, we can imagine that the seven variables in Table 6.2 correspond to as many dimensions, describing different features of each observation. Once we have the data in a data frame format, if we want to identify the relationship between the response and the predictors, we can resort to a range of multivariate Table . Subset of data frame used for study on Latin preverbs in McGillivray (2013, 127–210) id 24290 32817 32289 11028 12831 17440 17526 prep 0 1 1 1 1 1 1 era verb Classical Classical Late Late Late Late Late abeo abigo abscedo abstraho abstraho abstraho abstraho freq_verb 28 3 6 11 11 11 11 mood inf sbjv inf ind ind inf part voice 1 1 1 0 1 1 0 prep_type NA ab ab ab ab ab ab i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics statistics techniques, such as regression models introduced above and described more fully in McGillivray (2013, 162–6). Other models The regression model that best fits the data in Table 6.2 is a mixed-effect logistic regression model, a special class of generalized linear models. Generalized linear models generalize the logic behind linear regression models to response data that are not normally distributed, usually via some transformation of the response. Mixed-effect models involve a set of predictors (or fixed effects) and a set of so-called random effects. The random effects are responsible for the grouplevel variation in the model and are particularly useful with diachronic linguistic data, which tend to have uneven composition with respect to the set of authors of the texts. In the case at hand, the data set has an uneven composition of authors, with some authors being more heavily present than others, and setting author as a random effect accounts for this fact. For a more thorough explanation of mixed-effects models, see McGillivray (2013, 177; 189–90), as well as Baayen (2008), Tagliamonte and Baayen (2012), and Baayen (2014). Further examples and a discussion of generalized linear models and generalized mixed-effects models are given in sections 6.3 and 7.3. Logistic regression models are used when the response variable is binary, as in the case of the prepositional vs bare-case construction for Latin prefixed verbs. Logistic models estimate the probability (from 0 to 1) of switching from one of the two outcomes to the other, given the value of the predictors. This probability is not estimated directly, since the values bounded by 0 and 1 cannot be easily handled via a straight regression line. Instead, the so-called logit function is used to transform the probabilities onto a scale that ranges from −∞ to ∞; when applied to a probability p between 0 and 1, this function returns the logarithm of the odds ratio1 of p: logit(p) = log( p ) 1−p In the case of the preverb study, the best model for predicting the type of construction (response) is based on the following predictors: the lemma of the preverb, the era of the text, and the case of the verb’s argument; its random effects are the semantic class of the verb (motion, rest, or transport) and the author of the text. The model can be expressed as follows: (5) Response: probability of switching from a bare-case construction to a prepositional construction, modelled as depending on fixed effects: preverb + era + case random effect: genre, with a random slope for class 1 We can think of odds as the ratio of an event to its corresponding non-events over a sufficiently long time. For example, if we roll a fair die, the odds of getting  are  to . i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Tackling complexity with multivariate techniques  A multivariate exploratory technique: CA Multivariate techniques have applications beyond the hypothesis-testing setting outlined above. For the purposes of data exploration, we can use other types of multivariate techniques to find associations between different variables in our data. The class of multivariate techniques include so-called dimensionality reduction models, which all aim at reducing the variation in the data in a systematic manner that lends itself to interpretation. Linguistic applications of such techniques are discussed in Baayen (2008, 118–37) and Jenset and McGillivray (2012). One such multivariate technique that is highly useful for corpus data is correspondence analysis (CA) and its generalization to more than two variables called MCA or multiple correspondence analysis (Greenacre, 2007). Exploratory techniques follow the principle formulated by Benzécri (1973): ‘the model should follow the data. The data should not follow the model.’ CA aims at finding the essential structure of the data by reducing the original multidimensional space to a lower-dimensional space (typically consisting of two or three dimensions) that is easier to interpret. CA is discussed more extensively in McGillivray (2013, 168–9), and here we will give an example of its use to capture the multidimensional nature of the data set and research questions relative to the Latin preverbs study. We will focus on the following variables, as they potentially interact with the syntactic constructions that are the object of the study: • ‘construction’, with values from 1 to 4, corresponding to ‘CasePrev’, ‘CaseNonPrev’, ‘PrepPrev’, and ‘PrepNoPrev’; • ‘era’, a broad chronological classification of the authors of the data set: ‘early’ (Plautus), ‘classical’ (Caesar, Cicero, Ovid, Petronius, Propertius, Sallust, Vergil), and ‘late’ (Jerome and Thomas); • ‘case’, the case required by the preposition corresponding to the preverb (ablative or accusative); • ‘sp’, a representation of the lexical–semantic properties of the verbal arguments on the arguments’ lexical fillers; • ‘class’, the semantic class of the verb, with values ‘motion’, ‘rest’, and ‘transport’. Figure 6.3 shows the result of the analysis. It is a two-dimensional representation of the original data set capturing the essential structure of the data. We can assess how accurate such approximations are by considering how much of the variability in the data is expressed by the analysis (so-called percentage of explained inertia). The representation in Figure 6.3 accounts for 53.0 per cent of the variability in the data and highlights associations between the variables. Thanks to the bidimensionality of the plot, we can detect complex relations where more than two variables interact, for example constructions 4 (PrepNoPrev) and 3 (PrepPrev) are associated with the lateera authors (Jerome and Thomas); construction 1 (CasePrev) tends to interact with the classical authors and motion verbs. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics 0.2 era.Late 2BareCaseNonRequired case.acc 3PrepPrev sp.abstraction sp.affectus sp.actio sp.classis 0.0 sp.phenomenon era.Classical sp.eventum sp.entity era.Early –0.2 .4PrepNoPrev case.abl sp.psychological_feature –0.4 .1BareCaseRequired sp.artefact –0.5 0.0 0.5 Figure . Plot from MCA on the variables ‘construction’, ‘era’, ‘preverb’, ‘sp’, and ‘class’. The first axis accounts for 34.6 of the explained inertia, the second axis for 18.5. This is a simple example of the power of multivariate statistical techniques in capturing the multidimensional nature of the data in a systematic and quantitative way, which is a good basis for the subsequent theoretical interpretation. . The rise of existential there in Middle English As a case study of how quantitative corpus methods can shed new light on historical linguistic questions, we will discuss a syntactic change that took place in Middle English. Middle English was the language of England during a period roughly extending from 1100 to 1500, or from just after the Norman Conquest to just after the introduction of the printing press in England. The grammatical change in question is the evolution of existential (sometimes called “expletive” or “dummy”) there in English. The contemporary examples from Davies (2008) in Examples (6) and (7) illustrate the difference between existential there and the locative adverbial use of the morpheme: (6) There is a house to the north. (existential) (7) So it’s been accumulating there, now, for 60-some years. (locative) During the Middle English period, existential there gradually became more frequent in existential constructions, that is, constructions that serve to present new information about the existence of some entity, as in Example (6). The rise of existential there meant the demise of a corresponding existential construction without there, i.e. with what we informally can call a null, or empty, variant. The two competing Middle English existential constructions are exemplified below with sentences taken from i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i The rise of existential there in Middle English  Chaucer’s Canterbury Tales (Benson, 1987), respectively the General Prologue and The Monk’s Tale: (8) With hym ther was a Plowman (GP: 529) “With him there was a ploughman” (9) [Ø] “There manye many Was nevere wight, sith that this world bigan, That slow so was never man since the world began who killed so monstres (MkT: 2111) monsters” As the two examples from Chaucer attest, the two variants could be used interchangeably even by the same author. As with Middle English in general, we also find considerable variation in how there was spelt, including ther, þer, and ðere.2 To simplify matters, we will use the modern spelling for there when referring to the existential pronoun. We will refer to the absence of the pronoun, i.e. the null variant, as Ø. Although existential constructions with there are found in Old English (Breivik, 1990), it was during the Middle English period that constructions with there (hereafter there1 , to distinguish it from the locative adverbial use of the morpheme there exemplified in Example (7)) became the dominant variant. Simultaneously, the null variant gradually fell out of use, but with considerable synchronic variation, as the examples from Chaucer illustrate. The reasons behind this constructional change are less clear, however. Breivik (1990) essentially takes a pragmatic, functional-typological view, whereby syntax and pragmatics interacted to make there1 obligatory due to the loss of verb-second (V2) word order in English, so that “the increasing use of there1 in earlier English is part of a series of parallel syntactic changes, acting in a coordinated manner and pushing the language from one typological category to another” (Breivik, 1990, 247). The increasing use of there1 is attributed to pragmatic, information-related factors that together make the reanalysed locative adverb there1 gradually more obligatory in contexts where new information is being introduced. One such pragmatic factor is what Breivik (1990, 140–50) calls the visual impact constraint: if the post-verbal noun phrase being introduced by the construction refers to something abstract or nonconcrete, there1 serves as an added, pragmatic introductory signal. Williams (2000), on the other hand, takes the view that this change was at least partially connected with the loss, or lack of productivity, in verb-initial (V1) word order. A third angle is presented by Jenset (2014), who finds evidence for sociolinguistic factors being involved, following Croft (2000) and Blythe and Croft (2012). 2 Jenset () identifies twenty-seven spelling variants of there based on data from the PPCME treebank (Kroch and Taylor, ). i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics Identifying causes for linguistic change is of course challenging, as we have discussed earlier. In addition to the problems facing anyone wishing to establish a causal effect, the different linguistic paradigms will approach the question of causality in very different ways. Ringe and Eska (2013), who explicitly position themselves in a generative paradigm, stress the importance of errors when children learn their first language, as well as contact induced second-language learner errors, as the primary sources of linguistic change. In the cognitive-functional paradigm, Croft (2000) dismisses language learning and instead focuses on sociolinguistic context and usage preferences as selection mechanisms in the evolutionary sense of the word. These are but two examples, but they illustrate well the thorny problems facing anyone attempting to establish causes for linguistic change, not to mention establish a consensus (see principle 1, section 2.2.1). We believe that a focus on the empirical consequences associated with the claims and proposed models can help to bridge this gap, and evaluate the competing claims and models across paradigms. A crucial step in quantitative historical linguistics involves translating the competing claims and models into questions that can be answered with statistical techniques. This step allows us to assess the quantitative consequences of the competing claims, and hopefully reach a consensus model. Crucially, quantitative arguments are not themselves sufficient, as we argued in principles 11 (section 2.2.11) and 12 (section 2.2.12). If the linguistic phenomenon is multivariate, then multivariate techniques are required, and those techniques must be applied according to best practices. In the case of the Middle English existential construction, both Breivik (1990) and Williams (2000) rely on empirical and quantitative arguments, based on extensive collections of data. We need some form of quantitative evaluation that can weigh the different options against each other. Clearly, two of the proposed explanations directly or indirectly imply a correlation between different existential subjects (there1 and Ø) and changes in (surface) wordorder probabilities. Breivik’s argument implies a correlation with the loss of V2 word order, whereas Williams’s argument implies a correlation with the loss of V1. However, as we established in principle 3 (section 2.2.3), any claim about the linguistic past that is not physically or logically impossible has a non-zero probability of being true. Principles 4 (section 2.2.4) and 5 (section 2.2.5) established that, since a claim based on strong evidence has more merit than one based on weak evidence, we can recast the claims in terms of the relative strengths of the correlations. 1. The strongest correlation is found between there1 and the loss of V1. 2. The strongest correlation is found between there1 and the loss of V2. 3. No real correlation is found between there1 and any word-order pattern. The question can then be rephrased as follows: is the loss of V1 or V2 more important, or is some other variable of greater importance? Since the rate of V1 and V2 word-order patterns in main clauses can be retrieved from a corpus, we can investigate i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i The rise of existential there in Middle English  directly the question of correlation between (surface) word-order patterns and rates of there1 . At this point, we might reach for a statistical null-hypothesis test to measure the correlation (after retrieving the necessary information from a corpus). However, there are also other, competing claims and hypotheses to take into account. Jenset (2014) argues that sociolinguistic status might have been involved in the use and non-use of there1 in Old English, and suggests that this would also be the case for Middle English. In other words, a realistic quantitative assessment must handle not only the correlation between the realization of the existential subject and the word-order patterns, but also simultaneously account for sociolinguistic factors. Furthermore, the evaluation of the claims and models must take into account pragmatic, or information-theoretic, factors. Breivik (1990)’s suggestion that there1 gradually becomes obligatory in contexts where new information is expanded upon in Breivik (1997), arguing that the development of there1 can be seen as a form of grammaticalization, whereby the function of there1 as a signal of new information becomes increasingly tied to a fixed grammatical context. Since the grammatical context is observable in a corpus, we can predict that the increased importance of such a signal function should manifest itself in an increasing statistical correlation with the surrounding grammatical context. Below we will focus on the grammatical element that follows there1 . Based on this prediction, we would say that such a correlation with the context would strengthen the semantic-pragmatic claims made by Breivik (1990). Another aspect of the discourse-pragmatic argument made by Breivik (1990) is the tendency for complex elements to occur later in the clause, also known as “Behagel’s Law” (Köhler, 1999). This suggests that the complexity of the sentence is a possible factor. Jenset (2013) investigated the gradual evolution of the two uses of there in early English, and found that syntactic complexity, as measured by a composite index weighing the number of NPs and finite verbs against the total number of elements, was a significant predictor in distinguishing there1 from there2 . This leaves us with a number of competing claims and hypotheses regarding the competing use of Ø and there1 in Middle English, based on sociolinguistics, pragmatics (context, complexity), and the effect of word-order patterns. All have varying claims to explanatory power regarding the phenomenon we are studying, and the next section discusses the data used to assess these claims and hypotheses. .. Data We used syntactically annotated Middle English data as the basis for the statistical investigation, i.e. model serialization (section 1.1). The data were drawn from the PPCME2 treebank (Kroch and Taylor, 2000), which comprises around 1.1 million words of prose covering the period from around 1150 ce to 1500 ce. We used a bespoke Python script to extract information about sentences with there1 or Ø, as well as i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics (surface) word-order patterns in main clauses. The data and the code for this study are available on the GitHub repository https://github.com/gjenset. Advantages of treebanks A methodological remark about the data source is in order. We introduced treebanks in section 4.3.2; the detailed syntactic annotation that treebanks provide is the only level that makes studies like this one feasible at a large scale, with the advantages of reproducibility that we stressed in section 4.1.1. Since the PPCME2 annotates existential uses of there differently from locative uses, by means of assigning an ‘EX’ tag to the former and an ‘ADV’ tag to the latter, we could single out the cases of there1 . Furthermore, since empty pronouns are annotated with an ‘∗ exp∗ ’ tag, we could also identify the Ø variants. According to the corpus documentation, the ∗ exp∗ tag is ambiguous since it is also used for subjects in impersonal constructions. However, we could exclude these cases by checking the context of the ∗ exp∗ tag, thus allowing us to extract only the existential sentences. We could also rely on the treebank annotation when extracting only existential subjects from main clauses, thus avoiding the added complexity of dealing with both main and subordinate clauses. Furthermore, two of the hypotheses discussed above involve probabilities of (surface) word-order pattern. The topic of word-order patterns in early English is a complex area; see e.g. Heggelund (2015). We take a fairly atheoretical approach (Horobin and Smith, 2002, 99–103) where any verb-initial main clause is considered V1 if the first element is a finite verb (including verb clitics with the negative particle ne). By means of the corpus annotation, we could exclude imperative sentences and questions from our data. Some of the choices we have made will be contested by other linguists. By using a Python script to extract the data, we have a concrete documentation of those choices, which can be shared with, and critiqued by, other scholars in a transparent manner. Data description We found a total of 1688 main clauses with an existential subject, 807 (48 per cent) of which could be analysed as having an empty pronoun. For each sentence, we recorded the name of the corpus file it was found in, its period (from the Helsinki corpus metadata, incorporated into the PPCME2), the unique identifier of the sentence, the dialect of the text in question as recorded in the corpus documentation, and the manuscript date. Whenever the manuscript date was uncertain according to the documentation, a reasonable date was chosen based on the information in the corpus metadata. As we will see, that simplification is probably warranted and not a major problem for the analysis, since the techniques being used anyway can handle some noise in the observed data. We also collected more data about the sentences, to properly test the competing claims and hypotheses mentioned above. This includes whether or not the subject of the main clause is an empty expletive, the corpus tag of the next element in the main clause, and the conditional probability of finding that next element after an i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i The rise of existential there in Middle English  empty or non-empty existential subject (respectively). This probability was calculated as the probability of an item x following either Ø or there1 divided by the number of occurrences of x in the entire corpus. Also included are two columns providing the relative frequencies of main clauses displaying the V1 and V2 word order in the given corpus file. Finally, we included a sentence-level variable recording the maximum depth of syntactic embedding for each sentence as an approximation to its syntactic complexity. .. Exploration Faraway (2005, 2) emphasizes graphical analyses as a means to properly understand the data, and in this section we present a number of exploratory plots to help to understand the data below. Realization of subject The plot in Figure 6.4 visualizes the distribution of the existential subjects over time. The plot shows the shifting probabilities of the there1 and Ø Existential subject realization 1.0 0.6 Ø Existential subject 0.8 0.4 There 0.2 0.0 1150 1200 1250 1300 1350 MS date 1400 1450 Figure . Graph showing the shift in relative frequencies of existential there and empty existential subjects during the Middle English period. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics realizations of existential subject, over time. The resulting S-curve is expected based on previous research of existential there (Breivik, 1990, 226; Jenset, 2010, 273). This distribution of the realization of existential subject in Middle English shows how there1 rapidly overtakes the null variant after the end of the thirteenth century. Consequently, a potentially explanatory variable whose explanation assumes a correlation with the rise of there1 would need to display a roughly similar distribution. However, we can also note that Blythe and Croft (2012), using simulation modelling based on diachronic linguistic data, conclude that such an S-curve is also driven by sociolinguistic factors, specifically different social prestige associated with the linguistic variants in question. If this is correct, we might also expect to see some effects in our data related to variables that can express different social prestige, specifically genre and dialect. Word-order patterns In Figure 6.5 we see the main trends for the two word-order patterns. As expected, the V1 pattern is much less frequent than the V2 pattern. The V2 pattern is receding more than the V1 pattern, but both appear to stabilize, with visible variation in frequency, in the second half of the Middle English period. Proportion of pattern in declarative main clauses 0.5 Word order V1 V2 0.4 0.3 0.2 0.1 0.0 1200 1300 MS date 1400 1500 Figure . Distribution of V1 and V2 word-order patterns. The lines are smoothed, locally adjusted regression lines that outline the main trends for the two patterns. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i The rise of existential there in Middle English  A caveat is that at least some of this fluctuation may reflect uncertainty in manuscript dating, as mentioned above. The lines in the plot are locally smoothed non-parametric regression lines (Venables and Ripley, 2002, 230), which are useful for exploration and identification of the predominant behaviour in the data. Based on those lines, we would be inclined to think that the loss of V2 word order is a more promising explanatory variable than the loss of V1, since the former appears to have a clearly observable (negative) correlation with the rise of existential there. Conditional probability of right context In Figure 6.6, we can see the probability of the grammatical elements found after the existential subjects, plotted as a box and whiskers plot. The mean (black line) for the elements following there1 is higher than for Ø, suggesting that there1 is conventionally bound to the context following the morpheme. Despite a number of outliers, the Ø subject appears with more variable contexts. Maximum sentence depth As mentioned above, we decided to use the maximum embedding depth (based on the phrase-structure annotation of the treebank) as a 0.30 Probability of next constituent 0.25 0.20 0.15 0.10 0.05 0.00 There Ø Existential subject Figure . Box-and-whiskers plot of conditional probabilities of elements following existential there and empty existential subjects. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics Maximum degree of embedding 20 15 10 5 There Ø Existential subject Figure . Box-and-whiskers plot of the maximum degree of embedded (phrase-structure) elements for sentences with there and empty existential subjects. proxy for linguistic complexity. A higher number indicates more levels of embedding in the sentence. Figure 6.7 shows the maximum sentence depth by existential subject in a box-and-whiskers plot. Sentences with there1 have a lower degree of embedding, suggesting a slightly simpler clause structure on average. Figure 6.8 shows the maximum degree of embedding for all the sentences in the sample over time. Although there appears to be a very slight tendency towards less embedding in later sentences, the overall impression is one of stability. Genre Genre is a possible confounding factor in the analysis, since the choice of existential subject could be motivated by stylistic factors. Although there1 is nearobligatory in the present-day English existential construction, it can still be omitted under some circumstances, as exemplified in Example (10) from Davies (2008). (10) Behind the door was a large room lit by strips of blue phosphor laid across the ceiling. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i The rise of existential there in Middle English  Maximum degree of embedding 20 15 10 5 0 1200 1300 MS date 1400 1500 Figure . Maximum degree of embedding for all sentences in the sample over time, with added non-parametric regression line. Such examples are associated with particular styles and genres, although Coopmans (1989) argues they can be accounted for by syntactic means. Figure 6.9 shows the distribution of existential subjects by genre. Some genres are clearly more frequent than others, notably religious treatises, history, homilies, romances, travelogues, and sermons. Three of these (religious treatises, homilies, and travelogues) have a majority of null-existential subjects. Dialect Dialect is interesting because of its role as a possible proxy for typical sociolinguistic variables such as regional or local identity. Although any kind of ethnic identity is too complex to be reducible to a direct correspondence with language (Fought, 2002), it is still reasonable to assume that language can express regional in addition to national identity (Chambers, 2002, 362–4). Figure 6.10 shows the existential subject by dialect, and we can see that there1 is the most frequent variant in most dialects, except in East Midlands and Northern Middle English. The similarity between East Midlands and Northern material is not i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics Existential subject by genre There Ø 250 200 150 100 Philosophy Handbook Bible Biography Fiction Rule Sermon Travelogue Romance Homily History 0 Religious Treatise 50 Figure . Bar plot of counts of existential subjects by genre. unexpected, since these Middle English dialects both came out of the Old English Anglian dialect area (Corrie, 2006; Irvine, 2006). .. The choice of statistical technique In section 2.2.12 we argued for the importance of choosing the right statistical test or model to evaluate competing claims and hypotheses. Since more than one of the hypotheses could be correct (section 2.2.11), we must evaluate them against each other. Simply counting the raw number of observations in different categories and interpreting them as frequent or infrequent without a further frame of reference (as done in e.g. Bybee, 2003) will not do. Stefanowitsch (2005) discusses this “raw frequency fallacy” and points out that the correct approach is to compare the observed counts with the expected numbers, using some plausible statistical model. However, traditional null-hypotheses are not really suited for our purpose here, since they are i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i The rise of existential there in Middle English  Existential subject by dialect There Ø 400 300 200 100 0 East Midlands West Midlands Southern Kentish Northern Figure . Bar plot of counts of existential subjects by dialect. designed for testing a hypothesis against a null-hypothesis, not for comparing multiple hypotheses (Gelman and Loken, 2014). To illustrate this point, consider Pearson’s chi-square test. The chi-square test, which compares the difference between observed counts and expected counts based on the chi-squared distribution, is a statistical test that many linguists will be familiar with, either through use or exposure in the literature. The test is fairly simple and can be found in introductory books to corpus linguistics, such as Gries (2009b). However, the test itself is less than ideal for the purposes we are considering here, as we will show.3 The perils of chi-square For the purposes of illustration, we consider the evolution of there1 in light of dialects, one of the possible variables correlated with there1 . As discussed above, we must take into account the possibility that dialect variation is a piece of the puzzle of there1 . Table 6.3 provides an overview of there1 and Ø by dialect, and it is clear that the Midlands dialects account for the bulk of the occurrences, 3 Much of the same criticism pertains to another popular test, Fisher’s exact test, but we will only consider Pearson’s chi-square here. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics Table . Frequencies of there1 and Ø according to dialect in Middle English Dialect East Midlands Kentish Northern Southern West Midlands there1 Ø 389 51 24 114 303 465 25 64 46 207 with more instance of there1 than Ø in West Midlands, and the reversed situation in East Midlands. The Northern dialect is the only other dialect that has more instances of Ø than there1 . However, we need to take into account both the total number of observations by dialect and the total number of observations for the two existential variants, since they are obviously attested to very different degrees. We can take the different numbers of observations into account by comparing the observed number in each cell to its theoretical expected frequency based on the row and column sums. This expected frequency based on rows and column sums is how we can take the different numbers of observations for the different categories into account. By taking the difference between each pair of observed and expected values and dividing by the expected value, we are left with a proportional deviation from the expected. Squaring the difference avoids negative numbers, and adding all the squared proportional differences together, we are left with the chi-square score. This chi-square score is essentially a measure of how much the observed values in the table overall differ from their expected counterparts. A larger chi-square score would in general signal a larger deviation from expected values than a smaller one. If we take the number of categories, i.e. rows and columns, into account (for reasons of simplicity of exposition we gloss over a deeper discussion of degrees of freedom), we can compare the chi-square score to the chi-square distribution, which is usually a good model for such differences of expected and observed counts if there are no systematic correlations between rows and columns. Finally, a p-value signals the degree to which the observed data are likely given the chi-square distribution. The last point is worth underlining, since it is often misconstrued (Cohen, 1994). The p-value is not the probability of the null hypothesis, or the probability of being incorrect regarding some hypothesis. The p-value is the probability of observing the data in the table (or some more extreme version of it), if we assume the chi-square distribution to be a good model. If the p-value is small (conventionally below 0.05), we can assume that the chi-square distribution is not a good model, and that there are indeed correlations between rows and columns. From the discussion above it should be clear that the chi-square test is not a particularly intuitive way of reasoning about data. Since the p-value gives us the probability of the observed data given the chi-square distribution, a precise hypothesis i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i The rise of existential there in Middle English  is needed to make any sense of the results. Furthermore, as we discussed in section 3.7.5, the chi-square test is vulnerable to an inflation effect, which was noted as early as in Berkson (1938). Mosteller (1968, 1), in a discussion of chi-square testing in corpus linguistics that foreshadowed the discussion reported on in section 3.7.5, wrote that “I fear that the first act of most social scientists upon seeing a contingency table is to compute chi-square for it.” The reason for Mosteller’s fear is that having much data will inflate the final chi-square value used to compute the p-value. The reason for this inflation effect is that the chi-square deviation from the expected values is roughly proportional to the size of the sample (Mosteller, 1968, 2). In short, the p-value tells us whether we have the required minimum sample size to detect a correlation between rows and columns, but as the sample size grows larger, the p-value becomes increasingly less informative. For the data in Table 6.3, the result of a chi-square test in R is statistically significant (χ 2 = 77.72, df = 4, p 0.001), with a p-value so small that it is indistinguishable from zero for all practical purposes. However, since we have 1,688 observations in total in the table, we should not be surprised at the small p-value. In section 3.7.5 we discussed how Gries (2005) showed that the use of effect size measures to some extent can mitigate the inflation effect that comes with corpus data. A convenient effect size measure for chi-square tests is φ for two-by-two tables, and Cramèr V for larger tables. The details in the calculations differ, but both measures attempt to counter the inflation effect by dividing the chi-square sum for the table by the total number of observations in the table (while also taking the number of rows and columns into account: see Gries 2005 for details). The Cramèr V effect size measure for Table 6.3 is 0.22, which is a small effect for a table of this size (i.e. with six rows and two columns), with this many observations. Another way of thinking of this effect size is to consider it a measure of how much of the variation in the table can be explained by the correlation between rows and columns. We can approximate this by taking the square of the effect size, which in this case amounts to about 0.05 with rounding. In other words, the correlation between rows (dialects) and columns (there1 vs Ø) explains 0.5 per cent of the variation in the data. Such a negligible explanatory power is hardly impressive, and it shows clearly that dialect differences alone are incapable of explaining all the variation in the data. This is not to say that dialect differences play no role at all, but rather that whatever influence they exert must be weighed up against other potentially explanatory variables. However, bringing in more variables is associated with other challenges. The perils of multiple testing Above we showed that a simple chi-square test of a table of data with counts of there1 and Ø was incapable of fully describing the variation between the two variants. As we will see, attempting to repeat the same procedure for more variables is not the correct way to solve the problem. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics Table . Frequencies of there1 and Ø according to genre in Middle English Genre Bible Biography, Life of Saint Fiction Handbook History Homily Philosophy Religious Treatise Romance Rule Sermon Travelogue there1 Ø 30 9 36 8 176 21 10 212 157 33 117 72 7 29 11 10 73 220 0 273 25 24 33 102 In Table 6.4 we have collected the counts of there1 and Ø by genre. A Pearson chi-square test again informs us that we have enough data to detect an association between rows and columns (χ 2 = 409.86, df = 11, p 0.001) with a p-value practically indistinguishable from zero. Also, to obtain the strength of the association, we again calculate Cramèr’s V, which is 0.49. This is a medium effect size for a table of this size, and it would theoretically explain about 24 per cent of the variation in the table. At first glance this is promising, since we have established that genre appears more important than dialect. However, so far we have only considered two tables, both of which contain nominal, or count, data. As we showed above, we collected various types of data, including proportions of word-order patterns and counts of maximum syntactic embedding. Following the approach here, we would end up with a whole series of statistical test results, not all of them coming from a Pearson chi-square test, with different effect size measures, that would have to be compared against each other. To make matters worse, such multiple testing on the same data makes it easier to find significant pvalues by sheer chance, which would need to be taken into account by some sort of correction. Moreover, since the testing is done by slicing and dicing the data with each table unconnected with the others, we would have no means of working out any deeper connections or correlations among the variables found in different tables. And, finally, as pointed out above, the logic of each null-hypothesis test is strictly speaking invalidated, since we intend to compare multiple alternative hypotheses (Gelman and Loken, 2014). Hence, chi-square testing of the type outlined here is seldom the best choice for historical corpus data. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i The rise of existential there in Middle English  Regression modelling The problems discussed above concerning Pearson’s chi-square test (or similar tests such as Fisher’s exact test) point towards one conclusion: when historical linguistic data are viewed in their proper, multivariate context (see principle 11 in section 2.2.11), the appropriate technique is one or more of the multivariate techniques discussed in section 6.2 (given the appropriate data). The exact choice of technique depends on exactly how the question is conceptualized. For instance, Jenset (2010) modelled the difference between there1 and there2 in historical varieties of English by means of a mixed-effects binary logistic regression model. McGillivray (2013, 190–3) employs the same technique to model the binary choice between the bare-case constructions and the prepositional constructions for the realization of the argument structure of Latin prefixed verbs. In both cases the model was used to estimate the probability of switching from one variant to the other given some combination of variables. However, this is merely one way of looking at the data. McGillivray (2013, 202–10) used multivariate exploratory techniques like CA to explore a range of variables affecting the argument realizations of Latin prefixed verbs; Jenset (2013) uses a similar approach to better understand the difference between there1 and there2 from a distributional semantics perspective. For the present case study, we find it useful to conceptualize the problem as a binary choice between there1 and Ø, and the influence that the variables discussed above have on this choice. For this reason our main technique in the analysis below will be a binary logistic regression model; see section 6.2 for details. .. Quantitative modelling In order to test the hypotheses in question, we chose to use a binary logistic regression model, a type of regression model that is useful and well tested in linguistics (Bayley 2002; Baayen 2008; Johnson 2008). We initially considered both ordinary logistic regression models and mixed-effects model (see section 6.2). However, during the model evaluation phase, we found that the ordinary logistic regression model provided the best fit to the data. The question of proper model evaluation or criticism follows from principle 12 (see section 2.2.12, regarding adherence to best practices in applied statistics). Model criticism as best practice is recognized both by statisticians (Faraway, 2005, 53–75) and linguists (Baayen, 2008, 188–93; Hilpert and Gries, 2016). Ultimately, some trial and error is involved in hitting upon a model that fits the data well. Baayen (2008, 236) points out that linguistic hypotheses are often somewhat under-specified; however, even quite specific hypotheses will often require some trial and error in finding a good model, which shows that we are not dealing with some kind of mechanical process without room for the judgement of the researcher. The model The model has three elements: a response (i.e. a dependent variable), a set predictors (i.e. fixed effects or independent variables), and an error component i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics modelled by the binomial distribution. The model that fit the data best has the structure outlined below, and we created it with the glm() function in R. (11) Response: probability of switching from there1 to Ø modelled as depending on the following fixed effects: probability of context + max sentence depth (log scale) + proportion of V1 main clauses + proportion of V2 main clauses + dialect + genre + MS date Some of the fixed effects were rescaled for different reasons. The variable recording maximum embedding depth was log-transformed, to better match the glm() function’s expectations of normally distributed data. MS date (manuscript date) was rescaled so that each change of unit in the statistical model reflects fifty years (instead of one year), to make the result more interpretable. Similarly, the variables for probability of context and proportion of V1 and V2 clauses (all on a 0 to 1 scale) were rescaled so that each change of unit in the model reflects a 0.1 unit. Again, this was done to make the model easier to interpret. Model fit As mentioned above, an important step in quantitative modelling is evaluating how well the model fits the data. This makes intuitive sense: a model that does not fit the data is obviously not a sound basis for drawing conclusions about claims regarding the data. Baayen (2008, 204) mentions two ways in which we can assess the fit of a logistic regression model. One is (pseudo) R2 , a measure of fit loosely based on the proper R2 (or coefficient of determination) which ranges from 0 to 1 and indicates the degree of variation explained by the model. Logistic regression models are based on a different procedure, but we can interpret a pseudo R2 index such as Nagelkerke’s R2 (Nagelkerke, 1991) as the degree of improvement over an alternative model that only predicts the most frequent outcome. Another measure, Harrel’s C, calculates the correlation between the response values and the values predicted by the regression model. A C value of 0.5 signals random guessing, whereas a value of 1 means perfect prediction. A value of 0.8 is often taken as a minimum for a model with some predictive power (Baayen, 2008, 204). However, these measures are not uncontroversial indicators of good models (Long and Freese, 2001, 83–7). A powerful alternative to such numerical indicators of model fit is to look at diagnostic plots of the residuals (Faraway, 2005, 58). The residuals are essentially the distances between the actual observations and the straight line which is the basis for the regression model. Visualizing these residuals makes it possible to spot problems with the model, once we have some familiarity with residuals plots. We consider such familiarization a good investment of time, and Jenset (2010, 106–9) discusses such plots in more detail in the context of linguistics, based on the more technical exposition in Faraway (2005). Turning to the model at hand, we find that Nagelkerke’s R2 for the model is 0.4, whereas Harrel’s C index for the model works out to 0.82. Both measures were i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i The rise of existential there in Middle English  Binned residuals 0.4 Average residual 0.2 0.0 –0.2 –0.4 0.2 0.4 0.6 Expected values 0.8 Figure . Binned residuals plot of the logistic regression model, indicating acceptable fit to the data. calculated by refitting the model. The C index above 0.8 suggests that the model has a genuine capability to correctly identify the response (Baayen, 2008, 204). Not wishing to rely on such numbers alone, we also checked the model structure for any signs of a bad fit to the data by means of a plot (Faraway, 2005, 53–8; Gelman and Hill, 2007, 97–101). Figure 6.11 is a binned residuals plot of the model. Again, the impression is a positive one: the predicted values span the whole range of possible values from 0 to 1 (there1 and Ø, respectively), and most of the black dots (i.e. the observations in the data set) are inside the grey confidence interval lines, without too much of a clear pattern to them. The joint impression from these three ways to assess the model fit is that the model is good. From this step, we can go on to interpret the output of the model. Results Table 6.5 summarizes the fixed effects (or predictors). The first column of the table lists the name of the fixed effect variable, whereas the second column gives the coefficient, i.e. the size of the effect that changing the predictor has on the response. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics Table . Coefficients for the binary logistic regression model showing the log odds ratio for switching from there1 to Ø. Positive coefficients favour Ø, negative there1 (Intercept) nextProb01 logMaxDepth v1prop01 v2prop01 dialectKentish dialectNorthern dialectSouthern dialectWest Midlands genreBiography, Life of Saint genreFiction genreHandbook genreHistory genreHomily genrePhilosophy genreReligious Treatise genreRomance genreRule genreSermon genreTravelogue msDate50 Coef β SE(β) 12.35 −0.88 0.72 0.05 0.24 −1.43 1.50 0.51 0.17 1.40 0.66 2.73 1.08 1.99 −12.81 1.63 0.60 1.24 1.01 2.42 −0.53 2.14 5.8 <.0001 0.26 −3.4 <.001 0.21 3.5 <.001 0.53 0.1 >0.9 0.19 1.2 >0.2 0.30 −4.8 <.0001 0.28 5.4 <.0001 0.31 1.6 >0.1 0.23 0.8 >0.4 0.64 2.2 <.05 0.57 1.1 >0.3 0.69 3.9 <.0001 0.51 2.1 <.05 0.58 3.4 <.001 277.22 0.0 >1 0.46 3.5 <.001 0.66 0.9 >0.4 0.56 2.2 <.05 0.52 2.0 >0.1 0.49 4.9 <.0001 0.07 −7.3 <.0001 z p The next column gives the standard error of the coefficient, a measure of how much variation we can expect around the estimate. The penultimate column is the test statistic used for calculating the p-value in the last column. It is good practice to provide all this information when making use of regression modelling, since it allows a detailed look into the model. To summarize the regression output in Table 6.5: • The intercept is difficult to interpret in this case since it corresponds to a case where the probability of V1 and V2 word order, and the probability of the item following the existential subject are all zero. • The coefficient for nextProb is –0.88 on the logit scale. The variable is statistically significant and indicates the effect upon the response of a 0.1 increase in the probability of the grammatical element following the existential subject. The negative sign indicates that increasing the conditional probability of the context decreases the probability of a null realization of the existential subject. Dividing the coefficient by four (Gelman and Hill, 2007, 82) gives an estimate of the maximum effect size on a probability scale. Dividing –0.88 by four gives –0.22, i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i The rise of existential there in Middle English • • • • • • • • • • • • • • •  in other words: for every 10 per cent increase in the probability of the subject context, there is a 22 per cent decrease in the probability of a null subject. The coefficient for maximum syntactic depth (log scale) is 0.72 on the logit scale, which when divided by four corresponds to an 18 per cent increase in the probability of there1 for every 1 per cent increase in maximum syntactic depth. The coefficient for v1prop is 0.05. However, the last column in the table tells us that this effect is not statistically significant. The coefficient for v2prop is higher than for v1prop, 0.24, but again the effect is not statistically significant. The coefficient for dialect=Kentish is –1.43 on the logit scale. This is the log odds ratio for switching from there1 to Ø when changing from the reference level (East Midlands) to Kentish. The negative sign tells us that Kentish, or South Eastern English, is associated with there1 . Dividing by four estimates the effect to be a decrease in the probability of Ø of about 36 per cent. The coefficient for dialect=Northern is 1.50 on the logit scale. Again we have a statistically significant difference from the reference level East Midlands. The sign here is positive, so a Northern dialect is associated with the null subject (increase of around 38 per cent). The Southern and West Midlands dialects are not statistically significant from the East Midlands reference level. The coefficient for genre=Biography, Life of Saint is 1.40 on the logit scale, which translates into a 35 per cent increase in the probability of Ø in this genre. However, we cannot ignore the relatively large standard error compared to the coefficient (0.64) and the corresponding relatively high significance level. This result is less clear-cut than some of the other results. The coefficient for genre=Fiction is not statistically significant from the reference level (Bible). The coefficient for genre=Handbook is 2.73 on the logit scale, indicating a large increase (68 per cent) in the probability of the null variant. The coefficient for genre=History is 1.08 on the logit scale, which translates into a 27 per cent increase in the probability of the null subject. As with genre=Biography, Life of Saint, we can note the relatively large uncertainty about the estimate as expressed by the standard error (0.51) relative to the coefficient. The coefficient for genre=Homily is 1.99 on the logit scale, i.e. a 50 per cent increase in the probability of Ø. genre=Philosophy is not statistically significant. The coefficient for genre=Religious Treatise is 1.63, or a 41 per cent increase in the probability of Ø. genre=Romance is not statistically significant. The coefficient for genre=Rule is 1.24 on the logit scale, translating into a 31 per cent increase in the probability of the null subject. Again we note a relatively large uncertainty about this estimate. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Role of numbers in historical linguistics • genre=Sermon is not statistically significant. • The coefficient for genre=Travelogue is 2.42 on the logit scale, or an increase in the probability of Ø of about 60 per cent. • The coefficient for MS date is –0.53 on the logit scale, indicating that every fiftyyear increase of MS date corresponds to an average 13 per cent drop in the probability of the null subject. We can now return to the competing claims and hypotheses formulated above and review them in light of the statistical model. Interestingly, neither of the word-orderrelated hypotheses implied by Williams (2000) or Breivik (1990) proved significant, when compared to the other variables. Although we cannot categorically exclude an effect of major word-order changes upon the realization of the existential subject, the results above make such hypotheses less plausible. Instead, we find that there1 is associated with a closer link to the surrounding grammatical context, as would be expected under a process driven by pragmatic factors (Breivik, 1990). This interpretation is strengthened by the result seen for the proxy measure of syntactic complexity, namely maximum sentence depth. A more complex sentence appears to favour the null subject, which could again support a pragmatics (or information theory) based view of the rise of there1 . Nevertheless, this cannot be the full explanation. We also found that dialect plays a role, with a continuum running from North to South. Compared to the central areas of England, the Northern dialect is more likely to prefer Ø, whereas the South East is more likely to prefer there1 . Such a result is compatible with the possible sociolinguistic explanation based on Blythe and Croft (2012) and discussed in Jenset (2014). Similarly, genre appears to play a role even when we control for dialect and MS date, suggesting that this may also partially be a stylistic choice. These results are expected based on the assumption laid out in section 2.2.11 that language is best explained in multivariate terms. Although it is clear that one single factor cannot explain all the variation, we have shown that some explanations are substantially less likely than others. The quantitative model we have presented above compares favourably with an approach based on multiple null-hypothesis tests. Importantly, the model shows that the changes in the Middle English existential construction cannot be reduced to a purely syntactic process. Clearly, an explanatory model must somehow account for the sociolinguistic differences among dialects and genres that we have identified. As such, the results seen here broadly support the sociolinguistics-informed approach to historical linguistics in Croft (2000) and Blythe and Croft (2012). .. Summary This case study has illustrated the benefits of quantitative methods in historical linguistics discussed at the beginning of this chapter. Specifically, we have shown how multivariate techniques such as logistic regression can deal with complexity in i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i The rise of existential there in Middle English  the form of different linguistic and sociolinguistic variables. Although we have stated that linguistic features anyway ought to be modelled with more than one explanatory variable (see section 2.2.11), we have illustrated this by bringing up a case study where multiple competing and partially overlapping proposed explanations exist. As we have seen, quantitative, multivariate techniques are well suited for assessing such competing claims and hypotheses against each other. By performing this comparison, we have also identified potentially explanatory factors, which in turn point to how an explanatory linguistic model of the change in question must be framed. Specifically, the connection between existential subject realization and the linguistic context of the subject suggests that context at a relatively fine-grained level needs to be taken into account. But the linguistic model also needs to account for the sociolinguistic effects identified in the statistical model above. In demonstrating this, we have shown how the principles of quantitative historical linguistics can contribute to some of the major goals of historical linguistics listed above (section 6.1.4), namely through working out some of the details of histories of individual languages, and thereby, by pinpointing which variables have an explanatory value, cover some of the ground needed for modelling linguistic change. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  A new methodology for quantitative historical linguistics . The methodological framework The literature survey we described in Chapter 3 has shown that the landscape of historical linguistics research is characterized by a high degree of methodological variability. However, we can safely say that we observed a general under-representation of corpus-based and/or quantitative approaches. Moreover, there is no agreed standard on what is considered high-quality quantitative research in historical linguistics. In this book we have proposed a new overarching framework for quantitative historical linguistics, and we have argued that this is a good framework for conducting historical linguistics research within the scope defined in section 2.1.1 for three main reasons: it allows us to answer questions that the qualitative approaches cannot answer, it provides stronger evidence (and therefore stronger explanations) than qualitative approaches; it allows a higher degree of integration between historical linguistics and other related fields, and a higher level of understanding between scholars, thus making the field move forward more effectively. Our framework encourages corpus-driven approaches and the systematic adoption of multivariate statistical methods as the most appropriate ways to deal with the multifaceted nature of historical languages. Moreover, we argue for a clearer boundary between data-driven exploratory studies (whose results can be used to formulate hypotheses), and studies that attempt to answer questions by testing specific hypotheses. Ultimately, quantitative historical linguistics makes progress by confirming or rejecting these hypotheses in a reproducible way, and by defining models of historical linguistic phenomena. In line with the view expressed by Meyer and Schroeder (2015) and McGillivray (2013), we believe that quantitative corpus-driven methods are not a technical issue only; on the contrary, they have the potential to profoundly change the research practices and the research questions of historical linguistics. Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray. © Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Core steps of the research process  . Core steps of the research process In this chapter we will operationalize our framework in concrete terms, thus making it easier for other scholars to judge it. In the rest of the chapter we will illustrate the proposed methodology through a historical study on English corpus data. Before turning to the case study, though, we would like to summarize the principles (section 2.2) and best practices (section 2.3) of our framework. We will do that by representing the research process as a circle, loosely inspired by McGillivray (2013, 127), and involving the steps outlined below. Note that these steps should not be taken in a strictly prescriptive way, as they are meant as methodological guidelines which will need to be adapted on a case-by-case basis. 1. Study preparation (a) select a phenomenon that can be operationalized so that it falls within the scope of quantitative historical linguistics; (b) formulate operational definitions for the phenomenon (e.g. word-order change) and the variables under consideration (e.g. semantic features, morphological features, authors, time periods); 2. Data collection (a) Collect the data set(s) by drawing on relevant annotated corpus data, if available to the historical linguistics community; alternatively build a new, annotated, corpus available to the community and draw the data set(s) from it; (b) combine corpus data with external resources including non-linguistic ones, if relevant; (c) if the corpus annotation does not contain all variables relevant to the analysis, annotate the data set with those variables; (d) if possible, re-encode the variables back into the corpus or link them to the corpus for replicability purposes; (e) document the data and the process, and make the data set available to the community; 3. Quantitative modelling (a) establish the explanandum (according to the terminology by Goldthorpe, 2001), i.e. the statistical pattern that needs explanation; (b) explore the corpus data by making use of replicable visualization techniques and descriptive statistics (c) optionally, formulate a hypothesis from the data exploration or from existing claims, and conduct replicable hypothesis testing through quantitative analyses on the data set using suitable quantitative techniques (preferring multivariate methods over univariate ones); i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  A new methodology (d) optionally, identify response and predictors, and define one or more suitable statistical models for the response based on the predictors; assess the models with diagnostics tools and compare them; (e) report all relevant details of the results of the analyses; 4. Interpretation and publication of the results (a) formulate a probabilistic explanation of the phenomenon based on explanatory factors as found in the analysis; (b) publish the results and make the data and code available to the community. We would like to point out that the point regarding formulating an explanation in the last step does not formally coincide with identifying causal relationships, as we stressed in section 6.3. A full discussion of causality and explanations in linguistics is outside the scope of this book; instead we refer the reader to Goldthorpe (2001) and a linguistic view of this position discussed in Jenset (2010, 47–71), as well as the chapters in Penke and Rosenbach (2007a) and in Campbell (2013, 322–45). . Case study: verb morphology in early modern English In this section we present a longer case study to illustrate our framework for quantitative historical linguistics. In section 6.3 we illustrated how empirical methods can be used to evaluate competing claims about the evolution of existential there in historical English. That case study was a scenario where we were able to find a satisfactory model that could corroborate some claims and that made other claims less likely. In the present case study we tackle a more complex case where a satisfactory model is more difficult to identify, but where quantitative methods can nevertheless inform us about the details of the diachronic development. We also deal with the effects of frequency directly in mechanisms of change. The data and the code for this figure are available on the GitHub repository https://github. com/gjenset. The topic of the case study is the diachronic change that took place in English inflectional verb morphology during the early modern period, roughly the time period from 1500 to 1700 ce. In this period the third person singular form -(e)s, originally a Northern form from Middle English, spread to the rest of England, where -(e)th was the dominant form (Nevalainen, 2006, 184). Nevalainen notes that the -(e)th form was used in early Bible translations, but that Shakespeare on the whole preferred -(e)s. However, that did not prevent Shakespeare from using both forms, sometimes next to each other, as in this example from The Merry Wives of Windsor: (1) Ford: Has Page any brains? hath he any eyes? hath he any thinking? Sure, they sleep; he hath no use of them. (3.2.1338–9) Here we see both -(e)s and -(e)th, has and hath, used in the same passage with the same verb. Thus there is clearly considerable variation that might potentially inform us i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Verb morphology in early modern English  about the process that led to the general adoption of -(e)s in the third person singular. Moreover, we have corpus data that we can use to investigate this phenomenon. In other words, we are in a position to operationalize the phenomenon in such a way that it falls within the scope of quantitative historical linguistics, as indicated by step 1a in the list (section 7.2). Nevalainen (2006) outlines the following overview of the diachronic process, based on data from the Helsinki Corpus. She recognizes these stages, following the periodization of the Helsinki Corpus: 1. 1500–70: -(e)th dominates at the national level, -(e)s is a regional Northern variant. 2. 1570–1640: the use of -(e)s becomes dominant in informal writing such as letters, and becomes a substantial minority variant in official documents in England. The exceptions to this are the verbs do and have which tend to retain -(e)th. 3. 1570–1640: simultaneously, there was an increase of -(e)th in Older Scots, the Germanic language of Scotland; this increase appears to be genre-specific (Nevalainen, 2006, 191). By the end of the seventeenth century, -(e)s was the dominant form in English, except for the most conservative genres. Nevalainen (2006) lists a number of possible reasons for this development: 1. Immigration to the London area from the north brought the -(e)s form to the south. 2. The -(e)th form gradually became associated with more formal registers. 3. Female writers picked up the -(e)s form, which is perhaps connected to the role of women in linguistic innovation (Nevalainen, 2006, 188). 4. Some phonological contexts favoured -(e)s, especially verbs ending in stops such as /t/ and /d/, since lasts was easier to pronounce than lasteth, for instance; at the same time, the extra syllable added by -eth could be exploited metrically in poetry. 5. -(e)s spread by lexical diffusion, via word-specific restrictions. This explanation still leaves many questions unanswered. In section 1.1 we noted that exploration of the history of individual languages and the establishment of general processes of linguistic change are two of the aims of historical linguistics. Although all the explanatory variables discussed by Nevalainen (2006) are plausible, we cannot immediately establish the relative importance among them, or how they might interact. This makes it difficult to go from the description of the specific case (the history of third person singular inflections in English) to a more generalized description of change. One such generalization is the observation that infrequent words tend to lead the way in analogical change (Lieberman et al., 2007). Another is the observation that frequent words tend to be replaced at a slower rate than less frequent ones i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  A new methodology (Pagel et al., 2007). Hay et al. (2015), in a diachronic corpus study of word-frequency effects in diachronic sound change in New Zealand English, find an interaction between the speaker’s year of birth and lexical frequencies. Their study shows that lowfrequency words lead the change, in a manner that would be expected under analogical change but not under regular phonological change. Hay et al. (2015) explain this by pointing out that very frequent words are well represented in memory, which helps explain their resistance to change. Conversely, an infrequent word is less likely to be stored in memory, and this holds even more if the word is difficult to understand. Such impaired perception may affect words that are close to an advancing change, which affects memory storage. Together, this leads to a greater susceptibility for change for the low-frequency words. From this brief literature review we can already identify the operational definition of the phenomenon we are going to study (the alternation between -(e)s and -(e)th forms in early modern English) and the variables under consideration (step 1a in section 7.2). These generalizations may have some application to the case of -(e)s and -(e)th in early modern English. If the shift from -(e)th to -(e)s is a case of analogical change driven by perception forces, we would expect lexical frequency to play a role. Specifically, we would expect a higher word frequency to correlate with a lower probability of -(e)s, and conversely that a lower word frequency would correlate with a higher probability of -(e)th. We would also expect this frequency effect to increase as the change nears its completion (Hay et al., 2015). The inclusion of frequency sets our study apart from Gries and Hilpert (2010), who use a different corpus for a comparable time period, but similar statistical techniques. Based on these considerations, we can approach the Early Modern corpus data with the following claims, which correspond to the hypotheses that we want to test (step 3c in section 7.2): • If genre plays a role, we expect a statistically significant difference between genres in the use of -(e)s and -(e)th. • If gender plays a role, we expect a statistically significant difference between male and female writers in the use of -(e)s and -(e)th. • If phonological context is an important cause of change, we expect final stops to favour -(e)s, and final vowels and fricatives to favour -(e)th. • If lexical diffusion is a leading cause of change, we expect individual differences between verbs, especially for do and have. • If lexical frequency is an important variable, we expect a statistically significant effect towards the end of the period, as the change was approaching completion and the perceptual pressure on remaining -(e)th variants increased. .. Data For the data collection phase (step 2 in section 7.2) we relied on an annotated corpus. We extracted the data for this case study from the 1.7 million word PPCEME treebank (Kroch and Delfs, 2004), using a Python script. Since the corpus does not have i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Verb morphology in early modern English  annotation for third person present tense singular verb morphology, we identified all present tense verbs with the tags ‘DOP’ (do), ‘HVP’ (have), and ‘VBP’ (other verbs), and processed the results further to identify the third person cases, excluding forms of be, for which the alternation between -(e)s and -(e)th is not applicable. We lemmatized the set of present tense verbs and used them to calculate the lemma frequency in the present tense. This frequency was counted for the three sub-periods of the corpus (E1: 1500–69, E2: 1570–1639, E3: 1640–1710), to avoid having future increases in verb frequency influence past observations. The lemmatization lexicon we built is an example of additional resource that integrated the corpus annotation with extra variables (verb lemma, in this case) needed for the study, as suggested by steps 2d in section 7.2. Following the recommendations for representing and analysing multidimensional data outlined in section 6.2, we collected the data into a data frame format for analysis with R. Tables 7.1 and 7.2 exemplify an excerpt of the data. The full data set comprises 10,430 observations of 1,654 verb forms for 737 lemmas, and the list of variables (step 2c in section 7.2) is: • filename: a factor variable with the PPCEME file identifiers as levels; • period: a factor variable with the PPCEME sub-corpus period identifiers e1, e2, and e3 as levels; • id: a factor variable with the identifier for the individual syntactic tree; Table . Part of the metadata extracted from the PPCEME documentation filename period alhatton-e3-h alhatton-e3-h alhatton-e3-h alhatton-e3-h anhatton-e3-h e3 e3 e3 e3 e3 id ALHATTON-E3-H,2,241.7 ALHATTON-E3-H,2,242.27 ALHATTON-E3-H,2,245.42 ALHATTON-E3-H,2,245.44 ANHATTON-E3-H,2,211.6 genre year female LET PRIV LET PRIV LET PRIV LET PRIV LET PRIV 1699 1699 1699 1699 1690 T T T T T Table . Part of the data extracted from PPCEME verbForm verbTag lemma suffix3sg subPeriodCount has sayes designes sayes plays HVP VBP VBP VBP VBP have say design say play s s s s s 2, 532 393 13 393 6 context vowel vowel stop vowel vowel i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  A new methodology • year: a numerical variable with the year of the text as given in the PPCEME documentation; • female: a logical variable with the levels TRUE (female) and FALSE (male), corresponding to the gender of the author as given in the PPCEME documentation; • author: a factor variable with the name of the author (if known) from the PPCEME documentation; • verbForm: a factor variable with the verb forms observed in the corpus as levels; • verbTag: a factor variable with the corpus tags of the verbs as levels (‘DOP’, ‘HVP’, and ‘VBP’); • lemma: a factor variable with the lemmas of the verb forms as levels (we manually derived these from the verb forms, and lemmatized them to the modern form, due to the large early modern English variation in spelling); • suffix3sg: a factor variable with two levels indicating whether the verb ends in -(e)s (s) or -(e)th (th); • subPeriodCount: a numerical variable with counts of the lemma frequency in the corpus sub-period (E1, E2, or E3); • context: a factor variable indicating the phonological context of the third person suffix, based on the modern lemma, with the levels ‘fricative_other’, ‘liquid’, ‘sibilant’, ‘stop’, and ‘vowel’.1 .. Exploration As part of the data exploration phase (step 3b in section 7.2), we consider the distribution of -(e)s and -(e)th, plotted as changing probabilities over time in Figure 7.1. The distribution corresponds to what we would expect based on Nevalainen (2006), with a very low overall initial probability of -(e)s, and with a gradual decline in -(e)th throughout the seventeenth century. Comparing this with the similar plot in section 6.3 we see that there is less of a pronounced S-shaped curve in Figure 7.1, and the increase seems more gradual. Turning to the lexical frequencies plotted in Figure 7.2, two things are immediately clear. As we would expect based on Figure 7.1, there is a greater concentration of observations for -(e)th in early parts of the corpus, and a greater concentration of -(e)s in the later stages. Next, we notice that the trend, represented by the non-parametric 1 Basing the phonological context on a contemporary rendering of the lemma is of course problematic. During the early modern period, English underwent a number of phonological changes, the most noteworthy of which is perhaps the changes to the Middle English long vowels known as the Great Vowel Shift. However, a detailed reconstruction of the phonological context at the time of attestation in corpus is beyond the scope of this case study. Also, the Great Vowel Shift itself is still a matter of discussion, as McMahon () demonstrates. Hence, we have opted for the pragmatic solution of normalizing the phonological context along with the spelling. This is partly justified by simply referring to vowels in general, including diphthongs, without a further distinction between long and short vowels. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Verb morphology in early modern English  1.0 0.8 Suffix th 0.6 0.4 s 0.2 0.0 1550 1600 Year 1650 1700 Figure . Plot showing the shifting probabilities over time between -(e)s and -(e)th in the context of third person singular present tense verbs. regression line in the plot,2 is relatively stable over time for -(e)s. However, for -(e)th we see that over time there is an increasing tendency towards higher lemma frequencies. This observation is compatible with more than one interpretation, but at least it suggests that lexical frequencies may be involved somehow. To explore some of the variables further, we performed a multiple correspondence analysis (see section 6.2), which reduces the variation among the variables suffix, corpus sub-period, gender, and phonological context to a compact, two-dimensional sub-space that can be easily visualized. We can see the plot in Figure 7.3. Only the first (horizontal) dimension has a high enough explanatory power, judging from the percentage of explained inertia. This implies that we can read the plot from left to right, with similar categories close to each other. As we now would expect, we see that 2 A non-parametric regression line uses local adjustments to fit the line to the data, which typically results in a regression line that is not straight. This makes a non-parametric, or smoothed, regression line difficult to analyse in the same manner as a traditional regression line. However, it is a useful tool for describing the behaviour of the data. The lines in the plots were created with the lowess() function in R. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  A new methodology –th 3.5 3.5 3.0 3.0 Verb frequency (base 10 log scale) Verb frequency (base 10 log scale) –s 2.5 2.0 1.5 1.0 2.5 2.0 1.5 1.0 0.5 0.5 0.0 0.0 1500 1600 Year 1700 1500 1550 1600 Year 1650 1700 Figure . Plots of the trends of lemma frequency over time for verb forms occurring with -(e)s (left panel) and -(e)th (right panel). The black lines are non-parametric regression lines outlining the trend over time. the earliest period (E1) is associated with -(e)th, and the latest period (E3) with -(e)s. Period E2 takes up an intermediate position. We can also see that female writers are associated with later periods and -(e)s. However, this might be due to both higher use of -(e)s by female writers, as well as the relative lack of female writers in the earliest period, as illustrated by the numbers in Table 7.3. In other words, at this point we cannot decide if the use of -(e)s is directly associated with female writers, or if both are associated with the later time periods. Finally, we can see from Figure 7.3 that the phonological context accords with our predictions, with vowels displaying some association with -(e)th, while stops show some association with -(e)s. The remaining contexts do not display any particular tendencies to one or the other, judging from the plot. From this preliminary exploration, we turn to a more formalized hypothesis testing phase using statistical modelling. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Verb morphology in early modern English 0.4  period:e1 period:e3 0.2 female:FALSE context:liquid context:stop context:fricative_other Dimension 2: 1.8% suffix3sg:s 0.0 suffix3sg:th context:sibilant context:vowel –0.2 –0.4 female:TRUE period:e2 –0.6 –0.4 –0.2 0.0 0.2 Dimension 1: 71.5% 0.4 0.6 Figure . MCA plot of suffix, corpus sub-period, gender, and phonological context. Only the first (horizontal) dimension is interpretable. Table . Frequencies of verb tokens in the sample as taken from texts produced by female and male writers, broken down by corpus sub-period Author E1 E2 Male Female 3205 (31) 97 (<1) 3591 (34) 485 (5) E3 2794 (27) 258 (3) .. The models As with the case study in section 6.3, we have opted for a modelling approach based on binary logistic regression, specifically a mixed-effects model. The advantage of this technique for our case study is that we can model the direct effect that each variable has on the choice of -(e)s or -(e)th in the form of log odds ratios (which we convert to probabilities for ease of interpretation). This is in line with step 3d in section 7.2. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  A new methodology During the model fitting phase we tested a large number of models with different forms, as per step 3d in section 7.2. However, we were not able to find a single model that could successfully capture the variation between -(e)s and -(e)th for the entire corpus. To illustrate an example of a badly fitting model, we discuss one of these unsuccessful models below. A bad model The model in (2) is representative of our attempts to fit a single model to the data. The response is the probability of switching from -(e)th to -(e)s, and the fixed effects correspond to the claims we wish to test. In this model we used genre as a random effect, since we can assume some genre effects in this case. We also tested some models where the verb lemma was incorporated as a random effect; however, this did not improve the model fit in any substantial way. (2) Response: probability of switching from -(e)th to -(e)s modelled as depending on the following fixed effects: lexical frequency (log base 10 scale) + period + gender + verb tag + phonological context random effect: genre In section 6.3 we noted that it is not sufficient to rely exclusively on numerical measures of model fit such as Harrel’s C or Nagelkerke’s R2 . The reason is that such measures may be quite high even when the model is not a good fit to the data. A much better test of model fit is the extent to which the model residuals (the differences between the ideal straight line of the model and the observed values) are well behaved. Since the usefulness of the model depends on certain assumptions regarding these residuals, it is a crucial step to check them. For the model in (2), these indices work out as follows: Harrel’s C is 0.96 and Nagelkerke’s R2 is 1 compared to a mixed-effects model that only predicts the most frequent outcome. This looks promising, but interpreting these measures only makes sense if we have a proper fit to the data. Unfortunately, the binned residuals plot in Figure 7.4 reveals a disastrously poor fit to the data. The binned plot in Figure 7.4 is a good example of a bad fit to the data. In a well-behaved binned residuals plot we would expect most points to fall within the confidence intervals indicated by the grey lines. In this case, we notice that most points in fact fall outside these lines. There are also clear signs that the model is mis-specified because of the V-shaped pattern among the points. This tells us that the model is not an equally good fit in the whole data set, a very important assumption in regression modelling. Ideally, the black dots should be symmetrically distributed around the horizontal dotted line, without any clear signs of patterns. Such lack of any patterning is only an ideal situation, but Figure 7.4 is clearly too far from this ideal. When one model is not enough Our solution in this case was to fit three models, one per sub-corpus period (E1, E2, and E3). For the earliest period, we were simply i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Verb morphology in early modern English  Binned residual plot 0.6 0.4 Average residual 0.2 0.0 –0.2 –0.4 –0.6 –0.8 0.0 0.2 0.4 0.6 0.8 1.0 Estimated probability of ‘−s’ Figure . Binned residuals plot for the mixed-effects logistic regression model described in (2). The model is a very poor fit to the data as expressed by the fact that most of the predicted points are outside the grey confidence interval lines, and there are clear up-and-down patterns in the points. not able to find a model that was a good fit to the data. The model was essentially always predicting -(e)th due to the extremely low number of occurrences of -(e)s in this period. We cannot exclude that a satisfactory model can be achieved using other variables, but our variables were not sufficient to distinguish the two variants in the earliest period. For the two later periods, we were able to find satisfactory models; however, these models are different, as we will see now. For the E2 period we arrived at the following model: (3) Response: probability of switching from -(e)th to -(e)s modelled as depending on the following fixed effects: lexical frequency (log base 10 scale) + gender + verb tag + phonological context random effect: genre i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  A new methodology Binned residual plot 0.6 0.4 Average residual 0.2 0.0 –0.2 –0.4 –0.6 0.0 0.2 0.4 0.6 Estimated probability of ‘−s’ 0.8 1.0 Figure . Binned residuals plot for the mixed-effects logistic regression model described in (3). The model is an acceptable fit to the data, with points being fairly symmetrically distributed around the middle. There is a skew towards predicting 0, i.e. -(e)th. The binned residuals plot in Figure 7.5 is an improvement over the one in 7.4. The points are more symmetrically distributed around the middle, and more points are inside the grey lines. We can see that a large number of points are clustered together around the 0 point (left side). This means that there is a large number of cases being predicted as 0, i.e. -(e)th. However, we note that the model also predicts -(e)s, and a Nagelkerke’s R2 of 1 shows a real improvement over simply predicting the most frequent outcome. Similarly, a value of Harrell’s C of 0.96 is excellent. For the E3 period we used the model set out in (4): (4) Response: probability of switching from -(e)th to -(e)s modelled as depending on the following fixed effects: lexical frequency (log base 10 scale) + gender + verb tag random effect: genre For this model we obtained the best fit by removing the variable for the phonological context. In the binned residuals plot shown in Figure 7.6 we see that there are quite a i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Verb morphology in early modern English  Binned residual plot 0.4 Average residual 0.2 0.0 –0.2 –0.4 0.2 0.4 0.6 Estimated probability of ‘−s’ 0.8 1.0 Figure . Binned residuals plot for the mixed-effects logistic regression model described in (4). The model is an marginal fit to the data, with points being fairly symmetrically distributed around the middle. There is a skew towards predicting 1, i.e. -(e)s. few points falling outside the grey lines, but there is not too much structure among the points. As with the binned plot in Figure 7.5, we see that there is a skew, but here the tendency is for the points to be clustered near 1, i.e. around -(e)s. However, such a skew is not necessarily a large problem. For the purposes of the current study, we accept the model based on this plot. The numerical measures for evaluating the model are still good, with a Nagelkerke’s R2 of 1 and Harrell’s C of 0.93. Results We next give the relevant details from this statistical testing (step 3e in section 7.2), and turn our attention to the summary outputs of the models in (3) and (4), displayed in Tables 7.4 and 7.5, respectively. For the model covering period E2, Table 7.4 shows that the only two non-significant variables are the lexical frequency count (transformed to a logarithmic scale for improved fit) and the category designating the lemma have; this means that have is indistinguishable from do with respect to the model’s response. The intercept is not of much interest here, since it represents the average outcome effect when the frequency count is zero. Turning to the remaining coefficients, it is worth noting that they are all positive, i.e. they all point towards a higher probability of -(e)s. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  A new methodology Table . Summary of fixed effects from the mixedeffects logistic regression model for E2 described in (3) (Intercept) log10(subPeriodCount) femaleTRUE contextliquid contextsibilant contextstop contextvowel verbTagHVP verbTagVBP Coef β SE(β) z p -8.74 -0.10 1.94 4.38 2.68 4.68 3.85 -0.08 3.28 1.07 0.10 0.32 0.75 0.78 0.74 0.75 0.33 0.34 -8.2 -1.0 6.1 5.8 3.4 6.3 5.1 -0.2 9.6 <.0001 >0.3 <.0001 <.0001 <.001 <.0001 <.0001 >0.8 <.0001 Table . Summary of predictors from the mixedeffects logistic regression model for E3 described in (4) (Intercept) log10(subPeriodCount) femaleTRUE verbTagHVP verbTagVBP Coef β SE(β) z 2.77 -0.76 2.64 -0.15 2.05 0.70 0.14 0.46 0.21 0.26 3.9 -5.5 5.7 -0.7 7.7 p <.0001 <.0001 <.0001 >0.5 <.0001 We can summarize the results as follows, using the divide-by-four rule to transform the log odds ratios to probabilities (Gelman and Hill, 2007, 82): • Female writers are associated with a 50 per cent increase in the probability of -(e)s compared to men. • Phonological context=liquid is associated with a 100 per cent increase in the probability of -(e)s compared to the reference category of non-sibilant fricatives. • Phonological context=sibilant is associated with a 67 per cent increase in the probability of -(e)s compared to the reference category of non-sibilant fricatives. • Phonological context=stop is associated with a 120 per cent increase in the probability of -(e)s compared to the reference category of non-sibilant fricatives. • Phonological context=vowel is associated with a 96 per cent increase in the probability of -(e)s compared to the reference category of non-sibilant fricatives. • Verbs other than do and have are associated with an 82 per cent increase in the probability of -(e)s. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Verb morphology in early modern English  In other words, both the phonological context and gender are highly associated with the use of -(e)s. For the phonological context we see that there is in fact a preference for -(e)s in most contexts, but the degree of preference varies. Next we turn to the summary of the model in (4), displayed in Table 7.5. In this model we had to remove the phonological context variable to achieve an acceptable fit, leaving us with the lexical frequency variable, gender, and the verb categories derived from the corpus annotation. As with the previous model, we will not attempt to interpret the intercept since it is not particularly meaningful here. Also, as in the previous model, we note that the distinction between the verb tag reference category do and the verb have is not statistically significant. However, the lexical frequency variable is significant in this model. Below, we quickly summarize the coefficients: • A 1 per cent increase in lemma frequency decreases the probability of -(e)s by 20 per cent. • Female writers are associated with a 66 per cent increase in the probability of -(e)s compared to men. • Verbs other than do and have are associated with a 51 per cent increase in the probability of -(e)s. Finally, we look at the random effect, genre, in the two models. Here we see an interesting difference between the models in (3) and (4). For the E2 model in (3), the standard deviation of the random effect is 2.8. Mixed-effects models assume that the random effects are normally distributed, and we can make use of this to calculate something resembling a confidence interval for the random effect, i.e. a range within which we would expect 95 per cent of all values to fall. For the E2 model, this interval barely reaches above zero, which means that all genres tend towards -(e)th. In other words, the variation in third person singular suffix cannot really be attributed to genre differences for this period. Conversely, for the E3 model the standard deviation of the random effect is 2.1, but the confidence interval in this case spans from 0.19 to 0.99 on a probability scale. In other words, we find that the variation in tendency among genres spans virtually the whole range of probabilities from -(e)th to -(e)s. In short, to the extent that our models are reliable, it is in the E3 period that genre differences regarding -(e)th and -(e)s can be identified. .. Discussion We are now in the position to evaluate the claims presented initially, and reproduced here in enumerated form for convenience: (i) If genre plays a role, we expect a statistically significant difference between genres in the use of -(e)s and -(e)th. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  A new methodology (ii) If gender plays a role, we expect a statistically significant difference between male and female writers in the use of -(e)s and -(e)th. (iii) If phonological context is an important cause of change, we expect final stops to favour -(e)s, and final vowels and fricatives to favour -(e)th. (iv) If lexical diffusion is a leading cause of change, we expect individual differences between verbs, especially for do and have. (v) If lexical frequency is an important variable, we expect a statistically significant effect towards the end of the period, as the change was approaching completion and the perceptual pressure on remaining -(e)th variants increased. Following step 4a in section 7.2, our results appear to refute claim (i) regarding the importance of genre. Recall that the model for E2 found virtually no variation between genres. Instead, the variation in the use of -(e)s and -(e)th in the E2 period could better be described by the predictors. It was only in the next period, E3, that our model could identify large, systematic differences between genres in the use of -(e)s and -(e)th. Since the last period was also the period when -(e)s was increasingly becoming the norm while -(e)th became relegated to highly formal writing, it appears that the genre differences observed in E3 are a result of the change taking place, rather than an active cause for it. Our initial explorations left unanswered the question of whether women used -(e)s more than men, or whether we simply find more women writers in the period when -(e)s was becoming the norm. However, the two models for E2 and E3 both agreed that female writers employed -(e)s to a larger degree than men. We can thus conclude that claim (ii) has been strengthened. Claim (iii) deals with the effect of phonological context, and here our models are clear: the phonological context only plays a role in the E2 period. Furthermore, we note that rather than seeing a clear, unequivocal preference for -(e)s in some contexts, we found that there is a continuum of degrees of preference for -(e)s. We can schematically represent this as follows: (5) -(e)s > stop > liquid > vowel > sibilant > other fricatives > -(e)th Nevertheless, this does agree with the claim in (iii) that stops should show a preference for -(e)s while vowels and sibilants show more of a preference for -(e)th. However, this only holds for the E2 period. The model for the E3 period omitted the context variable in order to obtain an acceptable fit to the data. Based on this, it appears that the phonological context acted as an important factor in the early stages of the change, whereas other variables were involved in concluding the process. Regarding claim (iv), we found a reliable difference between do and have on the one hand, and other verbs on the other hand. Unfortunately, no models using the verb lemma proved an acceptable fit to the data. This leaves some uncertainty, since do and have are also high frequency verbs. However, we can state with confidence that i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Verb morphology in early modern English  verbs other than do and have present a clear preference for -(e)s as a whole in both E2 and E3. If most of the variation within the VBP category was tied to individual verbs, we would expect sufficient variation within this category for no significant difference to manifest itself compared to do and have. Instead, since the VBP category as a whole shows this difference, we suspect a frequency effect is involved. Hence, we tentatively note a decreased confidence in claim (iv). Finally, claim (v) involved lexical frequencies and predicted that this variable would grow in importance towards the endpoint of the change, when the number of lowfrequency verbs preferring -(e)th was decreasing. First, we can note an indirect support for this position based on our discussion of claim (iv). Second, we find support for it in the fact that in the E2 model the lexical frequency variable was not significant. However, in the E3 model, i.e. towards the endpoint of the transition, the lexical frequency variable was significant. For the E3 period, increasing the lemma frequency implied a lower probability of -(e)s, which is what we would expect if the change was caused by low-frequency analogical change, driven by the perception factors similar to the ones outlined by Hay et al. (2015). Thus, just as with the case study on existential there in Chapter 6, we have shown that empirical corpus methods are well suited to evaluating the merits of competing claims regarding historical linguistic phenomena. However, in addition we have demonstrated that careful use of multivariate statistical models can be a rich source of information for reasoning about the details of a process of diachronic change. Such models are no better than the data they make use of. By employing syntactically annotated data, essentially a form of model parallelization (section 1.1), we could extract a data set that was both large and rich in detail. Nevertheless, it is conceivable that the models might improve with more data. For instance, Nevalainen (2006, 193) mentions vowel contraction and adding this feature might result in even more informative models. Similarly, Nevalainen (2006, 193) notes that -(e)th was retained in some Southern dialects. The lack of dialect information in the metadata for the PPCEME corpus prevented us from including dialect information in this case study, but this is clearly a promising variable for further exploration regarding the change from -(e)th to -(e)s. Nevertheless, we showed that our models were informative enough to evaluate the claims listed above. Our results on the role of author gender are aligned with Gries and Hilpert (2010), but differ in other respects. Although these claims to some extent deal with the description of the historical evolution of a single language (English), we also demonstrated how the empirical approach advocated here could be employed to understand more general, abstract processes of historical linguistic change. Finally, we described the research process advocated by our proposed methodological framework, which involves a circle composed by clearly defined phases, transparent processes, and publicly available data. Moreover, the research relies on existing resources like annotated corpora, and creates new ones, such as the lemmatization lexicon which we built to compensate for the lack of lemma annotation in the i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  A new methodology PPCEME corpus. We strongly believe that this will enable further studies to confirm or refute our findings, thus advancing the field. . Concluding remarks In this book we have outlined a comprehensive approach for doing quantitative historical linguistics. We have also outlined how we think this can be best achieved, and why it matters. In Chapter 1 we outlined the main reasons why historical linguistics ought to make more use of corpora. Historical linguistics is by necessity data-centric; however, the high uncertainty that comes with time depth ensures that there is room for a large variety of claims about the past. This necessitates a high degree of precision in communication when these claims are presented and evaluated. We have shown that when claims about historical linguistics are expressed in terms of frequencies and probabilities, the historical linguist is forced to come up with more precise formulations of those claims. This can only be a good thing, since a precise claim is easier both to defend and refute. Furthermore, quantitative claims are highly transparent. True, a qualitative claim regarding the existence or non-existence of a construction or grammatical phenomenon in the past is precise in exactly the same manner, but note that nothing is lost when such binary claims are expressed as endpoints of a probability scale. However, in practice, linguistic argumentation is to a large extent about establishing category relationships and relations between categories (Beavers and Sells, 2014). As many studies have argued (see discussion in previous chapters and references there, including Manning, 2003; Bresnan et al., 2007; and Zuidema and de Boer, 2014), such relations are in most cases better expressed in probabilistic terms to better account for the large degree of variability in language. We have made a point of remaining agnostic about the question of whether or not language is inherently probabilistic, or if probabilities are simply useful in describing the details of an underlying, extremely detailed categorical system. We consider this an unresolved empirical question, which ought not to stand in the way of corpus methods to achieve what we consider the most important aspects, namely: increased precision, a standard for resolving claims in linguistics (see also Geeraerts, 2006), and reproducibility. We have outlined a framework for achieving this, using case studies to illustrate what we consider good practice in using quantitative techniques in historical linguistics, including interpretation, discussion, and presentation of results. However, in addition to these benefits to historical linguistics as a field, we also see wider benefits. Historical linguistics does of course not exist in a vacuum, and we consider the increased adoption of empirical corpus linguistics, probabilistic and computational methods, and an increased level of attention towards data sharing and reproducibility a step in the direction of improved professional understanding between historical linguistics and adjacent fields (for a proposal in the same spirit as ours and applied to the case of Latin linguistics, see McGillivray, 2013). In i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Concluding remarks  short, statistical techniques and a probabilistic conceptualization of the questions and claims can act as a bridge for cross-disciplinary communication. For instance, experimental psycholinguistic research already relies on statistical modelling as its methodological core. Psycholinguistic models of language processing can inform historical and diachronic research, as seen in Hay et al. (2015) and the previous section. The uniformitarian principle implies that such psycholinguistic models of understanding are relevant; empirical corpus methods (especially as part of efforts in model parallelization; see. Zuidema and de Boer, 2014) make the link between the two concrete and testable. However, the usefulness of statistical methods as a means of communication extends beyond the various linguistic subfields. There is a rich literature (e.g. McMahon and McMahon, 2005; Campbell, 2013; Pereltsvaig and Lewis, 2015) attesting to the communication problems that have arisen when researchers with backgrounds in fields other than linguistics have introduced new methods, most notably Bayesian phylogenetic trees, to study historical linguistic phenomena from a new perspective. Such communication problems across fields should in our view be resolved, not ignored or condemned, and we consider quantitative corpus methods a contribution to this end. Although our main focus with this book is historical linguistics, we consider such increased possibilities for improved cross-disciplinary communication a positive side effect. We can only hope that the effect size will be a considerable one, both within and beyond historical linguistics. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i References Abeillé, A. (2003). Treebanks: Building and Using Parsed Corpora. Dordrecht: Kluwer. Adger, D. (2015). Syntax. Wiley Interdisciplinary Reviews: Cognitive Science 6(2), 131–47. Allen, C. L. (1995). Case Marking and Reanalysis: Grammatical Relations from Old to Early Modern English. Oxford: Oxford University Press. Andersen, H. (1999). Actualization and the (uni)directionality of change. In H. Andersen (ed.), Actualization: Linguistic Change in Progress. Papers from a workshop held at the 14th International Conference on Historical Linguistics, Vancouver, B.C., Current Issues in Linguistic Theory, pp. 225–48. New York: John Benjamins. Andersen, H. and B. Hepburn (2015). Scientific method. In E. N. Zalta (ed.), The Stanford Encyclopedia of Philosophy (winter 2015 edn.). Archer, D. (2012). Corpus annotation: A welcome addition or an interpretation too far? In J. Tyrkkö, M. Kipiö, T. Nevalainen, and M. Rissanen (eds.), Outposts of Historical Corpus Linguistics: From the Helsinki Corpus to a Proliferation of Resources. Studies in Variation, Contacts and Change in English eSeries. Archer, D. and J. Culpeper (2003). Sociopragmatic annotation: New directions and possibilities in historical corpus linguistics. In G. N. Leech, P. Rayson, A. McEnery, and A. Wilson (eds.), Corpus Linguistics by the Lune: A Festschrift for Geoffrey Leech, pp. 37–58. Frankfurt am Main: Peter Lang. Archer, D., T. McEnery, P. Rayson, and A. Hardie (2003). Developing an automated semantic analysis system for early modern English. In Corpus Linguistics 2003 Conference, Lancaster University, pp. 22–31. Atkinson, Q. D. and R. D. Gray (2006). How old is the Indo-European language family? Illumination or more moths to the flame. In P. Forster and C. Renfrew (eds.), Phylogenetic Methods and the Prehistory of Languages, pp. 91–109. Cambridge: McDonald Institute for Archaeological Research. Attardi, G. (2006). Experiments with a multilanguage non-projective dependency parser. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X). New York City, pp. 166–70. Association for Computational Linguistics. Baayen, R. H. (2001). Word Frequency Distributions. Dordrecht: Kluwer Academic. Baayen, R. H. (2003). Probabilistic approaches to morphology. In R. Bod, J. Hay, and S. Jannedy (eds.), Probabilistic Linguistics, pp. 229–87. Cambridge, MA: MIT Press. Baayen, R. H. (2008). Analyzing Linguistic Data. Cambridge: Cambridge University Press. Baayen, R. H. (2014). Multivariate statistics. In R. J. Podesva and D. Sharma (eds.), Research Methods in Linguistics, pp. 337–72. Cambridge: Cambridge University Press. Baker, C., C. Fillmore, and J. Lowe (1998). The Berkeley FrameNet project. In Proceedings of COLING-ACL 1998. Montreal. Bamman, D. and G. Crane (2006). The design and use of a Latin dependency treebank. In J. Hajič and J. Nivre (eds.), Proceedings of the Fifth International Workshop on Treebanks and Linguistic Theories (TLT 2006). Prague, pp. 67–78. ÚFAL MFF UK. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  References Bamman, D. and G. Crane (2007). The Latin Dependency Treebank in a cultural heritage digital library. In Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007). Prague, pp. 33–40. Bamman, D. and G. Crane (2008). Building a dynamic lexicon from a digital library. In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2008). Pittsburgh. Bamman, D. and G. Crane (2011). The Ancient Greek and Latin Dependency Treebanks. In C. Sporleder, A. Bosch, and K. Zervanou (eds.), Language Technology for Cultural Heritage: Theory and Applications of Natural Language Processing, pp. 79–98. Berlin/Heidelberg: Springer. Bamman, D., M. Passarotti, R. Busa, and G. Crane (2008). The annotation guidelines of the Latin Dependency Treebank and Index Thomisticus Treebank. The treatment of some specific syntactic constructions in Latin. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008). Marrakech. Bamman, David, M. F. and G. Crane (2009). An Ownership Model of Annotation: The Ancient Greek Dependency Treebank. In TLT 2009: Proceedings of the Eighth International Workshop on Treebanks and Linguistic Theories Conference, Milan Italy. Milan, pp. 5–15. UniCATT. Barðdal, J., T. Smitherman, V. Bjarnadóttir, S. Danesi, G. B. Jenset, and B. McGillivray (2012). Reconstructing constructional semantics: The dative subject construction in Old Norse– Icelandic, Latin, Ancient Greek, Old Russian and Old Lithuanian. Studies in Language 36(3), 511–47. Baron, Alistair, R. P. (2009). Automatic standardization of texts containing spelling variation. How much training data do you need? In Proceedings of Corpus Linguistics 2009. Baroni, M. (2013). Composition in distributional semantics. Language and Linguistics Compass 7(10), 511–22. Baroni, M. and A. Kilgarriff (2006). Linguistically-processed web corpora for multiple languages. In Proceedings of EACL 2006, Trento, Italy, pp. 87–90. Baroni, M. and R. Zamparelli (2010). Nouns are vectors, adjectives are matrices: Representing adjective–noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1183–93. Association for Computational Linguistics. Bayley, R. (2002). The quantitative paradigm. In J. K. Chambers, P. Trudgill, and N. SchillingEstes (eds.), The Handbook of Language Variation and Change, pp. 117–41. Malden, MA: Blackwell. Beavers, J. and P. Sells (2014). Constructing and supporting a linguistic analysis. In R. J. Podesva and D. Sharma (eds.), Research Methods in Linguistics, pp. 397–421. Cambridge: Cambridge University Press. Bech, K. and G. Walkden (2016). English is (still) a West Germanic language. Nordic Journal of Linguistics 39(01), 65–100. Bender, E. M. and J. Good (2010). A grand challenge for linguistics: Scaling up and integrating models. White paper contributed to the National Science Foundation’s SBE 2020: Future Research in the Social, Behavioral and Economic Sciences initiative. Bennett, C. E. (1914). Syntax of Early Latin, vol. II—The Cases. Boston: Allyn & Bacon. Benson, L. D. (ed.) (1987). The Riverside Chaucer (3rd edn). Oxford: Oxford University Press. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i References  Bentein, K. (2012). The periphrastic perfect in Ancient Greek: A diachronic mental space analysis. Transactions of the Philological Society 110(2), 171–211. Benzécri, J.-P. (1973). L’Analyse des Données, vol. 1. Paris: Dunod. Bergsland, K. and H. Vogt (1962). On the validity of glottochronology. Current Anthropology 3(2), 115–53. Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chisquare test. Journal of the American Statistical Association 33(203), 526–36. Biber, Douglas, C. S. (2001). Register variation: A corpus approach. In T. D. Schiffrin, Deborah and H. E. Hamilton (eds.), The Handbook of Discourse Analysis, pp. 175–96. Oxford: Blackwell. Biber, Douglas, F. E. and D. Atkinson (1994). ARCHER and its challenges: Compiling and exploring A Representative Corpus of Historical English Registers. In S. P. Fries, Udo and G. Tottie (eds.), Creating and using English language corpora. Papers from the 14th International Conference on English Language Research on Computerized Corpora, Zurich 1993, pp. 1–13. Amsterdam. Rodopi. Bird, S., E. Klein, and E. Loper (2009). Natural Language Processing with Python. Sebastopol, CA: O’Reilly. Bizer, C., T. Heath, K. U. Idehen, and T. Berners-Lee (2008). Linked data on the web. In Proceedings of the 17th International World Wide Web Conference (WWW2008), Beijing. Bloomfield, L. (1933). Language. New York: Holt. Blythe, R. A. and W. Croft (2012). S-curves and the mechanisms of propagation in language change. Language 88(2), 269–304. Bod, R. (2003). Introduction to elementary probability theory and formal stochastic language theory. In R. Bod, J. Hay, and S. Jannedy (eds.), Probabilistic Linguistics, pp. 11–38. Cambridge, MA: MIT Press. Bod, R. (2014). A New History of the Humanities: The Search for Principles and Patterns from Antiquity to the Present. Oxford: Oxford University Press. Bod, R., J. Hay, and S. Jannedy (eds.) (2003). Probabilistic Linguistics. Cambridge, MA: MIT Press. Böhmová, A., J. Hajič, E. Hajičová, and B. Hladká (2003). The Prague Dependency Treebank: A three-level annotation scenario. In A. Abeillé (ed.), Treebanks: Building and Using Parsed Corpora, pp. 103–28. Dordrecht: Kluwer Academic. Borin, L. and M. Forsberg (2008). Something old, something new: A computational morphological description of Old Swedish. In K. Ribarov and C. Sporleder (eds.), Proceedings of the LREC 2008 Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), pp. 9–16. Boschetti, F. (2010). A corpus-based approach to philological issues. Ph.D. thesis, University of Trento, Trento, Italy. Breivik, L. E. (1990). Existential There: A Synchronic and Diachronic Study (2nd edn). Oslo: Novus Press. Breivik, L. E. (1997). There in space and time. In H. Ramisch and K. Wynne (eds.), Language in Time and Space: Studies in Honour of Wolfgang Viereck on the Occasion of his 60th birthday, pp. 32–45. Stuttgart: Franz Steiner Verlag. Bresnan, J., A. Cueni, T. Nikitina, and R. H. Baayen (2007). Predicting the dative alternation. In G. Bouma, I. Kraemer, and J. Zwarts (eds.), Cognitive Foundations of Interpretation, pp. 69–94. Amsterdam: Royal Netherlands Academy of Arts and Sciences. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  References Bülow, A. E. and J. Ahmon (2011). Preparing Collections for Digitization. London: Facet. Busa, R. (1980). The annals of humanities computing: The Index Thomisticus. Computers and the Humanities 14(2), 83–90. Bybee, J. (2003). Mechanisms of change in grammaticization: The role of frequency. In B. D. Joseph and R. D. Janda (eds.), The Handbook of Historical Linguistics, pp. 602–23. Malden, MA: Blackwell. Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 286–95. Association for Computational Linguistics. Campbell, L. (2013). Historical Linguistics: An Introduction (3rd edn). Edinburgh: Edinburgh University Press. Candela, L., D. Castelli, P. Manghi, and A. Tani (2015). Data journals: A survey. Journal of the Association for Information Science and Technology 66. Carnie, A. (2012). Syntax: A Generative Introduction (3rd (electronic) edn). Maldeu, MA: Blackwell. Carrier, R. C. (2012). Proving History: Bayes’s Theorem and the Quest for the Historical Jesus. Amherst, NY: Prometheus. Chambers, J. K. (2002). Patterns of variation including change. In J. K. Chambers, P. Trudgill, and N. Schilling-Estes (eds.), The Handbook of Language Variation and Change, pp. 349–72. Malden, MA: Blackwell. Chiarcos, C., J. McCrae, P. Cimiano, and C. Fellbaum (2013). Towards open data for linguistics: Linguistic linked data. In A. Oltramari, P. Vossen, L. Qin, and E. Hovy (eds.), New Trends of Research in Ontologies and Lexical Resources. Heidelberg/New York/Dordrecht/London: Springer. Chomsky, N. (1957). Syntactic Structures. The Hague: Mouton. Chretien, C. D. (1962). The mathematical models of glottochronology. Language 3(1), 11–37. Cimiano, P., P. Buitelaar, and M. Sintek (2011). LexInfo: A declarative model for the lexicon– ontology interface. Journal of Web Semantics: Science, Services and Agents on the World Wide Web 9, 29–51. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1), 37–46. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd edn). Hillsdale, NJ: Lawrence Erlbaum. Cohen, J. (1994). The earth is round (p < .05). American Psychologist 49(12), 997–1003. Coopmans, P. (1989). Where stylistic and syntactic processes meet: Locative inversion in English. Language 65(4), 728–51. Corrie, M. (2006). Middle English—dialects and diversity. In L. Mugglestone (ed.), The Oxford History of English, pp. 86–119. Oxford: Oxford University Press. Crane, G. (1991). Generating and parsing classical Greek. Literary and Linguistic Computing 6(4), 243–245. Crocco Galèas, G. and C. Iacobini (1992). Parasintesi e doppio stadio derivativo nella formazione verbale del latino. Archivio Glottologico Italiano 77, 167–99. Croft, W. (2000). Explaining Language Change: An Evolutionary Approach. London: Longman. Croft, W. and D. A. Cruse (2004). Cognitive Linguistics. Cambridge: Cambridge University Press. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i References  Cruse, A. (2011). Meaning in Language: An introduction to Semantics and Pragmatics (3rd edn). Oxford: Oxford University Press. Culpeper, J. and D. Archer (2008). Requests and directness in early modern English trial proceedings and play-texts, 1640–1760. In A. H. Jucker and I. Taavitsainen (eds.), Speech Acts in the History of English, pp. 45–84. Amsterdam/Philadelphia: John Benjamins. Czeitschner, U., T. Declerk, and C. Resch (2013). Porting elements of the Austrian Baroque Corpus onto the linguistics linked open data format. In P. Osenova, K. Simov, G. Georgiev, and P. Nakov (eds.), Proceedings of the Joint NLP&LOD and SWAIE Workshops, RANLP, Hissar, Bulgaria, pp. 12–16. Davies, M. (2008). The Corpus of Contemporary American English (COCA): 410+ million words, 1990–present. Davies, M. (2010). The Corpus of Historical American English: 400 million words, 1810–2009. Davies, M. (2011). Google Books (American English) Corpus (155 billion words, 1810–2009). Available online at http://googlebooks.byu.edu/. de Marneffe, M.-C. and C. Potts (2014). Developing linguistic theories using annotated corpora. In N. Ide and J. Pustejovsky (eds.), The Handbook of Linguistic Annotation. Berlin: Springer. Declerck, T., U. Czeitschner, K. Moerth, C. Resch, and G. Budin (2011). A text technology infrastructure for annotating corpora in the eHumanities. In S. Gradmann, F. Borri, C. Meghini, and H. Schuldt (eds.), Proceedings of the International Conference on Theory and Practice of Digital Libraries (TPDL–2011), pp. 457–60. Deignan, A. (2005). Metaphor and Corpus Linguistics. Amsterdam: John Benjamins. Denision, D. (2002). Log(ist)ic and simplistic S-curves. In R. Hickey (ed.), Motives for language Change, pp. 54–70. Cambridge: Cambridge University Press. Depuydt, K. and J. de Does (2009). Computational tools and lexica to improve access to Text. In E. Beijk and L. Colman (eds.), Fons Verborum. Feestbundel voor prof. dr. A.M.F.J. (Fons) Moerdijk, aangeboden door vrienden en collega’s bij zijn afscheid van het INL, pp. 187–99. Leiden/Amsterdam: Instituut voor Nederlandse Lexicologie. Dilthey, W. (1991). Selected Works, vol. I. Princeton, NJ: Princeton University Press. Downs, M. E., B. Z. Lund, R. Talbert, M. J. McDaniel, J. Becker, N. Jovanovic, S. Gillies, and T. Elliott. Places: 1004 ((H)Adriaticum/Superum Mare). Pleiades. Doyle, P. (2005). Replicating corpus-based linguistics: Investigating lexical networks in text. In Proceedings of the Corpus Linguistics Conference. University of Birmingham, UK. Dufresne, M., F. Dupuis, and M. Tremblay (2003). Preverbs and particles in Old French. Yearbook of Morphology, 33–59. Dunn, M., A. Terrill, G. Reesink, R. A. Foley, and S. C. Levinson (2005). Structural phylogenetics and the reconstruction of ancient language history. Science 309(5743), 2072–5. Ellegård, A. (1953). The Auxiliary Do: The Establishment and Regulation of its Use in English. Stockholm: Almquist & Wiksell. Ellegård, A. (1959). Statistical measurement of linguistic relationship. Language 35(2), 131–56. Elliott, T. and S. Gillies (2009). Data and code for ancient geography: Shared effort across projects and disciplines. In Digital Humanities 2009 Conference Abstracts, pp. 4–6. Elliott, T. and S. Gillies (2011). Pleiades: An un-GIS for ancient geography. In Digital Humanities 2011, Conference Abstracts, Stanford, pp. 311–12. Stanford University. Emonds, J. E. and J. T. Faarlund (2014). English: The Language of the Vikings. Olomouc Modern Language Monographs. Palackỳ University. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  References Evert, S. (2006). How random is a corpus? The library metaphor. Zeitschrift für Anglistik und Amerikanistik 54(2), 177–90. Faraway, J. J. (2005). Linear Models with R. Boca Raton, FL: Chapman & Hall/CRC. Farrar, S. and D. T. Langendoen (2003). A linguistic ontology for the semantic web. GLOT International 7(3), 97–100. Faudree, P. and M. P. Hansen (2014). Language, society, and history towards a unified approach? In The Cambridge Handbook of Linguistic Anthropology, pp. 227–49. Cambridge: Cambridge University Press. Fellbaum, C. (1998). Wordnet: An Electronic Lexical Database. Cambridge, MA: MIT Press. Ferraresi, A., E. Zanchetta, M. Baroni, and S. Bernardini (2008, 1 June 2008). Introducing and evaluating ukWaC, a very large web-derived corpus of English. In S. Evert, A. Kilgarriff, and S. Sharoff (eds.), Proceedings of the 4th LREC Web as Corpus Workshop (WAC-4)—Can We Beat Google?, Marrakech, Morocco. European Language Resources Association. Fillmore, C. J. (1992). ‘Corpus linguistics’ or ‘computer-aided armchair linguistics’. In J. Svartvik (ed.), Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, Stockholm, 4–8 August 1991, pp. 35–60. Berlin: Mouton de Gruyter. Fischer, O. (1996). Syntax. In N. Blake (ed.), The Cambridge History of the English Language, vol. II: 1066–1476, pp. 207–408. Cambridge: Cambridge University Press. Fischer, O. (2004). Grammar change versus language change: Is there a difference? In C. Kay, S. Horobin, and J. J. Smith (eds.), New Perspectives on English Historical Linguistics. Selected papers from 12 ICEHL. Glasgow, 21–26 August 2002, pp. 31–63. Philadelphia, PA: John Benjamins. Fischer, O. (2007). Morphosyntactic Change: Functional and Formal Perspectives (electronic edn). Oxford: Oxford University Press. Fodor, I. (1961). The validity of glottochronology on the basis of the Slavonic languages. Studia Slavica 7(4), 295–346. Forster, P. and C. Renfrew (eds.) (2006). Phylogenetic Methods and the Prehistory of Languages. Cambridge: McDonald Institute for Archeological Research. Fought, C. (2002). Ethnicity. In J. K. Chambers, P. Trudgill, and N. Schilling-Estes (eds.), The Handbook of Language Variation and Change, pp. 444–72. Malden, MA: Blackwell. Freitas, A. and O.-S. Curry, E. (2012). A distributional approach for terminological semantic search on the linked data web. pp. 384–91. Gale, W. and G. Sampson (1995). Good-turing smoothing without tears. Journal of Quantitative Linguistics 2(3), 217–37. Galves, C. and H. Britto (2002). The Tycho Brahe Corpus of Historical Portuguese. Technical report, Department of Linguistics, University of Campinas. Online publication, 1st. García García, L. (2000). A case study in historical linguistic research. In Perspectives on the Genitive in English: Synchronic, Diachronic, Contrastive and Research, vol. 1, pp. 118–29. Universidad de Sevilla. Geeraerts, D. (2006). Methodology in cognitive linguistics. In G. Kristiansen, M. Achard, R. Dirven, and F. J. Ruiz de Mendoza Ibáñez (eds.), Cognitive Linguistics: Current Applications and Future Perspectives, pp. 21–50. Berlin: Mouton de Gruyter. Gelderen, E. v. (2014). Generative syntax and language change. In C. Bowern and B. Evans (eds.), The Routledge Handbook of Historical Linguistics, pp. 326–42. Abingdon, UK: Routledge. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i References  Gelman, A. (2012). Statistics in a world where nothing is random. Blog post. Accessed 13/09/2015 from http://andrewgelman.com/2012/12/17/statistics-in-a-world-where-nothingis-random/. Gelman, A. and J. Hill (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press. Gelman, A. and E. Loken (2014). The statistical crisis in science data-dependent analysis—a “garden of forking paths”—explains why many statistically significant comparisons don’t hold up. American Scientist 102(6), 460. Gibson, E. and E. Fedorenko (2013). The need for quantitative methods in syntax and semantics research. Language and Cognitive Processes 28(1–2), 88–124. Gilliland, A. J. (2008). Setting the stage. In M. Baca (ed.), Introduction to Metadata (2nd edn). Los Angeles: Getty. Goldthorpe, J. H. (2001). Causation, statistics, and sociology. European Sociological Review 17(1), 1–20. Gorrell, J. H. (1895). Indirect discourse in Anglo-Saxon. PMLA 10(3), 342–485. Gotscharek, A., A. Neumann, U. Reffle, C. Ringlstetter, and K. U. Schulz (2009). Constructing a lexicon from a historical corpus. In Proceedings of the Conference of the American Association for Corpus Linguistics (AACL09), Edmonton. Gould, S. J. (1985). The Flamingo’s Smile: Reflections in Natural History. New York: Norton. Greenacre, M. (2007). Correspondence Analysis in Practice (2nd edn). Boca Raton, FL: Chapman & Hall/CRC. Gries, S. T. (2005). Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory 1(2), 277–94. Gries, S. T. (2006a). Introduction. In S. T. Gries and A. Stefanowitsch (eds.), Corpora in Cognitive Linguistics: Corpus-Based Approaches to Syntax and Lexis, pp. 1–17. Berlin & New York: Mouton de Gruyter. Gries, S. T. (2006b). Some proposals towards a more rigorous corpus linguistics. Zeitschrift für Anglistik und Amerikanistik 54(2), 191–202. Gries, S. T. (2009a). Quantitative Corpus Linguistics with R: A Practical Introduction. New York: Routledge. Gries, S. T. (2009b). Statistics for Linguistics with R: A Practical Introduction. Berlin: Mouton de Gruyter. Gries, S. T. (2011). Methodological and interdisciplinary stance in corpus linguistics. In G. Barnbrook, V. Viana, and S. Zyngier (eds.), Perspectives on Corpus Linguistics: Connections and Controversies, pp. 81–98. Amsterdam: John Benjamins. Gries, S. T. (2015). The most under-used statistical method in corpus linguistics: Multi-level (and mixed-effects) models. Corpora 10(1), 95–125. Gries, S. T. and A. L. Berez (2015). Linguistic annotation in/for corpus linguistics. In P. J. Ide, Nancy (ed.), Handbook of Linguistic Annotation. Berlin/New York: Springer. Gries, S. T. and M. Hilpert (2010). Modeling diachronic change in the third person singular: a multifactorial, verb- and author-specific exploratory approach. English Language and Linguistics 14(3), 293–320. Gries, S. T. and J. Newman (2014). Creating and using corpora. In R. J. Podesva and D. Sharma (eds.), Research Methods in Linguistics, pp. 257–87. Cambridge: Cambridge University Press. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  References Grondelaers, S., D. Geeraerts, and D. Speelman (2007). A case for a cognitive corpus linguistics. In M. Gonzalez-Marquez, I. Mittelberg, S. Coulson, and M. J. Spivey (eds.), Methods in Cognitive Linguistics, pp. 149–69. Amsterdam: John Benjamins. Guiraud, P. (1959). Problèmes et méthodes de la statistique linguistique. Dordrecht: Reidel. Hajič, J., J. Panevová, Z. Urešová, A. Bémová, V. Kolárová-Reznícková, and P. Pajas (2003). PDT-VALLEX: Creating a large coverage valency lexicon for treebank annotation. In J. Nivre and E. Hinrichs (eds.), Proceedings of the Second Workshop on Treebanks and Linguistic Theories (TLT 2003), Växjö. vol. 9, pp. 57–68. Växjö University Press. Halpin, H., V. Robu, and H. Shepherd (2007). The complex dynamics of collaborative tagging. In Proceedings of the International Conference on World Wide Web. ACM Press. Harris, R. A. (1993). The Linguistics Wars. New York: Oxford University Press. Harrison, S. (2003). On the limits of the comparative method. In B. D. Joseph and R. D. Janda (eds.), The Handbook of Historical Linguistics, pp. 213–243. Malden, MA: Blackwell. Haug, D., M. Jøhndal, H. Eckhoff, E. Welo, M. Hertzenberg, and A. Müth (2009). Computational and linguistic issues in designing a syntactically annotated parallel corpus of Indo-European languages. Traitement automatique des langues 50, 17–45. Haug, D. T. T. and M. L. Jøndal (2008). Creating a parallel treebank of the old Indo-European Bible translations. In Proceedings of Language Technologies for Cultural Heritage Workshop (LREC 2008), Marrakech, pp. 27–34. Haverling, G. (2000). On SCO verbs, prefixes and semantic functions. Number 64 in Studia Graeca et Latina Gothoburgensia. Göteborg: Acta Universitatis Gothoburgensis. Hay, J. and P. Foulkes (2016). The evolution of medial /t/ over real and remembered time. Language 92(2), 298–330. Hay, J. B., J. B. Pierrehumbert, A. J. Walker, and P. LaShell (2015). Tracking word frequency effects through 130 years of sound change. Cognition 139, 83–91. Heggelund, Ø. (2015). On the use of data in historical linguistics: Word order in early English subordinate clauses. English Language and Linguistics 19(01), 83–106. Hellmann, S., J. Lehmann, S. Auer, and M. Brümmer (2013). Integrating NLP using linked data. In H. Alani, L. Kagal, A. Fokoue, P. Groth, C. Biemann, J. Parreira, L. Aroyo, N. Noy, C. Welty, and K. Janowicz (eds.), The Semantic Web—ISWC 2013, vol. 8219, Lecture Notes in Computer Science, pp. 98–113. Springer Berlin Heidelberg. Hey, T., S. Tansley, and K. Tolle (eds.) (2009). The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond, WA: Microsoft Research. Hilpert, M. and S. T. Gries (2009). Assessing frequency changes in multistage diachronic corpora: Applications for historical corpus linguistics and the study of language acquisition. Literary and Linguistic Computing 24(4), 385–401. Hilpert, M. and S. T. Gries (2016). Quantitative approaches to diachronic corpus linguistics. In M. Kytö and P. Pahta (eds.), The Cambridge Handbook of English Historical Linguistics, pp. 36–53. Cambridge: Cambridge University Press. Hinton, P. R. (2004). Statistics Explained. Routledge. Hockett, C. F. (1958). A Course in Modern Linguistics. Oxford: Macmillan. Horobin, S. and J. Smith (2002). An Introduction to Middle English. Edinburgh: Edinburgh University Press. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i References  Huang, Liang, P. Y., H. Wang, and Z. Wu (2002). PCFG parsing for restricted Classical Chinese texts. In Proceedings of the first SIGHAN workshop on Chinese language processing, Stroudsburg, PA, USA, pp. 1–6. Association for Computational Linguistics. Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Hymes, D. H. (1960). Lexicostatistics so far. Current Anthropology 1(1), 3–44. Hájek, A. (2012). Interpretations of probability. In E. N. Zalta (ed.), The Stanford Encyclopedia of Philosophy (winter 2012 edn). Stanford. Iacobini, C. and F. Masini (2007). Verb-particle constructions and prefixed verbs in Italian: typology, diachrony and semantics. In G. Booij, B. Fradin, A. Ralli, and S. Scalise (eds.), On-line Proceedings of the Fifth Mediterranean Morphology Meeting (MMM5), pp. 157–84. Università degli Studi di Bologna. Ide, N. and C. Macleod (2001). The American National Corpus: A standardized resource of American English. In Proceedings of Corpus Linguistics 2001, Lancaster. Irvine, S. (2006). Beginnings and transitions: Old English. In L. Mugglestone (ed.), The Oxford History of English, pp. 32–60. Oxford: Oxford University Press. Jackson, H. (2002). Lexicography: An Introduction. Routledge: Routledge. Jenset, G. B. (2010). A corpus-based study on the evolution of there: Statistical analysis and cognitive interpretation. Ph.D. thesis, University of Bergen. Jenset, G. B. (2013). Mapping meaning with distributional methods: A diachronic corpus-based study of existential there. Journal of Historical Linguistics 3(2), 272–306. Jenset, G. B. (2014). In search of the S (curve) in there. In K. E. Haugland, K. A. Rusten, and K. McCafferty (eds.), ‘Ye whom the charms of grammar please’: Studies in English Language History in Honour of Leiv Egil Breivik, pp. 27–54. Oxford: Peter Lang. Jenset, G. B. and B. McGillivray (2012). Multivariate analyses of affix productivity in translated English. In M. Oakes and M. Ji (eds.), Quantitative Methods in Corpus-Based Translation Studies, pp. 301–23. Amsterdam: John Benjamins. Johnson, K. (2008). Quantitative Methods in Linguistics. Oxford: Blackwell. Joseph, B. D. and R. D. Janda (eds.) (2003). The Handbook of Historical Linguistics. Oxford: Blackwell. Joulain, A., I. Gregory, and A. Hardie (2013). The spatial patterns in historical texts: Combining corpus linguistics and geographical information systems to explore places in Victorian newspapers. In Exploring Historical Sources: Abstracts of Presentations. Kenter, T., T. Erjavec, M. Z. Dulmin, and D. Fišer (2012). Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the 6th EACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 1–6. Association for Computational Linguistics. Kestemont, M., W. Daelemans, and G. De Pauw (2010). Weigh your words—memory-based lemmatization for Middle Dutch. Literary and Linguistic Computing 25(3), 287–301. Kilgarriff, A. (2005). Language is never, ever, ever, random. Corpus Linguistics and Linguistic Theory 1(2), 263–76. Kilgarriff, A., P. Rychly, P. Smrz, and D. Tugwell (2004). The Sketch Engine. In G. Williams and S. Vessier (eds.), Proceedings of the Eleventh Euralex International Congress, Lorient, pp. 105–16. Université de Bretagne-Sud. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  References Kingsbury, P. and M. Palmer (2002). From Treebank to Propbank. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Canary Islands. Koch, U. (1993). The enhancement of a dependency parser for Latin. Technical Report AI-1993-03, Artificial Intelligence Programs, University of Georgia. Köhler, R. (1999). Syntactic structures: Properties and interrelations. Journal of Quantitative Linguistics 6(1), 46–57. Köhler, R. (2012). Quantitative Syntax Analysis, vol. 65. Walter de Gruyter. Kolachina, S. and P. Kolachina (2012, May). Parsing any domain English text to CoNLL dependencies. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis (eds.), Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association. Korhonen, A., Y. Krymolowski, and T. Briscoe (2006). A large subcategorization lexicon for natural language processing applications. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006). Genoa. Kretzschmar, W. A. and S. Tamasi (2003). Distributional foundations for a theory of language Change. World Englishes 22(4), 377–401. Kroch, A. (1989). Reflexes of grammar in patterns of language change. Language Variation and Change 1, 199–244. Kroch, A., B. Santorini, and L. Delfs (2004). Penn–Helsinki Parsed Corpus of Early Modern English. http://www.ling.upenn.edu/hist-corpora/PPCEME-RELEASE-3/index.html. Kroch, A. and A. Taylor (2000). The Penn–Helsinki Parsed Corpus of Middle English (PPCME2). Technical report, Department of Linguistics, University of Pennsylvania. Kroch, Anthony, S. B. and L. Delfs (2004). The Penn–Helsinki Parsed Corpus of Early Modern English (PPCEME). Technical report, Department of Linguistics, University of Pennsylvania. Kroch, Anthony, S. B. and A. Diertani (2010). The Penn–Helsinki Parsed Corpus of Modern British English (PPCMBE). Technical report, Department of Linguistics, University of Pennsylvania. Kroeber, A. L. and C. D. Chrétien (1937). Quantitative classification of Indo-European languages. Language 13(2), 83–103. Kytö, M. and T. Walker (2006). Guide to A Corpus of English Dialogues 1560–1760. Studia Anglistica Upsaliensia 130. Labov, W. (1972). Some principles of linguistic methodology. Language in Society 1(1), 97–120. Lau, J. H., A. Clark, and S. Lappin (2015). Unsupervised prediction of acceptability judgements. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 1618–28. Association for Computational Linguistics. Leech, G. (1997). Introducing corpus annotation. In Corpus annotation: Linguistic Information from Computer Text Corpora (3rd edn). London: Longman. Lenci, A. (2008). Distributional semantics in linguistic and cognitive research: A foreword. Italian Journal of Linguistics 20, 1–31. Lenci, A., B. McGillivray, S. Montemagni, and V. Pirrelli (2008). Unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora. In Proceedings of the 6th i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i References  Language Resources and Evaluation Conference (LREC 2008). Marrakech, pp. 3000–6. European Language Resources Association. Lenci, A., S. Montemagni, and V. Pirrelli (2005). Testo e computer. Elementi di linguistica computazionale. Roma: Carocci. Leonelli, S. (2016). Researching Life in the Digital Age: A Philosophical Study of Data-Centric Biology. Chicago, IL: Chicago University Press. Levin, B. (1993). English Verb Classes and Alternations: A Preliminary Investigation. Chicago: University of Chicago Press. Levy, R. (2008). Expectation-based syntactic comprehension. Cognition 106(3), 1126–77. Lewis, C. T. and C. Short (1879). A Latin Dictionary, Founded on Andrews’ edition of Freund’s Latin dictionary revised, enlarged, and in great part rewritten by Charlton T. Lewis, Ph.D. and Charles Short, LL.D. Oxford: Clarendon, http://www.lib.uchicago.edu/efts/PERSEUS/ Reference/lewisandshort.html. Lieberman, E., J.-B. Michel, J. Jackson, T. Tang, and M. A. Nowak (2007). Quantifying the evolutionary dynamics of language. Nature 449(7163), 713–16. Lightfoot, D. (1989). The child’s trigger experience: Degree-0 learnability. Behavioral and Brain Sciences 12(02), 321–34. Lightfoot, D. (2006). How New Languages Emerge. Cambridge: Cambridge University Press. Lightfoot, D. W. (2013). Types of explanation in history. Language 89(4), e18–e38. Long, J. S. and J. Freese (2001). Regression Models for Categorical Dependent Variables Using STATA. College Station, TX: Stata Press. Lüdeling, A., H. Hirschmann, and A. Zeldes (2011). Variationism and underuse statistics in the analysis of the development of relative clauses in German. In Y. Kawaguchi, M. Minegishi, and W. Viereck (eds.), Corpus-Based Analysis and Diachronic Linguistics, pp. 37–58. Amsterdam: John Benjamins. McCarthy, D. (2001). Lexical acquisition at the syntax–semantics interface: Diathesis alternations, subcategorization frames and selectional preferences. Ph.D. thesis, University of Sussex. McColl Millar, R. (2012). English Historical Sociolinguistics. Edinburgh: Edinburgh University Press. McCrae, J., E. Montiel-Ponsoda, and P. Cimiano (2012). Integrating WordNet and Wiktionary with lemon. In C. Chiarcos, S. Nordhoff, and S. Hellmann (eds.), Linked Data in Linguistics, pp. 25–34. Heidelberg/New York/Dordrecht/London: Springer. McEnery, T. and H. Baker (2014). The corpus as historian: Using corpora to investigate the past. In Exploring Historical Sources: Abstracts of Presentations. McEnery, T. and A. Hardie (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge University Press. McEnery, T. and A. Wilson (2001). Corpus Linguistics. An Introduction. Edinburgh: Edinburgh University Press. McGillivray, B. (2012). Latin preverbs and verb argument structure: New insights from new methods. In J. Barðdal, M. Cennamo, and E. van Gelderen (eds.), Argument Structure: The Naples/Capri Papers. John Benjamins. McGillivray, B. (2013). Methods in Latin Computational Linguistics. Leiden: Brill. McGillivray, B. and A. Kilgarriff (2013). Tools for historical corpus research, and a corpus of Latin. In P. Bennett, M. Durrell, S. Scheible, and R. J. Whitt (eds.), New Methods in Historical i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  References Corpus Linguistics, vol. 3, Corpus Linguistics and Interdisciplinary Perspectives on Language, Tübingen. Narr. McGillivray, B., M. Passarotti, and P. Ruffolo (2009). The Index Thomisticus Treebank Project: Annotation, Parsing and Valency Lexicon. TAL 50(2), 103–27. McGillivray, B. and A. Vatri (2015). Computational valency lexica for Latin and Greek in use: A case study of syntactic ambiguity. Journal of Latin Linguistics 14, 101–26. McMahon, A. (2006). Restructuring Renaissance English. In L. Mugglestone (ed.), The Oxford History of English, pp. 147–177. Oxford: Oxford University Press. McMahon, A. and R. McMahon (2005). Language Classification by Numbers. Oxford: Oxford University Press. Mair, C. (2004). Corpus linguistics and grammaticalization theory: Statistics, frequencies, and beyond. In H. Lindquist and C. Mair (eds.), Corpus Approaches to Grammaticalization in English, pp. 121–50. Amsterdam: Jonn Benjamins. Manning, C. D. (2003). Probabilistic syntax. In R. Bod, J. Hay, and S. Jannedy (eds.), Probabilistic Linguistics, pp. 289–342. Cambridge, MA: MIT Press. Martineau, France, H. P. K. A. and Y. C. Morin (2010). Corpus MCVF, Modéliser le changement: les voies du français. Technical report, Département de français, University of Ottawa. CD-ROM. Mason, H. and D. Patil (2015). Data Driven: Creating a Data Culture. Sebastopol, CA: O’Reilly. Mayer-Schonberger, V. and C. Kenneth (2013). Big Data: A Revolution That Will Transform How We Live, Work and Think. Boston: Houghton Mifflin Harcourt. Meehl, P. E. (1990). Appraising and amending theories: The strategy of lakatosian defense and two principles that warrant it. Psychological Inquiry 1(2), 108–41. Meillet, A. and J. Vendryes (1963). Traité de grammaire comparée des langues classiques. Paris: Librairie Ancienne Honoré Champion. Meini, L. and B. McGillivray (2010). Between semantics and syntax: Spatial verbs and prepositions in Latin. In Proceedings of the Space in Language Conference, 8–10 October 2009, Pisa. ETS. Menini, S. (2014). Computational analysis of historical texts. In Exploring Historical Sources: Abstracts of Presentations. Messiant, C., A. Korhonen, and T. Poibeau (2008). LexSchem: A large subcategorization lexicon for French verbs. In Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008), Marrakech, Morocco. Meyer, E. T. and R. Schroeder (2015). Knowledge Machines: Digital Transformations of the Sciences and Humanities. Cambridge, MA: MIT Press. Miller, G. A., R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography 3(4), 235–44. Moore, G. A. (1991). Crossing the Chasm: Marketing and Selling High-Tech Products to Mainstream Customers. New York: HarperBusiness. Morton, R. (2014). Using TEI mark-up and pragmatic classification in the construction and analysis of the British Telecom Correspondence Corpus. In Exploring Historical Sources: Abstracts of Presentations. Mosteller, F. (1968). Association and estimation in contingency tables. Journal of the American Statistical Association 321(63), 1–28. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i References  Munro, R., S. Bethard, V. Kuperman, V. Tzuyin Lai, R. Melnick, C. Potts, T. Schnoebelen, and H. Tily (2010). Crowdsourcing and language studies: The new generation of linguistic data. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, pp. 122–30. Association for Computational Linguistics. Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. Biometrika 78(3), 691–2. Nevalainen, T. (2003). Socio-Historical Linguistics: Language Change in Tudor and Stuart England. London: Longman. Nevalainen, T. (2006). Mapping change in Tudor English. In L. Mugglestone (ed.), The Oxford History of English, pp. 178–211. Oxford: Oxford University Press. Pagel, M., Q. D. Atkinson, and A. Meade (2007). Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature 449(7163), 717–20. Passarotti, M. (2007a). LEMLAT. Uno strumento per la lemmatizzazione morfologica automatica del latino. In F. Citti and T. Del Vecchio (eds.), From Manuscript to Digital Text: Problems of Interpretation and Markup. Proceedings of the Colloquium (Bologna, June 12th 2003). Roma, pp. 107–28. Passarotti, M. (2007b). Verso il Lessico Tomistico Biculturale. La treebank dell’Index Thomisticus. In R. Petrilli and D. Femia (eds.), Il filo del discorso. Intrecci testuali, articolazioni linguistiche, composizioni logiche. Atti del XIII Congresso Nazionale della Società di Filosofia del Linguaggio. Viterbo, pp. 187–205. Passarotti, M. (2010). Leaving behind the less-resourced status: The case of Latin through the experience of the Index Thomisticus Treebank. In Proceedings of the 7th SaLTMiL Workshop on Creation and Use of Basic Lexical Resources for Less-Resourced Languages, LREC 2010. La Valletta, Malta, 23 May 2010, pp. 27–32. Passarotti, M. (2014). From syntax to semantics: First steps towards tectogrammatical annotation of Latin. In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), pp. 100–9. Association for Computational Linguistics. Passarotti, M. and F. Dell’Orletta (2010). Improvements in parsing the Index Thomisticus Treebank. Revision, combination and a feature model for medieval Latin. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010). May 19–21, 2010, La Valletta, Malta, pp. 1964–71. European Language Resources Association. Passarotti, M., B. McGillivray, and D. Bamman (2015). A treebank-based study on Latin word order. In Proceedings of the 16th International Colloquium on Latin Linguistics, Uppsala. Passarotti, M. and P. Ruffolo (2009). Parsing the Index Thomisticus Treebank: Some preliminary results. In P. Anreiter and M. Kienpointner (eds.), Proceedings of the 15th International Colloquium on Latin Linguistics, Innsbrucker Beiträge zur Sprachwissenschaft, Innsbruck. Penke, M. and A. Rosenbach (eds.) (2007a). What Counts as Evidence in Linguistics. Amsterdam: John Benjamins. Penke, M. and A. Rosenbach (2007b). What counts as evidence in linguistics? An introduction. In M. Penke and A. Rosenbach (eds.), What Counts as Evidence in Linguistics, pp. 1–49. Amsterdam: John Benjamins. Pereira, F. C. (2000). Formal grammar and information theory: together again? Philosophical Transactions of the Royal Society 358, 1239–53. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  References Pereltsvaig, A. and M. W. Lewis (2015). The Indo-European Controversy (electronic edn). Cambridge: Cambridge University Press. Pintzuk, S. (2003). Variationist approaches to syntactic change. In B. D. Joseph and R. D. Janda (eds.), The Handbook of Historical Linguistics, pp. 509–28. Malden, MA: Blackwell. Pintzuk, S. and L. Plug (2002). The York–Helsinki Parsed Corpus of Old English Poetry. Technical report, Department of Linguistics, University of York. Oxford Text Archive. Piotrowski, M. (2012). Natural Language Processing for Historical Texts. Synthesis Lectures on Human Language Technologies. Morgan & Claypool. Podesva, R. J. and D. Sharma (eds.) (2014). Research Methods in Linguistics. Cambridge: Cambridge University Press. Popper, K. (1959). The Logic of Scientific Discovery (2002 edn). London: Routledge. Pullum, G. (2009). Computational linguistics and generative linguistics: The triumph of hope over experience. In T. Baldwin and V. Kordoni (eds.), Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous?, Athens, Greece, pp. 12–21. Association for Computational Linguistics. Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus Linguistics 13, 519–49. Rayson, P., D. Archer, A. Baron, J. Culpeper, and N. Smith (2007). Tagging the bard: Evaluating the accuracy of a modern POS tagger in early modern English corpora. In Corpus Linguistics Conference (CL2007), Birmingham. University of Birmingham. Redman, T. C. (2008). Data Driven: Profiting from Your Most Important Business Asset. New York. Harvard Business Review Press. Resch, C., T. Declerck, B. Krautgartner, and U. Czeitschner (2014). ABaC:us revisited. Extracting and linking lexical data from a historical corpus of sacred literature. In C. Brierley, M. Sawalha, and E. Atwell (eds.), Proceedings of the 2nd Workshop on Language Resources and Evaluation for Religious Texts (LRE-REL 2), pp. 36–41. Ringe, D. and J. F. Eska (2013). Historical Linguistics: Toward a Twenty-First Century Reintegration (electronic edn). Cambridge: Cambridge University Press. Risen, J. and T. Gilovich (2007). Informal Logical Fallacies. In R. J. Sternberg, H. L. Roediger III, and D. F. Halpern (eds.), Critical Thinking in Psychology, pp. 110–30. Cambridge: Cambridge University Press. Romaine, S. (1982). Socio-Historical Linguistics: Its Status and Methodology. New York: Cambridge University Press. Ross, A. S. (1950). Philological probability problems. Journal of the Royal Statistical Society. Series B (Methodological) 12(1), 19–59. Rovai, F. (2012). Between feminine singular and neuter plural: Re-analysis patterns. Transactions of the Philological Society 110(1), 94–121. Rusten, K. A. (2014). Null referential subjects from Old to early modern English. In K. E. Haugland, K. McCafferty, and K. A. Rusten (eds.), ‘Ye whom the charms of grammar please’: Studies in English Language History in Honour of Leiv Egil Breivik, Oxford, pp. 249–70. Peter Lang. Rusten, K. A. (2015). A quantitative study of empty referential subjects in Old English prose and poetry. Transactions of the Philological Society 113(1), 53–75. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i References  Rydén, M. (1980). Syntactic variation in a historical perspective. In S. Jacobson (ed.), Papers from the Scandinavian symposium on syntactic variation. Stockholm, 18–19 May 1979, Stockholm, pp. 37–45. Almqvist & Wiksell. Sabou, M., K. Bontcheva, L. Derczynski, and A. Scharl (2014). Corpus annotation through crowdsourcing: Towards best practice guidelines. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association. Salvi, G. and L. Vanelli (1992). Grammatica essenziale di riferimento della lingua italiana. Istituto Geografico De Agostini. Le Monnier. Sampson, G. R. (2001). Empirical Linguistics. London/New York: Continuum. Sampson, G. R. (2003). Statistical linguistics. In W. J. Frawley (ed.), International Encyclopedia of Linguistics (2nd edn). New York: Oxford University Press. Sampson, G. R. (2005). Quantifying the shift towards empirical methods. International Journal of Corpus Linguistics 10, 10–36. Sampson, G. R. (2013). The empirical trend: Ten years on. International Journal of Corpus Linguistics 18(2), 281–9. Sanchez-Marco, C., G. Boleda, and L. Padró (2011). Extending the tool, or how to annotate historical language varieties. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Portland, OR, pp. 1–9. Association for Computational Linguistics. Sandra, D. and S. Rice (1995). Network analyses of prepositional meaning: Mirroring whose mind—the linguist’s or the language user’s? Cognitive Linguistics 6(1), 89–130. Schlesewsky, M. and I. Bornkessel (2004). On incremental interpretation: Degrees of meaning accessed during sentence comprehension. Lingua 114(9–10), 1213–34. Schmid, G. (1994). TreeTagger: A language independent part-of-speech tagger. Available at http://www.cis.uni-muenchen.de/∼schmid/tools/TreeTagger/. Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. In Proceedings of the ACL SIGDAT Workshop, pp. 47–50. Schneider, G. (2008). Hybrid long-distance functional dependency parsing. Ph.D. thesis, Universität Zürich, Zurich, Switzerland. Schneider, G. (2012). Adapting a parser to historical English. In M. Rissanen, Tyrkkö, T. Nevalainen, and M. Kilpiö (eds.), Proceedings of the Helsinki Corpus Festival, Studies in Variation, Contacts and Change in English, Amsterdam/Philadelphia. Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki. Schulte im Walde, S. (2004). GermaNet synsets as selectional preferences in semantic verb clustering. Journal for Computational Linguistics and Language Technology 19(1/2), 69–79. Schulte im Walde, S. (2007). Corpus Linguistics. An International Handbook (chapter on the induction of verb frames and verb classes from corpora). Berlin: Mouton de Gruyter. Schütze, H. (1998). Automatic word sense discrimination. Computational Linguistics 24(1), 97–123. Sgall, P., E. Hajicová, and J. Panevová (1986). The Meaning of the Sentence in its Semantic and Pragmatic Aspects. Dordrecht, NL: Reidel. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  References Sinclair, J. (2004). Trust the Text: Language, Corpus and Discourse. London: Routledge. Snow, R., B. O’Connor, D. Jurafsky, and A. Y. Ng (2008). Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, Hawaii, pp. 254–63. Association for Computational Linguistics. Souvay, G. and J.-M. Pierrel (2009). LGeRM: Lemmatisation des mots en moyen français. Traitement automatique des langues 50(2), 149–72. Stefanowitsch, A. (2005). New York, Dayton (Ohio), and the raw frequency fallacy. Corpus linguistics and linguistic theory 1(2), 295–301. Swadesh, M. (1952). Lexico-statistic dating of prehistoric ethnic contacts: With special reference to North American Indians and Eskimos. Proceedings of the American philosophical society 96(4), 452–63. Swadesh, M. (1953). Archeological and linguistic chronology of Indo-European groups. American Anthropologist 55(3), 349–52. Tagliamonte, S. A. and R. H. Baayen (2012). Models, forests, and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change 24(02), 135–78. Talmy, L. (2007). Foreword. In M. Gonzalez-Marquez, I. Mittelberg, S. Coulson, and M. J. Spivey (eds.), Methods in Cognitive Linguistics, pp. xi–xxi. Amsterdam: John Benjamins. Taylor, A., A. Nurmi, A. Warner, S. Pintzuk, and T. Nevalainen (2006). The York–Helsinki Parsed Corpus of Early English Correspondence (PCEEC). Technical report, Department of Linguistics, University of York. Oxford Text Archive. Taylor, A., A. Warner, S. Pintzuk, and F. Beths (2003). The York–Toronto–Helsinki Parsed Corpus of Old English Prose (YCOE). Technical report, Department of Linguistics, University of York. Oxford Text Archive. TEI Consortium (2014). Guidelines for electronic text encoding and interchange. Technical Report WP2.11, TEI Consortium, http://www.tei-c.org/Guidelines/P5/. Tekavčić, P. (1972). Grammatica storica dell’italiano, vol. II Morfosintassi; III Lessico. Bologna: Il Mulino. Tesnière, L. (1959). Éléments de syntaxe structurale. Paris: Klincksieck. Tognini-Bonelli, E. (2001). Corpus Linguistics at Work. Amsterdam/Philadelphia: John Benjamins. Toth, G. M. (2013). Knowledge and thinking in Renaissance Florence: A computer-assisted analysis of the diaries and commonplace books of Giovanni Rucellai and his contemporaries. Ph.D. thesis, University of Oxford. Tukey, J. W. (1977). Exploratory Data Analysis. Reading, MA: Addison-Wesley. Van der Beek, L., G. Bouma, R. Malouf, and G. van Noord (2002). The Alpino Dependency Treebank. In M. Theune, A. Nijholt, and H. Hondorp (eds.), Proceedings of the Twelfth Meeting of Computational Linguistics in the Netherlands (CLIN 2001), pp. 8–22. Rodopi, Amsterdam. Van Gompel, R. P. G. and M. J. Pickering (2007). Syntactic parsing. In G. Gaskell (ed.), The Oxford Handbook of Psycholinguistics, pp. 284–307. Oxford: Oxford University Press. Venables, W. N. and B. D. Ripley (2002). Modern Applied Statistics with S (4th edn). New York: Springer. Vicario, F. (1997). I verbi analitici in friulano. Milano: Franco Angeli. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i References  Vines, T., A. Albert, R. Andrew, F. Débarre, D. Bock, M. Franklin, K. Gilbert, J.-S. Moore, S. Renaut, and D. Rennison (2014). The availability of research data declines rapidly with article age. Current Biology 24(1), 94–7. Vulanović, R. and R. H. Baayen (2007). Fitting the development of periphrastic do in all sentence types. In P. Grzybek and R. Köhler (eds.), Exact Methods in the Study of Language and Text: Dedicated to Gabriel Altmann on the Occasion of his 75th Birthday, pp. 679–88. Berlin: Mouton de Gruyter. Wallenberg, J., A. K. Ingason, E. F. Sigurðsson, and E. Rögnvaldsson (2011a). Icelandic Parsed Historical Corpus (IcePaHC). Technical report, Department of Linguistics, University of Iceland. Online publication. Wallenberg, J., A. K. Ingason, E. F. Sigurðsson, and E. Rögnvaldsson (2011b). Icelandic Parsed Historical Corpus (IcePaHC). Version 0.9. Weisser, M. (2010). Essential Programming for Linguistics. Edinburgh: Edinburgh University Press. Williams, A. (2000). Null subjects in Middle English existentials. In S. Pintzuk, G. Tsoulos, and A. Warner (eds.), Diachronic Syntax: Models and Mechanisms, pp. 285–310. Oxford: Oxford University Press. Zaenen, A., J. Carletta, G. Garretson, J. Bresnan, A. Koontz-Garboden, T. Nikitina, M. C. O’Connor, and T. Wasow (2004). Animacy encoding in English: Why and how. In B. Webber and D. K. Byron (eds.), Proceedings of the ACL2004 Workshop on Discourse Annotation, Volume 17, Barcelona, pp. 118–25. Association for Computational Linguistics. Zervanou, K. and C. Vertan (eds.) (2014). Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH). Gothenburg, Sweden: Association for Computational Linguistics. Zuidema, W. and B. de Boer (2014). Modeling in the language sciences. In R. J. Podesva and D. Sharma (eds.), Research Methods in Linguistics, pp. 428–45. Cambridge: Cambridge University Press. i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Index ALPINO Treebank 143, 144 Ancient Greek Dependency Treebank 101, 117, 118, 120, 129, 134, 143, 147, 148, 159 annotation 1, 5, 6, 10, 11, 19, 24, 42, 45, 55, 56, 58, 59, 61, 62, 63, 73, 75, 76, 99, 100, 101, 102, 103, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 127, 128, 129, 132, 134, 135, 137, 138, 140, 151, 152, 155, 159, 170, 173, 189, 193, 203, 205 Bayesian phylogenetic trees 207 Bayesian statistics 41 categorical 3, 4, 40, 43, 44, 50, 89, 91, 128, 186, 206 causation 27 chasm model 19, 20, 22, 23, 24, 33, 66, 71, 153, 154 chi-square 28, 64, 94, 95, 177, 178, 179, 180, 181 computational linguistics 3, 16, 89, 103, 112, 115, 133, 138 corpus-based 25, 28, 29, 31, 32, 33, 37, 58, 59, 86, 89, 92, 126, 132, 133, 135, 138, 140, 152, 157, 159, 188 corpus-driven 2, 12, 19, 38, 58, 59, 60, 61, 63, 64, 90, 117, 120, 131, 132, 133, 134, 135, 159, 188 corpus linguistics 2, 6, 19, 21, 37, 50, 58, 72, 77, 78, 80, 82, 92, 95, 96, 97, 99, 101, 105, 107, 121, 123, 124, 128, 132, 139, 177, 179, 206 correlation 13, 27, 52, 62, 67, 77, 156, 168, 169, 172, 173, 178, 179, 180, 182 correspondence analysis 2, 31, 165, 195 data-driven 3, 18, 23, 43, 58, 59, 60, 61, 62, 188 data exploration 43, 59, 60, 63, 165, 189, 194 dependency annotation 109, 115, 116 grammar 61, 115, 116 tree 115, 116, 122, 144 treebank 11, 42, 109, 116, 122, 123, 126, 129, 134 diachronic linguistics 30, 31, 88, 164, 172 digital humanities 58, 101, 103, 125, 137, 149 empirical 2, 3, 18, 19, 25, 26, 27, 29, 30, 39, 42, 45, 46, 47, 61, 63, 65, 66, 80, 81, 86, 87, 90, 91, 92, 93, 97, 98, 117, 127, 128, 154, 156, 168, 190, 205, 206, 207 English early modern 43, 86, 108, 113, 123, 129, 192, 194 middle 48, 49, 86, 87, 123, 140, 142, 166, 167, 168, 169, 172, 175, 186, 190, 194 old 11, 48, 49, 67, 86, 87, 121, 123, 167, 169, 176 evidence 1, 2, 3, 6, 8, 9, 10, 11, 12, 13, 14, 16, 18, 25, 26, 28, 31, 36, 37, 38, 39, 40, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 55, 56, 57, 58, 59, 60, 63, 64, 67, 68, 70, 71, 80, 82, 84, 88, 89, 90, 92, 93, 97, 98, 99, 121, 124, 127, 132, 137, 153, 155, 156, 167, 168, 188 exploratory data analysis see data exploration FrameNet 54, 133 frequency 2, 5, 6, 12, 14, 28, 52, 77, 80, 81, 88, 117, 120, 121, 129, 133, 134, 136, 154, 155, 163, 172, 176, 178, 190, 192, 193, 194, 198, 199, 200, 201, 203, 204, 205 distribution 59, 60, 96 expected 178 raw 12, 92 relative 17, 29, 41, 68, 72, 84, 155, 171 i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i  Index glottochronology 69, 70, 71, 81, 82 gold standard 117, 136 Greek 19, 54, 112, 120, 122 ancient 9, 10, 12, 100, 101, 115, 116, 117, 119, 129, 131, 132, 134, 135, 151 classical 14, 111 historical linguistics 1, 2, 3, 7, 8, 12, 16, 17, 18, 19, 20, 22, 23, 24, 25, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 44, 45, 46, 47, 48, 50, 51, 52, 53, 54, 55, 56, 58, 61, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 77, 78, 79, 80, 81, 82, 83, 85, 86, 87, 88, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 103, 118, 124, 125, 127, 129, 130, 131, 137, 139, 140, 149, 151, 152, 153, 154, 156, 157, 163, 166, 181, 186, 187, 188, 189, 190, 191, 205, 206, 207 quantitative 1, 7, 24, 36, 37, 38, 44, 45, 46, 50, 51, 52, 53, 94, 99, 101, 130, 153, 156, 168, 187, 188, 189, 190, 191, 206 hypothesis testing 2, 10, 12, 34, 42, 43, 52, 53, 65, 93, 94, 95, 96, 97, 155, 165, 169, 180, 186, 189, 196 Index Thomisticus Treebank 102, 114, 116, 122, 123, 126, 134 language change 6, 7, 8, 33, 34, 37, 38, 43, 44, 61, 63, 64, 66, 70, 71, 83, 87, 88, 89, 90, 93, 98, 119, 127, 137, 138, 140, 142, 156, 157, 166, 167, 168, 186, 187, 190, 191, 192, 204, 205 language resource 53, 54, 55, 56, 57, 58, 60, 100, 127, 130, 131, 135, 143, 144, 151, 152, 159 Latin 9, 10, 11, 12, 13, 14, 42, 54, 62, 63, 64, 67, 90, 100, 103, 105, 109, 110, 111, 112, 114, 115, 116, 117, 118, 122, 123, 125, 126, 127, 131, 132, 133, 134, 135, 150, 157, 158, 159, 163, 164, 165, 181, 206 Dependency Treebank 109, 116, 123, 126, 134 LatinISE 125, 126, 127 lemma 11, 14, 99, 100, 109, 110, 114, 120, 126, 127, 129, 134, 136, 137, 163, 164, 193, 194, 195, 198, 201, 203, 204, 205 lemmatization 99, 102, 112, 114, 115, 121, 122, 127, 136, 138, 193, 194, 205 lexicon 11, 12, 53, 54, 55, 113, 114, 120, 130, 131, 132, 133, 134, 135, 136, 137, 144, 152, 159, 193, 205 linguistic innovation 43, 44, 191 linguistic spread 43, 44, 191 linked data 58, 142, 143, 144, 148, 149 markup 105, 124, 125, 137 metadata 11, 53, 57, 58, 59, 60, 99, 100, 103, 104, 105, 106, 107, 110, 126, 137, 138, 140, 142, 146, 155, 170, 193, 205 model parallelization 6, 91, 157, 205, 207 morphology 36, 37, 43, 44, 81, 82, 190, 193 multivariate analysis 13, 44, 157 model 205 techniques 12, 31, 34, 52, 65, 157, 162, 163, 164, 165, 166, 168, 181, 186, 187, 188, 189 natural language processing 16, 17, 102, 112, 117, 118, 125, 127, 148 null hypothesis 178 parsing 102, 117, 118, 119, 133, 139 part-of-speech tagging 112, 114, 127, 138, 139, 143 Penn-Helsinki Parsed Corpus of Early Modern English 108, 169, 192 Penn-Helsinki Parsed Corpus of Middle English 123, 140, 141 phonology 36, 37, 81 phrase-structure annotation 108, 173 tree 5, 6, 115 pragmatics 10, 27, 89, 100, 110, 112, 119, 121, 122, 138, 140, 167, 169, 186 Prague Dependency Treebank 122, 123 i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i Index  probabilistic 3, 4, 5, 8, 9, 37, 39, 40, 44, 50, 60, 64, 65, 78, 85, 86, 89, 91, 92, 93, 155, 156, 157, 190, 206, 207 PROIEL Treebank 100, 126 dependency 115, 116, 122, 143, 144 phrase-structure 108, 115, 116 syntax 27, 36, 37, 48, 87, 92, 97, 117, 118, 121, 158, 167 qualitative analysis 14, 59, 91 quantitative analysis 10, 11, 12, 13, 14, 52, 59, 63, 80, 127, 189 Text Encoding Initiative 120, 124, 125, 137, 138 theory 1, 3, 6, 10, 16, 17, 40, 44, 55, 58, 59, 61, 63, 64, 65, 66, 85, 117, 121, 122, 128, 156, 157, 170 token 87, 107, 108, 111, 112, 120, 122, 125, 126, 134 tokenization 111, 112 treebank 91, 114, 115, 116, 117, 118, 120, 123, 135, 143, 144, 170, 173 TreeTagger 114, 126, 128 trend 11, 12, 42, 43, 50, 51, 127, 155, 157, 158, 172, 194 regression linear 34, 44, 76, 160, 162, 164 logistic 164, 181, 182, 183, 184, 186, 197, 199, 200, 201, 202 mixed-effects model 96, 164, 181, 197, 198, 199, 200, 201 model 2, 52, 76, 160, 162, 164, 181, 182, 183, 184, 198, 199, 200, 201, 202 multilevel model 96, 164, 181, 197, 198, 199, 200, 201 reproducibility 51, 53, 54, 55, 56, 129, 135, 156, 170, 206 Resource Description Framework 142, 144, 145, 147, 151 selectional preferences 89 semantics 14, 36, 37, 61, 63, 88, 89, 90, 181 sociolinguistics 58, 71, 119, 121, 137, 140, 142, 167, 168, 169, 172, 175, 186, 187 subcategorization 5, 89, 90, 144 syntactic tree 143, 193 usage 7, 16, 25, 43, 44, 54, 84, 86, 88, 92, 139, 159, 168 -based 25, 60, 89 valency lexicon 11, 54, 132, 133, 134, 135, 159 visualization 56, 139, 160, 182, 189, 195 WordNet 119 XML 105, 106, 107, 109, 138, 143, 147 i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i OX F O R D S T U D I E S I N D IAC H R O N IC A N D H I S T O R I C A L L I N G U I S T I C S general editors Adam Ledgeway and Ian Roberts, University of Cambridge advisory editors Cynthia Allen, Australian National University; Ricardo Bermúdez-Otero, University of Manchester; Theresa Biberauer, University of Cambridge; Charlotte Galves, University of Campinas; Geoff Horrocks, University of Cambridge; Paul Kiparsky, Stanford University; Anthony Kroch, University of Pennsylvania; David Lightfoot, Georgetown University; Giuseppe Longobardi, University of York; George Walkden, University of Konstanz; David Willis, University of Cambridge published 1 From Latin to Romance Morphosyntactic Typology and Change Adam Ledgeway 2 Parameter Theory and Linguistic Change Edited by Charlotte Galves, Sonia Cyrino, Ruth Lopes, Filomena Sandalo, and Juanito Avelar 3 Case in Semitic Roles, Relations, and Reconstruction Rebecca Hasselbach 4 The Boundaries of Pure Morphology Diachronic and Synchronic Perspectives Edited by Silvio Cruschina, Martin Maiden, and John Charles Smith 5 The History of Negation in the Languages of Europe and the Mediterranean Volume I: Case Studies Edited by David Willis, Christopher Lucas, and Anne Breitbarth 6 Constructionalization and Constructional Changes Elizabeth Traugott and Graeme Trousdale 7 Word Order in Old Italian Cecilia Poletto 8 Diachrony and Dialects Grammatical Change in the Dialects of Italy Edited by Paola Benincà, Adam Ledgeway, and Nigel Vincent 9 Discourse and Pragmatic Markers from Latin to the Romance Languages Edited by Chiara Ghezzi and Piera Molinelli i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i 10 Vowel Length from Latin to Romance Michele Loporcaro 11 The Evolution of Functional Left Peripheries in Hungarian Syntax Edited by Katalin É. Kiss 12 Syntactic Reconstruction and Proto-Germanic George Walkden 13 The History of Low German Negation Anne Breitbarth 14 Arabic Indefinites, Interrogatives, and Negators A Linguistic History of Western Dialects David Wilmsen 15 Syntax over Time Lexical, Morphological, and Information-Structural Interactions Edited by Theresa Biberauer and George Walkden 16 Syllable and Segment in Latin Ranjan Sen 17 Participles in Rigvedic Sanskrit The Syntax and Semantics of Adjectival Verb Forms John J. Lowe 18 Verb Movement and Clause Structure in Old Romanian Virginia Hill and Gabriela Alboiu 19 The Syntax of Old Romanian Edited by Gabriela Pană Dindelegan 20 Grammaticalization and the Rise of Configurationality in Indo-Aryan Uta Reinöhl 21 The Rise and Fall of Ergativity in Aramaic Cycles of Alignment Change Eleanor Coghill 22 Portuguese Relative Clauses in Synchrony and Diachrony Adriana Cardoso 23 Micro-change and Macro-change in Diachronic Syntax Edited by Eric Mathieu and Robert Truswell i i i i i i OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i i 24 The Development of Latin Clause Structure A Study of the Extended Verb Phrase Lieven Danckaert 25 Transitive Nouns and Adjectives Evidence from Early Indo-Aryan John J. Lowe 26 Quantitative Historical Linguistics A Corpus Framework Gard B. Jenset and Barbara McGillivray In preparation Negation and Nonveridicality in the History of Greek Katerina Chatzopoulou Morphological Borrowing Francesco Gardani Nominal Expressions and Language Change From Early Latin to Modern Romance Giuliana Giusti The Historical Dialectology of Arabic: Linguistic and Sociolinguistic Approaches Edited by Clive Holes A Study in Grammatical Change The Modern Greek Weak Subject Pronoun τ oς and its Implications for Language Change and Structure Brian D. Joseph Gender from Latin to Romance Michele Loporcaro Reconstructing Pre-Islamic Arabic Dialects Alexander Magidow Word Order Change Edited by Anna Maria Martins and Adriana Cardoso Grammaticalization from a Typological Perspective Heiko Narrog and Bernd Heine Word Order and Parameter Change in Romanian Alexandru Nicolae The History of Negation in the Languages of Europe and the Mediterranean Volume II: Patterns and Processes Edited by David Willis, Christopher Lucas, and Anne Breitbarth Verb Second in Medieval Romance Sam Wolfe Palatal Sound Change in the Romance Languages Diachronic and Synchronic Perspectives André Zampaulo i i i i

Quantitative Historical Linguistics - A Corpus Framework

Documentos relacionados

Productos

Apoyo

Quantitative Historical Linguistics - A Corpus Framework

Documentos relacionados

Añadir este documento a la recogida (s)

Añadir a este documento guardado

Sugiéranos cómo mejorar StudyLib