Springer Graduate Texts in Philosophy Stephan Kornmesser · Alexander Max Bauer · Mark Alfano · Aurélien Allard · Lucien Baumgartner · Florian Cova · Paul Engelhardt · Eugen Fischer · Henrike Meyer · Kevin Reuter · Justin Sytsma · Kyle Thompson · Marc Wyszynski Experimental Philosophy for Beginners A Gentle Introduction to Methods and Tools Springer Graduate Texts in Philosophy The Springer Graduate Texts in Philosophy offers a series of self-contained textbooks aimed towards the graduate level that covers all areas of philosophy ranging from classical philosophy to contemporary topics in the field. The texts will, in general, include teaching aids (such as exercises and summaries), and covers the range from graduate level introductions to advanced topics in the field. The publications in this series offer volumes with a broad overview of theory in core topics in field and volumes with comprehensive approaches to a single key topic in the field. Thus, the series offers publications for both general introductory courses as well as courses focused on a sub-discipline within philosophy. The series publishes: • All of the philosophical traditions • Includes sourcebooks, lectures notes for advanced level courses, as well as textbooks covering specialized topics • Interdisciplinary introductions – where philosophy overlaps with other scientific or practical areas We aim to make a first decision within 1 month of submission. In case of a positive first decision the work will be provisionally contracted: the final decision about publication will depend upon the result of the anonymous peer review of the complete manuscript. We aim to have the complete work peer-reviewed within 3 months of submission. Proposals should include: • • • • A short synopsis of the work or the introduction chapter The proposed Table of Contents CV of the lead author(s) List of courses for possible course adoption The series discourages the submission of manuscripts that are below 65,000 words in length. Stephan Kornmesser • Alexander Max Bauer • Mark Alfano • Aurélien Allard • Lucien Baumgartner • Florian Cova • Paul Engelhardt • Eugen Fischer • Henrike Meyer • Kevin Reuter • Justin Sytsma • Kyle Thompson • Marc Wyszynski Experimental Philosophy for Beginners A Gentle Introduction to Methods and Tools Stephan Kornmesser Department of Philosophy University of Oldenburg Oldenburg, Lower Saxony, Germany Alexander Max Bauer Department of Philosophy University of Oldenburg Oldenburg, Lower Saxony, Germany Mark Alfano Philosophy Department Macquarie University Sydney, New South Wales, Australia Aurélien Allard Department of Philosophy École Normale Supérieure Paris Paris, France Lucien Baumgartner Department of Philosophy University of Zurich Zurich, Switzerland Florian Cova Département de Philosophie Université de Genève Geneva, Switzerland Paul Engelhardt School of Psychology University of East Anglia Norwich, UK Eugen Fischer School of Politics, Philosophy, Language and Communication Studies University of East Anglia Norwich, UK Henrike Meyer Berlin, Germany Justin Sytsma Philosophy Programme Victoria University of Wellington Wellington, New Zealand Marc Wyszynski Faculty of Mathematics and Computer Science University of Bremen Bremen, Germany Kevin Reuter Department of Philosophy University of Zurich Zurich, Switzerland Kyle Thompson Harvey Mudd College Claremont, CA, USA ISSN 2627-6046 ISSN 2627-6054 (electronic) Springer Graduate Texts in Philosophy ISBN 978-3-031-58048-2 ISBN 978-3-031-58049-9 (eBook) https://doi.org/10.1007/978-3-031-58049-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland If disposing of this product, please recycle the paper. Contents 1 Introduction: Setting Out for New Shores....................................... Stephan Kornmesser, Alexander Max Bauer, Mark Alfano, Aurélien Allard, Lucien Baumgartner, Florian Cova, Paul Engelhardt, Eugen Fischer, Henrike Meyer, Kevin Reuter, Justin Sytsma, Kyle Thompson, and Marc Wyszynski 1.1 The Book’s Overall Idea....................................................... 1.2 How to Use This Book.......................................................... 1.3 Experimental Philosophy: General Motivations.......................... 1.4 Replicability in Experimental Philosophy.................................. 1.5 Ethics Boards...................................................................... 1.6 This Book’s Chapters........................................................... 1.6.1 Chapter 2: Quantitative Vignette Studies: X2 Tests—Empirically Reconsidering the Constative–Performative Distinction........................ 1.6.2 Chapter 3: Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About Unfelt Pains... 1.6.3 Chapter 4: Quantitative Vignette Studies: Correlations, Regressions, and Structural Equation Modeling—An Application to Experimental Philosophy of Free Will....................... 1.6.4 Chapter 5: Interactive and Incentivized Online Experiments: Non-Cooperation in Give-Some and Take-Some Dilemmas.......................................... 1.6.5 Chapter 6: Corpus Analysis: Building and Using Corpora—A Case Study on the Use of “Conspiracy Theory”............................................. 1.6.6 Chapter 7: Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic Network Analysis—An R Studio Pipeline.................................. 1 2 4 5 8 10 11 11 11 12 13 13 14 v vi Contents 1.6.7 Chapter 8: Psycholinguistic Experiments: A Case Study on Default Inferences in Philosophical Arguments Analysing the Argument from Illusion........... 1.6.8 Chapter 9: Qualitative Interview Studies: Constructing an Interview Study Based on a Paradigm Example in “Ought Implies Can”............. References.................................................................................. 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering the Constative–Performative Distinction.................... Alexander Max Bauer, Stephan Kornmesser, and Henrike Meyer 2.1 Introduction........................................................................ 2.2 Philosophical Background..................................................... 2.2.1 Constative and Performative Utterances........................ 2.2.2 Criteria for Distinguishing Between Constative and Performative Utterances....................................... 2.2.3 Categories of Performative Utterances.......................... 2.3 Research Questions.............................................................. 2.3.1 Aim 1: Do Austin’s Criteria for the Classification of Constative and Performative Utterances Work for Speakers of a Speech Community?.......................... 2.3.2 Aim 2: Can the Event Character of Performative Utterances Be Used as a Criterion for Distinguishing Between Constative and Performative Utterances?...................................... 2.3.3 Aim 3: Do Performative Utterances Have a Truth Value?... 2.4 Study Design...................................................................... 2.4.1 Operationalizations of the Criteria for the Classification of Constative and Performative Utterances.............................................................. 2.4.2 Vignettes................................................................ 2.4.3 Statistical Background............................................... 2.5 Creating the Survey.............................................................. 2.5.1 Setting up LimeSurvey and Creating a New Project......... 2.5.2 Preparation of the Questionnaire.................................. 2.6 Running the Survey.............................................................. 2.7 Analysis............................................................................ 2.7.1 Data Export............................................................ 2.7.2 Getting Started With Stata.......................................... 2.7.3 Data Import............................................................ 2.7.4 Data Cleaning.......................................................... 2.7.5 Analysis of Responses............................................... 2.8 Interpretation...................................................................... 15 16 16 19 19 20 20 22 24 26 26 26 27 28 28 30 31 44 45 45 58 62 63 63 65 68 71 78 Contents vii 2.8.1 Interpreting the Data for Aim 1: Do Austin’s Criteria for Classifying Constative and Performative Utterances Work for Speakers of a Language Community?........................................ 2.8.2 Interpreting the Data for Aim 2: Can the Event Character of Performative Utterances Be Used as a Criterion for Distinguishing Between Constative and Performative Utterances?...................................... 2.8.3 Interpreting the Data for Aim 3: Do Performative Utterances Have a Truth Value?................................... 2.8.4 Outlook and Possible Objections................................. 2.9 Further Literature................................................................ References.................................................................................. 78 80 81 82 87 87 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About Unfelt Pains...................................................... 89 Justin Sytsma 3.1 Introduction........................................................................ 89 3.2 Philosophical Background: The Standard View of Pain in Philosophy...................................................................... 90 3.3 Illustrating One-Sample t-Tests............................................... 94 3.3.1 The Empirical Research Process.................................. 94 3.3.2 Background............................................................ 98 3.3.3 Research Question.................................................... 100 3.3.4 Research Design...................................................... 100 3.3.5 Constructing the Instrument........................................ 101 3.3.6 Conducting the Study................................................ 102 3.3.7 Analysis................................................................. 104 3.3.8 Interpretation........................................................... 115 3.4 Further Studies and Further Tests............................................ 116 3.4.1 Independent Samples t-Tests....................................... 117 3.4.2 Sample Size, Effect Size, and Power............................. 121 3.4.3 Binary Answer Choices and Dichotomization................. 122 3.4.4 Paired Samples t-Tests............................................... 126 3.5 Correcting for Multiple Comparisons....................................... 128 3.5.1 Why Should We Apply a Correction?........................... 128 3.5.2 The Bonferroni Method and the Holm Method............... 129 3.5.3 Illustrating the Holm Method...................................... 131 References.................................................................................. 133 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural Equation Modeling—An Application to Experimental Philosophy of Free Will............................................ 137 Florian Cova and Aurélien Allard 4.1 Introduction........................................................................ 138 viii Contents 4.2 Philosophical Background: Experimental Approaches to the Free Will Problem....................................................... 138 4.2.1 What Is Free Will?.................................................... 138 4.2.2 The Natural Compatibilists vs. Natural Incompatibilists Debate............................................. 141 4.2.3 Bypassing: An Error Theory for Incompatibilist Intuitions................................................................ 146 4.2.4 Intrusion: An Error Theory for Compatibilist Intuitions... . 148 4.2.5 Limitations............................................................. 151 4.3 Research Questions.............................................................. 152 4.4 Experimental Design............................................................ 153 4.5 Implementing the Research Design and Conducting the Study....... 156 4.6 Analysis I: Preparing Our Dataset........................................... 160 4.6.1 Downloading and Opening the Datafile......................... 160 4.6.2 Data Preparation...................................................... 162 4.6.3 Exclusions.............................................................. 165 4.7 Analysis II: A First Look at the Data........................................ 167 4.7.1 Demographics......................................................... 167 4.7.2 Moral Responsibility Ratings Across Conditions............. 168 4.7.3 Computing Bypassing and Intrusion Scores................... 169 4.7.4 Transforming the New Measure of Intrusion into a Binary Variable..................................................... 171 4.8 Analysis III: Using Correlations to Answer (Q1)........................ 172 4.8.1 Correlations: An Introduction..................................... 173 4.8.2 An Introduction to Correlations in R............................. 179 4.8.3 From Correlations to Regressions................................ 183 4.9 Analysis IV: Using Regressions to Answer (Q1)......................... 185 4.9.1 What Is Linear Regression?........................................ 185 4.9.2 Linear Regression Models in R.................................... 187 4.9.3 What Is Multiple Linear Regression?............................ 192 4.9.4 Multiple Regression in R........................................... 194 4.10 Analysis V: Using Mediation Analysis and Structural Equation Modeling to Answer (Q2)......................................... 196 4.10.1 Testing the Effect of Condition on Attributions of Free Will and Moral Responsibility.......................... 196 4.10.2 Mediation Analysis with a Single Mediator.................... 199 4.10.3 Multiple Mediation and Structural Equation Modeling... . . 205 4.10.4 Using the Lavaan Package for Multiple Mediation Analysis................................................................. 209 4.10.5 Using Structural Equation Modeling to Estimate Latent Variables....................................................... 215 4.11 Conclusion......................................................................... 218 References.................................................................................. 221 Contents ix 5 Interactive and Incentivized Online Experiments: Noncooperation in Give-Some and Take-Some Dilemmas.................. 225 Marc Wyszynski 5.1 Introduction........................................................................ 225 5.2 Philosophical Background: Conducting an Interactive and Incentivized Study in Experimental Philosophy.......................... 227 5.3 Development of the Research Question to Be Investigated and Hypotheses................................................................... 229 5.4 Development of the Experimental Design for Testing the Hypothesis.................................................................... 230 5.4.1 Experimental Paradigms............................................ 231 5.4.2 Constructing the Design............................................. 232 5.5 Implementation of the Instrument............................................ 237 5.5.1 Procedure............................................................... 238 5.5.2 Programming Experiments Using Unipark (Tivian EFS)........................................................... 238 5.5.3 Ethical Standards and Ethics Committee Application....... 252 5.5.4 Cognitive Interviews and Pretests................................. 253 5.5.5 How to Go into the Field (Online)................................ 254 5.6 Analysis of the Test Results................................................... 260 5.6.1 Analysis Using JASP................................................ 260 5.6.2 Analysis Using RStudio............................................. 266 5.7 The Answer to the Research Question...................................... 270 5.8 Final Remarks..................................................................... 271 References.................................................................................. 271 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use of “Conspiracy Theory”....................................... 275 Kevin Reuter and Lucien Baumgartner 6.1 Introduction........................................................................ 275 6.2 Exploration, Examples, and Exercises...................................... 276 6.3 Corpus Analysis and Philosophy: Why, Who, and How?.............. 282 6.3.1 Why Do Corpus Analysis?......................................... 282 6.3.2 A Somewhat Different Literature Review...................... 283 6.3.3 Common Patterns and Doing It Yourself........................ 286 6.4 A Corpus Analysis of “Conspiracy Theory” with Pre-built Corpora............................................................................. 288 6.4.1 The Evaluative Nature of “Conspiracy Theory” and “Conspiracy Theorist”......................................... 290 6.4.2 How Has the Meaning of “Conspiracy Theory” Developed over Time?............................................... 293 6.4.3 Other Approaches and Limitations of Analyses with Pre-built Corpora............................................... 295 6.5 Building Your Own Corpus and Analyzing “Conspiracy Theory”............................................................................. 296 x Contents 6.5.1 Prerequisites........................................................... 296 6.5.2 Why APIs Are Your New Best Friends.......................... 297 6.5.3 Corpus Annotation................................................... 302 6.5.4 How Crazy Are Conspiracy Theories?.......................... 310 6.5.5 Discussion.............................................................. 317 6.6 Conclusion......................................................................... 317 Appendix.................................................................................... 318 References.................................................................................. 318 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic Network Analysis—An R Studio Pipeline.................... 321 Mark Alfano 7.1 Introduction and Philosophical Background............................... 321 7.2 Question Development and Hypothesis Formation...................... 322 7.3 Corpus Selection and Building................................................ 324 7.4 Corpus Analysis at the Level of the Word and Document.............. 324 7.5 Use of Pre-built and Custom Dictionaries.................................. 337 7.6 Semantic Network Analysis and Visualization............................ 349 7.7 Conclusion......................................................................... 351 References.................................................................................. 352 8 Psycholinguistic Experiments: A Case Study on Default Inferences in Philosophical Arguments—Analysing the Argument from Illusion............................................................... 355 Eugen Fischer and Paul E. Engelhardt 8.1 Introduction........................................................................ 356 8.2 Philosophical Background..................................................... 356 8.2.1 Evidential and Explanatory Experimental Philosophy....... 357 8.2.2 Psycholinguistic Inspirations for Experimental Philosophy.............................................................. 359 8.2.3 Experimental Argument Analysis................................ 362 8.2.4 Worked Example: Analysing the Argument from Illusion........................................................... 364 8.3 Developing the Research Question and Hypotheses..................... 366 8.3.1 From Philosophical to Empirical Questions.................... 366 8.3.2 The Phenomenon of Empirical Interest......................... 369 8.3.3 Formulating Research Questions and Hypotheses............ 371 8.4 Developing the Approach...................................................... 373 8.4.1 Cancellation Paradigm............................................... 375 8.4.2 Different Implementations.......................................... 377 8.4.3 Study Design: Worked Example.................................. 389 8.5 Methods: Implementing the Approach...................................... 395 8.5.1 Basics of Item Development....................................... 395 8.5.2 Drafting Items......................................................... 396 8.5.3 Norming Studies...................................................... 398 8.6 Procedure: Conducting the Experiment..................................... 401 Contents xi 8.6.1 Tasks..................................................................... 401 8.6.2 Apparatus............................................................... 407 8.7 Analysing the Results........................................................... 408 8.7.1 Preparatory Steps..................................................... 408 8.7.2 Conducting Inferential Analyses: ANOVAs.................... 410 8.7.3 Worked Example...................................................... 412 8.8 Interpreting the Results......................................................... 420 8.8.1 Assessing the Predictions........................................... 421 8.8.2 Interpreting Results Beyond Predictions........................ 423 8.8.3 Limitations............................................................. 428 8.9 Writing Up......................................................................... 429 Suggested Further Reading............................................................. 430 References.................................................................................. 431 9 Qualitative Interview Studies: Constructing an Interview Study Based on a Paradigm Example in “Ought Implies Can”........... 437 Kyle Thompson 9.1 Introduction........................................................................ 438 9.2 Why Use Qualitative Interviews in Experimental Philosophy?....... 439 9.3 Philosophical and Methodological Background: “Ought Implies Can”...................................................................... 442 9.3.1 A Paradigm Example in “Ought Implies Can”................ 444 9.3.2 Your Research Question and the Versatility of Using Interviews.................................................. 446 9.3.3 Success in Using Interviews on “Ought Implies Can”....... 452 9.4 Experimental Design: Designing Your Methods and Interview Questions According to the Golden and Papyrus Rules................................................................ 456 9.4.1 Phase One of Three: Deciding on, and Jotting Down, the Overarching Structure of Your Study.............. 456 9.4.2 Phase Two of Three: Deciding What Subjects Will Interact with..................................................... 459 9.4.3 Phase Three of Three: Writing Your Questions and Considering the Freedom to Improvise.................... 462 9.5 Implementation of the Experimental Design: Making the Materials and Gathering the Tools for Your Study.................. 467 9.5.1 In Person? On Zoom? Over the Phone? Using the View-As Rule to Choose....................................... 467 9.5.2 Formatting Questionnaires: An Example from an In-Person, Paper-Based Study on “Ought Implies Can”........................................................... 471 9.5.3 Making the Interview Script, Researcher Script, Flyer, Informed Consent Form, and Other Important Documents................................................ 473 xii Contents 9.5.4 Anticipating Other Needs and Adopting Your Style as a Researcher................................................ 478 9.6 “It’s Recording”: Conducting the Study Itself............................ 479 9.6.1 From Coordinating with Participants to Carrying Out the Interviews.................................................... 480 9.7 Utterance Crunching: Transcribing Audio and Analyzing Data in an Interview Study..................................................... 482 9.7.1 Transcribing............................................................ 483 9.7.2 Analysis, with or Without Coding................................ 489 9.8 What It Means: Interpreting Your Results and Incorporating Them into Your Work................................... 495 9.8.1 Interpreting the Results and Finding Their Limitations... . . 496 9.8.2 Planting Your Data into Your Work.............................. 498 9.9 Wrapping Up: The Challenge and Value of the Interview Study... . . 500 References.................................................................................. 502 Index.............................................................................................. 505 Chapter 1 Introduction: Setting Out for New Shores Stephan Kornmesser, Alexander Max Bauer, Mark Alfano, Aurélien Allard, Lucien Baumgartner, Florian Cova, Paul Engelhardt, Eugen Fischer, Henrike Meyer, Kevin Reuter, Justin Sytsma, Kyle Thompson, and Marc Wyszynski “Experimental philosophy is philosophy with a little something extra” (Sytsma et al., 2023, 9). This “little something extra” is the fact that experimental philosophers conduct their own experimental studies to provide empirical insights to address S. Kornmesser () · A. M. Bauer Department of Philosophy, University of Oldenburg, Oldenburg, Lower Saxony, Germany e-mail: [email protected]; [email protected] M. Alfano Philosophy Department, Macquarie University, Sydney, New South Wales, Australia e-mail: [email protected] A. Allard Department of Philosophy, École Normale Supérieure Paris, Paris, France L. Baumgartner · K. Reuter Department of Philosophy, University of Zurich, Zurich, Switzerland e-mail: [email protected]; [email protected] F. Cova Département de Philosophie, Université de Genève, Geneva, Switzerland e-mail: fl[email protected] P. Engelhardt School of Psychology, University of East Anglia, Norwich, UK e-mail: [email protected] E. Fischer School of Politics, Philosophy, Language and Communication Studies, University of East Anglia, Norwich, UK e-mail: e.fi[email protected] H. Meyer Berlin, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Kornmesser et al., Experimental Philosophy for Beginners, Springer Graduate Texts in Philosophy, https://doi.org/10.1007/978-3-031-58049-9_1 1 2 S. Kornmesser et al. philosophical issues. They use qualitative and quantitative research methods such as interactive experiments, reaction time studies, corpus analysis, vignette studies, interviews, and so forth. However, the development of experimental philosophy has led to a quandary: On the one hand, experimental philosophy (x-phi) is an extensively expanding area of philosophy, as a growing number of experimental philosophers are concerned with a multitude of philosophical issues, producing an ever-growing number of new findings. On the other hand, professional philosophers and philosophy students are usually not familiar with the relevant experimental methods, as those methods are— in most cases—not part of a philosophical curriculum. Access is also made difficult because there are very few textbooks that are concerned with teaching methods specifically in the context of x-phi. This is where our volume comes into play: Experimental Philosophy for Beginners is a textbook on x-phi that aims to guide students of philosophy as well as professional philosophers through the steps of developing their own experimental investigations of philosophical issues, be it, for example, in bachelor’s or master’s theses, dissertations, or other research projects. Before we demonstrate how to do x-phi in Chaps. 2, 3, 4, 5, 6, 7, 8, and 9, we would like to make some general remarks about this book and experimental philosophy: In Sect. 1.1, we introduce this textbook’s overall didactic idea and in Sect. 1.2 we explain how best to use the individual chapters. In Sect. 1.3, we discuss some general motivations for x-phi and distinguish between different strands and programs of x-phi. Section 1.4 deals with the replicability of experiments in xphi which, as we report, is itself explored experimentally within x-phi. Section 1.5 addresses the topic of ethics boards, which in most countries need to approve an experiment before it can be conducted. Finally, in Sect. 1.6, we introduce the individual chapters of this book. 1.1 The Book’s Overall Idea To some extent, the current volume breaks new didactical ground with an approach that may seem quite atypical for philosophy textbooks. Instead of explaining in general terms how experimental philosophy works and then leaving it to the readers J. Sytsma Philosophy Programme, Victoria University of Wellington, Wellington, New Zealand e-mail: [email protected] K. Thompson Harvey Mudd College, Claremont, CA, USA e-mail: [email protected] M. Wyszynski Faculty of Mathematics and Computer Science, University of Bremen, Bremen, Germany e-mail: [email protected] 1 Introduction: Setting Out for New Shores 3 to apply what they have just read, we aim to explain methods and tools inductively through examples. Each chapter provides a philosophical study (or a set of related studies) using a particular research method. That is, each chapter demonstrates how to use a specific method and walks the reader through the entire research process, starting with the development of a research question and the design of an appropriate experiment, continuing with the implementation of the experiment and, finally, showing how to analyze and interpret the obtained data. In doing so, we provide readers with the means to conduct their own experimental studies by transferring our examples to their research projects. Hence, we prefer a bottom-up over a topdown process of learning. Or, in more Kuhnian terms, we aim to introduce readers to experimental paradigms by means of exemplars. Throughout the book, we always keep in mind that it is aimed at readers who may already know a lot about philosophy and may even be familiar with several findings from x-phi but have yet to come into contact with the practice of empirical research. Therefore, we try to follow two overarching principles in each and every chapter: 1. We presuppose as little as possible. 2. We explain everything as simply as possible. Each chapter presents a different method to investigate a philosophical issue. We start with the most basic experimental paradigms and statistical methods and then discuss increasingly sophisticated methods. The nine chapters of our volume cover the following methods: 1. Quantitative Vignette Studies: Chi squared Tests 2. Quantitative Vignette Studies: t-Tests 3. Quantitative Vignette Studies: Correlations, Regressions, and Structural Equation Modeling 4. Interactive and Incentivized Experiments 5. Corpus Analysis: Building and Using Corpora 6. Corpus Analysis: Semantic Network Analysis 7. Psycholinguistic Experiments 8. Qualitative Interview Studies The most frequently used method in x-phi is probably that of vignette studies which examine judgments about hypothetical cases, verbally described by “vignettes.” Chapters 2, 3 and 4 deal with vignette studies and statistical analyses of increasing complexity. The fifth chapter then introduces the method of interactive and incentivized experiments commonly used in economics. Chaps. 6 and 7 demonstrate how to use the linguistic methods of corpus analysis and semantic network analysis for philosophical means. Chap. 8 discusses the design of behavioral experiments, quite generally, and demonstrates, more specifically, the use of methods from psycholinguistics, namely eye tracking. Finally, Chap. 9 introduces how to use qualitative interview studies to investigate philosophical issues. Along with those methods, we introduce several tools to implement experimental designs and analyze data. It goes without saying that we cannot provide full and 4 S. Kornmesser et al. basic introductions to these tools. However, we aim to present the essential steps of using them concerning the studies presented here. For creating questionnaires, we provide insight into using LimeSurvey (Chap. 2) and Qualtrics (Chap. 4). For creating interactive online experiments, we show how to use Unipark (Chap. 5). For analyzing data, we provide applications of Stata (Chap. 2), R (Chaps. 3, 4, 5, 6, and 7), JASP (Chap. 5), and SPSS (Chap. 8). We also explain how to report findings and write up studies for publication (Chap. 8). 1.2 How to Use This Book This book introduces how to do x-phi by working through the whole research process of several experimental studies using different methods. Therefore, most of the chapters have approximately the same structure following a typical research process when, for example, running a survey: • Initially, we introduce the philosophical background of a particular area. • From this background, we develop research questions that we want to investigate. • We then think of an experimental design for investigating our questions. • Next, we implement this design (e.g., by developing a questionnaire). • Then, the experiment is conducted (e.g., by having participants answer a questionnaire). • The collected data then has to be analyzed. • Finally, we interpret the analyzed data to answer the research questions posed at the beginning. However, there is no need to work your way through the entire book, chapter by chapter. Each chapter can stand alone. In consequence, providing the whole research process in each chapter leads to some repetition between some chapters. Therefore, if you are familiar with some aspects—for example, because it was explained in a preceding chapter or because you know it already—you can always skip the respective sections without losing the thread. For example, if you already know how to create an online questionnaire with LimeSurvey (as explained in Sect. 1.5 of Chap. 2), you can easily skip this section and proceed with the chapter’s subsequent sections. For most of the chapters, we suggest further readings and provide supplementary online materials such as, for example, questionnaires, data, or code to analyze data. These materials might help you transfer what you learnt in a chapter to your own project or practice analyzing the provided data using the provided code. You can find the links to the supplementary online materials in the respective chapters. Complete newcomers to empirical methods might find it useful to use this book in conjunction with a gentle introductory textbook in quantitative research design and analysis (e.g., Bourne, 2017; Field and Hole, 2003). Complete newcomers to x-phi might find it useful to have a look at the sister volume to this textbook, The Compact 1 Introduction: Setting Out for New Shores 5 Compendium to Experimental Philosophy (Bauer and Kornmesser, 2024), to get an impression of the wide variety of philosophical uses to which empirical studies can be put (see also Alexander, 2012; Sytsma and Livengood, 2015). However, a brief preliminary orientation over different kinds of philosophical uses of empirical methods might give newcomers a first idea of what to look out for—and what not to expect. We now offer such a first orientation in Sect. 1.3. 1.3 Experimental Philosophy: General Motivations Experimental philosophy is typically, and aptly, characterized as an intellectual movement that seeks to recruit and adapt empirical methods to address philosophical questions and problems. While the general ambition to render philosophical problems more empirically tractable—and more tractable, tout court—may be quite widely shared in x-phi (Fischer and Sytsma, 2024), x-phi is neither a school of thought nor a research program, let alone a unified approach to a distinct set of preset research questions. Rather, it contributes to a wide range of research programs, while initiating some new ones. Probably only three assumptions are shared by all experimental philosophers: (1) At various levels—ranging from the construction of specific philosophical theories to fundamental methodological commitments— philosophical thought often advances or relies on empirical claims or assumptions. (2) Empirical claims and assumptions need to be assessed in the light of empirical evidence. (3) Where nobody else has generated the relevant empirical evidence, philosophers should themselves start to address the philosophically relevant gap in empirical research, if at all feasible. By now, experimental philosophers have addressed such gaps in practically all subdisciplines of philosophy—with results that have sometimes transformed the pertinent philosophical debates. The three widely shared assumptions imply little beyond a general commitment to methodological naturalism. Naturalism as a methodological stance is independent from naturalism as a metaphysical position. Methodological naturalism simply recommends that we address philosophical problems by drawing—also—on empirical findings from the sciences. This methodological stance, however, can be developed in at least two markedly different directions. One is more traditional, while the other has gathered momentum only with the advent of x-phi. Traditional methodological naturalism seeks to address philosophical problems about natural phenomena like, e.g., vision or intentional action, by drawing on empirical findings about those phenomena (say, findings from the psychology or neuroscience of vision, or of action control). This naturalism translates into the recommendation that philosophical theories of natural phenomena should be built on the best available scientific theories of those phenomena (for a review, see Kornblith, 2016). Such empirically informed philosophizing, sometimes called “empirical philosophy” (Prinz, 2008), has become widespread practice in the philosophy of mind (Knobe, 2015; Yan and Liao, 2023), and is no longer uncommon in several other areas of philosophy, including areas of value theory like aesthetics (for a review, see Robson and Currie, 2022) or ethics (for a review, see Christen et 6 S. Kornmesser et al. al., 2014). However, natural and social scientists do not always do philosophers the favor of empirically investigating precisely the phenomena of philosophical interest, or the aspects of most philosophical interest. Accordingly, experimental philosophy has arisen within the framework of traditional naturalism from philosophers’ efforts to fill philosophically regrettable gaps in empirical research through empirical studies of their own. A less traditional form of methodological naturalism (moving in the second of the two directions mentioned above) has been characterized as “meta-philosophical naturalism” (Fischer and Sytsma, 2024). It seeks to address philosophical problems by studying how people think and talk about the phenomena of interest. For example, traditional methodological naturalists working on philosophical questions about vision or free will consider empirical findings from the psychology or neuroscience of vision or action control. By contrast, meta-philosophical naturalists interested in—often different—philosophical questions about the same phenomena build on the psychologies of language, judgment, and reasoning and empirically study what intuitive judgments people make about philosophically relevant cases pertaining to, say, vision or free will, what people believe about these phenomena, how people describe them, and how people reason about them. There is not much new about the idea, as such, that many important philosophical problems can be rendered more tractable by examining how people think and talk about the phenomena of interest. Twentieth-century analytic philosophy has developed this idea in a variety of ways (Daly, 2010). However, the mainstream of analytic philosophy has employed armchair methods to tap into the way in which people think and talk. Meta-philosophical naturalism, by contrast, recommends that philosophers should empirically investigate how people think and talk, to address philosophical problems. Many experimental philosophers followed this recommendation in engaging with the analytic tradition. Among the approaches from analytic philosophy that rely on claims about people’s judgments and beliefs, the most prominent at the last turn of the century (and to this day) involve the “method of cases”: In thought experiments, philosophers consider hypothetical cases (like Gettier or trolley cases), elicit intuitive judgments about them, and deploy these judgments as defeasible evidence for or against philosophical theories (for a review, see Machery, 2017, 11–44). This method is often embedded in philosophical theory construction that works back and forth between intuitions elicited by different thought experiments and various background beliefs, until “reflective equilibrium” and a coherent set of judgments and beliefs have been achieved. While used most explicitly in moral and political philosophy (e.g., Foot, 1967; Rawls, 1971), this approach is sufficiently common to have been characterized as analytic philosophy’s “standard justificatory procedure” (Bealer, 1996, 4). It is frequently used to assess modal and moral claims, and proposed analyses of concepts. The strand of x-phi to engage most directly with this approach is evidential experimental philosophy, which seeks to assess the evidential value of philosophically relevant intuitive case judgments (“philosophical intuitions”) and beliefs. The more conservative “positive program” of evidential x-phi largely accepts the 1 Introduction: Setting Out for New Shores 7 “standard justificatory procedure” and seeks to place it on a more secure, empirical, footing: The positive program empirically elicits philosophical intuitions from wider samples to establish, e.g., which of competing conceptions of a phenomenon (say, of free will; Nahmias et al., 2006; see Chap. 3 in this volume) can lay claim to being the “common sense” or “folk conception,” or whether philosophers’ intuitions are indeed representative of a wider community and reflective of a “folk concept” (say, the folk concept of knowledge; Starmans and Friedman, 2020). The more revisionist “negative program” focuses on methodological assumptions underlying the “standard justificatory procedure,” and specifically the method of cases, and seeks to determine whether they are fit for purpose. As used in conceptual analysis, for example, the method of cases assumes that case judgments are reflective of the content of the concept at issue. Accordingly, the negative program examines to what extent case judgments are subject to biases, presentation effects (order, framing), demographic effects—and, more generally, sensitive to “epistemically otiose” factors that undermine their value as evidence (for a review, see Machery, 2017, 45–90). The second main strand of x-phi is also responsive to intuition-driven philosophizing: Explanatory experimental philosophy seeks to explain philosophical intuitions. There is considerable overlap with evidential x-phi, as many of the debunking and other explanations developed are at the service of epistemological evaluation (e.g., Fischer and Sytsma, 2021, on philosophical zombies; Gerken et al., 2020, on knowledge attributions; or Nadelhoffer et al., 2020, on free will). However, much x-phi work pursues explanatory aims for their own sake and follows the same pattern as research in cognitive science, where researchers seek to identify new effects and explain them in terms of cognitive processes, and typically aim to explain a maximum number of effects by reference to a minimum number of processes (for a review, see Knobe, 2016). This work can be regarded as giving rise to a distinct strand of experimental philosophy that is continuous with cognitive science. Also this work, however, has been shaped by philosophical motivations and led to philosophically highly pertinent results: Explanatory x-phi was initiated largely by successful attempts to garner empirical support for the philosophically important and provocative hypothesis that supposedly descriptive attributions, e.g., of intentions, are influenced by moral valence of the outcome (Knobe, 2003), sparking efforts to explain this effect. And one main upshot of explanatory x-phi has been that case judgments are strongly influenced by cognitive factors distinct from the content of the specific concept employed (say, intention)—which hence cannot be studied with simple implementations of the method of cases that do not avail themselves of more advanced techniques of statistical analysis. Over the last 10 years, x-phi has considerably expanded in scope, beyond its initial focus on philosophical intuitions, to examine (folk) conceptions (covered in Chaps. 2, 3, and 4 in this volume), decisions (Chap. 5) and inferences (Chaps. 8 and 9), as well as the semantics and pragmatics of philosophically interesting terms, and their change over time (Chaps. 6 and 7), among other things. Repeatedly, fresh findings have led to new motivations for familiar philosophical projects, and for xphi. Take, for example, conceptual analysis. The structure of concepts of interest 8 S. Kornmesser et al. constrains the philosophical aims that can be pursued by their analysis: Classical (definable) concepts allow conceptual analysis to establish a priori necessities. The analysis of natural kind terms may help us to a posteriori necessities. The analysis of concepts with prototype structure, by contrast, only reveals what defeasible default inferences people will automatically make from them. But only empirical investigation can reveal the structure of any given concept that is already linguistically realized in a natural language—and the aims the analysis of the concept can sensibly pursue. The empirical study of concepts has thus established the key role of empirical investigation in conceptual analysis. It has also led experimental philosophers to adopt the study and epistemological assessment of default inferences as a new aim of such analysis (e.g., Machery, 2017, 208–244). Appreciation of structural differences between concepts, in turn, motivates extending the range of empirical investigations well beyond empirical implementations of the method of cases: We need to study several different properties of concept’s potential component features, such as their necessity, universality, typicality, diagnosticity, centrality, and salience, with the empirical methods psycholinguists have developed for the purpose (see, e.g., Beisbart and Reuter, 2021). In brief, while three rather uncontroversial assumptions are enough to take up x-phi, philosophical motivations multiply, once one has started. 1.4 Replicability in Experimental Philosophy As we saw in Sect. 1.3, experimental philosophers produce their own empirical findings by conducting their own experiments. However, assume that an experiment shows a correlation between two variables with a certain statistical significance. Does this mean that one would get approximately the same significance if one replicates (that is, roughly, repeats) the experiment? As experience from other sciences shows, this might not always be the case: Since 2011, social and biomedical sciences have faced a replicability crisis. International teams of researchers have tried to reproduce the results of experiments published in the most prestigious journals of their respective fields. While replication success varies by field, results have often been dismal, ranging from 25% success in social psychology or 46% in cancer biology, to a much higher 87% in personality psychology (Errington et al., 2021; Open Science Collaboration, 2015; Soto, 2019). Two major reasons for these low replicability rates have been proposed: publication bias and p-hacking. Publication bias corresponds to the reluctance to publish negative results, that is, results that do not indicate a difference between groups or experimental conditions (Franco et al., 2014). Negative results are generally seen as less interesting than positive results: in psychology, for instance, researchers often hope to show that they can successfully influence behaviors by manipulating the participant’s environment. This selection for the most interesting results leads to an exaggeration of the effect sizes studied by scientists. P-hacking corresponds to the exploitation of flexibility in data analysis in order to increase the chance of obtaining 1 Introduction: Setting Out for New Shores 9 a significant result, even in the absence of any underlying effect (Simmons et al., 2011). Considering the replication crisis in social and biomedical sciences, the question of the replicability of experiments in x-phi arises. In 2018, a team of experimental philosophers published the Experimental Philosophy Replicability Project (Cova et al., 2021). Out of the 40 experiments that were selected, 78% successfully replicated. This situates x-phi as one of the most replicable fields, alongside personality psychology. The high replicability rate of x-phi is particularly surprising given the fact that its methods are highly similar to the ones used in psychology. Several reasons can explain this discrepancy. First, it is possible that experimental philosophers are selecting experiments that are more obvious, or produce higher effect sizes than the ones used in psychology. Second, the x-phi project made a distinction between three classes of studies: content-based experiments, which manipulate the direct stimuli that participants attend, context-based experiments, which manipulate the context in which the experiment is taking place, and demographic effects, which evaluate whether some effects vary across groups, such as between men and women (Cova et al., 2021). Experiments that directly manipulate the content of the experiments generally tend to produce bigger effect sizes, both in experimental philosophy and in psychology (Cova et al., 2021). And contrary to psychologists, experimental philosophers have focused their attention on content-based experiments, which constitute about 78% of all selected experiments, compared to 34% in psychology (Cova et al., 2021). In other words, the good fortune of philosophers seems to be partly dependent on the fact that they are studying more obvious effects, by directly manipulating the stimuli that participants attend. It thus remains crucial to determine how to improve the replicability rate, even if philosophers start studying more subtle or unlikely effects. Over the last few years, psychology has led the way in promoting reforms aimed at improving replicability. Recent reforms include the promotion of replications, open materials and open data, and preregistration (Pennington, 2023). Open materials and open data consist in storing key aspects of studies in a public online database, so that other researchers can check whether errors in analysis and reporting have been committed. Preregistration consists in putting online a frozen study design, alongside the planned statistical analysis, before researchers start collecting data. The goal is to increase the transparency in what was planned beforehand, so that researchers will have less temptation to massage their data to manufacture positive results. It is unclear whether these reforms have contributed to improving the replicability rate of psychology. Replications have remained vanishingly rare among published articles (Clarke et al., 2024). Putting open materials and open data online has become the norm in psychology, but these practices do not seem to have a major impact on correcting errors (Berberi and Roche, 2022). Preregistration has seen a strong uptick, but the most careful examination of its impact did not find that preregistered tests were more likely to be null results than non-preregistered tests 10 S. Kornmesser et al. (van den Akker et al., 2023). Knowing how to improve the reliability of scientific research remains an open question. 1.5 Ethics Boards Unlike traditional philosophers, experimental philosophers often collect and analyze data from human subjects, which means they need to get approval from an ethics board before administering surveys, conducting interviews, or carrying out staged experiments. An ethics board is essentially a group of individuals trained to manage scientific research according to current ethical standards. A lot of ethical questions arise when creating and conducting an experimental study: Will the researcher present human subjects with a questionnaire? Will they be working with vulnerable populations? Will they be conducting interviews? Will they temporarily deceive their subjects? What will they do with identifiable information from subjects? What will they do with audio recordings from their interviews after they’ve been transcribed? Will they ask subjects potentially distressing questions? How will they compensate subjects? How will they ensure subjects’ anonymity? To get approval from an ethics board, the experimental philosopher will need to address these kinds of considerations and more. While institutional approval is surely important, the more significant consequence of such questions is that the experimental philosopher ensures their work is genuinely ethical and worthy of pursuit. By regarding their subjects as full persons rather than as data points, they enrich the whole experimental process, which is made possible by the very subjects who offer their wonderful judgments, responses, and thoughts to the experimental philosopher. With a nifty study in mind, the experimental philosopher is ready to submit the needed paperwork to get approval from their local ethics board. Indeed, there is not one grand unified ethics board that is identical across all countries. As a consequence, experimental philosophers will want to fully imbibe the specific guidelines for their home country, and then contact their local ethics board for any questions they have. In fact, one of the best ways to find out what the approval process is for a given country is to email your local institution’s ethics board and ask away. The wise researcher saves themselves so much time and stress by seeking advice from professionals trained to evaluate the ethicality of an experiment—and the last thing a researcher wants to do is redesign their study because they were not aware of some important protocol. The board may require researchers to take an online class about the history of ethics (or lack thereof) in science, submit specific paperwork, outline the proposed project with further details, and more. While the process is surely a tedious one, it is also hugely important, and it offers the added benefit of forcing the researcher to fully develop every last detail of the experiment. In sum, in developing a proposal for review from an ethics board, one will likely end up adopting the perspective of their subjects, which will make the project a better experience for them and improve the quality of the data. 1 Introduction: Setting Out for New Shores 11 1.6 This Book’s Chapters After this brief overview of some general issues of experimental philosophy, let’s now take a closer look at what we can expect in the coming chapters. 1.6.1 Chapter 2: Quantitative Vignette Studies: X2 Tests—Empirically Reconsidering the Constative–Performative Distinction In Chap. 2, Alexander Max Bauer, Stephan Kornmesser, and Henrike Meyer present a case study of an empirical investigation of John L. Austin’s distinction between constative and performative utterances. Austin distinguishes between two kinds of utterances: constative and performative utterances. A constative utterance expresses a statement and is true or false. A performative utterance is the doing of an action which is not true or false. For example, one can perform the action of promising by saying, “I promise to do the dishes.” Austin develops several tests and criteria to distinguish between constative and performative utterances that are based on his own linguistic intuitions. However, there is no experimental investigation of whether these intuitions hold for speakers of a linguistic community. The authors explain how to develop research questions concerning Austin’s tests to distinguish between constative and performative utterances and how to design and conduct a survey to answer their questions. To do so, they show how to use the free online tool LimeSurvey to develop a questionnaire for an online survey. They present the whole process of recruiting participants and running the survey. In their survey, participants are asked questions concerned with certain constative and performative utterances to determine whether Austin’s criteria work as intended. The authors show how to collect answers from participants and how to analyze them quantitatively using the X2 test, which is probably one of the beginner-friendliest statistical tests. The authors explain how to use the statistical software Stata to do the calculations of X2 tests. Finally, they interpret their results and suggest answers to the research questions developed at the beginning of the chapter. 1.6.2 Chapter 3: Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About Unfelt Pains Chapter 3 has two main goals. The first is to introduce the reader to a powerful family of statistical tests known as t-tests. To do this, Justin Sytsma walks the reader through three main case studies exploring whether laypeople believe there can be unfelt pains. Each of these case studies introduces one main type of t-test—one- 12 S. Kornmesser et al. sample t-tests, independent samples t-tests, and paired samples t-tests. Along with a range of studies on other aspects of the ordinary conception of pain, Justin Sytsma contends that this work makes it plausible that the ordinary view is rather different than many philosophers have supposed. Making the case for this is the chapter’s second goal. The standard view of pain among philosophers is that pains are a certain type of mental state—known as a phenomenally conscious mental state. Thus, philosophers tend to hold that when you cut your finger, for example, the sharp pain you feel in your finger is not actually a property of your finger but is instead a property of your mind. This view has a number of corollaries, including that it equates to pain and feeling pain, ruling out the possibility of unfelt pains. Perhaps surprisingly, this standard view doesn’t primarily derive from research in biology or related sciences; rather, philosophers have asserted that it is the ordinary or commonsense view of pain. Is this claim about the ordinary view accurate, however? At first glance, there is reason for doubt: when we hurt ourselves, the pains we feel seem to have bodily locations, and we often talk as if this is the case. While such initial impressions could be misleading, they emphasize the need for systematic empirical investigation of the ordinary view. Sytsma provides an overview of this research, focusing on judgments about unfelt pains, and argues that it suggests against the common philosophical assumption, indicating that people tend to think of pains as having bodily locations. 1.6.3 Chapter 4: Quantitative Vignette Studies: Correlations, Regressions, and Structural Equation Modeling—An Application to Experimental Philosophy of Free Will In Chap. 4, Florian Cova and Aurélien Allard introduce the reader to correlations, the use of regression analyses, and structural equation modeling. They approach these notions through the example of experimental philosophy of free will, a subset of experimental philosophy that is concerned with laypeople’s intuitions about free will and moral responsibility. In past years, experimental philosophers have sought to determine whether people are “natural compatibilists” or “natural incompatibilists”—that is, whether they spontaneously take free will and moral responsibility to be incompatible with determinism. To this end, they have presented participants with vignettes describing deterministic universes and asked them whether agents living in these universes can act of their own free will and be morally responsible for their actions. However, this approach has been criticized on the basis of the fact that it is not clear how participants interpret such vignettes and that their interpretation of the experimental materials might shape their intuitions about free will. To understand what exactly drives people’s answers to such vignettes, later studies have made extensive use of correlational methods, including mediation analysis and structural equation modeling. Replicating classic studies in experimental philosophy of free will, this 1 Introduction: Setting Out for New Shores 13 chapter provides the reader with a perfect opportunity to learn about these methods and how they can be used in experimental philosophy. 1.6.4 Chapter 5: Interactive and Incentivized Online Experiments: Non-Cooperation in Give-Some and Take-Some Dilemmas The scientific experiment is an excellent tool for identifying causal relationships between cause and effect. Researchers of almost all areas of empirical science use experiments for their empirical investigations, including natural sciences such as biology, chemistry, and physics and social sciences such as psychology, sociology, and behavioral economics. The basic principle of an experiment is to observe a measurable entity under various controlled conditions. In Chap. 5, using a particular example of a research problem in experimental philosophy, Marc Wyszynski illustrates one way to design, conduct, and analyze interactive experiments in behavioral science. Starting from the initial idea to conduct a scientific study, several steps of the research process are exemplarily described, including conducting literature research, identifying the research gap, developing the research question and hypotheses, designing the study, programming and conducting the experiment, analyzing the data, and interpreting the results. The chapter provides guidelines for programming an experiment investigating the impact of framing on rule-breaking in social dilemmas using the online survey tool Unipark, conducting the experiment using Amazon Mechanical Turk, and performing the analysis using the statistical analysis applications JASP and R. Readers will be given a first impression of ways to empirically investigate research problems and an easy-to-try-at-home recipe for conducting simple statistical tests and analyses. 1.6.5 Chapter 6: Corpus Analysis: Building and Using Corpora—A Case Study on the Use of “Conspiracy Theory” Corpus analysis has become an increasingly popular approach for philosophers to tackle complex philosophical issues. With the advancement of new technology and algorithms, conducting corpus analytical research has become more efficient and effortless, and numerous tools are now readily available. In Chap. 6, Kevin Reuter and Lucien Baumgartner make corpus analysis accessible to interested readers by providing theoretical background and practical exercises for beginners. The first section provides hands-on examples and exercises designed to help newcomers to corpus analysis get started, making it easy for readers with little or no experience to 14 S. Kornmesser et al. conduct their first corpus searches. They discuss several studies demonstrating how philosophers have used corpus linguistic methods during the last 15 years, focusing particularly on the task of formulating suitable hypotheses for corpus analysis, which is often the most challenging aspect of this research. In the following two sections, Reuter and Baumgartner examine the case of “conspiracy theory” by running various corpus studies. While the philosophical literature usually argues for a neutral definition of the concept of conspiracy theory, empirical results suggest the term “conspiracy theory” is a negatively evaluative term. Section 1.4 of their chapter showcases straightforward corpus studies using pre-built corpora, such as the Corpus of Contemporary American English (COCA). The results of these studies not only indicate that “conspiracy theory” is a highly negative term but also reveal that the meaning of “conspiracy theory” has changed quite dramatically during the last 20 years. The authors also highlight some of the problems and challenges researchers face when using pre-built corpora, compared to custom corpora collected to investigate a specific research question. In the second part of this chapter, therefore, Reuter and Baumgartner demonstrate how to build a corpus from scratch using the statistical software R. This tutorial covers different aspects, such as collecting text data, text annotation, and data analysis. The authors also give detailed instructions on API (Application Programming Interface) handling, how to work with big data sets, and explain sentiment analysis for beginners. Overall, this chapter makes corpus analysis more accessible and demonstrates its huge potential for philosophical research. 1.6.6 Chapter 7: Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic Network Analysis—An R Studio Pipeline In Chap. 7, Mark Alfano introduces lexical dispersion analysis, time series analysis, and semantic network analysis. The case study in this chapter uses State of the Union addresses delivered yearly by American Presidents from George Washington to Donald Trump. Lexical dispersion refers to the embedding of words, stems, and n-grams across corpora. Time series analysis is useful when you have a corpus that was produced over the course of hours, days, years, or decades. It enables us to track the prevalence of various words, stems, and n-grams—along with the concepts they express—over time. Time series analysis reveals trends in these speeches across more than two centuries. Semantic network analysis is useful when you have a sufficiently large and diverse corpus to ask which words, stems, and n-grams—along with the concepts they express—tend to co-occur in the same sentence, paragraph, or other pieces of text. This can help to shed light on the meaning of terms. These analytical approaches can also be combined to examine how patterns of association change over time. 1 Introduction: Setting Out for New Shores 15 1.6.7 Chapter 8: Psycholinguistic Experiments: A Case Study on Default Inferences in Philosophical Arguments Analysing the Argument from Illusion In Chap. 8, Eugen Fischer and Paul E. Engelhardt explain the key conceptual steps involved in designing, conducting, and reporting behavioral experiments, quite generally, and walk the reader through a worked example that illustrates, more specifically, how experimental methods from psycholinguistics can be adapted for philosophical purposes. Philosophy employs various forms of verbal reasoning, ranging from thought experiments involving verbally described cases to abductive and deductive arguments. Couched in natural languages like English or German, verbal reasoning is, at the most fundamental level, driven by automatic comprehension inferences. These ubiquitous inferences are triggered whenever we hear or read words in speech or text and are anticipated by speakers and writers whenever they speak or write. These inferences shape our intuitions about verbally described cases. They influence our reasoning about such cases. They influence how we move from premises to conclusions in verbal reasoning, more generally. The experimental study of these ubiquitous inferences is key to explaining and assessing philosophically relevant intuitions and arguments—from any area of philosophy. Psycholinguists use the cancellation paradigm to study automatic inferences in language comprehension. This chapter explains how the experimental paradigm can be adapted to study the role of automatic comprehension inferences in verbal reasoning and discusses how the paradigm can be implemented with questionnairebased measures, a variety of eye-tracking measures, and, ideally, with a combination of the two. The chapter presents a step-by-step account of a new eye-tracking study that examines two questions: Do verbs trigger basic automatic inferences— stereotypical inferences—even when prefaced by sentence contexts that defeat these inferences? If so, do such contextually inappropriate stereotypical inferences influence further judgment and reasoning? The chapter shows how findings (about appearance verbs) can be used to analyze and assess an influential philosophical argument (the “argument from illusion” that is central to philosophical debates about the nature of perception). Using this worked example, the chapter explains each key step of the research process, e.g., how to combine fixation time measurements with plausibility ratings to study automatic comprehension inferences, how to collect data with an eye tracker, and how to conduct statistical analyses of variance (ANOVAs). Since the study of automatic inferences in verbal reasoning is relevant for all areas of philosophy and we would like to facilitate wide uptake, the chapter takes care to also explain questionnaire-based methods that require no access to laboratory facilities. 16 S. Kornmesser et al. 1.6.8 Chapter 9: Qualitative Interview Studies: Constructing an Interview Study Based on a Paradigm Example in “Ought Implies Can” In Chap. 9, Kyle Thompson explores how experimental philosophers can use qualitative interviews to better understand how folks think about a range of fascinating philosophical ideas. Along the way, the chapter highlights and defends the unique power of qualitative methods for philosophy. Typically, experimental philosophers opt for brief quantitative questionnaires over qualitative methods, but such questionnaires require participants to funnel their rich and complex judgments through a few preselected response options. While there are marked advantages to using quantitative questionnaires—convenience, for example—they can come up short in revealing what folks think about topics that are, by their very nature, philosophically messy. By justifying the use of interviews in experimental philosophy, Chap. 9 explores the possibility that Socrates was right to engage in back-and-forth dialogue when trying to get to the bottom of philosophical questions. While the chapter centers on its author’s original study on “ought implies can,” the recommendations in each section are designed for readers who will be constructing diverse interview studies on many philosophical topics. From picking a research question to transcribing qualitative interview data, Chap. 9 is a comprehensive guide to crafting and conducting your qualitative interview study in experimental philosophy. To make things concrete, the chapter consistently references materials from the “ought implies can” study, which are made available in a supplemental document. These materials show the reader what transcripts look like, how an interview script can be structured, how qualitative data analysis works, and more. Readers of Chap. 9 will find it easy to take the design choices from the “ought implies can” study and modify them to fit their unique projects. And, if they are not sure where to start with their own projects, the chapter offers numerous suggestions for every stage of the research process. Even readers who are skeptical of using qualitative methods will find something in the pages of Chap. 9 since many of the research and design issues tackled therein are thought-provoking and relevant to all philosophical explorations. References Alexander, J. (2012). Experimental philosophy. An introduction. Polity. Bauer, A. M., & Kornmesser, S. (Eds.). (2024). The compact compendium of experimental philosophy. de Gruyter. Bealer, G. (1996). On the possibility of philosophical knowledge. Philosophical Perspectives, 10, 1–34. Beisbart, C., & Reuter, K. (2021). What is the folk concept of life? Australasian Journal of Philosophy, 101(2), 486–507. 1 Introduction: Setting Out for New Shores 17 Berberi, I., & Roche, D. (2022). No evidence that mandatory open data policies increase error correction. Nature Ecology & Evolution, 6, 1630–1633. Bourne, V. (2017). Starting out in methods and statistics for psychology. A hands-on guide to doing research. Oxford University Press. Christen, M., van Schaik, C., Fischer, J., Huppenbauer, M., & Tanner, C. (Eds.). (2014). Empirically informed ethics. Morality between facts and norms. Springer. Clarke, B., Lee, P. Y. (K.), Schiavone, S., Rhemtulla, M., & Vazire, S. (2024). The prevalence of direct replication articles in top-ranking psychology journals”, unpublished manuscript. https:/ /doi.org/10.31234/osf.io/sa6rc. Cova, F., Strickland, B., Abatista, A., Allard, A., Andow, J., Attie, M., Beebe, J., Berniūnas, R., Boudesseul, J., Colombo, M., Cushman, F., Diaz, R., Nikolai, N., van Dongen, V., Dranseika, B. D., Earp, A. G., Torres, I. H., Hernández-Conde, J. V., Wenjia, H., Jaquet, F., Khalifa, K., Kim, H., Kneer, M., Knobe, J., Kurthy, M., Lantian, A., Liao, S.-y., Machery, E., Moerenhout, T., Mott, C., Phelan, M., Phillips, J., Rambharose, N., Reuter, K., Romero, F., Sousa, P., Sprenger, J., Thalabard, E., Tobia, K., Viciana, H., Wilkenfeld, D., & Zhou, X. (2021). Estimating the reproducibility of experimental philosophy. The Review of Philosophy and Psychology, 12, 9– 44. Daly, C. (2010). An introduction to philosophical methods. Broadview. Errington, T., Mathur, M., Soderberg, C., Denis, A., Perfito, N., Iorns, E., & Nosek, B. (2021). Investigating the replicability of preclinical cancer biology. eLife, 10, e71601. Field, A., & Hole, G. (2003). How to design and report experiments. Sage. Fischer, E., & Sytsma, J. (2021). Zombie intuitions. Cognition, 215, e104807. Fischer, E., & Sytsma, J. (2024). Projects and methods of experimental philosophy. In A. M. Bauer & S. Kornmesser (Eds.), The compact compendium of experimental philosophy (pp. 39–69). de Gruyter. Foot, P. (1967). The problem of abortion and the doctrine of double effect. Oxford Review, 5, 5–15. Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences. Unlocking the file drawer. Science, 345(6203), 1502–1505. Gerken, M., Gonnerman, C., Alexander, J., & Waterman, J. (2020). Salient alternatives in perspective. Australasian Journal of Philosophy, 98(4), 792–810. Knobe, J. (2003). Intentional action in folk psychology. An experimental investigation. Philosophical Psychology, 16(2), 309–324. Knobe, J. (2015). Philosophers are doing something different now. Quantitative data. Cognition, 135, 36–38. Knobe, J. (2016). Experimental philosophy is cognitive science. In J. Sytsma & W. Buckwalter (Eds.), A companion to experimental philosophy (pp. 37–52). Wiley-Blackwell. Kornblith, H. (2016). Philosophical naturalism. In H. Cappelen, T. S. Gendler, & J. Hawthorne (Eds.), The Oxford handbook of philosophical methodology (pp. 147–158). Oxford University Press. Machery, E. (2017). Philosophy within its proper bounds. Oxford University Press. Nadelhoffer, T., Rose, D., Buckwalter, W., & Nichols, S. (2020). Natural compatibilism, indeterminism, and intrusive metaphysics. Cognitive Science, 44(8), e12873. Nahmias, E., Morris, S., Nadelhoffer, T., & Turner, J. (2006). Is Incompatibilism intuitive? Philosophy and Phenomenological Research, 73(1), 28–53. Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. Pennington, C. (2023). A Student’s guide to Open Science. Using the replication crisis to reform psychology. Open University Press. Prinz, J. (2008). Empirical philosophy and experimental philosophy. In J. Knobe & S. Nichols (Eds.), Experimental philosophy (pp. 189–208). Oxford University Press. Rawls, J. (1971). A theory of justice. Harvard University Press. Robson, J., & Currie, G. (2022). Aesthetics and cognitive science. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy. (Summer 2022 Edition). https://plato.stanford.edu/archives/ sum2022/entries/aesthetics-cogsci/. 18 S. Kornmesser et al. Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-positive psychology. Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. Soto, C. (2019). How replicable are links between personality traits and consequential life outcomes? The life outcomes of personality replication project. Psychological Science, 30(5), 711–727. Starmans, C., & Friedman, O. (2020). Expert or esoteric? Philosophers attribute knowledge differently than all other academics. Cognitive Science, 44(7), e12850. Sytsma, J., & Livengood, J. (2015). The theory and practice of experimental philosophy. Broadview. Sytsma, J., Ulatowski, J., & Gonnerman, C. (2023). History and philosophy of experimental philosophy. All in the family. In A. M. Bauer & S. Kornmesser (Eds.), The compact compendium of experimental philosophy (pp. 9–38). de Gruyter. van den Akker, O., van Assen, M., Enting, M., de Jonge, M., Ong, H. H., Rüffer, F., Schoenmakers, M., Stoevenbelt, A., Wicherts, J., & Bakker, M. (2023). Selective hypothesis reporting in psychology. Comparing preregistrations and corresponding publications. Advances in Methods and Practices in Psychological Science, 6(3). https://doi.org/10.1177/25152459231187988. Yan, K., & Liao, C.-Y. (2023). A co-citation analysis of cross-disciplinarity in the empiricallyinformed philosophy of mind. Synthese, 201, 159. Chapter 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering the Constative–Performative Distinction Alexander Max Bauer, Stephan Kornmesser, and Henrike Meyer Abstract This chapter shows how to derive and empirically address several research questions concerning John Langshaw Austin’s distinction between constative and performative utterances. To do so, we show how to conduct a vignette study using an online questionnaire (utilizing the freely available software LimeSurvey) and how to use χ2 tests to analyze the resulting data (utilizing the statistics program Stata). Lastly, we discuss how the results can be interpreted to answer the research questions posed at the beginning. Keywords Chi-squared test · Effect size · Statistical significance · LimeSurvey · Stata · Constative · Performative · John L. Austin 2.1 Introduction This chapter presents a case study of an empirical investigation of Austin’s distinction between constative and performative utterances.1 Our goal is to show how one can experimentally address certain philosophical questions on this topic. To achieve this, we introduce an experimental study using a specific statistical method, Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-58049-9_2. 1 The case study is published as Kornmesser and Bauer (2023). A. M. Bauer (✉) · S. Kornmesser Department of Philosophy, University of Oldenburg, Oldenburg, Lower Saxony, Germany e-mail: [email protected]; [email protected] H. Meyer Berlin, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Kornmesser et al., Experimental Philosophy for Beginners, Springer Graduate Texts in Philosophy, https://doi.org/10.1007/978-3-031-58049-9_2 19 20 A. M. Bauer et al. the χ2 test. We show the whole process of designing, conducting, and evaluating the experiment, using it as an illustrative example of how philosophical questions can be tackled with empirical methods. In doing so, we aim to introduce all steps of the process as simple and accessible as possible. Before we get to the experiment, we first lay a theoretical foundation: In the second section, we provide some philosophical background on performative and constative utterances, also exploring some subsequent philosophical debates. From the philosophical background, we derive several research questions in the third section that we want to address experimentally. In the fourth section, we design the experiment. Here, we make decisions about how to operationalize our research questions so that they are testable. We also provide some statistical background, including the χ2 test, so that we can analyze the data we collect. In the fifth section, we create an online questionnaire that serves as the basis for our experiment. The sixth section covers the process of conducting the experiment, also showing how we recruit subjects. In the seventh section, we show how to use the χ2 test with the help of the statistical software Stata to evaluate the data we collected. Finally, in the eighth section, we interpret the results of our experiment and answer the research questions posed at the beginning. 2.2 Philosophical Background In this section, we introduce the philosophical background of the study we’ll be discussing in this chapter. In Sect. 2.2.1, we outline the fundamental distinction between constative and performative utterances, as introduced by the British philosopher John Langshaw Austin. Subsequently, in Sect. 2.2.2, we delve into the criteria established by Austin to differentiate between the two kinds of utterances. We do so because in our study, we want to test experimentally, whether the criteria proposed by Austin work. In Sect. 2.2.3, we examine different types of performative utterances. We need the different types of performative utterances because in our experiment, we use one utterance of each type. 2.2.1 Constative and Performative Utterances In 1962, the British philosopher John Langshaw Austin introduced a new research program in the philosophy of language with his posthumously published book How to Do Things with Words. This book, consisting of 12 lectures reconstructed by Austin’s students, presents Austin’s ideas on how we use language to perform actions. 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 21 According to Austin (1962, 3), philosophy of language commonly accepts what he calls the descriptive fallacy because it tacitly assumes the flawed assumption that “to say something, at least in all cases worth considering, i.e., all cases considered, is always and simply to state something. This assumption is no doubt unconscious, no doubt is wrong, but it is wholly natural in philosophy apparently” (Austin, 1962, 12). According to this idea of language, sentences are only used to describe, state, or assert facts. An example of such a sentence is: (1) The earth revolves around the sun. Sentence (1) expresses the statement that the earth revolves around the sun.2 Statements of this kind have a truth value, i.e., they are true or false. In the case of (1), the statement is true. However, Austin (1962, 12) believed that there are cases “in which to say something is to do something; or in which by saying or in saying something we are doing something.” Austin had in mind sentences like the following: (2) I promise to do the dishes. When you utter sentence (2), you are not saying something that is true or false, but you are doing something—you are performing the action of making a promise. Responding to this utterance with “This is true” or “This is false” would be inappropriate, as actions cannot be adequately described in this way. For instance, the nonlinguistic action to sit down on a chair cannot be labeled “true” or “false.” The action may succeed (you subsequently sit on the chair) or fail (you sit down next to it and end up on the floor), but the action itself cannot be true or false. In the case of utterance (2), one is also performing an action with that utterance—the action of promising (to do the dishes). And as is the case with the action to sit down on a chair, to consider this action as true or false does not make sense. However, like any other action, the promise can fail. For example, if you promise something over which you have no control, the act of making a promise fails (for example, promising to make it rain tomorrow). In Austin’s (1962, 14) words, the action went wrong (is unhappy). Austin introduces the new technical terms “constative” and “performative” to distinguish utterances like (1), which he calls constative utterances, from utterances like (2), which he calls performative utterances. Austin (1962, 4ff.) gives further examples of performative utterances. To address his original criticism that the philosophy of language has ignored this apparently perfectly common use of 2 Usually, one distinguishes between utterances, sentences, and statements (propositions). A statement (proposition) is the meaning of a sentence or utterance. Sentences and utterances can be used to express propositions. In this chapter, we don’t make a strict distinction between utterances and sentences, but between propositions on the one hand and sentences or utterances on the other. 22 A. M. Bauer et al. language before, he casually lets in that “indeed, the examples now to be given [are] disappointing” (1962, 5). (a) “I do (sc. take this woman to be my lawful wedded wife)”—as uttered in the course of the marriage ceremony. (b) “I name this ship the Queen Elizabeth”—as uttered when smashing the bottle against the stem. (c) “I give and bequeath my watch to my brother”—as occurring in a will. (d) “I bet you sixpence it will rain tomorrow.” Austin summarizes that performative utterances of types (a) to (d) are to be distinguished from constative utterances as follows: In these examples it seems clear that to utter the sentence (in, of course, the appropriate circumstances) is not to describe my doing of what I should be said in so uttering to be doing or to state that I am doing it: it is to do it. None of the utterances cited is either true or false: I assert this as obvious and do not argue it. (Austin, 1962, 6) Austin’s philosophical interest is focused on how to perform actions with performative utterances. However, before we delve further into that, it is essential to distinguish between performative and constative utterances as precisely as possible. 2.2.2 Criteria for Distinguishing Between Constative and Performative Utterances Austin suggests several criteria to distinguish between performative and constative utterances. In our study, we aim to find out whether some of the proposed criteria work for speakers of a speech community. To do so, we introduce Austin’s criteria systematically throughout this section. As we saw in Sect. 2.2.1, Austin establishes the following two criteria to differentiate performative from constative utterances. Criterion truth Constative utterances have a truth value; they are true or false. Performative utterances have no truth value; they are not true or false. Criterion action Performative utterances are actions; they can be happy or unhappy. Constative utterances are not actions; they cannot be happy or unhappy. Austin develops even more criteria to distinguish constative from performative utterances. For example, he tries to categorize constative and performative utterances based on their grammatical properties but rejects this idea, considering it unsuitable (cf. Austin, 1962, 53–66). In his sixth lecture, Austin (1962, 79f.) 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 23 develops four further criteria, which he does not reject subsequently. Two of these four criteria are presented below.3 We dub the first criterion doubt. Criterion doubt It cannot be doubted whether the action performed by a performative utterance was actually performed. However, it can be doubted whether the state of affairs described by a constative utterance actually exists. Let’s take a look at performative utterance (3) and constative utterance (4) to explain this criterion. (3) I apologize for hurting you. (4) I regret that I hurt you. To reiterate, the act of apologizing is performed by uttering (3). Uttering (3) means to apologize. (3) is classified as a performative utterance, as it is neither true nor false. Conversely, (4) describes a state of affairs and can be true or false depending on whether the described state of affairs exists or not (i.e., whether I regret that I hurt you or not). Thus, (4) is a constative utterance. On the criterion doubt: if you utter (3), then it cannot be doubted that you are apologizing because to utter (3) is to perform the act of apologizing. In other words, when you utter (3), it cannot be meaningfully asked whether you are really apologizing because you have obviously performed the act of apologizing by uttering (3). However, if you utter (4), it can be meaningfully asked whether you really regret it. Utterance (4) is a description of a state of affairs, and this description can be true or false. Consequently, the statement expressed by (4) can be doubted. Now, let’s get to the second criterion. We call it intentionality. Criterion intentionality In contrast to constative utterances, the action performed with a performative utterance is always intentional. For example, it makes no sense that the mental state described by (4) is intentional. You regret having hurt someone, or you don’t, but you cannot regret it intentionally. The situation is similar to the constative utterance (1). It makes no sense to suppose that the earth intentionally revolves around the sun. On the other hand, it is different with the performative utterances (2) and (3). When you utter (2), you thereby intentionally perform the act of making a promise, and when you utter (3), you intentionally apologize. You cannot perform these acts unintentionally or 3 In what follows, only two of the four criteria presented in Austin’s sixth lecture will be introduced. Of the other two criteria, one creates a contradiction with Austin’s previous remarks since it presupposes that the actions performed with performative utterances can only be accomplished by saying something. As Austin (1962, 8) himself points out, this is not the case. For example, the act of betting can be performed not only by saying something like “I bet that . . . ” but one can also bet without speaking (for example, in online betting or, as Austin notes, in horse racing at the totalizer). The other criterion not further discussed here is redundant since it implies the criteria truth and action. It is based on the fact that performative utterances cannot be false but can only be unhappy. 24 A. M. Bauer et al. by accident because you uttered (2) with the intention of making a promise or you uttered (3) with the intention of apologizing. To sum up, in this section, we presented four criteria that Austin uses to differentiate between performative and constative utterances: truth, action, doubt, and intentionality. These criteria will be the cornerstones of the research questions that we will develop in the third section. However, before we can formulate these questions, we will have to add some more philosophical background. 2.2.3 Categories of Performative Utterances Austin distinguishes five different categories of performative utterances according to the type of action performed with a performative utterance: verdictive, exercitive, commissive, behabitive, and expositive utterances. These categories are crucial to our study as we will use one utterance from each category in the experiment. In the following, we introduce the five different types of performative utterances one after another. Verdictive utterances According to Austin (1962, 152), verdictive utterances “consist in the delivering of a finding, official or unofficial, upon evidence or reasons as to value or fact, so far as these are distinguishable.” For example, a judge may perform the act of acquitting someone with the following utterance: (5) I hereby acquit you. Examples of nonofficial judgments would be educational measures of parents toward their children. Further examples of verdictive utterances are performative utterances with the verbs “diagnose,” “date” (to a point in time), “estimate,” and “convict.” Exercitive utterances According to Austin (1962, 154), an exercitive utterance is “the giving of a decision in favour of or against a certain course of action, or advocacy of it. It is a decision that something is to be so, as distinct from a judgment that it is so.” For example, in some states, the head of state can pardon someone with the following utterance: (6) I pardon you. Contrary to (5), (6) is a decision that does not need to be based on evidence (as acquitting is), but is “a decision that something is to be so.” Other examples of exercitive utterances are performative utterances with the verbs “degrade,” “command,” “proclaim,” “dismiss,” and “recommend.” Commissive utterances With a commissive utterance, the person speaking commits to a future behaviour (Austin, 1962, 156ff.). 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 25 For example, with the following utterance, one commits oneself to inviting the addressed person to the cinema: (7) I promise to take you to the movies. Other examples of commissive utterances are performative utterances with the verbs “declare for,” “swear,” “agree,” “bet,” and “guarantee.” Behabitive utterances According to Austin (1962, 159), behabitive utterances “include the notion of reaction to other people’s behaviour and fortunes and of attitudes and expressions of attitudes to someone else’s past conduct or imminent conduct.” A typical example of such a behabitive utterance is: (8) I apologize for the accident. Other examples of behabitive utterances are performative utterances with the verbs “thank,” “congratulate,” “sympathize,” and “welcome.” Expositive utterances An expositive utterance can be used to perform “acts of exposition involving the expounding of views, the conducting of arguments, and the clarifying of usages and of references” (Austin, 1962, 160). Assertions are examples of expositive utterances: (9) I claim that the earth revolves around the sun. Other examples of expositive utterances are performative utterances with the verbs “affirm,” “describe,” “deny,” and “report.” Expositive utterances deal with the speaker’s attitude toward the communicated proposition (i.e., the content of the communicated phrase) in terms of truthfulness. Affirming, stating, or reporting propositions, for example, implies that the speaker believes the proposition to be true. In summary, Austin aims to organize performative utterances and classify them into various categories based on the actions performed. He has identified five categories of performative utterances, namely verdictives, exercitives, commissives, behabitives, and expositives.4 In our study, we will take one utterance of each of the five types to represent all types of performative utterances as identified by Austin. We have accomplished all the philosophical background that we need to understand the study presented in Sects. 2.3, 2.4, 2.5, 2.6, and 2.7. Now that we know all we need to know about constative and performative utterances for our study, we can move on to developing our research questions in the next section. 4 Austin’s categorization has been criticized by Searle (1989), among others, and replaced by another classification into five categories. In this chapter, however, we use Austin’s categories since we are examining his approach of constative and performative utterances. 26 A. M. Bauer et al. 2.3 Research Questions In the previous section, we introduced the philosophical background of Austin’s distinction between constative and performative utterances so that you can understand the research questions pursued in this study. From an x-phi perspective, Austin’s distinction raises various questions and issues that we can address through empirical studies. In Sects. 2.3.1, 2.3.2, and 2.3.3, we formulate three research questions that we aim to answer. We explain the reasons behind each aim and research question, and, in Sect. 2.4, show how empirical studies can help answer them. 2.3.1 Aim 1: Do Austin’s Criteria for the Classification of Constative and Performative Utterances Work for Speakers of a Speech Community? Austin’s criteria for differentiating between constative and performative utterances are based on linguistic intuition. For example, it is an intuitive judgment that utterance (2) cannot be classified as true or false (criterion truth) or that one cannot meaningfully ask about this utterance whether the action is really performed (criterion doubt). However, up to now, these criteria are solely based on Austin’s own intuitions. This raises the question of whether Austin’s criteria correspond to the linguistic intuitions of the speakers in a speech community. To address this, we will present randomly selected subjects with constative and performative utterances and ask them to apply Austin’s criteria to determine if they arrive at the same classification as Austin. For example, for the criterion truth, we present subjects with a constative or performative utterance and ask if it can be classified as true or false. For a constative utterance, a positive answer corresponds to Austin’s classification, and a negative answer does not. This is the overall idea of how we design the study. In Sect. 2.4, we provide a detailed account of our study design. 2.3.2 Aim 2: Can the Event Character of Performative Utterances Be Used as a Criterion for Distinguishing Between Constative and Performative Utterances? The second aim of our study is to examine the viability of using the event character of performative utterances, a concept introduced by Émile Benveniste (1974), as a criterion for distinguishing between constative and performative utterances, which Austin does not consider in How to Do Things with Words. According to Benveniste (1974), performative utterances are single, self-contained events that are not repeatable. That is, each repetition of the utterance constitutes a new event. On 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 27 the other hand, constative utterances typically refer to states of affairs lasting in time. We call this property of performatives proposed by Benveniste the event character of performative utterances. Thus, the proposed criterion for distinguishing between constative and performative utterances is: Criterion time Contrary to constative utterances, performative utterances are events that are limited in time and that are unrepeatable. The second aim of this study is to assess, using the subjects’ linguistic intuitions, whether criterion time can effectively differentiate between constative and performative utterances. 2.3.3 Aim 3: Do Performative Utterances Have a Truth Value? As explained in Sects. 2.2.1 and 2.2.2, Austin (1962, 4ff.) claims that performatives have no truth values. However, this viewpoint has been largely challenged in the philosophical discourse on performative utterances subsequent to Austin (1962). Philosophers such as Lemmon (1962), Quine (1981), Heal (1974), Bach (1965), Graham (1977), and Searle (1989) contend that performative utterances, despite being actions as described by Austin, still possess truth values and can be considered propositions (also see Soames, 2003; Hornsby, 2006; Tsohatzidis, 2017). The distinctiveness of performative utterances lies in the fact that their truth conditions are satisfied simply through the act of uttering them. For example, consider sentence (2) discussed in Sect. 2.2.1: (2) I promise to do the dishes. While all of the philosophers mentioned above acknowledge that sentence (2) is performative, they reject Austin’s assertion that it has no truth value. Instead, they argue that sentence (2) is “automatically” true because when it is uttered, it constitutes the promise to do the dishes. Therefore, it describes this state of affairs correctly. In other words, a performative utterance verifies itself Quine (1981). As a result, all performative utterances have the truth value of true but never false. The third aim of our study is to see if subjects agree with Austin’s view on performative utterances having no truth value or if they side with his critics. To do this, we will evaluate the linguistic intuitions of the subjects to find out how they judge the truth value of performative utterances. In this section, we developed our research questions. The next step is to design an experiment able to shed some light on those questions, which we will describe in Sect. 2.4. 28 A. M. Bauer et al. 2.4 Study Design In this section, we outline our study design to answer the research questions from Sect. 2.3. We start by operationalizing the criteria for classifying constative and performative utterances in Sect. 2.4.1, making them suitable for empirical testing. In Sect. 2.4.2, we introduce the utterances that we will present to the subjects in the experiment. These utterances are not presented in isolation but instead are embedded in contexts (this is called vignettes). Finally, in Sect. 2.4.3, we provide the statistical background necessary to analyze the data that will be collected. 2.4.1 Operationalizations of the Criteria for the Classification of Constative and Performative Utterances To answer the research questions from Sect. 2.3, we need to operationalize the criteria for classifying constative and performative utterances, which were presented in Sect. 2.2.2 (criteria truth, action, doubt, and intentionality) and Sect. 2.3.2 (criterion time). This means that we need to put the criteria into a form that allows us to conclude from the subjects’ reactions how they classify the utterances. Suppose we want to see if subjects distinguish between constative and performative utterances based on Austin’s criterion truth. To test this, we present subjects with constative and performative utterances. If criterion truth works for classifying utterances, then subjects should see the constative utterance as having a truth value and the performative utterance as not having one. But how can we find out if subjects see things this way? Just asking them, “Does the utterance have a truth value?” does not seem to be a good idea. Most subjects are not philosophers, so they might not understand what a truth value is. Thus, asking this question would likely lead to answers that don’t provide helpful information for us. Instead, we need to ask a question that is intuitively understandable by a competent speaker so that the answer can give us insight into our research question. The question we ask in our study is: Question truth Can the utterances be true or false? So, to evaluate if the criterion truth works as intended by Austin, we ask subjects whether a given utterance can be deemed true or false. If the criterion truth aligns with Austin’s view, subjects should respond “Yes” for a constative utterance and “No” for a performative utterance. We also operationalize the criteria action, doubt, intentionality, and time through corresponding questions. Let’s have a look at them one by one. Question act Is there an action performed with the utterance (in addition to the action of speaking)? 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 29 If the criterion action aligns with Austin’s views, subjects should answer “No” for a constative utterance and “Yes” for a performative utterance. To avoid confusion, we include the parenthetical note “(in addition to the act of speaking).” Although both constative and performative utterances may involve producing sounds through speaking, this action is irrelevant to the criterion and should, thus, be ignored. Question doubt Does it make sense to ask, “Do they really?” If Austin’s conception of criterion doubt is correct, subjects should respond with “Yes” for constative utterances and “No” for performative utterances. For example, in the case of utterance (2), it would be nonsensical to ask if the speaker really promised to do the dishes, as the promise is obviously made with the utterance. Conversely, for utterance (1), it would not be nonsensical to question the claim that the earth revolves around the sun, as not everyone may possess a common understanding of astronomy and may, therefore, doubt the veracity of this utterance. Question intentionality Does it make sense to insert the adverb “intentionally”? If we apply Austin’s criterion of intentionality, the response for performative utterances should be “Yes,” while for constative utterances, it should be “No.” This is because performative utterances are actions that can be intentionally performed, while constative utterances describe facts that are not performed intentionally. For example, if we take the constative utterance “The earth turns around the sun,” it does not make sense to say that it was done intentionally, so the response should be “No.” However, in the case of promises made through performative utterances, it is possible to perform them intentionally. It is worth noting that the adverb “intentionally” is not included in the performative utterance itself but rather in a report about the utterance. To create this report, the corresponding utterance is put in the past tense, turning it into a constative utterance, and then supplemented with “intentionally.” In the example of utterance (2), this looks as follows: (2) I promise to do the dishes. (2.1) I promised to do the dishes. (2.2) I intentionally promised to do the dishes. Since (2.2) makes sense, the expected answer to question intentionality is “Yes,” which, according to Austin’s criterion, yields that (2) is a performative utterance. Question time Does it make sense to ask, “Are they still doing it”? According to criterion time, subjects should answer “Yes” for a constative utterance and “No” for a performative utterance. Take the following examples: (3) I apologize for hurting you. (4) I regret that I hurt you. Regarding utterance (3), it would be nonsensical to inquire whether the speaker is still apologizing because the act of apologizing was completed by uttering (3). 30 A. M. Bauer et al. Even if the speaker were to apologize again for the same incident, it would be a new and distinct act. However, for utterance (4), it is reasonable to question whether the speaker still regrets what happened since utterance (4) describes a state that may or may not still exist. 2.4.2 Vignettes In Sect. 2.4.1, we operationalized Austin’s criteria for distinguishing between constative and performative utterances. We developed questions that we can ask subjects to assess whether these criteria work as intended by Austin. We will present each subject with either a constative or a performative utterance and ask them the five questions truth, act, doubt, intentionality, and time. To ensure that the questions make sense, we embed the utterances in a context. For example, it matters who pronounces an acquittal because a pronouncement that does not have the power to acquit someone does not really perform the act of acquittal. The combination of the utterance and its context can be called a vignette. We randomly present each subject with either a performative or constative vignette and ask them to answer the five questions based on the utterance presented. We then compare the frequencies of “Yes” and “No” answers between the two groups to determine if the questions truth, act, doubt, intentionality, and time yield different answers for performative and constative utterances. If so, this provides evidence that the criteria corresponding to these questions distinguish performative and constative utterances effectively. To not be limited to a single context, we use five different performative utterances drawn from the five categories of performatives described in Sect. 2.2.3. We choose one exemplary performative utterance from each category and construct a parallel constative utterance as closely as possible. As a result, we end up with ten vignettes in total, as summarized in Table 2.1. We have kept the differences between constative and performative utterances minimal within a given category to avoid confounding extraneous variables, as can be seen in Table 2.1. This ensures that any observed differences in response behavior between the two types of utterances are solely due to the constative– performative distinction and not due to other factors such as variations in the contextual information of the utterance. To prevent subjects from making direct comparisons between performative and constative utterances or between different contexts, we present each subject with only one of the ten vignettes. This eliminates the possibility of subjects drawing conclusions based on perceived differences between the various utterances and thus influencing their response behavior. We refer to this as a between-subjects design, where the different variations are presented to different individuals, and the factor of interest varies between subjects. In contrast, if we were to present subjects with both a performative and a constative utterance, we would refer to this as a within-subjects design, as the factor of interest would be varied for the same individual. 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 31 Table 2.1 Wordings of the ten vignettes Context Behabitive Group Performative Constative Verdictive Performative Constative Exercitive Performative Constative Commissive Performative Constative Expositive Performative Constative Vignette Petra and Hans had a bicycle crash yesterday. Hans had taken Petra’s right of way and collided with her. The next day he visits her and says, “I apologize for the accident.” Petra and Hans had a bicycle crash yesterday. Hans had taken Petra’s right of way and collided with her. The next day he visits her and says, “I regret the accident.” A judge says to the defendant as she pronounces the verdict, “I hereby acquit you.” A judge says to the defendant after the verdict is pronounced, “I believe you are innocent.” The president says to the inmate, “I hereby pardon you.” The president says to the inmate, “I believe you should be pardoned.” Petra says to Hans, “I promise to take you to the movies.” Petra says to Hans, “I have the intention of taking you to the movies.” Petra says to Hans, “I claim that Leonardo DiCaprio is the most attractive man.” Petra says to Hans, “I think Leonardo DiCaprio is the most attractive man.” Afterward, we ask subjects to answer the five yes-or-no questions presented in Sect. 2.4.1 in randomized order to ensure that the results are not distorted by an ordering effect. We will examine each of the five questions individually as to whether the response behaviors of the performative and constative groups differ in each context. Each context will, in turn, be examined individually. Thus, we will make pairwise comparisons between groups for the five questions presented in each of the five contexts, making a total of 5 × 5 = 25 different pairwise comparisons. 2.4.3 Statistical Background In this section, we present the statistical foundation for analyzing the data. First, we take a look at the data basis in Sect. 2.4.3.1. Then, in Sect. 2.4.3.2, we demonstrate how to organize the data in a fourfold table. In Sect. 2.4.3.3, we take a look at the hypotheses that can be tested using the data. To do this, we compare the observed frequencies with the frequencies expected under certain conditions. These expected frequencies are calculated in Sect. 2.4.3.4. In Sect. 2.4.3.5, we then compare the observed and expected frequencies using the χ2 test. To make the test’s results comparable between studies, we turn to normalizations in Sects. 2.4.3.6 and 2.4.3.7. 32 2.4.3.1 A. M. Bauer et al. Data Basis In this section, we explain how to analyze the data that we will obtain. The questions in Sect. 2.4.1 are designed to be answered with either “Yes” or “No.” As a result, the data to be analyzed is dichotomous, with exactly two values for each question: “Yes” or “No.” Different types of data can have different scale levels. Take, for example, people’s eye color. We can count how many people have blue, brown, or green eyes. However, we can hardly put these characteristics into a meaningful hierarchy. That is different, for example, with school grades, as they can be ranked. Here, a “B” is—in many school systems—better than a “D” (the scale is ordinal). Other data, like temperature, can have distances, where the difference between the temperature on two days can be calculated. For example, if the temperature on the hottest day this year was 30.2◦ C and the hottest day last year was 28.5◦ , the temperature this year was 1.7◦ higher than last year (the scale is cardinal). In our case, answers to our questions lead to dichotomous data, having one of only two values, “Yes” or “No.” Here, as was the case for eye colors above, the scale is nominal. To analyze nominal data, we use frequencies. In this case, we compare the response frequencies of “Yes” and “No” answers between subjects presented with either a performative or a constative utterance. To explain the data analysis in more detail, we will assume hypothetical data in this section. The actual data will only be available after the experiment has been conducted; we will look at the actual data in Sect. 2.7. The data will later be available in tabular form, as in Table 2.2. Here, as an example, we assume that 50 subjects were asked one of our questions, with 26 in the group presented with the constative utterance and 24 in the group presented with the performative utterance. In our example, 21 out of 26 subjects answered “Yes” to the question for the constative utterance, while 6 out of 24 answered “Yes” for the performative utterance. Table 2.2 shows which subjects (in the “Subject” column) were presented with which utterance (in the “Group” column) and how they answered the question (in the “Reply” column). 2.4.3.2 Summary of Observed Data in a Fourfold Table We can present the underlying frequency distribution in a fourfold contingency table. In Table 2.3, you can see the four cells a, b, c, and d at the center. These cells contain the bivariate frequency distributions, i.e., the frequencies of each combination of variables. Bivariate frequencies depend on two variables, which in this case are the combinations of “Group” and “Reply.” That is, how many subjects in the “Constative” group answered “Yes”? How many subjects in the “Performative” group answered “Yes”? How many subjects in the “Constative” group answered “No”? How many subjects in the “Performative” group answered “No” (see Table 2.4)? 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . Table 2.2 Hypothetical data Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 33 Group Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Constative Performative Performative Performative Performative Performative Performative Performative Performative Performative Performative Performative Performative Performative Performative Performative Reply No Yes Yes Yes Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes No Yes No No No Yes No No No Yes No No No No No Yes Yes (continued) 34 A. M. Bauer et al. Table 2.2 (continued) Subject 41 42 43 44 45 46 47 48 49 50 Group Performative Performative Performative Performative Performative Performative Performative Performative Performative Performative Reply No No Yes No No Yes No No No No Table 2.3 Structure of the fourfold table Row Variable Characteristic 1 Characteristic 2 Column total Column variable Characteristic 1 a c a+c Characteristic 2 b d b+d Table 2.4 Fourfold table with observed frequencies Reply No Yes Row total a+b c+d a+b+c+d=n Group Performative 18 6 24 Constative 5 21 26 23 27 50 The univariate frequency distributions of the row and column variables, i.e., the frequencies that depend on only one variable in each case, are in the margin column and the margin row of the table. They are the sum of the frequencies in each row or column. In other words, they tell us how many subjects in total answered “Yes” or “No” and how many subjects were in the “Performative” or “Constative” group. Using the hypothetical data presented in Table 2.2, we can now get the observed frequencies (fo ) for the four cells a to d of Table 2.3. To do this, we simply count the number of subjects who answered “Yes” or “No” in each group. By doing so, we can determine the following values from Table 2.2: fo (a) = 18 (“Performative” and “No”) fo (b) = 5 (“Constative” and “No”) fo (c) = 6 (“Performative” and “Yes”) fo (d) = 21 (“Constative” and “Yes”) With these values, we can fill in the fourfold table, creating a compact and clear presentation of our data, as shown in Table 2.4. This is a clear improvement over Table 2.2, where the information was far less organized and way harder to comprehend. The only missing piece of information is the association of the group 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 35 and reply with the subject number, but since the survey is anonymized and no further data is available in this example, we don’t lose valuable information. Now, we can easily see the number of subjects in each group, “Performative” and “Constative,” by looking at the column totals. The row totals show the total number of subjects who answered “Yes” or “No.” We can also see the total number of subjects interviewed in total, which is 50 (n = a + b + c + d = 50), and what the observed bivariate frequencies are (cells a to d with fo (a) to fo (d)). 2.4.3.3 Forming Hypotheses In this chapter, we focus on whether the response frequencies to a question vary depending on whether subjects receive a performative or a constative utterance. In other words, we investigate the relationship between the group and the response frequency through a hypothesis test. To do so, we construct a specific hypothesis and test it against the data. As a common rule, we don’t test the actual working hypothesis that we assume (based on the theory or previous studies) but rather its negation, which we call the null hypothesis. If the data indicate that the null hypothesis is not true, we can infer with a certain probability that the original working hypothesis might be correct. However, if the null hypothesis can’t be rejected, we can’t assume that the original hypothesis is correct. The questions we developed in Sect. 2.4.1 are designed to elicit opposite answers for constative and performative utterances, following Austin’s criteria. So, if a question is answered “Yes” for a performative utterance, it should be answered “No” for a constative utterance and vice versa. Accordingly, our working hypothesis assumes a correlation between the group and the answer. To determine if such a correlation exists, we compare the collected frequencies with the frequencies we would expect if there were no correlation (which is, in this case, our null hypothesis). To understand what the data would look like if there were no correlation, we examine the answer’s probabilities. If there is a correlation between the group and the answer, the answer’s probability depends on the group. For example, the probability of a “Yes” answer might be higher in the “Constative” group than in the “Performative” group. If there’s no correlation, the answer’s probability is independent of the group. In this case, the overall probability of a “Yes” answer (i.e., P(Yes)) is the same as the probability of a “Yes” answer in the “Constative” group (i.e., P(Yes|Constative)). In this case, it holds that: P (Yes|Constative) = P (Yes) . (2.1) As a null hypothesis, we can, thus, assume that there is no correlation between the groups and the answers given. The conditional probabilities of the answers (i.e., the probabilities that “Yes” or “No” is answered in the groups “Constative” or “Performative”) would then be equal to the individual probabilities of this answer 36 A. M. Bauer et al. (i.e., the probabilities that “Yes” or “No” is answered in general). Formally, the null hypothesis (H0 ) could then look like this: H0 (P(No|Performative) = P(No)) ∧ (P(No|Constative) = P(No)) ∧ (P(Yes|Performative) = P(Yes)) ∧ (P(Yes|Constative) = P(Yes)) As an alternative hypothesis (H1 ), we assume that this independence is not given in at least one of the cells: H1 (P(No|Performative) /= P(No)) ∨ (P(No|Constative) /= P(No)) ∨ (P(Yes|Performative) /= P(Yes)) ∨ (P(Yes|Constative) /= P(Yes)) 2.4.3.4 Expected Frequencies in the Case of Independence To test this, in addition to the observed frequencies from Table 2.2 in Sect. 2.4.3.1, which we counted as fo in Sect. 2.4.3.2, we also need those frequencies that would be expected in the case of independence. To obtain these expected frequencies (fe ), we use the marginal frequencies from the fourfold table. The row sum tells us how often the answers “Yes” and “No” were given in total, from which we can derive their individual probability: Across both groups, 23 out of a total of 50 hypothetical subjects answered “No” to the hypothetical question. Hence, we can say that P(No) = 23/50. To clarify, let’s consider cell a as an example. We need to determine how many subjects in the “Performative” group would need to answer “No” for the probability to be equal to 23/50. This ensures that P(No|Performative) = P(No) holds. This can be calculated for each cell as in (2): fe = . column total × row total total (2.2) If this calculation is performed for all four cells a to d, we obtain the following four expected frequencies: / fe (a) = 24 × 23 50 = 11.04 . / fe (b) = 26 × 23 50 = 11.96 . / fe (c) = 24 × 27 50 = 12.96 . / fe (d) = 26 × 27 50 = 14.04 . 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . Table 2.5 Fourfold indifference table with expected frequencies Reply Table 2.6 Fourfold table with the calculation of residuals Reply No Yes No Yes Group Performative 11.04 14.96 26.00 Group Performative 18 − 11.04 = 6.96 6 − 12.96 = −6.96 37 Constative 11.96 12.04 24.00 23.00 27.00 50.00 Constative 5 − 11.96 = −6.96 21 − 14.04 = 6.96 Thus, if the given answers are independent of the group (i.e., the conditional probabilities do not differ from the individual probabilities according to the null hypothesis), then, for example, 11.04 (fe (a)) subjects in the group “Performative” and 11.96 (fe (b)) subjects in the group “Constative” should answer “No” to the question. The values differ since the “Performative” and “Constative” groups are different in size, with 24 and 26 subjects, respectively (as can be seen in the column sum). 11.04 of 24 and 11.96 of 26 correspond (rounded) to 23 of 50, i.e., the total number of “No” answers (18 + 5 = 23) divided by the total number of subjects (24 + 26 = 50). Here, we can already check a basic prerequisite for the following analysis. For the χ2 test presented below, it is often required that the expected frequency in no cell should be less than 5 in order to avoid the test statistics becoming inaccurate.5 We can now enter the expected frequencies into a fourfold table, which results in a so-called indifference table, as shown in Table 2.5. 2.4.3.5 Comparison of Observed and Expected Frequencies (χ2 Test) We can now examine the deviations between the observed and expected frequencies. To do this, we calculate the difference between the observed and expected frequencies for each of the four cells, a to d. These differences are known as residuals; they are calculated in Table 2.6. We use the residuals to determine the Chi-square (χ2 ) statistic. To calculate it, we square the residuals of each cell, which results in only positive values (as squared deviations, they cannot cancel each other out). We then divide the sum of these squared deviations by the expected frequency and add the quotients of all four cells to obtain the final result. The entire process for the four cells z can be summarized using Eq. (2.3). χ2 = . 4 ⎲ (fo (z)–fe (z))2 z=1 fe (z) 5 If expected values are below 5, Fisher’s exact test can be used instead. (2.3) 38 A. M. Bauer et al. For our example data, this looks as follows: χ2 = . (5 − 11.96)2 (6 − 12.96)2 (21–14.04)2 (18 − 11.04)2 + + + ≈ 15.626 11.04 11.96 12.96 14.04 Thus, for our example data, the χ2 value is approximately 15.626. The value of a χ2 statistic increases with the deviation between observed and expected frequencies. We would obtain a χ2 value of 0 if observed and expected frequencies were identical. As a rule of thumb, we can say that the larger the χ2 value, the further we move away from the null hypothesis. However, in order to be able to correctly interpret this value, it is necessary to take a look at the χ2 distribution, a probability distribution derived from the normal distribution. Its shape depends on the degrees of freedom (df ). To put it simply, degrees of freedom indicate the number of values in a fourfold table that can vary freely or, conversely, the number of cell values needed to determine all other cell values given the column and row totals. In our case, only one value is needed because given the column and row totals, we only require a single cell value to calculate the remaining three cells. For a cross-tabulation with k columns and l rows, we can calculate the degrees of freedom using Eq. (2.4). df = (k − 1) × (l − 1) . (2.4) For the present case, this looks as follows: df = (2 − 1) × (2 − 1) = 1 . The χ2 distribution shows us a probability density function for the case where there is no relationship between the group and the given response (i.e., when we assume the null hypothesis; see Sect. 2.4.3.3). For the case of df = 1, this function looks as represented in Fig. 2.1. The x-axis represents the possible values χ2 can take. Since the χ2 statistic is— by definition—squared, negative values cannot occur, so the x-axis starts at 0 and increases continuously to the right. On the y-axis, the probability density is shown to increase from bottom to top. The area below the function represented by the blue line represents a probability of 1. Using this, we can calculate the probability that a certain χ2 value lies within a specific interval on the x-axis. To this end, we set two intervals based on where the null hypothesis should be accepted, i.e., where the region of acceptance should be, and from which value it should be rejected, i.e., where the region of rejection should begin. The point at which these areas are parted depends on the chosen test level, α, also known as the significance level. The significance level indicates the probability of making an α (or “Type 1”) error, that is, rejecting the null hypothesis when it is actually true. This probability can be chosen freely, but various conventions have been established in different areas of research. For example, the error probability is 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 1.0 0.5 0.0 probability density 1.5 39 0 1 2 X2 3 4 5 Fig. 2.1 Chi-square distribution for df = 1 (own representation) often stipulated as α = 0.05. If, for example, α is set at 5%, then, with a probability of 95%, the χ2 statistic takes on a value that corresponds at most to a so-called critical value if the null hypothesis is true (i.e., if the group and the given answer are actually independent of each other). For a χ2 distribution with one degree of freedom, this critical value corresponds to χ2 = 3.841. Thus, if the null hypothesis is true, then the χ2 statistic will take a value between 0 and 3.841 with a probability of 95%. Only with a probability of 5% is the value greater than 3.841. If the χ2 statistic is greater than the critical value, we can reject the null hypothesis with an error probability of α = 5%. To make things easier, the critical values for different combinations of test levels and degrees of freedom are documented in tables.6 Our above calculated χ2 value of 15.626, hence, is clearly in the region of rejection, so we can reject the null hypothesis with the corresponding error probability and accept the alternative hypothesis: there appears to be a relationship between the group (“Performative” or “Constative”) and the answer given to the question (“Yes” or “No”). Box 2.1 By the way, the shape of the probability density function depends on the number of degrees of freedom. For example, if we had a study design with five degrees of freedom instead of just one, it would look like Fig. 2.2. At a significance level of α = 0.05, the critical value, in this case, would be 11.07; this is where the area colored in orange begins, which represents 5% of the total area. 6 See, for example, https://en.wikibooks.org/wiki/Engineering_Tables/Chi-Squared_Distibution. A. M. Bauer et al. 0.10 0.05 0.00 probability density 0.15 40 0 5 10 15 20 X2 Fig. 2.2 Chi-square distribution for df = 5 (own representation) 2.4.3.6 Calculating the Ф Coefficient for the Effect Size Let’s discuss some limits of the χ2 statistic. The value that χ2 would take if there were perfect dependence between the group and the answer given to the question, i.e., the maximum value of χ2 , depends on the underlying number of cases (n = 50 in the example data used here). The value of χ2 may also vary with the size of the contingency table, which could change the number of degrees of freedom and the shape of the probability density function. Therefore, comparing the χ2 value between different studies with different numbers of cases is difficult. To make meaningful comparisons, we need to use correlation measures, such as the Ф coefficient or the percentage difference, to indicate the strength of the effect found. That’s why one typically reports the strength of a found effect on a normalized scale. If we assume a symmetrical relationship between the variables, where both variables can influence each other, we can use the Ф coefficient as a symmetrical measure of association between two binary variables. The Ф coefficient was introduced by Udny Yule in 1912 and normalizes the χ2 value by placing it in a ratio to the number of cases (n = 50 in our example). The Ф coefficient is exclusively used for fourfold tables and can be calculated with formula (2.5) if the χ2 value and n are known. In the next section, we will discuss the percentage difference as another correlation measure. / . | Ф |= χ2 n (2.5) In this form, |Ф| ranges from 0 to 1, where 0 indicates perfect independence and 1 indicates perfect correlation, so the Ф coefficient has a constant maximum and is 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . Table 2.7 Fourfold table with frequencies for perfect correlation Reply No Yes Group Performative 26 0 26 41 Constative 0 24 24 26 24 50 not affected by the underlying number of cases or dimensions of the contingency table. Let’s illustrate this with an example. A perfect correlation would exist, for instance, if in the “Performative” group, all subjects answered “No,” and in the “Constative” group, all subjects answered “Yes.” In this case, if the groups consist of 26 and 24 subjects, respectively, the resulting table would look like Table 2.7. In this scenario, χ2 has a value of 50, which is equivalent to the number of cases (n). However, if we imagine another case with n = 100 where 50 subjects from the “Performative” group had answered “No” and 50 from the “Constative” group had answered “Yes,” χ2 would have a value of 100. That is, we get two different χ2 values for two perfect correlations. In general terms, we get different χ2 values for the same dependency between the group variable and the answer variable if we conduct the experiment with a different total number of participants. However, when we calculate |Ф| in both cases instead of χ2 only, we find that it is 1, indicating a perfect correlation. Hence, we get the same Ф value for different χ2 values. / . |Ф| = 50 = 50 / 100 =1 100 For our example data above with a χ2 of 15.626 and n = 50, we obtain a |Ф| of / . | Ф |= 15.626 ≈ 0.559 50 There is another way to define Ф, besides this—as the absolute value bars might suggest—simplified form. If we use the definition from Eq. (2.6), then the coefficient can range from −1 to 1. In either case, we can say that a stronger effect is indicated as the coefficient moves further away from 0. However, the sign of the coefficient cannot be further interpreted since we are analyzing nominal scaled data (as opposed to, e.g., ordinal scaled data). Nevertheless, the sign can reveal some information about the arrangement of frequencies in our fourfold table: if the observed frequencies are high on the diagonal of cells a and d, we obtain a positive Ф. Conversely, if the observed frequencies are high on the diagonal of the cells b and c, we obtain a negative Ф. a×d −b×c Ф= √ (a + c) × (b + d) × (a + b) × (c + d) . (2.6) 42 A. M. Bauer et al. Table 2.8 Possible interpretation of the Ф coefficient Ф Coefficient 0.00 ≤ |Ф| < 0.10 0.10 ≤ |Ф| < 0.30 0.30 ≤ |Ф| < 0.50 0.50 ≤ |Ф| ≤ 1.00 Interpretation Practically no correlation Low correlation Medium correlation High correlation For our hypothetical data, this would look like the following, leading to the same result as above. 18 × 21 − 5 × 6 ≈ 0.559 Ф= √ (18 + 6) × (5 + 21) × (18 + 5) × (6 + 21) . There are different conventions to interpret the value of the Ф coefficient. With Cohen’s (1988, 223–226) classification being adopted in Table 2.8, we present one of the more common ones. With |Ф| = 0,559 for our hypothetical data, we can, thus, conclude that there is a high correlation. 2.4.3.7 Calculating the Percentage Difference for the Effect Size Against the theoretical background presented above, it is possible to assume an asymmetrical relationship between the variables, where one variable is considered the independent variable and the other the dependent variable. In our case, the group is the independent variable and the answer the dependent variable since the answer given depends on which group one belongs to. As an asymmetrical measure of the strength of a correlation, the percentage difference can be used instead of the symmetrical Ф coefficient. To get the percentage difference, we calculate conditional relative frequencies from the given absolute frequencies. Assuming that the column variable is the independent variable and the row variable is the dependent variable, the conditional frequency for a cell z (pz ) is calculated as the quotient of the frequency in cell (z) and the corresponding column total as shown in Eq. (2.7). pz = . z column total (2.7) For instance, for cell a from Table 2.4 that would be: pa = . 18 a = = 0.75 a+c 18 + 6 This way, we obtain the conditional frequencies in Table 2.9 for all four cells a to d. 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . Table 2.9 Fourfold table with (rounded) conditional values Reply No Yes Table 2.10 Fourfold table with conditional percentages Reply No Yes 43 Group Performative 0.750 0.250 1 Constative 0.192 0.808 1 Group Performative 75.0 25.0 100.0 Constative 19.2 80.8 100.0 Often, instead of reporting conditional frequencies, conditional percentages (pz %) are used. These are obtained for a cell z with Eq. (2.8), where we multiply the conditional proportions by 100. pz % = pz × 100 (2.8) . For cell a, this would be accordingly: pa % = 0.750 × 100 = 75.0% . This way, we obtain the conditional percentages in Table 2.10 for cells a to d. We can now calculate the percentage difference (Δ%) using the conditional proportions. Assuming that the independent variable is the column variable and the dependent variable is the row variable, we can express the percentage difference as shown in Eq. (2.9). ⎛ Δ% = . a b − a+c b+d ⎞ × 100 (2.9) Hence, the percentage difference for our sample data is as follows: ⎛ Δ% = . 5 18 − 18 + 6 5 + 21 ⎞ × 100 ≈ 55.8% For our hypothetical data, 75% of subjects in the “Performative” group answered “No” to the question, while only 19.2% of subjects in the “Constative” group did so. The difference between the two groups equals 55.8 percentage points. Therefore, we can conclude that the group has an effect of 55.8 percentage points on the answer 44 Table 2.11 Possible interpretation of the percentage difference A. M. Bauer et al. Percentage difference 0 ≤ | ΔYX % | < 10 10 ≤ | ΔYX % | < 30 30 ≤ | ΔYX % | < 50 50 ≤ | ΔYX % | ≤ 100 Interpretation Practically no correlation Low correlation Medium correlation High correlation to our question. We use percentage points because the percentages are based on different reference values, namely the column totals, due to the column percentages. The percentage difference can range from −100 percentage points to 100 percentage points. As for the Ф coefficient, the sign of the percentage difference can hardly be interpreted meaningfully for variables on a nominal scale, and it depends on the arrangement of the frequencies. Positive differences are obtained if the frequencies are most pronounced along the diagonal (i.e., in cells a and d), while negative differences are obtained if they are most pronounced along the secondary diagonal (i.e., in cells b and c). If we take Table 2.7 as a basis, where there is a perfect correlation, and frequencies are located along the diagonal, we obtain a percentage difference of 100. However, if we reverse the values in the table so that the frequencies are located in cells b and c instead of cells a and b, we obtain a percentage difference of −100. Thus, we can conclude that a perfect correlation exists at ±100 percentage points, while there is no correlation at 0 percentage points. For interpretation, we can use a rule of thumb similar to that for the Ф coefficient by simply multiplying the values from Table 2.7 by 100, which results in the values in Table 2.11. As was the case for the Ф coefficient, 55.8 percentage points indicate a high correlation for our hypothetical case. So far, we have seen how to operationalize the criteria for classifying constative and performative utterances and how to create corresponding vignettes. Now, it is also apparent how to evaluate the data statistically. In the next section, we will demonstrate how to finally get the data. 2.5 Creating the Survey In this section, we will learn how to create a questionnaire for our experiment using the free survey tool LimeSurvey. We will take a look at the basic functions of LimeSurvey so that we can create our first own questionnaire. We will begin by showing how to set up LimeSurvey and create a new project in Sect. 2.5.1. Then, in Sect. 2.5.2, we will focus on the actual creation of the questionnaire. If you are already familiar with LimeSurvey, you might want to skip this section. 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 45 2.5.1 Setting up LimeSurvey and Creating a New Project We use LimeSurvey to implement the study, which offers two options. We can either use the LimeSurvey cloud service, which provides us with a preinstalled instance of LimeSurvey (either for free or for a fee, depending on the functionality), or we can install the Community Edition of LimeSurvey on our own Web server. It is relatively straightforward to install LimeSurvey on a Web server. Therefore, we outline the installation process below. System requirements and detailed installation instructions7 as well as answers to frequently asked installation questions8 can be found on the LimeSurvey website. First, we need to download the latest stable version of LimeSurvey, which is available on the LimeSurvey website in the /downloads subdirectory.9 Next, we decompress the package and, using an FTP program, transfer the files to a directory, such as /limesurvey, on our own Web server (in binary mode and preserving the directory structure). After that, we set some directory permissions: /limesurvey/tmp, limesurvey/upload, and /limesurvey/application/config need full read-write access, while all other directories require only read access. Once we have successfully transferred all data, we can start the installation program by accessing the subdirectory /limesurvey/admin/ on our own website. For example: http://website.com/limesurvey/admin/ The installation script will then guide us through the final steps of the installation. After a successful installation, we can find the login screen at: http://website.com/limesurvey/admin/ If we did not specify our own username and password during installation, we could log in here with the username “admin” and the password “password.” In this case, we will be prompted to set a new password after logging in for the first time. 2.5.2 Preparation of the Questionnaire After finishing the setup described in Sect. 2.5.1, we can now create a questionnaire with LimeSurvey. Later, we will use this questionnaire to obtain the data necessary to answer the research questions posed in Sect. 2.3. Now, let us first take a general look at how to structure our study. As described in Sect. 2.4, we use a between-subjects design, meaning that each subject will be asked the questions developed in Sect. 2.4.1 regarding either a 7 https://manual.limesurvey.org/Installation_-_LimeSurvey_CE/de. 8 https://manual.limesurvey.org/Installation_FAQ/de. 9 https://community.limesurvey.org/downloads/. 46 A. M. Bauer et al. constative or a performative utterance. For instance, a subject may be presented with the following exercitive utterance and will not see the other nine utterances. Utterance exercitive/performative I hereby pardon you. As explained in Sect. 2.4.2, we cannot present subjects with only the isolated utterance because, for example, the act of pardoning someone can only be successful if the utterance is made by someone who is authorized to do so. Otherwise, the utterance is invalid, and no pardon takes place. Thus, the utterance exercitive/performative must be presented in a context that eliminates obvious infelicities. After introducing context and utterance, we ask subjects the five yes-or-no questions truth, act, doubt, intentionality, and time in randomized order. Another group of subjects is asked the very same questions about a constative version of the utterance exercitive/constative so that we can compare the answers to the performative utterance with those to the constative utterance (see Sect. 2.4.3). As discussed in Sect. 2.4.2, our study examines five different performative utterances and compares them with five different constative utterances. Each subject is presented with exactly one utterance, resulting in a total of ten subject groups. In each group, the five questions from Sect. 2.3 are asked with regard to the presented utterance. Therefore, the questionnaire’s core consists of ten groups, with each group comprising one utterance and five questions. However, no subject is presented with the entire core of the questionnaire but with only one of the core’s groups. We want to ensure that approximately the same number of subjects process each of these groups. To obtain an even allocation of subject groups, each subject is randomly assigned to one group. Therefore, the questionnaire needs to include a randomization mechanism to allocate subjects randomly to groups. After completing the core section, the next step is to proceed to the control section. The purpose of this section is to minimize the possibility of subjects blindly selecting answers just to complete the survey quickly and receive payment. In the control section, we evaluate whether subjects have carefully read and worked through the questions and tasks. It includes questions that can only be answered if subjects have paid attention during the core section. If subjects fail to pass the control section, the survey is terminated for them, and we will not use their input for our analysis, regarding it as invalid. Before the core section, subjects will receive a welcome and an instruction for the questionnaire. They aim to greet the subjects and inform them that their inputs and data will be used anonymously for a scientific experiment. Additionally, it explains precisely what the subjects will have to do in the following survey (i.e., in the core section). Here, we do not disclose to the subjects what is being investigated in detail, nor the difference between constative and performative utterances. The experiment intends to reveal the intuitive answers to the questions asked about the utterances. Thus, subjects should not be pre-directed in any particular direction. After the core section, there is the control section, followed by demographic questions that collect specific demographic data, such as age, gender, and occupation. Gathering this data may reveal possible influences of these variables on the obtained data. Lastly, the survey concludes with a closing section, where 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 47 subjects are informed that the survey has ended, and they are thanked for their participation. We summarize the welcome, explanation, demographic questions, and closing sections in one section below. To ensure that subjects move through the questionnaire’s pages and questions in the correct order, a logic system is built into the questionnaire. This system specifies the conditions under which subjects move from one page to another. Although the logic system is not a separate or self-contained section of the questionnaire, we will explain it in detail below. Additionally, we will also explain the general settings that apply to the entire questionnaire. For instance, the general settings include the layout of the pages presented to subjects and the feature of presenting questions on separate pages rather than displaying all five questions at once. To summarize, the questionnaire comprises the following components: (a) Core (b) Control section (c) Welcome, instructions, demographic questions, and closing section (d) Logic system and general settings of the questionnaire We will now take a closer look at how to create a questionnaire for our survey. Remember, if you want to delve deeper into LimeSurvey, there are plenty of tutorials to be found online. Also, don’t miss to take a look at LimeSurvey’s official documentation. Additionally, you can download our complete survey from this book’s supplementary information: https://doi.org/10.1007/978-3-031-58049-9_2. It is best to begin constructing a questionnaire with the core and then develop the simpler or standardized parts around it. Therefore, we will begin by outlining the core (Sect. 2.5.2.1) and its control section (Sect. 2.5.2.2). After that, we will be familiar with all the necessary functions of LimeSurvey to add the welcoming, instructions, demographic questions, and closing section (Sect. 2.5.2.3), as well as the logic system and general settings of the questionnaire (Sect. 2.5.2.4). 2.5.2.1 The Core We will now create the core of the questionnaire, which comprises ten vignettes, with each containing one utterance (five constative and five performative). For each vignette, we will create a group in LimeSurvey (Sect. 2.5.2.1.1) and then add our five questions for each group (Sect. 2.5.2.1.2). 2.5.2.1.1 Adding Groups As mentioned above, the core of the questionnaire consists of ten groups, each of which contains an utterance (either performative or constative), an utterance context, and five questions about the utterance. To create a group, we select Structure (located at the top left of Fig. 2.3) and then select Add group (as illustrated in Fig. 2.3). 48 A. M. Bauer et al. Fig. 2.3 Adding a group Now we enter an informative title to help us keep track of the many groups we need to create. We create the group behabitive/performative, which will include a vignette with a behabitive performative utterance and its context. Under Description, we enter the context with the utterance presented in this group (see Sect. 2.4.2, Table 2.1) and click Save in the top right corner. Next, we’ll add our five questions. 2.5.2.1.2 Adding Questions Above, we have established the first group with a vignette. In order to continue, we must add the five questions that will be posed to subjects in this group. On the left of Fig. 2.4, we can see the newly created Behabitive/Performative group, which does not yet contain any questions. To add a question, we click on Add question, located at the top left. We enter a Code for the question, which will be important for our questionnaire’s logic system. Below the code, we enter the first of our five questions (question truth) in the text field Question. On the right side, we specify the Question type. We select “Mask questions” from the dropdown menu at the bottom and then select Yes/No Type Y. This indicates that the question will be a yes-no question. Both Yes and No will be presented as answer options to click on in the questionnaire. In the lower right corner, we select that answering the question is Mandatory, ensuring that subjects can only move on to the next question once they have selected one of the answer options (i.e., Yes or No), and are thus “forced” to choose an answer.10 10 Whether one should force one’s participants in this way is a matter of debate. 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 49 Fig. 2.4 Adding a question Fig. 2.5 Question preview for the behabitive/performative question Finally, we click on Save or Save and add question. If we select Preview question, we can see how the question will look like to subjects (Fig. 2.5). Since the survey is not yet active (more on this in Sect. 2.6), a note appears at the top to indicate this (see Fig. 2.5). To preview an entire question group, we can select the group under Structure on the left and click Preview question group. 50 A. M. Bauer et al. Fig. 2.6 Adding further questions The next step is to add all five questions (truth, act, doubt, intentionality, and time) with reference to the given context. In Fig. 2.6, for the Behabitive/Performative group, all five questions have been inserted and numbered consecutively. There are still many question groups to be inserted. Thus, we may want to “collapse” them so that we only see the groups’ names and not the individual questions. To do this, we click on the small arrow next to the number indicating the number of questions in the group (located on the left in Fig. 2.6). In Fig. 2.6, the second group has already been created, containing the constative counterpart to the behabitive performative utterance. The four questions, truth, act, doubt, and intentionality, have been added to this group, and the question time is being prepared in the text box. Code, Question type, and Mandatory have been set accordingly. If we now click on Save, the second of our ten core groups is completed. We repeat this process until all ten groups of our questionnaire’s core have been created. 2.5.2.2 The Control Section The control section aims to identify subjects who did not complete the survey carefully. To achieve this, we ask subjects a control question after they finish answering the five questions on the performative or constative utterance. The control question should not be too difficult and should be easy to answer if the survey is done conscientiously. If the control question is too difficult or requires more effort to answer than the survey’s core, it may exclude more subjects than those who did not complete the survey carefully. On the other hand, it should be unlikely 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 51 for the subject to guess the correct answer if they were not attentive. For the Behabitive/Performative group in this survey, we ask the following control question to the subjects: Control question behabitive/performative Which of the following questions about Hans’ utterance were you not asked in this survey? In our control section, the subjects are presented with a choice of five questions. They were previously asked four of these questions (i.e., truth, action, doubt, and time). One of the questions, the question grammar, was not asked. Question grammar Does it make sense to ask whether Hans’ utterance is grammatically correct? If subjects were not attentive, it is hardly possible to determine from the content of the five options which answer is the correct one, meaning that subjects who have not worked through the survey carefully can only answer the control question correctly by chance. This way, a majority of subjects who did not work carefully are filtered out, and their answers are not used for analysis. These subjects are then taken directly to the end of the survey. Each subject works on precisely one of the ten core groups, and a separate control question is formulated for each core group. The answer options of the control questions must correspond to the questions asked in the respective core group. For example, the control question behabitive/performative (as shown above) cannot be asked about a core group in which no Hans is mentioned. Therefore, we created a separate control group for each of the ten core groups, which only contains one question, the control question (as shown in Fig. 2.7). However, the control question differs from the yes-no questions in the core groups as it requires selecting one answer option from the five options provided. This makes it a multiple-choice question. Hence, we select Multiple-choice Type: M Fig. 2.7 Creating the control question for the behabitive/performative group 52 A. M. Bauer et al. Fig. 2.8 Adding subquestions to the control question as Question type. Additionally, we set Mandatory to On. Thereafter, we access the Subquestions tab at the bottom of the page (as shown in Fig. 2.8). We now create five subquestions. Each subquestion will be an answer option to the actual control question. For each additional subquestion, we select Add subquestion. In Fig. 2.8, we can see the four subquestions that we have created so far. The fourth subquestion, circled in red, is the question about grammar and, hence, the correct answer to our control question. Next, we will add the question time. Note that we omitted question intentionality and replaced it with question grammar. 2.5.2.3 Welcome, Instructions, and Demographic Questions After creating the core and control questions, we now have a basic grasp of LimeSurvey’s interface for groups and questions. This means that the technical requirements for the welcome, explanation, demographic, and closing sections are straightforward and won’t be discussed further. To see the structure of these parts, remember that the complete survey can be downloaded from this book’s supplementary information. In the welcome group, we welcome subjects, express our gratitude for their participation, and provide important formalities. These formalities include the expected duration of the survey and the assurance that it is anonymous and responses cannot be traced back to individuals. We enter this information into the group as text, similarly to how we enter the vignette for the questions. Unlike the questions’ groups, we don’t pose a question in the greeting. Subjects read the text and proceed to the next group, which is the instructions group. The welcome text for our survey reads as follows: 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 53 Welcome to our survey! If you work in a concentrated manner, you will probably not need more than 5 to 10 minutes for this survey. It is important that you read the instructions and tasks carefully. Also, please complete the survey without closing your browser in between. We will evaluate your answers and the answers of all other subjects in this survey. All data will be stored in anonymous form so that no information can be assigned to a single person. The results of the survey will be published. Thank you for your participation! In the instructions group, we present the following explanation to subjects: In this survey, you will be presented with a verbal utterance that someone says in a certain situation. You will then be asked to answer a total of five yes-or-no questions about this utterance. Please answer the questions according to your own personal assessment. There are no right or wrong answers. We intentionally do not provide instructions to subjects about performative and constative utterances in order to avoid influencing their responses since our survey aims to explore subjects’ intuitive assessments and investigate whether they make a distinction between these two types of utterances, as proposed by Austin’s theory. While it may be helpful to include an example in the explanation for more complex studies, we avoid giving any hints about the content of the actual survey. Standard practice is to collect sociodemographic information from subjects to examine if their response behavior is influenced by factors such as age, gender, or education level. In the section demographic questions (also known as sociodemographic questions), we limit ourselves to some basic variables and ask the following three questions: Demographic question 1 With which gender do you identify? Demographic question 2 How old are you? Demographic question 3 What is your highest educational degree? Demographic questions 1 and 3 are multiple-choice questions, just like our control question. We can create answer options under Subquestions in the Question editor. For demographic question 1, the gender options (which we base on Germany’s legal definition) are male, female, and diverse. For demographic question 3, various educational qualifications are specified in as much detail as possible. Demographic question 2 uses Short free text (S) as Question type, with a text field appearing for subjects to enter their age as a number.11 Once the last designated question is answered (in this survey, the last demographic question), subjects are automatically redirected to the panel provider (as described in Sect. 2.6). 11 It’s important to note that—set up as this—LimeSurvey does not recognize whether the text input is a number. If, in this state, at least one character (such as the letter “g”) is entered, LimeSurvey considers it valid, and the subject can continue. 54 2.5.2.4 A. M. Bauer et al. Logic System and General Settings After we have created all questions and explanations in LimeSurvey, we define the general settings and the logic system of the survey. The general settings concern the layout (e.g., the font design) of the survey and display options for the questions (e.g., whether all questions of a group are to be displayed at once or whether they are to be displayed one after the other). The logic of the survey determines which questions and in which order these questions are presented to subjects. First, we’ll discuss the logic system of the survey: The survey should be designed so that all subjects are shown the introduction. In addition, each subject shall be shown exactly one question group with the associated control question (according to our between-subjects design, see Sect. 2.4.2), where the questions of the respective question group shall be presented in random order. LimeSurvey goes through all question groups from top to bottom according to the created structure. So how can it be managed that the subjects are assigned exactly one question group at a time and that each question group is answered (at least approximately) by the same number of subjects? One possibility to solve this problem is to incorporate a random number generator that assigns each subject to exactly one question group after the introductory part. For this purpose, we insert the additional question group Group Assignment after the introduction section and add a pseudo question, which takes over the role of random assignment to the groups with the actual questions (as shown in Fig. 2.9). Instead of a real (colloquial) question, we enter the following command in the Question editor and select the question type Equation: {if(is_empty(randnumber.NAOK),rand(1,10),randnumber.NAOK)} This command generates a random number between 1 and 10 for each subject. To keep this question group hidden from subjects, we select the function Always hide this question under Display (located at the bottom of the Question editor). The number range of 1–10 has been chosen since there are ten question groups in total. Next, we assign a number from 1 to 10 as a Condition to each of the ten question groups. The control question group associated with a question group is also Fig. 2.9 Creating a group and a pseudo question to assign a random number to subjects 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 55 Fig. 2.10 Creating a condition for a group assigned the same number. For instance, in Fig. 2.10, the second question group on behabitive/constative is assigned the number 2 as a condition. The following input is used for this: randnumber == "2" LimeSurvey now proceeds as follows: All groups of questions are gone through in order from top to bottom according to the survey’s structure unless a condition has been assigned to them. In the hidden Group Assignment group, each subject is randomly assigned a number from 1 to 10. Let’s say the number 6 is drawn. After the Group Assignment group, LimeSurvey goes through each group in turn, skipping those that have a number other than 6 as a condition. Only the group with the number 6 is displayed. In the structure shown, this is the exercitive/constative group. Once the subjects have processed the five questions of this group, LimeSurvey moves to the group exercitive/constative control question since the number 6 was also assigned to this group and it is next in order. The control question groups use their own logic, which we’ll discuss below. As a reminder, the control question asks which question subjects were not asked previously. There are five questions to choose from, four of which were asked, while the following one was not asked: Question grammar Does it make sense to ask whether Hans’ utterance is grammatically correct? Thus, if subjects select the question grammar from the five answer options of the control question, they’ve answered the control question correctly, and LimeSurvey will forward them to the demographic questions. If they select one of the four other options, they’ve answered the control question incorrectly because the four other questions were actually asked beforehand. In this case, LimeSurvey will stop the 56 A. M. Bauer et al. Fig. 2.11 Creating a quota Fig. 2.12 Adding a response option to a quota survey immediately and redirect them to an appropriate page of the panel provider (see Sect. 2.6). How can you implement these functions in LimeSurvey? We use survey quotas for this purpose. Under Settings, we select Quotas and set a quota for each of the ten control questions’ four wrong answers (i.e., 40 quotas for 40 wrong answer choices). In the following example, we create a quota for one of the four wrong answer options for the control question behaperfcont. We select Add new quota, and a screen appears to create the quota (see Fig. 2.11). We name the initial quota behaperfcont1 with a Limit set to 0. This indicates that the limit is surpassed as soon as this incorrect answer option is selected (i.e., zero subjects are allowed to select this answer). In this situation, the subject will be redirected to a URL provided by our panel provider, where they will be informed that the survey was terminated since the control question was answered incorrectly (see Sect. 2.6). Once the quota has been added, an answer option can be included in this quota under Quotas (see Fig. 2.12). Once the option Add answer is selected, a survey question menu will appear, where we need to choose the control question behaperfcont. The following menu will display the five answer options for behaperfcont. From there, we need to select one of the four incorrect answers for the behaperfcont1 quota. Figure 2.13 displays the finished behaperfcont1 quota. To summarize, the behaperfcont1 quota works as follows: When selecting the incorrect answer “Can the utterance be true or false?” for the control question behaperfcont (which asked which questions about Hans’ utterance were not asked in the survey), the subject is directed to a specified URL, and the survey is terminated 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 57 Fig. 2.13 The first quota Fig. 2.14 General survey settings for them due to the limit being set to 0. However, if another false answer is selected, nothing yet happens. For this mechanism to work as it should, we have to create a separate quota for each incorrect answer. We don’t create a quota for the correct answer, though. Subsequently, if the correct answer is chosen, LimeSurvey automatically moves to the next question group, which has either no condition or a condition randomly assigned in the Group Assignment group (in this example, the number 6). Only the behabitive/performative question group and its associated control question group have the number 6 as a condition. Hence, LimeSurvey will skip the other question groups and move on to the demographic questions group since it has no conditions set. As a last setting in the logic system, we need to make sure that the questions within a question group shall be represented in random order. To do so, for each individual question of a group, we go to the Question editor, where we select Logic at the bottom. Next, we enter the name of the question group the question belongs to under Randomization group name. That is, we enter the same name for all questions of the same group. Now, the questions of the group will be presented in random order. Note that it is not important what name we enter; we only have to make sure that the same name is assigned to questions of the same group and only to those. Lastly, we take a look at the General settings (see Fig. 2.14). Here, we make sure that the survey is presented to the subjects question by question (under Format) and select the desired template for the survey’s presentation. In this case, we use a 58 A. M. Bauer et al. Fig. 2.15 Presentation and navigation settings slightly customized version of the vanilla template that can be found in the online appendix of this book. In the presentation settings (Presentation), we find additional options (Fig. 2.15). On the left side, we can select to hide the question index from subjects (Show question index/allow jumping), display only the group description (Show group name and/or group description), and hide both the question number and code (Show question number and/or code). On the right side, we can choose to hide the total number of questions (Show “There are X questions in this survey”) and, due to panel integration (see Sect. 2.6), omit the welcome screen for subjects (Show welcome screen) and automatically redirect them to the panel provider’s end URL upon survey completion (Automatically load end URL when survey complete). 2.6 Running the Survey There are various ways to run such a survey. One way would be to follow Knobe’s (2003) approach and go to a bustling park in New York City, asking passersby to answer a few questions. However, this method may not always be feasible as parks may not always be nearby or bustling, and it can be challenging to reach a large number of subjects. On the other hand, LimeSurvey is well suited for conducting studies via the Internet, providing the unbeatable advantage of being always available and bustling, unlike a New York City park. The simplest way to recruit subjects via the Internet is to provide an access link, which can be communicated with a request for participation through relevant channels. Alternatively, we can also use the services of a panel provider. Such providers typically have large groups of subjects who are willing to participate in surveys, thus facilitating the recruitment of potential subjects. This approach can be particularly useful if one wants to have more control over the sample’s characteristics or if one wants a large number of subjects. 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 59 When using a panel provider, we must take two things into consideration: first, it is necessary to identify which panel members accessed the survey, and second, it is necessary to provide feedback to the panel provider about subjects who successfully completed the survey and those who were excluded due to an incorrectly answered control question. Luckily, LimeSurvey can automate both of these tasks for us. Subjects who access the survey through the panel provider are led to the survey via a personalized link. This link contains a parameter that uniquely identifies the subject, so it is important to ensure that this parameter is identified. At the end of the survey, subjects are redirected to the panel provider via an individual link that contains the very same parameter used for identification, as well as information on whether they completed the survey successfully or were excluded due to an incorrectly answered control question. To implement this, we create a new group and question in LimeSurvey. The question type should be either a short or a long text. We name the question tic because this is the name of the identification parameter used by our panel provider. Then, we set it to always be hidden in the question settings under Display. Subjects must land directly on this hidden question in the survey. To achieve this, we deactivate the automatically generated welcome page in LimeSurvey (to do this, we select Presentation under Survey settings, and then, as shown in Sect. 2.5.2.4, choose the Off option under Show welcome screen), and create a new question group that contains only this question and is placed as the first group before all others. Next, the Panel integration settings can be accessed by selecting Settings via the Survey menu in LimeSurvey, and a new parameter can be created by selecting Add parameter. The input mask shown in Fig. 2.16 will appear: We consistently name the new parameter tic and select the previously created question as the target question. We can then make a link available to the provider to direct the subjects to the survey. To do this, we simply select Preview survey in the upper bar (as shown in Fig. 2.17) and copy the URL of the preview that subsequently opens in the address bar. The copied URL will then look something like this: http://website.com/limesurvey/index.php/179792?newtest=Y Fig. 2.16 Input mask for creating a new parameter for panel integration 60 A. M. Bauer et al. Fig. 2.17 Activating the survey preview Then we simply add the parameter tic to the URL. To do this, ?newtest=Y is replaced by ?tic=abc, resulting in: http://website.com/limesurvey/index.php/179792?tic=abc By doing this, the string after the equals sign (i.e., “abc”) is saved in the survey. Our panel provider will create an individual string for each person participating in the survey, which will serve as a unique identifier for every subject in the data. Now it is necessary to ensure that the provider is informed of which subjects have successfully completed the survey and which ones have been excluded due to incorrectly answered control questions. For this purpose, the panel provider offers specific links for redirection, which can now be used accordingly. Consider the following example: https://panelprovider.com/survey/1234?s=1&re_tic= The string “1234” identifies the survey from the provider’s perspective. The parameter s= clarifies the status. In this case, s=1 represents the successful completion of the survey, while s=2 indicates exclusion due to incorrect control question responses. The parameter re_tic= should contain the identification number of the person being redirected via this link. To redirect subjects who successfully complete the survey to our panel provider and inform the provider that they did so, we can now go to the survey’s presentation settings (as shown in Sect. 2.5.2.4) and select Automatically load end URL when survey complete. Now, subjects are automatically redirected to the panel provider after the last demographic question. They will be thanked for completing the survey and informed that they will receive the agreed payment. Recall that the parameter s=1 indicates successful completion. Hence, we add him to our URL. By also adding PASSTHRU:tic to the parameter re_tic=, the value saved for the respective particitpant is automatically inserted when the URL is loaded. The resulting end URL that we insert at this point might then look something like this: https://panelprovider.com/survey/1234?s=1&re_tic={PASSTHRU:tic} To ensure that subjects who answer the control question incorrectly are automatically redirected to the panel provider and to inform the provider that they failed the control question, we navigate to Quotas in the Settings section of the survey menu. Here, we have created a number of quotas in Sect. 2.5.2.2, above. Now, as shown in Fig. 2.18, we can specify the redirect URL, in our example as: https://panelprovider.com/survey/1234?s=2&re_tic={PASSTHRU:tic} Remember that, in this case, the parameter s=2 indicates the failure to answer the control question correctly. Additionally, we make sure that Terminate survey 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 61 Fig. 2.18 Adding the URL to a quota Fig. 2.19 Activating the survey Fig. 2.20 Making the last settings before activating the survey is selected as the Quota action and that the Autoload URL box is checked, before saving the quota. Now, the answers of subjects who answered the control question incorrectly will be discarded and they will be redirected to the panel provider showing the information that they did not complete the survey properly and will not receive any payment. Once these preparations have been made, we can run the survey. To do this, we must first activate the survey by clicking on Activate this survey (see Fig. 2.19). The screen shown in Fig. 2.20 will appear, where we can make some final settings before starting the survey. If we choose Save & activate survey, the survey will be activated. Once a survey is activated, many settings cannot be changed anymore, so it is crucial to ensure that everything is set up correctly beforehand. Once the survey is activated, we can inform the panel provider. Usually, they will invite a small batch of subjects to first test the survey. This gives us the opportunity to use the tools presented in Sect. 2.7 to take a first look at the data and search for any anomalies that could indicate potential errors. If nothing seems amiss, we can 62 A. M. Bauer et al. Fig. 2.21 Summary of the number of subjects Fig. 2.22 Navigating to responses and statistics inform the panel provider that more subjects can be invited until the desired number of subjects is reached. In LimeSurvey’s survey list (see Fig. 2.21), we can see a compact overview of how many subjects have partially or fully completed the survey. For a more detailed overview—without having to download the data as shown in Sect. 2.7.1—we can also click on Responses & statistics under Responses during the running survey (see Fig. 2.22). By setting up the return links above, the panel provider also has an overview of how many subjects have successfully completed the survey or failed the control question. Once the target number of subjects is reached, an identification number check is usually done, for example, by providing the panel provider with a list of those subjects (remember the parameter tic=) that have successfully completed the survey or have failed the control questions. If no irregularities catch the eye, we are done with running the survey. Now, we can turn to analyzing the data we’ve collected. 2.7 Analysis In the following, we will first take a look at how to export the data from LimeSurvey (Sect. 2.7.1). Then, we will get familiar with the software Stata, which will be used for further data analysis (Sect. 2.7.2). We will walk through the process of importing data into Stata (Sect. 2.7.3), and we will generate some bar charts and cross-tabulations with χ2 value, relative frequencies, and Cramér’s V (Sects. 2.7.4 and 2.7.5). 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 63 Fig. 2.23 Exporting the data 2.7.1 Data Export After completing the survey, we can export the collected data from LimeSurvey. To do this, we select Responses in the Settings under Survey menu. In the top bar, we can then choose Export responses under Export. The page with settings for data export appears, as shown in Fig. 2.23. Once we have customized the settings according to our preferences, we can select Export in the upper right corner to obtain a file with the data for download. If we download this file and open it, for example, with an editor, it may look very messy, as in Fig. 2.24. To bring order to the chaos, we can import it into a program like Stata. 2.7.2 Getting Started With Stata Although the statistical tests presented in Sect. 2.4.3 can be calculated by hand, we can automate the process and save us some time using statistical software programs. We will demonstrate this process using Stata (version 14.2). After installing Stata on our computer and opening the program, we will see the main window (Fig. 2.25). This window comprises different sub-windows, including the results window in the middle, where the program provides feedback and displays important analyses. Below this, we can find the command window, where we can enter commands. To the right of the output area, we have the variables window at the top and the properties window at the bottom. Here, all the variables in the dataset 64 Fig. 2.24 The data in an editor Fig. 2.25 Stata’s main window A. M. Bauer et al. 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 65 Fig. 2.26 Stata’s do-file editor and corresponding properties will be displayed. Next to it on the right, there is the history window, storing previously entered commands. All the commands we present can be entered in the command window and executed by pressing the enter key. This practice is recommended for experimentation if you just want to play around a bit. However, for more complex analyses that need documentation, we suggest using do-files. These files contain a series of commands that can be executed individually or as a whole by the click of a button. To create such a do-file, we can either click the button for the do-file editor at the top or enter the following command in the command window and press the enter key: 1 doedit In addition to the main window, a window with the do-file editor will open, as shown in Fig. 2.26. 2.7.3 Data Import Now let’s get down to business. First, we need to tell Stata which directory we want to work in. For example, the LimeSurvey data downloaded in Sect. 2.7.1 may be located on the desktop in a folder called Chapter 1. Therefore, we start the dofile with a command that leads to this directory. To do this, we use the cd (change 66 A. M. Bauer et al. Fig. 2.27 Window with help for the command cd directory) command. We set the directory we want to switch to in quotes. In this case, we can type: 2 cd "~/Desktop/Chapter 1/" It’s worth noting that we can always call up help text for a command using the help command. For instance, if we type 3 help cd the window shown in Fig. 2.27 opens, providing a description of the command cd as well as information on syntax and examples. Next, we need to import the downloaded data. For this, we use the import delimited command. We put the path to the LimeSurvey data in quotes. Say we saved it as 1_data.csv in said folder. In this case, we type the following on a new line: 4 import delimited "~/Desktop/Chapter 1/1_data.csv" These are the first two commands in our do-file. We can save the file by clicking the Save button, allowing us to access the saved commands at any time or easily share them with others. 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 67 Fig. 2.28 Stata’s main window after importing the data set There are two main ways to execute these first two commands. We can click the Do button in the top right, which executes all the commands in the do-file line by line (so it’s important to write commands in the correct order). Alternatively, we can select one or more lines and then click Do, which executes only the selected commands (again, line by line). If these first two commands are executed successfully, the main window changes (see Fig. 2.28): a summary of the commands is displayed in the results window, and the variables contained in the data set are listed in the variables window. The imported data set can now be saved in Stata format. To do this, we use the save command, followed by the file name we choose. After a comma, we can use the replace option to overwrite a file with the same name if it already exists in our working directory: 5 save "1_data", replace This saves the data as a dta file that will be stored as 1_data.dta. In the future, if we want to work with this data, we don’t need to import the CSV file as we did in Sect. 2.7.1. We can simply tell Stata which data set to use with the use command, for example, 6 use "1_data", clear The clear option indicates that data in memory can be replaced, even if it wasn’t saved previously. 68 A. M. Bauer et al. 2.7.4 Data Cleaning Once the data has been successfully imported, we can see—just below the import command in the results window (see Fig. 2.28)—that 123 variables (vars) with 886 observations (obs) have been imported. Each observation represents a subject who took part in our survey. The relatively high number of observations is due to the inclusion of subjects who answered the control questions incorrectly or withdrew from the survey. However, our analysis should focus on subjects who answered the control questions correctly and completed the survey. Therefore, we must find a way to exclude those not meeting these criteria. Let’s first focus on the control questions. As shown in Sect. 2.5.2.2, there is a control question for each of the ten groups, which was slightly modified to suit the wording of the respective group’s vignette. For example, for the first group, the control question was: Control question behabitive/performative Which of the following questions about Hans’ utterance were you not asked in this survey? Subjects had to select one of the following five options: • Does Hans perform an action with the utterance (in addition to the action of speaking)? • Can the utterance be true or false? • Does it make sense to ask whether Hans really apologized for the accident (i.e., can one doubt that he apologized for the accident)? • Does it make sense to ask whether Hans’ utterance is grammatically correct? • Does it make sense to ask a week later if Hans is still apologizing for the accident? The correct answer is the fourth option: Subjects were not asked whether it makes sense to ask if the respective statement is grammatically correct. The subjects’ answers to this question are stored in the dataset in the variables behaperfcont1 to behaperfcont5. Clicking on Data Browser in the main window opens a new window that displays the data in tabular form (see Fig. 2.29). The same can be achieved with the command: 7 browse So, each answer option to the control question is listed as a separate variable named behaperfcont1 to behaperfcont5. As we can see by going through the data, this variable only contains a value if a person has selected the corresponding answer to the control question. In this case, there appears a Y (Yes). For example, Fig. 2.29 shows that the two subjects in rows 95 and 102 choose answer 4 to the control question, as there appears a Y where their rows intersect with the column behaperfcont4. This way to list the answers may be confusing and hard to analyze. To make things easier, in the first step, we summarize the answers to the 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 69 Fig. 2.29 Stata’s data window control questions in a single variable per group. This will be illustrated for the first group. First, we convert the variable to a numerical variable that uses a 1 instead of the letter Y. This can be achieved, for example, with the following command: 8 encode behaperfcont1, generate(behaperfcont_num_1) Using the encode command, we tell Stata to convert the following string variable—in this case, behaperfcont1—to a numeric variable. To store the result, we create a new variable using the generate command and specifying its name as behaperfcont_num_1. In behaperfcont_num_1, every Y is replaced with a 1. To avoid writing this process five times for the five different variables, we can construct a simple loop. The following loop instructs Stata to execute the command in parentheses for values 1 to 5, which the parameter i can take on: 9 10 11 forvalues i = 1/5 { encode behaperfcont'i', generate(behaperfcont_num_'i') } The loop executes the command 12 encode behaperfcont'i', generate(behaperfcont_num_'i') five times, replacing the i in behaperfcont'i' and behaperfcont_num_ 'i' with the digits 1 to 5. This generates five numeric variables, which can now be combined into a single variable. To do this, we generate a new variable named 70 A. M. Bauer et al. qualityfail_behaperf, initially containing only missing values (symbolized by the . after the equal sign): 13 generate qualityfail_behaperf = . A subject answered the control question incorrectly if they selected one of the wrong answers. Stata should, therefore, replace the . with a 1 if behaperfcont_num_1, behaperfcont_num_2, behaperfcont_num_3, or behaperfcont_num_5 equals 1 since answer options 1, 2, 3, and 5 are the wrong answers to the control question. This can be achieved using the replace command. The command 14 replace qualityfail_behaperf = 1 if behaperfcont_num_1 == 1 tells Stata to write a 1 to the variable qualityfail_behaperf if the condition listed after if is true (e.g., if the variable behaperfcont_num_1 also contains a 1). This is repeated for the cases where behaperfcont_num_2, behaperfcont_num_3, and behaperfcont_num_5 contain a 1: 15 16 17 replace qualityfail_behaperf = 1 if behaperfcont_num_2 == 1 replace qualityfail_behaperf = 1 if behaperfcont_num_3 == 1 replace qualityfail_behaperf = 1 if behaperfcont_num_5 == 1 For the sake of completeness, we will also assign the value 0 to the variable if the correct answer was selected (i.e., if there is a 1 in behaperfcont_num_4): 18 replace qualityfail_behaperf = 0 if behaperfcont_num_4 == 1 We repeat this process for all ten groups. If this is done, we can create a general variable that indicates how many subjects answered the control question incorrectly overall. To this end, we generate a new variable, this time with the name qualityfail_global: 19 generate qualityfail_global = . This variable should take on a value of 1 whenever a subject in any of the ten groups has given an incorrect answer. Hence, we write: 20 21 22 23 24 25 26 27 28 29 replace qualityfail_global = 1 if qualityfail_behaperf == 1 replace qualityfail_global = 1 if qualityfail_behacons == 1 replace qualityfail_global = 1 if qualityfail_verdperf == 1 replace qualityfail_global = 1 if qualityfail_verdcons == 1 replace qualityfail_global = 1 if qualityfail_exerperf == 1 replace qualityfail_global = 1 if qualityfail_exercons == 1 replace qualityfail_global = 1 if qualityfail_commperf == 1 replace qualityfail_global = 1 if qualityfail_commcons == 1 replace qualityfail_global = 1 if qualityfail_expoperf == 1 replace qualityfail_global = 1 if qualityfail_expocons == 1 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 71 In order to only include subjects in the following analysis who did not answer the control question incorrectly, we use the command drop to drop those who have a value of 1 in the variable qualityfail_global. This is done using: 30 drop if qualityfail_global == 1 Now we have removed the data of all subjects who answered the control question incorrectly from the analysis. In addition, we only want to consider those who have also reached the last page of the survey. LimeSurvey has automatically created a variable named lastpage, which records for each subject which page they last accessed. Using the command tab, a summary of the values of this variable can be accessed by entering: 31 tab lastpage Stata will then display a table which shows that the last page is numbered 67 and was reached by a total of 617 subjects. To further limit the analysis to these subjects, we use the keep command to keep only those whose last page was number 67. This is achieved using: 32 keep if lastpage == 67 2.7.5 Analysis of Responses Now that we have limited our analysis to those subjects who have answered the control question correctly and completed the survey, we focus on our five constative–performative pairs. Below, we go through the first pair’s analysis as an example. A look at the data reveals the same problem as in Sect. 2.7.4: it is stored as a string variable. Here again, a simple loop can be created that executes the encode command for all ten relevant variables; five from the performative group and five from the constative group: 33 34 35 36 forvalues i = 1/5 { encode behaperf'i', generate(behaperf_num_'i') encode behacons'i', generate(behacons_num_'i') } This generates ten numerical variables for the questions asked, which can be used more easily later. An N is replaced by a 0 and a Y by a 1. To easily distinguish between the performative and constative groups, we also create the variable treatment: 37 generate treatment = . Here, we can refer to the variable randnumber. When creating the survey in LimeSurvey, this variable was used to randomly assign subjects to one of the ten groups. Each group, therefore, has a number between 1 and 10, and the number stored in the variable randnumber reveals which group a subject was assigned to. 72 A. M. Bauer et al. It is, therefore, apparent that the performative group received the random number 1, and the constative group received the random number 2. Accordingly, we can write: 38 39 replace treatment = 0 if randnumber == 1 replace treatment = 1 if randnumber == 2 To avoid confusion, we give the variable’s values names. For this purpose, we first define a label set using label define, which we give the name treatment_lb. Then we assign the label Performative to the value 0 and Constative to the value 1: 40 label define treatment_lb 0 "Performative" 1 "Constative" Now we can assign the label set treatment_lb to the variable treatment using the command label values: 41 label values treatment treatment_lb With these labels, Stata displays the self-explanatory names Performative and Constative instead of the cryptic 0 or 1 when working with the variable treatment. Now we will look closer at the frequencies of the Yes and No answers to the five questions in these two groups. To avoid writing this for each question individually, we can use a loop again, iterating the values 1 to 5. Inside the loop, the preserve and restore commands are used, which reset the dataset to its original state at the end of an iteration; thus, all changes made inside the loop are temporary. In the following, we will present the components of this loop piece by piece before the entire loop is shown at the end. We first limit the dataset—temporarily—to the subjects of the first two groups using the keep command. In the second step, we use keep to only retain the three relevant variables, namely treatment, behaperf_num_'i', and behacons_num_'i': 42 43 keep if randnumber == 1 | randnumber == 2 keep treatment behaperf_num_'i' behacons_num_'i' Next, we generate a new variable called answer to combine the responses of subjects from both groups: 44 generate answer = . This variable should take a value of 0 if a subject answered No in either the Performative or Constative group, and it should take a value of 1 if a subject answered Yes in either group. To achieve this, we can write: 45 46 47 48 replace answer = 0 if behaperf_num_'i' == 1 replace answer = 1 if behaperf_num_'i' == 2 replace answer = 0 if behacons_num_'i' == 1 replace answer = 1 if behacons_num_'i' == 2 To improve readability, we create a label set: 49 label define answer_lb 0 "No" 1 "Yes" 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 73 We then assign this label set to the variable answer: 50 label values answer answer_lb Now that the responses for both groups have been named and sorted, we can analyze them using a simple bar chart created with the histogram command. We can use the percent option to display relative frequencies instead of absolute frequencies and the discrete option to indicate that the data is discrete. We want to display the responses by group and, hence, use by(treatment). We set a label for the x-axis using xlabel (0 "No" 1 "Yes") and a range for the y-axis using yscale(range(0 100)). The complete histogram command then looks like this: 51 histogram answer, percent discrete by(treatment) xlabel(0 ͨ→ "No" 1 "Yes") yscale(range(0 100)) If we take i=1, for example, the above command generates the bar chart shown in Fig. 2.30, which opens in a separate window. Stata generates the x-axis label from the variable name “answer.” Since Stata was instructed to display relative frequencies using percent, the y-axis is automatically labeled “percent.” Additionally, in the lower left corner, Stata generates the note “Graphs by treatment.” The axis labels can be easily adjusted by specifying Fig. 2.30 Stata’s bar chart output 74 A. M. Bauer et al. two new titles with xtitle and ytitle. To do this, we expand the previous command as follows: 52 53 54 histogram answer, percent discrete by(treatment) xtitle(Answer) xlabel(0 "No" 1 "Yes") ytitle(Percent) yscale(range(0 100)) To remove Stata’s note in the lower left corner, we add a comma and include note("") after specifying the treatment variable: 55 56 57 histogram answer, percent discrete by(treatment, note("")) xtitle(Answer) xlabel(0 "No" 1 "Yes") ytitle(Percent) yscale(range(0 100)) This results in the graph shown in Fig. 2.31. Using graph export, we can now save the graphs generated by our loop to our working directory, specifying name and format, for example, as beha_'i'.pdf: 58 graph export beha_'i'.pdf, replace Running the loop will then generate and save five graphs from beha_1.pdf to beha_5.pdf. Fig. 2.31 Stata’s bar chart output with adjustments 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 75 Fig. 2.32 Stata’s cross-tabulation output Fig. 2.33 Stata’s cross-tabulation output with χ2 value, relative frequencies, and Cramér’s V Now we can calculate the χ2 value, the Ф coefficients, and the relative frequencies. To do this, we can create a cross-tabulation of the two variables answer and treatment using the tabulate command: 59 tabulate answer treatment If we take i=1, for example, the result—appearing in the results window—looks like Fig. 2.32. To obtain the χ2 value, we add chi2, and for the relative frequencies, we add cell. Since, in this case, the Ф coefficient is identical to Cramér’s V (Cramér, 1946), we also add the option V. The expanded command looks like this: 60 tab answer treatment, chi cell V If we take i=1, the result looks like Fig. 2.33. 76 A. M. Bauer et al. Behabitive Verdictive Exercitive Commissive Expositive Truth Act Doubt Intentionality Time Fig. 2.34 Bar charts for all five performative–constative pairs Running the loop, we obtain five bar charts and five cross-tabulations with χ2 value, relative frequencies, and Cramér’s V, comparing the answers to questions 1–5 for the first performative–constative pair. Overall, the loop looks like this: 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 forvalues i = 1/5 { preserve keep if randnumber == 1 | randnumber == 2 keep treatment behaperf_num_'i' behacons_num_'i' gen answer = . replace answer = 0 if behaperf_num_'i' == 1 replace answer = 1 if behaperf_num_'i' == 2 replace answer = 0 if behacons_num_'i' == 1 replace answer = 1 if behacons_num_'i' == 2 label define answer_lb 0 "No" 1 "Yes label values answer answer_lb histogram answer, percent discrete by(treatment) xlabel(0 "No" 1 "Yes") yscale(range(0 100)) graph export beha_'i'.pdf, replace tab answer treatment, chi2 cell V restore } To generate the graphs shown in Fig. 2.34 and compile the data reported in Table 2.12, we can perform the same process for each of the five performative– constative pairs. This involves modifying the values of randnumber and the variables referenced by answer for each new pair. 0.002** 0.075 0.010** 0.681 p 0.039* Const. n = 62 10 (8.06) 52 (41.94) 38 (30.65) 24 (19.35) 25 (20.16) 37 (29.84) 49 (39.52) 13 (10.48) 21 (16.94) 41 (33.06) Verdictive Perf. V n = 62 0.189 36 (29.03) 26 (20.97) – 0.038 25 (20.16) 37 (29.84) 0.235 47 (37.90) 15 (12.10) – 0.163 43 (34.68) 19 (15.32) 0.284 42 (33.87) 20 (16.13) ≤ 0.001*** 0.218 ≤ 0.001*** 0.020* p ≤ 0.001*** Const. n = 57 10 (8.40) 47 (39.50) 48 (40.34) 9 (7.56) 15 (12.61) 42 (35.29) 53 (44.54) 4 (3.36) 11 (9.24) 46 (38.66) Exercitive Perf. V n = 62 0.434 23 (19.33) 39 (32.77) – 0.210 26 (21.85) 36 (30.25) 0.360 32 (26.89) 30 (25.21) – 0.111 39 (32.77) 23 (19.33) 0.339 38 (31.93) 24 (20.17) ≤ 0.001*** ≤ 0.001*** 0.005** ≤ 0.001*** p 0.017* Const. n = 67 11 (8.87) 56 (45.16) 52 (41.94) 15 (12.10) 23 (18.55) 44 (35.48) 59 (47.58) 8 (6.45) 5 (4.03) 62 (50.00) Commissive Perf. V n = 57 0.218 13 (10.48) 44 (35.48) – 0.436 35 (28.23) 22 (17.74) 0.259 32 (25.81) 25 (20.16) – 0.359 29 (23.39) 28 (22.58) 0.426 18 (14.52) 39 (31.45) ≤ 0.001*** ≤ 0.001*** 0.015* 0.049* p 0.369 Const. n = 60 8 (6.11) 52 (39.69) 50 (38.17) 10 (7.63) 20 (15.27) 40 (30.53) 45 (34.35) 15 (11.45) 8 (6.11) 52 (39.69) Expositive Perf. V n = 71 0.081 14 (10.69) 57 (43.51) – 0.177 56 (42.75) 15 (11.45) 0.219 30 (22.90) 41 (31.30) – 0.408 20 (15.27) 51 (38.93) 0.309 5 (3.82) 66 (50.38) 0.092 – 0.057 V 0.085 0.230 – 0.105 ≤ 0.001*** – 0.467 0.295 0.517 p 0.330 This table shows the answers (“yes” and “no”) that subjects gave for Questions 1–5. For each question, χ2 tests and Cramér’s V are reported for Behabitives, Verdictives, Exercitives, and Commissives separately. The ten vignettes were presented between subjects. Questions were presented on different screens in randomized order. n gives the number of participants for each statement. Asterisks denote significance levels: ∗ p ≤ 0.05, ∗ ∗ p ≤ 0.01, ∗ ∗ ∗ p ≤ 0.001 Const. n = 56 6 (5.04) 50 (42.02) 27 (22.69) 29 (24.37) 18 (15.13) 38 (31.93) 51 (42.86) 5 (4.20) 13 (10.92) 43 (36.13) Behabitive Answer Perf. n = 63 Truth 16 No (13.45) Yes 47 (39.50) No 28 Act (23.53) Yes 35 (29.41) Doubt No 35 (29.41) Yes 28 (23.53) 50 Deliberate No (42.02) 13 Yes (10.92) Time No 32 (26.89) 31 Yes (26.05) Question Table 2.12 Cross-tabulations for all five performative–constative pairs 78 A. M. Bauer et al. 2.8 Interpretation In Sect. 2.7, we analyzed and processed our data. In this section, we discuss to what extent the data can help to answer the research questions posed in Sect. 2.3. To do so, we interpret the data and the results of our analysis in the context of each research question. The fact that the data can be used to answer the research questions is a result of the study design presented in Sect. 2.4. The study was designed so that the subjects’ answers can provide insight into our questions. In Sects. 2.8.1, 2.8.2, and 2.8.3, we discuss the data in relation to each research question. 2.8.1 Interpreting the Data for Aim 1: Do Austin’s Criteria for Classifying Constative and Performative Utterances Work for Speakers of a Language Community? In order to answer this question, we need to identify conditions for evaluating subjects’ responses. These conditions will determine whether the subjects’ responses should be considered as positive or negative cases regarding the research question. A positive case, we stipulate, is given when the criterion leads to the assumed classification. For example, a performative utterance presented to the subjects is classified as performative based on the tested criterion. A negative case is given when the criterion does not lead to the assumed classification. Determining the conditions for a positive or negative case requires some specification. However, this specification is not entirely arbitrary but is based on (a) the statistical conditions of the significance level and effect size (see Sect. 2.4.3) and (b) substantive conditions. (a) First, let’s look at the statistical conditions. We specify that a positive case should have a significance level of at least p < 0.05. Additionally, we require at least a small effect size of Ф > 0.1 (see Sect. 2.4.3). Filtering Table 2.12 to consider only positive cases (i.e., cases that meet these statistical conditions), we get the following pattern of positive cases as presented in Table 2.13. Table 2.13 Positive cases with a significance level p < 0.05 and an effect size Ф > 0.1 Question Truth Act Doubt Intentionality Time Utterance class Verdictive Behabitive Positive Positive Positive Positive Positive Positive Positive Exercitive Positive Positive Positive Positive Positive Commissive Positive Positive Positive Positive Expositive Positive 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 79 (b) Second, we will examine the substantive conditions for a positive case. While it is necessary for a positive case to meet the statistical conditions of the significance level and effect size, these conditions alone are not sufficient. Consider, for example, the responses to question truth for the behabitive utterance (line 1, row 1). The χ2 test yields a significant result (p < 0.05) with a small effect size (Ф = 0.189). However, it would be misleading to classify this as a positive case. To see why, we must examine the corresponding bar charts in Fig. 2.34 for question truth for behabitives (line 1, row 1). When looking at the yes-no distribution of responses to the question Question truth Can the utterances be true or false? it becomes apparent that for both the performative and constative utterances, the majority of subjects answered Yes. This does not correspond to the prediction by Austin’s theory since a performative utterance, according to Austin, does not have a truth value. Thus, according to Austin, subjects should have answered No in the case of performative utterance. So, while the constative and performative utterances generate a statistically significant difference regarding the response behavior, it cannot be classified as a positive case, as Austin’s criterion has obviously not led to the correct classification of performative and constative utterances. In general, positive cases must satisfy the further necessary condition that the distribution of yes-no responses corresponds to the correct classification of constative and performative utterances in accordance with Austin’s theory. In summary, the following definition of positive and negative cases can be formulated: Definition positive/negative case The response pattern of subjects regarding a criterion for distinguishing between performative and constative utterances is a positive case if and only if a statistically significant difference with at least a small effect size is present, and the overall response pattern leads to a classification of constative and performative utterances that corresponds to Austin’s theory. Otherwise, it is a negative case. Applying this definition results in a decrease in the number of positive cases compared to Table 2.13 because the substantive condition discussed under (b) must also be met. The remaining positive cases are shown in Table 2.14. When looking at the results in Table 2.14, we notice that the fundamental distinguishing criteria expressed by question truth and question act seem to hardly work at all. This means that for these two criteria, the intuitions of native speakers— at least as far as could be seen in this study—do not lead to a classification of constative and performative utterances in Austin’s sense (each has less than 50% positive cases). Thus, the two original criteria that Austin first mentioned do not lead to a consistent distinction between the two classes of utterances among subjects. Similarly, the criterion intentionality only generates one positive case for the expositive and the associated constative utterance. 80 A. M. Bauer et al. Table 2.14 Positive cases according to the definition positive/negative case Question Truth Act Doubt Intentionality Time Utterance class Behabitive Verdictive Positive Positive Positive Positive Exercitive Commissive Positive Positive Positive Expositive Positive Positive Positive Positive However, the criteria time (three positive cases) and doubt (four positive cases) show much better results in being able to be used to distinguish between performative and constative utterances. Why do these two criteria work better than the other three criteria? We will offer an explanation for this in the interpretation of Aim 2 below. At the same time, the question arises as to why the two comparatively wellfunctioning criteria of doubt and time show no positive cases for the expositive and associated constative utterance. One possible explanation may be that expositive utterances are a deviant class of performative utterances. We will further examine this point in Sect. 2.8.4.1. 2.8.2 Interpreting the Data for Aim 2: Can the Event Character of Performative Utterances Be Used as a Criterion for Distinguishing Between Constative and Performative Utterances? Table 2.14 shows that the criterion time works in the study for distinguishing between conductive, verdictive, and exercitive utterances from their respective constative utterances. When the expositive utterances are initially ignored for reasons addressed in Sect. 2.8.4.1, the criterion of time works for most of the examined cases. Compared to the criterion of doubt, it has one less positive case but shows higher significance and effect size for conductive and exercitive utterances. The same applies to the commissive and its associated constative utterance. However, only the constative utterance, not the performative utterance, exhibits the predicted yes-no distribution for this class. Therefore, this class is considered a negative case. Subsequent to the answer to Aim 2, the question arises as to why the criteria doubt and time lead to better results regarding the distinction between constative and performative utterances than the other criteria. This question cannot be answered in the context of this study because the study aims only to verify whether the respective criteria work, not why they work or do not work. However, we can propose a hypothesis to address the above question—a possible, albeit untested, explanation that appears plausible based on the theoretical background and the available data. The hypothesis cannot be confirmed in the context of this study but can serve as a 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 81 starting point for further empirical investigations. The proposed hypothesis is that, unlike the criteria truth value, action, and intentionality, the criteria doubt and time take advantage of the event character of performative utterances to distinguish them from constative utterances. A performative utterance is a temporally limited event. This does not mean that uttering the corresponding words is temporally limited, but that the action performed with them is temporally limited, as it is completed directly with the utterance. In terms of events such as apologizing, acquitting someone, pardoning someone, and so on, it makes no sense to ask whether these events really happen because the respective event occurs with the performative utterance. “With” is not to be understood in the temporal sense that it is a separate event from the utterance that occurs simultaneously, but in the sense that the performative utterance is the event, and this event ends with the utterance. In the same manner, it makes no sense to ask whether someone is still apologizing, acquitting, or pardoning someone, as the respective events are temporally completed with the utterance. This suggests, in contrast to Austin, that the essential characteristics of performative utterances do not lie in their being actions and lacking truth value, but in their being temporally limited events. 2.8.3 Interpreting the Data for Aim 3: Do Performative Utterances Have a Truth Value? As we discussed in Sect. 2.3.3, there is a pronounced controversy in the philosophical debate following How to Do Things with Words about whether performative utterances have a truth value or not. According to Austin, performative utterances do not have a truth value since they are actions. In contrast, Lemmon (1962), Quine (1981), Heal (1974), Bach (1965), Graham (1977), and Searle (1989) argue that performative utterances, although actions, still have a truth value and are, therefore, statements (see also Soames, 2003; Hornsby, 2006; Tsohatzidis, 2017). The peculiarity of performative utterances lies only in the fact that their truth conditions are already satisfied by uttering them (see Sect. 2.3.3). As we discussed in Sect. 2.8.1, the truth value criterion creates only one positive case for the examined pairs of performative utterances and associated constative utterances. Do the data also provide insight into whether the subjects generally assign a truth value to performative utterances? The question of whether performative utterances have a truth value must be distinguished from the question of whether the truth value criterion is suitable for differentiating between constative and performative utterances. To answer the question of whether the truth value criterion can be used to differentiate between constative and performative utterances, the answers to the question truth for a performative utterance are compared with the answers for the associated constative utterance. For the question of whether performative utterances have a truth value, only the answer patterns regarding the question truth for the presented performative utterances are relevant. The answers 82 A. M. Bauer et al. for the constative utterances are not considered for this question. From the first line of Fig. 2.34, it becomes clear that, with the exception of the verdictives, the subjects mostly indicate that performative utterances can be true or false and, therefore, have a truth value. But can this answer behavior be quantified? Using the one-sample test of proportion described in Chap. 3 (Sect. 2.3), the answer patterns for the performative utterances in row 1 can be evaluated, and it turns out that there is even a significant difference between the yes and no answers to question truth for the conductive, exercitive, commissive, and expositive performative utterances. Overall, the answer patterns for performative utterances to question truth not only do not confirm Austin’s thesis that performative utterances do not have a truth value but even contradict it. The subjects’ linguistic intuitions suggest that they assign a truth value to performative utterances. Therefore, the data support the arguments against Austin’s conception of performative utterances as actions without a truth value. The subjects seem to assign a truth value to performative utterances and thus conceptualize them as statements. 2.8.4 Outlook and Possible Objections In this section, we will provide an outlook on how the generated data can be used to develop a new hypothesis concerning the distinction between constative and performative utterances (Sect. 2.8.4.1). In Sect. 2.8.4.2, we will anticipate possible criticisms of the study presented in this chapter and make attempts to refute them. 2.8.4.1 An Example of How to Generate a New Hypothesis from the Data In this section, we will exemplarily show how to develop a new hypothesis from the data we collected in our study. We cannot use the data to confirm or reject the new hypothesis because the experimental design is not fit to test it. However, the given data can give rise to new assumptions and make new hypotheses plausible that might be tested in future experiments. To understand the new hypothesis, we have to delve a little deeper into Austin’s distinction between constative and performative utterances and his own rejection of it. In Sect. 2.8.4.1.1, we introduce Austin’s distinction between implicit and explicit performative utterances, and in Sect. 2.8.4.1.2, we explain one of Austin’s main arguments to reject the distinction between constative and performative utterances. We will call it the constative expositive argument (CEA). Finally, in Sect. 2.8.4.1.3, we show how the data obtained in our experiments can contribute to formulating a hypothesis concerning the CEA. 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 2.8.4.1.1 83 Implicit and Explicit Performative Utterances Within performative utterances, Austin makes a distinction between implicit and explicit performative utterances, which applies to each of the five categories introduced in Sect. 2.2.3. In other words, for each of the five categories, there are both implicit and explicit performative utterances. Examples of this distinction are given in utterances (10) and (11). (10) I warn you about the bull in the field. (11) There is a bull in the field. When you make an effort to climb over the pasture fence, and a person standing next to you utters sentence (10), then with this utterance, the action of warning you is performed. It is made explicit by the verb “warn” that (10) is a warning. That is, the utterance itself clarifies what kind of action is performed by uttering it. Utterance (11) can be a constative utterance that is true or false. However, it can also be a warning, which is a performative utterance. If you try to climb over the pasture fence and someone else utters (11), then you might take it as a constative utterance, agree with it, and go on climbing over the fence. But, it’s more likely that you won’t. This is because (11) can also function as a warning. However, this performative aspect of the utterance isn’t explicitly, but only implicitly, stated. Thus, if (11) is used as a warning, it becomes an implicit performative utterance. In a different context, (11) could also function as a recommendation or a threat. The action that is performed with (11), or whether an action is performed at all, depends on the context of the conversation rather than on the utterance itself. To summarize, we use the term explicit performative utterance when a linguistic expression (typically the verb of an utterance) indicates what action is being performed by the utterance. On the other hand, an utterance is considered implicit performative when it does not contain such an expression and could, thus, also be constative. This distinction applies to all categories of performative utterances, including commissive and expositive utterances. In Sect. 2.8.4.1.2, we present the constative expositive argument, which relies on the distinction between implicit and explicit performative utterances. 2.8.4.1.2 The Constative Expositive Argument: Giving Up the Distinction Between Constative and Performative Utterances In the following, we explore one of Austin’s strongest arguments for abandoning the distinction between constative and performative utterances. In the course of the first seven lectures of How to Do Things with Words, Austin grows increasingly skeptical about maintaining a clear distinction between constative and performative utterances. Finally, by the end of the seventh lecture, he decides to discard the distinction altogether. We can’t delve into all of Austin’s arguments for abandoning the distinction here, but we’ll present the probably strongest one, which we call the constative expositive argument (CEA). The CEA asserts that even supposedly 84 A. M. Bauer et al. prototypical constative utterances such as statements, assertions, and reports are actually performative utterances. This becomes immediately apparent in the case of expositive utterances. Let’s recall utterance (9) from above. (9) I claim that the earth revolves around the sun. Utterance (9) is neither true nor false, but an action is performed with it—namely the action of claiming something. To claim something is basically a linguistic action, like promising or apologizing. When you compare (9) with utterance (1) The earth revolves around the sun, then (1), unlike (9), seems to be a constative utterance that is true or false and with which no action is performed. Is it really true, though, that we do not perform any action with (1)? According to Austin (1962, 132–135), we actually do something with (1). For instance, we can assert that the earth revolves around the sun, state it, or claim it using (1). Even though the explicit performative verb is missing, an action such as claiming, stating, informing, and so on is performed through utterance (1). Therefore, it is plausible to understand utterance (1) as an implicit performative utterance. If this holds for utterance (1), then it holds for all (purportedly) constative utterances: something is claimed, stated, described, and so on with every putatively constative utterance. As a result, they are linguistic actions and, therefore, performative utterances (expositives). Hence, there actually are no constative utterances at all, and accordingly, no distinction between constative and performative utterances can be made. All utterances are performative (cf. Black, 1963, 225).12 The CEA’s empirical evaluation isn’t the focus of our study. However, when we evaluate our data in Sect. 2.8.4.1.3, we’ll see that there are hints that raise doubt about the CEA. 2.8.4.1.3 We Can Formulate a Hypothesis on CEA from the Data Austin’s conclusion of the CEA includes that there is no distinction between constative and performative utterances because all utterances are used to perform a linguistic action in the sense of performative utterances. To be more precise, the CEA suggests that all constative utterances are actually implicitly expositive utterances, that is, a certain type of performative utterances. This would, of course, be a serious consequence. Are there really no constative utterances? Is there really no (albeit unconscious) distinction in language use between utterances that express something true or false and utterances with which one performs an action? In the 12 The CEA’s conclusion can also be derived from Austin’s speech act theory. According to Austin (1962, 98f.), the performance of a locutionary act is eo ipso the performance of an illocutionary act. This means that, according to Austin, you never simply say something (true or false) with an utterance (locutionary act), but you always do something with it (illocutionary act). 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 85 following, we will consider how the generated data can be used to provide an indication of whether the consequences of the CEA correspond to the linguistic intuitions of speakers in a language community. Can linguistic intuitions provide clues as to whether constative utterances actually behave like expositive utterances and that, therefore, there is no distinction between constative and performative utterances? Or is it perhaps the other way around: do expositive utterances behave like constative utterances? As discussed in Sect. 2.8.4.1.2, Austin’s constative expositive argument (CEA) may be his strongest argument against the distinction between constative and performative utterances. According to the CEA, constative utterances are like implicit expositive, that is, performative utterances. They perform a speech act of stating, asserting, describing, and so on. Thus, every constative utterance is technically a performative utterance, and the distinction between constative and performative utterances collapses. Can the data support or weaken the CEA? To answer this, we need to consider Fig. 2.34 and Table 2.14. Except for question intentionality, there are no significant differences between the response patterns of the subjects for the expositive and the associated constative utterance. The reason for this is easy to see in Fig. 2.34: the yes-no distributions for the expositive and the associated constative utterance are nearly identical. In other words, for the questions truth, act, doubt, and time, it makes no difference to the response patterns whether the subjects are presented with the expositive or the constative utterance. It could now be argued that these results support the CEA—there is obviously no difference between constative and expositive utterances. Constative utterances behave like expositive (i.e., performative) utterances. However, the opposite seems to be true. The response patterns of the subjects do not show the yes-no distributions of performative but rather constative utterances. For example, for the question truth, most subjects agree for both the constative and expositive utterance that the utterance can be true or false—for the expositive utterance, a majority answering No would have been expected according to Austin. On the other hand, for question act, the subjects mostly answer No for both the constative and expositive utterance, contrary to Austin’s classification. The same applies to the response patterns for question doubt and question time. In all these cases, the subjects mostly give the answer expected for constative utterances for both the constative and performative utterances. Therefore, the data do not support the CEA (constative utterances are expositive, that is, performative utterances) but the opposite direction (expositive utterances are constative utterances). The subjects interpret utterances such as “I claim that Leonardo DiCaprio is the most attractive man” as statements with a truth value, which do not perform an action, something that can be doubted and not as a limitedtime event. With this interpretation, it is understandable why the most successful criteria in this study, doubt and time, do not work for expositive utterances. The reason is that the subjects perceive expositive utterances not as performative but as constative utterances. 86 2.8.4.2 A. M. Bauer et al. Possible Objections When conducting an empirical investigation of a philosophical topic, it is customary to encounter objections. These objections may be fundamental (it is not possible or it does not make sense to empirically explore this topic), or they may be inherent in the investigation itself (it is fundamentally possible and reasonable to explore this topic empirically, but it should have been done differently at certain points). The former type of objection generally does not arise, or arises less frequently, when addressing a topic that has already been empirically explored and accepted examples exist. It is part of the research process to engage in a (sometimes controversial) discourse about possible objections. It can be advantageous to anticipate any objections and thus weaken or refute them in advance. In the following section, we will outline some possible objections to the study presented here. 2.8.4.2.1 The Adverb “Hereby” Should Not Have Been Used The adverb “hereby” was used in two of the performative statements that were presented to the subjects in the study (verdictive and exercitive). According to Austin, this is an indicator of performative utterances. An objection could be raised that the data are confounded by the use of “hereby” in the verdictive and exercitive utterance: Why do the subjects indicate that an action has been performed with the statement? Because the adverb “hereby” suggests it (not because of the statement itself without “hereby”). So why was “hereby” used if it can lead to objections? The reason is that according to Austin, “hereby” is used in official actions and actions governed by legal regulations. To obtain the most authentic performative utterance possible, we inserted it into the verdictive and exercitive utterance (and only these). Nevertheless, a systematic investigation into the effects of “hereby” in utterances is desirable. 2.8.4.2.2 The Subjects Should Have Been Instructed on How to Understand the Questions A second possible objection is that subjects’ answers to the test questions do not reflect what is actually supposed to be measured as the subjects understand the questions differently than intended. For example, for the performative utterance “I apologize for the accident,” the question “Can the utterances be true or false?” (question truth) or the question “Does it make sense to ask if Hans really apologized for the accident?” (question doubt) could also be understood as asking whether the speaker apologized sincerely or not. One could, therefore, demand that the subjects be trained or instructed on how to understand the questions before conducting the survey. However, the aim of the study is to investigate the intuitions of native speakers regarding Austin’s tests for distinguishing constative and performative utterances. 2 Quantitative Vignette Studies: χ2 Tests—Empirically Reconsidering. . . 87 If the subjects had been trained or sensitized to distinguish between constative and performative utterances beforehand, it would be like introducing the distinction and then testing whether it exists in the linguistic representation of the subjects. This would no longer capture the direct linguistic intuitions (but only judgments that reflect prior training). 2.8.4.2.3 The Study Is Not Concerned with Austin’s Theory of Speech Acts and Is, Therefore, Philosophically Irrelevant A third possible objection could be raised that Austin himself abandoned his distinction between constative and performative utterances and replaced it with his theory of speech acts, rendering empirical investigation of this distinction irrelevant. This objection can be countered by noting, first, that the investigation of the pragmatic categories of constative and performative utterances has inherent philosophical value, as demonstrated by the longstanding debate over the question of whether performative utterances are statements or not that followed Austin’s How to Do Things with Words. Second, the distinction between constative and performative utterances and the investigation of how language can be used to perform actions are fundamental to the development of Austin’s theory of speech acts and should, therefore, precede any future empirical engagement with this theory. 2.9 Further Literature LimeSurvey. (n.d.). LimeSurvey manual. https://manual.limesurvey.org/LimeSurvey _Manual. StataCorp. (2021). Stata user’s guide. Release 17. Stata Press. https://www.stata. com/manuals/u.pdf. References Austin, J. L. (1962). How to do things with words. The William James lectures delivered at Harvard University in 1955. Oxford University Press. Bach, K. (1965). Performatives are statements too. Philosophical Studies, 28(4), 229–236. Benveniste, É. (1974). Probleme der allgemeinen Sprachwissenschaft. München: List Verlag. Black, M. (1963). Austin on performatives. Philosophy, 38(145), 217–226. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, N.J.: L. Erlbaum Associates. Cramér, H. (1946). Mathematical methods of statistics. Princeton University Press. Graham, K. (1977). J. L. Austin. A critique of ordinary language philosophy. The Harvester Press. Heal, J. (1974). Explicit performative utterances and statements. The Philosophical Quarterly, 24(95), 106–121. 88 A. M. Bauer et al. Hornsby, J. (2006). Speech acts and performatives. In E. Lepore and B. C. Smith (Eds.), The Oxford handbook of philosophy of language (pp. 893–909). Oxford University Press. Knobe, J. (2003). Intentional action and side effects in ordinary language. Analysis, 63(3), 190– 194. Kornmesser, S., & Bauer, A. M. (2023). Austin in the lab. Empirically reconsidering the constative– performative distinction. Topics in Linguistics, 24(2), 1–14. Lemmon, E. J. (1962). On sentences verifiable by their use. Analysis, 22(4), 86–89. Quine, W. V. O. (1981). Theories and things. The Belknap Press of Harvard University Press. Searle, J. R. (1989). How performatives work. Linguistics and Philosophy, 12, 535–558. Soames, S. (2003). Philosophical analysis in the twentieth century. Vol. 2. The age of meaning. Princeton University Press. Tsohatzidis, S. L. (2017). Performativity and the “True/False Fetish”. In S. L. Tsohatzidis (Ed.), Interpreting J. L. Austin (pp. 96–118). Cambridge University Press. Yule, U. (1912). On the methods of measuring association between two attributes. Journal of the Royal Statistical Society, 75(6), 579–652. Chapter 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About Unfelt Pains Justin Sytsma Abstract What is pain? Perhaps surprisingly the standard answer to this question among philosophers does not derive from research in biology or other sciences but from claims about common sense and thought experiments intended to draw out our intuitions about the nature of pain. This raises a number of issues, among them the question of whether philosophers’ claims about the commonsense conception of pain are accurate. In this chapter, I’ll explore some of the empirical research that has been done on this question in recent years, focusing on the claim that common sense tells us that there can be no unfelt pains. In doing so, I’ll walk through several sets of studies, introducing the empirical research process and illustrating the use of one type of statistical tool—t-tests. Keywords t-tests · Statistical inference · Empirical research process · Null hypothesis significance testing · R · Pain · Feeling pain · Standard view · Bodily view 3.1 Introduction The present chapter has two main goals. The first is to introduce you to a powerful family of statistical tests—t-tests. To do this, I’ll walk you through three main case studies exploring the question of whether lay people tend to believe that there can be unfelt pains. Each of these will introduce one main type of t-test—onesample t-tests, independent samples t-tests, and paired samples t-tests. Coupled with Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-58049-9_3. J. Sytsma (✉) Philosophy Programme, Victoria University of Wellington, Wellington, New Zealand e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Kornmesser et al., Experimental Philosophy for Beginners, Springer Graduate Texts in Philosophy, https://doi.org/10.1007/978-3-031-58049-9_3 89 90 J. Sytsma further studies bearing on the ordinary conception of pain, I hope that these studies will make it plausible to you that the ordinary view is rather different than many philosophers have supposed. This is the second goal of this chapter. Here is how I will proceed. In the next section, I lay out the standard philosophical view about pains, the assumptions that have been made about the ordinary conception of pain, and the challenge that has been raised by some experimental philosophers. I begin to explore this challenge in more depth in Sect. 3.3, laying out the primary case study for this chapter—Study 3 by Sytsma (2010a)—and illustrating the use of one-sample t-tests. To do this, I detail the empirical research process I follow (Sect. 3.1), present the background motivation for the study (Sect. 3.2), and then detail the research question driving it (Sect. 3.3), the research design (Sect. 3.4), how I constructed the instrument (Sect. 3.5) and conducted the study (Sect. 3.3.6), and finally how Ianalyzed (Sect. 3.3.7) and interpreted (Sect. 3.3.8) the results. In Sect. 3.4, I then consider some potential worries about this study, detailing a selection of further studies that address them, including the second and third main case studies, which illustrate independent samples t-tests (Sect. 3.4.1) and paired samples t-tests (Sect. 3.4.4) respectively. Finally, in Sect. 3.5, I consider a worry that often arises in using t-tests: when we run multiple tests, we might need to apply a correction. 3.2 Philosophical Background: The Standard View of Pain in Philosophy The standard view among philosophers is that pains belong to the mind, not the body. When you cut your finger, for example, this view contends that the sharp pain you feel in your finger is in fact a property of your mind, not your finger. Put another way, the standard view holds that there is no distinction to be drawn between pain and feeling pain. More carefully, the standard view is that pains are properties of conscious mental states. Consciousness is a notoriously tricky notion, with the term being used to pick out a number of different phenomena (Block, 1995). Philosophers are most often interested in a specific sense of consciousness, however—what is typically termed “phenomenal consciousness.” Researchers are not in complete agreement about how to understand this notion, but the standard idea is that phenomenally conscious mental states are those states for which there is “something it is like” (Nagel, 1974) to be in them, where this is meant to pick out a diverse range of states that are thought to have a distinctive “feel.” These feels are associated with the mental states, typically being taken to be properties of them, and understood in this way they are known as phenomenal qualities (or qualia for short). Phenomenally conscious mental states and their associated qualia are typically drawn out by listing examples, detailing ordinary perceptual, bodily, or emotional experiences, and assuming that the distinctive qualities we’re aware of in these 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 91 episodes are phenomenal qualities. For instance, Michael Tye (2021) opens his Stanford Encyclopedia of Philosophy entry on “Qualia” in just this way: I run my fingers over sandpaper, smell a skunk, feel a sharp pain in my finger, seem to see bright purple, become extremely angry. In each of these cases, I am the subject of a mental state with a very distinctive subjective character. There is something it is like for me to undergo each state, some phenomenology that it has. Focusing on pains, the idea is that they are properties of mental states and that what makes them mental states of the type that they are is exactly those felt properties–the way the sates feel to the person who has them. As such, according to the standard view in philosophy, there is no appearance–reality distinction to be drawn for pains. In other words, to have the appearance of pain—for someone to experience pain—is for that person to have a pain. And, likewise, for someone to have a pain is for them to experience pain. Thus, the standard view endorses the following two conditionals: If a person has a pain, then she feels that pain. If a person feels a pain, then she has that pain. These conditionals have a number of implications. Most important for present purposes is that the first conditional excludes the possibility of there being unfelt pains: Since having a pain implies feeling pain, there could be no pain that is not felt. The idea that there can be no unfelt pains has a long and venerable history in philosophy. More than two hundred years ago, Thomas Reid (1785, 1.1.12) asserted that “pain, when it is not felt, has no existence.” In the previous century, Saul Kripke (1980, 152) expressed the underlying view succinctly when he wrote that pain “is picked out by the property of pain itself, by its immediate phenomenological quality.” And many contemporary philosophers of pain have continued this tradition, often clearly noting that they take the standard view to follow from our commonsense conception of pain. For instance, Murat Aydede (2005a, X) asserts that “it is part of the commonsense conception” that pains “can’t exist without someone’s feeling them.” Indeed, he holds that “there is an air of paradox when someone talks about unfelt pains,” noting that “one is naturally tempted to say that if a pain is not being felt by its owner then it does not exist” (2005b, 4). Similarly, Christopher Hill (2009, 169–170) expresses the standard view for a range of bodily sensations, holding that the way we talk about them “presupposes that the appearance of a bodily sensation is linked indissolubly to the sensation itself,” and asserting that “this is true, in particular, of our thought and talk about pain.” But should we accept the standard view of pain, including that there can be no unfelt pains? As we’ve just seen, this view is typically defended not by explicit philosophical arguments or empirical data, but by appeal to how we are said to commonly think and talk about pain. While this is often left implicit, the idea seems to be that the commonsense view should have default status—that we should assume the commonsense view in our philosophical discussions until or unless we have good reason to abandon it. Relatedly, the standard view is sometimes supported by 92 J. Sytsma appeals to intuition, laying out a hypothetical scenario involving pain and claiming that what we want to say about it coincides with the standard view. For example, in his Stanford Encyclopedia of Philosophy entry on “pain,” Aydede (2009) offers the following thought experiment in support of the claim that there can be no unfelt pains, contrasting the standard view with an alternative that treats pains as properties of body parts: Suppose that we do in fact attribute a physical condition, call it PC, when we attribute pain to a body part, and that PC is the perceptual object of such experiences. So, for instance, John’s current excruciating experience (call this E) is caused by and represents a physical condition in his thigh. From this it would follow that (a) John would not have any pain if he had E, but no PC in his thigh (as in the case of, for instance, phantom limb pains and centrally generated chronic pains such as sciatica), and, conversely, (b) John would have pain if he had PC but no E (as would be the case, for instance, if he had taken absolutely effective painkillers or his thigh had been anesthetized). But these statements are intuitively incorrect. They appear to clash with our ordinary or dominant concept of pain, which seems to track the experience rather than the physical condition. Similarly, Hill (2009, 171) appeals to intuitions about a thought experiment to support the standard view over the alternative picture: If we were fully committed to the picture, we would be prepared to consider it epistemically possible that an injured soldier actually has a severe pain, despite his professions to the contrary, but that there is something wrong with the mechanisms in his brain that support attention, and that this is preventing the pain from penetrating the threshold of consciousness. When I have asked informants to assess the likelihood of this scenario, however, they have all been inclined to dismiss it as absurd. In such passages, Aydede and Hill do not merely make claims about their own intuitions about the scenarios they lay out, but instead treat their intuitions as being general and, therefore, take them to tell us about the commonsense conception of pain. But are such claims about pain intuitions and the commonsense conception of pain accurate? Is it true, for instance, that commonsense rules out the possibility of unfelt pain? Across an extended series of papers, Kevin Reuter and I have argued that it is not, supporting this contention with a diverse array of empirical studies that suggest that lay people (i.e., nonphilosophers) do not tend to have the intuitions that advocates of the standard view claim. Indeed, we find evidence that people often treat pains as properties of body parts and, doing so, happily countenance the possibility of having unfelt pains.1 This includes background work in experimental 1 Earlier work in this area focused on the shortcomings of the standard view, drawing out that a prominent thread in ordinary thinking about pains treated them as properties of body parts (e.g., Reuter et al., 2014, 2019; Kim et al., 2016; Reuter, 2017). More recent work in this area has focused more on simply understanding ordinary thinking about pain, often emphasizing that this is complicated and showing that people sometimes also treating pains as mental states. Disagreement 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 93 philosophy of mind2 arguing that lay people do not tend to share the philosophical concept of phenomenal consciousness in the first place (and so don’t treat pains as phenomenal qualities),3 work using tools from corpus linguistics that indicates that the way people ordinarily talk about pains involves an appearance–reality distinction,4 and work using the types of questionnaire methods that I’ll focus on in this chapter. The latter includes three papers using questionnaire methods to test whether people believe that unfelt pains are possible—Sytsma (2010a), Sytsma and Reuter (2017), as well as Reuter and Sytsma (2020)—which I’ll focus on in the present chapter. This focus represents a divergence from the typical chapter in this volume, which illustrates a type of method or analysis in experimental philosophy by walking readers through a single case study. While there is much to be said for this approach, it is not feasible for the present chapter. The reason is that t-tests are not a single type of test, but a family of tests, with the different members of this family being applied in different circumstances. Thus, to give a reasonable introduction to t-tests and when each type applies, I’ll need to walk you through multiple studies. There are three main types of t-tests that you are likely to encounter in the literature or want to apply in your own research—one-sample t-tests, independent samples t-tests, and paired samples t-tests. Illustrating these three types of t-tests forms the heart of the present chapter and I’ll present a primary case study concerning the possibility of unfelt pains for each. I begin in the next section by detailing Study 3 by Sytsma (2010a), using this to illustrate the basic research process and introducing the first use of t-tests in statistical analysis—one-sample t-tests. remains on the extent of bodily versus mental aspects in people’s thinking about pain and how they relate (e.g., Borg et al., 2020; Liu, 2020; Liu, 2023; Salomons et al., 2021; Coninx et al., 2023; Goldberg et al., Forthcoming). 2 See Sytsma (2014), Sytsma and Buckwalter (2016, Part II.C), and Phelan (2023) for introductions to this area of experimental philosophy. 3 For a short overview, see Machery and Sytsma (2011), for more extended review see Sytsma (2010b, 2016), Gonnerman (2018). For a few recent studies dealing with experimental philosophy of consciousness, see Díaz (2021), Fischer and Sytsma (2021), and Gregory et al. (2022). 4 Corpus linguistics collects and analyses preexisting “real world” data on the use of words (McEnery and Wilson, 2001; McCarthy and O’Keefe, 2010). Philosophers have increasingly called on such methods, ranging from simple web searches, to more balanced corpora, to sophisticated computational approaches. See Chap. 6 of this volume for an extended illustration, and Bluhm (2016), Sytsma et al. (2019), Caton (2020), Ulatowski et al. (2020), as well as Fischer and Sytsma (2023) for further examples and discussion. These tools are employed in assessing the standard view of pain in Reuter (2011) as well as Sytsma and Reuter (2017). See Sytsma and Fischer (2023) for a recent study applying corpus methods to related an issue in experimental philosophy of consciousness. 94 J. Sytsma 3.3 Illustrating One-Sample t-Tests This section provides the most detailed case study in the chapter. The goal is to use the third study by Sytsma (2010a) to illustrate the first, and simplest, type of t-test that we’ll discuss. To do this, I’ll first introduce the general research process in Sect. 3.3.1. In Sect. 3.3.2, I’ll discuss the philosophical background for our target study, including the first two studies by Sytsma (2010a), and connect this to the general background provided in the previous section. The remaining sections will then walk us through the study: Sect. 3.3.3 details the research question motivating our target study, Sect. 3.3.4 the research design, Sect. 3.3.5 the instrument used, Sect. 3.3.6 conducting the study, Sect. 3.3.7 the analysis of the results, and Sect. 3.3.8 the interpretation of the findings. 3.3.1 The Empirical Research Process In a previous text with Jonathan Livengood (Sytsma and Livengood, 2015), we detailed a four-stage process for conducting empirical research in experimental philosophy: 1. Formulate a research question 2. Develop a plan to address your research question 3. conduct the study laid out in your plan 4. analyze the results of the study you conducted There are multiple components to each of these stages, including that the plan stage involves determining the design for your study and constructing an instrument corresponding to this design. The resulting process corresponds closely with the plan for the present text, with the individual chapters aiming to illustrate how to develop a research question (Stage 1), construct a corresponding study design and instrument (Stage 2), conduct a study (Stage 3), and analyze and interpret the results of that study (Stage 4). This process is quite general and can be used for most empirical research, not just work in philosophy. The philosophical focus of the research, however, will shape how the process is applied. This is most clear with regard to the first stage. While discussions of developing a research question in the typical text on experimental methodology will start with formulating a hypothesis, experimental philosophers should start a step earlier: philosophy generally begins with formulating arguments, and x-phi is no exception to this rule. In my opinion, the first step in developing a solid research question in experimental philosophy is to formulate a philosophical argument with an empirical premise, a premise that—with suitable clarification and specification—can be tested in your study (or studies). This clarification and specification converts your premise into a testable hypothesis. With this in hand, it is important to think through the general strategy you will employ in testing your 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 95 hypothesis and analyzing the results. This involves thinking about the goals of your study and the type of claim your hypothesis makes. There are three basic types of claims you might make in your hypothesis—an estimation claim, a comparison claim, or a relation claim. Estimation claims are about putting a number on a feature of a population that you’re interested in. For instance, we might want to estimate the percentage of people who agree with the claim that there can be no unfelt pains. Very often, for philosophical purposes, we’re not so much interested in the exact number, though, but instead more concerned with how it sits relative to another number. The claim that common sense rules out the occurrence of unfelt pains, for example, plausibly entails that such a belief should be common—that it should be the majority belief—but not that exactly 72% (or 86%, or 92%, or whatever) of people will hold this belief. The claim that a majority of people hold a given belief is a comparison claim. In this case, it compares one number that we’ll try to assess in our research to a fixed point (50%). In other cases, however, we’ll want to compare two numbers that we’ll try to assess. For instance, we might predict that the proportion of philosophers who believe that there cannot be unfelt pains is greater than the proportion of lay people who believe this. Finally, relation claims are about how multiple features are associated or how one changes relative to the other. For example, rather than simply comparing the proportion of philosophers who deny unfelt pains to the proportion of lay people, we might want to assess the level of training in philosophy and belief in the possibility of unfelt pains. We might predict, for instance, that there will be an inverse relationship— that belief in unfelt pains will go down as training in philosophy goes up. I will focus on comparison claims in this chapter, as the statistical tools it introduces concern certain types of comparison claims: t-tests are applicable when we’re comparing a number we’ve assessed using a continuous measure, or a suitable approximation of it, either to a fixed point (one-sample t-tests) or to another such number (independent samples t-tests, paired samples t-tests, partially paired samples t-tests), and if some other assumptions hold. I’ll return to this below. For now, the key thing is that the type of claim your hypothesis makes will matter not just for the plan you develop for testing your hypothesis, but how you analyze your results after conducting the study. The next step in the research process concerns developing a plan to test your hypothesis, starting with formulating a design for your study. Deciding on a design involves a number of factors, including the specifics of the hypothesis you’re looking to test and the type of claim it makes (estimation, comparison, or relation), among others. In turn, the design you arrive at will specify a number of important details about your study, including the type of study it is (i.e., a true experiment versus a quasi-experiment or a descriptive study) and the variables you will be manipulating and measuring in your study. The type of study you conduct and the types of variables in it are connected. There are two basic types of variables—the things that are varied in your study (known as independent variables or predictor variables) and the things that are measured (dependent variables or response variables). Every study will have at least one response variable. In the studies we’ll be looking at, these correspond with the test questions that we ask participants. 96 J. Sytsma Not every study will have predictor variables, however: in some studies there is just one condition—nothing is varied and every participant gets treated the same. These are descriptive studies. The case study we’ll focus on in this section is an example of such a study. In other studies, something is varied, either by nature (quasi-experiments) or by the researcher (true experiments). Studies looking at demographic differences in philosophical intuitions, such as Machery et al. (2015), are a classic example of the former. The study by Sytsma and Snater (2023a) discussed in Sect. 3.5 is a good example of the latter. While a general theoretical exploration of these differences and how they inform study design is beyond the scope of the present text—this requires quite lengthy texts on their own to cover— the process will be illustrated through the array of case studies presented, including those given in this chapter.5 For studies involving the types of questionnaire methods we’ll be focusing on here—i.e., studies where you’re asking participants to answer one or more questions—the next step will be to construct the instrument you will use. Again, questionnaire design is a large topic and one that we will largely illustrate via examples.6 A few preparatory remarks will help with understanding the process, however. Most often in experimental philosophy, questionnaires center on presenting participants with a short framing text, or vignette, followed by one or more questions about that text, often employing a fixed scale as we’ll illustrate below. But these instruments are almost always comprised of more than this, generally also including a consent form that introduces the researchers and project, instructions that guide participants in completing the questionnaire, and various demographic questions (e.g., asking for the participant’s age, gender, and so on). You might also want to include check questions to test that participants are putting in sufficient effort (attention checks) and/or understand what is going on in the questionnaire (comprehension checks), as illustrated in Sect. 3.4. To construct an effective questionnaire, I recommend thinking of it as a conversation between the researchers and the participants—a conversation that is shaped, in part, by each of the components of the questionnaire and how you phrase them. The key point here is to be on the lookout for how your questionnaire might lead the conversation astray, potentially generating responses from participants that don’t actually reflect the judgments you wanted to measure. This is not an easy task, however, and several potential pitfalls will be illustrated in the case studies presented below. Once you’ve designed your study and constructed the instrument, the next step is to carry it out. This involves getting ethics approval, piloting and refining the instrument, and determining how you’ll recruit participants. Piloting is basically 5 For a more extended introduction to study design directed at experimental philosophers, see Chaps. 7 and 8 of Sytsma and Livengood (2015). For an excellent introduction to research methods and design in psychology, see Goodwin and Goodwin (2016). For a more advanced treatment, see Shadish et al. (2001). 6 See chapter 11 of Sytsma and Livengood (2015) for a more extended discussion. See Sudman et al. (1996) and Schuman and Presser (1996) for excellent book-length treatments. 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 97 to take your study out on a test run and conducting a preliminary version on a small number of participants. Often this might involve using a modified version of your instrument, typically including additional open-ended questions that ask participants to explain their answers to the central questions you’re interested in. The goal here is to identify problems with your study design and conversational pitfalls in your instrument before committing full resources to this study. Pilot testing is an important part of good research practice, but is often overlooked by new practitioners. If your pilot study reveals issues with your design or instrument, you’ll want to make modifications and then pilot again before finally running your full study. Once you’ve run the study, you’ll need to analyze and interpret the results. I’ll illustrate this in the examples below, but first, it is important to say a little bit about why we perform a statistical analysis in the first place. It might seem that all that is needed at this stage is to describe the basic details of your results, perhaps simply noting how participants responded to the questions you asked. But statistical analysis goes beyond giving such basic details: it involves drawing inferences on the basis of your data and offering a justification for those inferences. In conducting empirical research, our goal is to use observation to answer questions about the world. But many of the questions we want to answer cannot be adequately addressed just through simple observation either because we cannot exhaustively observe what we’re interested in or because we’re interested in something that cannot be straightforwardly observed (or both). For instance, suppose we want to know something relatively straightforward, such as whether a majority of people believe that unfelt pains are impossible. People’s beliefs are not something that we can directly observe (as of yet) and figuring out how to assess belief with regard to an abstract question like the possibility of unfelt pains is no easy task. Setting this aside, however, we’re still left with the issue that we aren’t in a position to survey all people. Heck, we generally won’t be in a position to interact with more than an extremely small fraction of the population we’re interested in! So how do we answer our original question, moving from the relatively small number of participants we received responses from to conclusions about the wider population they’re part of? We employ statistical inference. The goal of statistical inference is to make an educated guess about things that we have not yet observed on the basis of things that we have observed. More technically, we infer something about features of a population (what we call “parameters”) from observations of corresponding features of a sample drawn from that population (what we call “statistics”). In doing so, we reason that since most samples drawn from a population will have features that are similar to the features of the population they’re drawn from, and since similarity is symmetric, we should expect the population to have similar features to the sample. Of course, the population of interest is unlikely to have exactly the same features as any given sample. As such, we wouldn’t be justified in simply asserting that the population has the same features as the sample. What we could reasonably say, however, is that the features of the population (the parameters) are probably and approximately the same as the features of the sample (the statistics). In other words, we infer 98 J. Sytsma parameters from statistics, while recognizing the hedge that this is only probably and approximately the case. A key part of our statistical analysis—including using tools like t-tests—is then to flesh out this “probably and approximately.” Let’s see how this all works by considering a preliminary example drawn by Sytsma (2010a), which provides important background for the case study that we’ll walk through in the remainder of this section. 3.3.2 Background My work on the commonsense conception of pain grew out of more general work in experimental philosophy of mind investigating whether nonphilosophers tend to have a concept that is suitably similar to the philosophical concept of phenomenal consciousness introduced above. I note that it is common for philosophers of mind to make assumptions about folk psychology—assumptions about our ordinary, pretheoretical thinking about the mind—in discussions of phenomenal consciousness (Sytsma, 2010a). This includes that both realists (e.g., Chalmers, 1995) and skeptics (e.g., Dennett, 1991) about phenomenal consciousness take the concept to be a part of folk psychology, assuming that the existence of qualia is pretheoretically obvious. Claims about folk psychology are empirical claims, however, and it is quite possible for scholarly training to skew our perspective on ordinary, pre-theoretical thinking. Focusing on whether lay people tend to have a concept of phenomenal consciousness, I noted that empirical work was beginning to be done on the question and I surveyed conflicting findings from Knobe and Prinz (2008), who argue in favor of the claim, and Sytsma and Machery (2010), who argue against.7 Sytsma and Machery (2010) present evidence that lay people, in contrast with philosophers, treat two prototypical examples of supposed phenomenally conscious mental states—seeing red and feeling pain—quite differently. In our main study, we gave participants either a description of a normal human or a simple non-humanoid robot performing behaviorally analogous tasks expected to elicit attributions of one or the other of these mental states for the human, then asked the participants whether the entity (human or robot) had the mental state at issue (saw red, felt pain). We found that while philosophers tended to treat both states similarly, denying that the robot either saw red or felt pain, lay people tended to treat them differently, 7 A number of criticisms have been raised against each of these works. For responses to Knobe and Prinz (2008), see Sytsma and Machery (2009), Huebner (2010), Arico (2010), Strickland and Suben (2012), and Phelan et al. (2013). The most prominent criticism of Sytsma and Machery (2010) has been the ambiguity objection (Sytsma 2016), which has been raised by Huebner (2010), Peressini (2013), Fiala et al. (2012), and Chalmers (2018). For recent responses, see Sytsma and Ozdemir (2019), Ozdemir (2022), and Sytsma (n.d.). Objections to our explanation of our results have been raised by Talbot (2012), Buckwalter and Phelan (2013), as well as the studies discussed below suggesting that participants’ responses reflect that they tend to hold a naïve view of colors and pains. 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 99 denying that the robot felt pain but affirming that it saw red.8 Based on these results, we argued that if lay people were employing the concept of phenomenal consciousness in responding to these questions, then they should have treated the two states similarly, just as the philosophers did. But they did not. We took this to suggest that the lay participants were not generally employing the concept of phenomenal consciousness. Building off of the arguments given by Sytsma (2009, 2010c), in my (2010a), I further explored one explanation for the pattern of findings for lay people found by Sytsma and Machery (2010). I hypothesized that this pattern reflects that lay people tend to hold a naïve view of both colors and pains: rather than treat colors or pains as qualities of mental states, they conceive of them as qualities of objects outside of the mind/brain. Focusing on pain, the idea is that people tend to deny that the simple robot in our study feels pain because they conceive of pains as being instantiated in injured body parts, but doubt that the robot has the right sort of body parts to support pains. On this view, while soft and fleshy body parts can instantiate pains, hard and metallic body parts cannot. This hypothesis not only explains the pattern of results by Sytsma and Machery (2010), however, but also suggests against the claim that common sense supports the standard view of pain among philosophers. In my first study by Sytsma (2010a), I asked participants a set of questions about how they understand colors. The results were consistent with lay people tending to hold a naïve view, with a majority of participants answering that colors are properties of external objects, denying that they are mental or mind dependent, and denying that spectrum inversion is possible (as we would expect if they treated colors as belonging to the objects seen rather than to perceivers). Study 2 extended these findings to pains, in addition to colors. Results were comparable, with a majority of participants seemingly embracing a naïve view for both colors and pains, treating these as properties of things outside the mind/brain and denying that they are mental or mind dependent. Further, this study included a question about unfelt pain: “Do you think that there is still pain in a badly injured leg even when the person is not aware of it?” I hypothesized that if people tend to hold a naïve view, taking pains to be properties of injured body parts, then the presence (or absence) of the pain would not depend on whether the person felt that pain. And, indeed, a majority of participants answered this question affirmatively, suggesting that they hold that unfelt pains are possible. The next two studies explored this finding further. Let’s walk through the process of designing, conducting, and analyzing Study 3 from this chapter in more detail. 8 In general, we shouldn’t accept a conclusion based on just one set of results. Rather, our credence in the results should be tempered and should rise as they are replicated—as similar results are found in subsequent studies, especially studies by other researchers. This includes both exact and approximate replications, which attempt to repeat a study as closely as possible, and conceptual replications that test the same hypothesis in another way (see Cova et al. 2019). The key result from Sytsma and Machery (2010) has been replicated a number of times, including in Sytsma and Machery (2012), Sytsma (2012), Sytsma (2013), Sytsma and Ozdemir (2019), Cova et al. (2019), and Ozdemir (2022). 100 J. Sytsma 3.3.3 Research Question Recall the research process laid out above. The first step is to formulate a research question. I indicated that for research in experimental philosophy, it is best to first start with formulating an argument with an empirical premise. We’ve just surveyed the broader dialectic that Sytsma (2010a) fits into. Focusing on just Study 3, however, we can lay out a rather straightforward argument. We’ve seen that according to the standard view of pain in philosophy, there can be no unfelt pains, and this is often supported by appeal to the (supposed) dictates of common sense. If common sense allows for unfelt pains, however, then the standard view would not enjoy this support. The key empirical premise here is that common sense allows for unfelt pain. This premise requires some clarification and specification before we can test it, however. Most importantly, we need to determine what such claims about common sense amount to. As detailed in Sect. 3.1, claims that common sense precludes unfelt pains in the literature are often laid out in terms of our intuitions about hypothetical cases involving someone being injured but not feeling pain, with the suggestion that people will generally have the intuition that in such cases there is no pain. This suggests a general strategy for testing: Give participants a vignette describing such a case and then ask them whether the injured person had a pain despite their not feeling it. Since we’re not varying the vignette or the question, this would be a descriptive study: it doesn’t have any predictor variables and has just one response variable (the question we ask about whether the case involves an unfelt pain). Our predictions for this study concern the response variable: The standard view predicts that the majority of participants will give a negative response to this question; in contrast, if people tend to hold a naïve view of pain, then we would expect the opposite—that the majority will give an affirmative response. Each of these predictions makes a comparison claim: in making the prediction that the majority will give an affirmative response, for example, we’re predicting that the number of affirmative answers will be greater than 50%. 3.3.4 Research Design Having developed a research question, the next step is to put together a research plan, starting with deciding on a study design. The hypothesis formed in the first stage will help guide the design of your study. In the present case, my hypothesis concerns how people will tend to respond to a question about a simple scenario: Will people tend to judge that an injured person has a pain even if they don’t feel it? As such, there is no need to compare responses between samples drawn from different populations (as in a quasi-experiment) or between participants assigned to different conditions (as in a true experiment). As we just noted, a descriptive study is sufficient here. This means that there is no need to worry about predictor 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 101 variables (independent variables) for the present study, and given that it involves just one question of interest, the study can be restricted to a single response variable (dependent variable). In other words, for this study, I simply want to give participants a description of a scenario about an injured person who doesn’t feel pain and ask them a question about whether that person has a pain. As such, I just need to develop a single instrument that each participant will receive. 3.3.5 Constructing the Instrument As noted above, a full instrument will typically include an introduction, instructions to participants, and demographic questions, in addition to the philosophical probe that we’re most concerned with—the vignette and questions that test our hypothesis. The full instrument I developed is available in the supplemental materials 2_s_2010a_study_3.pdf.9 Here, I’ll focus on the philosophical probe. For this study, I chose to describe a common scenario—an injured person being distracted and not noticing a pain. My aim was to describe this rather directly, keeping the text to a minimum. This has the potential benefit of maintaining participants’ attention (which can wain with longer probes or questionnaires) and avoiding extra verbiage that might bias their responses one way or another. After the scenario, I then wanted to ask participants whether the subject of the story still had the pain or whether there was no pain during this period. At a first pass, the response options here would appear to be binary: either there is a pain or there isn’t. But I wanted to allow participants to register that they weren’t sure about the answer, as well as the degree of certainty in a response. As such, I asked participants to answer using the partially anchored 7-point scale shown below (Fig. 3.1). Here the end points are anchored with text descriptions, as is the midpoint of the scale. Of course, a number of alternative decisions could have been made in designing this study, including the vignette used, the framing of the question, and the response options. I will return to some of these choices below in detailing subsequent studies. 9 Standardly, this should include questions about the participants age and gender. (If I were to run this study now, I would include an option for “nonbinary” in the gender question, and I encourage you to do the same in your studies.) Given the concern with common sense, I also wanted to check whether participants had training in philosophy or areas where they might have been taught scientific accounts of nociception. Finally, given concerns about the relevant population for the claims about the commonsense conception of pain, discussed below, I asked about participants’ native language. 102 J. Sytsma Fig. 3.1 Vignette and scale for Study 3 by Sytsma (2010a, 124) 3.3.6 Conducting the Study The next step in the research process is to conduct the study. As noted above, this involves getting ethics approval and, especially for more involved studies, piloting them. Given the simple nature of the present study, I didn’t formally pilot it, but rather asked friends and colleagues for their input. In conducting a study, a key decision point is figuring out how you’ll recruit participants. This involves determining what the relevant population is and figuring out how to obtain a sample from that population. Often, the relevant population will be somewhat unclear. When philosophers claim that common sense tells us that there can be no unfelt pains, it is plausible that this is thought to be something common to all people with a concept of pain. But we might also suspect that at least some aspects of common sense will be culturally variable. Further, we’ve seen that claims about the commonsense understanding of pain are often coupled with claims about how we talk about pain. And, of course, pain language will vary between groups of language users. As such, in all but the last of the studies, I’ll detail in this chapter that the target population is narrower, being comprised of English speakers in North America. A question, therefore, remains about how widely these results generalize, although some cross-cultural work on ordinary conceptions of pain has been done (e.g., Kim et al., 2016; Sytsma and Reuter, 2017). Even for a narrower population like English speakers in North America, however, it is impossible to exhaustively survey members of the population and we’ll need to content ourselves with testing just a small sample of the larger population. Ideally, we would sample randomly from the population, such that each individual was equally likely to be chosen for our sample. In practice, however, researchers are seldom, if ever, in a position to solicit a truly random sample: we simply don’t have equal access to each member of a population. Instead, we do what we can, aiming to use a recruitment method that we hope will produce a reasonably unbiased sample with regard to our research question. Typically, this will involve convenience sampling: we sample from the people we have access to—those people who are 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 103 convenient for the researcher. While this is nonideal, concerns can be at least partially alleviated by using different recruitment methods, as I illustrate below. For the present study, I used a participant pool that was convenient to me— students in introductory classes at the university I was attending. To do this, I talked with the instructors for two courses I had not previously surveyed and got permission to administer the questionnaire at the start of one of their classes. The instrument was printed out on paper and handed out to students.10 Using classroom samples like this means that there will be some variability in how many participants complete your questionnaire, based on how many students attend class that day and choose to complete the survey. In the present case, this generated 54 responses (excluding one person who took the survey in both classes). Ideally, however, we would first estimate the sample size that we need. We’ll return to this process in Sect. 3.4.2 after we’ve finished with this first case study. One disadvantage of in-class studies, as opposed to studies conducted online, is that after collecting the completed questionnaires, you’ll need to enter the responses into a digital form. I did this by hand, looking through each questionnaire and entering the responses for each in a row of a spreadsheet. (A reduced version of this spreadsheet is available in 2_s_2010a_study_3.csv, which removes unnecessary demographic details to further protect anonymity.) Entering data by hand can be a slow process, especially for large studies, and potentially creates an extra source of human error. This is one reason that many now prefer to use online samples. Classroom samples will also tend to be less varied than the larger population in a number of ways, including that they will tend to be younger (the average age of my sample was 19.6 years) and more likely to have education in relevant areas. Indeed, I found that five participants had more than minimal training in philosophy or psychology.11 The responses of these participants were removed based on criteria specified prior to running the study. Ideally, such criteria, as well as other important details of your studies (e.g., predictions and plans for statistical analysis), will be registered before conducting your study. Such pre-registration is now considered best practice and is becoming increasingly common. This can be done using websites like https://osf.io/, which also provides a repository for materials and data. 10 It is more common today to use online samples, as will be illustrated in studies described in subsequent sections. To do this, you’ll need to create a Web-based version of the instrument. Most often, this is done using survey software such as Qualtrics or Lime Survey, as described in Chap. 2. 11 Participants were counted as having more than minimal training in philosophy or psychology if they indicated that they had completed some graduate work in philosophy or psychology, had completed an undergraduate degree with a major in philosophy or psychology, or were completing an undergraduate degree with a major in philosophy or psychology. 104 J. Sytsma 3.3.7 Analysis Basic data for Study 3 by Sytsma (2010a) is available in the supplemental materials as a comma delineated spreadsheet—2_s_2010a_study_3.csv—with responses to the main test question shown in the column labeled “RESPONSE.” To explore this data, I’ll use the free statistical software package R, which you can download from http://cran.r-project.org. We’ll walk through the very basics of what you need to start using R to run t-tests here in this chapter. That said, it is important to note that R is a full-featured programming environment, such that it would be impossible to give a detailed overview of how to use it here. Fortunately, there are many excellent resources for using R. For a brief introduction, you might start with Chap. 10 in my text—Sytsma and Livengood (2015)—which also includes pointers to other references for learning to use R. To help with learning R, the code used for each analysis in this chapter is available in the supplemental materials. For the present study, this is provided in 2_s_2010a_study_3.txt. This file can be read using any standard text editor and I’ll reproduce it piece by piece below as we walk through the analysis. Once you’ve installed R using the link above, simply open the txt file and you can follow along with the analysis by either copying and pasting the relevant lines from the file or by typing the text in at the command prompt in the R console window (the red “>”), as shown in Fig. 3.2. Fig. 3.2 R console window with a standard first program when learning a new programming language (printing “Hello World!”) 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 3.3.7.1 105 Getting Started and Loading the Data Looking at the file, the first thing you’ll notice is that it starts with the following two lines of text: 1 2 #install.packages("lsr") library(lsr) Together these lines of text will be used to install the lsr package and then load it from the corresponding library. A package is basically a collection of functions (and usually other stuff such as data sets) that someone has built and documented for use in R. Generally, you can find information about a given package on the CRAN website noted above.12 A library is simply where the package is stored once it is installed. We’ll return to the lsr package below when we use the cohensD() function from it.13 Functions will be distinguished by name followed by a pair of parentheses and are used to tell R to do some specific thing, such as to install a package (i.e., the install.packages() function) or load an installed package from a library (i.e., the library() function). Most of the time, however, we need to give R more guidance about what we want it to do. For example, we’ll need to tell it which package we want it to install or load. We do this with arguments, which are specific bits of text that go inside the parentheses, such as "lsr" in install.packages("lsr") and lsr in library(lsr). In each of these cases, just a single argument is supplied to the functions, but as we’ll see below, we often want to tell a function multiple things. In such cases, we’ll do this by supplying multiple arguments, which will be separated by commas within the parentheses (such as cohensD(D3$RESPONSE, mu=4) which we’ll discuss below). So, each of the two lines of text at the start of our file calls a function, supplying R with a single argument for that function. If you simply paste these two lines into R, however, you’ll probably get the following error: Error in library(lsr) : there is no package called 'lsr' This is because you need to install the package before you can load it into R . . . and the line of code that does this is commented out. Basically, the # mark tells R that the text following it is just a comment, not something it needs to pay attention to. So when you enter the first line of text from the file— #install.packages("lsr")—R will just ignore the function call and won’t install the lsr package. I’ve added the # mark here because you’ll only need to install the package once, so after the first time you run the code, you’ll want R to ignore this line. 12 For the lsr package, this is available at https://cran.r-project.org/web/packages/lsr/. 13 Details about many functions in R can be found through the RDocumentation website, including for the cohensD() function: https://www.rdocumentation.org/packages/lsr/versions/ 0.5.2/topics/cohensD. 106 J. Sytsma Try entering the first two lines of text again, but this time exclude the # mark from the first line: 3 4 install.packages("lsr") library(lsr) The first line will pop up a window asking you to select a “mirror,” that is, a place to download the package from. You can simply leave it on the default location and select “OK” or else choose a location that is close to you. R should now install the lsr package for you. Once that is done, you should be able to enter the library(lsr) command without error. This will load the package so we can use it later. The next thing you’ll see in the script are three lines starting with the # mark: 5 6 7 ######################################### # Load Data for Sytsma (2010a), Study 3 # ######################################### This is another comment for the user—it tells the person reading the code what is going on and tells R to ignore it. Here it tells us that the next line in the code will load data from the study we’ll be looking at. Specifically, it uses the read.table() function to load the comma delineated spreadsheet noted above— 2_s_2010a_study_3.csv—from my desktop: 8 9 D3 = read.table("C:/USERS/jmsyt/Desktop/2_s_2010a_study_3.csv", header=TRUE, sep=",") To use this yourself, you’ll need to edit the path to point to where the file is on your own computer (i.e., change the bit that reads C:/USERS/jmsyt/Desktop/ to indicate where the file is on your computer, which can found by right-clicking the file and selecting “Properties” on a typical Windows PC or “Get Info” on a typical Apple computer). Once you run this line, R will read the table from the spreadsheet and copy it into the variable D3. After loading the data, if you simply enter the variable name into R—if you type D3 and hit enter—it should display the data from the table, as shown in Fig. 3.3. 3.3.7.2 Basic Visualization and Statistics Having read in this data and stored it in D3, we can now do any number of things with it in R. We’ll begin by using it to provide some basic details about the data, starting with looking at how many participants selected each response option (1 to 7) for the probe question. I began by generating a histogram using the hist() function in R. This gives a visual display of the number of participants selecting each answer choice. To do this, I supplied two arguments to the function: 10 hist(D3$RESPONSE, breaks=BreakValues) The first argument tells R that we want to use the values for RESPONSE from D3 (the $ symbol telling R to use that column from the table), while the second provides 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 107 Fig. 3.3 Displaying data by Sytsma (2010a), Study 3, in R a list of values for how to divide up the lines on the histogram, which was provided using the c() function (the concatenate function) in the previous line. This tells R to generate a histogram for the responses that centers the bars on the whole numbers from 1 to 7. To make this still more informative, I then added a dotted line to the histogram centered on the mean—the average response to the question—using the following command, which uses the lines() function, with calls to the c() and mean() functions in the arguments: 11 12 lines(x=c(mean(D3$RESPONSE),mean(D3$RESPONSE)), type="l", col="red", lty="dashed") y=c(0,18), Finally, I used the nrow() function, which counts th e number of rows in our table that meet a certain criteria, to give an exact count for each response option: 13 14 15 16 17 18 19 nrow( D3[ D3$RESPONSE == 1, ] ) nrow( D3[ D3$RESPONSE == 2, ] ) nrow( D3[ D3$RESPONSE == 3, ] ) nrow( D3[ D3$RESPONSE == 4, ] ) nrow( D3[ D3$RESPONSE == 5, ] ) nrow( D3[ D3$RESPONSE == 6, ] ) nrow( D3[ D3$RESPONSE == 7, ] ) Specifically, I had R count the number of rows in the table stored in D3 where RESPONSE was equal to 1, the number of rows where RESPONSE was equal to 2, and so on. The results have been added to the txt file after the command using the comment mark (#), as seen in Fig. 3.4, which shows the console output and the histogram for this block of code. My next step was to calculate a few basic statistics for the responses to the probe question, including the mean and the standard deviation, which tells us about 108 J. Sytsma Fig. 3.4 R output for initial script for analysis of Study 3 by Sytsma (2010a) how the responses tend to spread out around that average, as well as the basic demographic questions I included in the sample data file: 20 21 22 23 24 25 26 mean(D3$RESPONSE) # 2.57 sd(D3$RESPONSE) # 1.67 nrow(D3) # N=49 mean(D3$AGE) # 19.6 min(D3$AGE) # 18 max(D3$AGE) # 43 mean(D3$GENDER) # 61.2% women From the histogram, we can readily see that a large majority of the responses to the test question were below the midpoint of 4, with 41 of 49 participants (83.7%) giving a response of 1, 2, or 3 on the 7-point scale. In other words, most of the participants leaned toward answering that the injured person in the vignette was “clearly in pain, but not feeling it.” Not surprisingly, the mean response (M = 2.57) is also below the midpoint. This is in line with the prediction I made. As noted above, however, my concern is not specifically with this set of 49 people. Rather, the goal is to say something about the population of interest, and I’m simply using the responses of these participants as an imperfect guide to what the larger population is like in this regard. As such, I need to do some statistical inference. 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 3.3.7.3 109 Null Hypothesis Significance Testing and the Basics of t-Tests There are different ways of doing statistical inference. The most common is null hypothesis significance testing (NHST).14 In NHST, we specify a null hypothesis that corresponds with finding no effect, then test how likely it is that we would have gotten data at least as extreme as the responses observed if the null hypothesis were true (and if other test assumptions hold). If this is suitably unlikely, then we reject the null hypothesis and take the results to be in line with our hypothesis. In the present study, my hypothesis was that a majority of the population holds that unfelt pains are possible and so would respond that the subject of the vignette had a pain. One way of specifying this hypothesis (but not the only way) is to predict that the mean response would be below the midpoint on the scale. The relevant null hypothesis for this prediction is, then, that the mean response is at or above the midpoint on the scale. Clearly, the mean response for my sample was below the midpoint. But this might happen just by dumb luck even if a majority of the people in the population believe that unfelt pains are impossible. We use a statistical test to put bounds on how likely this is. More carefully, we calculate a conditional probability for getting data at least as extreme (relative to the null hypothesis) as the data we actually got if the null hypothesis is true (and if various test assumptions hold). This conditional probability is called a p-value and if it is sufficiently small, we reject the null hypothesis. What counts as sufficiently small depends on the significance level specified, with 0.05 being conventional.15 The statistical tests I’ll be focusing on here are t-tests. They are among the simplest and most commonly used statistical tests. Despite the way they are sometimes discussed, t-tests are actually a family of related procedures, including one-sample t-tests, independent samples t-tests, and paired samples t-tests. What unites all of these tests is that the reference distribution for the comparison is a t-distribution. We needn’t worry too much about what this means, here, but the t-distribution is closely related to the normal distribution (a standard bell curve). Indeed, the difference between the distributions becomes negligible as the degrees of freedom increase, where this is related to the sample size.16 What is most important, 14 One increasingly popular alternative is the use of Bayesian statistics, although this remains rare in experimental philosophy. A discussion of Bayesian statistics is well beyond the scope of the present chapter, but see Sect. 3.3 by Sytsma and Livengood (2015) for a Bayesian alternative to using t-tests and see Albert (2009) for a more general introduction to Bayesian statistics in R. 15 Sometimes a significance level of 0.01 is used instead. See Benjamin et al. (2018) for an argument that we should lower this still further to 0.005. 16 You can demonstrate this for yourself in R using the dnorm() and dt() functions to plot a normal curve and t-distributions, respectively. For example, the following code will compare the normal distribution to the t-distribution with degrees of freedom of 1, 5, 10, and 20: curve(dt(x, df=20), from=-5, to=5, col="green", main="Distribution Comparison", ylab="Density") curve(dt(x, df=10), from=-5, to=5, col="orange", add=TRUE) 110 J. Sytsma for present purposes, is that this generates a key assumption that is at play when we use t-tests—that the feature we’re interested in is normally distributed in the population. This means that if we plotted the histogram for the entire population, as we did above for our sample, the resulting histogram would approximate a bell curve. Of course, this isn’t something that we’ll typically know about our population. Nonetheless, the date from our sample can give us an indication of whether the assumption is warranted following the same logic as above (i.e., that features of the population are probably and approximately the same as features of the sample). The histogram for the present study, however, should give us some pause with regard to this assumption, since it doesn’t obviously approximate a bell curve. A second reason for pause is that t-tests assume that our data have interval scale such that they can be taken to approximate a continuous distribution. Interval scale means that the distance between a response of 2 and 3 on our scale, for example, is the same as between a response of 3 and 4, as opposed to these responses simply being rank ordered as in finishing places in a race. The assumption of interval data is controversial for scales like those most commonly used in experimental philosophy, including the scale used in the present study. Nonetheless, while there are often reasons to doubt each of these two key assumptions (normal distribution and interval data), in-practice t-tests are rather robust. As we proceed, though, I’ll briefly detail alternative tests that do not make these assumptions. As we’ll see, they lead to comparable conclusions in these case studies, and in my experience, this is very often the case (hence the robustness). 3.3.7.4 Directionality and Conducting a One-Sample t-Test Running t-tests is very easy in R. In fact, we can use the same function—the t.test() function17 —for each of the three types of tests that we’ll be focusing on in this chapter. (For the fourth type of t-test I noted above—partially paired samples t-tests discussed in Box 3.1—we’ll need to use a different function, but such tests are quite rare and aren’t likely to be something you’ll need to worry about.) Recall from above that we said that t-tests are applicable when we’re comparing a number we’ve assessed using a (suitable approximation of a) continuous measure to either a fixed point or to another such number, and if other assumptions hold like those we just discussed. As such, we’ll need to tell the t.test() function the two things we’re comparing—within the parentheses, we’ll need to supply one argument pointing curve(dt(x, df=5), from=-5, to=5, col="red", add=TRUE) curve(dt(x, df=1), from=-5, to=5, col="purple", add=TRUE) curve(dnorm(x), from=-5, to=5, col="black", add=TRUE) 17 Once again, details for functions in R can be found through the RDocumentation website, including for the t.test() function: https://www.rdocumentation.org/packages/stats/versions/ 3.6.2/topics/t.test. 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 111 R to the data for the first number and a second argument for what that number is being compared to. If we’re comparing to a fixed point, then the appropriate t-test will be a one-sample t-test and we’ll simply provide the number for that point as the second argument in the t.test() function. If we’re comparing to another number we’ve assessed, however, then we’ll want to run one of the remaining tests, which apply to comparisons between samples and vary based on the relationship between those samples—whether they are two different samples (independent samples), same samples (paired samples), or a mixture of the two (partially paired samples). What type of comparison are we interested in for the present study? Recall from our previous discussion that the null hypothesis for Study 3 by Sytsma (2010a) is that the mean response to the test question will be at or above the midpoint on the 7-point scale (i.e., compared to 4). Of course, the midpoint on the scale is a fixed point, not a second number that we’ve assessed by collecting additional data. As such, we will want to use a one-sample t-test. This is done in the first line of code in the next section of our txt document: 27 28 29 30 31 ##################### # Statistical Tests # ##################### t.test(D3$RESPONSE, mu=4, alternative="less") # t(48)=-5.99, p<.001 Here the mu=4 argument is telling R that the relevant comparison for our response data (D3$RESPONSE) is to the fixed point of 4.18 What about the third argument in our function call (alternative="less")? This specifies the direction of the test. Specifically, we could have been predicting any of three types of relationship here—either that the mean response would be less than 4, that it would be greater than 4, or that it would simply be different from 4, such that it is concerned with both sides (either less than or greater than 4). These are indicated by supplying the argument alternative="less", alternative="greater", or alternative="two.sided" respectively (note that if you leave out this argument, it will default to a two-sided test). I chose to do a one-sided test in this case because I had a directional hypothesis—I predicted that the mean would be less than 4—and a corresponding null hypothesis: the null hypothesis does not simply state that the population mean is the midpoint, such that we could reject the null hypothesis if the sample mean was suitably above or below 4, but that the mean is at or above the midpoint. This means that we can only reject the null hypothesis if the sample mean is suitably below the midpoint. As such, we can specify that the alternative hypothesis is that the mean is less than mu=4 by adding the argument alternative="less" to the function call. In my experience, when you use a one-sample t-test, it is quite likely that you’ll have a directional hypothesis. But it might be that you simply predict that the mean is different from a specified value to make the test more conservative (it is easier to get a significant result using a one-tailed test than a two-tailed test). Indeed, it is 18 The Greek letter μ (or mu) is standardly used in statistics to represent the population mean. 112 J. Sytsma common for researchers to report two-tailed tests even when they had a directional hypothesis. If this is desired, however, my preference would instead be to use a one-tailed test with a more stringent choice of significance level. When we run our t-test, as specified above, R will provide an output that gives us a good bit of information about our statistical test: One Sample t-test data: D3$RESPONSE t = -5.9851, df = 48, p-value = 1.321e-07 alternative hypothesis: true mean is less than 4 95 percent confidence interval: -Inf 2.971765 sample estimates: mean of x 2.571429 Perhaps most importantly, this output gives us the p-value, which is quite small indeed—1.321e−07 or 0.000000132—and is obviously well below the conventional cutoff of 0.05. This means that we can say that the mean is significantly below the midpoint at the specified significance level (see also Chap. 2), although the latter specification of the significance level is often left implicit. Other key information for reporting the test appears on the same line—the t-value (t = -5.9851) and the degrees of freedom (df = 48). Here is how I reported this (Sytsma, 2010a, 124): t(48) = −5.985, p < 0.001 (one-tailed) As illustrated here, for p-values below 0.001 (like 0.000000132), we typically just specify p < 0.001. In addition, it is important to specify whether you performed a one-tailed or two-tailed test, and in addition, I would include the type of test in the text—e.g., that we performed a one-sample t-test comparing the mean response to the neutral point of 4. Other useful information in the output is the 95% confidence interval. Testing a null hypothesis is intimately related to determining confidence intervals. Specifically, a null hypothesis is rejected at a specified significance level—standardly denoted α—when the value or range that the null hypothesis specifies is outside of the (1—α) × 100% confidence interval calculated for the sample. Using the 0.05 significance level, this is the 95% confidence interval given in the output.19 Since we used a one-tailed test, the lower bound is given as negative infinity in this output, indicating that no value we found would fall below the lower bound. As such, it is the upper bound of 2.971765 that matters for us here. Since 4 falls above this upper bound (2.971765 < 4), the null hypothesis can be rejected (at the 0.05 significance level). Often 95% confidence intervals will be included in bar graphs showing study results, as is shown in Fig. 3.5. For this, we would typically show the confidence 19 (1–0.05) × 100% = 95% 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 113 Fig. 3.5 Vignette and scale for Study 4 by Sytsma (2010a, 125) interval for each side of the mean, however, which can be generated using a twotailed test. This is easily done by modifying the above function call: 32 t.test(D3$RESPONSE, mu=4) As expected, this produces a slightly different output, since we’ve removed the directionality of the test, including giving a smaller p-value and a confidence interval that is positive on both ends: One Sample t-test data: D3$RESPONSE t = -5.9851, df = 48, p-value = 2.643e-07 alternative hypothesis: true mean is not equal to 4 95 percent confidence interval: 2.091511 3.051346 sample estimates: mean of x 2.571429 This new information can be reported by noting that the test showed a 95% confidence interval of [2.09, 3.05]. 3.3.7.5 Statistical Significance Versus Effect Size As noted above, the result of our test is significant, with the p-value (1.321e−07 ) being less than the standard cutoff of 0.05. Simply referring to a result as “significant,” however, while common, is also potentially misleading. In ordinary language, describing something as significant often means that it is important, sometimes with a connotation that it is large (e.g., a sales manager for a company reporting that they expect significant additional sales in the next year). But this is not what we mean when we say that the result of a statistical test is significant. Here we need to distinguish between statistical significance and effect size. When we report a t-test and conclude that the result is significant, it is the statistical significance 114 J. Sytsma we are reporting, and this simply means that we can reject the null hypothesis at the specified level. A result could be statistically significant, however, while the difference between the numbers we’re comparing is very small (say, for example, a sample mean of 3.9 compared to the midpoint of 4). This is because statistical significance depends on sample size, such that if we had a large enough sample size in our study, even a very slight divergence of the sample mean from the comparison point (or between means as we’ll see below) could be statistically significant. What we really want in addition to the p-value is a standardized indication of the size of the difference in our comparison. This is what effect size does. For a study like the present one, simply reporting the sample mean will give some indication of the effect size. Here, the mean response was 2.59 on a 7-point scale, placing it roughly 1.41 points below the midpoint. While this is likely adequate in the present situation, better practice is to also report a measure of the effect size that makes it easier for comparison. For t-tests, the most common measure of effect size is Cohen’s d. Cohen (1988) provides helpful rules of thumb for interpretation: d = 0.2 is a small effect, d = 0.5 a medium effect, and d = 0.8 a large effect. Here you can think of a small effect as one that will be difficult to discern with the naked eye, a medium effect as one that can probably be discerned, and a large effect as one that can definitely be discerned. A handy example provided by Cohen is that the difference in height between 15-yearold and 16-year-old girls in the United States is a small effect, while the difference in height between 13-year-old and 18-year-old girls is a large effect. Cohen’s d for the present study is readily calculated in R, here using a function from the lsr package that we installed and loaded at the beginning of the exercise: 33 cohensD(D3$RESPONSE, mu=4) This outputs a value of 0.86, which indicates that the comparison between our sample mean and the midpoint of the scale shows a large effect size according to Cohen’s rule of thumb. 3.3.7.6 A Nonparametric Alternative to a One-Sample t-Test As discussed above, t-tests involve a number of assumptions—importantly including that the feature we’re interested in is normally distributed in the population and that responses have an interval scale—and often these assumptions are somewhat dubious for studies in experimental philosophy. This does not mean that you should avoid t-tests in your work, however. As noted, t-tests are generally quite robust to violations of these assumptions. Nonetheless, it is important to be mindful of such issues. One easy way to do this is to also run a comparable statistical test that doesn’t make the same assumptions, such as a nonparametric test. Nonparametric tests do not make assumptions about the underlying distribution of the feature in the population or that your data has an interval scale. There is never a free lunch, however, and these advantages of nonparametric tests have a cost: they come at the expense of some statistical power, meaning that you’re less likely to get a significant 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 115 result. One common nonparametric alternative to t-tests is to use the Wilcoxon procedure, which is also easily run in R. For our present study this can be done with the following function call: 34 wilcox.test(D3$RESPONSE, mu=4, alternative="less") This gives the following output (along with warnings that the p-value is an estimate, which we need not worry about here): Wilcoxon signed rank test with continuity correction data: D3$RESPONSE V = 178, p-value = 9.672e-06 alternative hypothesis: true location is less than 4 As expected, the p-value is slightly lower than what we saw above for the corresponding t-test, but is overall fairly comparable: the result is highly significant on either measure. For thoroughness, you could report the Wilcoxon test alongside the t-test and Cohen’s d, if desired, such as t(48) = −5.99, p < 0.001 (one-tailed), d = 0.86; V = 178, p < 0.001 (one-tailed). 3.3.8 Interpretation It is a commonly noted point that data is one thing, conclusions another, and that to draw a conclusion from a set of data involves interpretation. This is sometimes taken to suggest a divide between the objective, scientific study, and the subjective, opinionated interpretation of it. But, as the above walkthrough hopefully draws out, there is no sharp divide like this in the research process. Indeed, the interests of the researcher are invariably present in forming a research question, in formulating and implementing a plan to address that question, and in analyzing the results. This is not a bad thing. For instance, just focusing on the statistical analysis in the last section, given the data we collected, there are any number of tests we might have run. For instance, we could have checked if there was a correlation between the responses of our participants to the test question and their age. This is certainly easy enough to do in R: 35 36 cor.test(D3$RESPONSE, D3$AGE) # r=.063, p=.67 It is important to note, however, that our research question didn’t specifically concern age and we made no predictions about the relation between responses and age. And absent such a prediction, there is no null hypothesis, such that the NHST framework makes no sense. This does not mean that there is anything wrong with running the correlation test, but it does bear on how we should think about the result: we should treat this as merely exploratory, such that if we had found a potentially interesting relation we’d want to confirm it with a new study directed at testing the relation and making the relevant prediction in advance of looking at the data. The basic reason is that if we look at enough comparisons for a given dataset, some 116 J. Sytsma of them are likely to be statistically significant just by chance. We’ll elaborate on this point in Sect. 3.5 when we consider the question of correcting for multiple comparisons. Nonetheless, while an element of interpretation is found throughout the research process, including the statistical analysis performed, once you’ve completed an analysis you’ll want to describe what it means—you’ll want to interpret those findings—typically focusing on drawing out the philosophically relevant conclusions. For the present study, recall that this project grew out of a wider set of work testing the common assertion in philosophy of mind that the concept of phenomenal consciousness is part of folk psychology, including the finding by Sytsma and Machery (2010) indicating that in contrast to philosophers, lay people treat two prototypical examples of supposed phenomenally consciousness mental states quite differently, happily ascribing seeing red to a simple robot while denying that the robot feels pain. Sytsma (2010a) explored one explanation for this finding, hypothesizing that lay people tend to hold a naïve view of both colors and pains. If people tend to hold such a view of pains, however, we would expect them to allow that there could be unfelt pains, in direct contradiction to the common justification offered for the standard view in philosophy. The results of my second study gave an initial indication that people are open to the possibility of unfelt pains. My third study, then, attempted to test this more directly, generating my research question. This same research question guided the design and analysis of the study, and it in turn guides the interpretation of the results. What we find is that the results from Study 3 by Sytsma (2010a) are in line with the prediction that lay Americans tend to allow for the possibility of unfelt pains. This in turn suggests against the standard view about the ordinary concept of pain in philosophy, while offering some support for the alternative hypothesis that the ordinary concept corresponds with a naïve view of pain. The evidence here is most direct for the explicit prediction, while the conclusion with regard to the opposed philosophical views is more tentative. Thus, as noted above, we shouldn’t put too much weight on just a single study, or even a pair of studies, especially when it comes to rejecting or endorsing broad philosophical accounts. One reason is that this study involved a number of decision points, including the vignette used, the question posed, and how participants were recruited. To make a truly compelling case for the claim that “common sense” allows for unfelt pains, further studies are needed. 3.4 Further Studies and Further Tests In this section, I’ll detail a few of the many subsequent studies that have been conducted on the question of unfelt pain. I’ll focus on two things in this discussion. First, we’ll consider how follow-up studies can address potential concerns with previous studies. This will include varying the vignettes, answer choices, and recruitment strategy. Second, we’ll illustrate two further types of t-tests— 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 117 independent samples t-tests and paired samples t-tests—although we’ll run through these in somewhat less detail, now that we’re more familiar with the basics of statistical analysis in R. We begin in Sect. 3.4.1 with the fourth study by Sytsma (2010a), which will be used to introduce independent samples t-tests. In Sect. 3.4.2, we’ll use this study to discuss the relationship between sample size, effect size, and power, which is crucial for designing and interpreting effective empirical studies. In Sect. 3.4.3, we then turn to the first study by Sytsma and Reuter (2017), which uses a different type of vignette to assess judgments about unfelt pains and shifts from scale responses to binary answer choices; we’ll use this as an opportunity to see how we can compare between these types of responses. Finally, in Sect. 3.4.4, we look at a study by Reuter and Sytsma (2020), which will be used to introduce paired samples t-tests. 3.4.1 Independent Samples t-Tests One issue that was raised in presenting the results of the study detailed in the previous section is that the vignette describes the situation in terms of an injured person being distracted from a pain. It is plausible, though, that you can only be distracted from something that exists, such that participants might have inferred from this wording that the pain was ongoing despite being unfelt. To address this criticism, in Study 4 by Sytsma (2010a), I revised the vignette to adjust the description, as shown in Fig. 3.5, while keeping other details the same (including the question asked and recruitment method). Data for this study is available in 2_s_2010a_study_4.csv. Let’s begin by graphing the means for these two studies side by side, along with the histogram for Study 4, as shown in Fig. 3.6. The code for producing these graphs and subsequent Fig. 3.6 Bar graph for Studies 3 and 4 by Sytsma (2010a) on the left with 95% confidence intervals; histogram for Study 4 on the right 118 J. Sytsma analyses can be found in 2_s_2010a_study_4.txt. After loading the data for each study (into D3 and D4), the code first produces a bar graph showing the means with 95% confidence intervals (as described in Sect. 3.3.7.5): 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 ################ # Plot Studies # ################ # Run t-tests for confidence intervals TD3 <- t.test(D3$RESPONSE, mu=4) TD4 <- t.test(D4$RESPONSE, mu=4) # Barplot with study means barplot(height=c( mean(D3$RESPONSE), mean(D4$RESPONSE) ), ͨ→ width=0.5, ylim=c(1,7), xpd=FALSE, ͨ→ main="Unfelt Pain", ͨ→ names.arg=c("Study 3","Study 4"),ylab="Mean Response") box(bty="l") # Plot confidence intervals for Study 3 lines(x=c(0.35,0.35),y=c(TD3$conf.int[1],TD3$conf.int[2]), ͨ→ type="l") lines(x=c(0.3,0.4),y=c(TD3$conf.int[1],TD3$conf.int[1]), ͨ→ type="l") lines(x=c(0.3,0.4),y=c(TD3$conf.int[2],TD3$conf.int[2]), ͨ→ type="l") # Plot confidence intervals for Study 4 lines(x=c(0.95,0.95),y=c(TD4$conf.int[1],TD4$conf.int[2]), ͨ→ type="l") lines(x=c(0.9,1),y=c(TD4$conf.int[1],TD4$conf.int[1]),type="l") lines(x=c(0.9,1),y=c(TD4$conf.int[2],TD4$conf.int[2]),type="l") Based on the confidence intervals, we can tell that as in Study 3, the mean for Study 4 is significantly below the midpoint: the upper bar for the 95% confidence interval for each study in Fig. 3.6 is well below the midpoint. This tells us that our null hypothesis (that the mean is greater than or equal to 4) is outside of the range we expect the true value of the mean to fall with 95% probability (again given our test assumptions), indicating a probability of less than 5% that the null hypothesis is true (and where 5% can alternatively be expressed as 0.05). This inference can easily be confirmed by applying the same analysis we used before—starting with a one-sample t-test, then checking the effect size, and finally confirming that the finding still holds when using a nonparametric test: 54 55 56 57 58 59 ################################# # Statistical Tests for Study 4 # ################################# t.test(D4$RESPONSE, mu=4, alternative="less") cohensD(D4$RESPONSE, mu=4 ) wilcox.test(D4$RESPONSE, mu=4, alternative="less") These tests for Study 4 produce a comparable result to what we saw above for Study 3: t(40) = −3.33, p < 0.001 (one-tailed), d = 0.52; V = 204, p = 0.0039 (one-tailed). And, again, the results are in line with the hypothesis that a majority of the population holds that unfelt pains are possible. 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 119 The significant result for Study 4 suggests against the criticism of Study 3: It does not appear that the use of the phrase “distracted from” was a primary driver of the low mean for that study. Nonetheless, it is possible that this phrase does make some difference, even if participants still tend to judge that this is a case of unfelt pain even when the phrase is replaced. One way to test this is to directly compare the mean responses between the two studies. If “distracted from” played a role in lowering mean responses, then we would predict that the mean for Study 4 (M = 3.02) would be higher than for Study 3 (M = 2.57). Just by looking at the means, we can tell that Study 4 is higher, of course (3.02 > 2.57). The real question, though, is whether this difference is unlikely to simply be due to chance variation between the samples. To check this, we need to run a statistical test.20 Here what we want to do is to compare results from two different samples. Since we want to compare means between samples, and since these are different samples— the samples are independent of one another—what we want to use is an independent samples t-test. We can conduct an independent samples t-test using the same t.test() function as before, but now we’ll need to include an argument for the relevant table for each study (D4 as well as D3) and an argument to indicate that the responses are from different participants—that the samples are not paired. As before, it is arguably appropriate to use a one-tailed test since the criticism makes a directional prediction. Here is the function call: 60 t.test(D3$RESPONSE, D4$RESPONSE, paired=FALSE, ͨ→ alternative="less") And here is the output it produces: Welch Two Sample t-test data: D3$RESPONSE and D4$RESPONSE t = -1.1981, df = 80.958, p-value = 0.1172 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf 0.1760992 sample estimates: mean of x mean of y 2.571429 3.024390 As expected, we find that the difference between the means is not significant at the 0.05 level. As such, we cannot reject the null hypothesis that the change to the vignette does not increase ratings. You might have noticed that the test description in the output is different from our previous tests: not only did the number of samples noted change, as expected, but it is now described as a Welch’s t-test. Recall from above that the label “t-test” describes a family of procedures. In fact, this family is even larger than I have 20 However, note that looking at Figure 3.6 and using the same logic as above, we can infer from the fact that the confidence interval for Study 4 includes the mean from Study 3 that this difference won’t be significant at the 0.05 level. 120 J. Sytsma indicated so far, including that there are multiple types of t-tests that we could run to compare between independent samples. The t.test() function in R defaults to a Welch’s t-test for these. By contrast, the one-sample t-tests we ran above were Student’s t-tests. We could have used a Student’s t-test for the present comparison as well; we would simply need to set the var.equal argument in the function call to TRUE: 61 t.test(D3$RESPONSE, D4$RESPONSE, paired=FALSE, ͨ→ alternative="less", var.equal=TRUE) This produces the following output: Two Sample t-test data: D3$RESPONSE and D4$RESPONSE t = -1.2107, df = 88, p-value = 0.1146 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf 0.1689956 sample estimates: mean of x mean of y 2.571429 3.024390 As we can see, the p-value is comparable to what we found using Welch’s ttest. Notice that the degrees of freedom differ between the two outputs. For the Student’s t-test, it is a whole number (88, which is the combined sample size for the two studies minus two, also see Chap. 2). For the Welch’s t-test, however, it is 80.958. This reflects that the Student’s t-test assumes that the values we’re sampling from have equal variance—basically, that the values have the same spread with regard to the average—while the Welch’s t-test does not make this assumption and applies an adjustment. We could perform a further test to see if our data supports the assumption of equal variance.21 But, I wouldn’t bother: there is a good reason that R performs a Welch’s t-test by default; basically, whether the assumption of equal variance holds or not, a Welch’s t-test will perform at least as well as a Student’s t-test (see Delacre et al., 2017). As with the one-sample t-tests we performed above, we can also calculate the effect size for our comparison between Study 3 and Study 4: 62 cohensD(D3$RESPONSE, D4$RESPONSE) This gives a Cohen’s d of 0.26, which is a small effect size.22 And, as before, we can address concerns based on the normality of our data and whether it is interval 21 This is illustrated in 2_s_2010a_study_4.txt using Levene’s test. Comparing Studies 3 and 4, this test gives a p-value of 0.51, meaning that we cannot reject the null hypothesis that the populations have equal variance. 22 It might seem strange to calculate an effect size for a comparison that was not significant. Recall, however, that these tell us two different things: roughly, statistical significance gives us a measure of how likely a difference is to be due to chance and depends on the sample size and the significance level we select, while effect size tells us how big that difference is. Thus, it can sometimes be useful to know what the effect size is if we were to accept a result, even if that result is nonsignificant at a given significance level. 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 121 by performing a nonparametric alternative to the independent samples t-test using the same wilcox.test() function as above: 63 wilcox.test(D3$RESPONSE, D4$RESPONSE, paired=FALSE, ͨ→ alternative="less") With independent samples, the function now performs a Wilcoxon rank sum test, as opposed to the signed rank test we saw before. As before, the nonparametric test gives a comparable p-value to that found for the t-test: t(80.958) = −1.20, p = 0.12, d = 0.26; W = 863.5, p = 0.12. 3.4.2 Sample Size, Effect Size, and Power We’ve just seen that comparing the means between Studies 3 and 4 by Sytsma (2010a), we get a p-value above 0.05, which means that we cannot reject the null hypothesis at this significance level. Does this mean that we can conclude that the wording of the vignette does not matter? No. Being unable to reject the null hypothesis is not the same as accepting the null hypothesis. One issue here is that, as we noted above, p-values are tied to sample sizes. To illustrate, let’s assume that we expect that the effect size we reported is roughly accurate and that the wording of the vignette does make a small difference, with the true effect size corresponding with a Cohen’s d of 0.2. With this assumption in place, we can run some further tests to tell us how large our sample sizes would need to be for our study to be likely to detect an effect of that size (i.e., to get a significant result). For instance, we could calculate how many participants we would need to sample to have an 80% chance to detect an effect with a Cohen’s d of 0.2 using a t-test with a 0.05 significance level (and taking the test assumptions to hold). The chance to detect the effect is known as the power of the test, and is usually expressed as a probability. For an 80% chance, then, we’d want a power of 0.8, and for a 90% chance we’d want a power of 0.9. We can calculate the sample size we would need for our study to have a given power using the pwr package in R.23 Using the same sample size for each group, this can be calculated with the pwr.t.test() function. To do this we need to supply our expected effect size (d = 0.2), the significance level (default is 0.05), whether the test we’re interested in is one- or two-tailed, and the power we want. Typical suggestions are to aim for a power between 0.8 and 0.9. The code document shows function calls for each of these options: 64 65 pwr.t.test(d=0.2, power=0.8, alternative="greater") ͨ→ #n=309.8065 pwr.t.test(d=0.2, power=0.9, alternative="greater") ͨ→ #n=428.8705 23 Remember to install (install.packages("pwr")) and load (library(pwr)) this package if you want to run the code for yourself! 122 J. Sytsma As we can see, these output a range of 310–429 participants. Note that this gives us the numbers we would need per condition. This means, for example, that to have a 90% chance of getting a significant result at the 0.05 level for an independent samples t-test checking that the mean response for the probe used in Study 4 is greater than the mean response for the probe used in Study 3, we would need roughly 858 participants (429 per condition)! Obviously, the studies by Sytsma (2010a) were a far cry short of this. Basically, the smaller the effect, the more difficult it will be to detect it. Given this, we should be very cautious of concluding that there is no effect here—that whether the vignette uses “distracted from” does not matter—as noted above. Rather, we simply did not detect an effect (we did not get a significant result), but this is hardly surprising given the sample sizes and if we expect that the true effect might be rather small. In fact, we can use the same pwr package to test the power these studies actually had to detect an effect of size d = 0.2. This calculates what is known as the post hoc power of the test: 66 pwr.t2n.test(d=0.2, n1=41, n2=49, alternative="greater") In contrast to the previous test, here we use the pwr.t2n.test() function since the studies had different sample sizes, and we specify those sample sizes instead of the power: if we supply two out of three of effect size, sample sizes, and power, these functions will calculate the third. The result is that the pair of studies had just a 24% chance of detecting an effect of this size. Does this mean that the studies were underpowered? Not necessarily! This would depend on the effect size that we expect. To illustrate, let’s suppose that what would be philosophically important here—what would vindicate the objection—is if there was a large effect size, say one of at least the size that we found above for comparing the mean in Study 3 to the midpoint (d = 0.86): 67 pwr.t2n.test(d=0.86, n1=41, n2=49, alternative="greater") For detecting an effect of this size, we find that the two studies have a post hoc power of over 99%, meaning that we would be very likely to detect such an effect. 3.4.3 Binary Answer Choices and Dichotomization While Study 4 by Sytsma (2010a) helps alleviate one concern with the vignette from Study 3, it is still roughly the same vignette. Perhaps there is something else about this story that tends to elicit responses from people that don’t truly reflect their views about unfelt pains? Or perhaps the way the question was asked does this? Or perhaps the university students surveyed aren’t representative of the wider population? Confidence in this finding is bolstered somewhat by the results of the second study by Sytsma (2010a), discussed above, which asked about unfelt pains more directly, but it still behooves us to explore the hypothesis from further angles. One way to do this is to run additional studies that vary the vignettes used and the questions asked, as well as the recruitment method employed. 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 123 Sytsma and Reuter (2017) report on three studies concerning unfelt pains that adapt the vignette used in Study 5 by Sytsma (2010a). This study concerned the possibility of shared pains—that two people might feel one and the same pain if they were to share a body part, such as both being attached to the same hand. The standard view holds that the ordinary concept of pain precludes the possibility of shared pains in such cases (since pains are mental states it doesn’t matter if a body part is shared), while the alternative predicts that people will tend to allow for this possibility (if pains are bodily states, then sharing a body part could lead to feeling the same pain). In the test condition of this study, participants were given a vignette describing two conjoined twins who share the lower part of their body. The twins run through a park, forcefully kick a rock hidden in the grass, and give behavioral indications of pain. Participants were then asked whether the twins felt one and the same pain or two different pains, answering on a 7-point scale. The mean response (M = 3.29) was significantly below the midpoint, indicating that participants tended to think that the twins felt one and the same pain. In the first study by Sytsma and Reuter (2017), participants were given two different probes soliciting pain judgments, with the probes being given on separate pages and with the order of the two pages randomized. The first probe replicated Study 5 by Sytsma (2010a) that we just discussed: this study used the same vignette about conjoined twins, but adjusted the question to use a binary answer choice instead of a scale (participants answered by selecting either “one and the same pain” or “two different pains”). In addition, we added a comprehension check question and used a different method for recruiting participants—we recruited them online using a push strategy rather than soliciting responses from students in class. A “push strategy” involves recruiting participants who were not directly looking to participate in research by offering an alternative incentive.24 What we found is that 68.5% (217/317) of participants who passed the comprehension check answered that the twins felt one and the same pain. Thus, we find a comparable result despite changing the question type and recruitment method, further suggesting against the standard view and in favor of the alternative naïve view. 24 In this case, participants were recruited through advertising for a free personality test on Google Ads, with the personality test being administered after the target questions. One notable benefit of using such a push strategy, in comparison to standard online recruitment methods in experimental philosophy (such as paid services like Amazon Mechanical Turk or Prolific Academic), is that participants are more likely to be “experimentally naïve”—less likely to guess what the study is really about—and less likely to be motivated to provide the responses that they think the researchers are looking for (Haug, 2018). Samples collected using the recruitment strategy employed here have been previously compared against samples collected with other methods in replication studies. And the present strategy has been consistently found to generate a diverse sample in terms of geography, socioeconomic status, religiosity, political orientation, age, and education. Studies using this strategy have been previously reported in publications including, e.g., Livengood et al. (2010), Feltz and Cokely (2011), Murray et al. (2013), Machery et al. (2015), Livengood and Rose (2016), Livengood and Sytsma (2020), Fischer et al. (2021), Sytsma et al. (2012, 2015), and Sytsma (2010d, 2021, 2022), among many others. 124 J. Sytsma More importantly for present purposes, the second probe concerned unfelt pains. Unlike Study 3 by Sytsma (2010a), however, this probe uses a vignette describing a pair of conjoined twins where just one takes a painkiller: Johnny and Tommy are conjoined twins that are joined at the torso. While they are distinct people, each with their own beliefs and desires, they share the lower half of their body. One day they accidentally dropped a heavy weight on their left foot. Johnny and Tommy both grimaced and shouted out “Ouch!” They were then rushed to the hospital for treatment. Unfortunately, the nurse who checked them in was unfamiliar with conjoined twins. As a result, Johnny was given a pill for the pain while Tommy was left untreated. Ten minutes later, the doctor arrived to examine them. When she pushed on the injured foot, Tommy grimaced and shouted out “Ouch!” while Johnny merely shrugged his shoulders and said it didn’t hurt at all. After reading the vignette, participants were asked to select which of two claims best reflected their view about it, with the choices presented in random order: There was a pain in Johnny and Tommy’s injured foot when the doctor pushed on it: While Tommy felt the pain in their foot, the painkiller prevented Johnny from feeling that pain. There was not a pain in Johnny and Tommy’s injured foot when the doctor pushed on it: While the foot caused Tommy to feel pain, the painkiller prevented Johnny from feeling pain. As with the shared pain probe, participants were also given a comprehension check question.25 The order of the two probes was randomized, but responses did not vary noticeably based on which probe participants saw first. We found that 83.7% (251/300) of participants who passed the comprehension check selected the first answer for the present probe, indicating that they thought of this as a case of unfelt pain. Results were comparable in our second study, which tweaked the answer choices for the unfelt pain question to emphasize that we meant the pain claims literally: There actually is a pain in the injured foot: while Tommy feels the pain in the foot, the painkiller prevents Johnny from feeling that pain. There is not actually a pain in the injured foot: while the foot causes Tommy to have the feeling of there being a pain in the foot, the painkiller prevents the foot from causing Johnny to have such a feeling. This time 84.0% (110/131) selected the first option. Our third study further tweaked the answer choices, with the first option now including that “the pain is literally in Johnny and Tommy’s injured foot.” Again, a significant majority of participants selected the unfelt pain option (65.7%, 109/166). In some ways, the use of binary response choices better matches the main hypotheses at issue for our research on unfelt pains: this work aims to test the contention that common sense denies that there can be unfelt pains, but this contention doesn’t make any clear claims about relative strength of belief or the 25 Participants were asked, “Did Johnny and Tommy drop a heavy weight on their left foot?” and answered by selecting either “yes” or “no.” 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 125 extent to which people will be unsure about the issue. At the same time, using binary response choices like those given above forces participants to make a choice— indeed, these are sometimes called “forced-choice questions”—and we might, therefore, worry that when participants aren’t sure they will just answer randomly (see Sytsma and Livengood, 2015, Section 9.2, for discussion). Using a scale like the one employed in the studies discussed in the previous section resolves this issue by giving participants the ability to register level of belief or to indicate uncertainty (by selecting the midpoint on the scale). The flip side of this is that the inclusion of such options might cause participants to be more cautious, leading to weaker responses than they would otherwise give. Fortunately, when in doubt, you can always replicate your studies using different types of response choices and compare. One way to compare between binary and scale responses is to dichotomize the scale results. For instance, for Studies 3 and 4 by Sytsma (2010a), we can split the participants into groups based on whether they gave a response indicating that they judged the case to be one of unfelt pain (answering 1, 2, or 3 on the 7-point scale) or not (answering 4 or higher). Here is the code to dichotomize the results from Study 3, as shown in the 2_s_2010a_study_3.txt document: 68 69 70 71 72 ######################### # Dichotomize Responses # ######################### # Percentage unfelt pain: 83.7% nrow(D3[D3$RESPONSE==1|D3$RESPONSE==2|D3$RESPONSE==3, ])/ ͨ→ nrow(D3) Running this line, we find that 83.7% (41/49) of participants in Study 3 judged there to be an unfelt pain. Calling on the statistical test discussed in Chap. 2 (χ2 test), we can then test whether this proportion is significantly greater than 50% using the prop.test() function in R: 73 prop.test(x=41, n=49, p=0.5, alternative="greater") Not surprisingly, we find that this proportion is significantly greater than 50%: χ2 = 20.9, p < 0.001 (one-tailed). Doing the same thing for Study 4, we find that 73.2% (30/41) of participants judged there to be an unfelt pain, which is again significantly greater than 50%: χ2 = 7.90, p = 0.0025 (one-tailed). Finally, we could compare either of these proportions to another, such as a proportion that was assessed directly using binary answer choices. To illustrate, let’s compare the proportion of positive responses in Study 3 by Sytsma (2010a) to the proportion from Study 1 by Sytsma and Reuter (2017). We can do this using the same prop.test() function, although we’ll now use the concatenate function c() to specify both the x value (positive count) and the n value (total count) for each proportion and we’ll use a two-tailed test since we don’t have reason to predict that either propotion would be greater than the other: 74 prop.test(x=c(41,251), n=c(49,300)) 126 J. Sytsma Given that these proportions are remarkably similar (0.8367347 vs 0.8366667), it is no surprise that the difference is not statistically significant. Indeed, we we get a p-value of 1. 3.4.4 Paired Samples t-Tests Finally, Reuter and Sytsma (2020) detail a large series of further studies testing whether common sense countenances unfelt pains. This includes a study replicating Study 4 by Sytsma (2010a) using the online push strategy from the previous study and binary response choices: Which of the following descriptions of this type of situation seems most appropriate to you? The injured person still had the pain and was just not feeling it during that period. The injured person had no pain during that period. We found that 90.3% (28/31) of participants selected the first option, which is somewhat higher than the proportion from Study 4 noted above, although the difference is not significant (χ2 = 2.31, p = 0.12). Our chapter also included four studies involving an injured patient taking a painkiller, as in the thought experiment from Aydede noted in Sect. 3.2. Unlike the painkiller studies just discussed by Sytsma and Reuter (2017), this time the vignettes did not involve conjoined twins, and we used both scales and binary response options. In each case, we again found that a significant majority of participants judged that the patient had a pain even though they didn’t feel it while the painkiller was in effect. Another set of studies instead used vignettes describing a severely injured soldier who professes not to feel any pain, as in the thought experiment from Hill discussed in Sect. 3.1, with similar results. Thus, while Hill (2009, 171) states that when he “asked informants to assess the likelihood of this scenario [ . . . ] they have all been inclined to dismiss it as absurd,” our results were quite different: in each of nine studies varying both the vignettes and the questions we found that a significant majority of participants responded that the injured soldier had a pain despite not feeling it. Finally, Reuter and I report a series of seven studies that asked participants more direct questions about the possibility of unfelt pains. Perhaps most strikingly, in our 15th and 16th studies, we asked the following four questions, with participants either answering by selecting “yes” or “no” (Study 15) or using a 7-point scale anchored at 1 with “clearly no,” at 4 with “not sure,” and at 7 with “clearly yes” (Study 16): 1. Is it possible for a person to have a pain that they don’t feel for a period of time? 2. Have you ever had a pain that you didn’t feel for a period of time? 3. Is it possible for a person to have a pain that doesn’t hurt for a period of time? 4. Have you ever had a pain that didn’t hurt for a period of time? In each case, a significant majority of participants gave an affirmative answer to each question, giving a further indication that lay people tend to hold that unfelt pains are possible, and further that they tend to hold that they are actual. Focusing on Study 16, we can show this by running a similar analysis to what we saw 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 127 before, using one-sample t-tests to compare the means to the midpoint for each of the four questions. This is illustrated in 2_rs_2020_study_16.txt using the data provided in 2_rs_2020_study_16.csv. (We can also dichotomize and compare the proportion to 50% or to the binary responses from Study 15, as we did in Sect. 3.4.3, and as is illustrated in the code document.) Analyzing each question separately in this way, however, raises the potential issue of correcting for multiple comparisons, which I return to in the next section. Our purpose in analyzing Study 16 was, once again, to test whether lay people tend to deny the possibility of unfelt pains as the standard view contends. And for this purpose, the one-sample t-tests just noted do the trick. But we could have asked other questions, here, motivated by different research interests. One possibility is comparing between the two sets of questions. I’ll do that here for purposes of introducing a third type of t-test—paired samples t-tests. One interesting facet of Study 16 by Reuter and Sytsma (2020) is that we didn’t just ask participants about the possibility of unfelt pains, but their actuality. This is done using two different wordings, with Questions 1 and 2 forming a pair and Questions 3 and 4 forming a pair (the order of these pairs was counterbalanced in the studies). We might wonder whether responses differed within these pairs. Indeed, since more things are possible than are actual, we would predict that people would be more likely to affirm the possibility questions than the actuality questions (although, again, this isn’t a prediction we specifically made in the actual paper). Making this prediction for illustrative purposes, we can then test it using a paired samples t-test. A paired samples test is called for in this case because we’re comparing mean responses, but these responses were given by the same participants—each participant in Study 16 answered all four of the questions—such that an independent samples t-test wouldn’t be appropriate: these participants are not independent. As before, we can run the paired samples comparison using the t.test() function in R. For this, all we need to do is to change the paired argument from FALSE to TRUE in our function call, as illustrated here for the first pair of questions: 75 t.test(D$RESPONSE_1, D$RESPONSE_2, paired=TRUE, ͨ→ alternative="greater") This produces the following output: Paired t-test data: D$RESPONSE_1 and D$RESPONSE_2 t = 3.26, df = 61, p-value = 0.0009122 alternative hypothesis: true difference in means is greater ͨ→ than 0 95 percent confidence interval: 0.3854137 Inf sample estimates: mean of the differences 0.7903226 The output indicates that we’ve conducted a paired samples t-test and that the difference is significant (p-value = 0.0009122). This means that, as expected, people were more likely to judge that unfelt pains are possible than that 128 J. Sytsma they have actually had an unfelt pain. As before, we can also calculate the effect size using cohensD() and run a nonparametric comparison using wilcox.test(): 76 77 cohensD(D$RESPONSE_1, D$RESPONSE_2) wilcox.test(D$RESPONSE_1, D$RESPONSE_2, paired=TRUE, ͨ→ alternative="greater") The combined result is t(61) = 3.26, p < 0.001, d = 0.46; V = 401.5, p = 0.0012.26 And we get comparable results comparing Questions 3 and 4: t(61) = 2.79, p = 0.0035, d = 0.27; V = 223.5, p = 0.0039. 3.5 Correcting for Multiple Comparisons To close this chapter, it is important to consider a possible complication that I noted in the previous section: when conducting multiple statistical comparisons on the same set of data we need to be mindful of how this might affect the way we should interpret our p-values. This concern isn’t specific to t-tests, but I most often see this type of issue arise in the x-phi literature for sequences of t-tests, such that it makes sense to address it in this context. First, in Sect. 3.5.1, I’ll discuss why we sometimes need to correct for multiple comparisons and provide an example where I don’t think such a correction is needed. In Sect. 3.5.2, I’ll discuss different types of correction, focusing on two methods—the Bonferoni method and the Holm method. Finally, in Sect. 3.5.3, I’ll present a case where I do think a correction is needed and show you how to apply the Holm method to this case in R. 3.5.1 Why Should We Apply a Correction? For Study 16 by Reuter and Sytsma (2020), participants were each asked four questions about unfelt pains, and in our analysis, we began by comparing the mean response for each to the midpoint using a sequence of four one-sample t-tests, as discussed in the previous section. I noted above that this potentially raises a complication, however. Simply put, the worry is that if you test enough variables, it is quite likely that some will be significant just by chance even if the null hypothesis is true. To illustrate, imagine that you’re worried that a mint is producing coins that are biased toward coming up heads. Say that you test this by flipping a single coin ten times. If it were to come up heads all ten times, this would be some evidence for your worry. After all, such an outcome is quite unlikely if the coin is fair (roughly a probability of 0.00098). But what if you were to test 1000 coins this way? Now it would be more likely than not that at least one of the coins would come up heads ten 26 As before, we could also dichotomize these questions and compare proportions, although for paired data like this we will now want to run a McNemar’s test, as illustrated in the code document. 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 129 times in a row (roughly a 0.62 probability) even if all the coins are fair.27 As such, finding that one of the coins came up heads ten times would hardly be evidence that the mint is producing biased coins. Running t-tests on lots of variables without correcting for multiple comparisons faces a corresponding problem. In this analogy, each t-test corresponds with testing a different coin: the more tests we run, the more likely it is that we’ll get one or more significant results just by chance even if the null hypothesis is true. Of course, this is possible even if we run just one test. This is why we report p-values, since they give us a sense of the likelihood of getting a result at least this extreme by chance. But the point is that running multiple tests will affect how we should think about those chances when considered as a group. For the one-sample t-tests in our analysis of Study 16, it is unclear that this is a serious worry, however. And, indeed, I’m inclined to think that it is not. The reason is that we were predicting a pattern of results across the four questions. In more detail, we varied two things across the four questions—modality and phrasing. With regard to modality, Questions 1 and 3 asked about the possibility of unfelt pain, while Questions 2 and 4 asked about their actuality for the participant. And while we expected that this would make a difference, for the reason detailed above, we, nonetheless, predicted that participants would still tend to affirm the actuality questions. With regard to phrasing, Questions 1 and 2 were phrased in terms a pain that wasn’t felt, while Questions 3 and 4 were phrased in terms of a pain that didn’t hurt. We didn’t expect that the specific phrasing would make an important difference. As such, we predicted that participants would tend to affirm each of the four questions, and if they tended to deny any of the four, this would provide some evidence against the general hypothesis. Corrections for multiple comparisons, however, essentially make it tougher to get significant results at a given significance level. Given our prediction of the pattern of results across the questions, however, this doesn’t seem warranted. 3.5.2 The Bonferroni Method and the Holm Method Exactly when one should apply a correction for multiple comparisons is a difficult question, and there is much disagreement on this score. Nevertheless, sometimes, it is essential that we apply such a correction, as I’ll illustrate in the next section. Further, when in doubt, I would recommend that you go ahead and apply a correction, as this will make your tests more stringent, and hence render the results more convincing. 27 The probability that a single fair coin will come up heads on a single toss is 1/2. Let’s write this P(H). The probability that it would come up heads ten times in a row is then P(H) × P(H) × P(H) × P(H) × P(H) × P(H) × P(H) × P(H) × P(H) × P(H), which is (1/2)10 or 1/1024. Let’s call this P(10H). The probability that this would occur at least once in 1000 attempts is equivalent to the one minus the probability that it doesn’t occur in any of the 1000 tests, which is equivalent to (1–P(10H))1000 or (1023/1024)1000 , which is roughly 0.3764. As such, 1–(1–P(10H))1000 is roughly 0.6236. 130 J. Sytsma Many different types of corrections can be applied. Perhaps the most common is the Bonferroni method. The main positive of this method is that it is quite easy to use: we simply multiple the p-values of a sequence of tests by the number of tests performed (or, equivalently, keep the p-values the same but adjust the significance level by dividing it by the number of tests). Let’s say that we decided that a correction for multiple comparisons is appropriate for the one-sample ttests performed for Study 16. We performed four tests, getting p-values of 5.9e−13 (i.e., 0.00000000000059), 0.0019, 7.5e−8 , and 0.0023, respectively. To apply the Bonferroni correction, we simply multiply each of these p-values by four (the number of tests performed), giving corrected values of 2.0e−12 , 0.0074, 3.0e−7 , and 0.0093. (Note that I’ve applied the correction to the full p-values given by R to minimize rounding errors.) As we can see, the results remain significant at the 0.05 level. The Bonferroni correction can also be applied using the p.adjust() function in R with the method argument set to "bonferroni", as demonstrated in the code file. While the Bonferroni method has the benefit of being simple, I wouldn’t personally recommend this correction. The reason is the same as we saw above with regard to using a Student’s t-test for comparing the means from independent samples: there is another method that is always at least as powerful. In the case of the Bonferroni method, there is an extension—what is known as the Holm method or the Holm–Bonferroni method—that makes the same assumptions as the Bonferroni method and is always at least as powerful as it (Holm, 1979). And while the Holm method is somewhat more complicated, it is equally easy to apply in R: all you need to do is to switch the value for the method argument in the p.adjust() function to "holm". As shown in the code file, applying this method to the sequence of tests for Study 16 gives lower p-values than the Bonferroni method for three of the four questions 2.0e−12 , 0.0037, 2.3e−7 , and 0.0037. A second complication is that while I compared Questions 1 and 2 and Questions 3 and 4 separately above to illustrate the use of paired samples t-tests, this does not tell us the shared impact across the two sets of questions as the modality is varied. And if we also wanted to test the impact of the phrasing, we would then need to run two further t-tests using this method, now comparing Questions 1 and 3 and Questions 2 and 4. A better solution is to recognize that we are crossing two variables (or “factors”) in our study, each taking on one of two values (or “levels”): as noted above, we’re varying the phrasing (feel, hurt) and the modality (possible, actual), with one question corresponding with each combination of values for these two variables. Recognizing this, we could test the impact of each variable across the questions, as well as their interaction, using an ANOVA (specifically, a two-way within-participants ANOVA). Tests like this are discussed further in Chap. 4, so I won’t try to explain them here, but two ways of performing the ANOVA in R are shown in the code document.28 In line with our paired samples t-tests, we find that 28 For this, we need to restructure the data for Study 16: we need to add columns for each of our two variables, and we need to convert it to “long form”—adding a column with a participant idea 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 131 there is a significant main effect for modality (p = 0.0012). Further, we do not find a significant main effect for phrasing (p = 0.27) or for the interaction of these two factors (p = 0.17). 3.5.3 Illustrating the Holm Method To conclude, I want to briefly discuss a final study where a correction for multiple comparisons is clearly called for. While this study does not involve judgments about unfelt pains, it does provide evidence for the underlying hypothesis about the commonsense conception of pain that motivated my predictions in the studies we discussed above. Sytsma and Snater (2023a) conducted a global study in which participants answered a large number of test questions. Drawing on studies by Ozdemir (2022) as well as Fischer and Sytsma (2021), we either gave participants a vignette describing future scientists creating a physical duplicate of a person or creating an android duplicate of a person, then asking them whether they agreed or disagreed with each of 25 statements ascribing a mental capacity to the resulting duplicate. Participants responded using a 7-point scale anchored at 1 with “Disagree strongly” and at 7 with “Agree strongly.” As part of the analysis detailed in the supplemental materials (Sytsma and Snater, 2023b), we compared responses between the two conditions using a series of independent samples ttests, as illustrated in 2_ss_2023_study_1.txt using the data provided in 2_ss_2023_study_1.csv. Without applying a correction for multiple comparisons, we found ten significant differences at the 0.05 level. But unlike in Study 16 by Reuter and Sytsma (2020), we were not predicting a specific pattern of results across these tests and did not have specific predictions for all of these comparisons. Further, many of the significant results had negligible effect sizes. Given the large number of tests, it is quite likely that some of these significant results owed to chance—more than we should accept at the 0.05 significance level selected—and hence it was important for us to apply a correction for multiple comparisons. We did this using the Holm method introduced above. Applying the correction, we found that only three of the results remained significant. This included the question we asked about feeling pain, which we had a specific prediction for. Recall that I hypothesized that the reason lay people in Sytsma and Machery’s (2010) and Machery’s first study tended to ascribe seeing red to a simple robot, but not feeling pain, is that people tend to hold a naïve view of both types of qualities. With regard to pains, I speculated that people tend to conceive of pains as being qualities of injured body parts, but that the entity needs the right sort of body parts to instantiate pains: they need soft and fleshy body parts, not hard and metallic and repeating the data set so that each row shows the response for just one question. While this conversion can be done in R, to make things easier I’ve instead created a second spreadsheet with the converted data: 2_rs_2020_study_16_anova.csv. 132 J. Sytsma ones. And, indeed, this hypothesis was directly tested by Sytsma (2012), where I found that giving the simple robot by Sytsma and Machery (2010) grasping arms made of bioengineered materials, instead of the original hard and metallic ones, notably increased ascriptions of feeling pain. Based on this, Sytsma and Snater (2023a) predicted that participants would be significantly more likely to judge that the physical duplicate felt pain than that the android duplicate felt pain. And, indeed, this prediction was borne out, with participants being significantly more likely to agree with the statement “the duplicate would feel pain when she is injured” in the physical duplicate condition compared to the android condition, even after correcting for multiple comparisons. Box 3.1: Partially Paired Samples t-Tests In the main text, we discuss three main types of t-tests: one-sample ttests, independent samples t-tests, and paired samples t-tests. These are distinguished by the number of conditions we are comparing (one for onesample t-tests, two for independent samples t-tests and paired samples t-tests) and whether the same participants make up the samples (no for independent samples t-tests, yes for paired samples t-tests). It is possible, however, that the answer could be yes and no: some of our participants could be the same across the two samples while others could be different. This is an unusual situation, and not one you’re likely to run across. Indeed, while I include the test here for completeness, I’ve only run a partially paired samples t-test one time in my own work. Specifically, a partially paired samples t-test was relevant to the analysis of Study 2 by Sytsma et al. (2022). One goal of this study was to test whether participants’ judgments about a statement would differ if it was presented alone versus being presented alongside three other statements. In order to test this, we ran the study with both a within-participants condition (each participant giving judgments about all four statements) and betweenparticipants conditions (each participant giving a judgment about just one of the four statements). Comparing judgments for each statement between conditions could be done using independent samples t-tests, as we saw in the main text. Doing so, we found no significant differences for any of the four statements. A second prediction about this study concerned a comparison between two different statements. This could be done separately for the within-participants condition (using a paired samples t-test) and for the between-participants conditions (using an independent samples t-test). Given that the results were not significantly different between the types of conditions, though, there is reason to combine these conditions: it would allow us to conduct just one test that would have greater statistical power. This couldn’t be done using the standard t-tests we’ve reviewed, however, since (continued) 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 133 Box 3.1 (continued) combining the data would mean that some participants were paired (having given judgments about both statements) and others were not (having given judgments about just one statement). Fortunately, this is exactly the (rare) type of situation in which a partially paired samples t-test is appropriate. Unfortunately, this type of test is uncommon enough, that I was unable to find a package in R that implements it. Instead, I adapted the code provided by Henriksen (2018) for the t.test.partial() function. References Albert, J. (2009). Bayesian computation with R (2nd ed.). Springer. Arico, A. (2010). Folk psychology, consciousness, and context effects. Review of Philosophy and Psychology, 1(3), 371–393. Aydede, M. (2005a). Preface. In M. Aydede (Ed.), Pain. New papers on its nature and the methodology of its study (pp. IX–XVII). The MIT Press. Aydede, M. (2005b). Introduction. A critical and quasi-historical essay on theories of pain. In M. Aydede (Ed.), Pain. New papers on its nature and the methodology of its study (pp. 1–58). The MIT Press. Aydede, M. (2009). Pain. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy. (Spring 2013 Edition). http://plato.stanford.edu/archives/spr2013/entries/pain/. Benjamin, D., Berger, J., Johannesson, M., et al. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 6–10. Block, N. (1995). On a confusion about a function of consciousness. Behavioral and Brain Sciences, 18(906), 227–247. Bluhm, R. (2016). Corpus analysis in philosophy. In M. Hinton (Ed.), Evidence, experiment and argument in linguistics and the philosophy of language (pp. 91–109). Peter Lang. Borg, E., Harrison, R., Stazicker, J., & Salomons, T. (2020). Is the folk concept of pain polyeidic? Mind & Language, 35(1), 29–47. Buckwalter, W., & Phelan, M. (2013). Function and feeling machines. A defense of the philosophical conception of subjective experience. Philosophical Studies, 166(2), 349–361. Caton, J. (2020). Using linguistic corpora as a philosophical tool. Metaphilosophy, 51(1), 51–70. Chalmers, D. (1995). Facing up to the problem of consciousness. Journal of Consciousness Studies, 2, 200–219. Chalmers, D. (2018). The meta-problem of consciousness. Journal of Consciousness Studies, 25(9–10), 6–61. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates. Coninx, S., Willemsen, P., & Reuter, K. (2023). Pain linguistics. A case for pluralism. The Philosophical Quarterly, 74, 145–168. Cova, F., Strickland, B., Abatista, A., Allard, A., Andow, J., Attie, M., Beebe, J., Berniūnas, R., Boudesseul, J., Colombo, M., Cushman, F., Diaz, R., van Dongen, N., Dranseika, V., Earp, B. D., Torres, A. G., Hannikainen, I., Hernández-Conde, J. V., Hu, W., Jaquet, F., Khalifa, K., Kim, H., Kneer, M., Knobe, J., Kurthy, M., Lantian, A., Liao, S.-y., Machery, E., Moerenhout, T., Mott, C., Phelan, M., Phillips, J., Rambharose, N., Reuter, K., Romero, F., Sousa, P., Sprenger, 134 J. Sytsma J., Thalabard, E., Tobia, K., Viciana, H., Wilkenfeld, D., & Zhou X. et al. (2019). Estimating the reproducibility of experimental philosophy. Review of Philosophy and Psychology, 12, 9–44. Delacre, M., Lakens, D., Leys, C. (2017). Why psychologists should by default use Welch’s t-test instead of Student’s t-test. International Review of Social Psychology, 30(1), 92–101. Dennett, D. (1991). Consciousness explained. Little, Brown and Company. Díaz, R. (2021). Do people think consciousness poses a hard problem? Empirical evidence on the meta-problem of consciousness. Journal of Consciousness Studies, 28(3–4), 55–75. Feltz, A., & Cokely, E. (2011). Individual differences in theory-of-mind judgments. Order effects and side effects. Philosophical Psychology, 24(3), 343–355. Fiala, B., Arico, A., & Nichols, S. (2012). You, Robot. In E. Machery & E. O’Neill (Eds.), Current controversies in experimental philosophy (pp. 31–47). Routledge. Fischer, E., & Sytsma, J. (2021). Zombie intuitions. Cognition, 215, 104807. Fischer, E., & Sytsma, J. (2023). Projects and methods of experimental philosophy. In A. M. Bauer & S. Kornmesser (Eds.), The compact compendium of experimental philosophy (pp. 39–69). de Gruyter. Fischer, E., Engelhardt, P., & Sytsma, J. (2021). Inappropriate stereotypical inferences? An adversarial collaboration in experimental ordinary language philosophy (pp. 39–69). Synthese, 198(11), 10127–10168. Goldberg, B., Reuter, K., & Sytsma, J. (Forthcoming). The history of the concept of pain. How the experts came to be out of touch with the folk. In K. Hens & A. De Block (Eds.), Advances in experimental philosophy of medicine (pp. 13–32). Bloomsbury. Gonnerman, C. (2018). Consciousness and experimental philosophy. In R. Gennaro (Ed.), The Routledge handbook of consciousness (pp. 463–477). Routledge. Goodwin, K., & Goodwin, C. J. (2016). Research in psychology. Methods and design (8th ed.). Wiley. Haug, M. (2018). Fast, Cheap, and Unethical? The Interplay of Morality and Methodology in Crowd sourced Survey Research. Review of Philosophy and Psychology, 9(2): 363–379. Gregory, D., Hendrickx, M., & Turner, C. (2022). Who knows what Mary knew? An experimental study. Philosophical Psychology, 35(4), 522–545. Henriksen, A. A. (2018). T-test for partially paired data. https://aksela.wordpress.com/2018/09/ 08/t-test-for-partially-paired-data/. Hill, C. (2009). Consciousness. Cambridge University Press. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70. Huebner, B. (2010). Commonsense concepts of phenomenal consciousness. Does anyone care about functional zombies? Phenomenology and the Cognitive Sciences, 9(1), 133–155. Kim, H.-e., Poth, N., Reuter, K., & Sytsma, J. (2016). Where is your pain? A cross-cultural comparison of the concept of pain in Americans and South Koreans. Studia Philosophica Estonica, 9(1), 136–169. Knobe, J., & Prinz, J. (2008). Intuitions about consciousness. Experimental studies. Phenomenology and the Cognitive Sciences, 7, 67–83. Kripke, S. (1980). Naming and necessity. Harvard University Press. Liu, M. (2020). The intuitive invalidity of the pain-in-mouth argument. Analysis, 80(3), 463–474. Liu, M. (2023). The polysemy view of pain. Mind & Language, 38(1), 198–217. Livengood, J., & Rose, D. (2016). Experimental philosophy and causal attribution. In J. Sytsma & W. Buckwalter (Eds.), A companion to experimental philosophy (pp. 434–449). Wiley Blackwell. Livengood, J., & Sytsma, J. (2020). Actual causation and compositionality. Philosophy of Science, 87(1), 43–69. Livengood, J., Sytsma, J., Feltz, A., Scheines, R., & Machery, E. (2010). Philosophical temperament. Philosophical Psychology, 23(3), 313–330. Machery, E., & Sytsma, J. (2011). Robot pains and corporate feelings. The Philosophers’ Magazine, 52, 78–82. 3 Quantitative Vignette Studies: t-Tests—Case Studies on Judgments About. . . 135 Machery, E., Sytsma, J., & Deutsch, M. (2015). Speaker’s reference and cross-cultural semantics. In A. Bianchi (Ed.), On reference (pp. 62–76). Oxford University Press. McCarthy, M. & A. O’Keeffe (2010). The Routledge Handbook of Corpus Linguistics. London: Routledge. McEnery, T. & A. Wilson (2001). Corpus Linguistics. An Introduction. Edinburgh: Edinburgh University Press. Murray, D., Sytsma, J., & Livengood, J. (2013). God knows. But does God believe? Philosophical Studies, 166, 83–107. Nagel, T. (1974). What is it like to be a bat? The Philosophical Review, 83, 435–450. Ozdemir, E. (2022). Empirical evidence against phenomenal theses (Ph.D. dissertation). Victoria University of Wellington. Peressini, A. (2013). Blurring two conceptions of subjective experience. Folk versus philosophical phenomenality. Philosophical Psychology, 27(6), 862–889. Phelan, M. (2023). Experimental philosophy of mind. In A. M. Bauer & S. Kornmesser (Eds.), The compact compendium of experimental philosophy (pp. 263–288). de Gruyter. Phelan, M., Arico, A., & Nichols, S. (2013). Thinking things and feeling things. On an alleged discontinuity in the folk metaphysics of mind. Phenomenology and the Cognitive Sciences, 12, 703–725. Reid, T. (1785). Essays on the intellectual powers of man. Pennsylvania State University. Reuter, K. (2011). Distinguishing the appearance from the reality of pain. Journal of Consciousness Studies, 18(9–10), 94–109. Reuter, K. (2017). The developmental challenge to the paradox of pain. Erkenntnis, 82(2), 265– 283. Reuter, K., & Sytsma, J. (2020). Unfelt pain. Synthese, 197, 1777–1801. Reuter, K., Phillips, D., & Sytsma, J. (2014). Hallucinating pain. In J. Sytsma (Ed.), Advances in experimental philosophy of mind (pp. 75–100). Bloomsbury. Reuter, K., Sienhold, M., & Sytsma, J. (2019). Putting pain in its proper place. Analysis, 79(1), 72–82. Salomons, T., Harrison, R., Hansen, N., Stazicker, J., Sorensen, A., Thomas, P., & Borg, E. (2021). Is pain “all in your mind”? Examining the general public’s views of pain. Review of Philosophy and Psychology, 13, 1–16. Schuman, H., & Presser, S. (1996). Questions and answers in attitude surveys: Experiments on question form, wording, and context. Sage. Shadish, W. R., Cook, T. D., & Campbell, D. (2001). Experimental and quasi-experimental designs for generalized causal inference (2nd ed.). Cengage Learning. Strickland, B., & Suben, A. (2012). Experimenter philosophy: The problem of experimenter bias in experimental philosophy. Review of Philosophy and Psychology, 3, 457–467. Sudman, S., Bradburn, N., & Schwarz, N. (1996). Thinking about answers: The application of cognitive processes to survey methodology. Jossey-Bass. Sytsma, J. (2009). Phenomenological obviousness and the new science of consciousness. Philosophy of Science, 76(5), 958–969. Sytsma, J. (2010a). Dennett’s theory of the folk theory of consciousness. Journal of Consciousness Studies, 17(3–4), 107–130. Sytsma, J. (2010b). Folk psychology and phenomenal consciousness. Philosophy Compass, 5(8), 700–711. Sytsma, J. (2010c). Phenomenal consciousness as scientific Phenomenon? A critical investigation of the new science of consciousness (Ph.D. dissertation). University of Pittsburgh. Sytsma, J. (2010d). The proper province of philosophy: Conceptual analysis and empirical investigation. Review of Philosophy and Psychology, 1(3), 427–445. Sytsma, J. (2012). Revisiting the valence account. Philosophical Topics, 40(2), 179–198. Sytsma, J. (2013). The robots of the dawn of experimental philosophy of mind. In E. Machery & E. O’Neill (Eds.), Current controversies in experimental philosophy (pp. 48–64). Routledge. Sytsma, J. (2014). Advances in experimental philosophy of mind. Bloomsbury. 136 J. Sytsma Sytsma, J. (2016). Attributions of consciousness. In J. Sytsma & W. Buckwalter (Eds.), A companion to experimental philosophy (pp. 257–278). Blackwell. Sytsma, J. (2021). Causation, responsibility, and typicality. Review of Philosophy and Psychology, 12, 699–712. Sytsma, J. (2022). Crossed wires: Blaming artifacts for bad outcomes. The Journal of Philosophy, 119(9), 489–516. Sytsma, J. (n.d.). Experiencers and the ambiguity objection. http://philsci-archive.pitt.edu/15481/. Sytsma, J., & Buckwalter, W. (2016). A companion to experimental philosophy. Blackwell. Sytsma, J., & Fischer, E. (Forthcoming). “Experience”, ordinary and philosophical: A corpus study. Synthese. Sytsma, J., & Livengood, J. (2015). The theory and practice of experimental philosophy. Broadview. Sytsma, J., & Machery, E. (2009). How to study folk intuitions about phenomenal consciousness. Philosophical Psychology, 22, 21–35. Sytsma, J., & Machery, E. (2010). Two conceptions of subjective experience. Philosophical Studies, 151(2), 299–327. Sytsma, J., & Machery, E. (2012). On the relevance of folk intuitions: A reply to Talbot. Consciousness and Cognition, 21(2), 654–660. Sytsma, J., & Ozdemir, E. (2019). No problem: Evidence that the concept of phenomenal consciousness is not widespread. Journal of Consciousness Studies, 26(9–10), 241–256. Sytsma, J., & Reuter, K. (2017). Experimental philosophy of pain. Journal of Indian Council of Philosophical Research, 34(3), 611–628. Sytsma, J., & Snater, M. (2023a). Consciousness, phenomenal consciousness, and free will. In P. Henne & S. Murray (Eds.), Advances in experimental philosophy of action (pp. 13–32). Bloomsbury. Sytsma, J., & Snater, M. (2023b). Consciousness, phenomenal consciousness, and free will: Supplemental materials. http://philsci-archive.pitt.edu/19556/. Sytsma, J., Livengood, J., & Rose, D. (2012). Two types of typicality: Rethinking the role of statistical typicality in ordinary causal attributions. Studies in History and Philosophy of Biological and Biomedical Sciences, 43, 814–820. Sytsma, J., Livengood, J., Sato, R., & Oguchi, M. (2015). Reference in the land of the rising sun: A cross-cultural study on the reference of proper names. Review of Philosophy and Psychology, 6(2), 213–230. Sytsma, J., Bluhm, R., Willemsen, P., & Reuter, K. (2019). Causal attributions and corpus analysis. In E. Fischer & M. Curtis (Eds.), Methodological advances in experimental philosophy (pp. 209–238). Bloomsbury. Sytsma, J., Bishop, R., & Schwenkler, J. (2022). Has the side-effect effect been cancelled? No, not yet. Synthese, 200, 395. Talbot, B. (2012). The irrelevance of folk intuitions to the “hard problem” of consciousness. Consciousness and Cognition, 21(2), 644–650. Tye, M. (2021). Qualia. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy. (Fall 2021 Edition). https://plato.stanford.edu/archives/fall2021/entries/qualia/. Ulatowski, J., Weijers, D., & Sytsma, J. (2020). Corpus methods in philosophy. The Brains Blog. https://philosophyofbrains.com/2020/12/15/cognitive-science-ofphilosophy-symposiumcorpus-analysis.aspx. Chapter 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural Equation Modeling—An Application to Experimental Philosophy of Free Will Florian Cova and Aurélien Allard Abstract Do people have the intuition that free will and moral responsibility are compatible with determinism? To find out, experimental philosophers have presented participants with vignettes describing agents in deterministic universes and probed their intuitions about the free will and moral responsibility of these agents. However, these debates soon got embroiled around the difficulty of assessing whether participants correctly understood the vignettes and the concepts of determinism and free will and the extent to which different comprehension errors plagued participants’ answers. In this chapter, we clarify those debates by using the results of a brand new study (N = 772) on participants’ intuitions about free will. We introduce readers to correlations, regressions, multiple regressions, mediation analyses, and multiple mediation analyses through the use of structural equation modeling to show how different techniques may give different interpretations of folk intuitions on free will. We explain how to run these different analyses in R and provide detailed R code. Keywords Correlations · Experimental philosophy · Free will · Moral responsibility · R · Qualtrics · Linear regression · Structural equation modelling Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-58049-9_4. F. Cova () Département de Philosophie, Université de Genève, Geneva, Switzerland e-mail: [email protected] A. Allard Department of Philosophy, École Normale Supérieure Paris, Paris, France © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Kornmesser et al., Experimental Philosophy for Beginners, Springer Graduate Texts in Philosophy, https://doi.org/10.1007/978-3-031-58049-9_4 137 138 F. Cova and A. Allard 4.1 Introduction In this chapter, our goal is to introduce the reader to the use of correlations, regression analyses, and structural equation modeling to test scientific hypotheses. We will approach these notions through the example of experimental philosophy of free will, the part of experimental philosophy that is concerned with laypeople’s intuitions about free will and moral responsibility. In the past years, experimental philosophers have sought to determine whether people are “natural compatibilists” or “natural incompatibilists”—that is, whether they spontaneously take free will and moral responsibility to be incompatible with determinism. To this end, they have presented participants with vignettes describing the deterministic universe and asked them whether agents living in these universes can act of their own free will and be morally responsible for their actions. However, this approach has been criticized on the basis of the fact that it is not clear how participants interpret such vignettes and that their interpretation of the experimental materials might shape their intuitions about free will. To understand what exactly drives people’s answers to such vignettes, later studies have made extensive use of correlational methods, including mediation analysis and structural equation modeling. Replicating classic studies in experimental philosophy of free will will provide us with a perfect opportunity to describe these methods and how they can be used in experimental philosophy. Box 4.1 All analyses presented in this chapter will be performed using the programming language R. The basic R interface (or “R-Gui”) can be downloaded quickly and for free at https://cran.r-project.org/ and can be installed on windows, macOS, and Linux. Fancier and more user-friendly interfaces, such as RStudio, can also be used, but the code we will present is designed to work on any R interface. 4.2 Philosophical Background: Experimental Approaches to the Free Will Problem 4.2.1 What Is Free Will? In contemporary philosophical literature, free will is often defined as the kind of control one has to exert upon one’s behavior to be morally responsible for it.1 That 1 Not everyone agrees with this definition, and some people choose to define “free will” without reference to moral responsibility. For example, so-called semicompatibilists define “free will” as 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 139 some degree of control is necessary for moral responsibility is intuitive: after all, we cannot be morally responsible for stabbing someone if someone else put the knife in our hand and then moved our arm against our will. In the same way, we leave fiction protagonists off the hook when we learn that their actions were the effect of mind control or hypnosis. But what exactly is the type of control necessary for moral responsibility, where does it begin, and where does it end? Let’s imagine that I stab someone, not because someone else made me do it, but in a fit of rage: surely, I was not in full control, but is it enough to consider me innocent? Such cases are much more controversial, and highlight the practical necessity of defining what kind of control characterizes free will. In philosophy, debates about the nature of free will are often developed as part of debates about the existence of free will: do we have free will? Can we really be morally responsible for (some of) our actions? It stands to reason that answering these questions about the existence of free will first requires having a good grasp of what is free will. However, people sometimes rush to answer the question about existence without answering the question about nature first. For example, in a widely popular book aimed at general audiences, neuroscientist Sam Harris argued that science was incompatible with the existence of free will and, consequently, that free will is an illusion. However, he defined free will in the following way: The popular conception of free will seems to rest on two assumptions: (1) that each of us could have behaved differently than we did in the past, and (2) that we are the conscious source of most of our thoughts and actions in the present. As we are about to see, however, both of these assumptions are false. (Harris, 2012, 6) But is this definition adequate? This is far from obvious. For example, let’s take the idea that free will (and thus moral responsibility) requires the ability to behave differently than we did. This might seem intuitive, and some philosophers do define “free will” in this way (see, for example, van Inwagen, 1983, 2008), but many other philosophers reject the idea that moral responsibility, and hence free will, requires the ability to do otherwise. The reason why so many philosophers have come to reject the idea that free will requires the ability to do otherwise can be found in a type of thought experiments called “Frankfurt-style cases.” In a seminal paper (Frankfurt, 1969), philosopher Harry Frankfurt asked us to imagine cases in which an agent (i) performs an action, (ii) could not have done otherwise than performing this action, but (iii) the reason why the agent performs the action has nothing to do with the reason why the agent could not have done otherwise. For example, on this template, Miller and Feltz (2011) designed the following case: Mr. Green wants Mr. Jones, the security guard, to steal Mrs. Green’s car at 12:00 am on October 7. However, Mr. Green doesn’t entirely trust Mr. Jones to do the job, so he has taken some extraordinary measures. Mr. Green has consulted neuroscientists who have implanted a device in Mr. Jones’s brain without Mr. Jones’s knowledge. This device has isolated the the “ability to do otherwise” but argue that this ability is not required for moral responsibility and that we can be morally responsible for our actions, even without free will (see, for example, Fischer, 2002). 140 F. Cova and A. Allard “decision-making” neurons in Mr. Jones’s brain and is programmed to send, at exactly 12:00 am, impulses that will certainly cause Mr. Jones to decide to steal the car just then. However, as it happens, at exactly 12:00 am, Mr. Jones decides on his own to steal the car and does it. Since Mr. Jones decides on his own to steal the car, the impulses from the device were ineffectual because the decision-making neurons were activated by the decision-making process of Mr. Jones himself. However, if Mr. Jones had not, just then, decided on his own to steal the car, the device would have activated his decision-making neurons, and Mr. Jones would have decided to steal the car anyway. In this case, the presence of the device in Mr. Jones’ brain makes it so that it is impossible for Mr. Jones not to steal the car. However, the reason why Mr. Jones actually steals the car has nothing to do with the device in his brain: he does it on his own. According to Frankfurt, in such a case, we should conclude that Mr. Jones is morally responsible for stealing the car, and thus did it on his own free will, even though he could not have done otherwise.2 But do people agree with Frankfurt’s intuitions? Miller and Feltz (2011) gave the case to a group of participants and asked them the following comprehension questions: (a) Was it possible for Mr. Jones to avoid deciding to steal the car at 12:00 am on October 7? (b) At 12:00 am on October 7, could Mr. Jones have done anything other than decide to steal the car just then? Participants could answer “yes,” “no,” and “not sure.” To make sure that participants understood the vignette and interpreted it in the intended way (i.e., as excluding alternative possibilities), Miller and Feltz excluded those who did not answer “no” to at least one of the two questions. Participants were then asked to rate the following statements on a 7-point scale (from 1 = “Strongly disagree” to 7 = “Strongly agree”): (i) Mr. Jones is morally responsible for deciding to steal the car. (ii) Mr. Jones is blameworthy for deciding to steal the car. (iii) Mr. Jones deserves to be blamed for deciding to steal the car. (iv) Mr. Jones is at fault for deciding to steal the car. They found that people tended to agree with these statements (means were, respectively, 5.59, 5.86, 5.18, and 5.82). This suggests that most participants did not see the presence of alternative possibilities as a requirement for moral responsibility. Using the same vignettes, Cova and Kitano (2014) found similar results when they asked participants about free will rather than moral responsibility. A cross-cultural study later replicated these findings across 21 countries (Hannikainen et al., 2019). As predicted by Frankfurt, and against Harris’ contention, it seems that most people 2 Frankfurt develops the following argument to this conclusion: in the absence of the device, we would undoubtedly conclude that Mr. Jones is morally responsible for stealing the car. However, the device is never activated and never plays any role: it would, thus, be strange to consider that its mere presence is enough to modify Mr. Jones’ moral responsibility. For more on this point, see Cova (2014). 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 141 tend to think that it is possible to act freely and be morally responsible for one’s action without having the possibility to act otherwise. Thus, it is better to get a better grasp of the nature of free will before debating about its existence, as some seemingly obvious assumptions about what is required for free will might turn out to be flawed. But how are we to determine the nature of free will? Sam Harris advises us to follow the “popular conception” of free will as a guide. However, the expression “popular conception” is ambiguous between two kinds of psychological entities that might sometimes be at odd with one another: • Explicit theories about free will, that is, the kind of theories and definitions people might offer when explicitly asked to define free will or explain what free will is. • Intuitions about moral responsibility, that is, the kind of automatic, prereflexive attributions of moral responsibility that we might form about situations we face, for example, about vignettes such as the one we just presented. As it turns out, our explicit theories are not always in line with our intuitions. For example, we might explicitly endorse an account of free will according to which free will requires the ability to do otherwise, but change our mind when we realize that our intuitions about Frankfurt-style cases (such as the one of Mr. Jones) conflict with this account. This is actually what a lot of philosophers did after Frankfurt introduced his cases. In general, philosophers consider that intuitions should be treated as (defeasible) evidence on the basis of which our explicit accounts of free will should be discussed. Thus, to determine the nature of free will, we should first get a better grasp at our intuitions about moral responsibility and the underlying principles that shape them.3 4.2.2 The Natural Compatibilists vs. Natural Incompatibilists Debate Let’s thus assume that our intuitions about moral responsibility should be the primary source of evidence to determine the nature of free will. What conclusion should we reach on their basis? In the past 15 years, experimental philosophers have investigated our intuitions about free will and moral responsibility. In most cases, these investigations have been carried in the context of potential threats to free will and have tried to determine whether people’s intuitions about free will mean that free will is incompatible with certain apparent dangers. Such apparent dangers include the absence of alternative possibilities, that we discussed earlier, but also cases of direct 3 This does not mean that there is no point in investigating people’s explicit theories about free will, as Monroe and Malle (2010) did. First, this is still psychologically informative. Second, since it is reasonable to think that people’s explicit theories are not completely divorced from their intuitions, they may help us better understand and delineate people’s intuitions. 142 F. Cova and A. Allard manipulation (Sripada, 2012; Cova, 2023), neuroscientific prediction (Nahmias et al., 2014), and fatalism (Feltz and Millan, 2015; Andow and Cova, 2016). However, the majority of studies have investigated people’s intuitions about the compatibility (or incompatibility) of free will with determinism. Determinism can be defined in the following way: a system S is deterministic when, for any time t and any posterior time t + x, there exists a correct description of the state of S at t together with the laws regulating this system that entail the description of the state of S at t + x. To put it in a simpler way, a system is deterministic if it is ruled by laws that make it so that, given the state of the system at t, there is only one possible state of this system at t + x. In a deterministic system, what happens in the past makes what happens in the future necessary. A lot of the contemporary debate about the existence of free will has revolved around the compatibility of free will with determinism, provided that we live in a deterministic universe, in which the laws of physics together with the past state of the universe only leave one possible future open: can we have free will and be morally responsible for our actions? Philosophers who answer “yes” to this question are typically called compatibilists, while those who answer “no” are typically called incompatibilists. Both compatibilists and incompatibilists typically appeal to our pre-theoretical intuitions to argue for their positions. For example, one traditional argument for incompatibilism (i.e., the claim that free will is incompatible with determinism) starts from the claim that the following principle is intuitive: Principle of Alternate Possibilities (PAP) An agent A is morally responsible for an outcome O only if A could have done otherwise, that is, could have avoided O. However, if we accept this principle, then we can apparently conclude that determinism is incompatible with moral responsibility, and hence free will. Indeed, in a deterministic universe, then, given the past, every agent had to act the way they did, and thus could not have done otherwise. Defenders of compatibilism can in turn rely on the intuitions elicited by Frankfurt-style cases (such as the case of Mr. Jones): indeed, in such cases, we have the intuition that agents are morally responsible for their actions, even if they could not have done otherwise (at least in any relevant way). Thus, our intuitions about such cases should lead us to reject the PAP and conclude that the fact that determinism prevents us from having the ability to act otherwise does not make it a threat to free will and moral responsibility. So, do people have the intuition that agents living in deterministic universes cannot be free and morally responsible for their actions? That is the question that has been at the heart of many experimental philosophy studies about free will. Partisans of natural incompatibilism claim that nonphilosophers have the intuition that free will cannot exist in a deterministic universe, while partisans of natural compatibilism claim that nonphilosophers do not perceive free will and determinism as clashing with each other (see Feltz et al., 2009, for this terminology). How are we to decide between these two possibilities? We cannot simply ask people whether they think that “free will” and “determinism” are compatible, as 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 143 “determinism” is a technical term and most people seem to interpret it as meaning the opposite of free will (see Nahmias et al., 2005). A better solution is then not to use the word “determinism” but to present participants with descriptions of deterministic universes. Thus, in a first series of studies on folk intuitions about free will, Eddy Nahmias and his colleagues (2005, 2006) presented participants with vignettes such as the following one: Supercomputer. Imagine that in the next century we discover all the laws of nature, and we build a supercomputer which can deduce from these laws of nature and from the current state of everything in the world exactly what will be happening in the world at any future time. It can look at everything about the way the world is and predict everything about how it will be with 100% accuracy. Suppose that such a supercomputer existed, and it looks at the state of the universe at a certain time on March 25, 2150 AD, 20 years before Jeremy Hall is born. The computer then deduces from this information and the laws of nature that Jeremy will definitely rob Fidelity Bank at 6:00 pm on January 26, 2195. As always, the supercomputer’s prediction is correct; Jeremy robs Fidelity Bank at 6:00 pm on January 26, 2195. Imagine such a supercomputer actually did exist and actually could predict the future, including Jeremy’s robbing the bank (and assume Jeremy does not know about the prediction): Do you think that, when Jeremy robs the bank, he acts of his own free will? In this case, 76% of participants answered that Jeremy robbed the bank of his own free will, and 83% answered that Jeremy was morally responsible for robbing the bank. Nahmias and colleagues also developed other vignettes. For example, they asked participants to imagine the following case: Rollback. Imagine there is a universe that is re-created over and over again, starting from the exact same initial conditions and with all the same laws of nature. In this universe the same conditions and the same laws of nature produce the exact same outcomes, so that every single time the universe is re-created, everything must happen the exact same way. For instance, in this universe a person named Jill decides to steal a necklace at a particular time, and every time the universe is re-created, Jill decides to steal the necklace at that time. In this case, 66% of participants answered that Jill stole the necklace of her own free will, while 77% answered that she was morally responsible. Thus, these first results seem to suggest that between two thirds and three quarters of participants have compatibilist intuitions: they attribute free will and moral responsibility to agents living in deterministic universes. So, we might be tempted to conclude that people are indeed natural compatibilists. However, later studies came to paint a more complex picture of people’s intuitions about moral responsibility. In 2007, Shaun Nichols and Joshua Knobe conducted a series of studies in which they sought to compare participants’ intuitions about abstract and concrete cases. They first presented their participants with a description of two universes: a deterministic universe (Universe A) and an indeterministic universe (Universe B). Their description went like this: Imagine a universe (Universe A) in which everything that happens is completely caused by whatever happened before it. This is true from the very beginning of the universe, so what happened in the beginning of the universe caused what happened next, and so on right up until the present. For example one day John decided to have French Fries at lunch. Like 144 F. Cova and A. Allard everything else, this decision was completely caused by what happened before it. So, if everything in this universe was exactly the same up until John made his decision, then it had to happen that John would decide to have French Fries. Now imagine a universe (Universe B) in which almost everything that happens is completely caused by whatever happened before it. The one exception is human decision making. For example, one day Mary decided to have French Fries at lunch. Since a person’s decision in this universe is not completely caused by what happened before it, even if everything in the universe was exactly the same up until Mary made her decision, it did not have to happen that Mary would decide to have French Fries. She could have decided to have something different. The key difference, then, is that in Universe A every decision is completely caused by what happened before the decision-given the past, each decision has to happen the way that it does. By contrast, in Universe B, decisions are not completely caused by the past, and each human decision does not have to happen the way that it does. Then, participants were divided into two groups. Half of participants were presented with the concrete case: In Universe A, a man named Bill has become attracted to his secretary, and he decides that the only way to be with her is to kill his wife and 3 children. He knows that it is impossible to escape from his house in the event of a fire. Before he leaves on a business trip, he sets up a device in his basement that burns down the house and kills his family. Is Bill fully morally responsible for killing his wife and children? (Yes/No) In this case, 72% answered that Bill was fully morally responsible for killing his wife and children, despite living in a deterministic universe. This is in line with the results of Nahmias and colleagues we presented earlier. However, the other half of participants was presented with the abstract case: In Universe A, is it possible for a person to be fully morally responsible for their actions? (Yes/No) In this case, only 14% of participants answered that it is possible for an agent to be fully morally responsible for their action in Universe A. This is dramatically different from the 72% who gave a compatibilist answer to the concrete case. Thus, it seems that whether participants give a compatibilist or incompatibilist answer depends on how the question is asked (in an abstract or concrete way). How is that possible? Nichols and Knobe have put forward their own account: the Performance Error Model, according to which people’s conception of free will is incompatibilist (as shown by their answers to the abstract case) but can be biased by emotional responses (which would explain their answer to concrete cases, including the ones used by Nadelhoffer and colleagues). To test their hypothesis, according to which affect biases participants toward compatibilist answers, Nichols and Knobe conducted a second series of experiments, in which they contrasted low-affect and high-affect cases. Participants were presented with the same description of Universes A and B as before, then presented either with the low-affect case: As he has done many times in the past, Mark arranges to cheat on his taxes. Is it possible that Mark is fully morally responsible for cheating on his taxes? 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 145 Table 4.1 Percentage of compatibilist answer (= agent fully morally responsible) for each condition (low affect vs. high affect) and in each universe (A = deterministic, B = indeterministic) by Nichols and Knobe (2007) Universe A Universe B Low affect 23% 89% High affect 64% 95% Or with the high-affect case: As he has done many times in the past, Bill stalks and rapes a stranger. Is it possible that Bill is fully morally responsible for raping the stranger? For each case, half of participants were told that the agent lived in Universe A, while the other half was told that the agent lived in Universe B. The percentage of participants giving a compatibilist (“Yes”) answer is presented in Table 4.1. As one can see, the results are in line with Nichols and Knobe’s predictions: when presented with a deterministic universe (Universe A), participants tended to give mostly incompatibilist answers in the low-affect case, but mostly compatibilist answers in the high-affect case, as if more intense affective reactions drove participants to give more compatibilist answers. However, later research has failed to replicate this effect. Indeed, while the difference between the abstract and concrete conditions is robust and has been successfully replicated (Cova et al., 2021), this is not the case for the difference between the low-affect and high-affect cases. A later meta-analysis (Feltz and Cova, 2014) found that, across all published and unpublished studies, the difference between low-affect and high-affect cases was very small and could not explain the huge difference between abstract and concrete cases. Moreover, there are reasons to doubt Nichols and Knobe’s performance error model. The first is that it cannot really account for the huge rate of compatibilist answers in Nahmias and colleagues’ experiments. True, Nahmias and colleagues only used concrete cases, but those cannot really be considered high-affect cases: robbing a bank (in the Supercomputer case) or stealing a necklace (in the Rollback case) are nowhere near raping someone. Additionally, Nahmias and colleagues also ran versions of the Supercomputer case in which the agent performed actions that were morally good (saving a child) or neutral (going jogging): in both cases, most participants answered that the agent acted of his own free will, which is at odd with the idea that compatibilist answers are driven by affective biases. Finally, Cova et al. (2012) gave Nahmias and colleagues’ Supercomputer case and Nichols and Knobe’s high-affect case to patients suffering from a behavioral variant of frontotemporal dementia, a condition associated with lower emotional responses. They found that, despite their emotional impairment, these patients still gave mostly compatibilist answers, which suggests that affect is not the explanation for the difference between abstract and concrete cases. 146 F. Cova and A. Allard 4.2.3 Bypassing: An Error Theory for Incompatibilist Intuitions How, then, are we to explain the difference between the abstract and concrete cases? For Murray and Nahmias, the explanation lies in how participants interpret determinism in the context of the vignettes they are presented with. Their key idea is that, sometimes, people take the vignettes (and their description of determinism) as implying bypassing (Nahmias and Murray, 2011; Murray and Nahmias, 2014). What is “bypassing”? It is the idea that one’s actions are caused by forces that bypass one’s self: thus, we are not the true authors of our actions. Bypassing can take the form of epiphenomenalism, the claim that our conscious mental states (such as our beliefs, desires and deliberations) are causally impotent and do not play a role in the generation of our actions. Rather, our thoughts are a causally irrelevant byproduct of physical processes that are occurring inside of our brains. But bypassing can also take the form of fatalism, the claim that we had to act the way we did, no matter how hard we tried not to act this way. In both cases, bypassing entails that agents do not really play an active part in their actions, but that they are caused to act by forces beyond their control. All philosophers (compatibilists and incompatibilists alike) would agree on the conclusion that, if bypassing were to be true, then we would not have free will and would not be responsible for our actions. However, it is important to understand that determinism, properly understood, does not entail bypassing. First, against the epiphenomenalism interpretation, determinism is compatible with the fact that our mental states play a causal role in the generation of our actions. In a deterministic universe, whatever happens is caused by what happens before in such a way that no other outcome was possible. This is compatible with my actions being caused (in a deterministic way) by my beliefs and desires, with these beliefs and desires being themselves caused by my past history. Second, against the fatalism interpretation, determinism does not tell us that what happened had to happen no matter what. Rather, determinism is the claim that, given the past, things had to happen this way. This means that, in a deterministic universe, things could have turned out very differently if the past had been slightly different: had I left my home 5 min later on this particular day, I would not have met the person I married on the bus, and my life would have turned out very differently. This is very different from fatalism, the claim that I was fated to marry this person, and would have ended up marrying her, no matter what, even if I tried to avoid all relationships. This also means that, in a deterministic world, my decisions play an important causal role: had I made a different decision, history would have turned out differently. Murray and Nahmias’ hypothesis is the following: people (rightly) consider bypassing to be incompatible with free will and moral responsibility. But, in certain circumstances, they interpret the vignettes they are presented with as describing bypassing rather than mere determinism. This is particularly the case in Nichols and Knobe’s abstract description of Universe A, in which participants are told that things have to happen—an expression that lends itself to a bypassing reading. Such 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 147 Table 4.2 Bypassing statements for abstract and concrete conditions in Murray and Nahmias (2014) Decisions Wants Believes No control Abstract In universe A, a person’s decisions have no effect on what they end up doing In universe A, what a person wants has no effect on what they end up doing In universe A, what a person believes has no effect on what they end up doing In universe A, a person has no control over what they do Concrete Bill’s decision to steal the necklace has no effect on what he ends up doing What bill wants has no effect on what he ends up doing What bill believes has no effect on what he ends up doing Bill has no control over what he does misunderstandings are, however, less probable in the case of concrete vignettes, such as the case of Bill, as it is made clear that agents act the way they do because they want to. Thus, what Nichols and Knobe interpreted as incompatibilist intuitions in the abstract case were, in fact, the mere expression of the reasonable belief that bypassing prevents moral responsibility. To put their hypothesis to the test, Murray and Nahmias (2014) designed a measure of bypassing. Participants were presented with one of two descriptions of a deterministic universe: either Nichols and Knobe’s description of Universe A, or the Rollback situation. For each description, participants were presented either with a concrete or an abstract scenario. Participants were then asked questions about the agent’s free will and moral responsibility, before being tasked to rate their agreement with the “bypassing statements” presented in Table 4.2. Murray and Nahmias observed two things. First, they found that, the more participants agreed with bypassing statements, the less they considered the agent they read about as having acted freely and being morally responsible. Second, they found that agreement with bypassing ratings was higher when participants were presented with Nichols and Knobe’s Universe A, compared to when they were presented with Nahmias and colleagues’ Rollback case. This suggests that the difference between Nahmias and colleagues and Nichols and Knobe’s results can be explained by the fact that participants were more prone to see determinism as implying bypassing when presented with Nichols and Knobe’s description of determinism. To test this hypothesis, Murray and Nahmias conducted a mediation analysis: they sought to determine what remained of the effect of vignette (Nichols and Knobe’s Universe A vs. Nahmias and colleagues’ Rollback) on participants’ judgments of free will and moral responsibility once taken into account the role of bypassing. The results of their analysis suggest that, after taking into account the effect of bypassing, the effect of vignette on participants’ judgments of free will and moral responsibility became nonsignificant, suggesting that most of this effect was explained by the conflation between determinism and bypassing. (The analysis they performed is called a mediation analysis and is detailed in Fig. 4.1. We will explain mediation analysis in more detail in Sect. 4.10.) 148 F. Cova and A. Allard Fig. 4.1 Mediation analysis in Murray and Nahmias (2014) In the same way, Murray and Nahmias found that participants’ agreement with bypassing statements was higher in abstract compared to concrete versions of the vignettes. Again, this suggests that the difference between abstract and concrete cases might simply be explained by the fact that abstract vignettes lead to more confusion of determinism with bypassing than concrete vignettes. However, in this case, Murray and Nahmias did not run a mediation analysis. We will get back to this later. More generally, Murray and Nahmias conclude that: most people judge that agents in deterministic scenarios lack moral responsibility, free will, and blameworthiness only when they conflate determinism with bypassing—that is, when they interpret the description of determinism in a scenario to mean that agents’ beliefs, desires, and decisions have no effect on what they end up doing and that agents have no control over what they do. When they do not confuse determinism with bypassing, most people do not offer incompatibilist intuitions. (Murray and Nahmias, 2014, 449). 4.2.4 Intrusion: An Error Theory for Compatibilist Intuitions Should we then conclude that folk intuitions about free will strongly lean toward compatibilism? Not necessarily. Indeed, we saw in the previous section that the answers Nichols and Knobe took to express incompatibilist intuitions did not, in fact, reflect a commitment to incompatibilism, but a confusion of determinism with bypassing. This leaves open the possibility that what we take to be compatibilist answers does not, in fact, reflect a commitment to compatibilism, but something else entirely. A first wave of error theories for seemingly compatibilist answers focused on the possibility that people might attribute free will and moral responsibility in a blind or automatic way, so that it would be inadequate to interpret them as reflecting a compatibilist conception of free will and moral responsibility. For example, Mandelbaum and Ripley (2012) put forward their “Norm Broken, Agent Responsible” (NBAR) theory, according to which people ascribe moral responsibility to agents as soon as they perceive that a norm is broken. And Feltz and 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 149 Millan (2015) argued that people will ascribe “free-will-no-matter-what.” However, both accounts are implausible: it is very easy to design cases in which people will not attribute free will and moral responsibility to agents who violate norms. For example, people do not attribute moral responsibility to agents who commit murder under the effect of brain manipulation (Cova, Forthcoming-a, Forthcoming-b) or magic (Andow and Cova, 2016). More recently, Nadelhoffer et al. (2020, 2023) have put forward an alternate error theory for (seemingly) incompatibilist intuitions: the Intrusive Metaphysics account. This account is premised on the idea that people have an indeterministic conception of human agency, i.e., that they consider human actions to escape determinism.4 Because people have an indeterministic conception of human agency, they would be unable to abide by the stipulations of the thought experiments described in the vignettes they are presented with, and would reintroduce indeterministic assumptions about the agents described in these vignettes. In other words, their metaphysical beliefs about human agency intrude in their construal of the situation described by the vignette, so that they do not really see agents as determined. Thus, even if they answer that the agents are free and morally responsible for their actions, their answers do not reflect a genuine commitment to compatibilism. To put this hypothesis to the test, Nadelhoffer and colleagues (2023) presented participants with either Nichols and Knobe’s abstract case or Nichols and colleagues’ Supercomputer vignette. After answering the usual questions about free will and moral responsibility, some participants were asked to rate their agreement with a series of items supposed to measure Intrusion: 1. In Universe A, what people decide to do could have been different even if everything leading up to the decision had been exactly the same. 2. In Universe A, there was a slight chance that John could have chosen not to have French Fries at the time. 3. In Universe A, it was open for John to choose not to have French Fries at the exact moment he decided to have them. 4. In Universe A, John could have decided not to have French Fries even though his decision to have them was completely caused. Other participants were asked to rate their agreement with a series of items supposed to measure participants’ conflation of determinism with Fatalism: 1. In Universe A, there is no sense in which events could have unfolded differently than they did. 4 Note that it is not clear that it is the case. A popular argument for the claim that people have an indeterministic view of human agency is that, when faced with the description of Nichols and Knobe’s Universes A and B and asked to indicate which universe is most like ours, most people choose Universe B (Sarkissian et al., 2010). However, we have seen that people interpret Universe A as involving bypassing, which means that what we can conclude from these studies is only that people reject bypassing about human agency. Moreover, other approaches suggest that people have a deterministic view of human agency (Nichols, 2006). 150 F. Cova and A. Allard Table 4.3 Pearson correlations between free will (FW) and moral responsibility (MR) ratings and Intrusion, Fatalism and Epiphenomenalism scores for each of the two vignettes in Nadelhoffer and colleagues (2023) Intrusion Fatalism Epiphenomenalism Abstract case FW 0.81*** 0.00 −0.16 MR 0.70*** 0.11 −0.22* Supercomputer case FW MR 0.60*** 0.54*** −0.14 −0.01 −0.53*** −0.49*** 2. In Universe A, John would have ended up having French Fries no matter what he tried to do. 3. In Universe A, John will eat French Fries no matter what. 4. In Universe A, John’s eating French Fries had to happen, even if what happened in the past had been different. And a last group of participants were asked to rate their agreement with a series of items supposed to measure participants’ conflation of determinism with Epiphenomenalism, which were very close to the items used by Murray and Nahmias (2014). For each vignette, Nadelhoffer and colleagues then computed the correlation between attributions of free will and moral responsibility, and their three measures of Intrusion, Fatalism, and Epiphenomenalism. The results are presented in Table 4.3. As one can see, they found strong correlations between Intrusion scores and free will and moral responsibility ratings: this means that, the more participants attributed free will and moral responsibility, the higher their score to the Intrusion measure. In fact, the association between free will attributions and Intrusion scores was either in the same ballpark as or stronger than the association between free will and participants’ epiphenomenalism scores. This suggests that, if we discard incompatibilist intuitions on the basis of their association with Epiphenomenalism (or Bypassing), we should also discard compatibilist intuitions on the basis of their association with Intrusion scores. (We will explain how correlations can be used to measure such association in Sect. 4.8.) In fact, Nadelhoffer and colleagues’ conclusion is even stronger: given the strong correlations between free will and moral responsibility ratings and Intrusion ratings, they conclude that: these results indicate that a serious comprehension problem lies at the root of experimental studies of everyday thinking about free will [...] If participants think that determinism entails epiphenomenalism or fatalism, or that determinism is compatible with capacities that require indeterminism, then they have clearly misunderstood the nature and implications of determinism as philosophers understand it. If that’s right, then the extant data, insofar as it relies on these measures and scenarios, do not speak to whether or not commonsense thinking about free will reflects compatibilist or incompatibilist commitments. (Nadelhoffer et al., 2023). 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 151 4.2.5 Limitations So, should we endorse Nadelhoffer and colleagues’ pessimistic conclusion and discard compatibilist intuitions as the result of a comprehension error? Not necessarily, as one might doubt the validity of their Intrusion measure. For example, let’s take their fourth Intrusion item: (4) In Universe A, John could have decided not to have French Fries even though his decision to have them was completely caused. As we saw earlier, there is one sense in which this statement is clearly false: given that Universe A is deterministic, then, given the past, it had to happen that John decides to have French Fries and, in this sense, John could not have decided to have French Fries. This is why accepting the Principle of Alternate Possibilities (see Sect. 4.2.2) seemed to commit us to incompatibilism. However, when we consider that item (4) is necessarily false in a deterministic universe, we take an unconditional interpretation of “could have,” according to which, when we say that “John could have decided not to have French Fries,” we mean that it was possible that John would have decided not to have French Fries, even if everything up to the moment of his decision was identical. But certain philosophers (e.g., Ayer, 1963; Austin, 1956) have argued that, when it comes to human decision, expressions such as “could have” can also be given a conditional interpretation, according to which “John could have decided not to have French Fries” means something like “if John’s desires and preferences had been different, John would have ended up deciding not to have French Fries.” Under such an interpretation, it can be said that, even in a deterministic universe, John could have decided not to have French Fries as, even in a deterministic universe, it is true that, if John had preferred to take vegetables rather than French Fries, he would have ended up deciding to take vegetables rather than French Fries. The same problem plagues the interpretation of items (2) and (3): (2) In Universe A, there was a slight chance that John could have chosen not to have French Fries at the time. (3) In Universe A, it was open for John to choose not to have French Fries at the exact moment he decided to have them. Item (2) also contains the expression “John could have chosen” and is thus subject to the same ambiguities as item (4). And, in item (3), it seems that the expression “it was open to John” can also be given a conditional reading. What does this mean? First, this means that participants can understand that John lives in a deterministic universe and still agree with items (2), (3), and (4), without their agreement being the sign of a comprehension error. Second, if participants take a conditional reading of items (2) to (4), we should expect them to agree with these items, as a disagreement with these items would indicate that participants consider that, even in John’s desires had been different, he would still have decided to have French Fries—which sounds a lot like bypassing. Thus, if participants endorse a 152 F. Cova and A. Allard conditional interpretation of those items, then it is disagreement with these items that signals a comprehension error. Of course, we don’t know that participants in Nadelhoffer and colleagues’ studies endorsed a conditional reading, but we cannot exclude it either. Nadelhoffer and colleagues tried to design their items to avoid such a reading, but it is hard to predict how participants will interpret items, and even items that are phased in a philosophically straightforward manner can give rise to unexpected interpretations from nonphilosophers (see Cova, Forthcoming-b). Thus, it would be great if we could find a measure of Intrusion that does not rely on ambiguous expressions such as “could have” or “is open to.” One solution might be found in a thought experiment William James used to explain determinism to his audience in a conference called “The Dilemma of Determinism”: imagine that I first walk through Divinity Avenue, and then imagine that the powers governing the universe annihilate ten minutes of time with all that it contained, and set me back at the door of this hall just as I was before the choice was made. Imagine then that, everything else being the same, I now make a different choice and traverse Oxford Street. You, as passive spectators, look on and see the two alternative universes, − one of them with me walking through Divinity Avenue in it, the other with the same me walking through Oxford Street. Now, if you are determinists you believe one of these universes to have been from eternity impossible: you believe it to have been impossible because of the intrinsic irrationality or accidentality somewhere involved in it. (James, 1907, 155). On this model, one way to probe whether participants understand determinism or insert indeterministic agency into their construal of the vignette would be to ask them to imagine that the same decision is played a certain number of times, and to investigate whether they understand that, since we are in a deterministic universe, the same decision should be taken every single time. For example, we could ask: Imagine that, in Universe A, a time-traveler observed John when he chose to have French Fries then decided to go back in time 10 minutes before John made his decision to see whether John would always choose to have French Fries. Imagine that he did that a hundred times and never interfered with John’s life and decision-making process. According to you, on the 100 times this time-traveler observed John making his decision, how many times did John chose to have French Fries: [pick a number between 0 and 100]. 4.3 Research Questions Given this theoretical background, we will now explore the two following research questions: (Q1) Which factors best explain participants’ judgments about moral responsibility in Nichols and Knobe’s abstract and concrete cases? Is it a confusion with bypassing (as measured through Murray and Nahmias’ Bypassing items)? Or is it the intrusion of metaphysical, indeterministic assumptions about human 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 153 agency (as measured by Nadelhoffer and colleagues’ Intrusion items and our new measure of indeterminism)? (Q2) What explains the difference between the abstract and concrete conditions? Is it a greater tendency to conflate determinism with bypassing in the abstract condition? Or is it a greater tendency to make indeterministic assumptions in the concrete condition? Or both? 4.4 Experimental Design Our first research question (Q1) asks about the relationship between several measured variables (free will and moral responsibility attributions, Bypassing measures, measures of participants’ assumptions about human agency). As such, it calls for a correlational research design, in which these different variables are measured simultaneously across participants and situations (e.g., different vignettes) to assess how they covary. Our second research question (Q2) asks to which extent the effect of a manipulated variable (abstract vs. concrete condition) on a measured variable (free will and moral responsibility attributions) can be explained by their relationship with a third, mediating variable (Bypassing scores of measures of participants’ assumptions about human agency). As such, it calls for an experimental research design, in which one variable (abstract vs. concrete condition) is manipulated and the others measured. Fortunately, both research questions can be answered within a single research design. We decided to adopt a research design similar to the design of Nichols and Knobe’s studies. This means that participants were first presented with the same description of Universes A and B that we presented in Sect. 4.2.2. At the end of the description, participants were asked: (Universe) Which of these universes do you think is most like ours? • Universe A. • Universe B. After that, participants were randomly assigned to one of four conditions, depending on the type of vignette (abstract vs. concrete) and the Universe (A vs. B) they were assigned to. This manipulation is what makes our research design experimental. Participants in the Universe A—abstract condition, received the following question: (AA_Responsibility) To which extent do you agree with the following claim: “In Universe A, it is possible for people to be morally responsible for their actions.” (on a 7-point scale from –3 = “Strongly disagree” to 3 = “Strongly agree”) 154 F. Cova and A. Allard Table 4.4 Free will and moral responsibility statements in our study. Formulations vary between the abstract and concrete conditions (blame) (free will) (up) Abstract In universe A/B, people deserve to be blamed for the bad things they do In universe A/B, it is possible for people to have free will In universe A/B, people’s decisions are “up to them” Concrete Bill deserves to be blamed for killing his family Bill has free will Bill’s decision to kill his family was “up to him” While participants in Universe B—abstract condition—received the same question with “In Universe B” instead of “In Universe A,” participants in Universe A—concrete condition—received the following question: (AC_Responsibility) In Universe A, a man named Bill has become attracted to his secretary and he decides that the only way to be with her is to kill his wife and 3 children. He knows that it is impossible to escape from his house in the event of a fire. Before he leaves on a business trip, he sets up a device in his basement that burns down the house and kills his family. To which extent do you agree with the following claim: “Bill is morally responsible for killing his family.” (on a 7-point scale from –3 = “Strongly disagree” to 3 = “Strongly agree”) While participants in Universe B—concrete condition—received the same question with “In Universe B” instead of “In Universe A.” After answering and justifying their answer to the responsibility question, participants were asked to rate their agreement with three other statements about free will and moral responsibility (on a 7-point scale from −3 = “Strongly disagree” to 3 = “Strongly agree”). Statements are presented in Table 4.4. Then, participants were asked to rate their agreement with six other statements, four or which were inspired by Murray and Nahmias (2014)’s Bypassing items.5 Statements are presented in Table 4.5. After that, participants were asked to rate their agreement with four Intrusion items drawn from Nadelhoffer et al. (2023): • (Intrusion_1) In Universe A/B, what people decide to do could have been different even if everything leading up to the decision had been exactly the same. • (Intrusion_2) In Universe A/B, there was a slight chance that John could have chosen not to have French Fries at the time. • (Intrusion_3) In Universe A/B, it was open for John to choose not to have French Fries at the exact moment he decided to have them. • (Intrustion_4) In Universe A, John could have decided not to have French Fries. 5 The two other items were added to measure participants’ beliefs about the agent’s “deep self.” We won’t be using these items in this chapter. 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 155 Table 4.5 Bypassing statements in our study. Formulations vary between the abstract and concrete conditions (Decisions) (Want) (Beliefs) (Control) Abstract In universe A/B, people’s decisions have no effect on what they end up doing In universe A/B, what people want has no effect on what they end up doing In universe A/B, what people believe has no effect on what they end up doing In universe A/B, people have no control over what they do Concrete Bill’s decision to kill his family had no effect on what he ended up doing What bill wanted had no effect on what he ended up doing What bill believed had no effect on what he ended up doing Bill had no control over what he did Just after, participants were presented with our own measure of intrusion: (Loop) Imagine that, in Universe A/B, a time-traveler observed John when he chose to have French Fries then decided to go back in time 10 minutes before John made his decision to see whether John would always choose to have French Fries. Imagine that he did that a hundred times and never interfered with John’s life and decision-making process. According to you, on the 100 times this time-traveler observed John making his decision, how many times did John chose to have French Fries [pick a number between 0 and 100]? To check that participants carefully answered the survey, we then included two attention checks: (Check1) According to the scenario, in which Universe is it the case that “everything that happens is completely caused by what happened before it”? • Universe A • Universe B • Universe C (Check2) Apart from the first question, the questions you were asked were mostly about: • Universe A • Universe B • Universe C Finally, participants were asked several series of questions about themselves, including their age and gender.6 6 In order to prepare a dataset usable for pedagogical purposes, we included a lot of measures we do not detail here because they did not provide useful results. Mostly, we hoped to find interesting correlations between moral responsibility judgments and personality traits (such as extraversion, cognitive style, or desire to punish) in order to introduce the reader to correlations, but not much came from it. Thus, for the sake of simplicity, we won’t be speaking about them. However, in a scientific article, it is important to report all the measures that were included in your study. The full questionnaire can be found in the 3_questionnaire.qsf and 3_questionnaire.pdf files. 156 F. Cova and A. Allard 4.5 Implementing the Research Design and Conducting the Study Running this kind of study requires two major steps: (i) finding a platform to implement and host the online survey and (ii) recruiting participants. We decided to host our survey on Qualtrics,7 an online platform that allows researchers to design and host online surveys. The advantage of Qualtrics is that it is very user-friendly and intuitive to use (as it involves almost no coding skills). It is also widely used by researchers in social psychology. The downfall is that it is not freely accessible and quite expensive, to the point that you generally need your institution to provide you an account. Still, you can familiarize yourself with its use by asking for a free, limited trial, which should be enough to reproduce the current study. A survey in Qualtrics (called a “Project”) is composed of “Blocks” that are themselves composed of different “Questions.” You can add as many “Blocks” as you want to your project, and as many “Questions” as you want within a Block. There are several types of questions. For example, Fig. 4.2 shows a “Multiple choice” question, in which you can toggle the number of options, their labels, and their orientation (horizontal or vertical). You can also use a “Text entry” question if you want participants to answer in their own words (see Fig. 4.3). When participants have to rate several statements on the same scale, you don’t have to create a separate “Multiple choice” for each question. Rather, you can use a Fig. 4.2 An example of “Multiple choice” question in Qualtrics Fig. 4.3 An example of “Text entry” question in Qualtrics 7 www.qualtrics.com/. 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 157 Fig. 4.4 An example of “Matrix table” question in Qualtrics Fig. 4.5 An example of “Slider” question in Qualtrics “Matrix table” question (see Fig. 4.4). You can have the statements be presented in a random order by selecting the “Statement randomization” option. For questions that allow for a wide range of values as an answer (such as our Loop question), “Multiple choice” questions might not be convenient. Rather, you can use a “Slider” question (as shown in Fig. 4.5). Finally, as shown in Fig. 4.6, instructions and vignettes can be presented in “Text/Graphic” questions (which are thus not questions strictly speaking). Given that we are using the same vignette for all four experimental conditions, the simplest to implement our design in Qualtrics is to create a common Block 158 F. Cova and A. Allard Fig. 4.6 An example of “Text/Graphic” question in Qualtrics (here called “Intro”) that contains this vignette, then to create one Block for each experimental condition (as the formulation of the questions differs across experimental conditions). We called these Blocks “AA,” “AC,” “BA,” and “BC,” according to the parameters varying across conditions (the first letter, A or B, refers to the universe participants are asked about, while the second letter, A or C, refers to whether the questions are formulated in an abstract or concrete way). But how do we make it so that participants only see one of the four conditions? In Qualtrics, the order in which Blocks are presented can be set and modified using the “Survey flow” tool that allows for randomization. Thus, as shown in Fig. 4.7, the simplest way is to first present the “Intro” Block and then use the “Randomizer” tool to randomly present participants with one of the four experimental Blocks. Still, it would be nice to keep track of the universe (A vs. B) and the condition (abstract vs. concrete) participants were randomly assigned to. This can be done in the Survey flow by using the “Set embedded data” tool and creating two variables (UNIVERSE and COND) and setting their value according to the Block participants were assigned to (these variables and their values will then appear in the final datafile and will prove very useful, as we will see). This can be done by creating four “Groups” in the Survey flow (one per condition), having each group contain the corresponding experimental Block and their respective values for the two variables, and having participants be randomly assigned to one of the four groups (see Fig. 4.8 for an example). Finally, one useful feature of Qualtrics is that it lets you save your survey (or “Project”) in a special format (.qsf) that allows Qualtrics users to “import” your 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 159 Fig. 4.7 Randomizing Blocks in Qualtrics’ Survey flow Fig. 4.8 Setting embedded data to keep track of experimental conditions in Qualtrics’ Survey flow survey to their own account. Thus, if you want a detailed view of the survey we used, you can download the Chap.3.qsf file from supplementary materials and open it in your own account. (Alternatively, if you don’t have a Qualtrics account, you can just download the Chap.3.pdf file.) Once the survey is implemented in Qualtrics, you can “Publish” it and obtain an anonymous link to share with participants. The last step is then to recruit said participants. We decided to recruit 800 participants (200 per condition) through Prolific Academic,8 a website that allows researchers to pay participants to participate in online studies. Prolific Academic also allows researchers to filter participants according to several demographic factors, so we asked only for participants coming from the United States and asked for a balanced ratio between men and women. 8 https://www.prolific.co. 160 F. Cova and A. Allard Participants were paid £1.00 for their participation. In total, 802 participants completed our survey. 4.6 Analysis I: Preparing Our Dataset 4.6.1 Downloading and Opening the Datafile Once participants have completed the survey, the results can be downloaded as a .csv file. The first thing to do is then to go and download the file called 3_data.csv from the supplementary materials. A CSV file is a standard format to store data from surveys. It is organized in rows and columns, where each row corresponds to one participant (or one trial when there are several trials by participant) and each column to one type of data (for example, one column by question participants answered). CSV files can be opened in Excel if you need to have a look at them, but that won’t be necessary here, as we will rely exclusively on R to manipulate and analyze our data. The 3_data.csv file is not exactly the one you would obtain from Qualtrics after running your experiment. It has been cleaned in several ways. First, to compensate participants for their participation, we had to ask them their Prolific ID number. This information was removed for anonymization, and experimenters should be wary not to share such information. Second, you must ensure that all data that are supposed to be numerical are indeed numerical. For example, if the label for a “Multiple choice” question option in Qualtrics was “3: Strongly agree” (as in Fig. 4.2), it might appear this way in the datafile, preventing R to treat participants’ answer as a numerical variable. This can be avoided by directly recoding values in Qualtrics (by selecting the option “Recode values”), but if it has not been done, be sure to replace such values by numerical values here, “3” in your datafile. Third, for R to properly read your file, only the first row should contain information about the data (such as the column name): all other rows should contain the data themselves. However, depending on the option you select Qualtrics while downloading your survey results, you might end up with additional rows containing the questions’ internal IDs and/or the question’s texts. Be sure to delete these rows before analyzing your data (however, always keep an unmodified copy of your original datafile, in case you introduce errors while cleaning up the data). If you look inside the file (using Excel, for example), you will see that each column corresponds to a variable and each row to a particular participant. Thus, participants’ answers to each question you included in your Qualtrics survey will have its dedicated column, and the name of this column will be the name you gave to the question in the original survey (see Fig. 4.9). This is why it is important to carefully label each question, to make sure that you can identify the corresponding column. Once the file is downloaded, launch R. To find the CSV file that contains the data, R needs you to indicate in which folder the CSV file is stored. This is done 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 161 Fig. 4.9 A look at the raw data by setting R’s working directory. On Windows, this is done by selecting “File,” then “Change directory,” and the corresponding folder. On macOS, this is done by selecting “Tools,” then “Change the working directory,” and the corresponding folder. Now, we need to open our CSV file in R and store its content in a variable. This is done by typing the following command and pressing enter: 1 raw_data <- read.csv("3_data.csv",header=TRUE) Here, we are using the read.csv() function, which allows us to open CSV files and store their content in a variable. Here, we gave the name raw_data to this variable, but you could have chosen any other name. You will note that the function takes two arguments (separated by a comma). The first is the name of the file you want to open (3_data_clean.csv) and the second is an argument (header) that allows us to indicate whether the first row in our CSV contains the name of the columns (rather than data). As this is the case, we have to indicate header = TRUE. In the opposite case, we would have indicated header = FALSE. Eventually, R might display the following error message: Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file '3_data.csv': No such file or directory Execution halted If this is the case, it just means that you did not correctly set the working directory and that the folder you indicated is not the one containing the CSV file. If R does not react and no error message is displayed, then this means the content of the CSV file has been successfully stored in the raw_data variable. To check whether this is the case, you can simply enter the name of the variable: 2 raw_data 162 F. Cova and A. Allard Fig. 4.10 Using the head() function to have a look at the first rows of the datafile This will display the full content of the raw_data variable in your R window. This might be a bit too much (since there are 802 rows and 104 columns). A more convenient way to check the content of your variable is: 3 head(raw_data) The head() function allows you to only see the six first rows of your raw_data variable (see Fig. 4.10). Finally, it should be noted that raw_data is what is called a “data frame.” In R, a data frame is an object composed of several rows and columns. Since we have one row per participant, the number of participants who filled the survey can be checked simply by calculating the number of rows in the data frame, using the nrow function: 4 nrow(raw_data) Moreover, each column constitutes a “vector,” that is, a sequence of elements that share the same data type. Each vector in a data frame can be singled out by typing the name of the data frame followed by the $ symbol and the name of the corresponding column. For example, if you want to display the content of the Gender column, you just have to enter: 5 raw_data$Gender 4.6.2 Data Preparation Now, one problem with our data is that participants’ answers to the same question (for example, the moral responsibility question) are not stored in the same column, depending on the condition they were assigned to. For example, participants’ answers to the moral responsibility question in the AA condition are stored in the following vector: 6 raw_data$AA_Responsibility 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 163 You will see that a certain number of answers are not occupied by numbers, but by the letters “NA” (short for “not available”). This means that this cell was empty in the original datafile, simply because the corresponding participant was not assigned to the AA condition, and thus did not answer the question. You can display the four vectors corresponding to participants’ answers to the moral responsibility question in each of the four conditions, in the following way: 7 data.frame(raw_data$AA_Responsibility,raw_data$AC_Responsibility, ͨ→ raw_data$BA_Responsibility,raw_data$BC_Responsibility) You can see that there is only one number (and three “NA”) by row, as should be expected, since participants, being assigned to one of four conditions, only answered one of the four questions. Now, for analysis purposes, it would be great to “fuse” these four columns into one containing all answers to the responsibility question. One way, among many, to do that using the basic R functions, is to use the rowSums() function that creates a new vector by making the sum of other vectors: 8 rowSums(data.frame(raw_data$AA_Responsibility,raw_data$AC_ ͨ→ Responsibility,raw_data$BA_Responsibility,raw_data$BC_ ͨ→ Responsibility),na.rm=TRUE) The na.rm = TRUE expression at the end is here to tell the function to ignore the “NA” while computing the sum (if you remove the expression, you will end with a vector full of “NA”). We can now store this vector in a new column for our data frame: 9 raw_data$Responsibility <- rowSums(data.frame(raw_data$AA_ ͨ→ Responsibility,raw_data$AC_Responsibility,raw_data$BA_ ͨ→ Responsibility,raw_data$BC_Responsibility),na.rm=TRUE) One possible worry might be that, since we have fused all four responsibility vectors together, we cannot make the difference between answers to the different conditions. Fortunately, information about the kind of universe (A or B) and the kind of case (abstract or concrete) participants were presented with is stored in the following columns, which correspond to the variables we created in Qualtrics by using the “Set embedded data” tool: 10 11 raw_data$UNIVERSE raw_data$COND Now, we can ask R to select a subset of a certain vector, depending on the content of another vector using brackets []. For example, if we type: 12 raw_data$Responsibility[raw_data$UNIVERSE=="A"] R will produce a vector that is composed by selecting the content of the raw_data$Responsibility vector for rows in which the sentence between brackets is true. Since the sentence between brackets is the affirmation that raw_data$UNIVERSE is equal to “A”, this means that the resulting vector only contains the values of raw_data$Responsibility for participants 164 F. Cova and A. Allard who were presented with Universe A. Note that we use two = signs instead of one when “affirming” that raw_data$UNIVERSE is equal to A, because using only one = sign would mean that we are trying to assign a new value to raw_data$Responsibility. Affirmations about the content require using two = signs. Affirmations of the type we just put into brackets can be complexified by combining basic affirmations using logical operators. For example, the logical operator for conjunction is &, and we can use it to compose more complex statements in the following way: 13 raw_data$Responsibility[raw_data$UNIVERSE=="A" & raw_data$ ͨ→ COND=="ABSTRACT"] Here, the sentence between brackets is true only if participants were presented with Universe A and if they were presented with an abstract case. This means that the resulting vector only contains participants’ moral responsibility judgments for the AA condition. Now that this worry is out of the way, we just need to create combined vectors for all our measures, in the following way: 14 15 16 raw_data$Blame <ͨ→ rowSums(data.frame(raw_data$AA_Blame,raw_data$AC_Blame, ͨ→ raw_data$BA_Blame,raw_data$BC_Blame),na.rm=TRUE) raw_data$FreeWill <ͨ→ rowSums(data.frame(raw_data$AA_FreeWill,raw_data$AC_FreeWill, ͨ→ raw_data$BA_FreeWill,raw_data$BC_FreeWill),na.rm=TRUE) raw_data$Up <ͨ→ rowSums(data.frame(raw_data$AA_Up,raw_data$AC_Up, ͨ→ raw_data$BA_Up,raw_data$BC_Up),na.rm=TRUE) 17 18 19 20 21 raw_data$Decisions <ͨ→ rowSums(data.frame(raw_data$AA_Decisions, ͨ→ raw_data$AC_Decisions,raw_data$BA_Decisions, ͨ→ raw_data$BC_Decisions),na.rm=TRUE) raw_data$Want <ͨ→ rowSums(data.frame(raw_data$AA_Want,raw_data$AC_Want, ͨ→ raw_data$BA_Want,raw_data$BC_Want),na.rm=TRUE) raw_data$Beliefs <ͨ→ rowSums(data.frame(raw_data$AA_Beliefs,raw_data$AC_Beliefs, ͨ→ raw_data$BA_Beliefs,raw_data$BC_Beliefs),na.rm=TRUE) raw_data$Control <ͨ→ rowSums(data.frame(raw_data$AA_Control,raw_data$AC_Control, ͨ→ raw_data$BA_Control,raw_data$BC_Control),na.rm=TRUE) 22 23 24 raw_data$Intrusion1 <ͨ→ rowSums(data.frame(raw_data$AA_Intrusion1,raw_data$AC_ ͨ→ Intrusion1,raw_data$BA_Intrusion1,raw_data$BC_Intrusion1), ͨ→ na.rm=TRUE) raw_data$Intrusion2 <ͨ→ rowSums(data.frame(raw_data$AA_Intrusion2,raw_data$AC_ ͨ→ Intrusion2,raw_data$BA_Intrusion2,raw_data$BC_Intrusion2), ͨ→ na.rm=TRUE) 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 25 26 165 raw_data$Intrusion3 <ͨ→ rowSums(data.frame(raw_data$AA_Intrusion3,raw_data$AC_ ͨ→ Intrusion3,raw_data$BA_Intrusion3,raw_data$BC_Intrusion3), ͨ→ na.rm=TRUE) raw_data$Intrusion4 <ͨ→ rowSums(data.frame(raw_data$AA_Intrusion4,raw_data$AC_ ͨ→ Intrusion4,raw_data$BA_Intrusion4,raw_data$BC_Intrusion4), ͨ→ na.rm=TRUE) 27 28 raw_data$Loop <ͨ→ rowSums(data.frame(raw_data$AA_Loop,raw_data$AC_Loop, ͨ→ raw_data$BA_Loop,raw_data$BC_Loop),na.rm=TRUE) 4.6.3 Exclusions Now, before analyzing our data, we need to exclude participants who did not pay sufficient attention to the survey. As mentioned earlier, we had two attention checks toward the end of our study: (Check1) According to the scenario, in which Universe is it the case that “everything that happens is completely caused by what happened before it”? • Universe A • Universe B • Universe C In this case, the right answer is “Universe A.” Thus, we want only to keep participants who answered “Universe A” to this question. Given that the corresponding column in the data frame is called “Check1,” this is done in the following way: 29 after_exclusion1 <- raw_data[raw_data$Check1=="Universe A",] The brackets [] after raw_data contain a sentence which can be read in the following way: “answer to column Check1 in the raw_data data frame is Universe A.” Putting this sentence into brackets just after raw_data allows us to select the rows in raw_data for which this sentence is true (and to let aside all rows for which this sentence is false). Simply put, we are only keeping the rows in raw_data for which the column “Check1” contains “Universe A.” This creates a new (probably smaller) data frame that we call after_exclusion. We can check the number of rows in after_exclusion1 using the nrow function: 30 nrow(after_exclusion1) The result is 781. So, we already excluded 802 – 781 = 21 participants. Now, let’s proceed to our second attention check: (Check2) Apart from the first question, the questions you were asked were mostly about: • Universe A 166 F. Cova and A. Allard • Universe B • Universe C This one is trickier as the right answer depends on the condition participants were presented with. Participants in the AA and AC conditions should have answered “Universe A,” while participants in the BA and BC conditions should have answered “Universe B.” This can be formulated in the following way: 31 after_exclusion2 <- after_exclusion1[(after_exclusion1$UNIVERSE ͨ→ =="A" & after_exclusion1$Check2=="Universe A")| ͨ→ (after_exclusion1$UNIVERSE=="B" & after_exclusion1$Check2 ͨ→ =="Universe B"),] This is a lot to take in, so let’s analyze this. As for the first attention check, the sentence into brackets just after the first occurrence of after_exclusion1 allows us to select the rows in after_exclusion1 for which this sentence is true. However, this sentence is actually composed of two sentences (in parentheses) separated by the | symbol. In R, | is the logical operator for inclusive disjunction: the sentence in brackets is true if at least one of the two sentences in parentheses is true. Now, let’s turn to the sentences within the parentheses: they both include the logical operator for conjunction &. This means that the sentences in parentheses are themselves composed of two smaller sentences and are only true if both these smaller sentences are true. For example, the sentence within the first set of parentheses is true only when the content of the UNIVERSE column is “A” and the content of the Check2 column is “Universe A.” And the sentence within the second set of parentheses is true only when the content of the UNIVERSE column is “B” and the content of the Check2 column is “Universe B.” Thus, what we are doing is selecting the rows in after_exclusion1 for which either participants were assigned to the “Universe A” condition and answered “Universe A,” or participants were assigned to the “Universe B” condition and answered “Universe B.” The data frame resulting from this selection is stored under the variable after_exclusion2. And now we can check how many participants are left using the nrow function: 32 nrow(after_exclusion2) It is 772. Thus, in total, we excluded 802 – 772 = 30 participants. Finally, we can give a shorter name to our final data frame, which will come in handy for further analysis. We can simply call it dat, in the following way: 33 dat <- after_exclusion2 We can even create a new CSV file to store our clean dataset: 34 write.csv(dat,file="3_data_clean.csv") Thus, if you want to analyze the data without having to go through all the data preparation and exclusion procedure again, all you have to do is: 35 dat <- read.csv("3_data_clean.csv",header=TRUE) 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 167 4.7 Analysis II: A First Look at the Data Before testing our hypotheses, we can take a first look at the data. For example, we can explore the demographic properties of our sample. Also, since our variables of interest are moral responsibility judgments, bypassing judgments, and measures of intrusion, we can look at how they vary across conditions. 4.7.1 Demographics It is usual, in any study, to describe the nature of your sample by giving at least the gender and age of your participants. Information about participants’ gender is stored in the Gender column. You just have to use the table function to display the number of participants who gave each answer: 36 table(dat$Gender) You should obtain the following results: A man A woman Other : 387 373 12 This means that, after exclusion, we have 387 participants who identified as “a man,” 373 who identified as “a woman,” and 12 who identified as “other.” To transform these raw numbers into percentages, you just have to divide the output of the table function by the total number of participants (772), using one of the two following methods: 37 38 table(dat$Gender)/772 table(dat$Gender)/length(dat$Gender) Let’s now turn to participants’ age. Data about their age is stored in the Age column, which you can display by entering: 39 dat$Age Since dat$Age is a numerical vector, you can compute its mean and standard deviation, using the mean() and sd() functions, respectively: 40 41 mean(dat$Age) sd(dat$Age) Thus, we obtain the following information: Mage = 39.03, SDage = 14.37. Finally, we can determine how many participants there are in each condition by applying the table() function to the UNIVERSE and COND columns: 42 table(dat$UNIVERSE,dat$COND) 168 F. Cova and A. Allard The result should look like this: A B ABSTRACT CONCRETE 199 183 196 194 This means that we have 199 participants in the AA condition, 183 in the AC condition, 196 in the BA condition, and 194 in the BC condition. 4.7.2 Moral Responsibility Ratings Across Conditions Let’s now try to compute means and standard deviations for moral responsibility judgments across our four conditions. One way to do it would be to compute them independently for each condition, using brackets [] to specify the condition: 43 44 mean(dat$Responsibility[dat$UNIVERSE=="A" & dat$COND== ͨ→ "ABSTRACT"]) sd(dat$Responsibility[dat$UNIVERSE== ͨ→ "A" & dat$COND=="ABSTRACT"]) However, there is a more practical way to do it, by using the tapply() function: 45 46 tapply(dat$Responsibility,list(dat$UNIVERSE,dat$COND),mean) tapply(dat$Responsibility,list(dat$UNIVERSE,dat$COND),sd) The tapply() function allows you to directly compute a certain function (such as mean or sd) for each level of a certain factor, specified in the “list” part of the function. When several factors are specified, as is the case here (since we have two factors, UNIVERSE and COND), the function is applied for each condition resulting from the intersection of these factors. For example, for the first line, you should obtain the following results: ABSTRACT CONCRETE A -0.7135678 1.530055 B 2.4081633 2.850515 The upper left number shows the average responsibility ratings for the AA (Universe A—abstract) condition. Based on this information, you should be able to reconstitute Table 4.6. You can then try to produce the same tables for the other measures of free will and moral responsibility (dat$Blame, dat$FreeWill, and dat$Up). Table 4.6 Mean (and standard deviations) for participants’ moral responsibility ratings for each universe (A or B) and each case (abstract or concrete) Universe A Universe B Abstract −0.71(2.07) 2.41(0.92) Concrete 1.53 (2.26) 2.85 (0.75) 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . Table 4.7 Mean (and standard deviations) for participants’ moral responsibility ratings for each universe (A or B) and each case (abstract or concrete) Universe A Universe B Abstract −0.71(2.07)30.2% 2.41(0.92)97.4% 169 Concrete 1.53 (2.26)71.6% 2.85 (0.75)97.4% Percentages indicate the % of participants who gave an answer superior to the midpoint (0) For those who find it more informative to know how many participants gave an answer superior to the midpoint (0) in each condition, you can use the table function: 47 table(dat$Responsibility[dat$UNIVERSE=="A" & dat$COND== ͨ→ "ABSTRACT"]>0) You can directly obtain a percentage by using the as.numeric() function, which will turn all FALSE into 0 and all TRUE into 1: 48 mean(as.numeric(dat$Responsibility[dat$UNIVERSE=="A" & dat$COND ͨ→ =="ABSTRACT"]>0)) Doing this for each condition, you can complete Table 4.6 to obtain Table 4.7. As one can see, our results are in line with the ones obtained by Nichols and Knobe (2007). When presented with a deterministic universe (Universe A), very few participants (30.2%) attributed moral responsibility to agents in the abstract case, but a majority (71.6%) attributed moral responsibility to agents in the concrete case. You can check by yourself whether the same happens for the other measures of free will and moral responsibility (dat$Blame, dat$FreeWill, and dat$Up). 4.7.3 Computing Bypassing and Intrusion Scores Thus, moral responsibility ratings behave as expected. But do Bypassing and Intrusion scores behave in the same way? We can’t check that yet because bypassing and Intrusion scores are composite: they need to be computed from several items. For example, in the questionnaire we used, bypassing was measured through the four following items: (Decisions), (Want), (Beliefs), and (Control). One way to compose the total Bypassing score is simply to compute the sum of all items and divide it by the total number of items, in the following way: 49 dat$Bypassing <- (dat$Decisions + dat$Want + dat$Beliefs ͨ→ + dat$Control)/4 However, there is a more complex method, which involves first constructing a data frame gathering all items (each column corresponding to one item): 50 Bypassing_frame <- data.frame(dat$Decisions, dat$Want, ͨ→ dat$Beliefs, dat$Control) 170 F. Cova and A. Allard And then to use the function rowMeans(), which averages all the columns of a given data frame to produce a single vector: 51 dat$Bypassing <- rowMeans(Bypassing_frame,na.rm=TRUE) Both methods give the same results. So, why choose the more complex one? Because its intermediary step (creating a data frame gathering all items) allows us to compute the internal coherence of our measure of bypassing—that is, the extent to which the different items composing our measure of bypassing are coherent between them, and correlated with each other. A widespread measure of internal coherence is Cronbach’s alpha. To compute the Cronbach’s alpha for our measure of bypassing, we will first need to install the psych package for R. R packages are collections of functions and data sets developed by the community to improve R’s base functionalities. To install the psych package, you can use the following command: 52 install.packages("psych") Once the package is installed, you need to “call” it to use the functions it contains. To call the psych package, just use the following command: 53 library("psych") The psych package contains the alpha() function, which, when applied to a data frame, computes Cronbach’s alpha for the set of items composing this data frame. Thus, we can apply it to Bypassing_frame: 54 alpha(Bypassing_frame) Doing so will return a lot of information, among the following: raw_alpha std.alpha G6(smc) average_r S/N ase mean sd ͨ→ median_r 0.92 0.92 0.9 0.74 11 0.0048 -1.1 1.9 ͨ→ 0.74 The leftmost number tells us that the “raw” alpha for our measure of bypassing is α = 0.94, which is very satisfying (the “standardized” alpha should be used for measures that combine items using different scales, which is not the case here). The alpha() function also gives us additional information. For example, the section “Reliability if an item is dropped” tells us what would be the Cronbach’s alpha for our measure if we abandoned one item. For example, we can see that our alpha would drop to 0.87 if we excluded the second item. However, the results suggest that we cannot increase our alpha by dropping one of the items: Reliability if an item is dropped: raw_alpha std.alpha G6(smc) average_r S/N alpha ͨ→ se var.r med.r dat.Decisions 0.91 0.91 0.88 0.77 10.3 ͨ→ 0.0057 0.0048 0.74 dat.Want 0.87 0.87 0.83 0.70 6.9 ͨ→ 0.0079 0.0030 0.73 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . Table 4.8 Mean (and standard deviations) for participants’ Bypassing scores for each universe (A or B) and each case (abstract or concrete) Table 4.9 Mean (and standard deviations) for participants’ Intrusion scores for each universe (A or B) and each case (abstract or concrete) dat.Beliefs ͨ→ 0.0077 0.0040 dat.Control ͨ→ 0.0055 0.0045 0.88 0.74 0.91 0.75 Universe A Universe B Abstract 0.76(1.83)64.8% −1.73(1.17)7.7% 171 Concrete −0.94 (1.91)25.7% −2.44 (0.97)3.1% Percentages indicate the % of participants who gave an answer superior to the midpoint (0) Universe A Universe B Abstract −1.89(1.48)11.1% 2.28(0.85)97.4% Concrete −1.41 (1.87)21.3% 2.57 (0.76)99.0% Percentages indicate the % of participants who gave an answer superior to the midpoint (0) 0.88 0.84 0.71 7.2 0.91 0.88 0.78 10.4 Thus, we have computed our composite Bypassing score, and our analyses suggest that it is a measure with good internal coherence. It is now time to compute the mean and standard deviations for each condition, as well as the percentage of participants giving an answer superior to the midpoint. Results are presented in Table 4.8. As can be seen, for Universe A, Bypassing scores are higher in the abstract condition, compared to the concrete condition. Now, you just have to do the same for Intrusion scores. The items that you need to aggregate to compose the Intrusion scores are called (dat$Intrusion1, dat$Intrusion2, dat$Intrusion3, and dat$Intrusion4). Try to apply the procedure we applied to the Bypassing items to these items to (i) compute a total Intrusion score that you will simply call “Intrusion,” (ii) compute the Cronbach’s alpha for this score, and (iii) obtain the results presented in Table 4.9. In case you meet difficulties, the full code can be found in the 3_analysis.txt file. 4.7.4 Transforming the New Measure of Intrusion into a Binary Variable Finally, there is only one matter left: our new measure of intrusion. In its current state, this measure (dat$Loop) is a continuous variable ranging from 0 to 100. We could just leave it that way and compute its mean and standard deviations. However, it is not clear that this is the best way to exploit it. After all, a person who answers that John will decide to French Fries in 98 occasions out of 100 (rather than in 100 occasions out of 100) is already rejecting determinism: is there a point in distinguishing this participant from the one who answers that John will decide to have French Fries 50 times out of 100? Rather than situating participants on a continuum ranging from 0 to 100, we might simply distinguish participants who 172 F. Cova and A. Allard Table 4.10 Upper line of each cell: mean (and standard deviations) for participants’ answers to the original Loop question for each universe (A or B) and each case (abstract or concrete) Universe A Universe B Abstract 93.4 (18.9)80.9% 71.2 (24.9)23.0% Concrete 90.2 (21.9)73.8% 70.6 (24.4)21.1% Bottom line of each cell: percentages indicate the % of participants who gave an answer equal to 100 answer 100 from all the others. This means that we need to turn our continuous variable into a binary one (answered 100 vs. did not answer 100). This can be done by creating a new variable that we will call BLoop (for Binary Loop) in the following way: 55 dat$BLoop <- dat$Loop==100 This will give us a new vector containing TRUE for participants who answered 100 to the original Loop question, and FALSE for participants who did not answer 100. The percentage of TRUE values for each condition can be computed using the as.numeric() function: 56 57 58 59 mean(as.numeric(dat$BLoop[dat$UNIVERSE=="A" & dat$COND== ͨ→ "ABSTRACT"])) mean(as.numeric(dat$BLoop[dat$UNIVERSE=="A" & dat$COND== ͨ→ "CONCRETE"])) mean(as.numeric(dat$BLoop[dat$UNIVERSE=="B" & dat$COND== ͨ→ "ABSTRACT"])) mean(as.numeric(dat$BLoop[dat$UNIVERSE=="B" & dat$COND== ͨ→ "CONCRETE"])) The results are presented in Table 4.10. A look at it suggests that a majority of participants understood that Universe A implied determinism and that intrusion was not much more frequent in the concrete condition compared to the abstract condition. 4.8 Analysis III: Using Correlations to Answer (Q1) Now that we have a certain idea of what’s going on with our data, it’s time to start answering our research questions. Our first research question was: 1. Which factors best explain participants’ judgments about moral responsibility in Nichols and Knobe’s abstract and concrete cases? Is it confusion with bypassing (as measured through Murray and Nahmias’ bypassing items)? Or is it the intrusion of metaphysical, indeterministic assumptions about human agency (as measured by Nadelhoffer and colleagues’ Intrusion items and our new measure of Intrusion)? 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 173 One way to answer this question is to look at correlations between participants’ judgments about moral responsibility and their answers to the various Bypassing and Intrusion measures. 4.8.1 Correlations: An Introduction What is a correlation? A correlation is a way of summarizing the relationship between two variables. Highly correlated variables will tend to go together, and be high or low at the same time. To really understand what correlations are, we have to go back to its mathematical definition. The definition can look scary at first, but having a basic representation of what it does is necessary for understanding the value of correlations. corr (X, Y ) = . cov (X, Y ) σX σY (4.1) The definition of correlation has two parts: the numerator (on top), which is the covariance between our two variables X and Y, and the denominator (on the bottom). The bottom is the product of the standard deviation of variable X with the standard deviation of variable Y. For simplicity’s sake, we can neglect the denominator: it is a normalizing factor that ensures that the correlation is bounded between −1 and 1 and that the correlation between two variables doesn’t change if you choose to multiply one of the variables by a constant. In turn, the formula for the covariance is: ) ( ) 1 ⎲n ( Xi − X Yi − Y (4.2) i=1 n−1 ( ) where n is the sample size, . Xi − X is the difference between the value of X for ) ( the ith participant and the average value of X in our sample, and . Yi − Y is the difference between theΣvalue of Y for the ith participant and the average value of n Y The ) . i=1 sign is telling us that we have to sum up all the value ( ( in our )sample. Yi − Y for all our n participants. . Xi − X If we look at the covariance, we can see that the covariance grows bigger if the two variables tend to be both higher and lower than their respective means at the same time. This means that the covariance measures if both variables tend to be coordinated, having the same deviations above or below their means at the same time. Let’s say that you are trying to measure the correlation between romantic partners’ height among French heterosexual couples. The mean height for men is 180 cm, and the mean height for women is 165 cm (Roser et al., 2013). You recruit 100 heterosexual couples. If you take one couple and see that the man is 186 cm high and the woman is 170 cm high, then this is going to lead to a positive contribution to . 174 F. Cova and A. Allard the overall covariance (because both values are above the respective average height of men and women). Repeat the process for the other 99 couples of the sample, take the average, and then you will have the covariance between the height of men and women in romantic relationships. Divide by the normalizing factor (the denominator of our first equation) and you will get the correlation. A correlation can be hard to interpret unless you have some experience with it. Here are a few ways of getting intuitions about what a specific correlation value means. A correlation is bounded between −1 and 1; −1 indicates a perfect negative correlation (high positive values of one variable are always linked with high negative values of the other one) and 1 indicates a perfect positive correlation. 0 indicates a lack of correlation, which indicates that there is no clear tendency for high values in one variable to be linked with low or high values of the other variable. This only gives you an idea about three values (−1, 0, and 1) though. So here are a few additional ways of getting an idea about what correlations mean. The easiest way of getting a feel for correlations is to see visual examples and to get a few benchmarks. We have plotted two examples of correlations: a correlation of 0.7 and a correlation of 0.3. The correlation of 0.7 corresponds to the correlation you get if you ask people questions about their personality 7 years apart (Ashton et al., 2019). This is an example of an extremely strong correlation, indicating the stability of personality across adulthood (the personality of children and teenagers is less stable). You will rarely see such strong correlations in the real world. The second correlation of 0.3 concerns the correlation between how agreeable people are (compassionate, conflict-avoidant) and how humble and honest they are (Roberts and DelVecchio, 2000). The two traits are positively related, but there’s a lot of noise. A correlation of 0.3 is still fairly strong (Fig. 4.11). Here are a few other examples of correlations taken from personality psychology: the correlation between agreeableness (being compassionate) and agreeing with the sentence “I am grateful to a wide variety of people” is around 0.4. The relationship between agreeableness and romantic satisfaction (reporting feeling close to one’s partner) is much lower, around 0.2, which is incidentally also the correlation between the height of men and women in heterosexual couples (Stulp et al., 2013). Finally, the correlation between reporting negative emotions (regularly feeling blue, depressed, and anxious) and reporting a good relationship with one’s parents is negative, around −0.2. There is a clear link between the two variables, but not that strong; of course, you can still have a good relationship with your parents even when you tend to be anxious. Third, we can think of correlations in terms of a prediction of what will happen to the other variable if you change the first variable by one standardized unit. This is the “regression interpretation” (we’ll talk more about regressions in Sect. 4.9). According to the regression interpretation, if you normalize your two variables (i.e., you subtract the mean and divide them by their standard deviation, so that both variables have a mean of 0 and a standard deviation of 1), then the correlation gives you the best prediction of what will happen to one variable if the other variable increases by one unit. For instance, if you have a correlation of 0.3 between two 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 175 Fig. 4.11 Two examples of correlations. Upper half: a correlation of 0.7 (value taken from Ashton et al., 2019). Lower half: a correlation of 0.3 (value taken from Roberts and DelVecchio, 2000). Data points are simulated based on the correct correlation values 176 F. Cova and A. Allard Fig. 4.12 A correlation of 0.97 between two variables that is caused by a single outlier variables, the regression interpretation predicts that increasing one variable by one standardized unit will increase the other variable by 0.3 standardized units. Correlations are just a summary measure of the association between two variables. While there is nothing wrong with correlations per se, they can be misleading in some cases. First, correlations can be particularly misleading if there are one or several outliers. The following graph (inspired by Anscombe, 1973) illustrates that a very strong positive correlation can be driven by a single case that is extremely high (abnormally high!) on both variables (Fig. 4.12). The influence of outliers is not specific to correlations. For instance, it is well known that the mean is not necessarily a good representative of the whole sample, especially in cases of strong inequalities. Outliers tend to be a bigger issue for variables that have unbounded values. For instance, the correlation between wealth and well-being could be seriously distorted if you have one billionaire in your sample, and if this billionaire happened to be particularly happy (or sad!). In psychology, response time is another variable where outliers can have a huge effect, for instance, if some participants take an hour to answer a simple question. The risk of outliers is less pronounced for bounded variables, such as Likert scales. Since experimental philosophers mostly use bounded variables, outliers tend not to be such a huge problem. Correlations can also be misleading if the link between two variables is not linear. A correlation of 0 could let you think that two variables are independent, but this is generally not the case. For instance, if you have a variable that is symmetric around zero, then your two variables could be related in a clear and deterministic way while still having a correlation of zero (see Fig. 4.13; the correlation is zero because the relationship is negative if x is below zero, positive if x is above zero, so that the two 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 177 Fig. 4.13 A correlation close to zero between two variables, even though y is simply x squared on average cancel out). This means that you should always plot your data to check the relationship between variables. The use of statistics can be described in two ways. One way is purely descriptive, as a way of summarizing our data. A second way is for making inferences about possible values of our variable in the whole population. For instance, let’s say you recruit 20 people, and find that agreeableness and gratitude have a correlation of 0.2 in your sample. That’s pretty cool, but it doesn’t tell us what we really want to know: what is the correlation between the two variables in the population? If we took a sample of 1,000,000 participants, would we still get approximately the same value, or a very different one? To guess the real value of our correlation in the population, people make use of inferential statistics. We will delve into two of the most important inferential statistics used in science: the p-value and the confidence interval. To see why we need inferential statistics, we have to realize that the correlation we obtained in our sample will never be exactly the true value: due to random variation between samples, the true value of the correlation is going to be slightly higher or slightly lower (and, in some rare cases, much higher or lower). The precision of our correlation mostly depends on one major factor: how big our sample size is. The standard deviation of √ the correlation (a summary of its variability) is approximately proportional to .1/ n, where n is your sample size. In other words, if you want to divide by two the variability of your estimate, you have to multiply your sample size by four. This means that adding new participants has diminishing returns. If you have a sample size of 5 (not recommended!), then adding 15 participants to go to 20 is going to divide the standard deviation of your 178 F. Cova and A. Allard estimate by 2. But if you want to divide again your standard deviation by 2, you will have to recruit 60 new participants to get to 80. And then to divide once more your standard deviation by 2, you will need 240 new participants. Getting a very high precision on your estimate means going to astronomic numbers. 95% confidence intervals are constructed in such a way that they will include the correct population value about 95% of the time. In other words, if you take a hundred samples and calculate each time the confidence interval for the correlation, on average your confidence interval will include the true value 95 times, and will fail to include the true value five times. If you read r = 0.2 [0.1, 0.3], then you will learn that your best estimate of the population value is 0.2, but that the true population value will be between 0.1 and 0.3 95% of the time. The second most commonly used inferential statistic is the p-value. While the pvalue and the confidence intervals are closely related from a mathematical point of view, their philosophy and use in science tend to be very different. The confidence interval gives you information about the range where the true value is likely to fall. On the other hand, the p-value is used to tell you whether the data we have obtained is compatible or incompatible with an absence of association between our two variables, that is, with a correlation of zero in our population. The p-value gives you the probability of observing a correlation that is as extreme as the one you observed, if there was in reality no correlation in the population. For instance, if you have a correlation of 0.3 in your sample, and you obtain a p-value of 0.05, then the p-value says that, if there was in reality a correlation of 0 in the population, we would observe extreme correlations in the range between 0.3 and 1 or between −0.3 and −1 only 5% of the time. That is, if there was a zero correlation in the population, we would get a correlation between −0.3 and 0.3 95% of the time, and our high correlation would have been a rare event. A p-value below 0.05 indicates that the 95% confidence interval does not include 0. A p-value below 0.01 indicates that the 99% confidence interval would not include 0 either. Our typical test for computing confidence intervals and p-values assumes that our data are normally distributed (that is, follow a bell-shaped distribution). This looks like a very strong assumption, since data in general are not normally distributed. Normal distributions take values between minus infinity and plus infinity, so I think it’s safe to say that no variable ever studied by biologists, psychologists, or philosophers has ever been exactly normally distributed. Strangely enough, however, tests assuming normality tend to work pretty well in the real world, and statisticians generally consider that violating the assumption of normality is not the biggest issue when you analyze your data (Gelman et al., 2020). This invitation to violate assumptions might sound dangerous. However, to convince you it is safe, we calculated the 95% confidence intervals for Likert scale distributed variables using simulations, varying the sample size, the real correlation between variables, and the variability of our data. In our simulations, the confidence intervals captured the correct value in the population about 95% of the time, despite 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 179 strong violations of normality. This confirmed the statisticians’ recommendations.9 In case you are curious, the full R code for our simulations can be found in supplementary materials in the file named 3_simulations.R. 4.8.2 An Introduction to Correlations in R In R, correlations between continuous variables can be explored using the function cor.test(). This function allows you to compute the Pearson correlation test between two variables. For that you need your two variables to be stored in two vectors of the same length such that values on the nth row in the first vector and values on the nth row in the second vector correspond to the same participant. Fortunately, this is the case in our dataset. Thus, we can compute the correlation between participants’ responsibility ratings and Bypassing scores in the following way: 60 cor.test(dat$Responsibility,dat$Bypassing) You should obtain: Pearson's product-moment correlation data: dat$Responsibility and dat$Bypassing t = -29.778, df = 770, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.7627791 -0.6970150 sample estimates: cor -0.7315945 The strength of the correlation between our two variables is indicated by the number at the bottom, that is, the correlation coefficient r = −0.73. To determine whether the correlation is statistically significant, you need to look at the p-value on the second line. Here it is inferior to 2.2 × 10−16 , which counts as significant by any standard. To report this kind of results, you would write something like: We observed a significant correlation between participants’ moral responsibility judgments and Bypassing scores: r(770) = −0.73, t = −29.78, p < 0.001. However, we might have moved too quickly here. Before performing our correlation test, we should have inquired whether the assumptions on which this test relies are fulfilled. The two most important assumptions for a correlation test to be meaningful are the following: 9 However, the confidence intervals failed for very high correlations, above 0.9, by being too conservative, and included the true correlation more than 95% of the time. Being too conservative is not a huge issue, though. 180 Linearity F. Cova and A. Allard The relationship between the two variables is linear. Absence of outliers Neither variable is presenting outliers. Linearity. One way to assess linearity and absence of outliers is to visually assess the relationship between our two variables. To do that, we could use the plot() function: 61 plot(dat$Responsibility~dat$Bypassing) Unfortunately, not much can be drawn from this plot. Why? Because participants’ answers overlap, as answers to the responsibility question can only take seven values, it makes it very likely that a lot of participants will give similar answers to this question. And, as soon as their Bypassing score is close enough, their answers on the plot will overlap, displaying only one data point where there should be many. To make the plot more useful, we have to add random noise in the position of our data points, so they do not all appear at the same place. This done using the jitter() function: 62 plot(jitter(dat$Responsibility)~jitter(dat$Bypassing) Additionally, you can add names to the x-axis, the y-axis, and the whole graph by using the xlab, ylab, and main arguments of the plot() function: 63 plot(jitter(dat$Responsibility)~jitter(dat$Bypassing),xlab= ͨ→ "Bypassing scores",ylab="Responsibility judgments",main= ͨ→ "Moral responsibility in function of bypassing") And you can save the graph as an image on your computer by clicking on it and then by going into “File” and “Save as.” Looking at the result (see Fig. 4.14), one can indeed see a linear trend: most of the data points are situated on a diagonal going from the upper left to the bottom left of the plot, while areas far from the diagonal tend to be less populated. This is, of course, far from perfect, but it rarely gets better when dealing with human behavior and judgments. Absence of outliers. Absence of outliers can be tested by checking whether certain data points are further than a certain distance from the mean. For example, we can decide that outliers are points that are distant from more than three standard deviations from the mean and test whether dat$Bypassing contains outliers using the following formula: 64 65 table(dat$Bypassing > mean(dat$Bypassing,na.rm=TRUE) + ͨ→ 3*sd(dat$Bypassing,na.rm=TRUE)) table(dat$Bypassing < mean(dat$Bypassing,na.rm=TRUE) ͨ→ 3*sd(dat$Bypassing,na.rm=TRUE)) Here, we are looking at how many times it is true that participants’ Bypassing scores are higher than the mean + 3 standard deviations (first line), or lower than the mean – three standard deviations (second line). R returns only FALSE, so there is no outlier. This is not very surprising as answers on scales, which limit the range 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 181 Fig. 4.14 Moral responsibility judgments in function of Bypassing scores of results, rarely contain outliers (outliers are more frequent in unconstrained data, such as reaction times). We now have to do the same for responsibility judgments. You can try it on your own (the answer is in the TXT file). Then, we can compute the correlation between responsibility judgments and Intrusion scores in the following way: 66 cor.test(dat$Responsibility,dat$Intrusion) You should obtain: Pearson's product-moment correlation data: dat$Responsibility and dat$Intrusion t = 23.752, df = 770, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6075984 0.6892165 sample estimates: cor 0.6502799 182 Table 4.11 Pearson correlations between Bypassing/Intrusion scores and the various measures of moral responsibility F. Cova and A. Allard Bypassing Intrusion Resp. −0.73*** 0.65*** Blame −0.75*** 0.60*** Free will −0.73*** 0.79*** Up −0.77*** 0.75*** *** indicates p < 0.001 which can be reported in the following way: We observed a significant correlation between participants’ moral responsibility judgments and Intrusion scores: r(770) = 0.65, t = 23.75, p < 0.001. Now, you can try and compute correlations between Bypassing/Intrusion scores and the three other measures of free and moral responsibility (dat$Blame, dat$FreeWill, and dat$Up) to obtain the same results as in Table 4.11. As one can see, correlations with Bypassing scores are slightly higher for responsibility and blame judgments, but correlations with Intrusion scores are the highest for free will judgments. Correlation coefficients can be compared between them. For example, the correlation between responsibility judgments and Bypassing scores and the correlation between responsibility judgments and Intrusion scores can be compared using a Williams’ test (a test valid only for correlations that share one common variable, here responsibility judgments). For that, we need to store our two correlations of interest and the correlation between Bypassing scores and Intrusion scores in variables: 67 68 69 rRB <- cor.test(dat$Responsibility,dat$Bypassing) rRI <- cor.test(dat$Responsibility,dat$Intrusion) rBI <- cor.test(dat$Bypassing,dat$Intrusion) Now that our correlation tests have been stored into variables, we can obtain the correlation coefficient corresponding to each test by adding $estimate after the name of the variable, for example, 70 rRI$estimate To perform our Williams’ test, we will use the r.test() function that is part of the psych package: 71 library(psych) 72 73 r.test(n = length(dat$Bypassing), r12 = rRB$estimate, ͨ→ r13 = rRI$estimate, r23 = rBI$estimate) The function takes four arguments: • n, the number of data points (participants). We could simply type in 772, but we can also enter length(dat$Bypassing), which will return a number equal to the number of observations in dat$Bypassing. • r12, the correlation coefficient for the first of the two correlations we want to compare (here, the correlation between responsibility judgments and Bypassing scores). 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . Table 4.12 Pearson correlations between Bypassing/Intrusion/Loop scores and the various measures of moral responsibility Bypassing Intrusion Loop BLoop Resp. −0.73*** 0.65*** −0.30*** −0.40*** Blame −0.75*** 0.60*** −0.28*** −0.37*** Free will −0.73*** 0.79*** −0.38*** −0.50*** 183 Up −0.77*** 0.75*** −0.34*** −0.46*** *** indicates p < 0.001 • r13, the correlation coefficient for the second of the two correlations we want to compare (here, the correlation between responsibility judgments and Intrusion scores). This correlation must share one variable with the first correlation (here, responsibility judgments). • r23, the correlation coefficient for the correlation between the two variables that are not common to the two correlations we want to compare (here, the correlation between responsibility and Intrusion scores). You should obtain the following result: Correlation tests Call:[1] "r.test(n = 772 , r12 = -0.731594454607713 , ͨ→ r23 = -0.639496604754744 , r13 = 0.650279930558832 )" Test of difference between two correlated correlations t value -32.87 with probability < 1.2e-148> The p-value (1.2e–148 ) is inferior to standard thresholds (e.g., p < 0.05). This means that the correlation between responsibility judgments and Bypassing scores is stronger than the correlation between responsibility judgments and Intrusion scores (Table 4.12). Finally, we would like to know the relationship between our new measure of Intrusion (BLoop) and moral responsibility judgments. However, one problem is that our new measure of Intrusion is binary (TRUE or FALSE) and thus does not fit one of the requirements of Pearson correlations: that variables be continuous. Ideally, we should use other approaches such as regression, but those won’t be introduced until the next section. Still, in a lot of cases, using Pearson correlation can give us a good idea of the relationship between the two variables. We can thus use the as.numeric function to convert BLoop to a numerical variable (0 = FALSE, 1 = TRUE) and compute its Pearson correlation to moral responsibility judgments: 74 cor.test(dat$Responsibility,as.numeric(dat$BLoop)) 4.8.3 From Correlations to Regressions Correlations are useful, but they are quite limited. They are used exclusively to summarize the relationship between two variables but will fail if we try to tackle broader issues. One of the main limitations of correlations is that they can suffer from confounds: two variables could be associated because they are caused by a 184 Table 4.13 Pearson correlations between the main predictor variables F. Cova and A. Allard Bypassing Intrusion BLoop Bypassing – – – Intrusion −0.64*** – – BLoop 0.38*** −0.61*** – third variable. For instance, getting good grades at school could be linked with health status because both are caused by parental wealth. In our case, we could imagine that Bypassing is the only variable affecting people’s judgments about free will, but that bypassing is also causing higher beliefs in determinism (BLoop) or lower intrusion of indeterminist intuition (Intrusion). In this case, we might have a positive correlation between intrusion and belief in free will that is exclusively caused by judgments about bypassing. In such cases, the first step is to investigate whether such worries are warranted. They won’t be if our different predictors (Bypassing, Intrusion, and BLoop) are not strongly associated with each other. Thus, we should look at the correlations between these different variables, using the cor.test() function. The results are presented in Table 4.13. As can be seen in Table 4.13, all our main predictor variables are significantly correlated with each other. This in itself is not problematic, but some of the correlations are quite high (> 0.60). This suggests that some of the correlations we observed between moral responsibility and (for example) Intrusion scores might be due to the fact that both moral responsibility scores and Intrusion scores are correlated with Bypassing and BLoop scores, rather than to a direct causal relationship between Intrusion scores and moral responsibility judgments. Additionally, it should be noted that the correlation between Intrusion and Bypassing scores (r = −0.64) is slightly stronger than the relationship between Intrusion and BLoop scores (r = 0.61). This is surprising, given that Intrusion and Bypassing scores are supposed to measure different constructs, while Intrusion and BLoop scores are supposed to measure the same construct. These results suggest that our criticism of the Intrusion measure might be on the right track, and that certain participants interpret the Intrusion measure in a way that is reminiscent of Bypassing (i.e., as being about the conditional ability to do to otherwise), rather than as bearing on the unconditional ability to do otherwise. Given these results, we need to find a way of controlling for this confound, to determine the exact causal effect of Loop, Intrusion and Bypassing. To estimate causality, the gold standard is the use of experiments in which each variable is manipulated independently, allowing researchers to estimate their respective contributions to the final outcome. However, as we saw in Sect. 4.2, the problem faced by experimental philosophers is that it is hard to independently manipulate intuitions about bypassing and intuitions about the indeterminacy of human agency: it seems that attempting to manipulate one of these variables also has an effect on the other variables. 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 185 In such less-than-ideal situations, linear regression is one way to estimate the causal effect of one variable while keeping other variables constant. It is not the gold standard for estimating causality (see Hernán and Robins, 2023, for more modern methods), but it is the first step of most analyses that you will meet in scientific journals. As such, it is essential to learn how to use it. This is why the next section is dedicated to the use of regression methods. 4.9 Analysis IV: Using Regressions to Answer (Q1) Linear regression comes in two flavors: simple linear regression, where we predict the outcome variable with a single predictor, and multiple regression, where we use several predictors. The usefulness of regression really shines through in the case of multiple regressions. However, to take things step by step, we will first consider linear regression with a single predictor. 4.9.1 What Is Linear Regression? Linear regression is one of the most versatile tools used in statistics. Under certain assumptions, it allows you to estimate the impact of a variable on another variable. In our case, it will allow us to estimate what happens to the responsibility judgments when the BLoop answer goes from 0 to 1. In a simple model, we have a single predictor, x (BLoop in our case), and a single predicted variable, y. We make the assumption that there is a simple relationship between the two variables: the predicted variable is equal to a constant, b0 , to which we add the value of x multiplied by a coefficient b1 , plus a random error term. In mathematical terms, y = b0 + b1 × x + ε (where b0 is the intercept, b1 the coefficient of the slope, and ε is the random error term that will vary for each data point). Our prediction is that there is a simple relationship between x and y: if the predictor variable is a cause of y, then increasing the predictor x by one unit will increase the predicted variable y by b1 units. In other words, we are assuming that there is a linear relationship between x and y: if you move x by one unit, then we would expect y to move by b1 units; if we move x by two units, we would expect y to move by 2 × b1 units. In a simple regression, we are trying to estimate two parameters: one parameter for the intercept, which is b0 , and another parameter for the slope, which is b1 . The intercept represents the value of the predicted variable if the predictor variable takes the value 0. In the case of an experiment with only two experimental conditions, the control condition will generally be coded as 0, and the experimental condition as 1. The interpretation of the intercept is then very simple: it is the value of the 186 F. Cova and A. Allard dependent variable in the “default” or “control” condition. The coefficient for the slope (b1 ) then represents the difference between the two conditions. If the predictor variable can take more than two values, the intercept still represents the value of the predicted variable when the predictor variable takes the value 0, but we have to be careful that the zero value is itself meaningful. For instance, in a Likert scale, the format traditionally allows answers on a scale from 1 to 7, so 0 is not a possible answer, but your linear regression will still try to evaluate what would happen if your Likert scale were set to 0! Be careful about trying to interpret the value of the intercept when the predictor variable has no meaningful zero. If the zero value of the predictor variable has no clear interpretation, then it can be useful to center the predictor variable by subtracting the mean value of x to the value of x for each data point, so that the intercept represents the value of the predicted variable when the predictor variable takes its mean value. There are two kinds of regression coefficients: raw and standardized regression coefficients. Raw coefficients correspond to what we have just seen (what happens to our predicted variable y when we move the predictor variable x by one unit). A standardized coefficient has the same interpretation, but in terms of standardized units, where we have subtracted the mean of both the predictor and predicted variables and divided each of them by their respective standard deviation. In a linear regression, we are making several assumptions that allow us to estimate the coefficients and their confidence intervals. The most important assumption is that we are assuming that there is a linear relationship between the predictor variable and the predicted variable. But it is easy to imagine cases where the linear assumptions would be terribly wrong. A simple case would be a case with a ceiling effect, or with diminishing returns. Let’s imagine that we are trying to model how much blame we are going to put on someone, based on how many people they have killed. We can assume that there will be a huge difference between killers and nonkillers, but that there will be very little difference between the blame people attribute to someone who has killed 15 people and the blame they attribute to someone who has killed 16 people. In this case, the relationship is not linear because adding one more murdered person has a very different impact on blame when one goes from zero to one (a huge impact!) and when one goes from 15 to 16 (a very small impact). Our linear regression would predict that both cases would be the same, and that there would be the same gap between an innocent person and a murderer as between two serial killers. That is unpalatable. To check that the relationship between the predictor and predicted variables is roughly linear, you should always plot your data. A second set of demanding assumptions in linear regression concerns the residuals. A residual is the gap between the predictions made by our model and the actual values. For instance, in the case of our BLoop variable, if our linear model says that Y = 2.5 + 1.7 x (where 2.5 corresponds to the intercept, or b0 , and 1.7 corresponds to the slope, or b1 ), then we are predicting that people will have the value 2.5 if they did not accept the assumption of a deterministic universe (and thus had a value of 0) and 2.5 + 1.7 = 4.2 if they accepted the assumption of a deterministic universe (and thus had a value of 1). In the case of a participant who 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 187 obtained a score of 0 to BLoop, but gave a score of 7 in judgments of free will, the residual would be 7 – 2.5 = 4.5. Computing the difference between all participants’ answers and the predictions of our model will give us the distribution of residuals. Our linear model makes the assumptions that the residuals are independent, have the same variance at all values of x, and are normally distributed. The independence of residuals will probably hold in any sample you collect online (but could fail if, for instance, you collect data from several members of a family). The assumption that your residuals will be normally distributed will almost certainly not hold in your sample, but, as in the case of correlations, linear regressions are quite robust to violations of normality.10 Equality of variance at all values of x means that there is no value of x at which the residuals will be tightly packed, while they will be farther spread out at another value of x. This assumption is more complicated than the assumption of independence or the assumption of normality; it can be violated and can have bad consequences if violated. We did some simulations to check what could happen if the assumption of homogeneity of variance was strongly violated with Likert-scale items. We found that the 95% confidence intervals captured the true value only 89% of the time in this case. This means that the error rate is multiplied by two! However, the violation of homogeneity of variance in our simulations was quite drastic, so you are unlikely to fall into this situation in real life. Overall, keep in mind that the assumptions of statistical tests can be violated and that this can increase the error rate. 4.9.2 Linear Regression Models in R Let’s now apply linear regression to our data. Earlier, we tested the relationship between responsibility judgments and bypassing scores using a correlation test in the following way: 75 cor.test(dat$Responsibility,dat$Bypassing) But we can also investigate this relationship using a linear regression model. To build the model and store it into a variable, we have to use the lm() function: 76 lmRB <- lm(dat$Responsibility ~ dat$Bypassing) As you can see, the regression model is specified within the parenthesis using the symbol ~, which can be read as “in function of.” The variable left of this sign (dat$Responsibility) is the dependent variable, the one we seek to explain, while the variable right of this sign (dat$Bypassing) is our predictor, the one we use to explain the dependent variable (Fig. 4.15). 10 We checked again with simulations based on likert-scale items and found that 95% confidence intervals captured the true value 94% of the time. Not too bad! 188 F. Cova and A. Allard Fig. 4.15 Moral responsibility judgments in function of Bypassing scores. The red line corresponds to a regression model with moral responsibility judgments as dependent variable and Bypassing scores as predictor If you type: 77 lmRB R will return the value for the intercept (b0 ) and the coefficient (b1 ) of your model: Call: lm(formula = dat$Responsibility ~ dat$Bypassing) Coefficients: (Intercept) 0.6316 dat$Bypassing -0.8101 You can display the line corresponding to this model using the plot() then abline() functions: 78 79 plot(jitter(dat$Responsibility)~jitter(dat$Bypassing),xlab= ͨ→ "Bypassing scores",ylab="Responsibility judgments", ͨ→ main="Moral responsibility in function of bypassing") abline(lmRB,col="red") 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 189 Finally, we can test our model using the summary() function and applying it to our model: 80 summary(lmRB) This should provide the following information: Call: lm(formula = dat$Responsibility ~ dat$Bypassing) Residuals: Min 1Q Median 3Q Max -6.0618 -0.4543 -0.0618 0.7482 4.7986 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.63161 0.06029 10.48 <2e-16 *** dat$Bypassing -0.81007 0.02720 -29.78 <2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.463 on 770 degrees of freedom Multiple R-squared: 0.5352, Adjusted R-squared: 0.5346 F-statistic: 886.7 on 1 and 770 DF, p-value: < 2.2e-16 Let’s first focus on the “Coefficients” part of these results. This details the results for the different components of your model: the intercept and the predictor (dat$Bypassing). For each of them, it presents an estimate, a standard error for this estimate, a t-value, and a p-value for the corresponding test. For example, the p-value for the intercept is 2e–16 , which means that our intercept differs significantly from 0. But, more importantly for our purpose, the p-value for dat$Bypassing is 2e–16 , which means that dat$Bypassing significantly predicts participants’ responsibility judgments. We can report these results in the following way: A linear regression analysis with responsibility judgments as dependent variable and Bypassing scores as predictor revealed that Bypassing scores were a significant predictor of responsibility judgment: B = −0.81, SE = 0.027, t = −29.78, p < 0.001. The last three lines of the output describe the properties of the whole model. The last line indicates that the model significantly differs from chance and thus predicts the dependent variable (dat$Responsibility) better than chance. The second line provides you with the multiple R2 of your model, which can be interpreted by the share of variance in the dependent variable (dat$Responsibility) that can be explained by your predictor (dat$Bypassing). Interestingly, this R2 should correspond to the results you would obtain by squaring the coefficient of the correlation between dat$Responsibility and dat$Bypassing: 81 rRB$esti3mateˆ2 Overall, this means that we can predict more than half of the variance in participants’ responsibility judgments based on Bypassing scores alone. This is quite high. 190 F. Cova and A. Allard Now, you can do the same for the relationship between responsibility judgments and Intrusion scores, using: 82 lmRI <- lm(dat$Responsibility ~ dat$Intrusion) ͨ→ summary(lmRI) And, finally, we can explore the relationship between responsibility judgment and our binary variable BLoop by testing a third model with BLoop as predictor: 83 lmRL <- lm(dat$Responsibility ~ dat$BLoop) ͨ→ summary(lmRL) As you will see, the results will indicate that BLoop is a significant predictor of responsibility judgments: Call: lm(formula = dat$Responsibility ~ dat$BLoop) Residuals: Min 1Q -5.3590 -0.6361 Median 0.6410 3Q 1.3639 Max 2.3639 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.35897 0.09949 23.71 <2e-16 *** dat$BLoopTRUE -1.72285 0.14143 -12.18 <2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.965 on 770 degrees of freedom Multiple R-squared: 0.1616, Adjusted R-squared: 0.1605 F-statistic: 148.4 on 1 and 770 DF, p-value: < 2.2e-16 Because dat$BLoop is not numerical, but composed of FALSE and TRUE, the lm() function automatically turns these logical values into numerical values, so that FALSE = 0 and TRUE = 1. Thus, the negative regression estimate (−1.72285) indicates that participants with a TRUE value (i.e., participants who answered “100” to the original question) tended to attribute less moral responsibility. Now, you should be able to reproduce Table 4.14, which summarizes the results of the regression models using Bypassing, Intrusion, or BLoop scores as their respective predictors. All three predictors significantly predict participants’ responsibility judgments. Now, can we compare these predictors between them to identify the “best” predictor Table 4.14 Results for three linear regression models with responsibility judgments as dependent variables and Bypassing scores, Intrusion scores, or BLoop scores as single predictor Predictor Bypassing Intrusion BLoop B (Estimate) −0.81 0.57 −1.72 SE 0.027 0.024 0.141 t −29.78 23.75 −12.18 p < 0.001 < 0.001 < 0.001 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 191 of responsibility judgments? By comparing the estimates, it seems BLoop is the “best” predictor, with an estimate of −1.72, compared to −0.81 and 0.57. However, that’s a wrong conclusion: estimates (or regression coefficients) cannot be compared in this way. A regression coefficient indicates the average increase in dependent variable (dat$Responsibility) for a one-point increase in the predictor variable. It might thus make sense to compare regression coefficients for two predictors that are measured on the same scale (such as Bypassing and Intrusion, which are measured on the same 7-point scales), but it stops making sense when the predictors are measured in completely different ways (such as Bypassing, which is measured on a 7-point scale, and BLoop, which is a binary variable). If we want to compare regression coefficients, we need to compute standardized regression coefficients (also called beta coefficients). Standardized regression coefficients are the estimates obtained through a regression analysis in which the underlying data have been standardized so that the variances of the predictor and predicted variables are equal to 1. To put it otherwise, standardized regression coefficients are obtained by transforming all variables so that they are measured on comparable scales. This means that standardized coefficients indicate by how many standard deviations a predicted variable will change for each increase of one standard deviation in the predictor variable. To compute standardized coefficients for our regression analyses, we will need the lm.beta package: 84 install.packages("lm.beta") And, to compute standardized coefficients for a model, we just need to apply the lm.beta() function to this model: 85 86 library(lm.beta) lm.beta(lmRB) You will obtain the beta (standardized) coefficients for the model: Call: lm(formula = dat$Responsibility ~ dat$BLoop) Standardized Coefficients:: (Intercept) dat$BLoopTRUE 0.0000000 -0.4019725 And you can thus complete Table 4.14 to obtain Table 4.15. As we can see, the “best” predictor is Bypassing scores. However, once again, we analyzed the relationship between moral responsibility judgments and each predictor (Bypassing, Intrusion, and BLoop) separately. We thus still face the problem we raised at the end of Sect. 4.8: we fail to take into account the fact that these different predictors might be related to each other. This is why we now need to turn to multiple regression analysis. 192 F. Cova and A. Allard Table 4.15 Results and standardized regression coefficients (β) for three linear regression models with responsibility judgments as dependent variables and Bypassing scores, Intrusion scores, or BLoop scores as single predictor Predictor Bypassing Intrusion BLoop B (Estimate) −0.81 0.57 −1.72 β −0.73 0.65 −0.40 SE 0.027 0.024 0.141 t −29.78 23.75 −12.18 p < 0.001 < 0.001 < 0.001 4.9.3 What Is Multiple Linear Regression? In the previous section, we examined the relationship between participants’ responsibility judgments and three potential predictors (Bypassing, Intrusion, and BLoop scores) separately. However, it is possible to have a better picture of the weight of these respective predictors by entering all of them in a single model. This is called multiple linear regression. There are two main cases of multiple regressions: linear regressions with and without interactions between independent variables. Without interactions, we are assuming that each variable is having its impact independently of all the others. With interactions, on the contrary, the impact of any variable depends on the value of the other variable, and both variables act in combination. Let’s first take a model with two predictors (X1 and X2 ), without any interaction. Our mathematical model now looks like: Y = b0 + b1 × X1 + b2 × X2 + ε . (4.3) Here, b1 corresponds to the coefficient for our first predictor and b2 corresponds to our second predictor. b0 is still the intercept and corresponds to the value of the predicted variable when both predictors take the value 0. ε is still an error term that corresponds to the fact that Y (the predicted variable) is unlikely to be fully determined by our two predictor variables. The interpretation of linear regressions without interactions is very similar to the one in the univariate case. We are trying to estimate what would happen to the predicted variable if we moved only one predictor variable by one unit and kept all other variables constant. Let’s say, for instance, that you are trying to evaluate the impact of parental income and personal education on personal income in adulthood. In this case, the coefficient for education would represent what would happen to offspring income if we moved education by one unit and kept parental income constant. The same can be said for parental income. Something has to be added, however: it is not always possible to move one variable by one unit without moving the other by one unit. For instance, in our example, it is implausible that increasing parental income would have no impact on offspring education. In this case, increasing parental income would have two kinds 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 193 Fig. 4.16 Parental income has two different kinds of effect on offspring income: one direct effect (arrow c) and an indirect effect mediated by offspring education (arrows a and b) of impact on offspring income: one indirect impact via a better education and a direct impact that is not mediated by education (see the causal diagram in Fig. 4.16; the path including the arrows a and b represents the indirect effect and the path including the arrow c represents the direct effect). A note on what we mean by direct effect: the effect will, of course, be mediated by further mechanisms, but those are not included in our model. In other words, the direct effect corresponds to all the mediated effects that we are not interested in for the purpose of our study, plus any real unmediated direct effect. For instance, we can imagine that rich parents are more likely to lend their children money to start up their own business, and that this explains part of the relationship between parental and offspring income. In this case, the impact of parental income on offspring income would still be mediated by a mechanism (i.e., money transfers), but this mechanism is not included in our model (we do not have a variable directly measuring how much money parents give to their children to help them set up their own business). Thus, in this case, the direct effect of parental income would encompass money transfer from parents to offspring along all the other means that are not related to providing a better offspring education. Still, even though what we call the “direct” effect will typically be mediated via other mediators, it is still “direct” in a statistical sense (as long as we do not enter the relevant variables in our model). The distinction between direct and indirect effects is crucial to understand the meaning of regression coefficients. If we have a path diagram like the one shown in Fig. 4.16, then the linear regression gives the estimates of the arrows pointing directly from the predictor to the predicted variable (i.e., direct effects), but eliminates indirect arrows. In other words, looking up coefficients in a regression table would be misleading if we thought that the coefficient for our parental income variable was the total causal effect of parental income on offspring income. Instead, it is just the direct effect, not including any indirect effect via other variables included in the model (in our case, offspring education). In other words, if a variable has a causal effect that is mediated by other variables that are included in the model, then the regression coefficients are going to underestimate the total causal effect of our variable. To compute both the direct and the total causal effects of our variable, we will have to use mediation models and structural equation models (see below). 194 F. Cova and A. Allard Now, we can discuss the meaning of interactions in regression. In the case of interactions, we are assuming that the impact of our two variables is multiplicative: the higher one variable is, the bigger the impact of the other will be. From a mathematical point of view: Y = b0 + b1 × X1 + b2 × X2 + b3 × X1 × X2 + ε (where b3 corresponds to the interaction term and gets multiplied by both X1 and X2 ). Imagine that X1 has value zero. In this case, adding one unit to the other predictor, X2 , will only have an effect through the main effect b2 , because the interaction term (b3 × X1 × X2 ) will always be 0. On the other hand, if X1 has value 1, then adding one unit to the other predictor, X2 , will increase the dependent variable by both the main effect (b2 ) and the interaction term (b3 ), because the equation then becomes Y = b0 + b1 × 1 + b2 × X2 + b3 × 1 × X2 + ε. If the first variable, X1 , has value 2, then adding one unit to the other predictor X2 will increase the independent variable by the main effect (b2 ) plus twice the interaction term. The interaction term is easy to interpret if our two variables are categorical, and only take the values 0 and 1. For instance, let’s say that we have two experimental conditions, and two genders, male and female, male coded as 0 and female as 1. In this case, the coefficient of the experimental condition corresponds to the impact of the experiment for male participants only. The gender coefficient corresponds to the impact of gender in the control condition only. The interaction coefficient corresponds to the impact of being a female participant in the experimental condition. 4.9.4 Multiple Regression in R To run a multiple regression analysis in R, the syntax is the same as the one we used for the linear regression models of Sect. 4.9.3. The only difference is that several predictors must be specified on the right of the ~ symbol. In a mediation model without interaction, predictors are separated by a + symbol (but, if you want your model to include the interaction between several predictors, these predictors must be separated by the * symbol). For example, if we want to run a regression model in which moral responsibility ratings are predicted by Bypassing, Intrusion, and BLoop, we must specify: 87 88 89 lmR3 <- lm(dat$Responsibility ~ dat$Bypassing + dat$Intrusion + ͨ→ dat$BLoop) summary(lmR3) lm.beta(lmR3) The summary() function will return you the following output: Call: lm(formula = dat$Responsibility ~ dat$Bypassing + dat$Intrusion + ͨ→ dat$BLoop) 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . Residuals: Min 1Q Median -5.9083 -0.5968 -0.1613 3Q 0.7154 195 Max 4.8387 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.80785 0.08871 9.107 <2e-16 *** dat$Bypassing -0.59198 0.03321 -17.826 <2e-16 *** dat$Intrusion 0.25963 0.03072 8.451 <2e-16 *** dat$BLoopTRUE -0.09169 0.12433 -0.738 0.461 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.373 on 768 degrees of freedom Multiple R-squared: 0.5918, Adjusted R-squared: 0.5902 F-statistic: 371.2 on 3 and 768 DF, p-value: < 2.2e-16 As you can see, the coefficients section gives you one line per predictor, with the regression coefficient and the corresponding p-value. The results are reported in Table 4.16. If you compare those results to the ones in Table 4.15, which we obtained by independently assessing the impact of each predictor, you will notice two things. The first is that our regression coefficients tend to be lower (in absolute value), for example, the standardized regression coefficient for Bypassing is now −0.53 while it was −0.81. This is because, as we explained in Sect. 4.9.3, the multiple regression analysis returns an estimate of the direct effect of Bypassing on moral responsibility ratings, while discounting the indirect effect due to the fact that Bypassing correlated with other constructs that also predicted moral responsibility ratings (here, Intrusion and BLoop). The second thing you will notice is that, in Table 4.16, BLoop is no longer a significant predictor of moral responsibility ratings (p = 0.461). This means that, after taking into account the impact of other variables on moral responsibility ratings, we can no longer confirm that BLoop actually predicts participants’ moral responsibility ratings. Overall, our results can be interpreted in the following way: both Bypassing and Intrusion scores predict participants’ moral responsibility ratings, though Bypassing score is a better predictor than Intrusion rating (β = −0.53 vs. β = 0.30), suggesting that confusion with Bypassing explains a greater part of the variance in participants’ answers than Intrusion. Once taken Bypassing and Intrusion into Table 4.16 Results and standardized regression coefficients (β) for one multiple linear regression model with responsibility judgments as dependent variable and Bypassing scores, Intrusion scores, and BLoop scores as predictors Predictor Bypassing Intrusion BLoop B (Estimate) −0.59 0.26 −0.09 β −0.53 0.30 −0.02 SE 0.033 0.031 0.124 t −17.83 8.45 −0.74 p < 0.001 < 0.001 = 0.461 196 F. Cova and A. Allard account, BLoop no longer significantly predicts participants’ moral responsibility judgments, suggesting that whether participants perceive the described universe as deterministic does not influence their moral responsibility ratings per se. Rather, participants judge agents in deterministic universes to be less morally responsible because they perceive these agents as being more “bypassed” and as having less “ability to do otherwise”—or whatever Intrusion scores measure. 4.10 Analysis V: Using Mediation Analysis and Structural Equation Modeling to Answer (Q2) Now that we answered (Q1) and found out that people’s attributions of moral responsibility are mainly driven by confusion with bypassing and that considerations about the indeterminacy of human agency seem to play a minor role, we can turn to (Q2): why do people attribute less moral responsibility in the abstract condition, compared to the concrete condition? Is it because they are more prone to confuse determinism with bypassing in the abstract case? Or because they are more likely to import indeterministic considerations about human agency in the concrete condition? To answer these questions, we will use a method called “mediation analysis.” 4.10.1 Testing the Effect of Condition on Attributions of Free Will and Moral Responsibility But first, we need to ensure that there is indeed an effect of condition on attributions of free will and moral responsibility, as there would be no point in trying to explain a phenomenon the existence of which has not been demonstrated. Since we have four experimental groups and two factors (universe and abstract/concrete formulation), and our two factors are crossed (i.e., each level of one factor occurs in combination with each level of the other factor), the best way to test for the impact of our two factors is to use analysis of variance (or ANOVA, see Chap. 3). In R, this can be done using the aov() function: 90 aov(dat$Responsibility ~ dat$UNIVERSE*dat$COND) In this function, you define a model for your ANOVA by first entering the name of the dependent variable (here, dat$Responsibility) and then the name of your factors (here, dat$UNIVERSE and dat$COND). The dependent variable and the factors are separated by the ~ sign, which means something like “in function of.” The different factors are separated by * because we are also interested in testing the effect of their interaction. If you want to ignore the interaction effects, you can just replace * by +. 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 197 You should obtain the following information about your model: Call: aov(formula = dat$Responsibility ~ dat$UNIVERSE * dat$COND) Terms: dat$UNIVERSE Sum of Squares Deg. of Freedom 768 dat$COND dat$UNIVERSE:dat$COND Residuals 991.7313 342.5734 156.3930 2054.2700 1 1 1 Residual standard error: 1.635491 Estimated effects may be unbalanced This only gives you general information about your model, but no indication about the effect of each individual factor. As for linear regression models, you will need to feed your model to the summary() function to get information about statistical significance: 91 summary(aov(dat$Responsibility ~ dat$UNIVERSE*dat$COND)) You should get the following output: Df Sum Sq Mean Sq F value Pr(>F) dat$UNIVERSE 1 991.7 991.7 370.76 < 2e-16 *** dat$COND 1 342.6 342.6 128.07 < 2e-16 *** dat$UNIVERSE:dat$COND 1 156.4 156.4 58.47 6.18e-14 *** Residuals 768 2054.3 2.7 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Here, we can see that both factors have a significant effect on attributions of moral responsibility, but that there is also a significant interaction effect (as shown by the line which begins by dat$UNIVERSE:dat$COND). This tells us that the impact of formulation (abstract vs. concrete) varies depending on the universe (A or B) and vice versa. To get a better grasp of the nature of this regression, we can use a multiple regression model with interaction. Indeed, since both our factors (universe and formulation) have two levels, they can be entered in a regression model using the following formula: 92 summary(lm(dat$Responsibility ~ dat$UNIVERSE*dat$COND)) which gives us a very similar output, with a significant interaction effect: Call: lm(formula = dat$Responsibility ~ dat$UNIVERSE * dat$COND) 198 Residuals: Min 1Q -5.8505 -0.4082 F. Cova and A. Allard Median 0.1495 3Q 0.7136 Max 3.7136 Coefficients: Estimate Std. Error t value ͨ→ Pr(>|t|) (Intercept) -0.7136 0.1159 -6.155 ͨ→ 1.21e-09 *** dat$UNIVERSEB 3.1217 0.1646 18.967 ͨ→ < 2e-16 *** dat$CONDCONCRETE 2.2436 0.1675 13.394 ͨ→ < 2e-16 *** dat$UNIVERSEB:dat$CONDCONCRETE -1.8013 0.2356 -7.646 ͨ→ 6.18e-14 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.635 on 768 degrees of freedom Multiple R-squared: 0.4205, Adjusted R-squared: 0.4182 F-statistic: 185.8 on 3 and 768 DF, p-value: < 2.2e-16 As you can see, the regression coefficient for the interaction is negative, suggesting that the effect of one factor is lower when the other factor is higher. Since R automatically recoded our factor such that Universe A = 0 and Universe B = 1, and that Abstract = 0 and Concrete = 1, we can interpret this as meaning that the difference between the abstract and concrete condition is smaller in Universe B (i.e., when the value for universe is higher). However, in this case, the nature of this interaction can be very easily deduced by looking at the results in Table 4.9 (which we reproduce in Table 4.17): the difference in moral responsibility scores between the abstract and concrete formulations is much higher for Universe A (1.53 + 0.71 = 2.24) than for Universe B (2.85 – 2.41 = 0.44). In fact, it is five times higher. You can now do the same thing for free will rating. Try it by yourself (but the code is available in the 3_analysis.txt file if you do not succeed). You will find a very similar pattern. For this reason, and also because much of the past scientific discussion has focused on the difference between abstract and concrete instructions in the context of deterministic universes, we will focus here on the effect of instructions in Universe A. Thus, we first need to create a new data frame that excludes all Table 4.17 Mean (and standard deviations) for participants’ moral responsibility ratings for each universe (A or B) and each case (abstract or concrete). Percentages indicate the % of participants who gave an answer superior to the midpoint (0) Universe A Universe B Abstract −0.71(2.07)30.2% 2.41(0.92)97.4% Concrete 1.53 (2.26)71.6% 2.85 (0.75)97.4% 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 199 participants who were asked about Universe B and only includes those who were asked about Universe A. This can be done in the following way: 93 datUA <- dat[dat$UNIVERSE=="A",] You can check the number of participants (rows) in your new data frame (= 382): 94 nrow(datUA) And you can check whether the effect of instruction (abstract vs. concrete) is still present: 95 summary(aov(datUA$Responsibility ~ datUA$COND)) Or: 96 summary(lm(datUA$Responsibility ~ datUA$COND)) You will see that it is: Df Sum Sq Mean Sq F value Pr(>F) datUA$COND 1 479.9 479.9 102.4 <2e-16 *** Residuals 380 1780.3 4.7 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 4.10.2 Mediation Analysis with a Single Mediator To determine whether the impact of instruction (abstract vs. concrete) on moral responsibility ratings is entirely or partly due to the impact of instruction on a third variable that also has an effect on moral responsibility ratings (for example, Bypassing scores), we need to run a mediation analysis. The basic notions of mediation analysis are illustrated in Fig. 4.17. Running a mediation analysis is only appropriate when you want to put to the test a certain hypothesis about causality. This hypothesis must have the following form: the causal effect of a certain factor (or independent variable) on a certain dependent Fig. 4.17 Basic notions for mediation analysis 200 F. Cova and A. Allard Fig. 4.18 Representation of a specific mediation hypothesis: confusion with bypassing mediates the effect of instruction (abstract vs. concrete) on moral responsibility ratings variable is (partly or totally) explained by the fact that our independent variable has an effect on a third variable called the mediator variable, which in turn has an effect on the dependent variable. In our case, such a hypothesis would be that the causal effect of instruction (= independent variable) on moral responsibility ratings (= dependent variable) is partly or totally due to the fact that instruction increases participants’ confusion with bypassing (= mediator variable), which in turn decreases participants’ moral responsibility ratings. This specific hypothesis is illustrated in Fig. 4.18. How do we put this specific hypothesis to the test? The first step is to show that there is indeed an effect of the independent variable (instruction) on the dependent variable (moral responsibility). We already showed that this was the case using ANOVA. However, it is customary to use linear regression analysis in mediation analyses, as this will give us precious information: the estimate of the total effect of the independent variable. Thus, we begin by assessing the effect of the independent variable on the dependent variable in the following way: 97 summary(lm(datUA$Responsibility ~ datUA$COND)) You should obtain the following output: Call: lm(formula = datUA$Responsibility ~ datUA$COND) Residuals: Min 1Q -4.5301 -1.5301 Median 0.7136 3Q 1.4699 Max 3.7136 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.7136 0.1534 -4.651 4.57e-06 *** datUA$CONDCONCRETE 2.2436 0.2217 10.121 < 2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.164 on 380 degrees of freedom Multiple R-squared: 0.2123, Adjusted R-squared: 0.2103 F-statistic: 102.4 on 1 and 380 DF, p-value: < 2.2e-16 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 201 Fig. 4.19 Representation of a specific mediation hypothesis, updated with the estimate of the total effect of the independent variable on the dependent variable Unsurprisingly, we have a significant effect of our independent variable (datUA$COND) on our dependent variable (datUA$Responsibility). But this output brings us one key information: the regression coefficient gives us the total effect of our independent variable on our dependent variable (here, B = 2.2436). We can thus update our model as shown in Fig. 4.19. The second step is to test whether the independent variable (instruction) indeed has a significant effect on the mediator variable (confusion with bypassing). Indeed, if the independent variable has no effect on the mediator variable, then our mediation hypothesis is probably false. Here, this can be done using the following regression model: 98 summary(lm(datUA$Bypassing ~ datUA$COND)) You will obtain the following output: Call: lm(formula = datUA$Bypassing ~ datUA$COND) Residuals: Min 1Q Median -3.7588 -1.5504 -0.0642 3Q Max 1.4774 3.9358 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.7588 0.1324 5.731 2.04e-08 *** datUA$CONDCONCRETE -1.6946 0.1913 -8.858 < 2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.868 on 380 degrees of freedom Multiple R-squared: 0.1711, Adjusted R-squared: 0.169 F-statistic: 78.46 on 1 and 380 DF, p-value: < 2.2e-16 As hypothesized, we find a significant effect of the independent variable (datUA$COND) on the mediator (datUA$Bypassing). We can now update our mediation model with the estimate of this effect (B = −1.6946), as shown in Fig. 4.20. 202 F. Cova and A. Allard Fig. 4.20 Representation of a specific mediation hypothesis, updated with the estimate of the effect of the independent variable on the mediator variable (upper left) The third step requires us to estimate both the effect of the mediator variable (confusion with bypassing) on the dependent variable (moral responsibility attributions) and the direct effect of the independent variable (instructions) on the dependent variable. The direct effect is what is left of the total effect of the independent variable while controlling for the effect of the mediator variable. As we saw earlier, we can measure the direct effect by using multiple regression analysis, with both the independent and the mediator variables as predictors in the model: 99 summary(lm(datUA$Responsibility ~ datUA$COND + datUA$Bypassing)) The output should be: Call: lm(formula = datUA$Responsibility ~ datUA$COND + datUA$ ͨ→ Bypassing) Residuals: Min 1Q -5.9666 -0.9611 Median 0.0334 3Q 1.0773 Max 5.2733 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.18548 0.12804 -1.449 0.148 datUA$CONDCONCRETE 1.06427 0.19494 5.459 8.65e-08 *** datUA$Bypassing -0.69595 0.04759 -14.624 < 2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.733 on 379 degrees of freedom Multiple R-squared: 0.4965, Adjusted R-squared: 0.4938 F-statistic: 186.8 on 2 and 379 DF, p-value: < 2.2e-16 As you can see, there is still a significant effect of datUA$COND, even when controlling for datUA$Bypassing. This means that, if there is a mediation, it will only be partial: there will still be a significant direct effect of the independent variable. The regression coefficients provided by the output give us the effect of 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 203 Fig. 4.21 Final regression model the mediator on the dependent variable (B = −0.69595) and the direct effect of the independent variable (B = 1.06427). The indirect effect (i.e., the share of the total effect explained by the effect of the independent variable on the mediator variable) can be computed in two ways. You can either subtract the direct effect from the total effect (2.2436 – 1.06247 = 1.18113) or multiply the effect of the independent variable on the mediator by the effect of the mediator on the dependent variable (−1.6946 × –0.69595 = 1.179357). The updated model is presented in Fig. 4.21. For our fourth and final step, we want to know whether the indirect effect (the one that goes through the mediator) is significant. We observe a difference between the total effect (B = 2.24) and the direct effect (B = 1.06), and this suggests that our mediator plays a role in the total effect of the independent variable on the dependent variable, but we would like to know whether this indirect effect is statistically significant. For that, we need to use a mediation test. We will use the Sobel test, as it is a widespread mediation test. For that, we will first need to install and load the bda package: 100 101 install.packages("bda") library(bda) We then have to use the function mediation.test(). The function takes three arguments, in the following way: mediation.test(mv,iv,dv), where mv is the mediator variable, iv the independent variable, and dv the dependent variable. This gives us: 102 mediation.test(datUA$Bypassing,datUA$COND,datUA$Responsibility) And the output is: Sobel Aroian Goodman z.value 7.576363e+00 7.563436e+00 7.589355e+00 p.value 3.553778e-14 3.925575e-14 3.215015e-14 This gives us the results of different mediation tests (Sobel, Aroian, and Goodman). For all three tests, the p-value is under the threshold of 0.05, suggesting a significant mediation (indirect) effect. We can thus conclude that Bypassing scores 204 F. Cova and A. Allard Fig. 4.22 An alternate mediation model, with moral responsibility attributions as mediator variable and Bypassing scores as dependent variable partially mediated the effect of instructions on attributions of moral responsibility (p < 0.001). But what does that mean? Did we prove our causal hypothesis? Not really: what we proved was that the statistical model corresponding to our causal hypothesis fits the data. But, in general, it is possible to show that there are several models corresponding to different, incompatible causal hypotheses that also fit the data (Fiedler et al., 2018). For example, we could test the hypothesis according to which the instructions directly impact moral responsibility ratings, which in turn impact Bypassing scores. In this model, instructions remain the independent variable, but Bypassing scores become the dependent variable and moral responsibility ratings become the mediator. Using the same procedure, we put this model to the test and obtained the results presented in Fig. 4.22 (the full code for this procedure is available in the analysis file). As can be seen, this model also fits the data: we have a significant mediation (p < 0.001). However, the causal hypothesis behind this model is very different from the one behind the first model we put to test. Thus, mediation tests can be used to test whether a certain model (and the causal hypothesis it describes) fits the data, but it cannot be used to directly prove that a causal hypothesis is the right one. In the past years, experimental philosophers (e.g., Rose and Nichols, 2013; Björnsson, 2014, 2016) have tried to escape this limitation by using programs and algorithms (such as TETRAD and its Greedy Search algorithm) that compare different causal models to find which one is the best fit of the data, or search by themselves which model is the best fit of the data. However, we would warn against using this approach (that we will not develop into details here): indeed, it has been shown that, because such methods rest on the assumption that there is no measurement error (or that measurement error is known), they can often fail to detect the right causal model as soon as a reasonable amount of measurement error is present (Scheines and Ramsey, 2016). Similarly, other researchers have argued that comparing a mediation model 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 205 (e.g., instructions → bypassing → moral responsibility) and its “reverse” model (e.g., instructions → moral responsibility → bypassing) to see which one best fits the data might help determining which one describes the true causal model. This procedure is called “reverse mediation testing.” However, here again, simulations have shown that, in the presence of measurement error, this procedure often fails to identify the true causal model (Lemmer and Gollwitzer, 2017). But what is measurement error? It is simply the fact that variations in our measures (for example, our measures of Bypassing) are not always due to variations in the phenomenon we seek to measure (participants’ perception that the universe they read about implies that agents’ mental states play no causal role) but to other phenomena (participants being inattentive or, more problematically, interpreting the measures in an unintended way). Given the many debates in experimental philosophy of free will (some of which we presented in Sect. 4.2) about the ambiguity of certain measures, and the fact that certain measures do not seem to measure what they were initially intended to measure (see, for example, the case of the “Throughpass” measure in Cova, Forthcoming-a), it would be overly optimistic to think that empirical studies of folk intuitions about free will do not include any measurement error. Thus, it is important to keep in mind that our measures are not identical to the phenomena they are intended to measure and that there can be some discrepancy between the two. It is easy to forget this simple fact because, when reporting analyses, researchers often write as if the measures and the measured phenomena were one and the same thing. For example, when we represented our mediation model in Figs. 4.17, 4.18, 4.19, 4.20, 4.21, and 4.22, we did not explicitly distinguish between the two. A more accurate representation of the model would be the one shown in Fig. 4.23. In this model, we distinguish between observable variables, which are directly measured or manipulated (and are typically represented by rectangles) from latent variables, which are not directly measured (and are typically represented by circles). Can we at least conclude from our results that if perception of bypassing mediates the effect of instructions (abstract vs. concrete) on participants’ perception of moral responsibility, then this mediation is only partial and not total, as we still found a statistically significant direct effect of instructions? Here also, we should proceed with caution, as Cova (Forthcoming-a) showed, through simulations, that the presence of measurement error can lead the procedure we used to conclude to partial mediation when there is actually a full mediation. Rather, we should stay humble: our results show that a causal hypothesis according to which the effect of instructions on attributions of moral responsibility is mediated by perceptions of bypassing is coherent with our data. 4.10.3 Multiple Mediation and Structural Equation Modeling However, perceptions of bypassing are not the only possible mediator for the effect of instructions on attributions of moral responsibility, but Intrusion scores and 206 F. Cova and A. Allard Fig. 4.23 Distinguished observed variables (rectangles) from latent variables (circles) in our mediation model. MR = moral responsibility Fig. 4.24 Results for a mediation model with Intrusion scores as mediator BLoop scores also are. Using the same procedure as for Bypassing scores, we can show that there is also a significant indirect effect through Intrusion (see Fig. 4.24) but not through BLoop (see Fig. 4.25; the full code is available in the analysis file). Thus, we found a significant indirect effect for two different mediation models: one with Bypassing scores as mediator and one with Intrusion scores as mediator. This raises two questions: 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 207 Fig. 4.25 Results for a mediation model with BLoop as mediator • Given that both candidate mediators (Bypassing and Intrusion scores) are highly correlated with each other, can it be that we observe a significant effect through one of them (e.g., through Intrusion scores) only because it is correlated with the “true” mediator (e.g., with Bypassing scores)? This worry parallels the one we had about regression analysis and led us to use multiple regression analysis. • Could it be that each candidate mediator explains a different part of the effect of instructions on attributions of moral responsibility and that taking both into account would lead us to explain a greater part of the total effect? To answer these questions, we need to test mediation models that include several mediators. This is called “multiple mediation analysis.” However, the mediation tests we described in the previous subsection are not tailored for multiple mediators. We thus need another, best suited method, such as the use of structural equation modeling through the lavaan package. Structural equation modeling (SEM) is a powerful statistical tool to estimate causal effects between variables. SEM can be understood as the combination of two sets of statistical models: a causal model and a measurement model. The causal model estimates causal relationships between variables (like mediation analysis). The measurement model gives estimates of latent variables, that is, variables that have not been directly measured. In a measurement model, we estimate whether several observed variables can be seen as being caused by a single latent variable (that we did not measure). For instance, in the case of an IQ test, psychologists often assume that a single latent variable, IQ, is causing subjects’ results on several different tasks. Why would you ever want to estimate latent variables? The main reason is simple: the latent variables are often the only variables that we’re interested in. We’re not interested in participants’ answers in our survey; we are interested in their underlying beliefs about moral responsibility. And, if we have used several proxy measurements for our latent variables, under certain assumptions, the measurement 208 F. Cova and A. Allard model in SEM allows us to eliminate the error in each proxy to get a better estimate of the fundamental latent variable behind them. In SEM, you are not trying to figure out what the latent variables behind your observed measures are; instead, you are telling your model that there is a latent variable that is causing a set of observed variables, and then your model is going to do its best to reconstruct what this latent variable should look like.11 SEM reconstructs the latent variables based on a set of assumptions. The most important assumption to understand is that SEM assumes that any correlation between observed variables will be solely due to the shared influence of the latent variable that is causing them. In more technical terms, the observed variables are assumed to be independent once we condition on the value of our latent variable. Let’s say, for instance, that we have four variables that are correlated, and let’s say that we are telling our SEM model to build up a latent variable out of these four variables. Our SEM model will then assume that these correlations between our variables are due solely to the existence of our latent variable and will try to guess the hidden value of the latent variable that is compatible with these correlations between our different observed variables. Since SEM is estimating these hidden variables by using correlations, it needs at least several variables before it can estimate the latent variable that is causing them (it actually needs at least three observed variables before it can estimate the latent variable that is causing them, and having at least four observed variables is strongly recommended). Once it has estimated the latent variables, SEM allows us to specify any kind of causal relationship between both observed and latent variables. Once again, as with the mediation model, SEM can’t directly prove causality. Instead, it assumes causal relationships: you feed into the program that these causal relationships exist, and then the program gives you back the causal coefficients that generate the best fit with the data. The importance of assumptions also applies to the measurement model. You are telling your SEM model that a single latent variable is causing your three or four observed variables, and then SEM tries to estimate it. A lack of fit could tell you that a single latent variable is only having a weak impact on your variables; however, a completely wrong model could still have a pretty good fit. A simple case where the latent variable assumption may go wrong is the case where there is a direct causal relationship between your observed variables. For instance, let’s say that you are trying to measure the socioeconomic status (SES) and that you’re telling SEM to build up a latent variable based on several observed variables, which include education and income. In this case, there will be a strong correlation between education and income, partly because education causes income, but probably not because both are caused by SES. Rather, SES is just the sum of education and income. 11 Other approaches, like exploratory factor analysis, are more appropriate if you are trying to figure out what are possible latent variables behind your measures. 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 209 To understand SEM, one last thing to keep in mind is that most versions of SEM that you will see will assume that the causal impacts that are estimated are linear, exactly as in linear regression. So let’s imagine that you are reading a path diagram, and see that there is an arrow from parental income to offspring income with a coefficient of 0.5, and that no other path is leading from parental income to offspring income. This means that, if you are raising the value of parental income by one euro (and if all assumptions in your model are true!), then this will raise the offspring’s income by 0.5 euro.12 We will stress once again the importance of assumptions. SEM can make you look like God or a divine prophet: you can specify causal relationships of any degree of complexity between any number of variables, and you can invent latent variables behind any set of observed variables you like. This wild fantasy does not have to be true, however. If it is true, SEM is going to give you correct estimates of the strength of the causal relationships that you predicted. However, if you incorrectly told SEM that a latent variable exists, or that there is no causal relationship between two variables that are actually causing each other, then your model will probably spit out nonsensical answers. You have to be careful about the assumptions you make, and be sure that they make sense and can withhold scrutiny. 4.10.4 Using the Lavaan Package for Multiple Mediation Analysis Now, we just need to learn how to run structural equation modeling in R. The reference package for this type of analysis is lavaan. So let’s first install and load the package: 103 104 install.packages("lavaan") library(lavaan) To use {lavaan}, the most important point is probably to understand how to describe the models we want it to test. A good heuristic is often to visualize the model through a visual representation (called the “path diagram”) before entering it. So, let’s begin with a very simple model with one dependent variable (attributions of moral responsibility) and one predictor (instructions). You can visualize it in Fig. 4.26. Let’s call it SEM1. This model can be described in lavaan in the following way: 105 106 107 SEM1 <- ' Responsibility ~ a*COND ' 12 This is a case where the linearity assumption is probably false: raising parental income by one euro is probably going to have a much bigger impact if the parents are poor than if they are rich. 210 F. Cova and A. Allard Fig. 4.26 Path diagram for Model SEM1 The arrow in the model is represented by the symbol ~ with the item on the left being the variable the arrow points at and the item on the right being the variable the arrow points from (remember that ~ can be interpreted as “in function of”). The a represents the coefficient corresponding to the relationship between the two variables. We can now put this model to test using the sem() and summary() functions. The sem() function allows you to indicate to which dataset the model should be applied: 108 fitSEM1 <- sem(SEM1, data=datUA) And now, we just have to feed the newly created fitSEM1 object to the summary() function. We also add standardized = TRUE as an argument so that the function returns standardized values for regression coefficients: 109 summary(fitSEM1, standardized=TRUE) You should obtain the following output: lavaan 0.6-12 ended normally after 1 iterations Estimator Optimization method Number of model parameters ML NLMINB 2 Number of observations 382 Model Test User Model: Test statistic Degrees of freedom 0.000 0 Parameter Estimates: Standard errors Standard Information Expected Information saturated (h1) model Structured Regressions: Estimate Responsibility ~ COND (a) Std.Err z-value P(>|z|) 2.244 0.221 10.148 0.000 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 211 Fig. 4.27 Path diagram for Model SEM2 Variances: Estimate .Responsibility Std.Err 4.660 z-value 0.337 P(>|z|) 13.820 Std.Err z-value P(>|z|) 2.244 0.221 10.148 0.000 As shown by the following lines: Regressions: Estimate Responsibility ~ COND (a) 0.000 We can see that the relationship between instructions and attributions of responsibility is statistically significant (p < 0.001). This is in line with the results we obtained using linear regression in Sect. 4.10.2. The regression coefficient is the same (B = 2.24). Let’s now move to a more complex model with two predictors, corresponding to a multiple regression analysis (see Fig. 4.27). Let’s call this model SEM2. SEM2 can be described in lavaan in the following way, by adding a second line to describe the newly added arrow: 110 111 112 113 SEM2 <- ' Responsibility ~ a*COND Responsibility ~ c*Bypassing ' To test this model, we use the same method as for SEM1: 114 115 fitSEM2 <- sem(SEM2, data=datUA) summary(fitSEM2, standardized=TRUE) Within the output, you will find the regression coefficients that are similar to the one we found for Fig. 4.21: Regressions: Estimate ͨ→ Std.all Responsibility ~ Std.Err z-value P(>|z|) Std.lv 212 F. Cova and A. Allard Fig. 4.28 Path diagram for Model SEM3 ͨ→ ͨ→ COND 0.219 Bypassing -0.585 (a) 1.064 0.194 5.481 0.000 1.064 (c) -0.696 0.047 -14.682 0.000 -0.696 Note that now the output indicates standardized estimates for regression coefficients. The Std.all version allows for a comparison between the different coefficients. It is now time to implement a third, more complex model SEM3 corresponding to the simple mediation model we tested in Sect. 4.10.2 (see Fig. 4.28). For that, we need to add one more arrow (from instructions to Bypassing scores). This can be done simply by adding a line with Bypassing ~ b*COND. However, to test for mediation, we also want lavaan to provide an estimate for the indirect effect going through Bypassing. If you remember, the indirect effect can be computed by multiplying together the regression coefficient corresponding to the path from the independent variable to the mediator (here, b) and the regression coefficient corresponding to the path from the mediator to the dependent variable (here, c). Thus, to ask lavaan to compute the value of the indirect effect through Bypassing, we need to add the following line: indirect_through_Bypassing: = b*c. The final description of the model is thus: 116 117 118 119 SEM3 <- ' Responsibility ~ a*COND Bypassing ~ b*COND Responsibility ~ c*Bypassing 120 121 122 indirect_through_Bypassing := b*c ' 123 124 125 fitSEM3 <- sem(SEM3, data=datUA) summary(fitSEM3, standardized=TRUE) 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 213 The Regressions section of the output will provide you with regression coefficients, standardized regression coefficients, and p-value for all main effects (arrows): Regressions: ͨ→ Std.all Responsibility ~ COND (a) ͨ→ 0.219 Bypassing ~ COND (b) ͨ→ -0.414 Responsibility ~ Bypassing (c) ͨ→ -0.585 Estimate Std.Err z-value P(>|z|) Std.lv 1.064 0.194 5.481 0.000 1.064 -1.695 0.191 -8.881 0.000 -1.695 -0.696 0.047 -14.682 0.000 -0.696 While the same information for the indirect effect will be found in the Defined Parameters section: Defined Parameters: Estimate ͨ→ Std.all indrct_thrgh_B 1.179 ͨ→ 0.242 Std.Err z-value P(>|z|) Std.lv 0.155 7.599 0.000 1.179 As you can see, the p-value is under the threshold for significance. We can thus conclude that the indirect effect is significant. So far, we only reproduced the results we already obtained in Sect. 4.10.2 using simple mediation analysis. It is now time to explore new possibilities by testing a new model SEM4 that now includes two mediators: Bypassing and Intrusion scores (see Fig. 4.29). This can be done by adding two lines, one for each newly added arrow (from instructions to Intrusion and from Intrusion to moral responsibility), and one additional line to ask lavaan to estimate the indirect effect through Intrusion scores: 126 127 128 129 130 131 SEM4 <- ' Responsibility ~ a*COND Bypassing ~ b*COND Responsibility ~ c*Bypassing Intrusion ~ d*COND Responsibility ~ e*Intrusion 132 indirect_through_Bypassing := b*c indirect_through_Intrusion := d*e 133 134 135 ' 136 137 138 fitSEM4 <- sem(SEM4, data=datUA) summary(fitSEM4, standardized=TRUE) 214 F. Cova and A. Allard Fig. 4.29 Path diagram for Model SEM4 You will see that there is still a significant direct effect of instructions on attributions of moral responsibility: Regressions: ͨ→ Std.all Responsibility ~ COND (a) ͨ→ 0.237 Estimate Std.Err z-value P(>|z|) Std.lv 1.109 0.189 5.853 0.000 1.109 And you will obtain the following estimates for the two indirect effects: Defined Parameters: Estimate ͨ→ Std.all indrct_thrgh_B 0.999 ͨ→ 0.213 indrct_thrgh_I 0.136 ͨ→ 0.029 Std.Err z-value P(>|z|) Std.lv 0.137 7.308 0.000 0.999 0.055 2.479 0.013 0.136 As you can see, both indirect effects are significant (p < 0.05). However, a comparison of the standardized coefficients (rightmost column) shows that the standardized coefficient for the indirect effect through Bypassing scores (β = 0.213) is way higher than the standardized coefficient for the indirect effect through Intrusion scores (β = 0.029)—approximately eight times higher. This shows that our data are consistent with a model in which both perceptions of bypassing and intrusion explain part of the effect of instructions on attributions of moral responsibility, but in which perceptions of bypassing explain a far greater part of this effect than perceptions of intrusion. 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 215 4.10.5 Using Structural Equation Modeling to Estimate Latent Variables So, we finally succeeded in estimating the relative role of several mediators in the impact of instructions (abstract vs. concrete) on attributions of moral responsibility. We thus answered our second and last research question (Q2). However, we haven’t taken advantage of the main feature of structural equation modeling: its ability to provide estimates of latent variables and thus to reduce (but not eliminate) mistakes due to measurement error. In fact, this is one of the abilities that gives the lavaan package its name, as “lavaan” stands for “latent variable analysis.” As we explained, SEM can provide an estimate of a latent variable which is (by definition) not directly measured and for which we do not already know the measurement error. It does so based on the estimated covariance of the various observed measures supposed to capture this same latent variable. lavaan will analyze to which extent these different measures covary (perfect measures should always covary, and bad measures will not covary) to estimate the measurement error for each measure and reach an estimate of the latent variable. Going back to our model, we can replace it by a fifth model (SEM5), in which perceptions of Bypassing and Intrusion are treated as latent variables that “cause” the various observed variables corresponding to the different items used to measure these constructs (see Fig. 4.30). In lavaan, latent variables can be defined based on their indicators (the measured variables supposed to capture them), using the symbol =~. For example, we can create a latent variable called LatentBypassing from the four items supposed to measure Bypassing in the following way: 139 LatentBypassing =~ Decisions + Want + Beliefs + Control And the latent variable for Intrusion can be defined in the following way: 140 LatentIntrusion =~ Intrusion1 + Intrusion2 + Intrusion3 + ͨ→ Intrusion4 Meaning the full model can be defined and tested in the following way: 141 142 143 SEM5 <- ' LatentBypassing =~ Decisions + Want + Beliefs + Control LatentIntrusion =~ Intrusion1 + Intrusion2 + Intrusion3 + ͨ→ Intrusion4 144 Responsibility ~ a*COND LatentBypassing ~ b*COND Responsibility ~ c*LatentBypassing LatentIntrusion ~ d*COND Responsibility ~ e*LatentIntrusion 145 146 147 148 149 150 indirect_through_Bypassing := b*c indirect_through_Intrusion := d*e 151 152 153 ' 216 F. Cova and A. Allard Fig. 4.30 Path diagram for Model SEM5 154 155 fitSEM5 <- sem(SEM5, data=datUA) summary(fitSEM5, standardized=TRUE) You will see that the final results are very similar to the ones we obtained when using observed rather than latent variables: Regressions: Estimate Responsibility ~ COND (a) ͨ→ 0.223 LatentBypassing ~ COND (b) ͨ→ -0.427 Responsibility ~ LtntBypssn (c) ͨ→ -0.522 LatentIntrusion ~ COND (d) ͨ→ 0.147 Std.Err z-value P(>|z|) Std.lv Std.all 1.048 0.197 5.329 0.000 1.048 -1.539 0.185 -8.304 0.000 -0.855 -0.680 0.063 -10.874 0.000 -1.225 0.427 0.153 2.793 0.005 0.294 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . Responsibility ~ LtntIntrsn (e) ͨ→ 0.216 Defined Parameters: Estimate ͨ→ Std.all indrct_thrgh_B ͨ→ 0.223 indrct_thrgh_I ͨ→ 0.032 217 0.348 0.064 5.415 0.000 0.506 Std.Err z-value P(>|z|) Std.lv 1.047 0.147 7.131 0.000 1.047 0.149 0.059 2.513 0.012 0.149 This is because the various measures we have for each latent variable are highly correlated, to the point that there was very little unreliability to correct for in the first place. Finally, we can represent the results of our analysis as shown in Fig. 4.31. Overall, our results suggest that the data are coherent with a model in which participants’ perceptions of both Bypassing and Intrusion mediate the effect of Fig. 4.31 Test results for model SEM5 218 F. Cova and A. Allard instructions on attributions of moral responsibility, but in which Bypassing plays a much more prominent role compared to Intrusion.13 4.11 Conclusion When participants give different answers to different vignettes, it is not always clear why they responded differently. Simply comparing answers between vignettes (using the methods introduced in previous chapters, such as χ2 tests, t-tests, or ANOVAs) won’t be enough to figure out what exactly is going on. One way to make progress is thus to add items measuring participants’ interpretations of the vignette and then to investigate to which extent these different interpretations correlate with and predict participants’ answers. Then, mediation methods, including structural equation modeling, can help test hypotheses about the mechanisms driving the difference in participants’ answers between two vignettes. In this chapter, we focused on experimental philosophy of free will, as it is the domain of experimental philosophy in which mediation methods were used for the first time (see Nahmias and Murray, 2011; Rose and Nichols, 2013). As we saw, experimental philosophers observed that participants answered differently to different presentations of determinism, and sought to explain why. For each hypothesis, they created a corresponding measure (such as measures of Bypassing and Intrusion) then used correlational methods to determine which of these measures best predicted participants’ judgments about free will and moral responsibility, and which mediated the impact of vignettes on these judgments. 13 Some have argued that the success of the Bypassing measure in predicting participants’ judgments about free will and moral responsibility might be due to the fact that the “No control” item is already capturing part of these notions (rather than Bypassing itself) (see, for example, Björnsson, 2014). This is a legitimate concern. To see how much it affects our results, you can simply run a sixth SEM model, in which you exclude the “No control” item from the definition of the Latent Bypassing variable (LatentBypassing = ~ Decisions + Want + Beliefs). You will see that the results now return an indirect effect of β = 0.194 for Bypassing and of β = 0.04 for Intrusion. Thus, even if excluding the “No control” item lessened the gap between Bypassing and Intrusion, it was still the case that Bypassing made a much greater contribution to the indirect effect of instructions on attributions of moral responsibility. 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 219 In Sects. 4.8 and 4.9, we used correlation tests and regression analysis to determine which of three measures (Bypassing, Intrusion, and BLoop) predicted participants’ judgments about the moral responsibility of agents in Nichols and Knobe’s vignettes. We found that participants’ attributions of moral responsibility were significantly predicted by Bypassing judgments and Intrusion judgments, but that BLoop judgments were not significant predictors once we controlled for Intrusion and Bypassing judgments. In Sect. 4.10, using mediation analyses and structural equation modeling, we found that the difference between abstract and concrete presentations of determinism was mediated mainly by participants’ confusion with bypassing and, to a much smaller extent, by whatever is measured by Nadelhoffer and colleagues’ Intrusion measure. Here again, BLoop did not play any significant role. Given that we introduced BLoop as a measure of participants’ perception of determinism, this suggests that perception of determinism did not have a major impact on participants’ attributions of moral responsibility. However, this raises the question of what exactly Intrusion scores are measuring. Nadelhoffer et al. (2023) introduced Intrusion scores as a measure of the agent’s unconditional ability to do otherwise. As such, it is basically supposed to measure the same construct as BLoop, and we should expect both measures to be strongly correlated. On the contrary, it is supposed to measure something different from Bypassing scores. However, as can be seen in Table 4.13, Intrusion scores correlated with both Bypassing and BLoop to the same extent, which was unexpected. One possibility is that, depending on how participants interpret them, Intrusion scores measure either the unconditional ability to do otherwise (which is why it correlates with BLoop) or the conditional ability to do otherwise (which is why it correlates with Bypassing), or both. Finally, you might wonder: are participants natural compatibilists or incompatibilists? One way to do this is to take participants who were presented with a deterministic universe (Universe A) and to exclude participants who failed at the Bypassing, Intrusion, and/or BLoop question. Since a majority of participants confused determinism with bypassing in the abstract version, it might be better to focus on the concrete version. We leave it to you as a final exercise to find how to manage that (you already have all the necessary tools), but the results are presented in Table 4.18. N 183 129 134 135 67 Moral responsibility 1.53 (2.26)71.6% 2.25 (1.64)86.0% 1.13 (2.44)64.2% 1.28 (2.40)66.7% 2.06 (1.82)82.1% Blame 1.63 (2.11)74.3% 2.40 (1.42)90.7% 1.24 (2.28)67.9% 1.37 (2.31)70.4% 2.06 (1.81)85.1% Free will 0.19(2.45)49.2% 1.01(2.18)65.1% −0.49(2.40)37.3% −0.15(2.50)41.5% 0.33(2.33)50.7% Up 0.70 (2.46)57.3% 1.66 (1.97)76.0% 0.07 (2.51)45.5% 0.31 (2.55)48.9% 1.05 (2.23)61.1% We present results for all participants, for participants who obtained Bypassing scores <0, for participants who obtained Intrusion scores <0, for participants who answered 100 to BLoop, and for participants who fulfilled all three conditions. Percentages indicate the % of participants who gave an answer superior to the midpoint (0) All participants Success at Bypassing Success at Intrusion Success at BLoop Success at all 3 measures Table 4.18 Mean (and standard deviations) for participants’ moral responsibility, blame, free will, and up scores in Universe A, concrete condition 220 F. Cova and A. Allard 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 221 References Andow, J., & Cova, F. (2016). Why compatibilist intuitions are not mistaken. A reply to Feltz and Millan. Philosophical Psychology, 29(4), 550–566. Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17–21. Ashton, M. C., Lee, K., & Visser, B. A. (2019). Where’s the H? Relations between BFI-2 and HEXACO-60 scales. Personality and Individual Differences, 137, 71–75. Austin, J. L. (1956). Ifs and cans. Proceedings of the British Academy, 42, 109–132. Ayer, A. J. (1963). Philosophical essays. Springer. Björnsson, G. (2014). Incompatibilism and “bypassed” agency. In A. R. Mele (Ed.), Surrounding free will (pp. 95–112). Oxford University Press. Björnsson, G. (2016). Outsourcing the deep self. Deep self discordance does not explain away intuitions in manipulation arguments. Philosophical Psychology, 29(5), 637–653. Cova, F. (2014). Frankfurt-style cases user manual. Why Frankfurt-style enabling cases do not necessitate tech support. Ethical Theory and Moral Practice, 17(3), 505–521. Cova, F. (2023). A defense of natural compatibilism. In J. Campbell, K. Mickelson, & V. Alan White (Eds.), Blackwell companion to free will. Blackwell. Cova, F. (Forthcoming-a). “It was all a cruel Angel’s thesis from the start.” Folk intuitions about Zygote cases do not support the Zygote argument. In T. Nadelhoffer & A. Monroe (Eds.), Advances in experimental philosophy of free will and responsibility. Bloomsbury. Cova, F. (Forthcoming-b). Calibrating measures of folk objectivism. In J. Knobe & S. Nichols (Eds.), Oxford studies in experimental philosophy (Vol. 5). Oxford University Press. Cova, F., & Kitano, Y. (2014). Experimental philosophy and the compatibility of free will and determinism. A survey. Annals of the Japan Association for Philosophy of Science, 22, 17–37. Cova, F., Bertoux, M., Bourgeois-Gironde, S., & Dubois, B. (2012). Judgments about moral responsibility and determinism in patients with behavioural variant of frontotemporal dementia. Still compatibilists. Consciousness and Cognition, 21(2), 851–864. Cova, F., Strickland, B., Abatista, A., Allard, A., Andow, J., Attie, M., Beebe, J., Berniūnas, R., Boudesseul, J., Colombo, M., Cushman, F., Diaz, R., van Dongen, N., Dranseika, V., Earp, B. D., Torres, A. G., Hannikainen, I., Hernàndez-Conde, J. V., Hu, W., Jaquet, F., Khalifa, K., Kim, H., Kneer, M., Knobe, J., Kurthy, M., Lantian, A., Liao, S.-y., Machery, E., Moerenhout, T., Mott, C., Phelan, M., Phillips, J., Rambharose, N., Reuter, K., Romero, F., Sousa, P., Sprenger, J., Thalabard, E., Tobia, K., Viciana, H., Wilkenfeld, D., & Zhou X. (2021). Estimating the reproducibility of experimental philosophy. Review of Philosophy and Psychology, 12(1), 9– 44. Feltz, A., & Cova, F. (2014). Moral responsibility and free will. A meta-analysis. Consciousness and Cognition, 30, 234–246. Feltz, A., & Millan, M. (2015). An error theory for compatibilist intuitions. Philosophical Psychology, 28(4), 529–555. Feltz, A., Cokely, E. T., & Nadelhoffer, T. (2009). Natural compatibilism versus natural incompatibilism. Back to the drawing board. Mind & Language, 24(1), 1–23. Fiedler, K., Harris, C., & Schott, M. (2018). Unwarranted inferences from statistical mediation tests. An analysis of articles published in 2015. Journal of Experimental Social Psychology, 75, 95–102. Fischer, J. M. (2002). Frankfurt-style examples, responsibility and semi-compatibilism. In R. Kane (Ed.), The Oxford handbook of free will (pp. 281–308). Oxford University Press. Frankfurt, H. G. (1969). Moral responsibility and the principle of alternative possibilities. Journal of Philosophy, 66(23), 829–839. Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press. Hannikainen, I. R., Machery, E., Rose, D., Stich, S., Olivola, C. Y., Sousa, P., Cova, F., Buchtel. E., Alai, M., Angelucci, A., Berniûnas, R., Chatterjee, A., Cheon, H., Cho, I.-R., Cohnitz, D., Dranseika, V., Eraña Lagos, Á., Ghadakpour, L., Grinberg, M., Hashimoto, T., Horowitz, A., 222 F. Cova and A. Allard Hristova, E., Jraissati, Y., Kadreva, V., Karasawa, K., Kim, H., Kim, Y., Lee, M., Mauro, C., Mizumoto, M., Moruzzi, S., Ornelas, J., Osimani, B., Romero, C., Rosas López, A., Sangoi, M., Sereni, A., Songhorian, S., Struchiner, N., Tripodi, V., Usui, N., Vázquez del Mercado, A., Vosgerichian, H. A., Zhang, X., and Zhu, J. (2019). For whom does determinism undermine moral responsibility? Surveying the conditions for free will across cultures. Frontiers in Psychology, 10, 475560. Harris, S. (2012). Free will. Free Press. Hernán, M. A., & Robins, J. M. (2023). Causal Inference. CRC Press. James, W. (1907). The will to believe. And other essays in popular philosophy. Longmans, Green, and Company. Lemmer, G., & Gollwitzer, M. (2017). The “true” indirect effect won’t (always) stand up: When and why reverse mediation testing fails. Journal of Experimental Social Psychology, 69, 144– 149. Mandelbaum, E., & Ripley, D. (2012). Explaining the abstract/concrete paradoxes in moral psychology: The NBAR hypothesis. Review of Philosophy and Psychology, 3(3), 351–368. Miller, J. S., & Feltz, A. (2011). Frankfurt and the folk. An experimental investigation of Frankfurtstyle cases. Consciousness and Cognition, 20(2), 401–414. Monroe, A. E., & Malle, B. F. (2010). From uncaused will to conscious choice. The need to study, not speculate about people’s folk concept of free will. Review of Philosophy and Psychology, 1(2), 211–224. Murray, D., & Nahmias, E. (2014). Explaining away incompatibilist intuitions. Philosophy and Phenomenological Research, 88(2), 434–467. Nadelhoffer, T., Rose, D., Buckwalter, W., & Nichols, S. (2020). Natural compatibilism, indeterminism, and intrusive metaphysics. Cognitive Science, 44(8), e12873. Nadelhoffer, T., Murray, S., & Murry, E. (2023). Intuitions about free will and the failure to comprehend determinism. Erkenntnis, 88, 2515–2536. Nahmias, E., & Murray, D. (2011). Experimental philosophy on free will. An error theory for incompatibilist intuitions. In J. H. Aguilar, A. A. Buckareff, & K. Frankish (Eds.), New waves in philosophy of action (pp. 189–216). Palgrave Macmillan. Nahmias, E., Morris, S., Nadelhoffer, T., & Turner, J. (2005). Surveying freedom. Folk intuitions about free will and moral responsibility. Philosophical Psychology, 18(5), 561–584. Nahmias, E., Morris, S. G., Nadelhoffer, T., & Turner, J. (2006). Is Incompatibilism intuitive? Philosophy and Phenomenological Research, 73(1), 28–53. Nahmias, E., Shepard, J., & Reuter, S. (2014). It’s OK if “my brain made me do it.” People’s intuitions about free will and neuroscientific prediction. Cognition, 133(2), 502–516. Nichols, S. (2006). Folk intuitions on free will. Journal of Cognition and Culture, 6(1–2), 57–86. Nichols, S., & Knobe, J. (2007). Moral responsibility and determinism. The cognitive science of folk intuitions. Noûs, 41(4), 663–685. Roberts, B. W., & DelVecchio, W. F. (2000). The rank-order consistency of personality traits from childhood to old age. A quantitative review of longitudinal studies. Psychological Bulletin, 126, 3–25. Rose, D., & Nichols, S. (2013). The lesson of bypassing. Review of Philosophy and Psychology, 4(4), 599–619. Roser, M., Appel, C., & Ritchie, H. (2013). Human height. Our World in Data. Sarkissian, H., Chatterjee, A., De Brigard, F., Knobe, J., Nichols, S., & Sirker, S. (2010). Is belief in free will a cultural universal? Mind & Language, 25(3), 346–358. Scheines, R., & Ramsey, J. (2016). Measurement error and causal discovery. CEUR Workshop Proceedings, 1792, 1–7. Sripada, C. S. (2012). What makes a manipulated agent unfree? Philosophy and Phenomenological Research, 85(3), 563–593. Stulp, G., Buunk, A. P., Pollet, T. V., Nettle, D., & Verhulst, S. (2013). Are human mating preferences with respect to height reflected in actual pairings? PLoS One, 8(1), e54186. van Inwagen, P. (1983). An essay on free will. Oxford University Press. 4 Quantitative Vignette Studies: Correlations, Regressions, and Structural. . . 223 van Inwagen, P. (2008). How to think about the problem of free will. The Journal of Ethics, 12(3), 327–341. R Packages Behrendt, S. (2023). Lm.Beta: Add standardized regression coefficients to linear-model-objects. https://cran.r-project.org/web/packages/lm.beta/. Revelle, W. (2023). Psych: Procedures for psychological, psychometric, and personality research. https://cran.r-project.org/web/packages/psych/index.html. Rosseel, Y. (2012). Lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. https://cran.r-project.org/web/packages/lavaan/index.html. Wang, B. (2021). bda: Binned data analysis. R package version 15.2.5. https://cran.r-project.org/ web/packages/bda/index.html. Chapter 5 Interactive and Incentivized Online Experiments: Noncooperation in Give-Some and Take-Some Dilemmas Marc Wyszynski Abstract Using a particular example of a research problem in experimental philosophy, this chapter illustrates one way to design, conduct, and analyze interactive experiments in behavioral science. Several steps of the research process are described, including developing the research question and hypotheses, designing the study, programming and conducting the experiment, analyzing the data, and interpreting the results. The chapter provides step-by-step instructions for programming the experiment using the online survey tool Unipark, conducting the experiment using Amazon Mechanical Turk, and performing the analysis using the statistical analysis applications JASP and R. Keywords Interactive experiment · Behavioral science · Research design · Data analysis · JASP · R · Social dilemma · Rule-breaking 5.1 Introduction The scientific experiment is an excellent tool for identifying causal relationships between cause and effect. Researchers of almost all areas of empirical science use experiments for their empirical investigations, including natural sciences such as biology, chemistry, or physics and social sciences such as psychology, sociology, and behavioral economics. The basic principle of an experiment is to observe a measurable entity, often called the “dependent variable” under various controlled conditions, often referred to as “independent variable(s).” The dependent variable Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-58049-9_5. M. Wyszynski (✉) Faculty of Mathematics and Computer Science, University of Bremen, Bremen, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Kornmesser et al., Experimental Philosophy for Beginners, Springer Graduate Texts in Philosophy, https://doi.org/10.1007/978-3-031-58049-9_5 225 226 M. Wyszynski is the variable to be measured in the experiment. Examples of dependent variables are the size of a plant, the temperature of a particular substance, the risk-taking behavior of an animal, a person’s ability to detect an audio or visual signal, particular behaviors such as choices between two or more options, or monetary contributions to a shared good or withdrawals from it. The independent variables are experimental treatments strictly controlled by the experimenter. That is, all observations are conducted under exactly the same conditions except the treatment condition itself. Examples of independent variables related to the dependent variables mentioned above are the amount of fertilizer used, solar radiation intensity, forage supply, intensity of audio or visual stimuli, positive or negative decision outcomes such as winning or losing a gamble, or different initial assets such as available endowments to place a bet on a gamble. Note that in some scientific articles, other terms are used synonymously. For instance, the dependent variable is sometimes called “response variable,” “regressand,” “predicted variable,” or “explained variable” and the independent variable “manipulated variable,” “regressor,” “covariate,” “predictor variable,” or “explanatory variable.” The term the authors choose usually depends on the context. There are multiple ways to conduct experiments examining the relationship between a wide variety of dependent and independent variables. However, most of them are not part of this chapter that exclusively focuses on interactive and incentivized experiments as commonly used in behavioral sciences. Interactive experiments are typically used to investigate a dependent variable in a group context, e.g., a particular behavior of participants that might affect other individuals in a particular way and/or depend on what others do. They include at least two participants interacting with each other in some way. For instance, participants are involved in a game where they can decide to cooperate or to play against each other. Experiments are incentivized when participants receive a reward (often money or credit points for students) for taking part. The amount of reward often depends on the performance during the experiment. For instance, a group of participants plays a game with the goal of maximizing their individual payoff. Currently, it is common practice in the social sciences to incentivize participants in experiments. Furthermore, incentivizing participants in experiments is a publication requirement in some scientific journals. However, experimental researchers should keep in mind that the assumption that incentives generally improve participants’ performance in experiments is controversial (see, e.g., Gneezy and Rustichini, 2000; Pokorny, 2008). This chapter gives a step-wise overview of how to conduct interactive and incentivized experiments using a particular example of empirical research investigating a philosophical problem. I first introduce the example study by describing its philosophical relevance and scientific background. I then show how to develop a research question and a concrete hypothesis derived from theory and previous research. The next step describes one way how the hypothesis can be tested using a scientific experiment. I provide an example of an experimental design and different ways of where (e.g., online platforms and lab) and how (e.g., programming and general procedure) the experiment can be conducted. Based on the example given, I 5 Interactive and Incentivized Experiments 227 finally demonstrate methods of analyzing the data and provide interpretations of the findings. 5.2 Philosophical Background: Conducting an Interactive and Incentivized Study in Experimental Philosophy Experimental philosophers use empirical methods to investigate a variety of different philosophical problems (Plakias, 2015). Some of them are concerned with the existence and extent of moral diversity (Plakias, 2015). The example I provide in this section relates to a problem of diversity in moral judgments. According to Haidt (2001, 817), I define moral judgments as “evaluations (good versus bad) of the actions or character of a person that are made with respect to a set of virtues held to be obligatory by a culture or subculture.” In particular, the example I use to demonstrate how to design, conduct, and analyze an interactive experiment is about social dilemmas where people may behave “morally good” or “morally bad.” Social dilemmas are situations in which “individual rationality leads to collective irrationality. That is, individually reasonable behavior leads to a situation in which everyone is worse off than they might have been otherwise” (Kollock, 1998, 183). Social dilemma-like situations can be found in many aspects of everyday social life. They are simply everywhere (Weber et al., 2004, 281). A typical example is collecting donations from residents to build a new and urgently needed hospital in their region. Individuals who do not donate will still benefit as long as other individuals donate enough. However, if no one donates, the entire community will not benefit from a new hospital because of insufficient funds for building one. Another example is extensive water consumption during a very dry summer even though the government asked citizens to conserve it. Individuals are better off when everyone conserves water. However, acting in self-interest, i.e., continuing high water consumption by all or a high number of individuals, may deplete the water reserves of the community (e.g., a city, a town, or a county). Therefore, in both examples, everyone is better off when donating/conserving resources than if everyone would act in self-interest. In the literature on social dilemmas, behaving in the interest of the collective is often referred to as cooperative behavior, and acting in self-interest is usually understood as noncooperative behavior, which is also called “free-riding” (see, e.g., Andreoni, 1995; Cubitt et al., 2011b; Kollock, 1998). Previous research about moral judgments on behavior in social dilemmas shows that free-riding, i.e., noncooperative behavior, is strongly seen as “morally reprehensible” (Cubitt et al., 2011b) as compared to cooperative behavior. That is, when comparing cooperative and noncooperative behavior in terms of morality, people judge noncooperative behavior as “morally bad,” i.e., giving nothing or just a very small amount or continuing extensive water consumption, respectively. On the other hand, they may see cooperative behavior in social dilemmas as “morally not 228 M. Wyszynski bad” or “morally good,” i.e., making donations to the hospital project or conserving water, respectively. As shown by previous studies, some individuals still behave noncooperatively in social dilemma situations, although it is seen as morally reprehensible (Cubitt et al., 2011b), here interpreted as morally bad. However, the extent to which people behave morally bad seems to depend on the particular type of a social dilemma (Cubitt et al., 2011b; Wyszynski and Bauer, 2023). Scientists from different fields such as psychology, sociology, and economics investigating behavior in social dilemmas often distinguish between two dilemma types: give-some dilemmas and take-some dilemmas (see, e.g., Bonacich, 1972; Dawes et al., 1974; Dawes, 1980). Givesome dilemmas describe situations where the individual decides whether or not to cooperate by giving a resource to a common good such as a donation for a project to build a new hospital. Take-some dilemmas describe situations where an individual decides whether or not to cooperate by not taking a particular resource from a common good such as water consumption during a drought. Previous research suggests that people tend to free-ride more often in give-some than in take-some dilemmas when the individual outcome is emphasized in the description of the respective problem, and the effect reverses when the focus is set to group benefits (van Lange et al., 2013). One problem is that the vague interpretation of morally good and bad behavior in social dilemmas does not clearly define how much an individual needs to give or is allowed to take in order to behave in a morally good manner. There is also no clear lower donation limit serving as a threshold between morally good and morally bad behavior. However, people evaluating behavior in social dilemmas might more easily distinguish between “good” and “bad” in social dilemma situations that I call “rule-regulated social dilemmas.” In this special case of a social dilemma, the decision to cooperate is not left to the individual’s discretion. Instead, individuals need to follow the rules or laws that apply to everyone (in the collective), clearly regulating what individuals have to do. Examples of such dilemmas are tax compliance and tax evasion, respectively: Paying taxes is mandatory. The amount of taxes each citizen has to pay is regulated by law (i.e., a binding rule). Paying taxes and following the law is cooperative (let us say “good behavior”), and evading taxes is noncooperative (let us say “bad behavior”). The rule establishes a clear cut between good (cooperative) and bad (noncooperative) behavior. Some previous studies indicate that a high proportion of people tend to break the rules in take-some dilemma scenarios (see e.g., Castillo et al., 2011; Janssen et al., 2013, 2012). Little is known about rule-breaking in give-some dilemmas; therefore, we do not know much about how the proportion of people breaking the rule differs between the two types of social dilemmas. The example study described in this chapter focuses on moral behavior in terms of rule-breaking (i.e., noncooperation or behaving “morally bad”) or following the rule (i.e., cooperation behaving “morally good”) in social dilemmas depending on the specific dilemma type (give-some dilemma vs. take-some dilemma). In particular, the idea of the study is to examine whether the number of people behaving 5 Interactive and Incentivized Experiments 229 morally bad in rule-regulated social dilemmas (i.e., breaking the rule; dependent variable) depends on the type of the social dilemma (i.e., independent variable). 5.3 Development of the Research Question to Be Investigated and Hypotheses Our research idea could initially be formulated as a vague question. However, based on what we already know, we formulate the question as precisely as possible: Does rule-breaking in so-called rule-regulated social dilemmas depend on the dilemma type? To find answers to this question, we systematically search for previous research on that topic and carefully evaluate it. Our extensive literature research and the evaluation of theory and previous findings did not provide satisfying answers to the question. That is, we identified the following specific research gap: Previous studies have not investigated whether the number of people behaving “morally bad” in the particular situations that we call rule-regulated social dilemmas is higher or lower in give-some or take-some dilemmas. We defined the goal of our research project accordingly: We want to compare the number of people behaving “morally bad” (i.e., breaking the rule) in rule-regulated give-some dilemmas with that in take-some dilemmas. The research question is the verbalized goal of the research project, typically formulated as a particular question. It should precisely reflect what we intend to find out. For the current example, we formulate the research question as follows: Research Question Does the number of people behaving “morally bad” in terms of breaking the rule in rule-regulated social dilemmas depend on whether the dilemma is a give-some or a take-some dilemma? Although research questions express what we exactly want to investigate, it is not possible to empirically test them. Hence, we need to derive at least one specific hypothesis from theory and previous work. A hypothesis summarizes what we expect to find in one clearly formulated statement that can be tested by using a quantitative empirical method. Hypotheses can be formulated as directed or nondirected statements. While nondirected hypotheses just indicate that a particular finding (e.g., an effect) is expected, directed hypotheses, in contrast, include a direction of what is expected. Examples are “plant A and B have different sizes” for a nondirected hypothesis and “plant A is larger than plant B” for a directed one. Directed hypotheses are usually preferable to nondirected ones since they express the expectations more clearly and, therefore, they are more powerful and suitable for testing. For the given example study, we expect to observe a relationship between “morally bad” behavior in terms of rule-breaking and the type of a social dilemma scenario, i.e., give-some and a take-some dilemma. In particular, as we have found in our extensive literature research, people tend to behave “morally bad” more often in give-some than in take-some dilemmas when no rule is included and when the 230 M. Wyszynski collective outcome is emphasized. Moreover, implementing a rule prescribing that individuals have to behave cooperatively draws attention to the group benefit rather than the individual outcome. Therefore, we postulate the following hypothesis: (H1) The number of people breaking the rule is higher in take-some than in givesome rule-regulated social dilemmas. Summary: (1) Formulate your research idea as a question. This question can be vague; however, it should be as precise as possible. (2) Carefully evaluate the literature, theory, and empirical studies to check whether they provide satisfying answers to your question. If not: (3) Reformulate the question by taking into account what you have learned during literature research. The question should now be even more specific and precise, and it serves as your research question. (4) Derive one or more clear and testable hypotheses from theory and previous empirical findings. 5.4 Development of the Experimental Design for Testing the Hypothesis Scientific experiments are basically instruments for collecting the data needed to test one or more hypotheses. Therefore, the experimental design should strictly relate to the hypotheses. Using the current example study, we design an experiment to test the hypothesis stated in the previous section, i.e., H1. First, we define the dependent and independent variables. We can easily derive them from our hypothesis. The dependent variable is “frequency of rule-breaking” since we want to measure how many of our participants break a rule in social dilemma situations. The independent variable is the variable we manipulate, i.e., “dilemma type.” That is, we want to find out whether we measure different values for the dependent variable, i.e., the frequency of rule-breaking, when we systematically vary the independent variable, i.e., dilemma type (give-some vs. take-some dilemma). Next, we need to design an experiment that allows us to measure the dependent variable as a function of the independent variable. One way to find an appropriate experimental paradigm for testing hypotheses is, again, evaluating previous research. Maybe there are specific paradigms that have been used before to investigate similar research problems. In the case of the current example study, there are two paradigms that have been repeatedly used to investigate cooperation in social dilemmas: The commons dilemma (“tragedy of the commons”; Hardin, 1968) for take-some dilemmas and the public goods game for give-some dilemmas (for details see Balliet et al., 2009; Dawes, 1980; Dawes et al., 1974; Kollock, 1998). 5 Interactive and Incentivized Experiments 231 5.4.1 Experimental Paradigms Both games are multiplayer games, i.e., multiple players (.n ≥ 2) are involved. A commons dilemma game (i.e., take-some dilemma) and a public goods game (i.e., give-some dilemma) usually include n players sharing a common good. Players are equipped with a particular endowment of a resource, let us say, a certain number of tokens, and they need to decide how many of them they want to contribute to the shared common good. The tokens that are finally in the common good are then distributed equally among all players. Players receive a payoff according to the number of tokens remaining in their endowment plus the number of tokens received from the common good. The most important difference between the two games is the following. Players of the commons dilemma game initially have no or just a small endowment, and they decide whether to take tokens from the common good and add them to their endowment or leave them in the common good. In contrast, players of the public goods game are equipped with a comparably high number of tokens, and they decide whether to keep the tokens for themselves or to give them to the common good. During the last 50 years, researchers examined cooperation in social dilemmas with many variants of these games. Variations often concern the group size (e.g., small groups of .n = 3 or large groups of .n = 100), number of iterations (one-shot or repeated games), and endowment asymmetry (players have different endowments); other examples are punishment (for noncooperators) or transparency (contributions are hidden or visible for other players). Group size has been varied on a large scale in previous research. Although some findings on the effect of group size in social dilemmas show mixed results (for a review, see, e.g., Brewer and Kramer, 1986), more recent research, however, suggests that cooperation is stable or even increases in larger (e.g., .n = 100) as compared to smaller (e.g., .n = 3) groups (Barcelo and Capraro, 2015; Capraro and Barcelo, 2015; Isaac et al., 1994). Whether or not iterations are used to investigate cooperation in social dilemmas strongly depends on the focus of the study and the situational context. That is, researchers interested in long-term interactions and changes in cooperation over time would rather use repeated games. However, for studies with a focus on onemoment situations, such as the current example study, one-shot games appear to be more appropriate (for more details, see, e.g., Cubitt et al., 2011a). Endowment asymmetry has been shown to influence cooperation in social dilemmas. In particular, people who are equipped with higher endowments give more tokens in absolute but less in relative terms to the common good in give-some dilemmas, and they take more from the common good in take-some dilemmas as compared to individuals with lower endowments (see, e.g., Hauser et al., 2019; Nockur et al., 2021; van Dijk and Wilke, 1995, 2000; Wade-Benzoni et al., 1996). Some studies further suggest that the effect of endowments only occurs when instructions of the respondents include information about the asymmetric endowment structure (see, e.g., van Dijk and Grodzka, 1992). 232 M. Wyszynski Moreover, before tokens of the common good are distributed to the players, they are often multiplied by a specific value (to increase the “marginal per capita return,” MPCR, of the shared good). The size of the MPCR manipulates the incentives of cooperation. In particular, incentives for cooperation increase with increasing MPCR (e.g., Isaac et al., 1994; van den Berg et al., 2020; Zelmer, 2003). Let n be the players of a standard one-shot resources game and public goods game where each player i is contributing c (i.e., giving tokens to the common good or leaving tokens in the common good, respectively) from their initial endowment of E and let .ω be a multiplicator value, then the individual payoff function P of the games is: ω Pi = Ei − ci + . n ∑ i=1 n ci (5.1) That is, the payoff for a player i is (.Pi ) is the sum of the following two components: First, the player contributes an particular amount to the common good which is, therefore, subtracted from ∑ their endowment (.Ei − ci ). Second, the sum of contributions from all players (. ni=1 ci ) is multiplied by the multiplicator (.ω) and divided by the number of players involved in the game (n). Other paradigms that have been used to investigate cooperation in social dilemmas are so-called two-person games. In contrast to the n-person games that I described above, two-person games are experimental paradigms involving exactly two persons (Dawes, 1980; Liebrand, 1983) instead of a group of at least three persons. One famous example is the prisoner’s dilemma game (used by, e.g., Sun et al., 2021; see Kuhn, 2019, for comprehensive explanation and details on the prisoner’s dilemma). However, the study I use as an example here focuses on nperson dilemmas. 5.4.2 Constructing the Design In the previous section of this chapter, we have identified appropriate experimental paradigms to investigate our research problem: The commons dilemma game that we use as our take-some game and the public goods game that we use as our givesome game. We use a one-shot game and a large group size as the research problem investigated here specifically addresses social dilemma problems involving one-shot decisions and many individuals (see the hospital example and water conservation example above). Next, we need to match the games with our hypotheses H1. That is, the paradigms must be modified so that they are adequate tests of H1 of the current example study. In particular, we need to incorporate a rule in both games prescribing how many tokens participants were allowed to take from the common good or how many tokens they were obliged to give to the common good, respectively. Furthermore, to create a more realistic environment, we also want to include a rationale that adds meaning 5 Interactive and Incentivized Experiments 233 to the rule. Otherwise, the people involved in the games may not understand why the rule exists, and they might be, therefore, less likely to follow it. In the following, I describe the experimental setup, including the materials used for the experiment and the design, in more detail. 5.4.2.1 Participants First, let us define the sample we want to use for our study. Many different types of samples are possible. For instance, one sample type that provides highly reliable data is a representative sample. Representative samples need to meet some specific requirements to ensure that the sample indeed reflects the composition of the given population. Compared to other sample types, the recruiting procedure for representative samples is rather complicated and expensive. Hence, researchers often use alternatives such as online or student samples. Conducting experiments using online and student samples is a common approach in psychology and social sciences. In the current example study, we decided to use an online sample. This decision was made because our study requires a relatively large sample size (remember that we use a large group size in the games; see above), and we have no further requirements on participants. Therefore, an online sample is an appropriate choice. Note that the literature comparing the data quality of online and student samples (see, e.g., Kees et al., 2017; Roulin, 2015) is not discussed here. To determine the exact group size, we first need to calculate how many individuals we need to reach an adequate statistical power using an appropriate statistical hypothesis test. The statistical power is the probability of not making a type II error, i.e., getting a false negative result. That is, given that there is a relationship between our dependent and independent variable, the statistical power is the probability that we find a significant relationship using our particular sample size and statistical test. The required size of the statistical power can usually be found in the submission guidelines of scientific journals. Note that specific definitions of minimum values suggesting an adequate statistical power are often set more or less arbitrarily. One probability value that is claimed to be a sufficient minimum statistical power is 0.8. For calculating the sample size for our experiment, we need the following information and parameters: A statistical test (here we choose the .χ 2 test, see Chap. 2), the statistical power (.1 − β = 0.8, see Chap. 3), the alpha error probability (significance level, .α = 0.05, see Chap. 2), degrees of freedom (.df = 1, Chap. 2) and the minimum effect size our test should be able to detect (w of 0.2, see Chap. 2.1 ) Researchers usually use a software to compute the statistical power such as “G.∗ Power” (Faul et al., 2009) or the R-package pwr (Champely et al., 2020). 1 The effect size is indicated by a particular parameter (e.g., w, .η2 , or “Cohen’s d” ). It is often categorized in “small,” “medium,” and “large.” However, similar to the minimum requirement of the statistical power, the particular thresholds are set arbitrarily. See Correll et al. (2020) for a detailed discussion of this problem and suggestions on how to solve it. 234 M. Wyszynski Analyzing the statistical power of a .χ 2 using the pwr package includes just a few lines of program code:2 1 2 3 4 5 6 7 ## power analysis #install package install.packages("pwr") #load/attach package library(pwr) #pwr analysis of chisq test pwr.chisq.test(w = 0.2, N = NULL, df = 1, sig.level = 0.05, power = 0.80) ͨ→ Note that exactly one of the parameters w, N, power, and sig.level must be passed as NULL, and that parameter is determined from the others.3 The output of the .χ 2 power calculation is a summary of relevant parameters including the required sample size. Instructions for computing statistical power analysis using G.∗ Power and a link to download the latest version of the application can be found online.4 The power analysis based on the parameters above revealed a sample size of at least 197 participants. To further increase statistical power, have some room for further exploratory analysis, and compensate for possible data collecting issues, we determined the sample size to .n = 200, i.e., 100 participants per group. We specified the experiment parameters based on this sample size. 5.4.2.2 Materials Everything we put together to build our experimental setup can be summarized as the materials. Materials include the games we use, modifications and implementations (e.g., the rule), as well as specific parameters and values. We incorporate the rule into the setup of a one-shot public goods game (givesome game) and a one-shot commons dilemma game (take-some game). The rule is as follows: In the give-some game, it prescribes the minimum number of tokens that a participant has to give. In the take-some game, it prescribes the maximum number of tokens a participant is allowed to take. We design each game for 100 players (i.e., participants). In both games, the common good is a joint account (JA) finally shared by all participants of the game, and the individual good is a personal account (PA) equipped with a particular number of tokens (endowment) for each participant. Next, we need to equip the PA’s with a specific number of tokens. We select an amount of tokens that has a similar size as in previous studies. Here we choose an 2 See Sect. 5.6.2 of this chapter for further information including an installation guide for R and RStudio. 3 See documentation of the pwr package https://cran.r-project.org/web/packages/pwr/pwr.pdf for details. 4 https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-undarbeitspsychologie/gpower. 5 Interactive and Incentivized Experiments 235 amount of 1,600 tokens in the give-some game and 400 tokens in the take-some game. The initial account balance of the JA is 0 tokens for the give-some game and 120,000 tokens for the take-some game. We do not use a multiplier in the current study. That is, the tokens of the JA will not be multiplied by a specific factor. Increasing the MPCR would encourage participants to cooperate, i.e., following the rule, because cooperation would be more attractive from an economic point of view. However, in the current experiment, participants are expected to cooperate because it is prescribed by the rule, while noncooperation, i.e., breaking the rule, is economically preferable. Moreover, to give more internal validity to the rule of the games, we implement a tool that binds the participant to follow the rule. In particular, before making their contributions or withdrawals, respectively, they have to agree to a statement that they will follow the rule. The rule includes a specific reference value of 1,000 tokens that we call “Need” (for similar applications of such need thresholds, see Bauer et al., 2022; Weiß et al., 2017; Wyszynski, 2020). The Need has two components: First, it specifies the number of tokens participants are allowed to keep in their final PA’s. For the give-some game, they are committed to give the number of tokens that exceeds the Need on their PA to the JA, and in the take-some game, they are only allowed to take as many tokens from the JA as they lack to reach the Need on their PA. Second, the Need indicates the minimum requirement of tokens each player finally needs on their PA to get an additional payment from the JA. That is, the tokens in the JA will be distributed equally among all participants who finish the game with .≥1,000 tokens in their PA. However, participants finishing the game with .<1,000 tokens in their PA receive nothing from the JA. In a nutshell, the rule indicates that PA’s must not contain more and should not contain less than 1,000 tokens. The Need makes the outcomes of both games objectively equivalent: participants have to give the identical number of tokens in the give-some game as they were allowed to take in the take-some game (i.e., 600 tokens). Provided that all players follow the rule and have finally 1,000 tokens on their PA’s, the JA would contain 60,000 tokens in both games. Thus, a portion of 600 tokens from the JA will be added to the 1,000 tokens remaining in the PA of each participant. In that case, all players would finally receive an outcome of 1,600 tokens. Let N be the Need, E the endowment, g the tokens given, and t the tokens taken, then the individual payoff functions for the rule-regulated give-some (2) and the take-some (3) games are: ⎛∑ n ⎞ gi ⎜ i=1 ⎟ ⎟ Pi = (Ei − gi ) + γ ⎜ ⎝ nγ =1 ⎠ . (5.2) 236 M. Wyszynski Table 5.1 Example of a full factorial design including two independent variables (factors) with two levels each (2 .× 2 design) Treatment 1 2 3 4 Factor I Level 1 Level 1 Level 2 Level 2 Factor II Level 1 Level 2 Level 1 Level 2 where .γ = 1 if .(Ei − gi ) ≥ N, and .γ = 0 if .(Ei − gi ) < N. ⎛ ⎜ Pi = (Ei + ti ) + γ ⎜ ⎝ . ⎞ n ∑ ti i=1 ⎟ ⎟ ⎠ nγ =1 120,000 − (5.3) where .γ = 1 if .(Ei + ti ) ≥ N, and .γ = 0 if .(Ei + ti ) < N. 5.4.2.3 Design The experimental design describes the way how the different independent variables are combined and offered to the participants. In a so-called full factorial design, all levels of the independent variables are combined with each other to individual experimental treatments (a random example of a full factorial design including two factors with two levels each is provided in Table 5.1). However, a full factorial design might be impracticable for experiments including many independent variables with many levels. A high number of different factorial combinations (i.e., treatments) would require a high number of individual measurements. That is, the researchers have to recruit a high number of participants, and/or each participant needs to perform a high number of experimental trials. In addition to full factorial designs, there are many other ways to combine the levels of the independent variables to a particular experimental study plan. Alternatives providing solutions with fewer combinations are so-called fractional and orthogonal designs. Such designs are worth to be mentioned since they are commonly used in empirical studies (e.g., in conjoint analysis). However, I do not further discuss fractional and orthogonal designs here since this chapter is intended to demonstrate the conduct of an empirical study as simply as possible. Therefore, I choose an example study that employs a simple (full) factorial design. In this study, there is only one factor (game) with two levels (level 1: give-some; level 2: take-some). Since we have selected the most suitable way of combining the variable levels into treatments, we need to determine the way how they should be offered to the participants. Basically, there are two ways to do that: (1) Each participant receives each treatment, called within-subjects design. We then compare the responses participants made to each treatment. (2) Each participant receives only one treatment, called between subjects-design. Here we compare the responses 5 Interactive and Incentivized Experiments 237 made by participants of the first group with those made by the participants of the other group. Moreover, it is possible to combine within-subjects and betweensubjects designs, called mixed-design. For instance, one could vary the levels of one variable within subjects and the levels of another variable between subjects. In the current example study, there is one independent variable we want to measure with a one-shot game (i.e., not repeated). Thus, we use a between-subject design. That is, we quasi-randomly divide our 200 participants into two groups of (almost) equal size and assign them to one of the two games. That is, we measure rule-breaking in two independent groups using a full factorial between-subjects design. Summary: (1) A common way to find suitable experimental paradigms for testing hypotheses is extensive literature research. In many cases, previously used paradigms can be combined, modified, and new variables can be supplemented. (2) For the current example, we apply the public goods (give-some) game and the commons dilemma (take-some) game. To test our hypothesis, we incorporate a rule in the games and measure how often participants break this rule (dependent variable) depending on the game type (give-some vs. take-some; independent variable). (3) We use an online sample, which is only one of several sample types. We determine the sample size using a power analysis. (4) Experiments in psychology and the social sciences can have many different designs. Researchers need to select a suitable design for their study. Here, we chose a full factorial within-subjects design with one dependent and one independent variable, including two levels. (5) Authors should describe the procedure, samples, materials, and design as precisely as possible to make the study replicable for other researchers. 5.5 Implementation of the Instrument In the following, I first describe the procedure of the games, and then I give stepby-step instructions on how we get them on the computer screens using a suitable tool. Many software solutions have been developed and used to conduct interactive experiments in the past years. In the previous chapters, you learned how to use survey tools to design questionnaires (i.e., LimeSurvey in Chap. 2 and Qualtrics in Chap. 4). Extending the knowledge you have gained in these chapters, I show how these tools can be used to conduct interactive (online survey) experiments. In particular, I use the example study to explain how to program an experiment using Unipark, which is the academic program of TIVIAN Enterprise Feedback Suite (EFS). Unipark is a complex and multifaceted survey tool used by many universities. However, Unipark is not free. Therefore, I also briefly discuss an alternative opensource application called “oTree” (Chen et al., 2016). 238 M. Wyszynski 5.5.1 Procedure Participants are randomly assigned to one of the two games. Each game starts by displaying the experiment instructions. The instructions vary depending on the game. We explicitly inform participants that they belong to a community including 100 members sharing one joint account, and that each individual holds a personal account equipped with 1,600 tokens (give-some game) or 400 tokens (take-some game). We further inform participants about the rule that regulates how many tokens they must give to the joint account (give-some game) or how many tokens they are allowed to take from the joint account (take-some game). We explicitly state that the rule ensures an equal distribution of tokens among all community members. We then ask participants to agree with the rule. They must actively select “I agree” in a drop-down selection menu to continue the experiment. The next page includes the input field, where participants are asked to indicate the number of tokens they wish to give/take. They can enter either 0 or any positive number with four digits. To prevent the joint account from running dry in the takesome game, we add a limit of tokens that participants are allowed to give/take (i.e., 1,200 tokens). However, since we are interested in the number of participants who try to take many more tokens, we record all first inputs irrespective of whether they are higher than the limit of 1,200 tokens. We direct those participants who enter a number of tokens higher than 1,200 (we call it “invalid number” of tokens) to another page where they are asked to enter a number between 0 and 1,200. On this page, they can only enter numbers that are within this range. Finally, a screen summarizing the number of tokens participants have given/taken and the final balance of their personal account is shown. We then inform them about the processing time for calculating the payments and provide contact information to allow participants to ask further questions. 5.5.2 Programming Experiments Using Unipark (Tivian EFS) In the following, I provide a manual describing the programming of the ruleregulated give-some game and take-some game. To properly understand this manual, you need to know some basics of the TIVIAN EFS. These include: 1. Handling the EFS user interface. 2. Setting up new projects and questionnaires (including project properties and layouts) in EFS. 3. Handling the EFS questionnaire editor (e.g., add, copy, paste, structure, and arrange elements). 4. Exporting result data in a comma-separated value format (CSV). 5. Light HTML skills are useful (e.g., to customize questionnaire design). 5 Interactive and Incentivized Experiments 239 Details about the tokens listed above are described in the official TIVIAN ESF Online Documentation.5 5.5.2.1 General Setup To create a new survey, we select the “New project” button placed at the upper left of the main menu screen (see Fig. 5.1). In the following “Create project” menu (see Fig. 5.2), we enter the custom project title (1), the project type (2) should be “Anonymous survey,” and the privacy assistant (3) should be active. We select the recommended layout (4), the folder (5) remains the default one, and we choose the survey language (6). We finally create the project by clicking the “create” button in the lower right. In any scientific experiment involving human individuals, it is essential to inform participants about the purpose of the study, use of their data, and obtain their consent. Tivian provides a template for the General Data Protection Regulation (GDPR) setting and for setting up the consent form. The template can be found Fig. 5.1 Tivian “New project” button Fig. 5.2 Tivian “Create project” menu 5 https://qbdocs.atlassian.net/wiki/spaces/DOC/overview. 240 M. Wyszynski Fig. 5.3 Tivian GDPR settings Fig. 5.4 Tivian “new question” menu in the questionnaire editor. The GDPR settings are shown in the first lines of the editor (see Fig. 5.3). We open the page “Consent form” for editing the settings. First, we select the survey language. Then, in the second input field, we enter a welcome message introducing the experiment, such as “Dear Participant, you are being invited to participate in a research study titled XYZ. This study is being done by a name from an institution.” Probably the most important part is the “Purpose statement.” Here we need to enter the consent form that includes answers to the following questions: Why are we doing this research study? Who can participate in this research study? What will be my task and how much time will it take? Will being in this study help me in any way? What are the risks of being involved in this study? How will my personal information be protected? Will I be given any money or other compensation for being in this research study? What happens if I agree, but I change my mind later? Who can I ask if I have questions? Next, we select “Create Consent form” in the “Privacy information” window. We do not need a separate privacy notice since it is already included in our consent form and we collect no personnel data in the current study. 5.5.2.2 Introduction Page After setting up the project, layout, data privacy information, and informed consent form, we open the questionnaire editor and create a new page called “Introduction” (or something like that; you can choose any name you want). Directly jump to the page after creation. In the page menu, we create a new question. The “new question menu” appears (see Fig. 5.4). Enter a question title, e.g., introduction. Choose the question type “Text and multimedia” and then “text and image.” Click “save and edit” at the lower right corner of the “new question menu.” Now you have successfully created the first page of the experiment. You can edit the first page 5 Interactive and Incentivized Experiments 241 according to your preferences. The page should include an “easy-to-understand” description of the experiment that summarizes answers to first intuitive questions the participants may have, e.g., What is the study intended to investigate? How long does my participation take? What do I get when I decide to participate (incentives), and when and how will I get it? What happens with my personal data? Whom can I contact if I have questions? The following text is an example of an experiment’s introductory first page. Dear participant, Welcome to our study and many thanks for your participation. This study involves an experiment to investigate individual behavior in communities. Participation takes about 5 minutes. You will receive $0.10 for your participation. Additionally, you will receive a payoff of approximately $3.20 (min. $0.00; max. $4.40). Note that the amount of your final payoff depends on your behavior as well as on the behavior of other participants. Participation is voluntary and not mandatory. Your data will be saved anonymously and cannot be traced back to you. Please read the instructions on the following pages carefully. For further information: Contact: Experimenter’s email Info: Website of the Research Group 5.5.2.3 Setup of Randomized Group Assignment Recall that we apply a between-subject design. That is, we need to quasi-randomly divide the sample into two groups of equal size. This can be done by performing the following steps: (1) Open the introduction page in the questionnaire editor and select “triggers” in the question menu bar on the left side. (2) Create a new trigger, enter a title name (e.g., treatment), and select “random trigger.” Click the “create” button in the lower right corner. (3) In the following window (Fig. 5.5a), you may enter a description (optional) and make custom settings. For the “execution position” option, select “after submitting page, before filter”; for “condition,” select “always”; activate the selection box for “also execute trigger in preview mode” (needed to test the trigger) and leave the box for the “execute this trigger multiple times in one survey session” option unselected. Click “save.” (4) Open the “detail configuration” tab and click on the “[+]” symbol before the point “user-defined variables.” Select the check box “create new user-defined variable.” Click “save” at the lower right corner of the “user-defined variables” menu. (5) Open the “user-defined variables” menu again, enter a name for the newly created variable (e.g., randomization), and choose “integer” as the variable type (Fig. 5.5b). Save again. (6) Back in the “detail configuration” menu, hide the “user-defined variables” option by clicking on the “[.−]” symbol. Then, make the following settings (Fig. 5.5c): select the newly created variable in “save random number in variable”; minimum and maximum random values are “1” and “2”; select the box for the “approximate uniform 242 M. Wyszynski Fig. 5.5 Creating a randomization trigger in Tivian. Trigger option (a), user-defined variables (b), and detail configuration (c) distribution within given range” option (needed to get groups of almost equal size). Save changes. Return to the main menu of the questionnaire editor. Now, we have created a trigger function randomly writing either “1” or “2” into the newly created “randomization” variable (c_0001) each time a participant leaves the introduction page to continue the experiment. 5.5.2.4 Programming the Give-Some Game We assign each game to an individual filter variable. I start with the step-bystep instructions for the give-some game. Then I provide a copy-paste method for programming the take-some game based on the give-some game. For programming the give-some game, turn to the questionnaire editor and execute the following steps: (1) Create a new filter, enter the title “give-some game,” select the “insert a blank standard page into the new branch” and the “jump to filter definition after creation” boxes. Press “ok.” We automatically get to the filter definition menu, where we find the “general options” in the first block of rows and the filter definitions in the second block. In the second block of rows, we find the columns “conjunction,” “negation,” “bracket,” “variables,” “condition,” “code,” and “bracket.” Select the “randomization” variable in the “variables” column, select 5 Interactive and Incentivized Experiments 243 Fig. 5.6 Tivian filter setup: general options “equal” in the “condition” column, and enter “1” in the text field of the “code” column. Click “save” and return to the questionnaire editor (Fig. 5.6). (2) Open the standard page that was automatically added to the filter. Choose “properties” at the page menu on the left side. Enter a title (e.g., give-some_intro) and click “save” in the lower right corner. Choose “questions” in the page menu. Add a new “text and image” question and enter a title (e.g., instructions_gs). Insert detailed instructions for the give-some game. For example: In this experiment, you belong to a community consisting of 100 members. Each member of the community has a personal account containing 1,600 tokens. Additionally, there is a joint account called “community account” shared by all members of your community. The balance of the community account is initially 0 tokens. Each member must give a part of their tokens to the community account. After all members made a contribution, the tokens in the community account will be shared equally among all members who have at least 1,000 tokes on their personal accounts. Additionally, each member will keep the tokens remaining in their personal account. To ensure a fair distribution of the tokens among all members of your community, each member is only allowed to keep 1,000 tokens and should, therefore, contribute the rest of the tokens to the community account. Thus, all members would finally receive the same number of tokens: The tokens remaining on the personal account plus the share from the joint account. For instance, provided that all members follow the rules, each member will get 1,000 tokens from their personal account plus 600 tokens from the joint account. Participants will finally receive $0.002 per token. For instance, the initial endowment of 1,600 tokens is worth $3.20. (3) Now, we add the rule to the experiment: Return to the main menu of the questionnaire editor. Add a new page to the “give-some game” filter and enter a title (e.g., rule). Activate the selection box to jump to the new page and click “save.” Add a new “text and image” question with the title “agreement” (or something else). Open the question and enter instructions for the participants such as “before making your decision on the next page, we ask you to agree to the following commitment”: Return to the page menu. Add another new question. This time, you choose “single response” in the “new question” menu and then “drop-down list.” Enter a question title (e.g., response_agree) and click “save and edit.” Enter the question text, e.g., “I agree to keep only 1,000 tokens on my personal account. I commit to give the rest of 244 M. Wyszynski my tokens to the community account.” Make answering to the question mandatory by selecting “yes” under the point “mandatory (DAC).” Scroll down to the block “answers. ” Add “I agree” as a new answer option. Save and return to the main menu of the questionnaire editor. (4a) Next, we add the page including the input field where participants are asked to enter the number of tokens they want to give to the joint account. Add a new page to the “give-some game” filter and enter a title (e.g., contribution_page). Open the new page, add a new question, and enter a question title (e.g., contribution). Select “fields for text input” and choose “text field (single row).” Click “save and edit.” Enter a question text (e.g., you have 1,600 tokens, how many of them will you give to the community account?) and fill-in instructions (e.g., note that you need at least 1,000 tokens to receive your portion from the community account!). Make answering the question mandatory and change “input format” to “integer.” You further may wish to label the input field with some text. If so, you can enter any text in the “text around the input field” field. We then restrict the “input length (max.)” to “4” digits. Let us assume that the variable name of the “contribution” question is #v_12#. Remember it or write it down (important for the following steps). Then save the question and return to the page menu. Note that you can add further questions, for instance, for adding text summarizing the instructions. (4b) To subtract the number of tokens the participant entered into the input field from their personal account, we need to configure another trigger. Choose “triggers” in the page menu of the contribution page. Add a new trigger, enter a title (e.g., personal_account), and select “recoding trigger” in the “new trigger” menu. In the trigger settings, select “after submitting page, before filter” for the “execution position” option; for “condition,” select “always”; activate the selection box for “also execute trigger in preview mode” (needed to test the trigger) and leave the box for the “execute this trigger multiple times in one survey session” option unselected. Click “save.” Open the “detail configuration” tab and click on the “[.+]” symbol before the point “user-defined variables.” Select the check box “create new user-defined variable.” Click “save” at the lower right corner of the “user-defined variables” menu. Then, open the “user-defined variables” menu again and enter a title for the newly created variable (e.g., remaining_tokens) and choose “integer” as the variable type. Save again. Back in the “detail configuration” menu, hide the “user-defined variables” option by clicking on the “[.−]” symbol. Jump to the “defined recodings” block of rows and select the newly created variable (e.g., c_0002) in “variable to be recoded.” For calculating the new value (value column), we use so-called wildcards. Wildcards are placeholders for previously defined variables. They can be used in text fields by putting the respective variable name between two hashes (e.g., #v_12#). To save the number of tokens remaining in the personal account of the participant after making their contribution in new “remaining_tokens” variable, we enter the following into the text field of the “value column”: “1600 .− #v_12#.” That is, whatever the participant has entered into the input field (#v_12#) is subtracted from their initial endowment of 1,600 tokens. Save the changes and return to the main menu of the questionnaire editor (Fig. 5.7). 5 Interactive and Incentivized Experiments 245 Fig. 5.7 Tivian, detail configuration of recoding trigger (5) To incorporate the limit of tokens participants are allowed to enter in the contribution field, we add another filter to the “give-some game” filter (title: e.g., invalid_number). Add the new filter (proceed as described above). Choose the variable name of the input field (the one you wrote down in the previous step) in the “variable” column of the filter definition menu. Select “greater” in the “condition” column and enter “1200” into the text field of the “code” column. Click “save.” A new row appeared in the filter definition row block. Choose “OR” in the conjunction column, the variable name of the input field variable in the “variable” column, “less” in the “condition” column, and enter “0” into the text field of the “code” column. Save the filter, return to the questionnaire editor, and open the new page created in this filter. Add a title to the page (e.g., contribution_invalid_number). Add a text “text and image” question with any title (e.g., text_contribution_2). Enter some text that asks participants to enter a number between 0 and 1,200 tokens in the input field. Save the question and return to the page menu. Add another new question, enter a title (e.g., contribution_2), select “fields for text input” and choose “text field (single row).” Click “save and edit.” Let’s assume that the variable name of the “contribution_2” question is v_13. Note that you need to repeat the previous step (4b) and create a new trigger for the “contribution_invalid_number” page with another user-defined variable for the new trigger (c_0003; e.g., remaining_tokens_2). (6) After setting up the new trigger, turn back to the “contribution_invalid_ number” page menu and choose “plausibility checks.” Add a new plausibility check, enter a title (e.g., validity_check) and select “range check.” Click “proceed.” Remain in the “standard editor” tab and make the following settings: Choose “no” for “user may ignore this check”; “client (JavaScript) AND server (PHP)” for “systems to use for check”; “yes” for “execute check if one or more items are hidden”; enter an error message (optional; e.g., please enter a number between 0 and 1,200). In the second block of rows, the plausibility check definition needs to be specified. Select the variable name of the “contribution_2” question (v_13) in the first drop-down selection menu of the “condition” column and enter the range of valid values in the second (0) and third (1200) input field (Fig. 5.8). Save the plausibility check and turn back to the questionnaire editor. (7) We insert two versions of the summary screen presenting the results. The first version (7a) is shown to participants who entered a number exceeding the limit on the contribution page. The second version (7b) is shown to those who entered a valid number of tokens with their first input. 246 M. Wyszynski Fig. 5.8 Tivian, setting up a new plausibility check (7a) We add the first version of the summary screen directly to the “invalid_number” filter: Add a new page, enter a title (e.g., result_info_a), and jump to the page. Add a new “text and image” question and enter a title (e.g., text_results_a). Use wildcards to enter a text that summarizes the results (e.g., you contributed #v_13# of your 1,600 tokens to the community account. #c_0003# tokens remain on your personal account). Save and turn back to the questionnaire editor. (7b) For the second version of the summary screen, we add another filter to the “give-some game” filter: Add the filter and enter a title (e.g., results). Add a new page, enter a title (e.g., result_info_b), and jump to the page. Add a new “text and image” question and enter a title (e.g., text_results_b”). Use wildcards to enter a text that summarizes the results (e.g., you contributed #v_12# of your 1,600 tokens to the community account. #c_0002# tokens remain on your personal account). Save and turn back to the questionnaire editor. 5.5.2.5 Programming the Take-Some Game For programming the take-some game, it is possible to repeat everything we have done for the give-some game. Alternatively, we can copy the “give-some game” filter and modify the duplication according to the requirements of the give-some game. Click on the copy icon right to the give-some game filter, enter a new title (e.g., take-some game), and confirm copying by clicking “copy.” The element structure of the take-some game is the same as for the give-some game. Using this method, we just need to adjust filters, pages, variables, triggers, plausibility checks, and questions according to the specifications of the take-some game. The following step-by-step instruction describes how to do that: 5 Interactive and Incentivized Experiments 247 (1) Open the “take-some game” filter and change the value displayed in the “code” column (should be “1”) to “2.” Click save and return to the questionnaire editor. (2) Open the “give-some_intro” page of the new “take-some game” filter and choose “properties” at the page menu on the left side. Change the title into a new one (e.g., take-some_intro) and click “save” in the lower right corner. Choose “questions” in the page menu. Open the “instructions_gs” question and change the title (e.g., into “instructions_ts”). Insert the detailed instructions for the take-some game. For example, the text is identical to the give-some game instructions, except for some changes that are italicized in the following example: In this experiment, you belong to a community consisting of 100 members. Each member of the community has a personal account containing 400 tokens. Additionally, there is a joint account called “community account” shared by all members of your community. The balance of the community account is initially 120,000 tokens. Each member is allowed to take a certain number of tokens from the community account. After all members have made a withdrawal, the remaining tokens in the community account will be shared equally among all members who have at least 1,000 tokes on their personal accounts. Additionally, each member will keep the tokens remaining in their personal account. To ensure a fair distribution of the tokens among all members of your community, each member should only take as many tokens as needed to have 1,000 tokens on their personal account. Thus, all members would finally receive the same number of tokens: The tokens remaining on the personal account plus the share from the joint account. For instance, provided that all members follow the rules, each member will get 1,000 tokens from their personal account plus 600 tokens from the joint account. Participants will finally receive $0.002 per token. For instance, the initial endowment of 400 tokens is worth $0.80. (3) Return to the main menu of the questionnaire editor. Open the “rule” page in the “take-some game filter” and then open the question attached to this page. Edit the text of the question to match to the take-some game (e.g., I commit to take only as many tokens as I need to have 1,000 tokens on my personal account). Save and return to the main menu of the questionnaire editor. (4a) Next, we edit the page that includes the input field where participants are asked to enter the number of tokens they want to take from the joint account. Open the “contribution_page” of the “take-some game” filter and edit the title (e.g., withdrawal_page). Open the question attached to the withdrawal page and change the title (e.g., “withdrawal”). Edit the question text (e.g., you have 400 tokens, how many tokens will you take from the community account?). Let us assume that the variable name of the “withdrawal” question is v_22. Remember it or write it down (important for the following steps). Then save the question and return to the page menu. (4b) Choose “triggers” in the page menu of the withdrawal page. Open the trigger, edit the title (e.g., personal_account), and select “recoding trigger” in the “new trigger” menu. Open the “detail configuration” tab and jump to the 248 M. Wyszynski “defined recodings” block of rows. To save the number of tokens remaining in the participant’s personal account after making their withdrawal in the new “‘remaining_tokens” variable, we enter the following string into the text field of the “value column”: “#v_22# + 400”; that is, whatever the participant has entered into the input field (v_22) is added to their initial endowment of 400 tokens. Save the changes and return to the main menu of the questionnaire editor. (5) Next, we edit the “invalid_number” filter controlling the limit of tokens participants are allowed to enter in the withdrawal field. Open the “invalid_number” filter attached to the “take-some game” filter. Edit the variable name of the input field (enter the one you wrote down in the previous step) in the “variables” column of the filter definition menu. Save the filter, return to the questionnaire editor, open the “contribution_invalid_number” page in this filter, and edit the title (e.g., withdrawal_invalid_number). Open the “text and image” question and edit the title (e.g., text_withdrawal_2). Turn back to the page menu and open the “contribution_2” question. Edit the title (e.g., withdrawal_2) and the text (copypaste from the “withdrawal” page). Click “save and edit.” Let us assume that the variable name of the “withdrawal_2” question is v_23. Note that you need to repeat the previous step (4b) and edit the trigger for the “withdrawal_invalid_number” page by adjusting the “value” column in the “detail configuration” menu of the trigger. (6) Turn back to the “withdrawal_invalid_number” page menu and choose “plausibility checks.” Open the plausibility check and select the variable name of the “withdrawal_2” question (v_23) in the first drop-down selection menu of the “condition” column in the plausibility check definition. Save the plausibility check and turn back to the questionnaire editor. (7) Edit both versions of the summary screen presenting the results by changing the variable names and wildcards in the text according to the corresponding names of the take-some game variables. 5.5.2.6 End Page The end page is the final page of the experiment. Typically, there is a “text and image” question attached to that page. The text may include expressions of thanking the participants for participating, an individual code needed by the participants to get a payoff from a third-party platform (e.g., Amazon MTurk), contact information of the experimenters, and further information about the experiment. 5.5.2.7 Overview of Variables and Structure Next, I illustrate a list of relevant variables (Table 5.2) and the structure of the filters, pages (including titles), and questions that I used in the step-by-step instructions. 1. Page: “introduction” → Trigger: “treatment”; random trigger, user-defined variable (c_0001 = 1 or c_0001 = 2) . 5 Interactive and Incentivized Experiments 249 Table 5.2 List of relevant experiment variables for the give-some game and the take-some game Variable Type Give-some game c_0001 User-defined variable Location Purpose Values Page: “introduction”; random trigger 1 or 2 v_12 Question variable Question: “contribution” c_0002 User-defined variable v_13 Question variable Page: “contribution_page”; recoding trigger Question: “contribution_2” c_0003 User-defined variable Randomly assign participants to games Record number of point to give indicated by the participant Calculate final personal account balance Record number of point to give indicated by the participant Calculate final personal account balance Take-some game v_22 Question variable c_0004 User-defined variable v_23 Question variable c_0005 User-defined variable Page: “contribution_invalid_number”; recoding trigger Question: “withdrawal” Page: “withdrawal_page”; recoding trigger Question: “withdrawal_2” Page: “withdrawal_invalid_number”; recoding trigger Record number of tokens to take indicated by the participant Calculate final personal account balance Record number of tokens to take indicated by the participant Calculate final personal account balance 0 to 9999 1600 .− v_12 0 to 1200 1600 .− v_13 0 to 9999 v_22 .+ 400 0 to 1200 v_23 .+ 400 2. Filter: “give-some game”; condition: c_0001 = 1 a. Page: “give-some_intro” i. Text and image question: “instructions_gs” b. Page: “rule” i. Text and image question: “agreement” ii. Single response, drop-down list: “response_agree” c. Page: “contribution page” i. Fields for text input, textfield (single row): “contribution” (v_12) 250 M. Wyszynski → Trigger: “personal_account”; recoding trigger, user-defined variable “remaining_tokens” (c_0002) . d. Filter: “invalid_number”; condition: v_12 .< 0 or v_12 .>1200 i. Page: “contribution_invalid_number” A. Text and image question: “text_contribution_2” B. Fields for text input, text field (single row): “contribution_2” (v_13) .→ Trigger: “personal_account”; recoding trigger, user-defined variable “remaining_tokens_2” (c_0003) .→ Plausibility check: “validity_check”; definition: v_13 .≥ 0 and v_13 .≤1200. ii. Page: “result_info_a” A. Text and image question: “text_results_a” e. Filter: “results”; condition: v_12 .≥ 0 and v_12 .≤ 1200 i. Page: “result_info_b” A. Text and image question: “text_results_a” 3. Filter: “take-some game”; condition: c_0001 = 2 a. Page: “take-some_intro” i. Text and image question: “instructions_ts” b. Page: “rule” i. Text and image question: “agreement” ii. Single response, drop-down list: “response_agree” c. Page: “withdrawal_page” i. Fields for text input, text field (single row): “withdrawal” (v_22) → Trigger: “personal_account”; recoding trigger, user-defined variable “remaining_tokens” (c_0004) . d. Filter: “invalid_number”; condition: v_22 .< 0 or v_22 .>1200 i. Page: “withdrawal_invalid_number” A. Text and image question: “text_withdrawal_2” B. Fields for text input, text field (single row): “withdrawal_2” (v_23) .→ Trigger: “personal_account”; recoding trigger, user-defined variable “remaining_tokens_2” (c_0005) .→ Plausibility check: “validity_check”; definition: v_23 .≥ 0 and v_23 .≤1200. 5 Interactive and Incentivized Experiments 251 ii. Page: “result_info_a” A. Text and image question: “text_results_a” e. Filter: “results”; condition: v_22 .≥ 0 and v_22 .≤ 1200 i. Page: “result_info_b” A. Text and image question: “text_results_a” 4. End page a. Text and image question 5.5.2.8 Adding Demographic Questions and Personality Scales We can add any additional pages and questions to the experiment. For instance, to collect demographic information about the participants, we can simply add a new page following the introduction page involving various questions about, e.g., participants’ age, sex, economic status, and other items. Note that placing the demographic questions at the beginning of questionnaires has been shown to reduce the dropout rate of participants during participation. However, some researchers prefer demographic questions at the end of a questionnaire. Another instrument often used in such experiments is including particular personality scales or psychometric measures (e.g., intelligence, thinking styles, and risk-taking propensity). Like the demographic questions, personality scales can easily be incorporated into the experiment by adding new pages using Tivian’s questionnaire editor. 5.5.2.9 Adding Comprehension Checks and Attention Tests Especially in monetarily incentivized online experiments, some individuals may try to receive the reward without participating seriously. That is, they make their choices randomly or go through questionnaires without reading the questions. The responses of such participants are unreliable, and they could make whole datasets useless. One way to minimize the number of unreliable responses is to include comprehension checks and attention tests in the experiment. Comprehension checks and attention tests are typically additional questions asking easy-to-answer questions that are more general (attention test) or refer to the experiment instructions (comprehension check). One example of an attention check in an experiment using Likert scales is asking the participant to indicate a particular response on the Likert scale (e.g., please indicate scale value 2). Comprehension checks are often more like quizzes. For instance, participants might be given some statements describing the experiment with only one of them being correct, and they are asked to choose the correct one. In TIVIAN EFS, comprehension checks and attention tests can be incorporated into the experiments by adding additional questions or separate pages that include the checks. 252 M. Wyszynski 5.5.3 Ethical Standards and Ethics Committee Application Researchers must adhere to ethical standards when designing experiments with humans and animals. One set of particular ethical principles for human experimentation often used as standards for ethics in human research is the Declaration of Helsinki. However, the Declaration of Helsinki was defined by the World Medical Association, and it, therefore, addresses ethical issues in the context of clinical or medical research. Even though many of the principles can be applied to psychology and social science research, some do not apply to other research fields. Without going into detail, scientists need to obtain information about the ethical standards before conducting the experiment. Such information can be found, for instance, on official Web sites of scientific associations such as the American Psychological Association (APA).6 Peer-reviewed scientific journals often require experiments that are officially approved by an independent ethics committee. Universities and other research institutes often have their own ethics committees. Researchers need to send an application to that committee that includes a detailed description of the experiment they plan to conduct (e.g., details about sample, materials, design, procedure, and facilities). The committee decides whether or not the experiment meets common ethical standards. In case of a positive decision, researchers receive an official approval letter from the committee. Box 5.1: oTree The open-source platform oTree (https://www.otree.org/) is a welldocumented alternative to conduct Web-based interactive experiments. However, it requires some experience in programming HTML and Python. Moreover, without going into detail, users need to learn specific terms, structures, and how to handle the user interface before they start programming an experiment. Once the programming of the experiment is finished, the users have to set up the server hosting the application. Setting up a server using Linux or Windows might be another big challenge for users with no or little experience. One crucial advantage as compared to Unipark is that participants can interact directly. Some experimental paradigms require direct interactions, particularly those where participants respond directly to actions of other participants (e.g., experiments involving punishment). For the example given in the current chapter, a direct interaction between the participants is not absolutely necessary, but in other cases, it is inevitable. (continued) 6 https://www.apa.org/ethics/code. 5 Interactive and Incentivized Experiments 253 In conclusion, although oTree is a useful tool for conducting interactive experiments with several advantages over other platforms, it might be rather difficult to use for beginners without any experience in programming experiments. 5.5.4 Cognitive Interviews and Pretests A cognitive interview is a useful tool for improving the quality of questionnaires, experiment instructions, choice architecture, and study procedure. It can reveal how respondents interpret and understand instructions of experiments, questionnaires, or individual survey items. Cognitive interviews require relatively small sample sizes. For instance, only one cognitive interview could already help to improve the study. Various techniques can be applied in cognitive interviews. Most standard techniques include probing, confidence rating, paraphrasing, and thinking aloud. Probing is one of the most popular techniques. Probing means that the experimenter asks additional questions about how the participant has understood particular instructions, items, or single words. The interviewer typically supervises the participation in the experimental session and asks the questions immediately after the participant makes their response. For instance, the interviewer may ask the participant how they have understood the role of the rule or why they have indicated that particular number of tokens in the input field of the current example study. Confidence ratings are usually used to test questions about facts (e.g., how many cigarettes do you smoke a day?). The interviewer then asks the participant how confident they are with their response. Confidence ratings play no role in the context of the current example study. However, paraphrasing is a very useful technique for testing the comprehensibility of an experiment. After reading the instructions, participants are asked to explain the experiment in their own words. The researchers can then compare the participant’s explanation with the actual meaning of the instruction. Using the thinking aloud technique, the interviewer asks the participant to verbalize all thoughts that came up during the decision process until they make a response. The verbalized thoughts could give insights into what participants may think during their participation. The researcher could reformulate instructions, questions, and items if they lead participants astray. The interviewer carefully protocols all responses of the participants. The researchers then thoroughly evaluate the protocols (preferably by more than one person), and the experimental setup is modified accordingly. For detailed information about cognitive interview techniques and instructions, see e.g., Memon and Higham (1999), as well as Prüfer and Rexroth (2005). Pretests can further increase the quality of an experiment. They are a powerful tool for finding and fixing errors. Conducting a pretest of an experiment or survey 254 M. Wyszynski Fig. 5.9 Enabling pretest comments before it goes into the field is a common procedure. For a pretest, the study is distributed to some participants (usually a smaller sample than the final one) to finally test whether all study elements work as they should. Unipark provides a specific pretest mode that allows participants to comment on particular pages and questions when they detect any errors or inconsistencies. To enable pretest comments in the survey, turn to the project properties in Tivian survey menu, open the survey option and then the “internal organization” tab. Here you’ll find the option “Pretest comments in the survey.” Enable it using the drop-down selection menu and save the changes (Fig. 5.9). Furthermore, the pretest allows to check all filters, triggers, plausibility checks, and variables and to fix them if any problems occur. In addition, the data collected with the pretest can be used for a first analysis, e.g., to check the structure of the data set and test analysis programs. Summary: (1) The procedure of the study describes in detail what happens to the participants and what they have to do during the experiment. (2) One of many ways of programming an interactive online experiment is using Unipark. In this section, I gave instructions to implement rule-regulated take-some and give-some games in Unipark. (3) Comprehension checks and attention tests are useful tools to test participants’ reliability. (4) Experiments should meet specific ethical standards for protecting subjects and safeguarding good research practice. (5) Cognitive interviews and pretests often increase the quality of an experiment. 5.5.5 How to Go into the Field (Online) After all the issues revealed by cognitive interviews and pretests have been fixed, the study is ready for the field. One of the most common ways to distribute the 5 Interactive and Incentivized Experiments 255 experiment to participants is to invite them to a behavioral science laboratory where they are guided through the experiment by an experimenter who introduces and observes the experimental process. Another way that has become more popular within the last decade is the usage of online experiments. That is, participants are invited via online platforms, and they participate using their own devices (or other devices, e.g., in the office, university, or workplace). Both ways have pros and cons, which I do not discuss here. In this chapter, I give instructions on conducting the study online. 5.5.5.1 Quality Standards of Online Experiments The methodological requirements of online experiments are identical to those of conventional experimental techniques. In addition, online experiments are required to meet specific quality standards. For instance, the Working Group of German Market and Social Research Institutes (Arbeitskreis Deutscher Markt- und Sozialforschungsinstitute e.V., 2018) provides a list of online research standards that maintain certain quality requirements of online surveys and experiments. Among them are uncomplicated accessibility to the study, Internet device and connection independent participation, straightforward communication, voluntariness, and data protection. That is, respondents must be informed about the purpose of the survey and participate voluntarily. Personal data and metadata collected must be treated carefully and must not be accessible to third parties. Other quality standards refer to technical requirements such as adequate field time (e.g., the experimenter should provide sufficient time for participation), field control measures to identify and avoid multiple responses and click-through the questionnaire item, ability to drop out at any time of the study, and possibility to resume participation after closing the survey tool window. Complying with these standards should guarantee reliable data collection. 5.5.5.2 Distributing the Experiment to Participants Online studies are typically distributed to the participants by providing them with a particular hyperlink that forwards them to the study. Undergraduate and graduate students often use email lists to send the link out, or social networks to publish their studies in certain groups. Additionally, they sometimes ask their participants to further forward the link to other people (e.g., friends, family members, and network) to benefit from a snowball effect. Another way to distribute the hyperlink to potential participants is using socalled crowdsourcing networks. Companies or other providers use crowdsourcing networks to offer small jobs such as writing short texts or comments, translation jobs, transcription jobs, photo editing jobs, participation in online surveys, and many others. The network divides the work into smaller pieces (jobs) and distribute them among its members who get paid for that particular job. Most of these networks 256 M. Wyszynski charge a fee from the companies that commission the jobs, which is usually related to a proportion of the wage the participants receive for that job. One crowdsourcing platform that has repeatedly been used for scientific online experiments is Amazon Mechanical Turk (MTurk). However, several alternatives have arisen in the meantime such as Profilic (particularly made for scientific studies) and Clickworker (alternative in German and English language). Researchers using a crowdsourcing network to distribute the link of their study need an account at the particular network they wish to use. Creating an account is usually easy to manage. In the following, I provide step-by-step instructions for using the MTurk crowdsourcing network to recruit participants for an online study such as a survey or an experiment. (1) Signing up to MTurk: Note that you need a regular Amazon account or an AWS account. Visit the Amazon Mechanical Turk Web page7 and click the button labeled with the text “Get started with Amazon Mechanical Turk.” MTurk distinguishes between “Worker” and “Requester.” A Worker is a MTurk customer who is looking to work on crowdsourcing tasks, and a requester is a MTurk customer who offers and pays for these tasks. On the following page, click the button showing the label “Sign in as a Requester.” Sign in using your Amazon account or your AWS account. Follow the instructions given by MTurk. (2) You are now signed in as a Requester. On the top of the screen, you find the “Create” button. Click on it and turn to the “New Project” page. You can now choose a template for your project. There are two options for online surveys: using an external link or using the internal MTurk survey tool. We use the “Survey Link” option for the current example study since we have already designed it using Unipark: Choose “Survey Link” in the left menu and click on the “Create Project” button in the lower right corner. (3) Next, we enter the project properties: Enter a title (e.g., academic experiment on decision making), a short description of your study (which should include reward information such as amount range of possible bonus payments), and some keywords that will help MTurkers finding the study (members of MTurk are often called “workers” or “MTurkers”). (4) We then turn to the next point, “Setting up your survey” (Fig. 5.10), where you need to enter some specifications for the project. “Reward per response”: Enter the participation fee. MTurk will automatically pay this amount to the participants after you have approved the participation, which is called HIT (human intelligence task) in MTurk. Bonus payments need to be added to each HIT individually (see below). Note that you need to include another input field in the experiment (e.g., in Unipark) where participants are asked to enter their Worker ID (an anonymous ID for MTurker) if you want to pay bonuses based on performance in the experiment. For the current example study, we will pay a participation fee of $0.10 (reward) and $0.02 for each point earned in the study (bonus). 7 https://www.mturk.com/. 5 Interactive and Incentivized Experiments 257 Fig. 5.10 Amazon MTurk, setting up your survey “Number of respondents”: Enter the number of participants you need. “Time allotted per Worker”: Enter the maximum time a Worker has to complete the experiment. There is no need to limit the time strictly. As recommended by MTurk: Be generous! “Survey expires in”: Enter the maximum time the experiment will be available for participation, given that the number of respondents is lower than that you have indicated above. “Auto-approve and pay Workers in”: Especially if you wish to pay bonuses, you should rather enter a higher number. Note that you need to check the HITs for potential cheaters (e.g., code-sharing) within this period. (5) The last setup point refers to “worker requirements”: Choose the qualification requirements of the worker you wish to request for participation. Note that you should carefully consider qualification requirements to avoid a strong selection bias. Furthermore, using qualifications allows us to exclude MTurkers who participated in a previous study or treatment (if you decide to conduct different treatments successively). Save and turn to the “design layout” page. (6) At the “design layout” page (Fig. 5.11), you should enter important information about your study, as well as any instructions and requirements for participation in a short form. First, we edit the “Survey Link Instructions”: Enter everything participants need to know before participation, including expected reward (if it is not certain, enter the range), estimated maximum participation time, and information about the survey code (e.g., where participants will find it and what they need to do with it). The survey code is a code that participants receive after finishing the experiment. They need to enter the code on the page of the survey task at MTurk. The requester can then check the codes before approving the HITs. You need to put the code on the final page of your experiment. But note, MTurkers 258 M. Wyszynski Fig. 5.11 Amazon MTurk, design layout sometimes share the code with others. Therefore, you should carefully check the HITs for code-sharing. To prevent code-sharing, you could, for instance, change the code every now and then as long the study is open for participation. Alternatively, you can add another trigger in Unipark generating unique survey codes for each individual participant (a random number that you put between other characters using wildcards). This option is undoubtedly the most efficient one. The second important piece of information displayed separately from the instructions is the “Survey link”: Here, we enter the link that forwards the participants to our experiment. For surveys and experiments created with Unipark, you find the link in the survey menu of your Unipark project. Moreover, make sure that you have turned the survey status to “active” in Unipark so that the experiment is open for participation. Finally, we need to request a survey code in the field with the label “provide the survey code here”: You should leave this field as it is. (7) Turn to the “Preview and Finish” page, check the preview, and confirm creating the project by clicking on the “finish” button in the lower right corner of the display. This action automatically saves the current project and directs you to the list of your projects. (8) The list of your projects (Fig. 5.12) provides a menu for each project to publish the batch edit, copy, and delete the project. Click “publish batch” to make the experiment available for participation. You will see another preview. Recheck it and click on the “next” button to get to the “Confirm and Publish Batch” page. (9) The “Confirm and Publish Batch” page summarizes properties and reward information. You can enter a name of the batch and a description (only for 5 Interactive and Incentivized Experiments 259 Fig. 5.12 Amazon MTurk, start a new batch with an existing project Fig. 5.13 Amazon MTurk, start a new batch with an existing project administration purposes, not visible for MTurker). You then need to choose the payment method you prefer (usually a credit card; MTurk accepts all major credit and debit cards). Note that you should only publish the batch when funding is guaranteed or you are willing to run the batch at your own costs. Click the “Purchase & Publish” button to confirm that the information displayed on this page is correct and to publish the batch. Your experiment is now “in the field” and MTurkers are able to see your batch, which allows them to participate. (10) To find information about the current status of your batch and to review the HITs, choose “Manage” in the menu at the top of the screen. The new batch can be found in the list of “batches in progress.” (11) Finally, we evaluate the completed batch. You can review the HITs by clicking on the “Review Results” link displayed at the right top corner of the batch in the “Manage Batches” menu. The “Review Results” page provides a list of MTurkers that entered a survey code into the corresponding field of the project (Fig. 5.13). The codes must match those from the experiment dataset. Approve all valid HITs. Carefully evaluate potential invalid HITS. You should only reject HITs that do not match any data point in the data set (e.g., those with incorrect or multipleused survey codes). Start a new batch to re-conduct the rejected HITs. 260 M. Wyszynski (12) For the current example study, we need to transfer a bonus to participants (given they get one). We can transfer the bonus payments by opening the “Review Results” page. There we click on the “Filter Results” button. A pop-up window appears, and we activate the box for “Status Filter.” Then we select “Approved” using the drop-down selection menu and confirm our choice. Now you see a list including all valid HITS. Sort them according to the Worker ID and click on the first one to open the Worker details. Then click the “Bonus Worker” button and enter the bonus payment you have calculated for that participant using the payoff functions displayed above. Confirm the bonus by clicking on the “Pay Bonus Now” button. Repeat the procedure for the remaining workers included in the list of approved hits. Summary: (1) Online experiments have become a popular tool in the last decades. (2) Although the methodological requirements of online studies are identical to those of conventional experiments, there are some additional specific quality standards that need to be considered when conducting research online. Some of them are discussed in this section. (3) There are several ways to distribute online experiments to participants. One of the most popular platforms is Amazon MTurk. The section provides detailed instructions on how to use MTurk to conduct an online study. 5.6 Analysis of the Test Results In this section, I demonstrate how to analyze the data using two different statistical analysis programs: JASP and RStudio (JASP Team, 2022; RStudio Team, 2020). Note that both programs use R (R Core Team, 2018) for doing data analyses. R is a free software environment and program language providing a long list of different statistical analysis methods that can be used for detailed statistical analysis, data modeling, and illustration. The data I use for the following step-by-step instruction can be found at https:// www.osf.io/6spv5. 5.6.1 Analysis Using JASP JASP is an easy-to-use open-source application. The program has a well-structured graphical user interface, and it offers a toolbox including various useful statistical analysis methods. The JASP project Web page8 provides many learning materials, including manuals, books, papers, and video tutorials uploaded to the official JASP statistics YouTube channel.9 8 https://jasp-stats.org/. 9 https://www.youtube.com/c/JASPStatistics. 5 Interactive and Incentivized Experiments 5.6.1.1 261 Download and Installation The latest versions of the program files are available for download on the JASP Web page. Turn to the download page and choose your operating system. JASP supports Windows, macOS, and Linux. Alternatively, there is a link to a JASP Web application (for usage in the browser) which I will not further discuss here. In the following, I describe the download and installation process and the application of the JASP version provided for MS Windows. You can find installation guides for the other versions on the JASP Web page. To install JASP on your Windows device, download the installation program (MSI file extension) from the JASP Web page; open the installer, read and accept the terms in the license agreement, proceed with the setup, and follow the instructions provided by the setup wizard. 5.6.1.2 JASP Graphical User Interface JASP has a neat graphical user interface (Fig. 5.14). You can find a “.≡” icon in the left upper corner to open the main menu. The statistical analysis tools are shown in the toolbar at the top of the window and can easily be selected by mouse-click. The dataset opened for analysis is shown under the toolbar. 5.6.1.3 Open Datasets (in CSV Format) JASP supports datasets with the following file extensions: JASP, CSV, TXT, TSV, SAV (SPSS), ODS (open office), DTA (STATA), POR (Hero Lab), SAS7BDAT (SAS), SAS7BCAT (SAS), and XPT (SAS). The example data set (you can download it from https://www.osf.io/6spv5) that I use for the current instructions has Fig. 5.14 JASP graphical user interface 262 M. Wyszynski the csv file format.10 To open it in JASP, start the program, click on the “.≡” symbol in the upper left corner, and choose “open.” Browse to the directory where you have saved the file and open it. The program now shows the data frame organized in rows and columns. Next, we check whether the columns have the correct scale formatting. This is important because particular statistical tests require particular scaling formats. However, if the data has an incorrect scale format, JASP may prevent the data from being selected for particular tests. The columns are labeled with the column title and a symbol indicating the scale format. The “ruler” symbol indicates that this column includes values scaled on a ratio or interval level of measurement. Furthermore, the “bar chart” symbol is shown for ordinal data, and the “three circles” symbol indicates nominal data. The scaling can easily be changed by clicking on the scale symbol and selecting the correct one. For example, the subject number is an increasing number, usually from 1 to n. The program often automatically classifies it as interval scaled data, which is not the correct scale. To change it to the nominal level of measurement, we click on the “ruler” symbol of the “SubID” column and select “nominal.” 5.6.1.4 Prepare Data Next, we need to prepare the data to test our hypothesis: “people break the rule more often in take-some than in give-some rule-regulated social dilemmas.” As you may have noticed, none of the columns indicate whether or not the subjects have broken the rule. Hence, we need to compute a new column indicating the rule-breaking behavior. The column “remaining” can serve as an indicator of rulebreaking since it includes the number of tokens remaining in each participant’s personal account (final account balance). Recall that participants with more than 1,000 tokens remaining on their personal accounts broke the rule. Therefore, we compute a new variable which is FALSE for participants who did not break the rule and TRUE for participants who broke the rule by performing the following steps: (1) click on the “.+” symbol on the right end of the column label bar, enter a column name (e.g., rule-breaking), choose “define column through drag and drop formulas,” select “nominal” and click the “create column” button. (2) You can now find a text field at the top of the window. There is a list of the variables (left to the text field), a list of mathematical operators (center top to the text field), and a list of advanced miscellaneous mathematical operators (right to the text field). For computing the new variable, select the remaining variable in the variables list. It should now appear in the text field. Then choose the “.>” operator, click on the “. . .” that appears behind it, and enter “1000.” The text field now shows the string “remaining .> 1000.” Click on “compute column” below the text field to complete the variable recoding (Fig. 5.15). The cells of the new column with the 10 Note that this data set is a fictional one. 5 Interactive and Incentivized Experiments 263 Fig. 5.15 JASP compute column input field label “rule-breaking” now show “TRUE” for participants who broke the rule, and “FALSE” for participants who did not break the rule. Finally, I describe how to change particular labels for the values of the variables. I use the “GiveOrTake” variable as an example: click on the variable name shown in the data frame. Then enter “give-some game” for value “1” and “take-some game” for value “2” in the corresponding text fields. 5.6.1.5 Descriptive Statistics For calculating the descriptive statistics, click on the button with the label “descriptives.” You find it next to the main menu symbol at the upper toolbar. First, we retrieve some participant information. In particular, we want to know the sample composition regarding the age and sex of our subjects. We simply select the age variable shown in the left text field of the “descriptive statistics” window and move it using drag and drop to the text field on the right with the label “variables.” We repeat that procedure with the sex variable. You can find several options below the two text fields. First, we activate the “frequency tables” box to get information about the sex variable, including nominal scaled data. Moreover, there is an option to create plots from the data of the selected variables, and you can add or remove several descriptives under the option “statistics.” Let us choose some descriptive statistics of interest: valid and missing, median, mean, minimum, maximum, and standard deviation. The window on the right side of your screen should now display the following tables showing the selected descriptive statistics for the variables “age” and “sex” (Table 5.3) and the frequency table for the variable “sex” (Table 5.4): We can easily create plots for illustrating our data. For instance, distribution plots can be created by unfolding the options for “plots” and enabling the distribution plots box (Fig. 5.16). JASP prints the plots in the right window showing the outputs. 264 M. Wyszynski Table 5.3 Descriptive statistics for the variables “age” and “sex” Age 200 0 40.500 41.945 14.642 18.000 70.000 Valid Missing Median Mean Std. deviation Minimum Maximum Sex 200 0 Table 5.4 Frequency table for the variable “sex” Sex Female Male Missing Total Frequency 93 107 0 200 Percent 46.500 . 53.500 . 0.000 .100.000 . Valid percent .46.500 .53.500 Cumulative percent 46.500 .100.000 . Fig. 5.16 Creating distribution plots using JASP 5.6.1.6 Conducting Hypothesis Test (χ 2 Test) For testing the hypothesis of the study that serves as an example here, we use a .χ 2 test (see Chap. 2). We test whether the frequency of participants breaking the rule is higher in the one group (i.e., take-some game) as compared to the other group 5 Interactive and Incentivized Experiments 265 Fig. 5.17 Test options for contingency tables in JASP (i.e., give-some game). To conduct a .χ 2 test, click on the “frequencies” icon in the main toolbar at the top of the JASP window and choose “contingency tables” in the classical sub-menu. We first select the grouping variable by moving the “GiveOrTake” variable via drag and drop into the “rows” text field. Then we put the dependent variable, i.e., “rule-breaking,” into the “columns” field. We leave the other fields (“counts” and “layers”) empty. Several test options for contingency tables can be found below the variable fields (Fig. 5.17). I describe some of them in further detail. (1) Statistics: We can choose four different tests. The .χ 2 test, the .χ 2 test continuity corrected (uses Yate’s correction for continuity; Yates, 1934), the likelihood ratio test (calculates the ratio between max likelihoods of the null hypothesis and the alternative hypothesis; note: should only be used for expected cell counts .> 5), and, finally, we can add the Vovk-Sellke maximum p-ratios to each test (Sellke et al., 2001; Vovk, 1993). (2) Odds ratio: For our 2 .× 2 contingency table, we can include the odds ratios in the analysis. Furthermore, for the Fishers’ exact test, we can choose whether to conduct a one-tailed or a two-tailed test. The three options are well illustrated in JASP. According to our hypothesis, we expect rule-breaking to be higher in the take-some game than in the give-some game. That is, we choose the one-tailed test, i.e., the option “group 1 .> group 2.” (3) Nominal and ordinal: JASP offers different correlation coefficients that users could include in their analysis according to their demands. (4) Additional options affecting the table appearance can be made under the submenus “cells” and “options.” For demonstrating some of the features, we select the following analysis options: “.χ 2 test,” “odds ratio,” “confidence interval 95.0%,” “group one .> group 2” for the alternative hypothesis (Fisher’s exact test), “descending row order” (so that “take-some game” appears in the first row and “give-some game” in the second 266 M. Wyszynski Fig. 5.18 Output of contingency tables in JASP row), “descending column order” (so that TRUE is the first and FALSE the second column), and we add row percentages to the table. I use this particular order of columns and rows to make the contingency table easy to replicate. Alternatively, the sorting features can be selected in any combination. JASP prints the results immediately in the window on the right side of the screen (Fig. 5.18). They show the contingency tables, including the frequencies of rulebreaking (TRUE) for each game. We see that 50% of the participants broke the rule in the take-some game and 32% in the give-some game. The next table shows the results of the .χ 2 test, including the .χ 2 value of 6.697, the degrees of freedom (.df = 1), and the p-values (p. =√0.010). We √ can simply calculate the effect size using the following formula .w = χ 2 /N = 6, 697/200 = 0.183. The third table shows the odds ratio (2.125), which can be seen as another indicator for the effect size, and the Fisher’s exact test (2.117), including the 95% confidence intervals and the p-values for the Fisher’s exact test (.p = 0.007). 5.6.2 Analysis Using RStudio RStudio is basically an editor and a user-friendly interface for the R program language. In contrast to JASP, there is no graphical user interface (GUI) that allows 5 Interactive and Incentivized Experiments 267 the user to choose statistical tests by selecting them in a menu. Instead, it is necessary to enter and run the R programming code. I show the code for the same statistical analysis method described in the previous section on analysis using JASP. 5.6.2.1 Download and Installation RStudio requires a preinstallation of R. That is, before installing RStudio, we need to download and install a version of R (R 3.3.0 or newer) before we continue with the RStudio setup. The latest version of R can be found on the Web page of the R-project.11 Visit the Web page, click on “CRAN” under the headword “downloads” in the link list on the left side of the page. Choose a mirror (you can use any link; a mirror located next to you might be slightly faster), select a version of R compatible with the operation system you use (available are versions for Linux, macOS, and Windows), click on the “install R for the first time” hyperlink and proceed with downloading on the following page. Install R by executing the downloaded file. Follow the installation instructions, choose an installation directory, do not change the start options, and change the start-menu folder options as well as the additional tasks according to your preferences. After you have successfully installed R, turn to the RStudio Web page,12 and choose Download RStudio. Then scroll down and click on “download RStudio desktop.” There are different versions of RStudio. For our research purposes, we select the free version. Finally, we need to choose the files for download compatible with our operating system. Install RStudio by executing the downloaded file. Follow the setup wizard’s instructions, and choose a target directory and a folder for your start menu. Open the application, and create a new R script by selecting “file,” “new file,” and the “new R script” in the main menu bar at the top of the window. Save the script. We use this new script for the complete sample analysis using R. The script, including the R code and commands used for the current example analysis, is also publicly available in the Open Science Framework (OSF).13 RStudio now shows four different windows: the script editor in the upper left; the global environment window showing variables, data frames, matrices, and more in the upper right; a window displaying the console, terminal, and jobs in the lower left; and a window showing files, plots, packages, help, and viewer in the lower right. 11 https://www.r-project.org/. 12 https://posit.co/. 13 https://www.osf.io/6spv5. 268 5.6.2.2 M. Wyszynski Open Datasets (in CSV Format) To load a CSV dataset, we use the following elements: (1) we create a new empty variable called “sampledata.” (2) In sampledata, we put the data saved in the CSV file by loading it using the R command read.csv. The read.csv command creates a data frame from a file formatted as a table. Enter ?read.csv into the console in RStudio to get a detailed list of instructions, including usage, arguments, and examples. For the current example, we want to load a data file with the name sampledata.csv located in a folder with the name datasets on our C drive. The file is a semicolon-separated CSV file labeled with column names but no row names. To load that particular data file into the R environment, we need the read.csv arguments file, i.e., the directory and name of the sampledata.csv file; row.names, i.e., we use row.names = NULL to force row numbering; and sep, i.e., the character that separates the columns. We enter the following R code to the R script (note that we put a # in the first position of a line for adding comments to the R code; R ignores text written after a # when running the code): 8 9 ##load file to data frame sampledata <- read.csv("C:/datasets/sampledata.csv", row.names = NULL, sep=";") ͨ→ To run the code in RStudio, click on the symbol “run” located at the upper right of the code editor window. RStudio always runs the code that is either selected or written in the line at which the cursor is currently positioned. After executing the code above, the new dataset “sampledata” appears in the “environment” window on the upper right side of the RStudio user interface. It can be previewed by click or by entering the following code: View(sampledata). 5.6.2.3 Prepare Data Next, we need to prepare the data for testing our hypothesis “people break the rule more often in take-some than in give-some rule-regulated social dilemmas.” As described in the previous section on the analysis using JASP, we need to compute a new column indicating the rule-breaking behavior. The column “remaining” can serve as an indicator of rule-breaking since it includes the final personal account balance of each participant. Recall that participants who have more than 1,000 tokens remaining on their personal accounts broke the rule. Therefore, we compute a new variable which is FALSE for participants who did not break the rule and TRUE for participants who broke the rule by performing the following steps: (1) We create a new column in our dataset that we name, e.g., rulebreaking; (2) then we calculate the values of the new column. For these two steps, we enter the following R code: 5 Interactive and Incentivized Experiments 10 11 12 13 14 269 #create a new column sampledata$'rule-breaking' <- NA #assign values to column sampledata$'rule-breaking' <- FALSE sampledata$'rule-breaking'[sampledata$remaining > 1000]<- TRUE Finally, I describe how to add labels for the values of “GiveOrTake” variable: (1) We transform the variable to a factor variable, i.e., a variable including nominal or ordinal scaled factorial data. (2) We add labels to the factor levels. These two steps can be done by using the R command factor. 15 16 ## changing/adding labels to factor variable sampledata$GiveOrTake <- factor(sampledata$GiveOrTake, labels = c("give-some game", "take-some game")) ͨ→ 5.6.2.4 Decriptive Statistics For the descriptive statistics, we use the package Hmisc (Harrell Jr, 2021) that includes a variety of functions for data illustration, analysis, and more. One of them is the command describe, which summarizes basic statistics of a particular variable. To install and load the Hmisc package, and to get the same information about descriptive statistics shown in Table 5.3 and 5.4, we enter the following code: 17 18 19 20 21 22 23 24 ## descriptives #install package install.packages("Hmisc") #load/attach package library(Hmisc) #descriptives of age and sex describe(sampledata$age) describe(sampledata$sex) 5.6.2.5 Conducting Hypotheses Test (χ 2 Test) We use a .χ 2 test to test whether the frequency of participants breaking the rule is higher in the one group (i.e., take-some game) as compared to the other group (i.e., give-some game). Similar to the analysis using JASP (see sections above), we calculate a 2 .×2 contingency table showing the frequencies and the proportions by rows and put the table into a variable called contab. We then perform the .χ 2 test (without applying Yate’s continuity correction) and the Fisher’s exact test of the odd ratios (alternative hypothesis stating that rule-breaking frequency is greater in group one than in group two). For executing the steps, enter the following R commands to the script and run the code: 270 25 26 27 28 29 30 31 32 33 M. Wyszynski ##2 × 2 contingency tables and tests # contingency table (frequencies) contab<-table(sampledata$GiveOrTake, sampledata$'rule-breaking') #print table to console print(contab) #Table proportions by rows prop.table(contab, margin = 1) #χ 2 test chisq.test(contab, correct = FALSE) #Fischer's test and odd ratios fisher.test(contab, alternative = "greater") Not surprisingly, the results are identical to those revealed by the analysis using JASP. RStudio prints the results immediately in the console window after running the code. First, two contingency tables are shown. One includes the frequencies of rule-breaking (number of cases for that the “rule-breaking” variable is TRUE), and the other table shows the proportions of rule-breaking for each game. The results indicate that 50 participants (50%) broke the rule in the take-some game, and 32 participants (32%) broke the rule in the give-some game. The next lines of the output show the results of the Pearson’s .χ 2 test, including the .χ 2 value of 6.697, the degrees of freedom (df . = 1), and the p-values (.p = 0.009658; rounded up to .p = 0.01 in the output provided by JASP; for details, see Sect. 5.6.1). And finally, the output shows the odds ratio of the Fisher’s exact test (2.116846; rounded up to 2.117 in the output provided by JASP; for details, see Sect. 5.6.1), including the 95% confidence intervals and the p values (.p = 0.007145). 5.7 The Answer to the Research Question We conducted an online experiment to find an answer to the research question stated at the beginning of the current chapter, i.e., does the number of people behaving “morally bad” in terms of breaking the rule in rule-regulated social dilemmas depend on whether the dilemma is a give-some or a take-some dilemma? To empirically investigate the research question, we derived a testable hypothesis from previous research showing that people tend to behave more cooperatively, i.e., they broke the rule less often, in take-some than in give-some social dilemma situations. Furthermore, we designed an empirical experiment to test it. Participants of the experiment were either assigned to a give-some game where a rule regulated the amount of tokens they had to give to a common good or assigned to a take-some game where a rule regulated the amount of tokens they were allowed to take from a common good. In particular, we tested the hypothesis that people break a rule more often in a take-some than in a give-some rule-regulated social dilemma by analyzing empirical data collected by the experimental investigation. The analysis revealed that participants break the rule more often in the take-some game than in the give-some game. That is, the results support our hypothesis. To answer the research question, we found that “morally bad” behavior in terms of rule-breaking in social dilemmas situations significantly depends on whether the social dilemma type is a give-some dilemma or a take-some dilemma. The (fictional) 5 Interactive and Incentivized Experiments 271 findings of the current example study suggest that people tend to break a rule more often in take-some dilemmas than in give-some dilemmas. 5.8 Final Remarks This chapter describes several steps of the process of finding an answer to a research question about a philosophical problem using an interactive and incentivized experiment. The steps described in this guideline are just examples of ways to empirically investigate research problems. That is, there are lots of other paradigms, tools, applications, and analysis programs that can be used for that purpose. The step-bystep instructions in this chapter are only intended to provide a first insight on how to conduct interactive and incentivized experiments. The tutorial of programming interactive experiments using Unipark is a comparatively easy way to implement them on an online platform, and the instructions for the statistical analysis with two different free programs are an easy-to-try-at-home recipe for conducting simple statistical tests and analyses. References Andreoni, J. (1995). Warm-glow versus cold-prickle. The effects of positive and negative framing on cooperation in experiments. The Quarterly Journal of Economics, 110(1), 1–21. Arbeitskreis Deutscher Markt- und Sozialforschungsinstitute e.V. (2018). Standards zur Qualitätssicherung für Online-Befragungen. Balliet, D., Parks, C., & Joireman, J. (2009). Social value orientation and cooperation in social dilemmas. A meta-analysis. Group Processes & Intergroup Relations, 12(4), 533–547. Barcelo, H., & Capraro, V. (2015). Group size effect on cooperation in one-shot social dilemmas. Scientific Reports, 5(1), 7937. Bauer, A. M., Meyer, F., Romann, J., Siebel, M., & Traub, S. (2022). Need, equity, and accountability. Evidence on third-party distributive decisions from a vignette study. Social Choice and Welfare, 59(4), 769–814. Bonacich, P. (1972). Norms and cohesion as adaptive responses to potential conflict. An experimental study. Sociometry, 35(4), 357–375. Brewer, M. B., & Kramer, R. M. (1986). Choice behavior in social dilemmas. Effects of social identity, group size, and decision framing. Journal of Personality and Social Psychology, 50(3), 543–549. Capraro, V., & Barcelo, H. (2015). Group size effect on cooperation in one-shot social dilemmas II. Curvilinear effect. PLoS ONE, 10(7), e0131419. Castillo, D., Bousquet, F., Janssen, M. A., Worrapimphong, K., & Cardenas, J. C. (2011). Context matters to explain field experiments. Results from Colombian and Thai fishing villages. Ecological Economics, 70(9), 1609–1620. Champely, S., Ekstrom, C., Dalgaard, P., Gill, J., Weibelzahl, S., Anandkumar, A., Ford, C., Volcic, R., & Rosario, H. (2020). pwr. Basic functions for power analysis. R package version 1.3-0. Chen, D. L., Schonger, M., & Wickens, C. (2016). oTree. An open-source platform for laboratory, online, and field experiments. Journal of Behavioral and Experimental Finance, 9, 88–97. 272 M. Wyszynski Correll, J., Mellinger, C., McClelland, G. H., & Judd, C. M. (2020). Avoid Cohen’s “small”, “medium”, and “large” for power analysis. Trends in Cognitive Sciences, 24(3), 200–207. Cubitt, R. P., Drouvelis, M., & Gächter, S. (2011a). Framing and free riding. Emotional responses and punishment in social dilemma games. Experimental Economics, 14(2), 254–272. Cubitt, R. P., Drouvelis, M., Gächter, S., & Kabalin, R. (2011b). Moral judgments in social dilemmas. How bad is free riding? Journal of Public Economics, 95(3), 253–264. Dawes, R. M. (1980). Social dilemmas. Annual Review of Psychology, 31(1), 169–193. Dawes, R. M., Delay, J., & Chaplin, W. (1974). The decision to pollute. Environment and Planning A, 6(1), 3–10. Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G* Power 3.1. Tests for correlation and regression analyses. Behavior Research Methods, 41(4), 1149–1160. Gneezy, U., & Rustichini, A. (2000). Pay enough or don’t pay at all. The Quarterly Journal of Economics, 115(3), 791–810. Haidt, J. (2001). The emotional dog and its rational tail. A social intuitionist approach to moral judgment. Psychological Review, 108(4), 814–834. Hardin, G. (1968). The tragedy of the commons. Science, 162(3859), 1243–1248. Harrell Jr, F. E. (2021). Hmisc. Harrell miscellaneous. R package version 4.7.0. Hauser, O. P., Kraft-Todd, G. T., Rand, D. G., Nowak, M. A., & Norton, M. I. (2019). Invisible inequality leads to punishing the poor and rewarding the rich. Behavioural Public Policy, 5(3), 1–21. Isaac, R. M., Walker, J. M., & Williams, A. W. (1994). Group size and the voluntary provision of public goods. Experimental evidence utilizing large groups. Journal of Public Economics, 54(1), 1–36. Janssen, M. A., Bousquet, F., Cardenas, J.-C., Castillo, D., & Worrapimphong, K. (2012). Field experiments on irrigation dilemmas. Agricultural Systems, 109, 65–75. Janssen, M. A., Bousquet, F., Cardenas, J.-C., Castillo, D., & Worrapimphong, K. (2013). Breaking the elected rules in a field experiment on forestry resources. Ecological Economics, 90, 132– 139. JASP Team (2024). JASP (Version 0.18.3). https://jasp-stats.org/. Kees, J., Berry, C., Burton, S., & Sheehan, K. (2017). An analysis of data quality. Professional panels, student subject pools, and Amazon’s Mechanical Turk. Journal of Advertising, 46(1), 141–155. Kollock, P. (1998). Social dilemmas. The anatomy of cooperation. Annual Review of Sociology, 24(1), 183–214. Kuhn, S. (2019). Prisoner’s dilemma. In Zalta, E. N. (Ed.), The Stanford encyclopedia of philosophy. https://plato.stanford.edu/entries/prisoner-dilemma/. Liebrand, W. B. (1983). A classification of social dilemma games. Simulation & Games, 14(2), 123–138. Memon, A., & Higham, P. A. (1999). A review of the cognitive interview. Psychology, Crime and Law, 5(1–2), 177–196. Nockur, L., Pfattheicher, S., & Keller, J. (2021). Different punishment systems in a public goods game with asymmetric endowments. Journal of Experimental Social Psychology, 93, 104096. Plakias, A. (2015). Experimental philosophy. In Oxford Handbooks Online. https://www. oxfordhandbooks.com/view/10.1093/oxfordhb/9780199935314.001.0001/oxfordhb9780199935314-e-17. Pokorny, K. (2008). Pay—but do not pay too much. An experimental study on the impact of incentives. Journal of Economic Behavior & Organization, 66(2), 251–264. Prüfer, P., & Rexroth, M. (2005). Kognitive Interviews. Zentrum für Umfragen, Methoden und Analysen, 26. https://nbn-resolving.org/urn:nbn:de:0168-ssoar-201470. R Core Team (2018). R. A language and environment for statistical computing. R Foundation for Statistical Computing. 5 Interactive and Incentivized Experiments 273 Roulin, N. (2015). Don’t throw the baby out with the bathwater. Comparing data quality of crowdsourcing, online panels, and student samples. Industrial and Organizational Psychology, 8(2), 190–196. RStudio Team (2020). RStudio. Integrated development environment for R. RStudio. Sellke, T., Bayarri, M., & Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses. The American Statistician, 55(1), 62–71. Sun, Q., Guo, H., Wang, J., Zhang, J., Jiang, C., & Liu, Y. (2021). Differences in cooperation between social dilemmas of gain and loss. Judgment and Decision Making, 16(6), 1506–1524. van den Berg, P., Dewitte, P., Aertgeerts, I., & Wenseleers, T. (2020). How the incentive to contribute affects contributions in the one-shot public goods game. Scientific Reports, 10(1), 18732. van Dijk, E., & Grodzka, M. (1992). The influence of endowments asymmetry and information level on the contribution to a public step good. Journal of Economic Psychology, 13(2), 329– 342. van Dijk, E., & Wilke, H. (1995). Coordination rules in asymmetric social dilemmas. A comparison between public good dilemmas and resource dilemmas. Journal of Experimental Social Psychology, 31(1), 1–27. van Dijk, E., & Wilke, H. (2000). Decision-induced focusing in social dilemmas. Give-some, keepsome, take-some, and leave-some dilemmas. Journal of Personality and Social Psychology, 78(1), 92–104. van Lange, P. A., Joireman, J., Parks, C. D., & van Dijk, E. (2013). The psychology of social dilemmas. A review. Organizational Behavior and Human Decision Processes, 120(2), 125– 141. Vovk, V. G. (1993). A logic of probability, with application to the foundations of statistics. Journal of the Royal Statistical Society: Series B (Methodological), 55(2), 317–341. Wade-Benzoni, K. A., Tenbrunsel, A. E., & Bazerman, M. H. (1996). Egocentric interpretations of fairness in asymmetric, environmental social dilemmas. Explaining harvesting behavior and the role of communication. Organizational Behavior and Human Decision Processes, 67(2), 111–126. Weber, J. M., Kopelman, S, & Messick, D. M. (2004). A conceptual review of decision making in social dilemmas. Applying a logic of appropriateness. Personality and Social Psychology Review, 8(3), 281–307. Weiß, A. R., Bauer, A. M., & Traub, S. (2017). Needs as reference points. When marginal gains to the poor do not matter. FOR 2104 Working Paper 2017-13, Helmut-Schmidt-Universität, Universität der Bundeswehr Hamburg. Wyszynski, M. (2020). Der Einfluss von Framing, Medienpriming, Vorurteilen und Debiasing auf die Bewertung eines Betrugs an der Solidargemeinschaft. PhD Thesis, Jacobs University Bremen. Wyszynski, M., & Bauer, A. M. (2023). Give what is required and take only what you need! The effect of framing on rule-breaking in social dilemmas. Judgment and Decision Making, 18, e17. Yates, F. (1934). Contingency tables involving small numbers and the χ 2 test. Supplement to the Journal of the Royal Statistical Society, 1(2), 217–235. Zelmer, J. (2003). Linear public goods experiments. A meta-analysis. Experimental Economics, 6(3), 299–310. Chapter 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use of “Conspiracy Theory” Kevin Reuter and Lucien Baumgartner Abstract Corpus analysis allows researchers to inform, illuminate, and investigate many problems. This chapter provides easy access to some of the central tools commonly used in corpus linguistics. After a short exploration of pre-built corpora and a brief literature review surveying corpus-analytic studies in philosophy, we illustrate these tools by running several corpus analyses on the term “conspiracy theory.” These analyses show that “conspiracy theory” is a strongly evaluative term. The reader of this chapter can follow each of the steps of the corpus analyses using the online material that is freely available. Keywords Corpus analysis · Conspiracy theory · Annotation · Prebuilt corpora · R · API · Hypotheses · Statistics 6.1 Introduction Corpus analysis allows philosophers to tackle philosophical problems in a way that was near impossible 15 years ago. Until recently, we simply did not have the computer power and algorithms to effectively analyze large amounts of text in a reasonable amount of time. Nowadays, using corpus-analytical methods yourself is easy (at least some of them) and just a few clicks away, as we demonstrate in this chapter. Of course, corpus analysis does not replace the creativity and hard philosophical work that goes into developing research hypotheses and interpreting the data accurately. In order to make this chapter as accessible and intuitive as possible, we start in Sect. 6.2 with some examples and small exercises that we encourage newcomers to corpus analysis to do while going through these pages. Consequently, Sect. 6.2 of this chapter is aimed at beginners with none or little experience in analyzing words K. Reuter (✉) · L. Baumgartner Department of Philosophy, University of Zurich, Zurich, Switzerland e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Kornmesser et al., Experimental Philosophy for Beginners, Springer Graduate Texts in Philosophy, https://doi.org/10.1007/978-3-031-58049-9_6 275 276 K. Reuter and L. Baumgartner and phrases in corpora. However, we hope that even those more familiar with corpus analysis will find some aspects helpful too. In Sect. 6.3, we take a step back from the data, provide a brief introduction to corpus analysis and present some studies to illustrate what philosophers have used corpus-linguistic methods for. We pay specific attention to how a specific philosophical research question was transformed into a hypothesis suitable to be investigated with corpus analysis, because we believe that developing suitable corpus hypotheses might be the most difficult part when aiming to do corpus analysis. In Sects. 6.4 and 6.5, we do a corpus analysis of the composite term “conspiracy theory.” Thus, readers who are mostly interested in how the term “conspiracy theory” is used might want to jump straight ahead to Sect. 6.4. We start with some analyses using pre-built corpora, the results of which indicate that: . “Conspiracy theory” is a strongly evaluative term. . The use of “conspiracy theory” has undergone substantial changes during the last 15 years. . Conspiracy theories are promoted and spread like false theories, and not discussed and tested like scientific theories. In Sect. 6.5, we show how to build a corpus from scratch, collect texts, compile and annotate the corpus, and run some analyses on the self-built corpus. While our analysis on pre-built corpora (Sects. 6.2 and 6.4) uses freely accessible online corpora,1 we provide links to all material necessary for researchers to follow each of the steps in Sect. 6.5. 6.2 Exploration, Examples, and Exercises The main purpose of this section is to make beginners to corpus analysis familiar with a few basic tools for searching pre-built corpora via a Web-based interface.2 Perhaps the most frequently used corpus is the Corpus of Contemporary American English, known as COCA (Davies, 2008). To access COCA, please go to https://www.english-corpora.org/coca/. Figure 6.1 depicts the starting page. You can now enter any word or phrase into the textfield on the left. When you enter a word and click on “Find matching strings,” you are likely to be asked to register first. If you haven’t registered so far, please do so. It will only take you a minute or two. 1 https://www.english-corpora.org. 2 As this section is meant to be purely exploratory, we refrain from a scientific approach. Thus, we do not state possible hypotheses that may either drive a corpus analysis or that can be inferred from the results. 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 277 Fig. 6.1 The starting page of COCA. The left-hand side allows users to put in terms or phrases and select various search functions Fig. 6.2 List of hits for “thinking.” The right-hand side displays the context in which the term is used. On the left-hand side, meta-information about each use of the term is provided, such as year and source of the text Let’s say we are interested in the term “thinking” (something not too unfamiliar to philosophers). Type in .<thinking.> (without the brackets) and hit enter or “Find matching strings.” The next page displays the frequency for the term of interest. There are 189,904 hits for “thinking.” In this case, we are not primarily after the frequency of that term, but rather in how it is used. If you now click on “Context” on the top, COCA provides you with the context of all the 189,904 instances (see Fig. 6.2). If you would like to know the wider context of one of the uses, you can click on the source, e.g., “forums.blurb.com” of the second hit, to get the expanded context. The context option is very useful for exploratory purposes to figure out the various uses and syntactical structures with which terms or phrases are used. 278 K. Reuter and L. Baumgartner Fig. 6.3 List of nouns that appear most frequently after “thinking about. ” For example, COCA contains 220 hits for “thinking about things” In most cases, researchers investigate more specific questions. For example, we might be interested in finding out what people say they most commonly think about. Go back to the search function, and instead of entering merely .<thinking.>, we now type in .<thinking about *.>. The asterisk is a placeholder (also known as “wildcard”) for any term. The results are rather disappointing, because the most common terms after “thinking about” are terms such as “the,” “it,” “what,” and “how.” Thus, we need to be more specific about our search. Luckily, many corpora not only contain large amounts of text, but the text also contains information about what part of speech the term belongs to, like noun, adjective, adverb, etc. The process of assigning part-of-speech information is also known as PoS-tagging. To specify that you are only interested in nouns, we now simply enter .<thinking about NOUN.>. The outcome of this search is much more interesting (Fig. 6.3). People write that they or others think a lot about sex, food, life, suicide, work, music, etc. Box 6.1: Exercise 1 . How many hits are there for the phrase “experimental philosophy”?3 3 The answers to the questions from the exercises can be found in the Appendix (Sect. 6.6). 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 279 Fig. 6.4 Development of the frequency of the use of the term “normative” from the 1820s till now Box 6.2: Exercise 2 . What things do people seem to be “talking about” the most? . What are the most frequent adjectives appearing before “thinking”? (hint: try ADJ) Next, we look at three further functions that come in very handy for many purposes. We start with the “Chart” function, which is situated right next to the “List” function. Let’s explore some more technical terminology of philosophers. For instance, researchers often state that they are interested in the normative aspects of x. You might wonder though whether the term “normative” is used outside of academia, and whether it was commonly used in the past. To answer these questions, click on “Chart,” type .<normative.> into the textfield and hit enter. You can now see in which section or genres (blog, Web, TV, spoken, fiction, magazine, newspaper, and academic) the term is frequently used, and its development over the last few decades. In this case, the term seems to occur only rarely outside of academia, and gained in popularity during the last two decades. The Corpus of Historical American English (Davies, 2010), also known as COHA, gives you an insight into the use of terms over the last two centuries. Just go to https://www.english-corpora.org/coha/, select “Chart,” and enter .<normative.>. The results displayed (see also Fig. 6.4) confirm the trend seen on COCA. Before the 1980s, the term was hardly ever used.4 4 A further great tool for investigating the development of terms over time is Google’s NGRAM viewer: https://books.google.com/ngrams. Just search for “descriptive” and “normative,” and observe how the use of “normative” has overtaken “descriptive” around the turn of the millennium. 280 K. Reuter and L. Baumgartner Box 6.3: Exercise 3 . How has the term “robot” developed over the last 20 years? . How have the terms “conservative” and “liberal” developed over the last 200 years? Moving back to COCA, another great tool is the “Collocates” function. In order to see the “Collocates” button, click on the .“+” sign next to “Browse.” Collocates are words that occur more frequently with a target term than what can be expected. Just put in the term “vague.” Then click on “Find Collocates.” You can now see (see also Fig. 6.5) the most frequently occurring terms in the vicinity of the target term “vague,” sorted by different parts of speech. If you are interested in narrowing down or expanding the window in which your collocates search is done, the numbers underneath the entry field allow you to do so. The default is four words to the left and four words to the right of the target term. There are two numbers to the left of each term. The first tells you the frequency with which the term appears within the specified span of the target term. The second number is the mutual information score (MIT). Some terms are, of course, much more frequent than others, so it is no surprise that they occur more frequently as collocates. The mutual information score takes this into account and thus indicates the relative collocation of the two terms. For instance, although the adjective “broad” occurs more often together with “vague,” “ambiguous” is more strongly tied to the term given its lower overall use and thus has a higher MIT score (6.92). Fig. 6.5 Lists of the most frequent collocates of the word “vague,” separated into different parts of speech. For example, the left-most column lists the most frequent nouns that occur together with the term “vague” 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 281 Fig. 6.6 Comparison of the term “reasonable” with “rational.” If a term appears high up on the list, it means that it frequently occurs with one but not the other of the two terms Box 6.4: Exercise 4 . Which five adjectives occur most often together with “happiness”? We end this exploratory section of COCA by examining the “Compare” function (like the “Collocates” function, you need to click on the “+” sign to make it available). “Compare” allows you to compare the collocates of two expressions, thereby investigating both differences and similarities between them. For example, enter the terms “rational” and “reasonable” into the placeholders and click on “Compare words.” Although the two terms can be used interchangeably in some contexts, the results reveal important differences between them, even in everyday talk (see Fig. 6.6). The standard search by ratio lists the terms that have the greatest ratio, i.e., terms that are used with one of the terms, but hardly with the other. As Fig. 6.6 demonstrates, “decision-making,” “self-interest,” and “intuition” are commonly used with “rational” but not “reasonable.” In contrast, people talk about reasonable fees, reasonable doubts, but not rational fees and rational doubts. Box 6.5: Exercise 5 . Compare the term “blaming” with “praising.” Which differences are specifically noteworthy? . Which nouns have the greatest ratio when comparing what people are “afraid of” and what they are “frightened of”? 282 K. Reuter and L. Baumgartner 6.3 Corpus Analysis and Philosophy: Why, Who, and How? 6.3.1 Why Do Corpus Analysis? In many fields of the Arts and Humanities, as well as the Social Sciences, researchers are active in using corpus-linguistic tools. Not so much in philosophy— so far. Judging by the PhilPapers Corpus Analysis page,5 around 60 corpus-analytic papers have been published by philosophers (as we write in Summer 2022). However, most of them have been published during the last 5 years, so the corpus train is gaining steam. It seems that corpus methods are specifically suited to philosophers. Most philosophers are interested in identifying the meaning and structure of concepts, either as the primary target of investigation or at least as a starting point for theory construction. By doing corpus analysis, they can gain valuable insights in how terms that express these concepts are used. From data about a term’s usage, further inferences about the content and structure of concepts can be made.6 And, so, we have something of a conundrum: on the one hand, corpora are (i) highly accessible, (ii) big, (iii) (relatively) unbiased, and (iv) suited for philosophical investigation. On the other hand, corpus data is hardly used to investigate philosophical research questions. Compare this with standard experimental philosophy: experimental data needs to be generated, is often limited, and often biased, but still, experimental data is frequently used to investigate philosophical research questions. Let us say a bit more about the just mentioned positive characteristics. First, corpora are highly accessible. The previous section was intended to give you a glimpse of what you can do with pre-built corpora that are freely available for anybody to use on the Internet. We hope you agree that it could not be much easier to find out how frequent terms are, how their use developed over the decades and centuries, in which contexts they occur, which terms they co-occur with, etc. We did an exploratory tour of COCA7 in the last section using its Web interface. If you go to the parent link8 of COCA, you see that many more corpora are freely available through that site. The NOW9 and iWeb10 corpora are the largest ones, but come with some restrictions. There are also more specialized corpora like the Coronavirus Corpus11 or the Corpus of US Supreme Court Opinions.12 Some of them can be highly useful if you have research questions that are suited to be investigated in specialized corpora. As the name of the Web site indicates, www.english-corpora.org only 5 https://philpapers.org/browse/experimental-philosophy-corpus-analysis. 6 This is not to say that we do not need to be cautious about moving from claims about the use of a term to claims about its meaning. 7 https://www.english-corpora.org/coca. 8 https://www.english-corpora.org. 9 https://www.english-corpora.org/now/. 10 https://www.english-corpora.org/iweb/. 11 https://www.english-corpora.org/corona/. 12 https://www.english-corpora.org/scotus/. 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 283 features English-language corpora. However, there exist many corpora-containing texts from other languages that are also available through Web sites. For instance, a large German-language corpus is available through COSMAS II.13 And the Childes14 database features large amounts of conversations with children. For a list of corpora featuring a wide variety of languages, you might want to go to this Wikipedia page: https://en.wikipedia.org/wiki/List_of_text_corpora. The second and third characteristics we mentioned are that corpora are big and (relatively) unbiased. The two aspects, of course, are not independent of each other. The larger the corpus, the less biased it will be (ceteris paribus) in regard to over- and underrepresenting specific uses of terms, contexts in which they are used, and topics of discussion.15 Just how large are some of these corpora? COCA contains around one billion words from around 485,000 texts.16 That sounds a lot, and it surely is. In the end, however, the overall amount of words is less decisive than the frequency of specific words and phrases. For example, there are 6182 hits for “irrational” in COCA, which seems plenty if you are interested in finding out how the term is used. If, however, you would like to study how the phrase “irrational choice” is used, then COCA yields only four hits, too little for a comprehensive analysis. In that case, you will probably need to look at other corpora, or build your own, something we discuss in Sect. 6.5 of this chapter. Given these positive characteristics of corpora, why then do most experimental philosophers generate their own data through tiresome and costly experiments? The answer, we believe, is, at least partly, that it is often not easy to translate a philosophical research question into a hypothesis that can be investigated by doing corpus analysis. We therefore decided in the next subsection to review the literature on existing philosophical corpus analysis by stating both the research question with which the philosopher(s) started as well as the hypothesis that lent itself to doing corpus analysis. 6.3.2 A Somewhat Different Literature Review Although the philosophical literature using corpus analysis is still relatively small, it is too big to be fully covered here.17 As stated above, we would like to encourage 13 https://cosmas2.ids-mannheim.de/cosmas2-web/. 14 https://childes.talkbank.org. 15 Of course, often researchers are specifically interested in biased corpora. For instance, Willem- sen et al. (2023) contrast laypeople’s use of thick terms with those of judges. In such a case, the Corpus of US Supreme Court Opinions might be exactly what you are after. 16 For more details, see www.english-corpora.org/coca/help/coca2020_overview.pdf. 17 Chartrand (2022) provides a more comprehensive and rather critical review of the current state of corpus analysis in philosophy. Bluhm (2016) sketches various paths in which philosophy can benefit from corpus analysis. See also the blog entry of Ulatowski et al. (2020) on the merits of and challenges for corpus analysis in philosophy. 284 K. Reuter and L. Baumgartner more philosophers to use corpus analysis for their own research. Therefore, we review 12 studies and state in a rather brief manner (a) the research question, (b) the corpus hypothesis the researchers worked with, (c) the corpus used, and (d) the results. We selected those 12 studies for two reasons. First, these studies show the large variety of different corpora that can and have been used in the last 15 years. Second, the research question with which the authors started is anchored in very different philosophical fields and, hence, demonstrates the usefulness and applicability of corpus analysis for many different areas. Most of the papers below included more than just one hypothesis that were investigated. For simplicity, however, we stick to a single hypothesis per paper covered. Study 1: Knobe and Prinz (2008) . Philosophical Research Question: Do people ascribe phenomenally conscious states to group agents? . Corpus Hypothesis: People do not use phrases like “Microsoft feels pain” and “Microsoft feels happy.” . Corpus Used: World Wide Web . Summary of Results: Very few uses for phrases of the kind described above.18 Study 2: Reuter (2011) . Philosophical Research Question: Are pains the same as feelings of pain? . Corpus Hypothesis: People use “feeling a pain” primarily with low-intensity pains and “having a pain” primarily with high-intensity pains. . Corpus Used: World Wide Web . Summary of Results: People distinguish “feeling a pain” from “having a pain” in using “feeling of pain” primarily for mild, small, and no pains and vice versa for severe pains. Study 3: Fischer et al. (2015) . Philosophical Research Question: Does appearance language primarily serve to indicate a doxastic or an experiential attitude? . Corpus Hypothesis: Terms like “seem” and “appear” are strongly associated, i.e., distributionally similar, with epistemic terms like “know.” . Corpus Used: Wikiwoods corpus . Summary of Results: The doxastic verbs “believe,” “think,” and “find” are among the nearest neighbors of “appear,” “look,” and “seem.” Study 4: Andow (2015) . Philosophical Research Question: How central was and is the notion of intuition in philosophy? . Corpus Hypothesis: The term “intuition” is more frequently used in recent decades. . Corpus Used: NGram and JSTOR 18 For Study 1, as well as with the other studies, the corpus hypotheses have all been largely confirmed. 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 285 . Summary of Results: Intuition-talk in philosophy has steadily increased between the 1950s and 2000. Study 5: Wright et al. (2016) . Philosophical Research Question: Are moral concerns ontogenetically driven by feelings and values (good/bad), or by rules and standards (right/wrong)? . Corpus Hypothesis: Young children primarily use the terms “good” and “bad,” but not “right” and “wrong” to make moral evaluations. . Corpus Used: Childes . Summary of Results: While “good” was used for moral evaluation 14.4% of the time, “right” was used 0.1% of the time by children. Study 6: Nichols and Pinillos (2018) . Philosophical Research Question: Is the ordinary notion of knowledge infallibilist? . Corpus Hypothesis: Children are not exposed to fallibilist uses of “know” (used as a propositional attitude). . Corpus Used: Childes . Summary of Results: Of the 802 items, coders identified no cases in which the knowledge attribution was coupled with an expression of fallibility. Study 7: Alfano (2018) . Philosophical Research Question: How strongly are the conceptions of drive (Trieb), instinct (Instinkt), and virtue (Tugend) related in Nietzsche’s thinking? . Corpus Hypothesis: The terms “drive,” “instinct,” and “virtue” co-occur strongly in text sections of the Nietzsche Corpus. . Corpus Used: Self-built corpus assembled from www.nietzschesource.org . Summary of Results: The probability of one of these terms occurring in a passage is more than doubled if at least one of the other terms occurs in the same passage. Study 8: Sytsma et al. (2019) . Philosophical Research Question: Is causation a descriptive concept or similar to the normative notion of responsibility? . Corpus Hypothesis: Nouns appearing after “caused the” are primarily negative, indicating a normative use of “caused” and similar to nouns occurring after “responsible for the.” . Corpus Used: COCA . Summary of Results: 17 of the 20 most frequent nouns occurring after “caused the” are negative terms. Study 9: Mizrahi (2020) . Philosophical Research Question: What is the role of case studies in philosophy of science over the years? . Corpus Hypothesis: The terms “case study” and “case studies” are prevalent in philosophy of science articles. . Corpus Used: JSTOR database 286 K. Reuter and L. Baumgartner . Summary of Results: There is an upward trend in appeals to case studies in many philosophy of science journals. Study 10: Tobia (2020) . Philosophical Research Question: Does corpus data capture the ordinary meaning of the term “vehicle”? . Corpus Hypothesis: Corpus data fails to deliver useful information about nonprototypical members of the set of vehicles. . Corpus Used: COCA and NOW . Summary of Results: Corpus data provides little information on vehicles such that bicycle, airplane, and golf cart cannot be inferred to be among its members. Study 11: Hansen et al. (2021) . Philosophical Research Question: Is “I know” more frequently used to make non-assurances or assurances?19 . Corpus Hypothesis: Occurrences of “I know” that are non-assurances are higher compared to those that are assurances. . Corpus Used: COCA . Summary of Results: A random sample from COCA revealed 62% of uses of “I know” are non-assurances. Study 12: Reuter, Baumgartner, Willemsen 2024 . Philosophical Research Question: How can we differentiate evaluative from value-associated adjectives? . Corpus Hypothesis: The modifier “truly” precedes evaluative adjectives more frequently compared to value-associated adjectives. . Corpus Used: COCA and Reddit . Summary of Results: Evaluative adjectives are more frequently used with the intensifiers “truly” compared to descriptive and value-associated adjectives. 6.3.3 Common Patterns and Doing It Yourself This quick “tour” of 12 studies reveals two interesting points. First, the use of corpus analysis is not restricted to philosophy of language, as one might initially think given its popularity in linguistics and its focus on language use. Instead, we find researchers taking corpus analyses to illuminate questions in (a) history of philosophy (Alfano), (b) metaphilosophy (Andow), epistemology (Nichols and Pinillos, Hansen et al.), philosophy of mind (Prinz and Knobe, Reuter), philosophy of language (Fischer et al.), philosophy of science (Mizrahi), metaphysics (Sytsma 19 If a person states “I know that COCA contains over a billion words,” she gives an assurance that COCA is such-and-so. In contrast, we also use “know” merely to “share a reaction to a piece of purported news” (Baz, 2012, 38–39), e.g., Person A: “Corpus Analysis is great!” Person B: “I know!” 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 287 et al.), legal philosophy (Tobia), and moral philosophy (Wright et al., Reuter et al.). Now, in a very real sense, many of these 12 studies touch on issues in the philosophy of language. Importantly though, the central questions raised by the researchers of these papers are not primarily philosophy of language questions, but firmly rooted in a wide variety of philosophical areas. Second, while there are many differences to be found between the 12 studies covered, there are also important commonalities. Very roughly, we suggest that those studies investigate terms or phrases in one of the following three ways: . They investigate how frequent a term or phrase occurs (in some context or at some time): Prinz and Knobe, Reuter, Andow, Hansen et al., Mizrahi, Reuter et al. . They investigate which other terms or phrases occur with the target term: Fischer et al., Alfano, Sytsma et al., Tobia . They investigate the contexts in which certain terms or phrases occur: Wright et al., Nichols and Pinillos Obviously, we do not claim that all corpus analyses fall within those three categories. But many studies, and very likely most corpus studies in philosophy, do. Once we see these recurring patterns of approaches that researchers have used, it becomes easier to see how to translate a research question into a corpus-based hypothesis that can be investigated through corpus analysis. While there is no blueprint or recipe for such a translation, thinking about what the frequency of a certain phrase (in a certain context) and/or its co-occurrence with other terms might tell you gets you a long way to your corpus-based hypothesis. Some of these translations are easier to see, others a bit more difficult. The translation is easy (relatively speaking) if what you start with is already a question about the use of a term. For example, Knobe and Nichols (Study 1) more or less have their corpus question on the table, given their interest in whether we ascribe phenomenal consciousness to group agents. Similarly, Andow (Study 4) in asking about the prevalence of the term “intuition” for philosophical theorizing across the last 100 years, simply needs to observe the use of the term over time. In other cases, an important side question needs to be answered first: What are the relevant contexts for which the frequency of a term or its co-occurrence with other terms matter. Sytsma et al. (Study 8), for instance, needed to check the frequency with which bad outcomes are specified after the phrases “caused the” and “responsible for the.” Once they had figured out which contextual information would tell them when a certain phrase was used normatively, the corpus hypothesis was as easy as pie: You collect data, analyze the frequency or co-occurrence, run your analysis, and verify or falsify your hypothesis. Importantly, some studies cannot (yet) dispense with human thinkers at an important stage of the process. Both in Studies 5 and 6, as well as to a lesser extent in Studies 3 and 10, the researchers needed to find out (or code) whether certain 288 K. Reuter and L. Baumgartner uses of a phrase belong to a certain category. Wright et al. (Study 5) needed people to read the larger context in which a term like “good” or “right” was used in order to determine whether its use was moral. Nichols and Pinillos (Study 6) needed to code whether the use of “know” was indeed infallibilist. Similar tasks were needed in Fischer et al. (Study 3) and Hansen, Porter, and Francis (Study 10). These, at times, tedious human tasks should not discourage anyone running their own corpus analysis.20 Thus, don’t despair if the hypothesis is too complex to be analyzed or fully computerized (without human involvement). Corpus data might still provide you with a new window into your research question, even if you need to look at the data qualitatively yourself. Before we move on to our corpus analyses of “conspiracy theory,” we should highlight that whereas corpus analysis is empirical but often nonexperimental—in the sense of using data but not generating new data—some of the same scientific standards hold for both experimental as well as corpus studies. First, a thorough corpus analysis should include proper control conditions wherever possible to make sure that the effects that were found cannot easily be accounted for by other factors (see also Chartrand, 2022, for a discussion of the need and prevalence of control conditions in philosophical corpus studies). While including control conditions can be very laborious, some controls can be easily incorporated into your corpus design. We show in the next section using the example of “conspiracy theory” how control conditions can and should be part of one’s corpus study. Second, corpus data should not speak for itself, but statistical analyses of the data should be done wherever possible. This, of course, makes many corpus analyses more tedious and complex, just as about any well-designed experimental study. In many cases, however, the statistical tests are similar to those used in vignette studies. Third, just as with experimental studies, ethical standards need to be adhered to (only make anonymized data publicly available, etc.). We repeatedly touch on some of these issues in the next two sections. Now on to conspiracy theories. 6.4 A Corpus Analysis of “Conspiracy Theory” with Pre-built Corpora The term “conspiracy theory” was hardly used before the second half of the twentieth century. It gained popularity in the 1960s and 1970s in the wake of the Kennedy assassination (deHaven-Smith, 2013; but see Butter, 2020; McKenzieMcHarg, 2018). Since the 1980s, we have seen a sharp rise in its use (see, e.g., the profiles in COHA and NGRAM). Nowadays, terms like “conspiracy theory” and “fake news” are part and parcel of the stock of concepts we use on a very regular basis. Unsurprisingly, philosophers are increasingly interested in understanding what we mean when we say that something is a conspiracy theory. 20 Often what starts as a human classification task can be automatized later, through classifiers, etc. (supervised and unsupervised machine learning). 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 289 The dominant view in analytic philosophy is that conspiracy theories are theories about conspiracies (see, e.g., Basham and Dentith, 2016; Cassam, 2019; Coady, 2003; Cohnitz, 2018; Feldman, 2011; Harris, 2018; Keeley, 1999; Pigden, 2007; Räikkä, 2018).21 That sounds a bit like a truism, but, of course, while the meaning of composite terms is often made up of the meanings of its parts, that is not always the case: The rainbow press is not the press about rainbows, and cloud computing has nothing to do with raindrops in the air. That said, without evidence to the contrary, we might simply assume the standard view to be correct, according to which conspiracy theory refers to a theory that features a conspiracy. Napolitano and Reuter (2021) have presented empirical evidence that challenges the dominant viewpoint in the field of philosophy. Their research findings demonstrate a double dissociation between the concepts of “conspiracy” and “conspiracy theory.” Specifically, their experiments reveal that individuals are willing to designate a claim or explanation as a conspiracy theory, despite the absence of any actual conspiracy (Study 4). Moreover, even in cases where a conspiracy does exist within an explanation, laypeople are less likely to refer to the explanation as a conspiracy theory (Study 5). These results indicate a clear separation between the two concepts and suggest that the dominant view in philosophy regarding the labeling of conspiracy theories may not accurately reflect how individuals interpret and categorize such claims or explanations. The central aim of Napolitano and Reuter’s paper, however, is to investigate whether the term “conspiracy theory” has an inherently negative evaluative nature, i.e., whether the term only refers to bad, false, or unjustified theories. That research question can be roughly put as follows: Research Question Is “conspiracy theory” an evaluative term? How can we investigate this research question with the means of corpus linguistics? Napolitano and Reuter conjecture that if “conspiracy theory” is indeed an evaluative term, people will frequently use evaluative adjectives before the term “conspiracy theory.” When looking at the way people use “conspiracy theory,” we should then find frequent uses for “stupid conspiracy theory” and “wild conspiracy theory,” and less frequently phrases like “interesting conspiracy theory” or “complex conspiracy theory.” Thus, we can formulate a more specific corpus-based hypothesis in the following way: Corpus Hypothesis 1 The most frequent adjectives before “conspiracy theory” are predominantly negative. Of course, without a suitable control condition, we cannot make reliable inferences about the evaluative dimension of “conspiracy theory.” It might, for example, be 21 This is, of course, not to say that all these theorists agree on what the proper definition of conspiracy theory is. In fact, they disagree quite a lot. For example, some theorists include a conflict criterion or some other additional element; some define conspiracy theories to be epistemically evaluative. They do agree, however, that a theory cannot be a conspiracy theory if it does not refer to a conspiracy. 290 K. Reuter and L. Baumgartner the case that many other terms are preceded by many negative adjectives without themselves being evaluative. Napolitano and Reuter use the term “theory” as a control condition. Based on their corpus analysis,22 they note that among the 50 most frequent adjectives preceding “conspiracy theory” 25 were negative, compared to merely 6 negative adjectives in the top 50 for “theory.” While the results suggest “conspiracy theory” to be an evaluative term and thus support Corpus Hypothesis 1, one might wonder: (i) Aren’t there further control conditions (e.g., how does the term “conspiracy” function?) that need to be checked to draw more reliable conclusions about the evaluative nature of the term “conspiracy theory”? Do the results only hold for “conspiracy theory” or also for agents labeled “conspiracy theorists”? (ii) Has the use of the term “conspiracy theory” changed during the last years and decades? (iii) Can we investigate the evaluate nature of “conspiracy theory” using a different corpus strategy? In the remainder of Sect. 6.4, we investigate these questions with the help of pre-built corpora, such as COCA and NOW. In Sect. 6.4.1, we provide evidence that “conspiracy theory” is indeed a strongly evaluative term, thereby tackling the questions stated in (i). In Sect. 6.4.2, we provide an affirmative answer to question (ii), and in Sect. 6.4.3, we discuss results from a recent paper (Reuter and Baumgartner 2024) that suggest that theories that are labeled “conspiracy theories” are treated like misinformation and falsehoods but not like genuine theories. We also show that corpus analyses based on pre-built corpora are limited in important respects. More specifically, we argue that the results we present are: . . . . Often based on too few data points Do not provide a comprehensive picture of a term’s usage Are not necessarily representative of ordinary usage Are too strongly reliant on people’s intuitions A discussion of these limitations will serve as our motivation for building our own corpus and for digging deeper into more complex corpus-analytic tools in Sect. 6.5. 6.4.1 The Evaluative Nature of “Conspiracy Theory” and “Conspiracy Theorist” Napolitano and Reuter’s corpus study shows that many of the most frequent adjectives before “conspiracy theory” are evaluative terms. In contrast, the term “theory” is preceded by a different set of mostly neutral or positive adjectives. These results suggest “conspiracy theory” to be a strongly evaluative term. However, one might 22 Their corpus analysis was based on a corpus featuring 68,640 texts from the social media Web site Reddit. 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 291 Table 6.1 A list of the 15 most frequent adjectives in front of “conspiracy theory,” “conspiracy theories,” “conspiracy,” and “theories” on COCA. Derogatory terms are highlighted in orange and negative epistemic terms in yellow Conspiracy Theory Conspiracy Theories Conspiracy Theorist Conspiracy Theories Term N Term N Term N Term N Term N good crazy new latest debunked rich baseless bizarre big wild grand other paranoid racist silly 14 13 13 8 7 7 6 6 6 6 5 5 5 5 5 wild crazy various paranoid right-wing bizarre new elaborate other outlandish specific baseless anti-semitic colorful good 24 13 13 11 11 10 9 8 8 7 7 6 4 4 4 right-wing crazy anti-muslim favorite crazed lunar long-time paranoid real full-on resident unhelpful ultimate total strong 9 8 4 3 2 2 2 2 2 2 1 1 1 1 1 criminal right-wing grand vast big alleged international jewish communist global larger crazy federal liberal massive 230 193 92 87 72 68 67 61 54 49 47 46 44 43 42 scientific other new different various economic current implicit competing alternative psychological legal political existing modern 326 290 239 194 158 152 140 124 110 91 86 83 67 63 63 Total Top 15 All uses 111 330 Total Top 15 All uses 139 428 Total Top 15 All uses 41 83 Total Top 15 All uses 1195 3892 Total Top 15 All uses 2186 7065 object that an important control condition is missing, namely the term “conspiracy.” If “conspiracy” were as evaluative as “conspiracy theory,” then we could not conclude that the composite term “conspiracy theory” is used more evaluatively than its individual parts. Instead, it would then seem that the evaluativity of the term “conspiracy theory” is likely derived from “conspiracy.” Some of the more recent literature has not only asked what counts as a conspiracy theory but also who counts as a conspiracy theorist (Klein et al. 2018; Tsapos 2024). We therefore decided to include in our corpus study the term “conspiracy theorist.” In order to find the most frequent adjectives before “conspiracy theory,” we go to the COCA Web site23 and enter <ADJ conspiracy theory> into the search field and hit enter. The results can also be found in Table 6.1 (two most left columns) but are somewhat disappointing. While the list of the 15 most frequent terms indeed contains several evaluative terms, the overall number of hits is fairly low (.N = 330, and .N = 111 for the top 15), certainly too low to make robust claims about the use of “conspiracy theory.” Perhaps we can boost the number by searching for <ADJ conspiracy theories> (plural), but the numbers are not much higher (see also Table 6.1). Our initial worry was that perhaps the term “conspiracy” is already predominantly negative. Searching for <ADJ conspiracy> and <ADJ theories> certainly yields higher numbers overall. The results are also displayed in Table 6.1. Finally, we have also listed the results for <ADJ conspiracy theorist>. As the results in the middle of the table show, there are too few hits to make any reliable conclusions about its evaluative use. Let’s set aside the problem of low numbers for the moment. The results show that the term “conspiracy” is preceded mostly (at least when we glimpse at the top 15) by descriptive terms. “Crazy conspiracy” seems to be the only exception of a clearly negative evaluative adjective. Once we look more closely at the results for “crazy conspiracy,” however, we find that of the 46 hits, most uses are about “crazy 23 https://www.english-corpora.org/coca/. 292 K. Reuter and L. Baumgartner conspiracy theory,” “crazy conspiracy theories,” and “crazy conspiracy theorist.” In other words, “crazy” is only one of top hits for “conspiracy” because it is used with “conspiracy theory/ies/ist/ists.”24 In summary, our analysis does provide additional evidence that “conspiracy theory/ies” is used often in a negative evaluative way, and, that its negative use can hardly be explained by a negative use of “conspiracy” or “theory.” One way to get a greater number of results for <ADJ conspiracy theory> is to use different pre-built corpora like iWEB25 and NOW,26 each of which contains between 14 and 15 billion words compared to COCA’s one billion words (e.g., NOW contains 7839 hits for <ADJ conspiracy theory> compared to COCA’s 330). The downside of using these other corpora is that they are not well-balanced corpora. NOW, for instance, is a collection of texts from news on the Web and thus certainly not representative of laypeople’s use of language. Consequently, in order to get a higher number of uses that allows for a more quantitative analysis and also to have data representative of ordinary usage, we would need to build our own corpus (see Sect. 6.5). There are (at least) three further difficulties with our corpus analysis in Sect. 6.4.1. First, we only examined the top 15 hits for our searches. Second, we relied on intuitive classification of the adjectives into derogatory, negative epistemic evaluative terms and descriptive terms. Third, we did not do any proper statistical analysis but made inferences merely by looking at the data and counting the number of evaluative adjectives. In principle, nothing prevents us from classifying all adjectives preceding our target structure. It is simply fairly tedious work. The intuitive classification we used might be more of a problem though. One way to tackle researcher bias would be to ask several independent coders to categorize the adjectives into derogatory, epistemically evaluative and descriptive terms. In terms of running a statistical analysis, we could, for example, run a ttest comparing differences in the occurrence of evaluative terms among the most frequent terms.27 Fisher’s exact test reveals a significant difference between the test condition (conspiracy theory) and the control condition (conspiracy): .χ 2 = 7.194; p = 0.015. Of course, we could take more adjectives into account and also weigh the frequency with which they occur. The overall fairly low numbers, however, do not lend themselves to a robust statistical analysis. We get back to this issue, once we compile our own corpus for which we have a much greater number of target adjectives. 24 Unfortunately, COCA does not allow you to exclude phrases that are followed by certain terms. In order to do this, you need to check by hand, or, even better, run your own computer code. 25 https://www.english-corpora.org/iweb/. 26 https://www.english-corpora.org/now/. 27 For more on the assumptions for running t-tests, on effect sizes, on p-values, etc., please see Chap. 3 in this book. 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 293 6.4.2 How Has the Meaning of “Conspiracy Theory” Developed over Time? Many philosophers have argued that “conspiracy theory” is a descriptive term referring to a theory featuring a conspiracy of some sort. The results of the experimental studies in Napolitano and Reuter as well as the corpus analysis of Sect. 6.4.1 suggest instead that “conspiracy theory” is a negative evaluative term. How can we explain this contrast? Have those philosophers been simply out of touch with reality? Perhaps a different explanation is available. While the term “conspiracy theory” was hardly used by the folk before 2010, nowadays, it is a highly popular term among laypeople. It is thus not unlikely that the term has changed its meaning during the last 10–15 years.28 From this observation, we can state our second corpus hypothesis: Corpus Hypothesis 2 The most frequent adjectives occurring before “conspiracy theory” were less negative 10–12 years ago than they are now. The corpus COCA features too few hits to do an analysis of the term “conspiracy theory” over time. The corpus NOW (Davies 2016) with its 15 billion words, however, does allow us to compare the adjectives preceding “conspiracy theory” in the timespan 2010–2016 with those in the time span of 2019–2021.29 In order to investigate the hypothesis that the use of “conspiracy theory” has changed during the last 10–15 years, we go to https://www.english-corpora.org/now/, enter <ADJ conspiracy theory> into the search field, click on sections, select the years between 2010 and 2016 in the left column by holding down the shift button, and then hit enter. The left-hand side of Table 6.2 lists the most frequent adjectives occurring before “conspiracy theory.” In order to get the data from the last three years, we simply select 2019–2021 and hit enter again. The results are listed on the right-hand side of Table 6.2. The results clearly suggest quite a dramatic change in its use during the last 10 years, providing positive evidence for Corpus Hypothesis 2. While the most frequent adjectives preceding “conspiracy theory” in the years 2010–2016 have been either positive or neutral (“good,” “popular,” “new,” “online,” “big,” “particular,” “elaborate,” “grand,” and “latest”), these terms are not as often used (relatively speaking) in the last few years. Instead, we see a rise of mostly negative epistemic terms: among the six most common adjectives, we find “baseless,” “debunked,” “unfounded,” and “false.” 28 Terms like “naughty,” “gay,” “nice,” and “silly”, and “discriminatory” all have changed their meanings over decades and centuries (see e.g., Willemsen et al. 2024). Perhaps “conspiracy theory” went through a similar change, albeit on a much smaller time scale. 29 We only selected the last three years (2019, 2020, 2021, excluding 2022) as a snapshot of its current use. We then selected the years 2010–2016 as the contrasting time span. The less frequent use of the term “conspiracy theory” in the early 10s made it necessary to extend the time frame up to the year 2016. 294 K. Reuter and L. Baumgartner Table 6.2 List of the 20 most frequent terms occurring before “conspiracy theory” for the years 2010–2016 (left-hand side) and 2019–2021 (right-hand side) of the NOW corpus. We color-coded positive and neutral terms in cyan, derogatory terms in orange, and negative epistemic terms in yellow. Terms left uncolored are primarily descriptive albeit often carrying a negative connotation 2010–2016 2019–2021 Term N Term N good popular bizarre new online big particular elaborate paranoid racist grand latest ridiculous right-wing false crazy baseless great political weird 43 28 27 26 21 15 14 13 13 12 10 10 10 10 9 9 9 8 8 8 baseless debunked far-right right-wing unfounded false anti-semitic new pro-trump bizarre racist popular discredited wild online crazy dangerous bogus good latest 475 368 317 204 184 180 131 130 121 115 113 108 97 95 93 68 66 54 51 51 Perhaps surprisingly, the data does not reveal an observable difference in the use of derogatory adjectives. Of course, the top 20 most frequent terms provide an incomplete picture of all the adjectives. However, the increase in negative epistemic terms as well as the decrease in positive and neutral terms is remarkable. Although we do not have any data to support this view,30 we might want to speculate that going back in time even further, say to the beginning of the century or even to the 80s and 90s of the twentieth century, we find an even more neutral picture of the use of “conspiracy theory.” (see also McKenzie-McHarg, 2018). If these two lists do reflect the meaning of “conspiracy theory” at different times, then purely descriptive definitions of philosophers might indeed reflect the dominant meaning of the term in previous times. In other words, the evaluative meaning of conspiracy theory that was recorded by Napolitano and Reuter (2021) might be a more recent phenomenon. We need to be careful in not overinterpreting our data, though. The NOW corpus reflects the language use in media, and hence differs in important respects from everyday use. The rise in negative epistemic terms might reflect a bias in the media sector. For instance, journalists might increasingly feel the need to highlight that current conspiracy theories are baseless, debunked, false, unfounded, and discredited.31 30 Although the data on COCA is very sparse, there seem to be only four negative uses of the term “conspiracy theory” of the 166 hits from the 90s of the twentieth century. 31 It is also not implausible to think that the nature of conspiracy theories has changed over the last decades, i.e., whereas conspiracy theories of the past were more often based on facts and well argued for, nowadays, conspiracy theories are more often epistemically deficient (we thank Giulia Napolitano for this suggestion). 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . Table 6.3 A list of the ten most frequent verbs in front of “conspiracy theories” and “falsehoods” (table taken from Reuter and Baumgartner, Forthcoming) Conspiracy theories Term N Promote 885 Spread 611 490 Push Believe 441 Peddle 389 Embrace 243 Share 151 Amplify 141 136 Debunk Espouse 135 Falsehoods Term Spread Perpetuate Peddle Contain Promote Publish Propagate Tell Use Debunk 295 N 748 370 304 142 136 121 116 113 100 92 6.4.3 Other Approaches and Limitations of Analyses with Pre-built Corpora Conducting a corpus analysis of the adjectives that appear before the term “conspiracy theory” is, of course, not the only method of investigating its usage, particularly in regard to its evaluative status. Another approach involves examining the verbs that precede the term to determine the actions people associate with it. Reuter and Baumgartner (2024) employ this method by analyzing the verbs that appear before “conspiracy theory” as well as other control terms like “theory,” “string theory,” and “music theory.” Their study draws from comments on the publicly accessible NOW corpus as well as a self-built Reddit corpus for analysis. Their first study finds that conspiracy theories seem to lack the epistemic and scientific standing that characterizes genuine theories. Whereas verbs like “teaching,” “studying,” and “testing” frequently appear before the control theories, these verbs hardly occur before “conspiracy theory.” Their second study demonstrates that conspiracy theories are frequently spread in a manner similar to the dissemination of falsehoods and misinformation. Table 6.3 shows the most frequent verbs preceding “conspiracy theories” and “falsehoods.” These findings provide further support for the argument that conspiracy theories are not widely perceived as genuine theories. Before we conclude our investigation of the term “conspiracy theory” using prebuilt corpora, we would like to take a brief critical perspective. In this section, we have learned about the way we talk about conspiracy theories by running corpus analyses with pre-built corpora. However, we noticed before that we need to be cautious with our corpus data. Here are five limitations and problems we faced: . The corpora were at times too small to deliver enough data points to make reliable inferences. . Focusing on the most common terms preceding the target term does not reveal a complete picture of the use of the target term. 296 K. Reuter and L. Baumgartner . We often relied on our own intuitions in regard to whether certain terms belong to a certain category. . Statistical analyses are often not easy to do on data from pre-built corpora. . The corpora we used are often not representative of the ordinary use among laypeople. In the next section, we run some corpus analyses on “conspiracy theory” using a corpus that we build ourselves, and thereby show how at least some of these limitations can be tackled. 6.5 Building Your Own Corpus and Analyzing “Conspiracy Theory” Given the drawbacks and limitations we faced in Sect. 6.4, we now show how to build your own corpus, and how to run simple analyses with the data in R (also see Chap. 3). Before embarking on this endeavor, however, we highly encourage you to take a look at the multitude of preexisting corpora which have wide applications in computational corpus linguistics. Among the most well known are the Brown Corpus, the Gutenberg Corpus, the Reuters Corpus, and the Penn Treebank—just to name a few.32 These corpora are pre-annotated but require additional software for analysis, and most of these resources are structured for very specific tasks. 6.5.1 Prerequisites This section assumes that the reader is familiar with the basics of the programming language R. For this chapter, we use R version 4.1.0 and RStudio version 1.4.1106. The full code script and data for this chapter are available on our OSF repository.33 We highly recommend working through the example script parallel to reading the code explanations. If you are relatively new to R, working with the provided script and data is essential. The line numbering in the code chunks follows the line numbering in the original script, which allows for seamless navigation between the two resources. We also provide pre-compiled data objects for each step in the data collection and analysis, which can be loaded directly into the workspace, in case you want to skip one of the steps. The code blocks’ headers also detail the estimated runtime, which is significant in places. Note that there are two scripts and two output folders: /script.R and /output/ contain the code and data to process the 32 The corpora are freely available, for instance, at http://www.nltk.org/nltk_data/. Many more can be found in the catalog of the Linguistic Data Consortium (https://catalog.ldc.upenn.edu/). 33 https://osf.io/abkm3/?view_only=189759a7dd2e407581bed5641e273c52. 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 297 full data, while /script_subsample.R and /output_subsample/ only use a subsample of the data. This chapter details the code in /script.R, but /script_subsample.R is identical (including the line numbering), except for different filepaths and a few additional lines for the subsampling. Box 6.6 If you intend to run the whole script on your local machine, we highly recommend processing only the subsample—the total runtime with the full data is more than a day. Please install all the packages specified in the script. If you skip certain code blocks, make sure to still load all packages mentioned earlier in the script as they might be needed later on as well. Lastly, please note that this script is optimized for Unix systems; on Windows, the parallelization used does not work and will lead to a significant increase in runtime.34 34 That said, the code will still run on a Windows machine. Due to the lack of the fork() functionality on Windows, parallelized vectorized R functions (mc*apply()) will simply default to single-threaded computing. Check out the parallelsugar package (https://github.com/nathanvan/parallelsugar) if you are interested in a parallel computing application on Windows with a similar syntax to the one used in this chapter. Before we start, please set the working directory to the downloaded folder for every new session (l. 15–18), since the code in this chapter loads external objects: 15 16 setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) # ͨ→ set working directory getwd() # print working directory 17 18 rm(list = ls()) # clear workspace On l. 15, we set the working directory to the folder in which the code file is located at. The location can be printed by invoking l. 16. On l. 18, we clear the workspace to ensure there is no leftover data from previous sessions. Now, we are all set up! 6.5.2 Why APIs Are Your New Best Friends For our current purposes, we decided to collect Reddit comments containing mentions of “conspiracy theory.” Reddit is probably the biggest online forum, with thousands of submissions and comments every day, and it is well known to host very controversial opinions, among them also conspiracy theories. As such, Reddit comments provide us with a presumably very large pool of natural language data on conspiracy theories, which can be mined for scientific interests. This data can easily be accessed via a so-called application programming interface (API). APIs allow 298 K. Reuter and L. Baumgartner external entities (e.g., a software application on your mobile phone) to send queries to the API provider, which then sends an automated response back. In our case, we want to send a query for comments containing the phrase “conspiracy theory”; the Reddit API will then send us these comments back as a response.35 There are plenty of APIs available, including for services like Twitter, Google Maps, Google Translate, Skyscanner Flight Search, etc. Some of these require prior registration and subsequent authentication, while others can be used (mostly) anonymously. In sum, APIs allow access to large amounts of structured data, which are just a query away, without all the hassle related to more extensive Web scraping. APIs are highly regulated and come with their own query language. The Pushshift Reddit API (Baumgartner et al., 2020) can easily be accessed via your browser. Here is a sample query: https://api.pushshift.io/reddit/search/comment/?q=\%22 conspiracy\%20theory\%22 ͨ→ With this query, we tell the API that we want to search Reddit data, specifically comments. The actual query parameter is q=, for which we provide the phrase we are looking for, namely "conspiracy theory" (the quotes are translated to %22, the spaces to %20, which is called percent encoding). If we type this into our browser, we get a response back, which looks analogous to this: { "data": [ { "all_awardings": [], "archived": false, "associated_award": null, "author": "TaroProfessional6141", "author_flair_background_color": null, "author_flair_css_class": null, "author_flair_richtext": [], "author_flair_template_id": null, "author_flair_text": null, "author_flair_text_color": null, "author_flair_type": "text", "author_fullname": "t2_crckblfb", "author_patreon_flair": false, "author_premium": false, "body": "Their idea of research is simply to search for anyone who confirms their biases and/or offers them an ͨ→ alternative reality conspiracy theory. They find this ͨ→ and act like their super Google typing skills are proof ͨ→ of their innate genius LMFAO!", ͨ→ "body_sha1": "54259c88fec0d7c428e38a1ff34719ed0643a679", "can_gild": true, "collapsed": false, 35 At the time of writing, the Reddit API service was free and open access. However, since July 1, 2023, the API is a paying service. 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 299 "collapsed_because_crowd_control": null, "collapsed_reason": null, "collapsed_reason_code": null, "comment_type": null, "controversiality": 0, "created_utc": 1647512281, "distinguished": null, "gilded": 0, "gildings": {}, "id": "i102192", "is_submitter": false, "link_id": "t3_tfiisq", "locked": false, "no_follow": true, "parent_id": "t1_i0y1ixy", "permalink": "/r/ParlerWatch/comments/tfiisq/this_dude_on_m ⏌ ͨ→ y_friends_list_that_just_got_removed/i102192/", "retrieved_utc": 1647512295, "score": 1, "score_hidden": false, "send_replies": true, "stickied": false, "subreddit": "ParlerWatch", "subreddit_id": "t5_3dw7go", "subreddit_name_prefixed": "r/ParlerWatch", "subreddit_type": "public", "top_awarded_type": null, "total_awards_received": 0, "treatment_tags": [], "unrepliable_reason": null },... ] } The data we get back comes in the JSON-format, which is—simply put—the enhanced cousin of your standard CSV or Excel file. The JSON-response contains lots of information with which the regular Reddit user might be familiar, but for us, the most interesting variable is body, i.e., the text of the comment. Now, the goal is to collect all these comments in an orderly fashion. Currently, the API limits your queries to 25 comments per response—not quite the number we have in mind. In order to get more responses, we will have to send the same query multiple times. If you try this in your browser, you will get different responses back every time. But how can we ensure that we do not get the same responses back among our several thousand queries? The Reddit API allows you to specify a time frame, within which the query will be computed. Hence, we will select a start date and work our way back in time to the desired end date. 300 K. Reuter and L. Baumgartner In theory, this seems like an easy task. But how do we send automated queries and store each response? For this, we use the programming software R. Now, here is how formulating a query and saving the response in R look like: 23 24 25 26 library(jsonlite) library(tidyverse) response <- fromJSON("https://api.pushshift.io/reddit/search/co ⏌ ͨ→ mment/?q=%22conspiracy%20theory%22") response <- as_tibble(response$data) To read in the response from the query within R, we can use the fromJSON() function from the jsonlite package (Ooms et al., 2020, l. 23) and assign the output to the object response (l. 25). Then, we use the as_tibble() function to coerce the data element of the response into a special kind of data frame, called tibble (l. 26).36 In sum, two lines of code already get us the desired data in a form that can easily be manipulated further down the line. Box 6.7 The jsonlite package (Ooms et al., 2020, l. 23) is used to import, handle, and write data in JSON-format. For more information visit https://cran.r-project.org/web/packages/jsonlite/index.html. The as_tibble() function is loaded using the tidyverse package (Wickham, 2021, l. 24), which is an opinionated collection of R-packages for data science (https://www.tidyverse.org/). Please make sure to load these packages, even if you skip this step. Now, we want to send multiple queries to get all data for a specific time period, which can be achieved by looping the query process. In our case, we would like to get Reddit comments from 01.10.2021 to 31.12.2021. Since we do not know how many iterations of the query are necessary to get all the data, we use a while loop, which just continues the query process as long as we do not have all the data, and stops as soon as we do. Please note that this is an open-ended process, and due to the amount of data available, it will take multiple hours to complete. We thus recommend to use the pre-compiled data objects on our OSF repository, in order to continue without delay.37 Here is the example code for “conspiracy theory”: 32 33 34 library(utc) container <- list() query.root <- "https://api.pushshift.io/reddit/search/comment/? ⏌ ͨ→ q=%22conspiracy%20theory%22&before=" 36 For more on tibbles, visit https://tibble.tidyverse.org/. 37 This also applies to the code in /script_subsample.R. 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 301 time.index <- as.numeric(toUTC(as.Date("31-12-2021", ͨ→ '%d-%m-%Y'))) time.end <- as.numeric(toUTC(as.Date("30-09-2021", '%d-%m-%Y'))) while(time.index>time.end){ # outer loop query <- paste0(query.root, time.index) print(time.index) TRYING <- T while(TRYING){ # inner loop response <- try(fromJSON(query)) TRYING <- "try-error" %in% class(response) #Sys.sleep(1) #uncomment this line if you get 429 and 403 ͨ→ errors } df <- as_tibble(response$data) container[[as.character(time.index)]] <- df time.index <- min(df$created_utc) } save(container, file = ͨ→ "./output/api-calls/conspiracy_theories.RDS") First, we specify a list, the container object on l. 33, which serves to collect each response.38 On l. 34, you will find a slightly changed API query string, which is assigned to query.root. It contains a new parameter (the suffix &before=) which controls that the query is restricted to the newest comments prior to the specified time point. Said time point will be specified anew for every iteration, as we see later. The start date for the query is assigned to time.index (l. 35), the end date is assigned to time.end (l. 36). The dates are provided in UTC epoch date format, an unambiguous machine-readable date format. Box 6.8 Epoch time is the number of seconds that have elapsed since the Unix epoch (excl. leap seconds), which is 00:00:00 UTC on January 1, 1970. To get our start date, the December 31, 2021, in Epoch time, we can use an online converter such as https://www.epochconverter.com/. On l. 35–36 in the code, we instead use the toUTC() function from the utc package (Segura, 2019, loaded on l. 32) to convert the respective date object to UTC format and subsequently coerce it to numeric, which gives us the Epoch time. In the code above, there are two while loops, an inner (l. 41–45) and an outer (l. 37–49) one. The outer one gets evaluated first and controls that our time period is respected: as long as time.index is still bigger than time.end, the queries continue on. For each iteration in the outer while loop, we formulate a new query 38 Alternatively, one could also save each response locally (as a RDS file, for instance). 302 K. Reuter and L. Baumgartner by appending time.index to query.root (l. 38).39 For the actual query, we then enter the inner while loop on l. 41. The reason why we need this inner loop is that the query relies on your Internet connection which might not be stable at all times. If the connection breaks during a query, the outer while loop breaks because of a no-connection error. To prevent this, we add some simple error handling: the inner while loop will try over and over to perform the query, if the connection throws an error.40 If there is no error, the loop stops and we proceed to coercing the response to a tibble (l. 46), which is subsequently stored in our container (l. 47). Finally, we extract the timestamp of the last comment and assign it as the new time.index (l. 48). By doing so, we always take the timestamp of the last comment as the new time point for the next query on l. 38, and thus continually go back in time. If the timestamp of the last comment is later than 01.10.2021 (our time.end), the outer while loop stops, and the data collection is completed. Lastly, container can be saved to your working directory, e.g., as a RDS file (l. 50). Congrats, you just collected your first corpus data! For our project, we repeat the data collection process for “conspiracy theories,” “conspiracy theorist(s),” “conspiracy” (without the mention of “theory”), and “theory,” to have comparison sets and control conditions.41 Next, we analyze which adjectives are most often attributed to conspiracy theory. 6.5.3 Corpus Annotation We want to investigate whether “conspiracy theory” is a descriptive term or whether it communicates a negative evaluation. In the following, we walk you through the following steps: 1. Compiling the API responses into a single corpus 2. Syntactic annotation of the data 3. Extracting target constructions (adjectives preceding “conspiracy theory,” “theory,” etc.) 4. Sentiment annotation of the extracted adjectives 39 Our start date, the December 31, 2021, is 1640908800 in epoch format. Hence, the query would read as https://api.pushshift.io/reddit/search/comment/?q=%22conspir acy%20theory%22&before=1640908800. 40 If you make too many queries, the server will throw the error codes 429 or 403. In that case, simply uncomment the code on l. 44, which introduces a pause of 1 second between each query. 41 The corresponding code is detailed on l. 55–76 in the script. 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 303 Fig. 6.7 The data is stored in multiple files (a); each file consists of multiple responses (b); and each response, in turn, consists of multiple comments (c). The task is to collect the comments for each file (1), and then compile them into a single corpus (2) 6.5.3.1 Compiling the Corpus At this point in the process, the API data is saved as separate RDS files in the folder "./output/api-calls/" (one file for each target phrase, e.g., "./output/api-calls/%22conspiracy%20theory%22.RDS"). We recommend to pool the data into a single corpus, as this will make the data processing easier (even though decentralized pipelines might be faster). To better understand this process, the data structure is illustrated in Fig. 6.7: The current data structure is depicted in white, and the desired output is depicted in grey. Each file is essentially a list containing all API responses as its elements; each element in turn contains the desired comments. So, how do we reduce these data chunks into a single corpus? This is a two-step process: 1. We need to open each file, which is essentially a list of single API responses. We then collapse the data for each file into a data frame. 2. Once each file corresponds to a single data frame, we pool these data frames into a single object and finalize the corpus. Here is how we go about the first step: First, we assign a list with all the files we want to combine to files (l. 83). We then use the lapply() function—a so-called wrapper—to apply a custom function to all elements in files (l. 84–91). The custom function first loads the first element in files (l. 86), i.e., "./output/api-calls/conspiracy.RDS", which adds the object container to the workspace, viz., the list object containing the API responses from above. The API data has a very complex and slightly heterogeneous structure, which cannot be directly compiled into a single data frame. The code on l. 87– 90 takes care of these issues, for which we require functionalities from the plyr 304 K. Reuter and L. Baumgartner package (loaded on l. 81 Wickham, 2020). This treatment gets repeated for every element in files, compiling all the API responses for an element in a single data frame. The result of this process is a list of data frames we assigned to dfl and corresponds to the elements in files. 81 82 83 84 85 86 87 88 89 90 91 library(plyr) rm(list = ls()) files <- list.files("./output/api-calls/", pattern = "*\\.RDS", full.names = T) ͨ→ dfl <- lapply(files, function(x){ print(paste0("currently loading: ", x)) load(x) container <- lapply(container, function(y){ select(y, !where(is.data.frame)) }) do.call(rbind.fill, container) }) In the second step, we collapse dfl, the list of data frames, to a single data frame (l. 92). Then, we annotate each comment (i.e., each row) with the target phrase it contains according to the API, and assign it to the variable df$target_phrase (l. 93–94).42 Finally, we coerce the comments’ timestamp to integers (l. 95), make sure that there are no comments prior to October 1, 2021 (l. 96), arrange the data by date and ID (l. 97), and save the data frame (l. 101). 92 93 94 95 96 97 101 102 df <- as_tibble(do.call(rbind.fill, dfl)) target_phrases <- c("conspiracy", "theory", "conspiracy theories", "conspiracy theorist", "conspiracy theory") # ͨ→ follows the order in <files> ͨ→ df$target_phrase <- rep(target_phrases, sapply(dfl, nrow)) df$created_utc <- as.numeric(df$created_utc) df <- filter(df, created_utc>=1633046401) df <- arrange(df, created_utc, id) save(df, file = paste0("./output/api-calls/combined/combined_ap ⏌ ͨ→ i_responses.RDS"), compress = "gzip") ͨ→ detach("package:plyr", unload=TRUE) Now that we have a single corpus object, we can annotate the data and isolate the phenomena we are interested in. Box 6.9 Text data is inherently multidimensional: it has a syntax, encodes semantic information, is written on different topics, contains coreferences, etc. But (continued) 42 Note that the order in target_phrases (l. 93) follows the order in files (l. 83). 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 305 these dimensions are often latent features rather than explicitly available variables. This means that the data has to be prepared in such a way that it can be used for computational analysis. This is usually done by adding relevant metadata to the data, a process which is often referred to as data annotation. This labeling process is typically performed by means of pretrained NLP classifiers. In certain cases, it might be beneficial or even necessary to train your own classifiers for your specific purposes, but generally, it is advised to rely on the standard annotators for the task at hand. 6.5.3.2 Syntactic Annotation In order to extract our target adjectives, we first have to annotate the syntactic structure of the comments. Syntactic annotation is often referred to as part-ofspeech tagging (henceforth PoS tagging), which makes use of pretrained syntactic dependency parsers. These parsers are used to decompose and convert text strings (i.e., our Reddit comments) into a structural representation, in this case syntactic dependency trees. There are several dependency parsers available for R, such as udpipe (Wijffels et al., 2022), spacyr (Benoit and Matsuo, 2020), openNLP (Hornik, 2019), etc. In this chapter, we use spacyr, which needs a separate installation. Please follow the installation instructions in the package vignette.43 Please note that the PoS tagging takes several hours with the full data and several minutes with the subsample. This is how the syntactic annotation works: 107 108 109 110 111 112 113 114 115 116 117 118 119 library(spacyr) library(stringi) library(pbmcapply) rm(list = ls()) #spacy_install() # only run if you do not have spacyr language ͨ→ models installed yet spacy_initialize() load("./output/api-calls/combined/combined_api_responses.RDS") ## Please subset the corpus if you want to run the whole ͨ→ annotation: #df <- sample_n(group_by(df, target_phrase), 50) alist <- pbmclapply(1:nrow(df), function(x){ annot <- spacy_parse(df$body[x]) return(paste0(tolower(annot$token), "__", annot$pos, collapse ͨ→ = " ")) }, mc.cores = 3) On l. 113, we load the data, which is stored in the object df (for “data frame”). The functions on l. 114–115 parse each comment into PoS tags, which are assigned 43 https://cran.r-project.org/web/packages/spacyr/readme/README.html. 306 K. Reuter and L. Baumgartner to annot. If we take the sentence “I hate these stupid conspiracy theories,” its annotation will look like this: > spacy_parse("I hate these stupid conspiracy theories.") doc_id sentence_id token_id token lemma pos entity 1 text1 1 1 I -PRON- PRON 2 text1 1 2 hate hate VERB 3 text1 1 3 these these DET 4 text1 1 4 stupid stupid ADJ 5 text1 1 5 conspiracy conspiracy NOUN 6 text1 1 6 theories theory NOUN 7 text1 1 7 . . PUNCT As we can see, the data structure of annot differs from the one in our corpus (df): df has one comment per row in the data frame, whereas annot contains one token (i.e., one syntactic component of a comment) per row. In other words, the latter is a long version of the former, which means that their respective formats are incompatible. Hence, we need to reaggregate the token-level PoS tags back to comment level, in order to ensure that we can join the PoS tags back to our corpus (l. 115): First, we paste together the lowercase tokens (annot$token) with their respective PoS tags (annot$pos) and a double underscore as delimiter, and subsequently collapse those compounds on comment level, introducing whitespace between each of the compounds of token and PoS tag. Finally, we return the annotated comments, which are automatically collected in a list by the wrapper function, and assign said list to the new object alist. The end result will look analogous to this: I__PRON hate__VERB these__DET stupid__ADJ conspiracy__NOUN theories__NOUN .__PUNCT ͨ→ Box 6.10 The syntactic annotation is wrapped by the pbmclapply() function (l. 116–119) from the pbmcapply package (Kuang et al., 2019, loaded on l. 109). The pbmclapply() function is a wrapper that tracks the progress of mclapply() which is a parallelized version of the lapply() function we used earlier. The mclapply() function allows us to compute several tasks at the same time and thus significantly reduces computation time, all the while its wrapper (i.e., pbmclapply()) tracks and visualizes the progress. 6.5.3.3 Extracting Target Structures Next, we need to extract all the adjectives preceding our target phrases. For this, we can use so-called regular expressions or regex. Regex is used to describe a chain of 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 307 signs using syntactic rules.44 In our case, we are looking for one of more lowercase letters ([a-z]+) followed by a the suffix __ADJ (viz., adjectives), which precede our target phrases.45 As we only want to extract the adjective rather than the whole target structure, we use a so-called positive lookahead expression (i.e., (?=...)), which only looks at what comes after the precedent expression, without actually evaluating or extracting it.46 Here is an abstract example for “conspiracy theory”: [a-z]+(?=__ADJ\\sconspiracy__NOUN\\stheory__NOUN) On l. 118, below, we coerce alist to the vector astring. Then, we define all regex expressions based on the target phrases, analogous to above (l. 119). The stri_extract() function (from the stringi package47 Gagolewski and Tartanus (2019) loaded on l. 108) extracts exactly what is specified in the regex from the annotated token strings in astring and assigns it to the new object adj (l. 120). After checking that adj and our corpus df have the same length, we add adj and astring to our corpus, as df$adj and df$body_pos respectively (l. 121–124). 118 119 120 121 122 123 124 125 astring <- unlist(alist) regex <- paste0("[a-z]+(?=__ADJ\\s", gsub("\\s", "__NOUN\\\\s", ͨ→ df$target_phrase), "__NOUN)") adj <- stri_extract(str = astring, regex = regex) if(length(adj) == nrow(df)){ df$adj <- adj df$body_pos <- astring } save(df, file = ͨ→ "./output/annotation/syntactically_annotated_corpus.RDS", ͨ→ compress = "gzip") Our corpus now has two new variables: df$adj, the adjectives preceding our target phrases, and df$body_pos, the PoS tagged comments. 6.5.3.4 Sentiment Annotation Sentiment annotation is usually dictionary based or uses pretrained classifiers (or a combination of both). For this project, we use the VADER (for Valence Aware Dictionary for sEntiment Reasoning) dictionary (Hutto and Eric, 2014; Roehrick, 2020).48 VADER annotates tokens based on their embedding and calculates the 44 For more resources on regex, consult https://www.regular-expressions.info/. 45 Our target phrases look as follows: conspiracy__NOUN\\stheories__NOUN, where \\s stands for whitespace). 46 For more on lookaround expressions, see https://www.regular-expressions.info/lookaround.html. 47 For an extensive documentation on the stringi package, see https://stringi.gagolewski.com/. 48 More information on the VADER implementation https://cran.r-project.org/web/packages/vader/index.html. in R can be found at 308 K. Reuter and L. Baumgartner sentiment value of the whole text string based on the word scores. The word scores are retrieved from a dictionary that contains pairs of tokens and decontextualized sentiment scores. VADER then weighs these lexical scores by taking into account negations as well as the wider context of tokens (to a limited degree). For example, the word score for “crazy” in “This is crazy!” is .−1.4, the one in “This is not crazy!” is 1.036. This means that we do not have to treat negations separately, as VADER takes care of it. Furthermore, VADER is specifically attuned to sentiments expressed in social media (Hutto and Eric, 2014), which means that we can expect adequate results for Reddit data.49 Sentiment annotation is very similar to syntactic annotation. Before starting, we remove all objects from the workspace (l. 132) and load the data (l. 133), viz., df. Then, we remove all comments which do not have an adjective preceding our target phrases (i.e., they have a missing value (NA) instead of a character string) and create our corpus (l. 134). Hence, we can remove the now obsolete df (l. 135). 130 131 132 133 134 135 library(vader) library(quanteda) rm(list=ls()) load("./output/annotation/syntactically_annotated_corpus.RDS") corpus <- filter(df, !is.na(adj)) rm(df) We also need to ensure that our comments only contain the target phrase they are supposed to contain. More specifically, the API queries for “conspiracy” might include results for the composita “conspiracy theories,” “conspiracy theorist,” or “conspiracy theory.” Since we are interested in the differences between these target phrases, though, we need to filter out these overlapping observations (l. 136). 136 corpus <- filter(corpus, !(target_phrase == "conspiracy" & grepl("[a-z]+__ADJ\\sconspiracy__NOUN\\s(?=theories|theoris ⏌ ͨ→ ͨ→ t|theory)", body_pos, perl = ͨ→ T))) Now, we can begin the actual sentiment analysis. We decided to perform the sentiment annotation on sentence level, rather than taking into account the whole comment, because the length of comments varies a lot.50 Accordingly, we break the comments down to sentences51 (l. 137) and only retain the sentences which actually contain our target structures (l. 138–142). Next, we use the vader_df() function to annotate the sentiment of the remaining sentences (l. 143), and collapse the sentiment list to a data frame (l. 144). 49 For recent applications of sentiment analysis to philosophical corpus studies, see Baumgartner, 2022; Willemsen et al., 2023; Meylan and Reuter, ms; Reuter et al., ms. 50 Note that for other applications, you might want to choose another annotation level, e.g., tokens, n-grams, or complete texts. 51 For this, we use the functionality from the quanteda package (Benoit et al., 2021). 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 137 138 139 140 141 142 143 144 309 text <- tokenize_sentence(corpus$body, remove_url = TRUE) regex <- paste0(corpus$adj, " ", corpus$target_phrase) sentences <- pbmclapply(1:length(text), function(x) text[[x]][stri_detect(tolower(text[[x]]), regex = ͨ→ regex[x])], mc.cores = 3) ͨ→ corpus <- corpus[lengths(sentences)==1,] sentences <- sentences[lengths(sentences)==1] sentences <- unlist(sentences) sentiment <- pbmclapply(sentences, vader_df, mc.cores = 3) sentiment <- do.call(rbind, sentiment) The process to extract the sentiment score of the adjectives preceding our target phrases as well as of the target phrases themselves is analogous to the extraction of syntactic target structures in Sect. 6.5.3.3. The sentiment annotation data contains the sentences (sentiment$text) and the sentiment scores of all tokens in the corresponding sentence (sentiment$word_scores). Thus, if we tokenize the sentences (l. 145) and split the sentiment scores (l. 146), this yields vectors of equal length. The idea behind that is to be able to paste the two vectors together and extract the target structures using regex. On l. 147–156, we specify the regex (regex_adj and regex_tphrase) and extract the corresponding token-sentiment score compounds. The result is the data frame token_sentiment which contains the sentiment score of the adjectives (token_sentiment$adj_sent) as well as of the target phrases (token_sentiment$tphrase_sent), and a variable which checks that the token vector and the sentiment score vector have equal length (token_sentiment$check). We join this annotation data back to the corpus (l. 157) and filter out all instances where the aforementioned sanity check fails (l. 158). 145 146 147 148 149 150 151 152 153 154 155 156 157 158 toks <- strsplit(sentiment$text, "\\s+") sents <- strsplit(sentiment$word_scores, "\\,\\s|\\{|\\}") regex_adj <- paste0(corpus$adj, "__(-)?[0-9.]+(?=\\s", gsub("\\s", "__(-)?[0-9.]+\\\\s", corpus$target_phrase), ͨ→ "__(-)?[0-9.]+)") ͨ→ regex_tphrase <- paste0(gsub("\\s", "__(-)?[0-9.]+\\\\s", corpus$target_phrase), "__(-)?[0-9.]+") ͨ→ token_sentiment <- pbmclapply(1:length(toks), function(x){ body_sent_annot <- paste0(paste0(sub("[[:punct:]]+", "", toks[[x]]), "__", sents[[x]][-1]), collapse = " ") ͨ→ adj_sent <- stri_extract(tolower(body_sent_annot), regex = regex_adj[x]) ͨ→ tphrase_sent <- stri_extract(tolower(body_sent_annot), regex = regex_tphrase[x]) ͨ→ check <- length(toks[[x]]) == length(sents[[x]][-1]) return(tibble(adj_sent, tphrase_sent, body_sent_annot, check)) }, mc.cores = 3) token_sentiment <- do.call(rbind, token_sentiment) corpus <- cbind(corpus, token_sentiment) corpus <- filter(corpus, check) The newly integrated sentiment scores are still concatenated with their corresponding token, e.g., crazy__-1.4. Hence, on l. 159–164, the scores are extracted and 310 K. Reuter and L. Baumgartner coerced to numeric values. We also code the polarity of adjective sentiment and calculate the sum of the word scores of the adjectives and the target phrases for each observation. With this, the annotation step is completed. 159 160 161 162 163 164 165 corpus <- mutate(corpus, sentiment = as.numeric(stri_extract(adj_sent, regex = "(-)?[0-9.]+")), ͨ→ sentiment_target = as.numeric(stri_extract(tphrase_sent, regex = "(-)?[0-9.]+")), ͨ→ polarity = ifelse(sentiment >= 0, "positive", "negative"), sent_aggr = sentiment + sentiment_target ) save(corpus, file = "./output/annotation/complete_corpus.RDS", compress = "gzip") ͨ→ 6.5.4 How Crazy Are Conspiracy Theories? At the end of Sect. 6.4, we raised several problems with using data from pre-built corpora in order to investigate the putative evaluative character of “conspiracy theory.” First, our pre-built corpora are either too small (e.g., COCA) or too specialized (e.g., NOW) to reliably reveal the ordinary use of “conspiracy theory.” Second, the coding of preceding adjectives into various categories by experts or non-experts is either limited (because not all adjectives are coded) or very resourceconsuming, as well as subject to biases. Third, and connected to the other two points, the statistical analysis we did was rather unsatisfactory and not up to the standards of experimental science. With the collection of a large amount of data from Reddit and the sentiment annotation of the sentences containing “[adj] + [target term],” we are now in a position to correct for these limitations. Obviously, the improvement is mostly gradual, and it is almost always possible to get more data from better sources and use better annotation procedures. Nonetheless, we hope the following section will convince even skeptical readers of the enormous scientific and philosophical potential of corpus analysis that is done with self-built corpora. 6.5.4.1 Hypotheses The hypotheses we used in the last section will no longer do, given our different approach in this section. As we are now dealing with sentiment scores, we investigate the following hypotheses: Hypothesis 1 The term “conspiracy theory” has significantly lower adjectival sentiment scores than “theory.” 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 311 Hypothesis 2 The terms “conspiracy theory,” “conspiracy theories,” and “conspiracy theorist” each have significantly lower adjectival sentiment scores than “conspiracy.” Hypothesis 3 The average adjectival sentiment scores for the terms “conspiracy theory,” “conspiracy theories,” and “conspiracy theorist” are each significantly below the midpoint of 0; the one for “theory” is significantly above the midpoint. 6.5.4.2 Descriptive Statistics The first step in any analysis is to familiarize yourself with the data. Most often, what is meant by that is simply to look at different measures of the data distribution. The ones most readers might be familiar with include the average (or mean), median, standard error of the mean, number of observations, as well as skewness and kurtosis. In our case, we are mostly interested in the sentiment distribution of our different target phrases. After grouping the corpus by target phrase (l. 176), we can compute the different measures as follows (l. 177):52 172 173 174 175 176 177 library(e1071) rm(list = ls()) # clear workspace load("./output/annotation/complete_corpus.RDS") corpus <- filter(corpus, !sentiment==0) dfx <- group_by(corpus, target_phrase) dfx <- dplyr::summarise(dfx, mean = mean(sentiment, na.rm = T), se = sd(sentiment)/sqrt(length(sentiment)), median = ͨ→ median(sentiment, na.rm = T), skewness = ͨ→ skewness(sentiment), kurtosis = kurtosis(sentiment), n = ͨ→ n()) ͨ→ > dfx # A tibble: 5 x 7 target_phrase mean se median skewness kurtosis n ͨ→ <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> ͨ→ 1 conspiracy -0.0558 0.0168 -0.7 0.0584 -1.36 15032 ͨ→ 2 conspiracy theories -1.26 0.0198 -1.5 1.15 0.937 5400 ͨ→ 3 conspiracy theorist -1.03 0.0326 -1.4 0.894 0.478 1949 ͨ→ 4 conspiracy theory -0.504 0.0258 -1.3 0.414 -1.11 5049 ͨ→ 5 theory 0.943 0.00780 1.7 -0.815 -0.181 44624 ͨ→ 52 The functions skewness() and kurtosis() are provided by the e1071 package (Meyer et al. 2021; loaded on l. 172). 312 K. Reuter and L. Baumgartner Fig. 6.8 Boxplot of the sentiment distribution for the different target phrases The majority of the sentiment averages is clustered very closely around the midpoint (.x̄: –1.26–0.94). All target phrases except for “theory” have a right skew (skewness > 0), which means the distribution is skewed toward negative adjectival sentiment scores. This is what we expect according to Hypothesis 3. The target phrases “conspiracy,” “conspiracy theory,” and “theory” have negative kurtosis, which means that the distribution is more homogenous and dispersed; “conspiracy theories” and “conspiracy theorist,” on the other hand, have positive kurtosis, which means that their distribution is less dispersed than under the assumption of a normal distribution. These findings can be illustrated by a boxplot in Fig. 6.8. For the boxplot, we can use the ggplot2 package (Pedersen et al., 2020).53 On l. 182. we specify the data and the axes. Then, we add a horizontal dotted line (l. 183), the boxplot elements (l. 184), specify the axes’ names (l. 185), apply a predefined theme to the plot (l. 186), and finally change the alignment and orientation of the labels on the x-axis (l. 187). 181 182 183 184 185 186 187 library(ggplot2) > p <- ggplot(data = group_by(corpus, target_phrase), aes(y = ͨ→ sentiment, x = gsub("\\s", "\n", target_phrase))) + geom_hline(aes(yintercept = 0), lty = "dotted") + geom_boxplot(outlier.alpha = 0) + labs(y = "Sentiment scores", x = "") + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust=1)) 53 For more resources on ggplot2, consult https://ggplot2.tidyverse.org/index.html. 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 188 189 313 p ggsave(p, file = "./output/plots/boxplot.png", width = 4, height = 3, dpi = 300) ͨ→ It is also important to check whether the text data makes sense, qualitatively speaking. In our case, it is advised to review the adjectives most associated with each target phrase. For this, we first group the data by target phrase and adjective (l. 192), and then calculate the number of observations per pair of target phrase and adjective, as well as their average sentiment (l. 193). Then, we arrange the data by the number of observations (l. 194), retain the top 50 adjectives per target phrase (l. 195), and write out the list as CSV (l. 196). Table 6.4 shows the top 20 adjectives per target phrase. 192 193 194 195 196 topw <- group_by(corpus, target_phrase, adj) topw <- dplyr::summarise(topw, n = n(), avg_sentiment = mean(sentiment)) ͨ→ topw <- arrange(topw, desc(n)) topw <- slice(topw, 1:50) write_csv(topw, file = "./output/topw.csv") 6.5.4.3 Statistical Tests The hypothesis testing boils down to a simple analysis of variance (ANOVA; also see Chap. 8). ANOVA has two basic assumptions: it assumes that the data follows a normal distribution and that the group variances are homogeneous. Based on the descriptive statistics calculated above, we expect that both assumptions are violated. We also performed an Anderson-Darling test as well as Levene’s test. The Anderson-Darling test is a goodness-of-fit test which is typically used to determine whether the data follows a normal distribution, especially if the data contains a high number of observations. The Levene’s test checks for the homogeneity of variances (homoscedasticity). Both tests are significant on 0.001 alpha level. Since both assumptions for the ANOVA are violated, we should instead opt for a nonparametric alternative. In this case, we use Wilcoxon rank sum tests (using Bonferroni correction; also see Chap. 3). Hypothesis 1 states that “conspiracy theory” has significantly lower sentiment values than “theory.” To test this, we need to subset the data and specify a one-sided two-sample Wilcoxon test as follows: 212 213 214 h1_data <- filter(corpus, target_phrase %in% c("conspiracy theory", "theory")) ͨ→ h1_data$target_phrase <- as.factor(h1_data$target_phrase) > levels(h1_data$target_phrase) [1] "conspiracy theory" "theory" 215 > wilcox.test(sentiment ~ target_phrase, data = h1_data, alternative = "less") ͨ→ Conspiracy theory Term Number 818 Crazy Good 367 295 Insane 270 Stupid Weird 240 Ridiculous 237 229 Favorite 190 Dumb Racist 179 147 Nice Popular 105 Silly 96 Huge 71 Interesting 70 66 Grand Idiotic 60 Paranoid 60 59 Bizarre Dumbest 58 57 Dangerous Conspiracy theories Term Number Crazy 1036 Insane 630 Stupid 455 Ridiculous 413 Weird 260 Dumb 174 Dangerous 161 Idiotic 137 Racist 135 Bizarre 88 Silly 88 Paranoid 83 Dumbass 82 Fake 60 Popular 59 Favorite 52 Good 51 Dumbest 48 Moronic 47 Interesting 45 Conspiracy theorist Term Number Crazy 756 Paranoid 124 Insane 120 Racist 63 Huge 57 Stupid 49 Weird 43 Lunatic 38 True 38 Dangerous 35 Dumb 33 Good 30 Retarded 20 Ignorant 19 Dead 17 Silly 17 Mad 16 Avid 15 Moronic 15 Prominent 12 Conspiracy Term Grand Crazy Criminal Huge Weird Best Good Great Stupid Evil Seditious Insane Dumb True Racist Ridiculous Paranoid Favorite Fake Greatest Table 6.4 Comparison of the 20 most frequent terms occurring before the five target expressions Number 3261 1418 954 942 568 557 520 432 414 373 335 299 266 223 197 177 163 161 157 131 Theory Term Interesting Good Critical Popular Great Nice Legal Best Cool Crazy Solid Bad Favorite Original Stupid Dumb Better Weird Unified Ridiculous Number 8137 6616 2829 2055 2031 1818 1368 1357 1280 1241 1011 939 713 674 656 589 517 510 447 357 314 K. Reuter and L. Baumgartner 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 315 Wilcoxon rank sum test with continuity correction data: sentiment by target_phrase W = 66408276, p-value < 2.2e-16 alternative hypothesis: true location shift is less than 0 On l. 212, we subset the data including only “conspiracy theory” and “theory,” assigning the subset to h1_data. The group variable, h1_data$target_ phrase, then gets coerced to a factor (l. 213), with the group levels ordered as specified on l. 214. The Wilcoxon test is calculated on l. 215, where we specify the formula (.y ∼ x), the data (h1_data), and the alternative hypothesis, i.e., that the mean for “conspiracy theory” is less than the mean for “theory.” The alternative hypothesis follows the order of the factor levels, which is why it is important to check whether the order is correct (l. 213–14). The output of the test shows that the null hypothesis has to be rejected on 0.001 alpha level (p-value .< 2.2 × 10−16 ) in favor of the alternative hypothesis. In other words, Hypothesis 1 cannot be rejected. Hypothesis 2 states that the terms “conspiracy theory,” “conspiracy theories,” and “conspiracy theorist” each have significantly lower adjectival sentiment scores than “conspiracy.” Hence, we have to subset the data by dropping the data for “theory” (l. 220), since it is not part of the hypothesis. Hypothesis 2 is one-sided, as it states a direction for the group differences, i.e., “conspiracy” has a significantly higher average adjectival sentiment score than the other terms. Thus, we perform a onesided pairwise Wilcoxon test, which will test that for each group pair Y–X, X is lower than Y. For this, it is important that “conspiracy” be the first level (l. 221– 222). The test is detailed on l. 223: 218 219 220 221 222 library(rstatix) library(coin) h2_data <- filter(corpus, !target_phrase == "theory") h2_data$target_phrase <- as.factor(h2_data$target_phrase) > levels(h2_data$target_phrase) [1] "conspiracy" "conspiracy theories" "conspiracy theorist" "conspiracy theory" ͨ→ 223 > pairwise.wilcox.test(x = h2_data$sentiment, g = h2_data$target_phrase, alternative = "less", ͨ→ p.adjust.method = "bonf") ͨ→ Pairwise comparisons using Wilcoxon rank sum test with continuity correction ͨ→ data: h2_data$sentiment and h2_data$target_phrase conspiracy conspiracy theories conspiracy theorist ͨ→ conspiracy theories <2e-16 conspiracy theorist <2e-16 1 conspiracy theory <2e-16 1 1 P value adjustment method: bonferroni 316 K. Reuter and L. Baumgartner We are only interested in the leftmost column, where each row is compared to the column head (conspiracy) as to whether the row has significantly lower sentiment scores than the column head. The other pairs can be ignored for our purposes. As we can see, “conspiracy theory,” “conspiracy theories,” and “conspiracy theorist” each indeed have significantly lower adjectival sentiment scores than “conspiracy,” on 0.001 alpha level (p-value .< 2 × 10−16 ). Thus, Hypothesis 2 cannot be rejected. It is also advised to report the effect sizes (also see Chap. 2), i.e., the magnitude of group differences. This is a big sample size often leads to significant group differences, all the while the effect sizes remain very small. Hence, for a comprehensive assessment of the differences, we should compute the effect sizes using the wilcox_effectsize() function (from the rstatix package,54 Kassambara, 2021, loaded on l. 218): 224 > wilcox_effsize(data = h2_data, formula = sentiment ~ target_phrase, alternative = "less", p.adjust.method = ͨ→ "bonf")[1:3,] ͨ→ # A tibble: 3 x 7 .y. group1 group2 effsize n1 n2 magnitude ͨ→ <chr> <chr> <chr> <dbl> <int> <int> <ord> ͨ→ 1 sentiment conspiracy conspiracy theories 0.248 15032 5400 small ͨ→ 2 sentiment conspiracy conspiracy theorist 0.125 15032 1949 small ͨ→ 3 sentiment conspiracy conspiracy theory 0.0923 15032 5049 small ͨ→ The effects are small to medium (0.092–0.248). Hypothesis 3 states that the average adjectival sentiment scores for the terms “conspiracy theory,” “conspiracy theories,” and “conspiracy theorist” are each significantly below the midpoint (i.e., smaller than 0); the average value for “theory” is significantly above the midpoint. For this, we specify a one-sided test for each target phrase with the null hypothesis that the average sentiment is equal to 0: 227 228 229 230 wilcox.test(corpus$sentiment[corpus$target_phrase == "conspiracy theory"], alternative = "less", mu = 0) ͨ→ wilcox.test(corpus$sentiment[corpus$target_phrase == "conspiracy theories"], alternative = "less", mu = 0) ͨ→ wilcox.test(corpus$sentiment[corpus$target_phrase == "conspiracy theorist"], alternative = "less", mu = 0) ͨ→ wilcox.test(corpus$sentiment[corpus$target_phrase == "theory"], alternative = "greater", mu = 0) ͨ→ Every test is significant on 0.001 alpha level. Thus, Hypothesis 3 cannot be rejected. To sum up, we cannot reject any of the three hypotheses. 54 The coin package (Hothorn et al., 2021) loaded on l. 219 is a dependency of the rstatix package and has to be loaded separately. 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 317 6.5.5 Discussion In this section, we guided you through various processes of a computational corpus analysis of the term “conspiracy theory.” We started building our own corpus by collecting data from Reddit via the Pushshift Reddit API. After we compiled our corpus, we annotated our corpus syntactically, extracted the target features, and annotated the corpus with sentiment scores from VADER. The statistical analysis we ran on the data delivered the expected results. “Conspiracy theory” not only seems to be a negative evaluative term, we also cannot explain its negative evaluative function through any of its composite terms. Our conclusion cannot be rejected based on the objection that our corpus is not representative of ordinary talk or that our database is too small. Furthermore, our analysis is less susceptible to biases given the coding of adjectives, as it relies on the automated annotation of a sentiment annotation algorithm. Consequently, while our conclusion is similar to the conclusion we arrived at by the end of Sect. 6.4, we can now more confidently argue in favor of the thesis that “conspiracy theory” is indeed an evaluative term. That said, other objections might be forthcoming that need further collection of data or at least certain adaptations in the design of our corpus analysis. 6.6 Conclusion Philosophers have begun applying corpus-analytic methods on a wide range of philosophical topics. The results of those studies already had a significant impact on current debates in fields such as epistemology, philosophy of mind, and moral philosophy. At the same time, however, most philosophers so far shy away from doing corpus analysis themselves. Arguably, the two main reasons for this are simply unawareness of how easy it is to access large corpora and a perceived difficulty in developing corpus-based hypotheses. In this chapter, we have introduced easyaccess corpora, highlighted similarities in the approaches researchers have used to illuminate philosophical problems via corpus-analytic tools, and ran various corpus analyses for the term “conspiracy theory” on pre-built corpora as well as our own corpus, which we compiled from Reddit comments. We hope many philosophers will follow in making the most of the existing tools as well as contributing to the development of this exciting field. There is no better time to start your own corpus study. 318 K. Reuter and L. Baumgartner Appendix This appendix lists the correct answers to the exercises from Sect. 6.2. Please note that COCA is a dynamic corpus and is expanded on a regular basis. Thus, the results you get might diverge from the results at the time this chapter was written (Summer 2022). Although the numbers might be different, the outcome should not be vastly different. . Exercise 1: 39 . Exercise 2: The most frequent nouns appearing after “talking about” (insert <talking about NOUN>) are “people,” “things,” and “sex.” The most frequent adjectives appearing before “thinking” (insert <ADJ thinking>) are “critical,” “wishful,” and “creative.” . Exercise 3: The chart function on COCA reveals that since the beginning of the millenia, the term “robot” has been used more frequently, but has not increased much within the last 20 years. While the term “conservative” has seen as steady rise in its use till the 1980s, the frequency of the term “liberal” has decreased over the last 200 years. . Exercise 4: The most common adjectival collocates of the term “happiness” are “true,” “human,” “personal,” “happy,” and “eternal.” . Exercise 5: One interesting qualitative difference between “praising” and “blaming” is that “praising” seems to be more frequently used in a religious setting, whereas “blaming” is primarily used in worldly settings. People seem to be saying that they are afraid of flying and heights, but not that they are frightened of flying and heights (don’t forget to put in NOUN in the collocates box). References Alfano, M. (2018). Digital humanities for history of philosophy. A case study on Nietzsche. In I. Ievenberg, T. Neilson, & D. Rheams (Eds.) Research methods for the digital humanities (pp. 85–101). Palgrave Macmillan. Andow, J. (2015). How “intuition” exploded. Metaphilosophy, 46, 189–212. Basham, L., & Dentith, M. R. X. (2016). Social science’s conspiracy theory panic. Now they want to cure everyone. Social Epistemology Review and Reply Collective, 5, 12–19. Baumgartner, L. (2022). Why are reckless socks not (more of) a thing? Towards an empirical classification of evaluative concepts. Proceedings of the Annual Meeting of the Cognitive Science Society, 44, 2869–2875. Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., Blackburn, J., & Io, P. (2020). The pushshift Reddit dataset. arXiv Preprint. https://arxiv.org/abs/2001.08435v1. Baz, A. (2012). When words are called for. A defense of ordinary language philosophy. Harvard University Press. Benoit, K., & Matsuo, A. (2020). spacyr. Wrapper to the “spaCy” “NLP” library. R package version 1.2.1. Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., Matsuo, A., Lowe, W., & Müller, C. (2021). quanteda. Quantitative analysis of textual data. R package version 3.0.0. 6 Corpus Analysis: Building and Using Corpora—A Case Study on the Use. . . 319 Bluhm, R. (2016). Corpus analysis in philosophy. In M. Hinton (Ed.) Evidence, experiment and argument in linguistics and the philosophy of language (pp. 91–109). Peter Lang. Butter, M. (2020). There’s a conspiracy theory that the CIA invented the term “conspiracy theory”—here’s why. The Conversation, March 16. Cassam, Q. (2019). Conspiracy theories. Polity Press. Chartrand, L. (2022). Modeling and corpus methods in experimental philosophy. Philosophy Compass. Coady, D. (2003). Conspiracy theories and official stories. International Journal of Applied Philosophy, 17, 197–209. Cohnitz, D. (2018). On the rationality of conspiracy theories. Croatian Journal of Philosophy, 18, 351–365. Davies, M. (2008). The Corpus of Contemporary American English (COCA). https://www.englishcorpora.org/coca/. Davies, M. (2010). The Corpus of Historical American English (COHA). https://www.englishcorpora.org/coha/. Davies, M. (2016). Corpus of News on the Web (NOW). https://www.english-corpora.org/now/. deHaven-Smith, L. (2013). Conspiracy theory in America. University of Texas Press. Feldman, S. (2011). Counterfact conspiracy theories. International Journal of Applied Philosophy, 25, 15–24. Fischer, E., Engelhardt, P. E., & Herbelot, A. (2015). Intuitions and illusions. From explanation and experiment to assessment. In E. Fischer & J. Collins (Ed.) Experimental philosophy, rationalism, and naturalism (pp. 267–300). Routledge. Gagolewski, M., & Tartanus, B. (2019). stringi. Fast and portable character string processing facilities. R package version 1.6.2. Hansen, N., Porter, J. D., & Francis, K. (2021). A corpus study of “know.” On the verification of philosophers’ frequency claims about language. Episteme, 18, 242–268. Harris, K. (2018). What’s epistemically wrong with conspiracy theorising? Royal Institute of Philosophy Supplements, 84, 235–257. Hornik, K. (2019). openNLP. Apache OpenNLP tools interface. R package version 0.2-7. Hothorn, T., Winell, H., Hornik, K., van de Wiel, M. A., & Zeileis, A. (2021). coin. Conditional inference procedures in a permutation test framework. R package version 1.4-2. Hutto, C. J., & Gilbert, E. (2014). VADER. A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media, 8, 216–225. Kassambara, A. (2021). rstatix. Pipe-friendly framework for basic statistical tests. R package version 0.7.0. Keeley, B. L. (1999). Of conspiracy theories. Journal of Philosophy, 96, 109–126. Klein, C., Clutton, P., & Polito, V. (2018). Topic modeling reveals distinct interests within an online conspiracy forum. Frontiers in Psychology. Knobe, J., & Prinz, J. (2008). Intuitions about consciousness. Experimental studies. Phenomenology and the Cognitive Sciences, 7, 67–83. Kuang, K., Kong, Q., & Napolitano, F. (2019). pbmcapply. Tracking the progress of Mc*pply with progress bar. R package version 1.5.0. McKenzie-McHarg, A. (2018). Conspiracy theory. The nineteenth-century prehistory of a twentieth-century concept. In J. E. Uscinski (Ed.) Conspiracy theories and the people who believe them (pp. 62–81). Oxford University Press. Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F., Chang, C.-C., & Lin, C.C. (2021). e1071. Misc functions of the department of statistics, Probability theory group (Formerly: E1071), TU Wien. R package version 1.7-7. Meylan, A., & Reuter, K. (ms). On being ignorant. Mizrahi, M. (2020). The case study method in philosophy of science. An empirical study. Perspectives on Science, 28, 63–88. Napolitano, M. G., & Reuter, K. (2021). What is a conspiracy theory? Erkenntnis, 88, 2035–2062. 320 K. Reuter and L. Baumgartner Nichols, S., & Ángel Pinillos, N. (2018). Skepticism and the acquisition of “knowledge.” Mind & Language, 33, 397–414. Ooms, J., Lang, D. T., & Hilaiel, N. (2020). jsonlite. A simple and robust JSON parser and generator for R. R package version 1.7.2. Pedersen, T. L., Wickham, H., Chang, W., Henry, L., Takahashi, K., Wilke, C., Woo, K., Yutani, H., & Dunnington, D. (2020). ggplot2. Create elegant data visualisations using the grammar of graphics. R package version 3.3.3. Pigden, C. (2007). Conspiracy theories and the conventional wisdom. Episteme, 4, 219–232. Räikkä, J. (2018). Conspiracies and conspiracy theories. Argumenta, 3, 205–216. Reuter, K. (2011). Distinguishing the appearance from the reality of pain. Journal of Consciousness Studies, 18, 94–109. Reuter, K., & Baumgartner, L. (Forthcoming). Conspiracy theories are not theories: Time to rename conspiracy theories. In M. G. Isaac, S. Koch, K. Scharp (Eds.), New Perspectives on Conceptual Engineering (Volume 3: Applied Conceptual Engineering). Springer. Reuter, K., Baumgartner, L., & Messerli, M. (ms). A corpus study on the normativity of rationality. Reuter, K., Baumgartner, L., & Willemsen, P. (2024). Tracing thick and thin concepts through corpora. Language and Cognition, 16(2), 263–282. Roehrick, K. (2020). vader. Valence aware dictionary and sEntiment reasoner (VADER). R package version 0.2.1. Segura, A. J. (2019). utc. Coordinated universal time transformations. R package version 0.1.5. Sytsma, J., Bluhm, R., Willemsen, P., & Reuter, K. (2019). Causal attributions and corpus analysis. In E. Fischer & M. Curtis (Ed.), Methodological advances in experimental philosophy (pp. 209– 238). Tobia, K. P. (2020). Testing ordinary meaning. Harvard Law Review, 134, 726–806. Tsapos, M. (2024). Who is a conspiracy theorist? In The Philosophy of Conspiracy Theories (pp. 50–59). Routledge. Ulatowski, J., Weijers, D., Sytsma, J., & Allen, C. (2020). Cognitive science of philosophy symposium. Corpus analysis. The Brains Blog. https://philosophyofbrains.com/2020/12/15/ cognitive-science-of-philosophy-symposium-cor-pus-analysis.aspx. Wickham, H. (2020). plyr. Tools for splitting, applying and combining data. R package version 1.8.6. Wickham, H. (2021). tidyverse. Easily install and load the “Tidyverse.” R package version 1.3.1. Wijffels, J., Straka, M., & Straková, J. (2022). udpipe. Tokenization, parts of speech tagging, lemmatization and dependency parsing with the “UDPipe” “NLP” toolkit. R package version 0.8.9. Willemsen, P., Baumgartner, L., Frohofer, S., & Reuter, K. (2023). Examining evaluativity in legal discourse. A comparative corpus-linguistic study of thick concepts. In S. Magen & K. Prochownik (Eds.), Advances in Experimental Philosophy of Law (pp. 192–214). Bloomsbury Academic. Willemsen, P., Sommer Degn, S., Garcia Olier, J. & Reuter, K. (2024). “Wrongful discrimination”—a tautological claim? An empirical study of the evaluative dimension of discrimination vocabulary. In Proceedings of the Annual Meeting of the Cognitive Science Society. Wright, J. C., Sedlock, T., West, J., Saulpaugh, K., & Hopkins, M. (2016). Located in the thin of it. Young children’s use of thin moral concepts. Journal of Moral Education, 45, 308–323. Chapter 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic Network Analysis—An R Studio Pipeline Mark Alfano Abstract This chapter introduces lexical dispersion analysis, time series analysis, and semantic network analysis. The case study in this chapter uses State of the Union addresses delivered yearly by American Presidents from George Washington to Donald Trump. Lexical dispersion refers to the embedding of words, stems, and n-grams across corpora. Time series analysis is useful when you have a corpus that was produced over the course of hours, days, years, or decades. It enables us to track the prevalence of various words, stems, and n-grams—along with the concepts they express—over time. Time series analysis reveals trends in these speeches across more than two centuries. Semantic network analysis is useful when you have a sufficiently large and diverse corpus to ask which words, stems, and n-grams—along with the concepts they express—tend to co-occur in the same sentence, paragraph, or other piece of text. This can help to shed light on the meaning of terms. These analytical approaches can also be combined to examine how patterns of association change over time. Keywords Corpus · Lexical dispersion · Semantic network · R · Markdown · Dictionary 7.1 Introduction and Philosophical Background In this chapter, we walk through a pipeline for doing corpus linguistics in the service of philosophical and other humanities research. The ultimate goal is to introduce you to semantic analysis of lexical dispersion, time series, and networks of collocations. In the previous chapter by Kevin Reuter and Lucien Baumgartner, you saw the power of both temporal and collocation analysis. The basic underlying idea behind collocation analysis was pithily articulated by Firth (1957), who said, “You shall M. Alfano (✉) Philosophy Department, Macquarie University, Sydney, New South Wales, Australia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Kornmesser et al., Experimental Philosophy for Beginners, Springer Graduate Texts in Philosophy, https://doi.org/10.1007/978-3-031-58049-9_7 321 322 M. Alfano know a word by the company that it keeps.” If two words are frequently collocated in a corpus, then presumably the author(s) of the corpus think or presuppose that there is an inferential, conceptual, or empirical connection between objects or concepts that they express. In the previous chapter, you also saw evidence from collocation analysis that suggests that people think of conspiracy theories in a negative light and are concerned about efforts to spread or propagate them. And you saw evidence that how people have talked about (and presumably thought about) conspiracy theories has changed over the years. This chapter builds directly on the sorts of analyses you’ve just learned about using an integrated R, R Studio, and R Markdown pipeline. In particular, this chapter supplements the methods and analyses introduced in Chap. 6 and is, therefore, shorter and does not contain a full study. We hope that the additional methods introduced here will, nevertheless, be useful to many readers. We will be analyzing State of the Union addresses by American Presidents from George Washington to Donald Trump. These addresses are already digitized, along with some helpful metadata, enabling us to track terms within and across addresses, as well as patterns of collocation within addresses. Similar techniques can be applied to any other corpus, so the approach exemplified here is generalizable. These techniques can be used to address philosophical questions about nonphilosophical texts, as well as philosophical questions about philosophical texts. In order to provide an easy introduction with few prerequisites, the hypotheses explored in the current chapter are relatively superficial and exploratory, but if you have more sophisticated questions and more specific hypotheses, the same methods can be used to shed light on them. This chapter is ordered step by step in the same order you will do the various tasks needed to perform these sorts of analyses. Let’s get started. 7.2 Question Development and Hypothesis Formation Corpus linguistics enables us to shed light on philosophical questions by examining patterns in natural language use, as well as to use philosophical theories to evaluate changes and patterns in the real world. To do this, we need to start off by precisifying our questions and hypotheses. While it is possible to do purely exploratory research, in general, this sort of inquiry goes best when it is hypothesis driven. One type of question worth asking is about patterns in the language use of a particular philosopher or group of philosophers. For example, we might wonder whether Hume’s use of “sympathy” is consistent across his corpus or whether Nietzsche’s use of “Macht” (power) is consistent across his corpus or even within a particular work. Likewise, we might wonder how the usage of “free will” has evolved over the last century of published Anglophone philosophy.1 1 The qualifier “Anglophone” is necessary here because it is much easier to use these methods within a given language than it is to make cross-linguistic comparisons. 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 323 To address these questions, we would need to formulate hypotheses in terms of frequencies and collocations of words, stems, and n-grams. Frequency refers to what percentage of a corpus is composed of a word. Some words (often called stop words) occur extremely frequently; these include the verb to be, pronouns, and articles. Content words occur relatively infrequently; typically, less than 1% of a corpus is a given word. But relative frequencies can be informative. If you hypothesize that Hume became more interested in sympathy over the course of his philosophical career, you could test this hypothesis by examining whether the frequency of “sympathy” and its cognates increases from book to book. Frequencies can also be aggregated. If you hypothesize that Nietzsche became more interested in emotions and less interested in deontic concepts like rights and obligations over the course of his philosophical career (he did), you could test this by examining the frequencies of various words that refer to emotions and deontic concepts from book to book (Alfano, 2019). Combining words into groups in this way involves the creation of dictionaries, which I cover in more detail below. Bare frequencies can be revealing, but collocations add context. Collocation refers to the frequency with which a pair of words occur near one another—in the same sentence, in the same paragraph, on the same page, in the same book, or whatever window you prefer. Just as the friends, one keeps reveal something about one’s character, so the words that tend to co-occur with a word reveal something about its meaning in the corpus in question. This approach has been used by judges to help with the interpretation of legal documents, for instance, in the Utah State Supreme Court case Richards vs. Cox.2 It is also possible to use collocations when analyzing philosophical texts. For instance, if we hypothesize that Beauvoir associates freedom with politics but not with psychology, we could test this hypothesis by seeing whether, in her writings, words that are clearly political in meaning tend to be collocates with “liberté” more than words that are clearly psychological in meaning. Beyond hypotheses about the history of philosophy, we might form hypotheses about overall lay usage or usage by specific communities. For instance, Frimer et al. (2019) have created a dictionary of words associated with moral foundations theory, which makes it possible to create profiles of various communities based on the text they produce. Likewise, Alfano et al. (2022) have created a dictionary of words associated with the seven elements of the morality-as-cooperation hypothesis, which they applied to the world’s largest corpus of ethnographies to compare moral signatures around the globe and argue for a modest form of moral universalism. In a more contemporary and socially relevant project for the Australian Department of Justice and Community Safety, Roose et al. (2022) created custom dictionaries for hegemonic masculinity, subordinated masculinity, misogyny, and xenophobia to compare various masculinist groups operating in Australia and around the world. I show you how to conduct this sort of analysis in this chapter. 2 See Lee and Phillips (2019) for an explanation of this sort of “data-driven originalism.” For an assessment of the ethical pitfalls of this approach to statutory and constitutional interpretation, see Alfano et al. (2021). 324 M. Alfano 7.3 Corpus Selection and Building To develop good questions and form testable hypotheses, it’s useful to know both what corpora already exist and what corpora could be built by, for instance, optical character recognition (OCR) or scraping of websites (for Web scraping, see Chap. 6). Let’s consider some readily available corpora. There are specialized philosophical corpora such as the Nietzsche Source.3 There are also more generic collections of philosophical writing, such as those at www.philpapers.org and the Stanford Encyclopedia of Philosophy.4 Other corpora that you might want to explore include those produced by laypeople. For instance, it’s possible to scrape websites such as Wikipedia and Reddit. In addition, there are nicely curated corpora of English known as the Corpus of Historical American English (COHA) and the Corpus of Contemporary American English (COCA), which were covered in the previous chapter. Corpus selection and construction should be tailored to the questions and hypotheses associated with the project at hand. For instance, research on folk conceptions of lust and romantic love might use profiles from online dating apps, such as the OKCupid dataset (Kirkegaard and Bjerrekær, 2014). By contrast, researchers interested in children’s development and acquisition of concepts may prefer to consult the CHILDES database (MacWhinney, 2000), as Nichols and Pinillos (2018) did. Researchers interested in cross-cultural comparisons may wish to consult the Human Relations Area Files (HRAF, 1967), as Alfano et al. (2022) did. And researchers interested in social epistemology may use the application programming interface (API) of Twitter, as Sullivan et al. (2020a, b) did. 7.4 Corpus Analysis at the Level of the Word and Document Let’s conduct an analysis that begins with the word prevalence, moves to timeseries and lexical dispersion, and terminates with semantic networks. We will be exploring the corpus of State of the Union (SOTU) addresses by American presidents from George Washington to Donald Trump. This is a convenient corpus because it is pre-built into quanteda.corpora, which you will install as part of this tutorial.5 The quanteda.corpora package supplements the quanteda natural language processing package. It contains not only SOTUs but also other 3 http://www.nietzschesource.org/. 4 https://plato.stanford.edu/. 5 Other corpora in this package include amicus briefs to the US Supreme Court, inaugural speeches by American presidents, and UK political party manifestos, among several others. 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 325 corpora such as amicus briefs to the US Supreme Court, US Presidential inaugural speeches, manifestos of various political parties in the United Kingdom, Irish budget documents, several years’ worth of news coverage of immigration in the United Kingdom, and so on. The SOTU corpus is also valuable in other ways. It contains speeches from 1790 to the present. The speeches were delivered yearly. They are in different media (most were spoken but some were written). They vary in length from approximately 1000–35,000 words. And they have already been annotated with some useful metadata such as the political party of the president giving the address. If you don’t already have them installed, please install R, R Studio, and R Markdown. To install the first two, use the following links (and see Chap. 3): . R: https://www.r-project.org/ . R Studio: https://www.rstudio.com/products/rstudio/download/ To install R Markdown, open R and enter the following into the command line: 1 install.packages("rmarkdown") Throughout this chapter, lines of code will be represented in a monospace font to make it clear what’s code and what’s prose. Make sure your version of R, R Studio, and R Markdown are up to date and that you have put all relevant files in your working directory. Next, you need to install the packages to be used in this tutorial by entering and running the following code in the command line of R Studio: 2 3 4 5 6 7 install.packages("devtools") library(devtools) install.packages("devtools") devtools::install_github("quanteda/quanteda.corpora") devtools::install_github("kbenoit/LIWCalike") install.packages(c("readtext", "quanteda", ͨ→ "quanteda.textplots", "quanteda.corpora", "tidyr", "ggplot2", ͨ→ "network", "igraph", "quanteda.textstats")) Documentation of these packages can be found at https://cran.r-project.org/ or simply by googling their names. We will not be using all of the functions in any of these packages in this tutorial. As functions come up, they will be explained. Next, we get started by loading the required packages. To do this, run the first main chunk of code in the R Markdown script provided: 8 9 10 11 12 13 14 15 16 17 require(readtext) require(LIWCalike) require(quanteda) require(quanteda.textplots) require(quanteda.textstats) require(quanteda.corpora) require(tidyr) require(ggplot2) require(network) require(igraph) 326 M. Alfano We can now have a peak at the first words of the first speeches in the State of the Union address corpus: 18 head(data_corpus_sotu) As you’ll see, Washington started all of his speeches with the same phrase, “Fellow-Citizens of the Senate and House of Representatives.” > head(data_corpus_sotu) Corpus consisting of 6 documents and 6 docvars. Washington-1790 : Fellow-Citizens of the Senate and House of Representatives: ... Washington-1790b : Fellow-Citizens of the Senate and House of Representatives: ... Washington-1791 : Fellow-Citizens of the Senate and House of Representatives: ... To make sure we don’t have accidental duplicates, let’s look a bit further at the first two speeches (Fig. 7.1): 19 20 texts(data_corpus_sotu)[1] texts(data_corpus_sotu)[2] As you’ll see, these are in fact different speeches, so we can feel assured in moving forward with the analysis. Fig. 7.1 Initial words from Washington’s first two SOTUs 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 327 Fig. 7.2 Summary of the SOTU corpus Let’s get a quick summary of the corpus (Fig. 7.2): 21 summary(data_corpus_sotu) Washington’s first speech was just 24 sentences long and contained only 1167 words (tokens)! The metadata, such as party ID, is nice to have, but this is a slightly annoying format because the Date column includes not just the year but also the month and day. Let’s split that column into three so that we can plot some yearly comparisons. We also have to coerce the new Year column to be numeric because otherwise R thinks it’s a string variable. 22 23 24 tokenInfo <- summary(data_corpus_sotu) tokenInfo <- separate(tokenInfo, Date, ͨ→ c('Year', 'Month', 'Day')) tokenInfo$Year <- as.numeric(as.character(tokenInfo$Year)) The separate function splits a column (in this case, the Date column) into two or more rows. Given the way this column is formatted in our dataset, we want to split it into three columns representing the year, month, and day on which the SOTU was delivered. The analysis in this chapter focuses only on the year, since SOTUs are delivered annually and the month and day do not differ much from year to year. We now have an object named tokenInfo that summarizes the first hundred speeches in the corpus. If you click on it in the upper-right panel in R Studio, you can have a look at the details (Fig. 7.3). 328 M. Alfano Fig. 7.3 Summary of the SOTU corpus Fig. 7.4 Comparison of the first 100 SOTU word counts by party We can then visualize this summary (Fig. 7.4): 25 ggplot(data=tokenInfo, aes(x = Year, y = Tokens, ͨ→ group = 1, color = party)) + geom_line() + geom_point() + ͨ→ scale_x_continuous(labels = c(seq(1789, 2017, 12)), ͨ→ breaks = seq(1789, 2017, 12)) + ͨ→ theme_bw() This code produces a figure using the ggplot function, which is one of the most versatile functions in R for creating a wide variety of visualizations. For a 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 329 full introduction to data visualization in R, see Healy (2019). To use this function, you need to tell it what data to plot (data = tokenInfo) and specify an aesthetic (aes). This typically involves telling R what the independent variable is (x = Year), what the dependent variable is (y = Tokens), and any other relevant information (e.g., color = party). You can play around with different versions of this plot by, for instance, removing + geom_line(), which will eliminate the line connecting the dots, or removing + geom_point(), which will leave the line but remove the dots. There is rarely one uniquely correct way to visualize your data. Instead, different visualizations reveal and emphasize different aspects of the data. For this reason, it’s typically useful to try half a different versions of a visualization before settling on one that seems to be most informative without being confusing or misleading. As you can see, the word count of addresses steadily rose over the years, and the Democrats of the nineteenth century were more long winded than their Republican opponents. To move beyond this summary and exploratory analysis, we need to convert the corpus into a document-feature matrix (dfm), remove punctuation and English stopwords, and trim rare words to make it computationally easier to process (Fig. 7.5): 26 27 28 dfmat_sotu <- dfm(corpus_subset(data_corpus_sotu), ͨ→ remove_punct = TRUE, ͨ→ remove = stopwords("english")) dfmat_sotu <- dfm_trim(dfmat_sotu, min_termfreq = 5, ͨ→ min_docfreq = 3) head(dfmat_sotu) A dfm is essentially just a list of all the words in a corpus, along with their frequency per document. Using the dfm function, we can create one of these from a corpus or a subset of a corpus (corpus_subset—in this case, the subset is the whole corpus, but we could take just some years or just speeches by members of a certain party). When you build a dfm, you have the option to remove punctuation and stopwords, which we are doing in this case. Depending on the question you want to address, it might be better to keep punctuation, or stopwords, or both. Stopwords are short, very common words such as articles, prepositions, and pronouns. They are also sometimes called function words. As Pennebaker (2011) and others have Fig. 7.5 Head of the dfmat of the SOTU corpus 330 M. Alfano Fig. 7.6 Summary of the SOTU corpus shown, you can learn a lot about someone from their choice of function words. For instance, people who disproportionately use first-person singular pronouns (“I,” “me,” “my”) tend to be depressed and low status, whereas people who disproportionately use first-person plural pronouns (“we,” “us,” “our”) tend to be empowered and high status. So, removing stopwords is not always advisable, but for the sorts of analyses we are doing here, it generally is. That’s because we are about to ask what the most common words in the dfm are, and you can rest assured they would all be stopwords if we didn’t remove them. Dfm objects get to be very large very quickly because they contain a column for every unique token (word) in the corpus. Many tokens only occur a handful of times and can be discarded with little loss of information. In this instance, I decided to throw out, using the dfm_trim function, all tokens that don’t occur at least five times across all SOTUs as well as all tokens that don’t occur at least three times in at least one SOTU. Using the head function to peek at the first few rows of the dfm object, we can see that Washington uses “great” four times each in his first and second speeches but not again until 1794, when he uses it once. Let’s have a look at the most common tokens in this corpus (Fig. 7.6). You can change the number to get a shorter or longer list: 29 topfeatures(dfmat_sotu, 20) As you’ll see, the most common word across all speeches is “government,” which occurs 7444 times. In addition, “states” and “united” are quite common. If we wanted to, we could fuse all instances of the bigram “United States” in the corpus object into “UnitedStates” before creating the dfmat to disambiguate uses. Another way to picture the top features of the dfmat is with a word cloud (Fig. 7.7): 30 31 set.seed(100) textplot_wordcloud(dfmat_sotu, min_count = 6, ͨ→ random_order = FALSE, ͨ→ rotation = .25, ͨ→ color = RColorBrewer::brewer.pal(8, "Dark2")) This outputs a figure that could be added to a paper or a slide deck. The textplot_wordcloud function takes a dfm object and a seed number (in this case 100; though the figure looks pretty similar with other seeds) as inputs and produces a word cloud as output. The size and centrality of a term in the word cloud represents its frequency in the dfm. You can play around with the rotation and color to get a variety of visualizations of the same data. The color palette in this case is the Dark2 palette from Color Brewer (see https://colorbrewer2.org for a range of further options). 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 331 Fig. 7.7 Word cloud of all SOTU addresses Fig. 7.8 Top features of Washington’s first two SOTU addresses This is somewhat informative, but it collapses across all speeches by all presidents in all parties. There are probably some interesting differences between them that get lost in the cloud. To get a sense of that, we can ask for the top ten features of each speech (from most frequent to least frequent), rather than the top twenty features of the total combined corpus (Fig. 7.8): 32 topfeatures(dfmat_sotu, 10, decreasing = TRUE, ͨ→ groups = docnames(dfmat_sotu)) 332 M. Alfano Note, for instance, that the top words in Washington’s first SOTU were “public,” “may,” and “great,” whereas the top words in his second SOTU were “us,” “made,” and “may.” If you explore the top features of other addresses, you’ll see that the top words in Franklin Delano Roosevelt’s 1935 address were “work,” “government,” and “relief.” By contrast, the top words in Donald Trump’s 2017 speech were “American,” “America,” and “country.” He really did put America first! Next, we can compare the speeches not just based on the most common words in them but based on everything they contain other than punctuation and stopwords. One way to do this is through hierarchical clustering: 33 34 35 36 tstat_dist <- textstat_dist(dfm_weight(dfmat_sotu, ͨ→ scheme = "prop")) sotu_cluster <- hclust(as.dist(tstat_dist)) sotu_cluster$labels <- docnames(dfmat_sotu) plot(sotu_cluster, xlab = "", sub = "", ͨ→ main = "Euclidean Distance on Normalized Token Frequency") The dfm_weight function weights the dfm object by frequency, and the weighted version is then fed into the textstat_dist function that computes the distance between features based on their weights. The resulting matrix of distances can then be clustered using the hclust function and plotted as a tree diagram. The figure produced by this code is far too wide to be printed in a textbook, but a large image file is in the supplementary materials online. It reveals that the clusters are largely temporal rather than based on party. Almost every speech from the Great Depression onward falls within the same cluster. Trump’s speeches are most similar to those by Dwight Eisenhower, Harry Truman, and George W. Bush. The preDepression speeches also largely cluster by year, rather than party, though there are some exceptions. Have a look at the figure to see for yourself. If we restrict ourselves just to speeches from 1980 onward, we can see some interesting ideological patterns: 37 38 39 40 41 pres_dfm <- tokens(corpus_subset(data_corpus_sotu, ͨ→ Date > "1980-01-01"), remove_punct = TRUE) %>% tokens_wordstem("en") %>% tokens_remove(stopwords("en")) %>% dfm() %>% dfm_trim(min_termfreq = 5, min_docfreq = 3) 42 43 44 45 46 # hierarchical clustering - get distances on normalized dfm pres_dist_mat <- dfm_weight(pres_dfm, scheme = "prop") %>% textstat_dist(method = "euclidean") %>% as.dist() 47 48 49 # hierchical clustering the distance object pres_cluster <- hclust(pres_dist_mat) 50 51 52 # label with document names pres_cluster$labels <- docnames(pres_dfm) 53 54 55 # plot as a dendrogram plot(pres_cluster, xlab = "", sub = "", ͨ→ main = "Euclidean Distance on Normalized Token Frequency") 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 333 Fig. 7.9 Hierarchical clustering of SOTUs from 1980 to the present As Fig. 7.9 shows, within a political epoch (essentially, end of the Cold War to the present), ideological differences do emerge in the language used by Presidents in their annual addresses. In this plot, Trump is most similar to George W. Bush, and then to Reagan and the elder Bush. By contrast, Clinton and Obama used language in similar ways. We can also plot the early SOTU speeches by fiddling with the Date parameter: 56 57 58 59 60 pres_dfm <- tokens(corpus_subset(data_corpus_sotu, ͨ→ Date < "1830-01-01"), remove_punct = TRUE) %>% tokens_wordstem("en") %>% tokens_remove(stopwords("en")) %>% dfm() %>% dfm_trim(min_termfreq = 5, min_docfreq = 3) 61 62 63 64 65 # hierarchical clustering - get distances on normalized dfm pres_dist_mat <- dfm_weight(pres_dfm, scheme = "prop") %>% textstat_dist(method = "euclidean") %>% as.dist() 66 67 68 # hierchical clustering the distance object pres_cluster <- hclust(pres_dist_mat) 69 70 71 # label with document names pres_cluster$labels <- docnames(pres_dfm) 72 73 74 # plot as a dendrogram plot(pres_cluster, xlab = "", sub = "", ͨ→ main = "Euclidean Distance on Normalized Token Frequency") As Fig. 7.10 shows, the two main clusters here are anchored by Jefferson, on the one hand, and Washington and Monroe, on the other. With these general impressions in hand, we can look in more fine-grained detail at what makes each speech distinctive, by plotting comparison word clouds. These do not represent overall frequency. Instead, they show what is most distinctive (relatively high in frequency compared to the other documents in the comparison). Let’s have a look at Washington’s eight SOTUs (eight is the largest number of documents that quanteda can currently compare in this way; see Fig. 7.11): 75 76 dfmat_sotu_washington <- dfmat_sotu[1:8,] textplot_wordcloud(dfmat_sotu_washington, max_size = 2, ͨ→ labelsize = .5, comparison = TRUE, labeloffset = .05) 334 Fig. 7.10 Hierarchical clustering of early SOTUs Fig. 7.11 Comparison word clouds of Washington’s eight SOTUs M. Alfano 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 335 Here we subset our dfm object by taking only its first eight rows (since Washington was the first president and delivered eight SOTUs). Then we produce a word cloud comparing them using the comparison = TRUE argument. You can play around with the size of the terms in the word cloud and the size of the labels of the speeches—as well as their placement—by fiddling with max_size, labelsize, and labeloffset. As Fig. 7.6 illustrates, there are distinctive terms in these eight speeches. For instance, in 1794, Washington emphasizes the state of Pennsylvania and the terms “insurrection” and “militia.” This is clearly in reference to the ongoing Whiskey Rebellion in Pennsylvania and elsewhere in Appalachia, which Washington was preparing to respond to with military force. We can also compare the last four addresses of Barack Obama’s presidency with the four delivered by Trump: 77 78 dfmat_sotu_obamatrump <- dfmat_sotu[234:241,] textplot_wordcloud(dfmat_sotu_obamatrump, max_size = 2, ͨ→ labelsize = .5, comparison = TRUE, labeloffset = .05) As Fig. 7.12 illustrates, Obama focused much more on topics like the economy and education, whereas Trump focused on immigration, terrorism, war, and crime. Thus far, we have been doing purely descriptive and exploratory analyses. We might also try testing a few (simple-minded) hypotheses. For instance: plausibly, the name of a country is more likely to show up in a SOTU when the United States is currently in a hot conflict with that country (since the United States doesn’t officially declare war anymore, we have to go with the looser notion of hot conflict). If this were true, we would expect “Iran” and its cognates such as “Iranian” and “Iranians” to show up more frequently in Carter’s speeches, and “Iraq” and its cognates to show up most frequently in the speeches by the Bushes. We can test these hypotheses using lexical dispersion plots: 79 80 81 corp <- corpus_subset(data_corpus_sotu, ͨ→ Date > "1980-01-01") # compare multiple documents textplot_xray(kwic(corp, pattern = "iran*"), ͨ→ scale = "absolute") 82 83 84 85 corp <- corpus_subset(data_corpus_sotu, ͨ→ Date > "1980-01-01") # compare multiple documents textplot_xray(kwic(corp, pattern = "iraq*"), ͨ→ scale = "absolute") A lexical dispersion plot shows each instance of a token (either a word, a stem, or an n-gram) in a corpus. We build these plots using the kwic function, which stands for “keyword in context” and restricting the SOTU dates to everything post1980. The white bar associated with each speech indicates the length of the speech, showing that Carter’s were much longer than those of his successors. Each vertical 336 M. Alfano Fig. 7.12 Comparison word clouds of Obama vs. Trump line in a given corpus indicates the presence, at that point in the corpus, of the keyword in question. As Figs. 7.13 and 7.14 show, words beginning with “Iran” (e.g., “Iran,” “Iranian,” “Iranians”) tend to be used only toward the end of SOTUs, with a few exceptions such as Jimmy Carter’s 1980 address. Our hypotheses receive only moderate support from this analysis. (Note that years in which the relevant term is not used are simply left out of the plot, so, for instance, most of the Reagan speeches did not mention Iran or Iraq.) Carter mentioned Iran multiple times in his lengthy speeches, but there are mentions of Iran by both George W. Bush that our hypothesis cannot account for. Presumably, these were about the potential for conflict related to Iran’s nuclear program, which was briefly put on hold thanks to a deal struck by the Obama administration in 2015 and subsequently terminated by the Trump administration a few years later. 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 337 Fig. 7.13 Lexical dispersion of “Iran” and cognates in SOTUS from 1980 to the present Bush Sr. did mention Iraq in his SOTUs of 1991 and 1992, but only a few times, contrary to our hypothesis. By contrast, Bush Jr. conspicuously does not mention Iraq in his 2001 speech, which occurred months before the terrorist attacks of 9/11, then peppers his later addresses with mentions of Iraq. 7.5 Use of Pre-built and Custom Dictionaries Thus far, we have been focusing on single words in documents, but, of course, multiple words can express the same or similar concepts. Sometimes, it is helpful to use dictionaries to cluster words together in order to examine broader patterns. We can do this using liwcalike. The liwcalike function imitates and expands the popular Linguistic Inquiry and Word Count (LIWC) application.6 LIWC was 6 A standalone version of LIWC is available at https://www.liwc.app/. 338 M. Alfano Fig. 7.14 Lexical dispersion of “Iraq” and cognates in SOTUS from 1980 to the present originally developed by James Pennebaker, a psychologist at the University of Texas at Austin, and was most recently updated in 2022.7 It works by counting the number of words belonging to various categories in a text. For instance, LIWC combs through a text to examine how many first-person singular pronouns (e.g., “I,” “me,” “my”), how many first-person plural pronouns (e.g., “we,” “us,” “our”), and how many articles (e.g., “a,” “an,” “the”) are used per 100 words. Beyond these so-called function words, LIWC has dictionaries for words that indicate complexity of thinking (e.g., “nevertheless,” “whereas,” “but”), asking questions (e.g., “who,” “what,” “why”) and a range of psychological processes that includes positive and negative emotions, discrete emotions such as anxiety, anger, and sadness, and drives for affiliation, achievement, power, reward, and risk. Over the last few decades, Pennebaker and his collaborators have shown that this seemingly simple method can reveal quite a bit about individuals, their relationships, and the groups to which they belong. His main findings are summarized for a 7 See Pennebaker (2011) and Pennebaker et al. (2015) for more details. 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 339 lay audience in Pennebaker (2011). For instance, Pennebaker and his colleagues have done consulting work for American police and Homeland Security to try to predict using LIWC the likelihood that various groups will engage in violence to advance their causes. Independent researchers have also used LIWC to study the language used by Islamist terrorist groups (Vergani and Bliuc, 2018; Smith, 2013) and masculinist movements in Australia and other parts of the Anglophone world (Roose et al., 2022). The quanteda package does not come pre-loaded with all of the dictionaries developed by Pennebaker’s team, but they and the other dictionaries used in this chapter are available at the website associated with this book. For the following, we first download the file liwcdict.dic and put it in a folder that we set as our current working directory. This can be done by clicking on the “Session” menu in R Studio and selecting “Set Working Directory” and then “Choose Directory.” Now, let’s first run the base LIWC dictionaries on the SOTU corpus: 86 87 88 liwcdic <- dictionary(file = "liwcdict.dic") LIWCanalysis <- liwcalike(data_corpus_sotu, liwcdic) head(LIWCanalysis) This code loads the LIWC dictionary, then runs liwcalike using that dictionary on the SOTU corpus. If you click on the object LIWCanalysis, you can see the output (Fig. 7.15). The rows are the SOTU addresses, and the columns represent various categories operationalized by LIWC dictionaries. For instance, we see that 9.7% of Washington’s first SOTU is pronouns. There are also outputs for word count (WC), mean sentence length, and so on. There are dozens of dictionaries in base LIWC, so obviously we don’t have the space to explore all of them in this chapter. Let’s first clean up the liwcalike output by separating the Presidents from the years of their speeches, giving special attention to Martin Van Buren given his double surname, and coerce some of Fig. 7.15 Sample liwcalike output for the SOTU corpus 340 M. Alfano the output columns to numeric so that they can be plotted (for some reason, liwcalike outputs its numeric results as strings): 89 90 91 92 93 94 LIWCanalysis <- separate(LIWCanalysis, docname, ͨ→ c('president', 'year')) LIWCanalysis$president <- replace(LIWCanalysis$president, LIWCanalysis$president == "van", "vanBuren") LIWCanalysis$year <- replace(LIWCanalysis$year, LIWCanalysis$year == "Buren", 1838) 95 96 97 98 99 100 101 LIWCanalysis$year <- as.numeric(as.character ͨ→ (LIWCanalysis$year)) LIWCanalysis$Sixltr <- as.numeric(as.character ͨ→ (LIWCanalysis$Sixltr)) LIWCanalysis$Posemo <- as.numeric(as.character ͨ→ (LIWCanalysis$Posemo)) LIWCanalysis$Anx <- as.numeric(as.character ͨ→ (LIWCanalysis$Anx)) LIWCanalysis$Anger <- as.numeric(as.character ͨ→ (LIWCanalysis$Anger)) LIWCanalysis$Sad <- as.numeric(as.character ͨ→ (LIWCanalysis$Sad)) Now we are in a position to examine general and specific features of the corpus. For instance, we might wonder whether Presidents have tended to use longer words and longer sentences as the years go by. Given improvements in general education in the United States during this period, that is not implausible. However, early on, the Presidents didn’t need to address the general popular, since women and slaves (among others) could not vote. This might suggest that they had a more elite audience, and that as the franchise expanded and SOTUs were broadcast first on the radio and then on TV, they would feel pressure to use shorter words and sentences. We can pit these hypotheses against each other by examining the time series for long words and sentence length: 102 ggplot(data=LIWCanalysis, aes(x = year, y = WC)) + ͨ→ geom_line() + geom_point() + ͨ→ theme(axis.text.x = element_text(angle = 90)) 103 104 ggplot(data=LIWCanalysis, aes(x = year, ͨ→ y = WPS$meanSentenceLength)) + geom_line() + geom_point() ͨ→ + theme(axis.text.x = element_text(angle = 90)) As Figs. 7.16 and 7.17 illustrate, both word length and sentence length have tended to drop in SOTUs. The pattern is clearer for sentence length, and there are certainly some notable exceptions when it comes to word length. Let’s test another hypothesis, this time about emotion and war. LIWC has dictionaries for anxiety, anger, and sadness. One might expect that a president would 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . Fig. 7.16 Percentage of words six letters or longer in SOTUs over time Fig. 7.17 Mean sentence length of SOTUs over time 341 342 M. Alfano Fig. 7.18 Anxiety in SOTUs across time be especially likely to express these emotions during large-scale armed conflict. Let’s investigate these hypotheses: 105 ggplot(data=LIWCanalysis, aes(x = year, y = Anx)) ͨ→ + geom_line() + geom_point() + theme(axis.text.x ͨ→ = element_text(angle = 90)) 106 107 ggplot(data=LIWCanalysis, aes(x = year, y = Anger)) ͨ→ + geom_line() + geom_point() + theme(axis.text.x ͨ→ = element_text(angle = 90)) 108 109 ggplot(data=LIWCanalysis, aes(x = year, y = Sad)) ͨ→ + geom_line() + geom_point() + theme(axis.text.x ͨ→ = element_text(angle = 90)) As Figs. 7.18, 7.19, and 7.20 show, there is some support for these hypotheses. Anxiety spiked especially after the 9/11 terror attacks. Anger spiked especially around the War of 1812, World War II, and the response to the 9/11 terror attacks. Sadness shows more variance but does spike around World War II. One useful functionality of LIWC is the ability to create and share custom dictionaries for categories of interest. A popular framework for understanding and comparing the moral values of individuals and groups is morality-as-cooperation, which has been operationalized with a dictionary by Alfano et al. (2022).8 This theory posits, based on evolutionary game theory, that there are seven main elements 8 Morality-as-cooperation is a competitor to moral foundations theory (Haidt, 2013), which has also been operationalized with a dictionary (Frimer et al., 2019). 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 343 Fig. 7.19 Anger in SOTUs across time Fig. 7.20 Sadness in SOTUs across time to morality: family values, reciprocity, group solidarity, heroism, deference to authority, fair distribution of resources, and property rights. Let’s return to the State of the Union addresses. This time, we will analyze them not with the LIWC2015 dictionaries but with the morality-as-cooperation dictionaries. Download the file macdvirtue.dic and put it into your working directory. Then run the following code in R Studio: 344 110 111 M. Alfano macdvirtue <- dictionary(file = "macdvirtue.dic") LIWCanalysis_mac <- liwcalike(data_corpus_sotu, macdvirtue) Instead of running the base LIWC dictionaries, we are now running the dictionaries for morality-as-cooperation. There are seven dictionaries in macdvirtue.dic, which operationalize each of these seven moral values. The file can be opened in R Studio by clicking on the object in the top-right panel or in a simple text editor. There, you can see which words are associated with each construct. The development of this dictionary is documented in detail by Alfano et al. (2022). Next, we clean up the output as before: 112 113 114 115 LIWCanalysis_mac <- separate(LIWCanalysis_mac, docname, ͨ→ c('president', 'year')) LIWCanalysis_mac$president <- replace(LIWCanalysis_mac ͨ→ $president, LIWCanalysis_mac$president == "van", ͨ→ "vanBuren") LIWCanalysis_mac$year <- replace(LIWCanalysis_mac$year, LIWCanalysis_mac$year == "Buren", 1838) 116 117 118 119 120 121 122 123 124 125 LIWCanalysis_mac$year <- as.numeric(as.character ͨ→ (LIWCanalysis_mac$year)) LIWCanalysis_mac$Sixltr <- as.numeric(as.character ͨ→ (LIWCanalysis_mac$Sixltr)) LIWCanalysis_mac$Family <- as.numeric(as.character ͨ→ (LIWCanalysis_mac$Family)) LIWCanalysis_mac$Group <- as.numeric(as.character ͨ→ (LIWCanalysis_mac$Group)) LIWCanalysis_mac$Reciprocity <- as.numeric(as.character ͨ→ (LIWCanalysis_mac$Reciprocity)) LIWCanalysis_mac$Heroism <- as.numeric(as.character ͨ→ (LIWCanalysis_mac$Heroism)) LIWCanalysis_mac$Deference <- as.numeric(as.character ͨ→ (LIWCanalysis_mac$Deference)) LIWCanalysis_mac$Fairness <- as.numeric(as.character ͨ→ (LIWCanalysis_mac$Fairness)) LIWCanalysis_mac$Property <- as.numeric(as.character ͨ→ (LIWCanalysis_mac$Property)) Then we can plot time series, as before, but this time for the seven moral values posited by morality-as-cooperation: 126 ggplot(data=LIWCanalysis_mac, aes(x = year, ͨ→ y = Family)) + geom_line() + geom_point() + ͨ→ theme(axis.text.x = element_text(angle = 90)) 127 128 ggplot(data=LIWCanalysis_mac, aes(x = year, ͨ→ y = Reciprocity)) + geom_line() + geom_point() + ͨ→ theme(axis.text.x = element_text(angle = 90)) 129 130 ggplot(data=LIWCanalysis_mac, aes(x = year, ͨ→ y = Group)) + geom_line() + geom_point() + ͨ→ theme(axis.text.x = element_text(angle = 90)) 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 131 345 ggplot(data=LIWCanalysis_mac, aes(x = year, ͨ→ y = Heroism)) + geom_line() + geom_point() + ͨ→ theme(axis.text.x = element_text(angle = 90)) 132 133 ggplot(data=LIWCanalysis_mac, aes(x = year, ͨ→ y = Deference)) + geom_line() + geom_point() + ͨ→ theme(axis.text.x = element_text(angle = 90)) 134 135 ggplot(data=LIWCanalysis_mac, aes(x = year, ͨ→ y = Fairness)) + geom_line() + geom_point() + ͨ→ theme(axis.text.x = element_text(angle = 90)) 136 137 ggplot(data=LIWCanalysis_mac, aes(x = year, ͨ→ y = Property)) + geom_line() + geom_point() + ͨ→ theme(axis.text.x = element_text(angle = 90)) As Figs. 7.21, 7.22, 7.23, 7.24, 7.25, 7.26, and 7.27 show, emphases on different moral values have shifted over time in SOTUs. In particular, we note the steep rise of family values as the Republican party was radicalized by white Christian evangelicals in the latter portion of the twentieth century. We also see an increase and then decrease in group solidarity in the post World War period associated with the Great Society. Fig. 7.21 Family values in SOTUs across time 346 Fig. 7.22 Reciprocity in SOTUs across time Fig. 7.23 Group solidarity in SOTUs across time M. Alfano 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . Fig. 7.24 Heroism in SOTUs across time Fig. 7.25 Deference in SOTUs across time 347 348 Fig. 7.26 Fairness in SOTUs across time Fig. 7.27 Property rights in SOTUs across time M. Alfano 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 349 7.6 Semantic Network Analysis and Visualization The igraph package makes it possible to perform semantic network analysis and visualization of our corpora. Essentially, this sort of analysis involves measuring patterns of pairwise association between constructs, i.e., collocations. In this context, constructs are the words, stems, n-grams, and aggregated categories measured by the liwcalike function. As we saw above, it is possible to use this function to generate a frequency matrix where the rows represent pieces of text (sentences, paragraphs, whole documents) and the columns represent constructs. Using the magic of matrix multiplication, it is possible to transform such a frequency matrix into an adjacency matrix, which represents frequency of co-occurrence for each pair of constructs. This adjacency matrix can then be analyzed and visualized using igraph. To do this, we subset out just the columns of interest (family, reciprocity, group solidarity, heroism, reciprocity, fairness, property), transform that subset into a matrix object, generate its transform (switch rows for columns and columns for rows, or, equivalently, rotate it on its diagonal), and multiply the transform by the original matrix. Note that matrix multiplication is not symmetric, so if you multiply the matrix by its transform instead, you will get a different output that doesn’t represent what we are interested in here. The result of this multiplication represents the pattern of co-occurrences of the seven moral values posited by MAC within speeches. We then transform this adjacency matrix into a network object and weight the nodes by their strength and the edges by their weight: 138 LIWCanalysis_mac_simple <- LIWCanalysis_mac[,8:14] 139 140 141 142 M <- data.matrix(LIWCanalysis_mac_simple) T <- t(M) LIWCanalysis_mac_adj <- (T %*% M) 143 144 145 146 mac_network <- graph_from_adjacency_matrix ͨ→ (LIWCanalysis_mac_adj, mode = "undirected", ͨ→ weighted = TRUE, diag = FALSE) mac_network <- simplify(mac_network, remove.multiple ͨ→ = F, remove.loops = T) 147 148 149 150 str <- strength(mac_network) V(mac_network)$size <- str*1.5 E(mac_network)$width <- E(mac_network)$weight 151 152 head(mac_network) As you can see, the resulting mac_network represents patterns of cooccurrence numerically. The diagonal is blanked because every value co-occurs with itself. The matrix is also symmetric across the diagonal, which is guaranteed when building undirected networks such as co-occurrence networks. In directed networks such as retweet networks, this is not guaranteed and generally false. 350 M. Alfano Fig. 7.28 Semantic network of the values expressed in SOTUs, Fruchterman-Reingold layout Next, we can plot this network as a visualization, where node strength is represented by size and edge weight is represented by width. We’ll start with a Fruchterman-Reingold layout: 153 154 l <- layout_with_fr(mac_network) plot(mac_network, layout=l, vertex.color="pink", ͨ→ vertex.frame.color="pink", vertex.label.color="black") In Fig. 7.28, we see the different moral values represented visually as a network, which (when there aren’t too many nodes) can be easier to read than the purely mathematical adjacency table. The Fruchterman-Reingold layout reduces dimensionality to 2 in order to make it possible to print the object. The layout, in essence, is a physical system where each node is attracted to the center, each node repels each other node, and each node is also attracted differentially to other nodes depending on how strong the connection between them is. As Fig. 7.28 shows, the most central moral value in SOTUs is heroism, followed by group solidarity. We can also plot the network in a circular layout: 155 156 l <- layout_in_circle(mac_network) plot(mac_network, layout=l, vertex.color="pink", ͨ→ vertex.frame.color="pink", vertex.label.color="black") Circular layouts make edge weights more legible but may disguise node centrality. It’s largely an aesthetic question which layout you should choose. Figure 7.29 shows quite clearly that heroism is the strongest moral value in SOTUs and that it is 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 351 Fig. 7.29 Semantic network of the values expressed in SOTUs, circular layout tightly connected to group solidarity. It is perhaps less helpful in showing that group solidarity is the second-most central node in the network. If we had hypotheses about different eras, different political parties, or different presidents, we could generate subsets of the corpus for them and compare the resulting networks. We could also examine collocation not at the level of the whole SOTU address but at the level of the paragraph or the sentence by using the corpus_reshape function. I leave these options as exercises for the reader. 7.7 Conclusion In this chapter, we walked through some of the main steps involved in corpus analysis for philosophical and other humanities research using R, R Studio, and R Markdown. We began with question development and hypothesis formation, followed by corpus selection and building. While we presented these in linear order, it is often the case that researchers find that no readily available corpus directly addresses their questions and hypotheses. In such a case, it may be suitable to revise one’s questions so that they are more tractable. Of course, when doing so, it is essential to be clear to one’s audience which aspects of the project are exploratory and which are confirmatory. Much mischief has been done over the last few decades by HARKing and related questionable research practices. 352 M. Alfano After a combination of questions, hypotheses, and congruent corpora have been established, the main task is to use R to attempt to answer the questions and test the hypotheses. We have primarily approached this process using the quanteda, ggplot2, and igraph packages in R. In many cases, the resources already available through these and related packages will be adequate. However, in other cases it may be necessary to use or even construct custom dictionaries for use with the liwcalike function. We showed how to use pre-built custom dictionaries associated with the morality-as-cooperation hypothesis. Building and validating your own custom dictionary can be a challenging and sometimes-onerous project, so we do not recommend it in all cases. Finally, we showed how to take the outputs of the liwcalike package and visualize them as semantic networks. References Alfano, M. (2019). Nietzsche’s moral psychology. Cambridge University Press. Alfano, M., Sullivan, E., & Fard, A. (2021). Ethical pitfalls for natural language processing in psychology. In M. Dehghani & R. Boyd (Eds.), Handbook of language analysis in psychology (pp. 511–530). Guilford Press. Alfano, M., Cheong, M., & Curry, O. (2022). Moral universals. Machine-reading analysis of 256 societies. https://doi.org/10.21203/rs.3.rs-1841350/v1. Firth, J. R. (1957). A synopsis of linguistic theory. Studies in linguistic analysis. Blackwell. Frimer, J. A., Boghrati, R., Haidt, J., Graham, J., & Dehgani, M. (2019). Moral foundations dictionary for linguistic analyses 2.0. Unpublished manuscript. Haidt, J. (2013). The righteous mind. Why good people are divided by politics and religion. Vintage. Healy, K. (2019). Data visualization. A practical introduction. Princeton University Press. HRAF. (1967). The HRAF quality control sample universe. Cross-Cultural Research, 2(2), 81–88. Kirkegaard, E., & Bjerrekær, J. (2014). The OKCupid dataset. A very large public dataset of dating site users Open Differential Psychology. https://openpsych.net/forum/showthread.php?tid=279. Lee, T., & Phillips, J. (2019). Data-driven originalism. University of Pennsylvania Law Review, 167(2), 261–233. MacWhinney, B. (2000). The CHILDES project. Tools for analyzing talk (3rd ed.). Lawrence Erlbaum. Nichols, S., & Pinillos, N. (2018). Skepticism and the acquisition of “knowledge.” Mind & Language, 33(4), 397–414. Pennebaker, J. (2011). The secret life of pronouns. What our words say about us. Bloomsbury Press. Pennebaker, J., Boyd, R., Jordan, K., & Blackburn, K. (2015). The development and psychometric properties of LIWC2015. https://doi.org/10.15781/T29G6Z. Roose, J., Flood, M., Grieg, A., Alfano, M., & Copland, S. (2022). Masculinity and violent extremism. Palgrave. Smith, A. (2013). The relationship between rhetoric and terrorist violence. Routledge. Sullivan, E., Sondag, M., Rutter, I., Meulemans, W., Cunningham, S., Speckmann, B., & Alfano, M. (2020a). Vulnerability in social epistemic networks. International Journal of Philosophical Studies, 28, 731. 7 Corpus Analysis: Lexical Dispersion, Semantic Time Series, and Semantic. . . 353 Sullivan, E., Sondag, M., Rutter, I., Meulemans, W., Cunningham, S., Speckmann, B., & Alfano, M. (2020b). Can real social epistemic networks deliver the wisdom of crowds? In T. Lombrozo, J. Knobe, & S. Nichols (Eds.), Oxford studies in experimental philosophy, 3, 29–63. Oxford University Press. Vergani, M., & Bliuc, A.-M. (2018). The language of new terrorism. Differences in psychological dimensions of communication in Dabiq and inspire. Journal of Language and Social Psychology, 37(5), 523–540. Chapter 8 Psycholinguistic Experiments: A Case Study on Default Inferences in Philosophical Arguments—Analysing the Argument from Illusion Eugen Fischer and Paul E. Engelhardt Abstract Experimental methods from psycholinguistics allow experimental philosophers to study important automatic inferences, with a view to explaining and assessing philosophically relevant intuitions and arguments. Philosophical thought is shaped by verbal reasoning in natural language. Such reasoning is driven by automatic comprehension inferences. Such inferences shape, e.g., intuitions about verbally described cases, in philosophical thought experiments; more generally, they shape moves from premises to conclusions in philosophical arguments. These inferences can be examined with questionnaire-based and eye-tracking methods from psycholinguistics. We explain how these methods can be adapted for use in experimental philosophy. We demonstrate their application by presenting a new eye-tracking study that helps assess the influential philosophical “argument from illusion.” The study examines whether stereotypical inferences from polysemous words (viz., appearance verbs) are automatically triggered even when prefaced by contexts that defeat the inferences. We use this worked example to explain the key conceptual steps involved in designing behavioural experiments, step by step. Going beyond the worked example, we also explain methods that require no laboratory facilities. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-58049-9_8. E. Fischer (✉) School of Politics, Philosophy, Language and Communication Studies, University of East Anglia, Norwich, UK e-mail: [email protected] P. E. Engelhardt School of Psychology, University of East Anglia, Norwich, UK e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Kornmesser et al., Experimental Philosophy for Beginners, Springer Graduate Texts in Philosophy, https://doi.org/10.1007/978-3-031-58049-9_8 355 356 E. Fischer and P. E. Engelhardt Keywords Argument analysis · Verbal reasoning · Stereotypical inferences · Polysemy processing · Eye tracking · Reading time measurements · SPSS · ANOVAs 8.1 Introduction Philosophical thought is shaped by verbal reasoning: Philosophical argument takes several forms, ranging from thought experiments to abductive and deductive arguments, but most of these arguments are couched in natural languages like English or German. At the most fundamental level, verbal reasoning is driven by automatic comprehension inferences. These ubiquitous inferences are triggered whenever we hear or read words in speech or text and are anticipated by speakers and writers whenever they speak or write. These inferences shape our intuitions about verbally described cases. They influence our reasoning about such cases. They influence, more generally, how we move from premises to conclusions in verbal reasoning. Experimental methods from psycholinguistics allow experimental philosophers to study these important inferences, with a view to explaining and assessing philosophically relevant intuitions and arguments, from several areas of philosophy. This chapter presents questionnaire-based and eye-tracking methods from psycholinguistics, explains how they can be adapted for philosophical use, and demonstrates their application by presenting a step-by-step account of a new eye-tracking study that explains each key step of the research process. The study examines (1) whether verbs trigger basic automatic inferences—stereotypical inferences—even when prefaced by contexts that defeat them and (2) whether such contextually inappropriate stereotypical inferences influence further judgement and reasoning. The chapter demonstrates how findings (namely, about appearance verbs) can be used to analyse and assess an influential philosophical argument (viz., the “argument from illusion” that is central to philosophical debates about the nature of perception). We will use this worked example to explain the key conceptual steps involved in designing behavioural experiments, quite generally, and, more specifically, how to combine fixation time measurements with plausibility ratings to study automatic comprehension inferences, how to collect data with an eye tracker, and how to conduct statistical analyses of variance (ANOVAs). We will also explain questionnaire-based methods that require no laboratory facilities. 8.2 Philosophical Background Our chapter explores and explains how experimental methods from psycholinguistics can be used to explain and assess philosophically relevant intuitions and arguments, in experimental philosophy’s familiar “source’s project” and in an 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 357 exciting recent extension of it, the research programme of “experimental argument analysis.” To bring out the philosophical relevance of psycholinguistic methods and findings, we now present the sources project (Sect. 8.2.1), explain how findings from psycholinguistics can support it (Sect. 8.2.2), and motivate its extension to study philosophical arguments (Sect. 8.2.3). The next sections then walk you, step by step, through a study that employs psycholinguistic methods to study an influential philosophical argument (presented in Sect. 8.2.4). 8.2.1 Evidential and Explanatory Experimental Philosophy While experimental philosophy ranges considerably more widely, its most prominent strands examine philosophically relevant intuitions (for reviews, see Fischer and Sytsma, 2024; Sytsma and Livengood, 2015, Chap. 1). In several areas of philosophy, theory construction involves the “method of cases”: In thought experiments, philosophers consider hypothetical cases like Gettier cases (Gettier, 1963), trolley cases (Foot, 1967), or Frankfurt cases (Frankfurt, 1969), and so forth, elicit intuitive judgements about them, and deploy these judgements as defeasible evidence for or against philosophical theories. These theories are typically required to be consistent with the intuitive case judgements and are often meant to explain or justify them. Philosophical theory construction guided by the method of cases often proceeds by working back and forth between intuitions elicited by different thought experiments and various background beliefs, until “reflective equilibrium” and a coherent set of judgements and beliefs has been achieved. This approach is sufficiently common to have been characterised as analytic philosophy’s “standard justificatory procedure” (Bealer, 1996, 4). The main strands of experimental philosophy are responses to such intuitiondriven philosophising: Evidential experimental philosophy seeks to assess the evidential value of philosophically relevant intuitive judgements (“philosophical intuitions”), while explanatory experimental philosophy seeks to explain such intuitions, which many experimental philosophers conceptualise as judgements about cases, generated by automatic cognitive processes. Cognitive processes are “automatic” when they require minimal attention or other working memory resources (as evidenced by the fact that performance at tasks relying on them is not impaired by multitasking and distraction); automatic processes often take place below the level of conscious awareness (Evans and Stanovich, 2013; cf. Bargh et al., 2012). As thus roughly characterised, those two strands of experimental philosophy overlap: Several contributions to evidential experimental philosophy involve explanations of the intuitions at issue—e.g., on free will (Nichols and Knobe, 2007; Nadelhoffer et al., 2020), consciousness attributions (Arico et al., 2011, Fischer and Sytsma, 2021), or knowledge attributions (Alexander et al., 2015; Gerken et al., 2020). In this work, sometimes characterised as experimental philosophy’s “sources project” (Pust, 2000), psychological explanation is at the service of epistemological 358 E. Fischer and P. E. Engelhardt evaluation: Debunking explanations of intuitive judgements or beliefs seek to show that they are due to “epistemically defective” processes, that is, to processes that are defective as means of generating knowledge (e.g., Nichols, 2014; Vavova, 2018; White, 2010). These processes may be defective because they are not sensitive to truth, i.e., because they would generate the same judgement, regardless of whether the judgement was true or false. This may be the case because the processes are decisively influenced by irrelevant factors that have nothing to do with the truth of the judgement. Accordingly, much work in evidential experimental philosophy (negative experimental philosophy) has empirically examined—if often without explanatory ambitions—to what extent philosophically relevant case judgements are influenced by supposedly irrelevant factors, including presentation effects (order of presentation of cases or choice of equivalent case descriptions) and demographic factors (gender and age of participants, their socioeconomic status, etc.) (for a review, see Machery, 2017). Cognitive processes may also be defective because they are vulnerable to cognitive biases: because under certain, specifiable conditions they lead, in predictable ways, to judgements that violate normative rules (i.e., rules that determine or constrain what is right and wrong to judge, believe, say, or do). Cognitive biases may be the result of trade-offs between accuracy and economy: Our cognitive resources (like working memory capacity) are limited, and cognitive processes may have evolved to deliver as accurate judgements as possible, in as large a part as possible of their “intended” domains of application—but to do so with as little cognitive resources as possible. The prominent “heuristics and biases” research programme in cognitive psychology has examined how automatic cognitive processes implement rules of thumb (heuristics) that deliver reasonably accurate judgements under most circumstances but give rise to cognitive biases (Kahneman, 2011; Tversky and Kahneman, 1974). This research inspired an ambitious body of work in experimental philosophy, which has sought to develop explanations of philosophically relevant intuitions that trace them back to automatic cognitive processes vulnerable to cognitive biases. For example, work on intuitive knowledge attributions examined whether such attributions are due to processes vulnerable to egocentrism bias (Alexander et al., 2015) or focal bias (Gerken et al., 2020) (for a review, see Pinillos, 2012). Where successful, this strategy has a great advantage over other debunking explanations: Other debunking explanations may just tell us that certain intuitions (say, about who proper names refer to) cannot serve as evidence since they are generated by a process that is not sensitive to truth. In consequence, the fact that we have the intuitions explained, as and when we have them, does not speak for their truth and has no evidential value. We have to conclude that intuitions of that kind should not be used as evidence for or against philosophical theories. This contrasts with debunking explanations that explain intuitions by reference to generally reliable cognitive processes subject to cognitive biases. Such explanations facilitate more fine-grained and informative explanations: They allow us to assess under what conditions the fact that we have these intuitions speaks for their truth— and under what conditions this fact speaks against their truth. This allows us to expose specific fallacies in judgement. 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 359 8.2.2 Psycholinguistic Inspirations for Experimental Philosophy To conduct thought experiments, philosophers typically give verbal descriptions of the cases to be considered. These so-called vignettes typically are short paragraphs of more or less plain English or German (etc.) prose. The automatic processes that shape intuitive judgements about these cases, therefore, include language comprehension processes that go on all the time when we read texts or hear utterances. One particularly fundamental and important process is the process of stereotypical enrichment that facilitates communication in the face of the “articulation bottleneck”: Normal speech conveys information at a slow rate of under 100 bits per second (Levinson, 2000, 28). Pre-articulation processes in speech production are three to four times faster (Wheeldon and Levelt, 1995), as are parsing processes and comprehension inferences (Mehler et al., 1993). Articulation in speech thus considerably slows down communication. The process of stereotypical enrichment implements a straightforward mitigation strategy: It draws on information about the world that listeners or readers possess all along and deploys this information to infer what else will also be true of the situation talked about, in addition to the things that are explicitly said. The relevant information is stored in “concepts,” in cognitive science’s sense of the term (see Machery, 2009): bodies of information stored in long-term memory, which are deployed in the exercise of higher cognitive competencies, including language comprehension, perceptual categorisation, and inductive learning, and are retrieved by default, i.e., they are retrieved rapidly (e.g., in response to a verbal stimulus), by an automatic process, either in every context in which the stimulus (say, word) is encountered or even when the stimulus is encountered outside all context (Machery, 2015; cf. Fischer, 2020). Concepts include stereotypes: Unlike definable concepts associated with necessary and sufficient conditions, these implicit knowledge structures encode statistical information about regularities that are observed in the physical or discourse environment (McRae and Jones, 2013). Stereotypes include prototypes associated with nouns (like “bird” or “chair”) that encode information about which properties are typical for category members (say, birds) and which properties are diagnostic of them, i.e., make it more likely that something belongs to the category (Rosch and Mervis, 1975; Hampton, 2006). Stereotypes further include “situation schemas” associated with verbs (like “arrest” or “serve”) that encode typical features of events or actions, acting agents, “patients” acted on, and typical relations between them (Rumelhardt, 1978; McRae et al., 1997; cf. Kim et al., 2016). Note that the knowledge encoded by these conceptual structures is empirical. In single-word priming experiments, participants are presented with a “prime” word and then a probe word or letter string, and have to, e.g., read out the word or decide whether the string forms a word. Shorter response times (e.g., for “bank”money than “bank”-honey) provide evidence that the prime activates the probe 360 E. Fischer and P. E. Engelhardt concept, i.e., makes it more accessible and likely to be used by cognitive processes (Lucas, 2000). Such experiments have shown that single words (like “tomato”) activate the associated stereotype with its several component features (like red), and do so rapidly (within 250 ms) (for a review, see Engelhardt and Ferreira, 2016). Nouns and verbs associated with stereotypes thus rapidly make these bodies of information more accessible and likely to be used in cognitive processes that, in reading comprehension, range from word recognition to the construction of situation models, i.e., mental representations of the situation described by the text, which provide the basis for further judgements and reasoning about that situation (Kintsch, 1988; Zwaan, 2016). Stereotypes thus are the prefab building blocks of information from which we construct our interpretations of texts and utterances, to mitigate the articulation bottleneck. The process of stereotypical enrichment draws on these building blocks to fill in detail about the situations talked about: When you read or hear “the secretary opened the door,” you will immediately infer that the secretary is female, and that the venue is an office. Speakers anticipate these inferences: They typically do not bother to make stereotypical features explicit, as and when they are relevant; by contrast, speakers make deviations from the stereotype explicit (“the male secretary opened the car door”). In a neo-Gricean framework (Horn and Ward, 2004; Levinson, 2000; Recanati, 2003), this practice has been codified with a pragmatic rule: The I-heuristic tells language users to deploy stereotypes, to devise or facilitate interpretations that are positive and highly specific (Levinson, 2000, 114–115). It consists of two maxims, one for speakers and one for hearers: I-speaker Skip mention of stereotypical features but make deviations from stereotypes explicit. I-hearer Absent such explicit indications to the contrary, assume that the situation talked about conforms to the relevant stereotypes and deploy the most specific stereotypes relevant, to fill in detail in line with this knowledge about situations of the kind at issue. The process allows us to convey much information with few words, in most situations: Whenever we are dealing with a situation that is a typical instantiation of certain kinds of actions or events, we can rely on familiar words to trigger inferences that will flesh out the picture with reasonable accuracy. These stereotypical inferences are, however, at the bottom of the pragmatic pecking order and get defeated by inferences supported by other pragmatic rules or heuristics (Levinson 2000, 157– 158). Stereotypical enrichment and related pragmatic processes have been shown to guide the interpretation of case descriptions in philosophical thought experiments, including descriptions of Gettier cases (Saint-Germier, 2021) and scenarios involving philosophical zombies (Fischer and Sytsma, 2021): They guide the construction of the situation model that represents the situation described and provides the basis for the reader’s further judgements about that situation. These findings motivate 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 361 bringing psycholinguistics to the table of evidential experimental philosophy and, specifically, its sources project. As we have seen (in Sect. 8.2.1), one particularly promising strategy employed by this project seeks to develop and test psychological explanations of philosophically relevant case judgements by tracing them back to generally reliable cognitive processes that are affected by specific cognitive biases. The process of stereotypical enrichment is generally reliable: Stereotypes are built up from the observation of statistical regularities in a person’s physical and discourse environment (McRae and Jones, 2013). This means that, unless people are the victim of systematic indoctrination, say through ideological discourse or war propaganda, their stereotype is going to reflect reasonably accurately the features of category members, as encountered in their environment. (The city kid’s stereotypical association of tomatoes and redness accurately reflects the co-occurrence frequencies in urban supermarkets, restaurants, and kitchens. Cf. Jussim et al., 2016, for social categories.) Prolonged exposure to a different sample (unripe green tomatoes encountered in summer work on the farm) then leads to gradual modification of the stereotype. While generally reliable, the process is subject to cognitive biases like the linguistic salience bias which we discuss below (Sect. 8.3.2). We thus arrive at a psycholinguistically informed implementation of experimental philosophy’s sources project: In the first instance, this implementation of the project examines how the process of stereotypical enrichment shapes the construction of situation models when thinkers read or hear verbal case descriptions in philosophical thought experiments, and how the process thereby influences intuitive judgements about the verbally described cases; the specific aim then is to facilitate positive and negative assessments of the evidentiary value of these judgements. Negative assessments are facilitated by developing debunking explanations of specific intuitions that show these judgements are due to comprehension biases like the linguistic salience bias, which beset automatic comprehension inferences including stereotypical inferences (Fischer and Engelhardt, 2016; Fischer and Sytsma, 2021). This approach is illustrated by a study on intuitions about philosophical zombies: David Chalmers’ (1996, 93–108) influential “zombie argument” against physicalism and the supervenience of phenomenal consciousness on the physical proceeds from the assumption that we can conceive of “philosophical zombies”: beings that have bodies like ours and behave like us, but lack conscious experience. This assumption is supported by a thought experiment. Fischer and Sytsma (2021) empirically elicited the stereotype associated with the noun “zombie,” showed that the noun satisfies the conditions engendering linguistic salience bias, hypothesised that this bias would influence intuitions about philosophical zombies, and provided evidence for this hypothesis through a questionnaire-based experiment that documented a medium-sized framing effect: When the vignette described imaginary beings as “zombies,” almost twice as many participants passed Chalmers’s (2002) positive conceivability test for philosophical zombies than when an otherwise identical vignette used the word “duplicate,” instead. Fischer and Sytsma (2021) concluded that the impression that philosophical zombies are conceivable is partially due to 362 E. Fischer and P. E. Engelhardt linguistic salience bias affecting the stereotypical enrichment of verbal descriptions of zombie scenarios. This psycholinguistic explanation of the intuitions contributes to debunking them by providing an “undermining defeater” (Pollock, 1986): It helps show that the fact that some people have conceivability intuitions concerning philosophical zombies does not provide justification for accepting them. To this day, evidential experimental philosophy focuses on the assessment of intuitive judgements that serve as evidence for philosophical theories or as premises in philosophical arguments (like the zombie argument). In this sub-section, we built up to an approach that puts psycholinguistic insights to the service of this project and studies how automatic language comprehension processes, and the biases affecting them, shape judgements about verbally described cases that are made in philosophical thought experiments and serve as premises in philosophical arguments. We now turn to an exciting extension of this project that examines how automatic comprehension processes and their biases shape moves from premises to conclusions in philosophical arguments. 8.2.3 Experimental Argument Analysis While experimental philosophy still engages with the philosophical tradition mainly by examining the value of intuitions as evidence for or against philosophical theories, philosophers have begun to use experimental methods also to examine and assess philosophical arguments, often with a view to exposing otherwise hard to detect fallacies in them. Relevant research includes work on fallacies in reasoning with conditionals (Pfeifer, 2012; Pfeifer and Tulkki, 2017, cf. Skovgaard-Olsen et al., 2016) and with metaphors (e.g., Ervas et al., 2015, 2018). The present authors have cheekily appropriated the label “experimental argument analysis” for a quite specific research programme that naturally extends psycholinguistically informed evidential experimental philosophy of the sort just outlined: Experimental argument analysis (EAA) examines the roots of verbal reasoning in automatic language comprehension (for a review, see Fischer and Herbelot, 2023). EAA studies how automatic default inferences triggered by words shape verbal reasoning, for better or worse. EAA focuses on the effects for the worse, in order to explain and expose fallacies in verbal reasoning—for starters in philosophical arguments. The key idea is to expose and explain otherwise unnoticed fallacies by gaining insight into cognitive biases that systematically affect language comprehension and production. Philosophical argument analysis is often regarded as the epitome of an armchair activity. So why should experiments be needed to assess philosophical arguments? The interpretation of philosophical texts is governed by widely accepted principles of charity. These principles tell us to credit authors with linguistic competence and rationality. This requirement creates a tension with the attribution of fallacies to authors (Adler, 1994; Lewinski, 2012). Medium-strength principles of charity resolve the tension by allowing interpreters to attribute fallacies to authors only if the 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 363 attribution is backed up by an empirically supported explanation that explains when and why even competent thinkers commit fallacies of the relevant kind (Thagard and Nisbett, 1983). Experiments are required to provide empirical support for such explanations—in particular, where these explanations invoke automatic inferences, which operate beyond thinkers’ conscious awareness. Extant contributions to EAA have developed and experimentally examined a psycholinguistic explanation of otherwise puzzling fallacies of equivocation: They suggest that the process of stereotypical enrichment is affected by a cognitive bias that influences the processing of words with several distinct, but related senses (like the verb “see,” for which the Princeton WordNet database distinguishes 24 different senses, including “to understand, realise”).1 Such polysemous words account for at least about 40% of English words (Byrd et al., 1987). Reasoning with such words is affected by the “linguistic salience bias” (Fischer and Engelhardt, 2020; Fischer and Sytsma, 2021) which we shall discuss below (Sect. 8.3.2). The bias affects reasoning with unbalanced polysemes, which have dominant sense that is (roughly speaking) far more frequent than all subordinate uses. Due to this bias, subordinate uses (e.g., “Jack saw Jane’s point”) trigger stereotypical inferences that are supported only by the word’s dominant sense (e.g., the visual sense) and are contextually inappropriate (e.g., that the thing “seen” is in front of Jack)—and influence further judgement and reasoning, all the same. Fallacies of equivocation resulting from this bias have been suggested to be involved in influential philosophical arguments: Five studies to date have examined inappropriate stereotypical inferences from polysemous perception verbs, to expose fallacies in arguments “from hallucination” (Fischer and Engelhardt, 2017a, b, 2019, 2020; Fischer et al., 2022); three studies have examined inappropriate stereotypical inferences from polysemous appearance verbs, with a view to exposing fallacies in arguments “from illusion” (Fischer and Engelhardt, 2016; Fischer et al., 2021a, b). A study involving professional academic philosophers revealed that they are no less susceptible to linguistic salience bias than laypeople (Fischer et al., 2022), providing initial justification for deploying findings for the analysis of philosophical arguments. The choice of target has been motivated by the idea that the arguments from illusion and from hallucination are instrumental for developing a characteristically philosophical problem: Historically influential and still prominent philosophical problems, including sceptical problems and the problems of free will (O’Connor and Franklin, 2021), mental causation (Robb and Heil, 2019), and perception (Crane and French, 2021), are generally regarded as arising from persuasive arguments that lead to conclusions that appear to rule out familiar facts, as recognised by common sense. The arguments from illusion and from hallucination, for instance, develop the “problem of perception”: They proceed from the uncontroversial assumptions that illusions and hallucinations occur (or are at least possible). In a first step, the arguments conclude that in the cases considered—i.e., illusions or 1 http://wordnetweb.princeton.edu/. 364 E. Fischer and P. E. Engelhardt hallucinations—viewers are aware, or directly aware, of subjective and immaterial objects (perceptions or sense data) in their minds, rather than physical objects in their environment. In a second step, the arguments generalise to all cases of visual perception. The arguments thus challenge what philosophers regard as the commonsense view of vision, which grants viewers direct access to physical objects. The arguments raise the problem “that if illusions and hallucinations are possible, then perception, as we ordinarily understand it, is impossible” (Crane and French, 2021, § 2; cf. Hume, 1772). They motivate the question at the centre of philosophical debates about the nature of perception: How is perception, as we ordinarily understand it, even possible? (Robinson, 1994; Smith, 2002). If fallacies prevent the underlying arguments from getting off the ground, however, this question is ill-motivated and needs to be rejected rather than answered. Austin (1962, 5) famously sought to “dissolve” the problem of perception by exposing “seductive (mainly verbal) fallacies” in the arguments that raise it. While EAA has further philosophical applications, the first contributions to the research programme sought to provide this Austinian approach with empirical content and support. 8.2.4 Worked Example: Analysing the Argument from Illusion As a worked example, this chapter presents and discusses a study that contributes to assessing a reconstruction of the argument from illusion, which was first proposed by Fischer and colleagues (Fischer et al., 2021a; cf. Fischer, 2014). The argument proceeds from mainly familiar cases of “non-veridical perception” where some object looks a different size, shape, or colour than it is. For example: (1) When subjects view a round coin sideways, the coin appears elliptical to them. Seminal statements of the argument (e.g., Hume, 1975, 152) infer directly that, in these cases, an ‘image’ (sense-datum) rather than a physical object must be “present to the mind.” Early twentieth-century statements (e.g., Ayer, 1940, 4; Broad, 1923, 240; Price, 1932, 27–30; Russell, 1912, 1–3) break up this decisive “sense-datum inference” (Smith, 2002, 25) into two parts. They infer first a negative, then a positive conclusion: (2) When subjects view a round coin sideways, they are not (directly) aware of the round coin. (3) When subjects view a round coin sideways, they are (directly) aware of something. (4) By (2) and (3), subjects are then (directly) aware of something other than the round coin (namely, a “sense-datum”). The sense-datum is then credited with the shape, size, and colour that the coin merely looks (there and then). Current versions of the argument invoke, instead, a controversial “phenomenal principle” to license the positive conclusion that, in situations of non-veridical perception, viewers are aware of things that have the 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 365 shape, size, or colour the physical object merely looks (Robinson, 1994, 57–58; Smith, 2002, 25–27). However, the phenomenal principle invoked arguably relies on the initial move to the negative conclusion in the early twentieth-century version (Fischer et al., 2021a, 1057–1060). The at-first sight jaw-dropping leap from (1) to (2) is and remains central to the argument. Fischer and colleagues (2021a, b) explain this move as turning on a fallacy of equivocation. In their dominant sense, the appearance verbs commonly used in the first premise (mostly “appear” and “seem,” occasionally “look”) are used to attribute beliefs: As used intransitively in “Joe looks dirty,” they do not serve to predicate any property from their complement (dirtiness) of those subjects’ referents (Joe) but are used to attribute to the often implicit patient an experiential, epistemic, or doxastic attitude towards a content (Joe is dirty) (Brogaard, 2014). A distributional semantic analysis of the words’ intransitive use in a parsed Wikipedia snapshot suggests that “seem” and “appear,” and to a lesser extent “look,” are most frequently used to attribute doxastic attitudes (belief), are less frequently used to attribute epistemic attitudes (knowledge), and are yet less frequently employed to attribute experiential attitudes (Fischer et al., 2015). Proponents of the arguments from illusion and hallucination, however, mean to use appearance and perception verbs in a subordinate phenomenal sense in which they just serve to describe viewers’ subjective experience, without implying anything about the beliefs they form (Ayer, 1956, 90; Jackson, 1977, 33–49; Maund, 1986; cf. Chisholm, 1957, 44–48). Proponents also acknowledge that, in the familiar cases at issue, viewers confidently judge that things actually have a shape, size, or colour that is different from the one they look under the circumstances (e.g., Ayer, 1956, 88; Broad, 1923, 236f., 241; cf. Price, 1932, 27). (1) is therefore true only on the intended “phenomenal” interpretation of the appearance verb. Fischer and colleagues (2021a, 1054–1057) suggest that, due to the linguistic salience bias mentioned above (Sect. 8.2.3), subordinate phenomenal uses of appearance verbs in the argument, all the same, trigger inferences to belief attributions that are supported—only—by their dominant sense. This dominant sense of “X appears F (to S)” serves to attribute beliefs to patients S and is associated with a situation schema that includes the stereotypical patient feature S believes that X is F. The stereotypical inferences supported by this schema lead from (1) to the implicit belief attribution: (1-B) The viewer believes that the object viewed is elliptical. (1-B) and uncontroversial background assumptions entail (2) (Fischer, 2014): (1) entails that the coin viewed is round. Together with (1-B), this entails that the viewer has a wrong belief about the coin and does not know that it is round, or that there is a round coin. Together with standard definitions of “to be aware of” (“to have cognizance, know, have knowledge as obtained by observation or information,” Oxford English Dictionary), 2 this conclusion entails that (2) the viewer is not aware of the 2 https://www.oed.com/view/Entry/13892. 366 E. Fischer and P. E. Engelhardt round coin. Standard definitions of “direct awareness” do not cancel implications of knowledge, but rather impose the stricter requirement that the relevant knowledge be acquired without conscious inference. Hence, the ignorant viewer is not “directly aware” of the round coin, either. Fischer and colleagues (2021a, b) develop this a priori reconstruction into an empirical explanation that explains the move from (1-B) to (2) by reference to the representativeness heuristic (Kahneman and Frederick, 2005; Tversky and Kahneman, 1982). The details need not concern us now—if any account along these lines is correct, the argument from illusion does not get off the ground, as it relies on a fallacy of equivocation in its opening move: (1) is true only if the appearance verb is used in the subordinate phenomenal sense (which implies nothing, either way, about the viewer’s beliefs), while the move to (2) relies precisely on an inference to a belief attribution licensed only by the dominant sense. To be true, (1) has to be interpreted as using the verb in one sense; to permit the move to (2), it has to be interpreted as using the verb in another sense. This example illustrates, in outline, how an empirical explanation that invokes automatic comprehension inferences and a bias affecting them can contribute to the critical assessment of a philosophical argument (and, potentially, the “dissolution” of a philosophical problem). We now develop the empirical explanation in more depth (Sect. 8.3) and then review how experimental methods from psycholinguistics can be used to examine this explanation and, more generally, to study how automatic comprehension inferences shape case judgements and verbal reasoning (Sect. 8.4). In doing so, we change expository strategy: Now that you are clear on the relevance of psycholinguistic methods for experimental philosophy, we start walking you, step by step, through the key steps involved in developing an empirical study that employs these methods to contribute to experimental philosophy. 8.3 Developing the Research Question and Hypotheses 8.3.1 From Philosophical to Empirical Questions An experiment is an attempt to answer a question about an empirical phenomenon. Before you begin to design an experiment, it is crucial that you are clear on what phenomenon you are interested in, and what question about it you want to answer. Typically, this will be a question to which no fully satisfactory answer has been given so far, in the extant literature: as you review the literature to find answers, you realise you need to do some work yourself. An experiment addresses a question by putting a proposed answer to it to the test. To develop a provisional answer to your question, you will build on extant literature. Your provisional answer is the hypothesis your experiment will examine. Typically, studies in experimental philosophy directly address a question about a certain empirical phenomenon of interest and thereby indirectly address philo- 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 367 sophical questions. These philosophical questions may, but need not, be about that empirical phenomenon. Much of experimental philosophy is animated by the idea that philosophical questions about phenomena can be addressed by empirically examining how people think about these phenomena; this key idea promises to make (some) philosophical questions (more) empirically tractable (Fischer and Sytsma, 2024). Thus, evidential experimental philosophy seeks to contribute to philosophical debates about phenomena of interest by empirically examining people’s intuitions about them. Similarly, the extension of this research strand through experimental argument analysis seeks to contribute to philosophical debates about phenomena of interest by empirically examining how people reason about them. In experimental philosophy, we therefore need to clearly distinguish between the phenomenon of philosophical interest and the phenomenon of empirical interest, which may be different. If our empirical studies are to help us answer our philosophical questions, we need to have a clear understanding of how these two phenomena hang together. Indeed, the greatest advances in experimental philosophy are made by discerning new connections of this kind. For example, we have just considered extant contributions to experimental argument analysis. They seek to contribute to philosophical debates about the nature of sense perception. The phenomenon of philosophical interest thus is sense perception. However, those contributions then ask a key meta-philosophical question about a philosophical debate that revolves around this phenomenon: Is the problem of perception well-motivated? To address this question, they ask: Are the arguments from illusion and from hallucination sound? That is, are these arguments’ premises all true and their inferences all valid? This is a question about people’s reasoning about the phenomenon of philosophical interest. However, the question about the validity of inferences is a normative, not an empirical question. But it provides the crucial starting point for the move from philosophical to empirical questions. While aspiring or pretending to deductive validity, philosophical arguments including the arguments from illusion and hallucination are couched in natural language and typically rely on inferences that are supported not by the rules of logic but by the information carried by words other than the logical operators. (Where such arguments actually are logically valid, as current versions of the argument from illusion are, the validity is purchased at the cost of introducing controversial assumptions whose acceptance relies on such further or “extra-logical” inferences.) The question about validity is therefore to be asked as a question not about logical validity but about the presence or absence of fallacies made in informal (rather than logical) reasoning: do the arguments from illusion and hallucination involve fallacies? In order to address this question, we need to ask: What inferences do thinkers make in moving from the premises to the conclusions of these arguments? And do these inferences include fallacious inferences? Given the medium-strength principles of charity discussed above (Sect. 8.2.3), we may only attribute fallacious inferences to competent thinkers, if we can explain why they make them. So we need to ask: Why do competent thinkers make these fallacies? This is a set 368 E. Fischer and P. E. Engelhardt of empirical questions about verbal reasoning. Drawing on empirical findings from psycholinguistics, experimental argument analysis has proposed a first set of answers (Sects. 8.2.3 and 8.2.4): In moving from the premises to the conclusions, people make automatic comprehension inferences that are supported by stereotypes associated with the nouns and verbs used. The automatic inferences triggered by polysemous appearance and perception verbs result in fallacies of equivocation. Competent thinkers commit these fallacies due to a linguistic salience bias that has them make inferences supported by the dominant sense of a polysemous word also from subordinate uses of the word. This forges a new connection between the topic of sense perception, on the one hand, and the topic of polysemy processing and its hiccups, on the other: In order to find out whether a central and long-standing philosophical problem about sense perception is well-motivated, we need to assess the verbal arguments that develop this problem. This can be done by examining how polysemous words are processed in language comprehension, and how cognitive biases affect automatic inferences from these words. Exposing and explaining verbal fallacies in the targeted arguments help “dissolve” the targeted philosophical “problem of perception” at the heart of philosophical debates about the nature of senseperception. The philosophical interest in sense perception thus leads to an empirical interest in polysemy processing and the linguistic salience bias that affects it. This is our phenomenon of empirical interest. The philosophical relevance of this phenomenon extends well beyond efforts to “dissolve” a specific problem about perception. Philosophers frequently employ familiar words in subordinate senses, or even introduce new or special senses for familiar words. The introduction of such new senses frequently occurs in “conceptual engineering,” which seeks to improve concepts and other representational devices, for purposes ranging from improving philosophical or scientific theorising to changing socio-political attitudes in pursuit of social justice (Cappelen, 2018; Cappelen and Plunkett, 2020). For example, philosophers of perception introduced a new “phenomenal” sense of appearance verbs to facilitate “pure” descriptions of subjective experience that lack implications about people’s beliefs or environments (Sect. 8.2.3). The phenomenon of linguistic salience bias raises the question of whether, or to what extent, thinkers are able to consistently use polysemous words correctly in the new senses intended by their “engineers,” in verbal reasoning, without inadvertently slipping back into habits of inference shaped by the words’ established and dominant use (Fischer, 2020). If philosophers are unable to prevent such slippage, we can expect several philosophical arguments to be marred by fallacies of equivocation (and, possibly, some further philosophical problems to be “dissolved” by their exposure). If intended users of reengineered concepts are unable to do so, we can expect many conceptual engineering efforts to result in more confusion than clarity, within and beyond philosophy. Yet more interestingly, answers to the question of when, or under what conditions, linguistic salience bias asserts itself will allow us to predict when these problems arise and help us work around them. Linguistic salience bias is well worth understanding. 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 369 Once you have identified your phenomenon of empirical interest, and are clear on how its study will help you address philosophical questions about the phenomenon of philosophical interest, you need to review the empirical literature on the phenomenon of empirical interest, to define your research question, and to motivate a potential answer to it—the hypothesis your experiment will assess. Your research question will typically be a question that has not yet been satisfactorily answered by the extant empirical literature and has a bearing on your intended philosophical application. Accordingly, we now briefly review the literature on polysemy processing and the linguistic salience bias (Sect. 8.3.2). We then identify as yet unanswered questions about this bias on which the proposed debunking account of the argument from illusion hangs (Sect. 8.3.3). 8.3.2 The Phenomenon of Empirical Interest Psycholinguistic research so far provides a mixed picture of the mental representation of polysemes, with some evidence of differences between different word classes, and perhaps even within word classes (Lopukhina et al., 2018). However, many polysemes activate a unitary representation of semantic information that is then deployed to interpret utterances which use the word in different senses (Macgregor et al., 2015; Pylkkänen et al., 2006). In this case, there are not independent representations for each sense, but one representation from which different information is extracted and deployed for interpretation, when the word is used in different senses. This unitary representation consists in overlapping clusters of semantic features (Brocher et al., 2016; Klepousniotou et al., 2012). These are parts of internally structured stereotypes (situation schemas) or overlapping stereotypes (Elman, 2009; cf. Fischer and Sytsma, 2021). The verbal stimulus activates different components of these unitary representations in different strength (Brocher et al., 2018): The more often the language user encounters the word in one sense, rather than another, the more strongly the features associated with that sense are activated, when the user encounters the word. Another factor influencing strength of activation is prototypicality: Features deemed to make for particularly good examples of the relevant category (say, “seeming” events) are activated more rapidly and strongly (Hampton, 2006). Strength of activation thus depends on the “linguistic salience” of the sense at issue: Unlike the contextual salience involved in familiar salience biases (for a classical review, see Taylor and Fiske, 1978), this is not a contextual magnitude, but a function of relative exposure frequency over time (how often the word is encountered in this sense, rather than another), modulated by prototypicality (i.e., how good examples of the relevant category the word is deemed to stand for in that sense) (cf. Giora, 2003). Where polysemous words activate such unitary representations, individual uses of them are interpreted with the retention/suppression strategy (Giora, 2003): The contextually irrelevant features are suppressed (cf. Faust and Gernsbacher, 1996) and the relevant features are retained for interpretation. Fischer and colleagues 370 E. Fischer and P. E. Engelhardt (2021a, b) suggest this applies to the appearance verbs “look,” “appear,” and “seem”: In their intransitive use, all three verbs share the same sense (“give a certain impression or have a certain outward aspect,” WordNet 3.1),3 which is far more frequent than any other sense associated with an intransitive use (ibid.). This suggests that, in conjunction with the relevant syntactic cues (Goldberg, 2003), all three verbs rapidly activate the same associated situation schema (“appearance schema”) that is built up from prototypical situations where viewers look at something, it looks thus-and-so to them, and they believe (indeed, know) it is thusand-so. This situation schema then constitutes the relevant “unitary representation.” As noted above (Sect. 8.2.3), distributional semantic analysis suggests that “seem” and “appear,” and to a lesser extent “look,” are most frequently used to attribute doxastic attitudes (belief), less frequently used to attribute epistemic attitudes (knowledge), and yet less frequently to attribute experiential attitudes (Fischer et al., 2015). We infer that doxastic, epistemic, and experiential patient features are integrated with decreasing strength into the “appearance schema,” and that this situation schema is deployed to interpret different uses of appearance verbs, with the retention/suppression strategy. To spell this out, the entire schema is initially activated, but different schema components are retained, depending upon which sense is relevant in the given context: If the word is used in a purely doxastic sense (“The investment risks look small”), the experiential components (S looks at X, X visually looks F to S) are suppressed, and the doxastic component (S thinks X is F) retained. (Where the patient S to whom the risks look small remains implicit, this is taken to be the speaker, so that the utterance is interpreted as “I think the investment risks are small.”) Conversely, where the word is used in the phenomenal sense, as in the argument from illusion (“The round coin looks elliptical”), the doxastic and epistemic schema components are suppressed, and the experiential component is retained (“The viewer looks at the coin and it visually looks elliptical to her”). General principles of activation suggest that disregarding initially activated but contextually irrelevant component features is not generally a problem: In the absence of contextual support, default inferences tend to decay (Oden and Spira, 1983) and do not influence further cognition. Moreover, conflicts of default inferences with contextual information or background beliefs can lead to suppression which is effortful (Faust and Gernsbacher, 1996) but can, even so, be completed within one second (Fischer and Engelhardt, 2017a). Suppression runs into difficulties, however, in the face of marked imbalances in linguistic salience: Suppose features irrelevant for the subordinate sense are associated with the clearly dominant sense. Due to this sense’s high linguistic salience, these irrelevant features will initially be strongly activated by the verbal stimulus (Brocher et al., 2018). (For example, suppose the doxastic (belief) features irrelevant for the phenomenal use of appearance verbs are associated with the clearly dominant sense, since they are most strongly integrated into the situation 3 http://wordnet-rdf.princeton.edu/pwn30/02133435-v. 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 371 schema that gets activated by that sense.) Second, frequently co-instantiated features exchange lateral co-activation (Dresang et al., 2021; Hare et al., 2009; McRae et al., 2005). This, of course, also applies to features associated with the dominant sense. Where only some, but not all of them are relevant for a subordinate use (like the phenomenal use of an appearance verb), the contextually relevant features will continue to pass on activation to the contextually irrelevant features. Therefore, strong initial activation of contextually irrelevant features is followed by their continued cross-activation. This prevents their selective suppression. Merely partially suppressed features continue to support stereotypical inferences. This difficulty about selective suppression creates a linguistic salience bias (Fischer and Engelhardt, 2019, 2020; Fischer and Sytsma, 2021). When (i) one sense of a polysemous word is much more salient than all others, (ii) the retention/suppression strategy is used to interpret utterances with a subordinate use, (iii) and some, but not all of the features associated with the dominant sense are relevant for the interpretation of the subordinate use, then 1. contextually inappropriate stereotypical inferences supported by the dominant sense will be triggered by the subordinate use as well, and 2. these automatic inferences will influence further judgement and reasoning. In a nutshell, thinkers are then swept along by defeasible inferences, even when these are defeated by the context. 8.3.3 Formulating Research Questions and Hypotheses Our research objective is to support the proposed explanation of the argument from illusion that identifies a fallacy of equivocation in its opening step (Sect. 8.2.4). As you recall, the account suggests that phenomenal uses of appearance verbs, in the argument’s initial premise, trigger stereotypical inferences to belief attributions (“doxastic inferences”) that are supported by the associated situation schema but would have needed to be suppressed. (To repeat, the phenomenal sense is meant not to imply beliefs, and the chosen situations of non-veridical perception are typically so familiar that nobody would believe the thing has the property it looks, under those circumstances.) The account further suggests that, due to linguistic salience bias, suppression remains incomplete, the contextually inappropriate doxastic inference goes through, and the conclusion is presupposed in further reasoning. To support this account, we need to show that phenomenal uses of appearance verbs meet the three conditions—(i) to (iii) above—for linguistic salience bias, and that doxastic inferences are triggered by these uses (as per (1) above) and go on to influence further judgement and reasoning (as per (2) above), in contexts that are relevantly similar to those provided by the initial premises of arguments from 372 E. Fischer and P. E. Engelhardt illusion. A number of previous studies provided evidence that conditions (i) and (iii) are met, while (ii) remains speculative (Fischer et al., 2015; Fischer and Engelhardt, 2016; Fischer et al., 2021a, b). These studies also provided evidence that appearance verbs trigger stereotypical inferences to belief attributions, and that these influence further judgement (see below, Sect. 8.4). However, all these studies fall short in one important way: In typical statements of the argument from illusion (e.g., “When viewed sideways, the coin appears elliptical”), text that appears before the verb (the pre-verbal context, e.g., “When viewed sideways”) strongly suggests that familiar conditions of non-veridical perception obtain, in which the object of sight is unlikely to look its true shape, size, or colour, and in which no competent viewer will think it does. This defeats, or at any rate strongly challenges, any doxastic inference the reader might make from the verb, already before the verb occurs. One might well expect such pre-verbal contexts to facilitate complete suppression of the doxastic inference, or to even pre-empt the inference. But extant studies do not provide any such pre-verbal context. They thus fail to show that cognitively influential belief inferences are made from appearance verbs in contexts that are relevantly similar to those provided by arguments from illusion. Indeed, a second shortcoming prevents extant studies from attaining our stated research objective. These studies all ask participants to read texts in which the appearance verb is not preceded by any informative context at all—items like (1) and (2), and (1' ) and (2' ) below (Sect. 8.4.1). The fact that the verb is used in a phenomenal sense, or at any rate in a sense that does not imply any belief, only becomes apparent by text following the verb (“post-verbal context”). In ways we discuss below (Sect. 8.4), these studies then provide evidence that the verb triggers a doxastic inference and that readers do not subsequently manage to suppress the conclusion of this inference and reinterpret the use of the verb as phenomenal— even though the post-verbal context clearly invites such reinterpretation. But one might well object that the lack of informative pre-verbal context actually supports the initial doxastic inference: In the absence of an informative context, stereotypical enrichment will ensure that we fill in detail in line with prototypical viewing conditions, where things look their true colour, size, or shape—and the viewer rightly thinks they do. These studies may hence suggest conditions that are the exact opposite of those relevant in arguments from illusion. These two shortcomings of extant studies prevent us from attaining our stated research objective. They thus motivate the research questions to be addressed by the present study: RQ1 Do appearance verbs trigger doxastic inferences that influence further cognition, also where the pre-verbal context does not support such inferences? RQ2 Do appearance verbs trigger doxastic inferences that influence further cognition, even where pre-verbal contexts defeat such inferences? Once you have thus obtained your research questions, you will scour the literature for evidence supporting competing answers, to motivate hypotheses. Since 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 373 this is a textbook in methodology for experimental philosophy, rather than in psycholinguistics, we will be brief, and merely note that the linguistic salience bias hypothesis developed above (Sect. 8.3.2) entails that, under the conditions indicated, default inferences supported by the dominant sense will influence further cognition, regardless of context—subsequent or prior. This suggests the affirmative answers to our research questions that the proposed reconstruction of the argument from illusion relies on: H1 All three appearance verbs (i) trigger doxastic inferences that (ii) influence further cognition, where the pre-verbal context neither supports nor defeats them. H2 These persistent inferences occur even after pre-verbal contexts that defeat them. 8.4 Developing the Approach Hypotheses typically do not allow us to just “read off” how evidence can be garnered for or against them. The most fundamental task facing the experimentalist is that of developing an experimental approach that allows us to generate pertinent evidence. This involves two key questions: What evidence is relevant? And how can this evidence be obtained? To answer these interconnected questions, we need to get clear on what observable or measurable difference to the world it would make, if our hypothesis (say, H1) was true, rather than false. For example, what difference to the observable behaviour of speakers of English would it make if the appearance verbs “look,” “appear,” and “seem” triggered doxastic inferences (rather than if they did not)? To what stimuli would such people respond differently, in what ways, if the hypothesis was true, rather than false? Experiments bring about changes in the world and observe the effects these changes have. They do so in a rigorous manner that is designed to exclude as much as possible the influence of other factors that could influence what happens when those changes are brought about. This is done to ensure as much as possible that what happens is indeed an effect of the changes brought about. In a behavioural experiment, participants will be exposed to different kinds of stimuli (say, to different linguistic utterances) or will be exposed to stimuli under different conditions (e.g., there may be a lot of background noise or little background noise). The experimentalist thus systematically varies stimuli in certain dimensions (say, with some stimulus utterances using the appearance verb “look,” where others use “appear” or “seem,” or with some utterances being presented in a high-noise environment and others in a low-noise environment). These dimensions are known as the “variables” that are manipulated (e.g., “verb” or “background noise” could be relevant variables). If the hypothesis is true, the manipulation of these variables should affect people’s responses in specific ways (e.g., people will take longer to respond when one verb is used, rather than another, or the pupils of their eyes will dilate more when stimuli they have to understand are presented amidst a 374 E. Fischer and P. E. Engelhardt lot of background noise, rather than less background noise). The experimentalist will measure the relevant features of the responses (say, response times, or pupil diameter). If the hypothesis is true, these features should be influenced by, or depend on, the variables manipulated. The variables measured are therefore known as “dependent variables” (DVs), while the variables manipulated by the experimenter are called “independent variables” (IVs). To be able to assess a hypothesis, an experiment has to manipulate independent variables that have a certain measurable effect on the dependent variables, if the hypothesis is true—and a different effect, or no effect at all, if the hypothesis is false. We noted that the first question we face in developing an experimental approach to assess a hypothesis is to what stimuli would people respond differently, in what ways, if the hypothesis was true, rather than false? We now see this key question needs to be answered by identifying independent variables whose manipulation will, according to the hypothesis, influence certain dependent variables that can be measured. For many types of hypotheses—say, about automatic comprehension inferences—that key question has already been faced by many other researchers, and answers are available in the shape of established experimental paradigms. A “paradigm,” in the relevant sense, is an experimental approach that allows us to address certain questions. It is defined by a key idea about how to generate experimental evidence for or against hypotheses of interest: about which IVs to manipulate, when giving people certain kinds of stimuli and tasks to address, and what effects to measure. Paradigms may evolve. In particular, an improved understanding of the effects of the manipulation often leads to the identification of further DVs, which can be measured with different methods. In this case, a paradigm comes to be “implemented with different methodologies.” For example, we will review how the psycholinguistic “cancellation paradigm” for examining automatic comprehension inferences can be implemented with methodologies including pupillometry, reading time measurements (using different techniques), and simple questionnaire-based measures. This leads to the use of different DVs, including pupil diameter and different kinds of reading times. Wherever experimentalists do not need to (re-)invent the wheel, the first task facing them is to identify a suitable paradigm to examine their hypothesis. A trawl of the empirical literature on the phenomenon of interest will typically reveal relevant paradigms. Searchable databases of psychology research publications like APA PsycNet, Scopus, Web of Science, and Google Scholar are useful for the purpose. The extant work will tell you which independent variables are potentially relevant, and which dependent variables they may influence. Paradigms are underpinned by empirical research that suggests links between independent and dependent variables. These links may be more or less well understood: At the shallow end of the understanding spectrum are links that have merely been reliably observed, but for which we still lack an explanation of why they causally obtain. The better the available explanations of this kind are, the better our understanding of the relevant links becomes. As a default, you will prefer to use paradigms that are underpinned by a more thorough understanding of the relevant links. Paradigms are not quite prêtà-porter, but need tailoring to specific research needs. After identifying a suitable 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 375 paradigm, the experimentalist therefore faces the second task of ascertaining how best to implement it. The bulk of this key section walks you through these steps in a way that simultaneously provides an overview of methodologies you can use to study automatic comprehension inferences: We identify and discuss a paradigm that permits us to assess hypotheses about such inferences (Sect. 8.4.1); we then review different implementations that introduce different DVs (Sect. 8.4.2); and, finally, we explain how we adapted one implementation to assess our hypotheses H1 and H2 (Sect. 8.4.3). 8.4.1 Cancellation Paradigm Psycholinguists have developed the so-called cancellation paradigm to examine hypotheses about what inferences automatically occur when we read texts or hear people speak. The key idea behind the cancellation paradigm is to make hypotheses about automatic comprehension inferences empirically tractable by engineering clashes between the conclusion of the hypothesised inference and subsequent text: the key independent variable (IV) that gets manipulated is the presence vs. absence of a conflict between hypothesised inferences and subsequent text. The most straightforward way of manipulating this variable is by constructing texts with the expression of interest, where the subsequent text is either inconsistent or consistent with the inference the expression is hypothesised to trigger. Suppose, for example, we want to examine whether the appearance verb “seem” triggers inferences to attributions of beliefs, namely, from sentences like “The dress seemed blue to Hannah” to conclusions like Hannah believed the dress was blue or, more generally, from expressions of the form “X seems blue to S” to conclusions of the form S believes that X is blue. The cancellation paradigm then has us present participants with short texts (“items”) like this: (1) The dress seemed blue to Hannah. She thought it was green. (s-inconsistent) (2) The dress seemed blue to Hannah. She thought it was blue. (s-consistent) In item (1), the second sentence is inconsistent with the hypothesised stereotypical inference from the verb in the first sentence: The hypothesised inference has us conclude that Hannah thought the dress was blue. The second sentence tells us she thought it was green. So, the conclusion of the hypothesised inference is wrong. The second sentence thus “cancels” the inference (hence the name “cancellation paradigm”). In item (2), by contrast, the second sentence is consistent with the hypothesised stereotypical inference. (In this chapter, we say the item is “stereotypeconsistent” or “s-consistent;” similarly, we call items like (1) “s-inconsistent.”) If the verb in the first sentence of these items triggers a stereotypical inference to the conclusion that Hannah believed the dress was blue, the second sentence in the sinconsistent item (1) will clash with this expectation and create the impression of a conflict that will be absent from s-consistent items like (2). 376 E. Fischer and P. E. Engelhardt Alternatively, or in addition, the absence vs. presence of a conflict can be manipulated by exchanging the expression of interest for a contrast expression that fits the sentence context equally well in all other respects, but does not trigger the inferences of interest: (1' ) The dress seemed blue. Hannah thought it was green. (2' ) The dress was blue. Hannah thought it was green. Appearance verbs are often used without making explicit the “patient” to whom things appear thus-and-so. We typically have little trouble inferring them from the context. If “seemed‘’ triggers the doxastic inference and is interpreted as “seemed to Hannah” (the only person the text refers to), it will trigger an inference that will clash with the s-inconsistent sequel “Hannah thought it was green.” This will result in a perceived conflict in (1' ) that will be absent from (2' ). Where the two manipulations, of expression of interest and sequel, can be combined, findings may strengthen the case for the intended interpretation of perceived conflicts, namely, that they are indeed due to inferences triggered by the expression of interest (here, “seems”). This illustrates how conceptually the same variable (consistency of sequel with inferences from prior verb) can be manipulated in different ways, by manipulating different parts of the items (here, the verb in the first sentence or the adjective in the second sentence), thus yielding what are technically distinct IVs (here, verb and consistency of sequel with stereotypical inference from expression of interest). The cancellation paradigm is built on the assumption that the presence vs. absence of a perceived conflict is bound to have measurable effects that will provide us with suitable DVs. For example, perceived conflicts give rise to signature electrophysiological responses, studied by research on event-related brain potentials (ERPs; Kutas and Federmeier, 2011). Different kinds of expectation violations have been found to give rise to different characteristic responses (positive vs. negative deflections from standard wave forms, which peak at different times after presentation of the stimulus that triggers them): Violations of expectations based on syntactic rules tend to give rise to positive deviations from the standard wave form that peak 600 ms after stimulus onset (P600), whereas violations of expectations based on semantic or world knowledge tend to prompt negative deviations that peak 400 ms after stimulus onset (N400). Despite the great cost of ERP experiments, the cancellation paradigm has therefore often been implemented with ERP measures, and these implementations have sometimes been regarded as affording the “gold standard.” However, it is poorly understood why the different kinds of conflict should give rise to these different responses. Indeed, there is considerable debate about how these measures should be interpreted (Frenzel et al., 2011; Lau et al., 2008). Until these debates have been resolved, it may therefore be prudent for experimental philosophers to prefer alternative implementations of the cancellation paradigm. Other implementations of the paradigm build on the idea that perceived conflicts create comprehension difficulties. Regardless of how these difficulties are resolved, they take effort to overcome. For example, if the hypothesis that the verb “seem” 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 377 triggers doxastic inferences is true, readers will have to dispense more effort to understand s-inconsistent items like (1) than s-consistent items like (2). Cognitive effort is reflected in a variety of eye-tracking measures that can be used as DVs. In this chapter, we therefore focus on implementations of the cancellation paradigm with eye-tracking methods (for another example-rich introduction to philosophical uses of such methods, see Fischer and Engelhardt, 2019). We first review eyetracking measures that can tap into the relevant cognitive processes, as they unfold and thus provide “online” or process measures. We then turn to “offline” or outcome measures that assess outcomes of those cognitive processes. The latter facilitate convenient questionnaire-based implementations of the paradigm that require no laboratory facilities. 8.4.2 Different Implementations 8.4.2.1 Pupil Size The pupils of our eyes respond sensitively not only to changes in lighting but also to how much cognitive effort is required by the task we are engaged in. Responses to task demands are highly correlated with neural activity in the locus coeruleus, a key node within neural circuitry that controls the muscles of the iris (Samuels and Szabadi, 2008) and mediates the functional integration of the whole attentional brain system (Corbetta et al., 2008). Pupil diameter reliably increases with the “intensity” of attention or cognitive effort devoted to a task (Kahneman, 1973). These pupil responses are spontaneous and impossible to suppress at will (Loewenfeld, 1993). They are triggered also by subliminally presented stimuli the subject is not aware of (Bijleveld et al., 2009) and regularly start well before any conscious task response. They thus provide a window into preconscious stages of processing (Laeng et al., 2012; Sirois and Brisson, 2014). In language comprehension, difficulties which require cognitive effort to overcome arise from several sources. While psycholinguists have only recently begun to take up pupillometry on a larger scale, pupil responses have been found sensitive to the length and syntactic complexity of sentences (Piquado et al., 2010) and to differences in the intelligibility of speech due to interfering noise (Zekveld and Kramer, 2014), where dilations peak at medium levels of interference, suggesting less resources are allocated when the task becomes too difficult. The level of difficulty also depends upon the predictability of new text in the light of old: Comprehension is easier where subsequent concepts are activated by previous words, either through associative priming (based on co-occurrence of words, as when “cat” activates the frequently co-occurring concept mouse) or semantic priming (based on activation of schemas and semantic knowledge, more generally). Accordingly, pupil responses have been found responsive to “surprisal,” that is, the predictability of the next word in a sentence, given the previous words (Levy, 2013). One common approach uses recurrent neural networks to estimate this predictability 378 E. Fischer and P. E. Engelhardt based on co-occurrence frequencies of words in large linguistic corpora. Low predictability and conflict with expectations result in wider pupil dilations (Frank and Thompson, 2012). Similarly, pupil dilations have been found in response to violations of expectations from “scripts” (social event schemas) (Raisig et al., 2012) and from event schemas (Fischer and Engelhardt, 2020). More generally, where new text clashes with conclusions from prior inferences, suppression is required and costs cognitive effort (Faust and Gernsbacher, 1996), resulting in larger pupil dilations. Indeed, pupil responses may index conflict monitoring yet more reliably than cognitive effort (Ptok et al., 2021; van Steenbergen and Band, 2013). This further strengthens the case for implementing the cancellation paradigm with pupillometry, when seeking to examine stereotypical inferences. Larger pupil dilations in response to stereotype-inconsistent items than in response to otherwise similar (equally long and complex) stereotype-consistent items can therefore provide evidence of stereotypical inferences. While pupil responses begin instantaneously, the moment a difficulty is encountered, they take time to unfold: Human pupils need a full second to expand to their maximal size (Engelhardt et al., 2010). (Accordingly, pupil diameter is sometimes called a “pseudo-online measure.”) Since they do not respond at uniform speed to all kinds of difficulties, pupil diameter may, in this period, be influenced also by difficulties preceding or following the difficulty of interest. To minimise such influence, one can create the difficulty at the end of items and compare mean pupil sizes in the second after the item was presented, with mean sizes in the previous time window (Fischer and Engelhardt, 2020). Evidence of stereotypical inferences is then provided by larger increases in pupil diameter after stereotype-inconsistent items than stereotype-consistent counterparts. The pupil is far more responsive to luminance variations than to changes in cognitive load (Beatty and Lucero-Wagoner, 2000). Since the presentation of reading items on ordinary-sized computer screens involves luminance differences as eyes move from the beginning of the sentence (when the visual field extends beyond the screen) to the centre of the screen, only few pupillometric investigations into language processing employ reading tasks (e.g., Frank and Thompson, 2012, Raisig et al., 2012). The luminance issue can partially be overcome with “rapid serial visual presentation” (RSVP), which is a technique in which each word in a sentence is presented in the centre of the screen sequentially (Potter, 2018). Obviously, this is different from normal reading situations in which the full text remains visible and can be re-read. There is also an issue in how fast the words in RSVP should be presented. Most studies therefore employ auditory stimuli and present participants with recordings of a speaker uttering the items, while participants look at a fixation cross in the middle of an otherwise dark screen in front of them. Pupil diameter is then measured at the sampling rate of the eye tracker used, which can be up to 1000 Hz (every millisecond) or more (cf. Sect. 8.6, below). The use of pupillometry in experimental philosophy is illustrated by two studies on inappropriate stereotypical inferences from perception verbs (Fischer and Engelhardt, 2017a, 2020). 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 8.4.2.2 379 Fixation Times This takes us to our next process measure: fixation times in reading. Contrary to a common folk conception, reading is not a sequential process where each word in a sentence is read, one after the other, as they appear in the text, at roughly the same pace. Instead, the eye moves in stops (fixations on words) and starts (saccades). Readers tend to fixate most, but not all words, as their eyes move forward (skipping the words easiest to predict from the context) and backwards (“regressions,” i.e., right-to-left eye movements occur most often at points of difficulty). Information is extracted from the text when the eyes do not move. Fixation times on words reflect the processing of information from those words. Higher fixation times reflect increased processing difficulties. Since our eyes skip forwards and backwards in reading, the same word or region of interest may be fixated several times. Accordingly, we can take different measures. Dependent measures employed in the study of automatic inferences include:4 – First fixation duration: Length of first fixation on a region of interest. Since long words may require several fixations, this is to be distinguished from the – First-pass reading time: The sum of all fixations in a region of text, from first entering that region until leaving that region either in a forward or backward direction – Regression path duration: The time from first entering a region until moving past that region forward in text (unlike first-pass reading time this also includes time spent on fixations following regressions out of the region) – Second-pass reading time or summed re-reading time: The sum of all fixations in a region following the initial first-pass reading – Total reading times: The sum of all fixations in a region First fixation duration and first-pass reading times are often called “early reading time”; second-pass and total reading times are called “late reading times.” The interpretation of these measures is still a matter of some debate (Liversedge and Findlay, 2000; Staub, 2015). The following suggestions, however, capture reasonably widely accepted rules of thumb. Different reading time measures are indicative of different kinds of difficulty, at different stages of processing. In reading, we need to recognise words, integrate them into local interpretations of a few adjacent words, and then integrate such local interpretations into more comprehensive interpretations of entire sentences or even paragraphs. The first two steps are mandatory, and the third is optional: According to the “good enough processing” approach that informs much eye-tracking research on reading (Ferreira et al., 2002; Ferreira and Patson, 2007) and is consistent with broader trends in cognitive science (Ferreira and Lowder, 2016), readers immediately recognise words and construct 4 Definitions of reading time measures differ across research labs and software packages. We use the definitions from Clifton et al. (2007). 380 E. Fischer and P. E. Engelhardt local interpretations; if the task at hand demands it, and only then, they subsequently integrate these local interpretations into more comprehensive interpretations. The difficulty of word recognition depends mainly on the word’s frequency, length, and predictability in the (local) context (Clifton et al., 2016; Rayner, 1998). These “big three” variables largely determine first-pass reading times. A backward eye movement (regression) upon first fixation may indicate difficulty in integrating the word into a local interpretation. The regression path duration (sum of [1] all fixations on a word or in a region before moving to the right, plus [2] all fixations made following a regression to an earlier region), then reflects the effort required to overcome this difficulty. By contrast, difficulties in integrating the local interpretation of a sentence region into a more comprehensive interpretation will show up only in increased second-pass reading times for the region, and a higher number of saccades from it to other text (Clifton et al., 2007b; Rayner et al., 2004). Difficulties arising from one sentence region may “spill over” and lead to longer total reading times for the next (n + 1) region (Rayner et al., 2004). The cancellation paradigm provides evidence of stereotypical inferences by engineering clashes of their conclusions with sequels. Consider again our stereotypeinconsistent sample item: (1) The dress seemed blue to Hannah. She thought it was green. Here, the clash arises from an inference that (as per the hypothesis) is triggered by the verb “seemed” and leads to the conclusion that Hannah believed the dress was blue. The clash arises only once the local interpretation of the text including the verb and its complements in the first sentence has been constructed. The conclusion that Hannah believes the dress is blue then clashes with information conveyed by the second sentence, where the clash becomes apparent only on the final word (“She thought it was green”). The clash thus creates a difficulty at the stage of integrating local interpretations of the first and the second sentence, respectively, into a comprehensive interpretation of the overall text. This should be indicated by summed re-reading times for the (one-word) conflict region where the conflict becomes apparent (“green”): These re-reading times should be higher than summed re-reading times for the analogous region in the stereotype-consistent counterpart (2) (“blue”). However, the clash may prompt re-reading also of the source region from where the difficulty seems to arise (“seemed blue”) and lead to higher summed re-reading times for this region or its key part (the source adjective “blue”) (Fischer et al., 2021b). Evidence of stereotypical inferences can be provided by higher rereading times for either conflict region or source region in stereotype-inconsistent items than in stereotype-consistent counterparts. Due to spill-over effects, further evidence may be provided by higher second-pass reading times for the immediately next region in the sentence (e.g., “to Hannah”). Researchers should also be aware that increased reading times are also observed at the very last word in a sentence, a phenomenon known as “sentence-final wrap up,” when readers pause briefly at the end of a sentence, before moving on to the next. This can be either in first-pass reading times or in late reading times. 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 381 This chapter presents a worked example illustrating the use of fixation time measurements (below, from Sect. 8.4.3). 8.4.2.3 Self-Paced Reading Times Pupillometry and fixation time measurements require an eye tracker and bringing participants into a lab and involve a lot of time and effort, requiring 20 to 30 min per participant and the presence of an experimenter throughout. While the quality of the data is often worth the price, it may be a good idea to first test new hypotheses with less costly implementations of the cancellation paradigm that can be delivered online. In order to examine reading times without an eye tracker (i.e., “behaviourally”), psycholinguists often use “self-paced reading” paradigms (Jegerski and VanPatten, 2014). In this paradigm, participants read text on a screen, where only one word at a time is visible. In a typical set-up, participants see a series of underlines across the screen. These underlines correspond in length and spacing to the words in the text to be read—typically a sentence. When the participant presses the spacebar, the first word in the sentence will become visible and the underline will disappear. The second time the participant presses the spacebar, the second word in the sentence will appear and the first word will disappear. Each subsequent spacebar tap reveals a new word, as the prior word disappears. The time between each press of the spacebar is recorded and indicative of the reading time (Wagers et al., 2009). In this way, self-paced reading delivers a reading time for each word of the sentence. However, participants cannot go backwards and re-read earlier text. While more convenient, self-paced reading, therefore, does not provide as many and fine-grained dependent measures as fixation time measurements. 8.4.2.4 Plausibility Ratings We now turn from process to outcome measures. We first explain why these simple measures are important (rather than merely convenient) and then discuss plausibility ratings as a first, straightforward outcome measure. In the experimental-philosophical study of inferences, outcome measures should complement eye-tracking measures. Whereas psycholinguists are primarily interested in how texts and utterances are processed, philosophers are mainly interested in identifying automatic inferences that influence further cognition: we are interested in how these inferences influence judgements about verbally described cases in philosophical thought experiments, or verbal reasoning in philosophical arguments. A particular attraction of eye-tracking measures is that they tap into comprehension processes as they unfold and can pick up automatic inferences as they get triggered in reading or listening. But they do not tell us whether any of the inferences picked up go on to influence further cognition: As we have seen (in Sect. 8.3.2), people are generally good at suppressing the conclusions of automatic default inferences where they are defeated by contextual information or clash with background beliefs; 382 E. Fischer and P. E. Engelhardt such conclusions can then be completely suppressed within one second (Fischer and Engelhardt, 2017a). Where this happens, they do not influence further cognition. Mere process measures can only provide evidence that automatic inferences are triggered, but not evidence that they influence further judgement and reasoning. To spell out this important limitation, the cancellation paradigm tests for hypothesised inferences by engineering clashes with subsequent text that engender comprehension difficulties. Different process measures reflect the effort readers or listeners must make to overcome these difficulties. In the cases that interest us, the comprehension difficulty can be overcome by suppressing the conclusion of the inference. For example, in items like (1) “The dress seemed blue to Hannah. She thought it was green,” the problem can be overcome by suppressing the doxastic inference from “seemed” and interpreting the word as merely characterising how the dress visually appeared to Hannah, perhaps under some odd lighting conditions, without implying that Hannah held any specific belief about the dress. This interpretation—in line with what philosophers call the “phenomenal sense” of these verbs—allows readers to coherently integrate the information from all parts of the item and resolve the comprehension difficulty. Such interpretations are in principle available in all cases in which the linguistic salience bias hypothesis predicts inappropriate inferences from subordinate uses of polysemous words. In these cases, which are of particular interest to us, the effort that is made to overcome comprehension difficulties is best interpreted as consisting in effort to suppress the inappropriate stereotypical inference. The relevant eye-tracking measure can then be interpreted as indicating the extent of suppression effort. Larger pupil dilations and higher late reading times thus demonstrate there was a need for suppressing the hypothesised inference, and thereby provide evidence that the inference was triggered. If these suppression efforts are completely successful, however, the inference for which they provide evidence will not influence further cognition. To what extent suppression is successful depends not only upon the amount of effort expended but also on the extent of the obstacles hindering it—e.g., the strength of linguistic salience bias. We therefore cannot infer the success from the extent of suppression effort. Extent and success of suppression effort are two distinct magnitudes and need to be measured independently. The eyetracking measures we reviewed reflect the amount of effort required to overcome comprehension difficulties. Specifically, summed re-reading times are indicative of the effort required to overcome integration difficulties. Where integration difficulties can be resolved by suppressing an automatic inference (as in experiments that use the cancellation paradigm to examine contextually inappropriate stereotypical inferences), this measure can be interpreted as measuring the extent of suppression (which will be reflected less neatly also in other measures, including pupil dilations and total reading times). To assess the success of suppression, the measure of choice is provided by plausibility judgements that are “unspeeded,” i.e., elicited without putting participants under time pressure. To illustrate, consider items that place a viewer in conditions that speak neither for nor against the conclusion that the viewer believes that things are as they look. Such conditions need to be made explicit, since stereotypical enrichment will ensure 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 383 that items that specify no viewing conditions at all (like 1 and 2 above) will be spontaneously interpreted as invoking stereotypical viewing conditions under which things look their true colour, size, or shape. As we shall presently see (in Sect. 8.5), suitably “neutral” viewing conditions are specified by: (1*) The lighting in the room was odd. The hostess’s dress looked blue to Hannah. She thought it was green. (2*) The lighting in the room was odd. The hostess’s dress looked blue to Hannah. She thought it was blue. Under these conditions, it is every bit as likely that the dress will be blue as that it will be another colour. Since viewers typically have insight into odd lighting conditions, these conditions render it equally likely that Hannah thinks the dress is blue or another colour, a plausible candidate being green. Suppose eye-tracking measures provide evidence that “looked blue to Hannah” triggers an inference to Hannah believed the dress was blue: say, we observe higher late reading times for the source region “looked blue” in (1*) than (2*), which we readily interpret as reflecting suppression effort. Then plausibility ratings let us gauge whether the success of suppressing this inference has been complete—or only partial, as the linguistic salience hypothesis would predict: If success has been complete, readers win through to a purely phenomenal interpretation of the appearance verb in (1*); on such an interpretation, (1*) is as likely to be true as (2*). By contrast, if suppression remains partial and the inference’s conclusion influences further cognition, the resulting persistent impression of a tension in (1*) will make it seem less likely to be true than (2*), which contains no such tension. To sum up, experimental philosophers interested in inferences that influence further judgement and reasoning should complement process measures like eyetracking measures with unspeeded plausibility ratings. Ideally, this should be done within the same study that also gathers eye-tracking data, by getting participants to first read or listen to an item (taking eye-tracking measures as they do so) and then asking them to rate its plausibility—to rate how likely the description provided by the item is to be true or accurate, given the information it provides. This is most conveniently done using a 5-point or 7-point Likert scale. Eye-tracking measures can provide evidence of automatic inferences. Plausibility judgements then let us assess whether these inferences are suppressed or influence further judgement and reasoning. We present a worked example in this chapter. Further examples are provided by extant studies on inappropriate stereotypical inferences from appearance and perception verbs (Fischer et al., 2021b; Fischer and Engelhardt, 2019). 8.4.2.5 Questionnaire-Based Studies: Plausibility Ratings as Sole Measure Plausibility ratings can, however, also be used on their own. This has the great advantage that the experiment can be based purely on a questionnaire that can be built with software tools like Qualtrics or Survey Monkey and delivered through 384 E. Fischer and P. E. Engelhardt Internet platforms like Prolific or MTurk that can be used to recruit participants meeting certain criteria. While this delivery format requires payment of participant fees and service charges, it may be cheaper than running an eye-tracking study in the lab. Implementing the cancellation paradigm only with plausibility ratings can therefore be a convenient way of examining upfront whether a hypothesis about automatic comprehension inferences merits closer investigation with labbased studies. Indeed, well-designed plausibility rating studies provide important evidence and have been published in leading journals. Since the questionnairebased format will be of particular interest to many readers, and the worked example discussed in detail in the next sections will be a lab-based study combining eyetracking with plausibility ratings, this sub-section will add details about study design which we did not provide for the other measures discussed. Where we use plausibility ratings on their own, rather than to complement process measures, some care is required, though. Process measures provide evidence that the hypothesised inferences are made. In the above example, this allows us to interpret, e.g., lower plausibility ratings for s-inconsistent items like (1*) than for s-consistent items like (2*) as due to persistent stereotypical inferences whose clash with the sequel lowers the plausibility of the overall item (as and when they do clash). Where we use plausibility ratings on their own, we lack this extra evidence. The case for the intended interpretation of plausibility differences can then be strengthened by manipulating not only the sequel but also the expression of interest—which we can change round with a contrast expression that is unlikely to trigger the conflict-generating inference. To expand on a previous example, we can use sets of items like these (taken from Fischer et al., 2021b, where plausibility ratings were, however, combined with fixation time measurements): (1' ) The dress seemed blue. Hannah thought it was green. (s-inconsistent) (2' ) The dress was blue. Hannah thought it was green. (3' ) The dress seemed blue. Hannah thought it was navy. (s-consistent; navy is a kind of blue) (4' ) The dress was blue. Hannah thought it was navy. The hypothesis that stereotypical inferences to belief attributions are made from appearance verbs then predicts that s-inconsistent appearance items (like 1' ) will be deemed less plausible than s-consistent appearance items (like 3' ). To obtain evidence that these plausibility differences (if observed) are due to inferences from the verb, we can further compare appearance items with “is”-items, whose verb (the common copula) does not trigger doxastic inferences. The hypothesis then predicts lower ratings for s-inconsistent appearance items (like 1' ) will be deemed less plausible than “is”-counterparts (like 2' ). The empirical argument is then further strengthened by observing—as we did (Fischer et al., 2021b)—that s-consistent appearance items (like 3' ) are deemed equally plausible as “is”-counterparts (like 4' ). This pattern of findings is best explained by the posited stereotypical inferences: The clash with the stereotypical inference from “seem” etc. reduces the plausibility of s-inconsistent appearance items below that of counterparts with “is” (where no clash occurs, because the verb does not trigger a pertinent inference), just as it 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 385 reduces the plausibility of s-inconsistent appearance items below that of s-consistent appearance items (where no clash takes place because the sequel is consistent with the inference triggered). Complex patterns of responses can hence provide evidence of stereotypical inferences, even where plausibility ratings are the only DV measured. Further efforts can be made to exclude the possibility that other factors influence plausibility assessments. When people cannot base plausibility assessments on factual knowledge (as in comparatively uninformative items about fictitious protagonists), people base them on metacognitive cues, in particular on the level of fluency or subjective ease they experience in processing the sentence(s) (for a review, see Alter and Oppenheimer, 2009). Difficulty or “dysfluency” leads to lower subjective plausibility. One factor influencing the level of fluency is the presence vs. absence of perceived conflicts in the text—i.e., the very variable manipulated by the cancellation paradigm. But there are further factors, including familiarity and pronounceability of individual words (Oppenheimer, 2006), syntactic complexity of the sentence (Lowrey, 1998), and priming by earlier words (Kelley and Lindsey, 1993). To exclude or minimise the influence of these factors, items should be as similar to each other as possible: Ideally, items whose ratings get compared with each other should differ only in one word. For example, Fischer et al. (2021b) made two comparisons of plausibility ratings: They compared the ratings of s-inconsistent items employing appearance verbs (like 1' ) and contrast verbs (like 2' ), which are the same except for the verb used in the first sentence (“seemed,” “appeared,” “looked,” “was”). And they compared s-inconsistent items using appearance verbs (like 1' ) with s-consistent counterparts (like 3' ), which differ only in the adjective used in the second sentence (e.g., “green” vs. “blue”). The next question to bear in mind in designing these studies is the number of items. Even with the precautions outlined, plausibility ratings will depend upon idiosyncrasies of the scenarios described by the items. Studies therefore have to employ a sufficiently large number of items, so that any effects of such idiosyncrasies will cancel each other out. Successful studies have used a minimum of six items of each sort (e.g., six items like (1' ) with the verb “seemed” and an sinconsistent sequel, which belong to the same “condition”). Candidate items should be “normed” to exclude items whose ratings are clearly influenced by further factors than those manipulated by the experiment: Before the main study, a “norming study” with a smaller number of participants (ca. 20 will typically be enough) should test the proposed items (plus several “spares!”), by eliciting plausibility ratings for them. Items whose ratings are influenced by irrelevant factors will receive clearly different mean ratings than most other items in their condition. Items whose mean ratings differ dramatically from the mean for their condition should therefore be discarded as not fit for purpose (“exclusion norming”). To assess this difference, one commonly considers the standard deviation (SD) of the item mean from the condition mean. It is best practice to exclude items when observing an SD > 2.5. Another common exclusion criterion is SD > 3. Further norming work can include group discussion, where participants explain and justify their ratings. These explanations help identify extraneous factors that can 386 E. Fischer and P. E. Engelhardt influence people’s choices and need to be eliminated. (For example, one participant in a UK-based norming study explained their high rating for “The bathroom seemed large. Georgina thought it was small” by pointing out that “Georgina” is a posh name and that an upper-class person might well find even a large bathroom small. We therefore changed the name.) To sum up, we can investigate whether automatic inferences from words influence further judgement and reasoning by implementing the cancellation paradigm with plausibility ratings only. Such studies should involve at least two manipulations: In line with the basic idea of the paradigm, they should manipulate the consistency of the sequel with the hypothesised inference from a prior word. In addition, they should manipulate the word used in the relevant position, and use both a word of interest (hypothesised to trigger the inference) and a contrast word (not triggering the inference). There should be at least six items per condition. The items whose ratings get compared should be as similar to each other as possible (across conditions) and should have benefitted from prior norming. An example illustrating the sole use of plausibility ratings to examine automatic comprehension inferences in experimental philosophy is provided by a recent study on inappropriate stereotypical inferences from perception verbs (Fischer et al., 2022). 8.4.2.6 Plausibility Rankings Forced-choice plausibility rankings provide a related outcome measure that is particularly convenient for preliminary examination of working hypotheses about automatic comprehension inferences which influence further judgement and reasoning (“cognitively effective inferences”). With this approach, participants are presented with “minimal pairs,” that is, two texts that differ only in one critical word (like 1' and 2' above, or 3' and 4' ). Participants are then asked to indicate which of the two texts strikes them as more plausible, even in the absence of a clear-cut preference (whence “forced-choice”). With this approach, the most important comparison is that between texts with a word of interest and a contrast word, in items with stereotype-inconsistent sequels. To repeat our previous example: (1' ) The dress seemed blue. Hannah thought it was green (s-inconsistent). (2' ) The dress was blue. Hannah thought it was green. If the word of interest (“seemed”) triggers the hypothesised inference, and the inference remains unsuppressed, the resulting clash with the sequel will reduce the plausibility of the text. As contrast word we use a word that does not trigger any inference that would clash with this sequel and that otherwise fits the context equally well. If the contrast word (here, “was”) meets this job specification, the hypothesised inference from the word of interest will reduce the plausibility of the s-inconsistent item (like 1' ) below that of its counterpart with the contrast word (like 2' ). This will lead participants to prefer the text with the contrast word. By contrast, if the expression of interest does not trigger a persistent inference, and the contrast word 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 387 otherwise fits the context equally well (as it should, as per its job specification), the two texts will strike participants as equally plausible, and their preferences will be random. To gauge whether the contrast word meets the job specification, we can use stereotype-consistent pairings, like: (3' ) The dress seemed blue. Hannah thought it was navy (s-consistent). (4' ) The dress was blue. Hannah thought it was navy. If the contrast word (“was”) fits the contexts equally well, except for not triggering the stereotypical inferences of interest, then these texts will strike participants as equally plausible, in items which are consistent with this inference. Choices between stereotype-consistent texts (like 3' and 4' ) will then be random. Participants are presented with several (at least six) items pairing the same expression of interest with the same contrast verb. The DV this approach has us measure is consistency of preference: how consistently texts with the contrast word are preferred over texts with the word of interest, across all relevant items. The task presents participants with a binary choice. In Chap. 2 of this volume, you learned that such choices yield non-parametric data, which can only be analysed in certain limited ways. However, the present variable, consistency of preference, admits parametric tests (see Chap. 3), as it is continuous, has a 0 point (50/50 preference, e.g., three of six items), and equal intervals on the scale represent equal differences in the property measured (e.g., a 5/6 preference for “was” over “seemed” sentences is equal to a 1/6 preference in the other direction). Evidence for stereotypical inferences of interest is then provided by clear preferences, in s-inconsistent items, for texts with the contrast word (“was”) over texts with the word of interest (“seemed”). Such preferences are indicated by significant deviation of preferences from the 0 point. (For a detailed explanation, see Fischer and Engelhardt, 2016.) The approach can not only be used to assess whether words of interest trigger cognitively effective stereotypical inferences. It can also be used to examine the relative strength of these inferences from distinct but related words. For example, the doxastic (belief-attributing) use seems to be slightly less dominant for “look” than the other appearance verbs. People should therefore find it easier to suppress doxastic inferences from “look” than from “appear” or “seem.” Hypotheses like this can be tested by asking which of two s-inconsistent texts is more plausible, e.g., the “seemed” text (3' ) or a counterpart with “looked.” If doxastic inferences are easier to suppress from one verb (say, “looked”) than the other (“seemed”), participants will consistently prefer texts using that verb (“looked”). Finally, forcedchoice plausibility rankings afford an excellent tool to examine how susceptible stereotypical inferences are to defeat by other pragmatic inferences, in particular with Grice’s (1989) Maxim of Manner (for an explanation, see Fischer et al., 2021a, 1040–1043). To successfully implement this potentially useful approach, we need to carefully control for all other factors that can influence subjective plausibility. This is even more important for ranking than for rating tasks, since even minuscule differences 388 E. Fischer and P. E. Engelhardt in plausibility can translate into consistent preferences across items. In discussing plausibility ratings, we reviewed several relevant factors. All but one of them can be excluded by using minimal pairs that differ only in the expression of interest (e.g., “seem” vs. “look” or “is”). The odd one out concerns use frequencies: We cannot help that the word of interest may be used more or less frequently than a contrast word we need to pair it with. For example, the verb “is” (in 4' ) is used far more frequently than any appearance verb, and “look” is used more frequently than “seem.” To exclude the possibility that this difference in frequency, rather than any difference in stereotypical inferences, accounts for preferences, we can identify and exclude participants whose plausibility rankings are systematically influenced by word frequency. To identify such participants, we constructed for our own plausibility ranking studies 30 minimal pairs whose critical verbs differed in frequency. To develop these “frequency-control items,” we used word frequency information for British English (Leech et al., 2001; the most commonly used source for American English is Kucera and Francis, 1967). Each of the 15 verbs occurred once in a “frequency-congruent” pair where the text employing the more frequent verb was also more consistent with its associated stereotype, and once in a “frequency-reversed” pair, where word frequency and stereotype consistency work in opposite directions; e.g., the more frequent verb ‘obey’ stereotypically implies submission to formal authority: [frequency-congruent] The colonel told the captain not to change his company’s position until further notice. The captain thought this reckless but obeyed/complied. [frequency-reversed] Jane asked the campers on her land to move somewhere else by tomorrow afternoon. They weren’t happy but complied/obeyed. If participants make judgements predominantly in line with stereotype consistency, and do not make fewer such judgements about frequency-reversed than frequencycongruent items, their plausibility judgements are unlikely to be influenced by frequency. We therefore include these items with our critical items in plausibility ranking studies and check whether findings remain significant after excluding participants who fail this test. 8.4.2.7 Summary: Implementations of the Cancellation Paradigm To examine automatic comprehension inferences, the psycholinguistic cancellation paradigm can be implemented with a variety of methodologies, translating into a variety of DVs. These include process measures that tap into the relevant cognitive processes, as they unfold, and outcome measures, which assess the outcomes of these processes. Process measures include pupil dilations, late fixation times, and self-paced reading times. Outcome measures include plausibility ratings and plausibility rankings. Process measures help us assess whether expressions of interest trigger certain inferences. Outcome measures help us assess whether such inferences are swiftly suppressed or influence further judgement and reasoning. 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 389 Experimentalists interested in cognitively effective inferences should combine both measures. However, outcome measures that are comparatively easy to administer can also be used on their own, to provide first evidence of inferences of interest and to address a variety of related research questions. Researchers interested in cognitively effective comprehension inferences can collect first evidence with plausibility rankings and follow up any promising findings with plausibility ratings, which can then be combined with self-paced reading time measurements or—where an eye tracker is available—pupillometry or fixation time measurements. Proceeding in this manner not only manages the risk inherent in all experimental work but also has the potential to afford cross-methodological (or “conceptual”) replication of results. This is important in response to the ongoing replication crisis in psychology, where new studies have surprisingly often failed to reproduce the findings from earlier studies (Klein et al., 2014; Nieuwland et al., 2018)—even though experimental philosophy (Cova et al., 2018) has been less affected by this problem than prominent areas of psychology. 8.4.3 Study Design: Worked Example The first task facing an experimentalist is the choice of an experimental paradigm. Now that we have reviewed several implementations of our chosen paradigm, the cancellation paradigm, we turn to the second task: We need to select an implementation and adapt the implementation so that we are able to derive specific predictions we can test with the chosen implementation, from the hypotheses we want to examine (Sect. 8.4.3.1). Then we need to think through how the data should be analysed—and ensure that our experiment will yield the right kind of data (Sect. 8.4.3.2). 8.4.3.1 Deriving Predictions To address our second overall task, we consider our hypotheses (developed above, in Sect. 8.3.3): H1 and H2 suggest that appearance verbs trigger stereotypical inferences from premises of the form “Object X appears property F to subject S” to conclusions of the form S believes that X is F. Previous studies provided evidence of such doxastic inferences at any rate from “appear” and “seem” and a mixed picture for “look.” These studies used items that—like (1) and (2) above— did not preface the appearance statement (e.g., “The dress seemed blue”) with any pre-verbal context (Fischer and Engelhardt, 2016; Fischer et al., 2021a, b). Our key research question is whether such inferences are triggered, and influence further cognition, even when such context is provided—as in the philosophical argument from illusion. As you will recall, the cautious hypothesis H1 suggests that appearance verbs trigger cognitively effective doxastic inferences also when prefaced by pre-verbal 390 E. Fischer and P. E. Engelhardt contexts, as long as these are at least “neutral,” i.e., neither support nor challenge the doxastic inference (as in sample items 1* and 2* above, Sect. 8.4.2.4). The more ambitious hypothesis H2 suggests such inferences occur even when the verb follows a “negative” pre-verbal context that challenges the inference. This is illustrated by the short text (or “scenario”): “The car was still far away. It looked small to Claire.” As a norming study established (see Sect. 8.5.3), readers think it unlikely that the car is small, simply because it looks small from far away. From far away, things look smaller than they are. This is a familiar phenomenon. Its familiarity speaks against the suggestion that the viewer believes that the car is as small as it looks. The information provided by the time the reader reaches the word “looked” hence undermines the case for a doxastic inference from “looked small” to S believes it is small. Even so, H2 maintains, readers will automatically infer from the phrase “looked small to Claire” that Claire believed the thing she saw was small. Both hypotheses H1 and H2 have two parts: They suggest (i) that appearance verbs trigger inferences to belief attributions and (ii) that these inferences influence further judgement and reasoning. To be able to test a hypothesis, we need to derive empirically testable predictions from it. The choice of an experimental paradigm helps translate hypotheses into predictions: It identifies key independent variables to be manipulated in the presentation of stimuli and dependent variables that should be affected by the manipulation in a predictable way, if the hypothesis is true. To think through how the manipulation of the IVs would affect DVs, according to the hypothesis, it helps to devise some toy items and think through how participants should respond to them, according to the hypothesis, if they are manipulated in line with the paradigm. The cancellation paradigm has us manipulate the consistency of the sequel with the posited inference from the expression of inference. We can do so by adding, to texts like the above, sequels that attribute to the viewer beliefs that are either consistent or inconsistent with the posited doxastic inference. This yields items like: (3*) The car was still far away. It looked small to Claire. She believed the car was large (s-inconsistent). (4*) The car was still far away. It looked small to Claire. She believed the car was small (s-consistent). Above, we have reviewed a number of dependent variables that are affected by this manipulation in predictable ways, if our hypotheses are true. Once we have chosen a specific methodology and DV, we can derive predictions from our hypotheses. So which DV are we to use? As we have seen above (in Sect. 8.4.2.4), we need to employ process measures to examine the first part—(i) above—of our hypotheses, about which automatic inferences are triggered, and outcome measures to examine their second claim—(ii) above—that these inferences influence further judgement and reasoning. 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 391 Which process measure to use depends first on the modality in which we are interested: If we are interested in oral discussion and comprehension inferences listeners make when hearing utterances, we can use pupillometry. If we are interested in written discourse and inferences readers make from text, reading time measurements are preferable. A second factor to consider is the length and complexity of stimuli. The eye-tracking measures we reviewed are indicative of comprehension difficulties. But the longer and more complex our items get, the more potential sources of the difficulty there are. In three-sentence items like (1*) to (4*), difficulties integrating information in the final sentence (“She believed the car was large”) could be due to inferences from the second sentence (“It [i.e., the car] looked small to Claire”) or from the first sentence (“The car was still far away”)—so perhaps participants think viewers would not yet be able to tell what kind of object (car or lorry?) they are dealing with and refrain from forming beliefs about its size (large car or small lorry?). Fixation time measurements are sufficiently fine-grained to allow us to not only document integration difficulties but also provide evidence concerning their course, which may be indicated by higher summed re-reading times for the source region (see above, Sect. 8.4.2.2). We therefore decided to implement the cancellation paradigm with fixation time measurements. Conflicts between inferences from words and subsequent text lead to higher summed re-reading times for conflict regions where the conflict becomes apparent or for source regions where the conflict-generating inference originates (Sect. 8.4.2.2). When we use items like (1*) to (4*), H1 and H2 thus imply predictions about summed re-reading times in those two regions. The conflict region boils down to the adjective at which the conflict becomes manifest (e.g., “large” in sentence 3 of 3*). However, the three verbs of interest (“look,” “appear,” and “seem”) differ in word length (four to seven characters) and frequency (appear = 307, look = 1,151, seem = 624, in Leech et al. 2001), which influence reading times: These factors strongly influence first-pass reading times, but may taint re-reading times (see Sect. 8.4.2.2.) Re-reading times on the source adjective alone therefore provide us with a cleaner measure of integration difficulties engendered by “clashing” inferences. This suggests making predictions about the conflict adjective (e.g., “large” in sentence 3 of item 3*) and the source adjective (e.g., “small” in sentence 2 of item 3*). On the other hand, we do not yet understand well enough when people re-read the source, rather than the conflict region, and therefore have to make do with a disjunction. The first part of our hypotheses translates into: Prediction H1-PL For items with neutral pre-verbal contexts (like 1* and 2*), second-pass reading times for the source adjective or the conflict adjective will be higher where items have s-inconsistent sequels than where they have s-consistent sequels (e.g., 1* > 2*). Prediction H2-RT For items with negative pre-verbal contexts (like 3* and 4*), second-pass reading times for the source adjective or the conflict adjective will be higher where items have s-inconsistent sequels than where they have s-consistent sequels (e.g., 3* > 4*). 392 E. Fischer and P. E. Engelhardt Due to spill-over effects that increase reading times in the region after the one primarily affected by integration difficulties (Sect. 8.4.2.2), further evidence of hypothesised doxastic inferences can be provided by higher re-reading times for the object regions following the source regions. The second part of our hypotheses claim that the inferences thus picked up will influence further judgement and reasoning. Relevant judgements include plausibility judgements. These claims can therefore be assessed with plausibility ratings: Prediction H1-PL For items with neutral pre-verbal contexts (like 1* and 2*), plausibility ratings will be higher where items have s-consistent sequels than where they have s-inconsistent sequels (e.g., 2* > 1*). Prediction H2-PL For items with negative pre-verbal contexts (like 3* and 4*), plausibility ratings will be higher where items have s-consistent sequels than where they have s-inconsistent sequels (e.g., 4* > 3*). Our hypotheses make predictions about different items: H1 makes predictions about items with neutral pre-verbal contexts and H2 about items with negative preverbal contexts. This raises the question of whether we should examine H1 and H2 in the same experiment or in different experiments. This depends upon whether we can learn more from comparing responses to these two different kinds of items. H2 maintains that the cognitively effective inferences of interest occur where pre-verbal contexts are negative and challenge them. One might well expect these contexts to attenuate the inferences of interest and reduce their influence on further cognition— even if those contexts do not manage to prevent the inferences. It would therefore be good if we were able to assess not only whether the inferences of interest are pre-empted by negative pre-verbal contexts or (as per H2) occur, even so, but also whether the inferences influence subsequent cognition less when preceded by negative contexts. We can assess this by comparing plausibility ratings in response to items employing different—negative vs. neutral—contexts. A strong version of H2 (let’s call it “H2*”), which claims that inferences influence further cognition no less where triggering verbs are preceded by negative context, would imply: Prediction H2*-PL For items with s-inconsistent sequels (like 1* and 3*), plausibility ratings will be the same, when items have neutral pre-verbal contexts (like 1*) and when they have negative pre-verbal contexts (like 3*). This prediction would, of course, be borne out also if participants simply ignored pre-verbal contexts, in their ratings. This possibility is excluded, and the argument for H2* clinched, if the nature of the pre-verbal context (i.e., neutral or negative) makes a difference to ratings for items with s-consistent sequels (like 2* and 4*). Since comparisons between responses to items with different pre-verbal contexts are instructive and help us gauge the appropriate strength of H2, we decided to test our hypotheses in the same study. This does, however, incur some risk: By combining these hypotheses in the same study, we are combining otherwise distinct sets of data for analysis, namely, data about responses to distinct classes of items. Below (in Sect. 8.7.2), we will see that this obliges us to undertake an “omnibus analysis” of the more comprehensive data 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 393 set. The results of this analysis determine what further, more detailed, analyses we are allowed to make. If uncongenial, results may prevent analyses we could have made had we considered smaller data sets separately. 8.4.3.2 Specifying the Design The first key question of study design is, Which IVs should we manipulate, and which DVs should be measured? This question is largely answered in deriving the predictions. The second key question is, How should the data be analysed? This second question is largely answered by what is technically known as the specification of the “study design.” Perhaps confusingly, this takes the shape of specifying how many variables are to be manipulated, in how many ways. We now discuss how to develop this specification and discuss this in a way that brings out how this step is guided by the second key question of how the data should be analysed. The key IVs can be “read off” the predictions: In line with the key idea of the cancellation paradigm, these predictions predict what will happen when we manipulate the consistency of the sequel with the hypothesised doxastic inference from the previous appearance verb. All predictions from H1 and H2 predict what will happen when these sequels are s-inconsistent, rather than s-consistent. The first IV manipulated in this study thus is stereotype consistency, with two “levels” (sconsistent and s-inconsistent). When deciding to test both hypotheses in the same study, we decided to include items with two different pre-verbal contexts (“neutral” and “negative”). This does not by itself add another IV to the study; this happens only once we decide to treat these two kinds of contexts as at least potentially making a difference. We did so when formulating a strong version of H2, H2*, and deriving from it a prediction about how plausibility ratings for items with neutral and with negative pre-verbal contexts would be related to each other. We thus added “nature of preverbal context” as a second IV, again with two levels: “neutral” and “negative.” These differ in whether or not they challenge doxastic inferences by suggesting that (familiar and transparent) viewing conditions make for non-veridical perceptions, namely, whether or not things, under these conditions, look the size, shape, or colour they are. We therefore call this variable “veridicality” (which is less of a mouthful than “nature of pre-verbal context” and better reflects the point of the manipulation). We thus provisionally arrive at a “2 × 2” (pronounced: “two by two”) design, which means we are using two variables (the number of terms separated by a “×”-sign), each with two levels (as indicated by the two “2”s). The multiplication sign is used because the multiplication of IVs and levels yields the number of conditions, where a condition is one possible combination of values for the IVs in your design (e.g., neutral-consistent). The design finally also specifies how the IVs are to be manipulated: whether, for example, one and the same individual will read and rate both items with s-consistent and with s-inconsistent sequels (“within-subject manipulation”) or 394 E. Fischer and P. E. Engelhardt whether one group of participants will read and rate items with s-consistent sequels while another group receives items with s-inconsistent sequels (“between-subjects manipulation”). Some IVs may be manipulated within-subject and others betweensubjects (“mixed design”). In general, within-subject manipulations are preferred, as they tend to cancel individual differences (i.e., differences between participants in terms, e.g., of verbal intelligence), and are therefore, more powerful (i.e., require fewer participants). However, in some cases, between-subject designs are necessary because participants cannot be exposed to multiple conditions without giving away and defeating the point of the study, or because the variable of interest is some inter-individual factor (e.g., level of philosophical training; cf. Fischer et al., 2022; Horvath and Wiegmann, 2016). These designs are less powerful (because of individual differences) and tend to require larger sample sizes (so that individual differences between participants have a better chance of “cancelling out” across the whole sample). Our IVs can easily be manipulated within subjects, so we opted for that design. At this point, we are thus looking at a “2 × 2 within-subject” design. Each IV implies differences in what “materials” (items) which participants will respond to. But not every change in the materials implies there is a manipulation: For example, our hypotheses are about three appearance verbs “look,” “appear,” and “seem.” We should therefore employ items with different verbs (viz., in sentence 2 in items like 1* to 4*). But our predictions do not predict any differences between them. In analysing our data to assess our hypotheses, we therefore do not have to look for any such differences and can treat items with “look,” “appear,” and “seem” the same, regardless of the specific verb they use. Even so, however, we may be curious whether the choice of verb does make a difference. Previous studies produced overall less evidence of cognitively effective doxastic inferences from “look” than from the other two verbs, and evidence that inferences from “appear” and “seem” are stronger than from “looks” (Fischer and Engelhardt, 2016; Fischer et al., 2021a, b). We therefore do wish to be able to make comparisons between items with these different verbs—e.g., compare reading times and plausibility ratings, so we can provide further evidence whether each verb triggers doxastic inferences and whether all triggered inferences go on to influence subsequent judgement. We therefore treat verb as a third IV, with three levels (“look,” “appear,” and “seem”). We thus obtain a 2 (veridicality) × 3 (verb) × 2 (s-consistency) fully within-subject design. The full design of the study, however, involves more than the “design,” in this technical sense: We need to specify not only which IVs we want to manipulate, and how, but also what we want to measure, and where. The derivation of the predictions already identified the DVs to be measured: plausibility ratings and second-pass reading times. For the latter, they also identified key regions of interest where measurements need to be taken. However, we need to think through in advance also which measurements might help us interpret our findings. For example, we have seen that comprehension difficulties that become manifest at the end of complex items may be engendered by conflicts rooted in different parts of the previous text, e.g., in the pre-verbal context or the verb. Summed re-reading times provide relevant evidence. While our predictions only mention higher re-reading times for source 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 395 Table 8.1 Regions of interest The car was still far away1 . It looked2 small3 to Claire4 . She believed the car was large5 . 1 Pre-verbal 2 Source verb 3 Source adjective 4 Source object 5 Conflict adjective and conflict adjectives, we therefore propose to measure them also in the pre-verbal context. Given spill-over effects, we will also measure on the source object (“to Claire,” in sentence 2 of 3*). We therefore intend to measure reading times at five regions of interest (i.e., regions were measurements are taken) (Table 8.1). 8.5 Methods: Implementing the Approach 8.5.1 Basics of Item Development Once we are clear on the design of our study, we can develop the materials for it: the items we will ask participants in our study to read and rate. These include critical items that are critical for the intended analyses and filler items that are designed to encourage participants to make full use of the rating scales and are intended to “mask the task,” i.e., to prevent participants from guessing what the research questions and predicted responses are, so that they do not change their answers to conform with the experimenter’s expectations (an important form of response bias). In addition, filler items can be used to enforce exclusions (like the “frequency control items” we devised to exclude participants whose plausibility assessments are unduly influenced by word frequencies; see Sect. 8.4.2.6). Fillers should have the same “surface structure” as critical items (e.g., contain the same number of sentences). Sometimes it is possible to use the critical items of another experiment as fillers, and, thus, run two experiments simultaneously. In particular, studies that are delivered online only (rather than in the lab) and may be taken by participants in distracting environments should include attention checks that allow the researcher to exclude data from participants who fail to pay enough attention to the stimuli. These items should have the same format as critical items or common fillers but prescribe counterintuitive responses (e.g., in an agreement rating task using a scale from “−3” to “+3”: “Please select +3 as the response to the present question. Washington DC is the capital of Germany”). The first question to address in material development is the number of items (a) in total, and (b) per condition. We broached this question already in discussing plausibility rating studies (Sect. 8.4.2.5): With all the care in the world, we cannot exclude that idiosyncrasies of specific items influence responses. The first precaution to take is to use a variety of items in each condition, so that any influences from idiosyncrasies will “cancel out.” In eye-tracking experiments, it is best practice to use six items per condition, and inappropriate to use less than three. Studies should ideally employ twice as many fillers as critical items and need to 396 E. Fischer and P. E. Engelhardt use at least as many fillers as critical items. Our design calls for 12 within-subject conditions. Hence, it would be ideal to have 72 critical items. However, we need at least the same number of fillers of a similar form, which would bring the total to 144 rather long and complex (three-sentence) items. Eye-tracking experiments should not exceed the normal adult attention span of 20 min, as participants may cease to engage with the task if it takes longer. This may speak for reducing the number of items below the ideal suggested by the design. In our study, we therefore decided to halve the number of critical items. Which condition we should cut down then depends upon our research questions and hypotheses: Given the mixed picture previous studies provided concerning differences between verbs, we could not afford to cut the number of items in the three verb conditions. Given the centrality of the consistency manipulation to the predictions derived from our key hypotheses H1 and H2, we could not afford to cut numbers in the two consistency conditions, either. By contrast, comparisons between the two veridicality conditions are required only to gauge the strength of H2, and the ancillary hypothesis H2* employed for the purpose concerns only the success, rather than the extent of suppression effort and accordingly delivers predictions only for plausibility ratings, and not for process measures. We therefore decided to reduce the number of items in the two veridicality conditions to three. We used as many fillers as critical items. 8.5.2 Drafting Items The development of the requisite number of critical items is then governed by the principle: variety within conditions, similarity between conditions. To prevent that idiosyncrasies of items influence responses, we need to vary items within conditions. The philosophical argument motivating our study proceeds from assumptions about how objects appear in terms of three basic perceptual properties: colour, size, and shape. To ensure variety, we therefore constructed scenarios in which agents view different kinds of objects under different conditions, and these appear a certain colour, size, or shape to them. We aimed to obtain roughly the same number of items with colour, size, and shape. To ensure that differences between responses given in different conditions are due to the manipulation of our IVs, rather than other factors, variability within conditions needs to go with similarity between conditions. Items need to be as similar as possible to their “counterparts” in the other conditions. This can be achieved by implementing manipulations through rotation or by changing as little text as possible. It is further supported by controlling changed text for length and frequency—the two “ubiquitous” factors most likely to affect plausibility judgements (Sect. 8.4.2.5) and reading times, especially first-pass reading times (Sect. 8.4.2.2). We consider these moves, in turn. Our design calls for the manipulation of three IVs, and the previous section has built up to the plan of manipulating them in different sentences of three-sentence items like (1*) to (4*) (repeated for convenience): 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 397 (1*) The lighting in the room was odd. The hostess’s dress looked blue to Greta. She thought it was green. (2*) The lighting in the room was odd. The hostess’s dress looked blue to Hannah. She thought it was blue. (3*) The car was still far away. It looked small to Sarah. She believed the car was large. (4*) The car was still far away. It looked small to Claire. She believed the car was small. We manipulate veridicality in the first sentence, verb in the second (see below), and stereotype consistency in the third. To ensure greatest possible similarity of items across different verb conditions, we manipulate the verb by rotating the three verbs of interest (“look,” “appear,” and “seem”) through the same text frames, in a so-called Latin Square design: We prepared three item lists that were otherwise identical, except that List 3 used “seemed” in items where List 2 used “appeared” and List 1 used “looked,” etc. We then presented each list to one third of our participants, so that an equal number of participants would see, e.g., (1*) with the verb “looked,” “appeared,” and “seemed,” respectively. This rotation ensures that each text frame is used with each word, in exactly the same number of trials. To ensure greatest possible similarity across different consistency conditions, where no such rotation is possible, we manipulate the IV by changing a single word, namely, the “conflict adjective” at the end, which is either identical with the source adjective in the second sentence (“blue,” “small”), in s-consistent items, or stands for an incompatible colour, and so forth (“green,” “large”), in s-inconsistent items. The repetition of the adjective in the s-consistent condition is potentially problematic, since the word’s first occurrence will prime the concept and facilitate processing of the second occurrence (starting with word recognition). Higher reading times for the final adjective in s-inconsistent than s-consistent items may hence be due to facilitation of reading in consistent items, rather than perceived conflicts in inconsistent items. However, priming affects the most relevant measure, summed re-reading times, considerably less than first-pass reading times, and grows weak after five intermittent words (Rayner, 1998). We therefore ensured that source and conflict adjective were always separated by six words (although it is not ideal that some of these were brief prepositions, which may get skipped in reading). To make our items as similar as possible between conditions in terms of word length and frequency, we need to control for these properties, in each region of interest. Three steps help exclude influence of length and frequency on reading time differences of interest: (a) Where possible, use the same word in the region of interest, across different conditions (as just discussed). (b) Where this is not possible (e.g., because the relevant manipulation is affected precisely by changing the word in the given region of interest), ascertain the mean word lengths and frequencies per condition, and conduct t-tests to identify significant differences (see below, Sect. 8.5.3). Word length is assessed by character counts. Word frequencies are assessed by looking up published “frequency norms” (like Leech et al., 2001, for British 398 E. Fischer and P. E. Engelhardt English as well as Kucera and Francis, 1967, for American English). Wherever possible, we need to ensure that the means of both length and frequency across the items in each condition are not significantly different. (c) Where this is not possible, you can apply a correction, when analysing results. Typically, a correction will involve applying either a constant adjustment to all items in a condition or a type of regression analysis in which the adjustment is applied to each individual item. In general, these kinds of corrections are not ideal and can lead to problems in data interpretation. Researchers should strive for (a) and (b). For the source object, there is a further constraint: Gender stereotypes might influence participants’ judgements about what a viewer is likely to believe—e.g., some readers might think it more likely that viewers form wrong beliefs about the colour of a dress when they are men (perhaps more attentive to the shape of the wearer than the colour of the dress). For each scenario, viewer names in s-consistent and s-inconsistent items were therefore matched not only for length but also for gender. Finally, we face a specific challenge when implementing the cancellation paradigm with plausibility ratings, to document contextually inappropriate comprehension inferences that influence further judgement: If comparisons between plausibility ratings for s-inconsistent and s-consistent items are to provide evidence of such inferences, we must ensure that any prior contextual cues for suppressing a hypothesised inference in s-inconsistent items are not so strong that they invite suppression also in s-consistent items. Specifically, our negative veridicality items should specify viewing conditions under which it is mildly implausible that the object of sight should have the colour, etc. it looks (implausible enough to support complete suppression of the doxastic inference in the s-inconsistent condition), but not highly implausible (motivating sustained suppression efforts also in the sconsistent condition). A norming study assessed the success of our efforts—and a more fundamental question. 8.5.3 Norming Studies Material development often involves norming studies to ensure that items are fit for purpose. Above (in Sect. 8.4.2.5), we considered norming studies designed to identify and eliminate extraneous item features influencing responses. Another common function is to ensure that items actually meet the brief of the condition to which they are assigned. In the present study, this question is a live one for the veridicality conditions: Will participants think that, under the viewing conditions specified by our items, objects have the size, shape, or colour they look? Or will readers think that the specification of the condition does not allow them to decide the question (as required by the brief of the neutral veridicality condition)? Or will readers think that, under the conditions specified, objects will not have the property they look (as required by the negative veridicality condition)? 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 399 To assign items to the neutral and the negative veridicality condition, we therefore conducted a questionnaire-based norming study. We recruited participants via the online platform Prolific (for more on recruitment, see Chap. 2) and built a questionnaire with the software Qualtrics (for more on building a questionnaire with Qualtrics, see Chap. 4). We recruited 200 participants who participated for remuneration. To match the intended student sample for the main study, we restricted participation to UK residents, 18–27 years old, studying for or possessing an undergraduate degree; 151 passed the attention checks (63% female, 1% nondisclosed). All of these participants were native speakers of English and resident in the United Kingdom. Participants were given “look” versions of the scenarios (sentences 1 and 2) of all our draft items, since extant studies suggest that this verb is more readily interpreted as characterising just the visual looks of objects: The lighting in the room was odd. The hostess’s dress looked blue to Hannah. Participants then rated on a 7-point scale, labelled from “−3” to “+3” how confident they were that certain statements were true in this situation. The key statement was that the object viewed had the property it appeared, for example: The hostess’s dress is blue. To assign our 26 draft items to different veridicality conditions, we considered whether participants were confident that the statement was true (so the specified viewing conditions are veridical), confident that the statement was false (so that the specified viewing conditions are non-veridical, and the item belongs to the negative veridicality condition), or totally unsure whether the statement is true (so that the viewing conditions support no inference, either way, and the item belongs to the neutral veridicality condition). We considered this by determining whether mean responses for key statements were significantly above, significantly below, or not significantly different from the mid-point. To determine this, we mapped ratings onto a scale from “1” to “7” (simply by adding 4 to all values in the Excel sheet of results), calculated the mean response for each item, and conducted one-sample t-tests (see Chap. 3) against a test value of 4 (mid-point). Nine scenarios generated mean responses significantly above mid-point and had to be discarded, and nine scenarios with mean responses significantly below midpoint were assigned to the negative veridicality condition. Eight scenarios with mean responses not significantly different from the neutral mid-point were assigned to the neutral veridicality condition. As explained above, we also wanted our negative veridicality items to specify viewing conditions under which it is mildly, but not highly implausible that the object of sight should have the property it looks. We therefore considered the effect size of the comparison with the mid-point, taking medium- and large-effect sizes to be indicative of the high implausibility we need to avoid and small (but not negligible) effect sizes to indicate the mild implausibility we want. To determine the effect size, we calculated Cohen’s d (for Cohen’s d, see Chap. 3) for each comparison. For neutral items, this effect was suitably negligible (as indicated by 400 E. Fischer and P. E. Engelhardt d’s < .2). For negative items, the effect was mostly small (as indicated by d’s between .2 and .5), with the exception of one item (d = .192), all such items had d’s ranging from .217 to .490. Each scenario yields two items, by adding an s-consistent and an s-inconsistent sequel. To obtain the sought 36 critical items, we thus need 18 scenarios, namely 9 negative and 9 neutral scenarios. Our norming study provided us with nine negative, but only eight neutral scenarios. They were reasonably varied, but not balanced, in terms of the visual properties involved: eight items involved colour, five size, and four shape. For a fully publishable study, we would have therefore needed to go through another cycle of item generation and norming study. For the present pilot study, we deemed these materials good enough. To make up the shortfall of one, we included in the main study also the scenario with the lowest mean above mid-point. However, the resulting two items were not assigned to a veridicality condition and excluded from all analyses (functioning basically like fillers). In fixation time studies, items should be presented in a single line on the computer screen, in a font that is big enough to allow for easy reading. For multi-sentence items, you should check whether draft items fit in this way on the screen of the eye tracker, before submitting them to norming work. Due to an oversight, this check was conducted in the present study only after completion of the norming study. In consequence, some scenarios had to be slightly edited for length. This procedure is defensible only for pilot studies. In the event, it was justified by the large effects we found the veridicality manipulation to produce. We then repeated the examination of length per region of interest for the shortened items. Region lengths were roughly balanced across conditions (see Table 8.2). We used frequency information for British English (Leech et al., 2001), to determine the mean frequencies of “source” and “conflict adjectives” in relevant conditions (see Table 8.3). Names for source objects were drawn from lists of common English names, and further investigation (e.g., frequency counts in corpora) seemed unlikely to provide more useful information. There was too much variation in the “surviving” pre-verbal contexts for comparison of word frequencies to have a point. This constrains the interpretability of any reading time findings for this region (which, however, do not figure in our predictions). The length differences, comparing neutral vs. negative, were not significant (all p’s > .05). Frequency differences for source and conflict adjectives between neutral and negative conditions were also not significant (all p’s > .05, arguably due to Table 8.2 Mean numbers of characters in region of interest, per condition Pre-verbal context Source adjective Source object Conflict adjective 7.00 7.22 6.33 s-consistent 28.78 s-inconsistent 28.78 7.00 7.22 6.78 5.13 5.75 5.25 Negative s-consistent 32.25 5.13 6.25 5.25 s-inconsistent 32.25 Neutral 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 401 Table 8.3 Mean frequencies of adjectives in different conditions Neutral Negative s-consistent s-inconsistent s-consistent s-inconsistent Source adjective 57.78 57.78 222.75 222.75 Conflict adjective 57.78 66.33 222.75 155.25 the small sample size and large SDs). However, the differences in frequency may impact especially first-pass reading times. In particular, they predict higher reading times in the s-inconsistent than the s-consistent negative veridicality condition. Word frequency trends might thus look like a potential confound for our study, which predicts differences in this direction from hypotheses that do not invoke word frequency. However, our predictions concern re-reading times. These are less affected by word length and frequency than first-pass reading times. In particular where the predicted effects are observed to be even larger for re-reading times than for first-pass reading times, they hence continue to provide evidence for the hypotheses. We therefore went ahead with our study, without going through another round of item development (which would have involved another norming study to assign items to veridicality conditions). 8.6 Procedure: Conducting the Experiment 8.6.1 Tasks We now review, in roughly chronological order, the tasks involved in collecting the data. As with any other experiment with human participants, you will need to seek— and receive—approval for your study from your institution’s Institutional Review Board (IRB) or Ethics Committee, prior to collecting data. The relevant form will typically ask you to outline aims and methodology of the study and identify possible risks to participants. With eye tracking, the physical risks are minimal: There is (1) the possibility of fatigue if the experiment is too long and (2) most eye trackers emit infrared light. The amount of infrared is very low, much less than a participant would experience on a sunny day, and so, if the participant is not light sensitive, then the potential risk of infrared is minimal. Other risks that may need mentioning are risks arising from exposure to ethically sensitive materials. At this point, a researcher may also consider “pre-registering” their study on an open science framework. The information required for pre-registration includes the hypotheses, predictions, sample size, planned analyses, etc. The rationale for pre-registration is that it leads to better scientific practices, and the benefit to the individual researcher is that it permits greater flexibility in publication (as some journals require or request prepublication). 402 E. Fischer and P. E. Engelhardt Before starting to collect data, you will need to determine how many participants you need to test, and what attributes they need to have. As a traditional rule of thumb, fully within-subjects experiments require a minimum of 25 to 30 participants, while experiments with a between-subject design require double that amount. Researchers increasingly recommend higher numbers (50 for a simple within-subjects design). More specifically, how many participants you need depends upon three factors: what significance level you want to reach (customary: p < .05), what effect size you expect the experimental manipulations to have (e.g., based on related findings from previous studies), and how probable it should be that you find an effect if there is one (“statistical power”; typically set at 80%). A power analysis can be used to calculate the required sample size from these three inputs. Where a researcher is working on an entirely novel topic, no expected effect can be determined from prior research, and only the effect hoped for can be provided. For a helpful tutorial, we refer the reader to Brysbaert (2019). The most commonly used programme for performing power calculations is G*Power. Many widely available online power calculators are free to use. For studies on language comprehension, it is best practice to recruit native speakers as participants, although high levels of proficiency in a non-native language (as evidenced by successful pursuit of university studies in that language) may suffice. Most lab-based psycholinguistic studies recruit psychology students from an institutional participant pool. For our study, we recruited 48 native speakers of English from our institutional participant pool (18–21 year olds, approximately three quarters female). The next task is to programme the experiment, which will interface with the eye tracker. For most eye trackers, there is bespoke software that accompanies the hardware. For example, we use SR Research Experiment Builder software to execute the experiment on the SR Research Eyelink 1000 eye-tracking system. However, the Eyelink 1000 is also compatible with more generic experiment programming software (e.g., Eprime or Matlab). Figure 8.1 shows the main graphical user interface of SR Research Experiment Builder Software. The software offers a flexible and intuitive user interface in which the researcher can simply drag and drop different presentation options, as well as input responses from the participant in the study. For our study, individual trials required two presentation screens and two responses from participants. The first display screen presents the sentence to be read. Once participants have read the sentence, they press the spacebar, which is the first keyboard in the sequence. The second display screen prompts participants to rate the plausibility of what they have just read. They did so using the number keys (1–5) on the keyboard. The next trial begins automatically. A key part of the experimental setup in Experiment Builder is the “datasoure.” Figure 8.2 below shows the “datasource” from our experiment. This is the key information which will be used in both the presentation of the sentences and in the data analysis. (Experiment Builder interfaces with Excel, and so, the datasource can be created in Excel and then copied and pasted into Experiment Builder.) The columns in the datasource contain information about item number, the different conditions, the key regions of interest, and the sentences to be presented to 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 403 participants. Cond1 in the example shows which verb was presented in the second sentence, Cond2 shows whether the final sentence was consistent or inconsistent, and Cond3 tells us which type of context was provided by the first sentence. In this particular study, we examined five key regions of each sentence (see columns R1– R5). Finally, the sentence and plausibility prompt are in the final two columns. Each of these variables will appear in the interest area report, which is generated after data collection by SR Research Data Viewer Software. Figure 8.3 shows the main graphical user interface for SR Research Data Viewer Software. In this example, we have uploaded one participants data (subject 4), and the left side of the interface shows a list of the trials and just below is the fixations made on each trial. The main part of the interface overlays what the participant was looking at, and the blue circles show individual fixations. The vertical red lines show how the sentence was divided into different regions of interest, which was based on prior study’s and our empirical hypotheses. The main analyses are obtained from the “Analysis” menu at the top of the interface. When a user clicks on “Analysis,” one of the options is “Interest Area Report” (see Fig. 8.4). This is the main analysis in order for Data Viewer to provide an output (in Excel), which contains a summary of fixation data for each interest area for each trial. The researcher can choose from the list of available measures in the right-hand side and move the selected ones to the left-hand side. IA_DWELL_TIME is the total reading time (sum of all fixations) on the interest area. A user can also select common and important measures, such as IA_FIRST_RUN_DWELL_TIME (i.e., Fig. 8.1 User interface of SR Experiment Builder 404 E. Fischer and P. E. Engelhardt Fig. 8.2 Datasource from SR Experiment Builder Fig. 8.3 Main user interface for SR Research Data Viewer first-pass reading time). (The second-pass reading times crucial for our study then need to be calculated manually, by subtracting the first-pass from the total reading time; there is no variable to select for this measure.) The full details of how to programme the experiment are beyond the scope of this chapter, but we would encourage novice eye-tracking researchers to make full use 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 405 Fig. 8.4 Interest area output report from SR Research Data Viewer of the manuals that accompany the chosen software (and/or hardware). Eye-tracking manufacturers operate support centres that provide swift assistance when required. Once you have secured ethical approval, programmed the experiment and ensured that it is working properly, and have determined the number of participants needed, you are ready to run your experiment. Our study illustrates the usual protocol for experimental sessions: Upon entering the lab, participants were asked to read an information sheet outlining the general nature of the study and indicating any risks, or possible adverse effects on participants. Once they had signed a form to give their consent to participate in the study, they were asked to complete a simple demographic form. Following this, they sat at the eye tracker, facing the screen that will display the stimuli. It is good practice to ensure that participants are as comfortable as possible, adjusting the height of the chair and/or the height of the chinrest, as required. Chinrests are used to reduce head movements, which helps ensure that the calibration (see below) stays valid throughout the experiment. An adjustable height table is the ideal solution for ensuring participant comfort and avoiding neck or back strain, especially in longer sessions. It is always best not to have a window in the lab to reduce ambient infrared light in the room, and to have a dimmer switch so that pupil size can be adjusted via luminance in the room. 406 E. Fischer and P. E. Engelhardt In our study, the experimenter instructed participants verbally about the task. (This can also be done in writing, on the initial screen). Participants had the opportunity to ask questions about the task. We complemented verbal instructions by the experimenter with practice trials that preceded the critical trials. Following the practice trials, the experimenter had a quick chat with the participant about what the expected pattern for the practice trials should have been, and explained again the relevant notion of “plausibility.” This further improved participants’ understanding of the task. Initial instructions and practice trials were separated by the calibration and validation procedure. Calibration is the procedure that allows the eye-tracking software to map fixation location to the computer screen for each participant individually. The participant is asked to fixate dots on the screen, and the eye tracker records the eye position for these locations. This is followed by a validation procedure that allows the software to determine whether calibration was successful. The Eyelink software automatically informs the researcher whether the calibration and validation were acceptable (it handily returns a result of “good,” “neutral,” or “bad”). If the result was “good,” the experiment can proceed. If it was “neutral” or “bad,” the calibration and validation procedure needs to be repeated. Initial calibration, prior to the first trials, can be followed by re-calibration at any point in the experiment, should the eye tracker lose the eye. We used a 9-point calibration and validation procedure: Participants fixated nine dots on the screen (in a 3 × 3 grid), for both calibration and validation. A few practical issues for eye tracking may emerge at the calibration stage. The most obvious is glasses. Occasionally, it is not possible to avoid reflection, which prevents calibration. There are few things that researchers can do proactively to minimise problems with glasses. The first is to contact participants ahead of their session and instruct them to wear contact lenses, if they are an option. The second is to determine whether participants can read the text without glasses. This will be dependent upon the severity of correction that the participant needs but also on the size of the font. Glasses will adversely affect data collection in approximately 5– 10% of participants. A second issue is mascara, which absorbs infrared light and can interfere with the eye tracker’s ability to detect the pupil. Where this presents a problem, it typically only requires the participant to remove the mascara. We have always found participants willing to do this, and we keep eye makeup remover and a mirror in the lab. Finally, in some cases, participants simply cannot be calibrated. This could be due to either the shape or the angle of their eye, in which the pupil becomes obscured behind the eye lid (particularly when they look downwards to the left or right). A second problem is that some participants do not have sufficient voluntary control of their eye movements, which means that calibration cannot be performed. We estimate that about 1% of invited participants will fall into these two “problem categories.” Once calibration has been successful, trials start. Each trial begins with a drift correction dot, which the participant must fixate in order for the experimenter to start the trial. Next, the sentence or text is presented on the computer screen. Items are presented in a single line, in a font sufficiently large to allow for easy reading. 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 407 Participants read the text silently to themselves, and once finished, advance to the next part of the trial. In most studies, this is either a comprehension question or a plausibility rating task concerning the item just read. In our study, participants received a plausibility rating prompt following the sentence (“Please rate the plausibility of what you just read”). Participants pressed number keys “1” to “5” on the keyboard to indicate their plausibility rating, on a 5-point scale. “1” had been initially explained as “very implausible,” “5” as “very plausible,” and “3” as “neutral.” Following the rating, the drift correct dot reappeared, and the next trial began. Trials can be separated into blocks, and this may be particularly important for long experiments in which participants can take a break periodically. Trials can be fully randomised or pseudo-randomised within blocks in the event that the experimenter does not want critical trials appearing on consecutive trials (i.e., to ensure that at least one filler is presented between critical trials). Randomisation ensures that any given item will be presented to different participants at different points in the experiment and thus guards against order effects, i.e., the risk that the order in which items are presented influences participants’ responses to them. In our study, items were presented in a single block, and the order of presentation changed from one participant to the next in a random fashion (i.e., without constraints on the ordering of critical and filler items). 8.6.2 Apparatus The following list indicates information that needs to be reported in publication (see Sect. 8.9). • • • • • • • • • • The make and model of eye tracker. The sample rate of the eye tracker in Hz (e.g., 120 Hz, 1,000 Hz). The accuracy and precision of the eye tracker. Whether one or both eyes were tracked. If only one eye was tracked, which one? What was the eye-tracking set-up (desktop mount, remote tracking, headmounted)? Was a chin/head rest used? The make and model of the monitor. The size and resolution of the monitor. Viewing distance from the eyes to the monitor, which permits calculation of visual angle. The software that was used for stimulus presentation. The stimulus size(s). For the present study, eye movements were recorded with an SR Research Ltd. EyeLink 1000 (desktop) eye tracker, which records the position of the reader’s eye every millisecond (1,000 Hz). Head movements were minimised with a chin rest. Eye movements were recorded from the right eye. The sentences were presented in 16 pt. Times New Roman black font on a white background. The monitor was 408 E. Fischer and P. E. Engelhardt 19 inch (48.26 cm) CRT, and the distance from the eyes to the monitor was 70 cm. Experiment Builder software was used to program the experiment. 8.7 Analysing the Results We now turn from data collection to data analysis. We will explain how to analyse data with the help of two commonly used software packages: how to use SR research Data Viewer to extract the data from the data file saved by the eye tracker (Sect. 8.7.1) and how to use SPSS to conduct inferential analyses (Sect. 8.7.2). Then we will demonstrate how to deploy such analyses, by spelling out the analysis of the plausibility and eye-tracking data from the study that serves as our worked example (Sect. 8.7.3). For an introduction to the relevant statistical techniques, the reader is referred to textbooks. Field and Hole (2003, 109–191) offer an accessible introduction with reference to SPSS. 8.7.1 Preparatory Steps The eye tracker collects quantitative measurements (eye fixation durations and plausibility ratings) for different experimental conditions, and SPSS is used to calculate the means for each condition as well as to conduct subsequent statistical analyses. The data produced by Data Viewer can be extracted either as a plain text file or already in Microsoft Excel format (XLS). The Data Viewer manual explains the procedures for creating data reports. In short, click on “Analysis” and then “Interest Area Report.” It is then possible to select which available variables you want (e.g., first-pass reading times), and then click “OK.” The reports from Data Viewer are then saved as an Excel file. The pivot table function in Excel is the easiest way to calculate the means and can then be copied directly into SPSS. This generates “data frames” which perspicuously represent data in a table. The basic data frame (for SPSS) should be in a N × K format in which each row of the data frame corresponds to one participant’s data (K). The columns represent the different within-subject conditions. Where a between-subjects design is used, they may also represent the “categorical classifier” that indicates in which of the between-subjects conditions the participant was placed. Let’s consider a toy example. The study that serves as our “proper” worked example used a quite complex 2 × 2 × 3 (veridicality × consistency × verb) design. To explain the first steps, let’s use a simpler design. Recall that our hypotheses do not posit any differences between appearance verbs. Had we not been interested in potential such differences, we could have examined our hypotheses just as well with a simpler 2 × 2 (veridicality × consistency) design, with two levels for each IV: Pre-verbal contexts are either non-veridical or neutral, and sequels are either sconsistent or s-inconsistent. Thus, in this simpler design, we have four experimental 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 409 Fig. 8.5 Example data frames from a within-subject (top left) and between-subject (bottom left) design. Right side shows an example histogram with one clear outlier conditions, and both variables are manipulated within subject. The data frame then has four columns of data, while the number of rows corresponds to the number of participants tested (e.g., the top panel in Fig. 8.5 shows data for three participants). If one of the manipulated variables (say, veridicality) is between subjects, then the data frame has three columns (see the bottom panel in Fig. 8.5). One of those columns (NegNeu) indicates which verdicality condition (non-veridical 1 or neutral 2) the participant was assigned to, and the other two columns display data from the s-consistent and s-inconsistent conditions that are manipulated within subject (see bottom panel of Fig. 8.5). In this illustration, the numbers in each cell of the data frame represent the mean re-reading time on a single word in a sentence. It is the mean because each participant reads multiple items for each condition. These numbers represent mean fixation durations in milliseconds, averaged across all trials in which the participant read that word (say, the source adjective) in that condition. From these data frames, SPSS generates the descriptive statistics (i.e., means and standard deviations) for each condition. This allows you to take the final two preparatory steps: First you need to check the data for outliers, i.e., “oddball” subjects or items that behave very differently from the vast majority. To detect outliers, you should save the standardised values (Z-scores) as variables and determine whether the mean for any individual participant (averaged across all relevant items) and for any item (averaged across all participants) is more than 2.5 to 3 standard deviations from the overall sample’s mean for the condition. Z-scores are “standardised,” which means that they have a mean of 0 and a standard deviation of 1. You can also examine a histogram plot of each variable (the standardised scores) and visually inspect whether there are any large gaps at the upper or lower end of the distribution (see right panel of Fig. 8.5). Outliers can be dealt with in various ways (Stevens, 2002b): Outlying datapoints can be eliminated from the dataset, replaced with the mean for the condition, or “trimmed” by moving the datapoint inwards so that rank order is preserved. “Trimming” involves changing the data point so that there is no gap in the distribution, but the changed datapoint is still the highest value of the distribution. In the example above, “trimming” would move the reading time outlier (i.e., the one > 1,000 ms) to be adjacent to 410 E. Fischer and P. E. Engelhardt the next highest data point, i.e., change the value to 810 ms (as the second highest reading time in the above histogram is 809 ms). Overall, the issue of what to do with outliers (once identified) is not altogether clear. If you do change the data, it is best practice to be as transparent as possible, and thus, researchers should report the threshold they used to define outliers, the amount/percentage of impacted datapoints, and how the data was ultimately manipulated. One may even consider reporting the statistical analyses both with and without the outliers. This is because outlier exclusion/replacement/trimming will most often tighten distributions and make it more likely to find statistically significant differences. As final preparatory step, you need to ascertain that the data for each condition displays a normal distribution (i.e., meets the criteria for parametric statistical tests). If the data are not normally distributed, then transformations should be applied. Normal distribution is determined by using the “Explore” function in SPSS (Analyze → Descriptives → Explore). The results of the Explore function will produce a table containing the skew statistic, i.e., a numerical value indicating how far the distribution is from a normal distribution, as well as the standard error of the skew. The skew statistic should be less than twice the standard error. For example, if the skew is .852 and the standard error is .343, then because .852 is larger than .686 (.343 multiplied by two), a transformation should be applied. We typically apply the transformations suggested by Tabachnick and Fidell (1989), which are square root, logarithm, or inverse, depending upon the severity of the skew. The most common transformation for reaction times (or reading times), which tend to show positive skew, is the logarithm transformation. Finally, if you have within-subject variables (i.e., repeated measures), then the same transformation should be applied to all levels of that variable. 8.7.2 Conducting Inferential Analyses: ANOVAs We are now ready to conduct inferential analyses that tell us whether the means we observe in different conditions are significantly different—roughly speaking, whether the differences we observe between conditions are real, rather than random. This is indicated by the p-value, which indicates how probable it is to obtain the observed results by chance, even if there are no differences between the different conditions (in which case the experimental manipulation of the relevant independent variable(s) made no difference to responses). The simplest such analysis, which allows us to compare two means, is the t-test (see Chap. 3). For more complex experiments, which manipulate a variable with more than two levels, or manipulate several variables, we need to conduct a more complex ANOVA or factorial analysis of variance. This analysis assesses a main effect for each independent variable and the interaction between variables. For the 2 × 2 design in our toy example, the results of the ANOVA will assess the main effect for consistency and veridicality. It will tell us whether, e.g., the difference between means for a dependent variable (say, plausibility) observed in all s-consistent conditions, on the one hand, and all 8 Psycholinguistic Experiments: A Case Study on Default Inferences. . . 411 s-inconsistent conditions, on the other hand, is real or random. The analysis will also tell us whether the two variables interact with one another. This means that the relationship between one IV (e.g., consistency) and the DV (plausibility) is different depending on the level of the other IV (veridicality). The analysis thus lets us assess, for example, whether the consistency manipulation (s-consistent vs. s-inconsistent) affects plausibility ratings differently when pre-verbal contexts are non-veridical than when they have neutral veridicality. If this “omnibus” analysis of the entire dataset shows a significant interaction or main effects, and only then, more targeted analyses are legitimate. In particular, where two variables interact (or a variable has a main effect), we may compare the means of the conditions involved. To return to our toy example, suppose our ANOVA revealed an interaction between veridicality and consistency. Then we could compare the means of the four conditions involved: (1) non-veridical-consistent vs. non-veridical-inconsistent, (2) neutral-consistent vs. neutral-inconsistent, (3) non-veridical-consistent vs. neutral-consistent, and (4) nonveridical-inconsistent vs. neutral-inconsistent. These comparisons are done with t-tests that can be thought of as locating the “sources” of the interaction. Where follow-up comparisons involve several statistical tests (here, four t-tests), correction for multiple comparisons is necessary (see Chap. 3). Eye-tracking studies demand several analyses of variance, namely, on each eye movement measure (e.g., first-pass reading time, summed re-reading time, etc.) for each region of interest (e.g., source adjective, source object, etc.). An ANOVA will also be conducted on the results of the plausibility rating (or comprehension) task, if the study utilised an outcome measure. If the design is within subject, the ANOVA is done in SPSS via the functions General Linear Model → Repeated Measures. If there is a significant interaction, then the follow-up t-tests will be conducted via Compare Means → Paired Samples. Finally, ANOVAs need to be conducted on both subjects and items: We need to exclude the possibility that the effects of the experimental manipulations we observe in our data are just driven by a minority of atypical participants or a minority of atypical items. Outlier analysis (above) identifies and excludes the clearest oddballs. Even so, our analyses need to ensure that most participants and most items show the observed differences. If we observe significant differences, for example, in mean re-reading times, between two conditions, we would report the means and infer that re-reading times were significant
0
Puede agregar este documento a su colección de estudio (s)
Iniciar sesión Disponible sólo para usuarios autorizadosPuede agregar este documento a su lista guardada
Iniciar sesión Disponible sólo para usuarios autorizados(Para quejas, use otra forma )