Statistics 1 J.S. Abdey ST104a 2019 Undergraduate study in Economics, Management, Finance and the Social Sciences This subject guide is for a 100 course offered as part of the University of London undergraduate study in Economics, Management, Finance and the Social Sciences. This is equivalent to Level 4 within the Framework for Higher Education Qualifications in England, Wales and Northern Ireland (FHEQ). For more information about the University of London, see: london.ac.uk This guide was prepared for the University of London by: James S. Abdey, BA (Hons), MSc, PGCertHE, PhD, Department of Statistics, London School of Economics and Political Science. This is one of a series of subject guides published by the University. We regret that due to pressure of work the author is unable to enter into any correspondence relating to, or arising from, the guide. If you have any comments on this subject guide, favourable or unfavourable, please use the form at the back of this guide. In this and other publications you may see references to the ‘University of London International Programmes’, which was the name of the University’s flexible and distance learning arm until 2018. It is now known simply as the ‘University of London’, which better reflects the academic award our students are working towards. The change in name will be incorporated into our materials as they are revised. University of London Publications Office Stewart House 32 Russell Square London WC1B 5DN United Kingdom london.ac.uk Published by: University of London © University of London 2019 The University of London asserts copyright over all material in this subject guide except where otherwise indicated. All rights reserved. No part of this work may be reproduced in any form, or by any means, without permission in writing from the publisher. We make every effort to respect copyright. If you think we have inadvertently used your copyright material, please let us know. Contents Contents 1 Introduction 1 1.1 General introduction to the subject area . . . . . . . . . . . . . . . . . . 1 1.2 Aims of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.5 Essential reading list and other learning resources . . . . . . . . . . . . . 3 1.5.1 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 How to study statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6.1 Mathematical background . . . . . . . . . . . . . . . . . . . . . . 4 1.6.2 Calculators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 1.7 1.8 How to use the subject guide . . . . . . . . . . . . . . . . . . . . . . . . 5 1.7.1 Using the subject guide and the textbook . . . . . . . . . . . . . 5 1.7.2 Online study resources . . . . . . . . . . . . . . . . . . . . . . . . 5 1.7.3 Virtual learning environment (VLE) . . . . . . . . . . . . . . . . 6 1.7.4 Making use of the Online Library . . . . . . . . . . . . . . . . . . 6 1.7.5 Structure of the subject guide . . . . . . . . . . . . . . . . . . . . 7 1.7.6 Time management . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.7.7 Recommendations for working through the chapters . . . . . . . . 7 Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Mathematical revision – simple algebra and coordinate geometry 11 2.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Recommended reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Arithmetic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6 Squares and square roots . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7 Fractions and percentages . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.8 Some further notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.8.1 14 Absolute value . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Contents 2.8.2 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P Summation operator, . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.10 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.11 The graph of a linear function . . . . . . . . . . . . . . . . . . . . . . . . 22 2.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.13 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.14 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.15 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . . 24 2.16 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 25 2.9 3 The nature of statistics 27 3.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.6 Coverage of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.7 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.7.1 Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.7.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.7.3 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.7.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.10 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.11 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . . 33 3.12 Self assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4 Data visualisation and descriptive statistics ii 15 35 4.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.6 Types of variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.6.1 37 Categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . Contents 4.7 Data visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.7.1 Presentational traps . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.7.2 Dot plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.7.3 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.7.4 Stem-and-leaf diagram . . . . . . . . . . . . . . . . . . . . . . . . 44 Measures of location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.8.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.8.2 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.8.3 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Measures of spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.9.1 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.9.2 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.9.3 Variance and standard deviation . . . . . . . . . . . . . . . . . . . 55 4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.11 Test your understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.12 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.13 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.14 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . . 73 4.15 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 74 4.8 4.9 5 Probability theory 77 5.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.6 The concept of probability . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.7 Relative frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.8 ‘Randomness’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.9 Properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.9.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.9.2 Venn diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.9.3 The additive law . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.9.4 The multiplicative law . . . . . . . . . . . . . . . . . . . . . . . . 90 5.10 Conditional probability and Bayes’ formula . . . . . . . . . . . . . . . . . 95 5.10.1 Bayes’ formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 iii Contents 5.10.2 Total probability formula . . . . . . . . . . . . . . . . . . . . . . . 102 5.10.3 Independent events (revisited) . . . . . . . . . . . . . . . . . . . . 109 5.11 Probability trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.13 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.14 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.15 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . . 114 5.16 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 114 6 The normal distribution and ideas of sampling 6.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.6 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.7 Population mean and variance . . . . . . . . . . . . . . . . . . . . . . . . 120 6.7.1 Population mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.7.2 Population variance . . . . . . . . . . . . . . . . . . . . . . . . . . 121 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.8.1 Relevance of the normal distribution . . . . . . . . . . . . . . . . 127 6.8.2 Consequences of the central limit theorem . . . . . . . . . . . . . 127 6.8.3 Characteristics of the normal distribution . . . . . . . . . . . . . . 128 6.8.4 Standard normal tables . . . . . . . . . . . . . . . . . . . . . . . . 128 6.8.5 The general normal distribution . . . . . . . . . . . . . . . . . . . 132 Sampling distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.10 Sampling distribution of X̄ . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.12 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.13 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.14 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . . 146 6.15 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 146 6.8 6.9 7 Point and interval estimation iv 117 149 7.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Contents 7.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.6 Principle of confidence intervals . . . . . . . . . . . . . . . . . . . . . . . 150 7.7 General formulae for normally-distributed statistics . . . . . . . . . . . . 151 7.7.1 Standard error known . . . . . . . . . . . . . . . . . . . . . . . . 151 7.7.2 Standard error unknown . . . . . . . . . . . . . . . . . . . . . . . 153 7.7.3 (Student’s) t distribution . . . . . . . . . . . . . . . . . . . . . . . 153 7.8 Confidence interval for a single mean (σ known) . . . . . . . . . . . . . . 155 7.9 Confidence interval for a single mean (σ unknown) . . . . . . . . . . . . . 156 7.10 Confidence interval for a single proportion . . . . . . . . . . . . . . . . . 158 7.11 Sample size determination . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.12 Difference between two population proportions . . . . . . . . . . . . . . . 164 7.13 Difference between two population means . . . . . . . . . . . . . . . . . . 168 7.13.1 Unpaired samples – variances known . . . . . . . . . . . . . . . . 168 7.13.2 Unpaired samples – variances unknown and unequal . . . . . . . . 169 7.13.3 Unpaired samples – variances unknown and equal . . . . . . . . . 170 7.13.4 Paired (dependent) samples . . . . . . . . . . . . . . . . . . . . . 173 7.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.15 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 7.16 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 7.17 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . . 177 7.18 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 178 8 Hypothesis testing 179 8.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.6 Statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.7 Types of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.8 Tests for normal populations . . . . . . . . . . . . . . . . . . . . . . . . . 184 8.9 Significance levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 8.9.1 . . . . . . . . . . . . . . . . . . . . . . 186 8.10 One-tailed and two-tailed tests . . . . . . . . . . . . . . . . . . . . . . . . 188 Order of conducting tests v Contents 8.11 P -values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 8.12 Hypothesis test for a single mean (σ known) . . . . . . . . . . . . . . . . 190 8.13 Hypothesis test for a single mean (σ unknown) . . . . . . . . . . . . . . . 191 8.14 Hypothesis test for a single proportion . . . . . . . . . . . . . . . . . . . 196 8.15 Difference between two population proportions . . . . . . . . . . . . . . . 200 8.16 Difference between two population means . . . . . . . . . . . . . . . . . . 204 8.16.1 Unpaired samples – variances known . . . . . . . . . . . . . . . . 204 8.16.2 Unpaired samples – variances unknown and unequal . . . . . . . . 205 8.16.3 Unpaired samples – variances unknown and equal . . . . . . . . . 206 8.16.4 Paired (dependent) samples . . . . . . . . . . . . . . . . . . . . . 211 8.17 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 8.18 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 8.19 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 8.20 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . . 214 8.21 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 215 9 Contingency tables and the chi-squared test 9.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 9.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 9.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 9.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 9.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 9.6 Correlation and association . . . . . . . . . . . . . . . . . . . . . . . . . 218 9.7 Tests for association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 9.7.1 Contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . 219 9.7.2 Expected frequencies . . . . . . . . . . . . . . . . . . . . . . . . . 220 9.7.3 Test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 9.7.4 The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 221 9.7.5 Degrees of freedom . . . . . . . . . . . . . . . . . . . . . . . . . . 221 9.7.6 Performing the test . . . . . . . . . . . . . . . . . . . . . . . . . . 222 9.7.7 Extending to an appropriate test of proportions . . . . . . . . . . 229 Goodness-of-fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 9.8.1 Observed and expected frequencies . . . . . . . . . . . . . . . . . 230 9.8.2 The goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . 231 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 9.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 9.8 9.9 vi 217 Contents 9.11 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 9.12 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . . 235 9.13 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 236 10 Sampling design 237 10.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 10.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 10.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 10.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 10.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 10.6 Motivation for sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 10.7 Types of sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 10.7.1 Non-probability sampling . . . . . . . . . . . . . . . . . . . . . . 241 10.7.2 Probability sampling . . . . . . . . . . . . . . . . . . . . . . . . . 244 10.8 Types of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 10.9 Pilot and post-enumeration surveys . . . . . . . . . . . . . . . . . . . . . 249 10.10 Non-response and response bias . . . . . . . . . . . . . . . . . . . . . . 250 10.11 Method of contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 10.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 10.13 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 260 10.14 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 10.15 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . 263 10.16 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . 263 11 Some ideas underlying causation – the use of control groups and time order 265 11.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 11.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 11.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 11.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 11.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 11.6 Observational studies and designed experiments . . . . . . . . . . . . . . 267 11.7 Use of the control group . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 11.7.1 Observational study . . . . . . . . . . . . . . . . . . . . . . . . . 268 11.7.2 Experimental study . . . . . . . . . . . . . . . . . . . . . . . . . . 269 11.8 Time order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 11.8.1 Longitudinal surveys . . . . . . . . . . . . . . . . . . . . . . . . . 271 vii Contents 11.8.2 Panel surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 11.9 Case study: Smoking and lung cancer . . . . . . . . . . . . . . . . . . . . 273 11.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 11.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 274 11.12 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 11.13 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . 275 11.14 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . 275 12 Correlation and linear regression 277 12.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 12.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 12.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 12.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 12.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 12.6 Scatter diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 12.7 Causal and non-causal relationships . . . . . . . . . . . . . . . . . . . . . 280 12.8 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 12.8.1 Spearman rank correlation . . . . . . . . . . . . . . . . . . . . . . 286 12.9 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 12.9.1 The simple linear regression model . . . . . . . . . . . . . . . . . 289 12.9.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . 291 12.9.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 12.9.4 Points to watch about linear regression . . . . . . . . . . . . . . . 304 12.10 Points to note about correlation and regression . . . . . . . . . . . . . . 305 12.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 12.12 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 306 12.13 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 12.14 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . 307 12.15 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . 307 A Solutions to Learning activities 309 A.1 Chapter 2 – Mathematical revision – simple algebra and coordinate geometry309 A.2 Chapter 3 – The nature of statistics . . . . . . . . . . . . . . . . . . . . . 312 A.3 Chapter 4 – Data visualisation and descriptive statistics . . . . . . . . . 312 A.4 Chapter 5 – Probability theory . . . . . . . . . . . . . . . . . . . . . . . 315 A.5 Chapter 6 – The normal distribution and ideas of sampling . . . . . . . . 319 viii Contents A.6 Chapter 7 – Point and interval estimation . . . . . . . . . . . . . . . . . 323 A.7 Chapter 8 – Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 325 A.8 Chapter 9 – Contingency tables and the chi-squared test . . . . . . . . . 327 A.9 Chapter 10 – Sampling design . . . . . . . . . . . . . . . . . . . . . . . . 333 A.10 Chapter 11 – Some ideas underlying causation – the use of control group and time order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 A.11 Chapter 12 – Correlation and linear regression . . . . . . . . . . . . . . . 340 B Solutions to Sample examination questions 343 B.1 Chapter 2 – Mathematical revision – simple algebra and coordinate geometry343 B.2 Chapter 4 – Data visualisation and descriptive statistics . . . . . . . . . 344 B.3 Chapter 5 – Probability theory . . . . . . . . . . . . . . . . . . . . . . . 345 B.4 Chapter 6 – The normal distribution and ideas of sampling . . . . . . . . 346 B.5 Chapter 7 – Point and interval estimation . . . . . . . . . . . . . . . . . 348 B.6 Chapter 8 – Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 349 B.7 Chapter 9 – Contingency tables and the chi-squared test . . . . . . . . . 351 B.8 Chapter 10 – Sampling design . . . . . . . . . . . . . . . . . . . . . . . . 353 B.9 Chapter 11 – Some ideas underlying causation – the use of control group and time order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 B.10 Chapter 12 – Correlation and linear regression . . . . . . . . . . . . . . . 355 C Examination formula sheet 357 ix Contents x 1 Chapter 1 Introduction 1.1 General introduction to the subject area Welcome to the world of statistics! This is a discipline with unparalleled applicability whose use can be found in a wide range of areas such as finance, business, management, economics and other fields in the social sciences. ST104a Statistics 1 provides you with the opportunity to grasp the fundamentals of the subject and will equip you with the vital quantitative skills and powers of analysis which are highly sought-after by employers in many sectors. The material in this half-course is necessary as preparation for other courses you may study later on as part of your degree. You may choose to take ST104b Statistics 2 so that you can study the concepts introduced here in greater depth. A natural continuation of this half-course and ST104b Statistics 2 are the advanced half-courses ST2133 Advanced statistics: distribution theory and ST2134 Advanced statistics: statistical inference. Two applied statistics courses for which this half-course is a pre-requisite are ST2187 Business analytics, applied modelling and prediction and ST3188 Statistical methods for market research. You may wish to develop your economic statistics by taking EC2020 Elements of econometrics. You may want to build on your interests in social research and take SC2145 Social research methods. There are also links with MN3141 Principles of marketing. So the reach of statistics is considerable. Hence it rightly forms a core component of our programmes since all of the courses mentioned above require an understanding of the concepts and techniques introduced in ST104a Statistics 1. The analytical skills which you will develop on this course will, therefore, stand you in very good stead both for your future studies and beyond into the real world of work. 1.2 Aims of the course The emphasis of this half-course is on the application of statistical methods in management, economics and the social sciences. Attention will focus on the 1 1 1. Introduction interpretation of tables and results, as well as the appropriate way to approach statistical problems. Treatment is at an elementary mathematical level. Ideas of probability, inference and multivariate analysis are introduced and are further developed in the half-course ST104b Statistics 2. 1.3 Learning outcomes At the end of this half-course, and having completed the essential reading and activities, you should: be familiar with the key ideas of statistics that are accessible to a student with a moderate mathematical competence be able to routinely apply a variety of methods for explaining, summarising and presenting data and interpreting results clearly using appropriate diagrams, titles and labels when required be able to summarise the ideas of randomness and variability, and the way in which these link to probability theory to allow the systematic and logical collection of statistical techniques of great practical importance in many applied areas have a grounding in probability theory and some grasp of the most common statistical methods be able to perform inference to test the significance of common measures such as means and proportions and conduct chi-squared tests of contingency tables be able to use simple linear regression and correlation analysis and know when it is appropriate to do so. 1.4 Syllabus This course introduces some of the basic ideas of theoretical statistics, emphasising the applications of these methods and the interpretation of tables and results. Basic background : Elementary summation signs, elementary probability, Venn and tree diagrams. Data collection: Elements of survey design, the stages of a survey, ideas of randomness, observation and experiment. Data presentation and analysis: Descriptive statistics, measures of location and dispersion, pictorial and graphical representation. The normal distribution: Estimation of mean, proportion, standard deviation, confidence intervals and hypothesis testing. Ideas of testing for differences between means and proportions. The use of Student’s t. Goodness-of-fit tests: The chi-squared distribution and contingency tables. Regression and correlation: An introduction to the ideas of regression and correlation, least squares, estimation of a, b and r, scatter diagrams. 2 1.5. Essential reading list and other learning resources 1.5 Essential reading list and other learning resources Numerous titles are available covering the topics frequently covered in foundation statistics courses such as ST104a Statistics 1. Due to the inevitable heterogeneity among students taking this half-course, some may find one author’s style easier to understand than another’s. That said, the recommended textbook for this course is: Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060]. This textbook contains many additional examples, complete with solutions to selected even-numbered exercises. To ensure you fully understand each topic, you should attempt these, and check your answers against the solutions provided. In addition, the textbook introduces all the topics in an alternative style to this subject guide so if you have difficulty following a specific point, you may wish to consult the textbook for further clarification. A non-exhaustive list of other suggested textbooks is outlined below under ‘Further reading’. Statistical tables will be provided in the examination for ST104a Statistics 1, hence you should also consult: Lindley, D.V. and W.F. Scott New Cambridge Statistical Tables (Cambridge: Cambridge University Press, 1995) second edition [ISBN 9780521484855]. These statistical tables form the basis of the statistical tables which are provided for you to use in the examination. It is essential that you familiarise yourself with these in advance rather than those printed in your textbook, as the method of navigating through statistical tables is not uniform across all publications. In order to prepare yourself, you should concentrate in particular on using Tables 4, 5, 7, 8, 9 and 10 of the New Cambridge Statistical Tables. These relate to the (standard) normal, Student’s t and chi-squared distributions. Detailed reading references in this subject guide refer to the editions of the set textbook listed above. Newer editions may have been published by the time you study this half-course. You can use a more recent edition of the textbook – use the detailed chapter and section headings and the index to identify relevant readings. Also, check the virtual learning environment (VLE) regularly for updated guidance on readings. 1.5.1 Further reading As previously mentioned, numerous alternative textbooks are available to purchase, although references in this subject guide will be to the eighth edition of Newbold et al. If you do decide to purchase a different/additional textbook, do ensure it covers all the syllabus topics (listed earlier in this introductory chapter). A word of caution at this point: statistical notation can differ between textbooks so be prepared for this. Note that as long as you read the Essential reading you are then free to read around the subject area in any textbook, paper or online resource. You will need to support your 3 1 1 1. Introduction learning by reading as widely as possible and by thinking about how these principles apply in the real world. To help you read extensively, you have free access to the VLE and University of London Online Library (see below). Other useful textbooks for this course include: Aczel, A.D. Complete Business Statistics (London: McGraw–Hill Higher Education, 2009) seventh edition [ISBN 9780071287531]. Anderson, D.R., D.J. Sweeney, T.A. Williams, J. Freeman and E. Shoesmith Statistics for Business and Economics (South-Western Cengage Learning, 2010) eleventh edition [ISBN 9780324783247]. Lind, D.A., W.G. Marchal and S.A. Wathen Statistical Techniques in Business and Economics (Boston: McGraw–Hill Higher Education, 2009) fourteenth edition [ISBN 9780073401768]. Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN 9780471615170]. Chapters 9 and 10 of the subject guide concentrate on the concepts of surveys and experimentation. You will find these topics particularly useful if you are studying MN3141 Principles of marketing. If you feel that you need to refresh your basic mathematics you will need for this half-course which you will cover in Chapter 1, a recommendation is: Anthony, M. and N. Biggs Mathematics for Economics and Finance (Cambridge: Cambridge University Press, 1996) [ISBN 9780521559133] Chapters 1, 2 and 7. 1.6 How to study statistics 1.6.1 Mathematical background To study and understand statistics you will need some familiarity with abstract mathematical concepts combined with the common sense to see how to use these ideas in real-life applications. The concepts needed for probability and statistical inference are impossible to absorb by just reading them in a book – although you may find you need to do this more than once! You need to read, then think, then try some problems, then read and think some more. This process should be repeated until you find the problems easy to do. Reading without practising with the examples and activities set in this subject guide will be of little help. You will also need to be able to use high-school arithmetic and understand some basic algebraic ideas. These are very important. Starting with them should help you feel comfortable with this half-course from the outset and these ideas are introduced to you in Chapter 2. 4 1.7. How to use the subject guide 1.6.2 Calculators A calculator may be used when answering questions on the examination paper for ST104a Statistics 1 and it must comply in all respects with the specification given in the Regulations. You should also refer to the Admission Notice you will receive when entering the examination and the ‘notice on permitted materials’. The most important thing is that you should accustom yourself to using your chosen calculator and feel comfortable with it. Specifically, calculators must: be hand-held, compact and portable be quiet in operation have no external wires be non-programmable not be capable of receiving, storing or displaying user-supplied non-numerical data. Typically, an internet search engine image search for the term ‘basic calculator’ will show examples of permitted calculators. 1.7 How to use the subject guide 1.7.1 Using the subject guide and the textbook This subject guide has been structured such that it is tailored to the specific requirements of the examinable material. It is ‘written to the course’, unlike textbooks which typically include additional material which is not examinable. Therefore, the subject guide should act as your principal resource and you should refer to the specific sections in Newbold et al as indicated. A key benefit of the textbook is that it contains numerous further examples and exercises which can be used to check and consolidate your understanding. As previously mentioned, solutions to selected even-numbered exercises are provided in the textbook enabling you to check your answers to these questions. 1.7.2 Online study resources In addition to the subject guide and the Essential reading, it is vitally important that you take advantage of the study resources which are available online for this half-course, including the VLE and Online Library. You can access the VLE, Online Library and your University of London email account via the Student Portal: http://my.londoninternational.ac.uk You should have received your login details for the Student Portal with your official offer, which was emailed to the address that you gave on your application form. You have probably already logged in to the Student Portal in order to register. As soon as 5 1 1 1. Introduction you registered, you will automatically have been granted access to the VLE, Online Library and your fully functioning University of London email account. If you have forgotten these login details, please click on the ‘Forgotten your password’ link on the login page. The Student Portal forms an important part of your study experience with the University of London and should, therefore, be accessed regularly. 1.7.3 Virtual learning environment (VLE) The VLE, which complements this subject guide, has been designed to enhance your learning experience, providing additional support and a sense of community. In addition to making materials more accessible, the VLE provides an open space for you to discuss interests and to seek support from other students, working collaboratively to solve problems and discuss subject material. In a few cases, such discussions are driven and moderated by an academic who offers a form of feedback on all discussions. In other cases, video material, such as audio-visual tutorials, are available. These will typically focus on taking you through difficult concepts in the subject guide. For quantitative courses, such as Mathematics and Statistics, fully worked-through solutions of practice examination questions are available. For some qualitative courses, academic interviews and debates will provide you with advice on approaching the subject and examination questions, and will show you how to build an argument effectively. Past examination papers and Examiners’ commentaries from the past three years are available to download which provide advice on how each examination question might best be answered. Self-testing activities allow you to test your knowledge and recall of the academic content of various courses. Finally, a section of the VLE has been dedicated to providing you with expert advice on preparing for examinations and developing digital literacy skills. 1.7.4 Making use of the Online Library The Online Library contains a huge array of journal articles and other resources to help you read widely and extensively. To access the majority of resources via the Online Library you will either need to use your University of London Student Portal login details, or you will be required to register and use an Athens login: http://tinyurl.com/ollathens The easiest way to locate relevant content and journal articles in the Online Library is to use the Summon search engine. If you are having trouble finding an article listed on the reading list, try removing any punctuation from the title, such as single quotation marks, question marks and colons. For further advice, please see the online help pages: www.external.shl.lon.ac.uk/summon/about.php 6 1.7. How to use the subject guide 1.7.5 Structure of the subject guide Statistics is fundamentally a cumulative discipline – that is, the following chapters are not a series of self-contained topics, rather they build on each other sequentially. As such, you are strongly advised to follow the subject guide in chapter order, understanding the material in the early chapters before embarking on later material. There is little point in rushing past material which you have only partially understood in order to reach the later chapters. 1.7.6 Time management About one-third of your private study time should be spent reading and the other two-thirds doing problems. (Note the emphasis on practising problems!) If you are intending to study ST104a Statistics 1 over the course of one academic year, then it is recommended that you devote a minimum of seven hours per week to your studies. To help your time management, the chapters and topics of this half-course have been converted into approximate weeks to devote to each subject if you were, for example, spending 20 weeks on ST104a Statistics 1. What you should gain from the following breakdown is an indication of the relative amounts of time to be spent on each topic. Bear in mind, however, that some of you may not need to spend as much time on Chapter 2 if the concepts and techniques of basic arithmetic and algebra (in particular the use of summation signs, the equation of a straight line, and the idea of a uniform distribution) are familiar to you. Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Chapter 12 Revision(!) 1.7.7 – – – – – – – – – – – – 2 weeks 1 week 2 weeks 1 week 1 week 1 week 1 week 1 week 2 weeks 1 week 2 weeks 5 weeks Recommendations for working through the chapters The following procedure is recommended for each chapter. 1. Carefully read the aims of the chapter and the learning outcomes. 2. Read the introduction, noting carefully the sections of Newbold et al to which you are referred. 3. Now work through each section of the chapter making sure you understand the examples given. In parallel, watch the accompanying video tutorials for each section. 7 1 1 1. Introduction 4. Review the intended learning outcomes carefully, almost as a checklist. Do you think you have achieved your targets? 5. Attempt the learning activities which appear near the end of each chapter. 6. Attempt the chapter’s self-test quizzes on the VLE. 7. Attempt the sample examination questions given at the end of the chapter. You can review the video solutions, but do so after attempting the questions yourself! 8. When you have finished the material in the chapter, try the suggested questions from Newbold et al. You can treat these as additional activities. This time, though, you will have to think a little about which part of the new material you have learnt is appropriate to each question. 9. If you have problems at this point, go back to the subject guide and work through the area you find difficult again. Do not worry – you will improve your understanding to the point where you can work confidently through the problems. 10. Once you have completed your work on the subject guide, you will be ready for examination revision. Start with work on the Sample examination paper which you will find at the back of the subject guide. The last few steps are most important. It is easy to think that you have understood the text after reading it, but working through problems is the crucial test of understanding. Problem-solving should take most of your study time (refer to the ‘Time management’ section above). Note that worked examples and activities have been provided to cover each substantive topic in the subject guide. The Newbold et al examples are added for further consolidation of each chapter topic and also to help you work out exactly what the questions are about! One of the problems candidates sometimes have in an examination is that they waste time trying to understand to which part of the syllabus particular questions relate. These final questions, together with the further explanations on the VLE, aim to help with this before you tackle the sample examination questions at the end of each chapter. Try to be disciplined about this: do not look up the answers until you have done your best. Promise? Statistical ideas may seem unfamiliar at first, but your attempts at the questions, however dissatisfied you feel with them, will help you understand the material far better than reading and re-reading the prepared answers – honestly! So, to conclude, perseverance with problem solving is your passport to a strong examination performance. Attempting (ideally successfully!) all the cited exercises is of paramount importance. 1.8 Examination advice Important: the information and advice given in the following section are based on the examination structure used at the time this subject guide was written. Note that subject guides may be used for several years. Because of this, it is strongly advised that you check both the current Regulations for relevant information about the examination, and the VLE where you should be advised of any forthcoming changes. You should also 8 1.8. Examination advice carefully check the rubric/instructions on the paper you actually sit and follow those instructions. This half-course is assessed by a two-hour, unseen, written examination. No books may be taken into the examination, but you will be provided with the necessary extracts from the New Cambridge Statistical Tables and a formula sheet (which can be found in past examination papers on the VLE, as well as at the end of the subject guide). These will be supplied as part of the examination question paper for ST104a Statistics 1 rather than being provided separately. A calculator may be used when answering questions on this paper and it must comply in all respects with the specification given in the Regulations. Section A is a series of short compulsory questions worth 50 per cent of the total marks. In Section B, you should attempt two out of a choice of three longer questions, each of which is worth 25 per cent of the total marks. As Section A will seek to assess a broad cross-section of the syllabus, you are strongly advised to study the whole syllabus. A sample examination paper is provided at the end of this subject guide. Candidates should be aware that graph paper will be supplied as part of the examination question paper for this half-course, so you will not need to request it from the invigilator. Candidates will need to detach this graph paper from the back of the question paper and tie any sheets used into the examination answer booklet. Candidates should, therefore, practise using this type of graph paper as part of their preparation for the examination. Examiners will expect this graph paper to be used where indicated and they will be looking for great precision in the answers to these questions. Strategies for success, your academic and study skills handbook, also provides additional guidance on examination technique. Remember, it is important to check the VLE for: up-to-date information on examination and assessment arrangements for this half-course where available, past examination papers and Examiners’ commentaries for this half-course which give advice on how each question might best be answered. 9 1 1 1. Introduction 10 Chapter 2 Mathematical revision – simple algebra and coordinate geometry 2.1 2 Aims This chapter outlines the essential mathematical building blocks which you will need to work with in this half-course. Most of them will be revision to you, but some new material is introduced. Particular aims are: to be familiar with the basic rules of arithmetic operations to learn further notation concerned with absolute value, ‘greater than’ (>) and ‘less than’ (<) signs to use summation signs to calculate simple statistical measures to learn how to plot a straight line graph and identify its slope and intercept. 2.2 Learning outcomes After completing this chapter, and having completed the recommended reading and activities, you should be able to: manipulate arithmetic and algebraic expressions using the simple rules recall and use common signs: square, square root, ‘greater than’, ‘less than’ and absolute value demonstrate use of the summation sign and explain the use of the ‘i’, or index, of x draw the straight line for a linear function. 2.3 Recommended reading Anthony, M. and N. Biggs Mathematics for Economics and Finance (Cambridge: Cambridge University Press, 1996) [ISBN 9780521559133] Chapters 1, 2 and 7. You may find Anthony and Biggs, or any other appropriate non-specialist mathematics textbook, helpful as background for the mathematics you will need for this half-course. If you are studying for MT105a Mathematics 1, that half-course will support your 11 2. Mathematical revision – simple algebra and coordinate geometry studies in ST104a Statistics 1. However, you will not need all of the material in MT105a Mathematics 1 here. 2 In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials accessible via the ST104a Statistics 1 area at https://emfss.elearning.london.ac.uk 2.4 Introduction This opening chapter introduces some basic concepts and mathematical tools on which the rest of the half-course is built. Before proceeding to the rest of the subject guide, it is essential that you have a solid understanding of these fundamental concepts and tools. You should be confident users of the basic mathematical operations: addition, subtraction, multiplication and division, and be able to use these operations on a basic electronic calculator. The content of this chapter is expected to be a ‘refresher’ of the elementary algebraic and arithmetic rules from schooldays. Some material featured in this chapter may be new to you, such as summation signs and graphs of linear functions. If so, you should master these new ideas before progressing. Finally, remember that although it is unlikely that an examination question would test you on the topics in this chapter alone, the material covered here may well be an important part of the answer! 2.5 Arithmetic operations We begin with elementary arithmetic operations which will be used when working with figures in ST104a Statistics 1. Students often let themselves down by understanding the statistical concepts, but fail to manage a problem because they are unable to deal with the required arithmetic. Although this is not primarily an arithmetic paper, many calculations will be used, so it is vital to ensure that you are comfortable with the examples and activities presented here. The acronym to remember is ‘BODMAS’, which tells us the correct order (that is, the priority) in which mathematical operations are performed: Brackets Order (i.e. powers, square roots etc.) Division Multiplication Addition Subtraction. 12 2.6. Squares and square roots You should also know that: 2 the sum of a and b means a + b the difference between a and b means either a − b or b − a the product of a and b means a × b = a · b the quotient of a and b means a divided by b, i.e. a/b. Example 2.1 What is (35 ÷ 7 + 2) − (42 − 8 × 3)? ‘BODMAS’ tells us to work out brackets first. Here there are two sets of brackets, so let us do them one at a time. First bracket: 35 ÷ 7 + 2 do division first: 35 ÷ 7 + 2 = 5 + 2 then perform the addition: 5 + 2 = 7. Second bracket: 42 − 8 × 3 do order first: 42 − 8 × 3 = 16 − 8 × 3 next do multiplication: 16 − 8 × 3 = 16 − 24 then perform the subtraction: 16 − 24 = −8. Now the problem has been simplified we complete the calculation with the final subtraction: 7 − (−8) = 7 + 8 = 15. Note the two negatives become positive! 2.6 Squares and square roots The power is the number of times a quantity is to be multiplied by itself. For example, 34 = 3 × 3 × 3 × 3 = 81. Any number raised to the power 2 is called ‘squared’, hence x2 is ‘x squared’, which is simply x × x. Remember that squared values, such as x2 , are always non-negative. This is important, for example, when we compute the quantity s2 in Chapter 4 which involves squared terms, so a negative answer should ring alarm bells telling us a mistake has been made! √ It might be helpful√to think √ of the square root of x (denoted x) as the reverse of the square, such that x × x = x. Note that positive real numbers have two square roots: √ ± 81 = ±9, although the positive square root will always be used in ST104a Statistics 1. In practice, the main problems you will encounter involve taking square roots of numbers with decimal places. Be careful that you understand that 0.9 is the square root of 0.81, and that 0.3 is the square root of 0.09 (and not 0.9!). Of course, in the examination you can perform such calculations on your calculator, but it always helps to have an idea of what the answer should be as a feasibility check of your answer! 13 2. Mathematical revision – simple algebra and coordinate geometry 2.7 2 Fractions and percentages A fraction is part of a whole and can be expressed as either: common fractions: for example, 1/2 or 3/8 decimal fractions: for example, 0.5 or 0.375. In the common fraction, the top number is the numerator and the bottom number is the denominator. In practice, decimal fractions are more commonly used. When multiplying fractions together, just multiply all the numerators together to obtain the new numerator, and do the same with the denominators. For example: 4 1 2 4×1×2 8 × × = = . 9 3 5 9×3×5 135 Percentages give an alternative way of representing fractions by relating a particular quantity to the whole in parts per hundred. For example, 60% is 60 parts per 100, which, as a common fraction, is simply 60/100. 2.8 Some further notation 2.8.1 Absolute value One useful sign in statistics is | | which denotes the absolute value. This is the numerical value of a real number regardless of its sign (positive or negative). The absolute value of x, sometimes referred to as the modulus of x, or ‘mod x’, is |x|. So |7.1| = |−7.1| = 7.1. Statisticians sometimes want to indicate that they only want to use the positive value of a number. For example, let the distance between town X and town Y be 5 miles. Suppose someone walks from X to Y – a distance of 5 miles. A mathematician would write this as +5 miles. Later, after shopping, the person returns to X and the mathematician would record him as walking −5 miles (taking into account the direction of travel). Hence this way the mathematician can show the person ended up where they started. We, however, may be more interested in the fact that the person has had some exercise that day! So, we need notation to indicate this. The absolute value enables us to take only the positive values of our variables. The distance, d, from Y to X may well be expressed mathematically as −5 miles, but you will probably be interested in the absolute amount, so |−d| = d. 2.8.2 Inequalities An inequality is a mathematical statement that one quantity is greater or less than another: x > y means ‘x is greater than y’ x ≥ y means ‘x is greater than or equal to y’ 14 2.9. Summation operator, P x < y means ‘x is less than y’ x ≤ y means ‘x is less than or equal to y’ 2 x ≈ y means ‘x is approximately equal to y’. 2.9 Summation operator, P P The summation operator, , is likely to be new to many of you. It is widely used in statistics and you will come across it frequently in ST104a Statistics 1, so make sure you are comfortable using it before proceeding further! Statistics is all about data analysis, so to use statistical methods we need data. Individual observations are typically represented using a subscript notation. For example, the heights of n people1 would be called x1 , x2 , . . . , xn , where the subscript denotes the order in which the heights are observed (x1 represents the height of the first person etc.). Hence xi represents the height of the ith individual and, in order to list them all, the subscript i must take all integer values from 1 to n. So the whole set of observations is {xi : i = 1, . . . , n} which can be read as ‘a set of observations xi such that i goes from 1 to n’. Summation opeartor, P The sum of a set of n observations, that is x1 + x2 + · · · + xn , may be written as: n X xi (2.1) i=1 P by introducing the summation operator, (the Greek capital letter sigma), which n P can be read as ‘the sum of’. Therefore, xi is read ‘sigma xi , for i equals 1 to n’. i=1 So the summation is said to be over i, where i is the index of summation and the range of i, in (2.1), is Pfrom 1 to n. So the lower bound of the range is the value of i written underneath , and the upper bound is written above it. Note the lower bound can be any integer (positive, negative or zero), such that the summation is over all values of the index of summation in step increments of size one from the lower bound to the upper bound. P As stated above, appears frequently in statistics. For example, in Chapter 4 you will meet descriptive statistics including the arithmetic mean of observations which is defined as: n 1X x̄ = xi . n i=1 As seen in this example, rather than write out n P xi in full, when all the xi s are i=1 1 Throughout this half-course, n will denote a sample size. 15 2. Mathematical revision – simple algebra and coordinate geometry 2 n P summed we sometimes write short-cuts, such as xi , or (when the range of summation 1 P is obvious) just xi . Note that the resulting sum does not involve i in any form. Hence the sum is unaffected by the choice of letter used for the index of summation, so, for example, the following summations are all equal: n n n X X X xi = xj = xk . i=1 j=1 k=1 Sometimes the way that xi depends on i is known. For example, if xi = i, we have: 3 X i=1 xi = 3 X i = 1 + 2 + 3 = 6. i=1 However, do not always assume xi = i! Example 2.2 If {xi : i = 1, . . . , n} is a set of observations, we might observe x1 = 4, x2 = 5, x3 = 1, x4 = −2 and x5 = 9. Therefore: 4 X x2i = 42 + 52 + 12 + (−2)2 = 46 i=1 and: 5 X 5 X xi (xi − 2) = (x2i − 2xi ) = ((−2)2 − 2 × −2) + (92 − 2 × 9) = 71 i=4 i=4 remembering to use BODMAS in the second example. Activity 2.1 A dataset contains the observations 1, 1, 1, 2, 4, 8, 9 (so here, n = 7). Find: (a) n X 2xi i=1 (b) n X x2i i=1 (c) n X (xi − 2) i=1 (d) n X (xi − 2)2 i=1 (e) n X i=1 (f) n X i=1 16 2. !2 xi 2.9. Summation operator, P Solution We have that x1 = 1, x2 = 1, x3 = 1, x4 = 2, x5 = 4, x6 = 8 and x7 = 9. 2 (a) We have: n X 2xi = 2x1 + 2x2 + 2x3 + 2x4 + 2x5 + 2x6 + 2x7 i=1 = 2 × (x1 + x2 + x3 + x4 + x5 + x6 + x7 ) = 2 × (1 + 1 + 1 + 2 + 4 + 8 + 9) = 52. (b) We have: n X x2i = x21 + x22 + x23 + x24 + x25 + x26 + x27 i=1 = 12 + 12 + 12 + 22 + 42 + 82 + 92 = 168. (c) We have: n X (xi − 2) = (x1 − 2) + (x2 − 2) + (x3 − 2) + (x4 − 2) i=1 + (x5 − 2) + (x6 − 2) + (x7 − 2) = (1 − 2) + (1 − 2) + (1 − 2) + (2 − 2) + (4 − 2) + (8 − 2) + (9 − 2) = 12. (d) We have, extending (c): n X (xi − 2)2 = (1 − 2)2 + (1 − 2)2 + (1 − 2)2 + (2 − 2)2 i=1 + (4 − 2)2 + (8 − 2)2 + (9 − 2)2 = 92. (e) We have: n X !2 xi = (x1 + x2 + x3 + x4 + x5 + x6 + x7 )2 i=1 = (1 + 1 + 1 + 2 + 4 + 8 + 9)2 = (26)2 = 676. 17 2. Mathematical revision – simple algebra and coordinate geometry Note that: n X 2 !2 xi i=1 6= n X x2i . i=1 (f) We have: n X 2 = 2 + 2 + 2 + 2 + 2 + 2 + 2 = 14. i=1 Activity 2.2 Suppose that x1 = 7, x2 = 3, x3 = 1, x4 = 0, x5 = −6, and y1 = −3, y2 = 5, y3 = −8, y4 = 9, y5 = 1. Calculate the following quantities: (a) 4 X 2yi i=2 (b) 3 X 4(xi − 1) i=1 (c) y12 + 5 X (x2i + 2yi2 ). i=3 Solution (a) We have: 4 X 2yi = 2 × (y2 + y3 + y4 ) = 2 × (5 − 8 + 9) = 12. i=2 (b) We have: 3 X 4(xi − 1) = 4 × i=1 3 X (xi − 1) i=1 = 4 × ((x1 − 1) + (x2 − 1) + (x3 − 1)) = 4 × ((7 − 1) + (3 − 1) + (1 − 1)) = 4 × (6 + 2 − 0) = 32. (c) We have: y12 + 5 5 X X 2 2 2 (xi + 2yi ) = y1 + ((x23 + 2y32 ) + (x24 + 2y42 ) + (x25 + 2y52 )) i=3 i=3 2 = (−3) + (12 + 2 × (−8)2 ) + (02 + 2 × 92 ) + ((−6)2 + 2 × 12 ) = 9 + 129 + 162 + 38 = 338. 18 2.9. Summation operator, P Activity 2.3 Suppose that x1 = −0.5, x2 = 2.5, x3 = −2.8, x4 = 0.4, x5 = 6.1, and y1 = −0.5, y2 = 4.0, y3 = 4.6, y4 = −2.0, y5 = 0. Calculate the following quantities: (a) 5 X x2i i=3 2 X 1 (b) xy i=1 i i (c) 5 X y2 i y43 + xi i=4 . Solution (a) We have: 5 X x2i = x23 + x24 + x25 = (−2.8)2 + (0.4)2 + (6.1)2 = 7.84 + 0.16 + 37.21 = 45.21. i=3 (b) We have: 2 X 1 1 1 1 1 = + = + = 4 + 0.1 = 4.1. x y x y x y (−0.5) × (−0.5) 2.5 × 4.0 i i 1 1 2 2 i=1 (c) We have: y43 + 5 X y2 i i=4 xi = y43 + 2 y52 (−2.0)2 02 y4 3 + = (−2.0) + + = −8 + 10 = 2. x4 x5 0.4 6.1 Activity 2.4 Explain why: n X i=1 x2i 6= n X !2 xi and i=1 n X i=1 xi yi 6= n X i=1 xi n X yi . i=1 Solution Writing out the full summations, we obtain: n X i=1 x2i = x21 + x22 + · · · + x2n 6= (x1 + x2 + · · · + xn )2 = n X !2 xi . i=1 Therefore, the ‘sum of squares’ is, in general, not equal to the ‘square of the sum’ because the expansion of the quadratic gives: (x1 + x2 + · · · + xn )2 = x21 + x22 + · · · + x2n + 2x1 x2 + 2x1 x3 + · · · 19 2 2. Mathematical revision – simple algebra and coordinate geometry so, unless all the cross-product terms sum to zero, the two expressions are not the same. Hence, in general, the two expressions are different. Similarly: 2 n X xi yi = x1 y1 +x2 y2 +· · ·+xn yn 6= (x1 +x2 +· · ·+xn )(y1 +y2 +· · ·+yn ) = i=1 n X i=1 xi n X yi . i=1 Therefore, the ‘sum of the products’ is, in general, not equal to the ‘product of the sums’ because the expansion of the products gives: (x1 + x2 + · · · + xn )(y1 + y2 + · · · + yn ) = x1 y1 + x2 y2 + · · · + xn yn + x1 y2 + x1 y3 + · · · . Note the first case is a special case of the second if we set yi = xi . 2.10 Graphs In Chapter 4 you will spend some time learning how to present material in graphical form, and also in the representation of the normal distribution in Chapter 6. You should make sure you have understood the following material. If you are taking MT105a Mathematics 1, you will need and use these ideas. If you are not, you are encouraged to read up on and practise examples in Anthony and Biggs. When a variable y depends on another variable x, we can represent the relationship mathematically using functions. In general we write this as y = f (x), where f is the rule which allows us to determine the value of y when we input the value of x. Graphs are diagrammatic representations of such relationships, using coordinates and axes. The graph of a function y = f (x) is the set of all points in the plane of the form (x, f (x)). Sketches of graphs can be very useful. To sketch a graph, we begin with the x-axis and y-axis as shown in Figure 2.1. Figure 2.1: Graph axes. We then plot all points of the form (x, f (x)). Therefore, at x units from the origin (the point where the axes cross), we plot a point whose height above the x-axis (that is, whose y-coordinate) is f (x), as shown in Figure 2.2. Joining all points together of the form (x, f (x)) results in a curve (or sometimes a straight line), which is called the graph of f (x). A typical curve might look like that shown in Figure 2.3. 20 2.10. Graphs 2 Figure 2.2: Example of a plotted coordinate. Figure 2.3: The graph of a generic function, y = f (x). 21 2. Mathematical revision – simple algebra and coordinate geometry y 2 3 -1.5 x Figure 2.4: A sketch of the linear function y = 3 + 2x. y 2 2 x Figure 2.5: A sketch of the linear function y = 2 − x. However, you should not imagine that the correct way to sketch a graph is to plot a few points of the form (x, f (x)) and join them up – this approach rarely works well in practice and more sophisticated techniques are needed. There are two function types which you need to know about for this half-course: linear functions (i.e. the graph of a straight line, see below) normal functions (which we shall meet frequently in later chapters). 2.11 The graph of a linear function Linear functions are those of the form f (x) = a + bx and their graphs are straight lines which are characterised by a gradient (or slope), b, and a y-intercept (where x = 0) at the point (0, a). A sketch of the function y = 3 + 2x is provided in Figure 2.4, and the function y = 2 − x is shown in Figure 2.5. 22 2.12. Summary 2.12 Summary Much of this material should be familiar to you, but some may be new. Although it is only a language or set of rules to help you deal with statistics, without it you will not be able to make sense of the following chapters. Before you continue, make sure you have completed all the learning activities below, understood what you have done and, if necessary, worked through additional examples in the basic textbooks. 2.13 Key terms and concepts Absolute value Fractions Linear function Power Summation operator 2.14 BODMAS Inequalities Percentages Square root Learning activities Solutions can be found in Appendix A. 1. Work out the following: (a) (2 + 4) × (3 + 7) (b) 1/3 of 12 − 4 ÷ 2 (c) (1 + 4)/5 × (100 − 98). If you find these difficult at all, go to Anthony and Biggs, or your old school textbook, and work on some more examples before you do anything else. 2. Work out the following (use a calculator where necessary): √ (a) 16 (b) (0.07)2 √ 0.49. (c) 3. (a) What is 98% of 200? (b) Express 17/25 as a percentage. (c) What is 25% of 98/144? 4. Provide the absolute values for: (a) |−8| (b) |15 − 9|. 23 2 2. Mathematical revision – simple algebra and coordinate geometry 5. (a) For which of 2, 3, 7 and 9 is x > 3? (b) For which of 2, 3, 7 and 9 is x < 3? 2 (c) For which of 2, 3, 7 and 9 is x ≤ 3? (d) For which of 2, 3, 7 and 9 is x2 ≥ 49? 6. Suppose x1 = 4, x2 = 1 and x3 = 2. For these figures, compute 2 X x3i . i=1 7. If n = 4, x1 = 2, x2 = 3, x3 = 5 and x4 = 7, compute: (a) 3 X xi i=1 4 (b) 1X 2 x. n i=1 i 8. Given x1 = 3, x2 = 1, x3 = 4, x4 = 6 and x5 = 8, find: P (a) 5xi i=1 (b) 4 P x2i . i=3 Given also that p1 = 1/4, p2 = 1/8, p3 = 1/8, p4 = 1/3 and p5 = 1/6, find: (c) 5 P p i xi i=1 (d) 5 P pi x2i . i=3 If you find these difficult, go back to an elementary textbook and do some more work on this. It is most important that you deal with this before you embark on the topic of descriptive statistics, such as means, in Chapter 4. 9. Sketch the following linear functions: (a) y = 3 + x (b) y = −2 + 3x. You are going to need equations like these for all the material on regression in Chapter 12. 2.15 A reminder of your learning outcomes After completing this chapter, and having completed the recommended reading and activities, you should be able to: manipulate arithmetic and algebraic expressions using the simple rules 24 2.16. Sample examination questions recall and use common signs: square, square root, ‘greater than’, ‘less than’ and absolute value demonstrate use of the summation sign and explain the use of the ‘i’, or index, of x draw the straight line for a linear function. 2.16 Sample examination questions Solutions can be found in Appendix B. 1. Suppose that x1 = −0.2, x2 = 2.5, x3 = −3.7, x4 = 0.8, x5 = 7.4, and y1 = −0.2, y2 = 8.0, y3 = 3.9, y4 = −2.0, y5 = 0. Calculate the following quantities: (a) 5 X x2i i=3 2 X 1 (b) xy i=1 i i (c) y43 + 5 X y2 i i=4 xi . 2. Suppose that y1 = −2, y2 = −5, y3 = 1, y4 = 16, y5 = 10, and z1 = 8, z2 = −5, z3 = 6, z4 = 4, z5 = 10. Calculate the following quantities: (a) 3 X zi2 i=1 (b) 5 X √ yi zi i=4 (c) z42 + 3 X 1 i=1 yi . 25 2 2. Mathematical revision – simple algebra and coordinate geometry 2 26 Chapter 3 The nature of statistics 3 3.1 Aims This chapter gives a general introduction to some of the statistical ideas which you will be learning about in this course. It should also enable you to link different parts of the syllabus and see their relevance to each other and also to the other courses you are studying. You should aim to: have an overview of the ideas to be introduced in later chapters connect the ideas of sampling to real-life problems be able to decide to take topics in a different sequence from the subject guide where this is more appropriate for your overall study plan. 3.2 Learning outcomes After completing this chapter, and having completed the essential reading and activities, you should be able to: explain the difference between a population, a sample and a census discuss the idea of systematic bias describe simple random sampling and justify its use explain the principle of random sampling identify what the parameter(s) of a function is (are) and you should then be ready to start studying statistics! It is rare for examination questions to focus solely on the topics covered in this chapter. However, a good understanding of the ideas covered is essential when answering the sample examination questions given in Chapters 10 and 11 of the subject guide. 3.3 Essential reading Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060] Chapter 1 and Section 6.1. 27 3. The nature of statistics 3.4 Further reading Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN 9780471615170] Chapter 1. 3 3.5 Introduction This chapter sets the framework for the subject guide. Read it carefully, because the ideas introduced are fundamental to this subject and, although superficially easy, give rise to profound difficulties both of principle and of methodology. It is important that you think about these ideas as well as being thoroughly familiar with the techniques explained in Chapter 2 before you begin work on later chapters. 3.6 Coverage of the course ST104a Statistics 1 is divided into six main sections: 1. Data visualisation and descriptive statistics (Chapter 4). This covers: Ways of summarising data; measures of location and dispersion; graphical presentation. 2. Probability theory (Chapter 5). This covers: Probability; probability trees. 3. The normal distribution and ideas of sampling (Chapter 6). This covers: Definitions of random variables; expectations and variances; normal distribution; central limit theorem. 4. Decision analysis (Chapters 7, 8 and 9). This covers: Estimation and confidence intervals for means, proportions and differences; Student’s t distribution; hypothesis testing; Type I and Type II errors; contingency tables. 5. Sampling design (Chapter 10). This covers: Key features of different data sampling techniques. 6. Modelling for decision-making (Chapters 11 and 12). This covers: Measures of correlation; spurious correlation; model selection; model testing. In the presence of a real-world problem, the aim is to develop a quantitative framework for decision-making under uncertainty. The first step is to conduct an experiment and collect data on which to base the decisions (Chapter 10). The second step is to explore the data (Chapter 4) and the third step is to assign good probability models (Chapters 28 3.7. Terminology 5 and 6). The final step is the task of statistical inference (or else decision analysis) and this uses the material of Chapters 7, 8 and 9, as well as Chapters 11 and 12. 3.7 Terminology Before progressing to Chapter 4, it is worth spending time familiarising yourself with some important terminology and ideas which you will meet later on in the half-course (and also in ST104b Statistics 2, for those studying that as well). In brief, statistics is a mathematical science concerned with analysing and interpreting data. Data collection is not a zero-cost exercise – costs, in addition to effort, include both time and money and, in the real world, time and budget resources are finite! (If we did perform a total enumeration of the population, then this is called a census.) Who might collect data? In principle anyone – governments, corporations, individuals etc. As is so often the case in statistics, some words have technical meanings which overlap with their common everyday use but are not the same. It is important that you are aware of these so that you can read and use them correctly. 3.7.1 Population Population is one such word. Most people think of the word as being about the number of people living in a country. Statisticians, however, use it more generally to refer to the collection of items1 which we would like to study. So, for example: if we were studying children aged two to five years old we would refer to them as the study ‘population’ and not consider older or younger children, or their parents if we were making an analysis of toy retailers in a country, we might look at the ‘population’ of toy shops. Often, it is not feasible to collect full information on the entire population due to the size of many populations2 (and the time and budget constraints described above). Therefore, we collect a sample drawn from the population (see below). It can sometimes be difficult to decide which population should be sampled. For instance, if we wished to sample n listeners of a radio station specialising in music, should the population be of listeners of that radio station in general, or of listeners to that station’s classical music programme, or perhaps just regular listeners, or any one of many other possible populations which you can construct for yourself? In practice, the population is often chosen by finding one which is easy to sample from, and which may not be the population of first choice – that is, our survey population may differ from our target population (of course, in an ideal world the two would be the same). 1 Note ‘items’ need not be restricted to people – researchers may wish to study a variety of different populations: fish, countries, dictators, businesses, alumni, criminals, politicians etc. 2 Indeed, the population size may well be finite, but unknown. For example, how many fish live in the sea? 29 3 3. The nature of statistics 3 In medical trials (which are an important statistical application), the population may be those patients who arrive for treatment at the hospital carrying out the trial, and this may be very different from one hospital to another. If you look at any collection of official statistics (which are most important for public policymaking) you will be struck by the great attention which is given to defining the population which is surveyed or sampled, and to the definition of terms. For instance, it has proved difficult to get consensus on the meaning of ‘unemployed’ in recent years, but a statistician must be prepared to investigate the population of unemployed persons. Think about this carefully; it is particularly important to those of you studying sociology or marketing. 3.7.2 Bias Besides the common-sense meaning of bias, in statistics there is a more technical meaning.3 Intuitively, bias is something we should aim to minimise (ideally we would wish to eliminate it entirely). A full discussion of bias will be given later in the subject guide, but for now remember that the term systematic bias refers to an estimation methodology which results in systematic errors, i.e. consistently over-estimating or under-estimating some population parameter (discussed shortly). It would be difficult to justify a methodology which was biased, hence sample survey methodology places great emphasis on unbiased surveys and unbiased estimators. 3.7.3 Parameters At this point it is useful to think a little about parameters. We can define a parameter (or set of parameters) as a measure (or set of measures) which completely describes a function. For example, in the linear equation y = a + bx, the parameters4 are a and b which completely determine the position of a straight line in a graph (where a denotes the y-intercept and b denotes the slope). Hence as a, b, or both vary, then we obtain different lines. Consider another application: you are asked to draw a circle. How do you do this – what things do you need to know in order to draw it? A little thought makes us realise that, of course, you need to know the centre of the circle. You also need to know how large it is – we need to know its radius. If we only knew the centre we could draw several different, but concentric, circles; if we knew the radius, but not the centre, then we could again draw several circles each of the same size, but with different positions in space. There is, however, only one circle which has a given centre and radius. This very interesting idea of finding a minimum number of properties (attributes) to define a shape was first described by the Greek mathematical philosophers who gave them their names. In Chapter 6 you will meet the normal distribution. One of the reasons statisticians like 3 It is covered in more technical detail in ST104b Statistics 2 when exploring the properties of estimators. 4 Mathematical textbooks will typically use a and b as the parameters of a line; however, in statistics we typically use α and β for the y-intercept and slope, respectively. The linear function is then written as y = α + βx. Of interest later on will be the estimation of these parameters when their values are unknown – this is achieved (for a line) in regression (Chapter 12). 30 3.7. Terminology to use the normal distribution is that, despite its complicated functional form, it is completely determined by two parameters: its mean,µ, and variance, σ 2 .5 If we know these two parameter values, we can draw a unique normal curve. As you will discover, this is extremely useful for working out probabilities and confidence intervals, and for testing hypotheses (part of statistical inference – covered in Chapters 7 and 8). 3.7.4 3 Sampling As discussed above, a census is not usually a feasible option; however, we may wish to know the value of some attribute of a particular population (specifically, the value of a population parameter). In the absence of population data, we resort to drawing a sample from the population of interest, and use the sample observations (our data) to estimate a certain parameter when its true value is unknown (due to the absence of population-level data).6 Of course, although sampling avoids the costs associated with a census, it is possible to make an argument against the practice. One could argue that since some members of the population will be excluded from the sample, then the process is inherently undemocratic. However, although the omission of some population members will likely give rise to sampling error (the difference between the estimate of a parameter and its true value), sampling might well be more accurate, since more time can be spent verifying the sample information collected. The key aim of sampling is to select a representative sample from the population. Chapter 10 will explore various sampling techniques in depth. A brief summary of the main themes is outlined below. Random sampling In order to collect a random sample, the researcher needs a sampling frame – a list of all population members. For example, if you want to investigate UK secondary schools, the sampling frame would be a list of all the secondary schools in the UK. The key advantages of random sampling are: it avoids systematic bias it allows an assessment of the size of the sampling error. Access to a sampling frame allows a known, non-zero probability of selection to be attached to each population unit. This is the main feature of a random sample, hence the collection of random sampling techniques is known as probability sampling. A simple random sample is a special case where each population unit has an equal, known, non-zero probability of selection. The two most common kinds of non-simple random samples are: stratified sampling, where the precision of estimates should be improved if the stratification factors are appropriate (i.e. we divide the sampling frame by 5 Concepts formally introduced in Chapter 4. This is what we term statistical inference, i.e. infer ring ‘things’ (such as parameter values) about a population based on sample data. 6 31 3. The nature of statistics stratification factors such as gender or age group, and take a simple random sample from each group) cluster sampling, where precision may not be better, but time and money may be saved on interview surveys involving travel to the site of the interview by only selecting individuals from some clusters. 3 Randomisation (the process by which a random sample is selected) is popular since it avoids the biases caused by the prejudices of the person taking the sample. Of course, just by chance, a random sample may have the appearance of extreme bias – for example, drawing from a population of men and women produces a sample containing men or women only. Such difficulties are usually resolved with some suitable form of restricted randomisation. One might for instance, in this example, use a stratified random sample (by gender) and select half the sample from all men and the other half from all women. Random sampling is carried out, for many populations, by choosing at random without replacement 7 a subset of the very large number of population units, N .8 If the sample (of size n) is a sufficiently small proportion of the population (i.e. n/N is sufficiently small), then there is little-to-no practical difference between the inferences possible using sampling with replacement and sampling without replacement. 3.8 Summary This chapter has laid out the coverage of ST104a Statistics 1 detailing the six main sections of the syllabus. Many frequently-cited terms in statistics have been introduced providing an introductory overview of the issue of sampling –specifically, probability sampling techniques. These will be explored in much greater depth in Chapter 10. Also, the concept of parameters was introduced which is extremely important since statistical inference (covered in Chapters 7, 8, 9 and 12) is concerned with the estimation and testing of unknown parameters. 3.9 Key terms and concepts Census Parameter Sample Sampling frame Stratified sampling 7 Cluster sampling Population Sampling error Simple random sample Systematic bias Sampling without replacement means that once a population unit is selected it is permanently removed from the remaining population, which ensures no population unit is repeatedly observed. 8 That is, N denotes the population size. 32 3.10. Learning activities 3.10 Learning activities Solutions can be found in Appendix A. 1. What parameters could you find for a circle? 3 2. Explain the difference between a simple random sample and a random sample. Name two kinds of non-simple random samples. 3.11 A reminder of your learning outcomes After completing this chapter, and having completed the essential reading and activities, you should be able to: explain the difference between a population, a sample and a census discuss the idea of systematic bias describe simple random sampling and justify its use explain the principle of random sampling identify what the parameter(s) of a function is (are). 3.12 Self assessment This chapter is an introduction to ideas you will meet again later. Check that you understand the content of the syllabus. If you want to study the ideas of sampling and causation in greater detail now before studying the theoretical statistical concepts in the next chapters, you could go straight to Chapters 10 and 11, before returning to Chapter 4. 33 3. The nature of statistics 3 34 Chapter 4 Data visualisation and descriptive statistics 4.1 4 Aims This chapter contains two separate but related themes, both to do with the understanding of data. These are: to find graphical representations for data which allow us to see their most important characteristics to calculate simple numbers, such as the mean or standard deviation, which will summarise those characteristics. 4.2 Learning outcomes After completing this chapter, and having completed the essential reading and activities, you should be able to: calculate the following: arithmetic mean, median, mode, standard deviation, variance, quartiles, range and interquartile range explain the use and limitations of the above quantities draw and interpret histograms, stem-and-leaf diagrams, boxplots and cumulative frequency distributions incorporate labels and titles correctly in your diagrams and state the units which you have used. In summary, you should be able to use appropriate measures and diagrams in order to explain and clarify data which you have collected or which are presented to you. 4.3 Essential reading Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060] Sections 1.3, 1.5 and 2.1–2.3. In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials accessible via the ST104a Statistics 1 area at https://emfss.elearning.london.ac.uk 35 4. Data visualisation and descriptive statistics 4.4 Further reading Aczel, A.D. Complete Business Statistics (London: McGraw–Hill Higher Education, 2009) seventh edition [ISBN 9780071287531] Chapter 1. Anderson, D.R., D.J. Sweeney, T.A. Williams, J. Freeman and E. Shoesmith Statistics for Business and Economics (South-Western Cengage Learning, 2010) eleventh edition [ISBN 9780324783247] Chapters 1–3. Lind, D.A., W.G. Marchal and S.A. Wathen Statistical Techniques in Business and Economics (Boston: McGraw–Hill Higher Education, 2009) fourteenth edition [ISBN 978007110004] Chapters 3 and 4. 4 Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN 9780471615170] Chapter 2. 4.5 Introduction Both themes considered in this chapter (data visualisation and descriptive statistics) could be applied to population data, but in most cases (namely here) they are applied to a sample. The notation would change slightly if a population was being represented. Most visual representations are very tedious to construct in practice without the aid of a computer. However, you will understand much more if you try a few by hand (as is commonly asked in examinations). You should also be aware that spreadsheets do not always use correct terminology when discussing and labelling graphs. It is important, once again, to go over this material slowly and make sure you have mastered the basic statistical definitions introduced here before you proceed to more theoretical ideas. Make sure that, once you have completed the chapter and its activities, you make time to practise the recommended questions from Newbold et al. 4.6 Types of variable Many of the questions for which people use statistics to help them understand and make decisions involve types of variables which can be measured. Obvious examples include height, weight, temperature, lifespan, rate of inflation, and so on. When we are dealing with such a variable – for which there is a generally recognised method of determining its value – we say that it is a measurable variable. The numbers which we then obtain come ready-equipped with an ordered relation, i.e. we can always tell if two measurements are equal (to the available accuracy) or if one is greater or less than the other. Data1 are obtained on any desired variable. For most of this half-course, we will be dealing with variables which can be partitioned into two types. 1 Note that the word ‘data’ is plural, but is very often used as if it was singular. You will probably see both forms written in textbooks. 36 4.6. Types of variable Types of variable 1. Discrete data: outcomes you can count. Examples include the number of passengers on a flight and telephone calls received each day in a call centre. Observed values for these will be 0, 1, 2, . . . (i.e. non-negative integers). 2. Continuous data: outcomes you can measure. Examples include height, weight and time, all of which can be measured to several decimal places. Of course, before we do any sort of data analysis, we need to collect data. Chapter 10 will discuss a range of different techniques which can be employed to obtain a sample. For now, we just consider some examples of situations where data might be collected, such as a: pre-election opinion poll asks 1,000 people about their voting intentions market research survey asks homemakers how many hours of television they watch per week census2 interviewer asks each householder how many of their children are receiving full-time education. 4.6.1 Categorical variables A polling organisation might be asked to determine whether, say, the political preferences of voters were in some way linked to their graduate status: for example, do supporters of Party X tend to be graduates? Other market research organisations might be hired to determine whether or not users were satisfied with the service which they obtained from a commercial organisation (a restaurant, say) or a department of local or central government (housing departments being one important example). This means that we are concerned, from time to time, with categorical variables in addition to measurable variables. So we can count the frequencies with which an item belongs to particular categories. Examples include: (a) The number of graduates who support Party X. (b) The total number of graduates (in a sample). (c) The number of Party X supporters who are graduates. (d) The total number of Party X supporters (in a sample). (e) The number of diners at a restaurant who were dissatisfied/indifferent/satisfied with the service. In cases (b) and (d) we are doing simple counts, within a sample, of a single category, while in cases (a) and (c) we are looking at some kind of cross-tabulation between 2 Recall that a census is the total enumeration of a population, hence this would not be a sample! 37 4 4. Data visualisation and descriptive statistics variables in two categories: graduate status vs. political preferences in (a), and political preferences vs. graduate status in (c) – they are not the same! There is no obvious and generally recognised way of putting political preferences or graduate status in order (in the way that we can certainly say that 1 < 2). It is similarly impossible to rank (as the technical term has it) many other categories of interest: in combatting discrimination against people, for instance, organisations might want to look at the effects of gender, religion, nationality, sexual orientation, disability etc. but the whole point of combatting discrimination is that different levels of each category cannot be ranked. 4 In case (e), by contrast, there is a clear ranking: the restaurant would be pleased if there were lots of people who expressed themselves satisfied rather than dissatisfied. Such considerations lead us to distinguish between two main types of variable, the second of which is itself subdivided. Classification of variables Measurable variables, where there is a generally recognised method of measuring the value of the variable of interest. Categorical variables, where no such method exists (or, often enough, is even possible), but among which: • some examples of categorical variables can be put in some sensible order (case (e)), and hence are called ordinal (categorical) variables • some examples of categorical variables cannot be put in any sensible order, but are only known by their names, and hence are called nominal (categorical) variables. 4.7 Data visualisation Datasets consist of potentially vast amounts of data. Hedge funds, for example, have access to very large databases of historical price information on a range of financial assets, such as so-called ‘tick data’ – very high-frequency intra-day data. Of course, the human brain cannot easily make sense of such large quantities of numbers when presented with them on a screen. However, the human brain can cope with visual representations of data. By producing various plots, we can instantly ‘eyeball’ to get a bird’s-eye view of the dataset. So, at a glance, we can quickly get a feel for the data and determine whether there are any interesting features, relationships etc. which could then be examined in greater depth.3 To summarise, plots are a great medium for communicating the salient features of a dataset to a wide audience. The main representations we use in ST104a Statistics 1 are histograms, stem-and-leaf diagrams and boxplots.4 We also use scatterplots for two 3 For those studying ST104b Statistics 2, in modelling we often make distributional assumptions, and a suitable variable plot allows us to easily check the feasibility of a particular distribution. 4 There are many other representations available from software packages, in particular pie charts and 38 4.7. Data visualisation measurable variables (covered in Chapter 12). 4.7.1 Presentational traps Before we see our first graphical representation you should be aware when reading articles in newspapers, magazines and even within academic journals, that it is easy to mislead the reader by careless or poorly-defined diagrams. Hence presenting data effectively with diagrams requires careful planning. A good diagram: 4 • provides a clear summary of the data • is a fair and honest representation • highlights underlying patterns • allows the extraction of a lot of information quickly. A bad diagram: • confuses the viewer • misleads (either accidentally or intentionally). Advertisers and politicians are notorious for ‘spinning’ data for their own objectives! 4.7.2 Dot plot The simplicity of a dot plot makes it an ideal starting point to think about the concept of a distribution. For small datasets, this type of plot is very effective for seeing the data’s underlying distribution. We use the following procedure. 1. Obtain the range of the dataset (the values spanned by the data), and draw a horizontal line to accommodate this range. 2. Place dots (hence the name ‘dot plot’ !) corresponding to the values above the line, resulting in the empirical distribution. Example 4.1 Hourly wage rates (in £) for clerical assistants: 8.20 7.60 7.50 7.90 • • 7.50 7.80 7.70 • • • • 7.60 7.60 7.60 • • • 7.70 8.10 8.10 • • • 7.80 7.80 7.70 • • 7.90 7.60 7.80 7.70 7.90 • • • 8.00 8.10 • 8.20 7.50 8.00 standard bar charts which are appropriate when dealing with categorical data, although these will not be considered further in this half-course. 39 4. Data visualisation and descriptive statistics Instantly, some interesting features emerge from the dot plot which are not immediately obvious from the raw data. For example, most clerical assistants earn less than £8 per hour and nobody (in the sample) earns more than £8.20. Activity 4.1 A survey has been conducted on the wage rates paid to clerical assistants in twenty companies in your locality in order to make proposals for a review of wages in your own firm. The results of this survey are as follows: Hourly rate in £ 4 8.50 8.20 7.80 8.00 8.10 7.90 7.90 8.40 8.40 8.00 8.10 7.90 7.90 8.00 8.00 8.10 7.80 8.20 7.90 8.30 Construct a dot plot of these data. Solution A good idea is to first list the data in increasing order: 7.80 7.80 7.90 7.90 7.90 7.90 7.90 8.00 8.00 8.00 8.00 8.10 8.10 8.10 8.20 8.20 8.30 8.40 8.40 8.50 The dot plot of wage rates is: • • 4.7.3 • • • • • • • • • £8.00 • • • • • • • • • £8.50 Histogram Histograms are excellent diagrams to use when we want to depict the frequency distribution of (discrete or continuous) variables. To construct histograms, data are first organised into a table which arranges the data into class intervals (also called bins) – subdivisions of the total range of values which the variable takes. To each class interval, the corresponding frequency is determined, that is the number of observations of the variable which falls in each class interval. Example 4.2 Weekly production output5 of a factory over a 50-week period. (You can choose what the manufactured good is!) 360 385 458 362 40 383 381 453 365 368 393 451 365 365 392 444 372 371 395 445 364 354 410 382 396 360 437 379 396 359 420 372 398 358 456 363 402 354 441 350 406 4.7. Data visualisation 450 469 467 460 454 438 358 362 404 459 In principle the bins do not have to be of equal width, but for simplicity we use equal bin widths whenever possible (the reason why will become clear below). A word of warning: recall our objective is to represent the distribution of the data. As such, too many bins will dilute the distribution, while too few will concentrate it (using technical jargon, will degenerate the distribution). Either way, the pattern of the distribution will be lost – defeating the purpose of the histogram. As a guide, six or seven bins should be sufficient, but remember to exercise common sense! Class interval [300, 360) [360, 380) [380, 400) [400, 420) [420, 460) [460, 500) Interval width 60 20 20 20 40 40 Frequency 6 14 10 4 13 3 Frequency density 0.100 0.700 0.500 0.200 0.325 0.075 4 Cumulative frequency 6 20 30 34 47 50 The table above includes two additional columns: (i) ‘Frequency density’ – obtained by calculating ‘frequency divided by interval width’ (for example, 6/60 = 0.100), and (ii) ‘Cumulative frequency’ – obtained by simply determining the running total of the class frequencies (for example, 6 + 14 = 20). Note the final column is not required for a histogram per se, although the computation of cumulative frequencies may be useful when determining medians and quartiles (to be discussed later in this chapter). To construct the histogram, adjacent bars are drawn over the respective class intervals such that the area of each bar is proportional to the interval frequency. This explains why equal bin widths are desirable since this reduces the problem to making the heights proportional to the interval frequency. However, you may be told to use a particular number of bins or bin widths, such that the bins will not all be of equal width. In such cases, you will need to compute the frequency density as outlined above. The histogram for the above example is shown in Figure 4.1. Key points to note are the following. All bars are centred on the midpoints of each class interval. Informative labels are shown on the histogram, i.e. title and axis labels. The idea is the reader can work out what the graphic is showing from the labels! Because area represents frequency, it follows that the dimension of bar heights is number per unit class interval, hence the y-axis should be labelled ‘Frequency density’ rather than ‘Frequency’. 5 This is a discrete variable since the output will take integer values, i.e. something which we can count. 41 4. Data visualisation and descriptive statistics 0.0 0.2 4 0.4 Frequency density 0.6 0.8 Frequency density histogram of weekly production output 300 350 400 450 500 Weekly production output Figure 4.1: Frequency density histogram of weekly production output. Keen students only: experiment with different numbers of bins and see how sensitive the histogram shape is to this choice! Activity 4.2 The following two sets of data represent the lengths (in minutes) of students’ attention spans during a one-hour class. Construct histograms for each of these datasets and use these to comment on comparisons between the two distributions. Statistics class 01 47 21 38 55 43 40 07 48 27 16 14 28 36 06 28 36 49 22 23 27 23 31 29 42 25 32 22 12 21 26 15 24 32 58 25 31 26 11 23 22 19 14 34 35 26 25 45 42 13 04 51 37 14 28 11 25 19 07 46 54 43 43 31 60 60 18 28 Economics class 60 58 29 50 60 60 42 39 60 21 60 48 42 30 21 31 51 25 04 41 53 60 24 32 26 37 26 48 41 59 60 27 47 30 03 11 41 38 08 28 37 60 60 4.7. Data visualisation Solution The exact shape of the histograms will depend on the class intervals you have used. A sensible approach is to choose round numbers. Below the following have been used: Statistics Class interval (0, 10] (10, 20] (20, 30] (30, 40] (40, 50] (50, 60] Interval width 10 10 10 10 10 10 Frequency 3 8 20 10 7 2 Frequency density 0.3 0.8 2.0 1.0 0.7 0.2 4 Economics Class interval (0, 10] (10, 20] (20, 30] (30, 40] (40, 50] (50, 60] Interval width 10 10 10 10 10 10 Frequency 5 5 14 8 11 17 Economics students' attention spans 2 2 1.5 1.5 Frequency density Frequency density Statistics students' attention spans Frequency density 0.5 0.5 1.4 0.8 1.1 1.7 1 0.5 1 0.5 0 0 10 20 30 40 Attention span (minutes) 50 60 10 20 30 40 50 60 Attention span (minutes) The graphs show that for the statistics class attention spans are approximately symmetrical. Attention spans for the economics class are more variable and higher than those of the statistics class, and they are not symmetrical due to a group of students who maintain interest throughout the class. Note we have more data on economics students – this is perfectly acceptable. Sample sizes of different groups may very well be different in practice. 43 4. Data visualisation and descriptive statistics 4.7.4 Stem-and-leaf diagram A stem-and-leaf diagram uses the raw data. As the name suggests, it is formed using a ‘stem’ and corresponding ‘leaves’. The choice of the stem involves determining a major component of an observed value, such as the ‘10s’ unit if the order of magnitude of the observations were 15, 25, 35 etc., or if data are of the order of magnitude 1.5, 2.5, 3.5 etc. then the integer part would be appropriate. The remainder of the observed value plays the role of the ‘leaf’. Applied to the weekly production dataset, we obtain the stem-and-leaf diagram shown below in Example 4.3. Note the following points. 4 These stems are equivalent to using the class intervals: [350, 360), [360, 370), [370, 380), ..., [460, 470). Leaves are vertically aligned, hence rotating the stem-and-leaf diagram 90 degrees anti-clockwise reproduces the shape of the data’s distribution, just as would be revealed with a histogram. The leaves are placed in order of magnitude within the stems, so it is a good idea to sort the raw data into ascending order first of all. Unlike the histogram, the actual data values are preserved. This is advantageous if we want to calculate various descriptive (or summary) statistics later on. Example 4.3 Continuing with Example 4.2, the stem-and-leaf diagram is: Stem-and-leaf diagram of weekly production Stem (Tens) 35 36 37 38 39 40 41 42 43 44 45 46 Leaves (Units) 044889 0022345558 1229 1235 235668 246 0 0 78 145 0134689 079 Note the informative title and labels for the stems and leaves. So far we have considered how to summarise a dataset visually. This methodology is appropriate to get a visual feel for the distribution of the dataset. In practice, we would also like to summarise things numerically. There are two key properties of a dataset which will be of particular interest. 44 4.8. Measures of location Key properties of a dataset Measures of location – a central point about which the data tend (also known as measures of central tendency). Measures of spread – a measure of the variability of the data, i.e. how spread out the data are about the central point (also known as measures of dispersion). Activity 4.3 Sales of footwear in a store were recorded for 52 weeks and these are shown below. Weekly sales of footwear 30 61 70 79 60 74 68 77 67 94 47 77 63 87 40 73 69 66 51 64 54 69 56 67 68 66 56 59 60 62 58 46 62 78 57 54 83 90 47 53 66 98 71 49 70 93 76 58 68 73 80 62 Construct a stem-and-leaf diagram of these data and comment on its shape. Solution Stem-and-leaf diagram of weekly shoe sales: Stems in 10s 3 4 5 6 7 8 9 Leaves in 1s 0 06779 1344667889 001222346667788899 00133467789 037 0348 We can see from the stem-and-leaf diagram that the data look approximately symmetric, centred at about 66. 4.8 Measures of location Mean, median and mode are the three principal measures of location. In general, these will not all give the same numerical value for a given dataset/distribution.6 These three measures (and, later, measures of spread) will now be introduced using the 6 These three measures can be the same is special cases, such as the normal distribution (introduced in Chapter 6) which is symmetric about the mean (and so mean = median) and achieves a maximum at this point, i.e. mean = median = mode. 45 4 4. Data visualisation and descriptive statistics following small sample dataset: 32 28 67 39 19 48 32 44 37 24. 4.8.1 (4.1) Mean The preferred measure of location/central tendency, which is simply the ‘average’ of the data. It will be frequently applied in various statistical inference techniques in later chapters. Sample mean 4 P Using the summation operator, , which remember is just a form of ‘notational shorthand’, we define the sample mean, x̄, as:7 n 1X x1 + x2 + · · · + xn x̄ = xi = . n i=1 n Example 4.4 For the dataset in (4.1) above: x̄ = 370 32 + 28 + · · · + 24 = = 37. 10 10 Of course, it is possible to encounter datasets in frequency form, that is each data value is given with the corresponding frequency of observations for that value, fk , for k = 1, . . . , K, where there are K different variable values. In such a situation, use the formula: K P f k xk k=1 . (4.2) x̄ = K P fk k=1 Note that this preserves the idea of ‘adding up all the observations and dividing by the total number of observations’. This is an example of a weighted mean, where the weights are the relative frequencies. If the data are given in grouped-frequency form, such as that shown in the table in Example 4.2, then the individual data values are unknown8 – all we know is the interval in which each observation lies. The sensible solution is to use the midpoint of the interval as a ‘proxy’ for each observation recorded as belonging within that class interval. Hence you still use the grouped-frequency mean formula (4.2), but each xi value will be substituted with the appropriate class interval midpoint. 7 The notation x̄ will be used to denote an observed sample mean, while µ will denote its population counterpart, i.e. the population mean. 8 Of course, we do have the raw data for the weekly production example and so we could work out the exact sample mean, but here suppose we did not have access to the raw data, instead we were just given the table of class frequencies in Example 4.2. 46 4.8. Measures of location Example 4.5 Using the weekly production data in Example 4.2, the interval midpoints are: 330, 370, 390, 410, 440 and 480, respectively. These will act as the data values for the respective classes. The mean is then calculated as: K P f k xk x̄ = k=1K P = fk (6 × 330) + · · · + (3 × 480) = 397.2. 6 + ··· + 3 k=1 Compared to the true mean of the raw data (which is 399.72), we see that using the midpoints as proxies gives a mean very close to the true sample mean value. Note the mean is not rounded up or down since it is an arithmetic result. A drawback with the mean is its sensitivity to outliers (extreme observations). For example, suppose we record the net worth of 10 randomly chosen people. If Jeff Bezos (one of the world’s richest people), say, was included, his substantial net worth would drag the mean upward considerably! By increasing the sample size n, the effect of his inclusion, although diluted, would still be non-negligible, assuming we were not just sampling from the population of billionaires! 4.8.2 Median The (sample) median, m, is the middle value of the ordered dataset, where observations are arranged in (ascending) order. By definition, 50 per cent of the observations are greater than the median, and 50 per cent are less than the median. Median Arrange the n numbers in ascending order,9 say x(1) , x(2) , . . . , x(n) ,10 then: if n is odd, there is an explicit middle value, so the median, m, is x((n+1)/2) if n is even, there is no explicit middle value, so take the average of the values either side of the ‘midpoint’, hence m = (x(n/2) + x(n/2+1) )/2. Example 4.6 For the dataset in (4.1), the ordered observations are: 19 24 28 32 32 37 39 44 48 67. Here n = 10, i.e. there is an even number of observations, so we compute the average of the fifth and sixth ordered observations, that is: m= x(5) + x(6) 32 + 37 = = 34.5. 2 2 9 If you have constructed a stem-and-leaf diagram, you would have already done this! These are known as ‘order statistics’. For example, x(1) is the first order statistic, i.e. the smallest observed value, and x(n) is the nth order statistic, i.e. the largest observed value. 10 47 4 4. Data visualisation and descriptive statistics If we only had data in grouped-frequency form (as in Example 4.2), then we can make use of the cumulative frequencies. Since n = 50, the median is the 25.5th ordered observation which must lie in the [380, 400) interval because once we exhaust the ordered data up to the [360, 380) interval we have only covered 20 observations, while once the [380, 400) interval is exhausted we have accounted for 30 observations, meaning the median lies in this interval. Assuming the raw data are not accessible, we could use the midpoint (i.e. 390) as denoting the median. Alternatively, we could use an interpolation method which uses the following ‘general’ formula for grouped data, once you have identified the class which includes the median (such as [380, 400) above): endpoint of previous class + 4 class width × number of remaining observations . class frequency Example 4.7 Returning to the weekly production example, the median would be: 380 + 20 × (25.5 − 20) = 391. 10 For comparison, using the raw data, x(25) = 392 and x(26) = 393 (quickly obtained from the stem-and-leaf diagram), giving a median of 392.5. Although an advantage of the median is that it is not influenced by outliers (Jeff Bezos’ net worth would be x(n) and so would not affect the median), in practice it is of limited use in formal statistical inference. For symmetric data, the mean and median are always equal. Therefore, this is a simple way to verify whether a dataset is symmetric. Asymmetric distributions are skewed, where skewness measures the departure from symmetry. Although you will not be expected to compute the coefficient of skewness (its numerical value), you need to be familiar with the two types of skewness. Skewness When mean > median, this indicates a positively-skewed distribution (also, referred to as ‘right-skewed’). When mean < median, this indicates a negatively-skewed distribution (also, referred to as ‘left-skewed’). Graphically, skewness can be determined by identifying where the long ‘tail’ of the distribution lies. If the long tail is heading toward +∞ (positive infinity) on the x-axis (i.e. on the right-hand side), then this indicates a positively-skewed (right-skewed) distribution. Similarly, if heading toward −∞ (negative infinity) on the x-axis (i.e. on the left-hand side) then this indicates a negatively-skewed (left-skewed) distribution, as illustrated in Figure 4.2. Example 4.8 The hourly wage rates used in Example 4.1 are skewed to the right, due to the influence of the ‘extreme’ values 8.00, 8.10, 8.10 and 8.20 – the effect of these (similar to Jeff Bezos’ effect mentioned above) is to ‘drag’ or ‘pull’ the mean upward, hence mean > median. 48 4.8. Measures of location Positively-skewed distribution Negatively-skewed distribution 4 Figure 4.2: Different types of skewed distributions. Example 4.9 For the weekly production dataset in Example 4.2, we have calculated the mean and median to be 399.72 and 392.50, respectively. Since the mean is greater than the median, the data form a positively-skewed distribution, as confirmed by the histogram in Figure 4.1. 4.8.3 Mode Our final measure of location is the mode. Mode The (sample) mode is the most frequently-occurring value in a (sample) dataset. It is perfectly possible to encounter a multimodal distribution where several data values are tied in terms of their frequency of occurrence. Example 4.10 The modal value of the dataset in (4.1) is 32, since it occurs twice while the other values only occur once each. Example 4.11 For the weekly production dataset, looking at the stem-and-leaf diagram in Example 4.3, we can quickly see that 365 is the modal value (the three consecutive 5s opposite the second stem stand out). If just given grouped data, then instead of reporting a modal value we can determine the modal class, which is [360, 380) with 14 observations. (The fact that this includes 365 here is a coincidence – the modal class and modal value are not the same thing.) 49 4. Data visualisation and descriptive statistics Activity 4.4 Calculate the mean, median and mode of the following sample values: 5, 8, 10, 7, 7, 4 and 12. Solution P We have xi = 53 and n = 7, hence x̄ = 53/7 = 7.57. For the median, arrange the observations into ascending order, giving: 4, 4 5, 7, 7, 8, 10 and 12. Given there is an odd number of observations, since n = 7, then the median is: x((n+1)/2) = x(4) = 7. The mode is 7 since this occurs twice, with the other values only occurring once each. Activity 4.5 Consider again the wage rate data in Activity 4.1. Calculate the mean, median and mode for these data. Explain which measure should be used in deciding on a rate to be used in your company. Solution The mean is: 161.4 2 × 7.80 + 5 × 7.90 + · · · + 1 × 8.50 = = £8.07. 20 20 The median is given by: x(10) + x(11) 8.00 + 8.00 = = £8.00. 2 2 The mode is £7.90 (5 repetitions). The frequency distribution is positively skewed (i.e. has a long tail on the right) and so we have (as is often the case with positively-skewed data) that: mode < median < mean. The mode is probably a poor measure of location in this case (the mode is often a poor measure when we are dealing with measurable variables), so the median or mean is preferred. Depending on the company’s negotiating strategy, it might choose a figure slightly higher than the mean, say £8.20, since it is higher than the average and beaten by fewer than one-third of rivals. Activity 4.6 Hard! In a sample of n = 6 objects the mean of the data is 15 and the median is 11. Another observation is then added to the sample mean and this takes the value x7 = 12. 50 4.8. Measures of location (a) Calculate the mean of the seven observations. (b) What can you conclude about the median of the seven observations? Solution The ‘new’ mean is (6 × 15 + 12)/7 = 102/7 = 14.6. Note the easy way to calculate the new sum of data, using the old mean. There is not enough information in the question to calculate the new median exactly, but it must be somewhere between 11 and 12 (inclusive), in part because for the original data: x(1) , x(2) , x(3) ≤ 11 and: x(4) , x(5) , x(6) ≥ 11. x7 = 12 fits somewhere in the second group, increasing the median. If x(3) < x7 < x(4) then the new median will be x7 = 12, while if x(3) < x(4) < x7 then the new median will be x(4) , where 11 ≤ x(4) ≤ 12. Other cases can also be worked out if we have enough information, or estimated by a bit more algebra. Activity 4.7 An airline has a baggage weight allowance of 20 kg per passenger on its flights. It collected the following data on the weights of its passengers’ baggage. The data were taken from a sample of 100 passengers on a Madrid–London flight. There were about 300 passengers on the flight and the 100 passengers were chosen at random. Weight of baggage (kg) 0≤x<5 5 ≤ x < 10 10 ≤ x < 15 15 ≤ x < 20 20 ≤ x < 25 25 ≤ x < 30 30 ≤ x < 35 35 ≤ x Number of passengers 21 2 10 27 26 11 3 0 (a) Display the data using an appropriate diagram. (b) Calculate the mean and median of the data. (c) The company uses the data to claim that ‘40% of airline passengers travel with baggage over the weight allowance’. Explain whether or not you think this claim is valid. (Think about how the data were collected!) Solution (a) A histogram is probably as good as anything here (though in some cases like this there is room for debate about what is best). We have: 51 4 4. Data visualisation and descriptive statistics Class interval [0, 5) [5, 10) [10, 15) [15, 20) [20, 25) [25, 30) [30, 35) 4 Interval width 5 5 5 5 5 5 5 Frequency 21 2 10 27 26 11 3 Frequency density 4.2 0.4 2.0 5.4 5.2 2.2 0.6 Histogram of baggage weights of airline passengers 5 Frequency density 4 3 2 1 0 0 10 20 30 Weight of baggage (in kg) (b) The sample mean is: (21 × 2.5) + (2 × 7.5) + · · · + (3 × 32.5) 1650 = = 16.5. 21 + 2 + · · · + 3 100 The median is: 15 + 5 × (50.5 − 33) = 18.24. 27 (c) The claim is not valid. The data were taken from one route and only one flight on that route. We cannot extrapolate to make conclusions about all flights. There needs to be more variety in the sampling. For example, include flights to many destinations, both domestic and international flights etc. 4.9 Measures of spread The dispersion (or spread) of a dataset is very important when drawing conclusions from it. Hence it is essential to have a useful measure of this property, and several 52 4.9. Measures of spread candidates exist – these are reviewed below. As expected, there are advantages and disadvantages to each. 4.9.1 Range Our first measure of spread is the range. Range The range is simply the largest value minus the smallest value, that is: 4 range = x(n) − x(1) . Example 4.12 For the dataset in (4.1), the range is: x(n) − x(1) = 67 − 19 = 48. Clearly, the range is very sensitive to extreme observations since (when they occur) they are going to be the smallest and/or largest observations (x(1) and/or x(n) , respectively), and so this measure is of limited appeal. If we were confident that no outliers were present (or decided to remove any outliers), then the range would better represent the true spread of the data. More formally, we could consider the idea of the interquartile range (IQR) instead – that is, the upper (third) quartile, Q3 , minus the lower (first) quartile, Q1 . The upper quartile has 75 per cent of observations below it (and 25 per cent above it) while the lower quartile has 25 per cent of observations below (and 75 per cent above). Unsurprisingly the median, given our earlier definition, is the middle (second) quartile. By discarding the top 25 per cent and bottom 25 per cent of observations, respectively, we restrict attention solely to the central 50 per cent of observations. Interquartile range The IQR is defined as: IQR = Q3 − Q1 where Q3 and Q1 are the third (upper) and first (lower) quartiles, respectively. Example 4.13 Continuing with the dataset in (4.1), computation of the quartiles can be problematic since, for example, for the lower quartile we require the value such that 2.5 observations are below it and 7.5 values are above it. A suggested approach11 (motivated by the median calculation when n is even) is to use: x(2) + x(3) 24 + 28 Q1 = = = 26. 2 2 Similarly: x(7) + x(8) 39 + 44 Q3 = = = 41.5. 2 2 53 4. Data visualisation and descriptive statistics Hence the IQR is 41.5 − 26 = 15.5. Contrast this with the range of 48 which is much larger due to the effects of x(1) and x(n) . Activity 4.8 Consider again the sales of footwear data in Activity 4.3. Weekly sales of footwear 30 61 70 79 4 60 74 68 77 67 94 47 77 63 87 40 73 69 66 51 64 54 69 56 67 68 66 56 59 60 62 58 46 62 78 57 54 83 90 47 53 66 98 71 49 70 93 76 58 68 73 80 62 (a) Use the stem-and-leaf diagram to find the median and the quartiles of the data. (b) Calculate the interquartile range. Solution (a) There are 52 values so the median is halfway between the 26th and 27th ordered values. It is easy to read off from the graph that both the 26th and 27th ordered values equal 66, so the median equals 66. (b) Again, we can read off from the graph that the lower and upper quartiles are about 57 and 74, respectively. On this basis, the interquartile range is 74 − 57 = 17. However, different textbooks give slightly different definitions of the quartiles so you might have obtained a slightly different answer (but not too different!) depending on the definition you have used. A sensible method is: Q1 = x(13) + x(14) 57 + 58 = = 57.5 2 2 and: x(39) + x(40) 73 + 74 = = 73.5. 2 2 Using this method, the interquartile range is 73.5 − 57.5 = 16. Q3 = 4.9.2 Boxplot At this point, it is useful to introduce another graphical method – the boxplot.12 In a boxplot, the middle horizontal line is the median and the upper and lower ends of the box are the upper and lower quartiles, respectively. The ‘whiskers’ are drawn from the 11 There are many different methodologies for computing quartiles, and conventions vary from country to country (and from textbook to textbook!). Any reasonable approach is perfectly acceptable in the examination. For example, interpolation methods, as demonstrated previously for the case of the median, are valid. 12 Some people call such plots box-and-whisker plots. No prizes for guessing why! 54 4.9. Measures of spread x Values more than 3 boxlengths from Q3 (extreme outlier) o Values more than 1.5 boxlengths from Q3 (outlier) Largest observed value that is not an outlier Q3 50% of cases have values within the box 4 Q2 = Median Q1 Smallest observed value that is not an outlier o Values more than 1.5 boxlengths from Q1 (outlier) x Values more than 3 boxlengths from Q1 (extreme outlier) Figure 4.3: An example of a boxplot (not to scale). quartiles to the observations furthest from the median, but not by more than one-and-a-half times the IQR (i.e. excluding outliers). The whiskers are terminated by horizontal lines. Any extreme points beyond the whiskers are plotted individually. An example of a (generic) boxplot is shown in Figure 4.3. If you are presented with a boxplot, then it is easy to obtain all of the following: the median, quartiles, IQR, range and skewness. Recall that skewness (the departure from symmetry) is characterised by a long tail, attributable to outliers, which are readily apparent from a boxplot. Example 4.14 From the boxplot shown in Figure 4.4, it can be seen that the median, Q2 , is around 74, Q1 is about 63, and Q3 is approximately 77. The many outliers provide a useful indicator that this is a negatively-skewed distribution as the long tail covers lower values of the variable. Note also that Q3 − Q2 < Q2 − Q1 , which tends to indicate negative skewness. 4.9.3 Variance and standard deviation The variance and standard deviation are much better and more useful statistics for representing the spread of a dataset. You need to be familiar with their definitions and methods of calculation for a sample of data values x1 , x2 , . . . , xn . Begin by computing the so-called ‘corrected sum of squares’, Sxx , the sum of the 55 4. Data visualisation and descriptive statistics 4 Figure 4.4: A boxplot showing a negatively-skewed distribution. squared deviations of each data value from the (sample) mean, where: Sxx = n n X X (xi − x̄)2 = x2i − nx̄2 . i=1 Recall from earlier x̄ = n P (4.3) i=1 xi /n. The proof for why the expressions in (4.3) are i=1 equivalent is beyond the scope of this half-course. Sample variance The sample variance, s2 , is defined as: n Sxx 1 X 1 = (xi − x̄)2 = s = n−1 n − 1 i=1 n−1 2 n X ! x2i − nx̄2 . i=1 Note the divisor used to compute s2 is n − 1, not n. Do not worry about why,13 just remember to divide by n − 1 when computing a sample variance.14 To obtain the sample standard deviation, s, we just take the (positive) square root of the sample variance, s2 . Sample standard deviation The sample standard deviation, s, is: √ s= r s2 = Sxx . n−1 13 This is covered in ST104b Statistics 2. 14 In contrast for population data, the population variance is σ 2 = N P (xi − µ)2 /N , i.e. we use the N i=1 divisor here, where N denotes the population size while n denotes the sample size. Also, note the use of µ (the population mean) instead of x̄ (the sample mean). 56 4.9. Measures of spread Example 4.15 Using the dataset in (4.1), x̄ = 37, so: Sxx = (32 − 37)2 + (28 − 37)2 + · · · + (24 − 37)2 = 25 + 81 + · · · + 169 = 1698. p Hence s = 1698/(10 − 1) = 13.74. P 2 Note that, given xi = 15388, we could have calculated Sxx using the other expression: n X Sxx = x2i − nx̄2 = 15388 − 10 × (37)2 = 1698. i=1 4 So this alternative method is much quicker to calculate Sxx . Sample variance for grouped-frequency data For grouped-frequency data with K classes, to compute the sample variance we use the formula: 2 K K K P P P fk (xk − x̄)2 fk x2k f k xk k=1 k=1 k=1 2 s = = K − K . K P P P fk fk fk k=1 k=1 k=1 Recall that the last bracketed squared term is simply the mean formula for grouped data. Note that for grouped-frequency data we can ignore the ‘divide by n − 1’ rule, since we would expect n to be very large in such cases, such that n − 1 ≈ n and so K X fk = n. dividing by n or n − 1 makes negligible difference in practice, noting that k=1 Example 4.16 Let us now consider an extended example bringing together many of the issues considered in this chapter. We will draw a histogram, here with equal class widths, and introduce the cumulative frequency diagram. A stockbroker is interested in the level of trading activity on a particular stock exchange. She has collected the following data, which are weekly average volumes (in millions), over a 29-week period. This is an example of time series data. Note that this variable is treated as if it was discrete, but because the numbers are so large the variable can be treated as continuous. 172.5 161.9 172.3 181.3 169.1 155.0 148.6 159.8 161.6 153.8 154.6 151.6 132.4 144.0 133.6 149.0 135.8 139.9 164.4 175.6 163.5 172.6 168.3 155.3 143.4 140.6 125.1 171.3 167.0 57 4. Data visualisation and descriptive statistics So to construct a histogram we first decide on the choice of class intervals, which is a subjective decision. The objective is to convey information in a useful way. In this case the data lie between (roughly) 120 and 190 million shares/week, so class intervals of width 10 million will give seven classes. With almost 30 observations this choice is probably adequate; more observations might support more classes (a class interval of 5 million, say); fewer observations would, perhaps, need a larger interval of 20 million. Therefore, the class intervals are defined like this: 120 ≤ volume < 130, 4 130 ≤ volume < 140 etc. or, alternatively, [120, 130), [130, 140) etc. We now proceed to determine the frequency density (and cumulative frequencies, for later). Class interval [120, 130) [130, 140) [140, 150) [150, 160) [160, 170) [170, 180) [180, 190) Interval width 10 10 10 10 10 10 10 Frequency 1 4 5 6 7 5 1 Frequency density 0.1 0.4 0.5 0.6 0.7 0.5 0.1 Cumulative frequency 1 5 10 16 23 28 29 Hence the frequency density histogram is as shown in Figure 4.5. A development from the histogram is the cumulative frequency diagram. Recall that a cumulative frequency gives us the total frequency of observations which fall below the upper endpoint of a class interval. Therefore, it is the upper endpoint which we use in the diagram. Figure 4.6 displays this for the trading data example. A variation on this is the cumulative relative frequency diagram which accumulates the relative frequencies (typically expressed as percentages), rather than the frequencies, where the relative frequency of the kth interval is calculated as: relative frequencyk = Class interval [120, 130) [130, 140) [140, 150) [150, 160) [160, 170) [170, 180) [180, 190) Frequency 1 4 5 6 7 5 1 fk . n Relative frequency (%) 3.45 13.79 17.24 20.69 24.14 17.24 3.45 % Cumulative relative frequency 3.45 17.24 34.48 55.17 79.31 96.55 100.00 So, for example, in the above table since n = 29 the relative frequency for the first interval is 1/29 = 0.0345, or 3.45%. Plotting the cumulative relative frequencies yields the cumulative relative frequency diagram, as shown in Figure 4.7. 58 4.9. Measures of spread Note all the diagrams use the grouped frequencies, rather then the original raw data. For comparison, Figure 4.8 uses the raw ungrouped data, providing a more ‘precise’ cumulative relative frequency diagram. Finally, we use the grouped data to compute particular descriptive statistics – specifically the mean, variance and standard deviation. There are K = 7 classes, so we perform the appropriate intermediate calculations by groups. Class interval [120, 130) [130, 140) [140, 150) [150, 160) [160, 170) [170, 180) [180, 190) P Total, Midpoint, xk 125 135 145 155 165 175 185 Frequency, fk 1 4 5 6 7 5 1 29 f k xk 125 540 725 930 1155 875 185 4535 fk x2k 15,625 72,900 105,125 144,150 190,575 153,125 34,225 715,725 4 Hence the grouped mean is: 7 P fk x k x̄ = k=17 P = fk 4535 = 156.4 29 k=1 and the grouped variance is: 7 P s 2 fk x2k = k=17 P k=1 fk P 7 2 f x k=1 k k = 715725 − (156.4)2 = 219.2 − 7 P 29 fk k=1 √ giving a standard deviation of 219.2 = 14.8. For comparison, the ungrouped mean, variance and standard deviation are 156.0, 217.0 and 14.7, respectively (compute these yourself to verify!). Note the units for the mean and standard deviation are ‘millions of shares/week’, while the units for the variance are the square of those for the standard deviation, i.e. ‘(millions of shares/week)2 ’, so this is an obvious reason why we often work with the standard deviation, rather than the variance, due to the original (and more meaningful) units. Activity 4.9 Calculate the range, variance and standard deviation of the following sample values: 5 8 10 7 7 4 12. Solution The range is: x(n) − x(1) = 12 − 5 = 7 59 4. Data visualisation and descriptive statistics 0.4 0.2 Frequency density 0.6 0.8 Frequency density histogram of trading volume data 0.0 4 120 130 140 150 160 170 180 190 Weekly average shares traded/day (millions) Figure 4.5: Frequency density histogram of trading volume data. 30 Cumulative frequency of trading volume data x 25 x 20 10 15 x x x 0 5 Cumulative frequency x x 120 x 130 140 150 160 170 180 190 Weekly average shares traded/day (millions) Figure 4.6: Cumulative frequency of trading volume data. where x(1) is the minimum value in the dataset, and x(n) is the maximum value. P P 2 We have xi = 53 and xi = 447, hence x̄ = 53/7 = 7.57 and the (sample) variance is: 1 1 X 2 s2 = xi − nx̄2 = × (447 − 7 × (7.57)2 ) = 7.64. n−1 6 Hence the (sample) standard deviation is: √ √ s = s2 = 7.64 = 2.76. Activity 4.10 Compare the means and standard deviations of the sample datasets shown in Activity 4.2 and comment. (You may wish to use the following summary statistics.) 60 4.9. Measures of spread 80 x x 60 x x 40 4 20 x 0 Cumulative relative frequency (%) 100 Cumulative relative frequency of trading volume data x x 120 x 130 140 150 160 170 180 190 Weekly average shares traded/day (millions) Figure 4.7: Cumulative relative frequency of trading volume data. 80 60 40 20 0 Cumulative relative frequency (%) 100 Cumulative relative frequency of trading volume (ungrouped) data 120 130 140 150 160 170 180 190 Weekly average shares traded/day (millions) Figure 4.8: Cumulative relative frequency of trading volume (ungrouped) data. 61 4. Data visualisation and descriptive statistics Sum Sum of squares Statistics class 1,395 46,713 Economics class 2,225 100,387 Solution For the statistics students: 50 1 X 1395 x̄ = xi = = 27.9 and s = 50 i=1 50 4 s 1 (1395)2 × 46713 − = 12.6. 49 50 So the statistics students’ mean and standard deviation attention spans are 27.9 minutes and 12.6 minutes, respectively. For the economics students: s 60 1 (2225)2 1 X 2225 = 37.1 and s = × 100387 − = 17.4. x̄ = xi = 60 i=1 60 59 60 So the economics students’ mean and standard deviation attention spans are 37.1 minutes and 17.4 minutes, respectively. These statistics represent the main features of the distributions shown in the previous question. The economics students have the higher mean and variability due to the group which maintains interest throughout the class. Activity 4.11 Calculate the mean and standard deviation of the following groups of numbers (treat these as samples). (a) 5, 3, 4, 7, 10, 7 and 1. (b) 12, 10, 11, 14, 17, 14 and 8. (c) 12, 8, 10, 16, 22, 16 and 4. (d) 25, 9, 16, 49, 100, 49 and 1. Comment on any relationships between the statistics for the groups. Solution The sample means and standard deviations are: Mean Standard deviation (s.d.) (a) 5.29 2.98 (b) 12.29 2.98 (c) 12.57 5.97 (d) 35.60 33.90 Comments are the following. Dataset (b) is (a) + 7, hence mean(b) = mean(a) + 7 and s.d.(a) = s.d.(b). Dataset (c) is 2(a) + 2, hence mean(c) = 2 × mean(a) + 2 and s.d.(c) = 2 × s.d.(a). 62 4.9. Measures of spread Dataset (d) is the square of group (a). These results are examples of transformations of random variables. Activity 4.12 Consider again the wage rate data in Activity 4.1. Calculate the range, interquartile range and standard deviation of these data. Solution The range is £8.50 − £7.80 = £0.70. For the quartiles we have: 4 x(5) + x(6) 7.90 + 7.90 Q1 = = = £7.90 2 2 and: x(15) + x(16) 8.20 + 8.20 = = £8.20. 2 2 Therefore, the interquartile range is £8.20 − £7.90 = £0.30. To compute the (sample) variance, compute the following: Q3 = 20 X x2i = 1303.30 i=1 20 X 1 Sxx = x2i − n i=1 20 X !2 = 1303.30 − xi i=1 1 × (161.4)2 = 0.802 20 hence: Sxx 0.802 = = 0.042211. n−1 19 √ Hence the (sample) standard deviation is 0.042211 = £0.21. (Note two decimal places are sufficient.) s2 = Tip: Working out the quartiles is easier if the sample size is a multiple of 4. If we have 4k items, listed in increasing order as x(1) , . . . , x(4k) , then: Q1 = x(k) + x(k+1) , 2 Q2 = x(2k) + x(2k+1) 2 and Q3 = x(3k) + x(3k+1) . 2 Activity 4.13 The table below summarises the distribution of salaries for a sample of 50 employees. Salary (£000s) Number of employees [5, 10) 8 [10, 15) 14 [15, 20) 21 [20, 25) 7 (a) Draw a histogram for these data. (b) Calculate the mean, standard deviation and median. (c) What is the modal class for this sample? 63 4. Data visualisation and descriptive statistics Solution (a) Calculations for the salary data and the histogram are: Salary (£000s) [5, 10) [10, 15) [15, 20) [20, 25) Total 4 Midpoint xi , (£000s) 7.5 12.5 17.5 20.5 Interval width 5 5 5 5 Frequency fi 8 14 21 7 50 Frequency density 1.6 2.8 4.2 1.4 f i xi 60.0 175.0 367.5 157.5 760.0 fi x2i 450.00 2,187.50 6,431.25 3,543.75 12,612.50 Histogram of salaries Frequency density 4 3 2 1 0 5 10 15 20 25 Salaries (in £000s) (b) The mean is 760/50 = 15.2. The sample standard deviation is: s (760)2 1 s= × 12612.50 − = 4.65. 49 50 For the median, we use the approach for grouped-frequency data, i.e. we have: 15 + 5 × (25.5 − 22) = 15.83. 21 (c) The modal class is 15–20 (where the modal class is, by extension of the mode, the class with the greatest number of items in it). Activity 4.14 In a sample of 64 observations, the numbers of households containing 1, 2, 3, 4 and 5 persons were 8, 30, 11, 7 and 8, respectively. Calculate the mean and standard deviation of household size. 64 4.9. Measures of spread Solution It might help to display the data in a table: Household size, xi Frequency, fi 1 8 2 30 3 11 4 7 5 8 The summary statistics are: X fk = 8 + 30 + 11 + 7 + 8 = 64 X fk xk = 1 × 8 + 2 × 30 + · · · + 5 × 8 = 169 4 and: X fk x2k = 8 × 12 + 30 × 22 + · · · + 8 × 52 = 539. So: K P f k xk x̄ = k=1K P = fk 169 = 2.64 64 k=1 and: K P s 2 fk x2k = k=1K P i=1 fk K P 2 2 f k xk 539 169 k=1 = 1.45. − K − = 64 64 P fk k=1 Therefore, the standard deviation is s = √ 1.45 = 1.20. Activity 4.15 James set a class test for his students. The test was marked out of 10, and the students’ marks are summarised below. (Note that no student obtained full marks, and every student scored at least 3.) Mark (out of 10) Number of students 3 1 4 1 5 6 6 5 7 11 8 3 9 1 (a) Display the data using an appropriate diagram. (b) Calculate the mean and median marks. (c) Calculate the standard deviation, the quartiles and the interquartile range. Solution (a) A histogram would be acceptable for this dataset. We have: 65 4. Data visualisation and descriptive statistics Class interval [3, 4) [4, 5) [5, 6) [6, 7) [7, 8) [8, 9) [9, 10) Interval width 1 1 1 1 1 1 1 4 Frequency 1 1 6 5 11 3 1 Frequency density 1 1 6 5 11 3 1 6 4 0 2 Frequency density 8 10 Histogram of Marks 3 4 5 6 7 8 9 10 Marks (b) x̄ = 177/28 = 6.32. There are 28 students. The 14th lowest score was 7 and the 15th lowest was also 7. Therefore, the median mark also equals 7. (c) We compute the sample standard deviation using: s 2 √ 1 (177) s= × (1 × 32 + 1 × 42 + · · · + 1 × 92 ) − = 1.78 = 1.33. 27 28 The lower quartile equals 5 and the upper quartile equals 7, so the interquartile range = 7 − 5 = 2. Note that, perhaps surprisingly, the upper quartile equals the middle quartile, which is, of course, the median. This sort of thing can happen with discrete data! 4.10 Summary This chapter, although in practice concerned with what you can do with data after you have collected data, serves as a useful introduction to the whole half-course. It highlights some of the problems with handling data and, furthermore, has introduced many of the fundamental concepts such as mean, variance, discrete and continuous data etc. 66 4.11. Test your understanding 4.11 Test your understanding In this exercise you will practise aspects of data visualisation and descriptive statistics. In statistical analysis, there are usually simply too many numbers to make sense of just by staring at them. Data visualisation and descriptive statistics attempt to summarise key features of the data to make them understandable and easy to communicate. The main function of diagrams is to bring out interesting features of a dataset visually by displaying its distribution, i.e. summarising the whole sample distribution of a variable. Descriptive statistics allow us to summarise one feature of the sample distribution in a single number. Here you will work with measures of central tendency, measures of dispersion and skewness. At a time of economic growth but political uncertainty, a random sample of 40 economists (from the population of all economists) produces the following forecasts for the growth rate of an economy in the next year: 1.3 1.8 2.2 3.4 4.5 3.8 2.7 1.9 3.9 5.2 4.1 3.1 3.6 2.9 3.1 2.6 2.3 2.8 1.9 2.5 2.4 3.7 4.3 3.3 3.3 2.2 2.5 3.1 8.2 3.4 3.4 4.1 4.2 5.4 4.4 5.1 4.7 4.6 3.3 5.2 (a) Construct a stem-and-leaf diagram for these data. (b) Draw a suitable histogram for these data. (c) Comment on the shape of the sample distribution. (d) Determine the median of the data using the stem-and-leaf diagram in (a). (e) Produce a boxplot for these data. (f) Using the following summary statistics, calculate the sample mean and sample standard deviation of the growth rate forecasts: Sum of data = X xi = 140.4 and Sum of squares of data = i X x2i = 557.26. i (g) Comment on the relative values of the mean and median. (h) What percentage of the data fall within one sample standard deviation of the sample mean? And what percentage fall within two sample standard deviations of the sample mean? Solution (a) A stem-and-leaf diagram for the data is: 67 4 4. Data visualisation and descriptive statistics Stem-and-leaf diagram of economy growth forecasts Stem (%) 1 2 3 4 5 6 7 8 4 Leaves (0.1%) 3899 2234556789 1113334446789 11234567 1224 2 (b) It would be sensible to use the stems from (a) as the classes (bins) for the histogram. Class interval [1.0, 2.0) [2.0, 3.0) [3.0, 4.0) [4.0, 5.0) [5.0, 6.0) [6.0, 7.0) [7.0, 8.0) [8.0, 9.0) Interval width 1 1 1 1 1 1 1 1 Frequency 4 10 13 8 4 0 0 1 Frequency density 4 10 13 8 4 0 0 1 10 8 6 4 0 2 Frequency density 12 Histogram of economy growth forecasts 0 2 4 6 8 Economy growth forecasts (%) 68 10 4.11. Test your understanding (c) The stem-and-leaf diagram and histogram show that the data are positively-skewed (skewed to the right), due to the outlier forecast of 8.2%. Note if you are ever asked to comment on the shape of a distribution, consider the following points: • Is the distribution (roughly) symmetric? • Is the distribution bimodal? • Is the distribution skewed (an elongated tail in one direction)? If so, what is the direction of the skewness? • Are there any outliers? 4 (d) There are n = 40 observations, so the median is the average of the 20th and 21st ordered observations. Using the stem-and-leaf diagram in (a), we see that x(20) = 3.3 and x(21) = 3.4. Therefore, the median is (3.3 + 3.4)/2 = 3.35%. (e) Since Q2 is the median, which is 3.35, we now need the first and third quartiles, Q1 and Q3 , respectively. There are several methods for determining the quartiles, and any reasonable approach would be acceptable in an examination. For simplicity, here we will use the following since n is divisible by 4: Q1 = x(n/4) = x(10) = 2.5% and Q3 = x(3n/4) = x(30) = 4.2%. Hence the interquartile range (IQR) is Q3 − Q1 = 4.2 − 2.5 = 1.7%. Therefore, the whisker limits must satisfy: max(x(1) , Q1 − 1.5 × IQR) and min(x(n) , Q3 + 1.5 × IQR) which is: max(1.3, −0.05) = 1.30 and min(8.2, 6.75) = 6.75. So there is just a single observation which lies outside the interval [1.30, 6.75], which is x(40) = 8.2% and hence this is plotted individually in the boxplot. Since this is less than Q3 + 3 × IQR = 4.2 + 3 × 1.7 = 9.3%, then this observation is an outlier, rather than an extreme outlier. The boxplot is (a vertical orientation is also fine): 1 2 3 4 5 6 7 8 9 Economy growth forecasts (%) (f) We have sample data, not population data, hence the (sample) mean is denoted by x̄ and the (sample) standard deviation is denoted by s. We have: n 1X 140.4 x̄ = xi = = 3.51% n i=1 40 69 4. Data visualisation and descriptive statistics and: 1 s2 = n−1 n X i=1 ! x2i − nx̄2 = 1 557.26 − 40 × (3.51)2 = 1.6527. 39 Therefore, the standard deviation is s = 4 √ 1.6527 = 1.29%. (g) In (c) it was concluded that the stem-and-leaf diagram and histogram of the data were positively-skewed, and this is consistent with the mean being larger than the median. It is possible to quantify skewness, although this is beyond the scope of ST104a Statistics 1. (h) We calculate: x̄ − s = 3.51 − 1.29 = 2.22 and x̄ + s = 3.51 + 1.29 = 4.80 also: x̄ − 2 × s = 3.51 − 2 × 1.29 = 0.93 and x̄ + 2 × s = 3.51 + 2 × 1.29 = 6.09. Now we use the stem-and-leaf diagram to see that 29 observations are between 2.22 and 4.80 (i.e. the interval [2.22, 4.80]), and 39 observations are between 0.93 and 6.09 (i.e. the interval [0.93, 6.09]). So the proportion (or percentage) of the data in each interval, respectively, is: 29 = 0.725 = 72.5% and 40 39 = 0.975 = 97.5%. 40 Some general points to note are the following. • Many ‘bell-shaped’ distributions we meet – that is, distributions which look a bit like the normal distribution – have the property that 68% of the data lie within approximately one standard deviation of the mean, and 95% of the data lie within approximately two standard deviations of the mean. The percentages in (h) are fairly similar to these. • The exercise illustrates the importance of (at least) one more decimal place than in the original data. If we had 3.5% and 1.3% for the mean and standard deviation, respectively, the ‘boundaries’ for the interval with one standard deviation would have been 3.5 ± 1.3 ⇒ [2.2, 4.8]. Since 2.2 is a data value which appears twice, we would have had to worry about to which side of the ‘boundary’ to allocate these. (This type of issue can still happen with the extra decimal place, but much less frequently.) • When constructing a histogram, it is possible to ‘lose’ a pattern in the data – for example, an approximate bell shape – through two common errors: • too few class intervals (which is the same as too wide class intervals) • too many class intervals (which is the same as too narrow class intervals). For example, with too many class intervals, you mainly get 0, 1 or 2 items per class, so any (true) peak is hidden by the subdivisions which you have used. 70 4.12. Key terms and concepts • The best number of (equal-sized) class intervals depends on the sample size. For large samples, many class intervals will not lose the pattern, while for small samples they will. However, with the datasets which tend to crop up in ST104a Statistics 1, something like 6, 7, 8 or 9 class intervals are likely to work well. So don’t forget this! 4.12 Key terms and concepts Boxplot Continuous variable Discrete variable Dot plot Interquartile range Measurable variable Mode Ordinal Range Skewness Stem-and-leaf diagram 4.13 Categorical variable Cumulative frequency Distribution Histogram Mean Median Nominal Outliers Relative frequency Standard deviation Variance 4 Learning activities Solutions can be found in Appendix A. 1. Identify and describe the variables in each of the following examples. (a) Voting intentions in a poll of 1,000 people. (b) The number of hours students watch television per week. (c) The number of children per household receiving full-time education. 2. Think about why and when you would use each of the following: (a) histogram (b) stem-and-leaf diagram. When would you not do so? 3. Find the mean of the number of hours of television watched per week by 10 students: Number of hours watched: 2, 2, 2, 5, 5, 10, 10, 10, 10, 10. 4. Say whether the following statement is true or false and briefly give your reason(s). ‘The mean of a dataset is always greater than the median.’ 71 4. Data visualisation and descriptive statistics 5. If n = 4, x1 = 1, x2 = 4, x3 = 5 and x4 = 6, find: 4 1X xi . 3 i=2 Why might you use this figure to estimate the mean? 6. For x1 = 4, x2 = 1 and x3 = 2, calculate: (a) the median (b) the mean. 4 7. If x1 = 4, x2 = 2, x3 = 2, x4 = 5 and x5 = 6, calculate: (a) the mode (b) the mean. 8. Calculate the mean, median and mode of the prices of spreadsheets (using the data below). Software Brand A Brand B Brand C Brand D Brand E Brand F Brand G Brand H Brand I Brand J Brand K Brand L Brand M Price (in £) 52 64 65 82 105 107 110 115 155 195 195 239 265 Total Number of observations 1749 13 You should be able to show that the arithmetic mean is 134.54, the median is 110, and the mode is 195. 9. Work out s2 for a sample of nine observations of the number of minutes students took to complete a statistics problem. Minutes to complete problem: 2, 4, 5, 6, 6, 7, 8, 11, 20. 10. State whether the following statement is true or false and briefly give your reason(s). ‘Three quarters of the observations in a dataset are less than the lower quartile.’ 72 4.14. A reminder of your learning outcomes 11. Calculate the range, variance, standard deviation and interquartile range of the spreadsheet prices shown below. Check against the answers given after the data. Name Brand A Brand B Brand C Brand D Brand E Brand F Brand G Brand H Brand I Brand J Brand K Brand L Brand M Price 52 64 65 82 105 107 110 115 155 195 195 239 265 Price − Mean −82.54 −70.54 −69.54 −52.54 −29.54 −27.54 −24.54 −19.54 20.46 60.46 60.46 104.46 130.46 (Price − Mean)2 6,812.60 4,975.67 4,835.60 2760.29 875.52 758.37 602.14 381.75 418.67 3,655.60 3,655.60 10,912.21 17,020.21 0.00 57,661.23 Total Variance = 4805.10 Std. Dev. = 69.32 4 IQR = 113 12. The data below show the number of daily telephone calls received by an office supplies company over a period of 25 working days. 219 476 258 121 291 541 58 7 13 418 177 175 455 312 164 314 336 77 183 133 78 138 244 36 48 (a) Construct a stem-and-leaf diagram for these data and use this to find the median of the data. (b) Find the first and third quartiles of the data. (c) Would you expect the mean to be similar to the median? Explain. (d) Comment on your figures. 4.14 A reminder of your learning outcomes After completing this chapter, and having completed the essential reading and activities, you should be able to: calculate the following: arithmetic mean, median, mode, standard deviation, variance, quartiles, range and interquartile range explain the use and limitations of the above quantities 73 4. Data visualisation and descriptive statistics draw and interpret histograms, stem-and-leaf diagrams, boxplots and cumulative frequency distributions incorporate labels and titles correctly in your diagrams and state the units which you have used. In summary, you should be able to use appropriate measures and diagrams in order to explain and clarify data which you have collected or which are presented to you. 4 4.15 Sample examination questions Solutions can be found in Appendix B. 1. Classify each one of the following variables as either measurable (continuous) or categorical. If a variable is categorical, further classify it as nominal or ordinal. Justify your answer. (a) Gross domestic product (GDP) of a country. (b) Five possible responses to a customer satisfaction survey ranging from ‘very satisfied’ to ‘very dissatisfied’. (c) A person’s name. 2. The data below contain measurements of the low-density lipoproteins, also known as the ‘bad’ cholesterol, in the blood of 30 patients. Data are measured in milligrams per deciliters (mg/dL). 95 99 103 111 115 127 96 101 104 112 117 129 96 101 104 113 121 131 98 102 107 113 123 135 99 102 107 114 124 143 (a) Construct an appropriate histogram of the data. (b) Find the mean (given that the sum of the data is 3342), the median P 2 and the standard deviation (given that the sum of the squared data, xi , is 377076). (c) Comment on the data given the shape of the histogram. 3. The average daily intakes of calories, measured in kcals, for a random sample of 12 athletes were: 1808, 1936, 1957, 2004, 2009, 2101, 2147, 2154, 2200, 2231, 2500, 3061. (a) Construct a boxplot of the data. (The boxplot does not need to be exactly to scale, but values of box properties and whiskers should be clearly labelled.) 74 4.15. Sample examination questions (b) Based on the shape of the boxplot you have drawn, describe the distribution of the data. (c) Name two other types of graphical displays which would be suitable to represent the data. Briefly explain your choices. 4 75 4. Data visualisation and descriptive statistics 4 76 Chapter 5 Probability theory 5.1 Aims This chapter introduces the fundamental concepts of probability. In other courses, particularly in ST104b Statistics 2 and ST2187 Business analytics, applied modelling and prediction, you may make full use of probability in both theory and in decision trees, and highlight the ways in which such information can be used. We will look at probability at quite a superficial level in this half-course. Even so, you may find that, although the concepts of probability introduced are simple, their application in particular circumstances may be very difficult. Try not to lose heart. Aim to: be capable of dealing with the basic concepts and follow through on examples and activities relate the new ideas of probability to the examples given appreciate its uses and applications. 5.2 Learning outcomes After completing this chapter, and having completed the essential reading and activities, you should be able to: apply the ideas and notation involved in set theory to simple examples recall the basic axioms of probability and apply them distinguish between the ideas of conditional probability and independence draw and use appropriate Venn diagrams draw and use appropriate probability trees. 5.3 Essential reading Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060] Sections 3.1–3.3 and 3.5. 77 5 5. Probability theory In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials accessible via the ST104a Statistics 1 area at https://emfss.elearning.london.ac.uk 5.4 Further reading Aczel, A.D. Complete Business Statistics (London: McGraw–Hill Higher Education, 2009) seventh edition [ISBN 9780071287531] Sections 2.1–2.6. Anderson, D.R., D.J. Sweeney, T.A. Williams, J. Freeman and E. Shoesmith Statistics for Business and Economics (South-Western Cengage Learning, 2010) eleventh edition [ISBN 9780324783247] Sections 4.1–4.4. Lind, D.A., W.G. Marchal and S.A. Wathen Statistical Techniques in Business and Economics (Boston: McGraw–Hill Higher Education, 2009) fourteenth edition [ISBN 978007110004] The first half of Chapter 5. 5 Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN 9780471615170] Sections 3.1–3.3. 5.5 Introduction Chance is what makes life worth living – if everything was known in advance, imagine the disappointment! If decision-makers had perfect information about the future as well as the present and the past, there would be no need to consider the concept of probability. However, it is usually the case that uncertainty cannot be eliminated and hence its presence should be recognised and used in the process of making decisions. 5.6 The concept of probability Probability forms the bedrock on which statistical methods are based. This chapter will be devoted to understanding this important concept, its properties and applications. Probability can be viewed as a quantifiable measure of one’s degree of belief in a particular event or set1 of interest. To motivate the use of the terms ‘event’ and ‘set’, we begin by introducing the concept of an experiment. An experiment can take many forms, but to prevent us from becoming over-excited, let us consider two basic examples: i. the toss of a (fair) coin ii. the roll of a (fair) die. Sample space The sample space, S, is the set of all possible outcomes of an experiment. 1 These terms can be used interchangeably. 78 5.6. The concept of probability Example 5.1 For our two examples we have the following. i. Coin toss: S = {H, T }, where H and T denote ‘heads’ and ‘tails’, respectively, and are called the elements or members of the sample space. ii. Die score: S = {1, 2, 3, 4, 5, 6}. So the coin toss sample space has two elementary outcomes, H and T , while the score on a die has six elementary outcomes. These individual elementary outcomes are themselves events, but we may wish to consider slightly more interesting events of interest.2 For example, for the die score, we may be interested in the event of obtaining an even score, or a score greater than 4 etc. Hence we proceed to define an event. Event 5 An event is a collection of elementary outcomes from the sample space S of an experiment, which is a subset of S. Typically, we denote events by letters for ease of reference. For example, A = ‘an even score’, and B = ‘a score greater than 4’. Hence A = {2, 4, 6} and B = {5, 6}, respectively. The universal convention is that we define probability to lie on a scale from 0 to 1, inclusive.3 Hence the probability of any event A, say, is denoted P (A) and is a real number somewhere on the unit interval, i.e. P (A) ∈ [0, 1], where ‘∈’ means ‘is a member of’. Note the following. If A is an impossible event, then P (A) = 0. If A is a certain event, then P (A) = 1. For events A and B, if P (A) > P (B), then A is more likely to occur than B. Therefore, we have a probability scale from 0 to 1 on which we are able to rank events, as evident from the P (A) > P (B) result above. However, we need to consider how best to quantify these probabilities. Let us begin with the experiments where each elementary outcome is equally likely, hence our (fair) coin toss and (fair) die score fulfil this condition (conveniently). Determining event probabilites for equally likely elementary outcomes For an experiment with N equally likely elementary outcomes, let n be the number of these elementary outcomes which are favourable to the event of interest, A. Therefore: P (A) = 2 3 n . N Admittedly, for a single coin toss, we cannot go beyond a choice of heads or tails. Multiplying by 100% expresses a probability as a percentage. 79 5. Probability theory Example 5.2 For our two examples we have the following. i. Coin toss: If A is the event ‘heads’, then N = 2 (H and T ), n = 1 (H), so for a fair 4 coin, P (A) = 1/2 = 0.5.5 ii. Die score: If A is the event ‘an even score’, then N = 6 (1, 2, 3, 4, 5 and 6), n = 3 (2, 4 and 6), so for a fair die, P (A) = 3/6 = 1/2 = 0.5. Similarly, if B is the event ‘a score greater than 4’, then N = 6 (as before), n = 2 (5 and 6), hence P (B) = 2/6 = 1/3. Activity 5.1 Calculate the probability that, when two fair dice are rolled, the sum of the upturned faces will be: 5 (a) an odd number (b) less than 9 (c) exactly 12 (d) exactly 4. Solution The following table shows all the possibilities [as: (Score on first die, Score on second die)]: (1, 1) (2, 1) (3, 1) (4, 1) (5, 1) (6, 1) (1, 2) (2, 2) (3, 2) (4, 2) (5, 2) (6, 2) (1, 3) (2, 3) (3, 3) (4, 3) (5, 3) (6, 3) (1, 4) (2, 4) (3, 4) (4, 4) (5, 4) (6, 4) (1, 5) (2, 5) (3, 5) (4, 5) (5, 5) (6, 5) (1, 6) (2, 6) (3, 6) (4, 6) (5, 6) (6, 6) The total number of possible outcomes, N , is 36. Note all outcomes are equally likely. (a) The number of favourable points = 18, so P (odd total) = 18/36 = 1/2 = 0.5. (b) The number of favourable points = 26, so P (less than 9) = 26/36 = 13/18 = 0.7222. (c) The number of favourable points = 1, so P (exactly 12) = 1/36 = 0.0278. (d) The number of favourable points = 3, so P (exactly 4) = 3/36 = 1/12 = 0.0833. 4 Remember we are assuming equally likely elementary outcomes here, so a fair coin is required. If we had a biased coin, then this approach would fail to accurately quantify probabilities. 5 Probabilities can be reported as proper fractions or in decimal form. If in decimal form (which is preferred), report answers to a maximum of four decimal places in the examination. 80 5.6. The concept of probability Activity 5.2 A simple gambling game is played as follows. The gambler throws three fair coins together with one fair die and wins £1 if the number of heads obtained in the throw of the coins is greater than or equal to the score on the die. Calculate the probability that the gambler wins. Solution The total number of equally likely outcomes is: N = 2 × 2 × 2 × 6 = 48. Here it is convenient to organise the calculations by numbers in descending order – in this case, the value on the die. Clearly, we are only interested in cases for which the highest die value is 3, 2 or 1. 5 Die shows 3: the only winning outcome is HHH ⇒ 1 case. Die shows 2: the winning outcomes are HHH, HHT, HT H, T HH ⇒ 4 cases. Die shows 1: the winning outcomes are HHH, HHT, HT H, T HH, HT T, T HT, T T H ⇒ 7 cases. We can simply add the numbers of cases, so there are n = 12 favourable outcomes. Hence the required probability is 12/48 = 1/4 = 0.25. Activity 5.3 Three fair dice are thrown and you are told that at least one of the upturned faces shows a 4. Using this fact, determine the (conditional) probabilities of: (a) exactly one 4 (b) exactly two 4s. Solution We first identify N , the total number of equally likely outcomes when at least one die shows a 4, i.e. we restrict our attention to the ‘smaller’ sample space, rather than the ‘full’ sample space of 63 = 216 outcomes. If the first die shows a 4, then the other two can be anything, so the ‘pattern’ is 4XY . There are six choices for each of X and Y , and these choices are independent of each other. Hence there are 6 × 6 = 36 outcomes in which the first die shows a 4. If the first die does not show a 4, then there are 11 possible combinations of values for the other two dice. To see why, one method is to list the possible pairs of values: left-hand number equal to 4 : left-hand number not equal to 4 : 46, 45, 44, 43, 42, 41 64, 54, 34, 24, 14 81 5. Probability theory where, of course, the outcome ‘44’ can only be shown in one of the two lists. This ‘listing’ approach works because the number of possibilities for the two dice is small enough such that we can list all of those in which we are interested (unlike with three dice). Another, slightly different, method is to calculate the numbers in each row (without listing them all). For the pattern 4X there are six choices of X, hence six cases, and for the pattern [not 4]4 there are five choices for [not 4] and hence five cases, giving eleven possibilities altogether. As mentioned above, anything except short lists should be avoided, so it is useful to know lots of alternative ways for working out the number of relevant outcomes. Combining these remarks, we have: N = 36 + (5 × 11) = 91. 5 We now identify the number of favourable outcomes, n, for two or more 4s. If the first die shows a 4, then there has to be at least one 4 on the other two, so there are 11 ways for the other two dice. If the first die does not show a 4, then there are 5 choices for what it might show, while both the other two dice must show 4s. Clearly, the two 4s can only happen in 1 way. So: n = 11 + (5 × 1) = 16 and exactly one of these cases, (4, 4, 4), represents more than two 4s, so that fifteen represent exactly two 4s. Hence the required probabilities are as follows. (a) We have: P (exactly one 4) = 75 91 − 16 = = 0.8242. 91 91 (b) We have: P (exactly two 4s) = 5.7 15 = 0.1648. 91 Relative frequency So far we have only considered equally likely experimental outcomes. Clearly, to apply probabilistic concepts more widely, we require a more general interpretation – the relative frequency interpretation. Relative frequency approach to probability Suppose the event A associated with some experiment either does or does not occur. Also, suppose we conduct this experiment independently a total of F times. Suppose that, following these repeated experiments, A occurs f times. 82 5.8. ‘Randomness’ The ‘frequentist’ approach to probability would regard: P (A) = f F as F → ∞. Example 5.3 For a coin toss with event A = {H}, if the coin is fair we would expect that repeatedly tossing the coin F times would result in approximately f = F/2 heads, hence P (A) = (F/2)/F = 1/2. Of course, this approach is not confined to fair coins! Intuitively, this is an appealing interpretation and is extremely useful when we come to its use in statistical inference later on. However, do be aware that we are not advocating that you perform all these experiments! Rather they are imaginary experiments, but the concept gives a meaning to numerical probability.6 5.8 ‘Randomness’ Statistical inference is concerned with the drawing of conclusions from data which are subject to randomness, perhaps due to the sampling procedure, perhaps due to observational errors, perhaps for some other reason. Let us stop and think why, when we repeat an experiment under apparently identical conditions, we get different results. The answer is that although the conditions may be as identical as we are able to control them to be, there will inevitably be a large number of uncontrollable (and frequently unknown) variables which we do not measure and which have a cumulative effect on the result of the sample or experiment. For example, weather conditions may affect the outcomes of biological or other ‘field’ experiments. The cumulative effect, therefore, is to cause variations in our results. It is this variation which we term randomness and, although we never fully know the true generating mechanism for our data, we can take the random component into account via the concept of probability, which is, of course, why probability plays such an important role in data analysis. 5.9 Properties of probability We begin this section by presenting three simple, self-evident truths known as axioms which list the basic properties we require of event probabilities. 6 There is another ‘school’ of probability thought, known as the ‘Bayesian’ school. We will touch on this only briefly via Bayes’ formula later in this chapter. In short, the Bayesian view is that probability is a degree of belief in an event’s occurrence based on the observer’s knowledge of the situation. 83 5 5. Probability theory Axioms of probability 1. For any event A, P (A) ≥ 0. 2. For the sample space S, P (S) = 1. 3. If {Ai }, for i = 1, . . . , n, are mutually exclusive events, then the probability of their ‘union’ is the sum of their respective probabilities, that is: ! n n [ X P Ai = P (Ai ). i=1 5 i=1 The first two axioms should not be surprising. The third may appear a little more daunting. Events are labelled mutually exclusive when they cannot both occur simultaneously. Example 5.4 When rolling a die once, the event A = ‘get an even score’ and the event B = ‘get an odd score’ are mutually exclusive because they cannot occur simultaneously. Extending this, a collection of events is pairwise mutually exclusive if no two events can occur simultaneously. For instance, the three events A, B and C are pairwise mutually exclusive if A and B cannot occur together and B and C cannot occur together and A and C cannot occur together. Another way of putting this is that a collection of events is pairwise mutually exclusive if at most one of them can occur. Related to this is the concept of a collection of events being collectively exhaustive. This means at least one of them must happen, i.e. all possible experimental outcomes are included among the collection of events. 5.9.1 Notation Axiom 3 above introduced a new symbol. For the remainder of this chapter, various symbols connecting sets will be used as a form of notational shorthand. It is important to be familiar with these symbols, hence two sets of descriptors are provided – one basic and the other more technical. Symbol ∪ ∩ c | Basic descriptor or and not given Technical descriptor union intersect complement of conditional on Example A ∪ B = ‘A union B’ A ∩ B = ‘A intersect B’ Ac = ‘complement of A’ A | B = ‘A conditional on B’ Also, do make sure that you distinguish between a set and the probability of a set. This distinction is important. A set, remember, is a collection of elementary outcomes from S, whereas a probability (from Axiom 1) is a number on the unit interval, [0, 1]. For example, A = ‘an even die score’, while P (A) = 0.5, for a fair die. 84 5.9. Properties of probability A B 10, 12 2, 4 3 6 8 5, 7 9 11 C 5 Figure 5.1: Venn diagram for pre-defined sets A, B and C recording the total score when a die is rolled twice. 5.9.2 Venn diagrams The previous coin and die examples have been rather simple (for illustrative purposes). Of course, it is highly likely (in fact with a probability of 1!) that you will encounter more challenging sets and sample spaces. Fear not, there is a helpful geometric technique which can often be used – we represent the sample space elements using a Venn diagram. Imagine we roll a die twice and record the total score. The sample space will be: S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}. Suppose we are interested in the following three events: A = an even total, that is A = {2, 4, 6, 8, 10, 12} B = a total strictly less than 8, that is B = {2, 3, 4, 5, 6, 7} C = a total greater than 4, but less than 10, that is C = {5, 6, 7, 8, 9}. Having defined these events, it is possible to place every element in the sample space S into a Venn diagram, as shown in Figure 5.1. The box represents S, so every possible outcome of the experiment (the total score when a die is rolled twice) appears within the box. Three (overlapping) circles are drawn representing the events A, B and C. Each element of S is then placed into the appropriate area. For example, the area where the three circles all intersect represents the event A ∩ B ∩ C into which we place the element ‘6’, since this is the only member of S which satisfies all three events A, B and C. 85 5. Probability theory Example 5.5 Using Figure 5.1, we can determine the following sets: A ∩ B = {2, 4, 6} A ∩ B ∩ C = {6} A ∩ B ∩ C c = {2, 4} (A ∪ C)c ∩ B = {3} 5.9.3 A ∩ C = {6, 8} (A ∪ B ∪ C)c = {11} Ac ∩ B = {3, 5, 7} A | C = {6, 8}. The additive law We now introduce our first probability ‘law’ – the additive law. The additive law 5 Let A and B be any two events. The additive law states that: P (A ∪ B) = P (A) + P (B) − P (A ∩ B). So P (A ∪ B) is the probability that at least one of A and B occurs, and P (A ∩ B) is the probability that both A and B occur. Example 5.6 We can think about this using a Venn diagram. The total area of the Venn diagram in Figure 5.2 is assumed to be 1, so area represents probability. Event A is composed of all points in the left-hand circle, and event B is composed of all points in the right-hand circle. Hence: P (A) = area x + area z P (B) = area y + area z P (A ∩ B) = area z P (A ∪ B) = area x + area y + area z. P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = (area x + area z) + (area y + area z) − (area z) = area x + area y + area z. Therefore, to compute P (A ∪ B) we need to subtract P (A ∩ B) otherwise that region would have been counted twice. Example 5.7 Consider an industrial situation in which a machine component can be defective in two ways such that: P (defective in first way) = P (D1 ) = 0.01 P (defective in second way) = P (D2 ) = 0.05 P (defective in both ways) = P (D1 ∩ D2 ) = 0.001. 86 5.9. Properties of probability A x z y B Figure 5.2: Venn diagram illustrating the additive law. 5 It follows that the probability that the component is defective is: P (D1 ∪ D2 ) = P (D1 ) + P (D2 ) − P (D1 ∩ D2 ) = 0.01 + 0.05 − 0.001 = 0.059. A B Figure 5.3: Venn diagram illustrating two mutually exclusive events. Additive law – special case 1 If A and B are mutually exclusive events, i.e. they cannot occur simultaneously, then P (A ∩ B) = 0. Hence: P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = P (A) + P (B) − 0 = P (A) + P (B). Such events can be depicted by two non-overlapping sets in a Venn diagram as shown in Figure 5.3. Now revisit Axiom 3, to see this result generalised for n mutually exclusive events. 87 5. Probability theory Additive law – special case 2 The probability of an event A not occurring, i.e. the complement, Ac , is: P (Ac ) = 1 − P (A). Activity 5.4 Use a Venn diagram, or otherwise, to ‘prove’ the following. (a) P (Ac ) = 1 − P (A). (b) P (A ∪ B) = P (A) + P (B) − P (A ∩ B). (c) If A implies B, then P (A) ≤ P (B). 5 (d) P (A ∪ B ∪ C) = P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C) + P (A ∩ B ∩ C). Solution These results can be seen very easily from a Venn diagram (although a mathematician would not accept such a graphical approach as a ‘proof’, but we will not worry about that here). The following is the Venn diagram for events A, B and C. The lower case letters, that is a, b, . . . , g, stand for the areas (which represent the probabilities) of the events for which the region surrounding each such letter stands. A B a d b g e f c C (a) With the same notation from the diagram, the area for Ac is everything outside the area for A, so P (Ac ) = 1 − P (A). (b) With the notation from the diagram: P (A ∪ B) = a + e + f + b + (d + g) = (a + d + g + e) + (d + b + f + g) − (d + g) = P (A) + P (B) − P (A ∩ B). (c) If A implies B, then this means that the region for A is completely inside the region for B, i.e. that the intersection of A and B is the whole of A. Hence the area of the shape for A is no larger than the area of the shape for B, i.e. P (A) ≤ P (B). 88 5.9. Properties of probability (d) It is clear from the diagram that: area(A ∪ B ∪ C) = a + b + c + d + e + f + g P (A) = a + d + e + g P (A ∩ B) = d + g P (B) = b + d + f + g P (A ∩ C) = e + g P (C) = c + e + f + g P (B ∩ C) = f + g P (A ∩ B ∩ C) = g. So P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C) + P (A ∩ B ∩ C) equals: (a + d + e + g) + (b + d + f + g) + (c + e + f + g) 5 − ((f + g) + (e + g) + (d + g)) + g =a+b+c+d+e+f +g = area(A ∪ B ∪ C) = P (A ∪ B ∪ C). Activity 5.5 In a class of 35 students, 20 have taken Economics, 15 have taken History, and 10 have taken Politics. Of these, 9 have taken both Economics and History, 5 have taken both Economics and Politics, and 6 have taken both History and Politics. 3 students have taken all three subjects. What is the probability that a student chosen at random from the class has taken at least one of these subjects? Solution This is an application of the previous question with: A = Economics, So: P (at least one subject) = B = History and C = Politics. 20 + 15 + 10 − 9 − 5 − 6 + 3 28 = = 0.8. 35 35 Activity 5.6 Hard! How many people do you need at a party so that the probability of at least two people having the same birthday is greater than 0.5? (Ignore leap years!) Solution We have: P (1 & 2 have different birthdays) = 364 365 89 5. Probability theory 364 365 364 363 × P (1, 2 & 3 have different birthdays) = 365 365 364 363 P (at least two of 1, 2 & 3 have the same birthday) = 1 − × 365 365 .. . P (1 & 2 have the same birthday) = 1 − 364 363 366 − n × × ··· × 365 365 365 364 363 366 − n P (at least two of the first n have the same birthday) = 1 − × × ··· × . 365 365 365 P (first n people have different birthdays) = We want the smallest n such that: 1− 366 − n 364 363 × × ··· × > 0.5. 365 365 365 The relevant value is n = 23, so that if there are 23 people at a party then there is a 50% chance (in fact, a little more) that at least two of them have the same birthday! The plot below shows how the probability changes as n increases. 0.0 0.2 0.4 0.6 0.8 1.0 Probability of at least 2 people with same birthday Probability 5 0 20 40 60 80 Number of people 5.9.4 The multiplicative law This rule is concerned with the probability of two events occurring at the same time – specifically when the two events have the special property of independence. An informal definition of independence is that two events are said to be independent if one 90 5.9. Properties of probability event has no influence on the other. The multiplicative law (for independent events) Formally, events A and B are independent if the probability of their intersect is the product of their individual probabilities, that is: P (A ∩ B) = P (A) P (B). Example 5.8 Consider rolling two fair dice. The score on one die has no influence on the score on the other die. Hence the respective scores are independent events, and so: 1 1 1 P (two sixes) = × = . 6 6 36 Note the multiplicative (or product) law does not hold for dependent events, which is the subject of conditional probability, discussed shortly. Also, take a moment to ensure that you are comfortable with the terms ‘mutually exclusive’ and ‘independent’. These are not the same thing, so do not get these terms confused! Activity 5.7 Two fair dice are rolled. (a) What is the probability of two sixes? (b) What is the probability of at least one six? Solution (a) By independence: P (two sixes) = 1 1 1 × = . 6 6 36 (b) We have: P (at least one 6) = P (6 on first die and/or 6 on second die) = P (6 on first die) + P (6 on second die) − P (6 on both dice) 1 1 1 + − 6 6 36 11 = 36 = = 0.3056. Activity 5.8 A student has an important job interview in the morning. To ensure he wakes up in time, he sets two alarm clocks which ring with probabilities 0.97 and 0.99, respectively. What is the probability that at least one of the alarm clocks will wake him up? 91 5 5. Probability theory Solution We are told that the alarm clocks are independent. Therefore: P (at least one rings) = 1 − P (neither rings) = 1 − (0.03 × 0.01) = 0.9997. Activity 5.9 A fair coin is tossed 10 times. (a) What is the probability that the 10th toss results in a head? (b) If the first 9 tosses give 4 heads and 5 tails, what is the probability that the 10th toss results in a head? 5 (c) If the first 9 tosses give 9 heads and no tails, what is the probability that the 10th toss results in a head? Solution The answer is 0.5 in each case! The coin has no ‘memory’. This is because the successive throws of a fair coin are independent. Activity 5.10 The successful operation of three separate switches is needed to control a machine. If the probability of failure of each switch is 0.1 and the failure of any switch is independent of any other switch, what is the probability that the machine will break down? Solution The probability that the machine will not break down equals the probability that all three switches work. This equals 0.9 × 0.9 × 0.9 = 0.729, by independence. Hence the probability it will break down is 1 − 0.729 = 0.271. Activity 5.11 Of all the candles produced by a company, 0.01% do not have wicks (the core piece of string). A retailer buys 10,000 candles from the company. (a) What is the probability that all the candles have wicks? (b) What is the probability that at least one candle will not have a wick? Solution Let X be the number of candles without a wick. (a) We have: P (X = 0) = (0.9999)10000 = 0.3679. (b) We have: P (X ≥ 1) = 1 − P (X = 0) = 1 − 0.3679 = 0.6321. 92 5.9. Properties of probability Activity 5.12 A family has four children. (a) What is the probability of at least one boy among the children? (b) What is the probability that all the children are of the same sex? Solution For our sample space we can take (in order of age): GGBB GGGB GBGB BBBG GGGG GGBG GBGG BGGB GBBG BBGB BGBB BBBB BGGG BGBG GBBB BBGG 5 So we have N = 24 = 16 possible outcomes. (a) For ‘at least one boy’ the number of favourable points is 15 (all excluding GGGG), so: 15 = 0.9375. P (at least one boy) = 16 An alternative (and easier) method is to note that: P (all girls) = 1 1 1 1 1 × × × = 2 2 2 2 16 so: P (at least one boy) = 1 − P (all girls) = 1 − 1 15 = = 0.9375. 16 16 (b) For ‘all of the same sex’ the number of favourable points is 2, so: P (all of the same sex) = 2 1 = = 0.125. 16 8 Activity 5.13 Calculate the probabilities of the following events. (a) Five independent tosses of a fair coin yield the sequence HHT T H. (b) Five independent tosses of a fair coin yield Hs on the first and last tosses. (c) A pair of fair dice is thrown and the faces are equal. (d) Five cards are drawn at random from a deck of cards and all are the same suit. (e) Four independent tosses of a fair coin result in at least two heads. What is the probability that all four tosses are heads? 93 5. Probability theory Solution (a) By independence, we have (0.5)5 = 0.03215. (b) By independence (noting we only care about the first and last outcomes), we have (0.5)2 = 0.25. (c) Of the 36 equally likely outcomes, 6 have equal faces, hence the probability is 6/36 = 1/6 = 0.1667. (d) Note this is sampling without replacement, and we can have any card as the first card drawn, the probability is: 1× 5 9 12 11 10 × × × = 0.00198. 51 50 49 48 (e) The sample space has 11 points, all equally likely, and one is the event of interest, so the probability is 1/11 = 0.0909. Activity 5.14 A company is concerned about interruptions to email. It was noticed that problems occurred on 15% of workdays. To see how bad the situation is, calculate the probabilities of an interruption during a five-day working week: (a) on Monday and again on Tuesday (b) for the first time on Thursday (c) every day (d) at least once during the week. Solution (a) This is the probability of the event occurring on Monday and on Tuesday. Each has probability 0.15 and the outcome is independent of day of the week, so calculate: (0.15)2 = 0.025. (b) We require the probability of the event not occurring on the first three days of the week, but occurring on Thursday. So the result will be: (1 − 0.15)3 × 0.15 = 0.092. (c) Similar to (a), but for each weekday. We need to find: (0.15)5 = 0.000076. (d) The probability of no occurrence in the week will be that of no occurrence on any of the five days, and this is: (0.85)5 = 0.444. Therefore the probability of at least one occurrence will be: 1 − 0.444 = 0.556. 94 5.10. Conditional probability and Bayes’ formula 5.10 Conditional probability and Bayes’ formula We have just introduced the concept of independent events – one event has no influence on another. Clearly, there are going to be many situations where independence does not in fact hold, i.e. the occurrence of one event has a ‘knock-on’ effect on the probability of another event occurring. Example 5.9 For a single roll of a fair die, let A be ‘roll a 6’, and B be ‘an even number’. The following probabilities are obvious: P (A) = 1/6 P (B) = 1/2 P (A | B) = 1/3. 5 So, we see that the probability of a 6 changes from 1/6 to 1/3 once we are given the information that ‘an even number’ has occurred. Similarly, the probability of an even number changes from 1/2 to 1, conditional on a 6 occurring, i.e. P (B | A) = 1. Example 5.10 In order to understand and develop formulae for conditional probability, consider the following simple example, representing the classification by gender and subject (where A, B, C and D are defined below) of 144 college students. Subject A: Mathematics B: Economics C: Science D: Arts Total Female 4 17 4 28 53 Male 14 41 25 11 91 Total 18 58 29 39 144 Let F = ‘Female’ and M = ‘Male’, then P (A) = 18/144, P (F ) = 53/144 and P (A ∩ F ) = 4/144. Note that P (A ∩ F ) 6= P (A) P (F ), hence A and F are not independent events. From the table we have the following probabilities: P (A | F ) = 4/53 6= P (A) P (F | A) = 4/18 6= P (F ). The correct relationship of these conditional probabilities to the original unconditional probabilities is: P (A | F ) = 4/144 4 P (A ∩ F ) = = . 53/144 53 P (F ) Similarly, noting that P (A ∩ F ) = P (F ∩ A), we have: P (F | A) = 4/144 4 P (A ∩ F ) = = . 18/144 18 P (A) 95 5. Probability theory Activity 5.15 Check that some of the other conditional probabilities obtained from the table in Example 5.10 satisfy this formula, such as P (C | F ) = 4/53. Note also another important relationship involving conditional probability is the ‘total probability formula’ (discussed in greater depth shortly). This expresses an unconditional probability in terms of other, conditional probabilities. Example 5.11 Continuing with Example 5.10, we have: 4 53 14 91 18 = × + × P (A) = 144 53 144 91 144 = P (A | F ) P (F ) + P (A | M ) P (M ). 5 Activity 5.16 Two fair coins, a 10p and a 50p, are tossed. (a) Find the probability that: i. both coins show heads ii. different faces show up iii. at least one head shows. (b) You are told that the 10p shows heads. What is the probability that both coins show heads? (c) You are told that at least one of the two coins shows heads. What is the probability that both coins show heads? Solution Let the sample space be S = {HH, HT, T H, T T } where, for example, ‘HT ’ means ‘heads on the 10p coin and tails on the 50p coin’. Each element is equally likely, so each has a probability equal to 0.25. (a) i. We have P (HH) = 0.25. ii. We have P (different faces) = P (HT or T H) = P (HT ) + P (T H) = 0.5. iii. We have P (at least 1 head) = P (HT or T H or HH) = P (HT ) + P (T H) + P (HH) = 0.75. (b) To find P (both show heads given that the 10p shows heads) we define: A = both show heads = {HH} and B = 10p shows heads = {HT, HH} and use conditional probability, i.e. the result that P (A | B) = P (A ∩ B)/P (B). To do this we note that in this case the event A ∩ B is the same as the event A (which is important to spot, when it happens). Hence: P (HH | 10p shows heads) = 96 P (A) P (HH) 0.25 P (A ∩ B) = = = = 0.5. P (B) P (B) P (HT or HH) 0.5 5.10. Conditional probability and Bayes’ formula (c) Here we want to find P (A | C) = P (A ∩ C)/P (C) where: C = at least one head = {HH, HT, T H}. Hence P (HH | at least one head) equals: P (A ∩ C) P (A) P (HH) 0.25 = = = = 0.33. P (C) P (C) P (HH or HT or T H) 0.75 Activity 5.17 A fair coin is thrown six times in succession. (a) Find the probability that exactly three throws come up heads. (b) Assuming that there are exactly three heads, what is the (conditional) probability that they are on consecutive throws? 5 Solution There are two ‘fair’ choices for the outcome of each throw, hence there are two equally likely outcomes (H and T , using obvious notation) of each. Moreover, we can assume that one throw’s outcome does not affect the others. That is, the outcomes of the different throws are independent, which (always!) means we multiply the probabilities. Combining these two thoughts, there are: N = 2 × 2 × 2 × 2 × 2 × 2 = 26 = 64 items (‘outcomes’) in the full, six-throw sample space, each with probability: 1 1 1 1 1 1 × × × × × = 2 2 2 2 2 2 6 1 1 = . 2 64 (a) It is (just about) possible to list accurately the 20 cases in which there are exactly three heads (and hence exactly three tails), identifying the four cases (top row) in which the three Hs are adjacent. HHHT T T HHT T T H T HHT T H HT HHT T HT HT HT T HHHT T HHT T HT T T HHT H T HT T HH HT HT T H T T HHHT HHT HT T HT T T HH T HT T HH HT T HT H T T T HHH T HHT T H HT T HHT T T HT HH T HT HT H 3 Hs together 2 consecutive Hs All Hs separate To make sure that you count everything, but just once, it is essential to organise the listed items according to some pattern. Hence: P (exactly 3 Hs) = n 20 5 = = = 0.3125. N 64 16 (b) If there are exactly three heads on consecutive throws, then it is possible to specify which of these different outcomes we are dealing with by identifying how many (if any) T s there are before the sequence of Hs. (The remaining T s must 97 5. Probability theory be after the sequence of Hs.) The possible values (out of 3) are 0, 1, 2 and 3, so there are four cases. Hence: P (3 consecutive Hs | exactly 3 Hs) = 4 1 n = = = 0.2. N 20 5 Activity 5.18 Let K be the event of drawing a ‘king’ from a well-shuffled deck of playing cards. Let D be the event of drawing a ‘diamond’ from the pack. Determine: 5 (a) P (K) (b) P (D) (c) P (K c ) (d) P (K ∩ D) (e) P (K ∪ D) (f) P (K | D) (g) P (D | K) (h) P (D ∪ K c ) (i) P (Dc ∩ K) (j) P ((Dc ∩ K) | (D ∪ K)). Are the events D and K independent, mutually exclusive, neither or both? Solution (a) We have: P (K) = 4 1 = 52 13 P (D) = 13 1 = 52 4 (b) We have: (c) We have: P (K c ) = 1 − 1 12 = 13 13 (d) We have: P (K ∩ D) = 1 52 (e) We have: P (K ∪ D) = 16 4 = 52 13 (f) We have: P (K | D) = 1 13 P (D | K) = 1 4 (g) We have: 98 5.10. Conditional probability and Bayes’ formula (h) We have: P (D ∪ K c ) = 49 52 P (Dc ∩ K) = 3 52 (i) We have: (j) We have: P (Dc ∩ K | D ∪ K) = 3 16 The events are independent since P (K ∩ D) = P (K) P (D), but they are not mutually exclusive (consider the King of Diamonds). Activity 5.19 At a local school, 90% of the students took test A, and 15% of the students took both test A and test B. Based on the information provided, which of the following calculations are not possible, and why? What can you say based on the data? (a) P (B | A). (b) P (A | B). (c) P (A ∪ B). If you knew that everyone who took test B also took test A, how would that change your answers? Solution (a) Possible. P (B | A) = 0.15 P (A ∩ B) = = 0.167. P (A) 0.90 (b) Not possible because P (B) is unknown and we would need to calculate P (A | B) = P (A ∩ B) 0.15 = . P (B) P (B) (c) Not possible because P (A ∪ B) = P (A) + P (B) − P (A ∩ B) and, as discussed in (b), we do not know P (B). If we knew that everyone who took test B also took test A, then P (A | B) = 1 and hence A ∪ B = A. Therefore P (A ∪ B) = P (A) = 0.90. Activity 5.20 Given two events, A and B, state why each of the following is not possible. Use formulae or equations to illustrate your answer. 99 5 5. Probability theory (a) P (A) = −0.46. (b) P (A) = 0.26 and P (Ac ) = 0.62. (c) P (A ∩ B) = 0.92 and P (A ∪ B) = 0.42. (d) P (A ∩ B) = P (A) P (B) and P (B) > P (B | A). Solution (a) Not possible because a probability cannot be negative since 0 ≤ P (A) ≤ 1. (b) Not possible because P (A) + P (Ac ) = 1. Here 0.26 + 0.62 = 0.88 6= 1. 5 (c) Not possible because A ∩ B is a subset of A ∪ B, hence we have P (A ∩ B) ≤ P (A ∪ B). Here 0.92 > 0.42. (d) Not possible because two events cannot both be independent (as implied by the condition P (A ∩ B) = P (A) P (B)) and dependent (as implied by the condition P (B) > P (B | A)) at the same time. 5.10.1 Bayes’ formula We formally define conditional probabilities. Conditional probability For any two events A and B, we define conditional probabilities as: P (A | B) = P (A ∩ B) P (B) and P (B | A) = P (A ∩ B) . P (A) (5.1) In words: ‘the probability of one event, given a second event, is equal to the probability of both, divided by the probability of the second (conditioning) event.’ This is the simplest form of Bayes’ formula, and this can be expressed in other ways. Rearranging (5.1), we obtain: P (A ∩ B) = P (A | B) P (B) = P (B | A) P (A) from which we can derive Bayes’ formula. Bayes’ formula The simplest form of Bayes’ formula is: P (A | B) = 100 P (B | A) P (A) . P (B) 5.10. Conditional probability and Bayes’ formula Activity 5.21 Events A, B and C have the following probabilities: P (A) = 0.4 P (B) = 0.5 P (C) = 0.1 P (A | B) = 0.2 P (B | C) = 0 P (C | A) = 0.25. (a) Are the following statements ‘True’ or ‘False’ ? i. C implies A. ii. B and C are mutually exclusive. iii. A ∪ B ∪ C = S. (b) What is P (B | A)? 5 (c) What is P (B | Ac )? Solution (a) i. The necessary condition for C implies A is that P (A | C) = 1. Using Bayes’ formula, we see that: P (A | C) = 0.25 × 0.4 P (C | A) P (A) = =1 P (C) 0.1 so the statement is true. ii. P (B | C) = 0, and using Bayes’ formula it follows that P (B ∩ C) = 0, so we see that (ignoring continuous distributions) the two events are mutually exclusive, so the statement is true. iii. Since C implies A, it follows that: P (A ∪ B ∪ C) = P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 0.4 + 0.5 − P (A | B) P (B) = 0.4 + 0.5 − 0.2 × 0.5 = 0.8. Since P (S) = 1, it follows that A ∪ B ∪ C 6= S, so the statement is false. (b) Applying Bayes’ formula, we have: P (B | A) = P (A | B) P (B) 0.2 × 0.5 P (B ∩ A) = = = 0.25. P (A) 0.4 0.4 (c) Applying Bayes’ formula, we have: P (B | Ac ) = 0.8 × 0.5 2 P (Ac | B) P (B) = = = 0.6667. c P (A ) 0.6 3 101 5. Probability theory 5.10.2 Total probability formula The simplest case of the total probability formula involves calculating the probability of an event A from information about its two conditional probabilities with respect to some other event B and its complement, B c , together with knowledge of P (B). Note that: B and B c are mutually exclusive B and B c are collectively exhaustive. Fulfilment of these criteria (being mutually exclusive and collectively exhaustive) allows us to view B and B c as a partition of the sample space. The (simplest form of the) total probability formula 5 The total probability formula is: P (A) = P (A | B) P (B) + P (A | B c ) P (B c ). In words: ‘the probability of an event is equal to its conditional probability on a second event times the probability of the second event, plus its probability conditional on the second event not occurring times the probability of that non-occurrence.’ There is a more general form of the total probability formula. Let B1 , B2 , . . . , Bn partition the sample space S into n pairwise mutually exclusive (at most one of them can occur) and collectively exhaustive (at least one of them must occur) events. For example, for n = 4, see Figure 5.4. B3 S B1 B2 B4 Figure 5.4: An example of a partitioned sample space. Figure 5.5 superimposes an event A. Extending the simple form of the total probability formula, we obtain: P (A) = n X P (A | Bi ) P (Bi ) i=1 = P (A | B1 ) P (B1 ) + P (A | B2 ) P (B2 ) + · · · + P (A | Bn ) P (Bn ). 102 5.10. Conditional probability and Bayes’ formula B3 B1 S A B2 B4 Figure 5.5: The event A within a partitioned sample space. Recall that P (B | A) = P (A ∩ B)/P (A) = P (A | B) P (B)/P (A), so assuming we have the partition B and B c , then: P (A | B) P (B) P (B | A) = . P (A | B) P (B) + P (A | B c ) P (B c ) A more general partition gives us a more complete form of Bayes’ formula. General form of Bayes’ formula For a general partition of the sample space S into B1 , B2 , . . . , Bn , and for some event A, then: P (A | Bk ) P (Bk ) P (Bk | A) = P . n P (A | Bi ) P (Bi ) i=1 Activity 5.22 In an audit Bill analyses 60% of the audit items and George analyses 40%. Bill’s error rate is 5% and George’s error rate is 3%. Suppose an item is sampled at random. (a) What is the probability that it is in error (i.e. audited incorrectly)? (b) If the chosen item is incorrect what is the probability that Bill is to blame? Solution Let B = Bill audits item, G = George audits item, and E = incorrect audit. Hence P (B) = 0.6, P (G) = 0.4, P (E | B) = 0.05 and P (E | G) = 0.03. (a) Using the total probability formula: P (E) = P (E | B) P (B) + P (E | G) P (G) = (0.05 × 0.6) + (0.03 × 0.4) = 0.042. (b) Using the conditional probability formula: P (B | E) = 0.05 × 0.6 P (E | B) P (B) = = 0.71. P (E) 0.042 103 5 5. Probability theory Activity 5.23 Two fair coins are tossed. You are told that ‘at least one is a head’. What is the probability that both are heads? Solution At first sight it may appear that this should be equal to 1/2, but this is wrong! The correct solution may be achieved in several ways; here we shall use a conditional probability approach. Let HH denote the event ‘both coins show heads’, and let A denote the event ‘at least one coin shows a head’. Then: P (HH | A) = 5 P (HH) 1/4 1 P (A ∩ HH) = = = . P (A) P (A) 3/4 3 Note that A ∩ HH = HH, i.e. ‘at least one head’ and ‘both show heads’ only corresponds to ‘both show heads’, HH. Activity 5.24 The probability of a horse winning a race is 0.3 if it is dry and 0.5 if it is wet. The weather forecast gives the chance of rain as 40%. (a) Find the probability that the horse wins. (b) If you are told that the horse lost the race, what is the probability that the weather was dry on the day of the race? (Assuming you cannot remember!) Solution Let A = ‘horse wins’, Ac = ‘horse loses’, D = ‘dry’ and Dc = ‘wet’. We have: P (A | D) = 0.3, P (A | Dc ) = 0.5, P (D) = 0.6 and P (Dc ) = 0.4. (a) We have: P (A) = P (A | D) P (D) + P (A | Dc ) P (Dc ) = 0.3 × 0.6 + 0.5 × 0.4 = 0.38. Hence the horse has a probability of 0.38 of winning the race. (b) From (a), we can determine that there is a 62% chance that the horse loses the race, i.e. P (Ac ) = 0.62. Hence: P (D | Ac ) = 0.7 × 0.6 P (Ac | D) P (D) = = 0.6774. c P (A ) 0.62 That is, the probability it is dry, given that the horse loses, is 0.6774, or about 68%. Activity 5.25 A restaurant manager classifies customers as well-dressed, casually-dressed or poorly-dressed and finds that 50%, 40% and 10%, respectively, fall into these categories. The manager found that wine was ordered by 70% of the well-dressed, by 50% of the casually-dressed and by 30% of the poorly-dressed. 104 5.10. Conditional probability and Bayes’ formula (a) What is the probability that a randomly chosen customer orders wine? (b) If wine is ordered, what is the probability that the person ordering is well-dressed? (c) If wine is not ordered, what is the probability that the person ordering is poorly-dressed? Solution The following notation is used: • W = well-dressed • C = casually-dressed 5 • P = poorly dressed • O = wine ordered. (a) We have: P (O) = P (O | W ) P (W ) + P (O | M ) P (M ) + P (O | P ) P (P ) = (0.7 × 0.5) + (0.5 × 0.4) + (0.3 × 0.1) = 0.58. (b) Using Bayes’ theorem: P (W | O) = 0.7 × 0.5 P (O | W ) P (W ) = ≈ 0.60. P (O) 0.58 P (P | Oc ) = P (Oc | P ) P (P ) 0.7 × 0.1 = ≈ 0.17. c P (O ) 0.42 (c) We require: Activity 5.26 In a large lecture, 60% of students are female and 40% are male. Records show that 15% of female students and 20% of male students are registered as part-time students. (a) If a student is chosen at random from the lecture, what is the probability that the student studies part-time? (b) If a randomly chosen student studies part-time, what is the probability that the student is male? Solution Let P T = ‘part time’, F = ‘female’ and M = ‘male’. 105 5. Probability theory (a) We have: P (P T ) = P (P T | F ) P (F ) + P (P T | M ) P (M ) = (0.15 × 0.6) + (0.2 × 0.4) = 0.17. (b) We have: P (M | P T ) = P (P T | M ) P (M ) 0.2 × 0.4 = = 0.4706. P (P T ) 0.17 Activity 5.27 20% of men show early signs of losing their hair. 2% of men carry a gene that is related to hair loss. 80% of men who carry the gene experience early hair loss. 5 (a) What is the probability that a man carries the gene and experiences early hair loss? (b) What is the probability that a man carries the gene, given that he experiences early hair loss? Solution Using obvious notation, P (H) = 0.2, P (G) = 0.02 and P (H | G) = 0.8. (a) We require: P (G ∩ H) = P (G) P (H | G) = 0.02 × 0.8 = 0.016. (b) We require: P (G | H) = 0.016 P (G ∩ H) = = 0.08. P (H) 0.2 Activity 5.28 James is a salesman for a company and sells two products, A and B. He visits three different customers each day. For each customer, the probability that James sells product A is 1/3 and the probability is 1/4 that he sells product B. The sale of product A is independent of the sale of product B during any visit, and the results of the three visits are mutually independent. Calculate the probability that James will: (a) sell both products, A and B, on the first visit (b) sell only one product during the first visit (c) make no sales of product A during the day (d) make at least one sale of product B during the day. Solution Let A = ‘product A sold’ and B = ‘product B sold’. 106 5.10. Conditional probability and Bayes’ formula (a) We have: P (A ∩ B) = P (A) P (B) = 1 1 1 × = . 3 4 12 (b) We have: c c P (A ∩ B ) + P (A ∩ B) = 1 3 × 3 4 + 2 1 × 3 4 = 5 . 12 (c) Since P (Ac ) = 2/3, then: 3 2 8 P (no A sales all day) = = . 3 27 (d) We have: 5 3 3 37 P (at least 1 B sale) = 1 − P (no B sales) = 1 − = . 4 64 Activity 5.29 Tower Construction Company (‘Tower’) is determining whether it should submit a bid for a new shopping centre. In the past, Tower’s main competitor, Skyrise Construction Company (‘Skyrise’), has submitted bids 80% of the time. If Skyrise does not bid on a job, the probability that Tower will get the job is 0.6. If Skyrise does submit a bid, the probability that Tower gets the job is 0.35. (a) What is the probability that Tower will get the job? (b) If Tower gets the job, what is the probability that Skyrise made a bid? (c) If Tower did not get the job, what is the probability that Skyrise did not make a bid? Solution (a) We have: P (Tower gets it) = (0.2 × 0.6) + (0.8 × 0.35) = 0.4. (b) P (Skyrise bid | Tower gets job) = = P (Tower gets job | Skyrise bid) P (Skyrise bid) P (Tower gets job) 0.35 × 0.8 0.4 = 0.7. 107 5. Probability theory (c) We have: P (Skyrise did not bid | Tower did not get job) = P (Tower did not get job | Skyrise did not bid) P (Skyrise did not bid) P (Tower did not get job) 0.4 × 0.2 0.6 2 = 15 = ≈ 0.13. Activity 5.30 Hard! 5 There are 3 identical urns. The first urn contains 7 red balls and 3 white balls. The second urn contains 5 red balls and 5 white balls. The third urn contains 4 red balls and 8 white balls. One of the urns is selected at random (i.e. each urn has a 1-in-3 chance of being selected). Balls are then drawn from the selected urn without replacement. The first ball is red. The second ball is white. The third ball is red. The fourth ball is red. At each stage of sampling, calculate the probabilities of the selected urn being Urn I, Urn II or Urn III. Solution The following table shows the relevant calculations: Mix of balls P (Ball) × P (Urn) Sum 108 Urn I II III I Start 7R, 3W 5R, 5W 4R, 8W — II — III — Round 1 6R, 3W 4R, 5W 3R, 8W 7/10 × 1/3 = 0.2333 5/10 × 1/3 = 0.1667 4/12 × 1/3 = 0.1111 0.5111 Round 2 6R, 2W 4R, 4W 3R, 7W 3/9 × 0.4565 = 0.1522 5/9 × 0.3262 = 0.1812 8/11 × 0.2174 = 0.1581 0.4915 Round 3 5R, 2W 3R, 4W 2R, 7W 6/8 × 0.3097 = 0.2323 4/8 × 0.3687 = 0.1844 3/10 × 0.3217 = 0.0965 0.5132 Round 4 4R, 2W 2R, 4W 1R, 7W 5/7 × 0.4526 = 0.3232 3/7 × 0.3593 = 0.1540 2/9 × 0.1880 = 0.0418 0.5190 5.11. Probability trees Urn I P (Urn) II III Start 0.3333 0.3333 0.3333 Round 1 0.2333/0.5111 = 0.4565 0.1667/0.5111 = 0.3262 0.1111/0.5111 = 0.2174 Round 2 0.1522/0.4915 = 0.3097 0.1812/0.4915 = 0.3687 0.1581/0.4915 = 0.3217 Round 3 0.2323/0.5132 = 0.4526 0.1844/0.5132 = 0.3593 0.0965/0.5132 = 0.1880 Round 4 0.3232/0.5190 = 0.6227 0.1540/0.5190 = 0.2967 0.0418/0.5190 = 0.0805 Note that the ‘Sum’ row is calculated using the total probability formula, and the urn probabilities are computed using Bayes’ formula. 5.10.3 Independent events (revisited) The terms ‘dependent’ and ‘independent’ reflect the fact that the probability of an event is changed when another event is known to occur only if there is some dependence between the events. If there is such a dependence, then P (A | B) 6= P (A). It follows from this that two events, A and B, are independent if and only if: P (A | B) = P (A). Recall from the multiplicative law in Section 5.9, that under independence P (A ∩ B) = P (A) P (B). Substituting this into the conditional probability formula gives the required result: P (A | B) = P (A) P (B) P (A ∩ B) = = P (A) P (B) P (B) provided P (B) > 0. Hence if A and B are independent, knowledge of B, i.e. ‘| B’, is of no value to us in determining the probability of A occurring. 5.11 Probability trees Probability problems can often be represented in several ways. We have already seen how Venn diagrams provide a convenient way to visualise the members of various combinations of events. Here we introduce a probability tree, also referred to as a tree diagram. This is best explained by way of an example. Example 5.12 The London Special Electronics (LSE) company is investigating a fault in its manufacturing plant. The tests done so far prove that the fault is at one of three locations: A, B or C. Taking all the tests and other evidence into account, the company assesses the chances of the fault being at each site as: Suspect site Probability of it being site of fault A 0.5 B 0.2 C 0.3 The next test they will do is expected to improve the identification of the correct site, but (like most tests) it is not entirely accurate. 109 5 5. Probability theory • If the fault is at A, then there is a 70 per cent chance of a correct identification; i.e. given that A is the site of the fault then the probability that the test says that it is at A is 0.7, and in this case the probabilities of either possible error are equal. • If the fault is at B, then there is a 60 per cent chance of a correct identification, and in this case too the probabilities of either possible error are equal. • If the fault is at C, then there is an 80 per cent chance of a correct identification, and in this case too the probabilities of either possible error are equal. Draw a probability tree for this problem, and use it to answer the following: 5 (a) What is the probability that the new test will (rightly or wrongly) identify A as the site of the fault? (b) If the new test does identify A as the site of the fault, find: i. the company’s revised probability that C is the site of the fault ii. the company’s revised probability that B is not the site of the fault. Let A, B and C stand for the events: ‘fault is at A’, ‘fault is at B’ and ‘fault is at C’, respectively. Also, let a, b and c stand for the events: ‘the test says the fault is at A’, ‘the test says the fault is at B’ and ‘the test says the fault is at C’, respectively. The probability tree is shown in Figure 5.6. (a) The probability P (a) is the sum of the three values against ‘branches’ which include the event a, that is: 0.35 + 0.04 + 0.03 = 0.42. (b) i. The conditional probability P (C | a) is the value for the C ∩ a branch divided by P (a), that is: 1 0.03 = = 0.071. 0.35 + 0.04 + 0.03 14 ii. The conditional probability P (B c | a) is the sum of the values for the A ∩ a and C ∩ a branches divided by P (a), that is: 0.35 + 0.03 = 0.905. 0.42 Alternatively: P (B c | a) = 1 − 110 value of (B ∩ a) branch 0.04 =1− = 0.905. P (a) 0.42 5.12. Summary Probability tree True site of fault Test says fault at A Final probabilities a 0.7 b 0.15 c 0.15 a 0.2 b 0.6 c 0.2 a 0.1 b 0.1 c 0.8 0.5 B 0.2 C 0.3 p = 0.5 x 0.7 = 0.35 p = 0.5 x 0.15 = 0.075 p = 0.5 x 0.15 = 0.075 p = 0.2 x 0.2 = 0.04 p = 0.2 x 0.6 = 0.12 p = 0.2 x 0.2 = 0.04 5 p = 0.3 x 0.1 = 0.03 p = 0.3 x 0.1 = 0.03 p = 0.3 x 0.8 = 0.24 Figure 5.6: Probability tree for Example 5.12. 5.12 Summary This chapter has introduced the idea of probability, and defined the key terms. You have also seen how Venn diagrams can be used to illustrate probability, and used the three axioms. This should prepare you for the following chapter. 5.13 Key terms and concepts Additive law Bayes’ formula Conditional Equally likely Exhaustive Independence Multiplicative law Partition Probability tree Sample space Subset Venn diagram Axioms Collectively exhaustive Element Event Experiment Member Mutually exclusive Probability Relative frequency Set Tree diagram 111 5. Probability theory 5.14 Learning activities Solutions can be found in Appendix A. 1. When throwing a die, suppose: S = {1, 2, 3, 4, 5, 6}, E = {3, 4} and F = {4, 5, 6}. Determine: (a) F c (b) E c ∩ F c (c) (E ∪ F )c 5 (d) E c ∩ F . 2. Draw the appropriate Venn diagram to show each of the following in connection with Question 1: (a) E ∪ F = {3, 4, 5, 6} (b) E ∩ F = {4} (c) E c = {1, 2, 5, 6}. 3. Consider the following information. Supplier Jones Smith Robinson Total Early 20 10 0 30 Delivery time On time Late Total 20 10 50 90 50 150 10 90 100 120 150 300 What are the probabilities associated with a delivery chosen at random for each of the following? (a) A delivery is early . (b) A delivery is from Smith. (c) A delivery is from Jones and late. 4. There are three sites a company may move to: A, B and C. We are told that P (A) (the probability of a move to A) is 1/2, and P (B) = 1/3. What is P (C)? 5. Two events A and B are independent with P (A) = 1/3 and P (B) = 1/4. What is P (A ∩ B)? 6. Say whether the following statement is true or false and briefly give your reason(s). ‘If two events are independent, then they must be mutually exclusive.’ 112 5.14. Learning activities 7. If X can take values of 1, 2 and 4 with P (X = 1) = 0.3, P (X = 2) = 0.5 and P (X = 4) = 0.2, calculate: (a) P (X 2 < 4) (b) P (X > 2 | X is an even number). 8. Write down and illustrate the use in probability of: (a) the additive law (b) the multiplicative rule. 9. A coffee machine may be defective because it dispenses the wrong amount of coffee, event C, and/or it dispenses the wrong amount of sugar, event S. The probabilities of these defects are: P (C) = 0.05, P (S) = 0.04 and 5 P (C ∩ S) = 0.01. Determine the proportion of cups of coffee with: (a) at least one defect (b) no defects. 10. A company gets 60% of its supplies from manufacturer A, and the remainder from manufacturer Z. The quality of the parts delivered is given below: Manufacturer A Z % Good parts 97 93 % Bad parts 3 7 (a) The probabilities of receiving good or bad parts can be represented by a probability tree. Show, for example, that the probability that a randomly chosen part comes from A and is bad is 0.018. (b) Show that the sum of the probabilities of all outcomes is 1. (c) The way the probability tree is used depends on the information required. For example, show that the probability tree can be used to show that the probability of receiving a bad part is 0.028 + 0.018 = 0.046. 11. A company has a security system comprising of four electronic devices (A, B, C and D) which operate independently. Each device has a probability of 0.1 of failure. The four electronic devices are arranged such that the whole system operates if at least one of A or B functions and at least one of C or D functions. Show that the probability that the whole system functions properly is 0.9801. (Use set theory and the laws of probability, or a probability tree.) 12. A student can enter a course either as a beginner (73% of all students) or as a transferring student (27% of all students). It is found that 62% of beginners eventually graduate, and that 78% of transferring students eventually graduate. 113 5. Probability theory (a) Find the probability that a randomly chosen student: i. is a beginner who will eventually graduate ii. will eventually graduate iii. is either a beginner or will eventually graduate, or both. (b) Are the events ‘eventually graduates’ and ‘enters as a transferring student’ statistically independent? (c) If a student eventually graduates, what is the probability that the student entered as a transferring student? (d) If two entering students are chosen at random, what is the probability that not only do they enter in the same way but that they also both graduate or both fail? 5 5.15 A reminder of your learning outcomes After completing this chapter, and having completed the essential reading and activities, you should be able to: apply the ideas and notation involved in set theory to simple examples recall the basic axioms of probability and apply them distinguish between the ideas of conditional probability and independence draw and use appropriate Venn diagrams draw and use appropriate probability trees. 5.16 Sample examination questions Solutions can be found in Appendix B. 1. If P (B) = 0.05, P (A | B) = 0.70 and P (A | B c ) = 0.30, find P (B | A). 2. An engine encounters a standard environment with a probability of 0.95, and a severe environment with a probability of 0.05. In a normal environment the probability of failure is 0.02, whereas in the severe environment this probability is 0.5. (a) What is the probability of failure? (b) Given that failure has occurred, what is the probability that the environment encountered was severe? 3. Suppose there are two boxes; the first box contains three green balls and one red ball, whereas the second box contains two green balls and two red balls. Suppose a box is chosen at random and then a ball is drawn randomly from this box. 114 5.16. Sample examination questions (a) What is the probability that the ball drawn is green? (b) If the ball drawn was green, what is the probability that the first box was chosen? 5 115 5. Probability theory 5 116 Chapter 6 The normal distribution and ideas of sampling 6.1 Aims Now is the time where the ideas of measurement (introduced in Chapter 4) and probability theory (Chapter 5) come together to form the ideas of sampling. Your aims should be to: understand the concept of a random variable and its distribution 6 work with the normal distribution see the connection between the mathematical ideas of sampling introduced here and their uses (introduced in Chapter 3, and covered in detail in Chapter 10). 6.2 Learning outcomes After completing this chapter, and having completed the essential reading and activities, you should be able to: summarise the meaning of E(X) and Var(X) compute areas under the curve for a normal distribution state and apply the central limit theorem explain the relationship between sample size and the standard error of the sample mean. 6.3 Essential reading Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060] Sections 4.1, 4.3, 5.1–5.3, 6.1 and 6.2. In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials accessible via the ST104a Statistics 1 area at https://emfss.elearning.london.ac.uk 117 6. The normal distribution and ideas of sampling 6.4 Further reading Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN 9780471615170] Chapters 4 and 6. 6.5 Introduction If all of a government’s political and economic decisions were based on figures from entire populations, think how costly and time-consuming that would be. You could argue that one would never in fact tackle the problem itself. If we needed to know exactly how many homeless families there are in a particular country before implementing policy, or the exact number of companies which employ people on a part-time basis, and for how many hours, then work might scarcely get off the ground! Think, for example, of the population of your country and the cost of a complete census, even if only limited questions are asked about residents. 6 Fortunately for everyone, statisticians have come up with a way of estimating such figures using samples from the population. They can also tell us how accurate these estimates are likely to be for a particular sample size and this is why survey sampling, rather than taking a full census, is the preferred tool of governments, social researchers and market research companies. The rest of this chapter develops the ideas you need in order to use random sampling successfully. 6.6 Random variables Formally, a random variable is a ‘mapping’ of the elementary outcomes in the sample space to real numbers. This allows us to attach probabilities to the experimental outcomes. Hence the concept of a random variable is that of a measurement which takes a particular value for each possible trial (experiment). Frequently, this will be a numerical value. For example, suppose we sample five people and measure their heights, such that ‘height’ is the random variable and the five (observed) values of this random variable are the realised measurements for the heights of these five people. As a further example, suppose a fair die is thrown four times and we observe two 6s, a 3 and a 1. The random variable is the ‘score on the die’, and for these four trials it takes the values 6, 6, 3 and 1. (In this case, since we do not know the true order in which the values occurred, we could also say that the results were 1, 6, 3 and 6 or 1, 3, 6 and 6 etc.) An example of an experiment with non-numerical outcomes would be a coin toss, for which recall S = {H, T }. We can use a random variable, X, to convert the sample space elements to real numbers, such as: ( 1 for heads X= 0 for tails. 118 6.6. Random variables The values of any measurable variables will typically vary from sample to sample, hence the name ‘random variable’. So each experimental random variable has a collection of possible outcomes, and a numerical value associated with each outcome. We have already encountered the term ‘sample space’ which is the set of all possible (numerical) values of the random variable. Random variables come in two ‘types’: discrete and continuous. Discrete random variables are those associated with count data, such as the score on a die, while continuous random variables refer to measurable data, such as height, weight and time. Example 6.1 Examples of discrete random variables include: Experiment Die is thrown Coin is tossed five times Twenty people sampled Machine operates for a day Random variable Value on top face Number of heads Number with blue eyes Number of breakdowns Sample space {1, 2, 3, 4, 5, 6} {0, 1, 2, 3, 4, 5} {0, 1, 2, . . . , 19, 20} {0, 1, 2, . . .} Example 6.2 Possible examples of continuous random variables include: in economics, measuring values of inputs or outputs, workforce productivity or consumption in sociology, measuring the proportion of people in a population with a particular preference in engineering, measuring the electrical resistance of materials in physics, measuring temperature, electrical conductivity of materials. Note the repeated appearance of measuring in the above examples. Remember, continuous random variables deal with measured data, while discrete random variables deal with count data. We typically use a capital letter to denote the random variable. For example, X = ‘score of a die’. The letter X is often adopted, but any other letter could be used, such as Y , Z etc. In contrast, a lower case letter denotes a particular value of the random variable. For example, if the die results in a 3, then this is x = 3. A natural question to ask is ‘what is the probability of any of these values?’. That is, we are interested in the probability distribution of the random variable. Special probability distributions are the subject matter of ST104b Statistics 2; however, there is one very special distribution which will concern us in this half-course – the normal distribution. Before the normal distribution is introduced though, it is necessary to introduce two important measures. 119 6 6. The normal distribution and ideas of sampling 6.7 Population mean and variance ‘Mean’ and ‘variance’ were introduced in Chapter 4, although importantly these referred to the sample versions, i.e. x̄ and s2 , which were calculated based on sample data. Here, our attention is on the population (or theoretical) mean and variance. These play a key role in sampling theory which will occupy us for the next few chapters. In order to work with these measures, we first need to introduce the expectation operator, denoted ‘E’. 6.7.1 Population mean Certain important properties of distributions arise if we consider probability-weighted averages of random variables, and of functions of random variables.1 For example, we might want to know the average (or typical value) of a random variable. 6 It would be wrong to simply take the average of all the values taken by the random variable, as this would mean that very unlikely values (those with small probabilities of occurrence) would receive the same weighting as very likely values (those with large probabilities of occurrence). The obvious approach is to use a probability-weighted average of the sample space values. Expected value of a discrete random variable If x1 , x2 , . . . , xN are the possible values of the random variable X, with corresponding probabilities p1 , p2 , . . . , pN , then: E(X) = µX = µ = N X xi p i = x 1 p 1 + x2 p 2 + · · · + xN p N . i=1 Note that this population mean can be written as E(X) (in words ‘the expectation of the random variable X’), or µX (in words ‘the population mean of X’). Also, note the distinction between the sample mean, x̄, (introduced in Chapter 4) based on observed sample values, and the population mean, µX , based on the theoretical probability distribution. Example 6.3 If the ‘random variable’ X happens to be a constant, then x1 = x2 = · · · = xN = k, and it is always the case that p1 = p2 = · · · = pN = 1, so trivially E(X) = k. Example 6.4 Let X represent the value shown when a fair die is thrown once. Its probability distribution (that is, how the sample space probability of 1 is distributed across the values the random variable takes) is: X=x P (X = x) 1 1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6 Total 1 A function, f (X), of a random variable X is, of course, also a random variable, say Y = f (X). 120 6.7. Population mean and variance Note that the sum of the individual probabilities is 1 (a consequence of the second probability axiom). So the population mean is calculated to be: E(X) = 6 X xi p i = 1 × i=1 1 1 1 + 2 × + · · · + 6 × = 3.5. 6 6 6 Example 6.5 Suppose the random variable X takes the values 3, 6 and 8 with probabilities 0.2, 0.7 and 0.1, respectively. Hence its probability distribution is: X=x P (X = x) 3 0.2 6 0.7 8 0.1 Total 1 The population mean is calculated to be: E(X) = 3 X xi pi = (3 × 0.2) + (6 × 0.7) + (8 × 0.1) = 5.6. i=1 6.7.2 6 Population variance The concept of a probability-weighted average (or expected value) can be extended to functions of the random variable. For example, if X takes the values x1 , x2 , . . . , xN with corresponding probabilities p1 , p2 , . . . , pN , then: X N 1 1 pi provided xi 6= 0 = E X x i=1 i and: 2 E(X ) = N X x2i pi . i=1 One very important average associated with a distribution is the expected value of the square of the deviation of the random variable from its mean, µ, known as the population variance, σ 2 . Variance of a discrete random variable If x1 , x2 , . . . , xN are the possible values of the random variable X, with corresponding probabilities p1 , p2 , . . . , pN , then: N X σ = E((X − µ) ) = (xi − µ)2 pi . 2 2 i=1 This can be seen to be a measure – not the only one, but the most widely-used by far – of the spread of the distribution and is known as the population variance of the random variable. 121 6. The normal distribution and ideas of sampling The (positive) square root of the variance is the standard deviation and, given the variance is typically denoted by σ 2 , the standard deviation is denoted by σ. Example 6.6 Returning to the fair die example, we now compute the variance of X as follows, noting that µ = 3.5. X=x 1 2 3 4 5 6 Total P (X = x) 1/6 1/6 1/6 1/6 1/6 1/6 1 (x − µ)2 25/4 9/4 1/4 1/4 9/4 25/4 (x − µ)2 P (X = x) 25/24 9/24 1/24 1/24 9/24 25/24 Hence σ 2 = E((X − µ)2 ) = 2.92 and σ = √ 70/24 = 2.92 2.92 = 1.71. The tabular format in Example 6.6 has some advantages. 6 It helps to have (and to calculate) the ‘Total’ column since, for example, if a probability, P (X = x), has been miscalculated or miscopied then the row total will not be 1 (recall the second probability axiom). Therefore, this would highlight an error so, with a little work, could be identified. It is often useful to do a group of calculations as fractions over the same denominator (as in the final row of the table in Example 6.6), rather than to cancel or to work with them as decimals, because important patterns can be more obvious, and calculations easier. It can be shown (in ST104b Statistics 2) that the population variance can be expressed in an alternative form: Var(X) = E((X − µ)2 ) = E(X 2 ) − (E(X))2 . In words, ‘the (population) variance is equal to the mean of the square minus the square of the mean’. Example 6.7 Using the same fair die example from Example 6.6: X=x 1 2 3 4 5 6 Total P (X = x) 1/6 1/6 1/6 1/6 1/6 1/6 1 x P (X = x) 1/6 2/6 3/6 4/6 5/6 6/6 21/6 = 3.5 = µ x2 P (X = x) 1/6 4/6 9/6 16/6 25/6 36/6 91/6 Hence µ = E(X) = 3.5, E(X 2 ) = 91/6, so the variance is 91/6 − (3.5)2 = 2.92, as before. However, this method is usually quicker to use. 122 6.7. Population mean and variance Activity 6.1 A fair die is thrown once. Determine the probability distribution of the value of the upturned face, X, and find its mean and variance. Solution Each face has an equal chance of 1/6 of turning up, so we get the following table: 1 1/6 1/6 1/6 X=x P (X = x) x P (X = x) x2 P (X = x) 2 1/6 2/6 4/6 3 1/6 3/6 9/6 4 1/6 4/6 16/6 5 1/6 5/6 25/6 6 1/6 6/6 36/6 Total 1 21/6 = 3.5 91/6 Hence the mean is: E(X) = X x P (X = x) = 1 × 1 1 + · · · + 6 × = 3.5. 6 6 Since: E(X 2 ) = X x2 P (X = x) = 12 × 1 91 1 + · · · + 62 × = . 6 6 6 6 The variance is: Var(X) = E(X 2 ) − µ2 = 91 − (3.5)2 = 2.92. 6 Activity 6.2 Two fair dice are thrown. (a) Determine the probability distribution of the sum of the two dice, X, and find its mean and variance. (b) Determine the probability distribution of the absolute difference of the two dice, Y , and find its mean and variance. Solution (a) The pattern is made clearer by using the same denominator (i.e. 36) below. 2 1/36 2/36 4/36 3 2/36 6/36 18/36 4 3/36 12/36 48/36 8 5/36 40/36 320/36 9 4/36 36/36 324/36 10 3/36 30/36 300/36 X=x P (X = x) x P (X = x) x2 P (X = x) X=x P (X = x) x P (X = x) x2 P (X = x) 5 4/36 20/36 100/36 11 2/36 22/36 242/36 6 5/36 30/36 180/36 12 1/36 12/36 144/36 7 6/36 42/36 294/36 Total 1 252/36 1974/36 This yields a mean of E(X) = 252/36 = 7, and a variance of E(X 2 ) − µ2 = (1974/36) − 72 = 5.83. Although not required, a plot of the distribution is: 123 6. The normal distribution and ideas of sampling 0.10 0.08 0.04 0.06 Probability 0.12 0.14 0.16 Probability distribution − sum of two dice 2 4 6 8 10 12 Value of sum (b) Again, for clarity, we use the same denominator. 0 6/36 0 0 Y =y P (Y = y) y P (Y = y) y 2 P (Y = y) 1 10/36 10/36 10/36 2 8/36 16/36 32/36 3 6/36 18/36 54/36 4 4/36 16/36 64/36 5 2/36 10/36 50/36 This yields a mean of E(Y ) = 70/36 = 1.94, while the variance is: E(Y 2 ) − µ2 = 210 − (1.94)2 = 2.05. 36 Again, although not required, a plot of the distribution is: 0.05 0.10 0.15 0.20 0.25 Probability distribution − difference in two dice Probability 6 0 1 2 3 Absolute value of difference 124 4 5 Total 1 70/36 210/36 6.8. The normal distribution Activity 6.3 The probability function P (X = x) = 0.02x is defined for x = 8, 9, 10, 11 and 12. What are the mean and variance of this probability distribution? Solution The probability distribution is: X=x P (X = x) 8 0.16 9 0.18 10 0.20 11 0.22 12 0.24 The mean is: X E(X) = x P (X = x) = (8 × 0.16) + (9 × 0.18) + · · · + (12 × 0.24) = 10.2. To determine the variance, we first find E(X 2 ), given by: X E(X 2 ) = x2 P (X = x) = (82 × 0.16) + (92 × 0.18) + . . . + ((12)2 × 0.24) = 106. 6 Hence: Var(X) = 106 − (10.2)2 = 1.96. Activity 6.4 In a prize draw, the probabilities of winning various amounts of money are: Prize (in £) Probability of win 0 0.35 1 0.50 50 0.11 100 0.03 500 0.01 What is the expected value and standard deviation of the prize? Solution The expected value is: X E(X) = x P (X = x) = (500 × 0.01) + (100 × 0.03) + · · · + (0 × 0.35) = £14.00 To calculate the standard deviation, we first need E(X 2 ), given by: X E(X 2 ) = x2 P (X = x) = ((500)2 ×0.01)+((100)2 ×0.03)+· · ·+(02 ×0.35) = 3075.50. So the variance is: Var(X) = 3075.50 − (14)2 = 2879.50 √ and hence the standard deviation is 2879.50 = £53.66. 6.8 The normal distribution The normal distribution is the most important distribution in statistical theory and it is essential to much statistical theory and reasoning. It is, in a sense, the ‘parent’ 125 6. The normal distribution and ideas of sampling distribution of all the sampling distributions which we shall meet later. In order to get some feel for the normal distribution, let us consider the exercise of constructing a histogram of people’s heights (assumed to be normally distributed). Suppose we start with 100 people and construct a histogram using sufficient class intervals such that the histogram gives a good representation of the data’s distribution. This will be a fairly ‘ragged’ diagram, but useful nonetheless. Now suppose we increase our sample size to 500 and construct an appropriate histogram for these observed heights, but using more class intervals now that we have more data. This diagram will be smoother than the first, peaked in the centre, and roughly symmetric about the centre. The normal distribution is emerging! If we continue this exercise to sample sizes of 5,000 or even 50,000, then we will eventually arrive at a very smooth bell-shaped curve similar to that shown in Figure 6.1. Hence we can view the normal distribution as the smooth limit of the basic histogram as the sample size becomes very large. 0.0 0.1 0.2 0.3 0.4 Typical Normal Density Function Shape f X (x ) 6 Such a diagram represents the distribution of the population. It is conventional to adjust the vertical scale so that the total area under the curve is 1, and so it is easy to view the area under the curve as probability (recall the second probability axiom). The mathematical form for this curve is well-known and can be used to compute areas, and hence probabilities – in due course we shall make use of statistical tables for this purpose. −3 −2 −1 0 1 2 3 x Figure 6.1: Density function of the (standard) normal distribution. 126 6.8. The normal distribution 6.8.1 Relevance of the normal distribution The normal distribution is relevant to the application of statistics for many reasons such as: Many naturally-occurring phenomena can be modelled as following a normal distribution. Examples include heights of people, diameters of bolts, weights of animals etc.2 A very important point is that averages of sampled variables (discussed later), indeed any functions of sampled variables, also have probability distributions. It can be demonstrated, theoretically and empirically, that provided the sample size is reasonably large, the distribution of the sample mean, X̄, will be (approximately) normal regardless of the distribution of the original variable. This is known as the central limit theorem (CLT) which we will return to later. The normal distribution is often used as the distribution of the error term in standard statistical and econometric models such as linear regression. This assumption can be, and should be, checked. This is considered further in ST104b Statistics 2. 6.8.2 Consequences of the central limit theorem The consequences of the CLT are twofold. A number of statistical methods which we use have a robustness property, i.e. it does not matter for their validity just what the true population distribution of the variable being sampled is. We are justified in assuming normality for statistics which are sample means, or linear transformations of them. The CLT was introduced above ‘provided the sample size is reasonably large’. In practice, 30 or more observations are usually sufficient (and can be used as a rule-of-thumb), although the distribution of X̄ may be normal for n much less than 30. This depends on the population distribution of the original variable. If this population distribution is in fact normal, then all sample means computed from it will be normal. However, if the population distribution is very non-normal, then a sample size of (at least) 30 would be needed to justify normality. 2 Note the use of the word modelled. This is due to the distributional assumption of normality. Since a normal random variable X is defined over the entire real line, i.e. −∞ < x < ∞, we know a person cannot have a negative height, even though the normal distribution has positive, non-zero probability over negative values. Also, nobody is of infinite height (the world’s tallest man ever, Robert Wadlow, was 272 cms), so clearly there is a finite upper bound to height, rather than ∞. Therefore, height does not follow a true normal distribution, but it is a good enough approximation for modelling purposes. 127 6 6. The normal distribution and ideas of sampling 6.8.3 Characteristics of the normal distribution The equation which describes the normal distribution takes the general form: 1 (x − µ)2 . f (x) = √ exp − 2σ 2 2πσ 2 The shape of this function is the bell-shaped curve, such as in Figure 6.1. Do not panic – you will not be working with this function directly! That said, do be aware that it involves two parameters: the mean, µ, and the variance, σ 2 . Since the normal distribution is symmetric about µ, the distribution is centred at µ. As a consequence of this symmetry, the mean is equal to the median. Also, since the distribution peaks at µ, it is also equal to the mode. In principle, the mean can take any real value, i.e. −∞ < µ < ∞. The variance is σ 2 , hence the larger σ 2 , the larger the spread of the distribution. Note that variances cannot be negative, hence σ 2 > 0. 6 If the random variable X has a normal distribution with parameters µ and σ 2 , we denote this as X ∼ N (µ, σ 2 ), where we read ‘∼’ as ‘is distributed as’. Given the infinitely many possible values for µ and σ 2 , and given that a normal distribution is uniquely defined by these two parameters, there is an infinite number of normal distributions due to the infinite combinations of values of µ and σ 2 . The most important normal distribution is the special case when µ = 0 and σ 2 = 1. We call this the standard normal distribution, denoted by Z, i.e. Z ∼ N (0, 1). Tabulated probabilities which appear in statistical tables are for the standard normal distribution. 6.8.4 Standard normal tables We now discuss the determination of normal probabilities using standard statistical tables. Extracts from the New Cambridge Statistical Tables will be provided in the examination. Here we focus on Table 4. Standard normal probabilities Table 4 of the New Cambridge Statistical Tables lists ‘lower-tail’ probabilities, which can be represented as: P (Z ≤ z) = Φ(z) for z ≥ 0 using the conventional Z notation for a standard normal random variable.3 Note the cumulative probability4 for the Z distribution, P (Z ≤ z), is often denoted Φ(z). We now consider some examples of working out probabilities from Z ∼ N (0, 1). 3 Although Z is the conventional letter used to denote a standard normal random variable, Table 4 uses (somewhat annoyingly) ‘x’ to denote ‘z’. 4 A cumulative probability is the probability of being less than or equal to some particular value. 128 6.8. The normal distribution Example 6.8 If Z ∼ N (0, 1), what is P (Z > 1.2)? When computing probabilities, it is useful to draw a quick sketch to visualise the specific area of probability which we are after. So for P (Z > 1.2) we require the upper-tail probability shaded in red in Figure 6.2. Since Table 4 gives us lower-tail probabilities, if we look up the value 1.2 in the table we will get P (Z ≤ 1.2) = 0.8849. The total area under a normal curve is 1, so the required probability is simply 1 − Φ(1.2) = 1 − 0.8849 = 0.1151. 0.3 0.4 Standard Normal Density Function 0.2 0.0 0.1 f Z (z ) 6 −3 −2 −1 0 1 2 3 z Figure 6.2: Standard normal distribution with shaded area depicting P (Z > 1.2). Example 6.9 If Z ∼ N (0, 1), what is P (−1.24 < Z < 1.86)? Again, begin by producing a sketch. The probability we require is the sum of the blue and red areas in Figure 6.3. Using Table 4, which note only covers z ≥ 0, we proceed as follows. The red area is given by: P (0 ≤ Z ≤ 1.86) = P (Z ≤ 1.86) − P (Z ≤ 0) = Φ(1.86) − Φ(0) = 0.9686 − 0.5 = 0.4686. 129 6. The normal distribution and ideas of sampling The blue area is given by: P (−1.24 ≤ Z ≤ 0) = P (Z ≤ 0) − P (Z ≤ −1.24) = Φ(0) − Φ(−1.24) = Φ(0) − (1 − Φ(1.24)) = 0.5 − (1 − 0.8925) = 0.3925. Note P (Z ≤ −1.24) = P (Z ≥ 1.24) = 1 − Φ(1.24), by symmetry of Z about µ = 0. So although Table 4 does not give probabilities for negative z values, we can exploit the symmetry of the (standard) normal distribution. Hence P (−1.24 < Z < 1.86) = 0.4686 + 0.3925 = 0.8611. Alternatively: P (−1.24 < Z < 1.86) = Φ(1.86) − Φ(−1.24) 6 = Φ(1.86) − (1 − Φ(1.24)) = 0.9686 − (1 − 0.8925) = 0.8611. 0.2 0.0 0.1 f Z (z ) 0.3 0.4 Standard Normal Density Function −3 −2 −1 0 1 2 3 z Figure 6.3: Standard normal distribution depicting P (−1.24 < Z < 1.86) as the shaded areas. 130 6.8. The normal distribution Activity 6.5 If Z ∼ N (0, 1), determine: (a) P (0 < Z < 1.2) (b) P (−0.68 < Z < 0) (c) P (−0.46 < Z < 2.21) (d) P (0.81 < Z < 1.94) (e) P (Z < −0.6) (f) P (Z > −1.28) (g) P (Z > 2.05). Solution All of these probabilities can be obtained from Table 4 of the New Cambridge Statistical Tables, which note provides P (Z ≤ z) for z ≥ 0. (a) We have: 6 P (0 < Z < 1.2) = P (Z < 1.2) − P (Z < 0) = 0.8849 − 0.5 = 0.3849. (b) We have: P (−0.68 < Z < 0) = P (Z < 0) − (1 − P (Z < 0.68)) = 0.5 − (1 − 0.7517) = 0.2517. (c) We have: P (−0.46 < Z < 2.21) = P (Z < 2.21) − (1 − P (Z < 0.46)) = 0.98645 − (1 − 0.6772) = 0.66365. (d) We have: P (0.81 < Z < 1.94) = P (Z < 1.94) − P (Z < 0.81) = 0.9738 − 0.7910 = 0.1828. (e) We have: P (Z < −0.62) = 1 − P (Z < 0.62) = 1 − 0.7324 = 0.2743. (f) We have: P (Z > −1.28) = P (Z < 1.28) = 0.8997. (g) We have: P (Z > 2.05) = 1 − P (Z < 2.05) = 1 − 0.97982 = 0.02018. 131 6. The normal distribution and ideas of sampling 6.8.5 The general normal distribution We have already discussed that there exists an infinite number of different normal distributions due to the infinite pairs of parameter values since −∞ < µ < ∞ and σ 2 > 0. The good news is that Table 4 of the New Cambridge Statistical Tables can be used to determine probabilities for any normal random variable X, such that X ∼ N (µ, σ 2 ). To do so, we need a little bit of magic, known as standardisation. This is a special transformation which converts X ∼ N (µ, σ 2 ) into Z ∼ N (0, 1). The transformation formula for standardisation If X ∼ N (µ, σ 2 ), then the transformation: Z= 6 X −µ σ creates a standard normal random variable, i.e. Z ∼ N (0, 1). So to standardise X we subtract its mean and divide by its standard deviation. Example 6.10 Suppose X ∼ N (5, 4). What is P (5.8 < X < 7.0)? We have: 5.8 − µ X −µ 7.0 − µ P (5.8 < X < 7.0) = P < < σ σ σ X −5 7.0 − 5 5.8 − 5 √ < √ < √ =P 4 4 4 = P (0.4 < Z < 1) = P (Z ≤ 1) − P (Z ≤ 0.4) = 0.8413 − 0.6554 (from Table 4) = 0.1859. Activity 6.6 Suppose X ∼ N (10, 4). (a) Find: i. P (X > 13.4) ii. P (8 < X < 9). (b) Find the value a such that P (10 − a < X < 10 + a) = 0.95. (c) Find the value b such that P (10 − b < X < 10 + b) = 0.99. (d) How far above the mean of the standard normal distribution must we go such that only 1% of the probability remains in the right-hand tail? 132 6.8. The normal distribution (e) How far below the mean of the standard normal distribution must we go such that only 5% of the probability remains in the left-hand tail? Solution Since X ∼ N (10,√4), we use the transformation Z = (X − µ)/σ with the values µ = 10 and σ = 4 = 2. (a) i. We have: 13.4 − 10 = P (Z > 1.70) = 1 − P (Z ≤ 1.70) P (X > 13.4) = P Z > 2 = 1 − 0.9554 = 0.0446. ii. We have: P (8 < X < 9) = P 8 − 10 9 − 10 <Z< 2 2 6 = P (−1 < Z < −0.5) = P (Z < −0.5) − P (Z < −1) = (1 − P (Z ≤ 0.5)) − (1 − P (Z ≤ 1)) = (1 − 0.6915) − (1 − 0.8413) = 0.1498. (b) We want to find the value a such that P (10 − a < X < 10 + a) = 0.95, that is: (10 − a) − 10 (10 + a) − 10 0.95 = P <Z< 2 2 a a =P − <Z< 2 2 a a =P Z< −P Z <− 2 2 a =1−2×P Z <− . 2 This is the same as 2 × P (Z > a/2) = 0.05, i.e. P (Z > a/2) = 0.025, or P (Z ≤ a/2) = 0.975. Therefore, from Table 4 of the New Cambridge Statistical Tables, a/2 = 1.96, and so a = 3.92. (c) We want to find the value b such that P (10 − b < X < 10 + b) = 0.99. Similar reasoning shows that P (Z ≤ b/2) = 0.995. Therefore, from Table 4, b/2 = 2.58 (approximately), so that b = 5.16. (d) We want k such that P (Z > k) = 0.01, or P (Z ≤ k) = 0.99. From Table 4, k = 2.33 (approximately). 133 6. The normal distribution and ideas of sampling (e) We want x such that P (Z < x) = 0.05. This means that x < 0 and P (Z > |x|) = 0.05, or P (Z ≤ |x|) = 0.95, so, from Table 4, |x| = 1.645 (by interpolating between Φ(1.64) and Φ(1.65)) and hence x = −1.645. Activity 6.7 Your company requires a special type of light bulb which is available from only two suppliers. Supplier A’s light bulbs have a mean lifetime of 2,000 hours with a standard deviation of 180 hours. Supplier B’s light bulbs have a mean lifetime of 1,850 hours with a standard deviation of 100 hours. The distribution of the lifetimes of each type of light bulb is normal. Your company requires that the lifetime of a light bulb be not less than 1,500 hours. All other things being equal, which type of bulb should you buy, and why? Solution Let A and B be the random variables representing the lifetimes (in hours) of light bulbs from supplier A and supplier B, respectively. We are told that: A ∼ N (2000, (180)2 ) and B ∼ N (1850, (100)2 ). 6 Since the relevant criterion is that light bulbs last at least 1,500 hours, the company should choose the supplier whose light bulbs have a greater probability of doing so. We find that: 1500 − 2000 = P (Z > −2.78) = P (Z < 2.78) = 0.99728 P (A > 1500) = P Z > 180 and: 1500 − 1850 P (B > 1500) = P Z > = P (Z > −3.50) = P (Z < 3.50) = 0.9998. 100 Therefore, the company should buy light bulbs from supplier B, since they have a greater probability of lasting the required time. Note it is good practice to define notation and any units of measurement and to state the distributions of the random variables. Note also that here it is not essential to compute the probability values in order to determine what the company should do, since −2.78 > −3.50 implies that P (Z > −2.78) < P (Z > −3.50). Activity 6.8 A normal distribution has a mean of 40. If 10% of the distribution falls between the values of 50 and 60, what is the standard deviation of the distribution? Solution Let X ∼ N (40, σ 2 ). We seek σ, and know that: 50 − 40 60 − 40 10 20 P (50 ≤ X ≤ 60) = P ≤Z≤ =P ≤Z≤ = 0.10. σ σ σ σ Hence we know that one z-value (i.e. 20/σ) is twice the other (i.e. 10/σ), and their corresponding probabilities differ by 0.10. We also know that the z-values are 134 6.8. The normal distribution positive since both represent numbers larger than the mean of 40. Now we need to use Table 4 of the New Cambridge Statistical Tables to find two such z-values. Looking at Table 4 we find the values to be, approximately, 1.25 and 2.50. Therefore, σ = 8. Activity 6.9 The life, in hours, of a light bulb is normally distributed with a mean of 200 hours. If a consumer requires at least 90% of the light bulbs to have lives exceeding 150 hours, what is the largest value that the standard deviation can have? Solution Let X be the random variable representing the lifetime of a light bulb (in hours), so that for some value σ we have X ∼ N (200, σ 2 ). We want P (X > 150) = 0.90, such that: 50 150 − 200 =P Z>− = 0.90. P (X > 150) = P Z > σ σ Note that this is the same as P (Z < 50/σ) = 0.90, so 50/σ = 1.28, giving σ = 39.06. Activity 6.10 The random variable X has a normal distribution with mean µ and variance σ 2 , i.e. X ∼ N (µ, σ 2 ). It is known that: P (X ≤ 66) = 0.0359 and P (X ≥ 81) = 0.1151. (a) Produce a clearly-labelled sketch to represent these probabilities on a normal curve. (b) Show that the value of σ = 5. (c) Calculate P (69 ≤ X ≤ 83). Solution (a) The sketch below shows the probabilities with P (X ≤ 66) shaded blue and P (X ≥ 81) shaded red. 135 6 6. The normal distribution and ideas of sampling (b) We have X ∼ N (µ, σ 2 ), where µ and σ 2 are unknown. Using Table 4 of the New Cambridge Statistical Tables, we find that P (Z ≤ −1.80) = 0.0359 and P (Z ≥ 1.20) = 0.1151. Therefore: 66 − µ = P (Z ≤ −1.80) = 0.0359 P (X ≤ 66) = P Z ≤ σ and: 81 − µ P (X ≥ 81) = P Z ≥ = P (Z ≥ 1.20) = 0.1151. σ Hence: 66 − µ 81 − µ = −1.80 and = 1.20. σ σ Rearranging, we obtain a pair of simultaneous equations which can be solved to find µ and σ. Specifically: 66 − µ = −1.80σ and 81 − µ = 1.20σ. Subtracting the first from the second, gives 15 = 3σ, and hence σ = 5. For completeness, 81 − µ = 6, so µ = 75 . Therefore, X ∼ N (75, 52 ). 6 (c) Given X ∼ N (75, 52 ), we have: 83 − 75 69 − 75 ≤Z≤ = P (−1.20 ≤ Z ≤ 1.60) P (69 ≤ X ≤ 83) = P 5 5 = Φ(1.60) − (1 − Φ(1.20)) = 0.9452 − (1 − 0.8849) = 0.8301. Activity 6.11 The number of newspapers sold daily at a kiosk is normally distributed with a mean of 350 and a standard deviation of 30. (a) Find the probability that fewer than 300 newspapers are sold on a particular day. (b) How many newspapers should the newsagent stock each day such that the probability of running out on any particular day is 10%? Solution Let X be the number of newspapers sold, hence X ∼ N (350, 900). (a) We have: 300 − 350 = P (Z < −1.67) = 1 − P (Z < 1.67) P (X < 300) = P Z < 30 = 1 − 0.9525 = 0.0475. 136 6.8. The normal distribution (b) Let s be the required stock, then we require P (X > s) = 0.10. Hence: s − 350 = 0.10 P Z> 30 s − 350 ≥ 1.28 30 ⇒ ⇒ s ≥ 350 + 1.28 × 30 = 388.4. Rounding up, the required stock is 389. Activity 6.12 Consider the following set of data. Does it appear to approximately follow a normal distribution? Justify your answer. 45 48 49 47 46 39 31 54 46 46 35 48 37 52 62 40 36 51 55 55 38 61 59 52 54 52 45 50 50 43 56 51 48 58 48 45 6 Solution To see whether this set of data approximates a normal distribution, we need to analyse it. Using a calculator we calculate the mean to be µ = 48.1 and the standard deviation to be σ = 7.3 (assuming population data). For ± one standard deviation, i.e. µ ± σ, the interval is (40.8, 55.4) which contains 24 observations, representing 24/36 = 67% of the data. For ± two standard deviations, i.e. µ ± 2σ, the interval is (33.1, 62.7) which contains 35 observations, representing 35/36 = 97% of the data. These percentages match very closely to what we expect for a normal distribution. We could also construct a histogram of the data, as shown below. This appears to confirm that the data do indeed approximate a normal distribution. 137 6. The normal distribution and ideas of sampling 6.9 Sampling distributions In Chapter 3, ‘Statistics’ was introduced as a discipline for data analysis. In this chapter we have encountered the normal probability distribution. Probability distributions typically have associated parameters, such as the (theoretical) mean, µ, and the (theoretical) variance, σ 2 , for the normal distribution. By convention, Greek letters are used to denote population parameters, whose values in practice are typically unknown. The next few chapters of this half-course will be the subject of statistical inference, whereby we infer unknown population parameters based on sample data. As an example, suppose we wanted to investigate the average height of the UK population. As previously discussed, it is reasonable to assume that height is a normally-distributed random variable with a mean µ and a variance σ 2 . What are the exact values of these parameters? To know these values precisely would require data on the heights of the entire UK population – all 60-plus million people! 6 Population sizes, denoted by N , are typically very large and clearly no-one has the time, money or patience to undertake such a marathon data collection exercise. Instead we opt to collect a sample (some subset of the population) of size n.5 Having collected our sample, we then estimate the unknown population parameters based on the known (i.e. observed) sample data. Specifically, we estimate population quantities based on their respective sample counterparts. A ‘statistic’ (singular noun) is simply a known function of data. A sample statistic is calculated from sample data. At this point, be aware of the following distinction between an estimator and an estimate. Estimator versus estimate An estimator is a statistic (which is a random variable) describing how to obtain an estimate (which is a real number) of a population parameter. Example 6.11 The sample mean X̄ = n P Xi /n is the estimator of the population i=1 mean, µ. For example, if we had drawn at random from the population the sample data 4, 8, 2 and 6, then the estimate of µ would be x̄ = 20/4 = 5. Notice the notation – the estimator is written as X̄ and is a random variable, while the estimate is written as x̄ as it is computed for a specific sample, hence is fixed for that particular sample. Had the random sample been instead 2, 9, 1 and 4 (from the n P same population), then the estimator of µ would still be X̄ = Xi /n, but the i=1 estimate would now be x̄ = 16/4 = 4. So we see that sample statistics vary from sample to sample due to the random nature of the sample. Hence estimators are random variables with corresponding probability distributions, called sampling distributions. 5 If n = N , and we sample without replacement, then we have obtained a complete enumeration of the population, i.e. a census. 138 6.9. Sampling distributions Sampling distribution A sampling distribution is the probability distribution of an estimator. Before we proceed further, let us take a moment to review some population quantities and their respective sample counterparts. Population quantity Probability distribution (Population) mean, µ (Population) variance, σ 2 (Population) standard deviation, σ (Population) proportion, π Sample counterpart Histogram (Sample) mean, x̄ (Sample) variance, s2 (Sample) standard deviation, s (Sample) proportion, p The precision (or quality) of point estimates such as x̄, s2 and p will depend on the sample size n, and in principle on the population size N , if finite. In practice if N is very large relative to n, then we can use approximations which are more than adequate for practical purposes, but would only be completely accurate if the population truly was infinite. In what follows we assume N is large enough to be treated as infinite. Activity 6.13 Consider the population below with N = 4 elements: A 3 B 6 C 9 D 12 (a) Calculate the population mean and variance. (b) Write down the sampling distribution of the sample mean for samples of size n = 2 drawn without replacement where order does not matter. (c) Using the result in (b), calculate the mean of the sampling distribution. (d) Using the result in (b), calculate the variance of the sampling distribution. Solution (a) The population mean is: N P µ= xi i=1 N = 3 + 6 + 9 + 12 = 7.5. 4 The population variance is: N P x2i σ 2 = i=1 N − µ2 = 9 + 36 + 81 + 144 − (7.5)2 = 11.25. 4 (b) The sampling distribution of the sample mean for samples of size n = 2 is: 139 6 6. The normal distribution and ideas of sampling Sample AB AC AD BC BD CD Values 3 6 3 9 3 12 6 9 6 12 9 12 X̄ = x̄ 4.5 6 7.5 7.5 9 10.5 P (X̄ = x̄) 1/6 1/6 1/6 1/6 1/6 1/6 (c) The mean of the sampling distribution is: E(X̄) = 4.5 + 6 + 7.5 + 7.5 + 9 + 10.5 45 = = 7.5. 6 6 (d) The variance of the sampling distribution is: Var(X̄) = 6 (4.5)2 + (6)2 + (7.5)2 + (7.5)2 + (9)2 + (10.5)2 − (7.5)2 = 3.75. 6 Activity 6.14 A random variable, X, can take the values 1, 2 and 3, each with equal probability. List all possible samples of size two which may be chosen when order matters, without replacement, from this population, and hence construct the sampling distribution of the sample mean, X̄. Solution Each possible sample has an equal probability of occurrence of 1/6. The sampling distribution of X̄ is: Possible samples (1, 2) (2, 1) (1, 3) (3, 1) (2, 3) (3, 2) Sample mean, X̄ = x̄ 1.5 2 2.5 Relative frequency, P (X̄ = x̄) 1/3 1/3 1/3 Activity 6.15 Discuss the differences or similarities between a sampling distribution of size 5 and a single (simple) random sample of size 5. Solution If 5 members are selected from a population such that every possible set of 5 population members has the same probability of being selected, then the sample is a simple random sample. In a sampling distribution of size 5, every possible sample of size 5 from the population is averaged and the result is the sampling distribution. The similarity is the inclusive nature of both the simple random sample as well as the sampling distribution. 140 6.10. Sampling distribution of X̄ Activity 6.16 A perfectly-machined regular tetrahedral (pyramid-shaped) die has four faces labelled 1 to 4. It is tossed twice onto a level surface and after each toss the number on the face which is downward is recorded. If the recorded values are x1 and x2 , when order matters, then the observed sample mean is x̄ = (x1 + x2 )/2. Write out the sampling distribution of the sample mean as a random quantity over repeated double tosses. Solution Each possible sample has an equal probability of occurrence of 1/16. The sampling distribution of X̄ is: Possible samples Sample mean, X̄ = x̄ Relative frequency, P (X̄ = x̄) 6.10 (1, 1) (1, 2) (2, 1) (1, 3) (3, 1) (2, 2) (1, 4) (4, 1) (2, 3) (3, 2) (2, 4) (4, 2) (3, 3) (3, 4) (4, 3) (4, 4) 1 1.5 2 2.5 3 3.5 4 1/16 1/8 3/16 1/4 3/16 1/8 1/16 Sampling distribution of X̄ Let us consider a large population (large enough to be treated as if it was infinite) which is normally distributed, i.e. N (µ, σ 2 ). Suppose we take an initial random sample of size n and obtain the sample mean, x̄1 . Next we take a second, independent sample (from the same initial population) and compute x̄2 . Continue taking new, independent samples, resulting in a series of sample means: x̄1 , x̄2 , x̄3 , . . .. These values will, of course, vary from sample to sample, hence if we constructed a histogram of these values we would see the empirical sampling distribution of X̄. Fortunately, however, we are often able to determine the exact, theoretical form of sampling distributions without having to resort to such empirical simulations. Sampling distribution of X̄ for normal populations When taking a random sample of size n from a N (µ, σ 2 ) population: σ2 X̄ ∼ N µ, . n The central limit theorem concerns the sampling distribution of X̄ when sampling from non-normal distributions. 141 6 6. The normal distribution and ideas of sampling Central limit theorem When taking a random sample of size n from a non-normal population with finite mean µ and finite variance σ 2 , then: σ2 X̄ ∼ N µ, n approximately, as n → ∞. So the difference between sampling from normal and non-normal populations is that X̄ is exactly normally distributed in the former case, but only approximately in the latter case. The approximation is reasonable when n is at least 30, as a rule-of-thumb. Although because this is an asymptotic approximation (i.e. as n → ∞), the larger n is, the better the normal approximation. 6 We can use standardisation √ to compute probabilities involving X̄, but we2 must n rather than σ, since the variance of X̄ is σ /n. Note that remember to divide by σ/ √ σ/ n is known as the standard error of X̄. Standard error The standard deviation of an estimator is called its standard error. For example, X̄ ∼ N (µ, σ 2 /n) (either exactly, or approximately by the CLT), so Var(X̄) = σ 2 /n, hence the standard deviation of X̄, i.e. the standard error, is: r q σ2 σ Var(X̄) = =√ . n n Note the standard error decreases as n increases. Example 6.12 A random sample of 16 values is taken from a normal distribution with µ = 25 and σ = 3. What is P (X̄ > 26)? Since the population is normally distributed, the sampling distribution of X̄ is exactly N (µ, σ 2 /n) = N (25, 9/16). If Z is the standard normal random variable, then: 26 − 25 √ P (X̄ > 26) = P Z > = P (Z > 1.33) = 1 − Φ(1.33) = 1 − 0.9082 = 0.0918 3/ 16 using Table 4 of the New Cambridge Statistical Tables. Activity 6.17 The weights of a large group of animals have mean 8.2 kg and standard deviation 2.2 kg. What is the probability that a random selection of 80 animals from the group will have a mean weight between 8.3 kg and 8.4 kg? State any assumptions you make. 142 6.10. Sampling distribution of X̄ Solution We are not told that the population is normal, but n is ‘large’ so we can apply the central limit theorem. The sampling distribution of X̄ is, approximately: σ2 (2.2)2 X̄ ∼ N µ, = N 8.2, . n 80 Hence, using Table 4: P (8.3 ≤ X̄ ≤ 8.4) = P 8.3 − 8.2 8.4 − 8.2 √ ≤Z≤ √ 2.2/ 80 2.2/ 80 = P (0.41 ≤ Z ≤ 0.81) = P (Z ≤ 0.81) − P (Z ≤ 0.41) = 0.7910 − 0.6591 = 0.1319. Activity 6.18 A random sample of 25 audits is to be taken from a company’s total audits, and the average value of these audits is to be calculated. (a) Explain what is meant by the sampling distribution of this average and discuss its relationship to the population mean. (b) Is it reasonable to assume that this sampling distribution is normally distributed? (c) If the population of all audits has a mean of £54 and a standard deviation of £10, find the probability that: i. the sample mean will be greater than £60 ii. the sample mean will be within 5% of the population mean. Solution (a) The sample average (mean) is formed from 25 observations which are subject to sampling variability, hence the average is also subject to this variability. Its sampling distribution describes its probability properties. If a large number of such averages were independently sampled, then their histogram would be the (approximate) sampling distribution. (b) It is reasonable to assume that this sampling distribution is normal due to the central limit theorem (CLT) – although the sample size is rather small. (c) If n = 25, µ = 54 and σ = 10, then the CLT says that, approximately: σ2 100 = N 54, . X̄ ∼ N µ, n 25 143 6 6. The normal distribution and ideas of sampling i. We have: P (X̄ > 60) = P 60 − 54 Z>p 100/25 ! = P (Z > 3) = 0.00135. ii. We are asked for: P (0.95 × 54 < X̄ < 1.05 × 54) = P −0.05 × 54 0.05 × 54 <Z< 2 2 = P (−1.35 < Z < 1.35) = P (Z < 1.35) − (1 − P (Z < 1.35) = 0.9115 − (1 − 0.9115) = 0.8230. 6 Activity 6.19 The distribution of salaries of lecturers in a university is positively skewed, with most lecturers earning near the minimum of the pay scale. What would a sampling distribution of the sample mean of size 2 look like? How about size 5? How about size 50? Solution Here we are given a non-normal (skewed) distribution. In order that the sampling distribution of the sample mean is (approximately) normal, a ‘large’ sample size is required, according to the central limit theorem. Sampling distributions based on ‘small’ sample sizes tend to resemble the population from which they are drawn. So, for samples of size 2 and 5, the sampling distributions would probably be positively skewed. However, the sample of size 50 would be approximately normal. Regardless, in all cases the sampling distribution would have a mean equal to the population mean, µ, and a variance of σ 2 /n. 6.11 Summary This chapter covered the key points relating to the normal distribution and the central limit theorem. You should now be ready to embark on Chapters 7, 8 and 9, and work on the ideas of statistical estimation and inference. Do not worry if you found some later sections of this chapter difficult to understand. Work at the learning activities and the sample examination questions below. 6.12 Key terms and concepts Central limit theorem Normal distribution Population variance 144 Expectation operator Population mean Probability distribution 6.13. Learning activities Random variable Standard error Statistical inference 6.13 Sampling distribution Standardisation Statistical tables Learning activities Solutions can be found in Appendix A. 1. Using Table 4 of the New Cambridge Statistical Tables, calculate: (a) P (Z ≤ 1) (b) P (Z ≥ 1) where Z ∼ N (0, 1). 2. Check that (approximately): (a) 68% of normal random variables fall within 1 standard deviation of the mean. (b) 95% of normal random variables fall within 2 standard deviations of the mean. (c) 99% of normal random variables fall within 3 standard deviations of the mean. Draw the areas concerned on a normal distribution. 3. Given a normal distribution with a mean of 20 and a variance of 4, what proportion of the distribution would be: (a) above 22 (b) between 14 and 16? 4. In an examination, the scores of students who attend schools of type A are normally distributed with a mean of 50 and a standard deviation of 5. The scores of students who attend schools of type B are normally distributed with a mean of 55 and a standard deviation of 6. Which type of school would have a higher proportion of students with marks below 45? 5. The manufacturer of a new brand of lithium battery claims that the mean life of a battery is 3,800 hours with a standard deviation of 250 hours. Assume the lifespans of lithium batteries follow a normal distribution. (a) What percentage of batteries will last for more than 3,500 hours? (b) What percentage of batteries will last for more than 4,000 hours? (c) If 700 batteries are supplied, how many should last between 3,500 and 4,000 hours? 6. The following six observations give the time taken, in seconds, to complete a 100-metre sprint by all six individuals competing in a race. Note this is population data. 145 6 6. The normal distribution and ideas of sampling Individual A B C D E F Time 15 14 10 12 20 15 (a) Find the population mean, µ, and the population standard deviation, σ, of the sprint times. (b) Calculate the sample mean for each possible sample of: i. two individuals ii. three individuals iii. four individuals. (c) Work out the mean for each set of sample means (it must come to µ) and compare the standard deviations of the sample means about µ. 6 This may take some time, but, after you have done it, you should have a clearer idea about sampling distributions! 6.14 A reminder of your learning outcomes After completing this chapter, and having completed the essential reading and activities, you should be able to: summarise the meaning of E(X) and Var(X) compute areas under the curve for a normal distribution state and apply the central limit theorem explain the relationship between sample size and the standard error of the sample mean. 6.15 Sample examination questions Solutions can be found in Appendix B. 1. The random variable X takes the values −1, 1 and 2 according to the following probability distribution: x pX (x) −1 1 0.20 k 2 4k (a) Determine the constant k and, hence, write down the probability distribution of X. 146 6.15. Sample examination questions (b) Find E(X) (the expected value of X). (c) Find Var(X) (the variance of X). 2. The scores on a verbal reasoning test are normally distributed with a population mean of µ = 100 and a population standard deviation of σ = 10. (a) What is the probability that a randomly chosen person scores at least 105? (b) A simple random sample of size n = 20 is selected. What is the probability that the sample mean will be between 97 and 104? (You may use the nearest values provided in the statistical tables.) 3. The weights of a large population of animals have a mean of 7.3 kg and a standard deviation of 1.9 kg. (a) Assuming that the weights are normally distributed, what is the probability that a random selection of 40 animals from this population will have a mean weight between 7.0 kg and 7.4 kg? (b) A researcher stated that the probability you calculated is approximately correct even if the distribution of the weights is not normal. Do you agree? Justify your answer. 147 6 6. The normal distribution and ideas of sampling 6 148 Chapter 7 Point and interval estimation 7.1 Aims This chapter develops the ideas of sampling introduced in Chapter 6 and looks at the actual estimation of population parameters using both normal and Student’s t distributions. Your aims are to: estimate values of common parameters such as means and proportions decide which is the appropriate distribution to use – normal or Student’s t. 7.2 Learning outcomes 7 After completing this chapter, and having completed the essential reading and activities, you should be able to: calculate sample means, standard deviations and proportions, and demonstrate their use as estimates construct a confidence interval for a mean, a proportion, the difference between means and the difference between proportions know when to use Student’s t distribution. You do not need to: demonstrate the central limit theorem know about the concepts relating to a ‘good’ estimator. 7.3 Essential reading Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060] Sections 7.1–7.3 and 8.1–8.4. In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials accessible via the ST104a Statistics 1 area at https://emfss.elearning.london.ac.uk 149 7. Point and interval estimation 7.4 Further reading Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN 9780471615170] Chapter 7 and Sections 8.1–8.3 and 8.5. 7.5 Introduction This chapter is concerned with evidence-based decision-making. It is about making a decision with respect to a population. The population is made up of a set of individual items known as elements. This could be, for example, a set of individuals or companies which constitute the market for your product, or the items being manufactured from a production line. The sort of information needed for a decision may be a mean value (for example, ‘how many items does an individual purchase per year, on average?’) or a proportion (‘what proportion of items manufactured has a fault?’). The associated decision may range from setting up extra capacity to cope with expected demand, to stopping the production line for re-adjustment. 7 In most cases it is impossible to gather information about the whole population (due to time and budget constraints), so instead one collects information about a sample drawn from the population and infers the required information about the population. In ST104a Statistics 1, we will look at the most commonly-used estimators. If you take ST104b Statistics 2, you will learn what the properties of a ‘good’ estimator are and look at other measures to be estimated other than means and proportions. In order to carry out this type of exercise, one obvious decision needs to be made. How large should the sample be? The answer to this question is ‘it depends’ ! Specifically, it depends on how variable the population is, on how accurate the input to the decision needs to be, and on how costly the data collection is. In this chapter you will learn to construct confidence intervals. You will also look at the idea of sample size determination. How does it affect accuracy when you estimate population means and proportions? How do you use this information? Note that inferring information about a parent (or theoretical) population using observations from a sample is the primary concern of the subject of statistics. 7.6 Principle of confidence intervals A (point) estimate is our ‘best guess’ of an unknown population parameter’s value based on sample data. Due to the random nature of sample data, we do not expect to estimate the parameter exactly (unless we are very lucky). Hence there is some uncertainty in our estimation so it would be advisable to communicate the level of uncertainty associated with the estimate. Standard errors (recall the standard error is the square root of the variance of an 150 7.7. General formulae for normally-distributed statistics estimator, i.e. its standard deviation) act as measures of estimation imprecision and these are used in the construction of confidence intervals. Informally, you can think of a confidence interval as representing our ‘best guess plus or minus a bit’, where the magnitude of this ‘bit’ is dependent on the level of precision. More formally, an x% confidence interval covers the unknown parameter with x% probability over repeated samples. This method of expressing the accuracy of an estimate is easily understood and requires no statistical sophistication for interpretation. So, it is important to distinguish (point) estimation – using sample data to obtain a numerical estimate of an unknown population parameter – from a confidence interval – an interval estimate of the parameter whose width indicates how reliable the point estimate is. (You may wish to review Section 6.9 to ensure that you are clear about the distinction between statistics (such as point and interval estimators) and parameters (population characteristics).) Clearly, a very wide confidence interval would show that our estimate was not very reliable, as we could not be sure that its value was close to the true parameter value, whereas a narrow confidence interval would correspond to a more reliable estimate. The degree of confidence that we have in our confidence interval can be expressed numerically; an ideal situation is a narrow interval with a high coverage probability. With these points in mind we now show how such intervals can be computed from data in some basic situations. 7 7.7 General formulae for normally-distributed statistics Chapter 6 introduced the sampling distribution of X̄. We will now draw on this sampling theory. 7.7.1 Standard error known Let θb be an unbiased1 estimator of θ such that its sampling distribution is: b = N (θ, (S.E.(θ)) b 2) N (θ, Var(θ)) b is the square recalling that the standard error of an estimator, denoted here as S.E.(θ), b Assume this standard error is known. We root of its variance, denoted here as Var(θ). require a confidence interval defined by a pair of values such that the probability of the interval covering θ, the coverage probability, is high. b we have: Since θb is normally distributed it follows that, upon standardising θ, Z= θb − θ ∼ N (0, 1). b S.E.(θ) 1 An unbiased estimator means the estimate is correct on average, over repeated samples. This concept is covered in greater depth in ST104b Statistics 2. 151 7. Point and interval estimation Hence, assuming we desire a 95% coverage probability, we have: ! θb − θ P −1.96 < < 1.96 = 0.95 b S.E.(θ) using Table 4 of the New Cambridge Statistical Tables. Note we can also write 1.96 as z0.025 , that is the z-value which cuts off 2.5% probability in the upper tail of the standard normal distribution. b > 0 (a standard error must be strictly positive): Since S.E.(θ) 0.95 = P ! θb − θ −1.96 < < 1.96 b S.E.(θ) b < θb − θ < 1.96 × S.E.(θ) b = P −1.96 × S.E.(θ) b b b = P −1.96 × S.E.(θ) < θ − θ < 1.96 × S.E.(θ) b < θ < θb + 1.96 × S.E.(θ) b = P θb − 1.96 × S.E.(θ) where we multiply by −1 to go from the second to the third line, and hence the inequality sign is reversed. 7 Endpoints of a 95% confidence interval (standard error known) b Therefore, the A 95% confidence interval for θ has endpoints θb ± 1.96 × S.E.(θ). reported confidence interval would be: b θb + 1.96 × S.E.(θ) b . θb − 1.96 × S.E.(θ), This is a simple, but very important, result which can be applied to give confidence intervals in many different situations such as the estimation of a mean, proportion, difference between means and difference between proportions. The above derivation was for a 95% confidence interval, i.e. with a 95% coverage probability, which is a generally accepted confidence requirement. Of course, it is possible to have different levels of confidence, say 90% or 99%. Fortunately, we can use the same argument as above; however, a different multiplier coefficient drawn from the standard normal distribution is required (i.e. not z0.025 = 1.96). Using Table 5 of the New Cambridge Statistical Tables, it is easy to obtain such coefficients. For convenience, key values are given below, where zα denotes the z-value which cuts off 100α% probability in the upper tail of the standard normal distribution. For 90% confidence, use the multiplier: z0.05 = 1.6449 For 95% confidence, use the multiplier: z0.025 = 1.9600 For 99% confidence, use the multiplier: z0.005 = 2.5758. 152 7.7. General formulae for normally-distributed statistics Check these values from Table 5 in the New Cambridge Statistical Tables, and make sure you know how to work out other ones, for example 80% or 98% confidence.2 Unfortunately, the method used so far in this section is limited by the assumption that the standard error of θb is known. This means, in effect, that we need to know the population variance and, in most practical cases, the assumption that we know either the population variance or the sampling standard error will not be justified. In such cases it will be necessary to estimate the standard error from the data. This requires a modification both of the approach and, in particular, of the general formula for the endpoints of a confidence interval. 7.7.2 Standard error unknown Here we consider cases where the standard error of θb is estimated, hereby denoted b which stands for the estimated standard error of θ. b One might think that: E.S.E.(θ) θb − θ b E.S.E.(θ) ∼ N (0, 1) but because of the additional sampling variability of the estimated standard error, this new, transformed function of the data will have a more dispersed distribution than the standard normal – the Student’s t distribution on ν degrees of freedom (discussed below). Hence in fact: θb − θ ∼ tν . b E.S.E.(θ) 7.7.3 (Student’s) t distribution ‘Student’ was the pen name of William S. Gosset (1876–1937) who is credited with developing this distribution.3 You should be familiar with its generic shape, which resembles the standard normal distribution (i.e. bell-shaped and symmetric about zero) but with ‘fatter’ tails. We get different members of the t distribution family for different degrees of freedom,4 ν. Graphical examples of the t distribution for various degrees of freedom are given in Figure 7.1, but note that as ν → ∞ (in words, ‘as the degrees of freedom tend to infinity’), the t distribution approaches the standard normal distribution – that is, tν → N (0, 1) as ν → ∞ (in words, ‘the t distribution tends to the standard normal distribution as the degrees of freedom tend to infinity’). For our purposes, we will use this distribution whenever we are performing inference for population means when population variances are unknown, and hence are estimated from the data. The correct degrees of freedom will depend on the degrees of freedom used to estimate the variance (more on this later). 2 These would be z0.10 = 1.2816 and z0.01 = 2.3263, respectively. Curious students may wonder why Gosset was not vain enough to call it Gosset’s t distribution – after all, who wouldn’t want to achieve such statistical immortality? In fact the reason for the pen name was due to the refusal of Guinness (his employer when he discovered this distribution) to allow its researchers to publish papers fearing leaks of trade secrets, forcing Gosset to publish anonymously as ‘Student’. Poor Gosset. 4 We will not discuss the naming of ‘degrees of freedom’ in ST104a Statistics 1. 3 153 7 7. Point and interval estimation 0.4 t density functions 0.2 0.0 0.1 Density function 0.3 t∞ t 30 t3 t1 −3 −2 −1 0 1 2 3 x Figure 7.1: Student’s t distribution – various degrees of freedom. 7 Assuming a 95% coverage probability, using Table 10 of the New Cambridge Statistical Tables, for a given ν we can find t0.025, ν such that: ! θb − θ < t0.025, ν = 0.95 P −t0.025, ν < b E.S.E.(θ) where t0.025, ν cuts off 2.5% probability in the upper tail of the t distribution with ν degrees of freedom. On rearranging the inequality within the brackets we get: b < θ < θb + t0.025, ν × E.S.E.(θ) b = 0.95. P θb − t0.025, ν × E.S.E.(θ) Endpoints of a 95% confidence interval (standard error unknown) b leading to a A 95% confidence interval for θ has endpoints θb ± t0.025, ν × E.S.E.(θ), reported confidence interval of the form: b θb + t0.025, ν × E.S.E.(θ) b θb − t0.025, ν × E.S.E.(θ), where t0.025, ν is the t-value which cuts off 2.5% probability in the upper tail of the t distribution with ν degrees of freedom, obtained from either Table 9 or Table 10 in the New Cambridge Statistical Tables. This general result can be used in a large number of situations to compute confidence intervals. In order to use these results we need to know how to calculate or estimate, respectively, the standard errors for the various cases using the sample data. We now proceed to demonstrate these techniques for some important scenarios. 154 7.8. Confidence interval for a single mean (σ known) Important note: In the following applications, whenever the t distribution is used and we have large sample size(s) (hence large degrees of freedom), it is acceptable to use standard normal values as approximations due to the tendency of the t distribution to the standard normal distribution as the degrees of freedom approach infinity. What constitutes a ‘large’ sample size is rather subjective; however, as a rule-of-thumb treat anything over 30 as ‘large’ in ST104a Statistics 1. 7.8 Confidence interval for a single mean (σ known) Given observed sample values x1 , x2 , . . . , xn , the point estimate of µ is x̄ = n P xi /n. √ Assuming the (population) variance, σ 2 , is known, the standard error of x̄ is σ/ n. i=1 Confidence interval endpoints of a single mean (σ known) In such instances, we use the standard normal distribution when constructing a confidence interval with endpoints: σ σ σ x̄ ± zα/2 × √ ⇒ x̄ − zα/2 × √ , x̄ + zα/2 × √ n n n where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the standard normal distribution to ensure a 100(1 − α)% confidence interval. For example, for α = 0.05, we have a 100(1 − 0.05)% = 95% confidence interval, and we require the z-value which cuts off α/2 = 0.025, i.e. 2.5% probability in the upper tail of the standard normal distribution, i.e. 1.96, which can be obtained from Table 5 in the New Cambridge Statistical Tables. Example 7.1 Measurements of the diameter of a random sample of 200 ball bearings produced by a machine gave a sample mean of x̄ = 0.824. The population standard deviation, σ, is 0.042. Construct a 95% confidence interval and a 99% confidence interval for the population mean value of the diameter of the ball bearings. We are told that σ = 0.042. So a 95% confidence interval, where α = 0.05, has endpoints: σ 0.042 = 0.824 ± 0.006 x̄ ± 1.96 × √ = 0.824 ± 1.96 × √ n 200 where z0.025 = 1.96 is the z-value which cuts off 2.5% probability in the upper tail of the standard normal distribution. In other words, the interval is (0.818, 0.830) which covers the population mean with a probability of 95%. To compute a 99% confidence interval (where α = 0.01), since σ is known we require zα/2 = z0.005 , that is the z-value which cuts off 0.5% probability in the upper tail of the standard normal distribution. Using Table 5 of the New Cambridge Statistical 155 7 7. Point and interval estimation Tables, the z-value is 2.5758. So a 99% confidence interval has endpoints: 0.042 σ x̄ ± 2.576 × √ = 0.824 ± 2.5758 × √ = 0.824 ± 0.008. n 200 In other words, the interval is (0.816, 0.832). Note the higher level of confidence has resulted in a wider confidence interval. 7.9 Confidence interval for a single mean (σ unknown) In practice it is unusual for σ 2 to be known – why would we know σ 2 , but not µ? However, we can estimate σ 2 with the sample variance s2 , using the estimator: n 1 X (Xi − X̄)2 . S = n − 1 i=1 2 This is because S 2 is an unbiased estimator5 of σ 2 . The degrees of freedom associated with this estimator are n − 1, where n is the sample size. Confidence interval endpoints of a single mean (σ unknown) 7 In such instances, we use the t distribution when constructing a 100(1 − α)% confidence interval with endpoints: s s s ⇒ x̄ − tα/2, n−1 × √ , x̄ + tα/2, n−1 × √ x̄ ± tα/2, n−1 × √ n n n where tα/2, n−1 is the t-value which cuts off 100α/2% probability in the upper tail of the t distribution with n − 1 degrees of freedom, obtained from either Table 9 or Table 10 in the New Cambridge Statistical Tables. Example 7.2 A company producing designer label jeans carries out a sampling exercise in order to estimate the average price which retailers are charging for the jeans. A random sample of 12 retailers gives the following sample of prices (in £): 34.50, 36.00, 43.50, 29.50, 44.50, 47.50, 45.50, 53.00, 40.00, 38.00, 33.00, 54.50. We seek a 95% confidence interval for the mean retailer’s price of the jeans. Clearly, n = 12 and it is not hard to check that x̄ = 41.625 and s = 7.84. So the estimated standard error of the sample mean is: s 7.84 √ = √ = 2.2632 n 12 5 Recall an unbiased estimator means the point estimate is correct on average, over repeated samples. This concept is covered in greater depth in ST104b Statistics 2. 156 7.9. Confidence interval for a single mean (σ unknown) on n − 1 = 11 degrees of freedom. Hence a 95% confidence interval for µ has endpoints 41.625 ± 2.201 × 2.2632, i.e. the confidence interval is: (£36.64, £46.61). Two important points are the following. Make sure you see where 2.201 comes from in Table 10 – it is t0.025, 11 , i.e. the t-value above which lies 2.5% probability for a Student’s t distribution with 11 degrees of freedom. Make sure you report confidence intervals in the form (£36.64, £46.61); that is you must compute the actual endpoints and report these as an interval, as that is what a confidence interval is! Note the lower endpoint should be given first. Activity 7.1 The reaction times, in seconds, for eight police officers were found to be: 0.28, 0.23, 0.21, 0.26, 0.29, 0.21, 0.25, 0.22. Determine a 90% confidence interval for the mean reaction time of all police officers. 7 Solution Here n = 8 is small and σ 2 is unknown, so the formula for a 90% confidence interval for µ is: s x̄ ± t0.05, 7 × √ . n It is easy to calculate x̄ = 0.2437 and s = 0.0311, so a 90% confidence interval has endpoints: 0.0311 0.2437 ± 1.895 × √ . 8 Hence the confidence interval is (0.2229, 0.2645). Activity 7.2 A business requires an inexpensive check on the value of stock in its warehouse. In order to do this, a random sample of 50 items is taken and valued. The average value of these is computed to be £320.41 with a (sample) standard deviation of £40.60. It is known that there are 9,875 items in the total stock. (a) Estimate the total value of the stock to the nearest £10,000. (b) Calculate a 95% confidence interval for the mean value of all items and hence determine a 95% confidence interval for the total value of the stock. (c) You are told this confidence interval is too wide for decision purposes and you are asked to assess how many more items would need to be sampled to obtain an interval with the same degree of confidence, but with half the width. 157 7. Point and interval estimation Solution (a) The total value of the stock is 9875µ, where µ is the mean value of an item of stock. We know that X̄ is the obvious estimator of µ, so 9875X̄ is the obvious estimator of 9875µ. Therefore, an estimate of the total value of the stock is: 9875 × 320.41 = £3,160,000 (to the nearest £10,000). (b) In this question σ 2 is unknown but n = 50 is large, so we can approximate tn−1 with the standard normal distribution. Hence for a 95% confidence interval for µ we use t0.025, 49 ≈ z0.025 = 1.96 giving: s 40.6 x̄±z0.025 × √ = 320.41±1.96× √ = 320.41±11.25 n 50 ⇒ (£309.16, £331.66). Note that because n is large we have used the standard normal approximation. It is more accurate to use a t distribution on 49 degrees of freedom. Using Table 10 of the New Cambridge Statistical Tables, the nearest available value is t0.025, 50 = 2.009. This gives a 95% confidence interval of: 40.6 s x̄±t0.025, 49 × √ = 320.41±2.009× √ = 320.41±11.54 n 50 7 ⇒ (£308.87, £331.95) so not much of a difference. To obtain a 95% confidence interval for the total value of the stock, 9875µ, multiply the interval endpoints by 9875. This gives (to the nearest £10,000): (£3,050,000, £3,280,000). (c) Increasing the sample size √ by a factor of k reduces the width of the confidence a factor of 4 will interval by a factor of k. Hence increasing the sample size by√ reduce the width of the confidence interval by a factor of 2 (= 4). Therefore, we need to increase the sample size from 50 to 4 × 50 = 200, i.e. collect another 150 observations. 7.10 Confidence interval for a single proportion We often want to estimate a (population) proportion, for example, in surveys of people’s social, political or consumer attitudes.6 Take an opinion poll in which we investigate voters’ support for a particular political party. We can ask the question ‘Do you support party X?’, to which the n respondents each answer either ‘yes’ or ‘no’. If r people reply ‘yes’, then the sample proportion who support the party is r/n. This is our point estimate of the population proportion of support in the population, π. 6 In Chapter 4, categorical variables were introduced. Political party affiliation is an example of a categorical variable. 158 7.10. Confidence interval for a single proportion Sampling distribution of the sample proportion estimator, P The sampling distribution of the sample proportion, P , (which is the estimator of the population proportion, π)7 is: π (1 − π) P ∼ N π, (7.1) n approximately, as n → ∞. p Hence Var(P ) = π (1 − π)/n, so the standard error is S.E.(P ) = π (1 − π)/n. Since this depends on π, precisely what we are trying to estimate, the true standard error is unknown, so must itself be estimated. As π is unknown, the best we can do is replace it with our point estimate of it, p = r/n, hence the estimated standard error is: r E.S.E.(p) = p (1 − p) = n r r/n (1 − r/n) . n Confidence interval endpoints of a single proportion A 100(1 − α)% confidence interval for a single proportion has endpoints: ! r r r p (1 − p) p (1 − p) p (1 − p) ⇒ p − zα/2 × , p + zα/2 × p ± zα/2 × n n n where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the standard normal distribution, obtained from Table 5 in the New Cambridge Statistical Tables. Note that although we are estimating a variance, for proportions we do not use Student’s t distribution for two reasons. The standard error has not been estimated by a corrected sum of squares calculation, unlike S 2 = Sxx /(n − 1). The sample size n has to be large for the normal approximation to hold, and so the standard normal distribution is appropriate in this case due to the central limit theorem. Example 7.3 A survey is conducted by a bank to estimate the proportion of its customers who would be interested in using a proposed new mobile telephone banking service. If we denote the population proportion of customers who are interested in this proposal by π, and it is found that 68 out of 150 customers are in favour, then we would estimate π to be p = 68/150 = 0.453. Hence a 95% confidence 7 This result is a consequence of the central limit theorem applied to the proportion of successes for a ‘binomial’ distribution. Full details can be found in ST104b Statistics 2. 159 7 7. Point and interval estimation interval for π has endpoints: r 0.453 ± 1.96 × 0.453 × (1 − 0.453) 150 ⇒ (0.37, 0.53). Note that for a 95% confidence interval, α = 0.05 and so we use z0.025 = 1.96 in the computation of the confidence interval. Activity 7.3 A random sample of 100 voters contained 35 supporters of a particular party. Compute a 95% confidence interval for the proportion of this party’s supporters in the population. Solution In this question we are estimating a proportion with n = 100. Let π be the proportion of the party’s supporters in the population. With p = 35/100 = 0.35 and n = 100, a 95% confidence interval for π is calculated as: r r p (1 − p) 0.35 × 0.65 ⇒ 0.35 ± 1.96 × ⇒ (0.257, 0.443). p ± z0.025 × n 100 7 Activity 7.4 A claimant of Extra Sensory Perception (ESP) agrees to be tested for this ability. Blindfolded, he claims to be able to identify more randomly chosen cards than would be expected by pure guesswork. An experiment is conducted in which 200 playing cards are drawn at random, and with replacement, from a deck of cards, and the claimant is asked to name their suits (hearts, diamonds, spades or clubs). Of the 200 cards he identifies 60 correctly. Compute a 95% confidence interval for his true probability of identifying a suit correctly. Is this evidence of ESP? Solution We have the sample proportion p = 60/200 = 0.3 and an estimated standard error of: r 0.3 × 0.7 = 0.0324. 200 A 95% confidence interval for the true probability of the correct identification of a suit is: r p (1 − p) p ± z0.025 × ⇒ 0.3 ± 1.96 × 0.0324 ⇒ (0.236, 0.364). n As 0.25 (pure guesswork) is within this interval, then the claimant’s performance is not convincing! Activity 7.5 400 college students, chosen at random, are interviewed and it is found that 240 use the refectory. 160 7.11. Sample size determination (a) Use these data to compute a: i. 95% confidence interval ii. 99% confidence interval for the proportion of students who use the refectory. (b) The college has 12,000 students. The college catering officer claims that the refectory is used by at least 9,000 students and that the survey has yielded a low figure due to sampling variability. Is this claim reasonable? Solution (a) i. We have p = 240/400 = 0.6 and the corresponding estimated standard error is: r 0.6 × 0.4 = 0.0245. 400 Therefore, a 95% confidence interval for the population proportion is: 0.6 ± 1.96 × 0.0245 ⇒ (0.5520, 0.6480). ii. Similarly, a 99% confidence interval for the population proportion is: 0.6 ± 2.576 × 0.0245 ⇒ 7 (0.5369, 0.6631). (b) A 99% confidence interval for the total number of students using the refectory is obtained by multiplying the interval endpoints by the total number of students, i.e. 12,000. This gives (6442, 7959). The catering officer’s claim of 9,000 is incompatible with these data as it falls well above the 99% confidence interval upper endpoint. We conclude that the actual number of users is well below 9,000. 7.11 Sample size determination The question ‘how large a sample do I need to select?’ is a common one when sampling. The answer to this depends on the quality of inference which the researcher requires from the data. In the estimation context, this can be expressed in terms of the accuracy of estimation. If the researcher requires that there should be a 95% chance of the estimation error being no larger than e units (we refer to e as the tolerance on the sampling error), then this is equivalent to having a 95% confidence interval of width 2 × e. Note that here e represents the half-width of the confidence interval since the point estimate is, by construction, at the centre of the confidence interval (in all cases seen in ST104a Statistics 1). For known variance, a 95% confidence interval is based on the estimation error being no 161 7. Point and interval estimation greater than 1.96 standard errors. Hence we can conveniently use the relation: b ≤e 1.96 × S.E.(θ) and solve this simple equation for the required sample size (note the standard error is a function of n). Sample size determination for a single mean To estimate µ to within e units with 100(1 − α)% confidence, we require a minimum sample size of: (zα/2 )2 σ 2 (7.2) n≥ e2 where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the standard normal distribution, obtained from Table 5 in the New Cambridge Statistical Tables. Sample size determination for a single proportion 7 To estimate π to within e units with 100(1 − α)% confidence, we require a minimum sample size of: (zα/2 )2 p (1 − p) n≥ (7.3) e2 where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the standard normal distribution, obtained from Table 5 in the New Cambridge Statistical Tables, and p is the assumed value of π (obtained from a pilot study, or otherwise an assumption about the value of π). Example 7.4 A random sample of 50 households is taken from a population of 10,000 households in an area of a city. The sample mean and standard deviation of weekly expenditure on beverages are £18 and £4, respectively. How many more observations are required to estimate µ to within 1 unit with 99% confidence? Here we have e = 1, and we can assume σ = 4 since the initial sample with n = 50 is ‘large’ and hence we would expect that s ≈ σ. For 99% confidence, we use z0.005 = 2.5758. Therefore, using (7.2), we have: (2.5758)2 × 42 = 106.16. n≥ 12 Remember that n must be an integer, so the smallest n satisfying this is 107.8 Therefore, 57 more observations are required. 8 Note that we round up, otherwise had we rounded down it would lead to less precision than we have specified. 162 7.11. Sample size determination Example 7.5 The reaction time of a patient to a certain stimulus is known to have a standard deviation of 0.05 seconds. How large a sample of measurements must a psychologist take in order to be 95% confident and 99% confident, respectively, that the error in the estimate of the mean reaction time will not exceed 0.01 seconds? For 95% confidence, we use z0.025 = 1.96 using Table 5 of the New Cambridge Statistical Tables. So, using (7.2), n is to be chosen such that: n≥ (1.96)2 × (0.05)2 . (0.01)2 Hence we find that n ≥ 96.04. Since n must be an integer, 97 observations are required to achieve an error of 0.01 or less with 95% probability. For 99% confidence, we use z0.005 = 2.5758. So, using (7.2), n is to be chosen such that: (2.5758)2 × (0.05)2 n≥ . (0.01)2 Hence we find that n ≥ 165.87. Since n must be an integer, 166 observations are required to achieve an error of 0.01 or less with 99% probability. Note that a higher level of confidence requires a larger sample size as more information (sample data) is needed to achieve a higher level of confidence for a given tolerance, e. Example 7.6 A pilot study estimates a proportion to be 0.4. If we wish to be 90% confident of estimating the population proportion with an error no greater than 0.03, how large a sample is required? Here e = 0.03, and we have an initial estimate of p = 0.4. For 90% confidence, we use z0.05 = 1.645 using Table 5 of the New Cambridge Statistical Tables. Therefore, using (7.3), we have: (1.645)2 × 0.4 × (1 − 0.4) = 721.61. n≥ (0.03)2 So, rounding up, we require a sample size of 722. Activity 7.6 A simple random sample of 100 workers had weekly salaries with a mean of £315 and a standard deviation of £20. (a) Calculate a 90% confidence interval for the mean weekly salary of all workers in the factory. (b) How many more workers should be sampled if it is required that the estimate is to be within £3 of the true average (again, with 90% confidence)? Note this means a tolerance of £3 – equivalent to a confidence interval width of £6. 163 7 7. Point and interval estimation Solution (a) We have n = 100, x̄ = 315 and s = 20. The estimated standard error is: s 20 √ =√ = 2. n 100 Hence a 90% confidence interval for the true mean is: s x̄ ± t0.05, n−1 × √ n ⇒ 315 ± 1.645 × 2 ⇒ (311.71, 318.29) where we use the approximation t0.05, 99 ≈ z0.05 = 1.645 since n is large. (b) For the tolerance to be 3, we require: 20 3 ≥ 1.645 × √ . n Solving this gives n ≥ 120.27, so we round up to get n = 121. Hence we need to take a further sample of 21 workers. 7 7.12 Difference between two population proportions The correct approach for the comparison of two population proportions, π1 and π2 , is, obviously, via the difference between the population proportions. The sample proportions P1 and P2 are, for large sample sizes n1 and n2 , respectively, approximately normally distributed by the central limit theorem as, from (7.1), we have: P1 ∼ N π1 (1 − π1 ) π1 , n1 π2 (1 − π2 ) and P2 ∼ N π2 , n2 approximately, as n1 → ∞ and n2 → ∞. When random samples are drawn from two independent populations, then these distributions are statistically independent, and it can be shown that their difference is also (approximately) normally distributed such that: π1 (1 − π1 ) π2 (1 − π2 ) P1 − P 2 ∼ N π1 − π2 , + . (7.4) n1 n2 p We see that Var(P1 − P2 ), and hence S.E.(P1 − P2 ) = Var(P1 − P2 ), depends on the unknown parameters π1 and π2 . Therefore, the variance and standard error of the difference in sample proportions is also unknown. So we must resort to the estimated standard error: s P1 (1 − P1 ) P2 (1 − P2 ) E.S.E.(P1 − P2 ) = + . n1 n2 164 7.12. Difference between two population proportions Confidence interval endpoints for the difference between two proportions With point estimates of π1 and π2 of p1 = r1 /n1 and p2 = r2 /n2 , respectively, a 100(1 − α)% confidence interval for the difference between two proportions has endpoints: s p1 − p2 ± zα/2 × p1 (1 − p1 ) p2 (1 − p2 ) + n1 n2 (7.5) where zα/2 cuts off 100α/2% probability in the upper tail of the standard normal distribution, obtained from Table 5 in the New Cambridge Statistical Tables. Example 7.7 We use (7.5) to construct 95% and 90% confidence intervals, respectively, for the difference between the population proportions of the general public who are aware of a particular commercial product before and after an advertising campaign. Two surveys were conducted and the results of the two random samples were: Before campaign After campaign Sample size 150 120 Number aware 68 65 To avoid working with negative differences whenever possible (to keep notation simpler by working with positive values), let p1 = r1 /n1 = 65/120 = 0.5417 and p2 = r2 /n2 = 68/150 = 0.4533. Hence our point estimate of the difference is: ‘After’ − ‘Before’ = p1 − p2 = 0.5417 − 0.4533 = 0.0884. So a 95% confidence interval for the difference in population proportions, π1 − π2 , has endpoints: r 0.4533 × (1 − 0.4533) 0.5417 × (1 − 0.5417) + 0.0884 ± 1.96 × 150 120 which can be expressed as (−0.031, 0.208). Similarly, a 90% confidence interval has endpoints: r 0.4533 × (1 − 0.4533) 0.5417 × (1 − 0.5417) 0.0884 ± 1.645 × + 150 120 which can be expressed as (−0.012, 0.189). Note that both confidence intervals include zero (since they both have a negative lower bound and positive upper bound). This suggests that there is no significant difference in public awareness. This idea has close parallels with hypothesis testing, introduced in Chapter 8. Activity 7.7 In a market research study to compare two chocolate bar wrappings, 30 out of 50 children preferred a gold wrapping to a silver wrapping, whereas 25 out 165 7 7. Point and interval estimation of 40 adults preferred the silver wrapping. (a) Compute a 95% confidence interval for the difference in the true proportions in favour of gold wrapping between the two groups. (b) It is decided to take further samples of children and adults so that we finally sample the same number of adults and children such that the final estimator of the difference between proportions is within 0.06 of the true difference (with 95% confidence). How many more adults and children need to be sampled? Solution (a) For ‘Children − Adults’, the estimate of the difference in population proportions is 30/50 − 15/40 = 0.225. The estimated standard error of this estimate is: s r p1 (1 − p1 ) p2 (1 − p2 ) 0.6 × 0.4 0.375 × 0.625 + = + = 0.1032. n1 n2 50 40 A 95% confidence interval is: 0.225 ± 1.96 × 0.1032 7 ⇒ (0.023, 0.427). (b) Let n be the required common sample size. We require: r 0.6 × 0.4 + 0.375 × 0.625 1.96 × ≤ 0.06. n Solving gives n = 507. Therefore, we must sample 457 more children and 467 more adults. Activity 7.8 Two market research companies each take random samples to assess the public’s attitude toward a new government policy. If n represents the sample size and r the number of people against, the results of these independent surveys are as follows: Company 1 Company 2 n 400 900 r 160 324 If π is the population proportion against the policy, compute a 95% confidence interval to comment on the likely representativeness of the two companies’ results. Solution Here we are estimating the difference between two proportions (if the two companies’ results are compatible, then there should be no difference between the proportions). The formula for a 95% confidence interval of the difference is: s p1 (1 − p1 ) p2 (1 − p2 ) p1 − p2 ± z0.025 × + . n1 n2 166 7.12. Difference between two population proportions The estimates of the proportions are p1 = 160/400 = 0.4 and p2 = 324/900 = 0.36, so a 95% confidence interval is: r 0.4 × 0.6 0.36 × 0.64 0.4 − 0.36 ± 1.96 × + ⇒ (−0.017, 0.097). 400 900 Since zero is in this interval it is plausible that the two proportions are the same, so the companies’ surveys are compatible indicating likely representativeness. Activity 7.9 A sample of 954 adults in early 1987 found that 23% of them held shares. (a) Given a UK adult population of 41 million, and assuming a proper random sample was taken, compute a 95% confidence interval estimate for the number of shareholders in the UK in 1987 (following liberalisation of financial markets in the UK). (b) A ‘similar’ survey the previous year (prior to financial liberalisation) had found a total of 7 million shareholders. Assuming ‘similar’ means the same sample size, find a 95% confidence interval estimate of the increase in shareholders between the two years. Solution (a) Let π be the proportion of shareholders in the population in 1987. Start by estimating π. We are estimating a proportion and n is large, so a 95% confidence interval for π is: r r p (1 − p) 0.23 × 0.77 ⇒ 0.23 ± 1.96 × ⇒ (0.203, 0.257). p ± z0.025 × n 954 Therefore, a 95% confidence interval for the number (rather than the proportion) of shareholders in the UK in 1987 is obtained by multiplying the above interval endpoints by 41 million, resulting in: 9,430,000 ± 1,107,000 ⇒ (8,323,000, 10,537,000) Therefore, we estimate there were about 9.43 million shareholders in the UK in 1987, with a margin of error of approximately 1.1 million. (b) Let us start by finding a 95% confidence interval for the difference in the two proportions. We use the formula: s p1 (1 − p1 ) p2 (1 − p2 ) p1 − p2 ± z0.025 × + . n1 n2 167 7 7. Point and interval estimation The estimates of the proportions π1 and π2 are 0.23 and 7/41 = 0.171, respectively. We know n1 = 954, and although n2 is unknown we can assume it is approximately equal to 954 (as the previous year’s survey was ‘similar’), so a 95% confidence interval is: r 0.23 × 0.77 0.171 × 0.829 + = 0.059 ± 0.036 0.23 − 0.171 ± 1.96 × 954 954 giving (0.023, 0.094). Multiply the interval endpoints by 41 million and we get a confidence interval of: 2,419,000 ± 1,476,000 ⇒ (943,000, 3,895,000). We estimate that the number of shareholders has increased by about 2.4 million during the period covered by the surveys. There is quite a large margin of error of approximately 1.5 million, especially when compared with a point estimate (i.e. interval midpoint) of 2.4 million. However, it seems financial liberalisation increased the number of shareholders. 7 7.13 Difference between two population means In this section we are primarily interested in the difference between two population means, i.e. µ1 − µ2 . There are four cases to be considered, depending on whether variances are known or unknown (in which case we can assume they are either equal or unequal), and the case of paired datasets. 7.13.1 Unpaired samples – variances known Suppose we have random samples of size n1 and n2 from two normally-distributed populations, N (µ1 , σ12 ) and N (µ2 , σ22 ), respectively, where the variances σ12 and σ22 are known. The sampling distributions of X̄1 and X̄2 are, respectively: X̄1 ∼ N σ2 µ1 , 1 n1 σ22 and X̄2 ∼ N µ2 , . n2 A natural estimator of µ1 − µ2 is X̄1 − X̄2 . Random samples drawn from two different populations are statistically independent, hence X̄1 and X̄2 are independent. The sampling distribution of their difference is: X̄1 − X̄2 ∼ N 168 σ2 σ2 µ1 − µ2 , 1 + 2 n1 n2 . (7.6) 7.13. Difference between two population means Confidence interval endpoints for the difference between two means If the population variances σ12 and σ22 are known, a 100(1 − α)% confidence interval for µ1 − µ2 has endpoints: s σ12 σ22 x̄1 − x̄2 ± zα/2 × + (7.7) n1 n2 where zα/2 cuts off 100α/2% probability in the upper tail of the standard normal distribution, obtained from Table 5 in the New Cambridge Statistical Tables. 7.13.2 Unpaired samples – variances unknown and unequal We have the same set-up as above, with the same sampling distribution for X̄1 − X̄2 in (7.6), but now the population variances σ12 and σ22 are unknown. Assuming large sample sizes, say greater than 30, we can replace these unknown parameters with the respective sample variances s21 and s22 and continue to use standard normal values for the confidence coefficient. The justification is because since the sample sizes are large, we would expect fairly accurate estimates of the population variances, such that s21 ≈ σ12 and s22 ≈ σ22 . Confidence interval endpoints for the difference between two means If the population variances σ12 and σ22 are unknown, provided the sample sizes n1 and n2 are large (greater than 30), a 100(1 − α)% confidence interval for µ1 − µ2 has endpoints: s x̄1 − x̄2 ± zα/2 × s21 s2 + 2 n1 n2 where zα/2 cuts off 100α/2% probability in the upper tail of the standard normal distribution, obtained from Table 5 in the New Cambridge Statistical Tables. Activity 7.10 In order to assess the impact of an advertising campaign, a restaurateur monitors her daily revenue before and after the campaign. The table below shows some sample statistics of daily sales calculated over a period of 60 days prior to the campaign, and 45 days after the campaign. Determine a 95% confidence interval for the increase in average daily sales due to the campaign. Is there strong evidence that the advertising campaign has increased sales? Number of days Mean daily sales Standard deviation Before campaign 60 £503 £21 After campaign 45 £559 £29 169 7 7. Point and interval estimation Solution As both sample sizes are ‘large’ there is no need to use a pooled estimator of the variance as we would expect s21 ≈ σ12 and s22 ≈ σ22 . A 95% confidence interval for the difference in means is: s r s21 s22 (21)2 (29)2 x̄1 −x̄2 ±z0.025 × + ⇒ 559−503±1.96× + ⇒ (46.0, 66.0). n1 n2 60 45 Zero is well below the lower endpoint, so there is evidence that the advertising campaign has increased sales. 7.13.3 Unpaired samples – variances unknown and equal In some circumstances we may be able to justify the assumption that the two populations being sampled are of equal variability. In which case suppose we have random samples of sizes n1 and n2 from two normally-distributed populations, N (µ1 , σ 2 ) and N (µ2 , σ 2 ), respectively, i.e. the populations have a common variance, σ 2 , which is unknown.9 The sampling distributions of X̄1 and X̄2 are: σ2 σ2 X̄1 ∼ N µ1 , and X̄2 ∼ N µ2 , . n1 n2 7 A natural estimator of µ1 − µ2 is still X̄1 − X̄2 . Since we have unpaired (independent) samples, X̄1 and X̄2 are statistically independent, hence the sampling distribution of their difference is: 1 1 2 + X̄1 − X̄2 ∼ N µ1 − µ2 , σ . n1 n2 The problem is that this common variance σ 2 is unknown, so it needs to be estimated. Should we use S12 or S22 as an estimator of σ 2 ? Well, we use both by pooling the two sample variances, since both contain useful information about σ 2 . Pooled variance estimator The pooled variance estimator of σ 2 , using S12 and S22 which are sample variances from random samples of sizes n1 and n2 , respectively, is: Sp2 = (n1 − 1)S12 + (n2 − 1)S22 n1 + n2 − 2 (7.8) on n1 + n2 − 2 degrees of freedom, where the subscript ‘p’ denotes ‘pooled’. Note that Sp2 is the weighted average of the sample variances S12 and S22 , where the weights are (n1 − 1)/(n1 + n2 − 2) and (n2 − 1)/(n1 + n2 − 2), respectively. If n1 = n2 then we attach equal weights to the sample variances. Intuitively, this should make sense. As the sample size increases, a sample variance provides a more accurate 9 If the population variances were known, then we would know their true values so no assumptions would be necessary and we would use (7.7). 170 7.13. Difference between two population means estimate of σ 2 . If n1 6= n2 , the sample variance calculated from the larger sample is more accurate, hence it is given greater weight in the pooled variance estimator. Of course, if n1 = n2 , then the variances are equally accurate and so are given equal weights. Confidence interval endpoints for the difference between two means If the population variances σ12 and σ22 are unknown but assumed equal, a 100(1 − α)% confidence interval for µ1 − µ2 has endpoints: s 1 1 + (7.9) x̄1 − x̄2 ± tα/2, n1 +n2 −2 × s2p n1 n2 where s2p is the estimate of the common variance σ 2 from the pooled variance estimator (7.8), and where tα/2, n1 +n2 −2 is the t-value which cuts off 100α/2% probability in the upper tail of the Student’s t distribution with n1 +n2 −2 degrees of freedom, obtained from either Table 9 or Table 10 in the New Cambridge Statistical Tables.10 An obvious question is how to decide whether to assume the unknown variances are equal or unequal. To assist with making this decision consider the following points. If σ12 = σ22 , then we would expect approximately equal sample variances, i.e. s21 ≈ s22 , since both sample variances would be estimating the same (common) variance. If the sample variances are very different, then this would suggest σ12 6= σ22 .11 If we are sampling from two similar populations (for example, two companies in the same industry) then an assumption of equal variability in these similar populations would be reasonable. Example 7.8 Two companies supplying a similar service are compared for their reaction times (in days) to complaints. Random samples of recent complaints to these companies gave the following: Company A Company B Sample size 12 10 Sample mean 8.5 4.8 Sample std. dev. 3.6 2.1 We construct a 95% confidence interval for the difference between mean reaction times and use this interval to decide if one company reacts faster than the other, on average. Because the markets are similar, it is reasonable to assume (though it is only an assumption!) that the two population variances are equal. Under this assumption, using (7.8), we have: (12 − 1) × (3.6)2 + (10 − 1) × (2.1)2 2 sp = = 9.1125 12 + 10 − 2 10 Remember that for sufficiently ‘large’ degrees of freedom a standard normal approximation can be used, provided a justification is given. 11 A formal hypothesis test of H0 : σ12 = σ22 could be performed, although this is covered in ST104b Statistics 2. 171 7 7. Point and interval estimation on n1 + n2 − 2 = 12 + 10 − 2 = 20 degrees of freedom. So the estimated standard error of the difference between sample means is: s s √ 1 1 1 1 2 + = 9.1125 × + = 9.1125 × 0.1833 = 1.2925. sp n1 n2 12 10 Hence a 95% confidence interval for µA − µB , using (7.9), is: 8.5 − 4.8 ± 2.086 × 1.2925 ⇒ (1.01, 6.39) where t0.025, 20 = 2.086 (we have estimated the common variance, so we use the Student’s t distribution, here with 20 degrees of freedom) using Table 10 of the New Cambridge Statistical Tables. As zero does not lie in this interval, we conclude that the data suggest that Company B is faster than Company A in reacting to complaints on average. (Note the variable here is reaction time, so a lower sample mean suggests a faster average reaction time.) 7 Activity 7.11 A survey is conducted on time spent recording (in hours per track) for 25 music industry recording studios, classified as successful or unsuccessful according to their recent chart performances. The relevant statistics resulting from this study are: Successful studios Unsuccessful studios Sample size 12 13 Sample mean 9.2 6.8 Sample standard deviation 1.4 1.9 (a) Compute a 98% confidence interval for the difference between population mean recording times between successful and unsuccessful studios. (b) On the basis of this confidence interval, do you consider this to be sufficient evidence of a true difference in mean recording times between the different types of studios? Justify your answer. Solution (a) Both samples are small (combined sample size < 30), so it will be necessary to use a pooled variance by assuming a common value of σ 2 and estimating it using the pooled variance estimator: (n1 − 1)S12 + (n2 − 1)S22 2 . Sp = n1 + n2 − 2 Also, since the sample sizes are small and the true variance is unknown, the Student’s t distribution will be used with ν = (n1 − 1) + (n2 − 1) = n1 + n2 − 2 degrees of freedom. (Simply add the degrees of freedom from the two separate samples.) 172 7.13. Difference between two population means Let the subscripts ‘1 ’ and ‘2 ’ refer to the successful and unsuccessful studios, respectively. Because the studios are in the same market (which is largely why we are bothering to compare them in the first place!) then it is reasonable to assume the same variance for the recording times at studios for the two groups. Under this assumption: s2p = 64.88 (12 − 1)(1.4)2 + (13 − 1)(1.9)2 = = 2.8209. 12 + 13 − 2 23 This variance is our best estimate of σ 2 , and applies (by assumption) to both populations. If the true population means µ1 and µ2 are estimated by x̄1 and x̄2 (the sample means), respectively, then the central limit theorem allows us to use the following approximations: Var(X̄1 ) ≈ Sp2 n1 and Var(X̄2 ) ≈ Sp2 n2 with point estimates of 2.8209/12 and 2.8209/13, provided that we use the t distribution. The point estimates are 9.2 for µ1 and 6.8 for µ2 . We estimate the difference between the means by: x̄1 − x̄2 = 9.2 − 6.8 = 2.4. 7 Since the random samples are independent, we add the variances, hence: s2p s2p 2.8209 2.8209 1 1 + = + = 2.8209 × + n1 n2 12 13 12 13 on 23 degrees of freedom. Therefore, a 98% confidence interval for the difference between the means is, noting here t0.01, 23 = 2.500 for 98% confidence: s 1 1 + ⇒ 2.4 ± 2.500 × 0.6724 2.4 ± 2.500 × 2.8209 × 12 13 resulting in the confidence interval (0.7190, 4.0810). (b) Since zero is not in this interval, it follows (with 98% confidence) that the true difference is greater than zero, so that more time (on average) is spent recording at the successful studios. 7.13.4 Paired (dependent) samples Paired-sample methods are used in special cases when the two samples are not statistically independent. For our purposes, such paired data are likely to involve observations on the same individuals in two different states – specifically ‘before’ and ‘after’ some intervening event. A paired-sample experimental design is advantageous since it allows researchers to determine whether or not significant changes have occurred 173 7. Point and interval estimation as a result of the intervening event free of bias from other factors, since these have been controlled for by observing the same individuals.12 A necessary, but not sufficient, condition for the presence of paired sample data is that n1 = n2 in order to have pairs of data values. Common sense needs to be exercised to determine whether or not we have paired data. An example of such a dataset would be observations of the same individuals at two different points in time, typically ‘before’ and ‘after’ some event, such as statistical ‘IQ’ before and after taking ST104a Statistics 1.13 This scenario is easy to analyse as the paired data can simply be reduced to a one-sample analysis by working with differenced data. That is, suppose two samples generated sample values x1 , x2 , . . . , xn and y1 , y2 , . . . , yn , respectively (note the same number of observations, n, in each sample). Compute the differences, that is: d1 = x1 − y1 , d2 = x2 − y2 , ..., dn = xn − yn . Let x̄d denote the sample mean of the differenced data. It follows that x̄d = x̄ − ȳ, so that by using the differences to compute a confidence interval for µd (the population mean of the differences), then we get the required confidence interval for µX − µY . Therefore, the technique subsequently follows that in Section 7.9. Example 7.9 The table below shows the before and after weights (in pounds) of 8 adults after a diet. Construct a 95% confidence interval for the mean loss of weight due to the diet. Are you convinced that the diet reduces weight on average? 7 Before 127 130 114 139 After 122 120 116 132 Before 150 147 167 153 After 144 138 155 152 The differences (calculated as ‘Before − After’) are: 5, 10, −2, 7, 6, 9, 12, 1. Hence n = 8, x̄d = 6 and sd = 4.66 on n − 1 = 7 degrees of freedom, where x̄d and sd denote the sample mean and standard deviation of the √ differenced √ data, respectively. The estimated standard error of the sample mean is s/ n = 4.66/ 8 = 1.65. Using the t distribution on 7 degrees of freedom, for 95% confidence we use t0.025, 7 = 2.365. So a 95% confidence interval √ for the mean difference in weight before and after the diet is x̄d ± t0.025, n−1 × s/ n, that is: 6 ± 2.365 × 1.65 ⇒ (2.1, 9.9). Since zero is not included in this confidence interval, we conclude that the diet does appear to reduce weight on average, i.e. the average weight loss appears to be positive. 12 Experimental studies are discussed in Chapter 11. One hopes, of course, that this would result in a mean increase, or at the very least no decrease, in statistical ‘IQ’ ! 13 174 7.14. Summary Activity 7.12 Two advertising companies each give quotations for nine different campaigns. Their quotations (in £000s) are shown in the following table. Compute a 95% confidence interval for the true difference between average quotations. Can you deduce from this interval if one company is more expensive on average than the other? Company A B 1 39 46 2 24 26 3 36 32 4 42 39 5 45 51 6 30 34 7 38 37 8 32 41 9 39 44 Solution The data are paired observations for n = 9 campaigns. First compute the differences: 7 2 −4 −3 6 4 − 1 9 5. The sample mean of these differences is 2.78 and the sample standard deviation is 4.58. (Make sure you know how to do these calculations!) Therefore, a 95% confidence interval for the true difference, µd , is: s x̄d ± t0.025, n−1 × √ n ⇒ 4.58 2.78 ± 2.306 × √ 9 ⇒ (−0.74, 6.30). Since zero is in this confidence interval we cannot conclude that one company is more expensive than the other on average. It is possible that the differences in quotations are due to random variation. 7.14 Summary The concepts of estimation are obviously extremely important when the true values of parameters are unknown. When constructing confidence intervals, make sure that you understand when Student’s t distribution is required rather than the standard normal distribution. Remember: we use the standard normal distribution when we know the variance, either as given by the researcher or as a population figure if we have to estimate the variance from a sample, we will need to use Student’s t distribution, whatever the size of the sample if the sample size is large, then the standard normal distribution well-approximates Student’s t we use the standard normal distribution whenever we are dealing with proportions (as a consequence of the central limit theorem). 175 7 7. Point and interval estimation 7.15 Key terms and concepts Confidence interval Degrees of freedom Endpoints Paired sample Standard error Tolerance 7.16 Coverage probability Differences Half-width Pooled variance Student’s t distribution Learning activities Solutions can be found in Appendix A. 1. Would you say the following statement is true or false? ‘When calculated from the same dataset, a 90% confidence interval is wider than a 95% confidence interval.’ Explain your answer. 7 2. National figures for a blood test result have been collected and the population standard deviation is 1.2. You select a random sample of 100 observations and find a sample mean of 25 units. Construct a 95% confidence interval for the population mean. 3. Consider this variation of Example 7.1. ‘Measurements of the diameter of a random sample of 200 ball bearings produced by a machine gave a sample mean x̄ = 0.824. The sample standard deviation, s, is 0.042. Construct a 95% confidence interval and a 99% confidence interval for the population mean value of the diameter of the ball bearings.’ (Note that although this time you have been told that you only have a point estimate of the population standard deviation, you can justify using the standard normal distribution since the sample size is very large and so your confidence interval will be exactly the same as in Example 7.1!) 4. Look at Table 10 of the New Cambridge Statistical Tables. Note that different probability tails are given for ν = 1, 2 etc. (ν is the same as n − 1 in this case). Now consider a 95% confidence interval for µ when n = 21 (i.e. for 20 degrees of freedom). You can see the t-value is t0.025, 20 = 2.086. However, when ν → ∞, the t-value converges to 1.96 – exactly the same as z0.025 for the standard normal distribution. Although you can see that t-values are given for quite large degrees of freedom, we generally assume that the standard normal distribution can be used instead of Student’s t if the degrees of freedom are greater than 30 (some textbooks say 40, others say 50). 5. (a) Write down the formula for the standard error of the sample proportion when a random sample is selected from a very large population and the population proportion is equal to π. Give the formula for pooled proportions when comparing the two samples. (Note that you need to check this on your examination formula sheet for ST104a Statistics 1.) 176 7.17. A reminder of your learning outcomes (b) An independent assessment is made of the services provided by two holiday companies. Of a random sample of 300 customers who booked through company A, 220 said they were satisfied with the service. Of a random sample of 250 of company B’s customers, 200 said they were satisfied. For both companies the total customer base was very large. Construct a 95% confidence interval for the difference between the proportions of satisfied customers between the two companies. Basing your conclusion on this confidence interval, do you believe that one company generates more customer satisfaction than the other? 6. A random sample of 200 students is observed. 30 of them say they ‘really enjoy’ studying Statistics. (a) Calculate the proportion of students in the sample who say they ‘really enjoy’ studying Statistics and then construct a 95% confidence interval for the population proportion of students who ‘really enjoy’ studying Statistics. You now take a further random sample, in another institution. This time there are 100 students, and 18 say they ‘really enjoy’ studying Statistics. (b) Construct a 95% confidence interval for the population proportion of students who ‘really enjoy’ studying Statistics based on this second sample. (c) Construct a 95% confidence interval for the difference between the two proportions. 7.17 A reminder of your learning outcomes After completing this chapter, and having completed the essential reading and activities, you should be able to: calculate sample means, standard deviations and proportions, and demonstrate their use as estimates construct a confidence interval for a mean, a proportion, the difference between means and the difference between proportions know when to use Student’s t distribution. You do not need to: demonstrate the central limit theorem know about the concepts relating to a ‘good’ estimator. 177 7 7. Point and interval estimation 7.18 Sample examination questions Solutions can be found in Appendix B. 1. You are told that a 99% confidence interval for a single population proportion is (0.3676, 0.5324). (a) What was the sample proportion that lead to this confidence interval? (b) What was the size of the sample used? 2. In a random sample of 120 large retailers, 85 used linear regression as a method of forecasting sales. In an independent random sample of 163 small retailers, 78 used linear regression as a method of forecasting sales. (a) Find a 98% confidence interval for the difference between the two population proportions. (b) What do you conclude from your confidence interval? (c) Repeat (a) but for a 94% confidence interval. 7 3. A random sample of 21 female students is chosen from students at higher education establishments in a particular area of a country, and it is found that their mean height is 165 centimetres with a sample variance of 81. (a) Assuming that the distribution of the heights of the students may be regarded as normally distributed, calculate a 98% confidence interval for the mean height of female students. (b) You are asked to obtain a 98% confidence interval for the mean height of width 3 centimetres. What sample size would be needed in order to achieve that degree of accuracy? (c) Suppose that a sample of 15 had been obtained from a single student hostel, in which there were a large number of female students (still with a distribution of heights which was well-approximated by the normal distribution). The mean height is found to be 160 centimetres. Calculate a 99% confidence interval for the mean height of female students in the hostel. How do their heights compare with the group in part (a)? 178 Chapter 8 Hypothesis testing 8.1 Aims In Chapters 6 and 7 you were introduced to the idea of the probability that a parameter could lie within a range of values, and in particular the confidence interval (generally 90%, 95% or 99% confidence) for a parameter. In this chapter, we are going to look at the idea of using statistics to see whether or not we should reject statements about these parameters – the concept of hypothesis testing. The arithmetic and underlying ideas you need are similar to those you met in the last chapter. You should aim to consolidate your familiarity with the ideas of randomness and work on the different aspects of hypothesis testing as presented here. 8.2 Learning outcomes After completing this chapter, and having completed the essential reading and activities, you should be able to: 8 set up the null and alternative hypotheses for a problem and state whether the latter is one-sided or two-sided, leading to a one-tailed or two-tailed test, respectively define and apply the terminology of hypothesis testing define and explain the types of errors which can be made in hypothesis testing perform hypothesis tests on means and proportions. 8.3 Essential reading Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060] Sections 9.1–9.4 and 10.1–10.3. In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials accessible via the ST104a Statistics 1 area at https://emfss.elearning.london.ac.uk 179 8. Hypothesis testing 8.4 Further reading Aczel, A.D. Complete Business Statistics (London: McGraw–Hill Higher Education, 2009) seventh edition [ISBN 9780071287531] Chapters 7, 8 and 14 (taking care to omit topics not in the learning outcomes). Anderson, D.R., D.J. Sweeney, T.A. Williams, J. Freeman and E. Shoesmith Statistics for Business and Economics (South-Western Cengage Learning, 2010) eleventh edition [ISBN 9780324783247] Chapter 9 and Sections 10.2, 12.2 and 14.9. Lind, D.A., W.G. Marchal and S.A. Wathen Statistical Techniques in Business and Economics (Boston: McGraw–Hill Higher Education, 2009) fourteenth edition [ISBN 978007110004] Chapters 9, 10, 11 and the latter half of 16. Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and Economics. (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN 9780471615170] Chapter 9. 8.5 Introduction In this chapter we introduce a branch of statistical inference known as hypothesis testing. Typically, we choose between two statements about the value of a parameter based on evidence obtained from sample data which, due to sample data being subject to randomness, can be regarded as realisations of some random variable of interest. 8 Our objective is to choose between these two opposite statements about the population, where these statements are known as hypotheses. By convention these are denoted by H0 and H1 . Qualitative examples of such hypotheses are: Null hypothesis, H0 A modified process does not produce a higher yield than the standard process. True economic output has not increased over a year. A person is not gifted with Extra Sensory Perception. The average level of lead in the blood of people in a particular environment is as high as 3.0. Alternative hypothesis, H1 A modified process does produce a higher yield than the standard process. True economic output has increased over a year. A person is gifted with Extra Sensory Perception. The average level of lead in the blood of people in a particular environment is lower than 3.0. From these examples we see that we use H0 to represent ‘no difference’, ‘no improvement’, ‘no effect’ etc., and this is known as the null hypothesis, while H1 is the alternative hypothesis. Many statistical procedures can be represented as statements about the values of population parameters such as the mean, µ, or variance, σ 2 . The first step in any hypothesis testing problem is to ‘translate’ the real problem into its technical counterpart. In ST104a Statistics 1, the hypotheses will only contain parameters. For example, the previous hypotheses can all be ‘translated’ into technical forms similar to: 180 8.6. Statistical tests ‘We observe x1 , x2 , . . . , xn from N (µ, σ 2 ) and we wish to test: H0 : µ = µ0 vs. H1 : µ > µ0 where µ0 is some specified value such as 0 or 3.’ Performing the ‘translation’ correctly can itself be complicated, and so this requires careful thought. Specifically, it is the form of H1 which needs consideration. The null hypothesis, H0 , will always denote the parameter value with equality, i.e. ‘=’,1 such as: H 0 : µ = µ0 . In contrast the alternative hypothesis, H1 , will take one of three forms, i.e. using ‘6=’, ‘<’ or ‘>’, that is: H1 : µ 6= µ0 or H1 : µ < µ0 or H1 : µ > µ0 . Note that only one of these forms will be used per test. To determine which form to use will require careful consideration of the wording of the problem. The form H1 : µ 6= µ0 is an example of a two-tailed (or two-sided ) test and we use this form with problems worded such as ‘test the hypothesis that µ is zero’. Here, there is no indication of the value of µ if it is not zero – do we assume µ > 0 or µ < 0 in such cases? We cannot be sure, so we take the safe option of a two-sided test, i.e. H1 : µ 6= 0. In contrast, had the problem been phrased as ‘test whether µ is greater than zero’, then unambiguously we would opt for H1 : µ > µ0 . This is an example of an upper-tailed (or one-sided ) test. Unsurprisingly, ‘test whether µ is less than zero’ leads to H1 : µ < µ0 which is a lower-tailed (also one-sided ) test. Later, when testing for differences between two population means or two population proportions, you need to look out for comparative phrases indicating if one population value should exceed the other (for example, testing whether group A is on average faster than group B). Practising problems will make you proficient in correctly specifying your hypotheses. 8.6 Statistical tests Consider the following two situations. 1. Suppose that it has been suggested that a particular environment is such that the average lead content of blood is as high as 3 units. In order to assess the truth of this, 50 blood samples selected at random are analysed and it is found that the average of these is 2.8 and the sample standard deviation is 0.20.2 Therefore, the 1 Such an H0 is called a simple null hypothesis. It is possible to have a composite null hypothesis, such as H0 : µ ≥ 0 which allows for more than one parameter value, although in ST104a Statistics 1 we will only focus on simple forms of H0 . 2 Even when the population mean of the blood lead level is 3, it is unlikely we will observe the sample mean to be exactly 3. However, we do expect the sample mean to be close to 3, while being willing to tolerate a certain amount of deviation from that value, due to chance. The important question here is ‘Is this deviation from 3 due to chance alone, or is it too large to be explained purely by chance?’. The deviation of 2.8 from 3 is far too large, when expressed in standard error units. 181 8 8. Hypothesis testing √ √ estimated standard error is s/ n = 0.20/ 50 = 0.0283 so that the sample mean is more than 7 (estimated) standard errors below the hypothetical mean – which is highly unlikely if the hypothesis is true! The data are extremely inconsistent with the hypothetical mean of 3, and so we reject this value on the evidence of the data. 2. A coin is tossed 100 times in order to decide whether or not it is a fair coin. We observe 98 heads and 2 tails. Common sense tells us that this is (very!) strong evidence that the coin is biased toward heads. More formally, we reject the hypothesis that the coin is fair as the chance of obtaining 98 heads with a fair coin is extremely low – the observation, 98 heads, is inconsistent with the coin being fair since we would have expected (approximately) 50 heads and the observation significantly departed from this expectation. In these examples it would be perfectly reasonable to make common sense judgements since the data were so extreme, but these ‘common sense judgements’ are using exactly the same underlying logic as do hypothesis tests which involve formal test statistics. By fully understanding this logic, we are able to make optimal decisions in less extreme and more complicated cases. A further advantage of the formal approach is that it enables us to express numerically the degree of confidence in our inference procedure. This can be extremely important in communicating our conclusions via reports, presentations etc. 8 Choosing between competing hypotheses requires us to conduct a statistical test. The test procedures themselves involve straightforward calculations using the sample data. At this point it is worth emphasising that in ‘classical testing’ we always assume the null hypothesis is true by performing the test conditional on H0 being true. That is, H0 is our ‘working hypothesis’ which we hold to be true until we obtain significant evidence against it. A test statistic is the formal mechanism used to evaluate the support given to H0 by sample data. Different test statistics are used for testing different forms of hypotheses. For example, tests of a single mean, single proportion, differences between means or proportions all require different test statistics. Since H0 refers to the tested value of a population parameter, we use a suitable estimator to estimate it based on the sample data. For example, X̄ would be used when testing µ. Test statistics are constructed from the sampling distribution of the estimator (introduced in Chapter 6). Hence test statistics are themselves random variables, as they are functions of estimators (which are random variables). It follows, therefore, that test statistics also follow probability distributions as they are random variables. Since we assume H0 is true, the test statistic’s distribution is conditional on H0 . Using the given sample data, we evaluate the test statistic to obtain a test statistic value (analogous to obtaining a point estimate using an estimator). If we find this test statistic value is sufficiently extreme – as determined by the test statistic distribution – then we will have sufficiently strong evidence to justify rejecting H0 . Of course, we need to specify what constitutes a ‘sufficiently extreme’ test statistic value. In fact we define a critical region which spans test statistic values which are considered sufficiently extreme. It follows that our decision rule is to reject H0 if the test statistic value falls in this critical region. If it does not, then we fail to reject H0 , which is retained as our working hypothesis. For any test, the critical region is defined by one or two critical values, depending on whether we are performing a one-tailed or 182 8.7. Types of error two-tailed test, respectively. These critical values are obtained from the test statistic’s distribution under H0 . 8.7 Types of error In any hypothesis test there are two types of inferential decision error which could be committed. Clearly, we would like to reduce the probabilities of these errors as much as possible. These two types of error are called Type I error and Type II error. Type I and Type II errors Type I error: rejecting H0 when it is true, which is a false positive. We denote the probability of this type of error by α. Type II error: failing to reject H0 when it is false, which is a false negative. We denote the probability of this type of error by β. Both errors are undesirable and, depending on the context of the hypothesis test, it could be argued that either one is worse than the other.3 However, on balance, a Type I error is usually considered to be more problematic. The possible decision space can be presented as: H0 true H1 true Decision Not reject H0 Reject H0 Correct decision Type I error Type II error Correct decision 8 For example, if H0 was being ‘innocent’ and H1 was being ‘guilty’, a Type I error would be finding an innocent person guilty (bad for him/her), while a Type II error would be finding a guilty person innocent (bad for the victim/society/justice, but admittedly good for him/her!). The complement of a Type II error, that is 1 − β, is called the power of the test – the probability that the test will reject a false null hypothesis. You will not be expected to calculate either β or (1 − β) in ST104a Statistics 1. However, you should understand the implications of the different types of error and be able to complete the chart in the second learning activity at the end of this chapter. Example 8.1 In the example about lead levels discussed earlier, we would make a Type I error if we decided the lead level was less than 3 when it was actually as high as 3, and we would make a Type II error if we believed the level to be 3 when it was actually lower than 3. These errors are taken into consideration in the formulation of test procedures – the standard tests presented below are all devised to minimise the probabilities of making such errors. 3 Which is better, a medical test incorrectly concluding a healthy person has an illness, or incorrectly concluding that an ill person is healthy? 183 8. Hypothesis testing Activity 8.1 (a) Briefly define Type I and Type II errors in the context of the statistical test of a hypothesis. (b) What is the general effect on the probabilities of each type of these errors happening if the sample size is increased? Solution (a) A Type I error is rejecting H0 when it is true. A Type II error is failing to reject a false H0 . We can express these as conditional probabilities as follows: α = P (Type I error) = P (Reject H0 | H0 is true) and: β = P (Type II error) = P (Not reject H0 | H1 is true). (b) Increasing the sample size decreases the probabilities of making both types of error. 8.8 8 Tests for normal populations Let us suppose that, having obtained sample data from a normal distribution, we use the estimator θb to estimate an unknown parameter θ. Assume θb has the sampling distribution: θb ∼ N (θ, V ) b is assumed to be known. Hence the standard error, S.E.(θ), b is also where V = Var(θ) known. The hypothesis test will proceed using the following steps. 1. Define the hypotheses. In this case suppose we test: H0 : θ = θ0 vs. H1 : θ > θ0 . Hence large values of θb would imply H1 , while small values would tend to favour H0 . Note in this case the alternative hypothesis is one-sided and this has an effect on the format of the test, i.e. here we perform an upper-tailed test. 2. State the test statistic and its distribution, then compute its value. Regardless of the type of parameter we are testing, we choose a test statistic whose distribution, conditional on H0 , has a tabulated distribution (i.e. one for which we have statistical tables) such as the standard normal, Student’s t, or chi-squared distributions (the chi-squared distribution is introduced in Chapter 9). In this case, due to the known standard error, standardisation of θb readily yields: θb − θ0 ∼ N (0, 1). b S.E.(θ) 184 8.9. Significance levels Next compute the test statistic value using the sample data, where θb will be our point estimate of θ. 3. Define the critical region for a given significance level, α.4 Obtain critical value(s) from statistical tables to define the critical region (also referred to as the rejection region). For this example, let α = 0.05, where α is our chosen probability of a Type I error. Since we are conducting an upper-tailed test, we require zα = z0.05 = 1.645, i.e. the z-value, for the standard normal distribution, above which lies α = 0.05 = 5% probability. 4. Decide whether or not to reject H0 . Now we decide whether or not to reject H0 (our working hypothesis which, up to now, is assumed to be true). If the test statistic value lies in the critical region then we will reject H0 , otherwise we do not reject H0 . A rejected H0 means the test is statistically significant at the specific significance level used. In the examination you will typically be required to test at two significance level sequentially – more on this later. 5. Draw conclusions. It is always important to draw conclusions in the context of the original problem. It is not sufficient to conclude ‘the test is significant at the 5% significance level’. This is just a technical step which guides us to make a better decision about the original real-world problem, and final conclusions should be drawn in terms of that problem. An important aspect of this test is the simple form of the test statistic, namely: point estimator − hypothesised value . (estimated) standard error Although this is not the format of all test statistics, it does cover a wide class of situations, such as the: 8 test of a single population mean, H0 : µ = µ0 test of a single population proportion, H0 : π = π0 test of the equality of two means, H0 : µ1 = µ2 test of the equality of two proportions, H0 : π1 = π2 . 8.9 Significance levels Step 3 of the previous test procedure referred to a significance level. We control for the probability of committing a Type I error by setting the probability of this to be α. Hence the significance level, or size, of a test is just the probability of committing a Type I error. If we perform a test at the 5% significance level, say, then we are actually basing our decision on a procedure which gives us a 5% probability of making a Type I error when H0 is true. In the development of the general normal test above, we used a significance level of 5%. This is one of the most widely-used significance levels, although 1% and 10% are also 4 See the next section for a discussion of significance levels. 185 8. Hypothesis testing sometimes used. The value of quoting the significance level used for the test is that it gives an acceptable method of expressing the user’s confidence in the decision which was made. A test which is significant at the 1% significance level expresses a strong conviction that H0 is untrue, whereas a test which is significant only at the 10% significance level expresses merely some doubt in H0 . The significance levels – 10%, 5% and 1%, corresponding to α values of 0.10, 0.05 and 0.01, respectively – provide a means of deciding which values of the test statistic are extreme. Consider, for example, an upper-tailed test of: H0 : µ = 2 vs. H1 : µ > 2. We calculate the test statistic value and if it falls deep enough into the right-hand tail of the test statistic’s distribution (the critical region) then we reject H0 . Where do we draw the line which signals the start of this critical region? That is, what critical value should we use? An accepted convention is to use a critical value which ‘cuts off’ a tail of size α. So, if the test statistic follows a standard normal distribution, for an upper-tailed test, using Table 5 of the New Cambridge Statistical Tables, we denote this cut-off point as zα which for: α = 0.10 ⇒ z0.10 = 1.2816 α = 0.05 ⇒ z0.05 = 1.6449 α = 0.01 ⇒ z0.01 = 2.3263 such that if the test statistic value exceeds the critical value, then the ‘test is significant at the 100α% significance level’. 8 Clearly, significance at the 1% significance level (2.3263 < test statistic value) implies significance at the 5% significance level (since 1.6449 < 2.3263) which, in turn, implies significance at the 10% significance level (since 1.2816 < 1.6449 < 2.3263). When a test is significant we should always state the strongest level of significance as possible as this provides the best measure of how inconsistent the data are with respect to the null hypothesis. If the test statistic value falls between 1.2816 and 1.6449, then the test is significant at the 10% significance level, but not at the 5% significance level. This type of result arises from rather inconclusive data and so it would be better to sample more data in the hope that a firmer conclusion can be drawn one way or the other. Such an outcome is saying that the null hypothesis is to be treated with suspicion, but that we cannot confidently reject it. These points and ideas will be reinforced in the subsequent examples. 8.9.1 Order of conducting tests The best practice for answering hypothesis test questions in the examination is to test at two significance levels. However, the order in which you apply the significance levels is taken into account by the examiners. Noting the discussion above, a test which is significant at the 5% significance level, say, must also be significant at the 10% significance level, so you would be penalised if you used 10% as your second significance level having just rejected at the 5% significance level as there is no ‘value added’ from use of the 10% significance level. In such a situation it would be more appropriate to test at the 1% significance level next. 186 8.9. Significance levels Unless an examination question states otherwise, a sensible strategy to follow is to initially test at the 5% significance level and then test either at the 1% or 10% significance level, depending on whether or not you have rejected at the 5% significance level. A ‘decision tree’ depiction of the procedure to follow is presented in Figure 8.1. Significance level decision tree Reject H0 Reject H0 Test result is 'highly significant' Not reject H0 Test result is 'moderately significant' Reject H0 Test result is 'weakly significant' Not reject H0 Test result is 'not significant' Choose the 1% level Start at the 5% level Not reject H0 Choose the 10% level Figure 8.1: Significance level decision tree. As indicated in Figure 8.1, it is possible to state whether a test result is highly significant, moderately significant, weakly significant or not significant, and hence we can convey a measure of the ‘strength’ of any statistical significance. Activity 8.2 (a) Explain what is meant by the statement: ‘The test is significant at the 5% significance level’. (b) How should you interpret a test which is significant at the 10% significance level, but not at the 5% significance level? Solution (a) ‘The test is significant at the 5% significance level’ means there is a less than 5% chance of getting data as extreme as actually observed if the null hypothesis was true. This implies that the data are inconsistent with the null hypothesis, which we reject, i.e. the test is ‘moderately significant’. (b) A test which is significant at the 10% significance level, but not at the 5% significance level is often interpreted as meaning there is some doubt about the null hypothesis, but not enough to reject it with confidence, i.e. the test is ‘weakly significant’. 187 8 8. Hypothesis testing 8.10 One-tailed and two-tailed tests The tests which have been described so far have been one-tailed (specifically, upper-tailed). This is because the alternative hypotheses used were one-sided. Recall the discussion in Section 8.5 which identified three different forms of H1 . For example, if testing a population mean H0 : µ = µ0 , these are: i. H1 : µ 6= µ0 or ii. H1 : µ < µ0 or iii. H1 : µ > µ0 . If H1 is of the form i., then both tails of the test statistic’s distribution will form the critical region, known as a two-tailed test. The critical region will still be of total area α, so it is logical to split this α equally between the two tails. So if the test statistic follows a standard normal distribution, for a two-tailed test, using Table 5 of the New Cambridge Statistical Tables and noting symmetry of the standard normal distribution about zero: α = 0.10 ⇒ zα/2 = z0.05 = 1.6449, giving critical values of ± 1.6449 α = 0.05 ⇒ zα/2 = z0.025 = 1.9600, giving critical values of ± 1.9600 α = 0.01 ⇒ zα/2 = z0.005 = 2.5758, giving critical values of ± 2.5758 such that if the test statistic value exceeds either critical value (i.e. greater than the upper critical value, or less than the lower critical value), then the ‘test is significant at the 100α% significance level’. 8 If H1 is of the form ii., then this is a lower-tailed test. Hence the critical region is in the left-hand tail of the test statistic’s distribution. So if the test statistic follows a standard normal distribution, for a lower-tailed test, using Table 5 of the New Cambridge Statistical Tables and noting symmetry of the standard normal distribution about zero: α = 0.10 ⇒ z0.90 = −1.2816 α = 0.05 ⇒ z0.95 = −1.6449 α = 0.01 ⇒ z0.99 = −2.3263 such that if the test statistic value exceeds the critical value (i.e. falls below the critical value), then the ‘test is significant at the 100α% significance level’. If H1 is of the form iii., then we have an upper-tailed test, which has already been discussed. 8.11 P -values Before proceeding to examples of hypothesis test situations, we will discuss an alternative approach to hypothesis testing based on the p-value. So far we have spoken about defining a critical region and rejecting the null hypothesis, H0 , whenever the test statistic value falls in this critical region. An alternative approach is to compute the probability of obtaining the observed test statistic value or a more 188 8.11. P -values extreme value conditional on H0 being true, using the probability distribution of the test statistic. This probability is the p-value. So, given we reject H0 if the test statistic value is sufficiently extreme, we would also reject H0 if we obtained a sufficiently small p-value. To summarise, in the p-value world of testing, we use the following simple decision rule. Hypothesis testing – p-value decision rule Reject H0 if p-value < α, where α is the significance level. Do not reject H0 if p-value ≥ α. It remains to show how to calculate the p-value. Again, this will depend on the form of H1 and the test statistic distribution. Calculating p-values Let the test statistic value be x, and the test statistic (a random variable as it is constructed from an estimator) be X. Assuming a test statistic distribution which is symmetric about zero (for the two-tailed test), in general these can be summarised as follows. Form of H1 p-value calculation H1 : θ 6= θ0 2 × P (X ≥ |x|) H1 : θ < θ0 P (X ≤ x) H1 : θ > θ0 P (X ≥ x) 8 In each case use the New Cambridge Statistical Tables corresponding to the distribution of test statistic X. Example 8.2 If X ∼ N (0, 1), and we obtain the test statistic value x = −1.82, then for a lower-tailed test the p-value would be: P (X ≤ −1.82) = 1 − Φ(1.82) = 1 − 0.9656 = 0.0344 using Table 4 of the New Cambridge Statistical Tables. Hence this is significant at the 5% significance level (since 0.0344 < 0.05), but not at the 1% significance level (since 0.0344 > 0.01), indicating a moderately significant result. The next few sections detail various hypothesis testing situations analogous to the confidence interval situations presented in Chapter 7. 189 8. Hypothesis testing 8.12 Hypothesis test for a single mean (σ known) We first consider the test of a single population mean when the population standard deviation, σ, is known. To test H0 : µ = µ0 when sampling from N (µ, σ 2 ) we use the following test statistic. z-test of hypothesis for a single mean (σ known) In this case, the test statistic is: Z= X̄ − µ0 √ ∼ N (0, 1). σ/ n (8.1) Hence critical values are obtained from the standard normal distribution, i.e. using Table 5 of the New Cambridge Statistical Tables. Example 8.3 The mean lifetime of 100 components in a sample is 1,570 hours and their standard deviation is known to be 120 hours. µ is the mean lifetime of all the components produced. Is it likely the sample comes from a population whose mean is 1,600 hours? We perform a two-tailed test since we are testing whether or not µ is 1,600.5 Hence we test: H0 : µ = 1600 vs. H1 : µ 6= 1600. Since σ is known, we use (8.1) to calculate the test statistic value, which is: 8 z= 1570 − 1600 x̄ − µ0 √ √ = = −2.5. σ/ n 120/ 100 We first test at the 5% significance level. Since this is a two-tailed test, the critical values are ±1.96. Since the test statistic value is in the critical region (−2.5 < −1.96), we reject H0 . If we now refer to the significance level decision tree (Figure 8.1), we now test at the 1% significance level with critical values of ±2.5758. Since −2.5758 < −2.5 we are (just) unable to reject H0 and conclude that the test result is ‘moderately significant’. Hence at the 5% significance level there is evidence to suggest that the mean lifetime of components, µ, is not equal to 1,600. If we wanted to compute the p-value, then from Section 8.11, for this two-tailed test the p-value would be: p-value = 2 × P (Z > | − 2.5|) = 2 × P (Z ≥ 2.5) = 2 × 0.0062 = 0.0124. Since 0.01 < 0.0124 < 0.05, we see that using the p-value testing method we still conclude (as we must) that the test is significant at the 5% significance level, but not at the 1% significance level. 5 Common sense might lead us to perform a lower-tailed test since 1570 < 1600, suggesting that if µ is not 1,600, then it is likely to be less than 1,600. However, since the question is phrased as a two-tailed 190 8.13. Hypothesis test for a single mean (σ unknown) 8.13 Hypothesis test for a single mean (σ unknown) Here the only difference with the previous example is that σ 2 is unknown. In which case this is estimated with the sample variance estimator, S 2 . To test H0 : µ = µ0 when sampling from N (µ, σ 2 ) we use the following test statistic. t test of hypothesis for a single mean (σ unknown) In this case, the test statistic is: T = X̄ − µ0 √ ∼ tn−1 . S/ n (8.2) Hence critical values are obtained from the Student’s t distribution with n−1 degrees of freedom, i.e. using Table 10 of the New Cambridge Statistical Tables. Example 8.4 A study on the impact of comprehensive planning on financial performance reported that the average annual return on investment for American banks was 10.2%, and suggested that banks who exercised comprehensive planning would do better than this. A random sample of 26 such banks gave the following returns on investment. Do these data support the conjecture? 10.00, 10.23, 10.74, 11.90, 10.72, 10.31, 9.90, 10.09, 10.31, 9.96, 10.34, 10.30, 10.50, 11.54, 10.81, 10.15, 9.04, 11.55, 10.81, 8.69, 10.76, 10.92, 11.26, 11.21, 10.20, 10.76. We test: 8 H0 : µ = 10.2 vs. H1 : µ > 10.2. Note that the alternative hypothesis is one-sided as this is the region of interest (we hypothesise that banks exercising comprehensive planning perform better than 10.2%). This implies that an upper-tailed test is needed. The summary statistics are n = 26, x̄ = 10.5 and s = 0.714. Hence, using (8.2), the test statistic value is: 10.5 − 10.2 x̄ − µ0 √ = √ = 2.14. t= s/ n 0.714/ 26 We compare this test statistic value with the upper tail of the Student’s t distribution with 26 − 1 = 25 degrees of freedom. If we choose α = 0.05,6 then Table 10 of the New Cambridge Statistical Tables can be used to find the (upper tail) critical value of 1.708. For α = 0.01, the critical value is 2.485. Since 1.708 < 2.14 < 2.485, we conclude that the test is significant at the 5% significance level, but not at the 1% significance level, indicating that the result is moderately significant. Using the p-value approach, the p-value is calculated to be: p-value = P (T > 2.14) ≈ 0.02 test, a justification for performing a lower-tailed test would be required, should you decide to opt for a lower-tailed test. Indeed, in principle the alternative hypothesis should be determined before data are collected, to avoid the data biasing our choice of alternative hypothesis. 191 8. Hypothesis testing where T ∼ t25 , using Table 9 of the New Cambridge Statistical Tables, with 24 as the nearest available degrees of freedom. We see that 0.01 < 0.021 < 0.05 so, of course, we reach the same conclusion. Note that we can view the p-value as the smallest α value such that we would reject the null hypothesis.7 We conclude that there is moderate evidence against the null hypothesis and that comprehensive planning does improve the average annual return on investment. Activity 8.3 A sample of seven batteries is taken at random from a large batch of (nominally 12 volt) batteries. These are tested and their true voltages are shown below: 12.9 11.6 13.5 13.9 12.1 11.9 13.0. (a) Test whether the mean voltage of the whole batch is 12 volts using two appropriate significance levels. (b) Test whether the mean batch voltage is less than 12 volts using two appropriate significance levels. (c) Which test do you think is the more appropriate? Briefly explain why. Solution 8 (a) We are to test H0 : µ = 12 vs. H1 : µ 6= 12. The key points here are that n is small and that σ 2 is unknown. We can use the t test, and this is valid provided the data are normally distributed. The test statistic value is: t= 12.7 − 12 x̄ − 12 √ = √ = 2.16. s/ 7 0.858/ 7 This is compared to a Student’s t distribution on n − 1 = 6 degrees of freedom. The critical values corresponding to a 5% significance level two-tailed test are ±t0.025, 6 = ±2.447, using Table 10 of the New Cambridge Statistical Tables. Hence we cannot reject the null hypothesis at the 5% significance level since 2.16 < 2.447. Moving to the 10% significance level the critical values are ±t0.05, 6 = ±1.943, so we can reject H0 and conclude that the test is weakly significant as there is weak evident that µ 6= 12. (b) We are to test H0 : µ = 12 vs. H1 : µ < 12. There is no need to do a formal statistical test. As the sample mean is 12.7, which is greater than 12, the test statistic value will be positive while the lower-tail critical value(s) will be negative and hence there is no evidence whatsoever in favour of the alternative hypothesis. 6 Recall that if the question does not specify a significance level, then α = 0.05, representing a 5% significance level, is a good starting point. 7 You are not expected to calculate p-values from the Student’s t distribution in the examination, although the examiners will not penalise those who use this approach provided the p-value is compared against two appropriate significance levels. 192 8.13. Hypothesis test for a single mean (σ unknown) (c) In (a) you are asked to conduct a two-tailed test; in (b) it is a one-tailed test. Which is more appropriate will depend on the purpose of the experiment, and your suspicions before you conduct it. • If you suspected before collecting the data that the mean voltage was less than 12 volts, the test in (b) would be appropriate. • If you had no prior reason to believe that the mean was less than 12 volts, you would use the test in (a). • General rule: decide on whether it is a one-tailed or two-tailed test before collecting the data! Activity 8.4 A salesman claims that he can sell at least 100 DVD players per day. Over a period of 11 days his resulting sales were as shown below. Do the data support his claim? 94 105 100 90 96 101 92 97 93 100 91. Solution If µ is the mean number of sales per day, then we wish to test: H0 : µ = 100 vs. H1 : µ < 100. The main statistics are: Sales n 11 Sample mean 96.273 Sample standard deviation 4.777 8 E.S.E. √ of mean 4.777/ 11 = 1.440 So the test statistic value is: t= 96.723 − 100 = −2.59. 1.440 Comparing this with t10 (since there are 11 − 1 = 10 degrees of freedom), we see that it is significant at the 5% significance level (for this one-tailed test, the critical value is t0.95, 10 = −1.812 and −2.59 < −1.812). However, it is not significant at the 1% significance level (for which the critical value is t0.99 = −2.764 and −2.764 < −2.59). Hence we reject the salesman’s claim with only moderate evidence against it. Activity 8.5 A random sample of 18 observations from the population N (µ, σ 2 ) yields the sample mean x̄ = 8.83 and the sample variance s2 = 1.59. At the 1% significance level, test the following hypotheses by obtaining critical values: (a) H0 : µ = 8 vs. H1 : µ > 8. 193 8. Hypothesis testing (b) H0 : µ = 8 vs. H1 : µ < 8. (c) H0 : µ = 8 vs. H1 : µ 6= 8. Repeat the above exercise with the additional assumption that σ 2 = 1.59. Compare the results with those derived without this assumption and comment. Solution When σ 2 is unknown, we use the test statistic T = T ∼ t17 . With α = 0.01, we reject H0 if: √ n(X̄ − 8)/S. Under H0 , (a) t > t0.01, 17 = 2.567, against H1 : µ > 8. (b) t < −t0.01, 17 = −2.567, against H1 : µ < 8. (c) |t| > t0.005, 17 = 2.898, against H1 : µ 6= 8. For the given sample, t = 2.79. Hence we reject H0 against the alternative H1 : µ > 8, but we will not reject H0 against the two other alternative hypotheses. √ When σ 2 is known, we use the test statistic Z = n(X̄ − 8)/σ. Under H0 , Z ∼ N (0, 1). With α = 0.01, we reject H0 if: (a) z > z0.01 = 2.326, against H1 : µ > 8. (b) z < −z0.01 = −2.326, against H1 : µ < 8. 8 (c) |z| > z0.005 = 2.576, against H1 : µ 6= 8. For the given sample, z = 2.79. Hence we reject H0 against the alternative H1 : µ > 8 and H1 : µ 6= 8, but we will not reject H0 against H1 : µ < 8. With σ 2 known, we should be able to perform inference better simply because we have more information about the population. More precisely, for the given significance level, we require less extreme values to reject H0 . Put another way, the p-value of the test is reduced when σ 2 is given. Therefore, the risk of rejecting H0 is also reduced. Activity 8.6 To instil customer loyalty, airlines, hotels, rental car companies and credit card companies (among others) have initiated marketing programmes which reward their loyal customers. In the United States alone, several million people are members of the frequent-flyer programmes of the airline industry. A large fast-food restaurant chain wished to explore the profitability of such a programme. They randomly selected 12 of their 1,200 restaurants nationwide and instituted a loyalty programme which rewarded customers with a $5.00 gift certificate after every 10 meals purchased at full price. They ran the trial programme for three months. The restaurants not in the sample had an average increase in profits of $1,047.34 over the previous three months, 194 8.13. Hypothesis test for a single mean (σ unknown) whereas the restaurants in the sample had the following changes in profit: $2,232.90 $6,552.70 $3,381.30 $545.47 $3,440.70 $4,798.70 $2,965.00 $1,591.40 $2,376.20 $1,809.10 $2,610.70 −$2,191.00 Note that the last number is negative, representing a decrease in profits. Specify the appropriate null and alternative hypotheses for determining whether the mean profit change for restaurants with loyalty programmes is significantly greater (in a statistical sense, which you should make clear) than $1,047.34, and conduct an appropriate test. Solution It is useful to discuss the issues about this question first. • We want to know whether a loyalty programme such as that at the 12 selected restaurants would result in an increase in mean profits greater than that observed (during the three-month test) at the other sites within the chain. • So we can model the profits across the chain as $1047.34 + x, where $x is the supposed effect of the promotion, and if the true mean value of x is µ, then we wish to test: H0 : µ = 0 vs. H1 : µ > 0 which is a one-tailed test since, clearly, there are (preliminary) grounds for thinking that there is an increase due to the loyalty programme. • We know nothing about the variability of profits across the rest of the chain, so we will have to use the sample data, i.e. to calculate the sample variance and to employ the t distribution on ν = 12 − 1 = 11 degrees of freedom. • Although we shall want the variance of the data ‘sample value −1047.34’, this will be the same as the variance of the sample data, since for any random variable X and constant k we have: Var(X + k) = Var(X) because in calculating the variance every value (xi − x̄) is ‘replaced’ by: (xi + k) − (x̄ + k) which is in fact the same value. • So we need to calculate x̄, 12 P x2i , Sxx and s2 . i=1 The total change in profits for restaurants in the programme is 12 P xi = 30113.17. i=1 Since n = 12 the mean change in profit for restaurants in the programme is 30113.17/12 = 2509.431 = 1047.34 + 1462.091, hence use x̄ = 1462.091. 195 8 8. Hypothesis testing The raw sum of squares is 12 P x2i = 126,379,568.8. So the corrected sum of squares is: i=1 Sxx = 12 X !2 x2i − nx̄2 = 126,379,568.8 − 12 × (1462.091)2 = 50,812,651.51. i=1 Hence: 50,812,651.51 Sxx = = 4,619,331.956. n−1 11 Therefore, the estimated standard error is: r 4,619,331.956 p s √ = = 384,944.3296 = 620.439. 12 n s2 = So the test statistic value is: t= x̄ − µ0 1462.091 − 0 √ = = 2.3567 620.439 s/ n on 11 degrees of freedom. The relevant critical values for t11 in this one-tailed test are: t0.05, 11 = 1.796 and t0.01, 11 = 2.718. 8 So we see that the test is significant at the 5% significance level, but not at the 1% significance level (1.796 < 2.357 < 2.718), so we can reject H0 and conclude that the test is moderately significant with moderate evidence that the loyalty programme has an effect. 8.14 Hypothesis test for a single proportion We now consider the hypothesis test for a single proportion. Recall from Section 7.10 p that the standard error for a proportion is S.E.(P ) = π (1 − π)/n. When testing p H0 : π = π0 , under H0 the standard error is π0 (1 − π0 )/n leading to the following test statistic, achieved by standardising the sample proprotion estimator, P , given in (7.1). z test of hypothesis for a single proportion In this case, the test statistic is: P − π0 Z∼ ∼ N (0, 1) (approximately). =p π0 (1 − π0 )/n (8.3) Hence critical values are obtained from the standard normal distribution, i.e. using Table 5 of the New Cambridge Statistical Tables. 196 8.14. Hypothesis test for a single proportion Example 8.5 To illustrate this, let us reconsider Example 7.3 in which a survey was conducted by a bank to estimate the proportion of its customers who would be interested in using a proposed new mobile telephone banking service. If we denote the population proportion of customers who are interested in this proposal by π, and it is found that 68 out of a random sample of 150 customers are in favour, then we would estimate π with p = 68/150 = 0.453. Let us suppose that other surveys have shown that 40% of the public are interested in mobile telephone banking and it is proposed to test whether or not the above survey agrees with this figure, i.e. we conduct a two-tailed test. Hence: H0 : π = 0.4 vs. H1 : π 6= 0.4. The test statistic value, using (8.3), is: 0.453 − 0.4 p − π0 =p = 1.325. z=p π0 (1 − π0 )/n 0.4 × 0.6/150 Let α = 0.05, then we compare this test statistic value with the critical values ±1.96 (using Table 5 of the New Cambridge Statistical Tables), so it can be seen that the test is not significant at the 5% significance level as we do not reject H0 since 1.325 < 1.96. The significance level decision tree (see Figure 8.1) now requires us to test at the 10% significance level, which gives critical values of ±1.6449, so again we do not reject H0 since 1.325 < 1.6449, i.e. the test result is not statistically significant. We conclude that these data are consistent with the null hypothesis and that the level of interest shown among the bank’s customers for the proposed new mobile telephone banking service may well be 40%. Activity 8.7 In a particular city it is known, from past surveys, that 25% of households regularly use a washing powder named ‘Snolite’. After an advertising campaign, a survey of 300 randomly selected households showed that 100 had recently purchased ‘Snolite’. Is there evidence that the campaign has been successful? Use a 1% significance level. Solution Here we are testing a proportion: H0 : π = 0.25 vs. H1 : π > 0.25. Note that this is a one-sided test – we have reason to believe that the sales campaign has increased sales, and we believe this before collecting any data. As this is a test for a single proportion and n is large, we compute the test statistic value to be: p−π 100/300 − 0.25 = √ = 3.33. z=p 0.000625 π (1 − π)/n Compare this to the critical value of z0.01 = 2.326 (from the bottow row of Table 10 of the New Cambridge Statistical Tables) and you will see that it is significant at the 1% significance level since 2.326 < 3.33. That is, there is very strong evidence that more than 25% of households use ‘Snolite’. It appears that the campaign has been successful. 197 8 8. Hypothesis testing Activity 8.8 Consumer Reports evaluated and rated 46 brands of toothpaste. One attribute examined in the study was whether or not a toothpaste brand carries an American Dental Association (ADA) seal verifying effective decay prevention. The data for the 46 brands (coded as 1 = ADA seal, 0 = no ADA seal) are reproduced below. 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 1 1 1 0 0 1 (a) State the null and alternative hypotheses testing whether the true proportion of toothpaste brands with the ADA seal verifying effective decay prevention is less than 0.5. (b) Assuming that the sample was random, from a much larger population, conduct a test between the two hypotheses. Solution (a) Appropriate hypotheses are H0 : π = 0.50 vs. H1 : π < 0.50. (b) The data show that there are r = 21 ADA seals among n = 46 items in the sample. Let π be the true proportion. If H0 is true then the sample proportion P ∼ N (0.50, 0.005435), approximately, since 46 is ‘large’ and hence we can use the normal approximation with µ = 0.50 and σ 2 = π (1 − π)/n = (0.50)2 /46 = 0.005435. 8 The test statistic value for this test would then be: −0.04348 21/46 − 0.5 = −0.59. = z= √ 0.07372 0.005435 This is clearly not going to be significant at any plausible significance level (in fact the p-value is P (Z < −0.59) = 0.2776). Therefore, we cannot reject the null hypothesis that half the brands have the ADA seal. Activity 8.9 A government agency carries out a large-scale random survey of public attitudes towards job security. 78 of the 300 workers surveyed indicated they were worried about losing their job. A newspaper report claims 30% of workers fear losing their job. Is such a high percentage claim justified? State and carry out an appropriate hypothesis test at two appropriate significance levels and explain your results. Solution The null and alternative hypotheses are: H0 : π = 0.30 vs. H1 : π < 0.30. The test statistic under H0 is: P − π0 P − 0.30 Z=p =p ∼ N (0, 1). π0 (1 − π0 )/n 0.30 × (1 − 0.70)/n 198 8.14. Hypothesis test for a single proportion The sample proportion is p = 78/300 = 0.26 and the test statistic value is −1.51. For α = 0.05, the one-tailed (lower tail) critical value is z0.95 = −1.645. Since −1.645 < −1.51, we do not reject H0 at the 5% significance level. Moving to α = 0.10, the critical value is now z0.90 = −1.282. Since −1.51 < −1.282, we reject H0 at the 10% significance level. The test is weakly significant and we conclude that there is weak evidence that fewer than 30% of workers were worried about losing their jobs, suggesting there is only a marginal reason to doubt the newspaper report’s claim. Activity 8.10 If you live in California, the decision to purchase earthquake insurance is a critical one. An article in the Annals of the Association of American Geographers (June 1992) investigated many factors which California residents consider when purchasing earthquake insurance. The survey revealed that only 133 of 337 randomly selected households in Los Angeles County were protected by earthquake insurance. (a) What are the appropriate null and alternative hypotheses to test the research hypothesis that fewer than 40% of households in Los Angeles County were protected by earthquake insurance? (b) Do the data provide sufficient evidence to support the research hypothesis? (Use α = 0.10.) (c) Calculate and interpret the p-value for the test. 8 Solution (a) Appropriate hypotheses are H0 : π = 0.40 vs. H1 : π < 0.40. (b) The estimate of the proportion covered by insurance is 133/337 = 0.395. As this is a test of a single proportion and n is large, the test statistic value is: 0.395 − 0.40 p−π = √ = −0.20. z=p 0.02669 π (1 − π)/n Compare this to the critical value of z0.90 = −1.282 (from the bottow row of Table 10 of the New Cambridge Statistical Tables) and you will see that it is not significant at even the 10% significance level since −1.282 < −0.20. Hence the test is insignificant, i.e. there is no evidence that fewer than 40% of households in Los Angeles County were protected by earthquake insurance? (c) To compute the p-value for this lower-tailed test, we determine: P (Z < −0.20) = 0.4207 using Table 4 of the New Cambridge Statistical Tables. Hence we would not reject H0 for any α < 0.4207, remembering that the p-value is the smallest significance level such that we reject the null hypothesis. 199 8. Hypothesis testing Activity 8.11 A museum conducts a survey of its visitors in order to assess the popularity of a device which is used to provide information on the museum exhibits. The device will be withdrawn if fewer than 30% of all of the museum’s visitors make use of it. Of a random sample of 80 visitors, 20 chose to use the device. (a) Carry out a test at the 5% significance level to see if the device should be withdrawn and state your conclusions. (b) Determine the p-value of the test. Solution (a) Let π be the population proportion of visitors who would use the device. We test H0 : π = 0.30 vs. H1 : π < 0.30. The samplep proportion is p = 20/80 = 0.25. The standard error of the sample proportion is 0.3 × 0.7/80 = 0.0512. The test statistic value is: 0.25 − 0.30 = −0.976. z= 0.0512 For a one-sided (lower-tailed) test at the 5% significance level, the critical value is z0.95 = −1.645, so the test is not significant – and not even at the 10% significance level (critical value of z0.90 = −1.282). On the basis of the data, there is no reason to withdraw the device. The critical region for the above test is: reject H0 if the sample proportion is less than 0.3 − 1.645 × 0.0512, i.e. if the sample proportion, p, is less than 0.2157. 8 (b) The p-value of the test is the probability of the test statistic value or a more extreme value conditional on H0 being true. Hence the p-value is: P (Z ≤ −0.976) = 0.1645. So, for any α < 0.1645 we would fail to reject H0 . 8.15 Difference between two population proportions As with confidence intervals, the correct approach to the comparison of two population proportions, π1 and π2 , is to consider the difference between them, i.e. π1 − π2 . When testing for equal proportions (i.e. a zero difference), the null hypothesis is H0 : π1 − π2 = 0, or equivalently H0 : π1 = π2 . We derive the test statistic by standardising the sampling distribution of the difference in two independent sample proportions, P1 − P2 , given in (7.4), leading to the proposed test statistic: P1 − P2 − (π1 − π2 ) ∼ N (0, 1). Z=p π1 (1 − π1 )/n1 + π2 (1 − π2 )/n2 However, when evaluating this test statistic, which values do we use for π1 and π2 ? In the test of a single proportion, we had H0 : π = π0 , where π0 is the tested value. When comparing two proportions, under H0 no value is given for π1 and π2 , only that they are 200 8.15. Difference between two population proportions equal, that is π1 = π2 = π, where π is the common proportion whose value, of course, is still unknown! Hence we need to estimate π from the sample data using the pooled proportion estimator. Pooled proportion estimator If R1 and R2 represent the number of ‘favourable’ responses from two independent random samples with sample sizes of n1 and n2 , respectively, then the pooled proportion estimator is: R1 + R2 . (8.4) P = n1 + n2 This leads to the following revised test statistic. z test for the difference between two proportions In this case, the test statistic is: Z∼ =p P1 − P2 − (π1 − π2 ) P (1 − P )(1/n1 + 1/n2 ) ∼ N (0, 1) (approximately). (8.5) Hence critical values are obtained from the standard normal distribution, i.e. using Table 5 of the New Cambridge Statistical Tables. Example 8.6 To illustrate this, let us reconsider Example 7.7 to test for a difference between the population proportions of the general public who are aware of a particular commercial product before and after an advertising campaign. Two surveys were conducted and the results of the two random samples were: Before campaign After campaign Sample size 150 120 Number aware 68 65 If π1 and π2 are the true population proportions before and after the campaign, respectively, then we wish to test: H0 : π1 = π2 vs. H1 : π1 < π2 . Note that we use a one-sided alternative on the assumption that the campaign would not decrease awareness!8 After checking that this way round the value we get is positive (avoiding negative differences wherever possible to minimise arithmetic errors), we estimate the difference in the population proportions as: 65 68 ‘after’ − ‘before’ = p2 − p1 = − = 0.0884. 120 150 On the assumption that H0 is true, we estimate the common proportion, π, using (8.4), as: 68 + 65 = 0.4926. 150 + 120 201 8 8. Hypothesis testing So our test statistic value, using (8.5), is: 0.0884 z=p = 1.44. 0.4926 × (1 − 0.4926) × (1/150 + 1/120) For a z test at the 5% significance level, we compare this test statistic value with the upper-tail critical value of 1.645, for this upper-tailed test. Since 1.44 < 1.6449, we fail to reject H0 and conclude that the test is not significant at the 5% significance level. However, it is significant at the 10% significance level (because 1.2816 < 1.44), and so we maintain belief in the null hypothesis with some reservation as the data have thrown some doubt on it. It is possible that the campaign has increased awareness but, on the basis of these random samples, we would not be fully convinced – the test is at best ‘weakly significant’. If we had to take an important decision, we would almost certainly want more information before doing so! Incidentally, the p-value for this problem is simply P (Z > 1.44) = 0.0749, using Table 4 of the New Cambridge Statistical Tables. So, clearly, we would be unable to reject H0 for any significance level α < 0.0749, agreeing with our earlier findings. Activity 8.12 A market research company has conducted a survey of adults in two large towns, either side of an international border, in order to judge attitudes towards a controversial internationally-broadcast celebrity television programme. The following table shows some of the information obtained by the survey: Sample size Sample number approving of the programme 8 Town A 80 44 Town Z 80 41 (a) Conduct a formal hypothesis test, at two appropriate significance levels, of the claim that the population proportions approving of the programme in the two towns are equal. (b) Would your conclusion be the same if, in both towns, the sample sizes had been 1500 (with the same sample proportions of approvals and the same pooled sample proportion)? Solution (a) We test H0 : πA = πZ vs. H1 : πA 6= πZ . Note there is no a priori reason to suppose which town would have a greater proportion approving of the programme in the event that the proportions are not equal, hence we conduct a two-tailed test. The sample proportions are pA = 0.55 and pZ = 0.5125, and the estimate of the common proportion under H0 is: p= 8 44 + 41 = 0.53125. 80 + 80 An example of the importance of using common sense in determining the alternative hypothesis! 202 8.15. Difference between two population proportions We ues the test statistic: P A − PZ Z=p ∼ N (0, 1) P (1 − P ) (1/nA + 1/nZ ) approximately under H0 . The test statistic value is: 0.55 − 0.5125 p = 0.48. 0.53125 × (1 − 0.53125) × (1/80 + 1/80) At the 5% significance level, the critical values are ±z0.025 = ±1.96. Since 0.48 < 1.96 we do not reject H0 . Following the significance level decision tree, we now test at the 10% significance level, for which the critical values are ±z0.05 = ±1.645. Since 0.48 < 1.645, again we do not reject H0 . Therefore, the test is not significant as there is no evidence of a difference in the population proportions approving of the programme. For reference, the p-value for this two-tailed test is: 2 × P (Z ≥ 0.48) = 2 × 0.3156 = 0.6312 which is greater than any reasonable significance level, α. (b) If nA = nZ = 1500, with the same values for pA , pZ and p, the test statistic value becomes: 0.55 − 0.5125 p = 2.06. 0.53125 × (1 − 0.53125) × (1/1500 + 1/1500) Since 1.96 < 2.06, we reject H0 at the 5% significance level. Now testing at the 1% significance level we have critical values of ±2.576. Since 2.06 < 2.576, we do not reject H0 and conclude that the test is moderately significant indicating moderate evidence of a difference in the population proportions approving of the programme. The larger sample sizes have increased the power of the test, i.e. they have increased our ability to reject H0 whereby the difference in the sample proportions is attributed to chance (that is, sampling error) when nA = nZ = 80, while the difference is statistically significant when nA = nZ = 1500. For reference, the p-value for this two-tailed test is: 2 × P (Z ≥ 2.06) = 2 × 0.0197 = 0.0394. Since 0.01 < 0.0394 < 0.05, the test is moderately significant, as before. Activity 8.13 A random sample of 250 households in a particular community was taken and in 50 of these households the lead level in the water supply was found to be above an acceptable level. A sample was also taken from a second community, which adds anti-corrosives to its water supply, and, of these, only 16 out of 320 households were found to have unacceptably high lead levels. Is this conclusive evidence that the addition of anti-corrosives reduces lead levels? 203 8 8. Hypothesis testing Solution Here we are testing the difference between two proportions with a one-sided test. Also, n1 and n2 are both large. Let: • π1 = the population proportion of households with unacceptable levels of lead in the community without anti-corrosives • π2 = the population proportion of households with unacceptable levels of lead in the community with anti-corrosives. We wish to test H0 : π1 = π2 vs. H1 : π1 > π2 . The test statistic value is: p 1 − p2 50/250 − 16/320 z=p =p = 5.55 p (1 − p) (1/n1 + 1/n2 ) 0.116 × (1 − 0.116) × (1/250 + 1/320) where p is the pooled sample proportion, i.e. p = (50 + 16)/(250 + 320) = 0.116. This test statistic value is, at 5.55, obviously very extreme and hence is highly significant (since z0.01 = 2.326 < 5.55), so there is strong evidence that anticorrosives reduce lead levels. 8.16 Difference between two population means In this section we are primarily interested in the difference between two population means, µ1 − µ2 . As in Chapter 7, there are four cases to consider. 8 8.16.1 Unpaired samples – variances known Suppose we have two random samples of size n1 and n2 , respectively, drawn from N (µ1 , σ12 ) and N (µ2 , σ22 ), where σ12 and σ22 are known. Testing for the equality of means gives the null hypothesis H0 : µ1 = µ2 or, in terms of their difference, H0 : µ1 − µ2 = 0. Standardising the sampling distribution of X̄1 − X̄2 , shown in (7.6), gives the test statistic. z test for the difference between two means (variances known) In this case, the test statistic is: Z= X̄1 − X̄2 − (µ1 − µ2 ) p ∼ N (0, 1). σ12 /n1 + σ22 /n2 (8.6) Hence critical values are obtained from the standard normal distribution, i.e. using Table 5 of the New Cambridge Statistical Tables. Note if testing for the equality of means, then µ1 − µ2 = 0 under H0 . Hence, in (8.6), we set the term (µ1 − µ2 ) = 0. 204 8.16. Difference between two population means 8.16.2 Unpaired samples – variances unknown and unequal We have the same set-up as above, with the same sampling distribution for X̄1 − X̄2 , but now the population variances σ12 and σ22 are unknown. Assuming large sample sizes we can replace these unknown parameters with the sample variance estimators S12 and S22 to obtain the test statistic. z test for the difference between two means (variances unknown) If the population variances σ12 and σ22 are unknown, provided the sample sizes n1 and n2 are large (greater than 30), then: Z∼ = X̄1 − X̄2 − (µ1 − µ2 ) p ∼ N (0, 1) (approximately, for large samples). S12 /n1 + S22 /n2 (8.7) Hence critical values are obtained from the standard normal distribution, i.e. using Table 5 of the New Cambridge Statistical Tables. Note if testing for the equality of means, then µ1 − µ2 = 0 under H0 . Hence, in (8.7), we set the term (µ1 − µ2 ) = 0. Activity 8.14 Two companies supplying a television repair service are compared by their repair times (in days). Random samples of recent repair times for these companies gave the following statistics: Company A Company B Sample size 44 52 Sample mean 11.9 10.8 Sample variance 7.3 6.2 8 (a) Is there evidence that the companies differ in their true mean repair times? Give an appropriate significance test to support your conclusions. (b) Determine the p-value of the test. (c) What difference would it have made if the sample sizes had each been smaller by 35 (i.e. sample sizes of 9 and 17, respectively)? Solution (a) H0 : µA = µB vs. H1 : µA 6= µB , where we use a two-tailed test since ‘by eye’ the evidence we are given only barely suggests the direction of the difference – if any. The test statistic value is: z=p 11.9 − 10.8 7.3/44 + 6.3/52 = 2.06. For a two-tailed test, this is significant at the 5% significance level (z0.025 = 1.96 < 2.06), but not at the 1% significance level (2.06 < 2.326 = z0.01 ). We reject H0 and conclude that company A is slower in repair times than company B, with a moderately significant result. 205 8. Hypothesis testing (b) The p-value for this two-tailed test is 2 × P (Z > 2.06) = 0.0394. (c) For small samples, we should use a pooled estimate of the population standard deviation: s (9 − 1) × 7.3 + (17 − 1) × 6.2 = 2.5626 s= (9 − 1) + (17 − 1) on 24 degrees of freedom. Hence the test statistic value in this case is: t= 11.9 − 10.8 p = 1.04. 2.5626 × 1/9 + 1/17 This should be compared with t24 and is clearly not significant, even at the 10% significance level, since 1.04 < 1.711 = t0.05, 24 . With the smaller samples we fail to detect the difference. Comparing the two test statistic calculations shows that the different results flow from differences in the estimated standard errors, hence ultimately (and unsurprisingly) from the differences in the sample sizes used in the two situations. 8.16.3 Unpaired samples – variances unknown and equal Although still unknown, if we assume the population variances are equal to some common variance, i.e. σ12 = σ22 = σ 2 , then we only have one (common) unknown variance to estimate. As with confidence intervals, we utilise the pooled variance estimator, given in (7.8). 8 t test for the difference between two means (variances unknown) If the population variances σ12 and σ22 are unknown but assumed equal, then: X̄1 − X̄2 − (µ1 − µ2 ) T = q ∼ tn1 +n2 −2 Sp2 (1/n1 + 1/n2 ) (8.8) where Sp2 is the pooled variance estimator, given in (7.8). Hence critical values are obtained from the Student’s t distribution with n1 + n2 − 2 degrees of freedom, i.e. using Table 10 of the New Cambridge Statistical Tables.9 Note if testing for the equality of means, then µ1 − µ2 = 0 under H0 . Hence, in (8.7), we set the term (µ1 − µ2 ) = 0. Look back at Section 7.13 for hints about how to decide whether or not σ12 = σ22 . Example 8.7 To illustrate this, let us reconsider Example 7.8 where two companies supplying a similar service are compared for their mean reaction times (in 9 Recall for sufficiently ‘large’ degrees of freedom, a standard normal approximation can be used, provided a justification is given. 206 8.16. Difference between two population means days) to complaints. Random samples of recent complaints to these companies gave the following descriptive statistics: Company A Company B Sample size 12 10 Sample mean 8.5 4.8 Sample std. dev. 3.6 2.1 We want to test for a true difference in mean reaction times, that is: H 0 : µA = µB vs. H1 : µA 6= µB . Because the markets are ‘similar’, it is reasonable to assume (though it is only an assumption!) that the two population variances are equal. Under this assumption, using (7.8), we have: s2p = (12 − 1) × (3.6)2 + (10 − 1) × (2.1)2 = 9.1125. 12 + 10 − 2 Using (8.8), the test statistic value is: 8.5 − 4.8 t= p = 2.87. 9.1125 × (1/12 + 1/10) There are 12 + 10 − 2 = 20 degrees of freedom, hence we obtain critical values from the t20 distribution using Table 10 of the New Cambridge Statistical Tables. For a two-tailed test at the 5% significance level, we obtain critical values of ±2.086, hence the test is significant at the 5% significance level since 2.086 < 2.87. Moving to the 1% significance level, as per the significance level decision tree in Figure 8.1, the critical values are ±2.845, so again we (just) reject H0 since 2.845 < 2.87. On this basis, we have a highly significant result when we reject H0 , and we conclude that the mean reaction times are different. Indeed, we have strong evidence that Company B reacts faster than Company A, on average. Activity 8.15 A study was conducted to determine the amount of hours spent on Facebook by university and high school students. For this reason, a questionnaire was administered to a random sample of 14 university and 11 high school students and the hours per day spent on Facebook were recorded. University students High school students Sample size 14 11 Sample mean 3.3 2.9 Sample variance 0.9 1.3 (a) Use an appropriate hypothesis test to determine whether the mean hours per day spent on Facebook is different between university and high school students. Test at two appropriate significance levels, stating clearly the hypotheses, the test statistic and its distribution under the null hypothesis. Comment on your findings. 207 8 8. Hypothesis testing (b) State clearly any assumptions you made in (a). (c) Adjust the procedure above to determine whether the mean hours spent per day on Facebook for university students is higher than that of high school students. Solution (a) Let µ1 denote the mean hours per day spent on Facebook for university students and let µ2 denote the mean hours per day spent on Facebook for high school students. We test: H 0 : µ1 = µ2 vs. H1 : µ1 6= µ2 . Since the sample sizes are small, we assume σ12 = σ22 and use the test statistic: X̄1 − X̄2 q ∼ tn1 +n2 −2 2 Sp (1/n1 + 1/n2 ) under H0 . The estimate of the common variance is: 13 × 0.9 + 10 × 1.3 (n1 − 1) × s21 + (n2 − 1) × s22 = = 1.0739. n1 + n2 − 2 14 + 11 − 2 The test statistic value is: 3.3 − 2.9 p = 0.9580. 1.0739 × (1/14 + 1/11) 8 For α = 0.05, the critical values are ±t0.025, 23 = ±2.069 (the t23 distribution is used). Since 0.9580 < 2.069, we do not reject H0 . For α = 0.10, the critical values are ±t0.05, 23 = ±1.714. Since 0.9580 < 1.714, again we do not reject H0 . We conclude that there is no evidence of a difference in the mean hours spent on Facebook between university and high school students. The test is insignificant. (b) It assumed that the random samples are independent and that σ12 = σ22 (due to the small sample sizes). (c) Now we test: H 0 : µ1 = µ2 vs. H1 : µ1 > µ2 . All that changes are the critical values. For α = 0.05, we have a critical value of t0.05, 23 = 1.714. Since 0.9580 < 1.714, we do not reject H0 . For α = 0.10, we have a critical value of t0.10, 23 = 1.319. Since 0.9580 < 1.319, we again do not reject H0 . The conclusion is the same as in (a). 208 8.16. Difference between two population means Activity 8.16 The following data show the times of flow through an orifice of two types of sand. Is there sufficient evidence that one type flows faster than the other? Type I 27.2 26.8 27.4 27.1 26.5 Type II 29.6 30.0 28.4 30.2 Solution If µ1 and µ2 are the respective population means, then we wish to test: H 0 : µ1 = µ2 vs. H1 : µ1 6= µ2 . The main statistics are: Type 1 Type 2 n 5 4 Sample mean 27.000 29.550 Sample standard deviation 0.354 0.806 E.S.E.√of mean 0.354/√5 = 0.16 0.806/ 4 = 0.40 The pooled sample variance is: s2p = (5 − 1)(0.354)2 + (4 − 1)(0.806)2 2.450172 = = 0.3500. (5 − 1) + (4 − 1) 7 8 The pooled standard deviation is: √ sp = 0.3500 = 0.592 on 7 degrees of freedom. The estimated standard error is: r r 1 1 1 1 sp × + = 0.592 × + = 0.397. n1 n2 5 4 The test statistic value is: 27 − 29.55 = −6.43. 0.397 Comparing this with the t distribution on 7 degrees of freedom, we see that the test is significant even at the 0.1% significance level (since −6.43 < −4.785 = t0.999, 7 ) and has a p-value of 0.000178. Conclusion: we are convinced that Type I sand flows faster! Activity 8.17 The data in the following table show the numbers of daily parking offences in two areas of a city. The day identifications are unknown and the recordings were not necessarily made on the same days. Is there evidence that the areas experience different numbers of offences? 209 8. Hypothesis testing Area A 38 38 29 45 42 Area B 32 38 22 30 34 Area A 33 27 32 32 34 Area B 28 32 34 24 Solution It is clear that the data for the two areas cannot be regarded as paired, because there is no detail about the days to which the figures refer (and because there are different numbers of values for the two areas anyway). The sample sizes are 10 and 9 for A and B, respectively, so we need to decide whether we can use the combination of a pooled estimate of the variance and a t distribution test on the appropriate number of degrees of freedom. So, start by calculating the sample means and standard deviations (or variances): n Mean Standard deviation 8 A 10 35 5.68 B 9 30.44 5.08 Because the events to which the data refer are of the same type, it is reasonable to assume that the population variances of the numbers of offences are the same (especially since the sample standard deviations turn out to be quite similar. (It can be shown that, if the ratio of the sample variances is between roughly 1/3 and 3, then there is no reason to suppose that the population variances are different.) We test H0 : µA − µB = 0 vs. H1 : µA − µB 6= 0. The pooled variance is: s2p = (10 − 1) × (5.68)2 + (9 − 1) × (5.08)2 = 29.2243 (10 − 1) + (9 − 1) Hence the estimated standard error is: s √ 1 1 29.2243 × + = 5.406 × 0.21 = 2.484 10 9 on 17 degrees of freedom. The test statistic value is: 35 − 30.44 4.56 t= = = 1.836. 2.484 2.484 For a two-tailed test at the 5% significance level the critical values from t17 are ±t0.025, 17 = ±2.110, so the evidence is not significant at the 5% significance level since 1.836 < 2.110. For a two-tailed test at the 10% significance level the critical values from t17 are ±t0.05, 17 = ±1.740, so the evidence is weakly significant since 1.740 < 1.836. We cannot conclude with confidence that there is a difference in the number of parking offences between the two areas, but there is a suspicion that there may be such a difference. It would be preferable to examine more data in order to come to a firmer conclusion. 210 8.16. Difference between two population means 8.16.4 Paired (dependent) samples Recall from Section 7.13, that for paired (dependent) samples we work with differenced data to reduce matters to a one-sample analysis. As before, we compute the differenced data as: d1 = x1 − y1 , d2 = x2 − y2 , . . . , dn = xn − yn reducing the two-sample problem to a one-sample problem. The test is then analogous to the hypothesis test of a single mean with σ unknown. t test for the difference in means in paired samples Using the sample mean and sample standard deviation of differenced data, then: T = X̄d − µd √ ∼ tn−1 . Sd / n (8.9) Hence critical values are obtained from the Student’s t distribution with n−1 degrees of freedom, i.e. using Table 10 of the New Cambridge Statistical Tables. Example 8.8 The table below shows the before and after weights (in pounds) of 8 adults after a diet. Test whether there is evidence that the diet is effective, on average. Before 127 130 114 139 After 122 120 116 132 Before 150 147 167 153 After 144 138 155 152 8 We want to test: H0 : µ1 = µ2 vs. H1 : µ1 > µ2 which is equivalent to: H0 : µd = 0 vs. H1 : µd > 0 where we choose a one-tailed test because we are looking for an average reduction (if there is any change, we would expect weight loss from a diet!) and we define µd = µ1 − µ2 since we anticipate that this way round the values will (more likely) be positive. The differences (calculated as ‘before − after’) are: 5 10 − 2 7 6 9 12 1. Hence n = 8, x̄d = 6 and sd = 4.66 on n − 1 = 7 degrees of freedom. Using (8.9), the test statistic value is: 6−0 √ = 3.64. t= 4.66/ 8 At the 5% significance level, the upper-tail critical value is 1.895 (using Table 10 of the New Cambridge Statistical Tables), hence we reject H0 since 1.895 < 3.64. Using 211 8. Hypothesis testing Figure 8.1, we proceed to test at the 1% significance level, which uses a critical value of 2.998. Hence we, again, reject H0 since 2.998 < 3.64 and we conclude that the test is highly significant. These data provide strong evidence that the diet reduces weight, on average. Activity 8.18 Two different methods of determination of the percentage fat content in meat are available. Both methods are used on portions of the same meat sample. Is there any evidence to suggest that one method gives a higher reading than the other? Meat Sample 1 2 3 4 5 6 7 8 Method I II 23.1 22.7 23.2 23.6 26.5 27.1 26.6 27.4 27.1 27.4 48.3 46.8 40.5 40.4 25.0 24.9 Meat Sample 9 10 11 12 13 14 15 16 Method I II 38.4 38.1 23.5 23.8 22.2 22.5 24.7 24.4 45.1 43.5 27.6 27.0 25.0 24.9 36.7 35.2 Solution Since the same meat is being put through two different tests, we are clearly dealing with paired samples. We wish to test: H0 : µ1 = µ2 8 vs. H1 : µ1 6= µ2 but for paired samples we do our calculations using the differenced data, so we might reformulate this presentation of the hypotheses in the form: H0 : µ1 − µ2 = 0 vs. H1 : µ1 − µ2 6= 0. If we list the differences from the original data table then we get: 0.4 −0.4 −0.6 −0.8 −0.3 1.5 0.1 0.1 0.3 −0.3 −0.3 0.3 1.6 0.6 0.1 1.5 The main statistics are: Differences n 16 Sample mean 0.238 Sample standard deviation 0.745 E.S.E. √ of mean 0.745/ 16 = 0.186 So the test statistic value is: t= 0.238 − 0 = 1.28. 0.186 Looking at Table 10 of the New Cambridge Statistical Tables for the t15 distribution, we find that the critical values for a two-tailed test at the 5% significance level are 212 8.17. Summary ±t0.025, 15 = ±2.131, so this result is not significant at the 5% significance level since 1.28 < 2.131. Indeed, the critical values for a test at the 20% significance level are ±t0.10, 15 = ±1.341, so the result would still not be significant (which, of course, is in any case too high a significance level to be used). The test shows, in fact, that the p-value of the test is over 0.2, although the limited information in Table 10 means that we cannot work it out exactly. (To do so, we would need much more detailed statistical tables, or a computer.) 8.17 Summary The idea of testing hypotheses is a central part of statistics, and underpins the development of theories and checking of ideas in management and the social sciences. It is important that you make sure you understand the material introduced in this chapter and Chapters 6 and 7 before you move on to look at the chi-squared distribution in Chapter 9. 8.18 Key terms and concepts Alternative hypothesis Critical region Lower-tailed test P -value Significance level Two-tailed test Type II error 8.19 Common proportion Critical value Null hypothesis Power Test statistic (value) Type I error Upper-tailed test 8 Learning activities Solutions can be found in Appendix A. 1. Think about each of the following statements. For each, state the null and alternative hypotheses and say whether they will need one-tailed or two-tailed tests. (a) The general mean level of family income in a population is known to be 10,000 ‘ulam’ per year. You take a random sample in an urban area and find the mean family income is 6,000 ulam per year in that area. Do the families in the chosen area have a lower mean income than the population as a whole? (b) You are looking at data from two schools on the heights and weights of children by age. Are the mean weights of children aged 10–11 the same in the two schools? (c) You are looking at reading scores for children before and after a new teaching programme. Have their scores improved on average? 213 8. Hypothesis testing 2. Complete the following chart: Real situation Result from your test Not reject H0 Reject H0 Correct decision H0 true Probability (1 − α) Type II error H1 true Probability (1 − β) called the power of the test 3. Say whether the following statement is true or false and briefly give your reasons. ‘The power of a test is the probability that the correct hypothesis is chosen.’ 4. Explain your attitude to a null hypothesis if a test of the hypothesis is significant: (a) at the 1% significance level (b) at the 10% significance level, but not at the 5% significance level. 8 5. The manufacturer of a patient medicine claimed that it was 90% effective in relieving an allergy for a period of 8 hours. In a random sample of 200 people suffering from the allergy, the medicine provided relief for 160 people. Determine whether the manufacturer’s claim is legitimate. (Be careful, your parameter here will be π.) Is your test one-tailed or two-tailed? 6. You have been asked to compare the percentages of people in two groups with n1 = 16 and n2 = 24 who are in favour of a new plan. You decide to make a pooled estimate of the proportion and make a test. Which test would you use? 7. Look again at Question 5 in the equivalent section of Chapter 7. You were asked in part (b) whether you thought one company gave on average more satisfaction than the other. Now give the null and alternative hypotheses for such a one-tailed test. Show how you would decide whether or not to reject the null hypothesis. 8.20 A reminder of your learning outcomes After completing this chapter, and having completed the essential reading and activities, you should be able to: set up the null and alternative hypotheses for a problem and state whether the latter is one-sided or two-sided, leading to a one-tailed or two-tailed test, respectively 214 8.21. Sample examination questions define and apply the terminology of hypothesis testing define and explain the types of errors which can be made in hypothesis testing perform hypothesis tests on means and proportions. 8.21 Sample examination questions Solutions can be found in Appendix B. 1. An experiment is conducted to determine whether intensive tutoring (covering a great deal of material in a fixed amount of time) is more effective than standard tutoring (covering less material in the same amount of time) for a particular course. Two randomly chosen groups of students were tutored separately and their examination mark on the course was recorded. The data are summarised in the table below: Sample size Intensive tutoring Standard tutoring 22 25 Mean examination mark 65.33 61.58 Sample standard deviation 6.61 5.37 (a) Use an appropriate hypothesis test to determine whether there is a difference between the mean examination marks between the two tutoring groups. State clearly the hypotheses, the test statistic and its distribution under the null hypothesis, and carry out the test at two appropriate significance levels. Comment on your findings. (b) State clearly any assumptions you made in part (a). 2. (a) A pharmaceutical company is conducting an experiment to test whether a new type of pain reliever is effective. In this context, a treatment is considered effective if it is successful with a probability of more than 0.5. The pain reliever was given to 30 patients and it reduced the pain for 20 of them. You are asked to use an appropriate hypothesis test to determine whether the pain reliever is effective. State the test hypotheses, and specify your test statistic and its distribution under the null hypothesis. Comment on your findings. (b) A second experiment followed where a placebo pill was given to another group of 40 patients. A placebo pill contains no medication and is prescribed so that the patient will expect to get well. In some situations, this expectation is enough for the patient to recover. This effect, also known as the placebo effect, occurred to some extent in the second experiment where the pain was reduced for 21 of the patients. You are asked to consider an appropriate hypothesis test to incorporate this new evidence with the previous data and reassess the effectiveness of the pain reliever. 215 8 8. Hypothesis testing 3. A paired-difference experiment involved n = 121 adults. For each adult a characteristic was measured under two distinct conditions and the difference in characteristic values was recorded. The sample mean of the differences was 1.195, whereas their sample standard deviation was 10.2. The researchers reported a t statistic value of 1.289 when testing whether the means of the two conditions are the same. (a) Show how the researchers obtained the t statistic value of 1.289. (b) Calculate the p-value of the test and use the p-value to test the hypothesis of equal means. Use a 5% significance level. 8 216 Chapter 9 Contingency tables and the chi-squared test 9.1 Aims This chapter introduces you to a further application of hypothesis testing using a new distribution – the chi-squared (χ2 ) distribution. This distribution enables you to work with categorical variables. Your aims are to: learn a new application link this with different applications in the business and social statistical fields. 9.2 Learning outcomes After completing this chapter, and having completed the Essential reading and activities, you should be able to: set up the null and alternative hypotheses appropriate for a contingency table 9 compute the degrees of freedom, expected frequencies and appropriate critical values of chi-squared for a contingency table summarise the limitations of a chi-squared test be able to extend from chi-squared to an appropriate test of proportions, if necessary be able to work with a one-row or one-column contingency table as above. 9.3 Essential reading Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and Economics (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060] Sections 14.1 and 14.3. In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials accessible via the ST104a Statistics 1 area at http://my.londoninternational.ac.uk 217 9. Contingency tables and the chi-squared test 9.4 Further reading Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN 9780471615170] Chapter 17. 9.5 Introduction In Chapter 8 we focused on testing the value of a particular population parameter of interest, such as a mean, µ, or a proportion, π. Being able to perform such statistical tests is of particular use when making policy decisions. Here we shall look at two additional testing procedures, one which deals with testing whether an association exists between two categorical variables (introduced in Chapter 4), and a second which considers the shape of the distribution from which the sample data were drawn. Both tests can easily be applied to business management and social science fields. 9.6 Correlation and association If we are asking whether two measurable variables (see Section 4.6) are related to each other, the technical way of describing the case when they are is to say that the two variables are correlated. It is possible to measure the strength of a correlation based on sample data, and also to study to what extent changes in a variable ‘explain’ changes in another – a topic known as regression. These topics will be discussed in Chapter 12. 9 However, we also often wish to answer questions such as ‘Do vegetarians tend to support Party X?’ or ‘Do Party X supporters tend to be vegetarians?’. If either question has the affirmative answer, we say that there is an association between the two categories (or ‘attributes’ or ‘factors’). We will need frequency data in order to answer such questions, but, in this course, we will not be concerned with the strength of any such association, rather we restrict ourselves to testing for its existence and commenting on the nature of any association which we discover. Finally, note that if one variable is measurable and the other categorical, we would still be testing for the existence of association instead of computing correlation. In which case we would ‘demote’ the measurable variable to categorical status by creating levels. For example, age could be converted to a categorical variable by creating age groups.1 9.7 Tests for association This type of test tests the null hypothesis that two factors (or attributes) are not associated, against the alternative hypothesis that they are associated. Each data unit we sample has one level (or ‘type’ or ‘variety’) of each factor. 1 Note age group would be an ordinal variable, since we could rank the age groups in order, from youngest to oldest. 218 9.7. Tests for association Example 9.1 Suppose that we are sampling people, and that one factor of interest is hair colour (black, blonde, brown etc.) while another factor of interest is eye colour (blue, brown, green etc.). In this example, each sampled person has one level of each factor. We wish to test whether or not these factors are associated. Hence: H0 : There is no association between hair colour and eye colour2 H1 : There is an association between hair colour and eye colour. So, under H0 , the distribution of eye colour is the same for blondes as it is for brunettes etc., whereas if H1 is true it may be attributable to blonde-haired people having a (significantly) higher proportion of blue eyes, say. The association might also depend on the sex of the person, and that would be a third factor which was associated with (i.e. interacted with) both of the others.3 The main way of analysing these questions is by using a contingency table, discussed in the next section. 9.7.1 Contingency tables In a contingency table, also known as a cross-tabulation, the data are in the form of frequencies (counts), where the observations are organised in cross-tabulated categories. We sample a certain number of units (such as people) and classify them according to the two factors of interest. Example 9.2 In three areas of a city, a record has been kept of the numbers of burglaries, robberies and car thefts which take place in a year. The total number of offences was 150, and they were divided into the various categories as shown in the following contingency table: Area A B C Total Burglary 30 12 8 50 Robbery 19 23 18 60 Car theft 6 14 20 40 Total 55 49 46 150 The cell frequencies are known as observed frequencies and show how the data are spread across the different combinations of factor levels (known as contingencies). The first step in any analysis is to complete the row and column totals (as already done in this table). 2 When conducting tests for association, the null hypothesis can be expressed either as ‘There is no association between categorical variables X and Y ’, or that ‘Categorical variables X and Y are independent’. The corresponding alternative hypothesis would then replace ‘no association’ or ‘independent’ with ‘association’ or ‘not independent (dependent)’, respectively. 3 We will not consider interactions in these testing procedures in ST104a Statistics 1. 219 9 9. Contingency tables and the chi-squared test 9.7.2 Expected frequencies We proceed by computing a corresponding set of expected frequencies, conditional on the null hypothesis of no association between the factors, i.e. that the factors are independent. Now suppose that you are only given the row and column totals for the frequencies. If the factors were assumed to be independent, consider how you would calculate the expected frequencies. Recall from Chapter 5 that if A and B are two independent events, then P (A ∩ B) = P (A) P (B). We now apply this idea. Example 9.3 For the data in Example 9.2, if a record was selected at random from the 150 records: P (a crime being a burglary) = 50/150 P (a crime being in area A) = 55/150. Hence, under H0 , we have: P (a crime being a burglary in area A) = 50 55 × 150 150 and so the expected number of burglaries in area A is: 150 × 55 50 × . 150 150 So the expected frequency is obtained by multiplying the product of the ‘marginal’ probabilities by n, the total number of observations. This can be generalised as follows. 9 Expected frequencies in contingency tables The expected frequency, Eij , for the cell in row i and column j of a contingency table with r rows and c columns, is: Eij = row i total × column j total total number of observations where i = 1, . . . , r and j = 1, . . . , c. Example 9.4 The completed expected frequency table for the data in Example 9.2 is (rounding to two decimal places – which is recommended in the examination): Area A B C Total 220 Burglary 18.33 16.33 15.33 50 Robbery 22.00 19.60 18.40 60 Car theft 14.67 13.07 12.27 40 Total 55 49 46 150 9.7. Tests for association Make sure you can replicate these expected frequencies using your own calculator. 9.7.3 Test statistic To motivate our choice of test statistic, if H0 is true then we would expect to observe small differences between the observed and expected frequencies, while large differences would suggest that H1 is true. This is because the expected frequencies have been calculated conditional on the null hypothesis of independence hence, if H0 is actually true, then what we actually observe (the observed frequencies) should be (approximately) equal to what we expect to observe (the expected frequencies). χ2 test of association Let the contingency table have r rows and c columns, then formally the test statistic4 used for tests of association is: r X c X (Oij − Eij )2 i=1 j=1 Eij ∼ χ2(r−1)(c−1) . (9.1) Critical values are found using Table 8 of the New Cambridge Statistical Tables. Notice the ‘double summation’ here just means summing over all rows and columns. This test statistic follows an (approximate) chi-squared distribution with (r − 1)(c − 1) degrees of freedom, where r and c denote the number of rows and columns, respectively, in the contingency table. 9.7.4 The χ2 distribution A chi-squared variable is only defined over positive values. The precise shape of the distribution is dependent on the degrees of freedom, which is a parameter of the distribution. Figure 9.1 illustrates the chi-squared distribution for a selection of degrees of freedom. Although the shape of the distribution does change quite significantly for different degrees of freedom, note the distribution is always positively skewed. 9.7.5 Degrees of freedom The general expression for the number of degrees of freedom in tests for association is: (number of cells) − (number of times data are used to calculate the Eij s) . For an r × c contingency table, we begin with rc cells. We lose one degree of freedom for needing to use the total number of observations to compute the expected frequencies. However, we also use the row and column totals in these calculations, but we only need r − 1 row totals and c − 1 column totals, as the final one in each case can be deduced 4 This test statistic was first used by the statistician Karl Pearson, hence it is often referred to as ‘Pearson’s chi-squared statistic’. We will not worry in this course about how this test statistic is derived. 221 9 9. Contingency tables and the chi-squared test 0.4 Chi−squared distribution: various degrees of freedom 0.2 0.0 0.1 f X (x ) 0.3 1 degree of freedom 4 degrees of freedom 8 degrees of freedom 10 degrees of freedom 0 5 10 15 20 x Figure 9.1: Examples of the chi-squared distribution with a selection of degrees of freedom. using the total number of observations. Hence we only lose r − 1 degrees of freedom for the row totals, similarly we only lose c − 1 degrees of freedom for the column totals. Hence the overall degrees of freedom are: 9 ν = rc − (r − 1) − (c − 1) − 1 = (r − 1)(c − 1). 9.7.6 Performing the test As usual, we choose a significance level, α, at which to conduct the test. However, are we performing a one-tailed test or a two-tailed test? To determine this, we need to consider what sort of test statistic value would be considered extreme under H0 . As seen in Figure 9.1, the chi-squared distribution only takes positive values. The squared term in the numerator of the test statistic in (9.1) ensures the test statistic value will be positive (the Eij s in the denominator are clearly positive too). If H0 is true, then the observed and expected frequencies should be quite similar, since the expected frequencies are computed conditional on the null hypothesis of independence. This means that |Oij − Eij | should be quite small for all cells. In contrast if H0 is not true, then we would expect comparatively large values for |Oij − Eij | due to large differences between the two sets of frequencies. Therefore, upon squaring |Oij − Eij |, sufficiently large test statistic values suggest rejection of H0 . Hence χ2 tests of association are always upper-tailed tests. 222 9.7. Tests for association Example 9.5 Using the data in Example 9.2, we proceed with the hypothesis test. Note it is advisable to present your calculations as an extended contingency table as shown below, where the three rows in each cell correspond to the observed frequencies, the expected frequencies and the test statistic contributor. O1· E1· (O1· − E1· )2 /E1· O2· E2· (O2· − E2· )2 /E2· O3· E3· (O3· − E3· )2 /E3· A B C Total Burglary 30 18.33 7.48 12 16.33 1.13 8 15.33 3.48 50 Robbery 19 22.00 0.41 23 19.60 0.59 18 18.40 0.01 60 Car theft 6 14.67 5.15 14 13.07 0.06 20 12.27 4.82 40 Total 55 55 49 49 46 46 150 Using (9.1), we obtain a test statistic value of: 3 X 3 X (Oij − Eij )2 i=1 j=1 Eij = 7.48 + 0.41 + · · · + 4.82 = 23.13. Since r = c = 3, we have (r − 1)(c − 1) = (3 − 1)(3 − 1) = 4 degrees of freedom. For α = 0.05, using Table 8 of the New Cambridge Statistical Tables, we obtain an upper-tail critical value of 9.488. Hence we reject H0 since 9.488 < 23.13. Moving to the 1% significance level, the critical value is now 13.28 so, again, we reject H0 since 13.28 < 23.13. Therefore, the test is highly significant and we conclude that there is strong evidence of an association between the factors. Looking again at the contingency table, comparing observed and expected frequencies, the interpretation of this association becomes clear – burglary is the main problem in area A whereas car theft is a problem in area C. (We can deduce this by looking at the cells with large test statistic contributors, which are a consequence of large differences between observed and expected frequencies.) Incidentally, the p-value for this upper-tailed test is (using a computer) P (X > 23.13) = 0.000119, where X ∼ χ24 , emphasising the extreme significance of this test statistic value. The conclusions in Example 9.5 are fairly obvious, given the small dimensions of the contingency table. However, for data involving more factors, and more factor levels, this type of analysis can be very insightful. Cells which make a large contribution to the test statistic value (i.e. which have large values of (Oij − Eij )2 /Eij ) should be studied carefully when determining the nature of an association. This is because, in cases where H0 has been rejected, rather than simply conclude that there is an association between two categorical variables, it is helpful to describe the nature of the association. 223 9 9. Contingency tables and the chi-squared test Activity 9.1 An analyst of the retail trade uses as analytical tools the concepts of ‘Footfall’ (the daily number of customers per unit sales area of a shop) and ‘Ticket price’ (the average sale price of an item in the shop’s offer). Shops are classified as offering low, medium or high price items and, during any sales period, as having low, medium or high footfall. During the January sales, the analyst studies a sample of shops and obtains the following frequency data for the nine possible combined classifications: Low footfall Medium footfall High footfall Low price 22 37 45 Medium price 43 126 75 High price 16 25 23 Conduct a suitable test for association between Ticket classification and Footfall level, and report on your findings. Solution We test: H0 : No association between ticket price and footfall level vs. H1 : Association between ticket price and footfall level. We next compute the expected values for each cell using: row i total × column j total total number of observations which gives: 9 Footfall O1· Low E1· (O1· − E1· )2 /E1· O2· Medium E2· (O2· − E2· )2 /E2· O3· High E3· (O3· − E3· )2 /E3· Total Low 22 20.45 0.12 37 47.46 2.30 45 36.10 2.20 104 Ticket price Medium High 43 16 47.97 12.58 0.52 0.93 126 25 111.34 29.20 1.93 0.61 75 23 84.69 22.21 1.11 0.03 244 64 Total 81 46 188 188 143 143 412 The test statistic value is: 3 X 3 X (Oij − Eij )2 i=1 j=1 Eij = 2.20 + 1.11 + · · · + 0.93 = 9.73. Since r = c = 3, we have (r − 1)(c − 1) = (3 − 1)(3 − 1) = 4 degrees of freedom. 224 9.7. Tests for association For α = 0.05, using Table 8 of the New Cambridge Statistical Tables, we obtain an upper-tail critical value of 9.488. Hence we reject H0 since 9.488 < 9.73. Moving to the 1% significance level, the critical value is now 13.277 so we do not reject H0 since 9.73 < 13.277. Therefore, the test is moderately significant and we conclude that there is moderate evidence of an association between ticket price and footfall. Activity 9.2 Give an example of a 2 × 2 contingency table in which there is: (a) no association (b) strong association. Be sure to show why each contingency table exhibits the property. Solution (a) Any 2 × 2 contingency table with equal observed frequencies would suffice. For example (where k is some constant): Level 1 Level 2 Level 1 k k Level 2 k k Clearly, Oij = Eij = k for all cells, leading to a test statistic value of zero for the chi-squared test. (b) A possible example of a 2 × 2 contingency table showing strong association is: Level 1 Level 2 Level 1 100 200 Level 2 200 100 9 All expected frequencies are 300 × 300/600 = 150, and (Oij − Eij )2 /Eij = 16.67 for all cells. Therefore, the test statistic value for the chi-squared test is 4 × 16.67 = 66.67 which is highly significant given the critical value for the 0.1% significance level is 10.83. Activity 9.3 The table below shows the relationship between gender and party identification in a US state. Gender Male Female Total Party identification Democrat Independent Republican 108 72 198 216 59 142 324 131 340 Total 378 417 795 Test for an association between gender and party identification at two appropriate significance levels and comment on your results. 225 9. Contingency tables and the chi-squared test Solution We test: H0 : There is no association between gender and party identification vs. H1 : There is an association between gender and party identification. We construct the following contingency table: Male Female O1j E1j (O1j − E1j )2 /E1j O2j E2j (O2j − E2j )2 /E2j Total Party identification Democrat Independent Republican 108 72 198 154.05 62.29 161.66 13.77 1.51 8.17 216 59 142 169.95 68.71 178.34 12.48 1.37 7.40 324 131 340 Total 378 378 417 417 795 P The χ2 test statistic value is i,j (Oij − Eij )2 /Eij = 44.71. The number of degrees of freedom is (2 − 1)(3 − 1) = 2, so we compare the test statistic with the χ22 distribution in Table 8 of the New Cambridge Statistical Tables. The test is significant at the 5% significance level (critical value is 5.991, and 5.991 < 44.71). Clearly, it is also significant at the 1% significance level (critical value is 9.210, and 9.210 < 44.71) so we have found (very!) strong evidence of an association between gender and party identification. 9 Comparing the observed and expected frequencies, we see that males are more likely to identify as Republican supporters, while females are more likely to identify as Democrat supporters. Activity 9.4 The following table shows the numbers of car accidents in an urban area over a period of time. These are classified by severity and by type of vehicle. Carry out a test for association on these data and draw conclusions. Saloon Van Sports car Severity of accident Minor Medium Major 29 39 16 15 24 12 7 20 12 Solution We test: H0 : There is no association between type of vehicle and severity of accident vs. H1 : There is an association between type of vehicle and severity of accident. 226 9.7. Tests for association We construct the following contingency table: Saloon Van Sports O1j E1j (O1j − E1j )2 /E1j O2j E2j (O2j − E2j )2 /E2j O3j E3j (O3j − E3j )2 /E3j Total Severity of accident Minor Medium Major 29 39 16 24.6 40.1 19.3 0.78 0.03 0.57 15 24 12 15.0 24.3 11.7 0.00 0.00 0.00 7 20 12 11.4 18.6 9.0 1.72 0.10 1.03 51 83 40 Total 84 84 51 51 39 39 174 P The χ2 test statistic value is i,j (Oij − Eij )2 /Eij = 4.24. The number of degrees of freedom is (3 − 1)(3 − 1) = 4, so we compare the test statistic with the χ24 distribution in Table 8 of the New Cambridge Statistical Tables. The test is not significant at the 5% significance level (critical value is 9.488, and 4.24 < 9.488). It is also not significant at the 10% significance level (critical value is 7.779, and 4.24 < 7.779) so we have found no evidence whatsoever of an association since the test is insignificant. Activity 9.5 The classification of examination candidates for three examining bodies is shown below. Is there evidence that the examining bodies are classifying differently? If so, explain what the differences are. Pass Refer Fail Examining bodies A B C D 233 737 358 176 16 68 29 20 73 167 136 64 Solution We test: H0 : There is no association between examining bodies and grade classifications vs. H1 : There is an association between examining bodies and grade classifications. We construct the following contingency table: 227 9 9. Contingency tables and the chi-squared test O1j E1j (O1j − E1j )2 /E1j O2j Refer E2j (O2j − E2j )2 /E2j O3j Fail E3j (O3j − E3j )2 /E3j Total Pass Examining bodies A B C D 233 737 358 176 233.2 703.8 378.7 188.3 0 1.56 1.13 0.80 16 68 29 20 20.6 62.2 33.5 16.6 1.04 0.53 0.60 0.53 73 167 136 64 68.2 205.9 110.8 55.1 0.34 7.35 5.73 1.47 322 972 523 260 Total 1504 1504 133 133 440 440 2077 P The χ2 test statistic value is i,j (Oij − Eij )2 /Eij = 21.21. The number of degrees of freedom is (4 − 1)(3 − 1) = 6, so we compare the test statistic with the χ26 distribution in Table 8 of the New Cambridge Statistical Tables. The test is significant at the 5% significance level (the critical value is 12.59, and 12.59 < 21.21), indeed it is also significant at the 1% significance level (the critical value is 16.81, and 16.81 < 21.21). So we can conclude that the test is highly significant and hence there is an association between examining bodies and grade classifications. Looking at the table for large individual contributors to the test statistic value and comparing observed and expected frequencies, it appears that B fails fewer people than expected under independence and C fails more. Note that none of the other individual values of the χ2 test statistic is a significant fraction of the 21.21 total. 9 Activity 9.6 In a random sample of 200 children in a school it was found that 171 had been inoculated against the common cold before the winter. The table below shows the numbers observed to have suffered from colds over the winter season. No cold Cold Inoculated 40 131 Not inoculated 5 24 Test for an association between colds and inoculation using the standard test for a contingency table, and draw conclusions. Solution We test: H0 : There is no association between being inoculated and cold prevention. vs. H1 : There is an association between being inoculated and cold prevention. 228 9.8. Goodness-of-fit tests We construct the following contingency table: No cold Cold O1j E1j (O1j − E1j )2 /E1j O2j E2j (O2j − E2j )2 /E2j Total Inoculated 40 38.475 0.060 131 132.525 0.018 171 Not inoculated 5 6.525 0.356 24 22.475 0.103 29 Total 45 45 155 155 200 The number of degrees of freedom is (r − 1)(c − 1) = (2 − 1)(2 − 1) = 1. The χ2 test statistic value is 0.060 + 0.356 + 0.018 + 0.103 = 0.537. For comparison, the χ21 critical value for α = 0.3 (i.e. a 30% significance level!) is 1.074. Since 0.537 < 1.074, the test statistic value is clearly not significant. There is no evidence that inoculation prevents colds. 9.7.7 Extending to an appropriate test of proportions If we had a 2 × 2 contingency table, then there are two ways of testing an equivalent hypothesis – (i.) a χ2 test as outlined above, and (ii.) a test of the difference in proportions as detailed in Chapter 8. See the fourth Learning activity at the end of this chapter for an example to practise. 9.8 Goodness-of-fit tests In addition to tests of association, the chi-squared distribution is often used more generally in so-called ‘goodness-of-fit’ tests. We may, for example, wish to answer hypotheses such as ‘Is it reasonable to assume the data follow a particular distribution?’. This justifies the name ‘goodness-of-fit’ tests, since we are testing whether or not a particular probability distribution provides an adequate fit to the observed data. The null hypothesis will assert that a specific hypothetical population distribution is the true one. The alternative hypothesis is that this specific distribution is not the true one. There is a special case which we shall consider in ST104a Statistics 1 when you are only dealing with one row or one column. This is when we wish to test that the sample data are drawn from a (discrete) uniform distribution, i.e. that each characteristic is equally likely. Example 9.6 Is a given die fair? If the die is fair, then the values of the faces (1, 2, 3, 4, 5 and 6) are all equally likely so we have the following hypotheses: H0 : Score is uniformly distributed vs. H1 : Score is not uniformly distributed. 229 9 9. Contingency tables and the chi-squared test 9.8.1 Observed and expected frequencies As with tests of association, the goodness-of-fit test involves both observed and expected frequencies. In all goodness-of-fit tests, the sample data must be expressed in the form of observed frequencies associated with certain classifications of the data. Assume k classifications, hence the observed frequencies can be denoted by Oi , for i = 1, . . . , k. Example 9.7 Extending Example 9.6, for a die the obvious classifications would be the six faces. If the die is thrown n times, then our observed frequency data would be the number of times each face appeared. Here k = 6. Recall that in hypothesis testing we always assume that the null hypothesis, H0 , is true. In order to conduct a goodness-of-fit test, expected frequencies are computed conditional on the probability distribution expressed in H0 . The test statistic will then involve a comparison of the observed and expected frequencies. If H0 is true, then we would expect small differences between these two sets of frequencies, while large differences would indicate support for H1 . We now consider how to compute expected frequencies for discrete uniform probability distributions. Expected frequencies in goodness-of-fit tests For discrete uniform probability distributions, expected frequencies are computed as: Ei = n × 1 k for i = 1, 2, . . . , k where n denotes the sample size and 1/k is the uniform (i.e. equal, same) probability for each characteristic. 9 Expected frequencies should not be rounded, just as we do not round sample means, say. Note that the final expected frequency (for the kth category) can easily be computed using the formula: Ek = n − k−1 X Ei . i=1 This is because we have a constraint that the sum of the observed and expected frequencies must be equal,5 that is: k X i=1 Oi = k X Ei i=1 which results in a loss of one degree of freedom (discussed below). 5 Recall the motivation for computing expected frequencies in the first place – assuming H0 is true, we want to know how a sample of size n is expected to be distributed across the k categories. 230 9.8. Goodness-of-fit tests 9.8.2 The goodness-of-fit test Having computed expected frequencies, we now need to formally test H0 and so we need a test statistic. Goodness-of-fit test statistic For a discrete uniform distribution with k categories, observed frequenices Oi and expected frequencies Ei , the test statistic is: k X (Oi − Ei )2 i=1 Ei ∼ χ2k−1 approximately under H0 . (9.2) Note that this test statistic does not have a true χ2k−1 distribution under H0 , rather it is only an approximating distribution. An important point to note is that this approximation is only good enough provided all the expected frequencies are at least 5. As seen in (9.2), there are k − 1 degrees of freedom when testing a discrete uniform distribution. k is the number of categories, and we lose one degree of freedom due to the constraint that: X Oi = X i Ei . i As with the test of association, goodness-of-fit tests are upper-tailed tests as, under H0 , we would expect to see small differences between the observed and expected frequencies, as the expected frequencies are computed conditional on H0 . Hence large test statistic values are considered extreme under H0 , since these arise due to large differences between the observed and expected frequencies. 9 Example 9.8 A confectionery company is trying out different wrappers for a chocolate bar – its original, A, and two new ones, B and C. It puts the bars on display in a supermarket and looks to see how many of each wrapper type have been sold in the first hour, with the following results. Wrapper type Observed frequencies A 8 B 10 C 15 Total 33 Is there a difference between wrapper types in the consumer choices made? To answer this we need to test: H0 : There is no difference in preference for the wrapper types. vs. H1 : There is a difference in preference for the wrapper types. 231 9. Contingency tables and the chi-squared test How do we work out the expected frequencies? Well, for equal preferences, with three choices each (k = 3), the expected frequencies will be: Ei = 33 × 1 = 11 3 for i = 1, 2 and 3. Applying (9.2), our test statistic value is: 3 X (Oi − Ei )2 i=1 Ei = (8 − 11)2 (10 − 11)2 (15 − 11)2 + + = 2.364. 11 11 11 The degrees of freedom will be k − 1 = 3 − 1 = 2. At the 5% significance level, the upper-tail critical value is 5.991, using Table 8 of the New Cambridge Statistical Tables, so we do not reject H0 since 2.364 < 5.991. If we now consider the 10% significance level6 the critical value is 4.605, so again we do not reject H0 since 2.364 < 4.605. Hence the test is not significant. Therefore, there is no evidence of a strict preference for a particular wrapper type based on the choices observed in the supermarket during the first hour.7 Activity 9.7 Many people believe that when a horse races, it has a better chance of winning if its starting line-up position is closer to the rail on the inside of the track. The starting position of 1 is closest to the inside rail, followed by position 2, and so on. The table below lists the numbers of wins of horses in the different starting positions. Do the data support the claim that the probabilities of winning in the different starting positions are not all the same? Starting position Number of wins 1 29 2 19 3 18 4 25 5 17 6 10 7 15 8 11 Solution 9 We test whether the data follow a discrete uniform distribution of 8 categories. Let pi = P (X = i), for i = 1, . . . , 8. We test the null hypothesis H0 : pi = 1/8, for i = 1, . . . , 8. Note n = 29 + 19 + 18 + 25 + 17 + 10 + 15 + 11 = 144. The expected frequencies are Ei = 144/8 = 18, for all i = 1, . . . , 8. 1 2 3 4 5 6 7 8 Total Starting position Oi 29 19 18 25 17 10 15 11 144 Ei 18 18 18 18 18 18 18 18 144 Oi − Ei 11 1 0 7 −1 −8 −3 −7 0 2 (Oi − Ei ) /Ei 6.72 0.06 0 2.72 0.06 3.56 0.50 2.72 16.34 P Under H0 , i (Oi − Ei )2 /Ei ∼ χ27 . At the 5% significance level, the critical value is 14.067. Since 14.067 < 16.34, we reject the null hypothesis. Turning to the 1% 6 Remember if the test is not significant at the 5% significance level, we should then test at the 10% significance level. See Figure 8.1. 7 This is the same as testing for the suitability of the discrete uniform distribution, i.e. that each wrapper is equally popular. 232 9.9. Summary significance level, the critical value is 18.475. Since 16.34 < 18.475, we cannot reject the null hypothesis, hence we conclude that the test is moderately significant. There is moderate, but not strong, evidence to support the claim that the chances of winning in the different starting positions are not all the same. 9.9 Summary You should regard this chapter as an opportunity to revise your work on hypothesis testing in Chapter 8 and also revisit your work on testing proportions. The only new material here is a way of testing the significance of countable figures as opposed to their attributes. 9.10 Key terms and concepts Association Chi-squared distribution Cross-tabulation Expected frequency Observed frequency 9.11 Categorical variable Contingency table Discrete uniform distribution Goodness-of-fit test Learning activities Solutions can be found in Appendix A. 1. A survey has been made of levels of satisfaction with housing by people living in different types of accommodation. Levels of satisfaction are categorised as being high, medium, low and very dissatisfied. Levels of housing type are public housing apartment, public housing house, private apartment, private detached house, private semi-detached house, miscellaneous (includes boat, caravan etc). State: (a) the null and alternative hypotheses (b) the degrees of freedom (c) the 5% and 1% critical values for a χ2 test. (Remember that you will reject the null hypothesis if your calculated test statistic value is greater than the critical value obtained from Table 8 of the New Cambridge Statistical Tables.) 2. Draw the χ2 curve and put in the rejection region for 5% and 1% significance levels with 6 degrees of freedom. Make sure you understand which calculated values of χ2 will lead you to reject H0 . 3. You have carried out a χ2 test on a 3 × 4 contingency table to study whether there is evidence of an association between advertising type and level of sales of a 233 9 9. Contingency tables and the chi-squared test product. You have four types of advertising (A, B, C and D) and three levels of sales (low, medium and high). Your calculated χ2 value is 13.5. Giving degrees of freedom and an appropriate significance level, set out your hypotheses. What would you say about the result? 4. In a survey conducted to decide where to locate a factory, samples from five towns were examined to see the numbers of skilled and unskilled workers. The data were as follows. Area A B C D E Number of skilled workers 80 58 114 55 83 Number of unskilled workers 184 147 276 196 229 (a) Does the population proportion of skilled workers vary with the area? (b) Test: H0 : πD = πothers vs. H1 : πD < πothers . Think about your results and what you would explain to your management team who had seen the chi-squared results and want, for other reasons, to site their factory at area D. 5. Look at the following table taken from a study of sex differences in perception. One of the tests was of scores in verbal reasoning. Male Female 9 High 50 90 Low 150 210 Totals 200 300 Do these figures show a difference in verbal reasoning by gender? Try to do this: (a) using a chi-squared test (b) using a test of differences between proportions. 6. The table below shows a contingency table for a sample of 1,104 randomly-selected adults from three types of environment (City, Town and Rural) who have been classified into two groups by level of exercise (high and low). Test the hypothesis that there is no association between level of exercise and type of environment and draw conclusions. Environment City Town Rural 234 Level of exercise High Low 221 256 230 118 159 120 9.12. A reminder of your learning outcomes 7. Two surveys have collected information on adult and teenage cigarette use. The results of the first survey are given in the first table below (sample size 4,000) and the results of the second survey, carried out two years later on a new sample of 1,000 households, are underneath it. (a) Without doing any further calculations, comment on any association for the first survey. (b) Calculate the chi-squared statistic for the second survey and test for association. (c) Write a short report explaining what the first survey table shows about the nature of the association, if any, between adult and teenage cigarette use in a household, and (by comparing the two tables) discuss whether or not the extent of any association changed over the years. Survey 1 Teenage cigarette use χ2 = 4.61 Yes No Total Survey 2 Teenage cigarette use Yes No Total Adult cigarette use Yes No 198 (8.4%) 170 (10.4%) 2,164 (91.6%) 1,468 (89.6%) 2,362 (100.0%) 1,638 (100.0%) Adult cigarette use Yes No 48 (8.1%) 45 (11.0%) 544 (91.9%) 363 (89.0%) 592 (100.0%) 408 (100.0%) Total 368 3,632 4,000 Total 93 907 1,000 8. You have been given the number of births in a country for each of the four seasons of the year and are asked whether births vary over the year. What would you need to know in order to carry out a chi-squared test of the hypothesis that births are spread uniformly between the four seasons? Outline the steps in your work. 9.12 A reminder of your learning outcomes After completing this chapter, and having completed the Essential reading and activities, you should be able to: set up the null and alternative hypotheses appropriate for a contingency table compute the degrees of freedom, expected frequencies and appropriate critical values of chi-squared for a contingency table summarise the limitations of a chi-squared test be able to extend from chi-squared to an appropriate test of proportions, if necessary be able to work with a one-row or one-column contingency table as above. 235 9 9. Contingency tables and the chi-squared test 9.13 Sample examination questions Solutions can be found in Appendix B. 1. State whether the following statements are true or false. Give a brief explanation. (a) A negative chi-squared value shows that there is little association between the variables tested. (b) Similar observed and expected frequencies indicate strong evidence of an association in an r × c contingency table. 2. An experiment was conducted to examine whether age, in particular being over 30 or not, has any effect on preferences for a digital or an analogue watch. Specifically, 129 randomly-selected people were asked what watch they prefer and their responses are summarised in the table below: 30 year old or younger Over 30 years old Analogue watch 10 31 Undecided 17 22 Digital watch 37 12 (a) Based on the data in the table, and without conducting any significance test, would you say there is an association between age and watch preference? Provide a brief justification for your answer. (b) Calculate the χ2 statistic for the hypothesis of independence between age and watch preference, and test that hypothesis. What do you conclude? 3. Set out the null and alternative hypotheses, degrees of freedom, expected frequencies, and 10%, 5% and 1% critical values for the following problem. The following figures give live births by season in a town. 9 Season Spring Summer Autumn Winter Live births 100 200 250 180 The number of days per season in this country are 93 (Spring), 80 (Summer), 100 (Autumn) and 92 (Winter). Is there any evidence that births vary over the year? Hint: You would expect, if the births are uniformly distributed over the year, that the number of births would be proportional to the number of days per season. So work out your expected frequencies by taking the number of days per season divided by the number of days in the year and multiplying by the total number of births over the year. 236 Chapter 10 Sampling design 10.1 Aims If you are using this subject guide in the order in which it is presented, you can now take a break from learning any more new statistical ideas for a while. Now is your chance to consolidate these ideas by looking at their applications in business and the social sciences. You should aim to consider the following types of question. What are your choices when you design a survey? How reliable are results of surveys reported in the media? Are random samples the best we can do in every circumstance? 10.2 Learning outcomes After completing this chapter, and having completed the Essential reading and activities, you should be able to: define random, simple random and quota sampling and describe the implications of using them explain the reasons for stratifying and clustering samples 10 describe the factors which contribute to errors in surveys, including: • inaccurate and poorly-judged sampling frames • sampling error • non-sampling error (non-response, biased response, interviewer error) discuss the various methods of contact which may be used in a survey and the related implications: • interviewer • postal • other self-administered • telephone. 237 10. Sampling design 10.3 Essential reading Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and Economics (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060] Section 6.1 and Chapter 17. In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials accessible via the ST104a Statistics 1 area at http://my.londoninternational.ac.uk 10.4 Further reading Office for National Statistics Social Trends (Basingstoke: Palgrave Macmillan, 2009) [ISBN 9780230220508]. 10.5 Introduction This chapter describes the main stages of a survey and the sources of error at each stage. Remember that a first treatment and background to this subject matter is given earlier in Chapter 3. It makes sense to review that chapter now before continuing. This part of the course is the foundation for your future work in applied social science, business and management. There is not much point in learning the various techniques we have introduced you to in the rest of the subject guide unless you understand the sources and limitations of the data you are using. This is important to academics and policymakers alike! 10.6 10 Motivation for sampling In Chapter 3, we introduced the term target population as representing the collection of units (people, objects etc.) in which we are interested. In the absence of time and budget constraints we conduct a census, that is a total enumeration of the population. Examples include the population census in the UK and in other countries around the world. Its advantage is that there is no sampling error because all population units are observed, and so there is no estimation of population parameters – we can, in principle, determine the true values exactly. Of course, in the real world, we do need to take into account time and budget constraints. Due to the large size, N , of most populations, an obvious disadvantage with a census is cost. So a census is not feasible in practice because there is only a limited amount of information which is cost-effective to collect. Also, non-sampling errors may occur. For example, if we have to resort to using cheaper (hence less reliable) interviewers, then they may erroneously record data, misunderstand a respondent etc. So we select a sample, that is a certain number of population members are selected and studied. The selected members are known as elementary sampling units. Sample surveys (hereafter ‘surveys’) are how new data are collected on a population and tend to be based on samples rather than a census. However, in practice, surveys are 238 10.6. Motivation for sampling usually conducted under circumstances which cannot be fully controlled. Since the sample size, n, can be very large, the design and implementation of surveys requires considerable planning and teamwork. Example 10.1 Examples of sample surveys include: demography: study of births, deaths, families etc. government: study of crime, education, employment, housing, health etc. market research: study of consumer preferences, attitudes etc. political science: study of voting intention. Selected respondents may be contacted using a variety of methods such as face-to-face interviews, telephone, mail or email questionnaires. Sampling errors will occur (since not all population units are observed, so we resort to estimating population parameter values, which means there will be some uncertainty attached to our point estimates), but because of the smaller numbers involved (since typically n N )1 resources can be used to ensure high-quality interviews or to check completed questionnaires. Therefore, non-sampling errors should be less and consequently researchers can ask more questions. So a census is difficult to administer, time-consuming, expensive and does not guarantee completely accurate results due to non-sampling errors. In contrast, a sample is easier, faster and cheaper, although it introduces sampling errors while non-sampling errors can still occur. Unsurprisingly, we would like sampling (and non-sampling) errors to be minimised. Before proceeding, it is worth spending a moment thinking about the classification of data into primary data and secondary data: primary data are new data collected by researchers for a particular purpose secondary data refer to existing data which have already been collected by others or for another purpose. Of course, if secondary data exist, then there would be no need to collect new data! In ST104a Statistics 1, the focus will be on the collection of primary data using various sampling techniques. Activity 10.1 Under what conditions would (a) a sample be preferable to a census, and (b) a census be preferable to a sample? Solution A sample is preferable to a census in the following cases: • Cost constraints prohibit measuring all target respondents. • The time available to conduct the study is relatively short. 1 The symbol means ‘much less than’. 239 10 10. Sampling design • The target population size is large. • The variance of the key characteristics of the study is small. • The cost of sampling error is low, i.e. there are no ‘significant’ respondents who, if they were omitted, would significantly bias the results. • The cost of non-sampling error is high. For example, if interviewers were to make an error in recording responses, the costs would be exacerbated in a census. • The nature of measurement is destructive, i.e. damage or some change to those measured is incurred. • Attention must be given to individual cases. In a ‘mirror image’, a census is preferable to a sample in the following cases: • Cost constraints are minimal, i.e. when there is a large budget. • The time available to conduct the study is relatively long. • The target population size is small. • The variance of the key characteristics of the study is large. • The cost of sampling error is high. • The cost of non-sampling error is low. • The nature of measurement is non-destructive. • Attention need not be given to individual cases. 10.7 10 Types of sample We seek a sample which is representative of the whole population to ensure our estimated parameter values are good approximations of the true parameter values. A particular difficulty is how to select a sample which yields a ‘fair’ representation. Fortunately, there are many different sampling techniques (reviewed below) from which to choose. As expected, these all have relative advantages and disadvantages, hence each will be suitable in particular situations. Knowledge of these techniques is frequently examined in this course, so it is strongly advised that you become very familiar with this material. Of course, consideration also needs to be given to the choice of sample size, n. In practice we face a trade-off – n should be large enough to give a fair representation of the population, yet be sufficiently small to be practical and cost-effective. Section 6.11 considered this problem in the context of setting a tolerance on the maximum permitted sampling error. Qualitative factors to take into consideration when determining n are study objectives, budget and time constraints, type of data analysis planned and non-sampling errors. 240 10.7. Types of sample Types of sample We proceed by partitioning sampling techniques into two groups: non-probability (non-random) sampling probability (random) sampling. 10.7.1 Non-probability sampling Non-probability samples are characterised by the following properties. Some population units have no chance (i.e. a zero probability) of being selected. Units which can be selected have an unknown (non-zero) probability of selection. Sampling errors cannot be quantified. Non-probability samples are used in the absence of a sampling frame.2 We shall consider three types of non-probability sampling. Convenience sampling – individuals are selected from the members of the population which happen to be in the vicinity. For example, those people in a shopping mall at the time of the survey, say a school day. This technique is convenient, hence the name! Of course, such an approach is unlikely to yield a representative sample. (In the mall example, children and working professionals are unlikely to be in the mall at the time.) Judgemental sampling – this uses expert opinion to judge which individuals to choose. An example would be a research firm selecting people to participate in a focus group3 by telephoning people and identifying who matches the target profile through a series of questions developed by an expert. Quota sampling – this attempts to obtain a representative sample by specifying quotas on certain specified characteristics, such as age, gender, social class and any other variables relevant to the investigation being undertaken. The (approximate) distribution of such characteristics in the population is required in order to replicate it in the sample. Of these, quota sampling is the most important type which we will consider now in greater depth. Activity 10.2 What is the difference between judgemental and convenience sampling? Give examples of where these techniques may be applied successfully. 2 Sampling frames (lists) were discussed at length in Chapter 3. Focus groups are used for qualitative research by asking a group of individuals about their opinions toward a new product or advertising campaign, consumer preferences etc. 3 241 10 10. Sampling design Solution The main difference is that judgemental sampling purposively selects sampling units based on the judgement of the researcher as opposed to the haphazard method of convenience sampling. Judgemental sampling could be used, for example, to try out a new product on consumers who in the judgement of the researcher are sophisticated, highly discerning and extremely fussy – the researcher may contend that satisfying such a group may mean high levels of satisfaction in less discerning groups. The success of a judgemental sample lies in how knowledgeable the researcher is of the target population and how well they justify the selection of specific elements. Given these requirements, judgemental sampling is usually limited to relatively small target populations and qualitative studies. Convenience sampling could be used in the piloting of a questionnaire. For example, imagine a postal questionnaire which is to be administered to retired women. A group of retired women who may be easy to contact and willing to discuss problems with the questionnaire and are elements of the target population would make a good convenience sample. Quota sampling Quota sampling is useful in the absence of a sampling frame, since non-probability sampling techniques do not require a sampling frame. Another reason for conducting a quota sample, instead of a random sample, might be speed. We may be in a hurry and not want to spend time organising interviewers for a random sample – it is much quicker to set target numbers (quotas) to interview. As with any form of sampling, the objective is to obtain a sample which is representative of the population. Quota sampling is no exception and to do so the interviewer: seeks out units which satisfy some control characteristics (known as quota controls), such as age, gender and social class 10 requires the distribution of these characteristics in the population to be (approximately) known in order to replicate it in the sample. Quota sampling is cheap, but it may be systematically biased by the choice of interviewee made by the interviewer, and their willingness to reply. For instance, interviewers might avoid choosing anyone who looks threatening, or too busy, or too strange! Quota sampling also does not allow us to measure the sampling error – a consequence of it being a non-probability sampling technique. Each of the three techniques above (convenience, judgemental and quota) has an appeal; however, in all cases we have no real guarantee that we have achieved an adequately representative sample – we might do by chance, of course, but this would be highly unlikely! For example, were the people we interviewed only those working in local offices? Were the young adults all students? Basically, since we do not know the probability that an individual will be selected for the survey, the basic rules of inference which we have been learning (confidence intervals 242 10.7. Types of sample and hypothesis tests) do not apply. Specifically, standard errors (a key ingredient in inferential procedures) are not measurable. However, in the absence of a sampling frame then we will have to resort to non-probability sampling techniques. That said, non-random samples are also frequently used by market research organisations or companies when speed (if not accuracy) is important. They are rarely used by governments. Example 10.2 You would likely use a quota sample (the main non-probability sampling technique considered in ST104a Statistics 1) in the following situations. When speed is important. Clearly, an interviewer with a target to reach a certain number (quota) of people on a given day is likely to be quicker than one which requires a specific person or household to be contacted (as determined by a random sample). Typical quota controls for the interviewer to meet are: • age group • gender • socio-economic group, or social class. Note the more controls the interviewer is given, the longer it will take to complete the required number of interviews (and hence it will take longer to complete your study). No available sampling frame covering the target population. If you think obtaining a sampling frame is likely to be very complicated, then a sensible targeting of the population by taking a quota sample might be helpful. You might, for example, wish to contact drivers of coaches and buses over a set of routes. There are a lot of bus companies involved, and some of them will not let you have their list of employees for data protection reasons, say. One of the things you could do in these circumstances is to carry out a quota sample at different times of the day. There are often random alternatives though, using sampling frames you may not have thought of. In the case above, you might be able to make a list of scheduled journeys on the routes you wish to study and take a random sample of routes, interviewing the relevant drivers as they complete their journeys. When you need to reduce cost. Clearly, time-saving is an important element in cost-saving. When accuracy is not important. You may not need to have an answer to your question to the high – and known – level of accuracy which is possible using a random sample; rather you merely require a rough idea about a subject. Perhaps you only need to know if people, on the whole, like your new flavour of ice cream in order to judge whether or not there is likely to be sufficient consumer demand to justify full-scale production and distribution. In this case, asking a representative group of people (a quota) would be perfectly adequate for your needs. Although there may be several reasons to justify the use of a quota sample, you should 243 10 10. Sampling design be aware of the problem caused by the omission of non-respondents. Because you only count the individuals who reply (unlike random sampling where your estimate has to allow for bias through non-response), the omission of non-respondents4 can lead to serious errors as your results would be misleading. For this reason, members of the British Market Research Association have now agreed to list non-response as it occurs in their quota samples, and this is regarded as good practice. One possible remedy to this problem is to introduce more detailed quota controls. For example, we might ask for age group, gender, employment status, marital status and/or the age group of the respondent’s children. However, this can take away a lot of the cost advantages of using a quota sample, rather than a random sample. Imagine the time you would take locating the last female for your sample aged 35–44, married with teenage children and a full-time job! There is the additional expense of paying interviewers more for a smaller number of interviews (on the basis of the time they spend on the job). If this is not done, the temptation to cheat, and hence make results completely invalid, will be strong. Activity 10.3 What are the main potential disadvantages of quota sampling with respect to probability sampling? Solution Quota sampling may yield biased estimates. The size of the bias may be hard to assess because the sampling is not probability based. Empirically it has been shown to give inferior results. 10.7.2 Probability sampling Probability sampling means that every population unit has a known (not necessarily equal), non-zero probability of being selected in the sample. In all cases selection is performed through some form of randomisation. For example, a pseudo-random number generator can be used to generate a sequence of ‘random’ numbers. 10 Relative to non-probability sampling methods, probability sampling can be expensive and time-consuming, and also requires a sampling frame. We aim to minimise both the (random) sampling error and the systematic sampling bias. Since the probability of selection is known, standard errors can be computed which allows confidence intervals to be determined and hypothesis tests to be performed. We shall consider five types of probability sampling. Simple random sampling (SRS) – a special case where each population unit has a known, equal and non-zero probability of selection. Of the various probability samples, its simplicity is desirable and it produces unbiased estimates, although more accurate methods exist (i.e. those with smaller standard errors). Systematic random sampling – a 1-in-x systematic random sample is obtained by randomly selecting one of the first x units in the sampling frame and then selecting every subsequent xth unit. Though easy to implement, it is important to 4 Non-response is discussed later in this chapter. 244 10.7. Types of sample consider how the sampling frame is compiled. For example, the data may exhibit a periodic property such as sales in a shop. All Monday sales are likely to be similar, as are all Saturday sales, but Saturdays are traditionally much busier shopping days. So sales from a 1-in-7 systematic sample will have a large variation between days, but a small variation within days. This would lead us to underestimate the true variation in sales. A 1-in-8 systematic sample would be better. Can you say why?5 Stratified random sampling – this sampling technique achieves a higher level of accuracy (smaller standard errors) by exploiting natural groupings within the population. Such groupings are called strata 6 and these are characterised as having population units which are similar within strata, but are different between strata. More formally, elements within a stratum are homogeneous while the strata are collectively heterogeneous. A simple random sample is taken from each stratum, hence ensuring a representative overall sample since a broad cross-section of the population will have been selected – provided suitable strata were created. Hence great care needs to be taken when choosing the appropriate stratification factors, which should be relevant to the purpose of the sample survey such as age group or gender. Imagine we were investigating student satisfaction levels at a university. If we took a simple random sample of students we could, by chance, select just (or mainly) first-year students whose opinions may very well differ from those in other year groups. So in this case a sensible stratification factor would be ‘year of study’. By taking a simple random sample from each stratum you ensure that you will not end up with an extreme (non-representative) sample and also avoid the possibility of one particular group not being represented at all in the sample. Of course, in order to perform stratified random sampling, we would need to be able to allocate each population unit to a stratum. For students this should be straightforward, since the university’s database of student records would no doubt include year of study too. Cluster sampling – is used to reduce costs (time and money). Here the population is divided into clusters, ideally such that each cluster is as variable as the overall population (i.e. heterogeneity within clusters and homogeneity between clusters). Next, some of the clusters are selected by SRS. This is where the economy savings are expected (in face-to-face interviews), since typically the clusters may be constructed on a geographical basis allowing interviewers to restrict themselves to visiting households in certain areas rather than having to travel long distances to meet the respondents required by a simple random sample who potentially cover a large area. A one-stage cluster sample would mean that every unit in the chosen clusters is surveyed. It may be that the cluster sizes are large meaning that it is not feasible to survey every unit within a cluster. A two-stage cluster sample involves taking a simple random sample from the clusters which were selected by SRS during the first stage. 5 A 1-in-7 systematic sample would lead to only one day of the week being represented in the sample, such as Mondays. A 1-in-8 systematic sample would have different days of the week. For example, if the first day chosen was a Monday, the next day selected would be 8 days later, i.e. Tuesday of the following week, followed by a Wednesday etc. 6 ‘Strata’ is the plural form, ‘stratum’ is the singular form. 245 10 10. Sampling design When a subsample is taken from a selected cluster, this is known as a multistage design. Individual respondents will be identified in a random manner – but crucially within an area. You will reduce costs (the interviewer will be able to complete a higher number of interviews in a given time, using less petrol and reducing shoe leather costs), but will probably have to sacrifice a degree of accuracy, that is the method is less efficient in general compared to, say, stratified random sampling. However, cluster sampling is very useful when a sampling frame is not immediately available for the entire population. For example, individual universities will have databases of their own students, but a national student database does not exist (admittedly it may be possible to create a national student database, but it would take a lot of effort, time and money). Clustering is clearly useful in an interviewer-administered survey. It is less important as a design feature for telephone or postal interviews, unless you are particularly interested in the cluster itself, to the extent that individuals in a cluster are similar (having intra-class correlation they will be less representative of other clusters). In other words, the variance (hence standard error) of your sample estimate will be greater. A further reason, in addition to cost savings, for cluster sampling, may arise from your need as a researcher to look at the clusters themselves for their own sake. An interest in income and educational levels for a group living in one area, or a study of the children at a particular school and their reaction to a new television programme, will require you to look at individuals in a cluster. Multistage sampling – refers to the case when sample selection occurs at two or more successive stages (the two-stage cluster sample above is an example). Multistage sampling is frequently used in large surveys. During the first stage, large compound units are sampled (primary units). During the second stage, smaller units (secondary units) are sampled from the primary units. From here, additional sampling stages of this type may be performed, as required, until we finally sample the basic units. 10 As we have already seen, this technique is often used in cluster sampling as we initially sample main clusters, then clusters within clusters etc. We can also use multistage sampling with a mixture of techniques. A large government survey will likely incorporate elements of both stratification and clustering at different stages. A typical multistage sample in the UK might involve the following: • divide the areas of the country into strata by industrial region • sample clusters (local areas) from each industrial region • from each local area choose some areas for which you have lists (such as electoral registers) and take a simple random sample from the chosen lists (clusters). Note that from a technical perspective, stratified sampling can be thought of as an extreme form of two-stage cluster sampling where at the first stage all clusters in the population are selected. In addition, one-stage cluster sampling is at the opposite end of this spectrum. We can summarise as follows. 246 10.7. Types of sample Similarities and differences between stratified and cluster sampling Stratified sampling: all strata chosen, some units selected in each stratum. One-stage cluster sampling: some clusters chosen, all units selected in each sampled cluster. Two-stage cluster sampling: some clusters chosen, some units selected in each sampled cluster. Example 10.3 You have been asked to make a sample survey of each of the following. Would you use quota sampling or random sampling? Explain your choice. i. Airline pilots, for their company, about their use of holiday entitlement in order to bring in a new work scheme. In this case as the survey is for the company (and hence there are no confidentiality issues) it is quite easy to use the company’s list of personnel. A quota sample would not be very easy in these circumstances – you would have to send your interviewers to a venue where most pilots would be likely to meet, or you would risk a very unrepresentative sample. So, in this case, a random sample would be easy and efficient to use. You would be able to collect accurate information and use your statistical techniques on it. The subject matter, too, means that it is likely the pilots would take the survey more seriously if they were contacted through the company’s list. ii. Possible tourists, about their holiday destinations and the likely length of time and money they expect to spend on holiday in the next year, for a holiday company planning its holiday schedule and brochure for next year. The situation for the tourist survey is different from that for airline pilots. There will be not just one, but several lists of tourists from different holiday companies, and data confidentiality might well mean you could not buy lists which do not belong to your company. You might use the register of voters or list of households, but then you would not necessarily target those thinking about holidays in the near future. So a random sample sounds like an expensive option if this is to be a general study for a tourist company assessing its future offers and illustrations for its holiday brochure. Here, a quota sample makes more sense: interviewers can quickly find the right respondent for the company’s needs and get a general picture of holidaymakers’ preferences. iii. Household expenditure for the government’s assessment of the effect of different types of taxes. The government survey will require accuracy of information in an important policy area. A random sample will have to be used and the national lists of addresses or voters used. 247 10 10. Sampling design Activity 10.4 The simplest probability-based sampling method is simple random sampling. Give two reasons why it may be desirable to use a sampling design which is more sophisticated than simple random sampling. Solution Possible reasons include the following. • We can estimate population characteristics with greater precision. • It may be more cost effective. • It may not be possible to do simple random sampling. Activity 10.5 A corporation wants to estimate the total number of worker-hours lost for a given month because of accidents among its employees. Each employee is classified into one of three categories – labourer, technician and administrator. Which sampling method do you think would be preferable here – simple random sampling, stratified sampling, or cluster sampling? Give arguments to explain your choice. Solution We can expect the three groups to be homogeneous with respect to accident rates – for example, labourers probably have the highest rates. Therefore, stratified sampling is preferable to simple random sampling. Cluster sampling does not make sense here, with only three homogeneous groups. Activity 10.6 Describe how stratified random sampling is performed and explain how it differs from quota sampling. Solution 10 Description of stratified random sampling: the population is divided into strata, natural groupings within the population, and a simple random sample is taken from each stratum. Stratified random sampling is different from quota sampling in the following ways. Stratified random sampling is probability sampling, whereas quota sampling is non-probability sampling. In stratified random sampling a sampling frame is required, whereas in quota sampling pre-chosen frequencies in each category are sought. Activity 10.7 Provide a similarity and a difference between one-stage cluster sampling and two-stage cluster sampling. Solution Similarity: In both one-stage and two-stage cluster sampling, only some clusters are sampled (at random). 248 10.8. Types of error Difference: In one-stage every unit within each selected cluster is sampled, while in two-stage a random sample is taken from each selected cluster. 10.8 Types of error We can distinguish between two types of error in sampling design7 as follows. Sampling error – this occurs as a result of us selecting a sample, rather than performing a census (where a total enumeration of the population is undertaken). It is attributable to random variation due to the sampling scheme used. For probability sampling, we can estimate the statistical properties of the sampling error, i.e. we can compute (estimated) standard errors which allow confidence intervals to be determined and hypothesis tests to be performed. Non-sampling error – this occurs as a result of the (inevitable) failures of the sampling scheme. In practice it is very difficult to quantify this sort of error, typically through separate investigation. We distinguish between two sorts of non-sampling error. • Selection bias – this may be due to (i.) the sampling frame not being equal to the target population, (ii.) the sampling frame not being strictly adhered to, or (iii.) non-response bias. • Response bias – the actual measurements might be wrong due to, for example, ambiguous question wording, misunderstanding of a word in a questionnaire by less-educated people, or sensitivity of information which is sought. Interviewer bias is another aspect of this, where the interaction between the interviewer and interviewee influences the response given in some way, either intentionally or unintentionally, such as through leading questions, the dislike of a particular social group by the interviewer, the interviewer’s manner or lack of training, or perhaps the loss of a batch of questionnaires from one local post office. These could all occur in an unplanned way and bias your survey badly. 10 10.9 Pilot and post-enumeration surveys Both kinds of error can be controlled or allowed for more effectively by a pilot survey. A pilot survey is used: to find the standard error which can be attached to different kinds of questions and hence to underpin the sampling design chosen to sort out non-sampling questions: • do people understand the questionnaires? 7 These should not be confused with Type I errors and Type II errors, which only concern hypothesis testing and are covered in Section 8.7. 249 10. Sampling design • are our interviewers working well? • are there particular organisational problems associated with this enquiry? Sometimes, particularly for government surveys, a post-enumeration survey is used to check the above. Here a subsample of those interviewed are re-interviewed to make sure that they have understood the questions and replied correctly. This is particularly useful when technical questions are being asked. 10.10 Non-response and response bias Bias caused by non-response and response is worth a special entry. It can cause problems at every stage of a survey, both in random samples and quota samples, regardless of how the surveys are administered. The first problem can be in the sampling frame. Is an obvious group missing? Example 10.4 The following are examples of coverage bias in the sampling frame. If the sampling frame is of householders, those who have just moved in will be missing. If the sampling frame is of those aged 18 or over, and the under-20s are careless about registration, then younger people will be missing from the sample. In the field, non-response (data not provided by a sampling unit which we wish to sample) is one of the major problems of sample surveys as the non-respondents, in general, cannot be treated like the rest of the population. As such, it is most important to try to get a picture of any shared characteristics in those refusing to answer or people who are not available at the time of the interview. Classifications of non-response We can classify non-response as follows. Item non-response occurs when a sampled member fails to respond to a question in the questionnaire. 10 Unit non-response occurs when no information is collected from a sample member. Non-response may be due to any of the following factors: not-at-home due to work commitments, or on holiday refusals due to subject matter, or sponsorship of the survey incapacity to respond due to illness, or language difficulties not found due to vacant houses, incorrect addresses, moved on lost schedules due to information being lost or destroyed after it had been collected. 250 10.10. Non-response and response bias How should we deal with non-response? Well, note that increasing the sample size will not solve the problem – the only outcome would be that we have more data on the types of individuals who are willing to respond! Instead, we might look at improving our survey procedures such as data collection and interviewer training. Non-respondents could be followed up using call-backs, or an alternative contact method to the original survey in an attempt to subsample the non-respondents. A proxy interview (where a unit from your sample is substituted with an available unit) may be another possibility. (Note that non-response also occurs in quota sampling but is not generally recorded – see the earlier discussion.) However, an obvious remedy is to provide an incentive (for example, cash or entry into a prize draw) to complete the survey – this exploits the notion that human behaviour can be influenced in response to the right incentives! Response error is very problematic because it is not so easy to detect. A seemingly clear reply may be based on a misunderstanding of the question asked or a wish to deceive. A good example from the UK is the reply to the question about the consumption of alcohol in the Family Expenditure Survey. Over the years, there is up to a 50% understatement of alcohol use compared with the overall known figures for sales from HM Revenue & Customs! Sources of response error The main sources of response error include: role of the interviewer due to the characteristics and/or opinions of the interviewer, asking leading questions and the incorrect recording of responses role of the respondent who may lack knowledge, forget information or be reluctant to give the correct answer due to the sensitivity of the subject matter. Control of response errors typically involves improving the recruitment, training and supervision of interviewers, re-interviewing, consistency checks and increasing the number of interviewers. In relation to all these problems, pilot work is very important. It may also be possible to carry out a check on the interviewers and methods used after the survey (post-enumeration surveys). Activity 10.8 In the context of sampling, explain the difference between item non-response and unit non-response. Solution • Item non-response occurs when a sampled member fails to respond. • Unit non-response occurs when no information is collected from a sample member. Activity 10.9 Non-response in surveys is considered to be problematic. (a) Give two possible reasons why non-response may occur. 251 10 10. Sampling design (b) Why is non-response problematic for the person or organisation conducting the research? Give two reasons. (c) How can non-response be reduced in telephone surveys and mail surveys, respectively? Solution (a) There may be people whom the data collection procedure does not reach, who refuse to respond, or who are unable to respond. As an example, in telephone surveys, people may not be at home or may not have time to do the survey. In a mail survey, the survey form may not reach the addressee. (b) The non-respondents may be different from respondents, leading to biased parameter estimates. Additionally, the sample size is effectively reduced, leading to a loss of precision in estimation. (c) i. Numerous calls, concentrating on evenings and weekends. Interviewers with flexible schedules who can make appointments with interviewees at a time convenient for them. Cooperation can be enlisted by sending an advance letter, ensure respondents feel their help is important and having well-trained interviewers. ii. Clear layout of survey form, easy response tasks, send reminders and call respondents by telephone. Activity 10.10 Give an example of response bias. Is response bias a form of sampling error or a form of non-sampling error? Briefly explain why. Solution 10 Interviewer bias, where the interaction between the interviewer and interviewee influences the response given in some way, is an example of response bias. It is a form of non-sampling error. Non-sampling error is a result of the inevitable failures of the sampling scheme, whereas sampling error occurs due to random variation due to the sampling scheme used. Activity 10.11 Briefly discuss the advantages and disadvantages of paying respondents for an interview. Solution Advantages include a higher response rate and hence less non-response bias. Disadvantages are that it costs money which we may not have or may also be used for other ways of improving the survey. Also, there are ethical considerations: we take respondent’s time so they should get something back! 252 10.11. Method of contact Activity 10.12 A research group has designed a survey and finds the costs are greater than the available budget. Two possible methods of saving money are a sample size reduction or spending less on interviewers (for example, by providing less interviewer training or taking on less-experienced interviewers). Discuss the advantages and disadvantages of these two approaches. Solution A smaller sample size leads to less precision in terms of estimation. Reducing interviewer training leads to response bias. We can statistically assess the reduction in precision, but the response bias is much harder to quantify, making it hard to draw conclusions from the survey. Therefore, the group should probably go for a smaller sample size, unless this will give them too imprecise results for their purposes. Activity 10.13 Discuss the role of the interviewer in a survey and the importance of training an interviewer. Solution The role of the interviewer is very important: to gain cooperation, train and motivate respondents. Training interviewers leads to less interviewer bias, more uniformity in response, and a higher response rate. Activity 10.14 Readers of the magazine Popular Science were asked to phone in (on a premium rate number) their responses to the following question: ‘Should the United States build more fossil fuel-generating plants or the new so-called safe nuclear generators to meet the energy crisis?’. Of the total call-ins, 86% chose the nuclear option. Discuss the way the poll was conducted, the question wording, and whether or not you think the results are a good estimate of the prevailing mood in the country. Solution Both selection bias and response bias may have occurred. Due to the phone-in method, respondents may well be very different from the general population (selection bias). Response bias is possible due to the question wording – for example ‘so-called safe nuclear generators’ calls safety into question. It is probably hard to say much about the prevailing mood in country based on the survey. 10.11 Method of contact A further point you should think about when assessing how to carry out a survey is the method of contact. The most common methods of contact are face-to-face interviews, telephone interviews and online/postal/mail (so-called ‘self-completion’) interviews. In most countries you can assume the following. An interviewer-administered face-to-face questionnaire will be the most expensive to carry out. 253 10 10. Sampling design Telephone surveys depend very much on whether your target population is on the telephone (and how good the telephone system is). Self-completion questionnaires can have a low response rate. We now explore some8 of the advantages and disadvantages of various contact methods. Face-to-face interviews: • Advantages – good for personal questions; allow for probing issues in greater depth; permit difficult concepts to be explained; can show samples (such as new product designs). • Disadvantages – (very) expensive; not always easy to obtain detailed information on the spot. Telephone interviews: • Advantages – easy to achieve a large number of interviews; easy to check the quality of interviewers (through a central switchboard perhaps). • Disadvantages – not everyone has a telephone so the sample can be biased; cannot usually show samples; although telephone directories exist for landline numbers, what about mobile telephone numbers? Also, young people are more likely to use mobile telephones rather than landline telephones, so are more likely to be excluded. Self-completion interview: • Advantages – most people can be contacted this way (there will be little non-response due to not-at-home reasons); allow time for people to look up details such as income, tax returns etc. • Disadvantages – high non-response rate as it requires effort to complete the questionnaire; answers to some questions may influence answers to earlier questions since the whole questionnaire is revealed to the respondent, which is important where the order of a questionnaire matters; you have no control over who answers the questionnaire. 10 Example 10.5 Examples of occasions when you might use a particular contact method are as follows. Face-to-face interviewer – a survey of shopping patterns. Here you need to be able to contact a sample of the whole population. You can assume that a large proportion would not bother to complete a postal questionnaire – after all, the subject matter is not very important and it takes time to fill in a form! Using a telephone would exclude those (for example, the poor and the elderly) who either do not have access to a telephone, or are unwilling to talk to strangers by telephone. 8 This is not necessarily an exhaustive list. Can you add any more? 254 10.11. Method of contact Telephone interviewer – a survey of businessmen and their attitudes to a new item of office equipment. All of them will have a telephone, and also the questions should be simple to ask. Here, booking time for a telephone interview at work (once it has been agreed with the management) should be much more effective than waiting for a form to be filled in, or sending interviewers to disrupt office routine. Postal/mail questionnaire – a survey of teachers about their pay and conditions. Here, on-the-spot interviews will not elicit the level of detail needed. Most people do not remember their exact pay and taxation, particularly if they are needed for earlier years. We would expect a high response rate and good-quality data – the recipients are motivated to reply, since they may be hoping for a pay rise! Also, the recipients come from a group of people who find it relatively easy to fill in forms without needing the help, or prompting, of an interviewer. Remember that it is always possible to combine methods. The Family Expenditure Survey in the UK, for example, combines the approach of using an interviewer three times over a fortnight (to raise response and explain details), while the respondent household is required to fill in a self-completion diary (showing expenditure, which could not be obtained by an interview alone). Similarly, telephone interviews may be combined with a mail-shot subsampled individual survey. In the case of the telephone survey of business people described above, a description of the new equipment could be sent in advance, or as they are telephoned. Remember also that email surveys are already widespread and becoming increasingly popular, although they are only appropriate when the population to be studied uses them heavily and is likely to reply to your questions, such as employees in your office. An obvious advantage is that this method is very cheap to administer. Activity 10.15 What criteria would you use in deciding which method of contact to use in a survey of individuals? Solution 10 Key criteria to use when deciding on the appropriate form of contact relate to the extent to which the respondent can interact with the interviewer. Examples of criteria are: • flexibility of contact method • diversity of questions which can be asked • ability to use physical stimuli • control of the data collection environment • quantity of data required • response rate 255 10. Sampling design • perceived respondent anonymity • obtaining sensitive information • potential for interviewer bias • potential to probe respondents • speed • cost. Activity 10.16 Discuss the relative advantages and disadvantages of telephone interviewing, also face-to-face interviewing. Solution A number of advantages and disadvantages could be given, clearly presented. • Telephone interviewing advantages: • Less expensive than personal interviews. • Random digit dialling samples of general population. • Shorter data collection period than personal interviews. • Interviewer administration (versus mail). • Better control and supervision of interviewers (versus personal). • Better response rate than mail for list samples. • Telephone interviewing disadvantages: • Biased against households without telephones, unlisted numbers. • Non-response. • Questionnaire constraints. • Difficult to administer questionnaires on sensitive or complex topics. 10 • Personal interviewing advantages: • Generally yields highest cooperation and lowest refusal rates. • Allows for longer, more complex interviews. • High response quality. • Takes advantage of interviewer presence. • Multi-method data collection. • Personal interviewing disadvantages: • Most costly mode of administration. • Longer data collection period. • Interviewer concerns. 256 10.12. Summary Activity 10.17 Discuss the feasibility of each of the types of survey contact methods (personal interview, postal survey, email and telephone survey) for a random sample of university students about their undergraduate experiences and attitudes at the end of the academic year. Solution • Personal interview: • Allows for probing details about experiences and attitudes in greater detail. • Likely to have a high response rate, although costly. • Postal survey: • High non-response rate. • Effort to return survey form. • Email: • Easiest method of contact for university students, who regularly check email. • Likely to suffer from non-response. • Telephone survey: • Most students would use mobile phones rather than landlines, hence telephone numbers not available in directories. • Likely to suffer from non-response. Activity 10.18 Retirement and Investment Services would like to conduct a survey on online users’ demands for additional internet retirement services. Outline your suggested sampling and contact method and explain how the results might be affected by your methodology. 10 Solution Online sampling can be used. Use of web intercept (pop-up) or client lists (email addresses). It is cheaper and potentially very quick but the users might tend to be young people unless a client list is used. Internet penetration is still not as high as landline telephone penetration. Respondents who are not online may have a different profile. There is no recognised sampling frame for the online population. 10.12 Summary This chapter has described the main stages of a survey and the sources of error at each stage. The various techniques in the rest of the subject guide are of little use unless you understand the sources and limitations of the data you are using. The contents of this chapter should have helped you to understand how statistical techniques you have 257 10. Sampling design learned about so far can be used in practice. Activity 10.19 How do probability sampling techniques differ from non-probability sampling techniques? Which factors should be considered when choosing between probability and non-probability sampling? Solution Probability techniques differ from non-probability techniques in the following ways. • Probability techniques select sampling units by chance as opposed to the judgement of the researcher. • The number of elements to be included in the sample set can be pre-specified. • It is possible to specify the probability of selecting any particular sample of a given size. • Estimates are statistically projectable to the population. The factors which should be considered are as follows. • The nature of the research, i.e. is it exploratory or conclusive? • Degree of error tolerance, i.e. the precision sought by decision-makers. • Relative magnitude of sampling versus non-sampling errors, i.e. the need to eliminate selection bias. • Variability in the population, i.e. how divergent are target respondents in the issues being measured? • Statistical considerations, i.e. what forms of analysis will help create the meanings sought by the decision-maker? • Operational considerations, i.e. are sampling frames available and accurate? 10 Activity 10.20 You work for a market research company and your manager has asked you to carry out a random sample survey for a mobile phone company to identify whether a recently launched mobile phone is attractive to people over 40 years old. Limited time and money resources are available at your disposal. You are being asked to prepare a brief summary containing the items below. i. Choose an appropriate probability sampling scheme. Provide a brief justification for your answer. ii. Describe the sampling frame and the method of contact you will use. Briefly explain the reasons for your choices. iii. Provide an example in which response bias may occur. State an action that you would take to address this issue. 258 10.12. Summary iv. State the main research question of the survey. Identify the variables associated with this question. Solution i. Cluster sampling is appropriate here due to the cost issue. Also, multistage sampling is an option. Although the question mentions limited time, discussion of quota sampling (for speed) would be incorrect as the question stresses a ‘probability sampling scheme’. ii. The question requires: • a description of a sampling frame • a justification of its choice • mentioning a (sensible) contact method • stating an advantage of the contact method mentioned above. Note that there is no unique answer. A suggested answer is given below. The sampling frame could be an email list from the records of the mobile phone company, which should be easy to obtain. Assuming that the new type is a smartphone, the method of contact can be via email. Alternatively, if the new type is not a smartphone, some basic questions can be sent by text. Method of contact by post would be too expensive compared with the other two methods. iii. The question requires an example of response bias and an action suggested to address this issue. Those least comfortable with the phone are unlikely to reply, so the questionnaire should be designed appropriately. Also, busy people may not want to spend time on such market research. A reward of, say, ten free text messages could be offered as an incentive so that the cost remains low. iv. A suggested answer for the question is ‘Are attitudes to the new phone more favourable to younger owners?’. In terms of variables one could mention ‘age’ and ‘measure of favourableness toward the new phone’. Activity 10.21 A company producing handheld electronic devices (tablets, mobile phones etc.) wants to understand how people of different ages rate its products. For this reason, the company’s management has decided to use a survey of its customers and has asked you to devise an appropriate random sampling scheme. Outline the key components of your sampling scheme. Solution Note that there is no unique answer. A possible set of ‘ingredients’ of an answer is given below. • Propose stratified sampling since customers of all ages are to be surveyed. 259 10 10. Sampling design • Sampling frame could be the company’s customer database. • Take a simple random sample from each stratum. • Stratification factors should include age group. • Other stratification factors could be gender, country of residence etc. • Contact method: mail, telephone or email (likely to have all details on database). • Minimise non-response through a suitable incentive, such as discount off the next purchase. Activity 10.22 You have been asked to design a nationwide survey in your country to find out about internet usage among children less than 10 years old. Provide a probability sampling scheme and a sampling frame that you would like to use. Identify a potential source of selection bias which may occur and discuss how this issue can be addressed. Solution Note that there is no unique answer. Sampling frame: note that the target group is ‘children less than ten years old’ hence one might take the view that they only need to look at children who are at school or nursery school – say aged from 4 or 5 to 10. If this is the case, a sampling frame of schools and nurseries and sampling from their lists may be used. Another example is to use doctors’ lists (if possible). Sampling scheme: a probability/random sampling scheme is required, which should be accompanied by a justification. For example, if one goes with clustering (area of the country/type of school such as junior, infants, preschool etc.) or stratified sampling (stratification factors: gender, age group), why would these schemes be advantageous? 10 Source of selection bias: selection bias could arise from the omission of those who are not at school or preschool (in most countries, school is compulsory only from five or six years old) and those who are home schooled. Note that this should not be confused with response bias – for example, things about children responding differently if the teachers ask the questions! Way to address it: reset the target population group to match what the sampling frame is actually providing. 10.13 Key terms and concepts Census Convenience sampling Incentive Multistage design 260 Cluster sampling Judgement sampling Method of contact Non-probability sampling 10.14. Learning activities Non-response Pilot survey Primary data Representative Quota sampling Sample survey Sampling frame Selection bias Stratified random sampling Target population 10.14 Non-sampling error Post-enumeration survey Probability sampling Response error Sample Sampling error Secondary data Simple random sampling Systematic random sampling Learning activities Solutions can be found in Appendix A. 1. Think of at least three lists you could use in your country as a basis for sampling. Remember, each list must: • be generally available • be up-to-date • provide a reasonable target group for the people you might wish to sample. 2. Find out about one of the government surveys carried out in your own country. This should help you understand the problems involved in designing a useful survey and help you with illustrations for your examination questions. (Remember that your understanding of the general points raised here should be illustrated by examples. The examiners are very happy if you give examples from your own country or area of interest. They are not looking for points memorised from textbooks!) 3. Discuss the statistical problems you might expect to have in each of the following situations. (a) Conducting a census of the population. (b) Setting up a study of single-parent families. (c) Establishing future demand for post-compulsory education. 4. (a) Define a quota sample. (b) What are the main reasons for using a quota sample, and what are the alternatives? (c) What are the main sources of error in a quota sample, and how would you deal with them? 5. Think of three quota controls you might use to make a quota sample of shoppers in order to ask them about the average amount of money they spend on shopping per 261 10 10. Sampling design week. Two controls should be easy for your interviewer to identify. How can you help them with the third control? 6. Your company has five sites and a mixture of managerial, clerical and factory workers. You want to know what kind of help they would like with their travel-to-work arrangements. One of the trade unions involved has asked for free parking for all its members, another has suggested subsidised rail season tickets. You, as a statistician, have been asked to design a sample survey of employees to help them decide what to do. You decide to make a random sample. There is no problem with the sampling frame and you have been asked to give statistically reliable advice. You decide to design a stratified sample. (a) State the strata you will use. (b) Explain how the strata will help you. (Do not be afraid to use two sets of strata in these circumstances!) 7. You have been asked to design a random sample in order to study the way school children learn in your country. Explain the clusters you might choose, and why. 8. Which method of contact might you use for your questionnaire in each of the following circumstances? (a) A random sample of school children about their favourite lessons. (b) A random sample of households about their expenditure on non-essential items. (c) A quota sample of shoppers about shopping expenditure. (d) A random sample of bank employees about how good their computing facilities are. (e) A random sample of the general population about whether they liked yesterday’s television programmes. 10 9. (a) Outline the main stages of a random survey. Where do the main dangers of errors lie? (b) Why might you carry out a quota survey rather than a random survey? (c) ‘The designing of questionnaires and the training of interviewers is a waste of money.’ Discuss. (d) When would you carry out a telephone survey rather than using a face-to-face interview? (e) You have been asked to survey the interest of a population in a new type of technology. How might you stratify your sample? Explain. 10. You are carrying out a random sample survey of leisure patterns for a holiday company, and have to decide whether to use interviews at people’s homes and workplaces, postal (mail) questionnaires, or a telephone survey. Explain which method of contact you would use, and why. 262 10.15. A reminder of your learning outcomes 10.15 A reminder of your learning outcomes After completing this chapter, and having completed the Essential reading and activities, you should be able to: define random, simple random and quota sampling and describe the implications of using them explain the reasons for stratifying and clustering samples describe the factors which contribute to errors in surveys, including: • inaccurate and poorly-judged frames • sampling error • non-sampling error (non-response, biased response, interviewer error) discuss the various methods of contact which may be used in a survey and the related implications: • interviewer • postal • other self-administered • telephone. 10.16 Sample examination questions Solutions can be found in Appendix B. 1. (a) State one advantage and one disadvantage of quota sampling. (b) State one advantage and one disadvantage of telephone interviews. 2. You work for a market research company and your manager has asked you to carry out a random sample survey for a laptop company to identify whether a new laptop model is attractive to consumers. The main concern is to produce results of high accuracy. You are being asked to prepare a brief summary containing the items below. (a) Choose an appropriate probability sampling scheme. Provide a brief justification for your answer. (b) Describe the sampling frame and the method of contact you will use. Briefly explain the reasons for your choices. (c) Provide an example in which selection bias may occur. State an action which you would take to address this issue. (d) State the main research question of the survey. Identify the variables associated with this question. 263 10 10. Sampling design 3. You have been asked to design a stratified random sample survey from the employees of a certain large company to examine whether job satisfaction of employees varies between different job types. (a) Discuss how you will choose your sampling frame. Also discuss the limitations of your choice. (b) Propose two relevant stratification factors. Justify your answers. (c) Provide two actions to reduce response bias and explain why you think they would be successful. (d) Briefly discuss the statistical methodology you would use to analyse the collected data. 10 264 Chapter 11 Some ideas underlying causation – the use of control groups and time order 11.1 Aims As with Chapter 10, this chapter does not introduce any new statistical calculations. However, it addresses the key ideas of causation in non-experimental science. It also prepares you for the final ideas – namely correlation and regression – you need to know in order to complete ST104a Statistics 1. 11.2 Learning outcomes After completing this chapter, and having completed the Essential reading and activities, you should be able to: distinguish between an experiment in the natural sciences and the observations possible in social science, business or management studies determine sensible controls for a given experiment or trial set up a panel study explain the merits and limitations of a longitudinal/panel survey compared with a cross-sectional survey. 11.3 Essential reading Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and Economics (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060] Section 13.2. In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials accessible via the ST104a Statistics 1 area at http://my.londoninternational.ac.uk 11.4 Further reading Shipman, M. The Limitations of Social Research (London: Longman, 1997) fourth edition [ISBN 9780582311039] Part 2. 265 11 11. Some ideas underlying causation – the use of control groups and time order 11.5 Introduction So far, we have looked at ways of collecting and describing data. Chapters 3 and 10 introduced you to the basic ideas of sampling from populations. The main ways of describing and presenting data were covered in Chapter 4. Chapters 6 to 8 dealt with the ways we can assess the relevance or significance of these figures, while Chapter 9 introduced the idea of association between categorical variables. Chapter 12 will complete the process, so far as ST104a Statistics 1 is concerned, by covering correlation and regression. Before we look at the ideas of correlation and regression more formally, it is important to take stock of the limitations of assuming causation when working in social research. Anyone who studied science at school will be familiar with the idea of an experiment. Subjects are measured (observations are made), a treatment is administered, then further observations are made. Providing that we can be sure that nothing else has happened between observations, apart from the treatment (scientists write ‘other things equal’, or ceteris paribus, in Latin), the assumption is made that the treatment has caused the change between the first and second set of observations. If we are dealing with such a controlled experiment, the meaning of our statistical measures and work is very clear. However, in social science, business and management fields, things are rarely that simple. We are generally faced with figures which show changes in variables but the treatment given is not easy to assess. Example 11.1 Consider determining the effectiveness of an advertising campaign for your new chocolate bars with improved wrappers. Your company measures the sales of the old chocolate bars, product A, in two areas, X and Y, before introducing the new chocolate bars, product B, and running a four-month advertising campaign. Imagine your marketing manager’s joy when sales of product B are much higher than for product A in area X. Clearly, the changes have worked! However, oh dear, product B is achieving a lower level of sales than product A in area Y. How can this be? Well, on closer investigation, we find that our main rival has withdrawn their product from area X and concentrated their sales focus on area Y (where they have realised that a larger number of their target population lives). 11 So your success with product B in area X is not necessarily related to your advertising campaign, rather it may be entirely due to your rival’s actions. Clearly, whatever measures you use, there is a problem with the measurement of your results. ‘Other things’ are no longer equal, since the behaviour of the rival changed while you were conducting your experiment. So it is important to control for confounding factors, that is, factors which are correlated with the observed variables (such as the rival’s actions in Example 11.1). Failure to properly control for such factors may lead us to treat a false positive as a genuine causal relationship. 266 11.6. Observational studies and designed experiments 11.6 Observational studies and designed experiments In medical, industrial and agricultural applications of statistics, it is possible to control the levels of the important factors which affect the results. Bias from the factors which cannot be controlled is dealt with by randomisation. These investigations are designed experiments. In economic and other social science applications of statistics, one usually just observes a sample of the available population, without controlling for any of the factors which may influence the measures observed. Such studies are observational studies. Much of the methodology of statistics is devoted to disentangling the effects of different factors in such observational studies, but it is better to have a designed experiment, whenever possible. Of course, the social sciences do not really lend themselves to designed experiments. For example, we do not have multiple versions of an economy to which we can apply different mixtures of fiscal and monetary policies. Instead, we have one economy to which we apply a particular mix and observe what happens. In ST104a Statistics 1, we will only look at correlation and regression as ways of studying the connections between variables, but analysis of variance (covered in ST104b Statistics 2) is also very relevant in this context. Even if it is possible to experiment, there may be ethical objections. The ethical questions raised by medical trials are still a lively area of debate. There can be strong objections from patients (or their relatives) to their treatment by an established method (or a placebo) when a new treatment is available. After the trial is complete, if the new treatment is shown to be worse than the established one, there may be complaints from those who were given the new treatment. Some of these difficulties can be resolved by the use of sequential methods, which try to discontinue a trial at the earliest opportunity when one treatment has been shown to give better results than others. When we come to study regression in Chapter 12, it will be important to remember its limitations. It is easy to be too optimistic about its use. There is a big difference between a randomised experiment and an observational study, even if regression is used to allow for some of the distorting factors in the latter. The long-standing difficulties in proving that smoking causes lung cancer or, more recently, in demonstrating the positive effects of seat belts in cars, or of investigating the possible increase in leukaemia cases in children of workers at nuclear power plants, should be enough to show how the problems arise. It is difficult to allow for all the distorting factors; even to say whether a factor should be eliminated is hard. So how do statisticians try to measure causal relationships in the social sciences? They use two main tools: the control group time order. We now proceed to consider both of these. Activity 11.1 Explain the difference between an experimental design and an observational study. 267 11 11. Some ideas underlying causation – the use of control groups and time order Solution In an observational study, correlation is not the same as causation because of the possible presence of unobserved confounding variables. This problem can be avoided using an experimental design, where randomisation can be used. This creates groups of units which are similar on average. For example, if we perform a clinical trial to find out the effect of a treatment on a disease, and a certain gene is an unobserved confounder, then randomisation ensures the two groups have similar genetic makeup (on average; note that the groups need to be large enough to ensure this). Association between type of treatment and recovery then becomes the same as causation. In an observational study, we have no control over the genetic makeup the groups, so cannot conclude causality if we find an association. 11.7 Use of the control group We are often limited in the social sciences because we are unable to perform experiments either for ethical or practical reasons. Imagine, for example, that you need to assess the likely effect on tooth decay of adding fluoride to the water supply in a town. There is no possibility of being allowed to experiment, as you are afraid that fluoride might have harmful side effects. However, you know that fluoride occurs naturally in some communities. What can you do, as a statistician? 11.7.1 Observational study In an observational study data are collected on units (not necessarily people) without any intervention. Researchers do their best not to influence the observations in any way. A sample survey is a good example of such a study, where data are collected in the form of questionnaire responses. As discussed in Chapter 10, every effort is made to ensure response bias is minimised (if not completely eliminated). Example 11.2 To assess the likely effect on tooth decay of adding fluoride to the water supply, you can look at the data for your non-fluoridated water population and compare it with one of the communities with naturally-occurring fluoride in their water and measure tooth decay in both populations, but be careful! A lot of other things may be different. Are the following the same for both communities? Number of dentists per person. 11 Number of sweet shops per person. Eating habits. Age distribution. Think of other relevant attributes which may differ between the two. If you can match in this way (i.e. find two communities which share the same characteristics and only differ in the fluoride concentration of their water supply), your results may have some significance. 268 11.7. Use of the control group So, to credibly establish a causal link in an observational study, all other relevant factors need to be adequately controlled for, such that any change between observation periods can be explained by only one variable. 11.7.2 Experimental study In an experiment, an intervention or treatment is administered to some or all of the experimental units (usually people). Allocation of the treatment (or perhaps a combination of treatments) is determined by using a form of randomisation.1 Some time after the treatments are given, the outcomes are recorded by the researchers. Data analysis involves comparing outcomes across all treatments. Occasionally it is possible and permissible to carry out experiments in business situations. There are still difficulties caused by the sheer number of variables which will need to be taken into consideration, but at least it is possible to distinguish those who had the treatment (the experimental group) from those who did not (the control group). In order to avoid bias in assessing the effects of the treatments, blinding and double blinding are recommended. In blind experimentation the experimental units are unaware of whether they receive a treatment or a placebo, while in a double-blind set-up the people administering the treatment and placebo also do not know which is which. The sample size in each treatment group must be large enough to ensure that medically (or socially) important differences can be detected. Sample size determination (as previously discussed for observational studies in Section 7.11) to ensure adequate power is a routine part of experimental design. On the whole, such methods are associated with medical statistics more often than with other applied areas, but they are also used in marketing and test marketing when possible. Activity 11.2 What is randomisation in the context of experimental design? Solution Experimental units are randomly allocated to treatment groups. In this way, confounders are (probabilistically) controlled for, and causal conclusions can be drawn from an experiment (with high probability). In observational studies, we can never be sure if all confounders have been controlled, and causal conclusions are harder to draw. Activity 11.3 Explain what is meant by each of the following and why they are considered desirable in an experiment: (a) placebo (b) double blinding. 1 For example, treatment A is administered if a fair coin toss comes up ‘heads’, otherwise treatment B is administered. 269 11 11. Some ideas underlying causation – the use of control groups and time order Solution (a) Many statistical studies involve testing the effectiveness of drugs. A placebo looks identical to the actual drug but contains no active ingredient and so has no real physical effect. Any patient wants to be helped by the medication administered to them. If the patient thinks s/he is receiving a drug to remedy their condition, they tend to improve even if it turns out that the drug is a placebo. This is known as the placebo effect. Experimental design should control for the placebo effect. (b) Double blinding means that neither the subjects receiving the treatment, nor the researchers know to which group, treatment or control, subjects have been assigned. This avoids bias in assessing the effects of the treatments. If a researcher knows that a subject is in the control group, they do not expect a treatment effect, and their measurement of a response might be understated. If a researcher knows that a subject is in the treatment group, they might overstate a response simply because they expect it. Activity 11.4 Explain the following terms: (a) blind trials (b) control and treatment groups (c) measuring causation. Solution (a) Blind trials are a feature of experiments. In a blind experiment, the subjects do not know if they are getting the treatment or the placebo. In a double blind experiment, neither the person handing out the treatments nor the person evaluating the results know which subjects are in which group. This is to neutralise psychological effects. 11 (b) A control group (which either gets no treatment or gets the old, established treatment) must be used for comparison. The treatment group, of course, receives the treatment. For example, if you are testing a treatment for a sore throat, you must have a group which gets no treatment because sore throats naturally get better over time. You need to show that the treatment group gets better faster than the control group. (c) Ideally you want the control and treatment groups to be as equal as possible, so that the only difference between the groups is the treatment itself. At the end of an experiment, if you then notice a difference between the two groups – for example, in terms of the recovery time from sore throats – then this can be attributed to the treatment and hence we would have shown a causal link in that the treatment caused a change in recovery time. 270 11.8. Time order 11.8 Time order Another way we might attempt to disentangle causal relationships is to look at the order in which events occurred. Clearly, if we eat more, we gain weight, other things equal. For example, if we do not exercise more to offset the extra consumption of calories! This underpins work on time series data which you will meet if you study EC2020 Elements of econometrics. For now, you should know a little about longitudinal surveys and panel surveys, where the same individuals are resurveyed over time. 11.8.1 Longitudinal surveys Policymakers use longitudinal surveys over a long period of time to look at the development of childhood diseases, educational development and unemployment. There are many long-term studies in these areas. Some longitudinal medical studies of rare diseases have been carried out at an international level over long periods. One such, very well-known, study which is readily available is the UK National Child Development Survey. This began with a sample of about 5,000 children born in 1958. It is still going on! It was initially set up to look at the connections between childhood health and development and nutrition by social groups. The figures produced in the first few years were so useful that it was extended to study educational development and work experience. There are several books which describe the survey at its different stages. You should note the advantages and disadvantages of using such methods. The advantages are that you: can actually measure individual change (not just averages) do not depend on people’s memories about what they did four years ago, say. The disadvantages are: on the one hand, drop out – if the subject material is trivial, people may not agree to be continually resurveyed on the other hand, conditioning – if the researcher manages to persuade participants to continue to cooperate, they may have altered their perceptions (the participants may have become too involved in the study). Despite the disadvantages, such studies are widely regarded as being the best way of studying change over time. Activity 11.5 Compare and contrast cross-sectional and longitudinal designs. Solution A cross-sectional design involves the collection of data from a sample at a single point in time. For example, if a researcher observes the buying behaviour of shoppers selected in a street interview scenario and then analyses the data gathered, then such a study will be called a cross-sectional study. On the other hand, longitudinal design 271 11 11. Some ideas underlying causation – the use of control groups and time order involves collecting data periodically from a fixed sample of respondents. Therefore, a longitudinal study provides a series of ‘snapshots’, which, when viewed together, portray a vivid illustration of the situation and the changes which are taking place. Compared to a cross-sectional design, a longitudinal design is more effective in detecting change, can be more accurate due to the rapport which can be built up with respondents, and provides a larger amount of data. A cross-sectional design can have advantages in creating a ‘one-off’ sample which is more representative of a target population, compared to a longitudinal sample where respondents may drop out and particular types of respondent become under-represented. The disadvantage of building up a rapport with respondents in a longitudinal design is the bias which may be introduced through an over-familiarity with the study and the interviewers/researchers working on it. 11.8.2 Panel surveys In business and management research, we generally use slightly less ambitious surveys called panels. They also involve contacting the same individuals over a period, but are generally different from longitudinal surveys in the following ways: they are more likely to be chosen by quota rather than random methods individuals are interviewed every two to four weeks (rather than every few years) individuals are unlikely to be panel members for longer than two years at a time. We can use results from such surveys to look at brand loyalty and brand switching. It is particularly useful for assessing the effectiveness of advertising. For now, make sure you understand how a longitudinal or panel study is set up. Activity 11.6 Discuss the advantages and disadvantages of panels. Solution A panel consists of a sample of respondents, generally households, who have agreed to provide information at specified intervals over an extended period. Panels are maintained by syndicated firms, and panel members are compensated for their participation with gifts, coupons, information or cash. Advantages: 11 Over a period of time, panel data can reveal changes in market share and/or changes in consumer attitudes and behaviour. Relatively large amounts of data can be gathered. The researcher can collect more detailed data regarding the primary variables of interest. The data collected are more accurate because of the support given to respondents and the development of a relationship between panel members and the sponsoring organisation. 272 11.9. Case study: Smoking and lung cancer Disadvantages: Panels may not necessarily be any more representative than a single cross-sectional survey. This may arise due to respondents refusing to cooperate, ‘mortality’, i.e. respondents dropping out of the panel due to lack of interest, and the nature of payments or tangible rewards which may attract a certain type of respondent. New panel members are often biased in their initial responses, and seasoned panel members may provide biased responses through being tired or bored with the process. 11.9 Case study: Smoking and lung cancer In order to clarify the issues raised in this chapter, we will consider a case study. It describes how control groups were used in order to see what was connected with lung cancer. You might be interested to know that initially the researchers expected that pollution in the environment would show the strongest link. They compared patients in hospital diagnosed as having lung cancer with a control group who did not. (They were likely to have had accidents or non-chest diseases such as appendicitis or, in the case of women, be pregnant.) Each lung cancer patient was matched with a control by: age sex occupation home area being in hospital at roughly the same time. Using the measures which you will meet in Chapter 12, it was found that the main difference between the two groups was in smoking behaviour – those with lung cancer were more likely to be heavy smokers than the other group. Although this study was carried out some time ago, similar studies have confirmed the results. However, what did the study tell us? Consider again Section 11.5. Descriptive statistics alone cannot distinguish between the following ideas. Smoking causes lung cancer: smoking ⇒ lung cancer. Smoking is a symptom of developing lung cancer: lung cancer ⇒ smoking. 273 11 11. Some ideas underlying causation – the use of control groups and time order Your personality factor leads to smoking and lung cancer: personality facto ⇒ smoking + lung cancer. It is important that you understand this. Although animal experimentation (ethics aside) may help to resolve the conundrum to some extent, we are really at a stalemate without an experiment on the people in whom we are interested! At this point, the original researchers carried out a longitudinal study of 40,000 doctors in the UK. Their smoking histories were collected and subsequent death rates from lung cancer checked. The results confirmed the initial findings. However, although these results make the causal connection ‘heavy smoking ⇒ lung cancer’ much more likely, it is still possible that the heavy smokers may have some other characteristic (for example, genetic) which leads them to contract lung cancer. Ethically it is not possible to choose individuals to participate in an experiment and ask them to smoke heavily, so a pure experiment could not be made. 11.10 Summary This chapter’s focus on causation gives you an opportunity to think of the implications of the work you have covered in the earlier chapters and prepares the way for the material in Chapter 12. 11.11 Key terms and concepts Blinding Conditioning Control group Experiment Longitudinal survey Outcome Randomisation Treatment 11 11.12 Causation Confounding factor Drop out Intervention Observational study Panel survey Time order Learning activities Solutions can be found in Appendix A. 1. Your government is assessing whether it should change the speed limit on its motorways or main roads. Several countries in your immediate area have lowered their limit recently by 10 miles per hour. What control factors might you use in order to examine the likely effect on road accidents of a change in your country? 274 11.13. A reminder of your learning outcomes 2. (a) List the advantages and disadvantages of making a panel study. (b) Work out how you might set up a panel of children in order to see whether they like the television programmes your company is making for under-10s in the after school/before homework time slot. 11.13 A reminder of your learning outcomes After completing this chapter, and having completed the Essential reading and activities, you should be able to: distinguish between an experiment in the natural sciences and the observations possible in social science, business or management studies determine sensible controls for a given experiment or trial set up a panel study explain the merits and limitations of a longitudinal/panel survey compared with a cross-sectional survey. 11.14 Sample examination questions Solutions can be found in Appendix B. 1. What are the strengths and weaknesses of a longitudinal survey? Describe how you would design such a survey if you were aiming to study the changes in people’s use of health services over a 20-year period. State the target group, survey design, and frequency of contact. You should give examples of a few of the questions you might ask. 2. Explain how panel surveys are different from longitudinal surveys. 11 275 11. Some ideas underlying causation – the use of control groups and time order 11 276 Chapter 12 Correlation and linear regression 12.1 Aims This final chapter of the subject guide takes us back to the ideas of the formula for a straight line, which you worked on back in Chapter 2. Here, we are going to use basic mathematics to understand the relationship between two measurable variables. If you are taking, or likely to take, ST104b Statistics 2, and later EC2020 Elements of econometrics, you should ensure that you cover all the examples here very carefully as they prepare you for work on those courses. If you do not intend to take your statistics any further, you need, at the very least as a social scientist, to be able to assess whether relationships shown by data are what they appear! Be careful not to confuse the techniques used here – on measurable variables – with those you worked with in Chapter 9 on categorical variables! 12.2 Learning outcomes After completing this chapter, and having completed the Essential reading and activities, you should be able to: draw and label a scatter diagram calculate r explain the meaning of a particular value and the general limitations of r calculate a and b for the line of best fit in a scatter diagram explain the relationship between b and r summarise the problems caused by extrapolation. 12 12.3 Essential reading Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060] Sections 1.6, 2.4, 11.1–11.3 and 14.6. 277 12. Correlation and linear regression In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials accessible via the ST104a Statistics 1 area at http://my.londoninternational.ac.uk 12.4 Further reading Aczel, A.D. Complete Business Statistics (London: McGraw–Hill Higher Education, 2009) seventh edition [ISBN 9780071287531] Chapters 10 and 11. Anderson, D.R., D.J. Sweeney, T.A. Williams, J. Freeman and E. Shoesmith Statistics for Business and Economics (South–Western Cengage Learning, 2010) eleventh edition [ISBN 9780324783247] Chapter 14 and Section 15.1. Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN 9780471615170] Chapters 12–14. 12.5 Introduction In Chapter 9, you were introduced to idea of testing for association between different attributes of a variable using the chi-squared distribution. We did this by looking at the number of individuals falling into a category, or experiencing a particular contingency. Correlation and regression enable us to see the connection between the actual dimensions of two or more measurable variables. The work we will do in this chapter will only involve looking at two variables at a time, but you should be aware that statisticians use these theories and similar formulae to look at the relationship between many variables – so-called multivariate analysis. When we use these terms we are concerned with using models for prediction and decision making. So, how do we model the relationship between two variables? We are going to look at: correlation – which measures the strength of a linear relationship regression – which is a way of representing that linear relationship. It is important you understand what these two terms have in common, but also the differences between them. 12 Activity 12.1 Explain the main types of variables which statistical methods are used to study. What are the differences between ‘association’ and ‘correlation’ as statistical terms? Solution There are two main types of variable: measurable variables (such as distances, weights, temperatures, frequencies and sums of money), and categorical variables, which cannot be measured. 278 12.6. Scatter diagrams Categorical variables can be either ordinal (when the values can be ranked, as, for example, a scale which measures dissatisfaction versus satisfaction) or nominal, for which no ranking is possible (such as nationalities). Association represents the relationship between categorical variables (nominal or ordinal scales) whereas correlation represents (and assesses numerically the strength of) the linear relationship between measurable variables. 12.6 Scatter diagrams Let us assume that we have some data in paired form: (xi , yi ), for i = 1, 2, . . . , n. Example 12.1 An example of paired data is the following which represents the number of people unemployed and the corresponding monthly reported crime figures for twelve areas of a city. Unemployed, x Number of offences, y 2,614 6,200 1,160 4,610 1,055 5,336 1,199 5,411 2,157 5,808 2,305 6,004 Unemployed, x Number of offences, y 1,687 5,420 1,287 5,588 1,869 5,719 2,283 6,336 1,162 5,103 1,201 5,268 This dataset will be used repeatedly throughout this chapter. When dealing with paired data, the first action is to construct a scatterplot (or scatter diagram) of the data, and visually inspect it for any apparent relationship between the two variables.1 Figure 12.1 shows such a scatterplot for these data. Figure 12.1 gives an impression of a positive, linear relationship, i.e. it can be seen that x (unemployment) and y (number of offences) increase together, roughly in a straight line, but subject to a certain amount of scatter. So, the relationship between x and y is not exactly linear – the points do not lie exactly on a straight line. Data showing a general ‘upward shape’ like this are said to be positively correlated, and we shall see how to quantify this correlation. Other possible scatter patterns are shown in Figure 12.2. The left-hand plot shows data which have a negative correlation, i.e. y decreases as x increases, and vice versa. The right-hand plot shows uncorrelated data, i.e. no clear relationship between x and y. Note that correlation assesses the strength of the linear relationship between two variables. Hence uncorrelated data, in general, just means an absence of linearity. It is perfectly possible that uncorrelated data are related, just not linearly – for example, x and y may exhibit a quadratic relationship (considered later). 1 Notice that, in contrast to categorical variables considered in Chapter 9, for the remainder of this course we will deal with measurable variables only. 279 12 12. Correlation and linear regression Scatter plot of Crime against Unemployment x x x x 5500 x x x x x x 5000 Number of offences 6000 x x 1000 1500 2000 2500 Unemployment Figure 12.1: Scatterplot of ‘Number of offences’ against ‘Unemployment’. Example 12.2 Below is a list of variables, along with their expected correlation. Variables Height and weight Rainfall and sunshine hours Ice cream sales and sun cream sales Hours of study and examination mark Car’s petrol consumption and goals scored 12 12.7 Expected correlation Positive Negative Positive Positive Zero Causal and non-causal relationships When two variables are correlated, an interesting question which arises is whether the correlation indicates a causal relationship. In Example 12.2, it is natural to assume that hours of study and examination marks are positively correlated, due to more study resulting in better examination performance, i.e. a higher mark. In this case, the relationship is causal. 280 12.7. Causal and non-causal relationships Scatter plot Scatter plot x x x 8 x x 8 x x x x 6 6 x x x x y y x x x 4 4 x x 2 2 x x x x x 2 4 6 8 0 2 x x 4 6 8 x Figure 12.2: Scatterplots – negatively correlated variables (left) and uncorrelated variables (right). For the ice cream and sun cream sales, however, the relationship is not causal. It is not the selling of ice cream which causes sun cream sales to rise, nor vice versa. Rather both sales respond to warm weather and so these sales are seasonal, with both rising and falling together in response to other variables such as temperature and sunshine hours. It should be clear from this that care needs to be taken in interpreting correlated relationships. Remember correlation does not imply causality! Example 12.3 Let us consider a few more examples. In each case we observe strong correlations. i. ‘Average salary of school teachers’ and ‘Consumption of alcohol measured by country’. ii. ‘Stork population in Bavaria’ and ‘Human birth rate’. iii. ‘Size of student population’ and ‘Number of juvenile offenders by local area in a country’. Would you seriously think there was a causal connection in these cases? Let us look at them in a little more detail. i. It would be frivolous to deduce that respect for teachers causes alcoholism! It is much more likely that buying and consuming alcohol and high salaries are both signs of a flourishing economy. ii. This is a bit more difficult – can the fairy stories be true? Again, there is probably a further variable at play here – are young families living in areas of natural beauty (to raise their children in pleasant surroundings) which encourage storks? 281 12 12. Correlation and linear regression iii. The more young people there are, the more juvenile offenders, scholarship winners, and students there are likely to be. Connecting these two figures is pretty meaningless. 12.8 Correlation coefficient The strength of a linear relationship between two random variables is given by the correlation coefficient. For random variables X and Y , the population correlation coefficient, denoted ρ, is defined as: ρ= E(XY ) − E(X) E(Y ) E((X − E(X))(Y − E(Y ))) p = p Var(X)Var(Y ) Var(X) Var(Y ) and some textbooks may refer to this as the ‘product-moment correlation coefficient’. Technically, ρ can only be determined if we have perfect knowledge of the ‘bivariate density function’ of X and Y .2 In practice, we will wish to estimate ρ, using the sample correlation coefficient, from a set of random sample observations of X and Y , i.e. using sample paired data (xi , yi ), for i = 1, 2, . . . , n. Sample correlation coefficient The sample correlation coefficient, denoted r, is: n P Sxy r=p = s n Sxx Syy P xi yi − nx̄ȳ i=1 x2i − nx̄ i=1 (12.1) n P 2 yi2 − nȳ 2 i=1 where: corrected sum of squares of x-data is: Sxx = n X x2i − nx̄2 i=1 corrected sum of squares of y-data is: corrected sum of cross-products is: Syy = Sxy = n X i=1 n X yi2 − nȳ 2 xi yi − nx̄ȳ. i=1 12 It is quiteP common P in2 examination P P 2questions P to be given certain summary statistics (for example, i xi , i xi , i yi , i yi and i xi yi ) to save you time from computing all such quantities directly using the raw data. Hence it may be easier for you to remember the expressions for Sxx , Syy and Sxy (the ‘corrected sum of squares’ for x and y, and corresponding cross-products, respectively), and how they combine to calculate r. 2 A bivariate density function shows the dependence of two variables (here X and Y ) and is covered in ST104b Statistics 2. 282 12.8. Correlation coefficient Example 12.4 For the dataset in Example 12.1, we have n = 12, x̄ = 1665 and ȳ = 5567, and so: Sxx = 3431759, Syy = 2584497 and Sxy = 2563066. Using (12.1), we have: r=√ 2563066 = 0.8606. 3431759 × 2584497 This is a strong, positive correlation.3 We also note that the value of r agrees with the scatterplot shown in Figure 12.1. The sample correlation coefficient measures how closely the points in the scatterplot lie around a straight line, and the sign of r tells us the direction of this line, i.e. upward-sloping or downward-sloping, for positive and negative r, respectively. It does not tells us the gradient of this line – this is what we will determine in regression. Properties of the correlation coefficient Both types of correlation coefficient (ρ and r) have the following properties. They: are independent of the scale of measurement are independent of the origin of measurement are symmetric; that is, the correlation of X and Y is the same as the correlation of Y and X can only take values between ±1, i.e. −1 ≤ ρ ≤ 1 and −1 ≤ r ≤ 1, alternatively |ρ| ≤ 1 and |r| ≤ 1, i.e. correlation coefficients always have an absolute value less than or equal to 1. Having defined the correlation coefficient, it is important to remember the following when interpreting r (or ρ). An r (ρ) near the top of this range (i.e. near 1) indicates a strong, positive linear relationship between X and Y . An r (ρ) near the bottom of this range (i.e. near −1) indicates a strong, negative linear relationship between X and Y . Zero correlation indicates that X and Y are not linearly related, i.e. the variables are uncorrelated. 3 If you were following ST104b Statistics 2 and were to test the statistical significance of this r, using the kind of techniques and ideas in Chapter 8, you would find that the result is highly significant at the 1% significance level. Note that you are not expected to test for the significance of r in ST104a Statistics 1. 283 12 12. Correlation and linear regression Example 12.5 To reiterate, correlation assesses the strength of linear relationships between variables only. ρ = 0 does not imply that X and Y are independent, since the variables could have a non-linear relationship. For example, if: y = x (1 − x) for 0 ≤ x ≤ 1 then the correlation is zero (as they are not linearly related), but, clearly, there is a well-defined relationship between the two variables, so they are certainly not independent. Figure 12.3 demonstrates this point for simulated sample data, where we see a clear relationship between x and y, but it is clearly not a linear relationship.4 Data of this kind would have a sample correlation near zero (here, r = 0.148), which emphasises the linear nature of r. Scatter plot 2500 x x x x 2000 x x x x 1500 y x x 1000 x 500 x x x x 20 30 40 50 60 70 80 x Figure 12.3: Scatterplot showing a non-linear relationship. Activity 12.2 Define the term ‘sample correlation coefficient’, r, based on data (x1 , y1 ), . . . , (xn , yn ). Describe some properties of r in terms of how its value is different when the data have different patterns of scatter diagram. 12 Solution The sample correlation coefficient is Sxy r=p Sxx Syy 4 In fact, these data are scattered around a parabola with (approximate) equation y = 2(x−15)(85−x). 284 12.8. Correlation coefficient where: n n X X 2 Sxx = (xi − x̄) = x2i − nx̄2 , i=1 Syy = i=1 and: n X 2 (yi − ȳ) = i=1 n X yi2 − nȳ 2 i=1 n n X X (xi − x̄)(yi − ȳ) = xi yi − nx̄ȳ. Sxy = i=1 i=1 Examples of scatter diagrams include: • linear up: r near 1 • linear down: r near −1 • random cloud: r close to 0. Activity 12.3 The following table shows the number of salespeople employed by a company and the corresponding values of sales (in £000s). 210 206 Salespeople (x) Sales (y) 209 200 219 204 225 215 232 222 221 216 220 210 233 218 200 201 215 212 205 204 227 212 Here are some useful summary statistics: 12 X xi = 2616, i=1 12 X yi = 2520, 12 X i=1 xi yi = 550,069, i=1 12 X x2i = 571,500 and i=1 12 X yi2 = 529,746. i=1 Compute the sample correlation coefficient for these data. Solution We have: x̄ = 2520 2616 = 218 and ȳ = = 210. 12 12 Therefore: P xi yi − nx̄ȳ r=pP 2 P ( xi − nx̄2 ) ( yi2 − nȳ 2 ) 550069 − 12 × 218 × 210 =p (571500 − 12 × (218)2 ) (529746 − 12 × (210)2 ) 12 = 0.8716. This indicates a very strong, positive and linear relationship between salespeople and sales. The more salespeople employed, the higher the sales. 285 12. Correlation and linear regression Activity 12.4 State whether the following statements are true or false, explaining your answers. (a) The sample correlation coefficient between x and y is the same as the sample correlation coefficient between y and x. (b) If the slope is negative in a regression equation y = a + bx, then the correlation coefficient between x and y would be negative too. (c) If two variables have a correlation coefficient of minus 1 they are not related. (d) A large correlation coefficient means the regression line will have a steep slope. Solution (a) True. The sample correlation coefficient is: r=p Sxy . Sxx Syy Switching x and y does not change r because Sxy = Syx . (b) True. The sign of the slope and the sign of the correlation coefficient between x and y are the same. A negative slope indicates a negative correlation between the variables. Using sample data, the sign is determined by the term: Sxy = n X xi yi − nx̄ȳ i=1 which is the numerator of b and r. (c) False. A correlation coefficient of minus 1 represents a perfect negative correlation between two variables. A correlation coefficient of zero would indicate that the two variables are not linearly related. (d) False. A large correlation coefficient means a strong linear relationship between two variables. The magnitude of the correlation coefficient says nothing about how steep the regression line is. 12.8.1 12 Spearman rank correlation Just as we saw in Chapter 4 that we can use the median and interquartile range as measures of location and dispersion, respectively, instead of the mean and standard deviation (or variance), it is possible to calculate the Spearman rank correlation, rs , instead of r. To compute rs we rank the xi and yi values in ascending order and use the r formula above using the ranks of the xi s and yi s, rather than their actual values. Of course, it may be that we only have the ranks, in which case we would have to use this method. 286 12.8. Correlation coefficient Spearman rank correlation If there are no tied rankings of xi and yi , the Spearman rank correlation is: 6 rs = 1 − n P d2i i=1 (12.2) n (n2 − 1) where the di s are the differences in the ranks between each xi and yi . As with other order statistics, such as the median and quartiles, it is helpful to use the Spearman rank correlation if you are worried about the effect of extreme observations (outliers) in your sample. The limits for rs are the same as for r, that is |rs | ≤ 1. Example 12.6 An aptitude test has been designed to examine a prospective salesperson’s ability to sell. Ten current staff sit the test. Instead of putting achieved scores in the computer, a research assistant ranks the individuals in ascending order in terms of the test as well as productivity. The ranked data are: Staff member Rank order in test Rank order in productivity A 2 1 B 3 2 C 5 3 D 1 4 E 4 5 F 9 6 G 10 7 H 6 8 I 7 9 I −2 4 J −2 4 J 8 10 Does it look as if the test is a good predictor of sales ability? We first compute the differences in the ranks: Staff member di d2i A 1 1 B 1 1 C 2 4 and so: 10 X D −3 9 E −1 1 F 3 9 G 3 9 H −2 4 d2i = 46. i=1 Hence, using (12.2), we have: rs = 1 − 6 × 46 = 0.7212 10 × ((10)2 − 1) which is quite strong, indicating that the test is a reasonably good predictor of sales ability. 12 Activity 12.5 Seven students in a class received the following examination and project marks in a subject: Examination mark Project mark 50 75 80 60 70 55 40 40 30 50 75 80 95 65 287 12. Correlation and linear regression You want to know if students who excel in examinations in the subject also had relatively high project marks. (a) Calculate the Spearman rank correlation. (b) Based on your answer to part (a) do you think students who score well in examinations are also likely to have the highest project marks? Briefly justify your answer. Solution (a) We have (ranking marks from highest to lowest): Examination mark Project mark Rank of examination mark Rank of project mark Difference in ranks, di d2i 50 75 5 2 3 9 80 60 2 4 −2 4 70 55 4 5 −1 1 40 40 6 7 −1 1 30 50 7 6 1 1 75 80 3 1 2 4 95 65 1 3 −2 4 The Spearman rank correlation is: 6 rs = 1 − n P d2i 6 × 24 i=1 =1− = 0.5714. 2 n (n − 1) 7 × 48 (b) There is some correlation, but it is not strong. It looks as if there is a positive connection between examination marks and project marks. 12.9 Linear regression In ST104a Statistics 1, we can only hope to review the fundamentals of what is a very large (and important) topic in statistical analysis – several textbooks and courses focus exclusively on regression analysis. Here, we shall concentrate on the simple linear regression model (also known as the bivarirate regression model). This will allow calculations to be performed on a hand-held calculator, unlike multiple linear regression (with multiple independent variables) which typically requires the help of statistical computer packages due to the complexity and sheer number of calculations involved. 12 In the simple linear regression model, we have two variables Y and X. Y is the dependent (or response) variable – that which we are trying to explain. X is the independent (or explanatory) variable – the variable we think influences Y . 288 12.9. Linear regression Multiple linear regression is just a natural extension of this set-up, but with more than one explanatory variable. There can be a number of reasons for wanting to establish a mathematical relationship between a response variable and an explanatory variable, for example: to find and interpret unknown parameters in a known relationship to understand the reason for such a relationship – is it causal? to predict or forecast Y for specific values of the explanatory variable. 12.9.1 The simple linear regression model We assume that there is a true (population) linear relationship between a response variable, y, and an explanatory variable, x, of the approximate form: y = α + βx where α and β are fixed, but unknown, population parameters. Our objective is to estimate α and β using (paired) sample data (xi , yi ), for i = 1, . . . , n. Note the use of the word approximate. Particularly in the social sciences, we would not expect a perfect linear relationship between the two variables. Therefore, we modify this basic model to: y = α + βx + ε where ε is some random perturbation from the initial ‘approximate’ line. In other words, each y observation almost lies on the line, but ‘jumps’ off the line according to the random variable ε. This perturbation is often referred to as the error term. For each pair of observations (xi , yi ), for i = 1, . . . , n, we can write this as: yi = α + βxi + εi . The random deviations ε1 , ε2 , . . . , εn corresponding to the n data points are assumed to be independent and identically normally distributed, with zero mean and constant (but unknown) variance, σ 2 . That is: εi ∼ N (0, σ 2 ) for i = 1, . . . , n. This completes the model specification. Specification of the simple linear regression model To summarise, we list the assumptions of the simple linear regression model. A linear relationship between the variables of the form y = α + βx + ε. The existence of three model parameters: the linear equation parameters α and β, and the error term variance, σ 2 . Var(εi ) = σ 2 for all i = 1, . . . , n, i.e. the error term variance is constant and does not depend on the explanatory variable. The εi s are independent and N (0, σ 2 ) distributed for all i = 1, . . . , n. 289 12 12. Correlation and linear regression You may feel that some of these assumptions are particularly strong and restrictive. For example, why should the error term variance be constant across all observations? Indeed, your scepticism serves you well! In a more comprehensive discussion of linear regression, such as in EC2020 Elements of econometrics, model assumptions would be properly tested to assess their validity. Given the limited scope of linear regression in ST104a Statistics 1, sadly we are too time-constrained to consider such tests in detail. However, do be aware that with any form of modelling, a thorough critique of model assumptions is essential. Analysis based on false assumptions leads to invalid results – a bad thing. Activity 12.6 Write down the simple linear regression model, explaining each term in the model. Solution The simple linear regression model is: y = α + βx + ε where: • y is the response (or dependent) variable • x is the explanatory (or independent) variable • α is the y-intercept • β is the slope of the line • ε is a random error term. Activity 12.7 List the assumptions which we make when using the simple linear regression model to explain changes in y by changes in x (i.e. the ‘regression of y on x’). Solution We have the following four assumptions. • The equation of the form y = α + βx + ε describes the true population relationship. 12 • There are three unknown population parameters in the model, namely the ‘linear’ parameters α and β, and the error term variance, σ 2 . • σ 2 is the same for i = 1, . . . , n, so in particular it does not depend on the value of x we observe. • εi ∼ N (0, σ 2 ) for i = 1, . . . , n, and the different εi s are independent of each other. 290 12.9. Linear regression 12.9.2 Parameter estimation As mentioned above, our principal objective is to estimate α and β, that is the y-intercept and slope of the true line. To fit a line to some data, we need a criterion for establishing which straight line is in some sense ‘best’. The criterion used is to minimise the sum of the squared distances between the observed values of yi and the values predicted by the model.5 The estimated least squares regression line is written as: yb = a + bx where a and b denote our estimates of α and β, respectively. The derivation of the formulae for a and b is not required for this course,6 although you do need to know how to calculate point estimates of α and β. Simple linear regression line estimates We estimate α and β with a and b, where: n P b= (xi − x̄)(yi − ȳ) i=1 n P (xi − x̄)2 i=1 n P xi yi − nx̄ȳ = i=1 n P = x2i − nx̄2 Sxy Sxx i=1 and: a = ȳ − bx̄. Note you need to compute b first, since this is needed to calculate a. Activity 12.8 (a) Explain the meaning of the term ‘least squares estimation’. (b) Explain, briefly, the difference between correlation and regression. Solution (a) ‘Least squares estimation’ refers to moving a line around until the sum of the (vertical) squared deviations from the data is at a minimum. (b) The main point is that we take xi fixed in regression, but correlation is for two random variables. Also, the estimate of β is related to correlation. For example, positive β occurs with positive correlation. 5 6 This ‘least squares’ estimation technique is considered in ST104b Statistics 2. The derivation is presented in ST104b Statistics 2. 291 12 12. Correlation and linear regression Activity 12.9 The table below shows the cost of fire damage for ten fires together with the corresponding distances of the fires to the nearest fire station. Distance in miles (x) Cost in £000s (y) 4.9 31.1 4.5 31.1 6.3 43.1 3.2 22.1 5.0 36.2 5.7 35.8 4.0 25.9 4.3 28.0 2.5 22.9 5.2 33.5 Here are some useful summary statistics: 10 X xi = 45.6, i=1 10 X yi = 309.7, i=1 10 X xi yi = 1475.1, i=1 10 X x2i = 219.46 and i=1 10 X yi2 = 9973.99. i=1 Fit a straight line to these data. Solution We have: x̄ = 309.7 45.6 = 4.56 and ȳ = = 30.97. 10 10 The slope estimate is: P 1475.1 − 10 × 4.56 × 30.97 xi yi − nx̄ȳ = = 5.46. b= P 2 2 219.46 − 10 × (4.56)2 xi − nx̄ The intercept estimate is: a = ȳ − bx̄ = 30.97 − 5.46 × 4.56 = 6.07. Hence the estimated regression line is yb = 6.07 + 5.46x. Activity 12.10 Consider the following dataset (which should be treated as sample data): Observation xi yi 1 1.1 6.4 2 3.9 17.0 3 2.8 12.8 4 3.2 14.4 5 2.9 13.1 6 4.4 18.7 7 3.4 15.1 8 4.9 20.6 9 2.3 11.0 10 3.8 16.6 Summary statistics for the dataset are: 10 X xi = 32.7, i=1 12 10 X yi = 145.7, i=1 10 X i=1 xi yi = 516.19 10 X x2i = 117.57 and i=1 10 X yi2 = 2271.39. i=1 Calculate the point estimates of α and β for the regression of y on x based on the above sample data. Solution We have: x̄ = 292 32.7 145.7 = 3.27 and ȳ = = 14.57. 10 10 12.9. Linear regression The slope estimate is: P xi yi − nx̄ȳ 516.19 − 10 × 3.27 × 14.57 b= P 2 = 3.7356. = 2 117.57 − 10 × (3.27)2 xi − nx̄ The intercept estimate is: a = ȳ − bx̄ = 14.57 − 3.7356 × 3.27 = 2.3546. Hence the estimated regression line is yb = 2.3546 + 3.7356x. Activity 12.11 In a study of household expenditure a population was divided into five income groups with the mean income, x, and the mean expenditure, y, on essential items recorded (in Euros per month). The results are in the following table. x 1000 2000 3000 4000 5000 y 871 1300 1760 2326 2950 Here are some useful summary statistics: 5 X xi = 15,000, i=1 5 X yi = 9207, i=1 5 X x2i = 55,000,000, i=1 5 X xi yi = 32,805,000, i=1 and 5 X yi2 = 19,659,017. i=1 (a) Fit a straight line to the data. (b) How would you use the fit to predict the percentage of income which households spend on essential items? Comment on your answer. Solution (a) We have: x̄ = 15000 9207 = 3000 and ȳ = = 1841.4. 5 5 The slope estimate is: P xi yi − nx̄ȳ 32805000 − 5 × 3000 × 1841.4 b= P 2 = = 0.5184. 2 55000000 − 5 × (3000)2 xi − nx̄ 12 The intercept estimate is: a = ȳ − bx̄ = 1841.4 − 0.5184 × 3000 = 286.2. Hence the estimated regression line is yb = 286.2 + 0.5184x. 293 12. Correlation and linear regression (b) The percentage of income which households spend on essential items is: y × 100% x which is approximated by: a + bx × 100%. x This percentage is decreasing with increasing x so that poorer households spend a larger proportion of their income on essentials. Activity 12.12 The following table shows the number of computers (in 000s), x, produced by a company each month and the corresponding monthly costs (in £000s), y, for running its computer maintenance department. Number of computers (in 000s), x 7.2 8.1 6.4 7.7 8.2 Maintenance costs (in £000s), y 100 116 98 112 115 Number of computers (in 000s), x 6.8 7.3 7.8 7.9 8.1 Maintenance costs (in £000s), y 103 106 107 112 111 The following statistics can be calculated from the data. 10 X xi = 75.5, i=1 10 X yi = 1080, i=1 10 X i=1 xi yi = 8184.9, 10 X x2i = 573.33 and i=1 10 X yi2 = 116,988. i=1 (a) Calculate the correlation coefficient for computers and maintenance costs. (b) Find the best-fitting straight line relating y and x. (c) Plot the points on a scatter diagram and draw the line of best fit. (d) Comment on your results. How would you check on the strength of the relationship you have found? Solution (a) We have: 1080 75.5 = 7.55 and ȳ = = 108. 10 10 The sample correlation coefficient is: P xi yi − nx̄ȳ r=pP 2 P ( xi − nx̄2 ) ( yi2 − nȳ 2 ) x̄ = 12 8184.9 − 10 × 7.55 × 108 =p (573.33 − 10 × (7.55)2 ) (116988 − 10 × (108)2 ) = 0.9111. 294 12.9. Linear regression (b) The slope estimate is: P xi yi − nx̄ȳ 8184.9 − 10 × 7.55 × 108 b= P 2 = = 9.349. xi − nx̄2 573.33 − 10 × (7.55)2 The intercept estimate is: a = ȳ − bx̄ = 108 − 9.349 × 7.55 = 37.415. Hence the estimated regression line is yb = 37.415 + 9.349x. (c) We have: Scatter diagram of maintenance costs vs. computers x x x 110 x x 105 x x 100 Maintenance costs (in £000s) 115 x x x 6.5 7.0 7.5 8.0 Number of computers (in 000s) Note to plot the line you need to compute any two points on the line. For example, for x = 7 and x = 8 the y-coordinates are determined, respectively, as: 37.415 + 9.349 × 7 ≈ 103 and 37.415 + 9.349 × 8 ≈ 112 giving points (7, 103) and (8, 112). The line of best fit should be drawn passing through these two points. (d) The sample correlation coefficient is close to 1, indicating a very strong positive correlation between maintenance costs and the number of computers. For each additional 1,000 computers, maintenance costs increase by £9,349. 12 12.9.3 Prediction Having estimated the regression line, an important application of it is for prediction. That is, for a given value of the explanatory variable, we can use it in the estimated regression line to obtain a prediction of y. 295 12. Correlation and linear regression Predicting y for a given value of x For a given value of the explanatory variable, x0 , the predicted value of the dependent variable, yb, is: yb = a + b x0 . Remember to attach the appropriate units to the prediction (i.e. the units of measurement of the original y data). Also, ensure the value you use for x0 is correct – if the original x data is in 000s, then a prediction of y when the explanatory variable is 10,000, say, would mean x0 = 10, and not 10,000! Example 12.7 A study was made by a retailer to determine the relationship between weekly advertising expenditure and sales (both in thousands of pounds). Find the equation of a regression line to predict weekly sales from advertising. Predict weekly sales when advertising costs are £35,000. The data are: Advertising costs (in £000s) Sales (in £000s) 40 385 20 400 25 395 20 365 30 475 50 440 Advertising costs (in £000s) Sales (in £000s) 40 490 20 420 50 560 40 525 25 480 50 510 Summary statistics, representing sales as yi and advertising costs as xi , for i = 1, . . . , 12, are: X X X X X xi = 410, x2i = 15650, yi = 5445, yi2 = 2512925, xi yi = 191325. So, using (12.9.2) and (12.9.2), the parameter estimates are: b= 191325 − (12 × (410/12) × (5445/12)) = 3.221 15650 − (12 × (410/12)2 ) and: 5445 410 − 3.221 × = 343.7. 12 12 Hence the estimated regression line is: a= yb = 343.7 + 3.221x. The predicted sales for £35,000 worth of advertising is: yb = 343.7 + 3.221 × 35 = 456.4 12 which is £456,400. Note that since the advertising costs were given in £000s, we used x0 = 35, and then converted the predicted sales into pounds. Activity 12.13 The examination marks of thirteen students in two statistics papers (a foundation paper, Paper I, and a more advanced paper, Paper II) were as follows: 296 12.9. Linear regression Paper I (x) Paper II (y) 65 78 73 88 42 60 52 73 84 92 60 77 70 84 79 89 60 70 83 57 99 73 77 88 54 70 Useful summary statistics for these data are: X xi = 856, X x2i = 58,402, X yi = 1041, X yi2 = 84,801 and X xi yi = 70,203. (a) Calculate the sample correlation coefficient and comment on its value. (b) Determine the line of best fit of y on x. (c) Plot the points on a scatter diagram and draw the line of best fit. (d) If you were told a student achieved a mark of 68 in Paper I, what would you predict for the student’s mark in Paper II? Would you trust this prediction? Solution (a) We have: x̄ = 1041 856 = 65.846 and ȳ = = 80.077. 13 13 Therefore: P xi yi − nx̄ȳ r=pP 2 P ( xi − nx̄2 ) ( yi2 − nȳ 2 ) 70203 − 13 × 65.846 × 80.077 =p (58402 − 13 × (65.846)2 ) (84801 − 13 × (80.077)2 ) = 0.9671. This indicates a very strong, positive and linear relationship between Paper I and Paper II examination marks. Assuming these papers are sat in order, the higher the examination mark in Paper I, the higher the examination mark in Paper II. (b) The slope estimate is: P 70203 − 13 × 65.846 × 80.077 xi yi − nx̄ȳ = = 0.8132. b= P 2 2 xi − nx̄ 58402 − 13 × (65.846)2 12 The intercept estimate is: a = ȳ − bx̄ = 80.077 − 0.8132 × 65.846 = 26.5310. Hence the estimated regression line is yb = 26.5310 + 0.8132x. 297 12. Correlation and linear regression (c) We have: 100 Scatter diagram of examination marks x 90 x x x 80 x x 70 x 60 Paper II examination mark x x x x x x 50 60 70 80 Paper I examination mark Note to plot the line you need to compute any two points on the line. For example, for x = 50 and x = 70 the y-coordinates are determined, respectively, as: 26.5310 + 0.8132 × 50 ≈ 67 and 26.5310 + 0.8132 × 70 ≈ 83 giving points (50, 67) and (70, 83). The line of best fit should be drawn passing through these two points. (d) We have: yb = 26.5310 + 0.8132 × 68 = 81.8286. Since examination marks are integers, we round this to 82. The Paper I mark of 68 is within the range of our x data, hence this is interpolation so we trust the accuracy of the prediction. Activity 12.14 The following table, published in USA Today, lists divorce rates and mobility rates for different regions of the USA. The divorce rate is measured as the annual number of divorces per 1,000 population, and the mobility rate is the percentage of people living in a different household from five years before. Region New England Middle Atlantic East North Central West North Central South Atlantic East South Central West South Central Mountain Pacific 12 298 Mobility rate (x-variable) 41 37 44 46 47 44 50 57 56 Divorce rate (y-variable) 4.0 3.4 5.1 4.6 5.6 6.0 6.5 7.6 5.9 12.9. Linear regression Summary statistics for the dataset are: 9 X xi = 422, i=1 9 X x2i = 20132, i=1 9 X yi = 48.7, i=1 9 X yi2 = 276.91 and i=1 9 X xi yi = 2341.6. i=1 (a) Calculate the sample correlation coefficient and comment on its value. (b) Calculate the regression equation. (c) Plot the points on a scatter diagram and draw the line of best fit. (d) Compute the expected divorce rate if the mobility rate is 40. (e) Why might it be reasonable to use the divorce rate as the y variable? Solution (a) The sample correlation coefficient is: P xi yi − nx̄ȳ r=pP 2 P ( xi − nx̄2 ) ( yi2 − nȳ 2 ) 2341.6 − 9 × 46.89 × 5.41 =p (20132 − 9 × (46.89)2 ) (276.91 − 9 × (5.41)2 ) = 0.8552 which suggests a fairly strong, positive and linear relationship betweeen divorce rate and mobility rate (as deduced from the scatter diagram). (b) The slope estimate is: P xy − nx̄ȳ b= P 2 = 0.1685. x − nx̄2 12 The intercept estimate is: a = ȳ − bx̄ = −2.4893. Hence the estimated regression line is yb = −2.4893 + 0.1685x. 299 12. Correlation and linear regression (c) We have: Scatter diagram of divorce rate vs. mobility rate 7 x 6 x x x 5 x x x 4 Divorce rate (per 1,000 population) x x 40 45 50 55 Mobility rate (per 1,000 population) Note to plot the line you need to compute any two points on the line. For example, for x = 45 and x = 50 the y-coordinates are determined, respectively, as: −2.4893 + 0.1685 × 45 ≈ 5 and − 2.4893 + 0.1685 × 50 ≈ 6 giving points (45, 5) and (50, 6). The line of best fit should be drawn passing through these two points. (d) For x = 40, the expected divorce rate is −2.4893 + 0.1685 × 40 = 4.25 per 1,000 population. (e) The use of the divorce rate as the dependent variable is reasonable due to the likely disruptive effect moving home may have on relationships (any reasonable argument would be acceptable here). Activity 12.15 An area manager in a department store wants to study the relationship between the number of workers on duty and the value of merchandise lost to shoplifters. To do so, she assigned a different number of clerks for each of 10 weeks. The results were: Week 1 2 3 4 5 6 7 8 9 10 12 300 Number of workers (xi ) 9 11 12 13 15 18 16 14 12 10 Loss (yi ) 420 350 360 300 225 200 230 280 315 410 12.9. Linear regression Here are some useful summary statistics: 10 X xi = 130, i=1 10 X yi = 3090, i=1 10 X xi yi = 38,305, i=1 10 X x2i = 1760 and i=1 10 X yi2 = 1,007,750. i=1 (a) Which is the independent variable and which is the dependent variable? (b) Compute the correlation coefficient between the number of workers and the loss. (c) Calculate the least squares regression line. (d) Plot the points on a scatter diagram and draw the line of best fit. (e) Interpret the regression coefficients of the fitted line. (f) Predict the loss when the number of workers is 17. Do you trust this value? Solution (a) The independent variable should be the number of workers, xi , and the dependent variable should be the loss, yi . One might expect that more workers would mean less shoplifting, as shoplifters would be more likely to get caught, hence they are deterred from stealing merchandise. (b) We have: x̄ = 3090 130 = 13 and ȳ = = 309. 10 10 The sample correlation coefficient is: P xi yi − nx̄ȳ r=pP 2 P ( xi − nx̄2 ) ( yi2 − nȳ 2 ) 38305 − 10 × 13 × 309 =p (1760 − 10 × (13)2 ) (1007750 − 10 × (309)2 ) = −0.9688. (c) The slope estimate is: P xi yi − nx̄ȳ 38305 − 10 × 13 × 309 b= P 2 = = −26.64. xi − nx̄2 1760 − 10 × (13)2 12 The intercept estimate is: a = ȳ − bx̄ = 309 − (−26.64) × 13 = 655.32. Hence the estimated regression line is yb = 655.32 − 26.64x. 301 12. Correlation and linear regression (d) We have: Scatter diagram of value shoplifted vs. workers 400 350 x x x 300 x x 250 x x x 200 Value of merchandise lost to shoplifters x x 10 12 14 16 18 Number of workers on duty Note to plot the line you need to compute any two points on the line. For example, for x = 10 and x = 16 the y-coordinates are determined, respectively, as: 655.32 − 26.64 × 10 ≈ 389 and 655.32 − 26.64 × 16 ≈ 229 giving points (10, 389) and (16, 229). The line of best fit should be drawn passing through these two points. (e) For every extra worker, the value of merchandise lost to shoplifters decreases by £26.64. If there were no workes on duty (unlikely), then £655.32 would be lost. (Since x = 0 is not realistic here, the intercept has little practical meaning.) (f) When x = 17, the predicted loss is yb = 655.32 − 26.64 × 17 = £202.44. Although 17 is within the range of the x data, it is only just! So, while strictly speaking this is interpolation, being near the limit of the x data (the maximum value of x is 18) means we should treat the prediction with caution! Activity 12.16 The following data were recorded during an investigation into the effect of fertiliser in g/m2 , x, on crop yields in kg/ha, y. 12 Crop yields (kg/ha) Fertiliser (g/m2 ) 160 0 168 1 176 2 179 3 183 4 186 5 189 6 186 7 184 8 Here are some useful summary statistics: 9 X i=1 302 xi = 36, 9 X i=1 yi = 1611, 9 X i=1 xi yi = 6627, 9 X i=1 x2i = 204 and 9 X i=1 yi2 = 289,099. 12.9. Linear regression (a) Plot the data and comment on the appropriateness of using the simple linear regression model. (b) Calculate a least squares regression line for the data. (c) Predict the crop yield for 3.5 g/m2 of fertiliser. (d) Would you feel confident predicting a crop yield for 10 g/m2 of fertiliser? Explain briefly why or why not. Solution (a) We have: 190 Scatter diagram of crop yields vs. fertiliser x 185 x x x 180 x 170 175 x x 160 165 Crop yields (kg/ha) x x 0 2 4 6 8 Fertiliser (g/m^2) Fitting a linear model seems reasonable, although there is a strong hint of non-linearity. (b) We have: x̄ = 1611 36 = 4 and ȳ = = 179. 9 9 The slope estimate is: P xi yi − nx̄ȳ 6627 − 9 × 4 × 179 b= P 2 = = 3.05. 2 xi − nx̄ 204 − 9 × 42 The intercept estimate is: 12 a = ȳ − bx̄ = 179 − 3.05 × 4 = 166.8. Hence the estimated regression line is yb = 166.8 + 3.05x. (c) When x = 3.5, the predicted crop yield is: yb = 166.8 + 3.05 × 3.5 = 177.475 kg/ha. 303 12. Correlation and linear regression (d) No, we would not feel confident since this would be clear extrapolation. The x data values do not exceed 8, and the scatter diagram suggests that fertiliser values above 7 g/m2 may actually have a negative effect on crop yield. 12.9.4 Points to watch about linear regression Non-linear relationships We have only seen here how to use a straight line to model the best fit. So, we could be missing some quite important non-linear relationships, particularly if we were working in the natural sciences. Which is the dependent variable? Note it is essential to correctly establish which is the dependent variable, y. In Example 12.7, you would have a different line if you had taken advertising costs as y and sales as x! So remember to exercise your common sense – we would expect sales to react to advertising campaigns rather than vice versa. Extrapolation In Example 12.7, we used our estimated line of best fit to predict the value of y for a given value of x, i.e. advertising expenditure of £35,000. Such prediction is only reliable if you are dealing with figures which lie within the dataset, known as interpolation. If you use the estimated regression line for prediction using x0 which lies outside the range of the available x data, then this is known as extrapolation, for which any predictions should be viewed with extreme caution. For Example 12.7, it may not be immediately obvious that the relationship between advertising expenditure and sales could change, but a moment’s thought should convince you that, were you to quadruple advertising expenditure, you would be unlikely to get a nearly 13-fold increase in sales! Basic economics would suggest diminishing marginal returns to advertising expenditure. Sometimes it is very easy to see that the relationship must change. For instance, consider Example 12.8, which shows an anthropologist’s figures on years of education of a mother and the number of children she has, based on a Pacific island. 12 Example 12.8 Figures from an anthropologist show a negative linear relationship between the number of years of education, x, of a mother and the number of live births she has, y. The estimated regression line is: yb = 8 − 0.6x based on figures of women with between 5 and 8 years of education who had 0 to 8 live births. This looks sensible. We predict yb = 8 − 0.6 × 5 = 5 live births for those with 5 years of education, and yb = 8 − 0.6 × 10 = 2 live births for those with 10 years of education. 304 12.10. Points to note about correlation and regression This is all very convincing, but say a woman on the island went to university and completed a doctorate, and so had 15 years of education. She clearly cannot have yb = 8 − 0.6 × 15 = −1 children! Also, if someone missed school entirely, is she really likely to have yb = 8 − 0.6 × 0 = 8 children? We have no way of knowing. The relationship shown by the existing figures will probably not hold beyond the x data range of 5 to 8 years of education. 12.10 Points to note about correlation and regression As already mentioned, examiners frequently give you the following summary statistics: n X xi , i=1 n X i=1 x2i , n X i=1 yi , n X i=1 yi2 and n X xi y i i=1 for i = 1, . . . , n, in order to save you time. If you do not know how to take advantage of these, you will waste valuable time which you really need for the rest of the question. Note that if you use your calculator, show no working and get the answer wrong, you are unlikely to get any credit. This part of the syllabus leads directly into ST104b Statistics 2 and EC2020 Elements of econometrics, so it is important that you understand it if you are taking these courses. What is the relationship between correlation and regression? As we have seen, all the calculations we do for correlation and regression involve similar summation measures, so they must be connected in some way. This is indeed the case. For example, a large r2 (the square of the sample correlation coefficient) means that the standard error of b will be low (in other words, if there is a strong connection between x and y, the points will be close to the line of best fit). A small r2 means that the points are scattered (resulting in a large standard error of b). While you are not expected to determine confidence intervals nor perform hypothesis tests for the parameters α, β and ρ (these are covered in ST104b Statistics 2), it is important that you are aware of these ideas. On the other hand, be sure you are clear that a large |r| does not necessarily mean that the slope of the regression line is steep – a strong correlation can exist if the points are all close to a line with a moderate gradient. 12.11 Summary 12 This brings us to the end of our introduction to dealing with the estimation of two variables (bivariate statistics). This chapter has been chiefly concerned with two important features when modelling the (linear) relationship between two variables: (i.) correlation – measuring the strength of a (linear) relationship, and (ii.) linear regression – a way of representing that linear relationship. 305 12. Correlation and linear regression 12.12 Key terms and concepts Causality Correlation coefficient Error term Independent (explanatory) variable Prediction Scatter diagram (plot) Spearman rank correlation 12.13 Correlation Dependent (response) variable Extrapolation Non-linear relationship Regression Simple linear regression Uncorrelated Learning activities Solutions can be found in Appendix A. 1. State whether the following statements are true or false and explain why. (a) ‘The correlation between X and Y is the same as the correlation between Y and X.’ (b) ‘If the slope is negative in a regression equation, then the correlation coefficient between X and Y would be negative too.’ (c) ‘If two variables have a correlation coefficient of minus 1 they are not related.’ (d) ‘A strong correlation coefficient means the regression line will have a steep slope.’ 2. Sketch a scatter diagram for each of the following situations: (a) r = 0 (b) r is strong and positive (c) r is strong and negative (d) r is quite strong, but negative. 3. Think of an example where you feel the correlation is clearly spurious (that is, there is a correlation, but no causal connection) and explain how it might arise. Now think of a ‘clear’ correlation and the circumstances in which you might accept causality. 12 4. Work out a and b in Example 12.7 using advertising costs as the dependent variable. Now predict advertising costs when sales are £460,000. Make sure you understand how and why your results are different from Example 12.7! 5. Try to think of a likely linear relationship between x and y which would probably work over some of the data, but then break down like that in the anthropologist case in Example 12.8. 306 12.14. A reminder of your learning outcomes This should make sure you understand the difference between interpolation and extrapolation. 6. Sketch a scatter diagram with the line of best fit for the following a, b and r. (a) a = 2, b = 1/2 and r = 0.9 (b) a = −3, b = −2 and r = −0.3. 12.14 A reminder of your learning outcomes After completing this chapter, and having completed the Essential reading and activities, you should be able to: draw and label a scatter diagram calculate r explain the meaning of a particular value and the general limitations of r calculate a and b for the line of best fit in a scatter diagram explain the relationship between b and r summarise the problems caused by extrapolation. 12.15 Sample examination questions Solutions can be found in Appendix B. 1. Seven students in a class received the following examination and project marks in a subject: Examination mark Project mark 50 75 80 60 70 55 40 40 30 50 75 80 95 65 You want to know if students who excel in examinations in the subject also had relatively high project marks. (a) Calculate the Spearman rank correlation. (b) Based on your answer to part i. do you think students who score well in examinations are also likely to have the highest project marks? Briefly explain your answer. 2. An area manager in a department store wants to study the relationship between the number of workers on duty, x, and the value of merchandise lost to shoplifters, y, in $. To do so, the manager assigned a different number of workers for each of 10 weeks. The results were as follows: 307 12 12. Correlation and linear regression Week x y #1 9 420 #2 11 350 #3 12 360 #4 13 300 #5 15 225 #6 18 200 #7 16 230 #8 14 280 #9 12 315 #10 10 410 The summary statistics for these data are: Sum of x data: 130 Sum of the squares of x data: 1760 Sum of y data: 3090 Sum of the squares of y data: 1007750 Sum of the products of x and y data: 38305 (a) Draw a scatter diagram of these data on the graph paper provided. Carefully label the diagram. (b) Calculate the sample correlation coefficient. Interpret its value. (c) Calculate and report the least squares line of y on x. Draw the line on the scatter diagram. (d) Based on the regression model above, what will be the predicted loss from shoplifting when there are 17 workers on duty? Would you trust this value? Justify your answer. 12 308 A Appendix A Solutions to Learning activities A.1 Chapter 2 – Mathematical revision – simple algebra and coordinate geometry 1. (a) Rules: Brackets take precedence so this is: (6) × (10) = 60. (b) Here the ‘of’ and ÷ take precedence and this is: 1 of 12 (= 4) minus 4 ÷ 2 (= 2) 3 i.e. 4 minus 2, i.e. +2. (c) Brackets take precedence so we have: 5 ×2 5 which is 1 × 2, which is 2. You should find these really simple. If you are in any doubt at all, go over the rules again given in Section 2.5 of the subject guide. Work through these examples again and (even better) go to your old school arithmetic textbook and try some more! 2. (a) 16 is 4 × 4 and −4 × −4, so the square root of 16 is ±4. (b) (0.07)2 = 0.07 × 0.07 = 0.0049. Be careful with the decimal points! (c) 0.49 is 0.7 × 0.7 and −0.7 × −0.7, so the square root of 0.49 is ±0.7. Be sure that you understand the rules for placing decimal points. When you work on estimation or tests of hypotheses of proportions you will need this. 3. (a) The answer is (98/100) × 200 which is 98 × 2, or 196. (b) To get a percentage, multiply the given fraction by 100, i.e. we obtain: 17 17 × 100 = × 4 = 17 × 4 = 68%. 25 1 So 17/25 can be written as 68%. (c) 25% of 98/144 is another way of writing (25/100) × (98/144) which is (1/4) × (49/72) or 49/288 which is 0.1701 to four decimal places. 309 A A. Solutions to Learning activities 4. (a) This is, simply put, 8. (b) This is the same as |6|, which is 6. 5. (a) 7 and 9 are greater than 3. (b) 2 is less than 3. (c) 2 and 3 are less than or equal to 3. (d) If x2 ≥ 49, then x ≥ 7. This is the case when x is 7 or 9. 6. Substituting in the values, we obtain: 2 X x3i = x31 + x32 = 43 + 13 = 64 + 1 = 65. i=1 7. (a) Substituting in the values, we obtain: 3 X xi = x1 + x2 + x3 = 2 + 3 + 5 = 10. i=1 (b) Substituting in the values, we obtain: 4 1X 2 1 2 1 1 xi = (x1 + x22 + x23 + x24 ) = (22 + 32 + 52 + 72 ) = (4 + 9 + 25 + 49) n i=1 4 4 4 = 21.75. 8. (a) We have: 5 X xi = x1 + x2 + x3 + x4 + x5 = 3 + 1 + 4 + 6 + 8 = 22. i=1 (b) We have: 4 X x2i = x23 + x24 = 42 + 62 = 16 + 36 = 52. i=3 (c) We have: 5 X p i xi = p 1 x1 + p 2 x2 + p 3 x3 + p 4 x4 + p 5 x5 i=1 1 1 1 1 1 ×3+ ×1+ ×4+ ×6+ ×8 4 8 8 3 6 3 1 1 1 = + + +2+1 4 8 2 3 17 =4 24 or 4.7083, if you prefer to work in decimals. = (Note: do not forget the ‘times’ sign takes precedence over the ‘plus’. Check over Question 1 if in any doubt!) 310 A.1. Chapter 2 – Mathematical revision – simple algebra and coordinate geometry (d) We have: 5 X pi x2i = p3 x23 + p4 x24 + p5 x25 = i=3 1 1 1 × 42 + × 62 + × 82 8 3 6 1 1 1 × 16 + × 36 + × 64 8 3 6 2 = 2 + 12 + 10 3 2 = 24 3 = or 24.6667, if you prefer to work in decimals. Check this carefully if you could not do it yourself first time. Note that i = 3 and i = 5 mark the beginning and end, respectively, of the summation, as did i = 3 and i = 4 in (b). 9. (a) Here the straight line graph has a slope of 1 and the line cuts the y-axis at x = 3 (when x is zero, y is 3). y 3 -3 x (b) Here the straight line still has a positive slope of 3 (y increases by 3 when x increases by 1) but the line crosses the y-axis at a minus value (−2). y 2/3 x -2 311 A A A. Solutions to Learning activities A.2 Chapter 3 – The nature of statistics 1. To properly define a circle we need to specify its location and size. Hence two suitable parameters would be the co-ordinates of the circle’s centre point (defines its location) and its radius (defines its size). 2. In a simple random sample for each member of the population there is a known, equal and non-zero probability of selection, while in a non-simple random sample for each member of the population there is a known and non-zero probability of selection (that is, these probabilities are not equal). Stratified sampling and cluster sampling are two examples of non-simple random sampling. There will be more on this topic in Chapter 10. A.3 Chapter 4 – Data visualisation and descriptive statistics 1. (a) Imagine there are only three possible parties people can vote for: A, B and C. Hence there will be three possible variables with nA , nB and nC being the numbers voting for each (where nA + nB + nC = 1000), or pA , pB and pC being the proportions voting for each (where pA + pB + pC = 1). Whichever way you look at it, these are discrete variables. (b) The number of hours students watch television per week could be anything from 0 to 168! If you only measure whole numbers, you would regard these as integers, though with such large numbers the distribution will approximate to a continuous one. It is more likely that you will be given more detailed figures, such as 1.33 hours, for example, or 23.798 hours. In this case, hours are clearly continuous variables. (c) Only whole numbers are possible which are unlikely to be large, so the variable is discrete. 2. (a) We use a histogram when we want to compare the numbers of observations or frequencies in categories which form part of a continuum, such as numbers in the income groups: ◦ less than £200 per week ◦ £200 and less than £300 per week ◦ £300 and less than £500 per week and so on. There must be no overlaps between the categories. The areas under the histogram are important to watch. (b) A stem-and-leaf diagram is very useful to work out before we draw a histogram. It is a way of organising data and understanding their structure. 312 A.3. Chapter 4 – Data visualisation and descriptive statistics You could not use: • a histogram if you are looking at named groups with no particular relationship with each other • a stem-and-leaf diagram if you are looking, for example, at proportions in categories. 3. Adding the ten observations and dividing by 10 gives us: 2 + 2 + 2 + 5 + 5 + 10 + 10 + 10 + 10 + 10 66 = = 6.6. 10 10 Using the grouped formula gives us, again: 6 + 10 + 50 (2 × 3) + (5 × 2) + (10 × 5) = = 6.6. 10 10 4. The statement is false. The mean is only greater than the median when the dataset is positively skewed. If the distribution is symmetric, then the mean and median are equal. If the distribution is negatively skewed, then the mean is less than the median. 5. We have: 4 1 4+5+6 15 1X xi = (x2 + x3 + x4 ) = = = 5. 3 i=2 3 3 3 The first observation, x1 , could be an outlier due to its value, 1, being ‘far’ from the other observed values. Given the small sample size of n = 4, including this potentially extreme observation might give a misleading estimate of the true population mean. 6. (a) Since n = 3, the median is the midpoint of the ordered dataset, which is: x(1) = 1, x(2) = 2 and x(3) = 4 hence the median is x(2) = 2. (b) The mean is: n 1X 4+1+2 7 xi = = . n i=1 3 3 7. (a) The mode is the value which occurs most frequently. Here the mode is 2. (b) The mean is: n 1X x1 + x2 + x3 + x4 + x5 4+2+2+5+6 19 xi = = = = 3.8. n i=1 5 5 5 8. There are 13 observations so the mean is the sum of them (1749 as given) divided by 13. This comes to £134.54. The numbers have been arranged in order of size, so the median is the ((13 + 1)/2)th observation, that is the 7th observation. This is £110. 313 A A A. Solutions to Learning activities The mode of the data, as given, is £195 which occurs twice (the other figures only occur once). However, if we round to the nearest 10 (counting the 5s downwards) the prices become (in £s) 50, 60, 60, 80, 100, 110, 110, 110, 150, 190, 190, 240, 260 and the mode is then £110 which occurs three times (£60 and £190 occur twice – less often). 9. Using the first, we need to know the mean, which is: 69 2 + 4 + 5 + 6 + 6 + 7 + 8 + 11 + 20 = = 7.67. 9 9 Therefore: 9 1X s = (xi − 7.67)2 8 i=1 2 = 1 (2 − 7.67)2 + (4 − 7.67)2 + (5 − 7.67)2 + (6 − 7.67)2 + (6 − 7.67)2 8 +(7 − 7.67)2 + (8 − 7.67)2 + (11 − 7.67)2 + (20 − 7.67)2 1 = (32.15 + 13.47 + 7.13 + 2.79 + 2.79 + 0.45 + 0.11 + 11.09 + 152.03) 8 222.01 = 8 = 27.75. You can see this is very hard work, and it is quite remarkable that having rounded figures to the nearest two decimal places before squaring the (xi − x̄) terms we still get the same answer as that using the short-cut formula. The short-cut formula gives us: 1 s = n−1 2 n X i=1 ! x2i − nx̄2 1 = (751 − 9 × (7.67)2 ) = 27.75. 8 10. The statement is false. By definition, the lower quartile separates the bottom 25% of observations in a dataset from the top 75% of observations. 11. The range of the data is 265 − 52 = 213, so a price difference of 213 between the most expensive and cheapest brands. The variance, standard deviation and interquartile range are provided in the table, so check through your working and make sure you can determine these. 314 A.4. Chapter 5 – Probability theory 12. (a) Stem-and-leaf diagram of daily telephone calls received by an office supplies company: Stem = 100s | Leaf = 10s & 1s 0 1 2 3 4 5 07 13 36 48 58 77 78 21 33 38 64 75 77 83 19 44 58 91 12 14 36 18 55 76 41 Since n = 25, the median is: x((n+1)/2) = x(13) = 177. (b) For the quartiles, noting that different quartile calculation methods are acceptable, possible (interpolated) solutions are: Q1 = x(n/4) = x(6.25) = x(6) + x(7) − x(6) 78 − 77 = 77 + = 77.25 4 4 and: Q3 = x(3n/4) = x(18.75) = x(18) + 3(x(19) − x(18) ) 3(312 − 291) = 291 + = 306.75. 4 4 (c) The stem-and-leaf diagram clearly shows a positively-skewed distribution, hence we would expect the mean to be greater than the median. (In fact the mean of the data is 210.88.) (d) As discussed in (c), the data are clearly skewed to the right. The third quartile indicates that about 75% of the time no more than 300 telephone calls are received. The company’s management could use this information when considering staffing levels. A.4 Chapter 5 – Probability theory 1. We have: S = {1, 2, 3, 4, 5, 6}, E = {3, 4} and F = {4, 5, 6}. (a) F c means the elements not in F , which are 1, 2 and 3. (b) E c ∩ F c means the elements not in E and not in F . The elements not in E are 1, 2, 5 and 6. So only 1 and 2 are not in E and F . (c) (E ∪ F )c means elements which are neither in E nor F , so the elements in E or F or both are 3, 4, 5 and 6. So, once again, the answer is 1 and 2. (d) E c ∩ F are the elements not in E and in F . E c elements are 1, 2, 5 and 6 and F elements are 4, 5, and 6, so the answer is 5 and 6. 315 A A A. Solutions to Learning activities 2. The Venn diagram is: (a) E ∪ F is shown by the total area enclosed by both circles. (b) E ∩ F is the intersection of the two circles. (c) E is the area completely outside the circle which encloses 3 and 4. 3. (a) Of the total number of equally likely outcomes, 300, there are 30 which are early. Hence required probability is 30/300 = 0.1. (b) Again, of the total number of equally likely outcomes, 300, there are 150 from Smith. Hence required probability is 150/300 = 0.5. (c) Now, of the 300, there are only 10 which are late and from Jones. Hence the probability is 10/300 = 1/30. 4. We have that P (A) + P (B) + P (C) = 1. Therefore: 1 1 1 + = . P (C) = 1 − 2 3 6 5. Since A and B are independent: P (A ∩ B) = P (A) P (B) = 1 1 1 × = . 3 4 12 6. False. If two events are independent, then the occurrence (or non-occurrence) of one event has no impact on whether or not the other event occurs. Mutually exclusive events cannot both occur at the same time. A simple counterexample to show the statement is false would be the events A = ‘a coin landing heads up’, and B = ‘rolling a 6 on a die’. The outcome of the coin toss has no effect on the die outcome, and vice versa. However, it is possible to get heads and a 6, so the events are not mutually exclusive. 7. (a) P (X 2 < 4) = P (X = 1) = 0.3, since X 2 < 4 is only satisfied when X = 1. (b) We require the following conditional probability: P (X > 2 | X is an even number) = 316 0.2 P (X = 4) = = 0.2857. P (X = 2) + P (X = 4) 0.5 + 0.2 A.4. Chapter 5 – Probability theory 8. (a) The additive law, for any two events A and B, is: P (A ∪ B) = P (A) + P (B) − P (A ∩ B). For example, consider the following Venn diagram: A x z y B We have P (A) = area x + area z, P (B) = area y + area z, P (A ∩ B) = area z and P (A ∪ B) = area x + area y + area z. Therefore: P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = (area x + area z) + (area y + area z) − (area z) = area x + area y + area z. Therefore, to compute P (A ∪ B), we need to subtract P (A ∩ B) otherwise that region would have been counted twice. (b) The multiplicative law, for any two independent events A and B, is: P (A ∩ B) = P (A) P (B). There is no clear way of representing this on a Venn diagram. However, this is a very important result whenever events are independent, such as when randomly sampling objects with replacement. 9. (a) From the given probabilities we obtain: P (C ∪ S) = P (C) + P (S) − P (C ∩ S) = 0.05 + 0.04 − 0.01 = 0.08. (b) Having no defects is the complement of the event of having at least one defect, hence: P (no defects) = 1 − P (C ∪ S) = 1 − 0.08 = 0.92. 10. Note that the percentage of good and bad parts total to 100. (a) The probability that a randomly chosen part comes from A is 0.6 (60%), the probability that one of A’s parts is bad is 0.03 (3%), so the probability that a randomly chosen part comes from A and is bad is 0.6 × 0.03 = 0.018. 317 A A A. Solutions to Learning activities (b) Add all the outcomes, which gives: (0.6×0.97)+(0.6×0.03)+(0.4×0.93)+(0.4×0.07) = 0.582+0.018+0.372+0.028 = 1. (c) The probability of receiving a bad part is the probability of either receiving a bad part from A or from B, which is: (0.6 × 0.03) + (0.4 × 0.07) = 0.018 + 0.028 = 0.046. 11. Here we have: Device A B C D Probabilities Not fail Fail 0.9 0.1 0.9 0.1 0.9 0.1 0.9 0.1 The system fails if both A and B fail (or more) or both C and D fail (or more). To work out the probability that the system works properly, first work out the probability it will fail, P (F ). The probability it will work is P (F c ). The system fails if any of the following occur: • A, B, C and D all fail = (0.1)4 = 0.0001 • ABC, ABD, ACD or BCD fail = (0.1)3 × 0.9 × 4 = 0.0036 • A and B fail and C & D are okay = 0.1 × 0.1 × 0.9 × 0.9 = 0.0081 • C and D fail and A & B are okay = 0.1 × 0.1 × 0.9 × 0.9 = 0.0081. So the total probability that the system will fail is 0.0001 + 0.0036 + 0.0162 = 0.0199, which makes the probability it will run smoothly 1 − 0.0199 = 0.9801. 12. Let B = ‘a beginner student’, T = ‘a transferring student’ and G = ‘will graduate’. Hence P (B) = 0.73, P (T ) = 0.27, P (G | B) = 0.62 and P (G | T ) = 0.78. (a) We need P (B ∩ G) which can be derived from: P (G | B) = P (B ∩ G) P (B) ⇒ P (B∩G) = P (G | B) P (B) = 0.62×0.73 = 0.4526. (b) Using the total probability formula: P (G) = P (G | B) P (B) + P (G | T ) P (T ) = 0.62 × 0.73 + 0.78 × 0.27 = 0.4526 + 0.2106 = 0.6632. (c) The required probability is: P (B ∪ G) = P (B) + P (G) − P (B ∩ G) = 0.73 + 0.6632 − 0.4526 = 0.9406. 318 A.5. Chapter 6 – The normal distribution and ideas of sampling (d) We need to check whether P (G ∩ T ) = P (G) P (T ). Since: P (G∩T ) = P (G | T ) P (T ) = 0.78×0.27 = 0.2106 6= P (G) P (T ) = 0.6632×0.27 = 0.1791 the events are not statistically independent. (e) This is a conditional probability, given by: P (T | G) = P (T ∩ G) P (G | T ) P (T ) 0.2106 = = = 0.3176. P (G) P (G) 0.6632 (f) We already have P (G ∩ T ) = 0.2106 and P (G ∩ B) = 0.4526. We also require: P (Gc ∩ T ) = P (Gc | T ) P (T ) = (1 − 0.78) × 0.27 = 0.0594 and: P (Gc ∩ B) = P (Gc | B) P (B) = (1 − 0.62) × 0.73 = 0.2774. Two students being chosen at random can be considered as independent events, hence the required probability is: (0.2106)2 + (0.4526)2 + (0.0594)2 + (0.2774)2 = 0.3297. A.5 Chapter 6 – The normal distribution and ideas of sampling 1. Z ∼ N (0, 1), i.e. µ = 0 and σ 2 = 1. (a) Looking at Table 4 of the New Cambridge Statistical Tables, we see the area under the curve for which Z ≤ 1 is 0.8413, i.e. P (Z ≤ 1) = 0.8413. A sketch of this is: (b) Since the total area under the curve is 1, the area for which Z ≥ 1 is P (Z ≥ 1) = 1 − 0.8413 = 0.1587. 2. (a) If we are looking for 0.68 as the area under the curve between two standard deviations, we need A (in the diagram below) to be 0.68/2 = 0.34 and hence the whole area to the left of x to be 0.34 + 0.5 = 0.84. Looking at Φ(x) = 0.84, we see that x is between 0.99 (where Φ(x) is 0.8389) and 1.00 (where Φ(x) = 0.8413), i.e. approximately correct. 319 A A A. Solutions to Learning activities (b) Similarly, look up Φ(x) of 0.95/2 + 0.5 = 0.475 + 0.5 = 0.975, which gives us an x of exactly 1.96. (c) Similarly, look up Φ(x) of 0.99/2 + 0.5 = 0.495 + 0.5 = 0.995, which gives x between 2.57 and 2.58. Therefore, the third example is more approximate! 3. Let X ∼ N (20, 4). (a) Standardising, we have: 22 − 20 X − 20 √ = P (Z > 1) = 1−Φ(1) = 1−0.8413 = 0.1587. > √ P (X > 22) = P 4 4 (b) Standardising, we have: P (14 < X < 16) = P 14 − 20 X − 20 16 − 20 √ < √ < √ 4 4 4 = P (−3 < Z < −2) = (1 − Φ(2)) − (1 − Φ(3)) = (1 − 0.97725) − (1 − 0.99865) = 0.0214. 4. Let A ∼ N (50, 52 ) and B ∼ N (55, 62 ). We have: A − 50 45 − 50 P (A < 45) = P < = P (Z < −1) 5 5 and: P (B < 45) = P 45 − 55 B − 55 < 6 6 = P (Z < −1.67). Clearly, P (Z < −1.67) < P (Z < −1) (if you do not see this immediately, shade these regions on a sketch of the standard normal distribution) and hence schools of type A would have a higher proportion of students with marks below 45. 5. Let X ∼ N (3800, (250)2 ). (a) Standardising, we have (noting the symmetry of the standard normal distribution about zero): X − 3800 3500 − 3800 P (X > 3500) = P > = P (Z > −1.2) = Φ(1.2) = 0.8849. 250 250 Hence, as a percentage, this is 88.49%. 320 A.5. Chapter 6 – The normal distribution and ideas of sampling (b) Standardising, we have (noting the symmetry of the standard normal distribution about zero): X − 3800 4000 − 3800 P (X > 4000) = P > = P (Z > 0.8) 250 250 = 1 − Φ(0.8) = 1 − 0.7881 = 0.2119. Hence, as a percentage, this is 21.19%. (c) Using (a) and (b), we have: P (3500 < X < 4000) = (1 − P (X > 4000)) − (1 − P (X > 3500)) = (1 − 0.2119) − (1 − 0.8849) = 0.6730. Therefore, we would expect 700 × 0.6730 ≈ 471 batteries to last between 3,500 and 4,000 hours. 6. (a) The population mean is: N P xi 15 + 14 + 10 + 12 + 20 + 15 86 = = 14.33. N 6 6 The population standard deviation is: v r u N u1 X 1 √ 1 t (xi − x̄)2 = × 57 = 9.56 = 3.09. σ= N i=1 6 3 µ= (b) i=1 = i. Case of two individuals Taking samples of two elements, the 15 possible samples and their means are: Sample number Individuals Times x̄i 1 A, B 15, 14 14.5 A, C 15, 10 12.5 2 3 A, D 15, 12 13.5 4 A, E 15, 20 17.5 5 A, F 15, 15 15.0 6 B, C 14, 10 12.0 7 B, D 14, 12 13.0 8 B, E 14, 20 17.0 9 B, F 14, 15 14.5 10 C, D 10, 12 11.0 11 C, E 10, 20 15.0 12 C, F 10, 15 12.5 13 D, E 12, 20 16.0 14 D, F 12, 15 13.5 15 E, F 20, 15 17.5 321 A A A. Solutions to Learning activities ii. Case of three individuals Taking samples of three elements, the 20 possible samples and their means are: Sample number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Individuals A, B, C A, B, D A, B, E A, B, F A, C, D A, C, E A, C, F A, D, E A, D, F A, E, F B, C, D B, C, E B, C, F B, D, E B, D, F B, E, F C, D, E C, D, F C, E, F D, E, F Times 15, 14, 10 15, 14, 12 15, 14, 20 15, 14, 15 15, 10, 12 15, 10, 20 15, 10, 15 15, 12, 20 15, 12, 15 15, 20, 15 14, 10, 12 14, 10, 20 14, 10, 15 14, 12, 20 14, 12, 15 14, 20, 15 10, 12, 20 10, 12, 15 10, 20, 15 12, 20, 15 x̄i 13.00 13.67 16.33 14.67 12.33 15.00 13.33 15.67 14.00 16.67 12.00 14.67 13.00 15.33 13.67 16.33 14.00 12.33 15.00 15.67 iii. Case of four individuals Taking samples of four elements, the 15 possible samples and their means are: Sample number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 322 Individuals C, D, E, F B, D, E, F B, C, E, F B, C, D, F B, C, D, E A, D, E, F A, C, E, F A, C, D, F A, C, D, E A, B, E, F A, B, D, F A, B, D, E A, B, C, F A, B, C, E A, B, C, D Times 10, 12, 20, 15 14, 12, 20, 15 14, 10, 20, 15 14, 10, 12, 15 14, 10, 12, 20 15, 12, 20, 15 15, 10, 20, 15 15, 10, 12, 15 15, 10, 12, 20 15, 14, 20, 15 15, 14, 12, 15 15, 14, 12, 20 15, 14, 10, 15 15, 14, 10, 20 15, 14, 10, 12 x̄i 14.25 15.25 14.75 12.75 14.00 15.50 15.00 13.00 14.25 16.00 14.00 15.25 13.50 14.75 12.75 A.6. Chapter 7 – Point and interval estimation (c) For the case of two individuals, the mean of all the sample means is: 15 P µ= x̄i i=1 = 15 215 = 14.33. 15 For the case of three individuals, the mean of all the sample means is: 20 P µ= x̄i i=1 = 20 286.67 = 14.33. 20 For the case of four individuals, the mean of all the sample means is: 15 P µ= x̄i i=1 15 = 215 = 14.33. 15 The standard deviations of X̄ are 1.9551, 1.3824 and 0.9776 for samples of size 2, 3 and 4, respectively. This confirms that the accuracy of our population mean estimator improves as we increase our sample size because we increase the amount of information about the population in the sample. A.6 Chapter 7 – Point and interval estimation 1. False. When calculated from the same data set, the only difference is the change in confidence level. There is a trade-off whereby higher confidence (which is good, other things equal) leads to a larger multiplier coefficient (a z-value or t-value) leading to a greater margin of error, and hence a wider confidence interval (which is bad, other things equal). Therefore a 95% confidence interval would be wider, since 95% confidence is greater than 90% confidence. 2. We are given a population variance of σ 2 = (1.2)2 , a sample mean of x̄ = 25, and a sample size of n = 100. To get a 95% confidence interval for the mean, we need: σ x̄ ± 1.96 × √ n ⇒ 25 ± 1.96 × 1.2 10 ⇒ (24.7648, 25.2352). 3. We were given x̄ = 0.824, n = 200 and σ = 0.042. This time we are told that the standard deviation is derived from the sample, i.e. x̄ = 0.824, n = 200 and s = 0.042. We should use the t distribution, but, looking at Table 10 of the New Cambridge Statistical Tables the largest n we see is 121 (i.e. the degrees of freedom are 120), for which the appropriate 95% value is 1.98 (compared with 1.96 for an infinite-sized sample) and the 99% value is 2.617 (compared with 2.576). As 200 is beyond the values given in the tables, in practice we use the values given on the bottom line, which gives the same answers as those given in Example 7.1. 323 A A A. Solutions to Learning activities 4. Do this exactly as stated. Proceed carefully. Look at Table 10 of the New Cambridge Statistical Tables as suggested! 5. (a) The true standard error of the sample proportion, P , is r π (1 − π) S.E.(P ) = . n If π, the population proportion, is unknown then we use the estimated standard error in inference calculations, i.e. we use: r p (1 − p) . E.S.E.(P ) = n For two samples drawn from the same population (again with a population proportion π), the pooled sample proportion is: n1 p1 + n2 p2 n1 + n2 when the sample sizes are n1 and n2 giving rise to sample proportions p1 and p2 , respectively. (b) The sample proportions are pA = 220/300 = 0.733 and pB = 200/250 = 0.8. For a 95% confidence interval, we use the z-value of 1.96. A 95% confidence interval is: r 0.733 × (1 − 0.733) 0.8 × (1 − 0.8) + ⇒ (−0.0035, 0.1375). 0.8−0.733±1.96× 300 250 The confidence interval (just) contains zero, suggesting no evidence of a difference in the true population proportions. Therefore, there is no reason to believe that one company gives greater satisfaction than the other. However, given the close proximity of the lower endpoint to zero, this is a marginal case. 6. (a) We are given n = 200 and p = 30/200 = 0.15. So the estimated standard error is: r 0.15 × 0.85 √ E.S.E.(p) = = 0.0006375 = 0.0252. 200 A 95% confidence interval is: 0.15 ± 1.96 × 0.0252 ⇒ (0.1006, 0.1994). Taking this as percentages, we have a 95% confidence interval of between 10.06% and 19.94%, to two decimal places. (b) In the second institution, n = 100 and p = 18/100 = 0.18. So the estimated standard error is: r 0.18 × 0.82 √ = 0.0015 = 0.0384 E.S.E.(p) = 100 which is larger than the one in the first sample owing to the relatively smaller sample size. 324 A.7. Chapter 8 – Hypothesis testing This gives us a 95% confidence interval of: 0.18 ± 1.96 × 0.0384 ⇒ (0.1047, 0.2553). Taking this as percentages, we have a 95% confidence interval of between 10.47% and 25.53%, to two decimal places. (c) The estimated standard error is: r E.S.E.(pA −pB ) = 0.15 × 0.85 0.18 × 0.82 √ + = 0.0006375 + 0.001476 = 0.0460 200 100 which gives a 95% confidence interval of: 0.4 − 0.18 ± 1.96 × 0.0460 ⇒ (0.1298, 0.3102) or between 12.98% and 31.02% – not a very useful confidence interval! Such a wide interval arises because the sample proportions differ greatly, meaning the proportions of students enjoying Statistics in these two different institutions seem to be very different. A.7 Chapter 8 – Hypothesis testing 1. Look at the wording in each of (a) to (c) carefully. You can get clues as to whether you are dealing with a one-tailed test (where H1 will use a < or >), or a two-tailed test (where H1 will involve 6=) according to the use of words like ‘increase’, ‘higher’, ‘greater’ and ‘diminished’ which all imply a one-tailed test; or the use of words like ‘equal’, ‘changed’ and ‘different from’ which all imply a two-tailed test. (a) Here you want to know whether the mean incomes in a particular area (µU ) is less than the general population mean (µ = 10000). So you need a one-tailed test, where: H0 : µU = 10000 (i.e. µU is equal to µ) H1 : µU < 10000 (i.e. µU is less than µ). (b) Here we are comparing the mean weights in two schools A and B, let us say they are x̄A and x̄B , with population means µA and µB , respectively. We want to know if they are different so we need a two-tailed test. H0 : µA = µB (i.e. the means are the same) H1 : µA 6= µB (i.e. the means are different). (c) We look at the reading score means for children before, x̄B , and after, x̄A , for a teaching programme with populations means µB and µA , respectively. H0 : µA = µB (i.e. µB is no different from µA ) H1 : µA > µB (i.e. the mean score after the programme is greater than before). 325 A A A. Solutions to Learning activities 2. Your completed chart should look like this: Real situation Result from your test Not reject H0 Reject H0 Correct decision Type I error H0 true Probability (1 − α) called the confidence level of the test Type II error Probability α called the significance level of the test Correct decision H1 true Probability β Probability (1 − β) called the power of the test 3. False. The power of a test is the probability of rejecting the null hypothesis when the alternative hypothesis is true. Although this outcome results in the correct hypothesis being chosen, the statement is false because it implies that not rejecting a true null hypothesis (a correct decision) also constitutes the power of a test, which is incorrect. 4. (a) A test which is significant at the 1% significance level is highly significant. This indicates that it is very unlikely to have obtained the sample we did if the null hypothesis was actually true. Therefore, we would be very confident in rejecting the null hypothesis. (b) A test which is significant at the 10% significance level, but not at the 5% significance level, is weakly significant. This indicates mild (at best) support for rejecting the null hypothesis. The test outcome is less ‘conclusive’ than in (a). 5. Here the manufacturer is claiming that 90% of the population will be relieved over 8 hours, i.e. π = 0.9. There is a random sample of n = 200 and 160 gained relief, giving a sample proportion of p = 160/200 = 0.8. Given the manufacturer’s claim, we would be concerned if we found significant evidence that fewer than 90% were relieved (more than 90% would not be a problem!). So we have a one-tailed test with: H0 : π = 0.9 vs. H1 : π < 0.9. We use the population value π to work out the standard error, and so the test statistic value is: 0.8 − 0.9 0.1 = = −4.717. z=p 0.0212 (0.9 × 0.1)/200 This goes beyond Table 5 of the New Cambridge Statistical Tables (the most extreme critical value is −4.4172, so this result is highly significant – the p-value is nearly zero). Alternatively, you could look at Table 10 of the New Cambridge Statistical Tables (which provides percentage points of the t distribution) and take the bottom line (where ν = ∞), since tν = N (0, 1). 326 A.8. Chapter 9 – Contingency tables and the chi-squared test Note that the 5% critical value is −1.645 for a one-tailed test and the 1% and the 0.1% critical values are −2.326 and −3.090, respectively. This confirms that the result is highly significant. So we reject H0 and the manufacturer’s claim is not legitimate. Based on the observed sample, we have (very) strong evidence that the proportion of the population given relief is significantly less than the 90% claimed. 6. We would test H0 : π1 = π2 vs. H1 : π1 6= π2 . Under H0 , the test statistic would be: P1 − P2 p ∼ N (0, 1) P (1 − P )(1/16 + 1/24) where P is the pooled proportion estimator: P = n 1 P1 + n 2 P 2 16P1 + 24P2 = . n1 + n2 40 Since the test statistic follows (approximately) the standard normal distribution, there are no degrees of freedom to consider. For a two-sided test, the 5% critical values would be ±1.96. 7. There is no a priori reason, i.e. before obtaining the sample data, which holiday company should have a greater proportion of customer satisfaction in the event they did differ, hence we test H0 : πA = πB vs. H1 : πA 6= πB . The test statistic value is: 0.8 − 0.733 p B − pA p =p = 1.843 p (1 − p)(1/n1 + 1/n2 ) 0.764 × (1 − 0.764) × (1/300 + 1/250) since p = (220 + 200)/(300 + 250) = 0.764. At the 5% significance level, the critical values are ±1.96. Therefore, since 1.843 < 1.96, we do not reject the null hypothesis and conclude that there is no evidence of a difference in the proportions of customer satisfaction between the two companies. Note that the link between a 95% confidence interval and hypothesis testing only holds when considering a two-sided test at the 5% significance level. Had a two-sided test been conducted here, the critical values would be ±1.96 and hence we would fail to reject the null hypothesis. This is consistent with zero being included in the calculated 95% confidence interval. A.8 Chapter 9 – Contingency tables and the chi-squared test 1. (a) The null and alternative hypotheses are: H0 : There is no association between the kind of accommodation people have and their level of satisfaction with it. H1 : There is an association between the kind of accommodation people have and their level of satisfaction with it. (b) The degrees of freedom are (number or rows − 1) × (number of columns − 1). Let us take the levels of satisfaction to be rows and the housing types to be columns. Your table would look like this: 327 A A A. Solutions to Learning activities Public apartment Level of satisfaction High Public house Housing type Private Private apartment detached house Private semidetached house Misc. Medium Low Very dissatisfied There are 4 rows and 6 columns and therefore (4 − 1) × (6 − 1) = 3 × 5 = 15 degrees of freedom. (c) Now look up the 5% and 1% critical values for a χ2 distribution with 15 degrees of freedom, using Table 8 of the New Cambridge Statistical Tables. The 5% critical value is 25.00 and the 1% critical value is 30.58. 2. Look at Table 8 of the New Cambridge Statistical Tables again. This shows a χ2 curve. Copy it. For 6 degrees of freedom, the 5% and 1% critical values are marked above. Note that the rejection areas are to the right tail of the distribution. So you reject the appropriate null hypothesis at the significance level required if your calculated test statistic value is greater than these critical values. 3. We test: H0 : No association between advertising type and level of sales H1 : Association between advertising type and level of sales. The degrees of freedom are (r − 1)(c − 1) = (3 − 1)(4 − 1) = 6. Using a 5% significance level, the critical value is 12.59. Since 12.59 < 13.5, we reject the null hypothesis and conclude that there is an association between advertising type and level of sales. Moving to the 1% significance level, the critical value is 16.81, hence we do not reject the null hypothesis and conclude that the test is moderately significant. 328 A.8. Chapter 9 – Contingency tables and the chi-squared test 4. (a) We test: H0 : There is no association between proportion of skilled workers and area. H1 : There is an association between proportion of skilled workers and area. We get a contingency table which should look something like: Number of Number of Total skilled workers unskilled workers O 80 184 264 72.41 191.59 A E (O − E)2 /E 0.80 0.30 O 58 147 205 B E 56.22 148.78 (O − E)2 /E 0.06 0.02 O 114 276 390 C E 106.96 283.04 (O − E)2 /E 0.46 0.18 O 55 196 251 68.84 181.43 D E (O − E)2 /E 2.78 1.17 O 83 229 312 E E 85.57 226.43 (O − E)2 /E 0.08 0.03 Total 390 1032 1422 P The χ2 test statistic value is (O − E)2 /E = 5.88. The number of degrees of freedom is (5 − 1)(2 − 1) = 4, so we compare the test statistic value with the χ24 distribuion, using Table 8 of the New Cambridge Statistical Tables. The test is not significant at the 5% significance level (the critical value is 9.488). It is also not significant at the 10% significance level (the critical value is 7.779) so we have found no evidence of an association. That is, the proportion of skilled workers does not appear to vary with area. (b) We have a one-tailed z test: H0 : πD = πothers vs. H1 : πD < πothers . We have pD = 55/251 = 0.2191, pothers = 335/1171 = 0.2861, and a pooled sample proportion of (55 + 335)/(251 + 1171) = 0.2742. So the test statistic value is: 0.2861 − 0.2191 0.2861 − 0.2191 = = 2.1613. z=p 0.0310 0.2742 × (1 − 0.2742) × (1/251 + 1/1171) The 5% critical value is 1.645 and the 1% critical value is 2.326. So our calculated test statistic value of 2.1613 lies between the two, meaning we reject the null hypothesis at the 5% significance level only. We can tell the management team that there is moderate evidence of a lower proportion of skilled workers in area D compared with the others. However, it 329 A A A. Solutions to Learning activities is not significant at the 1% significance level and so it would be sensible to check whether there are other factors of interest to them before deciding against the area. 5. (a) We test: H0 : There is no association between verbal reasoning ability and sex. H1 : There is an association between verbal reasoning ability and sex. The complete table with expected values for each block in brackets looks like: Male Female Totals High 50 (56) 90 (84) 140 Low 150 (144) 210 (216) 360 Totals 200 300 500 Looking at this table, we can see that there are fewer men in the higher reasoning category than we would expect from the overall figures and slightly more women. However, are the differences statistically significant? We use the formula: X (Oi,j − Ei,j )2 i,j Ei,j and get a test statistic value of: χ2 = (50 − 56)2 (150 − 144)2 (90 − 84)2 (210 − 216)2 + + + 56 144 84 216 = 0.6429 + 0.2500 + 0.4285 + 0.1667 = 1.483. Looking at Table 8 of the New Cambridge Statistical Tables for 1 degree of freedom [((r − 1) × (c − 1) = (2 − 1) × (2 − 1) = 1], the 5% critical value is 3.841. Since 1.483 < 3.841 we do not reject H0 . In fact, 1.438 is not significant even at the 20% significance level and so we cannot reject the null hypothesis and we conclude that there is no evidence of an association between the sexes in their verbal reasoning ability by these figures. (b) So what about testing the differences between proportions? Will this show anything different? Well, it should not otherwise we have a problem! Try it anyway. Let us look at the proportion with the high verbal reasoning score. Here the test is two-tailed. Is there a difference between the sexes? We test: H 0 : πF = π M vs. H1 : πF 6= πM . We have pM = 50/200 = 0.25, pF = 90/300 = 0.30 and a pooled sample proportion p = (50 + 90)/(200 + 300) = 0.28. The test statistic value is: 0.30 − 0.25 0.05 0.05 =√ = = 1.220. z=p 0.0410 0.00168 0.28 × (1 − 0.28) × (1/200 + 1/300) 330 A.8. Chapter 9 – Contingency tables and the chi-squared test Using Table 5 of the New Cambridge Statistical Tables, we see that 1.25 is not significant at the 5% significance level (1.220 < 1.96), nor at the 10% significance level (1.220 < 1.645). So we do not reject H0 and so we have insufficient evidence to say that there is a difference in the proportions of men and women with a high level of verbal reasoning. Notice that the null hypothesis used in (b) (that there is no difference in the proportion with a high level of reasoning ability) gives a slightly different result than that for (a) that there is no association between levels of verbal reasoning and sex. These are not quite the same and the slight difference in significance reflects the fact that the chi-squared test takes into account both high and low levels of verbal reasoning while the test of proportions only investigates those with high reasoning ability. 6. We test: H0 : No association between level of exercise and type of environment H1 : Association between level of exercise and type of environment. Determining row and column totals, we obtain the expected frequencies as shown in the following table, along with the calculated χ2 values for each cell. Environment City Town Rural Total Observed Expected (O − E)2 /E Observed Expected (O − E)2 /E Observed Expected (O − E)2 /E Level of exercise High Low 221 256 263.56 213.44 6.87 8.49 230 118 192.28 155.72 7.40 9.14 159 120 154.16 124.84 0.15 0.19 610 494 Total 477 348 279 1104 The test statistic value is: 6.87 + 8.49 + 7.40 + 9.14 + 0.15 + 0.19 = 32.24. There are (r − 1)(c − 1) = (3 − 1)(2 − 1) = 2 degrees of freedom, and the 5% critical value is 5.991. Since 5.991 < 32.24, we reject H0 . Testing at the 1% significance level, the critical value is 9.210. Again, we reject H0 and conclude that the test is highly significant. There is strong evidence of an association between level of exercise and type of environment. Looking at the large contributors to the test statistic value, we see that those living in cities tend not to exercise much, while those in towns tend to exercise more. 7. (a) We examine the percentages. Doing so, we see that the proportion of teenagers who do not smoke is roughly equal (around 90%) regardless of whether or not there is adult cigarette use in a household. Eyeballing the figures there is no 331 A A A. Solutions to Learning activities clear association between teenager and adult cigarette use, although there is a large sample size of 4,000. Therefore, small differences in percentages may be statistically significant when formally tested. However, any association would be subtle due to the similar column percentages by row. Noting the χ2 value of 4.61, we would reject the null hypothesis of no association at the 5% significance level, because the critical value with (r − 1)(c − 1) = (2 − 1)(2 − 1) = 1 degree of freedom is 3.841, but fail to reject the null hypothesis at the 1% significance level (with a critical value of 6.635). Hence the association is significant, but not highly significant. (b) We test: H0 : No association adult and teenage cigarette use H1 : Association between adult and teenage cigarette use. The expected frequencies as shown in the following table, along with the calculated χ2 values for each cell. Teenage Adult cigarette use cigarette use Yes No Total Observed 48 45 93 Yes Expected 55.06 37.94 (O − E)2 /E 0.91 1.31 Observed 544 363 907 536.94 370.06 No Expected (O − E)2 /E 0.09 0.13 Total 592 408 1000 The test statistic value is: 0.91 + 1.31 + 0.09 + 0.13 = 2.44. There is (r − 1)(c − 1) = (2 − 1)(2 − 1) = 1 degree of freedom, and the 5% critical value is 3.841. Since 2.44 < 3.841, we do not reject H0 . Testing at the 10% level, the critical value is 2.706. Again, we do not reject H0 and conclude that the test is insignificant. There is no evidence of an association between adult and teenage cigarette use. (c) At best, there is only a slight association between teenage and adult cigarette use evident in the first survey table. Teenage cigarette use is marginally higher in households without adult cigarette use (10.4%) than households with adult cigarette use (8.4%). We have seen that a test for association was significant for the first survey table, but insignificant for the second survey table. Therefore, it seems that the association has disappeared over the two years. Note the different sample sizes used – although still a ‘large’ sample size of 1,000 households in the second survey, it is harder to detect association than in the first survey with 4,000 households. So despite there being very similar percentages when comparing across the two tables, the smaller sample size in the second survey has caused the test of association to become insignificant. 8. We are told that we are supplied with the observed frequencies, i.e. the number of births in a country for each of the four seasons. To test whether births are evenly 332 A.9. Chapter 10 – Sampling design distributed between the four seasons, we would need to calculate the expected number of births for each season. We would need to know the exact number of days in each season to be able to calculate the expected frequencies, and hence perform a goodness-of-test of an even distribution. Remember the seasons are of slightly different lengths reflecting the variable number of days in each month. We would then test: H0 : There is a uniform distribution of live births throughout the year. H1 : There is a non-uniform distribution of live births throughout the year. Denoting the observed and expected frequencies for season i as Oi and Ei , respectively, the test statistic is: 4 X (Oi − Ei )2 i=1 Ei ∼ χ23 . At a 5% significance level, the critical value would be 7.815 and so we would reject the null hypothesis if the test statistic value exceeded 7.815. We would then test at a suitable second significance level to determine the strength of the test result. A.9 Chapter 10 – Sampling design 1. Generally available lists include registers of electors, postal addresses used by the postal services, lists of schools and companies. Whether they are up-to-date or not depends on the purpose for which the list was intended. For example, looking at the kinds of lists mentioned above you would expect there to be some government regulation about registering electors – every year, or perhaps before each national election. What is the policy in your country? For the register of electors, the post office will register new buildings by postcode on a continuous basis. For postal addresses you need to check who would be responsible for the lists of schools or companies and why and when they would be updated. Is there a subscription to pay annually, for example, in order to be a member of a school or company association, or is there a government regulation which means that schools or companies must be registered? The question of target group also affects the usefulness of your sampling frame. If you want to know about foreign nationals resident in your country, then the register of electors is hardly appropriate, nor will it help you if you are interested primarily in contacting people under voting age. The list of new buildings will help you get in touch with people living in particular areas so you may find you stratify by the socio-economic characteristics of the place rather than the people. Schools and companies are fine as a contact for people for many purposes, but the fact that they have been approached through a particular organisation may affect responses to particular questions. Would you tell interviewers who contacted you in school time and had got hold of you through your school that you hate school, or even about what you do after school? Would you reply to a postal questionnaire sent via your company if you were asked how often you took unnecessary sick leave? 333 A A A. Solutions to Learning activities The more you think about all this, the more difficult it can seem! So think of three lists and then list the ways you could find them useful as a basis for sampling and the problems you might have in particular circumstances. 2. I am afraid I really cannot help you here! You must do this yourself. Surveys in your home country should tell you details of the: • sampling frame • respondents • coverage of the questionnaires • response rates • general subjects covered. You could also get hold of one of these through your library and see the careful explanation of how the survey works and the purpose(s) for which it is used. 3. (a) If the population is very large, then this increases the chances of making non-sampling errors which reduces the quality of the collected data. For example, large amounts of data collected may result in more data-entry errors leading to inaccurate data and hence inaccurate conclusions drawn from any statistical analysis of the data. (b) An adequate sampling frame of single parent families is unlikely to be available for use. This would make random sampling either impossible, or very time-consuming if a sampling frame had to be created. (c) Those who would potentially undertake post-compulsory education in the future will be ‘young’ today. Data protection laws may prevent us from accessing lists of school-age children. In addition, young children may not realise or appreciate the merits of post-compulsory education so their current opinions on the subject may differ from their future decision about whether to continue into further education and higher education. 4. (a) Quota sampling may be viewed as two-stage restricted judgemental sampling. The first stage consists of developing control categories, or quota controls, of population elements. In the second stage, sample elements are selected based on convenience or judgement until the set quotas are met. An example might be using gender as our control characteristic. Suppose 48% of the population is male, with 52% being female, then we would want the sample composition to reflect this. (b) Quota sampling is a form of non-probability (non-random) sampling which is appropriate in the absence of a sampling frame. An advantage over other forms of non-probability sampling is that we can control for certain characteristics. Alternative forms of non-probability sampling include convenience sampling and judgemental sampling. (c) Quota sampling suffers from selection bias and hence there is no guarantee of selecting a sample which is representative of the population. Despite controlling for population characteristics, ultimately it is up to the interviewer to select people, which inevitably results in selection bias. Refining the different levels of the control characteristics and training interviewers may 334 A.9. Chapter 10 – Sampling design reduce, if not eliminate, such bias. Also, non-response bias may result, although this is not unique to quota sampling. Suitable incentives, callbacks etc. may help to improve response rates. 5. The two quota controls which are relatively easy for your interviewer to identify are gender and age group. They are also useful controls to use as they help you gain a representative picture of shopping patterns. We would expect people to buy different things according to whether they are female (particular perfumes, for example) or male (special car accessories) or by age group (out-of-season holidays for older people, or pop concert tickets for younger people) to give trivial examples. If we are wrong, and older people are into pop concerts of lots of females want to make their cars sportier, then we would find this out if we have made sure we have sufficient numbers in each of the controls. The question of a third control is more tricky. We would like to know about people’s preferences if they spend more or less money when they shop, or according to their income group, or depending on how many people they are shopping for (just themselves, or their family and their elderly neighbours). Of course, the interviewer could ask people how much they spent on shopping last week, or what their family income is, or how many people they are shopping for. People might reply to the last of these questions, but may well be unhappy about the other two – so you get a lot of refusals – and lose your sample! Even if they do reply, the interviewer will then have to discard some of the interviews they have started. If everyone interviewed so far is shopping for themselves, for example, and the quota has been filled, the interviewer will have to ignore the person they have stopped and go to look for someone else to ask whether they have shopped for others! If the aim is to interview people who have bought a lot of things at that time, then the interviewer could wait for people to leave the store concerned, and stop people who have bought a lot that day, a medium amount, or a small amount, judging by how full their shopping trolley is! Or do the same if people are accompanied by children or not on their shopping expedition. An alternative is to interview at more than one shopping area and deliberately go to some shops in rich areas and some in poor areas, knowing that most of the people you interview will be in an income category which matches their surroundings. 6. (a) Useful strata here might be: • the site – people might have different travel arrangements depending on where they work • their work type – they may have different hours of work (which would affect their travel possibilities), they may have different wage levels (which might affect how they travel – do they need car parks?) and what they can afford (would a season ticket loan help?). You might also like to consider gender – safety issues, secure parking etc. (b) These strata will help you to get good overall estimates (because you reduce the overall standard error by using relevant stratification). They will also 335 A A A. Solutions to Learning activities enable you to look at particular groups and assess their needs (lower paid women, people on a site with no good public transport, shift workers etc.). 7. Here an obvious cluster will be the school (some textbooks may indeed refer to it as the primary sampling unit). This is because a main influence on how children learn will be the school they are at – you could give yourself an enormous amount of unnecessary extra work and expense if you choose a sample of 100 children, say, from 100 different schools – you would in fact have to find out about each of their schools individually. So it makes sense to sample clusters of schools themselves. For a similar reason, it would be a good idea to cluster children according to the study group (class) to which they belong. The only problem with this is that, if you have too few clusters, it may be difficult to separate the effect of outside influences (the effect of a particular teacher, for example) as opposed to the actual teaching methods used. 8. There are no absolute right answers to these – you may disagree with the suggestions, but if you do make sure you can explain why you think what you do! In addition, the explanation must use the kinds of arguments given in this chapter. (a) This slightly depends on the age of the children concerned. The younger the child, the more difficult it will be to elicit a clear written reply. Telephone is probably not an option (annoying to parents). On the whole, interviews at school make most sense – though we hope the children’s replies would not be too influenced by their environment. (b) The items are non-essential – we need clear records – probably a diary kept at home (if people will agree to do this) would be best. Random digit dialling might work – but you would need to catch people while they still remembered their shopping. Interviewing outside the store is not an option – we have been told this is a random survey so we will need a list of addresses rather than a quota approach to people as they shop. (c) A quota sample is specified, so a face-to-face interview contact method would be appropriate. Suitable quota controls need to be provided to the interviewer to carry out the exercise in, say, a shopping centre. (d) Email might be a good idea – quick and easy – but, of course, if their computer facilities are really poor, we will only get replies from employees with state-of-the-art facilities! If the company is supporting this, you could use any method – an interview might obtain detailed and useful information. If we only want to know the proportions of employees who have experienced particular problems with their computers, a postal survey would work – and the response rate should be as good as they are doing it for their employer! (e) Here time is important – mail will be too slow. Even a random face-to-face survey might take too long – you need to catch the interviewee at home. So the choice is the following. • If random, use randomised digital dialling and telephone people at home the day following the TV programmes – everyone will remember them and you can make your questions short and to the point! 336 A.9. Chapter 10 – Sampling design • Alternatively, you may prefer to use a quota survey with the possibility of showing pictures to illustrate your questions and perhaps ask more detailed questions. You can also do this straight after the TV programmes are broadcast. 9. (a) The main points you should make are the following. • Deciding on your target group. Be careful – is there something particular about them (i.e. children who will find writing difficult, employees who might not feel comfortable to be interviewed about work at their workplace) which may affect your next stage? • The sampling frame – it should be complete (and appropriate to the target), up-to-date, not have duplicates, be centrally and cheaply available. • Sampling design – your stratification and/or clustering factors – you want to maximise accuracy for cost with respect to the variables in which you are interested. • Contact method – will you interview face-to-face, use telephone, postal survey, or email, or what? This may be limited by your sampling frame. You cannot ring people up, for example, if you only have their addresses. • Analysis – everything you have studied in the other chapters! • Errors can happen at each stage – your sampling frame may not adequately reflect your target, the clusters in your design may cause more variability than you expected, people may refuse to answer, not be at home for interviews, lose your mail questionnaire, misunderstand your questions or lie, hence your analysis may be faulty. (b) You might carry out a quota survey because: • you have no convenient sampling frame • you are in a hurry • you are not too concerned at this point about measuring accuracy – you just want to get a rough idea about the problem you are studying • money is a little tight and a random sample would cost too much. (c) Hint – you should look at all stages of a survey. How many of them involve interviewers making complicated surveys? How important is that? Is it a waste of money? However, the kind of information – think about it – you can get is really impossible to get without a trained interviewer. Go into this argument in depth and think of examples. (d) The following circumstances are worth thinking about. • Am I in a hurry? Do I need a random sample (i.e. if quota will not do on this occasion)? • Will I be able to ask questions without needing to use visual aids? (e) You need to think about what might affect interest in this new technology so you can segment the possible market. The information you have may be limited. 337 A A A. Solutions to Learning activities • You could look at the figures for the amount of money spent on electrical goods by sales area, and stratify by area – high, medium and low spend. This would mean that you would be likely to be able to contact people with different degrees of interest in your product. This would form a good basis for a random design. • If you are doing this study for a company which had its customer sales figures, you could stratify by high, medium and low spend on appropriate items and sample from your customer base. • If you are intending to make a quota sample, you could still stratify the areas or shops where you send your interviewers by using information about the area or customer database (if you have it) and taking interview points which represent high, medium and low likely expenditure on appropriate products. 10. Arguments could be made for and against any of the suggested contact methods. However, on balance, a postal (mail) questionnaire might be most preferable. Respondents could then return the survey form at their own convenience and would not be rushed to respond as would be the case with an interview or telephone survey. Also, this would be the least expensive and least time-consuming method to employ. An obvious disadvantage is the likely high non-response rate. A suitable incentive could be used to help reduce non-response. A.10 Chapter 11 – Some ideas underlying causation – the use of control group and time order 1. Possible control factors might be the: • number of vehicles per capita of population (because the effect of speed limits will be different the more cars there are being driven) • number of vehicles per mile (or kilometre) of made-up road (for the same reason) • age distribution of the driving population (younger drivers are more likely to be involved in high-speed accidents; the very old have a high incidence of low-speed, low fatality accidents) • ‘mix’ of road types (on the whole, motorways tend to have fewer (but more serious) accidents compared with minor roads. There are a few controls you might like to use. You would not use them all at once – the data would become too detailed for sensible statistical analysis and some of the figures you want might not be available – which is the easier to get of the first two controls suggested for your country? You may think of more – check if the figures exist for your country! 2. (a) The advantages given in most textbooks are the following. • The fact that asking people questions about their experience and behaviour at the time (rather than retrospectively) eliminates ‘memory effect’. 338 A.10. Chapter 11 – Some ideas underlying causation – the use of control group and time order • There is an enormous advantage to be gained from seeing how people change their behaviour over time. (If you make repeated observations using independent samples each time, you have no way of telling whether the 20% of people who bought ‘Yum Yum’ sweets this month are the same people as the 20% who did so last month. In a panel, we can distinguish those people who are buying ‘Yum Yums’ for the first time and repeat buyers.) The disadvantages given in most textbooks are the following. • The problem of drop-outs – the more often you ask people questions for a survey, the more likely they are to eventually refuse to participate any more. In other words, you risk a high non-response. The main way of dealing with this – retaining respondents’ interest by use of rewards or information about your findings – may lead to the other difficulty. • Conditioning – which means that the respondents begin to behave differently from the general population from which they were originally sampled because they have been made more aware of the subject of the study. Think of examples of how all this works – perhaps in the case of a panel of sweet eaters (taken every two weeks) – or old people using mobility aids, taken monthly. (b) The things to look for here are the following. ◦ How you will contact the children – it is easy to do through school, if you are sure that will not bias results. You could do this by choosing schools randomly, or choose a few representative schools (quota). The alternative is through youth clubs – this is likely to be a quota sample (but that leaves out less social children, or those that live in out-of-the-way places). Alternatively, you could sample families (randomly using the register of electors or a post list) and ask the parents to let you talk to their children (this would probably be the most expensive). The main thing is that with this age of child you will probably have to interview them rather than ask them to fill in questionnaires. ◦ Once you have decided that, you may decide to contact the children weekly for three months while you are running your pilot programme. You will have to think about incentives to keep them replying (if it is through school or a youth club, you may offer, for example, videos to the school, or television show tickets for the youth club or the same things to the families). This exercise is likely to cause bias – so you should be careful to ask children not involved in the panel about the programmes and see if their general answers are different. The things you need to decide for any panel or cohort study then are: • how to target the subjects of your study • how often • which sampling method to use • how to balance the need for a good response rate with the need to get unbiased results. 339 A A A. Solutions to Learning activities A.11 Chapter 12 – Correlation and linear regression 1. (a) True. Correlation measures the strength of any linear relationship between two (measurable) random variables, not causality. Therefore, whether we consider X and Y , or Y and X, does not affect the strength of the linear relationship and so the correlation is the same. (b) True. A negative slope in a regression equation means the line of best fit is downward-sloping. Therefore, this indicates a negative linear relationship between X and Y , hence they must be negatively correlated, so the correlation coefficient would be negative too. (c) False. A correlation coefficient of −1 indicates a perfect negative linear relationship between the variables. That is, increases in one variable are perfectly reflected as decreases in the other variable. (d) False. A strong correlation coefficient just means a strong linear relationship between the variables. For example, on a scatter diagram this would mean the points lie close to a straight line. It says nothing about the gradient of the line. This may be steep, and hence a high slope b, but not necessarily so as the points could be tightly clustered around a line with a low slope. 2. (a) When r = 0, this means that there is no one clear line. The scatter diagram would look like this: You can see that any straight line would be as good (or bad) as any other if you were to try to draw a line of best fit! (b) If r is strong and positive, the line will slope upwards from left to right and the points in the scatter diagram will be close. A typical diagram might be: Note that the following could give the same r. The angle of slope of the line is immaterial – it is the closeness of the observations which matters. 340 A.11. Chapter 12 – Correlation and linear regression (c) If r is strong and negative, the slope is in the other direction, but again the points must be close. (d) If r is quite strong and negative, the slope may be the same, but the points will be more spread out. 3. Here you should think about two things which have risen or fallen over time together, but have no obvious connection. Examples might be the number of examination successes and the number of films shown per year in a town. A ‘clear’ correlation might be recovery rates from a particular illness in relation to the amount of medicine given. You might accept this correlation as perhaps being causal ‘other things equal’ if everyone who got the disease were given medicine as soon as it was diagnosed (control) and also if recovery began only after the medicine was administered (time effect). 4. We have: b= 12 × 191325 − 410 × 5445 = 0.1251 12 × 2512925 − (5445)2 341 A A A. Solutions to Learning activities and: 410 − 0.1251 × 5445 = −22.5975. 12 So the regression equation, if we decide that advertising costs depend on sales, is: a= yb = 0.1251x − 22.5975. We are assuming that as sales rise, the company concerned spends more on advertising. When sales are £460,000 we get predicted advertising costs of: yb = 0.1251 × 460 − 22.5975 = 34.9485 i.e. £34,948.50. Note that the xs and the ys were given in thousands, so be careful over the units of measurement! 5. An example might be an equation linking national income as an independent variable to the percentage of employed persons as the dependent variable. Clearly it is impossible to employ more than 100% of the population! 6. (a) We have: Note how close the observations are to the line of best fit. (b) We have: Note how scattered the points are about the steep (gradient 2:1) negative sloping line. 342 Appendix B Solutions to Sample examination questions B.1 B Chapter 2 – Mathematical revision – simple algebra and coordinate geometry 1. (a) We have: 5 X x2i = (−3.7)2 + (0.8)2 + (7.4)2 = 13.69 + 0.64 + 54.76 = 69.09. i=3 (b) We have: 2 X 1 1 1 + = 25 + 0.05 = 25.05. = xy (−0.2) × (−0.2) 2.5 × 8.0 i=1 i i (c) We have: y43 + 5 X y2 i xi i=4 3 = (−2.0) + (−2.0)2 02 + 0.8 7.4 = −8 + 5 = −3. 2. (a) We have: 3 X zi2 = 82 + (−5)2 + 62 = 64 + 25 + 36 = 125. i=1 (b) We have: 5 X √ yi zi = √ 16 × 4 + √ 10 × 10 = 8 + 10 = 18. i=4 (c) We have: z42 + 3 X 1 1 1 =4 + − − +1 y 2 5 i=1 i 2 = 16.3. 343 B. Solutions to Sample examination questions B.2 B Chapter 4 – Data visualisation and descriptive statistics 1. A general tip for identifying continuous and categorical variables is to think of the possible values they can take. If these are finite and represent specific entities the variable is categorical. Otherwise, if these consist of numbers corresponding to measurements, the data are continuous and the variable is measurable. Such variables may also have measurement units or can be measured to various decimal places. (a) Measurable, because GDP can be measured in $bn or $tn to several decimal places. (b) Each satisfaction level corresponds to a category. The level of satisfaction is in a ranked order – for example, in terms of the list items provided. Therefore, this is a categorical ordinal variable. (c) Each name (James, Jane etc.) is a category. Also, there is no natural ordering between the names – for example, we cannot really say that ‘James is higher than Jane’. Therefore, this is a categorical nominal variable. 2. (a) We have: Class interval [90, 100) [100, 110) [110, 120) [120, 130) [130, 140) [140, 150) Interval width 10 10 10 10 10 10 Frequency 6 9 7 5 2 1 Frequency density 0.6 0.9 0.7 0.5 0.2 0.1 Histogram of low-density lipoprotein data 0.9 Frequency density 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 90 100 110 120 130 140 150 Low-density lipoproteins (in mg/dL) (b) We have x̄ = 3342/30 = 111.4 mg/dL, median = 109 mg/dL, and the standard 344 B.3. Chapter 5 – Probability theory deviation is: r s= 1 × (377076 − 30 × (111.4)2 ) = 12.83 mg/dL. 29 B (c) The data exhibit positive skewness, as shown by the mean being greater than the median. 3. (a) We have: 1800 2000 2200 2400 2600 2800 3000 Calories (in kcals) Average daily intakes of calories for 12 athletes In order to identify the box, the quartiles are needed which are 1992.25 (anything between 1957 and 2009 is acceptable), 2124, 2207.75 (anything between 2200 and 2231 is also acceptable as long as it is consistent with Q1 ), hence giving an interquartile range of 215.5 (or anything else consistent with the values of Q1 and Q3 ). Hence the outlier limits are from 1669 to 2531. Therefore, the value of 3061 is an outlier. Note that no label of the x-axis is necessary and that the plot can be transposed. (b) Based on the shape of the boxplot above, we can see that the distribution of the data is positively skewed. (c) A histogram, stem-and-leaf diagram or a dot plot are other types of suitable graphical displays. The reason is that the variable income is measurable and these graphs are suitable for displaying the distribution of such variables. B.3 Chapter 5 – Probability theory 1. The solution of this exercise requires the following steps. Note, however, that these steps can be performed in a different order. We have that: P (B c ) = 1 − P (B) = 1 − 0.05 = 0.95 345 B. Solutions to Sample examination questions and: P (A ∩ B) = P (A | B) P (B) = 0.70 × 0.05 = 0.035. B Applying the total probability formula, we have: P (A) = P (A | B) P (B) + P (A | B c ) P (B c ) = 0.70 × 0.05 + 0.30 × 0.95 = 0.32. Applying Bayes’ formula, we have: P (B | A) = P (A ∩ B) 0.035 = = 0.1094. P (A) 0.32 2. (a) Applying the total probability formula, we have: P (F ) = P (F | N ) P (N ) + P (F | S) P (S) = 0.02 × 0.95 + 0.5 × 0.05 = 0.044. (b) Applying Bayes’ formula, we have: P (S | F ) = 0.025 25 P (F | S) P (S) = = = 0.5682. P (F ) 0.044 44 3. (a) Let B1 and B2 denote boxes 1 and 2, respectively. Let G denote a green ball and R denote a red ball. Applying the total probability formula, we have: 1 1 5 3 1 P (G) = P (G | B1 ) P (B1 ) + P (G | B2 )P (B2 ) = × × + = . 4 2 2 2 8 (b) Applying Bayes’ formula, we have: P (B1 | G) = B.4 P (G | B1 ) P (B1 ) 3/4 × 1/2 3 = = = 0.6. P (G) 5/8 5 Chapter 6 – The normal distribution and ideas of sampling P 1. (a) We must have i p(xi ) = 0.20 + 5k = 1, hence k = 0.16. Therefore, the probability distribution is: x −1 1 2 pX (x) 0.20 0.16 0.64 (b) We have: E(X) = X xi p(xi ) = (−1) × 0.20 + 1 × 0.16 + 2 × 0.64 = 1.24. i (c) We have: E(X 2 ) = X x2i p(xi ) = (−1)2 × 0.20 + 12 × 0.16 + 22 × 0.64 = 2.92 i hence: Var(X) = E(X 2 ) − (E(X))2 = 2.92 − (1.24)2 = 1.3824. An method to find the variance is through the formula P alternative 2 i (xi − µ) p(xi ), where µ = E(X) was found in part (b). 346 B.4. Chapter 6 – The normal distribution and ideas of sampling 2. The first part just requires knowledge of the fact that X is a normal random variable with mean µ = 100 and variance σ 2 = 100. However, for the second part of the question it is important to note that X̄, the sample mean, is also a normal random variable with mean µ and variance σ 2 /n. Direct application of this fact then yields that: σ2 = N (100, 5). X̄ ∼ N µ, n For both parts, the basic property of normal random variables for this question is that if X ∼ N (µ, σ 2 ), then: Z= X −µ ∼ N (0, 1). σ Note also that: • P (Z < a) = P (Z ≤ a) = Φ(a) • P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a) • P (a < Z < b) = P (a ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a). The above is all you need to find the requested probabilities. (a) We have X ∼ N (100, 100), hence: 105 − 100 P (X ≥ 105) = P Z ≥ = P (Z ≥ 0.5) = 1−Φ(0.5) = 1−0.6915 = 0.3085. 10 (b) We have X̄ ∼ N (100, 5), hence: 97 − 100 104 − 100 √ √ P (97 ≤ X̄ ≤ 104) = P ≤Z≤ = P (−1.34 ≤ Z ≤ 1.79) 5 5 = Φ(1.79) − (1 − Φ(1.34)) = 0.9633 − (1 − 0.9099) = 0.8732. 3. (a) Since X̄ ∼ N (7.3, (1.9)2 /40), we have: 7.4 − 7.3 7.0 − 7.3 √ ≤Z≤ √ P (7.0 ≤ X̄ ≤ 7.4) = P 1.9/ 40 1.9/ 40 = P (−1.00 ≤ Z ≤ 0.33) = Φ(0.33) − (1 − Φ(1.00)) = 0.6293 − (1 − 0.8413) = 0.4706 using Table 4 of the New Cambridge Statistical Tables. (b) The researcher’s statement is justified since we can apply the central limit theorem given the large sample size n. 347 B B. Solutions to Sample examination questions B.5 B Chapter 7 – Point and interval estimation 1. (a) The sample proportion, p, must be in the centre of the interval (0.3676, 0.5324). Adding the two endpoints and dividing by 2 gives: p= 0.3676 + 0.5324 = 0.45. 2 (b) The (estimated) standard error when estimating a single proportion is: r √ p (1 − p) 0.45 × 0.55 0.4975 √ = = √ . n n n Since this is a 100(1 − α)% = 99% confidence interval, then α = 0.01, so the confidence coefficient is zα/2 = z0.005 = 2.576. Therefore, to determine n we need to solve: 0.4975 = 0.5324 − 0.45 = 0.0824. 2.576 × √ n The correct sample size is n = 242. Note: In questions regarding sample size determination remember to round up when the solution is not an integer. 2. (a) Let p1 , n1 refer to the proportion of large retailers using regression and to the total number of large retailers, respectively. Similarly, denote by p2 and n2 the corresponding quantities for small retailers. We have p1 = 85/120 = 0.7083, n1 = 120, p2 = 78/163 = 0.4785 and n2 = 163. The estimated standard error is: s E.S.E.(p1 − p2 ) = p1 (1 − p1 ) p2 (1 − p2 ) + = 0.0570. n1 n2 The correct z-value is z0.01 = 2.326, leading to the lower and upper bounds of 0.0971 and 0.3625, respectively. Presented as an interval this is (0.0971, 0.3625). (b) We are 98% confident that the difference between the two population proportions is between 0.0971 and 0.3625. The interval excludes zero, suggesting there is a difference between the true proportions at the 2% significance level. (c) For a 94% confidence interval, the correct confidence coefficient is z0.03 = 1.88. The sample proportions and standard error are unaffected, hence the new interval is (0.1226, 0.3370). 3. (a) The confidence interval formula is: s x̄ ± tα/2, n−1 × √ . n The degrees of freedom are 21 − 1 = 20, and the correct t value from tables is t0.01, 20 = 2.528. The computed confidence interval is: 9 165 ± 2.528 × √ 21 348 ⇒ (160.04, 169.96). B.6. Chapter 8 – Hypothesis testing √ (b) We seek n such that 2.528 × 9/ n ≤ 1.5 (for a confidence interval of width 3). Solve for n = 231 (remembering to round up to the nearest integer). B (c) The confidence interval formula remains as: s x̄ ± tα/2, n−1 × √ . n The correct t value is t0.005,14 = 2.977. The computed interval is: 9 160 ± 2.977 × √ 15 ⇒ (153.08, 166.92). Note the confidence intervals are not directly comparable since the one in part (a) is for 98% confidence, while the one in part (c) is for 99% confidence. Other things equal, a 99% confidence interval is wider than a 98% one. Also, the sample sizes are different and, other things equal, a smaller sample size leads to a wider confidence interval. Although there is some overlap of the computed confidence intervals (suggesting that the mean heights are potentially the same), a formal hypothesis test should be performed. B.6 Chapter 8 – Hypothesis testing 1. (a) Let µ1 denote the mean examination mark for the intensive tutoring group and µ2 denote the mean examination mark for the standard tutoring group. The wording ‘whether there is a difference between the mean examination marks’ implies a two-sided test, hence the hypotheses can be written as: H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 . The test statistic formulae, depending on whether or not a pooled variance is used, are provided on the formula sheet: x̄1 − x̄2 q s2p (1/n1 + 1/n2 ) or x̄ − x̄2 p 1 . 2 s1 /n1 + s22 /n2 If equal variances are assumed, the test statistic value is 2.1449. If equal variances are not assumed, the test statistic value is 2.1164. The variances are unknown but the sample size is large enough, so the standard normal distribution can be used due to the central limit theorem. The t40 distribution is also correct and will be used in what follows. The critical values at the 5% significance level are ±2.021, hence we reject the null hypothesis. If we take a (smaller) α of 1%, the critical values are ±2.704, so we do not reject H0 . We conclude that there is moderate evidence of a difference between the two tutoring groups. (b) The assumptions for part (a) relate to the following. • Assumption about whether variances are equal, i.e. whether σ12 = σ22 or σ12 6= σ22 . 349 B. Solutions to Sample examination questions • Assumption about whether n1 + n2 is ‘large’ so that the normality assumption is satisfied via the central limit theorem. B • Assumption about the samples being independent. • Assumption about whether a normal or t distribution is used. 2. (a) Let πT denote the true probability for the new treatment to work. We can use the following test. Test H0 : πT = 0.5 vs. H1 : πT > 0.5. Under H0 , the test statistic is: p − 0.5 p ∼ N (0, 1) 0.5 × (1 − 0.5)/n approximately due to the central limit theorem, since here n = 30 is (just about) large enough. Hence the test statistic value is: 20/30 − 0.5 = 1.83. 0.0913 For α = 0.05, the critical value is 1.645. Since 1.645 < 1.83, we reject H0 at the 5% significance level. There is moderate evidence that the treatment is effective. (b) Let πP denote the true probability for the patient to recover with the placebo. Now we test H0 : πT = πP vs. H1 : πT > πP . For reference, the test statistic is: πT − π P p ∼ N (0, 1) P (1 − P )(1/n1 + 1/n2 ) approximately, due to the central limit theorem. The standard error is: s S.E.(πT − πP ) = 41 29 × × 70 70 1 1 + 40 30 = 0.119. and so the test statistic value is: 20/30 − 21/40 = 1.191. 0.119 For α = 0.05, the critical value is 1.645. Since 1.191 < 1.645, we do not reject H0 at the 5% significance level. There is no evidence of higher effectiveness than the placebo effect. 3. (a) The test statistic value is: x̄d 1.195 √ = √ = 1.289. sd / n 10.2/ 121 350 B.7. Chapter 9 – Contingency tables and the chi-squared test (b) In this case one can use the tn−1 = t120 distribution, since the standard deviation is unknown and estimated by sd . Nevertheless the assumption of a standard normal distribution is also justified by the central limit theorem given the large sample size n. Combining the above, noting that this is a two-tailed test we get that the p-value is: 2 × P (T ≥ 1.289) = 2 × 0.10 = 0.20 where T ∼ t120 , using Table 10 of the New Cambridge Statistical Tables. Since the p-value is 0.20 > 0.05, the test is not significant at the 5% significance level. B.7 Chapter 9 – Contingency tables and the chi-squared test 1. (a) False. Chi-squared values cannot be negative. (b) False. The expected frequencies are determined under the null hypothesis of no association, hence similar observed and expected frequencies provide little to no evidence of an association. 2. (a) There are some differences between younger and older people regarding watch preference. More specifically, 16% of younger people prefer an analogue watch compared to 48% for people over 30. Hence there seems to be an association between age and watch preference, although this needs to be investigated further. (b) Set out the null hypothesis that there is no association between age and watch preference against the alternative, that there is an association. Be careful to get these the correct way round! H0 : No association between age and watch preference. H1 : Association between age and watch preference. Work out the expected values to obtain the table below: 30 year old or younger Over 30 years old Analogue watch 20.34 20.66 Undecided 19.35 19.65 Digital watch 24.31 24.69 The test statistic formula is: r X c X (Oij − Eij )2 i=1 j=1 Eij which gives a test statistic value of 24.146. This is a 2 × 3 contingency table so the degrees of freedom are (2 − 1) × (3 − 1) = 2. For α = 0.05, the critical value is 5.991, hence reject H0 since 5.991 < 24.146. For α = 0.01, the critical value is 9.210, hence reject H0 since 9.210 < 24.146. We conclude that there is strong evidence of an association between age and watch preference. 351 B B. Solutions to Sample examination questions 3. We perform a goodness-of-fit test with: H0 : There is a uniform distribution of live births throughout the year. B H1 : There is a non-uniform distribution of live births throughout the year. How do we calculate the expected values? You should not expect equal numbers of births in each season unless they had exactly the same number of days. The clue lies in the fact that you are given the number of days per season, so logically you would expect the number of births to be (out of the 730 given): Season Spring Summer Autumn Winter Expected live births 730 × 93/365 = 186 730 × 80/365 = 160 730 × 100/365 = 200 730 × 92/365 = 184 The test statistic is: χ2 = 4 X (Oi − Ei )2 i=1 Ei giving a test statistic value of: χ2 = (100 − 186)2 (200 − 160)2 (250 − 200)2 (180 − 184)2 + + + 186 160 200 184 = 39.76 + 10.00 + 12.50 + 0.09 = 62.35. The convention is to take the number of categories minus 1 so, as there are four seasons given, we have 3 degrees of freedom. Looking at Table 8 of the New Cambridge Statistical Tables, we see that for 3 degrees of freedom this is highly significant (the 5% critical value is 7.815 and the 1% critical value is 11.34), so we reject H0 and conclude that there is very strong evidence that births vary over the year. If you look again at the observed and expected values you see: Season Spring Summer Autumn Winter Observed 100 200 250 180 Expected 186 160 200 184 Only observed winter births are anywhere near the expected value – both summer and autumn show higher births than expected, while spring is much lower. You might like to test whether the proportion of births in spring is higher than in other seasons. You will get a highly significant result! 352 B.8. Chapter 10 – Sampling design B.8 Chapter 10 – Sampling design 1. Note that for these types of questions there is no single correct answer. Below are some suggested ‘good’ answers for each part. (a) – Possible advantages: useful in the absence of a sampling frame; speed; cost. – Possible disadvantages: standard errors not measurable; systematically biased due to interviewer; no guarantee of representativeness. (b) – Possible advantages: easy to achieve a large number of interviews; easy to check on quality of interviewers. – Possible disadvantages: not everyone has a telephone so the sample can be biased; cannot usually show samples; although telephone directories exist for landline numbers, what about mobile telephone numbers? Also, young people are more likely to use mobile telephones rather than landline telephones, so are more likely to be excluded. 2. One of the main things to avoid in this part is to write ‘essays’ without any structure. This question asks for specific things and each one of them requires only one or two lines of response. If you are unsure of what these things are, do not write lengthy answers. This is a waste of your valuable examination time. If you can identify what is being asked, keep in mind that the answer should not be long. Note also that in some cases there is no single right answer to the question. Some suggested answers are given below. (a) We are asked for accuracy and random (probability) sampling, so a reasonable option is the use of stratified random sampling which is known to produce results of high accuracy. An example of a sampling scheme could be ‘a stratified sample of those customers who bought this laptop recently’. (b) The question requires: ◦ a description of a sampling frame ◦ a justification of its choice ◦ mentioning a (sensible) contact method ◦ stating an advantage of the contact method mentioned above. A suggested answer is given below. Use a list provided by retailers to identify those who bought this laptop model recently. The list could include the postal address, telephone or email. Stratification can be made by area of country or by gender of buyer. Finally, an explanation as to which you would prefer – for example, email is fast if all have it but there may be a lot of non-response. (c) The question requires an example of selection bias and an action suggested to address this issue. For example, retailers’ records may be incomplete. Offer incentives to make sure they keep accurate records. 353 B B. Solutions to Sample examination questions B (d) A suggested answer for the question is ‘How does preference for the laptop model compare for men and women?’. In terms of variables one could mention ‘gender’ and ‘buying preference’. 3. (a) An indicative answer here would be to use an email list. A limitation with this choice is that this list may not contain all the employees. (b) Examples of stratification factors are income level, gender, age group etc. In order for stratified sampling to be effective, within strata the members should be homogeneous. (c) Example actions are incentives and face-to-face interviews. Note here that response bias occurs when the respondents give consistently false answers, such as claiming a younger age. (d) Examples here are appropriate graphs (boxplots, histograms etc.), confidence intervals and hypothesis tests of job satisfaction measurement across different job types. B.9 Chapter 11 – Some ideas underlying causation – the use of control group and time order 1. Longitudinal surveys have the following strengths and weaknesses: • Strengths: Can detect change over time, and are likely to get more accurate results than with a single cross-sectional survey. • Weaknesses: Due to attrition there may be a lack of representativeness over time, and we may experience response bias. To study people’s use of health services over a 20-year period, we would begin in year 1 by selecting the study participants. These should be representative of the population in terms of several characteristics such as gender, age groups, general well-being, diet types, different fitness levels etc. To allow for likely attrition over 20 years, a large number of participants would be required. Appropriate incentives should be provided to induce people to remain in the study over time. The participants would be interviewed face-to-face on an annual basis or perhaps every four or five years, depending on the budget resources available. In order to detect change, data on the same variables would be needed at each interview stage. Examples of questions to include might be the following. • ‘How many times have you visited a doctor in the past 12 months?’ • ‘How long did you have to wait from making an appointment to actually seeing a doctor?’ • ‘Overall, how satisfied are you with the level of health service provision in your area?’ 2. Panel surveys are generally different from longitudinal surveys in the following ways: • they are more likely to be chosen by quota rather than random methods 354 B.10. Chapter 12 – Correlation and linear regression • individuals are interviewed every two to four weeks (rather than every few years) • individuals are unlikely to be panel members for longer than two years at a time. B.10 Chapter 12 – Correlation and linear regression 1. (a) We have: Examination mark Project mark Rank of examination mark Rank of project mark Difference in ranks, di d2i 50 75 5 2 3 9 80 60 2 4 −2 4 70 55 4 5 −1 1 40 40 6 7 −1 1 30 50 7 6 1 1 75 80 3 1 2 4 95 65 1 3 −2 4 The Spearman rank correlation is: 6 rs = 1 − n P d2i 6 × 24 i=1 = 1 − = 0.5714. n (n2 − 1) 7 × 48 (b) There is some correlation, but it is not strong. It looks as if there is a positive connection between examination marks and project marks. 2. (a) We have: 350 300 250 200 value of merchandise in $'s lost to shoplifters 400 Stolen merchandise vs number of workers 10 12 14 16 18 number of workers on duty (b) The summary statistics can be substituted into the formula for the sample correlation coefficient (make sure you know which one it is!) to obtain the value r = −0.9688. An interpretation of this value is the following – the data 355 B B. Solutions to Sample examination questions B suggest that the higher the number of workers, the lower the loss from shoplifters. The fact that the value is very close to −1 suggests that this is a strong, negative, linear association. (c) The regression line can be written by the equation yb = a + bx or y = a + bx + ε. The formula for b is: P xi yi − nx̄ȳ b= P 2 xi − nx̄2 and by substituting the summary statistics we get b = −26.64. The formula for a is a = ȳ − bx̄, and we get a = 655.36. Hence the regression line can be written as: yb = 655.36 − 26.64x or y = 655.36 − 26.64x + ε which is drawn on the scatter diagram above. (d) In this case one can note in the scatter diagram that the points seem to be ‘scattered’ around a straight line. Hence a linear regression model does seem to be a good model here. According to the model, the expected loss from shoplifting for 17 workers on duty is: 655.36 − 26.64 × 17 ≈ $202.48. 356 Appendix C Examination formula sheet C Expected value of a discrete random variable: µ = E(X) = N X p i xi Standard deviation of a discrete random variable: v u N uX √ pi (xi − µ)2 σ = σ2 = t i=1 The transformation formula: Z= X −µ σ Finding Z for the sampling distribution of the sample proportion: P −π Z=p π(1 − π)/n Confidence interval endpoints for a single mean (σ unknown): s x̄ ± tα/2, n−1 × √ n Sample size determination for a mean: n≥ (zα/2 )2 σ 2 e2 z test of hypothesis for a single mean (σ known): X̄ − µ0 √ Z= σ/ n i=1 Finding Z for the sampling distribution of the sample mean: Z= X̄ − µ √ σ/ n Confidence interval endpoints for a single mean (σ known): σ x̄ ± zα/2 × √ n Confidence interval endpoints for a single proportion: r p (1 − p) p ± zα/2 × n Sample size determination for a proportion: (zα/2 )2 p (1 − p) n≥ e2 t test of hypothesis for a single mean (σ unknown): X̄ − µ0 √ T = S/ n 357 C. Examination formula sheet z test of hypothesis for a single proportion: Z∼ =p C P − π0 π0 (1 − π0 )/n z test for the difference between two means (variances known): Z= X̄1 − X̄2 − (µ1 − µ2 ) p σ12 /n1 + σ22 /n2 t test for the difference between two means Confidence interval endpoints for the (variances unknown): difference between two means: s X̄1 − X̄2 − (µ1 − µ2 ) 1 1 T = q + x̄1 − x̄2 ± tα/2, n1 +n2 −2 × s2p n1 n2 Sp2 (1/n1 + 1/n2 ) Pooled variance estimator: Sp2 = (n1 − 1)S12 + (n2 − 1)S22 n1 + n2 − 2 Confidence interval endpoints for the difference in means in paired samples: sd x̄d ± tα/2, n−1 × √ n Pooled proportion estimator: P = R1 + R2 n1 + n2 χ2 statistic for test of association: t test for the difference in means in paired samples: T = z test for the difference between two proportions: P1 − P2 − (π1 − π2 ) Z=p P (1 − P ) (1/n1 + 1/n2 ) Confidence interval endpoints for the difference between two proportions: s p1 (1 − p1 ) p2 (1 − p2 ) p1 −p2 ±zα/2 × + n1 n2 Sample correlation coefficient: r X c X (Oij − Eij )2 i=1 j=1 Eij X̄d − µd √ Sd / n n P r = s n P xi yi − nx̄ȳ i=1 x2i − nx̄ n P 2 i=1 Spearman rank correlation: 6 n P d2i i=1 rs = 1 − n(n2 − 1) yi2 − nȳ 2 i=1 Simple linear regression line estimates: n P xi yi − nx̄ȳ b = i=1 n P x2i − nx̄2 i=1 a = ȳ − bx̄ 358