Subido por Johnny Henry Calluchi Arbildo

study-guide-2019 compress

Anuncio
Statistics 1
J.S. Abdey
ST104a
2019
Undergraduate study in
Economics, Management,
Finance and the Social Sciences
This subject guide is for a 100 course offered as part of the University of London
undergraduate study in Economics, Management, Finance and the Social
Sciences. This is equivalent to Level 4 within the Framework for Higher Education
Qualifications in England, Wales and Northern Ireland (FHEQ).
For more information about the University of London, see: london.ac.uk
This guide was prepared for the University of London by:
James S. Abdey, BA (Hons), MSc, PGCertHE, PhD, Department of Statistics, London School of
Economics and Political Science.
This is one of a series of subject guides published by the University. We regret that due to
pressure of work the author is unable to enter into any correspondence relating to, or arising
from, the guide. If you have any comments on this subject guide, favourable or unfavourable,
please use the form at the back of this guide.
In this and other publications you may see references to the ‘University of London
International Programmes’, which was the name of the University’s flexible and distance
learning arm until 2018. It is now known simply as the ‘University of London’, which
better reflects the academic award our students are working towards. The change in name
will be incorporated into our materials as they are revised.
University of London
Publications Office
Stewart House
32 Russell Square
London WC1B 5DN
United Kingdom
london.ac.uk
Published by: University of London
© University of London 2019
The University of London asserts copyright over all material in this subject guide except where
otherwise indicated. All rights reserved. No part of this work may be reproduced in any form,
or by any means, without permission in writing from the publisher. We make every effort to
respect copyright. If you think we have inadvertently used your copyright material, please let
us know.
Contents
Contents
1 Introduction
1
1.1
General introduction to the subject area . . . . . . . . . . . . . . . . . .
1
1.2
Aims of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.3
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.4
Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.5
Essential reading list and other learning resources . . . . . . . . . . . . .
3
1.5.1
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
How to study statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.6.1
Mathematical background . . . . . . . . . . . . . . . . . . . . . .
4
1.6.2
Calculators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.6
1.7
1.8
How to use the subject guide
. . . . . . . . . . . . . . . . . . . . . . . .
5
1.7.1
Using the subject guide and the textbook
. . . . . . . . . . . . .
5
1.7.2
Online study resources . . . . . . . . . . . . . . . . . . . . . . . .
5
1.7.3
Virtual learning environment (VLE) . . . . . . . . . . . . . . . .
6
1.7.4
Making use of the Online Library . . . . . . . . . . . . . . . . . .
6
1.7.5
Structure of the subject guide . . . . . . . . . . . . . . . . . . . .
7
1.7.6
Time management . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.7.7
Recommendations for working through the chapters . . . . . . . .
7
Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2 Mathematical revision – simple algebra and coordinate geometry
11
2.1
Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.3
Recommended reading . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.5
Arithmetic operations
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.6
Squares and square roots . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.7
Fractions and percentages . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.8
Some further notation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.8.1
14
Absolute value . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
Contents
2.8.2
Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
P
Summation operator,
. . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.10 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.11 The graph of a linear function . . . . . . . . . . . . . . . . . . . . . . . .
22
2.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.13 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.14 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.15 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . .
24
2.16 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
25
2.9
3 The nature of statistics
27
3.1
Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.3
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.4
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.6
Coverage of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.7
Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.7.1
Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.7.2
Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.7.3
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.7.4
Sampling
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.8
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.9
Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.10 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.11 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . .
33
3.12 Self assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4 Data visualisation and descriptive statistics
ii
15
35
4.1
Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.3
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.4
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.6
Types of variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.6.1
37
Categorical variables . . . . . . . . . . . . . . . . . . . . . . . . .
Contents
4.7
Data visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.7.1
Presentational traps . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.7.2
Dot plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.7.3
Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.7.4
Stem-and-leaf diagram . . . . . . . . . . . . . . . . . . . . . . . .
44
Measures of location . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.8.1
Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.8.2
Median
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.8.3
Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
Measures of spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.9.1
Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4.9.2
Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.9.3
Variance and standard deviation . . . . . . . . . . . . . . . . . . .
55
4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.11 Test your understanding . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.12 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.13 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.14 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . .
73
4.15 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
74
4.8
4.9
5 Probability theory
77
5.1
Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.3
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.4
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.6
The concept of probability . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.7
Relative frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.8
‘Randomness’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
5.9
Properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
5.9.1
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.9.2
Venn diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5.9.3
The additive law . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5.9.4
The multiplicative law . . . . . . . . . . . . . . . . . . . . . . . .
90
5.10 Conditional probability and Bayes’ formula . . . . . . . . . . . . . . . . .
95
5.10.1 Bayes’ formula . . . . . . . . . . . . . . . . . . . . . . . . . . . .
100
iii
Contents
5.10.2 Total probability formula . . . . . . . . . . . . . . . . . . . . . . .
102
5.10.3 Independent events (revisited) . . . . . . . . . . . . . . . . . . . .
109
5.11 Probability trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
109
5.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
111
5.13 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
111
5.14 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
112
5.15 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . .
114
5.16 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
114
6 The normal distribution and ideas of sampling
6.1
Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
117
6.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
117
6.3
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
117
6.4
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
118
6.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
118
6.6
Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
118
6.7
Population mean and variance . . . . . . . . . . . . . . . . . . . . . . . .
120
6.7.1
Population mean . . . . . . . . . . . . . . . . . . . . . . . . . . .
120
6.7.2
Population variance . . . . . . . . . . . . . . . . . . . . . . . . . .
121
The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . .
125
6.8.1
Relevance of the normal distribution . . . . . . . . . . . . . . . .
127
6.8.2
Consequences of the central limit theorem . . . . . . . . . . . . .
127
6.8.3
Characteristics of the normal distribution . . . . . . . . . . . . . .
128
6.8.4
Standard normal tables . . . . . . . . . . . . . . . . . . . . . . . .
128
6.8.5
The general normal distribution . . . . . . . . . . . . . . . . . . .
132
Sampling distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
138
6.10 Sampling distribution of X̄ . . . . . . . . . . . . . . . . . . . . . . . . . .
141
6.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
144
6.12 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
144
6.13 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
145
6.14 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . .
146
6.15 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
146
6.8
6.9
7 Point and interval estimation
iv
117
149
7.1
Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
149
7.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
149
Contents
7.3
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
149
7.4
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
150
7.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
150
7.6
Principle of confidence intervals . . . . . . . . . . . . . . . . . . . . . . .
150
7.7
General formulae for normally-distributed statistics . . . . . . . . . . . .
151
7.7.1
Standard error known . . . . . . . . . . . . . . . . . . . . . . . .
151
7.7.2
Standard error unknown . . . . . . . . . . . . . . . . . . . . . . .
153
7.7.3
(Student’s) t distribution . . . . . . . . . . . . . . . . . . . . . . .
153
7.8
Confidence interval for a single mean (σ known) . . . . . . . . . . . . . .
155
7.9
Confidence interval for a single mean (σ unknown) . . . . . . . . . . . . .
156
7.10 Confidence interval for a single proportion . . . . . . . . . . . . . . . . .
158
7.11 Sample size determination . . . . . . . . . . . . . . . . . . . . . . . . . .
161
7.12 Difference between two population proportions . . . . . . . . . . . . . . .
164
7.13 Difference between two population means . . . . . . . . . . . . . . . . . .
168
7.13.1 Unpaired samples – variances known . . . . . . . . . . . . . . . .
168
7.13.2 Unpaired samples – variances unknown and unequal . . . . . . . .
169
7.13.3 Unpaired samples – variances unknown and equal . . . . . . . . .
170
7.13.4 Paired (dependent) samples . . . . . . . . . . . . . . . . . . . . .
173
7.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
175
7.15 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
176
7.16 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
176
7.17 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . .
177
7.18 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
178
8 Hypothesis testing
179
8.1
Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
179
8.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
179
8.3
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
179
8.4
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
180
8.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
180
8.6
Statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
181
8.7
Types of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
183
8.8
Tests for normal populations . . . . . . . . . . . . . . . . . . . . . . . . .
184
8.9
Significance levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
185
8.9.1
. . . . . . . . . . . . . . . . . . . . . .
186
8.10 One-tailed and two-tailed tests . . . . . . . . . . . . . . . . . . . . . . . .
188
Order of conducting tests
v
Contents
8.11 P -values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
188
8.12 Hypothesis test for a single mean (σ known) . . . . . . . . . . . . . . . .
190
8.13 Hypothesis test for a single mean (σ unknown) . . . . . . . . . . . . . . .
191
8.14 Hypothesis test for a single proportion . . . . . . . . . . . . . . . . . . .
196
8.15 Difference between two population proportions . . . . . . . . . . . . . . .
200
8.16 Difference between two population means . . . . . . . . . . . . . . . . . .
204
8.16.1 Unpaired samples – variances known . . . . . . . . . . . . . . . .
204
8.16.2 Unpaired samples – variances unknown and unequal . . . . . . . .
205
8.16.3 Unpaired samples – variances unknown and equal . . . . . . . . .
206
8.16.4 Paired (dependent) samples . . . . . . . . . . . . . . . . . . . . .
211
8.17 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
213
8.18 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
213
8.19 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
213
8.20 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . .
214
8.21 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
215
9 Contingency tables and the chi-squared test
9.1
Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
217
9.2
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
217
9.3
Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
217
9.4
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
218
9.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
218
9.6
Correlation and association
. . . . . . . . . . . . . . . . . . . . . . . . .
218
9.7
Tests for association . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
218
9.7.1
Contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . .
219
9.7.2
Expected frequencies . . . . . . . . . . . . . . . . . . . . . . . . .
220
9.7.3
Test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
221
9.7.4
The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . .
221
9.7.5
Degrees of freedom . . . . . . . . . . . . . . . . . . . . . . . . . .
221
9.7.6
Performing the test . . . . . . . . . . . . . . . . . . . . . . . . . .
222
9.7.7
Extending to an appropriate test of proportions . . . . . . . . . .
229
Goodness-of-fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
229
9.8.1
Observed and expected frequencies . . . . . . . . . . . . . . . . .
230
9.8.2
The goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . .
231
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
233
9.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .
233
9.8
9.9
vi
217
Contents
9.11 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
233
9.12 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . . .
235
9.13 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . .
236
10 Sampling design
237
10.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
237
10.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
237
10.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
238
10.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
238
10.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
238
10.6 Motivation for sampling . . . . . . . . . . . . . . . . . . . . . . . . . . .
238
10.7 Types of sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
240
10.7.1 Non-probability sampling
. . . . . . . . . . . . . . . . . . . . . .
241
10.7.2 Probability sampling . . . . . . . . . . . . . . . . . . . . . . . . .
244
10.8 Types of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
249
10.9 Pilot and post-enumeration surveys . . . . . . . . . . . . . . . . . . . . .
249
10.10 Non-response and response bias . . . . . . . . . . . . . . . . . . . . . .
250
10.11 Method of contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
253
10.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
257
10.13 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . .
260
10.14 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
261
10.15 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . .
263
10.16 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . .
263
11 Some ideas underlying causation – the use of control groups and time
order
265
11.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
265
11.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
265
11.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
265
11.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
265
11.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
266
11.6 Observational studies and designed experiments . . . . . . . . . . . . . .
267
11.7 Use of the control group . . . . . . . . . . . . . . . . . . . . . . . . . . .
268
11.7.1 Observational study
. . . . . . . . . . . . . . . . . . . . . . . . .
268
11.7.2 Experimental study . . . . . . . . . . . . . . . . . . . . . . . . . .
269
11.8 Time order
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
271
11.8.1 Longitudinal surveys . . . . . . . . . . . . . . . . . . . . . . . . .
271
vii
Contents
11.8.2 Panel surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
272
11.9 Case study: Smoking and lung cancer . . . . . . . . . . . . . . . . . . . .
273
11.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
274
11.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . .
274
11.12 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
274
11.13 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . .
275
11.14 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . .
275
12 Correlation and linear regression
277
12.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
277
12.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
277
12.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
277
12.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
278
12.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
278
12.6 Scatter diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
279
12.7 Causal and non-causal relationships . . . . . . . . . . . . . . . . . . . . .
280
12.8 Correlation coefficient
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
282
12.8.1 Spearman rank correlation . . . . . . . . . . . . . . . . . . . . . .
286
12.9 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
288
12.9.1 The simple linear regression model . . . . . . . . . . . . . . . . .
289
12.9.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . .
291
12.9.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
295
12.9.4 Points to watch about linear regression . . . . . . . . . . . . . . .
304
12.10 Points to note about correlation and regression . . . . . . . . . . . . . .
305
12.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
305
12.12 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . .
306
12.13 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
306
12.14 A reminder of your learning outcomes . . . . . . . . . . . . . . . . . . .
307
12.15 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . .
307
A Solutions to Learning activities
309
A.1 Chapter 2 – Mathematical revision – simple algebra and coordinate geometry309
A.2 Chapter 3 – The nature of statistics . . . . . . . . . . . . . . . . . . . . .
312
A.3 Chapter 4 – Data visualisation and descriptive statistics
. . . . . . . . .
312
A.4 Chapter 5 – Probability theory . . . . . . . . . . . . . . . . . . . . . . .
315
A.5 Chapter 6 – The normal distribution and ideas of sampling . . . . . . . .
319
viii
Contents
A.6 Chapter 7 – Point and interval estimation . . . . . . . . . . . . . . . . .
323
A.7 Chapter 8 – Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . .
325
A.8 Chapter 9 – Contingency tables and the chi-squared test . . . . . . . . .
327
A.9 Chapter 10 – Sampling design . . . . . . . . . . . . . . . . . . . . . . . .
333
A.10 Chapter 11 – Some ideas underlying causation – the use of control group
and time order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
338
A.11 Chapter 12 – Correlation and linear regression . . . . . . . . . . . . . . .
340
B Solutions to Sample examination questions
343
B.1 Chapter 2 – Mathematical revision – simple algebra and coordinate geometry343
B.2 Chapter 4 – Data visualisation and descriptive statistics
. . . . . . . . .
344
B.3 Chapter 5 – Probability theory . . . . . . . . . . . . . . . . . . . . . . .
345
B.4 Chapter 6 – The normal distribution and ideas of sampling . . . . . . . .
346
B.5 Chapter 7 – Point and interval estimation . . . . . . . . . . . . . . . . .
348
B.6 Chapter 8 – Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . .
349
B.7 Chapter 9 – Contingency tables and the chi-squared test . . . . . . . . .
351
B.8 Chapter 10 – Sampling design . . . . . . . . . . . . . . . . . . . . . . . .
353
B.9 Chapter 11 – Some ideas underlying causation – the use of control group
and time order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
354
B.10 Chapter 12 – Correlation and linear regression . . . . . . . . . . . . . . .
355
C Examination formula sheet
357
ix
Contents
x
1
Chapter 1
Introduction
1.1
General introduction to the subject area
Welcome to the world of statistics! This is a discipline with unparalleled applicability
whose use can be found in a wide range of areas such as finance, business, management,
economics and other fields in the social sciences. ST104a Statistics 1 provides you
with the opportunity to grasp the fundamentals of the subject and will equip you with
the vital quantitative skills and powers of analysis which are highly sought-after by
employers in many sectors.
The material in this half-course is necessary as preparation for other courses you may
study later on as part of your degree.
You may choose to take ST104b Statistics 2 so that you can study the concepts
introduced here in greater depth. A natural continuation of this half-course and
ST104b Statistics 2 are the advanced half-courses ST2133 Advanced
statistics: distribution theory and ST2134 Advanced statistics: statistical
inference.
Two applied statistics courses for which this half-course is a pre-requisite are
ST2187 Business analytics, applied modelling and prediction and ST3188
Statistical methods for market research.
You may wish to develop your economic statistics by taking EC2020 Elements of
econometrics.
You may want to build on your interests in social research and take SC2145
Social research methods.
There are also links with MN3141 Principles of marketing.
So the reach of statistics is considerable. Hence it rightly forms a core component of our
programmes since all of the courses mentioned above require an understanding of the
concepts and techniques introduced in ST104a Statistics 1. The analytical skills
which you will develop on this course will, therefore, stand you in very good stead both
for your future studies and beyond into the real world of work.
1.2
Aims of the course
The emphasis of this half-course is on the application of statistical methods in
management, economics and the social sciences. Attention will focus on the
1
1
1. Introduction
interpretation of tables and results, as well as the appropriate way to approach
statistical problems. Treatment is at an elementary mathematical level. Ideas of
probability, inference and multivariate analysis are introduced and are further
developed in the half-course ST104b Statistics 2.
1.3
Learning outcomes
At the end of this half-course, and having completed the essential reading and
activities, you should:
be familiar with the key ideas of statistics that are accessible to a student with a
moderate mathematical competence
be able to routinely apply a variety of methods for explaining, summarising and
presenting data and interpreting results clearly using appropriate diagrams, titles
and labels when required
be able to summarise the ideas of randomness and variability, and the way in which
these link to probability theory to allow the systematic and logical collection of
statistical techniques of great practical importance in many applied areas
have a grounding in probability theory and some grasp of the most common
statistical methods
be able to perform inference to test the significance of common measures such as
means and proportions and conduct chi-squared tests of contingency tables
be able to use simple linear regression and correlation analysis and know when it is
appropriate to do so.
1.4
Syllabus
This course introduces some of the basic ideas of theoretical statistics, emphasising the
applications of these methods and the interpretation of tables and results.
Basic background : Elementary summation signs, elementary probability, Venn and
tree diagrams.
Data collection: Elements of survey design, the stages of a survey, ideas of
randomness, observation and experiment.
Data presentation and analysis: Descriptive statistics, measures of location and
dispersion, pictorial and graphical representation.
The normal distribution: Estimation of mean, proportion, standard deviation,
confidence intervals and hypothesis testing. Ideas of testing for differences between
means and proportions. The use of Student’s t.
Goodness-of-fit tests: The chi-squared distribution and contingency tables.
Regression and correlation: An introduction to the ideas of regression and
correlation, least squares, estimation of a, b and r, scatter diagrams.
2
1.5. Essential reading list and other learning resources
1.5
Essential reading list and other learning
resources
Numerous titles are available covering the topics frequently covered in foundation
statistics courses such as ST104a Statistics 1. Due to the inevitable heterogeneity
among students taking this half-course, some may find one author’s style easier to
understand than another’s.
That said, the recommended textbook for this course is:
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060].
This textbook contains many additional examples, complete with solutions to selected
even-numbered exercises. To ensure you fully understand each topic, you should
attempt these, and check your answers against the solutions provided. In addition, the
textbook introduces all the topics in an alternative style to this subject guide so if you
have difficulty following a specific point, you may wish to consult the textbook for
further clarification. A non-exhaustive list of other suggested textbooks is outlined
below under ‘Further reading’.
Statistical tables will be provided in the examination for ST104a Statistics 1, hence
you should also consult:
Lindley, D.V. and W.F. Scott New Cambridge Statistical Tables (Cambridge:
Cambridge University Press, 1995) second edition [ISBN 9780521484855].
These statistical tables form the basis of the statistical tables which are provided for
you to use in the examination. It is essential that you familiarise yourself with these in
advance rather than those printed in your textbook, as the method of navigating
through statistical tables is not uniform across all publications. In order to prepare
yourself, you should concentrate in particular on using Tables 4, 5, 7, 8, 9 and 10 of the
New Cambridge Statistical Tables. These relate to the (standard) normal, Student’s t
and chi-squared distributions.
Detailed reading references in this subject guide refer to the editions of the set textbook
listed above. Newer editions may have been published by the time you study this
half-course. You can use a more recent edition of the textbook – use the detailed
chapter and section headings and the index to identify relevant readings. Also, check
the virtual learning environment (VLE) regularly for updated guidance on readings.
1.5.1
Further reading
As previously mentioned, numerous alternative textbooks are available to purchase,
although references in this subject guide will be to the eighth edition of Newbold et al.
If you do decide to purchase a different/additional textbook, do ensure it covers all the
syllabus topics (listed earlier in this introductory chapter). A word of caution at this
point: statistical notation can differ between textbooks so be prepared for this.
Note that as long as you read the Essential reading you are then free to read around the
subject area in any textbook, paper or online resource. You will need to support your
3
1
1
1. Introduction
learning by reading as widely as possible and by thinking about how these principles
apply in the real world. To help you read extensively, you have free access to the VLE
and University of London Online Library (see below). Other useful textbooks for this
course include:
Aczel, A.D. Complete Business Statistics (London: McGraw–Hill Higher
Education, 2009) seventh edition [ISBN 9780071287531].
Anderson, D.R., D.J. Sweeney, T.A. Williams, J. Freeman and E. Shoesmith
Statistics for Business and Economics (South-Western Cengage Learning, 2010)
eleventh edition [ISBN 9780324783247].
Lind, D.A., W.G. Marchal and S.A. Wathen Statistical Techniques in Business and
Economics (Boston: McGraw–Hill Higher Education, 2009) fourteenth edition
[ISBN 9780073401768].
Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and
Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN
9780471615170].
Chapters 9 and 10 of the subject guide concentrate on the concepts of surveys and
experimentation. You will find these topics particularly useful if you are studying
MN3141 Principles of marketing.
If you feel that you need to refresh your basic mathematics you will need for this
half-course which you will cover in Chapter 1, a recommendation is:
Anthony, M. and N. Biggs Mathematics for Economics and Finance (Cambridge:
Cambridge University Press, 1996) [ISBN 9780521559133] Chapters 1, 2 and 7.
1.6
How to study statistics
1.6.1
Mathematical background
To study and understand statistics you will need some familiarity with abstract
mathematical concepts combined with the common sense to see how to use these ideas
in real-life applications. The concepts needed for probability and statistical inference are
impossible to absorb by just reading them in a book – although you may find you need
to do this more than once! You need to read, then think, then try some problems, then
read and think some more. This process should be repeated until you find the problems
easy to do. Reading without practising with the examples and activities set in
this subject guide will be of little help.
You will also need to be able to use high-school arithmetic and understand some basic
algebraic ideas. These are very important. Starting with them should help you feel
comfortable with this half-course from the outset and these ideas are introduced to you
in Chapter 2.
4
1.7. How to use the subject guide
1.6.2
Calculators
A calculator may be used when answering questions on the examination paper for
ST104a Statistics 1 and it must comply in all respects with the specification given in
the Regulations. You should also refer to the Admission Notice you will receive when
entering the examination and the ‘notice on permitted materials’.
The most important thing is that you should accustom yourself to using your chosen
calculator and feel comfortable with it. Specifically, calculators must:
be hand-held, compact and portable
be quiet in operation
have no external wires
be non-programmable
not be capable of receiving, storing or displaying user-supplied non-numerical data.
Typically, an internet search engine image search for the term ‘basic calculator’ will
show examples of permitted calculators.
1.7
How to use the subject guide
1.7.1
Using the subject guide and the textbook
This subject guide has been structured such that it is tailored to the specific
requirements of the examinable material. It is ‘written to the course’, unlike textbooks
which typically include additional material which is not examinable. Therefore, the
subject guide should act as your principal resource and you should refer to the specific
sections in Newbold et al as indicated.
A key benefit of the textbook is that it contains numerous further examples and
exercises which can be used to check and consolidate your understanding. As previously
mentioned, solutions to selected even-numbered exercises are provided in the textbook
enabling you to check your answers to these questions.
1.7.2
Online study resources
In addition to the subject guide and the Essential reading, it is vitally important that
you take advantage of the study resources which are available online for this half-course,
including the VLE and Online Library.
You can access the VLE, Online Library and your University of London email account
via the Student Portal: http://my.londoninternational.ac.uk
You should have received your login details for the Student Portal with your official
offer, which was emailed to the address that you gave on your application form. You
have probably already logged in to the Student Portal in order to register. As soon as
5
1
1
1. Introduction
you registered, you will automatically have been granted access to the VLE, Online
Library and your fully functioning University of London email account.
If you have forgotten these login details, please click on the ‘Forgotten your password’
link on the login page.
The Student Portal forms an important part of your study experience with the
University of London and should, therefore, be accessed regularly.
1.7.3
Virtual learning environment (VLE)
The VLE, which complements this subject guide, has been designed to enhance your
learning experience, providing additional support and a sense of community. In addition
to making materials more accessible, the VLE provides an open space for you to discuss
interests and to seek support from other students, working collaboratively to solve
problems and discuss subject material. In a few cases, such discussions are driven and
moderated by an academic who offers a form of feedback on all discussions. In other
cases, video material, such as audio-visual tutorials, are available. These will typically
focus on taking you through difficult concepts in the subject guide. For quantitative
courses, such as Mathematics and Statistics, fully worked-through solutions of practice
examination questions are available. For some qualitative courses, academic interviews
and debates will provide you with advice on approaching the subject and examination
questions, and will show you how to build an argument effectively.
Past examination papers and Examiners’ commentaries from the past three years are
available to download which provide advice on how each examination question might
best be answered. Self-testing activities allow you to test your knowledge and recall of
the academic content of various courses. Finally, a section of the VLE has been
dedicated to providing you with expert advice on preparing for examinations and
developing digital literacy skills.
1.7.4
Making use of the Online Library
The Online Library contains a huge array of journal articles and other resources to help
you read widely and extensively.
To access the majority of resources via the Online Library you will either need to use
your University of London Student Portal login details, or you will be required to
register and use an Athens login: http://tinyurl.com/ollathens
The easiest way to locate relevant content and journal articles in the Online Library is
to use the Summon search engine.
If you are having trouble finding an article listed on the reading list, try removing any
punctuation from the title, such as single quotation marks, question marks and colons.
For further advice, please see the online help pages:
www.external.shl.lon.ac.uk/summon/about.php
6
1.7. How to use the subject guide
1.7.5
Structure of the subject guide
Statistics is fundamentally a cumulative discipline – that is, the following chapters
are not a series of self-contained topics, rather they build on each other
sequentially. As such, you are strongly advised to follow the subject guide in chapter
order, understanding the material in the early chapters before embarking on later
material. There is little point in rushing past material which you have only partially
understood in order to reach the later chapters.
1.7.6
Time management
About one-third of your private study time should be spent reading and the other
two-thirds doing problems. (Note the emphasis on practising problems!) If you are
intending to study ST104a Statistics 1 over the course of one academic year, then it
is recommended that you devote a minimum of seven hours per week to your studies.
To help your time management, the chapters and topics of this half-course have been
converted into approximate weeks to devote to each subject if you were, for example,
spending 20 weeks on ST104a Statistics 1. What you should gain from the following
breakdown is an indication of the relative amounts of time to be spent on each topic.
Bear in mind, however, that some of you may not need to spend as much time on
Chapter 2 if the concepts and techniques of basic arithmetic and algebra (in particular
the use of summation signs, the equation of a straight line, and the idea of a uniform
distribution) are familiar to you.
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 11
Chapter 12
Revision(!)
1.7.7
–
–
–
–
–
–
–
–
–
–
–
–
2 weeks
1 week
2 weeks
1 week
1 week
1 week
1 week
1 week
2 weeks
1 week
2 weeks
5 weeks
Recommendations for working through the chapters
The following procedure is recommended for each chapter.
1. Carefully read the aims of the chapter and the learning outcomes.
2. Read the introduction, noting carefully the sections of Newbold et al to which you
are referred.
3. Now work through each section of the chapter making sure you understand the
examples given. In parallel, watch the accompanying video tutorials for each
section.
7
1
1
1. Introduction
4. Review the intended learning outcomes carefully, almost as a checklist. Do you
think you have achieved your targets?
5. Attempt the learning activities which appear near the end of each chapter.
6. Attempt the chapter’s self-test quizzes on the VLE.
7. Attempt the sample examination questions given at the end of the chapter. You
can review the video solutions, but do so after attempting the questions yourself!
8. When you have finished the material in the chapter, try the suggested questions
from Newbold et al. You can treat these as additional activities. This time, though,
you will have to think a little about which part of the new material you have learnt
is appropriate to each question.
9. If you have problems at this point, go back to the subject guide and work through
the area you find difficult again. Do not worry – you will improve your
understanding to the point where you can work confidently through the problems.
10. Once you have completed your work on the subject guide, you will be ready for
examination revision. Start with work on the Sample examination paper which you
will find at the back of the subject guide.
The last few steps are most important. It is easy to think that you have understood the
text after reading it, but working through problems is the crucial test of
understanding. Problem-solving should take most of your study time (refer to the
‘Time management’ section above). Note that worked examples and activities have been
provided to cover each substantive topic in the subject guide. The Newbold et al
examples are added for further consolidation of each chapter topic and also to help you
work out exactly what the questions are about! One of the problems candidates
sometimes have in an examination is that they waste time trying to understand to
which part of the syllabus particular questions relate. These final questions, together
with the further explanations on the VLE, aim to help with this before you tackle the
sample examination questions at the end of each chapter.
Try to be disciplined about this: do not look up the answers until you have done your
best. Promise? Statistical ideas may seem unfamiliar at first, but your attempts at the
questions, however dissatisfied you feel with them, will help you understand the
material far better than reading and re-reading the prepared answers – honestly!
So, to conclude, perseverance with problem solving is your passport to a strong
examination performance. Attempting (ideally successfully!) all the cited exercises is of
paramount importance.
1.8
Examination advice
Important: the information and advice given in the following section are based on the
examination structure used at the time this subject guide was written. Note that
subject guides may be used for several years. Because of this, it is strongly advised that
you check both the current Regulations for relevant information about the examination,
and the VLE where you should be advised of any forthcoming changes. You should also
8
1.8. Examination advice
carefully check the rubric/instructions on the paper you actually sit and follow those
instructions.
This half-course is assessed by a two-hour, unseen, written examination. No books may
be taken into the examination, but you will be provided with the necessary extracts
from the New Cambridge Statistical Tables and a formula sheet (which can be found in
past examination papers on the VLE, as well as at the end of the subject guide). These
will be supplied as part of the examination question paper for ST104a Statistics 1
rather than being provided separately. A calculator may be used when answering
questions on this paper and it must comply in all respects with the specification given
in the Regulations.
Section A is a series of short compulsory questions worth 50 per cent of the total marks.
In Section B, you should attempt two out of a choice of three longer questions, each of
which is worth 25 per cent of the total marks. As Section A will seek to assess a broad
cross-section of the syllabus, you are strongly advised to study the whole syllabus. A
sample examination paper is provided at the end of this subject guide.
Candidates should be aware that graph paper will be supplied as part of the
examination question paper for this half-course, so you will not need to request it from
the invigilator. Candidates will need to detach this graph paper from the back of the
question paper and tie any sheets used into the examination answer booklet.
Candidates should, therefore, practise using this type of graph paper as part of their
preparation for the examination.
Examiners will expect this graph paper to be used where indicated and they will be
looking for great precision in the answers to these questions.
Strategies for success, your academic and study skills handbook, also provides
additional guidance on examination technique.
Remember, it is important to check the VLE for:
up-to-date information on examination and assessment arrangements for this
half-course
where available, past examination papers and Examiners’ commentaries for this
half-course which give advice on how each question might best be answered.
9
1
1
1. Introduction
10
Chapter 2
Mathematical revision – simple
algebra and coordinate geometry
2.1
2
Aims
This chapter outlines the essential mathematical building blocks which you will need to
work with in this half-course. Most of them will be revision to you, but some new
material is introduced. Particular aims are:
to be familiar with the basic rules of arithmetic operations
to learn further notation concerned with absolute value, ‘greater than’ (>) and ‘less
than’ (<) signs
to use summation signs to calculate simple statistical measures
to learn how to plot a straight line graph and identify its slope and intercept.
2.2
Learning outcomes
After completing this chapter, and having completed the recommended reading and
activities, you should be able to:
manipulate arithmetic and algebraic expressions using the simple rules
recall and use common signs: square, square root, ‘greater than’, ‘less than’ and
absolute value
demonstrate use of the summation sign and explain the use of the ‘i’, or index, of x
draw the straight line for a linear function.
2.3
Recommended reading
Anthony, M. and N. Biggs Mathematics for Economics and Finance (Cambridge:
Cambridge University Press, 1996) [ISBN 9780521559133] Chapters 1, 2 and 7.
You may find Anthony and Biggs, or any other appropriate non-specialist mathematics
textbook, helpful as background for the mathematics you will need for this half-course.
If you are studying for MT105a Mathematics 1, that half-course will support your
11
2. Mathematical revision – simple algebra and coordinate geometry
studies in ST104a Statistics 1. However, you will not need all of the material in
MT105a Mathematics 1 here.
2
In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials
accessible via the ST104a Statistics 1 area at https://emfss.elearning.london.ac.uk
2.4
Introduction
This opening chapter introduces some basic concepts and mathematical tools on which
the rest of the half-course is built. Before proceeding to the rest of the subject guide, it
is essential that you have a solid understanding of these fundamental concepts and tools.
You should be confident users of the basic mathematical operations: addition,
subtraction, multiplication and division, and be able to use these operations on a basic
electronic calculator. The content of this chapter is expected to be a ‘refresher’ of the
elementary algebraic and arithmetic rules from schooldays. Some material featured in
this chapter may be new to you, such as summation signs and graphs of linear
functions. If so, you should master these new ideas before progressing.
Finally, remember that although it is unlikely that an examination question would test
you on the topics in this chapter alone, the material covered here may well be an
important part of the answer!
2.5
Arithmetic operations
We begin with elementary arithmetic operations which will be used when working with
figures in ST104a Statistics 1. Students often let themselves down by understanding
the statistical concepts, but fail to manage a problem because they are unable to deal
with the required arithmetic. Although this is not primarily an arithmetic paper, many
calculations will be used, so it is vital to ensure that you are comfortable with the
examples and activities presented here.
The acronym to remember is ‘BODMAS’, which tells us the correct order (that is, the
priority) in which mathematical operations are performed:
Brackets
Order (i.e. powers, square roots etc.)
Division
Multiplication
Addition
Subtraction.
12
2.6. Squares and square roots
You should also know that:
2
the sum of a and b means a + b
the difference between a and b means either a − b or b − a
the product of a and b means a × b = a · b
the quotient of a and b means a divided by b, i.e. a/b.
Example 2.1 What is (35 ÷ 7 + 2) − (42 − 8 × 3)?
‘BODMAS’ tells us to work out brackets first. Here there are two sets of brackets, so
let us do them one at a time.
First bracket: 35 ÷ 7 + 2
do division first: 35 ÷ 7 + 2 = 5 + 2
then perform the addition: 5 + 2 = 7.
Second bracket: 42 − 8 × 3
do order first: 42 − 8 × 3 = 16 − 8 × 3
next do multiplication: 16 − 8 × 3 = 16 − 24
then perform the subtraction: 16 − 24 = −8.
Now the problem has been simplified we complete the calculation with the final
subtraction: 7 − (−8) = 7 + 8 = 15. Note the two negatives become positive!
2.6
Squares and square roots
The power is the number of times a quantity is to be multiplied by itself. For example,
34 = 3 × 3 × 3 × 3 = 81. Any number raised to the power 2 is called ‘squared’, hence x2
is ‘x squared’, which is simply x × x.
Remember that squared values, such as x2 , are always non-negative. This is important,
for example, when we compute the quantity s2 in Chapter 4 which involves squared
terms, so a negative answer should ring alarm bells telling us a mistake has been made!
√
It might be helpful√to think
√ of the square root of x (denoted x) as the reverse of the
square,
such that x × x = x. Note that positive real numbers have two square roots:
√
± 81 = ±9, although the positive square root will always be used in ST104a
Statistics 1. In practice, the main problems you will encounter involve taking square
roots of numbers with decimal places. Be careful that you understand that 0.9 is the
square root of 0.81, and that 0.3 is the square root of 0.09 (and not 0.9!). Of course, in
the examination you can perform such calculations on your calculator, but it always
helps to have an idea of what the answer should be as a feasibility check of your answer!
13
2. Mathematical revision – simple algebra and coordinate geometry
2.7
2
Fractions and percentages
A fraction is part of a whole and can be expressed as either:
common fractions: for example, 1/2 or 3/8
decimal fractions: for example, 0.5 or 0.375.
In the common fraction, the top number is the numerator and the bottom number is
the denominator. In practice, decimal fractions are more commonly used.
When multiplying fractions together, just multiply all the numerators together to
obtain the new numerator, and do the same with the denominators. For example:
4 1 2
4×1×2
8
× × =
=
.
9 3 5
9×3×5
135
Percentages give an alternative way of representing fractions by relating a particular
quantity to the whole in parts per hundred. For example, 60% is 60 parts per 100,
which, as a common fraction, is simply 60/100.
2.8
Some further notation
2.8.1
Absolute value
One useful sign in statistics is | | which denotes the absolute value. This is the
numerical value of a real number regardless of its sign (positive or negative). The
absolute value of x, sometimes referred to as the modulus of x, or ‘mod x’, is |x|. So
|7.1| = |−7.1| = 7.1.
Statisticians sometimes want to indicate that they only want to use the positive value of
a number. For example, let the distance between town X and town Y be 5 miles.
Suppose someone walks from X to Y – a distance of 5 miles. A mathematician would
write this as +5 miles. Later, after shopping, the person returns to X and the
mathematician would record him as walking −5 miles (taking into account the direction
of travel). Hence this way the mathematician can show the person ended up where they
started. We, however, may be more interested in the fact that the person has had some
exercise that day! So, we need notation to indicate this. The absolute value enables us
to take only the positive values of our variables. The distance, d, from Y to X may well
be expressed mathematically as −5 miles, but you will probably be interested in the
absolute amount, so |−d| = d.
2.8.2
Inequalities
An inequality is a mathematical statement that one quantity is greater or less than
another:
x > y means ‘x is greater than y’
x ≥ y means ‘x is greater than or equal to y’
14
2.9. Summation operator,
P
x < y means ‘x is less than y’
x ≤ y means ‘x is less than or equal to y’
2
x ≈ y means ‘x is approximately equal to y’.
2.9
Summation operator,
P
P
The summation operator, , is likely to be new to many of you. It is widely used in
statistics and you will come across it frequently in ST104a Statistics 1, so make sure
you are comfortable using it before proceeding further!
Statistics is all about data analysis, so to use statistical methods we need data.
Individual observations are typically represented using a subscript notation. For
example, the heights of n people1 would be called x1 , x2 , . . . , xn , where the subscript
denotes the order in which the heights are observed (x1 represents the height of the first
person etc.). Hence xi represents the height of the ith individual and, in order to list
them all, the subscript i must take all integer values from 1 to n. So the whole set of
observations is {xi : i = 1, . . . , n} which can be read as ‘a set of observations xi such
that i goes from 1 to n’.
Summation opeartor,
P
The sum of a set of n observations, that is x1 + x2 + · · · + xn , may be written as:
n
X
xi
(2.1)
i=1
P
by introducing the summation operator,
(the Greek capital letter sigma), which
n
P
can be read as ‘the sum of’. Therefore,
xi is read ‘sigma xi , for i equals 1 to n’.
i=1
So the summation is said to be over i, where i is the index of summation and the
range of i, in (2.1), is
Pfrom 1 to n. So the lower bound of the range is the value of i
written underneath , and the upper bound is written above it. Note the lower bound
can be any integer (positive, negative or zero), such that the summation is over all
values of the index of summation in step increments of size one from the lower bound to
the upper bound.
P
As stated above,
appears frequently in statistics. For example, in Chapter 4 you will
meet descriptive statistics including the arithmetic mean of observations which is
defined as:
n
1X
x̄ =
xi .
n i=1
As seen in this example, rather than write out
n
P
xi in full, when all the xi s are
i=1
1
Throughout this half-course, n will denote a sample size.
15
2. Mathematical revision – simple algebra and coordinate geometry
2
n
P
summed we sometimes write short-cuts, such as
xi , or (when the range of summation
1
P
is obvious) just
xi .
Note that the resulting sum does not involve i in any form. Hence the sum is unaffected
by the choice of letter used for the index of summation, so, for example, the following
summations are all equal:
n
n
n
X
X
X
xi =
xj =
xk .
i=1
j=1
k=1
Sometimes the way that xi depends on i is known. For example, if xi = i, we have:
3
X
i=1
xi =
3
X
i = 1 + 2 + 3 = 6.
i=1
However, do not always assume xi = i!
Example 2.2 If {xi : i = 1, . . . , n} is a set of observations, we might observe
x1 = 4, x2 = 5, x3 = 1, x4 = −2 and x5 = 9. Therefore:
4
X
x2i = 42 + 52 + 12 + (−2)2 = 46
i=1
and:
5
X
5
X
xi (xi − 2) =
(x2i − 2xi ) = ((−2)2 − 2 × −2) + (92 − 2 × 9) = 71
i=4
i=4
remembering to use BODMAS in the second example.
Activity 2.1 A dataset contains the observations 1, 1, 1, 2, 4, 8, 9 (so here, n = 7).
Find:
(a)
n
X
2xi
i=1
(b)
n
X
x2i
i=1
(c)
n
X
(xi − 2)
i=1
(d)
n
X
(xi − 2)2
i=1
(e)
n
X
i=1
(f)
n
X
i=1
16
2.
!2
xi
2.9. Summation operator,
P
Solution
We have that x1 = 1, x2 = 1, x3 = 1, x4 = 2, x5 = 4, x6 = 8 and x7 = 9.
2
(a) We have:
n
X
2xi = 2x1 + 2x2 + 2x3 + 2x4 + 2x5 + 2x6 + 2x7
i=1
= 2 × (x1 + x2 + x3 + x4 + x5 + x6 + x7 )
= 2 × (1 + 1 + 1 + 2 + 4 + 8 + 9)
= 52.
(b) We have:
n
X
x2i = x21 + x22 + x23 + x24 + x25 + x26 + x27
i=1
= 12 + 12 + 12 + 22 + 42 + 82 + 92
= 168.
(c) We have:
n
X
(xi − 2) = (x1 − 2) + (x2 − 2) + (x3 − 2) + (x4 − 2)
i=1
+ (x5 − 2) + (x6 − 2) + (x7 − 2)
= (1 − 2) + (1 − 2) + (1 − 2) + (2 − 2)
+ (4 − 2) + (8 − 2) + (9 − 2)
= 12.
(d) We have, extending (c):
n
X
(xi − 2)2 = (1 − 2)2 + (1 − 2)2 + (1 − 2)2 + (2 − 2)2
i=1
+ (4 − 2)2 + (8 − 2)2 + (9 − 2)2
= 92.
(e) We have:
n
X
!2
xi
= (x1 + x2 + x3 + x4 + x5 + x6 + x7 )2
i=1
= (1 + 1 + 1 + 2 + 4 + 8 + 9)2
= (26)2
= 676.
17
2. Mathematical revision – simple algebra and coordinate geometry
Note that:
n
X
2
!2
xi
i=1
6=
n
X
x2i .
i=1
(f) We have:
n
X
2 = 2 + 2 + 2 + 2 + 2 + 2 + 2 = 14.
i=1
Activity 2.2 Suppose that x1 = 7, x2 = 3, x3 = 1, x4 = 0, x5 = −6, and y1 = −3,
y2 = 5, y3 = −8, y4 = 9, y5 = 1. Calculate the following quantities:
(a)
4
X
2yi
i=2
(b)
3
X
4(xi − 1)
i=1
(c) y12 +
5
X
(x2i + 2yi2 ).
i=3
Solution
(a) We have:
4
X
2yi = 2 × (y2 + y3 + y4 ) = 2 × (5 − 8 + 9) = 12.
i=2
(b) We have:
3
X
4(xi − 1) = 4 ×
i=1
3
X
(xi − 1)
i=1
= 4 × ((x1 − 1) + (x2 − 1) + (x3 − 1))
= 4 × ((7 − 1) + (3 − 1) + (1 − 1))
= 4 × (6 + 2 − 0)
= 32.
(c) We have:
y12 +
5
5
X
X
2
2
2
(xi + 2yi ) = y1 +
((x23 + 2y32 ) + (x24 + 2y42 ) + (x25 + 2y52 ))
i=3
i=3
2
= (−3) + (12 + 2 × (−8)2 ) + (02 + 2 × 92 )
+ ((−6)2 + 2 × 12 )
= 9 + 129 + 162 + 38
= 338.
18
2.9. Summation operator,
P
Activity 2.3 Suppose that x1 = −0.5, x2 = 2.5, x3 = −2.8, x4 = 0.4, x5 = 6.1, and
y1 = −0.5, y2 = 4.0, y3 = 4.6, y4 = −2.0, y5 = 0. Calculate the following quantities:
(a)
5
X
x2i
i=3
2
X
1
(b)
xy
i=1 i i
(c)
5
X
y2
i
y43 +
xi
i=4
.
Solution
(a) We have:
5
X
x2i = x23 + x24 + x25 = (−2.8)2 + (0.4)2 + (6.1)2 = 7.84 + 0.16 + 37.21 = 45.21.
i=3
(b) We have:
2
X
1
1
1
1
1
=
+
=
+
= 4 + 0.1 = 4.1.
x
y
x
y
x
y
(−0.5)
×
(−0.5)
2.5
×
4.0
i
i
1
1
2
2
i=1
(c) We have:
y43 +
5
X
y2
i
i=4
xi
= y43 +
2
y52
(−2.0)2
02
y4
3
+
= (−2.0) +
+
= −8 + 10 = 2.
x4 x5
0.4
6.1
Activity 2.4 Explain why:
n
X
i=1
x2i 6=
n
X
!2
xi
and
i=1
n
X
i=1
xi yi 6=
n
X
i=1
xi
n
X
yi .
i=1
Solution
Writing out the full summations, we obtain:
n
X
i=1
x2i = x21 + x22 + · · · + x2n 6= (x1 + x2 + · · · + xn )2 =
n
X
!2
xi
.
i=1
Therefore, the ‘sum of squares’ is, in general, not equal to the ‘square of the sum’
because the expansion of the quadratic gives:
(x1 + x2 + · · · + xn )2 = x21 + x22 + · · · + x2n + 2x1 x2 + 2x1 x3 + · · ·
19
2
2. Mathematical revision – simple algebra and coordinate geometry
so, unless all the cross-product terms sum to zero, the two expressions are not the
same. Hence, in general, the two expressions are different. Similarly:
2
n
X
xi yi = x1 y1 +x2 y2 +· · ·+xn yn 6= (x1 +x2 +· · ·+xn )(y1 +y2 +· · ·+yn ) =
i=1
n
X
i=1
xi
n
X
yi .
i=1
Therefore, the ‘sum of the products’ is, in general, not equal to the ‘product of the
sums’ because the expansion of the products gives:
(x1 + x2 + · · · + xn )(y1 + y2 + · · · + yn ) = x1 y1 + x2 y2 + · · · + xn yn + x1 y2 + x1 y3 + · · · .
Note the first case is a special case of the second if we set yi = xi .
2.10
Graphs
In Chapter 4 you will spend some time learning how to present material in graphical
form, and also in the representation of the normal distribution in Chapter 6. You should
make sure you have understood the following material. If you are taking MT105a
Mathematics 1, you will need and use these ideas. If you are not, you are encouraged
to read up on and practise examples in Anthony and Biggs.
When a variable y depends on another variable x, we can represent the relationship
mathematically using functions. In general we write this as y = f (x), where f is the
rule which allows us to determine the value of y when we input the value of x. Graphs
are diagrammatic representations of such relationships, using coordinates and axes. The
graph of a function y = f (x) is the set of all points in the plane of the form (x, f (x)).
Sketches of graphs can be very useful. To sketch a graph, we begin with the x-axis and
y-axis as shown in Figure 2.1.
Figure 2.1: Graph axes.
We then plot all points of the form (x, f (x)). Therefore, at x units from the origin (the
point where the axes cross), we plot a point whose height above the x-axis (that is,
whose y-coordinate) is f (x), as shown in Figure 2.2.
Joining all points together of the form (x, f (x)) results in a curve (or sometimes a
straight line), which is called the graph of f (x). A typical curve might look like that
shown in Figure 2.3.
20
2.10. Graphs
2
Figure 2.2: Example of a plotted coordinate.
Figure 2.3: The graph of a generic function, y = f (x).
21
2. Mathematical revision – simple algebra and coordinate geometry
y
2
3
-1.5
x
Figure 2.4: A sketch of the linear function y = 3 + 2x.
y
2
2
x
Figure 2.5: A sketch of the linear function y = 2 − x.
However, you should not imagine that the correct way to sketch a graph is to plot a few
points of the form (x, f (x)) and join them up – this approach rarely works well in
practice and more sophisticated techniques are needed. There are two function types
which you need to know about for this half-course:
linear functions (i.e. the graph of a straight line, see below)
normal functions (which we shall meet frequently in later chapters).
2.11
The graph of a linear function
Linear functions are those of the form f (x) = a + bx and their graphs are straight
lines which are characterised by a gradient (or slope), b, and a y-intercept (where x = 0)
at the point (0, a).
A sketch of the function y = 3 + 2x is provided in Figure 2.4, and the function
y = 2 − x is shown in Figure 2.5.
22
2.12. Summary
2.12
Summary
Much of this material should be familiar to you, but some may be new. Although it is
only a language or set of rules to help you deal with statistics, without it you will not
be able to make sense of the following chapters.
Before you continue, make sure you have completed all the learning activities below,
understood what you have done and, if necessary, worked through additional examples
in the basic textbooks.
2.13
Key terms and concepts
Absolute value
Fractions
Linear function
Power
Summation operator
2.14
BODMAS
Inequalities
Percentages
Square root
Learning activities
Solutions can be found in Appendix A.
1. Work out the following:
(a) (2 + 4) × (3 + 7)
(b) 1/3 of 12 − 4 ÷ 2
(c) (1 + 4)/5 × (100 − 98).
If you find these difficult at all, go to Anthony and Biggs, or your old school
textbook, and work on some more examples before you do anything else.
2. Work out the following (use a calculator where necessary):
√
(a)
16
(b) (0.07)2
√
0.49.
(c)
3. (a) What is 98% of 200?
(b) Express 17/25 as a percentage.
(c) What is 25% of 98/144?
4. Provide the absolute values for:
(a) |−8|
(b) |15 − 9|.
23
2
2. Mathematical revision – simple algebra and coordinate geometry
5. (a) For which of 2, 3, 7 and 9 is x > 3?
(b) For which of 2, 3, 7 and 9 is x < 3?
2
(c) For which of 2, 3, 7 and 9 is x ≤ 3?
(d) For which of 2, 3, 7 and 9 is x2 ≥ 49?
6. Suppose x1 = 4, x2 = 1 and x3 = 2. For these figures, compute
2
X
x3i .
i=1
7. If n = 4, x1 = 2, x2 = 3, x3 = 5 and x4 = 7, compute:
(a)
3
X
xi
i=1
4
(b)
1X 2
x.
n i=1 i
8. Given x1 = 3, x2 = 1, x3 = 4, x4 = 6 and x5 = 8, find:
P
(a)
5xi
i=1
(b)
4
P
x2i .
i=3
Given also that p1 = 1/4, p2 = 1/8, p3 = 1/8, p4 = 1/3 and p5 = 1/6, find:
(c)
5
P
p i xi
i=1
(d)
5
P
pi x2i .
i=3
If you find these difficult, go back to an elementary textbook and do some more
work on this. It is most important that you deal with this before you embark on
the topic of descriptive statistics, such as means, in Chapter 4.
9. Sketch the following linear functions:
(a) y = 3 + x
(b) y = −2 + 3x.
You are going to need equations like these for all the material on regression in
Chapter 12.
2.15
A reminder of your learning outcomes
After completing this chapter, and having completed the recommended reading and
activities, you should be able to:
manipulate arithmetic and algebraic expressions using the simple rules
24
2.16. Sample examination questions
recall and use common signs: square, square root, ‘greater than’, ‘less than’ and
absolute value
demonstrate use of the summation sign and explain the use of the ‘i’, or index, of x
draw the straight line for a linear function.
2.16
Sample examination questions
Solutions can be found in Appendix B.
1. Suppose that x1 = −0.2, x2 = 2.5, x3 = −3.7, x4 = 0.8, x5 = 7.4, and y1 = −0.2,
y2 = 8.0, y3 = 3.9, y4 = −2.0, y5 = 0. Calculate the following quantities:
(a)
5
X
x2i
i=3
2
X
1
(b)
xy
i=1 i i
(c)
y43 +
5
X
y2
i
i=4
xi
.
2. Suppose that y1 = −2, y2 = −5, y3 = 1, y4 = 16, y5 = 10, and z1 = 8, z2 = −5,
z3 = 6, z4 = 4, z5 = 10. Calculate the following quantities:
(a)
3
X
zi2
i=1
(b)
5
X
√
yi zi
i=4
(c)
z42 +
3
X
1
i=1
yi
.
25
2
2. Mathematical revision – simple algebra and coordinate geometry
2
26
Chapter 3
The nature of statistics
3
3.1
Aims
This chapter gives a general introduction to some of the statistical ideas which you will
be learning about in this course. It should also enable you to link different parts of the
syllabus and see their relevance to each other and also to the other courses you are
studying. You should aim to:
have an overview of the ideas to be introduced in later chapters
connect the ideas of sampling to real-life problems
be able to decide to take topics in a different sequence from the subject guide
where this is more appropriate for your overall study plan.
3.2
Learning outcomes
After completing this chapter, and having completed the essential reading and
activities, you should be able to:
explain the difference between a population, a sample and a census
discuss the idea of systematic bias
describe simple random sampling and justify its use
explain the principle of random sampling
identify what the parameter(s) of a function is (are)
and you should then be ready to start studying statistics!
It is rare for examination questions to focus solely on the topics covered in this chapter.
However, a good understanding of the ideas covered is essential when answering the
sample examination questions given in Chapters 10 and 11 of the subject guide.
3.3
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060]
Chapter 1 and Section 6.1.
27
3. The nature of statistics
3.4
Further reading
Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and
Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN
9780471615170] Chapter 1.
3
3.5
Introduction
This chapter sets the framework for the subject guide. Read it carefully, because the
ideas introduced are fundamental to this subject and, although superficially easy, give
rise to profound difficulties both of principle and of methodology. It is important that
you think about these ideas as well as being thoroughly familiar with the
techniques explained in Chapter 2 before you begin work on later chapters.
3.6
Coverage of the course
ST104a Statistics 1 is divided into six main sections:
1. Data visualisation and descriptive statistics (Chapter 4).
This covers: Ways of summarising data; measures of location and dispersion;
graphical presentation.
2. Probability theory (Chapter 5).
This covers: Probability; probability trees.
3. The normal distribution and ideas of sampling (Chapter 6).
This covers: Definitions of random variables; expectations and variances;
normal distribution; central limit theorem.
4. Decision analysis (Chapters 7, 8 and 9).
This covers: Estimation and confidence intervals for means, proportions and
differences; Student’s t distribution; hypothesis testing; Type I and Type II
errors; contingency tables.
5. Sampling design (Chapter 10).
This covers: Key features of different data sampling techniques.
6. Modelling for decision-making (Chapters 11 and 12).
This covers: Measures of correlation; spurious correlation; model selection;
model testing.
In the presence of a real-world problem, the aim is to develop a quantitative framework
for decision-making under uncertainty. The first step is to conduct an experiment and
collect data on which to base the decisions (Chapter 10). The second step is to explore
the data (Chapter 4) and the third step is to assign good probability models (Chapters
28
3.7. Terminology
5 and 6). The final step is the task of statistical inference (or else decision analysis) and
this uses the material of Chapters 7, 8 and 9, as well as Chapters 11 and 12.
3.7
Terminology
Before progressing to Chapter 4, it is worth spending time familiarising yourself with
some important terminology and ideas which you will meet later on in the half-course
(and also in ST104b Statistics 2, for those studying that as well).
In brief, statistics is a mathematical science concerned with analysing and interpreting
data. Data collection is not a zero-cost exercise – costs, in addition to effort, include
both time and money and, in the real world, time and budget resources are finite! (If we
did perform a total enumeration of the population, then this is called a census.) Who
might collect data? In principle anyone – governments, corporations, individuals etc.
As is so often the case in statistics, some words have technical meanings
which overlap with their common everyday use but are not the same. It is
important that you are aware of these so that you can read and use them
correctly.
3.7.1
Population
Population is one such word. Most people think of the word as being about the
number of people living in a country. Statisticians, however, use it more generally to
refer to the collection of items1 which we would like to study. So, for example:
if we were studying children aged two to five years old we would refer to them as
the study ‘population’ and not consider older or younger children, or their parents
if we were making an analysis of toy retailers in a country, we might look at the
‘population’ of toy shops.
Often, it is not feasible to collect full information on the entire population due to the
size of many populations2 (and the time and budget constraints described above).
Therefore, we collect a sample drawn from the population (see below).
It can sometimes be difficult to decide which population should be sampled. For
instance, if we wished to sample n listeners of a radio station specialising in music,
should the population be of listeners of that radio station in general, or of listeners to
that station’s classical music programme, or perhaps just regular listeners, or any one of
many other possible populations which you can construct for yourself? In practice, the
population is often chosen by finding one which is easy to sample from, and which may
not be the population of first choice – that is, our survey population may differ from
our target population (of course, in an ideal world the two would be the same).
1
Note ‘items’ need not be restricted to people – researchers may wish to study a variety of different
populations: fish, countries, dictators, businesses, alumni, criminals, politicians etc.
2
Indeed, the population size may well be finite, but unknown. For example, how many fish live in the
sea?
29
3
3. The nature of statistics
3
In medical trials (which are an important statistical application), the population may
be those patients who arrive for treatment at the hospital carrying out the trial, and
this may be very different from one hospital to another. If you look at any collection of
official statistics (which are most important for public policymaking) you will be struck
by the great attention which is given to defining the population which is surveyed or
sampled, and to the definition of terms. For instance, it has proved difficult to get
consensus on the meaning of ‘unemployed’ in recent years, but a statistician must be
prepared to investigate the population of unemployed persons. Think about this
carefully; it is particularly important to those of you studying sociology or marketing.
3.7.2
Bias
Besides the common-sense meaning of bias, in statistics there is a more technical
meaning.3 Intuitively, bias is something we should aim to minimise (ideally we would
wish to eliminate it entirely). A full discussion of bias will be given later in the subject
guide, but for now remember that the term systematic bias refers to an estimation
methodology which results in systematic errors, i.e. consistently over-estimating or
under-estimating some population parameter (discussed shortly). It would be difficult to
justify a methodology which was biased, hence sample survey methodology places great
emphasis on unbiased surveys and unbiased estimators.
3.7.3
Parameters
At this point it is useful to think a little about parameters. We can define a parameter
(or set of parameters) as a measure (or set of measures) which completely describes a
function.
For example, in the linear equation y = a + bx, the parameters4 are a and b which
completely determine the position of a straight line in a graph (where a denotes the
y-intercept and b denotes the slope). Hence as a, b, or both vary, then we obtain
different lines.
Consider another application: you are asked to draw a circle. How do you do this –
what things do you need to know in order to draw it? A little thought makes us realise
that, of course, you need to know the centre of the circle. You also need to know how
large it is – we need to know its radius. If we only knew the centre we could draw
several different, but concentric, circles; if we knew the radius, but not the centre, then
we could again draw several circles each of the same size, but with different positions in
space. There is, however, only one circle which has a given centre and radius. This very
interesting idea of finding a minimum number of properties (attributes) to define a
shape was first described by the Greek mathematical philosophers who gave them their
names.
In Chapter 6 you will meet the normal distribution. One of the reasons statisticians like
3
It is covered in more technical detail in ST104b Statistics 2 when exploring the properties of
estimators.
4
Mathematical textbooks will typically use a and b as the parameters of a line; however, in statistics
we typically use α and β for the y-intercept and slope, respectively. The linear function is then written
as y = α + βx. Of interest later on will be the estimation of these parameters when their values are
unknown – this is achieved (for a line) in regression (Chapter 12).
30
3.7. Terminology
to use the normal distribution is that, despite its complicated functional form, it is
completely determined by two parameters: its mean,µ, and variance, σ 2 .5 If we know
these two parameter values, we can draw a unique normal curve. As you will discover,
this is extremely useful for working out probabilities and confidence intervals, and for
testing hypotheses (part of statistical inference – covered in Chapters 7 and 8).
3.7.4
3
Sampling
As discussed above, a census is not usually a feasible option; however, we may wish to
know the value of some attribute of a particular population (specifically, the value of a
population parameter). In the absence of population data, we resort to drawing a
sample from the population of interest, and use the sample observations (our data) to
estimate a certain parameter when its true value is unknown (due to the absence of
population-level data).6
Of course, although sampling avoids the costs associated with a census, it is possible to
make an argument against the practice. One could argue that since some members of
the population will be excluded from the sample, then the process is inherently
undemocratic. However, although the omission of some population members will likely
give rise to sampling error (the difference between the estimate of a parameter and its
true value), sampling might well be more accurate, since more time can be spent
verifying the sample information collected. The key aim of sampling is to select a
representative sample from the population. Chapter 10 will explore various sampling
techniques in depth. A brief summary of the main themes is outlined below.
Random sampling
In order to collect a random sample, the researcher needs a sampling frame – a list of
all population members. For example, if you want to investigate UK secondary schools,
the sampling frame would be a list of all the secondary schools in the UK. The key
advantages of random sampling are:
it avoids systematic bias
it allows an assessment of the size of the sampling error.
Access to a sampling frame allows a known, non-zero probability of selection to be
attached to each population unit. This is the main feature of a random sample, hence
the collection of random sampling techniques is known as probability sampling. A
simple random sample is a special case where each population unit has an equal,
known, non-zero probability of selection. The two most common kinds of non-simple
random samples are:
stratified sampling, where the precision of estimates should be improved if the
stratification factors are appropriate (i.e. we divide the sampling frame by
5
Concepts formally introduced in Chapter 4.
This is what we term statistical inference, i.e. infer ring ‘things’ (such as parameter values) about a
population based on sample data.
6
31
3. The nature of statistics
stratification factors such as gender or age group, and take a simple random sample
from each group)
cluster sampling, where precision may not be better, but time and money may
be saved on interview surveys involving travel to the site of the interview by only
selecting individuals from some clusters.
3
Randomisation (the process by which a random sample is selected) is popular since it
avoids the biases caused by the prejudices of the person taking the sample. Of course,
just by chance, a random sample may have the appearance of extreme bias – for
example, drawing from a population of men and women produces a sample containing
men or women only. Such difficulties are usually resolved with some suitable form of
restricted randomisation. One might for instance, in this example, use a stratified
random sample (by gender) and select half the sample from all men and the other half
from all women.
Random sampling is carried out, for many populations, by choosing at random without
replacement 7 a subset of the very large number of population units, N .8 If the sample
(of size n) is a sufficiently small proportion of the population (i.e. n/N is sufficiently
small), then there is little-to-no practical difference between the inferences possible
using sampling with replacement and sampling without replacement.
3.8
Summary
This chapter has laid out the coverage of ST104a Statistics 1 detailing the six main
sections of the syllabus. Many frequently-cited terms in statistics have been introduced
providing an introductory overview of the issue of sampling –specifically, probability
sampling techniques. These will be explored in much greater depth in Chapter 10. Also,
the concept of parameters was introduced which is extremely important since statistical
inference (covered in Chapters 7, 8, 9 and 12) is concerned with the estimation and
testing of unknown parameters.
3.9
Key terms and concepts
Census
Parameter
Sample
Sampling frame
Stratified sampling
7
Cluster sampling
Population
Sampling error
Simple random sample
Systematic bias
Sampling without replacement means that once a population unit is selected it is permanently
removed from the remaining population, which ensures no population unit is repeatedly observed.
8
That is, N denotes the population size.
32
3.10. Learning activities
3.10
Learning activities
Solutions can be found in Appendix A.
1. What parameters could you find for a circle?
3
2. Explain the difference between a simple random sample and a random sample.
Name two kinds of non-simple random samples.
3.11
A reminder of your learning outcomes
After completing this chapter, and having completed the essential reading and
activities, you should be able to:
explain the difference between a population, a sample and a census
discuss the idea of systematic bias
describe simple random sampling and justify its use
explain the principle of random sampling
identify what the parameter(s) of a function is (are).
3.12
Self assessment
This chapter is an introduction to ideas you will meet again later. Check that you
understand the content of the syllabus.
If you want to study the ideas of sampling and causation in greater detail now before
studying the theoretical statistical concepts in the next chapters, you could go straight
to Chapters 10 and 11, before returning to Chapter 4.
33
3. The nature of statistics
3
34
Chapter 4
Data visualisation and descriptive
statistics
4.1
4
Aims
This chapter contains two separate but related themes, both to do with the
understanding of data. These are:
to find graphical representations for data which allow us to see their most
important characteristics
to calculate simple numbers, such as the mean or standard deviation, which will
summarise those characteristics.
4.2
Learning outcomes
After completing this chapter, and having completed the essential reading and
activities, you should be able to:
calculate the following: arithmetic mean, median, mode, standard deviation,
variance, quartiles, range and interquartile range
explain the use and limitations of the above quantities
draw and interpret histograms, stem-and-leaf diagrams, boxplots and cumulative
frequency distributions
incorporate labels and titles correctly in your diagrams and state the units which
you have used.
In summary, you should be able to use appropriate measures and diagrams in order to
explain and clarify data which you have collected or which are presented to you.
4.3
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060]
Sections 1.3, 1.5 and 2.1–2.3.
In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials
accessible via the ST104a Statistics 1 area at https://emfss.elearning.london.ac.uk
35
4. Data visualisation and descriptive statistics
4.4
Further reading
Aczel, A.D. Complete Business Statistics (London: McGraw–Hill Higher
Education, 2009) seventh edition [ISBN 9780071287531] Chapter 1.
Anderson, D.R., D.J. Sweeney, T.A. Williams, J. Freeman and E. Shoesmith
Statistics for Business and Economics (South-Western Cengage Learning, 2010)
eleventh edition [ISBN 9780324783247] Chapters 1–3.
Lind, D.A., W.G. Marchal and S.A. Wathen Statistical Techniques in Business and
Economics (Boston: McGraw–Hill Higher Education, 2009) fourteenth edition
[ISBN 978007110004] Chapters 3 and 4.
4
Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and
Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN
9780471615170] Chapter 2.
4.5
Introduction
Both themes considered in this chapter (data visualisation and descriptive statistics)
could be applied to population data, but in most cases (namely here) they are applied
to a sample. The notation would change slightly if a population was being represented.
Most visual representations are very tedious to construct in practice without the aid of
a computer. However, you will understand much more if you try a few by hand (as is
commonly asked in examinations). You should also be aware that spreadsheets do not
always use correct terminology when discussing and labelling graphs. It is important,
once again, to go over this material slowly and make sure you have mastered the basic
statistical definitions introduced here before you proceed to more theoretical ideas.
Make sure that, once you have completed the chapter and its activities, you make time
to practise the recommended questions from Newbold et al.
4.6
Types of variable
Many of the questions for which people use statistics to help them understand and
make decisions involve types of variables which can be measured. Obvious examples
include height, weight, temperature, lifespan, rate of inflation, and so on. When we are
dealing with such a variable – for which there is a generally recognised method of
determining its value – we say that it is a measurable variable. The numbers which
we then obtain come ready-equipped with an ordered relation, i.e. we can always tell if
two measurements are equal (to the available accuracy) or if one is greater or less than
the other.
Data1 are obtained on any desired variable. For most of this half-course, we will be
dealing with variables which can be partitioned into two types.
1
Note that the word ‘data’ is plural, but is very often used as if it was singular. You will probably
see both forms written in textbooks.
36
4.6. Types of variable
Types of variable
1. Discrete data: outcomes you can count. Examples include the number of
passengers on a flight and telephone calls received each day in a call centre.
Observed values for these will be 0, 1, 2, . . . (i.e. non-negative integers).
2. Continuous data: outcomes you can measure. Examples include height, weight
and time, all of which can be measured to several decimal places.
Of course, before we do any sort of data analysis, we need to collect data. Chapter 10
will discuss a range of different techniques which can be employed to obtain a sample.
For now, we just consider some examples of situations where data might be collected,
such as a:
pre-election opinion poll asks 1,000 people about their voting intentions
market research survey asks homemakers how many hours of television they watch
per week
census2 interviewer asks each householder how many of their children are receiving
full-time education.
4.6.1
Categorical variables
A polling organisation might be asked to determine whether, say, the political
preferences of voters were in some way linked to their graduate status: for example, do
supporters of Party X tend to be graduates? Other market research organisations might
be hired to determine whether or not users were satisfied with the service which they
obtained from a commercial organisation (a restaurant, say) or a department of local or
central government (housing departments being one important example).
This means that we are concerned, from time to time, with categorical variables in
addition to measurable variables. So we can count the frequencies with which an item
belongs to particular categories. Examples include:
(a) The number of graduates who support Party X.
(b) The total number of graduates (in a sample).
(c) The number of Party X supporters who are graduates.
(d) The total number of Party X supporters (in a sample).
(e) The number of diners at a restaurant who were dissatisfied/indifferent/satisfied
with the service.
In cases (b) and (d) we are doing simple counts, within a sample, of a single category,
while in cases (a) and (c) we are looking at some kind of cross-tabulation between
2
Recall that a census is the total enumeration of a population, hence this would not be a sample!
37
4
4. Data visualisation and descriptive statistics
variables in two categories: graduate status vs. political preferences in (a), and political
preferences vs. graduate status in (c) – they are not the same!
There is no obvious and generally recognised way of putting political preferences or
graduate status in order (in the way that we can certainly say that 1 < 2). It is
similarly impossible to rank (as the technical term has it) many other categories of
interest: in combatting discrimination against people, for instance, organisations might
want to look at the effects of gender, religion, nationality, sexual orientation, disability
etc. but the whole point of combatting discrimination is that different levels of each
category cannot be ranked.
4
In case (e), by contrast, there is a clear ranking: the restaurant would be pleased if
there were lots of people who expressed themselves satisfied rather than dissatisfied.
Such considerations lead us to distinguish between two main types of variable, the
second of which is itself subdivided.
Classification of variables
Measurable variables, where there is a generally recognised method of
measuring the value of the variable of interest.
Categorical variables, where no such method exists (or, often enough, is even
possible), but among which:
• some examples of categorical variables can be put in some sensible order
(case (e)), and hence are called ordinal (categorical) variables
• some examples of categorical variables cannot be put in any sensible
order, but are only known by their names, and hence are called nominal
(categorical) variables.
4.7
Data visualisation
Datasets consist of potentially vast amounts of data. Hedge funds, for example, have
access to very large databases of historical price information on a range of financial
assets, such as so-called ‘tick data’ – very high-frequency intra-day data. Of course, the
human brain cannot easily make sense of such large quantities of numbers when
presented with them on a screen. However, the human brain can cope with visual
representations of data. By producing various plots, we can instantly ‘eyeball’ to get a
bird’s-eye view of the dataset. So, at a glance, we can quickly get a feel for the data and
determine whether there are any interesting features, relationships etc. which could
then be examined in greater depth.3 To summarise, plots are a great medium for
communicating the salient features of a dataset to a wide audience.
The main representations we use in ST104a Statistics 1 are histograms,
stem-and-leaf diagrams and boxplots.4 We also use scatterplots for two
3
For those studying ST104b Statistics 2, in modelling we often make distributional assumptions,
and a suitable variable plot allows us to easily check the feasibility of a particular distribution.
4
There are many other representations available from software packages, in particular pie charts and
38
4.7. Data visualisation
measurable variables (covered in Chapter 12).
4.7.1
Presentational traps
Before we see our first graphical representation you should be aware when reading
articles in newspapers, magazines and even within academic journals, that it is easy to
mislead the reader by careless or poorly-defined diagrams. Hence presenting data
effectively with diagrams requires careful planning.
A good diagram:
4
• provides a clear summary of the data
• is a fair and honest representation
• highlights underlying patterns
• allows the extraction of a lot of information quickly.
A bad diagram:
• confuses the viewer
• misleads (either accidentally or intentionally).
Advertisers and politicians are notorious for ‘spinning’ data for their own objectives!
4.7.2
Dot plot
The simplicity of a dot plot makes it an ideal starting point to think about the concept
of a distribution. For small datasets, this type of plot is very effective for seeing the
data’s underlying distribution. We use the following procedure.
1. Obtain the range of the dataset (the values spanned by the data), and draw a
horizontal line to accommodate this range.
2. Place dots (hence the name ‘dot plot’ !) corresponding to the values above the line,
resulting in the empirical distribution.
Example 4.1 Hourly wage rates (in £) for clerical assistants:
8.20
7.60
7.50
7.90
•
•
7.50
7.80
7.70
•
•
•
•
7.60
7.60
7.60
•
•
•
7.70
8.10
8.10
•
•
•
7.80
7.80
7.70
•
•
7.90
7.60
7.80
7.70
7.90
•
•
•
8.00 8.10
•
8.20
7.50
8.00
standard bar charts which are appropriate when dealing with categorical data, although these will not
be considered further in this half-course.
39
4. Data visualisation and descriptive statistics
Instantly, some interesting features emerge from the dot plot which are not
immediately obvious from the raw data. For example, most clerical assistants earn
less than £8 per hour and nobody (in the sample) earns more than £8.20.
Activity 4.1 A survey has been conducted on the wage rates paid to clerical
assistants in twenty companies in your locality in order to make proposals for a
review of wages in your own firm. The results of this survey are as follows:
Hourly rate in £
4
8.50
8.20
7.80
8.00
8.10
7.90
7.90
8.40
8.40
8.00
8.10
7.90
7.90
8.00
8.00
8.10
7.80
8.20
7.90
8.30
Construct a dot plot of these data.
Solution
A good idea is to first list the data in increasing order:
7.80 7.80 7.90 7.90 7.90 7.90 7.90 8.00 8.00 8.00
8.00 8.10 8.10 8.10 8.20 8.20 8.30 8.40 8.40 8.50
The dot plot of wage rates is:
•
•
4.7.3
•
•
•
•
•
•
•
•
•
£8.00
•
•
•
•
•
•
•
•
•
£8.50
Histogram
Histograms are excellent diagrams to use when we want to depict the frequency
distribution of (discrete or continuous) variables. To construct histograms, data are first
organised into a table which arranges the data into class intervals (also called bins) –
subdivisions of the total range of values which the variable takes. To each class interval,
the corresponding frequency is determined, that is the number of observations of the
variable which falls in each class interval.
Example 4.2 Weekly production output5 of a factory over a 50-week period. (You
can choose what the manufactured good is!)
360
385
458
362
40
383
381
453
365
368
393
451
365
365
392
444
372
371
395
445
364
354
410
382
396
360
437
379
396
359
420
372
398
358
456
363
402
354
441
350
406
4.7. Data visualisation
450
469
467
460
454
438
358
362
404
459
In principle the bins do not have to be of equal width, but for simplicity we use
equal bin widths whenever possible (the reason why will become clear below). A
word of warning: recall our objective is to represent the distribution of the data. As
such, too many bins will dilute the distribution, while too few will concentrate it
(using technical jargon, will degenerate the distribution). Either way, the pattern of
the distribution will be lost – defeating the purpose of the histogram. As a guide, six
or seven bins should be sufficient, but remember to exercise common sense!
Class interval
[300, 360)
[360, 380)
[380, 400)
[400, 420)
[420, 460)
[460, 500)
Interval
width
60
20
20
20
40
40
Frequency
6
14
10
4
13
3
Frequency
density
0.100
0.700
0.500
0.200
0.325
0.075
4
Cumulative
frequency
6
20
30
34
47
50
The table above includes two additional columns: (i) ‘Frequency density’ – obtained
by calculating ‘frequency divided by interval width’ (for example, 6/60 = 0.100), and
(ii) ‘Cumulative frequency’ – obtained by simply determining the running total of
the class frequencies (for example, 6 + 14 = 20). Note the final column is not
required for a histogram per se, although the computation of cumulative frequencies
may be useful when determining medians and quartiles (to be discussed later in this
chapter).
To construct the histogram, adjacent bars are drawn over the respective class
intervals such that the area of each bar is proportional to the interval
frequency. This explains why equal bin widths are desirable since this reduces the
problem to making the heights proportional to the interval frequency.
However, you may be told to use a particular number of bins or bin widths, such
that the bins will not all be of equal width. In such cases, you will need to compute
the frequency density as outlined above. The histogram for the above example is
shown in Figure 4.1.
Key points to note are the following.
All bars are centred on the midpoints of each class interval.
Informative labels are shown on the histogram, i.e. title and axis labels. The idea
is the reader can work out what the graphic is showing from the labels!
Because area represents frequency, it follows that the dimension of bar heights is
number per unit class interval, hence the y-axis should be labelled ‘Frequency
density’ rather than ‘Frequency’.
5
This is a discrete variable since the output will take integer values, i.e. something which we can
count.
41
4. Data visualisation and descriptive statistics
0.0
0.2
4
0.4
Frequency density
0.6
0.8
Frequency density histogram of weekly production output
300
350
400
450
500
Weekly production output
Figure 4.1: Frequency density histogram of weekly production output.
Keen students only: experiment with different numbers of bins and see how
sensitive the histogram shape is to this choice!
Activity 4.2 The following two sets of data represent the lengths (in minutes) of
students’ attention spans during a one-hour class. Construct histograms for each of
these datasets and use these to comment on comparisons between the two
distributions.
Statistics class
01
47
21
38
55
43
40
07
48
27
16
14
28
36
06
28
36
49
22
23
27
23
31
29
42
25
32
22
12
21
26
15
24
32
58
25
31
26
11
23
22
19
14
34
35
26
25
45
42
13
04
51
37
14
28
11
25
19
07
46
54
43
43
31
60
60
18
28
Economics class
60
58
29
50
60
60
42
39
60
21
60
48
42
30
21
31
51
25
04
41
53
60
24
32
26
37
26
48
41
59
60
27
47
30
03
11
41
38
08
28
37
60
60
4.7. Data visualisation
Solution
The exact shape of the histograms will depend on the class intervals you have used.
A sensible approach is to choose round numbers. Below the following have been used:
Statistics
Class interval
(0, 10]
(10, 20]
(20, 30]
(30, 40]
(40, 50]
(50, 60]
Interval width
10
10
10
10
10
10
Frequency
3
8
20
10
7
2
Frequency density
0.3
0.8
2.0
1.0
0.7
0.2
4
Economics
Class interval
(0, 10]
(10, 20]
(20, 30]
(30, 40]
(40, 50]
(50, 60]
Interval width
10
10
10
10
10
10
Frequency
5
5
14
8
11
17
Economics students' attention spans
2
2
1.5
1.5
Frequency density
Frequency density
Statistics students' attention spans
Frequency density
0.5
0.5
1.4
0.8
1.1
1.7
1
0.5
1
0.5
0
0
10
20
30
40
Attention span (minutes)
50
60
10
20
30
40
50
60
Attention span (minutes)
The graphs show that for the statistics class attention spans are approximately
symmetrical. Attention spans for the economics class are more variable and higher
than those of the statistics class, and they are not symmetrical due to a group of
students who maintain interest throughout the class. Note we have more data on
economics students – this is perfectly acceptable. Sample sizes of different groups
may very well be different in practice.
43
4. Data visualisation and descriptive statistics
4.7.4
Stem-and-leaf diagram
A stem-and-leaf diagram uses the raw data. As the name suggests, it is formed using a
‘stem’ and corresponding ‘leaves’. The choice of the stem involves determining a major
component of an observed value, such as the ‘10s’ unit if the order of magnitude of the
observations were 15, 25, 35 etc., or if data are of the order of magnitude 1.5, 2.5, 3.5
etc. then the integer part would be appropriate. The remainder of the observed value
plays the role of the ‘leaf’. Applied to the weekly production dataset, we obtain the
stem-and-leaf diagram shown below in Example 4.3. Note the following points.
4
These stems are equivalent to using the class intervals:
[350, 360),
[360, 370),
[370, 380),
...,
[460, 470).
Leaves are vertically aligned, hence rotating the stem-and-leaf diagram 90 degrees
anti-clockwise reproduces the shape of the data’s distribution, just as would be
revealed with a histogram.
The leaves are placed in order of magnitude within the stems, so it is a good idea
to sort the raw data into ascending order first of all.
Unlike the histogram, the actual data values are preserved. This is advantageous if
we want to calculate various descriptive (or summary) statistics later on.
Example 4.3 Continuing with Example 4.2, the stem-and-leaf diagram is:
Stem-and-leaf diagram of weekly production
Stem (Tens)
35
36
37
38
39
40
41
42
43
44
45
46
Leaves (Units)
044889
0022345558
1229
1235
235668
246
0
0
78
145
0134689
079
Note the informative title and labels for the stems and leaves.
So far we have considered how to summarise a dataset visually. This methodology is
appropriate to get a visual feel for the distribution of the dataset. In practice, we would
also like to summarise things numerically. There are two key properties of a dataset
which will be of particular interest.
44
4.8. Measures of location
Key properties of a dataset
Measures of location – a central point about which the data tend (also known
as measures of central tendency).
Measures of spread – a measure of the variability of the data, i.e. how spread
out the data are about the central point (also known as measures of dispersion).
Activity 4.3 Sales of footwear in a store were recorded for 52 weeks and these are
shown below.
Weekly sales of footwear
30
61
70
79
60
74
68
77
67
94
47
77
63
87
40
73
69
66
51
64
54
69
56
67
68
66
56
59
60
62
58
46
62
78
57
54
83
90
47
53
66
98
71
49
70
93
76
58
68
73
80
62
Construct a stem-and-leaf diagram of these data and comment on its shape.
Solution
Stem-and-leaf diagram of weekly shoe sales:
Stems in 10s
3
4
5
6
7
8
9
Leaves in 1s
0
06779
1344667889
001222346667788899
00133467789
037
0348
We can see from the stem-and-leaf diagram that the data look approximately
symmetric, centred at about 66.
4.8
Measures of location
Mean, median and mode are the three principal measures of location. In general,
these will not all give the same numerical value for a given dataset/distribution.6 These
three measures (and, later, measures of spread) will now be introduced using the
6
These three measures can be the same is special cases, such as the normal distribution (introduced
in Chapter 6) which is symmetric about the mean (and so mean = median) and achieves a maximum at
this point, i.e. mean = median = mode.
45
4
4. Data visualisation and descriptive statistics
following small sample dataset:
32 28 67 39 19 48 32 44 37 24.
4.8.1
(4.1)
Mean
The preferred measure of location/central tendency, which is simply the ‘average’ of the
data. It will be frequently applied in various statistical inference techniques in later
chapters.
Sample mean
4
P
Using the summation operator, , which remember is just a form of ‘notational
shorthand’, we define the sample mean, x̄, as:7
n
1X
x1 + x2 + · · · + xn
x̄ =
xi =
.
n i=1
n
Example 4.4 For the dataset in (4.1) above:
x̄ =
370
32 + 28 + · · · + 24
=
= 37.
10
10
Of course, it is possible to encounter datasets in frequency form, that is each data value
is given with the corresponding frequency of observations for that value, fk , for
k = 1, . . . , K, where there are K different variable values. In such a situation, use the
formula:
K
P
f k xk
k=1
.
(4.2)
x̄ = K
P
fk
k=1
Note that this preserves the idea of ‘adding up all the observations and dividing by the
total number of observations’. This is an example of a weighted mean, where the
weights are the relative frequencies.
If the data are given in grouped-frequency form, such as that shown in the table in
Example 4.2, then the individual data values are unknown8 – all we know is the interval
in which each observation lies. The sensible solution is to use the midpoint of the
interval as a ‘proxy’ for each observation recorded as belonging within that class
interval. Hence you still use the grouped-frequency mean formula (4.2), but each xi
value will be substituted with the appropriate class interval midpoint.
7
The notation x̄ will be used to denote an observed sample mean, while µ will denote its population
counterpart, i.e. the population mean.
8
Of course, we do have the raw data for the weekly production example and so we could work out
the exact sample mean, but here suppose we did not have access to the raw data, instead we were just
given the table of class frequencies in Example 4.2.
46
4.8. Measures of location
Example 4.5 Using the weekly production data in Example 4.2, the interval
midpoints are: 330, 370, 390, 410, 440 and 480, respectively. These will act as the
data values for the respective classes. The mean is then calculated as:
K
P
f k xk
x̄ = k=1K
P
=
fk
(6 × 330) + · · · + (3 × 480)
= 397.2.
6 + ··· + 3
k=1
Compared to the true mean of the raw data (which is 399.72), we see that using the
midpoints as proxies gives a mean very close to the true sample mean value. Note
the mean is not rounded up or down since it is an arithmetic result.
A drawback with the mean is its sensitivity to outliers (extreme observations). For
example, suppose we record the net worth of 10 randomly chosen people. If Jeff Bezos
(one of the world’s richest people), say, was included, his substantial net worth would
drag the mean upward considerably! By increasing the sample size n, the effect of his
inclusion, although diluted, would still be non-negligible, assuming we were not just
sampling from the population of billionaires!
4.8.2
Median
The (sample) median, m, is the middle value of the ordered dataset, where observations
are arranged in (ascending) order. By definition, 50 per cent of the observations are
greater than the median, and 50 per cent are less than the median.
Median
Arrange the n numbers in ascending order,9 say x(1) , x(2) , . . . , x(n) ,10 then:
if n is odd, there is an explicit middle value, so the median, m, is x((n+1)/2)
if n is even, there is no explicit middle value, so take the average of the values
either side of the ‘midpoint’, hence m = (x(n/2) + x(n/2+1) )/2.
Example 4.6 For the dataset in (4.1), the ordered observations are:
19
24
28
32
32
37
39
44
48
67.
Here n = 10, i.e. there is an even number of observations, so we compute the average
of the fifth and sixth ordered observations, that is:
m=
x(5) + x(6)
32 + 37
=
= 34.5.
2
2
9
If you have constructed a stem-and-leaf diagram, you would have already done this!
These are known as ‘order statistics’. For example, x(1) is the first order statistic, i.e. the smallest
observed value, and x(n) is the nth order statistic, i.e. the largest observed value.
10
47
4
4. Data visualisation and descriptive statistics
If we only had data in grouped-frequency form (as in Example 4.2), then we can make
use of the cumulative frequencies. Since n = 50, the median is the 25.5th ordered
observation which must lie in the [380, 400) interval because once we exhaust the
ordered data up to the [360, 380) interval we have only covered 20 observations, while
once the [380, 400) interval is exhausted we have accounted for 30 observations, meaning
the median lies in this interval. Assuming the raw data are not accessible, we could use
the midpoint (i.e. 390) as denoting the median. Alternatively, we could use an
interpolation method which uses the following ‘general’ formula for grouped data, once
you have identified the class which includes the median (such as [380, 400) above):
endpoint of previous class +
4
class width × number of remaining observations
.
class frequency
Example 4.7 Returning to the weekly production example, the median would be:
380 +
20 × (25.5 − 20)
= 391.
10
For comparison, using the raw data, x(25) = 392 and x(26) = 393 (quickly obtained
from the stem-and-leaf diagram), giving a median of 392.5.
Although an advantage of the median is that it is not influenced by outliers (Jeff Bezos’
net worth would be x(n) and so would not affect the median), in practice it is of limited
use in formal statistical inference.
For symmetric data, the mean and median are always equal. Therefore, this is a simple
way to verify whether a dataset is symmetric. Asymmetric distributions are skewed,
where skewness measures the departure from symmetry. Although you will not be
expected to compute the coefficient of skewness (its numerical value), you need to be
familiar with the two types of skewness.
Skewness
When mean > median, this indicates a positively-skewed distribution (also,
referred to as ‘right-skewed’).
When mean < median, this indicates a negatively-skewed distribution (also,
referred to as ‘left-skewed’).
Graphically, skewness can be determined by identifying where the long ‘tail’ of the
distribution lies. If the long tail is heading toward +∞ (positive infinity) on the x-axis
(i.e. on the right-hand side), then this indicates a positively-skewed (right-skewed)
distribution. Similarly, if heading toward −∞ (negative infinity) on the x-axis (i.e. on
the left-hand side) then this indicates a negatively-skewed (left-skewed) distribution, as
illustrated in Figure 4.2.
Example 4.8 The hourly wage rates used in Example 4.1 are skewed to the right,
due to the influence of the ‘extreme’ values 8.00, 8.10, 8.10 and 8.20 – the effect of
these (similar to Jeff Bezos’ effect mentioned above) is to ‘drag’ or ‘pull’ the mean
upward, hence mean > median.
48
4.8. Measures of location
Positively-skewed
distribution
Negatively-skewed
distribution
4
Figure 4.2: Different types of skewed distributions.
Example 4.9 For the weekly production dataset in Example 4.2, we have
calculated the mean and median to be 399.72 and 392.50, respectively. Since the
mean is greater than the median, the data form a positively-skewed distribution, as
confirmed by the histogram in Figure 4.1.
4.8.3
Mode
Our final measure of location is the mode.
Mode
The (sample) mode is the most frequently-occurring value in a (sample) dataset.
It is perfectly possible to encounter a multimodal distribution where several data values
are tied in terms of their frequency of occurrence.
Example 4.10 The modal value of the dataset in (4.1) is 32, since it occurs twice
while the other values only occur once each.
Example 4.11 For the weekly production dataset, looking at the stem-and-leaf
diagram in Example 4.3, we can quickly see that 365 is the modal value (the three
consecutive 5s opposite the second stem stand out). If just given grouped data, then
instead of reporting a modal value we can determine the modal class, which is
[360, 380) with 14 observations. (The fact that this includes 365 here is a coincidence
– the modal class and modal value are not the same thing.)
49
4. Data visualisation and descriptive statistics
Activity 4.4 Calculate the mean, median and mode of the following sample values:
5,
8,
10,
7,
7,
4 and 12.
Solution
P
We have
xi = 53 and n = 7, hence x̄ = 53/7 = 7.57.
For the median, arrange the observations into ascending order, giving:
4,
4
5,
7,
7,
8,
10 and 12.
Given there is an odd number of observations, since n = 7, then the median is:
x((n+1)/2) = x(4) = 7.
The mode is 7 since this occurs twice, with the other values only occurring once each.
Activity 4.5 Consider again the wage rate data in Activity 4.1. Calculate the
mean, median and mode for these data. Explain which measure should be used in
deciding on a rate to be used in your company.
Solution
The mean is:
161.4
2 × 7.80 + 5 × 7.90 + · · · + 1 × 8.50
=
= £8.07.
20
20
The median is given by:
x(10) + x(11)
8.00 + 8.00
=
= £8.00.
2
2
The mode is £7.90 (5 repetitions).
The frequency distribution is positively skewed (i.e. has a long tail on the right) and
so we have (as is often the case with positively-skewed data) that:
mode < median < mean.
The mode is probably a poor measure of location in this case (the mode is often a
poor measure when we are dealing with measurable variables), so the median or
mean is preferred. Depending on the company’s negotiating strategy, it might choose
a figure slightly higher than the mean, say £8.20, since it is higher than the average
and beaten by fewer than one-third of rivals.
Activity 4.6 Hard!
In a sample of n = 6 objects the mean of the data is 15 and the median is 11.
Another observation is then added to the sample mean and this takes the value
x7 = 12.
50
4.8. Measures of location
(a) Calculate the mean of the seven observations.
(b) What can you conclude about the median of the seven observations?
Solution
The ‘new’ mean is (6 × 15 + 12)/7 = 102/7 = 14.6. Note the easy way to calculate
the new sum of data, using the old mean.
There is not enough information in the question to calculate the new median
exactly, but it must be somewhere between 11 and 12 (inclusive), in part because for
the original data:
x(1) , x(2) , x(3) ≤ 11
and:
x(4) , x(5) , x(6) ≥ 11.
x7 = 12 fits somewhere in the second group, increasing the median. If
x(3) < x7 < x(4) then the new median will be x7 = 12, while if x(3) < x(4) < x7 then
the new median will be x(4) , where 11 ≤ x(4) ≤ 12. Other cases can also be worked
out if we have enough information, or estimated by a bit more algebra.
Activity 4.7 An airline has a baggage weight allowance of 20 kg per passenger on
its flights. It collected the following data on the weights of its passengers’ baggage.
The data were taken from a sample of 100 passengers on a Madrid–London flight.
There were about 300 passengers on the flight and the 100 passengers were chosen at
random.
Weight of baggage (kg)
0≤x<5
5 ≤ x < 10
10 ≤ x < 15
15 ≤ x < 20
20 ≤ x < 25
25 ≤ x < 30
30 ≤ x < 35
35 ≤ x
Number of passengers
21
2
10
27
26
11
3
0
(a) Display the data using an appropriate diagram.
(b) Calculate the mean and median of the data.
(c) The company uses the data to claim that ‘40% of airline passengers travel with
baggage over the weight allowance’. Explain whether or not you think this claim
is valid. (Think about how the data were collected!)
Solution
(a) A histogram is probably as good as anything here (though in some cases like
this there is room for debate about what is best). We have:
51
4
4. Data visualisation and descriptive statistics
Class interval
[0, 5)
[5, 10)
[10, 15)
[15, 20)
[20, 25)
[25, 30)
[30, 35)
4
Interval width
5
5
5
5
5
5
5
Frequency
21
2
10
27
26
11
3
Frequency density
4.2
0.4
2.0
5.4
5.2
2.2
0.6
Histogram of baggage weights of airline passengers
5
Frequency density
4
3
2
1
0
0
10
20
30
Weight of baggage (in kg)
(b) The sample mean is:
(21 × 2.5) + (2 × 7.5) + · · · + (3 × 32.5)
1650
=
= 16.5.
21 + 2 + · · · + 3
100
The median is:
15 +
5 × (50.5 − 33)
= 18.24.
27
(c) The claim is not valid. The data were taken from one route and only one flight
on that route. We cannot extrapolate to make conclusions about all flights.
There needs to be more variety in the sampling. For example, include flights to
many destinations, both domestic and international flights etc.
4.9
Measures of spread
The dispersion (or spread) of a dataset is very important when drawing conclusions
from it. Hence it is essential to have a useful measure of this property, and several
52
4.9. Measures of spread
candidates exist – these are reviewed below. As expected, there are advantages and
disadvantages to each.
4.9.1
Range
Our first measure of spread is the range.
Range
The range is simply the largest value minus the smallest value, that is:
4
range = x(n) − x(1) .
Example 4.12 For the dataset in (4.1), the range is:
x(n) − x(1) = 67 − 19 = 48.
Clearly, the range is very sensitive to extreme observations since (when they occur) they
are going to be the smallest and/or largest observations (x(1) and/or x(n) , respectively),
and so this measure is of limited appeal. If we were confident that no outliers were
present (or decided to remove any outliers), then the range would better represent the
true spread of the data.
More formally, we could consider the idea of the interquartile range (IQR) instead –
that is, the upper (third) quartile, Q3 , minus the lower (first) quartile, Q1 . The upper
quartile has 75 per cent of observations below it (and 25 per cent above it) while the
lower quartile has 25 per cent of observations below (and 75 per cent above).
Unsurprisingly the median, given our earlier definition, is the middle (second) quartile.
By discarding the top 25 per cent and bottom 25 per cent of observations, respectively,
we restrict attention solely to the central 50 per cent of observations.
Interquartile range
The IQR is defined as:
IQR = Q3 − Q1
where Q3 and Q1 are the third (upper) and first (lower) quartiles, respectively.
Example 4.13 Continuing with the dataset in (4.1), computation of the quartiles
can be problematic since, for example, for the lower quartile we require the value
such that 2.5 observations are below it and 7.5 values are above it. A suggested
approach11 (motivated by the median calculation when n is even) is to use:
x(2) + x(3)
24 + 28
Q1 =
=
= 26.
2
2
Similarly:
x(7) + x(8)
39 + 44
Q3 =
=
= 41.5.
2
2
53
4. Data visualisation and descriptive statistics
Hence the IQR is 41.5 − 26 = 15.5. Contrast this with the range of 48 which is much
larger due to the effects of x(1) and x(n) .
Activity 4.8 Consider again the sales of footwear data in Activity 4.3.
Weekly sales of footwear
30
61
70
79
4
60
74
68
77
67
94
47
77
63
87
40
73
69
66
51
64
54
69
56
67
68
66
56
59
60
62
58
46
62
78
57
54
83
90
47
53
66
98
71
49
70
93
76
58
68
73
80
62
(a) Use the stem-and-leaf diagram to find the median and the quartiles of the data.
(b) Calculate the interquartile range.
Solution
(a) There are 52 values so the median is halfway between the 26th and 27th ordered
values. It is easy to read off from the graph that both the 26th and 27th ordered
values equal 66, so the median equals 66.
(b) Again, we can read off from the graph that the lower and upper quartiles are
about 57 and 74, respectively. On this basis, the interquartile range is
74 − 57 = 17. However, different textbooks give slightly different definitions of
the quartiles so you might have obtained a slightly different answer (but not too
different!) depending on the definition you have used.
A sensible method is:
Q1 =
x(13) + x(14)
57 + 58
=
= 57.5
2
2
and:
x(39) + x(40)
73 + 74
=
= 73.5.
2
2
Using this method, the interquartile range is 73.5 − 57.5 = 16.
Q3 =
4.9.2
Boxplot
At this point, it is useful to introduce another graphical method – the boxplot.12 In a
boxplot, the middle horizontal line is the median and the upper and lower ends of the
box are the upper and lower quartiles, respectively. The ‘whiskers’ are drawn from the
11
There are many different methodologies for computing quartiles, and conventions vary from country
to country (and from textbook to textbook!). Any reasonable approach is perfectly acceptable in the
examination. For example, interpolation methods, as demonstrated previously for the case of the median,
are valid.
12
Some people call such plots box-and-whisker plots. No prizes for guessing why!
54
4.9. Measures of spread
x
Values more than 3 boxlengths from Q3 (extreme outlier)
o
Values more than 1.5 boxlengths from Q3 (outlier)
Largest observed value that is not an outlier
Q3
50% of cases
have values
within the box
4
Q2 = Median
Q1
Smallest observed value that is not an outlier
o
Values more than 1.5 boxlengths from Q1 (outlier)
x
Values more than 3 boxlengths from Q1 (extreme outlier)
Figure 4.3: An example of a boxplot (not to scale).
quartiles to the observations furthest from the median, but not by more than
one-and-a-half times the IQR (i.e. excluding outliers). The whiskers are terminated by
horizontal lines. Any extreme points beyond the whiskers are plotted individually. An
example of a (generic) boxplot is shown in Figure 4.3.
If you are presented with a boxplot, then it is easy to obtain all of the following: the
median, quartiles, IQR, range and skewness. Recall that skewness (the departure from
symmetry) is characterised by a long tail, attributable to outliers, which are readily
apparent from a boxplot.
Example 4.14 From the boxplot shown in Figure 4.4, it can be seen that the
median, Q2 , is around 74, Q1 is about 63, and Q3 is approximately 77. The many
outliers provide a useful indicator that this is a negatively-skewed distribution as the
long tail covers lower values of the variable. Note also that Q3 − Q2 < Q2 − Q1 ,
which tends to indicate negative skewness.
4.9.3
Variance and standard deviation
The variance and standard deviation are much better and more useful statistics for
representing the spread of a dataset. You need to be familiar with their definitions and
methods of calculation for a sample of data values x1 , x2 , . . . , xn .
Begin by computing the so-called ‘corrected sum of squares’, Sxx , the sum of the
55
4. Data visualisation and descriptive statistics
4
Figure 4.4: A boxplot showing a negatively-skewed distribution.
squared deviations of each data value from the (sample) mean, where:
Sxx =
n
n
X
X
(xi − x̄)2 =
x2i − nx̄2 .
i=1
Recall from earlier x̄ =
n
P
(4.3)
i=1
xi /n. The proof for why the expressions in (4.3) are
i=1
equivalent is beyond the scope of this half-course.
Sample variance
The sample variance, s2 , is defined as:
n
Sxx
1 X
1
=
(xi − x̄)2 =
s =
n−1
n − 1 i=1
n−1
2
n
X
!
x2i − nx̄2
.
i=1
Note the divisor used to compute s2 is n − 1, not n. Do not worry about why,13 just
remember to divide by n − 1 when computing a sample variance.14 To obtain the
sample standard deviation, s, we just take the (positive) square root of the sample
variance, s2 .
Sample standard deviation
The sample standard deviation, s, is:
√
s=
r
s2 =
Sxx
.
n−1
13
This is covered in ST104b Statistics 2.
14
In contrast for population data, the population variance is σ 2 =
N
P
(xi − µ)2 /N , i.e. we use the N
i=1
divisor here, where N denotes the population size while n denotes the sample size. Also, note the use of
µ (the population mean) instead of x̄ (the sample mean).
56
4.9. Measures of spread
Example 4.15 Using the dataset in (4.1), x̄ = 37, so:
Sxx = (32 − 37)2 + (28 − 37)2 + · · · + (24 − 37)2 = 25 + 81 + · · · + 169 = 1698.
p
Hence s = 1698/(10 − 1) = 13.74.
P 2
Note that, given
xi = 15388, we could have calculated Sxx using the other
expression:
n
X
Sxx =
x2i − nx̄2 = 15388 − 10 × (37)2 = 1698.
i=1
4
So this alternative method is much quicker to calculate Sxx .
Sample variance for grouped-frequency data
For grouped-frequency data with K classes, to compute the sample variance we use
the formula:
2

K
K
K
P
P
P
fk (xk − x̄)2
fk x2k 
f k xk 

 k=1
k=1
k=1
2
s =
= K
− K
 .
K
P
P

 P
fk
fk
fk
k=1
k=1
k=1
Recall that the last bracketed squared term is simply the mean formula for grouped
data. Note that for grouped-frequency data we can ignore the ‘divide by n − 1’ rule,
since we would expect n to be very large in such cases, such that n − 1 ≈ n and so
K
X
fk = n.
dividing by n or n − 1 makes negligible difference in practice, noting that
k=1
Example 4.16 Let us now consider an extended example bringing together many
of the issues considered in this chapter. We will draw a histogram, here with equal
class widths, and introduce the cumulative frequency diagram.
A stockbroker is interested in the level of trading activity on a particular stock
exchange. She has collected the following data, which are weekly average volumes (in
millions), over a 29-week period. This is an example of time series data. Note that
this variable is treated as if it was discrete, but because the numbers are so large the
variable can be treated as continuous.
172.5
161.9
172.3
181.3
169.1
155.0
148.6
159.8
161.6
153.8
154.6
151.6
132.4
144.0
133.6
149.0
135.8
139.9
164.4
175.6
163.5
172.6
168.3
155.3
143.4
140.6
125.1
171.3
167.0
57
4. Data visualisation and descriptive statistics
So to construct a histogram we first decide on the choice of class intervals, which is a
subjective decision. The objective is to convey information in a useful way. In this
case the data lie between (roughly) 120 and 190 million shares/week, so class
intervals of width 10 million will give seven classes. With almost 30 observations this
choice is probably adequate; more observations might support more classes (a class
interval of 5 million, say); fewer observations would, perhaps, need a larger interval
of 20 million.
Therefore, the class intervals are defined like this:
120 ≤ volume < 130,
4
130 ≤ volume < 140 etc.
or, alternatively, [120, 130), [130, 140) etc. We now proceed to determine the
frequency density (and cumulative frequencies, for later).
Class interval
[120, 130)
[130, 140)
[140, 150)
[150, 160)
[160, 170)
[170, 180)
[180, 190)
Interval
width
10
10
10
10
10
10
10
Frequency
1
4
5
6
7
5
1
Frequency
density
0.1
0.4
0.5
0.6
0.7
0.5
0.1
Cumulative
frequency
1
5
10
16
23
28
29
Hence the frequency density histogram is as shown in Figure 4.5.
A development from the histogram is the cumulative frequency diagram. Recall that
a cumulative frequency gives us the total frequency of observations which fall below
the upper endpoint of a class interval. Therefore, it is the upper endpoint which we
use in the diagram. Figure 4.6 displays this for the trading data example.
A variation on this is the cumulative relative frequency diagram which accumulates
the relative frequencies (typically expressed as percentages), rather than the
frequencies, where the relative frequency of the kth interval is calculated as:
relative frequencyk =
Class interval
[120, 130)
[130, 140)
[140, 150)
[150, 160)
[160, 170)
[170, 180)
[180, 190)
Frequency
1
4
5
6
7
5
1
fk
.
n
Relative
frequency (%)
3.45
13.79
17.24
20.69
24.14
17.24
3.45
% Cumulative
relative frequency
3.45
17.24
34.48
55.17
79.31
96.55
100.00
So, for example, in the above table since n = 29 the relative frequency for the first
interval is 1/29 = 0.0345, or 3.45%. Plotting the cumulative relative frequencies
yields the cumulative relative frequency diagram, as shown in Figure 4.7.
58
4.9. Measures of spread
Note all the diagrams use the grouped frequencies, rather then the original raw data.
For comparison, Figure 4.8 uses the raw ungrouped data, providing a more ‘precise’
cumulative relative frequency diagram.
Finally, we use the grouped data to compute particular descriptive statistics –
specifically the mean, variance and standard deviation. There are K = 7 classes, so
we perform the appropriate intermediate calculations by groups.
Class interval
[120, 130)
[130, 140)
[140, 150)
[150, 160)
[160, 170)
[170, 180)
[180, 190)
P
Total,
Midpoint, xk
125
135
145
155
165
175
185
Frequency, fk
1
4
5
6
7
5
1
29
f k xk
125
540
725
930
1155
875
185
4535
fk x2k
15,625
72,900
105,125
144,150
190,575
153,125
34,225
715,725
4
Hence the grouped mean is:
7
P
fk x k
x̄ = k=17
P
=
fk
4535
= 156.4
29
k=1
and the grouped variance is:
7
P
s
2
fk x2k
= k=17
P
k=1
fk
P
7
2
f x
 k=1 k k 

 = 715725 − (156.4)2 = 219.2
− 7

P
29
fk
k=1
√
giving a standard deviation of 219.2 = 14.8. For comparison, the ungrouped mean,
variance and standard deviation are 156.0, 217.0 and 14.7, respectively (compute
these yourself to verify!). Note the units for the mean and standard deviation are
‘millions of shares/week’, while the units for the variance are the square of those for
the standard deviation, i.e. ‘(millions of shares/week)2 ’, so this is an obvious reason
why we often work with the standard deviation, rather than the variance, due to the
original (and more meaningful) units.
Activity 4.9 Calculate the range, variance and standard deviation of the following
sample values:
5
8
10
7
7
4
12.
Solution
The range is:
x(n) − x(1) = 12 − 5 = 7
59
4. Data visualisation and descriptive statistics
0.4
0.2
Frequency density
0.6
0.8
Frequency density histogram of trading volume data
0.0
4
120
130
140
150
160
170
180
190
Weekly average shares traded/day (millions)
Figure 4.5: Frequency density histogram of trading volume data.
30
Cumulative frequency of trading volume data
x
25
x
20
10
15
x
x
x
0
5
Cumulative frequency
x
x
120
x
130
140
150
160
170
180
190
Weekly average shares traded/day (millions)
Figure 4.6: Cumulative frequency of trading volume data.
where x(1) is the minimum value in the dataset, and x(n) is the maximum value.
P
P 2
We have
xi = 53 and
xi = 447, hence x̄ = 53/7 = 7.57 and the (sample)
variance is:
1
1 X 2
s2 =
xi − nx̄2 = × (447 − 7 × (7.57)2 ) = 7.64.
n−1
6
Hence the (sample) standard deviation is:
√
√
s = s2 = 7.64 = 2.76.
Activity 4.10 Compare the means and standard deviations of the sample datasets
shown in Activity 4.2 and comment. (You may wish to use the following summary
statistics.)
60
4.9. Measures of spread
80
x
x
60
x
x
40
4
20
x
0
Cumulative relative frequency (%)
100
Cumulative relative frequency of trading volume data
x
x
120
x
130
140
150
160
170
180
190
Weekly average shares traded/day (millions)
Figure 4.7: Cumulative relative frequency of trading volume data.
80
60
40
20
0
Cumulative relative frequency (%)
100
Cumulative relative frequency of trading volume (ungrouped) data
120
130
140
150
160
170
180
190
Weekly average shares traded/day (millions)
Figure 4.8: Cumulative relative frequency of trading volume (ungrouped) data.
61
4. Data visualisation and descriptive statistics
Sum
Sum of squares
Statistics class
1,395
46,713
Economics class
2,225
100,387
Solution
For the statistics students:
50
1 X
1395
x̄ =
xi =
= 27.9 and s =
50 i=1
50
4
s
1
(1395)2
× 46713 −
= 12.6.
49
50
So the statistics students’ mean and standard deviation attention spans are 27.9
minutes and 12.6 minutes, respectively. For the economics students:
s
60
1
(2225)2
1 X
2225
= 37.1 and s =
× 100387 −
= 17.4.
x̄ =
xi =
60 i=1
60
59
60
So the economics students’ mean and standard deviation attention spans are 37.1
minutes and 17.4 minutes, respectively.
These statistics represent the main features of the distributions shown in the
previous question. The economics students have the higher mean and variability due
to the group which maintains interest throughout the class.
Activity 4.11 Calculate the mean and standard deviation of the following groups
of numbers (treat these as samples).
(a) 5, 3, 4, 7, 10, 7 and 1.
(b) 12, 10, 11, 14, 17, 14 and 8.
(c) 12, 8, 10, 16, 22, 16 and 4.
(d) 25, 9, 16, 49, 100, 49 and 1.
Comment on any relationships between the statistics for the groups.
Solution
The sample means and standard deviations are:
Mean
Standard deviation (s.d.)
(a)
5.29
2.98
(b)
12.29
2.98
(c)
12.57
5.97
(d)
35.60
33.90
Comments are the following.
Dataset (b) is (a) + 7, hence mean(b) = mean(a) + 7 and s.d.(a) = s.d.(b).
Dataset (c) is 2(a) + 2, hence mean(c) = 2 × mean(a) + 2 and
s.d.(c) = 2 × s.d.(a).
62
4.9. Measures of spread
Dataset (d) is the square of group (a).
These results are examples of transformations of random variables.
Activity 4.12 Consider again the wage rate data in Activity 4.1. Calculate the
range, interquartile range and standard deviation of these data.
Solution
The range is £8.50 − £7.80 = £0.70. For the quartiles we have:
4
x(5) + x(6)
7.90 + 7.90
Q1 =
=
= £7.90
2
2
and:
x(15) + x(16)
8.20 + 8.20
=
= £8.20.
2
2
Therefore, the interquartile range is £8.20 − £7.90 = £0.30. To compute the
(sample) variance, compute the following:
Q3 =
20
X
x2i = 1303.30
i=1
20
X
1
Sxx =
x2i −
n
i=1
20
X
!2
= 1303.30 −
xi
i=1
1
× (161.4)2 = 0.802
20
hence:
Sxx
0.802
=
= 0.042211.
n−1
19
√
Hence the (sample) standard deviation is 0.042211 = £0.21. (Note two decimal
places are sufficient.)
s2 =
Tip: Working out the quartiles is easier if the sample size is a multiple of 4. If we
have 4k items, listed in increasing order as x(1) , . . . , x(4k) , then:
Q1 =
x(k) + x(k+1)
,
2
Q2 =
x(2k) + x(2k+1)
2
and Q3 =
x(3k) + x(3k+1)
.
2
Activity 4.13 The table below summarises the distribution of salaries for a sample
of 50 employees.
Salary (£000s)
Number of employees
[5, 10)
8
[10, 15)
14
[15, 20)
21
[20, 25)
7
(a) Draw a histogram for these data.
(b) Calculate the mean, standard deviation and median.
(c) What is the modal class for this sample?
63
4. Data visualisation and descriptive statistics
Solution
(a) Calculations for the salary data and the histogram are:
Salary
(£000s)
[5, 10)
[10, 15)
[15, 20)
[20, 25)
Total
4
Midpoint
xi , (£000s)
7.5
12.5
17.5
20.5
Interval
width
5
5
5
5
Frequency
fi
8
14
21
7
50
Frequency
density
1.6
2.8
4.2
1.4
f i xi
60.0
175.0
367.5
157.5
760.0
fi x2i
450.00
2,187.50
6,431.25
3,543.75
12,612.50
Histogram of salaries
Frequency density
4
3
2
1
0
5
10
15
20
25
Salaries (in £000s)
(b) The mean is 760/50 = 15.2.
The sample standard deviation is:
s
(760)2
1
s=
× 12612.50 −
= 4.65.
49
50
For the median, we use the approach for grouped-frequency data, i.e. we have:
15 +
5 × (25.5 − 22)
= 15.83.
21
(c) The modal class is 15–20 (where the modal class is, by extension of the mode,
the class with the greatest number of items in it).
Activity 4.14 In a sample of 64 observations, the numbers of households
containing 1, 2, 3, 4 and 5 persons were 8, 30, 11, 7 and 8, respectively. Calculate the
mean and standard deviation of household size.
64
4.9. Measures of spread
Solution
It might help to display the data in a table:
Household size, xi
Frequency, fi
1
8
2
30
3
11
4
7
5
8
The summary statistics are:
X
fk = 8 + 30 + 11 + 7 + 8 = 64
X
fk xk = 1 × 8 + 2 × 30 + · · · + 5 × 8 = 169
4
and:
X
fk x2k = 8 × 12 + 30 × 22 + · · · + 8 × 52 = 539.
So:
K
P
f k xk
x̄ = k=1K
P
=
fk
169
= 2.64
64
k=1
and:
K
P
s
2

fk x2k
= k=1K
P
i=1
fk
K
P
2
2
f k xk 

539
169
 k=1

= 1.45.
− K
−
 =
64
64
 P

fk
k=1
Therefore, the standard deviation is s =
√
1.45 = 1.20.
Activity 4.15 James set a class test for his students. The test was marked out of
10, and the students’ marks are summarised below. (Note that no student obtained
full marks, and every student scored at least 3.)
Mark (out of 10)
Number of students
3
1
4
1
5
6
6
5
7
11
8
3
9
1
(a) Display the data using an appropriate diagram.
(b) Calculate the mean and median marks.
(c) Calculate the standard deviation, the quartiles and the interquartile range.
Solution
(a) A histogram would be acceptable for this dataset. We have:
65
4. Data visualisation and descriptive statistics
Class interval
[3, 4)
[4, 5)
[5, 6)
[6, 7)
[7, 8)
[8, 9)
[9, 10)
Interval width
1
1
1
1
1
1
1
4
Frequency
1
1
6
5
11
3
1
Frequency density
1
1
6
5
11
3
1
6
4
0
2
Frequency density
8
10
Histogram of Marks
3
4
5
6
7
8
9
10
Marks
(b) x̄ = 177/28 = 6.32. There are 28 students. The 14th lowest score was 7 and the
15th lowest was also 7. Therefore, the median mark also equals 7.
(c) We compute the sample standard deviation using:
s
2
√
1
(177)
s=
× (1 × 32 + 1 × 42 + · · · + 1 × 92 ) −
= 1.78 = 1.33.
27
28
The lower quartile equals 5 and the upper quartile equals 7, so the interquartile
range = 7 − 5 = 2. Note that, perhaps surprisingly, the upper quartile equals
the middle quartile, which is, of course, the median. This sort of thing can
happen with discrete data!
4.10
Summary
This chapter, although in practice concerned with what you can do with data after you
have collected data, serves as a useful introduction to the whole half-course. It highlights
some of the problems with handling data and, furthermore, has introduced many of the
fundamental concepts such as mean, variance, discrete and continuous data etc.
66
4.11. Test your understanding
4.11
Test your understanding
In this exercise you will practise aspects of data visualisation and descriptive statistics.
In statistical analysis, there are usually simply too many numbers to make sense of just
by staring at them. Data visualisation and descriptive statistics attempt to summarise
key features of the data to make them understandable and easy to communicate. The
main function of diagrams is to bring out interesting features of a dataset visually by
displaying its distribution, i.e. summarising the whole sample distribution of a variable.
Descriptive statistics allow us to summarise one feature of the sample distribution in a
single number. Here you will work with measures of central tendency, measures of
dispersion and skewness.
At a time of economic growth but political uncertainty, a random sample of 40
economists (from the population of all economists) produces the following forecasts for
the growth rate of an economy in the next year:
1.3
1.8
2.2
3.4
4.5
3.8
2.7
1.9
3.9
5.2
4.1
3.1
3.6
2.9
3.1
2.6
2.3
2.8
1.9
2.5
2.4
3.7
4.3
3.3
3.3
2.2
2.5
3.1
8.2
3.4
3.4
4.1
4.2
5.4
4.4
5.1
4.7
4.6
3.3
5.2
(a) Construct a stem-and-leaf diagram for these data.
(b) Draw a suitable histogram for these data.
(c) Comment on the shape of the sample distribution.
(d) Determine the median of the data using the stem-and-leaf diagram in (a).
(e) Produce a boxplot for these data.
(f) Using the following summary statistics, calculate the sample mean and sample
standard deviation of the growth rate forecasts:
Sum of data =
X
xi = 140.4 and Sum of squares of data =
i
X
x2i = 557.26.
i
(g) Comment on the relative values of the mean and median.
(h) What percentage of the data fall within one sample standard deviation of the
sample mean? And what percentage fall within two sample standard deviations of
the sample mean?
Solution
(a) A stem-and-leaf diagram for the data is:
67
4
4. Data visualisation and descriptive statistics
Stem-and-leaf diagram of economy growth forecasts
Stem (%)
1
2
3
4
5
6
7
8
4
Leaves (0.1%)
3899
2234556789
1113334446789
11234567
1224
2
(b) It would be sensible to use the stems from (a) as the classes (bins) for the
histogram.
Class interval
[1.0, 2.0)
[2.0, 3.0)
[3.0, 4.0)
[4.0, 5.0)
[5.0, 6.0)
[6.0, 7.0)
[7.0, 8.0)
[8.0, 9.0)
Interval
width
1
1
1
1
1
1
1
1
Frequency
4
10
13
8
4
0
0
1
Frequency
density
4
10
13
8
4
0
0
1
10
8
6
4
0
2
Frequency density
12
Histogram of economy growth forecasts
0
2
4
6
8
Economy growth forecasts (%)
68
10
4.11. Test your understanding
(c) The stem-and-leaf diagram and histogram show that the data are positively-skewed
(skewed to the right), due to the outlier forecast of 8.2%.
Note if you are ever asked to comment on the shape of a distribution, consider the
following points:
• Is the distribution (roughly) symmetric?
• Is the distribution bimodal?
• Is the distribution skewed (an elongated tail in one direction)? If so, what is
the direction of the skewness?
• Are there any outliers?
4
(d) There are n = 40 observations, so the median is the average of the 20th and 21st
ordered observations. Using the stem-and-leaf diagram in (a), we see that
x(20) = 3.3 and x(21) = 3.4. Therefore, the median is (3.3 + 3.4)/2 = 3.35%.
(e) Since Q2 is the median, which is 3.35, we now need the first and third quartiles, Q1
and Q3 , respectively. There are several methods for determining the quartiles, and
any reasonable approach would be acceptable in an examination. For simplicity,
here we will use the following since n is divisible by 4:
Q1 = x(n/4) = x(10) = 2.5% and Q3 = x(3n/4) = x(30) = 4.2%.
Hence the interquartile range (IQR) is Q3 − Q1 = 4.2 − 2.5 = 1.7%. Therefore, the
whisker limits must satisfy:
max(x(1) , Q1 − 1.5 × IQR) and
min(x(n) , Q3 + 1.5 × IQR)
which is:
max(1.3, −0.05) = 1.30 and
min(8.2, 6.75) = 6.75.
So there is just a single observation which lies outside the interval [1.30, 6.75],
which is x(40) = 8.2% and hence this is plotted individually in the boxplot. Since
this is less than Q3 + 3 × IQR = 4.2 + 3 × 1.7 = 9.3%, then this observation is an
outlier, rather than an extreme outlier. The boxplot is (a vertical orientation is also
fine):
1
2
3
4
5
6
7
8
9
Economy growth forecasts (%)
(f) We have sample data, not population data, hence the (sample) mean is denoted by
x̄ and the (sample) standard deviation is denoted by s. We have:
n
1X
140.4
x̄ =
xi =
= 3.51%
n i=1
40
69
4. Data visualisation and descriptive statistics
and:
1
s2 =
n−1
n
X
i=1
!
x2i − nx̄2
=
1
557.26 − 40 × (3.51)2 = 1.6527.
39
Therefore, the standard deviation is s =
4
√
1.6527 = 1.29%.
(g) In (c) it was concluded that the stem-and-leaf diagram and histogram of the data
were positively-skewed, and this is consistent with the mean being larger than the
median. It is possible to quantify skewness, although this is beyond the scope of
ST104a Statistics 1.
(h) We calculate:
x̄ − s = 3.51 − 1.29 = 2.22 and x̄ + s = 3.51 + 1.29 = 4.80
also:
x̄ − 2 × s = 3.51 − 2 × 1.29 = 0.93 and x̄ + 2 × s = 3.51 + 2 × 1.29 = 6.09.
Now we use the stem-and-leaf diagram to see that 29 observations are between 2.22
and 4.80 (i.e. the interval [2.22, 4.80]), and 39 observations are between 0.93 and
6.09 (i.e. the interval [0.93, 6.09]). So the proportion (or percentage) of the data in
each interval, respectively, is:
29
= 0.725 = 72.5% and
40
39
= 0.975 = 97.5%.
40
Some general points to note are the following.
• Many ‘bell-shaped’ distributions we meet – that is, distributions which look a bit
like the normal distribution – have the property that 68% of the data lie within
approximately one standard deviation of the mean, and 95% of the data lie within
approximately two standard deviations of the mean. The percentages in (h) are
fairly similar to these.
• The exercise illustrates the importance of (at least) one more decimal place than in
the original data. If we had 3.5% and 1.3% for the mean and standard deviation,
respectively, the ‘boundaries’ for the interval with one standard deviation would
have been 3.5 ± 1.3 ⇒ [2.2, 4.8]. Since 2.2 is a data value which appears twice, we
would have had to worry about to which side of the ‘boundary’ to allocate these.
(This type of issue can still happen with the extra decimal place, but much less
frequently.)
• When constructing a histogram, it is possible to ‘lose’ a pattern in the data – for
example, an approximate bell shape – through two common errors:
• too few class intervals (which is the same as too wide class intervals)
• too many class intervals (which is the same as too narrow class intervals).
For example, with too many class intervals, you mainly get 0, 1 or 2 items per
class, so any (true) peak is hidden by the subdivisions which you have used.
70
4.12. Key terms and concepts
• The best number of (equal-sized) class intervals depends on the sample size. For
large samples, many class intervals will not lose the pattern, while for small
samples they will. However, with the datasets which tend to crop up in ST104a
Statistics 1, something like 6, 7, 8 or 9 class intervals are likely to work well. So
don’t forget this!
4.12
Key terms and concepts
Boxplot
Continuous variable
Discrete variable
Dot plot
Interquartile range
Measurable variable
Mode
Ordinal
Range
Skewness
Stem-and-leaf diagram
4.13
Categorical variable
Cumulative frequency
Distribution
Histogram
Mean
Median
Nominal
Outliers
Relative frequency
Standard deviation
Variance
4
Learning activities
Solutions can be found in Appendix A.
1. Identify and describe the variables in each of the following examples.
(a) Voting intentions in a poll of 1,000 people.
(b) The number of hours students watch television per week.
(c) The number of children per household receiving full-time education.
2. Think about why and when you would use each of the following:
(a) histogram
(b) stem-and-leaf diagram.
When would you not do so?
3. Find the mean of the number of hours of television watched per week by 10
students:
Number of hours watched: 2, 2, 2, 5, 5, 10, 10, 10, 10, 10.
4. Say whether the following statement is true or false and briefly give your reason(s).
‘The mean of a dataset is always greater than the median.’
71
4. Data visualisation and descriptive statistics
5. If n = 4, x1 = 1, x2 = 4, x3 = 5 and x4 = 6, find:
4
1X
xi .
3 i=2
Why might you use this figure to estimate the mean?
6. For x1 = 4, x2 = 1 and x3 = 2, calculate:
(a) the median
(b) the mean.
4
7. If x1 = 4, x2 = 2, x3 = 2, x4 = 5 and x5 = 6, calculate:
(a) the mode
(b) the mean.
8. Calculate the mean, median and mode of the prices of spreadsheets (using the data
below).
Software
Brand A
Brand B
Brand C
Brand D
Brand E
Brand F
Brand G
Brand H
Brand I
Brand J
Brand K
Brand L
Brand M
Price (in £)
52
64
65
82
105
107
110
115
155
195
195
239
265
Total
Number of observations
1749
13
You should be able to show that the arithmetic mean is 134.54, the median is 110,
and the mode is 195.
9. Work out s2 for a sample of nine observations of the number of minutes students
took to complete a statistics problem.
Minutes to complete problem: 2, 4, 5, 6, 6, 7, 8, 11, 20.
10. State whether the following statement is true or false and briefly give your
reason(s). ‘Three quarters of the observations in a dataset are less than the lower
quartile.’
72
4.14. A reminder of your learning outcomes
11. Calculate the range, variance, standard deviation and interquartile range of the
spreadsheet prices shown below. Check against the answers given after the data.
Name
Brand A
Brand B
Brand C
Brand D
Brand E
Brand F
Brand G
Brand H
Brand I
Brand J
Brand K
Brand L
Brand M
Price
52
64
65
82
105
107
110
115
155
195
195
239
265
Price − Mean
−82.54
−70.54
−69.54
−52.54
−29.54
−27.54
−24.54
−19.54
20.46
60.46
60.46
104.46
130.46
(Price − Mean)2
6,812.60
4,975.67
4,835.60
2760.29
875.52
758.37
602.14
381.75
418.67
3,655.60
3,655.60
10,912.21
17,020.21
0.00
57,661.23
Total
Variance = 4805.10
Std. Dev. = 69.32
4
IQR = 113
12. The data below show the number of daily telephone calls received by an office
supplies company over a period of 25 working days.
219
476
258
121
291
541 58
7 13
418 177 175 455
312 164 314 336
77 183 133 78
138 244 36 48
(a) Construct a stem-and-leaf diagram for these data and use this to find the
median of the data.
(b) Find the first and third quartiles of the data.
(c) Would you expect the mean to be similar to the median? Explain.
(d) Comment on your figures.
4.14
A reminder of your learning outcomes
After completing this chapter, and having completed the essential reading and
activities, you should be able to:
calculate the following: arithmetic mean, median, mode, standard deviation,
variance, quartiles, range and interquartile range
explain the use and limitations of the above quantities
73
4. Data visualisation and descriptive statistics
draw and interpret histograms, stem-and-leaf diagrams, boxplots and cumulative
frequency distributions
incorporate labels and titles correctly in your diagrams and state the units which
you have used.
In summary, you should be able to use appropriate measures and diagrams in order to
explain and clarify data which you have collected or which are presented to you.
4
4.15
Sample examination questions
Solutions can be found in Appendix B.
1. Classify each one of the following variables as either measurable (continuous) or
categorical. If a variable is categorical, further classify it as nominal or ordinal.
Justify your answer.
(a) Gross domestic product (GDP) of a country.
(b) Five possible responses to a customer satisfaction survey ranging from ‘very
satisfied’ to ‘very dissatisfied’.
(c) A person’s name.
2. The data below contain measurements of the low-density lipoproteins, also known
as the ‘bad’ cholesterol, in the blood of 30 patients. Data are measured in
milligrams per deciliters (mg/dL).
95
99
103
111
115
127
96
101
104
112
117
129
96
101
104
113
121
131
98
102
107
113
123
135
99
102
107
114
124
143
(a) Construct an appropriate histogram of the data.
(b) Find the mean (given that the sum of the data is 3342), the median
P 2 and the
standard deviation (given that the sum of the squared data,
xi , is 377076).
(c) Comment on the data given the shape of the histogram.
3. The average daily intakes of calories, measured in kcals, for a random sample of 12
athletes were:
1808, 1936, 1957, 2004, 2009, 2101,
2147,
2154,
2200,
2231,
2500,
3061.
(a) Construct a boxplot of the data. (The boxplot does not need to be exactly to
scale, but values of box properties and whiskers should be clearly labelled.)
74
4.15. Sample examination questions
(b) Based on the shape of the boxplot you have drawn, describe the distribution of
the data.
(c) Name two other types of graphical displays which would be suitable to
represent the data. Briefly explain your choices.
4
75
4. Data visualisation and descriptive statistics
4
76
Chapter 5
Probability theory
5.1
Aims
This chapter introduces the fundamental concepts of probability. In other courses,
particularly in ST104b Statistics 2 and ST2187 Business analytics, applied
modelling and prediction, you may make full use of probability in both theory and
in decision trees, and highlight the ways in which such information can be used.
We will look at probability at quite a superficial level in this half-course. Even so, you
may find that, although the concepts of probability introduced are simple, their
application in particular circumstances may be very difficult. Try not to lose heart. Aim
to:
be capable of dealing with the basic concepts and follow through on examples and
activities
relate the new ideas of probability to the examples given
appreciate its uses and applications.
5.2
Learning outcomes
After completing this chapter, and having completed the essential reading and
activities, you should be able to:
apply the ideas and notation involved in set theory to simple examples
recall the basic axioms of probability and apply them
distinguish between the ideas of conditional probability and independence
draw and use appropriate Venn diagrams
draw and use appropriate probability trees.
5.3
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060]
Sections 3.1–3.3 and 3.5.
77
5
5. Probability theory
In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials
accessible via the ST104a Statistics 1 area at https://emfss.elearning.london.ac.uk
5.4
Further reading
Aczel, A.D. Complete Business Statistics (London: McGraw–Hill Higher
Education, 2009) seventh edition [ISBN 9780071287531] Sections 2.1–2.6.
Anderson, D.R., D.J. Sweeney, T.A. Williams, J. Freeman and E. Shoesmith
Statistics for Business and Economics (South-Western Cengage Learning, 2010)
eleventh edition [ISBN 9780324783247] Sections 4.1–4.4.
Lind, D.A., W.G. Marchal and S.A. Wathen Statistical Techniques in Business and
Economics (Boston: McGraw–Hill Higher Education, 2009) fourteenth edition
[ISBN 978007110004] The first half of Chapter 5.
5
Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and
Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN
9780471615170] Sections 3.1–3.3.
5.5
Introduction
Chance is what makes life worth living – if everything was known in advance, imagine
the disappointment! If decision-makers had perfect information about the future as well
as the present and the past, there would be no need to consider the concept of
probability. However, it is usually the case that uncertainty cannot be eliminated and
hence its presence should be recognised and used in the process of making decisions.
5.6
The concept of probability
Probability forms the bedrock on which statistical methods are based. This chapter will
be devoted to understanding this important concept, its properties and applications.
Probability can be viewed as a quantifiable measure of one’s degree of belief in a
particular event or set1 of interest. To motivate the use of the terms ‘event’ and ‘set’,
we begin by introducing the concept of an experiment. An experiment can take many
forms, but to prevent us from becoming over-excited, let us consider two basic examples:
i. the toss of a (fair) coin
ii. the roll of a (fair) die.
Sample space
The sample space, S, is the set of all possible outcomes of an experiment.
1
These terms can be used interchangeably.
78
5.6. The concept of probability
Example 5.1 For our two examples we have the following.
i. Coin toss: S = {H, T }, where H and T denote ‘heads’ and ‘tails’, respectively,
and are called the elements or members of the sample space.
ii. Die score: S = {1, 2, 3, 4, 5, 6}.
So the coin toss sample space has two elementary outcomes, H and T , while the score
on a die has six elementary outcomes. These individual elementary outcomes are
themselves events, but we may wish to consider slightly more interesting events of
interest.2 For example, for the die score, we may be interested in the event of obtaining
an even score, or a score greater than 4 etc. Hence we proceed to define an event.
Event
5
An event is a collection of elementary outcomes from the sample space S of an
experiment, which is a subset of S.
Typically, we denote events by letters for ease of reference. For example, A = ‘an even
score’, and B = ‘a score greater than 4’. Hence A = {2, 4, 6} and B = {5, 6},
respectively.
The universal convention is that we define probability to lie on a scale from 0 to 1,
inclusive.3 Hence the probability of any event A, say, is denoted P (A) and is a real
number somewhere on the unit interval, i.e. P (A) ∈ [0, 1], where ‘∈’ means ‘is a
member of’. Note the following.
If A is an impossible event, then P (A) = 0.
If A is a certain event, then P (A) = 1.
For events A and B, if P (A) > P (B), then A is more likely to occur than B.
Therefore, we have a probability scale from 0 to 1 on which we are able to rank events,
as evident from the P (A) > P (B) result above. However, we need to consider how best
to quantify these probabilities. Let us begin with the experiments where each
elementary outcome is equally likely, hence our (fair) coin toss and (fair) die score fulfil
this condition (conveniently).
Determining event probabilites for equally likely elementary outcomes
For an experiment with N equally likely elementary outcomes, let n be the number of
these elementary outcomes which are favourable to the event of interest, A. Therefore:
P (A) =
2
3
n
.
N
Admittedly, for a single coin toss, we cannot go beyond a choice of heads or tails.
Multiplying by 100% expresses a probability as a percentage.
79
5. Probability theory
Example 5.2 For our two examples we have the following.
i. Coin toss: If A is the event ‘heads’, then N = 2 (H and T ), n = 1 (H), so for a
fair 4 coin, P (A) = 1/2 = 0.5.5
ii. Die score: If A is the event ‘an even score’, then N = 6 (1, 2, 3, 4, 5 and 6),
n = 3 (2, 4 and 6), so for a fair die, P (A) = 3/6 = 1/2 = 0.5.
Similarly, if B is the event ‘a score greater than 4’, then N = 6 (as before),
n = 2 (5 and 6), hence P (B) = 2/6 = 1/3.
Activity 5.1 Calculate the probability that, when two fair dice are rolled, the sum
of the upturned faces will be:
5
(a) an odd number
(b) less than 9
(c) exactly 12
(d) exactly 4.
Solution
The following table shows all the possibilities [as: (Score on first die, Score on second
die)]:
(1, 1)
(2, 1)
(3, 1)
(4, 1)
(5, 1)
(6, 1)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
(5, 2)
(6, 2)
(1, 3)
(2, 3)
(3, 3)
(4, 3)
(5, 3)
(6, 3)
(1, 4)
(2, 4)
(3, 4)
(4, 4)
(5, 4)
(6, 4)
(1, 5)
(2, 5)
(3, 5)
(4, 5)
(5, 5)
(6, 5)
(1, 6)
(2, 6)
(3, 6)
(4, 6)
(5, 6)
(6, 6)
The total number of possible outcomes, N , is 36. Note all outcomes are equally likely.
(a) The number of favourable points = 18, so P (odd total) = 18/36 = 1/2 = 0.5.
(b) The number of favourable points = 26, so P (less than 9) = 26/36 = 13/18
= 0.7222.
(c) The number of favourable points = 1, so P (exactly 12) = 1/36 = 0.0278.
(d) The number of favourable points = 3, so P (exactly 4) = 3/36 = 1/12 = 0.0833.
4
Remember we are assuming equally likely elementary outcomes here, so a fair coin is required. If we
had a biased coin, then this approach would fail to accurately quantify probabilities.
5
Probabilities can be reported as proper fractions or in decimal form. If in decimal form (which is
preferred), report answers to a maximum of four decimal places in the examination.
80
5.6. The concept of probability
Activity 5.2 A simple gambling game is played as follows. The gambler throws
three fair coins together with one fair die and wins £1 if the number of heads
obtained in the throw of the coins is greater than or equal to the score on the die.
Calculate the probability that the gambler wins.
Solution
The total number of equally likely outcomes is:
N = 2 × 2 × 2 × 6 = 48.
Here it is convenient to organise the calculations by numbers in descending order –
in this case, the value on the die. Clearly, we are only interested in cases for which
the highest die value is 3, 2 or 1.
5
Die shows 3:
the only winning outcome is HHH
⇒
1 case.
Die shows 2:
the winning outcomes are HHH, HHT, HT H, T HH
⇒
4 cases.
Die shows 1:
the winning outcomes are
HHH, HHT, HT H, T HH, HT T, T HT, T T H
⇒
7 cases.
We can simply add the numbers of cases, so there are n = 12 favourable outcomes.
Hence the required probability is 12/48 = 1/4 = 0.25.
Activity 5.3 Three fair dice are thrown and you are told that at least one of the
upturned faces shows a 4. Using this fact, determine the (conditional) probabilities
of:
(a) exactly one 4
(b) exactly two 4s.
Solution
We first identify N , the total number of equally likely outcomes when at least one
die shows a 4, i.e. we restrict our attention to the ‘smaller’ sample space, rather than
the ‘full’ sample space of 63 = 216 outcomes. If the first die shows a 4, then the
other two can be anything, so the ‘pattern’ is 4XY . There are six choices for each of
X and Y , and these choices are independent of each other. Hence there are
6 × 6 = 36 outcomes in which the first die shows a 4.
If the first die does not show a 4, then there are 11 possible combinations of values
for the other two dice. To see why, one method is to list the possible pairs of values:
left-hand number equal to 4 :
left-hand number not equal to 4 :
46, 45, 44, 43, 42, 41
64, 54, 34, 24, 14
81
5. Probability theory
where, of course, the outcome ‘44’ can only be shown in one of the two lists. This
‘listing’ approach works because the number of possibilities for the two dice is small
enough such that we can list all of those in which we are interested (unlike with
three dice).
Another, slightly different, method is to calculate the numbers in each row (without
listing them all). For the pattern 4X there are six choices of X, hence six cases, and
for the pattern [not 4]4 there are five choices for [not 4] and hence five cases, giving
eleven possibilities altogether. As mentioned above, anything except short lists
should be avoided, so it is useful to know lots of alternative ways for working out the
number of relevant outcomes.
Combining these remarks, we have:
N = 36 + (5 × 11) = 91.
5
We now identify the number of favourable outcomes, n, for two or more 4s.
If the first die shows a 4, then there has to be at least one 4 on the other two, so
there are 11 ways for the other two dice. If the first die does not show a 4, then there
are 5 choices for what it might show, while both the other two dice must show 4s.
Clearly, the two 4s can only happen in 1 way. So:
n = 11 + (5 × 1) = 16
and exactly one of these cases, (4, 4, 4), represents more than two 4s, so that fifteen
represent exactly two 4s.
Hence the required probabilities are as follows.
(a) We have:
P (exactly one 4) =
75
91 − 16
=
= 0.8242.
91
91
(b) We have:
P (exactly two 4s) =
5.7
15
= 0.1648.
91
Relative frequency
So far we have only considered equally likely experimental outcomes. Clearly, to apply
probabilistic concepts more widely, we require a more general interpretation – the
relative frequency interpretation.
Relative frequency approach to probability
Suppose the event A associated with some experiment either does or does not occur.
Also, suppose we conduct this experiment independently a total of F times. Suppose
that, following these repeated experiments, A occurs f times.
82
5.8. ‘Randomness’
The ‘frequentist’ approach to probability would regard:
P (A) =
f
F
as F → ∞.
Example 5.3 For a coin toss with event A = {H}, if the coin is fair we would
expect that repeatedly tossing the coin F times would result in approximately
f = F/2 heads, hence P (A) = (F/2)/F = 1/2. Of course, this approach is not
confined to fair coins!
Intuitively, this is an appealing interpretation and is extremely useful when we come to
its use in statistical inference later on. However, do be aware that we are not advocating
that you perform all these experiments! Rather they are imaginary experiments, but the
concept gives a meaning to numerical probability.6
5.8
‘Randomness’
Statistical inference is concerned with the drawing of conclusions from data which are
subject to randomness, perhaps due to the sampling procedure, perhaps due to
observational errors, perhaps for some other reason. Let us stop and think why, when
we repeat an experiment under apparently identical conditions, we get different results.
The answer is that although the conditions may be as identical as we are able to control
them to be, there will inevitably be a large number of uncontrollable (and frequently
unknown) variables which we do not measure and which have a cumulative effect on the
result of the sample or experiment. For example, weather conditions may affect the
outcomes of biological or other ‘field’ experiments.
The cumulative effect, therefore, is to cause variations in our results. It is this variation
which we term randomness and, although we never fully know the true generating
mechanism for our data, we can take the random component into account via the
concept of probability, which is, of course, why probability plays such an important role
in data analysis.
5.9
Properties of probability
We begin this section by presenting three simple, self-evident truths known as axioms
which list the basic properties we require of event probabilities.
6
There is another ‘school’ of probability thought, known as the ‘Bayesian’ school. We will touch on
this only briefly via Bayes’ formula later in this chapter. In short, the Bayesian view is that probability
is a degree of belief in an event’s occurrence based on the observer’s knowledge of the situation.
83
5
5. Probability theory
Axioms of probability
1. For any event A, P (A) ≥ 0.
2. For the sample space S, P (S) = 1.
3. If {Ai }, for i = 1, . . . , n, are mutually exclusive events, then the probability of
their ‘union’ is the sum of their respective probabilities, that is:
!
n
n
[
X
P
Ai =
P (Ai ).
i=1
5
i=1
The first two axioms should not be surprising. The third may appear a little more
daunting. Events are labelled mutually exclusive when they cannot both occur
simultaneously.
Example 5.4 When rolling a die once, the event A = ‘get an even score’ and the
event B = ‘get an odd score’ are mutually exclusive because they cannot occur
simultaneously.
Extending this, a collection of events is pairwise mutually exclusive if no two events can
occur simultaneously. For instance, the three events A, B and C are pairwise mutually
exclusive if A and B cannot occur together and B and C cannot occur together and A
and C cannot occur together. Another way of putting this is that a collection of events
is pairwise mutually exclusive if at most one of them can occur.
Related to this is the concept of a collection of events being collectively exhaustive.
This means at least one of them must happen, i.e. all possible experimental outcomes
are included among the collection of events.
5.9.1
Notation
Axiom 3 above introduced a new symbol. For the remainder of this chapter, various
symbols connecting sets will be used as a form of notational shorthand. It is important
to be familiar with these symbols, hence two sets of descriptors are provided – one basic
and the other more technical.
Symbol
∪
∩
c
|
Basic descriptor
or
and
not
given
Technical descriptor
union
intersect
complement of
conditional on
Example
A ∪ B = ‘A union B’
A ∩ B = ‘A intersect B’
Ac = ‘complement of A’
A | B = ‘A conditional on B’
Also, do make sure that you distinguish between a set and the probability of a set. This
distinction is important. A set, remember, is a collection of elementary outcomes from
S, whereas a probability (from Axiom 1) is a number on the unit interval, [0, 1]. For
example, A = ‘an even die score’, while P (A) = 0.5, for a fair die.
84
5.9. Properties of probability
A
B
10, 12
2, 4
3
6
8
5, 7
9
11
C
5
Figure 5.1: Venn diagram for pre-defined sets A, B and C recording the total score when
a die is rolled twice.
5.9.2
Venn diagrams
The previous coin and die examples have been rather simple (for illustrative purposes).
Of course, it is highly likely (in fact with a probability of 1!) that you will encounter
more challenging sets and sample spaces. Fear not, there is a helpful geometric
technique which can often be used – we represent the sample space elements using a
Venn diagram.
Imagine we roll a die twice and record the total score. The sample space will be:
S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}.
Suppose we are interested in the following three events:
A = an even total, that is A = {2, 4, 6, 8, 10, 12}
B = a total strictly less than 8, that is B = {2, 3, 4, 5, 6, 7}
C = a total greater than 4, but less than 10, that is C = {5, 6, 7, 8, 9}.
Having defined these events, it is possible to place every element in the sample space S
into a Venn diagram, as shown in Figure 5.1.
The box represents S, so every possible outcome of the experiment (the total score
when a die is rolled twice) appears within the box. Three (overlapping) circles are
drawn representing the events A, B and C. Each element of S is then placed into the
appropriate area. For example, the area where the three circles all intersect represents
the event A ∩ B ∩ C into which we place the element ‘6’, since this is the only member
of S which satisfies all three events A, B and C.
85
5. Probability theory
Example 5.5 Using Figure 5.1, we can determine the following sets:
A ∩ B = {2, 4, 6}
A ∩ B ∩ C = {6}
A ∩ B ∩ C c = {2, 4}
(A ∪ C)c ∩ B = {3}
5.9.3
A ∩ C = {6, 8}
(A ∪ B ∪ C)c = {11}
Ac ∩ B = {3, 5, 7}
A | C = {6, 8}.
The additive law
We now introduce our first probability ‘law’ – the additive law.
The additive law
5
Let A and B be any two events. The additive law states that:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
So P (A ∪ B) is the probability that at least one of A and B occurs, and P (A ∩ B)
is the probability that both A and B occur.
Example 5.6 We can think about this using a Venn diagram. The total area of the
Venn diagram in Figure 5.2 is assumed to be 1, so area represents probability. Event
A is composed of all points in the left-hand circle, and event B is composed of all
points in the right-hand circle. Hence:
P (A) = area x + area z
P (B) = area y + area z
P (A ∩ B) = area z
P (A ∪ B) = area x + area y + area z.
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
= (area x + area z) + (area y + area z) − (area z)
= area x + area y + area z.
Therefore, to compute P (A ∪ B) we need to subtract P (A ∩ B) otherwise that
region would have been counted twice.
Example 5.7 Consider an industrial situation in which a machine component can
be defective in two ways such that:
P (defective in first way) = P (D1 ) = 0.01
P (defective in second way) = P (D2 ) = 0.05
P (defective in both ways) = P (D1 ∩ D2 ) = 0.001.
86
5.9. Properties of probability
A
x
z
y
B
Figure 5.2: Venn diagram illustrating the additive law.
5
It follows that the probability that the component is defective is:
P (D1 ∪ D2 ) = P (D1 ) + P (D2 ) − P (D1 ∩ D2 ) = 0.01 + 0.05 − 0.001 = 0.059.
A
B
Figure 5.3: Venn diagram illustrating two mutually exclusive events.
Additive law – special case 1
If A and B are mutually exclusive events, i.e. they cannot occur simultaneously, then
P (A ∩ B) = 0. Hence:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
= P (A) + P (B) − 0
= P (A) + P (B).
Such events can be depicted by two non-overlapping sets in a Venn diagram as shown
in Figure 5.3. Now revisit Axiom 3, to see this result generalised for n mutually
exclusive events.
87
5. Probability theory
Additive law – special case 2
The probability of an event A not occurring, i.e. the complement, Ac , is:
P (Ac ) = 1 − P (A).
Activity 5.4 Use a Venn diagram, or otherwise, to ‘prove’ the following.
(a) P (Ac ) = 1 − P (A).
(b) P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
(c) If A implies B, then P (A) ≤ P (B).
5
(d) P (A ∪ B ∪ C) =
P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C) + P (A ∩ B ∩ C).
Solution
These results can be seen very easily from a Venn diagram (although a
mathematician would not accept such a graphical approach as a ‘proof’, but we will
not worry about that here). The following is the Venn diagram for events A, B and
C. The lower case letters, that is a, b, . . . , g, stand for the areas (which represent the
probabilities) of the events for which the region surrounding each such letter stands.
A
B
a
d
b
g
e
f
c
C
(a) With the same notation from the diagram, the area for Ac is everything outside
the area for A, so P (Ac ) = 1 − P (A).
(b) With the notation from the diagram:
P (A ∪ B) = a + e + f + b + (d + g)
= (a + d + g + e) + (d + b + f + g) − (d + g)
= P (A) + P (B) − P (A ∩ B).
(c) If A implies B, then this means that the region for A is completely inside the
region for B, i.e. that the intersection of A and B is the whole of A. Hence the
area of the shape for A is no larger than the area of the shape for B, i.e.
P (A) ≤ P (B).
88
5.9. Properties of probability
(d) It is clear from the diagram that:
area(A ∪ B ∪ C) = a + b + c + d + e + f + g
P (A) = a + d + e + g
P (A ∩ B) = d + g
P (B) = b + d + f + g
P (A ∩ C) = e + g
P (C) = c + e + f + g
P (B ∩ C) = f + g
P (A ∩ B ∩ C) = g.
So P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C) + P (A ∩ B ∩ C)
equals:
(a + d + e + g) + (b + d + f + g) + (c + e + f + g)
5
− ((f + g) + (e + g) + (d + g)) + g
=a+b+c+d+e+f +g
= area(A ∪ B ∪ C)
= P (A ∪ B ∪ C).
Activity 5.5 In a class of 35 students, 20 have taken Economics, 15 have taken
History, and 10 have taken Politics. Of these, 9 have taken both Economics and
History, 5 have taken both Economics and Politics, and 6 have taken both History
and Politics. 3 students have taken all three subjects. What is the probability that a
student chosen at random from the class has taken at least one of these subjects?
Solution
This is an application of the previous question with:
A = Economics,
So:
P (at least one subject) =
B = History
and
C = Politics.
20 + 15 + 10 − 9 − 5 − 6 + 3
28
=
= 0.8.
35
35
Activity 5.6 Hard!
How many people do you need at a party so that the probability of at least two
people having the same birthday is greater than 0.5? (Ignore leap years!)
Solution
We have:
P (1 & 2 have different birthdays) =
364
365
89
5. Probability theory
364
365
364 363
×
P (1, 2 & 3 have different birthdays) =
365 365
364 363
P (at least two of 1, 2 & 3 have the same birthday) = 1 −
×
365 365
..
.
P (1 & 2 have the same birthday) = 1 −
364 363
366 − n
×
× ··· ×
365 365
365
364 363
366 − n
P (at least two of the first n have the same birthday) = 1 −
×
× ··· ×
.
365 365
365
P (first n people have different birthdays) =
We want the smallest n such that:
1−
366 − n
364 363
×
× ··· ×
> 0.5.
365 365
365
The relevant value is n = 23, so that if there are 23 people at a party then there is a
50% chance (in fact, a little more) that at least two of them have the same birthday!
The plot below shows how the probability changes as n increases.
0.0
0.2
0.4
0.6
0.8
1.0
Probability of at least 2 people with same birthday
Probability
5
0
20
40
60
80
Number of people
5.9.4
The multiplicative law
This rule is concerned with the probability of two events occurring at the same time –
specifically when the two events have the special property of independence. An
informal definition of independence is that two events are said to be independent if one
90
5.9. Properties of probability
event has no influence on the other.
The multiplicative law (for independent events)
Formally, events A and B are independent if the probability of their intersect is the
product of their individual probabilities, that is:
P (A ∩ B) = P (A) P (B).
Example 5.8 Consider rolling two fair dice. The score on one die has no influence
on the score on the other die. Hence the respective scores are independent events,
and so:
1
1 1
P (two sixes) = × = .
6 6
36
Note the multiplicative (or product) law does not hold for dependent events, which is
the subject of conditional probability, discussed shortly. Also, take a moment to
ensure that you are comfortable with the terms ‘mutually exclusive’ and ‘independent’.
These are not the same thing, so do not get these terms confused!
Activity 5.7 Two fair dice are rolled.
(a) What is the probability of two sixes?
(b) What is the probability of at least one six?
Solution
(a) By independence:
P (two sixes) =
1 1
1
× = .
6 6
36
(b) We have:
P (at least one 6) = P (6 on first die and/or 6 on second die)
= P (6 on first die) + P (6 on second die) − P (6 on both dice)
1 1
1
+ −
6 6 36
11
=
36
=
= 0.3056.
Activity 5.8 A student has an important job interview in the morning. To ensure
he wakes up in time, he sets two alarm clocks which ring with probabilities 0.97 and
0.99, respectively. What is the probability that at least one of the alarm clocks will
wake him up?
91
5
5. Probability theory
Solution
We are told that the alarm clocks are independent. Therefore:
P (at least one rings) = 1 − P (neither rings) = 1 − (0.03 × 0.01) = 0.9997.
Activity 5.9 A fair coin is tossed 10 times.
(a) What is the probability that the 10th toss results in a head?
(b) If the first 9 tosses give 4 heads and 5 tails, what is the probability that the
10th toss results in a head?
5
(c) If the first 9 tosses give 9 heads and no tails, what is the probability that the
10th toss results in a head?
Solution
The answer is 0.5 in each case! The coin has no ‘memory’. This is because the
successive throws of a fair coin are independent.
Activity 5.10 The successful operation of three separate switches is needed to
control a machine. If the probability of failure of each switch is 0.1 and the failure of
any switch is independent of any other switch, what is the probability that the
machine will break down?
Solution
The probability that the machine will not break down equals the probability that all
three switches work. This equals 0.9 × 0.9 × 0.9 = 0.729, by independence. Hence the
probability it will break down is 1 − 0.729 = 0.271.
Activity 5.11 Of all the candles produced by a company, 0.01% do not have wicks
(the core piece of string). A retailer buys 10,000 candles from the company.
(a) What is the probability that all the candles have wicks?
(b) What is the probability that at least one candle will not have a wick?
Solution
Let X be the number of candles without a wick.
(a) We have:
P (X = 0) = (0.9999)10000 = 0.3679.
(b) We have:
P (X ≥ 1) = 1 − P (X = 0) = 1 − 0.3679 = 0.6321.
92
5.9. Properties of probability
Activity 5.12 A family has four children.
(a) What is the probability of at least one boy among the children?
(b) What is the probability that all the children are of the same sex?
Solution
For our sample space we can take (in order of age):
GGBB
GGGB
GBGB
BBBG
GGGG
GGBG GBGG
BGGB GBBG
BBGB BGBB
BBBB
BGGG
BGBG
GBBB
BBGG
5
So we have N = 24 = 16 possible outcomes.
(a) For ‘at least one boy’ the number of favourable points is 15 (all excluding
GGGG), so:
15
= 0.9375.
P (at least one boy) =
16
An alternative (and easier) method is to note that:
P (all girls) =
1 1 1 1
1
× × × =
2 2 2 2
16
so:
P (at least one boy) = 1 − P (all girls) = 1 −
1
15
=
= 0.9375.
16
16
(b) For ‘all of the same sex’ the number of favourable points is 2, so:
P (all of the same sex) =
2
1
= = 0.125.
16
8
Activity 5.13 Calculate the probabilities of the following events.
(a) Five independent tosses of a fair coin yield the sequence HHT T H.
(b) Five independent tosses of a fair coin yield Hs on the first and last tosses.
(c) A pair of fair dice is thrown and the faces are equal.
(d) Five cards are drawn at random from a deck of cards and all are the same suit.
(e) Four independent tosses of a fair coin result in at least two heads. What is the
probability that all four tosses are heads?
93
5. Probability theory
Solution
(a) By independence, we have (0.5)5 = 0.03215.
(b) By independence (noting we only care about the first and last outcomes), we
have (0.5)2 = 0.25.
(c) Of the 36 equally likely outcomes, 6 have equal faces, hence the probability is
6/36 = 1/6 = 0.1667.
(d) Note this is sampling without replacement, and we can have any card as the
first card drawn, the probability is:
1×
5
9
12 11 10
×
×
×
= 0.00198.
51 50 49 48
(e) The sample space has 11 points, all equally likely, and one is the event of
interest, so the probability is 1/11 = 0.0909.
Activity 5.14 A company is concerned about interruptions to email. It was noticed
that problems occurred on 15% of workdays. To see how bad the situation is,
calculate the probabilities of an interruption during a five-day working week:
(a) on Monday and again on Tuesday
(b) for the first time on Thursday
(c) every day
(d) at least once during the week.
Solution
(a) This is the probability of the event occurring on Monday and on Tuesday. Each
has probability 0.15 and the outcome is independent of day of the week, so
calculate:
(0.15)2 = 0.025.
(b) We require the probability of the event not occurring on the first three days of
the week, but occurring on Thursday. So the result will be:
(1 − 0.15)3 × 0.15 = 0.092.
(c) Similar to (a), but for each weekday. We need to find:
(0.15)5 = 0.000076.
(d) The probability of no occurrence in the week will be that of no occurrence on
any of the five days, and this is:
(0.85)5 = 0.444.
Therefore the probability of at least one occurrence will be:
1 − 0.444 = 0.556.
94
5.10. Conditional probability and Bayes’ formula
5.10
Conditional probability and Bayes’ formula
We have just introduced the concept of independent events – one event has no influence
on another. Clearly, there are going to be many situations where independence does not
in fact hold, i.e. the occurrence of one event has a ‘knock-on’ effect on the probability of
another event occurring.
Example 5.9 For a single roll of a fair die, let A be ‘roll a 6’, and B be ‘an even
number’. The following probabilities are obvious:
P (A) = 1/6
P (B) = 1/2
P (A | B) = 1/3.
5
So, we see that the probability of a 6 changes from 1/6 to 1/3 once we are given the
information that ‘an even number’ has occurred. Similarly, the probability of an even
number changes from 1/2 to 1, conditional on a 6 occurring, i.e. P (B | A) = 1.
Example 5.10 In order to understand and develop formulae for conditional
probability, consider the following simple example, representing the classification by
gender and subject (where A, B, C and D are defined below) of 144 college students.
Subject
A: Mathematics
B: Economics
C: Science
D: Arts
Total
Female
4
17
4
28
53
Male
14
41
25
11
91
Total
18
58
29
39
144
Let F = ‘Female’ and M = ‘Male’, then P (A) = 18/144, P (F ) = 53/144 and
P (A ∩ F ) = 4/144. Note that P (A ∩ F ) 6= P (A) P (F ), hence A and F are not
independent events.
From the table we have the following probabilities:
P (A | F ) = 4/53 6= P (A)
P (F | A) = 4/18 6= P (F ).
The correct relationship of these conditional probabilities to the original
unconditional probabilities is:
P (A | F ) =
4/144
4
P (A ∩ F )
=
=
.
53/144
53
P (F )
Similarly, noting that P (A ∩ F ) = P (F ∩ A), we have:
P (F | A) =
4/144
4
P (A ∩ F )
=
=
.
18/144
18
P (A)
95
5. Probability theory
Activity 5.15 Check that some of the other conditional probabilities obtained from
the table in Example 5.10 satisfy this formula, such as P (C | F ) = 4/53.
Note also another important relationship involving conditional probability is the ‘total
probability formula’ (discussed in greater depth shortly). This expresses an
unconditional probability in terms of other, conditional probabilities.
Example 5.11 Continuing with Example 5.10, we have:
4
53
14
91
18
=
×
+
×
P (A) =
144
53 144
91 144
= P (A | F ) P (F ) + P (A | M ) P (M ).
5
Activity 5.16 Two fair coins, a 10p and a 50p, are tossed.
(a) Find the probability that:
i. both coins show heads
ii. different faces show up
iii. at least one head shows.
(b) You are told that the 10p shows heads. What is the probability that both coins
show heads?
(c) You are told that at least one of the two coins shows heads. What is the
probability that both coins show heads?
Solution
Let the sample space be S = {HH, HT, T H, T T } where, for example, ‘HT ’ means
‘heads on the 10p coin and tails on the 50p coin’. Each element is equally likely, so
each has a probability equal to 0.25.
(a)
i. We have P (HH) = 0.25.
ii. We have P (different faces) = P (HT or T H) = P (HT ) + P (T H) = 0.5.
iii. We have P (at least 1 head) = P (HT or T H or HH) =
P (HT ) + P (T H) + P (HH) = 0.75.
(b) To find P (both show heads given that the 10p shows heads) we define:
A = both show heads = {HH} and B = 10p shows heads = {HT, HH}
and use conditional probability, i.e. the result that P (A | B) = P (A ∩ B)/P (B).
To do this we note that in this case the event A ∩ B is the same as the event A
(which is important to spot, when it happens). Hence:
P (HH | 10p shows heads) =
96
P (A)
P (HH)
0.25
P (A ∩ B)
=
=
=
= 0.5.
P (B)
P (B)
P (HT or HH)
0.5
5.10. Conditional probability and Bayes’ formula
(c) Here we want to find P (A | C) = P (A ∩ C)/P (C) where:
C = at least one head = {HH, HT, T H}.
Hence P (HH | at least one head) equals:
P (A ∩ C)
P (A)
P (HH)
0.25
=
=
=
= 0.33.
P (C)
P (C)
P (HH or HT or T H)
0.75
Activity 5.17 A fair coin is thrown six times in succession.
(a) Find the probability that exactly three throws come up heads.
(b) Assuming that there are exactly three heads, what is the (conditional)
probability that they are on consecutive throws?
5
Solution
There are two ‘fair’ choices for the outcome of each throw, hence there are two
equally likely outcomes (H and T , using obvious notation) of each. Moreover, we can
assume that one throw’s outcome does not affect the others. That is, the outcomes
of the different throws are independent, which (always!) means we multiply the
probabilities. Combining these two thoughts, there are:
N = 2 × 2 × 2 × 2 × 2 × 2 = 26 = 64
items (‘outcomes’) in the full, six-throw sample space, each with probability:
1 1 1 1 1 1
× × × × × =
2 2 2 2 2 2
6
1
1
= .
2
64
(a) It is (just about) possible to list accurately the 20 cases in which there are
exactly three heads (and hence exactly three tails), identifying the four cases
(top row) in which the three Hs are adjacent.
HHHT T T
HHT T T H
T HHT T H
HT HHT T
HT HT HT
T HHHT T
HHT T HT
T T HHT H
T HT T HH
HT HT T H
T T HHHT
HHT HT T
HT T T HH
T HT T HH
HT T HT H
T T T HHH
T HHT T H
HT T HHT
T T HT HH
T HT HT H
3 Hs together
2 consecutive Hs
All Hs separate
To make sure that you count everything, but just once, it is essential to organise
the listed items according to some pattern.
Hence:
P (exactly 3 Hs) =
n
20
5
=
=
= 0.3125.
N
64
16
(b) If there are exactly three heads on consecutive throws, then it is possible to
specify which of these different outcomes we are dealing with by identifying how
many (if any) T s there are before the sequence of Hs. (The remaining T s must
97
5. Probability theory
be after the sequence of Hs.) The possible values (out of 3) are 0, 1, 2 and 3, so
there are four cases.
Hence:
P (3 consecutive Hs | exactly 3 Hs) =
4
1
n
=
= = 0.2.
N
20
5
Activity 5.18 Let K be the event of drawing a ‘king’ from a well-shuffled deck of
playing cards. Let D be the event of drawing a ‘diamond’ from the pack. Determine:
5
(a) P (K)
(b) P (D)
(c) P (K c )
(d) P (K ∩ D)
(e) P (K ∪ D)
(f) P (K | D)
(g) P (D | K)
(h) P (D ∪ K c )
(i) P (Dc ∩ K)
(j) P ((Dc ∩ K) | (D ∪ K)).
Are the events D and K independent, mutually exclusive, neither or both?
Solution
(a) We have:
P (K) =
4
1
=
52
13
P (D) =
13
1
=
52
4
(b) We have:
(c) We have:
P (K c ) = 1 −
1
12
=
13
13
(d) We have:
P (K ∩ D) =
1
52
(e) We have:
P (K ∪ D) =
16
4
=
52
13
(f) We have:
P (K | D) =
1
13
P (D | K) =
1
4
(g) We have:
98
5.10. Conditional probability and Bayes’ formula
(h) We have:
P (D ∪ K c ) =
49
52
P (Dc ∩ K) =
3
52
(i) We have:
(j) We have:
P (Dc ∩ K | D ∪ K) =
3
16
The events are independent since P (K ∩ D) = P (K) P (D), but they are not
mutually exclusive (consider the King of Diamonds).
Activity 5.19 At a local school, 90% of the students took test A, and 15% of the
students took both test A and test B. Based on the information provided, which of
the following calculations are not possible, and why? What can you say based on the
data?
(a) P (B | A).
(b) P (A | B).
(c) P (A ∪ B).
If you knew that everyone who took test B also took test A, how would that change
your answers?
Solution
(a) Possible.
P (B | A) =
0.15
P (A ∩ B)
=
= 0.167.
P (A)
0.90
(b) Not possible because P (B) is unknown and we would need to calculate
P (A | B) =
P (A ∩ B)
0.15
=
.
P (B)
P (B)
(c) Not possible because P (A ∪ B) = P (A) + P (B) − P (A ∩ B) and, as discussed in
(b), we do not know P (B).
If we knew that everyone who took test B also took test A, then P (A | B) = 1 and
hence A ∪ B = A. Therefore P (A ∪ B) = P (A) = 0.90.
Activity 5.20 Given two events, A and B, state why each of the following is not
possible. Use formulae or equations to illustrate your answer.
99
5
5. Probability theory
(a) P (A) = −0.46.
(b) P (A) = 0.26 and P (Ac ) = 0.62.
(c) P (A ∩ B) = 0.92 and P (A ∪ B) = 0.42.
(d) P (A ∩ B) = P (A) P (B) and P (B) > P (B | A).
Solution
(a) Not possible because a probability cannot be negative since 0 ≤ P (A) ≤ 1.
(b) Not possible because P (A) + P (Ac ) = 1. Here 0.26 + 0.62 = 0.88 6= 1.
5
(c) Not possible because A ∩ B is a subset of A ∪ B, hence we have
P (A ∩ B) ≤ P (A ∪ B). Here 0.92 > 0.42.
(d) Not possible because two events cannot both be independent (as implied by the
condition P (A ∩ B) = P (A) P (B)) and dependent (as implied by the condition
P (B) > P (B | A)) at the same time.
5.10.1
Bayes’ formula
We formally define conditional probabilities.
Conditional probability
For any two events A and B, we define conditional probabilities as:
P (A | B) =
P (A ∩ B)
P (B)
and P (B | A) =
P (A ∩ B)
.
P (A)
(5.1)
In words: ‘the probability of one event, given a second event, is equal to the
probability of both, divided by the probability of the second (conditioning) event.’
This is the simplest form of Bayes’ formula, and this can be expressed in other ways.
Rearranging (5.1), we obtain:
P (A ∩ B) = P (A | B) P (B) = P (B | A) P (A)
from which we can derive Bayes’ formula.
Bayes’ formula
The simplest form of Bayes’ formula is:
P (A | B) =
100
P (B | A) P (A)
.
P (B)
5.10. Conditional probability and Bayes’ formula
Activity 5.21 Events A, B and C have the following probabilities:
P (A) = 0.4
P (B) = 0.5
P (C) = 0.1
P (A | B) = 0.2
P (B | C) = 0
P (C | A) = 0.25.
(a) Are the following statements ‘True’ or ‘False’ ?
i. C implies A.
ii. B and C are mutually exclusive.
iii. A ∪ B ∪ C = S.
(b) What is P (B | A)?
5
(c) What is P (B | Ac )?
Solution
(a)
i. The necessary condition for C implies A is that P (A | C) = 1. Using Bayes’
formula, we see that:
P (A | C) =
0.25 × 0.4
P (C | A) P (A)
=
=1
P (C)
0.1
so the statement is true.
ii. P (B | C) = 0, and using Bayes’ formula it follows that P (B ∩ C) = 0, so we
see that (ignoring continuous distributions) the two events are mutually
exclusive, so the statement is true.
iii. Since C implies A, it follows that:
P (A ∪ B ∪ C) = P (A ∪ B)
= P (A) + P (B) − P (A ∩ B)
= 0.4 + 0.5 − P (A | B) P (B)
= 0.4 + 0.5 − 0.2 × 0.5
= 0.8.
Since P (S) = 1, it follows that A ∪ B ∪ C 6= S, so the statement is false.
(b) Applying Bayes’ formula, we have:
P (B | A) =
P (A | B) P (B)
0.2 × 0.5
P (B ∩ A)
=
=
= 0.25.
P (A)
0.4
0.4
(c) Applying Bayes’ formula, we have:
P (B | Ac ) =
0.8 × 0.5
2
P (Ac | B) P (B)
=
= = 0.6667.
c
P (A )
0.6
3
101
5. Probability theory
5.10.2
Total probability formula
The simplest case of the total probability formula involves calculating the probability of
an event A from information about its two conditional probabilities with respect to some
other event B and its complement, B c , together with knowledge of P (B). Note that:
B and B c are mutually exclusive
B and B c are collectively exhaustive.
Fulfilment of these criteria (being mutually exclusive and collectively exhaustive) allows
us to view B and B c as a partition of the sample space.
The (simplest form of the) total probability formula
5
The total probability formula is:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c ).
In words: ‘the probability of an event is equal to its conditional probability on a
second event times the probability of the second event, plus its probability conditional
on the second event not occurring times the probability of that non-occurrence.’
There is a more general form of the total probability formula. Let B1 , B2 , . . . , Bn
partition the sample space S into n pairwise mutually exclusive (at most one of them
can occur) and collectively exhaustive (at least one of them must occur) events. For
example, for n = 4, see Figure 5.4.
B3
S
B1
B2
B4
Figure 5.4: An example of a partitioned sample space.
Figure 5.5 superimposes an event A.
Extending the simple form of the total probability formula, we obtain:
P (A) =
n
X
P (A | Bi ) P (Bi )
i=1
= P (A | B1 ) P (B1 ) + P (A | B2 ) P (B2 ) + · · · + P (A | Bn ) P (Bn ).
102
5.10. Conditional probability and Bayes’ formula
B3
B1
S
A
B2
B4
Figure 5.5: The event A within a partitioned sample space.
Recall that P (B | A) = P (A ∩ B)/P (A) = P (A | B) P (B)/P (A), so assuming we have
the partition B and B c , then:
P (A | B) P (B)
P (B | A) =
.
P (A | B) P (B) + P (A | B c ) P (B c )
A more general partition gives us a more complete form of Bayes’ formula.
General form of Bayes’ formula
For a general partition of the sample space S into B1 , B2 , . . . , Bn , and for some event
A, then:
P (A | Bk ) P (Bk )
P (Bk | A) = P
.
n
P (A | Bi ) P (Bi )
i=1
Activity 5.22 In an audit Bill analyses 60% of the audit items and George analyses
40%. Bill’s error rate is 5% and George’s error rate is 3%. Suppose an item is
sampled at random.
(a) What is the probability that it is in error (i.e. audited incorrectly)?
(b) If the chosen item is incorrect what is the probability that Bill is to blame?
Solution
Let B = Bill audits item, G = George audits item, and E = incorrect audit. Hence
P (B) = 0.6, P (G) = 0.4, P (E | B) = 0.05 and P (E | G) = 0.03.
(a) Using the total probability formula:
P (E) = P (E | B) P (B) + P (E | G) P (G) = (0.05 × 0.6) + (0.03 × 0.4) = 0.042.
(b) Using the conditional probability formula:
P (B | E) =
0.05 × 0.6
P (E | B) P (B)
=
= 0.71.
P (E)
0.042
103
5
5. Probability theory
Activity 5.23 Two fair coins are tossed. You are told that ‘at least one is a head’.
What is the probability that both are heads?
Solution
At first sight it may appear that this should be equal to 1/2, but this is wrong! The
correct solution may be achieved in several ways; here we shall use a conditional
probability approach.
Let HH denote the event ‘both coins show heads’, and let A denote the event ‘at
least one coin shows a head’. Then:
P (HH | A) =
5
P (HH)
1/4
1
P (A ∩ HH)
=
=
= .
P (A)
P (A)
3/4
3
Note that A ∩ HH = HH, i.e. ‘at least one head’ and ‘both show heads’ only
corresponds to ‘both show heads’, HH.
Activity 5.24 The probability of a horse winning a race is 0.3 if it is dry and 0.5 if
it is wet. The weather forecast gives the chance of rain as 40%.
(a) Find the probability that the horse wins.
(b) If you are told that the horse lost the race, what is the probability that the
weather was dry on the day of the race? (Assuming you cannot remember!)
Solution
Let A = ‘horse wins’, Ac = ‘horse loses’, D = ‘dry’ and Dc = ‘wet’. We have:
P (A | D) = 0.3,
P (A | Dc ) = 0.5,
P (D) = 0.6 and P (Dc ) = 0.4.
(a) We have:
P (A) = P (A | D) P (D) + P (A | Dc ) P (Dc ) = 0.3 × 0.6 + 0.5 × 0.4 = 0.38.
Hence the horse has a probability of 0.38 of winning the race.
(b) From (a), we can determine that there is a 62% chance that the horse loses the
race, i.e. P (Ac ) = 0.62. Hence:
P (D | Ac ) =
0.7 × 0.6
P (Ac | D) P (D)
=
= 0.6774.
c
P (A )
0.62
That is, the probability it is dry, given that the horse loses, is 0.6774, or about
68%.
Activity 5.25 A restaurant manager classifies customers as well-dressed,
casually-dressed or poorly-dressed and finds that 50%, 40% and 10%, respectively,
fall into these categories. The manager found that wine was ordered by 70% of the
well-dressed, by 50% of the casually-dressed and by 30% of the poorly-dressed.
104
5.10. Conditional probability and Bayes’ formula
(a) What is the probability that a randomly chosen customer orders wine?
(b) If wine is ordered, what is the probability that the person ordering is
well-dressed?
(c) If wine is not ordered, what is the probability that the person ordering is
poorly-dressed?
Solution
The following notation is used:
• W = well-dressed
• C = casually-dressed
5
• P = poorly dressed
• O = wine ordered.
(a) We have:
P (O) = P (O | W ) P (W ) + P (O | M ) P (M ) + P (O | P ) P (P )
= (0.7 × 0.5) + (0.5 × 0.4) + (0.3 × 0.1)
= 0.58.
(b) Using Bayes’ theorem:
P (W | O) =
0.7 × 0.5
P (O | W ) P (W )
=
≈ 0.60.
P (O)
0.58
P (P | Oc ) =
P (Oc | P ) P (P )
0.7 × 0.1
=
≈ 0.17.
c
P (O )
0.42
(c) We require:
Activity 5.26 In a large lecture, 60% of students are female and 40% are male.
Records show that 15% of female students and 20% of male students are registered
as part-time students.
(a) If a student is chosen at random from the lecture, what is the probability that
the student studies part-time?
(b) If a randomly chosen student studies part-time, what is the probability that the
student is male?
Solution
Let P T = ‘part time’, F = ‘female’ and M = ‘male’.
105
5. Probability theory
(a) We have:
P (P T ) = P (P T | F ) P (F ) + P (P T | M ) P (M ) = (0.15 × 0.6) + (0.2 × 0.4) = 0.17.
(b) We have:
P (M | P T ) =
P (P T | M ) P (M )
0.2 × 0.4
=
= 0.4706.
P (P T )
0.17
Activity 5.27 20% of men show early signs of losing their hair. 2% of men carry a
gene that is related to hair loss. 80% of men who carry the gene experience early
hair loss.
5
(a) What is the probability that a man carries the gene and experiences early hair
loss?
(b) What is the probability that a man carries the gene, given that he experiences
early hair loss?
Solution
Using obvious notation, P (H) = 0.2, P (G) = 0.02 and P (H | G) = 0.8.
(a) We require:
P (G ∩ H) = P (G) P (H | G) = 0.02 × 0.8 = 0.016.
(b) We require:
P (G | H) =
0.016
P (G ∩ H)
=
= 0.08.
P (H)
0.2
Activity 5.28 James is a salesman for a company and sells two products, A and B.
He visits three different customers each day. For each customer, the probability that
James sells product A is 1/3 and the probability is 1/4 that he sells product B. The
sale of product A is independent of the sale of product B during any visit, and the
results of the three visits are mutually independent. Calculate the probability that
James will:
(a) sell both products, A and B, on the first visit
(b) sell only one product during the first visit
(c) make no sales of product A during the day
(d) make at least one sale of product B during the day.
Solution
Let A = ‘product A sold’ and B = ‘product B sold’.
106
5.10. Conditional probability and Bayes’ formula
(a) We have:
P (A ∩ B) = P (A) P (B) =
1
1 1
× = .
3 4
12
(b) We have:
c
c
P (A ∩ B ) + P (A ∩ B) =
1 3
×
3 4
+
2 1
×
3 4
=
5
.
12
(c) Since P (Ac ) = 2/3, then:
3
2
8
P (no A sales all day) =
= .
3
27
(d) We have:
5
3
3
37
P (at least 1 B sale) = 1 − P (no B sales) = 1 −
= .
4
64
Activity 5.29 Tower Construction Company (‘Tower’) is determining whether it
should submit a bid for a new shopping centre.
In the past, Tower’s main competitor, Skyrise Construction Company (‘Skyrise’),
has submitted bids 80% of the time. If Skyrise does not bid on a job, the probability
that Tower will get the job is 0.6.
If Skyrise does submit a bid, the probability that Tower gets the job is 0.35.
(a) What is the probability that Tower will get the job?
(b) If Tower gets the job, what is the probability that Skyrise made a bid?
(c) If Tower did not get the job, what is the probability that Skyrise did not make a
bid?
Solution
(a) We have:
P (Tower gets it) = (0.2 × 0.6) + (0.8 × 0.35) = 0.4.
(b)
P (Skyrise bid | Tower gets job) =
=
P (Tower gets job | Skyrise bid) P (Skyrise bid)
P (Tower gets job)
0.35 × 0.8
0.4
= 0.7.
107
5. Probability theory
(c) We have:
P (Skyrise did not bid | Tower did not get job)
=
P (Tower did not get job | Skyrise did not bid) P (Skyrise did not bid)
P (Tower did not get job)
0.4 × 0.2
0.6
2
=
15
=
≈ 0.13.
Activity 5.30 Hard!
5
There are 3 identical urns.
The first urn contains 7 red balls and 3 white balls.
The second urn contains 5 red balls and 5 white balls.
The third urn contains 4 red balls and 8 white balls.
One of the urns is selected at random (i.e. each urn has a 1-in-3 chance of being
selected). Balls are then drawn from the selected urn without replacement.
The first ball is red.
The second ball is white.
The third ball is red.
The fourth ball is red.
At each stage of sampling, calculate the probabilities of the selected urn being Urn I,
Urn II or Urn III.
Solution
The following table shows the relevant calculations:
Mix
of
balls
P (Ball)
×
P (Urn)
Sum
108
Urn
I
II
III
I
Start
7R, 3W
5R, 5W
4R, 8W
—
II
—
III
—
Round 1
6R, 3W
4R, 5W
3R, 8W
7/10 × 1/3
= 0.2333
5/10 × 1/3
= 0.1667
4/12 × 1/3
= 0.1111
0.5111
Round 2
6R, 2W
4R, 4W
3R, 7W
3/9 × 0.4565
= 0.1522
5/9 × 0.3262
= 0.1812
8/11 × 0.2174
= 0.1581
0.4915
Round 3
5R, 2W
3R, 4W
2R, 7W
6/8 × 0.3097
= 0.2323
4/8 × 0.3687
= 0.1844
3/10 × 0.3217
= 0.0965
0.5132
Round 4
4R, 2W
2R, 4W
1R, 7W
5/7 × 0.4526
= 0.3232
3/7 × 0.3593
= 0.1540
2/9 × 0.1880
= 0.0418
0.5190
5.11. Probability trees
Urn
I
P (Urn)
II
III
Start
0.3333
0.3333
0.3333
Round 1
0.2333/0.5111
= 0.4565
0.1667/0.5111
= 0.3262
0.1111/0.5111
= 0.2174
Round 2
0.1522/0.4915
= 0.3097
0.1812/0.4915
= 0.3687
0.1581/0.4915
= 0.3217
Round 3
0.2323/0.5132
= 0.4526
0.1844/0.5132
= 0.3593
0.0965/0.5132
= 0.1880
Round 4
0.3232/0.5190
= 0.6227
0.1540/0.5190
= 0.2967
0.0418/0.5190
= 0.0805
Note that the ‘Sum’ row is calculated using the total probability formula, and the
urn probabilities are computed using Bayes’ formula.
5.10.3
Independent events (revisited)
The terms ‘dependent’ and ‘independent’ reflect the fact that the probability of an
event is changed when another event is known to occur only if there is some dependence
between the events. If there is such a dependence, then P (A | B) 6= P (A).
It follows from this that two events, A and B, are independent if and only if:
P (A | B) = P (A).
Recall from the multiplicative law in Section 5.9, that under independence
P (A ∩ B) = P (A) P (B). Substituting this into the conditional probability formula gives
the required result:
P (A | B) =
P (A) P (B)
P (A ∩ B)
=
= P (A)
P (B)
P (B)
provided P (B) > 0. Hence if A and B are independent, knowledge of B, i.e. ‘| B’, is of
no value to us in determining the probability of A occurring.
5.11
Probability trees
Probability problems can often be represented in several ways. We have already seen
how Venn diagrams provide a convenient way to visualise the members of various
combinations of events.
Here we introduce a probability tree, also referred to as a tree diagram. This is best
explained by way of an example.
Example 5.12 The London Special Electronics (LSE) company is investigating a
fault in its manufacturing plant. The tests done so far prove that the fault is at one
of three locations: A, B or C. Taking all the tests and other evidence into account,
the company assesses the chances of the fault being at each site as:
Suspect site
Probability of it being site of fault
A
0.5
B
0.2
C
0.3
The next test they will do is expected to improve the identification of the correct
site, but (like most tests) it is not entirely accurate.
109
5
5. Probability theory
• If the fault is at A, then there is a 70 per cent chance of a correct identification;
i.e. given that A is the site of the fault then the probability that the test says
that it is at A is 0.7, and in this case the probabilities of either possible error
are equal.
• If the fault is at B, then there is a 60 per cent chance of a correct identification,
and in this case too the probabilities of either possible error are equal.
• If the fault is at C, then there is an 80 per cent chance of a correct identification,
and in this case too the probabilities of either possible error are equal.
Draw a probability tree for this problem, and use it to answer the following:
5
(a) What is the probability that the new test will (rightly or wrongly) identify A as
the site of the fault?
(b) If the new test does identify A as the site of the fault, find:
i. the company’s revised probability that C is the site of the fault
ii. the company’s revised probability that B is not the site of the fault.
Let A, B and C stand for the events: ‘fault is at A’, ‘fault is at B’ and ‘fault is at C’,
respectively. Also, let a, b and c stand for the events: ‘the test says the fault is at A’,
‘the test says the fault is at B’ and ‘the test says the fault is at C’, respectively. The
probability tree is shown in Figure 5.6.
(a) The probability P (a) is the sum of the three values against ‘branches’ which
include the event a, that is:
0.35 + 0.04 + 0.03 = 0.42.
(b)
i. The conditional probability P (C | a) is the value for the C ∩ a branch
divided by P (a), that is:
1
0.03
=
= 0.071.
0.35 + 0.04 + 0.03
14
ii. The conditional probability P (B c | a) is the sum of the values for the A ∩ a
and C ∩ a branches divided by P (a), that is:
0.35 + 0.03
= 0.905.
0.42
Alternatively:
P (B c | a) = 1 −
110
value of (B ∩ a) branch
0.04
=1−
= 0.905.
P (a)
0.42
5.12. Summary
Probability tree
True site
of fault
Test says
fault at
A
Final
probabilities
a
0.7
b
0.15
c
0.15
a
0.2
b
0.6
c
0.2
a
0.1
b
0.1
c
0.8
0.5
B
0.2
C
0.3
p = 0.5 x 0.7 = 0.35
p = 0.5 x 0.15 = 0.075
p = 0.5 x 0.15 = 0.075
p = 0.2 x 0.2 = 0.04
p = 0.2 x 0.6 = 0.12
p = 0.2 x 0.2 = 0.04
5
p = 0.3 x 0.1 = 0.03
p = 0.3 x 0.1 = 0.03
p = 0.3 x 0.8 = 0.24
Figure 5.6: Probability tree for Example 5.12.
5.12
Summary
This chapter has introduced the idea of probability, and defined the key terms. You
have also seen how Venn diagrams can be used to illustrate probability, and used the
three axioms. This should prepare you for the following chapter.
5.13
Key terms and concepts
Additive law
Bayes’ formula
Conditional
Equally likely
Exhaustive
Independence
Multiplicative law
Partition
Probability tree
Sample space
Subset
Venn diagram
Axioms
Collectively exhaustive
Element
Event
Experiment
Member
Mutually exclusive
Probability
Relative frequency
Set
Tree diagram
111
5. Probability theory
5.14
Learning activities
Solutions can be found in Appendix A.
1. When throwing a die, suppose:
S = {1, 2, 3, 4, 5, 6},
E = {3, 4} and
F = {4, 5, 6}.
Determine:
(a) F c
(b) E c ∩ F c
(c) (E ∪ F )c
5
(d) E c ∩ F .
2. Draw the appropriate Venn diagram to show each of the following in connection
with Question 1:
(a) E ∪ F = {3, 4, 5, 6}
(b) E ∩ F = {4}
(c) E c = {1, 2, 5, 6}.
3. Consider the following information.
Supplier
Jones
Smith
Robinson
Total
Early
20
10
0
30
Delivery time
On time Late Total
20
10
50
90
50
150
10
90
100
120
150
300
What are the probabilities associated with a delivery chosen at random for each of
the following?
(a) A delivery is early .
(b) A delivery is from Smith.
(c) A delivery is from Jones and late.
4. There are three sites a company may move to: A, B and C. We are told that P (A)
(the probability of a move to A) is 1/2, and P (B) = 1/3. What is P (C)?
5. Two events A and B are independent with P (A) = 1/3 and P (B) = 1/4. What is
P (A ∩ B)?
6. Say whether the following statement is true or false and briefly give your reason(s).
‘If two events are independent, then they must be mutually exclusive.’
112
5.14. Learning activities
7. If X can take values of 1, 2 and 4 with P (X = 1) = 0.3, P (X = 2) = 0.5 and
P (X = 4) = 0.2, calculate:
(a) P (X 2 < 4)
(b) P (X > 2 | X is an even number).
8. Write down and illustrate the use in probability of:
(a) the additive law
(b) the multiplicative rule.
9. A coffee machine may be defective because it dispenses the wrong amount of coffee,
event C, and/or it dispenses the wrong amount of sugar, event S. The probabilities
of these defects are:
P (C) = 0.05,
P (S) = 0.04
and
5
P (C ∩ S) = 0.01.
Determine the proportion of cups of coffee with:
(a) at least one defect
(b) no defects.
10. A company gets 60% of its supplies from manufacturer A, and the remainder from
manufacturer Z. The quality of the parts delivered is given below:
Manufacturer
A
Z
% Good parts
97
93
% Bad parts
3
7
(a) The probabilities of receiving good or bad parts can be represented by a
probability tree. Show, for example, that the probability that a randomly
chosen part comes from A and is bad is 0.018.
(b) Show that the sum of the probabilities of all outcomes is 1.
(c) The way the probability tree is used depends on the information required. For
example, show that the probability tree can be used to show that the
probability of receiving a bad part is 0.028 + 0.018 = 0.046.
11. A company has a security system comprising of four electronic devices (A, B, C
and D) which operate independently. Each device has a probability of 0.1 of failure.
The four electronic devices are arranged such that the whole system operates if at
least one of A or B functions and at least one of C or D functions.
Show that the probability that the whole system functions properly is 0.9801.
(Use set theory and the laws of probability, or a probability tree.)
12. A student can enter a course either as a beginner (73% of all students) or as a
transferring student (27% of all students). It is found that 62% of beginners
eventually graduate, and that 78% of transferring students eventually graduate.
113
5. Probability theory
(a) Find the probability that a randomly chosen student:
i. is a beginner who will eventually graduate
ii. will eventually graduate
iii. is either a beginner or will eventually graduate, or both.
(b) Are the events ‘eventually graduates’ and ‘enters as a transferring student’
statistically independent?
(c) If a student eventually graduates, what is the probability that the student
entered as a transferring student?
(d) If two entering students are chosen at random, what is the probability that not
only do they enter in the same way but that they also both graduate or both
fail?
5
5.15
A reminder of your learning outcomes
After completing this chapter, and having completed the essential reading and
activities, you should be able to:
apply the ideas and notation involved in set theory to simple examples
recall the basic axioms of probability and apply them
distinguish between the ideas of conditional probability and independence
draw and use appropriate Venn diagrams
draw and use appropriate probability trees.
5.16
Sample examination questions
Solutions can be found in Appendix B.
1. If P (B) = 0.05, P (A | B) = 0.70 and P (A | B c ) = 0.30, find P (B | A).
2. An engine encounters a standard environment with a probability of 0.95, and a
severe environment with a probability of 0.05. In a normal environment the
probability of failure is 0.02, whereas in the severe environment this probability is
0.5.
(a) What is the probability of failure?
(b) Given that failure has occurred, what is the probability that the environment
encountered was severe?
3. Suppose there are two boxes; the first box contains three green balls and one red
ball, whereas the second box contains two green balls and two red balls. Suppose a
box is chosen at random and then a ball is drawn randomly from this box.
114
5.16. Sample examination questions
(a) What is the probability that the ball drawn is green?
(b) If the ball drawn was green, what is the probability that the first box was
chosen?
5
115
5. Probability theory
5
116
Chapter 6
The normal distribution and ideas of
sampling
6.1
Aims
Now is the time where the ideas of measurement (introduced in Chapter 4) and
probability theory (Chapter 5) come together to form the ideas of sampling. Your aims
should be to:
understand the concept of a random variable and its distribution
6
work with the normal distribution
see the connection between the mathematical ideas of sampling introduced here
and their uses (introduced in Chapter 3, and covered in detail in Chapter 10).
6.2
Learning outcomes
After completing this chapter, and having completed the essential reading and
activities, you should be able to:
summarise the meaning of E(X) and Var(X)
compute areas under the curve for a normal distribution
state and apply the central limit theorem
explain the relationship between sample size and the standard error of the sample
mean.
6.3
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060]
Sections 4.1, 4.3, 5.1–5.3, 6.1 and 6.2.
In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials
accessible via the ST104a Statistics 1 area at https://emfss.elearning.london.ac.uk
117
6. The normal distribution and ideas of sampling
6.4
Further reading
Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and
Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN
9780471615170] Chapters 4 and 6.
6.5
Introduction
If all of a government’s political and economic decisions were based on figures from
entire populations, think how costly and time-consuming that would be. You could
argue that one would never in fact tackle the problem itself. If we needed to know
exactly how many homeless families there are in a particular country before
implementing policy, or the exact number of companies which employ people on a
part-time basis, and for how many hours, then work might scarcely get off the ground!
Think, for example, of the population of your country and the cost of a complete
census, even if only limited questions are asked about residents.
6
Fortunately for everyone, statisticians have come up with a way of estimating such
figures using samples from the population. They can also tell us how accurate these
estimates are likely to be for a particular sample size and this is why survey sampling,
rather than taking a full census, is the preferred tool of governments, social researchers
and market research companies. The rest of this chapter develops the ideas you need in
order to use random sampling successfully.
6.6
Random variables
Formally, a random variable is a ‘mapping’ of the elementary outcomes in the sample
space to real numbers. This allows us to attach probabilities to the experimental
outcomes. Hence the concept of a random variable is that of a measurement which takes
a particular value for each possible trial (experiment).
Frequently, this will be a numerical value. For example, suppose we sample five people
and measure their heights, such that ‘height’ is the random variable and the five
(observed) values of this random variable are the realised measurements for the heights
of these five people.
As a further example, suppose a fair die is thrown four times and we observe two 6s, a 3
and a 1. The random variable is the ‘score on the die’, and for these four trials it takes
the values 6, 6, 3 and 1. (In this case, since we do not know the true order in which the
values occurred, we could also say that the results were 1, 6, 3 and 6 or 1, 3, 6 and 6
etc.)
An example of an experiment with non-numerical outcomes would be a coin toss, for
which recall S = {H, T }. We can use a random variable, X, to convert the sample space
elements to real numbers, such as:
(
1 for heads
X=
0 for tails.
118
6.6. Random variables
The values of any measurable variables will typically vary from sample to sample, hence
the name ‘random variable’.
So each experimental random variable has a collection of possible outcomes, and a
numerical value associated with each outcome. We have already encountered the term
‘sample space’ which is the set of all possible (numerical) values of the random variable.
Random variables come in two ‘types’: discrete and continuous. Discrete random
variables are those associated with count data, such as the score on a die, while
continuous random variables refer to measurable data, such as height, weight and time.
Example 6.1 Examples of discrete random variables include:
Experiment
Die is thrown
Coin is tossed five times
Twenty people sampled
Machine operates for a day
Random variable
Value on top face
Number of heads
Number with blue eyes
Number of breakdowns
Sample space
{1, 2, 3, 4, 5, 6}
{0, 1, 2, 3, 4, 5}
{0, 1, 2, . . . , 19, 20}
{0, 1, 2, . . .}
Example 6.2 Possible examples of continuous random variables include:
in economics, measuring values of inputs or outputs, workforce productivity or
consumption
in sociology, measuring the proportion of people in a population with a
particular preference
in engineering, measuring the electrical resistance of materials
in physics, measuring temperature, electrical conductivity of materials.
Note the repeated appearance of measuring in the above examples. Remember,
continuous random variables deal with measured data, while discrete random
variables deal with count data.
We typically use a capital letter to denote the random variable. For example, X =
‘score of a die’. The letter X is often adopted, but any other letter could be used, such
as Y , Z etc. In contrast, a lower case letter denotes a particular value of the random
variable. For example, if the die results in a 3, then this is x = 3.
A natural question to ask is ‘what is the probability of any of these values?’. That is, we
are interested in the probability distribution of the random variable. Special
probability distributions are the subject matter of ST104b Statistics 2; however,
there is one very special distribution which will concern us in this half-course – the
normal distribution. Before the normal distribution is introduced though, it is
necessary to introduce two important measures.
119
6
6. The normal distribution and ideas of sampling
6.7
Population mean and variance
‘Mean’ and ‘variance’ were introduced in Chapter 4, although importantly these
referred to the sample versions, i.e. x̄ and s2 , which were calculated based on sample
data. Here, our attention is on the population (or theoretical) mean and variance. These
play a key role in sampling theory which will occupy us for the next few chapters. In
order to work with these measures, we first need to introduce the expectation
operator, denoted ‘E’.
6.7.1
Population mean
Certain important properties of distributions arise if we consider probability-weighted
averages of random variables, and of functions of random variables.1 For example, we
might want to know the average (or typical value) of a random variable.
6
It would be wrong to simply take the average of all the values taken by the random
variable, as this would mean that very unlikely values (those with small probabilities of
occurrence) would receive the same weighting as very likely values (those with large
probabilities of occurrence). The obvious approach is to use a probability-weighted
average of the sample space values.
Expected value of a discrete random variable
If x1 , x2 , . . . , xN are the possible values of the random variable X, with corresponding
probabilities p1 , p2 , . . . , pN , then:
E(X) = µX = µ =
N
X
xi p i = x 1 p 1 + x2 p 2 + · · · + xN p N .
i=1
Note that this population mean can be written as E(X) (in words ‘the expectation of
the random variable X’), or µX (in words ‘the population mean of X’). Also, note the
distinction between the sample mean, x̄, (introduced in Chapter 4) based on observed
sample values, and the population mean, µX , based on the theoretical probability
distribution.
Example 6.3 If the ‘random variable’ X happens to be a constant, then
x1 = x2 = · · · = xN = k, and it is always the case that p1 = p2 = · · · = pN = 1, so
trivially E(X) = k.
Example 6.4 Let X represent the value shown when a fair die is thrown once. Its
probability distribution (that is, how the sample space probability of 1 is
distributed across the values the random variable takes) is:
X=x
P (X = x)
1
1
1/6
2
1/6
3
1/6
4
1/6
5
1/6
6
1/6
Total
1
A function, f (X), of a random variable X is, of course, also a random variable, say Y = f (X).
120
6.7. Population mean and variance
Note that the sum of the individual probabilities is 1 (a consequence of the second
probability axiom). So the population mean is calculated to be:
E(X) =
6
X
xi p i = 1 ×
i=1
1
1
1
+ 2 × + · · · + 6 × = 3.5.
6
6
6
Example 6.5 Suppose the random variable X takes the values 3, 6 and 8 with
probabilities 0.2, 0.7 and 0.1, respectively. Hence its probability distribution is:
X=x
P (X = x)
3
0.2
6
0.7
8
0.1
Total
1
The population mean is calculated to be:
E(X) =
3
X
xi pi = (3 × 0.2) + (6 × 0.7) + (8 × 0.1) = 5.6.
i=1
6.7.2
6
Population variance
The concept of a probability-weighted average (or expected value) can be extended to
functions of the random variable. For example, if X takes the values x1 , x2 , . . . , xN with
corresponding probabilities p1 , p2 , . . . , pN , then:
X
N
1
1
pi provided xi 6= 0
=
E
X
x
i=1 i
and:
2
E(X ) =
N
X
x2i pi .
i=1
One very important average associated with a distribution is the expected value of the
square of the deviation of the random variable from its mean, µ, known as the
population variance, σ 2 .
Variance of a discrete random variable
If x1 , x2 , . . . , xN are the possible values of the random variable X, with corresponding
probabilities p1 , p2 , . . . , pN , then:
N
X
σ = E((X − µ) ) =
(xi − µ)2 pi .
2
2
i=1
This can be seen to be a measure – not the only one, but the most widely-used by far –
of the spread of the distribution and is known as the population variance of the random
variable.
121
6. The normal distribution and ideas of sampling
The (positive) square root of the variance is the standard deviation and, given the
variance is typically denoted by σ 2 , the standard deviation is denoted by σ.
Example 6.6 Returning to the fair die example, we now compute the variance of
X as follows, noting that µ = 3.5.
X=x
1
2
3
4
5
6
Total
P (X = x)
1/6
1/6
1/6
1/6
1/6
1/6
1
(x − µ)2
25/4
9/4
1/4
1/4
9/4
25/4
(x − µ)2 P (X = x)
25/24
9/24
1/24
1/24
9/24
25/24
Hence σ 2 = E((X − µ)2 ) = 2.92 and σ =
√
70/24 = 2.92
2.92 = 1.71.
The tabular format in Example 6.6 has some advantages.
6
It helps to have (and to calculate) the ‘Total’ column since, for example, if a
probability, P (X = x), has been miscalculated or miscopied then the row total will
not be 1 (recall the second probability axiom). Therefore, this would highlight an
error so, with a little work, could be identified.
It is often useful to do a group of calculations as fractions over the same
denominator (as in the final row of the table in Example 6.6), rather than to cancel
or to work with them as decimals, because important patterns can be more
obvious, and calculations easier.
It can be shown (in ST104b Statistics 2) that the population variance can be
expressed in an alternative form:
Var(X) = E((X − µ)2 ) = E(X 2 ) − (E(X))2 .
In words, ‘the (population) variance is equal to the mean of the square minus the square
of the mean’.
Example 6.7 Using the same fair die example from Example 6.6:
X=x
1
2
3
4
5
6
Total
P (X = x)
1/6
1/6
1/6
1/6
1/6
1/6
1
x P (X = x)
1/6
2/6
3/6
4/6
5/6
6/6
21/6 = 3.5 = µ
x2 P (X = x)
1/6
4/6
9/6
16/6
25/6
36/6
91/6
Hence µ = E(X) = 3.5, E(X 2 ) = 91/6, so the variance is 91/6 − (3.5)2 = 2.92, as
before. However, this method is usually quicker to use.
122
6.7. Population mean and variance
Activity 6.1 A fair die is thrown once. Determine the probability distribution of
the value of the upturned face, X, and find its mean and variance.
Solution
Each face has an equal chance of 1/6 of turning up, so we get the following table:
1
1/6
1/6
1/6
X=x
P (X = x)
x P (X = x)
x2 P (X = x)
2
1/6
2/6
4/6
3
1/6
3/6
9/6
4
1/6
4/6
16/6
5
1/6
5/6
25/6
6
1/6
6/6
36/6
Total
1
21/6 = 3.5
91/6
Hence the mean is:
E(X) =
X
x P (X = x) = 1 ×
1
1
+ · · · + 6 × = 3.5.
6
6
Since:
E(X 2 ) =
X
x2 P (X = x) = 12 ×
1
91
1
+ · · · + 62 × = .
6
6
6
6
The variance is:
Var(X) = E(X 2 ) − µ2 =
91
− (3.5)2 = 2.92.
6
Activity 6.2 Two fair dice are thrown.
(a) Determine the probability distribution of the sum of the two dice, X, and find
its mean and variance.
(b) Determine the probability distribution of the absolute difference of the two dice,
Y , and find its mean and variance.
Solution
(a) The pattern is made clearer by using the same denominator (i.e. 36) below.
2
1/36
2/36
4/36
3
2/36
6/36
18/36
4
3/36
12/36
48/36
8
5/36
40/36
320/36
9
4/36
36/36
324/36
10
3/36
30/36
300/36
X=x
P (X = x)
x P (X = x)
x2 P (X = x)
X=x
P (X = x)
x P (X = x)
x2 P (X = x)
5
4/36
20/36
100/36
11
2/36
22/36
242/36
6
5/36
30/36
180/36
12
1/36
12/36
144/36
7
6/36
42/36
294/36
Total
1
252/36
1974/36
This yields a mean of E(X) = 252/36 = 7, and a variance of
E(X 2 ) − µ2 = (1974/36) − 72 = 5.83. Although not required, a plot of the
distribution is:
123
6. The normal distribution and ideas of sampling
0.10
0.08
0.04
0.06
Probability
0.12
0.14
0.16
Probability distribution − sum of two dice
2
4
6
8
10
12
Value of sum
(b) Again, for clarity, we use the same denominator.
0
6/36
0
0
Y =y
P (Y = y)
y P (Y = y)
y 2 P (Y = y)
1
10/36
10/36
10/36
2
8/36
16/36
32/36
3
6/36
18/36
54/36
4
4/36
16/36
64/36
5
2/36
10/36
50/36
This yields a mean of E(Y ) = 70/36 = 1.94, while the variance is:
E(Y 2 ) − µ2 =
210
− (1.94)2 = 2.05.
36
Again, although not required, a plot of the distribution is:
0.05
0.10
0.15
0.20
0.25
Probability distribution − difference in two dice
Probability
6
0
1
2
3
Absolute value of difference
124
4
5
Total
1
70/36
210/36
6.8. The normal distribution
Activity 6.3 The probability function P (X = x) = 0.02x is defined for x = 8, 9,
10, 11 and 12. What are the mean and variance of this probability distribution?
Solution
The probability distribution is:
X=x
P (X = x)
8
0.16
9
0.18
10
0.20
11
0.22
12
0.24
The mean is:
X
E(X) =
x P (X = x) = (8 × 0.16) + (9 × 0.18) + · · · + (12 × 0.24) = 10.2.
To determine the variance, we first find E(X 2 ), given by:
X
E(X 2 ) =
x2 P (X = x) = (82 × 0.16) + (92 × 0.18) + . . . + ((12)2 × 0.24) = 106.
6
Hence:
Var(X) = 106 − (10.2)2 = 1.96.
Activity 6.4 In a prize draw, the probabilities of winning various amounts of
money are:
Prize (in £)
Probability of win
0
0.35
1
0.50
50
0.11
100
0.03
500
0.01
What is the expected value and standard deviation of the prize?
Solution
The expected value is:
X
E(X) =
x P (X = x) = (500 × 0.01) + (100 × 0.03) + · · · + (0 × 0.35) = £14.00
To calculate the standard deviation, we first need E(X 2 ), given by:
X
E(X 2 ) =
x2 P (X = x) = ((500)2 ×0.01)+((100)2 ×0.03)+· · ·+(02 ×0.35) = 3075.50.
So the variance is:
Var(X) = 3075.50 − (14)2 = 2879.50
√
and hence the standard deviation is 2879.50 = £53.66.
6.8
The normal distribution
The normal distribution is the most important distribution in statistical theory and
it is essential to much statistical theory and reasoning. It is, in a sense, the ‘parent’
125
6. The normal distribution and ideas of sampling
distribution of all the sampling distributions which we shall meet later.
In order to get some feel for the normal distribution, let us consider the exercise of
constructing a histogram of people’s heights (assumed to be normally distributed).
Suppose we start with 100 people and construct a histogram using sufficient class
intervals such that the histogram gives a good representation of the data’s distribution.
This will be a fairly ‘ragged’ diagram, but useful nonetheless.
Now suppose we increase our sample size to 500 and construct an appropriate
histogram for these observed heights, but using more class intervals now that we have
more data. This diagram will be smoother than the first, peaked in the centre, and
roughly symmetric about the centre. The normal distribution is emerging! If we
continue this exercise to sample sizes of 5,000 or even 50,000, then we will eventually
arrive at a very smooth bell-shaped curve similar to that shown in Figure 6.1. Hence we
can view the normal distribution as the smooth limit of the basic histogram as the
sample size becomes very large.
0.0
0.1
0.2
0.3
0.4
Typical Normal Density Function Shape
f X (x )
6
Such a diagram represents the distribution of the population. It is conventional to
adjust the vertical scale so that the total area under the curve is 1, and so it is easy to
view the area under the curve as probability (recall the second probability axiom). The
mathematical form for this curve is well-known and can be used to compute areas, and
hence probabilities – in due course we shall make use of statistical tables for this
purpose.
−3
−2
−1
0
1
2
3
x
Figure 6.1: Density function of the (standard) normal distribution.
126
6.8. The normal distribution
6.8.1
Relevance of the normal distribution
The normal distribution is relevant to the application of statistics for many reasons such
as:
Many naturally-occurring phenomena can be modelled as following a normal
distribution. Examples include heights of people, diameters of bolts, weights of
animals etc.2
A very important point is that averages of sampled variables (discussed later),
indeed any functions of sampled variables, also have probability distributions. It
can be demonstrated, theoretically and empirically, that provided the sample size is
reasonably large, the distribution of the sample mean, X̄, will be (approximately)
normal regardless of the distribution of the original variable. This is known as the
central limit theorem (CLT) which we will return to later.
The normal distribution is often used as the distribution of the error term in
standard statistical and econometric models such as linear regression. This
assumption can be, and should be, checked. This is considered further in ST104b
Statistics 2.
6.8.2
Consequences of the central limit theorem
The consequences of the CLT are twofold.
A number of statistical methods which we use have a robustness property, i.e. it
does not matter for their validity just what the true population distribution of the
variable being sampled is.
We are justified in assuming normality for statistics which are sample means, or
linear transformations of them.
The CLT was introduced above ‘provided the sample size is reasonably large’. In
practice, 30 or more observations are usually sufficient (and can be used as a
rule-of-thumb), although the distribution of X̄ may be normal for n much less than 30.
This depends on the population distribution of the original variable. If this population
distribution is in fact normal, then all sample means computed from it will be normal.
However, if the population distribution is very non-normal, then a sample size of (at
least) 30 would be needed to justify normality.
2
Note the use of the word modelled. This is due to the distributional assumption of normality. Since
a normal random variable X is defined over the entire real line, i.e. −∞ < x < ∞, we know a person
cannot have a negative height, even though the normal distribution has positive, non-zero probability
over negative values. Also, nobody is of infinite height (the world’s tallest man ever, Robert Wadlow,
was 272 cms), so clearly there is a finite upper bound to height, rather than ∞. Therefore, height does
not follow a true normal distribution, but it is a good enough approximation for modelling purposes.
127
6
6. The normal distribution and ideas of sampling
6.8.3
Characteristics of the normal distribution
The equation which describes the normal distribution takes the general form:
1
(x − µ)2
.
f (x) = √
exp −
2σ 2
2πσ 2
The shape of this function is the bell-shaped curve, such as in Figure 6.1. Do not panic
– you will not be working with this function directly! That said, do be aware that it
involves two parameters: the mean, µ, and the variance, σ 2 .
Since the normal distribution is symmetric about µ, the distribution is centred at
µ. As a consequence of this symmetry, the mean is equal to the median. Also, since
the distribution peaks at µ, it is also equal to the mode. In principle, the mean can
take any real value, i.e. −∞ < µ < ∞.
The variance is σ 2 , hence the larger σ 2 , the larger the spread of the distribution.
Note that variances cannot be negative, hence σ 2 > 0.
6
If the random variable X has a normal distribution with parameters µ and σ 2 , we
denote this as X ∼ N (µ, σ 2 ), where we read ‘∼’ as ‘is distributed as’. Given the
infinitely many possible values for µ and σ 2 , and given that a normal distribution is
uniquely defined by these two parameters, there is an infinite number of normal
distributions due to the infinite combinations of values of µ and σ 2 .
The most important normal distribution is the special case when µ = 0 and σ 2 = 1. We
call this the standard normal distribution, denoted by Z, i.e. Z ∼ N (0, 1).
Tabulated probabilities which appear in statistical tables are for the standard normal
distribution.
6.8.4
Standard normal tables
We now discuss the determination of normal probabilities using standard statistical
tables. Extracts from the New Cambridge Statistical Tables will be provided in the
examination. Here we focus on Table 4.
Standard normal probabilities
Table 4 of the New Cambridge Statistical Tables lists ‘lower-tail’ probabilities, which
can be represented as:
P (Z ≤ z) = Φ(z) for z ≥ 0
using the conventional Z notation for a standard normal random variable.3
Note the cumulative probability4 for the Z distribution, P (Z ≤ z), is often denoted
Φ(z). We now consider some examples of working out probabilities from Z ∼ N (0, 1).
3
Although Z is the conventional letter used to denote a standard normal random variable, Table 4
uses (somewhat annoyingly) ‘x’ to denote ‘z’.
4
A cumulative probability is the probability of being less than or equal to some particular value.
128
6.8. The normal distribution
Example 6.8 If Z ∼ N (0, 1), what is P (Z > 1.2)?
When computing probabilities, it is useful to draw a quick sketch to visualise the
specific area of probability which we are after.
So for P (Z > 1.2) we require the upper-tail probability shaded in red in Figure 6.2.
Since Table 4 gives us lower-tail probabilities, if we look up the value 1.2 in the table
we will get P (Z ≤ 1.2) = 0.8849. The total area under a normal curve is 1, so the
required probability is simply 1 − Φ(1.2) = 1 − 0.8849 = 0.1151.
0.3
0.4
Standard Normal Density Function
0.2
0.0
0.1
f Z (z )
6
−3
−2
−1
0
1
2
3
z
Figure 6.2: Standard normal distribution with shaded area depicting P (Z > 1.2).
Example 6.9 If Z ∼ N (0, 1), what is P (−1.24 < Z < 1.86)?
Again, begin by producing a sketch.
The probability we require is the sum of the blue and red areas in Figure 6.3. Using
Table 4, which note only covers z ≥ 0, we proceed as follows.
The red area is given by:
P (0 ≤ Z ≤ 1.86) = P (Z ≤ 1.86) − P (Z ≤ 0)
= Φ(1.86) − Φ(0)
= 0.9686 − 0.5
= 0.4686.
129
6. The normal distribution and ideas of sampling
The blue area is given by:
P (−1.24 ≤ Z ≤ 0) = P (Z ≤ 0) − P (Z ≤ −1.24)
= Φ(0) − Φ(−1.24)
= Φ(0) − (1 − Φ(1.24))
= 0.5 − (1 − 0.8925)
= 0.3925.
Note P (Z ≤ −1.24) = P (Z ≥ 1.24) = 1 − Φ(1.24), by symmetry of Z about µ = 0.
So although Table 4 does not give probabilities for negative z values, we can exploit
the symmetry of the (standard) normal distribution.
Hence P (−1.24 < Z < 1.86) = 0.4686 + 0.3925 = 0.8611.
Alternatively:
P (−1.24 < Z < 1.86) = Φ(1.86) − Φ(−1.24)
6
= Φ(1.86) − (1 − Φ(1.24))
= 0.9686 − (1 − 0.8925)
= 0.8611.
0.2
0.0
0.1
f Z (z )
0.3
0.4
Standard Normal Density Function
−3
−2
−1
0
1
2
3
z
Figure 6.3: Standard normal distribution depicting P (−1.24 < Z < 1.86) as the shaded
areas.
130
6.8. The normal distribution
Activity 6.5 If Z ∼ N (0, 1), determine:
(a) P (0 < Z < 1.2)
(b) P (−0.68 < Z < 0)
(c) P (−0.46 < Z < 2.21)
(d) P (0.81 < Z < 1.94)
(e) P (Z < −0.6)
(f) P (Z > −1.28)
(g) P (Z > 2.05).
Solution
All of these probabilities can be obtained from Table 4 of the New Cambridge
Statistical Tables, which note provides P (Z ≤ z) for z ≥ 0.
(a) We have:
6
P (0 < Z < 1.2) = P (Z < 1.2) − P (Z < 0) = 0.8849 − 0.5 = 0.3849.
(b) We have:
P (−0.68 < Z < 0) = P (Z < 0) − (1 − P (Z < 0.68))
= 0.5 − (1 − 0.7517)
= 0.2517.
(c) We have:
P (−0.46 < Z < 2.21) = P (Z < 2.21) − (1 − P (Z < 0.46))
= 0.98645 − (1 − 0.6772)
= 0.66365.
(d) We have:
P (0.81 < Z < 1.94) = P (Z < 1.94) − P (Z < 0.81) = 0.9738 − 0.7910 = 0.1828.
(e) We have:
P (Z < −0.62) = 1 − P (Z < 0.62) = 1 − 0.7324 = 0.2743.
(f) We have:
P (Z > −1.28) = P (Z < 1.28) = 0.8997.
(g) We have:
P (Z > 2.05) = 1 − P (Z < 2.05) = 1 − 0.97982 = 0.02018.
131
6. The normal distribution and ideas of sampling
6.8.5
The general normal distribution
We have already discussed that there exists an infinite number of different normal
distributions due to the infinite pairs of parameter values since −∞ < µ < ∞ and
σ 2 > 0. The good news is that Table 4 of the New Cambridge Statistical Tables can be
used to determine probabilities for any normal random variable X, such that
X ∼ N (µ, σ 2 ).
To do so, we need a little bit of magic, known as standardisation. This is a special
transformation which converts X ∼ N (µ, σ 2 ) into Z ∼ N (0, 1).
The transformation formula for standardisation
If X ∼ N (µ, σ 2 ), then the transformation:
Z=
6
X −µ
σ
creates a standard normal random variable, i.e. Z ∼ N (0, 1). So to standardise X
we subtract its mean and divide by its standard deviation.
Example 6.10 Suppose X ∼ N (5, 4). What is P (5.8 < X < 7.0)?
We have:
5.8 − µ
X −µ
7.0 − µ
P (5.8 < X < 7.0) = P
<
<
σ
σ
σ
X −5
7.0 − 5
5.8 − 5
√
< √
< √
=P
4
4
4
= P (0.4 < Z < 1)
= P (Z ≤ 1) − P (Z ≤ 0.4)
= 0.8413 − 0.6554 (from Table 4)
= 0.1859.
Activity 6.6 Suppose X ∼ N (10, 4).
(a) Find:
i. P (X > 13.4)
ii. P (8 < X < 9).
(b) Find the value a such that P (10 − a < X < 10 + a) = 0.95.
(c) Find the value b such that P (10 − b < X < 10 + b) = 0.99.
(d) How far above the mean of the standard normal distribution must we go such
that only 1% of the probability remains in the right-hand tail?
132
6.8. The normal distribution
(e) How far below the mean of the standard normal distribution must we go such
that only 5% of the probability remains in the left-hand tail?
Solution
Since X ∼ N (10,√4), we use the transformation Z = (X − µ)/σ with the values
µ = 10 and σ = 4 = 2.
(a)
i. We have:
13.4 − 10
= P (Z > 1.70) = 1 − P (Z ≤ 1.70)
P (X > 13.4) = P Z >
2
= 1 − 0.9554
= 0.0446.
ii. We have:
P (8 < X < 9) = P
8 − 10
9 − 10
<Z<
2
2
6
= P (−1 < Z < −0.5)
= P (Z < −0.5) − P (Z < −1)
= (1 − P (Z ≤ 0.5)) − (1 − P (Z ≤ 1))
= (1 − 0.6915) − (1 − 0.8413)
= 0.1498.
(b) We want to find the value a such that P (10 − a < X < 10 + a) = 0.95, that is:
(10 − a) − 10
(10 + a) − 10
0.95 = P
<Z<
2
2
a
a
=P − <Z<
2
2
a
a
=P Z<
−P Z <−
2
2
a
=1−2×P Z <− .
2
This is the same as 2 × P (Z > a/2) = 0.05, i.e. P (Z > a/2) = 0.025, or
P (Z ≤ a/2) = 0.975. Therefore, from Table 4 of the New Cambridge Statistical
Tables, a/2 = 1.96, and so a = 3.92.
(c) We want to find the value b such that P (10 − b < X < 10 + b) = 0.99. Similar
reasoning shows that P (Z ≤ b/2) = 0.995. Therefore, from Table 4, b/2 = 2.58
(approximately), so that b = 5.16.
(d) We want k such that P (Z > k) = 0.01, or P (Z ≤ k) = 0.99. From Table 4,
k = 2.33 (approximately).
133
6. The normal distribution and ideas of sampling
(e) We want x such that P (Z < x) = 0.05. This means that x < 0 and
P (Z > |x|) = 0.05, or P (Z ≤ |x|) = 0.95, so, from Table 4, |x| = 1.645 (by
interpolating between Φ(1.64) and Φ(1.65)) and hence x = −1.645.
Activity 6.7 Your company requires a special type of light bulb which is available
from only two suppliers. Supplier A’s light bulbs have a mean lifetime of 2,000 hours
with a standard deviation of 180 hours. Supplier B’s light bulbs have a mean lifetime
of 1,850 hours with a standard deviation of 100 hours. The distribution of the
lifetimes of each type of light bulb is normal. Your company requires that the
lifetime of a light bulb be not less than 1,500 hours. All other things being equal,
which type of bulb should you buy, and why?
Solution
Let A and B be the random variables representing the lifetimes (in hours) of light
bulbs from supplier A and supplier B, respectively. We are told that:
A ∼ N (2000, (180)2 ) and B ∼ N (1850, (100)2 ).
6
Since the relevant criterion is that light bulbs last at least 1,500 hours, the company
should choose the supplier whose light bulbs have a greater probability of doing so.
We find that:
1500 − 2000
= P (Z > −2.78) = P (Z < 2.78) = 0.99728
P (A > 1500) = P Z >
180
and:
1500 − 1850
P (B > 1500) = P Z >
= P (Z > −3.50) = P (Z < 3.50) = 0.9998.
100
Therefore, the company should buy light bulbs from supplier B, since they have a
greater probability of lasting the required time.
Note it is good practice to define notation and any units of measurement and to
state the distributions of the random variables. Note also that here it is not essential
to compute the probability values in order to determine what the company should
do, since −2.78 > −3.50 implies that P (Z > −2.78) < P (Z > −3.50).
Activity 6.8 A normal distribution has a mean of 40. If 10% of the distribution falls
between the values of 50 and 60, what is the standard deviation of the distribution?
Solution
Let X ∼ N (40, σ 2 ). We seek σ, and know that:
50 − 40
60 − 40
10
20
P (50 ≤ X ≤ 60) = P
≤Z≤
=P
≤Z≤
= 0.10.
σ
σ
σ
σ
Hence we know that one z-value (i.e. 20/σ) is twice the other (i.e. 10/σ), and their
corresponding probabilities differ by 0.10. We also know that the z-values are
134
6.8. The normal distribution
positive since both represent numbers larger than the mean of 40. Now we need to
use Table 4 of the New Cambridge Statistical Tables to find two such z-values.
Looking at Table 4 we find the values to be, approximately, 1.25 and 2.50.
Therefore, σ = 8.
Activity 6.9 The life, in hours, of a light bulb is normally distributed with a mean
of 200 hours. If a consumer requires at least 90% of the light bulbs to have lives
exceeding 150 hours, what is the largest value that the standard deviation can have?
Solution
Let X be the random variable representing the lifetime of a light bulb (in hours), so
that for some value σ we have X ∼ N (200, σ 2 ). We want P (X > 150) = 0.90, such
that:
50
150 − 200
=P Z>−
= 0.90.
P (X > 150) = P Z >
σ
σ
Note that this is the same as P (Z < 50/σ) = 0.90, so 50/σ = 1.28, giving σ = 39.06.
Activity 6.10 The random variable X has a normal distribution with mean µ and
variance σ 2 , i.e. X ∼ N (µ, σ 2 ). It is known that:
P (X ≤ 66) = 0.0359 and P (X ≥ 81) = 0.1151.
(a) Produce a clearly-labelled sketch to represent these probabilities on a normal
curve.
(b) Show that the value of σ = 5.
(c) Calculate P (69 ≤ X ≤ 83).
Solution
(a) The sketch below shows the probabilities with P (X ≤ 66) shaded blue and
P (X ≥ 81) shaded red.
135
6
6. The normal distribution and ideas of sampling
(b) We have X ∼ N (µ, σ 2 ), where µ and σ 2 are unknown. Using Table 4 of the New
Cambridge Statistical Tables, we find that P (Z ≤ −1.80) = 0.0359 and
P (Z ≥ 1.20) = 0.1151. Therefore:
66 − µ
= P (Z ≤ −1.80) = 0.0359
P (X ≤ 66) = P Z ≤
σ
and:
81 − µ
P (X ≥ 81) = P Z ≥
= P (Z ≥ 1.20) = 0.1151.
σ
Hence:
66 − µ
81 − µ
= −1.80 and
= 1.20.
σ
σ
Rearranging, we obtain a pair of simultaneous equations which can be solved to
find µ and σ. Specifically:
66 − µ = −1.80σ
and 81 − µ = 1.20σ.
Subtracting the first from the second, gives 15 = 3σ, and hence σ = 5. For
completeness, 81 − µ = 6, so µ = 75 . Therefore, X ∼ N (75, 52 ).
6
(c) Given X ∼ N (75, 52 ), we have:
83 − 75
69 − 75
≤Z≤
= P (−1.20 ≤ Z ≤ 1.60)
P (69 ≤ X ≤ 83) = P
5
5
= Φ(1.60) − (1 − Φ(1.20))
= 0.9452 − (1 − 0.8849)
= 0.8301.
Activity 6.11 The number of newspapers sold daily at a kiosk is normally
distributed with a mean of 350 and a standard deviation of 30.
(a) Find the probability that fewer than 300 newspapers are sold on a particular
day.
(b) How many newspapers should the newsagent stock each day such that the
probability of running out on any particular day is 10%?
Solution
Let X be the number of newspapers sold, hence X ∼ N (350, 900).
(a) We have:
300 − 350
= P (Z < −1.67) = 1 − P (Z < 1.67)
P (X < 300) = P Z <
30
= 1 − 0.9525
= 0.0475.
136
6.8. The normal distribution
(b) Let s be the required stock, then we require P (X > s) = 0.10. Hence:
s − 350
= 0.10
P Z>
30
s − 350
≥ 1.28
30
⇒
⇒
s ≥ 350 + 1.28 × 30 = 388.4.
Rounding up, the required stock is 389.
Activity 6.12 Consider the following set of data. Does it appear to approximately
follow a normal distribution? Justify your answer.
45
48
49
47
46
39
31
54
46
46
35
48
37
52
62
40
36
51
55
55
38
61
59
52
54
52
45
50
50
43
56
51
48
58
48
45
6
Solution
To see whether this set of data approximates a normal distribution, we need to
analyse it. Using a calculator we calculate the mean to be µ = 48.1 and the standard
deviation to be σ = 7.3 (assuming population data).
For ± one standard deviation, i.e. µ ± σ, the interval is (40.8, 55.4) which contains
24 observations, representing 24/36 = 67% of the data.
For ± two standard deviations, i.e. µ ± 2σ, the interval is (33.1, 62.7) which contains
35 observations, representing 35/36 = 97% of the data.
These percentages match very closely to what we expect for a normal distribution.
We could also construct a histogram of the data, as shown below. This appears to
confirm that the data do indeed approximate a normal distribution.
137
6. The normal distribution and ideas of sampling
6.9
Sampling distributions
In Chapter 3, ‘Statistics’ was introduced as a discipline for data analysis. In this chapter
we have encountered the normal probability distribution. Probability distributions
typically have associated parameters, such as the (theoretical) mean, µ, and the
(theoretical) variance, σ 2 , for the normal distribution. By convention, Greek letters are
used to denote population parameters, whose values in practice are typically unknown.
The next few chapters of this half-course will be the subject of statistical inference,
whereby we infer unknown population parameters based on sample data. As an
example, suppose we wanted to investigate the average height of the UK population. As
previously discussed, it is reasonable to assume that height is a normally-distributed
random variable with a mean µ and a variance σ 2 . What are the exact values of these
parameters? To know these values precisely would require data on the heights of the
entire UK population – all 60-plus million people!
6
Population sizes, denoted by N , are typically very large and clearly no-one has the
time, money or patience to undertake such a marathon data collection exercise. Instead
we opt to collect a sample (some subset of the population) of size n.5 Having collected
our sample, we then estimate the unknown population parameters based on the known
(i.e. observed) sample data. Specifically, we estimate population quantities based on
their respective sample counterparts.
A ‘statistic’ (singular noun) is simply a known function of data. A sample statistic is
calculated from sample data. At this point, be aware of the following distinction
between an estimator and an estimate.
Estimator versus estimate
An estimator is a statistic (which is a random variable) describing how to obtain
an estimate (which is a real number) of a population parameter.
Example 6.11 The sample mean X̄ =
n
P
Xi /n is the estimator of the population
i=1
mean, µ. For example, if we had drawn at random from the population the sample
data 4, 8, 2 and 6, then the estimate of µ would be x̄ = 20/4 = 5. Notice the
notation – the estimator is written as X̄ and is a random variable, while the
estimate is written as x̄ as it is computed for a specific sample, hence is fixed for
that particular sample. Had the random sample been instead 2, 9, 1 and 4 (from the
n
P
same population), then the estimator of µ would still be X̄ =
Xi /n, but the
i=1
estimate would now be x̄ = 16/4 = 4.
So we see that sample statistics vary from sample to sample due to the random nature
of the sample. Hence estimators are random variables with corresponding probability
distributions, called sampling distributions.
5
If n = N , and we sample without replacement, then we have obtained a complete enumeration of
the population, i.e. a census.
138
6.9. Sampling distributions
Sampling distribution
A sampling distribution is the probability distribution of an estimator.
Before we proceed further, let us take a moment to review some population quantities
and their respective sample counterparts.
Population quantity
Probability distribution
(Population) mean, µ
(Population) variance, σ 2
(Population) standard deviation, σ
(Population) proportion, π
Sample counterpart
Histogram
(Sample) mean, x̄
(Sample) variance, s2
(Sample) standard deviation, s
(Sample) proportion, p
The precision (or quality) of point estimates such as x̄, s2 and p will depend on the
sample size n, and in principle on the population size N , if finite. In practice if N is
very large relative to n, then we can use approximations which are more than adequate
for practical purposes, but would only be completely accurate if the population truly
was infinite. In what follows we assume N is large enough to be treated as infinite.
Activity 6.13 Consider the population below with N = 4 elements:
A
3
B
6
C
9
D
12
(a) Calculate the population mean and variance.
(b) Write down the sampling distribution of the sample mean for samples of size
n = 2 drawn without replacement where order does not matter.
(c) Using the result in (b), calculate the mean of the sampling distribution.
(d) Using the result in (b), calculate the variance of the sampling distribution.
Solution
(a) The population mean is:
N
P
µ=
xi
i=1
N
=
3 + 6 + 9 + 12
= 7.5.
4
The population variance is:
N
P
x2i
σ 2 = i=1
N
− µ2 =
9 + 36 + 81 + 144
− (7.5)2 = 11.25.
4
(b) The sampling distribution of the sample mean for samples of size n = 2 is:
139
6
6. The normal distribution and ideas of sampling
Sample
AB
AC
AD
BC
BD
CD
Values
3 6
3 9
3 12
6 9
6 12
9 12
X̄ = x̄
4.5
6
7.5
7.5
9
10.5
P (X̄ = x̄)
1/6
1/6
1/6
1/6
1/6
1/6
(c) The mean of the sampling distribution is:
E(X̄) =
4.5 + 6 + 7.5 + 7.5 + 9 + 10.5
45
=
= 7.5.
6
6
(d) The variance of the sampling distribution is:
Var(X̄) =
6
(4.5)2 + (6)2 + (7.5)2 + (7.5)2 + (9)2 + (10.5)2
− (7.5)2 = 3.75.
6
Activity 6.14 A random variable, X, can take the values 1, 2 and 3, each with
equal probability. List all possible samples of size two which may be chosen when
order matters, without replacement, from this population, and hence construct the
sampling distribution of the sample mean, X̄.
Solution
Each possible sample has an equal probability of occurrence of 1/6. The sampling
distribution of X̄ is:
Possible samples
(1, 2)
(2, 1)
(1, 3)
(3, 1)
(2, 3)
(3, 2)
Sample mean, X̄ = x̄
1.5
2
2.5
Relative frequency, P (X̄ = x̄)
1/3
1/3
1/3
Activity 6.15 Discuss the differences or similarities between a sampling
distribution of size 5 and a single (simple) random sample of size 5.
Solution
If 5 members are selected from a population such that every possible set of 5
population members has the same probability of being selected, then the sample is a
simple random sample. In a sampling distribution of size 5, every possible sample of
size 5 from the population is averaged and the result is the sampling distribution.
The similarity is the inclusive nature of both the simple random sample as well as
the sampling distribution.
140
6.10. Sampling distribution of X̄
Activity 6.16 A perfectly-machined regular tetrahedral (pyramid-shaped) die has
four faces labelled 1 to 4. It is tossed twice onto a level surface and after each toss
the number on the face which is downward is recorded. If the recorded values are x1
and x2 , when order matters, then the observed sample mean is x̄ = (x1 + x2 )/2.
Write out the sampling distribution of the sample mean as a random quantity over
repeated double tosses.
Solution
Each possible sample has an equal probability of occurrence of 1/16. The sampling
distribution of X̄ is:
Possible samples
Sample mean, X̄ = x̄
Relative frequency, P (X̄ = x̄)
6.10
(1, 1)
(1, 2)
(2, 1)
(1, 3)
(3, 1)
(2, 2)
(1, 4)
(4, 1)
(2, 3)
(3, 2)
(2, 4)
(4, 2)
(3, 3)
(3, 4)
(4, 3)
(4, 4)
1
1.5
2
2.5
3
3.5
4
1/16
1/8
3/16
1/4
3/16
1/8
1/16
Sampling distribution of X̄
Let us consider a large population (large enough to be treated as if it was infinite)
which is normally distributed, i.e. N (µ, σ 2 ). Suppose we take an initial random sample
of size n and obtain the sample mean, x̄1 . Next we take a second, independent sample
(from the same initial population) and compute x̄2 . Continue taking new, independent
samples, resulting in a series of sample means: x̄1 , x̄2 , x̄3 , . . ..
These values will, of course, vary from sample to sample, hence if we constructed a
histogram of these values we would see the empirical sampling distribution of X̄.
Fortunately, however, we are often able to determine the exact, theoretical form of
sampling distributions without having to resort to such empirical simulations.
Sampling distribution of X̄ for normal populations
When taking a random sample of size n from a N (µ, σ 2 ) population:
σ2
X̄ ∼ N µ,
.
n
The central limit theorem concerns the sampling distribution of X̄ when sampling from
non-normal distributions.
141
6
6. The normal distribution and ideas of sampling
Central limit theorem
When taking a random sample of size n from a non-normal population with finite
mean µ and finite variance σ 2 , then:
σ2
X̄ ∼ N µ,
n
approximately, as n → ∞.
So the difference between sampling from normal and non-normal populations is that X̄
is exactly normally distributed in the former case, but only approximately in the latter
case. The approximation is reasonable when n is at least 30, as a rule-of-thumb.
Although because this is an asymptotic approximation (i.e. as n → ∞), the larger n is,
the better the normal approximation.
6
We can use standardisation
√ to compute probabilities involving X̄, but we2 must
n rather than σ, since the variance of X̄ is σ /n. Note that
remember
to
divide
by
σ/
√
σ/ n is known as the standard error of X̄.
Standard error
The standard deviation of an estimator is called its standard error.
For example, X̄ ∼ N (µ, σ 2 /n) (either exactly, or approximately by the CLT), so
Var(X̄) = σ 2 /n, hence the standard deviation of X̄, i.e. the standard error, is:
r
q
σ2
σ
Var(X̄) =
=√ .
n
n
Note the standard error decreases as n increases.
Example 6.12 A random sample of 16 values is taken from a normal distribution
with µ = 25 and σ = 3. What is P (X̄ > 26)?
Since the population is normally distributed, the sampling distribution of X̄ is
exactly N (µ, σ 2 /n) = N (25, 9/16). If Z is the standard normal random variable,
then:
26 − 25
√
P (X̄ > 26) = P Z >
= P (Z > 1.33) = 1 − Φ(1.33) = 1 − 0.9082 = 0.0918
3/ 16
using Table 4 of the New Cambridge Statistical Tables.
Activity 6.17 The weights of a large group of animals have mean 8.2 kg and
standard deviation 2.2 kg. What is the probability that a random selection of 80
animals from the group will have a mean weight between 8.3 kg and 8.4 kg? State
any assumptions you make.
142
6.10. Sampling distribution of X̄
Solution
We are not told that the population is normal, but n is ‘large’ so we can apply the
central limit theorem. The sampling distribution of X̄ is, approximately:
σ2
(2.2)2
X̄ ∼ N µ,
= N 8.2,
.
n
80
Hence, using Table 4:
P (8.3 ≤ X̄ ≤ 8.4) = P
8.3 − 8.2
8.4 − 8.2
√ ≤Z≤
√
2.2/ 80
2.2/ 80
= P (0.41 ≤ Z ≤ 0.81)
= P (Z ≤ 0.81) − P (Z ≤ 0.41)
= 0.7910 − 0.6591
= 0.1319.
Activity 6.18 A random sample of 25 audits is to be taken from a company’s total
audits, and the average value of these audits is to be calculated.
(a) Explain what is meant by the sampling distribution of this average and discuss
its relationship to the population mean.
(b) Is it reasonable to assume that this sampling distribution is normally
distributed?
(c) If the population of all audits has a mean of £54 and a standard deviation of
£10, find the probability that:
i. the sample mean will be greater than £60
ii. the sample mean will be within 5% of the population mean.
Solution
(a) The sample average (mean) is formed from 25 observations which are subject to
sampling variability, hence the average is also subject to this variability. Its
sampling distribution describes its probability properties. If a large number of
such averages were independently sampled, then their histogram would be the
(approximate) sampling distribution.
(b) It is reasonable to assume that this sampling distribution is normal due to the
central limit theorem (CLT) – although the sample size is rather small.
(c) If n = 25, µ = 54 and σ = 10, then the CLT says that, approximately:
σ2
100
= N 54,
.
X̄ ∼ N µ,
n
25
143
6
6. The normal distribution and ideas of sampling
i. We have:
P (X̄ > 60) = P
60 − 54
Z>p
100/25
!
= P (Z > 3) = 0.00135.
ii. We are asked for:
P (0.95 × 54 < X̄ < 1.05 × 54) = P
−0.05 × 54
0.05 × 54
<Z<
2
2
= P (−1.35 < Z < 1.35)
= P (Z < 1.35) − (1 − P (Z < 1.35)
= 0.9115 − (1 − 0.9115)
= 0.8230.
6
Activity 6.19 The distribution of salaries of lecturers in a university is positively
skewed, with most lecturers earning near the minimum of the pay scale. What would
a sampling distribution of the sample mean of size 2 look like? How about size 5?
How about size 50?
Solution
Here we are given a non-normal (skewed) distribution. In order that the sampling
distribution of the sample mean is (approximately) normal, a ‘large’ sample size is
required, according to the central limit theorem. Sampling distributions based on
‘small’ sample sizes tend to resemble the population from which they are drawn. So,
for samples of size 2 and 5, the sampling distributions would probably be positively
skewed. However, the sample of size 50 would be approximately normal. Regardless,
in all cases the sampling distribution would have a mean equal to the population
mean, µ, and a variance of σ 2 /n.
6.11
Summary
This chapter covered the key points relating to the normal distribution and the central
limit theorem. You should now be ready to embark on Chapters 7, 8 and 9, and work on
the ideas of statistical estimation and inference. Do not worry if you found some later
sections of this chapter difficult to understand. Work at the learning activities and the
sample examination questions below.
6.12
Key terms and concepts
Central limit theorem
Normal distribution
Population variance
144
Expectation operator
Population mean
Probability distribution
6.13. Learning activities
Random variable
Standard error
Statistical inference
6.13
Sampling distribution
Standardisation
Statistical tables
Learning activities
Solutions can be found in Appendix A.
1. Using Table 4 of the New Cambridge Statistical Tables, calculate:
(a) P (Z ≤ 1)
(b) P (Z ≥ 1)
where Z ∼ N (0, 1).
2. Check that (approximately):
(a) 68% of normal random variables fall within 1 standard deviation of the mean.
(b) 95% of normal random variables fall within 2 standard deviations of the mean.
(c) 99% of normal random variables fall within 3 standard deviations of the mean.
Draw the areas concerned on a normal distribution.
3. Given a normal distribution with a mean of 20 and a variance of 4, what
proportion of the distribution would be:
(a) above 22
(b) between 14 and 16?
4. In an examination, the scores of students who attend schools of type A are
normally distributed with a mean of 50 and a standard deviation of 5. The scores
of students who attend schools of type B are normally distributed with a mean of
55 and a standard deviation of 6. Which type of school would have a higher
proportion of students with marks below 45?
5. The manufacturer of a new brand of lithium battery claims that the mean life of a
battery is 3,800 hours with a standard deviation of 250 hours. Assume the lifespans
of lithium batteries follow a normal distribution.
(a) What percentage of batteries will last for more than 3,500 hours?
(b) What percentage of batteries will last for more than 4,000 hours?
(c) If 700 batteries are supplied, how many should last between 3,500 and 4,000
hours?
6. The following six observations give the time taken, in seconds, to complete a
100-metre sprint by all six individuals competing in a race. Note this is population
data.
145
6
6. The normal distribution and ideas of sampling
Individual
A
B
C
D
E
F
Time
15
14
10
12
20
15
(a) Find the population mean, µ, and the population standard deviation, σ, of the
sprint times.
(b) Calculate the sample mean for each possible sample of:
i. two individuals
ii. three individuals
iii. four individuals.
(c) Work out the mean for each set of sample means (it must come to µ) and
compare the standard deviations of the sample means about µ.
6
This may take some time, but, after you have done it, you should have a
clearer idea about sampling distributions!
6.14
A reminder of your learning outcomes
After completing this chapter, and having completed the essential reading and
activities, you should be able to:
summarise the meaning of E(X) and Var(X)
compute areas under the curve for a normal distribution
state and apply the central limit theorem
explain the relationship between sample size and the standard error of the sample
mean.
6.15
Sample examination questions
Solutions can be found in Appendix B.
1. The random variable X takes the values −1, 1 and 2 according to the following
probability distribution:
x
pX (x)
−1 1
0.20 k
2
4k
(a) Determine the constant k and, hence, write down the probability distribution
of X.
146
6.15. Sample examination questions
(b) Find E(X) (the expected value of X).
(c) Find Var(X) (the variance of X).
2. The scores on a verbal reasoning test are normally distributed with a population
mean of µ = 100 and a population standard deviation of σ = 10.
(a) What is the probability that a randomly chosen person scores at least 105?
(b) A simple random sample of size n = 20 is selected. What is the probability
that the sample mean will be between 97 and 104? (You may use the nearest
values provided in the statistical tables.)
3. The weights of a large population of animals have a mean of 7.3 kg and a standard
deviation of 1.9 kg.
(a) Assuming that the weights are normally distributed, what is the probability
that a random selection of 40 animals from this population will have a mean
weight between 7.0 kg and 7.4 kg?
(b) A researcher stated that the probability you calculated is approximately
correct even if the distribution of the weights is not normal. Do you agree?
Justify your answer.
147
6
6. The normal distribution and ideas of sampling
6
148
Chapter 7
Point and interval estimation
7.1
Aims
This chapter develops the ideas of sampling introduced in Chapter 6 and looks at the
actual estimation of population parameters using both normal and Student’s t
distributions. Your aims are to:
estimate values of common parameters such as means and proportions
decide which is the appropriate distribution to use – normal or Student’s t.
7.2
Learning outcomes
7
After completing this chapter, and having completed the essential reading and
activities, you should be able to:
calculate sample means, standard deviations and proportions, and demonstrate
their use as estimates
construct a confidence interval for a mean, a proportion, the difference between
means and the difference between proportions
know when to use Student’s t distribution.
You do not need to:
demonstrate the central limit theorem
know about the concepts relating to a ‘good’ estimator.
7.3
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060]
Sections 7.1–7.3 and 8.1–8.4.
In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials
accessible via the ST104a Statistics 1 area at https://emfss.elearning.london.ac.uk
149
7. Point and interval estimation
7.4
Further reading
Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and
Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN
9780471615170] Chapter 7 and Sections 8.1–8.3 and 8.5.
7.5
Introduction
This chapter is concerned with evidence-based decision-making. It is about making a
decision with respect to a population. The population is made up of a set of individual
items known as elements. This could be, for example, a set of individuals or companies
which constitute the market for your product, or the items being manufactured from a
production line.
The sort of information needed for a decision may be a mean value (for example, ‘how
many items does an individual purchase per year, on average?’) or a proportion (‘what
proportion of items manufactured has a fault?’). The associated decision may range
from setting up extra capacity to cope with expected demand, to stopping the
production line for re-adjustment.
7
In most cases it is impossible to gather information about the whole population (due to
time and budget constraints), so instead one collects information about a sample drawn
from the population and infers the required information about the population. In
ST104a Statistics 1, we will look at the most commonly-used estimators. If you take
ST104b Statistics 2, you will learn what the properties of a ‘good’ estimator are and
look at other measures to be estimated other than means and proportions.
In order to carry out this type of exercise, one obvious decision needs to be made. How
large should the sample be? The answer to this question is ‘it depends’ ! Specifically, it
depends on how variable the population is, on how accurate the input to the decision
needs to be, and on how costly the data collection is.
In this chapter you will learn to construct confidence intervals. You will also look at the
idea of sample size determination. How does it affect accuracy when you estimate
population means and proportions? How do you use this information?
Note that inferring information about a parent (or theoretical) population
using observations from a sample is the primary concern of the subject of
statistics.
7.6
Principle of confidence intervals
A (point) estimate is our ‘best guess’ of an unknown population parameter’s value
based on sample data. Due to the random nature of sample data, we do not expect to
estimate the parameter exactly (unless we are very lucky). Hence there is some
uncertainty in our estimation so it would be advisable to communicate the level of
uncertainty associated with the estimate.
Standard errors (recall the standard error is the square root of the variance of an
150
7.7. General formulae for normally-distributed statistics
estimator, i.e. its standard deviation) act as measures of estimation imprecision and
these are used in the construction of confidence intervals. Informally, you can think
of a confidence interval as representing our ‘best guess plus or minus a bit’, where the
magnitude of this ‘bit’ is dependent on the level of precision. More formally, an x%
confidence interval covers the unknown parameter with x% probability over repeated
samples. This method of expressing the accuracy of an estimate is easily understood
and requires no statistical sophistication for interpretation.
So, it is important to distinguish (point) estimation – using sample data to obtain a
numerical estimate of an unknown population parameter – from a confidence interval –
an interval estimate of the parameter whose width indicates how reliable the point
estimate is. (You may wish to review Section 6.9 to ensure that you are clear about the
distinction between statistics (such as point and interval estimators) and parameters
(population characteristics).)
Clearly, a very wide confidence interval would show that our estimate was not very
reliable, as we could not be sure that its value was close to the true parameter value,
whereas a narrow confidence interval would correspond to a more reliable estimate. The
degree of confidence that we have in our confidence interval can be expressed
numerically; an ideal situation is a narrow interval with a high coverage probability.
With these points in mind we now show how such intervals can be computed from data
in some basic situations.
7
7.7
General formulae for normally-distributed
statistics
Chapter 6 introduced the sampling distribution of X̄. We will now draw on this
sampling theory.
7.7.1
Standard error known
Let θb be an unbiased1 estimator of θ such that its sampling distribution is:
b = N (θ, (S.E.(θ))
b 2)
N (θ, Var(θ))
b is the square
recalling that the standard error of an estimator, denoted here as S.E.(θ),
b Assume this standard error is known. We
root of its variance, denoted here as Var(θ).
require a confidence interval defined by a pair of values such that the probability of the
interval covering θ, the coverage probability, is high.
b we have:
Since θb is normally distributed it follows that, upon standardising θ,
Z=
θb − θ
∼ N (0, 1).
b
S.E.(θ)
1
An unbiased estimator means the estimate is correct on average, over repeated samples. This concept
is covered in greater depth in ST104b Statistics 2.
151
7. Point and interval estimation
Hence, assuming we desire a 95% coverage probability, we have:
!
θb − θ
P −1.96 <
< 1.96 = 0.95
b
S.E.(θ)
using Table 4 of the New Cambridge Statistical Tables. Note we can also write 1.96 as
z0.025 , that is the z-value which cuts off 2.5% probability in the upper tail of the
standard normal distribution.
b > 0 (a standard error must be strictly positive):
Since S.E.(θ)
0.95 = P
!
θb − θ
−1.96 <
< 1.96
b
S.E.(θ)
b < θb − θ < 1.96 × S.E.(θ)
b
= P −1.96 × S.E.(θ)
b
b
b
= P −1.96 × S.E.(θ) < θ − θ < 1.96 × S.E.(θ)
b < θ < θb + 1.96 × S.E.(θ)
b
= P θb − 1.96 × S.E.(θ)
where we multiply by −1 to go from the second to the third line, and hence the
inequality sign is reversed.
7
Endpoints of a 95% confidence interval (standard error known)
b Therefore, the
A 95% confidence interval for θ has endpoints θb ± 1.96 × S.E.(θ).
reported confidence interval would be:
b θb + 1.96 × S.E.(θ)
b .
θb − 1.96 × S.E.(θ),
This is a simple, but very important, result which can be applied to give confidence
intervals in many different situations such as the estimation of a mean, proportion,
difference between means and difference between proportions.
The above derivation was for a 95% confidence interval, i.e. with a 95% coverage
probability, which is a generally accepted confidence requirement. Of course, it is
possible to have different levels of confidence, say 90% or 99%. Fortunately, we can use
the same argument as above; however, a different multiplier coefficient drawn from the
standard normal distribution is required (i.e. not z0.025 = 1.96). Using Table 5 of the
New Cambridge Statistical Tables, it is easy to obtain such coefficients. For convenience,
key values are given below, where zα denotes the z-value which cuts off 100α%
probability in the upper tail of the standard normal distribution.
For 90% confidence, use the multiplier: z0.05 = 1.6449
For 95% confidence, use the multiplier: z0.025 = 1.9600
For 99% confidence, use the multiplier: z0.005 = 2.5758.
152
7.7. General formulae for normally-distributed statistics
Check these values from Table 5 in the New Cambridge Statistical Tables, and make
sure you know how to work out other ones, for example 80% or 98% confidence.2
Unfortunately, the method used so far in this section is limited by the assumption that
the standard error of θb is known. This means, in effect, that we need to know the
population variance and, in most practical cases, the assumption that we know either
the population variance or the sampling standard error will not be justified.
In such cases it will be necessary to estimate the standard error from the data. This
requires a modification both of the approach and, in particular, of the general formula
for the endpoints of a confidence interval.
7.7.2
Standard error unknown
Here we consider cases where the standard error of θb is estimated, hereby denoted
b which stands for the estimated standard error of θ.
b One might think that:
E.S.E.(θ)
θb − θ
b
E.S.E.(θ)
∼ N (0, 1)
but because of the additional sampling variability of the estimated standard error, this
new, transformed function of the data will have a more dispersed distribution than the
standard normal – the Student’s t distribution on ν degrees of freedom (discussed
below). Hence in fact:
θb − θ
∼ tν .
b
E.S.E.(θ)
7.7.3
(Student’s) t distribution
‘Student’ was the pen name of William S. Gosset (1876–1937) who is credited with
developing this distribution.3 You should be familiar with its generic shape, which
resembles the standard normal distribution (i.e. bell-shaped and symmetric about zero)
but with ‘fatter’ tails. We get different members of the t distribution family for different
degrees of freedom,4 ν. Graphical examples of the t distribution for various degrees
of freedom are given in Figure 7.1, but note that as ν → ∞ (in words, ‘as the degrees of
freedom tend to infinity’), the t distribution approaches the standard normal
distribution – that is, tν → N (0, 1) as ν → ∞ (in words, ‘the t distribution tends to the
standard normal distribution as the degrees of freedom tend to infinity’).
For our purposes, we will use this distribution whenever we are performing inference for
population means when population variances are unknown, and hence are estimated
from the data. The correct degrees of freedom will depend on the degrees of freedom
used to estimate the variance (more on this later).
2
These would be z0.10 = 1.2816 and z0.01 = 2.3263, respectively.
Curious students may wonder why Gosset was not vain enough to call it Gosset’s t distribution
– after all, who wouldn’t want to achieve such statistical immortality? In fact the reason for the pen
name was due to the refusal of Guinness (his employer when he discovered this distribution) to allow
its researchers to publish papers fearing leaks of trade secrets, forcing Gosset to publish anonymously as
‘Student’. Poor Gosset.
4
We will not discuss the naming of ‘degrees of freedom’ in ST104a Statistics 1.
3
153
7
7. Point and interval estimation
0.4
t density functions
0.2
0.0
0.1
Density function
0.3
t∞
t 30
t3
t1
−3
−2
−1
0
1
2
3
x
Figure 7.1: Student’s t distribution – various degrees of freedom.
7
Assuming a 95% coverage probability, using Table 10 of the New Cambridge Statistical
Tables, for a given ν we can find t0.025, ν such that:
!
θb − θ
< t0.025, ν = 0.95
P −t0.025, ν <
b
E.S.E.(θ)
where t0.025, ν cuts off 2.5% probability in the upper tail of the t distribution with ν
degrees of freedom. On rearranging the inequality within the brackets we get:
b < θ < θb + t0.025, ν × E.S.E.(θ)
b = 0.95.
P θb − t0.025, ν × E.S.E.(θ)
Endpoints of a 95% confidence interval (standard error unknown)
b leading to a
A 95% confidence interval for θ has endpoints θb ± t0.025, ν × E.S.E.(θ),
reported confidence interval of the form:
b θb + t0.025, ν × E.S.E.(θ)
b
θb − t0.025, ν × E.S.E.(θ),
where t0.025, ν is the t-value which cuts off 2.5% probability in the upper tail of the t
distribution with ν degrees of freedom, obtained from either Table 9 or Table 10 in
the New Cambridge Statistical Tables.
This general result can be used in a large number of situations to compute confidence
intervals.
In order to use these results we need to know how to calculate or estimate, respectively,
the standard errors for the various cases using the sample data. We now proceed to
demonstrate these techniques for some important scenarios.
154
7.8. Confidence interval for a single mean (σ known)
Important note: In the following applications, whenever the t distribution is
used and we have large sample size(s) (hence large degrees of freedom), it is
acceptable to use standard normal values as approximations due to the
tendency of the t distribution to the standard normal distribution as the
degrees of freedom approach infinity. What constitutes a ‘large’ sample size
is rather subjective; however, as a rule-of-thumb treat anything over 30 as
‘large’ in ST104a Statistics 1.
7.8
Confidence interval for a single mean (σ known)
Given observed sample values x1 , x2 , . . . , xn , the point estimate of µ is x̄ =
n
P
xi /n.
√
Assuming the (population) variance, σ 2 , is known, the standard error of x̄ is σ/ n.
i=1
Confidence interval endpoints of a single mean (σ known)
In such instances, we use the standard normal distribution when constructing a
confidence interval with endpoints:
σ
σ
σ
x̄ ± zα/2 × √
⇒
x̄ − zα/2 × √ , x̄ + zα/2 × √
n
n
n
where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of
the standard normal distribution to ensure a 100(1 − α)% confidence interval. For
example, for α = 0.05, we have a 100(1 − 0.05)% = 95% confidence interval, and we
require the z-value which cuts off α/2 = 0.025, i.e. 2.5% probability in the upper tail
of the standard normal distribution, i.e. 1.96, which can be obtained from Table 5 in
the New Cambridge Statistical Tables.
Example 7.1 Measurements of the diameter of a random sample of 200 ball
bearings produced by a machine gave a sample mean of x̄ = 0.824. The population
standard deviation, σ, is 0.042. Construct a 95% confidence interval and a 99%
confidence interval for the population mean value of the diameter of the ball
bearings.
We are told that σ = 0.042. So a 95% confidence interval, where α = 0.05, has
endpoints:
σ
0.042
= 0.824 ± 0.006
x̄ ± 1.96 × √ = 0.824 ± 1.96 × √
n
200
where z0.025 = 1.96 is the z-value which cuts off 2.5% probability in the upper tail of
the standard normal distribution. In other words, the interval is (0.818, 0.830) which
covers the population mean with a probability of 95%.
To compute a 99% confidence interval (where α = 0.01), since σ is known we require
zα/2 = z0.005 , that is the z-value which cuts off 0.5% probability in the upper tail of
the standard normal distribution. Using Table 5 of the New Cambridge Statistical
155
7
7. Point and interval estimation
Tables, the z-value is 2.5758. So a 99% confidence interval has endpoints:
0.042
σ
x̄ ± 2.576 × √ = 0.824 ± 2.5758 × √
= 0.824 ± 0.008.
n
200
In other words, the interval is (0.816, 0.832). Note the higher level of confidence has
resulted in a wider confidence interval.
7.9
Confidence interval for a single mean (σ
unknown)
In practice it is unusual for σ 2 to be known – why would we know σ 2 , but not µ?
However, we can estimate σ 2 with the sample variance s2 , using the estimator:
n
1 X
(Xi − X̄)2 .
S =
n − 1 i=1
2
This is because S 2 is an unbiased estimator5 of σ 2 . The degrees of freedom associated
with this estimator are n − 1, where n is the sample size.
Confidence interval endpoints of a single mean (σ unknown)
7
In such instances, we use the t distribution when constructing a 100(1 − α)%
confidence interval with endpoints:
s
s
s
⇒
x̄ − tα/2, n−1 × √ , x̄ + tα/2, n−1 × √
x̄ ± tα/2, n−1 × √
n
n
n
where tα/2, n−1 is the t-value which cuts off 100α/2% probability in the upper tail
of the t distribution with n − 1 degrees of freedom, obtained from either Table 9 or
Table 10 in the New Cambridge Statistical Tables.
Example 7.2 A company producing designer label jeans carries out a sampling
exercise in order to estimate the average price which retailers are charging for the
jeans. A random sample of 12 retailers gives the following sample of prices (in £):
34.50, 36.00, 43.50, 29.50, 44.50, 47.50, 45.50, 53.00, 40.00, 38.00, 33.00, 54.50.
We seek a 95% confidence interval for the mean retailer’s price of the jeans.
Clearly, n = 12 and it is not hard to check that x̄ = 41.625 and s = 7.84.
So the estimated standard error of the sample mean is:
s
7.84
√ = √ = 2.2632
n
12
5
Recall an unbiased estimator means the point estimate is correct on average, over repeated samples.
This concept is covered in greater depth in ST104b Statistics 2.
156
7.9. Confidence interval for a single mean (σ unknown)
on n − 1 = 11 degrees of freedom. Hence a 95% confidence interval for µ has
endpoints 41.625 ± 2.201 × 2.2632, i.e. the confidence interval is:
(£36.64, £46.61).
Two important points are the following.
Make sure you see where 2.201 comes from in Table 10 – it is t0.025, 11 , i.e. the
t-value above which lies 2.5% probability for a Student’s t distribution with 11
degrees of freedom.
Make sure you report confidence intervals in the form (£36.64, £46.61); that is
you must compute the actual endpoints and report these as an interval, as that
is what a confidence interval is! Note the lower endpoint should be given first.
Activity 7.1 The reaction times, in seconds, for eight police officers were found to
be:
0.28, 0.23,
0.21,
0.26,
0.29,
0.21,
0.25,
0.22.
Determine a 90% confidence interval for the mean reaction time of all police officers.
7
Solution
Here n = 8 is small and σ 2 is unknown, so the formula for a 90% confidence interval
for µ is:
s
x̄ ± t0.05, 7 × √ .
n
It is easy to calculate x̄ = 0.2437 and s = 0.0311, so a 90% confidence interval has
endpoints:
0.0311
0.2437 ± 1.895 × √ .
8
Hence the confidence interval is (0.2229, 0.2645).
Activity 7.2 A business requires an inexpensive check on the value of stock in its
warehouse. In order to do this, a random sample of 50 items is taken and valued.
The average value of these is computed to be £320.41 with a (sample) standard
deviation of £40.60. It is known that there are 9,875 items in the total stock.
(a) Estimate the total value of the stock to the nearest £10,000.
(b) Calculate a 95% confidence interval for the mean value of all items and hence
determine a 95% confidence interval for the total value of the stock.
(c) You are told this confidence interval is too wide for decision purposes and you
are asked to assess how many more items would need to be sampled to obtain
an interval with the same degree of confidence, but with half the width.
157
7. Point and interval estimation
Solution
(a) The total value of the stock is 9875µ, where µ is the mean value of an item of
stock. We know that X̄ is the obvious estimator of µ, so 9875X̄ is the obvious
estimator of 9875µ. Therefore, an estimate of the total value of the stock is:
9875 × 320.41 = £3,160,000
(to the nearest £10,000).
(b) In this question σ 2 is unknown but n = 50 is large, so we can approximate tn−1
with the standard normal distribution. Hence for a 95% confidence interval for
µ we use t0.025, 49 ≈ z0.025 = 1.96 giving:
s
40.6
x̄±z0.025 × √ = 320.41±1.96× √ = 320.41±11.25
n
50
⇒
(£309.16, £331.66).
Note that because n is large we have used the standard normal approximation.
It is more accurate to use a t distribution on 49 degrees of freedom. Using Table
10 of the New Cambridge Statistical Tables, the nearest available value is
t0.025, 50 = 2.009. This gives a 95% confidence interval of:
40.6
s
x̄±t0.025, 49 × √ = 320.41±2.009× √ = 320.41±11.54
n
50
7
⇒
(£308.87, £331.95)
so not much of a difference.
To obtain a 95% confidence interval for the total value of the stock, 9875µ,
multiply the interval endpoints by 9875. This gives (to the nearest £10,000):
(£3,050,000, £3,280,000).
(c) Increasing the sample size
√ by a factor of k reduces the width of the confidence
a factor of 4 will
interval by a factor of k. Hence increasing the sample size by√
reduce the width of the confidence interval by a factor of 2 (= 4). Therefore,
we need to increase the sample size from 50 to 4 × 50 = 200, i.e. collect another
150 observations.
7.10
Confidence interval for a single proportion
We often want to estimate a (population) proportion, for example, in surveys of
people’s social, political or consumer attitudes.6 Take an opinion poll in which we
investigate voters’ support for a particular political party. We can ask the question ‘Do
you support party X?’, to which the n respondents each answer either ‘yes’ or ‘no’. If r
people reply ‘yes’, then the sample proportion who support the party is r/n. This is our
point estimate of the population proportion of support in the population, π.
6
In Chapter 4, categorical variables were introduced. Political party affiliation is an example of a
categorical variable.
158
7.10. Confidence interval for a single proportion
Sampling distribution of the sample proportion estimator, P
The sampling distribution of the sample proportion, P , (which is the estimator of
the population proportion, π)7 is:
π (1 − π)
P ∼ N π,
(7.1)
n
approximately, as n → ∞.
p
Hence Var(P ) = π (1 − π)/n, so the standard error is S.E.(P ) = π (1 − π)/n. Since
this depends on π, precisely what we are trying to estimate, the true standard error is
unknown, so must itself be estimated. As π is unknown, the best we can do is replace it
with our point estimate of it, p = r/n, hence the estimated standard error is:
r
E.S.E.(p) =
p (1 − p)
=
n
r
r/n (1 − r/n)
.
n
Confidence interval endpoints of a single proportion
A 100(1 − α)% confidence interval for a single proportion has endpoints:
!
r
r
r
p (1 − p)
p (1 − p)
p (1 − p)
⇒
p − zα/2 ×
, p + zα/2 ×
p ± zα/2 ×
n
n
n
where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution, obtained from Table 5 in the New Cambridge Statistical
Tables.
Note that although we are estimating a variance, for proportions we do not use
Student’s t distribution for two reasons.
The standard error has not been estimated by a corrected sum of squares
calculation, unlike S 2 = Sxx /(n − 1).
The sample size n has to be large for the normal approximation to hold, and so the
standard normal distribution is appropriate in this case due to the central limit
theorem.
Example 7.3 A survey is conducted by a bank to estimate the proportion of its
customers who would be interested in using a proposed new mobile telephone
banking service. If we denote the population proportion of customers who are
interested in this proposal by π, and it is found that 68 out of 150 customers are in
favour, then we would estimate π to be p = 68/150 = 0.453. Hence a 95% confidence
7
This result is a consequence of the central limit theorem applied to the proportion of successes for a
‘binomial’ distribution. Full details can be found in ST104b Statistics 2.
159
7
7. Point and interval estimation
interval for π has endpoints:
r
0.453 ± 1.96 ×
0.453 × (1 − 0.453)
150
⇒
(0.37, 0.53).
Note that for a 95% confidence interval, α = 0.05 and so we use z0.025 = 1.96 in the
computation of the confidence interval.
Activity 7.3 A random sample of 100 voters contained 35 supporters of a
particular party. Compute a 95% confidence interval for the proportion of this
party’s supporters in the population.
Solution
In this question we are estimating a proportion with n = 100. Let π be the
proportion of the party’s supporters in the population. With p = 35/100 = 0.35 and
n = 100, a 95% confidence interval for π is calculated as:
r
r
p (1 − p)
0.35 × 0.65
⇒ 0.35 ± 1.96 ×
⇒ (0.257, 0.443).
p ± z0.025 ×
n
100
7
Activity 7.4 A claimant of Extra Sensory Perception (ESP) agrees to be tested for
this ability. Blindfolded, he claims to be able to identify more randomly chosen cards
than would be expected by pure guesswork.
An experiment is conducted in which 200 playing cards are drawn at random, and
with replacement, from a deck of cards, and the claimant is asked to name their
suits (hearts, diamonds, spades or clubs).
Of the 200 cards he identifies 60 correctly. Compute a 95% confidence interval for his
true probability of identifying a suit correctly. Is this evidence of ESP?
Solution
We have the sample proportion p = 60/200 = 0.3 and an estimated standard error of:
r
0.3 × 0.7
= 0.0324.
200
A 95% confidence interval for the true probability of the correct identification of a
suit is:
r
p (1 − p)
p ± z0.025 ×
⇒ 0.3 ± 1.96 × 0.0324 ⇒ (0.236, 0.364).
n
As 0.25 (pure guesswork) is within this interval, then the claimant’s performance is
not convincing!
Activity 7.5 400 college students, chosen at random, are interviewed and it is
found that 240 use the refectory.
160
7.11. Sample size determination
(a) Use these data to compute a:
i. 95% confidence interval
ii. 99% confidence interval
for the proportion of students who use the refectory.
(b) The college has 12,000 students. The college catering officer claims that the
refectory is used by at least 9,000 students and that the survey has yielded a
low figure due to sampling variability. Is this claim reasonable?
Solution
(a)
i. We have p = 240/400 = 0.6 and the corresponding estimated standard
error is:
r
0.6 × 0.4
= 0.0245.
400
Therefore, a 95% confidence interval for the population proportion is:
0.6 ± 1.96 × 0.0245
⇒
(0.5520, 0.6480).
ii. Similarly, a 99% confidence interval for the population proportion is:
0.6 ± 2.576 × 0.0245
⇒
7
(0.5369, 0.6631).
(b) A 99% confidence interval for the total number of students using the refectory is
obtained by multiplying the interval endpoints by the total number of students,
i.e. 12,000. This gives (6442, 7959).
The catering officer’s claim of 9,000 is incompatible with these data as it falls
well above the 99% confidence interval upper endpoint. We conclude that the
actual number of users is well below 9,000.
7.11
Sample size determination
The question ‘how large a sample do I need to select?’ is a common one when sampling.
The answer to this depends on the quality of inference which the researcher requires
from the data. In the estimation context, this can be expressed in terms of the accuracy
of estimation.
If the researcher requires that there should be a 95% chance of the estimation error
being no larger than e units (we refer to e as the tolerance on the sampling error),
then this is equivalent to having a 95% confidence interval of width 2 × e. Note that
here e represents the half-width of the confidence interval since the point estimate is,
by construction, at the centre of the confidence interval (in all cases seen in ST104a
Statistics 1).
For known variance, a 95% confidence interval is based on the estimation error being no
161
7. Point and interval estimation
greater than 1.96 standard errors. Hence we can conveniently use the relation:
b ≤e
1.96 × S.E.(θ)
and solve this simple equation for the required sample size (note the standard error is a
function of n).
Sample size determination for a single mean
To estimate µ to within e units with 100(1 − α)% confidence, we require a minimum
sample size of:
(zα/2 )2 σ 2
(7.2)
n≥
e2
where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution, obtained from Table 5 in the New Cambridge Statistical
Tables.
Sample size determination for a single proportion
7
To estimate π to within e units with 100(1 − α)% confidence, we require a minimum
sample size of:
(zα/2 )2 p (1 − p)
n≥
(7.3)
e2
where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution, obtained from Table 5 in the New Cambridge Statistical
Tables, and p is the assumed value of π (obtained from a pilot study, or otherwise
an assumption about the value of π).
Example 7.4 A random sample of 50 households is taken from a population of
10,000 households in an area of a city. The sample mean and standard deviation of
weekly expenditure on beverages are £18 and £4, respectively. How many more
observations are required to estimate µ to within 1 unit with 99% confidence?
Here we have e = 1, and we can assume σ = 4 since the initial sample with n = 50 is
‘large’ and hence we would expect that s ≈ σ. For 99% confidence, we use
z0.005 = 2.5758. Therefore, using (7.2), we have:
(2.5758)2 × 42
= 106.16.
n≥
12
Remember that n must be an integer, so the smallest n satisfying this is 107.8
Therefore, 57 more observations are required.
8
Note that we round up, otherwise had we rounded down it would lead to less precision than we have
specified.
162
7.11. Sample size determination
Example 7.5 The reaction time of a patient to a certain stimulus is known to have
a standard deviation of 0.05 seconds. How large a sample of measurements must a
psychologist take in order to be 95% confident and 99% confident, respectively, that
the error in the estimate of the mean reaction time will not exceed 0.01 seconds?
For 95% confidence, we use z0.025 = 1.96 using Table 5 of the New Cambridge
Statistical Tables. So, using (7.2), n is to be chosen such that:
n≥
(1.96)2 × (0.05)2
.
(0.01)2
Hence we find that n ≥ 96.04. Since n must be an integer, 97 observations are
required to achieve an error of 0.01 or less with 95% probability.
For 99% confidence, we use z0.005 = 2.5758. So, using (7.2), n is to be chosen such
that:
(2.5758)2 × (0.05)2
n≥
.
(0.01)2
Hence we find that n ≥ 165.87. Since n must be an integer, 166 observations are
required to achieve an error of 0.01 or less with 99% probability.
Note that a higher level of confidence requires a larger sample size as more
information (sample data) is needed to achieve a higher level of confidence for a
given tolerance, e.
Example 7.6 A pilot study estimates a proportion to be 0.4. If we wish to be 90%
confident of estimating the population proportion with an error no greater than 0.03,
how large a sample is required?
Here e = 0.03, and we have an initial estimate of p = 0.4. For 90% confidence, we use
z0.05 = 1.645 using Table 5 of the New Cambridge Statistical Tables. Therefore, using
(7.3), we have:
(1.645)2 × 0.4 × (1 − 0.4)
= 721.61.
n≥
(0.03)2
So, rounding up, we require a sample size of 722.
Activity 7.6 A simple random sample of 100 workers had weekly salaries with a
mean of £315 and a standard deviation of £20.
(a) Calculate a 90% confidence interval for the mean weekly salary of all workers in
the factory.
(b) How many more workers should be sampled if it is required that the estimate is
to be within £3 of the true average (again, with 90% confidence)?
Note this means a tolerance of £3 – equivalent to a confidence interval width of
£6.
163
7
7. Point and interval estimation
Solution
(a) We have n = 100, x̄ = 315 and s = 20. The estimated standard error is:
s
20
√ =√
= 2.
n
100
Hence a 90% confidence interval for the true mean is:
s
x̄ ± t0.05, n−1 × √
n
⇒
315 ± 1.645 × 2
⇒
(311.71, 318.29)
where we use the approximation t0.05, 99 ≈ z0.05 = 1.645 since n is large.
(b) For the tolerance to be 3, we require:
20
3 ≥ 1.645 × √ .
n
Solving this gives n ≥ 120.27, so we round up to get n = 121. Hence we need to
take a further sample of 21 workers.
7
7.12
Difference between two population proportions
The correct approach for the comparison of two population proportions, π1 and π2 , is,
obviously, via the difference between the population proportions. The sample
proportions P1 and P2 are, for large sample sizes n1 and n2 , respectively, approximately
normally distributed by the central limit theorem as, from (7.1), we have:
P1 ∼ N
π1 (1 − π1 )
π1 ,
n1
π2 (1 − π2 )
and P2 ∼ N π2 ,
n2
approximately, as n1 → ∞ and n2 → ∞. When random samples are drawn from two
independent populations, then these distributions are statistically independent, and it
can be shown that their difference is also (approximately) normally distributed such
that:
π1 (1 − π1 ) π2 (1 − π2 )
P1 − P 2 ∼ N π1 − π2 ,
+
.
(7.4)
n1
n2
p
We see that Var(P1 − P2 ), and hence S.E.(P1 − P2 ) = Var(P1 − P2 ), depends on the
unknown parameters π1 and π2 . Therefore, the variance and standard error of the
difference in sample proportions is also unknown. So we must resort to the estimated
standard error:
s
P1 (1 − P1 ) P2 (1 − P2 )
E.S.E.(P1 − P2 ) =
+
.
n1
n2
164
7.12. Difference between two population proportions
Confidence interval endpoints for the difference between two proportions
With point estimates of π1 and π2 of p1 = r1 /n1 and p2 = r2 /n2 , respectively,
a 100(1 − α)% confidence interval for the difference between two proportions has
endpoints:
s
p1 − p2 ± zα/2 ×
p1 (1 − p1 ) p2 (1 − p2 )
+
n1
n2
(7.5)
where zα/2 cuts off 100α/2% probability in the upper tail of the standard normal
distribution, obtained from Table 5 in the New Cambridge Statistical Tables.
Example 7.7 We use (7.5) to construct 95% and 90% confidence intervals,
respectively, for the difference between the population proportions of the general
public who are aware of a particular commercial product before and after an
advertising campaign. Two surveys were conducted and the results of the two
random samples were:
Before campaign
After campaign
Sample size
150
120
Number aware
68
65
To avoid working with negative differences whenever possible (to keep notation
simpler by working with positive values), let p1 = r1 /n1 = 65/120 = 0.5417 and
p2 = r2 /n2 = 68/150 = 0.4533. Hence our point estimate of the difference is:
‘After’ − ‘Before’ = p1 − p2 = 0.5417 − 0.4533 = 0.0884.
So a 95% confidence interval for the difference in population proportions, π1 − π2 ,
has endpoints:
r
0.4533 × (1 − 0.4533) 0.5417 × (1 − 0.5417)
+
0.0884 ± 1.96 ×
150
120
which can be expressed as (−0.031, 0.208).
Similarly, a 90% confidence interval has endpoints:
r
0.4533 × (1 − 0.4533) 0.5417 × (1 − 0.5417)
0.0884 ± 1.645 ×
+
150
120
which can be expressed as (−0.012, 0.189).
Note that both confidence intervals include zero (since they both have a negative
lower bound and positive upper bound). This suggests that there is no significant
difference in public awareness. This idea has close parallels with hypothesis testing,
introduced in Chapter 8.
Activity 7.7 In a market research study to compare two chocolate bar wrappings,
30 out of 50 children preferred a gold wrapping to a silver wrapping, whereas 25 out
165
7
7. Point and interval estimation
of 40 adults preferred the silver wrapping.
(a) Compute a 95% confidence interval for the difference in the true proportions in
favour of gold wrapping between the two groups.
(b) It is decided to take further samples of children and adults so that we finally
sample the same number of adults and children such that the final estimator of
the difference between proportions is within 0.06 of the true difference (with
95% confidence). How many more adults and children need to be sampled?
Solution
(a) For ‘Children − Adults’, the estimate of the difference in population proportions
is 30/50 − 15/40 = 0.225. The estimated standard error of this estimate is:
s
r
p1 (1 − p1 ) p2 (1 − p2 )
0.6 × 0.4 0.375 × 0.625
+
=
+
= 0.1032.
n1
n2
50
40
A 95% confidence interval is:
0.225 ± 1.96 × 0.1032
7
⇒
(0.023, 0.427).
(b) Let n be the required common sample size. We require:
r
0.6 × 0.4 + 0.375 × 0.625
1.96 ×
≤ 0.06.
n
Solving gives n = 507. Therefore, we must sample 457 more children and 467
more adults.
Activity 7.8 Two market research companies each take random samples to assess
the public’s attitude toward a new government policy. If n represents the sample size
and r the number of people against, the results of these independent surveys are as
follows:
Company 1
Company 2
n
400
900
r
160
324
If π is the population proportion against the policy, compute a 95% confidence
interval to comment on the likely representativeness of the two companies’ results.
Solution
Here we are estimating the difference between two proportions (if the two
companies’ results are compatible, then there should be no difference between the
proportions). The formula for a 95% confidence interval of the difference is:
s
p1 (1 − p1 ) p2 (1 − p2 )
p1 − p2 ± z0.025 ×
+
.
n1
n2
166
7.12. Difference between two population proportions
The estimates of the proportions are p1 = 160/400 = 0.4 and p2 = 324/900 = 0.36,
so a 95% confidence interval is:
r
0.4 × 0.6 0.36 × 0.64
0.4 − 0.36 ± 1.96 ×
+
⇒ (−0.017, 0.097).
400
900
Since zero is in this interval it is plausible that the two proportions are the same, so
the companies’ surveys are compatible indicating likely representativeness.
Activity 7.9 A sample of 954 adults in early 1987 found that 23% of them held
shares.
(a) Given a UK adult population of 41 million, and assuming a proper random
sample was taken, compute a 95% confidence interval estimate for the number
of shareholders in the UK in 1987 (following liberalisation of financial markets
in the UK).
(b) A ‘similar’ survey the previous year (prior to financial liberalisation) had found
a total of 7 million shareholders. Assuming ‘similar’ means the same sample
size, find a 95% confidence interval estimate of the increase in shareholders
between the two years.
Solution
(a) Let π be the proportion of shareholders in the population in 1987. Start by
estimating π. We are estimating a proportion and n is large, so a 95%
confidence interval for π is:
r
r
p (1 − p)
0.23 × 0.77
⇒ 0.23 ± 1.96 ×
⇒ (0.203, 0.257).
p ± z0.025 ×
n
954
Therefore, a 95% confidence interval for the number (rather than the
proportion) of shareholders in the UK in 1987 is obtained by multiplying the
above interval endpoints by 41 million, resulting in:
9,430,000 ± 1,107,000
⇒
(8,323,000, 10,537,000)
Therefore, we estimate there were about 9.43 million shareholders in the UK in
1987, with a margin of error of approximately 1.1 million.
(b) Let us start by finding a 95% confidence interval for the difference in the two
proportions. We use the formula:
s
p1 (1 − p1 ) p2 (1 − p2 )
p1 − p2 ± z0.025 ×
+
.
n1
n2
167
7
7. Point and interval estimation
The estimates of the proportions π1 and π2 are 0.23 and 7/41 = 0.171,
respectively. We know n1 = 954, and although n2 is unknown we can assume it
is approximately equal to 954 (as the previous year’s survey was ‘similar’), so a
95% confidence interval is:
r
0.23 × 0.77 0.171 × 0.829
+
= 0.059 ± 0.036
0.23 − 0.171 ± 1.96 ×
954
954
giving (0.023, 0.094). Multiply the interval endpoints by 41 million and we get a
confidence interval of:
2,419,000 ± 1,476,000
⇒
(943,000, 3,895,000).
We estimate that the number of shareholders has increased by about 2.4 million
during the period covered by the surveys.
There is quite a large margin of error of approximately 1.5 million, especially
when compared with a point estimate (i.e. interval midpoint) of 2.4 million.
However, it seems financial liberalisation increased the number of shareholders.
7
7.13
Difference between two population means
In this section we are primarily interested in the difference between two population
means, i.e. µ1 − µ2 . There are four cases to be considered, depending on whether
variances are known or unknown (in which case we can assume they are either equal or
unequal), and the case of paired datasets.
7.13.1
Unpaired samples – variances known
Suppose we have random samples of size n1 and n2 from two normally-distributed
populations, N (µ1 , σ12 ) and N (µ2 , σ22 ), respectively, where the variances σ12 and σ22 are
known. The sampling distributions of X̄1 and X̄2 are, respectively:
X̄1 ∼ N
σ2
µ1 , 1
n1
σ22
and X̄2 ∼ N µ2 ,
.
n2
A natural estimator of µ1 − µ2 is X̄1 − X̄2 . Random samples drawn from two different
populations are statistically independent, hence X̄1 and X̄2 are independent. The
sampling distribution of their difference is:
X̄1 − X̄2 ∼ N
168
σ2 σ2
µ1 − µ2 , 1 + 2
n1 n2
.
(7.6)
7.13. Difference between two population means
Confidence interval endpoints for the difference between two means
If the population variances σ12 and σ22 are known, a 100(1 − α)% confidence interval
for µ1 − µ2 has endpoints:
s
σ12 σ22
x̄1 − x̄2 ± zα/2 ×
+
(7.7)
n1 n2
where zα/2 cuts off 100α/2% probability in the upper tail of the standard normal
distribution, obtained from Table 5 in the New Cambridge Statistical Tables.
7.13.2
Unpaired samples – variances unknown and unequal
We have the same set-up as above, with the same sampling distribution for X̄1 − X̄2 in
(7.6), but now the population variances σ12 and σ22 are unknown. Assuming large sample
sizes, say greater than 30, we can replace these unknown parameters with the respective
sample variances s21 and s22 and continue to use standard normal values for the
confidence coefficient. The justification is because since the sample sizes are large, we
would expect fairly accurate estimates of the population variances, such that s21 ≈ σ12
and s22 ≈ σ22 .
Confidence interval endpoints for the difference between two means
If the population variances σ12 and σ22 are unknown, provided the sample sizes n1 and
n2 are large (greater than 30), a 100(1 − α)% confidence interval for µ1 − µ2 has
endpoints:
s
x̄1 − x̄2 ± zα/2 ×
s21
s2
+ 2
n1 n2
where zα/2 cuts off 100α/2% probability in the upper tail of the standard normal
distribution, obtained from Table 5 in the New Cambridge Statistical Tables.
Activity 7.10 In order to assess the impact of an advertising campaign, a
restaurateur monitors her daily revenue before and after the campaign. The table
below shows some sample statistics of daily sales calculated over a period of 60 days
prior to the campaign, and 45 days after the campaign. Determine a 95% confidence
interval for the increase in average daily sales due to the campaign. Is there strong
evidence that the advertising campaign has increased sales?
Number of days
Mean daily sales
Standard deviation
Before campaign
60
£503
£21
After campaign
45
£559
£29
169
7
7. Point and interval estimation
Solution
As both sample sizes are ‘large’ there is no need to use a pooled estimator of the
variance as we would expect s21 ≈ σ12 and s22 ≈ σ22 . A 95% confidence interval for the
difference in means is:
s
r
s21
s22
(21)2 (29)2
x̄1 −x̄2 ±z0.025 ×
+
⇒ 559−503±1.96×
+
⇒ (46.0, 66.0).
n1 n2
60
45
Zero is well below the lower endpoint, so there is evidence that the advertising
campaign has increased sales.
7.13.3
Unpaired samples – variances unknown and equal
In some circumstances we may be able to justify the assumption that the two
populations being sampled are of equal variability. In which case suppose we have
random samples of sizes n1 and n2 from two normally-distributed populations,
N (µ1 , σ 2 ) and N (µ2 , σ 2 ), respectively, i.e. the populations have a common variance, σ 2 ,
which is unknown.9 The sampling distributions of X̄1 and X̄2 are:
σ2
σ2
X̄1 ∼ N µ1 ,
and X̄2 ∼ N µ2 ,
.
n1
n2
7
A natural estimator of µ1 − µ2 is still X̄1 − X̄2 . Since we have unpaired (independent)
samples, X̄1 and X̄2 are statistically independent, hence the sampling distribution of
their difference is:
1
1
2
+
X̄1 − X̄2 ∼ N µ1 − µ2 , σ
.
n1 n2
The problem is that this common variance σ 2 is unknown, so it needs to be estimated.
Should we use S12 or S22 as an estimator of σ 2 ? Well, we use both by pooling the two
sample variances, since both contain useful information about σ 2 .
Pooled variance estimator
The pooled variance estimator of σ 2 , using S12 and S22 which are sample variances
from random samples of sizes n1 and n2 , respectively, is:
Sp2 =
(n1 − 1)S12 + (n2 − 1)S22
n1 + n2 − 2
(7.8)
on n1 + n2 − 2 degrees of freedom, where the subscript ‘p’ denotes ‘pooled’.
Note that Sp2 is the weighted average of the sample variances S12 and S22 , where the
weights are (n1 − 1)/(n1 + n2 − 2) and (n2 − 1)/(n1 + n2 − 2), respectively.
If n1 = n2 then we attach equal weights to the sample variances. Intuitively, this should
make sense. As the sample size increases, a sample variance provides a more accurate
9
If the population variances were known, then we would know their true values so no assumptions
would be necessary and we would use (7.7).
170
7.13. Difference between two population means
estimate of σ 2 . If n1 6= n2 , the sample variance calculated from the larger sample is more
accurate, hence it is given greater weight in the pooled variance estimator. Of course, if
n1 = n2 , then the variances are equally accurate and so are given equal weights.
Confidence interval endpoints for the difference between two means
If the population variances σ12 and σ22 are unknown but assumed equal, a 100(1 − α)%
confidence interval for µ1 − µ2 has endpoints:
s 1
1
+
(7.9)
x̄1 − x̄2 ± tα/2, n1 +n2 −2 × s2p
n1 n2
where s2p is the estimate of the common variance σ 2 from the pooled variance
estimator (7.8), and where tα/2, n1 +n2 −2 is the t-value which cuts off 100α/2%
probability in the upper tail of the Student’s t distribution with n1 +n2 −2 degrees of
freedom, obtained from either Table 9 or Table 10 in the New Cambridge Statistical
Tables.10
An obvious question is how to decide whether to assume the unknown variances are
equal or unequal. To assist with making this decision consider the following points.
If σ12 = σ22 , then we would expect approximately equal sample variances, i.e. s21 ≈ s22 ,
since both sample variances would be estimating the same (common) variance. If
the sample variances are very different, then this would suggest σ12 6= σ22 .11
If we are sampling from two similar populations (for example, two companies in
the same industry) then an assumption of equal variability in these similar
populations would be reasonable.
Example 7.8 Two companies supplying a similar service are compared for their
reaction times (in days) to complaints. Random samples of recent complaints to
these companies gave the following:
Company A
Company B
Sample size
12
10
Sample mean
8.5
4.8
Sample std. dev.
3.6
2.1
We construct a 95% confidence interval for the difference between mean reaction
times and use this interval to decide if one company reacts faster than the other, on
average.
Because the markets are similar, it is reasonable to assume (though it is only an
assumption!) that the two population variances are equal. Under this assumption,
using (7.8), we have:
(12 − 1) × (3.6)2 + (10 − 1) × (2.1)2
2
sp =
= 9.1125
12 + 10 − 2
10
Remember that for sufficiently ‘large’ degrees of freedom a standard normal approximation can be
used, provided a justification is given.
11
A formal hypothesis test of H0 : σ12 = σ22 could be performed, although this is covered in ST104b
Statistics 2.
171
7
7. Point and interval estimation
on n1 + n2 − 2 = 12 + 10 − 2 = 20 degrees of freedom. So the estimated standard
error of the difference between sample means is:
s s
√
1
1
1
1
2
+
= 9.1125 ×
+
= 9.1125 × 0.1833 = 1.2925.
sp
n1 n2
12 10
Hence a 95% confidence interval for µA − µB , using (7.9), is:
8.5 − 4.8 ± 2.086 × 1.2925
⇒
(1.01, 6.39)
where t0.025, 20 = 2.086 (we have estimated the common variance, so we use the
Student’s t distribution, here with 20 degrees of freedom) using Table 10 of the New
Cambridge Statistical Tables.
As zero does not lie in this interval, we conclude that the data suggest that
Company B is faster than Company A in reacting to complaints on average. (Note
the variable here is reaction time, so a lower sample mean suggests a faster average
reaction time.)
7
Activity 7.11 A survey is conducted on time spent recording (in hours per track)
for 25 music industry recording studios, classified as successful or unsuccessful
according to their recent chart performances. The relevant statistics resulting from
this study are:
Successful studios
Unsuccessful studios
Sample size
12
13
Sample mean
9.2
6.8
Sample standard deviation
1.4
1.9
(a) Compute a 98% confidence interval for the difference between population mean
recording times between successful and unsuccessful studios.
(b) On the basis of this confidence interval, do you consider this to be sufficient
evidence of a true difference in mean recording times between the different types
of studios? Justify your answer.
Solution
(a) Both samples are small (combined sample size < 30), so it will be necessary to
use a pooled variance by assuming a common value of σ 2 and estimating it
using the pooled variance estimator:
(n1 − 1)S12 + (n2 − 1)S22
2
.
Sp =
n1 + n2 − 2
Also, since the sample sizes are small and the true variance is unknown, the
Student’s t distribution will be used with ν = (n1 − 1) + (n2 − 1) = n1 + n2 − 2
degrees of freedom. (Simply add the degrees of freedom from the two separate
samples.)
172
7.13. Difference between two population means
Let the subscripts ‘1 ’ and ‘2 ’ refer to the successful and unsuccessful studios,
respectively. Because the studios are in the same market (which is largely why
we are bothering to compare them in the first place!) then it is reasonable to
assume the same variance for the recording times at studios for the two groups.
Under this assumption:
s2p =
64.88
(12 − 1)(1.4)2 + (13 − 1)(1.9)2
=
= 2.8209.
12 + 13 − 2
23
This variance is our best estimate of σ 2 , and applies (by assumption) to both
populations. If the true population means µ1 and µ2 are estimated by x̄1 and x̄2
(the sample means), respectively, then the central limit theorem allows us to use
the following approximations:
Var(X̄1 ) ≈
Sp2
n1
and Var(X̄2 ) ≈
Sp2
n2
with point estimates of 2.8209/12 and 2.8209/13, provided that we use the t
distribution.
The point estimates are 9.2 for µ1 and 6.8 for µ2 . We estimate the difference
between the means by:
x̄1 − x̄2 = 9.2 − 6.8 = 2.4.
7
Since the random samples are independent, we add the variances, hence:
s2p
s2p
2.8209 2.8209
1
1
+
=
+
= 2.8209 ×
+
n1 n2
12
13
12 13
on 23 degrees of freedom.
Therefore, a 98% confidence interval for the difference between the means is,
noting here t0.01, 23 = 2.500 for 98% confidence:
s
1
1
+
⇒ 2.4 ± 2.500 × 0.6724
2.4 ± 2.500 × 2.8209 ×
12 13
resulting in the confidence interval (0.7190, 4.0810).
(b) Since zero is not in this interval, it follows (with 98% confidence) that the true
difference is greater than zero, so that more time (on average) is spent recording
at the successful studios.
7.13.4
Paired (dependent) samples
Paired-sample methods are used in special cases when the two samples are not
statistically independent. For our purposes, such paired data are likely to involve
observations on the same individuals in two different states – specifically ‘before’ and
‘after’ some intervening event. A paired-sample experimental design is advantageous
since it allows researchers to determine whether or not significant changes have occurred
173
7. Point and interval estimation
as a result of the intervening event free of bias from other factors, since these have been
controlled for by observing the same individuals.12
A necessary, but not sufficient, condition for the presence of paired sample data is that
n1 = n2 in order to have pairs of data values. Common sense needs to be exercised to
determine whether or not we have paired data. An example of such a dataset would be
observations of the same individuals at two different points in time, typically ‘before’
and ‘after’ some event, such as statistical ‘IQ’ before and after taking ST104a
Statistics 1.13
This scenario is easy to analyse as the paired data can simply be reduced to a
one-sample analysis by working with differenced data. That is, suppose two samples
generated sample values x1 , x2 , . . . , xn and y1 , y2 , . . . , yn , respectively (note the same
number of observations, n, in each sample). Compute the differences, that is:
d1 = x1 − y1 ,
d2 = x2 − y2 ,
...,
dn = xn − yn .
Let x̄d denote the sample mean of the differenced data. It follows that x̄d = x̄ − ȳ, so
that by using the differences to compute a confidence interval for µd (the population
mean of the differences), then we get the required confidence interval for µX − µY .
Therefore, the technique subsequently follows that in Section 7.9.
Example 7.9 The table below shows the before and after weights (in pounds) of 8
adults after a diet. Construct a 95% confidence interval for the mean loss of weight
due to the diet. Are you convinced that the diet reduces weight on average?
7
Before
127
130
114
139
After
122
120
116
132
Before
150
147
167
153
After
144
138
155
152
The differences (calculated as ‘Before − After’) are:
5,
10,
−2,
7,
6,
9,
12,
1.
Hence n = 8, x̄d = 6 and sd = 4.66 on n − 1 = 7 degrees of freedom, where x̄d and sd
denote the sample mean and standard deviation of the √
differenced √
data, respectively.
The estimated standard error of the sample mean is s/ n = 4.66/ 8 = 1.65. Using
the t distribution on 7 degrees of freedom, for 95% confidence we use t0.025, 7 = 2.365.
So a 95% confidence interval
√ for the mean difference in weight before and after the
diet is x̄d ± t0.025, n−1 × s/ n, that is:
6 ± 2.365 × 1.65
⇒
(2.1, 9.9).
Since zero is not included in this confidence interval, we conclude that the diet does
appear to reduce weight on average, i.e. the average weight loss appears to be
positive.
12
Experimental studies are discussed in Chapter 11.
One hopes, of course, that this would result in a mean increase, or at the very least no decrease, in
statistical ‘IQ’ !
13
174
7.14. Summary
Activity 7.12 Two advertising companies each give quotations for nine different
campaigns. Their quotations (in £000s) are shown in the following table. Compute a
95% confidence interval for the true difference between average quotations. Can you
deduce from this interval if one company is more expensive on average than the
other?
Company
A
B
1
39
46
2
24
26
3
36
32
4
42
39
5
45
51
6
30
34
7
38
37
8
32
41
9
39
44
Solution
The data are paired observations for n = 9 campaigns. First compute the differences:
7 2
−4
−3 6 4
− 1 9 5.
The sample mean of these differences is 2.78 and the sample standard deviation is
4.58. (Make sure you know how to do these calculations!) Therefore, a 95%
confidence interval for the true difference, µd , is:
s
x̄d ± t0.025, n−1 × √
n
⇒
4.58
2.78 ± 2.306 × √
9
⇒
(−0.74, 6.30).
Since zero is in this confidence interval we cannot conclude that one company is
more expensive than the other on average. It is possible that the differences in
quotations are due to random variation.
7.14
Summary
The concepts of estimation are obviously extremely important when the true values of
parameters are unknown. When constructing confidence intervals, make sure that you
understand when Student’s t distribution is required rather than the standard normal
distribution. Remember:
we use the standard normal distribution when we know the variance, either as
given by the researcher or as a population figure
if we have to estimate the variance from a sample, we will need to use Student’s t
distribution, whatever the size of the sample
if the sample size is large, then the standard normal distribution well-approximates
Student’s t
we use the standard normal distribution whenever we are dealing with proportions
(as a consequence of the central limit theorem).
175
7
7. Point and interval estimation
7.15
Key terms and concepts
Confidence interval
Degrees of freedom
Endpoints
Paired sample
Standard error
Tolerance
7.16
Coverage probability
Differences
Half-width
Pooled variance
Student’s t distribution
Learning activities
Solutions can be found in Appendix A.
1. Would you say the following statement is true or false? ‘When calculated from the
same dataset, a 90% confidence interval is wider than a 95% confidence interval.’
Explain your answer.
7
2. National figures for a blood test result have been collected and the population
standard deviation is 1.2. You select a random sample of 100 observations and find
a sample mean of 25 units. Construct a 95% confidence interval for the population
mean.
3. Consider this variation of Example 7.1. ‘Measurements of the diameter of a random
sample of 200 ball bearings produced by a machine gave a sample mean x̄ = 0.824.
The sample standard deviation, s, is 0.042. Construct a 95% confidence interval
and a 99% confidence interval for the population mean value of the diameter of the
ball bearings.’
(Note that although this time you have been told that you only have a point
estimate of the population standard deviation, you can justify using the standard
normal distribution since the sample size is very large and so your confidence
interval will be exactly the same as in Example 7.1!)
4. Look at Table 10 of the New Cambridge Statistical Tables. Note that different
probability tails are given for ν = 1, 2 etc. (ν is the same as n − 1 in this case).
Now consider a 95% confidence interval for µ when n = 21 (i.e. for 20 degrees of
freedom). You can see the t-value is t0.025, 20 = 2.086. However, when ν → ∞, the
t-value converges to 1.96 – exactly the same as z0.025 for the standard normal
distribution. Although you can see that t-values are given for quite large degrees of
freedom, we generally assume that the standard normal distribution can be used
instead of Student’s t if the degrees of freedom are greater than 30 (some textbooks
say 40, others say 50).
5. (a) Write down the formula for the standard error of the sample proportion when
a random sample is selected from a very large population and the population
proportion is equal to π. Give the formula for pooled proportions when
comparing the two samples. (Note that you need to check this on your
examination formula sheet for ST104a Statistics 1.)
176
7.17. A reminder of your learning outcomes
(b) An independent assessment is made of the services provided by two holiday
companies. Of a random sample of 300 customers who booked through
company A, 220 said they were satisfied with the service. Of a random sample
of 250 of company B’s customers, 200 said they were satisfied. For both
companies the total customer base was very large.
Construct a 95% confidence interval for the difference between the proportions
of satisfied customers between the two companies. Basing your conclusion on
this confidence interval, do you believe that one company generates more
customer satisfaction than the other?
6. A random sample of 200 students is observed. 30 of them say they ‘really enjoy’
studying Statistics.
(a) Calculate the proportion of students in the sample who say they ‘really enjoy’
studying Statistics and then construct a 95% confidence interval for the
population proportion of students who ‘really enjoy’ studying Statistics.
You now take a further random sample, in another institution. This time there are
100 students, and 18 say they ‘really enjoy’ studying Statistics.
(b) Construct a 95% confidence interval for the population proportion of students
who ‘really enjoy’ studying Statistics based on this second sample.
(c) Construct a 95% confidence interval for the difference between the two
proportions.
7.17
A reminder of your learning outcomes
After completing this chapter, and having completed the essential reading and
activities, you should be able to:
calculate sample means, standard deviations and proportions, and demonstrate
their use as estimates
construct a confidence interval for a mean, a proportion, the difference between
means and the difference between proportions
know when to use Student’s t distribution.
You do not need to:
demonstrate the central limit theorem
know about the concepts relating to a ‘good’ estimator.
177
7
7. Point and interval estimation
7.18
Sample examination questions
Solutions can be found in Appendix B.
1. You are told that a 99% confidence interval for a single population proportion is
(0.3676, 0.5324).
(a) What was the sample proportion that lead to this confidence interval?
(b) What was the size of the sample used?
2. In a random sample of 120 large retailers, 85 used linear regression as a method of
forecasting sales. In an independent random sample of 163 small retailers, 78 used
linear regression as a method of forecasting sales.
(a) Find a 98% confidence interval for the difference between the two population
proportions.
(b) What do you conclude from your confidence interval?
(c) Repeat (a) but for a 94% confidence interval.
7
3. A random sample of 21 female students is chosen from students at higher education
establishments in a particular area of a country, and it is found that their mean
height is 165 centimetres with a sample variance of 81.
(a) Assuming that the distribution of the heights of the students may be regarded
as normally distributed, calculate a 98% confidence interval for the mean
height of female students.
(b) You are asked to obtain a 98% confidence interval for the mean height of width
3 centimetres. What sample size would be needed in order to achieve that
degree of accuracy?
(c) Suppose that a sample of 15 had been obtained from a single student hostel, in
which there were a large number of female students (still with a distribution of
heights which was well-approximated by the normal distribution). The mean
height is found to be 160 centimetres. Calculate a 99% confidence interval for
the mean height of female students in the hostel. How do their heights
compare with the group in part (a)?
178
Chapter 8
Hypothesis testing
8.1
Aims
In Chapters 6 and 7 you were introduced to the idea of the probability that a parameter
could lie within a range of values, and in particular the confidence interval (generally
90%, 95% or 99% confidence) for a parameter.
In this chapter, we are going to look at the idea of using statistics to see whether or not
we should reject statements about these parameters – the concept of hypothesis testing.
The arithmetic and underlying ideas you need are similar to those you met in the last
chapter.
You should aim to consolidate your familiarity with the ideas of randomness and work
on the different aspects of hypothesis testing as presented here.
8.2
Learning outcomes
After completing this chapter, and having completed the essential reading and
activities, you should be able to:
8
set up the null and alternative hypotheses for a problem and state whether the
latter is one-sided or two-sided, leading to a one-tailed or two-tailed test,
respectively
define and apply the terminology of hypothesis testing
define and explain the types of errors which can be made in hypothesis testing
perform hypothesis tests on means and proportions.
8.3
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060]
Sections 9.1–9.4 and 10.1–10.3.
In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials
accessible via the ST104a Statistics 1 area at https://emfss.elearning.london.ac.uk
179
8. Hypothesis testing
8.4
Further reading
Aczel, A.D. Complete Business Statistics (London: McGraw–Hill Higher
Education, 2009) seventh edition [ISBN 9780071287531] Chapters 7, 8 and 14
(taking care to omit topics not in the learning outcomes).
Anderson, D.R., D.J. Sweeney, T.A. Williams, J. Freeman and E. Shoesmith
Statistics for Business and Economics (South-Western Cengage Learning, 2010)
eleventh edition [ISBN 9780324783247] Chapter 9 and Sections 10.2, 12.2 and 14.9.
Lind, D.A., W.G. Marchal and S.A. Wathen Statistical Techniques in Business and
Economics (Boston: McGraw–Hill Higher Education, 2009) fourteenth edition
[ISBN 978007110004] Chapters 9, 10, 11 and the latter half of 16.
Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and
Economics. (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN
9780471615170] Chapter 9.
8.5
Introduction
In this chapter we introduce a branch of statistical inference known as hypothesis
testing. Typically, we choose between two statements about the value of a parameter
based on evidence obtained from sample data which, due to sample data being subject
to randomness, can be regarded as realisations of some random variable of interest.
8
Our objective is to choose between these two opposite statements about the population,
where these statements are known as hypotheses. By convention these are denoted by
H0 and H1 . Qualitative examples of such hypotheses are:
Null hypothesis, H0
A modified process does not produce a
higher yield than the standard process.
True economic output has not
increased over a year.
A person is not gifted with
Extra Sensory Perception.
The average level of lead in the blood
of people in a particular environment
is as high as 3.0.
Alternative hypothesis, H1
A modified process does produce a
higher yield than the standard process.
True economic output has
increased over a year.
A person is gifted with
Extra Sensory Perception.
The average level of lead in the blood
of people in a particular environment
is lower than 3.0.
From these examples we see that we use H0 to represent ‘no difference’, ‘no
improvement’, ‘no effect’ etc., and this is known as the null hypothesis, while H1 is
the alternative hypothesis.
Many statistical procedures can be represented as statements about the values of
population parameters such as the mean, µ, or variance, σ 2 . The first step in any
hypothesis testing problem is to ‘translate’ the real problem into its technical
counterpart. In ST104a Statistics 1, the hypotheses will only contain parameters. For
example, the previous hypotheses can all be ‘translated’ into technical forms similar to:
180
8.6. Statistical tests
‘We observe x1 , x2 , . . . , xn from N (µ, σ 2 ) and we wish to test:
H0 : µ = µ0
vs. H1 : µ > µ0
where µ0 is some specified value such as 0 or 3.’
Performing the ‘translation’ correctly can itself be complicated, and so this requires
careful thought. Specifically, it is the form of H1 which needs consideration. The null
hypothesis, H0 , will always denote the parameter value with equality, i.e. ‘=’,1 such as:
H 0 : µ = µ0 .
In contrast the alternative hypothesis, H1 , will take one of three forms, i.e. using ‘6=’,
‘<’ or ‘>’, that is:
H1 : µ 6= µ0
or H1 : µ < µ0
or H1 : µ > µ0 .
Note that only one of these forms will be used per test. To determine which form to use
will require careful consideration of the wording of the problem.
The form H1 : µ 6= µ0 is an example of a two-tailed (or two-sided ) test and we use this
form with problems worded such as ‘test the hypothesis that µ is zero’. Here, there is no
indication of the value of µ if it is not zero – do we assume µ > 0 or µ < 0 in such
cases? We cannot be sure, so we take the safe option of a two-sided test, i.e. H1 : µ 6= 0.
In contrast, had the problem been phrased as ‘test whether µ is greater than zero’, then
unambiguously we would opt for H1 : µ > µ0 . This is an example of an upper-tailed (or
one-sided ) test. Unsurprisingly, ‘test whether µ is less than zero’ leads to H1 : µ < µ0
which is a lower-tailed (also one-sided ) test.
Later, when testing for differences between two population means or two population
proportions, you need to look out for comparative phrases indicating if one population
value should exceed the other (for example, testing whether group A is on average
faster than group B). Practising problems will make you proficient in correctly
specifying your hypotheses.
8.6
Statistical tests
Consider the following two situations.
1. Suppose that it has been suggested that a particular environment is such that the
average lead content of blood is as high as 3 units. In order to assess the truth of
this, 50 blood samples selected at random are analysed and it is found that the
average of these is 2.8 and the sample standard deviation is 0.20.2 Therefore, the
1
Such an H0 is called a simple null hypothesis. It is possible to have a composite null hypothesis, such
as H0 : µ ≥ 0 which allows for more than one parameter value, although in ST104a Statistics 1 we
will only focus on simple forms of H0 .
2
Even when the population mean of the blood lead level is 3, it is unlikely we will observe the sample
mean to be exactly 3. However, we do expect the sample mean to be close to 3, while being willing to
tolerate a certain amount of deviation from that value, due to chance. The important question here is
‘Is this deviation from 3 due to chance alone, or is it too large to be explained purely by chance?’. The
deviation of 2.8 from 3 is far too large, when expressed in standard error units.
181
8
8. Hypothesis testing
√
√
estimated standard error is s/ n = 0.20/ 50 = 0.0283 so that the sample mean is
more than 7 (estimated) standard errors below the hypothetical mean – which is
highly unlikely if the hypothesis is true! The data are extremely inconsistent with
the hypothetical mean of 3, and so we reject this value on the evidence of the data.
2. A coin is tossed 100 times in order to decide whether or not it is a fair coin. We
observe 98 heads and 2 tails. Common sense tells us that this is (very!) strong
evidence that the coin is biased toward heads. More formally, we reject the
hypothesis that the coin is fair as the chance of obtaining 98 heads with a fair coin
is extremely low – the observation, 98 heads, is inconsistent with the coin being fair
since we would have expected (approximately) 50 heads and the observation
significantly departed from this expectation.
In these examples it would be perfectly reasonable to make common sense judgements
since the data were so extreme, but these ‘common sense judgements’ are using exactly
the same underlying logic as do hypothesis tests which involve formal test statistics. By
fully understanding this logic, we are able to make optimal decisions in less extreme and
more complicated cases. A further advantage of the formal approach is that it enables us
to express numerically the degree of confidence in our inference procedure. This can be
extremely important in communicating our conclusions via reports, presentations etc.
8
Choosing between competing hypotheses requires us to conduct a statistical test. The
test procedures themselves involve straightforward calculations using the sample data.
At this point it is worth emphasising that in ‘classical testing’ we always assume the
null hypothesis is true by performing the test conditional on H0 being true. That is, H0
is our ‘working hypothesis’ which we hold to be true until we obtain significant evidence
against it.
A test statistic is the formal mechanism used to evaluate the support given to H0 by
sample data. Different test statistics are used for testing different forms of hypotheses.
For example, tests of a single mean, single proportion, differences between means or
proportions all require different test statistics.
Since H0 refers to the tested value of a population parameter, we use a suitable
estimator to estimate it based on the sample data. For example, X̄ would be used when
testing µ. Test statistics are constructed from the sampling distribution of the estimator
(introduced in Chapter 6). Hence test statistics are themselves random variables, as
they are functions of estimators (which are random variables). It follows, therefore, that
test statistics also follow probability distributions as they are random variables.
Since we assume H0 is true, the test statistic’s distribution is conditional on H0 . Using
the given sample data, we evaluate the test statistic to obtain a test statistic value
(analogous to obtaining a point estimate using an estimator). If we find this test
statistic value is sufficiently extreme – as determined by the test statistic distribution –
then we will have sufficiently strong evidence to justify rejecting H0 .
Of course, we need to specify what constitutes a ‘sufficiently extreme’ test statistic
value. In fact we define a critical region which spans test statistic values which are
considered sufficiently extreme. It follows that our decision rule is to reject H0 if the
test statistic value falls in this critical region. If it does not, then we fail to reject H0 ,
which is retained as our working hypothesis. For any test, the critical region is defined
by one or two critical values, depending on whether we are performing a one-tailed or
182
8.7. Types of error
two-tailed test, respectively. These critical values are obtained from the test statistic’s
distribution under H0 .
8.7
Types of error
In any hypothesis test there are two types of inferential decision error which could be
committed. Clearly, we would like to reduce the probabilities of these errors as much as
possible. These two types of error are called Type I error and Type II error.
Type I and Type II errors
Type I error: rejecting H0 when it is true, which is a false positive. We denote
the probability of this type of error by α.
Type II error: failing to reject H0 when it is false, which is a false negative.
We denote the probability of this type of error by β.
Both errors are undesirable and, depending on the context of the hypothesis test, it
could be argued that either one is worse than the other.3 However, on balance, a Type I
error is usually considered to be more problematic. The possible decision space can be
presented as:
H0 true
H1 true
Decision
Not reject H0
Reject H0
Correct decision
Type I error
Type II error
Correct decision
8
For example, if H0 was being ‘innocent’ and H1 was being ‘guilty’, a Type I error would
be finding an innocent person guilty (bad for him/her), while a Type II error would be
finding a guilty person innocent (bad for the victim/society/justice, but admittedly
good for him/her!).
The complement of a Type II error, that is 1 − β, is called the power of the test – the
probability that the test will reject a false null hypothesis. You will not be expected to
calculate either β or (1 − β) in ST104a Statistics 1. However, you should understand
the implications of the different types of error and be able to complete the chart in the
second learning activity at the end of this chapter.
Example 8.1 In the example about lead levels discussed earlier, we would make a
Type I error if we decided the lead level was less than 3 when it was actually as high
as 3, and we would make a Type II error if we believed the level to be 3 when it was
actually lower than 3. These errors are taken into consideration in the formulation of
test procedures – the standard tests presented below are all devised to minimise the
probabilities of making such errors.
3
Which is better, a medical test incorrectly concluding a healthy person has an illness, or incorrectly
concluding that an ill person is healthy?
183
8. Hypothesis testing
Activity 8.1
(a) Briefly define Type I and Type II errors in the context of the statistical test of a
hypothesis.
(b) What is the general effect on the probabilities of each type of these errors
happening if the sample size is increased?
Solution
(a) A Type I error is rejecting H0 when it is true. A Type II error is failing to reject
a false H0 . We can express these as conditional probabilities as follows:
α = P (Type I error) = P (Reject H0 | H0 is true)
and:
β = P (Type II error) = P (Not reject H0 | H1 is true).
(b) Increasing the sample size decreases the probabilities of making both types of
error.
8.8
8
Tests for normal populations
Let us suppose that, having obtained sample data from a normal distribution, we use
the estimator θb to estimate an unknown parameter θ. Assume θb has the sampling
distribution:
θb ∼ N (θ, V )
b is assumed to be known. Hence the standard error, S.E.(θ),
b is also
where V = Var(θ)
known. The hypothesis test will proceed using the following steps.
1. Define the hypotheses. In this case suppose we test:
H0 : θ = θ0
vs. H1 : θ > θ0 .
Hence large values of θb would imply H1 , while small values would tend to favour
H0 . Note in this case the alternative hypothesis is one-sided and this has an effect
on the format of the test, i.e. here we perform an upper-tailed test.
2. State the test statistic and its distribution, then compute its value.
Regardless of the type of parameter we are testing, we choose a test statistic whose
distribution, conditional on H0 , has a tabulated distribution (i.e. one for which we
have statistical tables) such as the standard normal, Student’s t, or chi-squared
distributions (the chi-squared distribution is introduced in Chapter 9). In this case,
due to the known standard error, standardisation of θb readily yields:
θb − θ0
∼ N (0, 1).
b
S.E.(θ)
184
8.9. Significance levels
Next compute the test statistic value using the sample data, where θb will be our
point estimate of θ.
3. Define the critical region for a given significance level, α.4 Obtain critical
value(s) from statistical tables to define the critical region (also referred to as the
rejection region). For this example, let α = 0.05, where α is our chosen probability
of a Type I error. Since we are conducting an upper-tailed test, we require
zα = z0.05 = 1.645, i.e. the z-value, for the standard normal distribution, above
which lies α = 0.05 = 5% probability.
4. Decide whether or not to reject H0 . Now we decide whether or not to reject
H0 (our working hypothesis which, up to now, is assumed to be true). If the test
statistic value lies in the critical region then we will reject H0 , otherwise we do not
reject H0 . A rejected H0 means the test is statistically significant at the specific
significance level used. In the examination you will typically be required to test at
two significance level sequentially – more on this later.
5. Draw conclusions. It is always important to draw conclusions in the context of
the original problem. It is not sufficient to conclude ‘the test is significant at the
5% significance level’. This is just a technical step which guides us to make a better
decision about the original real-world problem, and final conclusions should be
drawn in terms of that problem.
An important aspect of this test is the simple form of the test statistic, namely:
point estimator − hypothesised value
.
(estimated) standard error
Although this is not the format of all test statistics, it does cover a wide class of
situations, such as the:
8
test of a single population mean, H0 : µ = µ0
test of a single population proportion, H0 : π = π0
test of the equality of two means, H0 : µ1 = µ2
test of the equality of two proportions, H0 : π1 = π2 .
8.9
Significance levels
Step 3 of the previous test procedure referred to a significance level. We control for
the probability of committing a Type I error by setting the probability of this to be α.
Hence the significance level, or size, of a test is just the probability of committing a
Type I error. If we perform a test at the 5% significance level, say, then we are actually
basing our decision on a procedure which gives us a 5% probability of making a Type I
error when H0 is true.
In the development of the general normal test above, we used a significance level of 5%.
This is one of the most widely-used significance levels, although 1% and 10% are also
4
See the next section for a discussion of significance levels.
185
8. Hypothesis testing
sometimes used. The value of quoting the significance level used for the test is that it
gives an acceptable method of expressing the user’s confidence in the decision which was
made. A test which is significant at the 1% significance level expresses a strong
conviction that H0 is untrue, whereas a test which is significant only at the 10%
significance level expresses merely some doubt in H0 .
The significance levels – 10%, 5% and 1%, corresponding to α values of 0.10, 0.05 and
0.01, respectively – provide a means of deciding which values of the test statistic are
extreme. Consider, for example, an upper-tailed test of:
H0 : µ = 2 vs. H1 : µ > 2.
We calculate the test statistic value and if it falls deep enough into the right-hand tail
of the test statistic’s distribution (the critical region) then we reject H0 . Where do we
draw the line which signals the start of this critical region? That is, what critical value
should we use? An accepted convention is to use a critical value which ‘cuts off’ a tail of
size α. So, if the test statistic follows a standard normal distribution, for an
upper-tailed test, using Table 5 of the New Cambridge Statistical Tables, we denote this
cut-off point as zα which for:
α = 0.10 ⇒ z0.10 = 1.2816
α = 0.05 ⇒ z0.05 = 1.6449
α = 0.01 ⇒ z0.01 = 2.3263
such that if the test statistic value exceeds the critical value, then the ‘test is significant
at the 100α% significance level’.
8
Clearly, significance at the 1% significance level (2.3263 < test statistic value) implies
significance at the 5% significance level (since 1.6449 < 2.3263) which, in turn, implies
significance at the 10% significance level (since 1.2816 < 1.6449 < 2.3263). When a test
is significant we should always state the strongest level of significance as possible as this
provides the best measure of how inconsistent the data are with respect to the null
hypothesis.
If the test statistic value falls between 1.2816 and 1.6449, then the test is significant at
the 10% significance level, but not at the 5% significance level. This type of result arises
from rather inconclusive data and so it would be better to sample more data in the hope
that a firmer conclusion can be drawn one way or the other. Such an outcome is saying
that the null hypothesis is to be treated with suspicion, but that we cannot confidently
reject it. These points and ideas will be reinforced in the subsequent examples.
8.9.1
Order of conducting tests
The best practice for answering hypothesis test questions in the examination is to test
at two significance levels. However, the order in which you apply the significance levels
is taken into account by the examiners. Noting the discussion above, a test which is
significant at the 5% significance level, say, must also be significant at the 10%
significance level, so you would be penalised if you used 10% as your second significance
level having just rejected at the 5% significance level as there is no ‘value added’ from
use of the 10% significance level. In such a situation it would be more appropriate to
test at the 1% significance level next.
186
8.9. Significance levels
Unless an examination question states otherwise, a sensible strategy to follow is to
initially test at the 5% significance level and then test either at the 1% or 10%
significance level, depending on whether or not you have rejected at the 5% significance
level. A ‘decision tree’ depiction of the procedure to follow is presented in Figure 8.1.
Significance level decision tree
Reject H0
Reject H0
Test result is 'highly significant'
Not reject H0
Test result is 'moderately significant'
Reject H0
Test result is 'weakly significant'
Not reject H0
Test result is 'not significant'
Choose the 1% level
Start at the
5% level
Not reject H0
Choose the 10% level
Figure 8.1: Significance level decision tree.
As indicated in Figure 8.1, it is possible to state whether a test result is highly
significant, moderately significant, weakly significant or not significant, and hence we
can convey a measure of the ‘strength’ of any statistical significance.
Activity 8.2
(a) Explain what is meant by the statement: ‘The test is significant at the 5%
significance level’.
(b) How should you interpret a test which is significant at the 10% significance
level, but not at the 5% significance level?
Solution
(a) ‘The test is significant at the 5% significance level’ means there is a less than
5% chance of getting data as extreme as actually observed if the null hypothesis
was true. This implies that the data are inconsistent with the null hypothesis,
which we reject, i.e. the test is ‘moderately significant’.
(b) A test which is significant at the 10% significance level, but not at the 5%
significance level is often interpreted as meaning there is some doubt about the
null hypothesis, but not enough to reject it with confidence, i.e. the test is
‘weakly significant’.
187
8
8. Hypothesis testing
8.10
One-tailed and two-tailed tests
The tests which have been described so far have been one-tailed (specifically,
upper-tailed). This is because the alternative hypotheses used were one-sided. Recall
the discussion in Section 8.5 which identified three different forms of H1 . For example, if
testing a population mean H0 : µ = µ0 , these are:
i. H1 : µ 6= µ0
or ii. H1 : µ < µ0
or iii. H1 : µ > µ0 .
If H1 is of the form i., then both tails of the test statistic’s distribution will form the
critical region, known as a two-tailed test. The critical region will still be of total area α,
so it is logical to split this α equally between the two tails. So if the test statistic follows
a standard normal distribution, for a two-tailed test, using Table 5 of the New
Cambridge Statistical Tables and noting symmetry of the standard normal distribution
about zero:
α = 0.10 ⇒ zα/2 = z0.05 = 1.6449, giving critical values of ± 1.6449
α = 0.05 ⇒ zα/2 = z0.025 = 1.9600, giving critical values of ± 1.9600
α = 0.01 ⇒ zα/2 = z0.005 = 2.5758, giving critical values of ± 2.5758
such that if the test statistic value exceeds either critical value (i.e. greater than the
upper critical value, or less than the lower critical value), then the ‘test is significant at
the 100α% significance level’.
8
If H1 is of the form ii., then this is a lower-tailed test. Hence the critical region is in the
left-hand tail of the test statistic’s distribution. So if the test statistic follows a standard
normal distribution, for a lower-tailed test, using Table 5 of the New Cambridge
Statistical Tables and noting symmetry of the standard normal distribution about zero:
α = 0.10 ⇒ z0.90 = −1.2816
α = 0.05 ⇒ z0.95 = −1.6449
α = 0.01 ⇒ z0.99 = −2.3263
such that if the test statistic value exceeds the critical value (i.e. falls below the critical
value), then the ‘test is significant at the 100α% significance level’.
If H1 is of the form iii., then we have an upper-tailed test, which has already been
discussed.
8.11 P -values
Before proceeding to examples of hypothesis test situations, we will discuss an
alternative approach to hypothesis testing based on the p-value.
So far we have spoken about defining a critical region and rejecting the null hypothesis,
H0 , whenever the test statistic value falls in this critical region. An alternative approach
is to compute the probability of obtaining the observed test statistic value or a more
188
8.11. P -values
extreme value conditional on H0 being true, using the probability distribution of the
test statistic. This probability is the p-value.
So, given we reject H0 if the test statistic value is sufficiently extreme, we would also
reject H0 if we obtained a sufficiently small p-value. To summarise, in the p-value world
of testing, we use the following simple decision rule.
Hypothesis testing – p-value decision rule
Reject H0 if p-value < α, where α is the significance level.
Do not reject H0 if p-value ≥ α.
It remains to show how to calculate the p-value. Again, this will depend on the form of
H1 and the test statistic distribution.
Calculating p-values
Let the test statistic value be x, and the test statistic (a random variable as it is
constructed from an estimator) be X. Assuming a test statistic distribution which is
symmetric about zero (for the two-tailed test), in general these can be summarised
as follows.
Form of H1
p-value calculation
H1 : θ 6= θ0
2 × P (X ≥ |x|)
H1 : θ < θ0
P (X ≤ x)
H1 : θ > θ0
P (X ≥ x)
8
In each case use the New Cambridge Statistical Tables corresponding to the distribution
of test statistic X.
Example 8.2 If X ∼ N (0, 1), and we obtain the test statistic value x = −1.82,
then for a lower-tailed test the p-value would be:
P (X ≤ −1.82) = 1 − Φ(1.82) = 1 − 0.9656 = 0.0344
using Table 4 of the New Cambridge Statistical Tables. Hence this is significant at
the 5% significance level (since 0.0344 < 0.05), but not at the 1% significance level
(since 0.0344 > 0.01), indicating a moderately significant result.
The next few sections detail various hypothesis testing situations analogous to the
confidence interval situations presented in Chapter 7.
189
8. Hypothesis testing
8.12
Hypothesis test for a single mean (σ known)
We first consider the test of a single population mean when the population standard
deviation, σ, is known. To test H0 : µ = µ0 when sampling from N (µ, σ 2 ) we use the
following test statistic.
z-test of hypothesis for a single mean (σ known)
In this case, the test statistic is:
Z=
X̄ − µ0
√ ∼ N (0, 1).
σ/ n
(8.1)
Hence critical values are obtained from the standard normal distribution, i.e. using
Table 5 of the New Cambridge Statistical Tables.
Example 8.3 The mean lifetime of 100 components in a sample is 1,570 hours and
their standard deviation is known to be 120 hours. µ is the mean lifetime of all the
components produced. Is it likely the sample comes from a population whose mean
is 1,600 hours?
We perform a two-tailed test since we are testing whether or not µ is 1,600.5
Hence we test:
H0 : µ = 1600 vs. H1 : µ 6= 1600.
Since σ is known, we use (8.1) to calculate the test statistic value, which is:
8
z=
1570 − 1600
x̄ − µ0
√
√ =
= −2.5.
σ/ n
120/ 100
We first test at the 5% significance level. Since this is a two-tailed test, the critical
values are ±1.96. Since the test statistic value is in the critical region
(−2.5 < −1.96), we reject H0 . If we now refer to the significance level decision tree
(Figure 8.1), we now test at the 1% significance level with critical values of ±2.5758.
Since −2.5758 < −2.5 we are (just) unable to reject H0 and conclude that the test
result is ‘moderately significant’. Hence at the 5% significance level there is evidence
to suggest that the mean lifetime of components, µ, is not equal to 1,600.
If we wanted to compute the p-value, then from Section 8.11, for this two-tailed test
the p-value would be:
p-value = 2 × P (Z > | − 2.5|) = 2 × P (Z ≥ 2.5) = 2 × 0.0062 = 0.0124.
Since 0.01 < 0.0124 < 0.05, we see that using the p-value testing method we still
conclude (as we must) that the test is significant at the 5% significance level, but not
at the 1% significance level.
5
Common sense might lead us to perform a lower-tailed test since 1570 < 1600, suggesting that if µ
is not 1,600, then it is likely to be less than 1,600. However, since the question is phrased as a two-tailed
190
8.13. Hypothesis test for a single mean (σ unknown)
8.13
Hypothesis test for a single mean (σ unknown)
Here the only difference with the previous example is that σ 2 is unknown. In which case
this is estimated with the sample variance estimator, S 2 . To test H0 : µ = µ0 when
sampling from N (µ, σ 2 ) we use the following test statistic.
t test of hypothesis for a single mean (σ unknown)
In this case, the test statistic is:
T =
X̄ − µ0
√ ∼ tn−1 .
S/ n
(8.2)
Hence critical values are obtained from the Student’s t distribution with n−1 degrees
of freedom, i.e. using Table 10 of the New Cambridge Statistical Tables.
Example 8.4 A study on the impact of comprehensive planning on financial
performance reported that the average annual return on investment for American
banks was 10.2%, and suggested that banks who exercised comprehensive planning
would do better than this. A random sample of 26 such banks gave the following
returns on investment. Do these data support the conjecture?
10.00,
10.23,
10.74,
11.90,
10.72,
10.31,
9.90, 10.09, 10.31, 9.96, 10.34, 10.30, 10.50,
11.54, 10.81, 10.15, 9.04, 11.55, 10.81, 8.69,
10.76, 10.92, 11.26, 11.21, 10.20, 10.76.
We test:
8
H0 : µ = 10.2 vs. H1 : µ > 10.2.
Note that the alternative hypothesis is one-sided as this is the region of interest (we
hypothesise that banks exercising comprehensive planning perform better than
10.2%). This implies that an upper-tailed test is needed.
The summary statistics are n = 26, x̄ = 10.5 and s = 0.714. Hence, using (8.2), the
test statistic value is:
10.5 − 10.2
x̄ − µ0
√ =
√ = 2.14.
t=
s/ n
0.714/ 26
We compare this test statistic value with the upper tail of the Student’s t
distribution with 26 − 1 = 25 degrees of freedom.
If we choose α = 0.05,6 then Table 10 of the New Cambridge Statistical Tables can
be used to find the (upper tail) critical value of 1.708. For α = 0.01, the critical
value is 2.485. Since 1.708 < 2.14 < 2.485, we conclude that the test is significant at
the 5% significance level, but not at the 1% significance level, indicating that the
result is moderately significant.
Using the p-value approach, the p-value is calculated to be:
p-value = P (T > 2.14) ≈ 0.02
test, a justification for performing a lower-tailed test would be required, should you decide to opt for a
lower-tailed test. Indeed, in principle the alternative hypothesis should be determined before data are
collected, to avoid the data biasing our choice of alternative hypothesis.
191
8. Hypothesis testing
where T ∼ t25 , using Table 9 of the New Cambridge Statistical Tables, with 24 as the
nearest available degrees of freedom. We see that 0.01 < 0.021 < 0.05 so, of course,
we reach the same conclusion. Note that we can view the p-value as the smallest α
value such that we would reject the null hypothesis.7
We conclude that there is moderate evidence against the null hypothesis and that
comprehensive planning does improve the average annual return on investment.
Activity 8.3 A sample of seven batteries is taken at random from a large batch of
(nominally 12 volt) batteries. These are tested and their true voltages are shown
below:
12.9
11.6
13.5
13.9
12.1
11.9
13.0.
(a) Test whether the mean voltage of the whole batch is 12 volts using two
appropriate significance levels.
(b) Test whether the mean batch voltage is less than 12 volts using two appropriate
significance levels.
(c) Which test do you think is the more appropriate? Briefly explain why.
Solution
8
(a) We are to test H0 : µ = 12 vs. H1 : µ 6= 12. The key points here are that n is
small and that σ 2 is unknown. We can use the t test, and this is valid provided
the data are normally distributed. The test statistic value is:
t=
12.7 − 12
x̄ − 12
√ =
√ = 2.16.
s/ 7
0.858/ 7
This is compared to a Student’s t distribution on n − 1 = 6 degrees of freedom.
The critical values corresponding to a 5% significance level two-tailed test are
±t0.025, 6 = ±2.447, using Table 10 of the New Cambridge Statistical Tables.
Hence we cannot reject the null hypothesis at the 5% significance level since
2.16 < 2.447. Moving to the 10% significance level the critical values are
±t0.05, 6 = ±1.943, so we can reject H0 and conclude that the test is weakly
significant as there is weak evident that µ 6= 12.
(b) We are to test H0 : µ = 12 vs. H1 : µ < 12. There is no need to do a formal
statistical test. As the sample mean is 12.7, which is greater than 12, the test
statistic value will be positive while the lower-tail critical value(s) will be
negative and hence there is no evidence whatsoever in favour of the alternative
hypothesis.
6
Recall that if the question does not specify a significance level, then α = 0.05, representing a 5%
significance level, is a good starting point.
7
You are not expected to calculate p-values from the Student’s t distribution in the examination,
although the examiners will not penalise those who use this approach provided the p-value is compared
against two appropriate significance levels.
192
8.13. Hypothesis test for a single mean (σ unknown)
(c) In (a) you are asked to conduct a two-tailed test; in (b) it is a one-tailed test.
Which is more appropriate will depend on the purpose of the experiment, and
your suspicions before you conduct it.
• If you suspected before collecting the data that the mean voltage was less
than 12 volts, the test in (b) would be appropriate.
• If you had no prior reason to believe that the mean was less than 12 volts,
you would use the test in (a).
• General rule: decide on whether it is a one-tailed or two-tailed test before
collecting the data!
Activity 8.4 A salesman claims that he can sell at least 100 DVD players per day.
Over a period of 11 days his resulting sales were as shown below. Do the data
support his claim?
94
105
100
90
96
101
92
97
93
100
91.
Solution
If µ is the mean number of sales per day, then we wish to test:
H0 : µ = 100 vs. H1 : µ < 100.
The main statistics are:
Sales
n
11
Sample mean
96.273
Sample standard
deviation
4.777
8
E.S.E.
√ of mean
4.777/ 11 = 1.440
So the test statistic value is:
t=
96.723 − 100
= −2.59.
1.440
Comparing this with t10 (since there are 11 − 1 = 10 degrees of freedom), we see that
it is significant at the 5% significance level (for this one-tailed test, the critical value
is t0.95, 10 = −1.812 and −2.59 < −1.812).
However, it is not significant at the 1% significance level (for which the critical value
is t0.99 = −2.764 and −2.764 < −2.59). Hence we reject the salesman’s claim with
only moderate evidence against it.
Activity 8.5 A random sample of 18 observations from the population N (µ, σ 2 )
yields the sample mean x̄ = 8.83 and the sample variance s2 = 1.59. At the 1%
significance level, test the following hypotheses by obtaining critical values:
(a) H0 : µ = 8 vs. H1 : µ > 8.
193
8. Hypothesis testing
(b) H0 : µ = 8 vs. H1 : µ < 8.
(c) H0 : µ = 8 vs. H1 : µ 6= 8.
Repeat the above exercise with the additional assumption that σ 2 = 1.59. Compare
the results with those derived without this assumption and comment.
Solution
When σ 2 is unknown, we use the test statistic T =
T ∼ t17 . With α = 0.01, we reject H0 if:
√
n(X̄ − 8)/S. Under H0 ,
(a) t > t0.01, 17 = 2.567, against H1 : µ > 8.
(b) t < −t0.01, 17 = −2.567, against H1 : µ < 8.
(c) |t| > t0.005, 17 = 2.898, against H1 : µ 6= 8.
For the given sample, t = 2.79. Hence we reject H0 against the alternative
H1 : µ > 8, but we will not reject H0 against the two other alternative hypotheses.
√
When σ 2 is known, we use the test statistic Z = n(X̄ − 8)/σ. Under H0 ,
Z ∼ N (0, 1). With α = 0.01, we reject H0 if:
(a) z > z0.01 = 2.326, against H1 : µ > 8.
(b) z < −z0.01 = −2.326, against H1 : µ < 8.
8
(c) |z| > z0.005 = 2.576, against H1 : µ 6= 8.
For the given sample, z = 2.79. Hence we reject H0 against the alternative H1 : µ > 8
and H1 : µ 6= 8, but we will not reject H0 against H1 : µ < 8.
With σ 2 known, we should be able to perform inference better simply because we
have more information about the population. More precisely, for the given
significance level, we require less extreme values to reject H0 . Put another way, the
p-value of the test is reduced when σ 2 is given. Therefore, the risk of rejecting H0 is
also reduced.
Activity 8.6 To instil customer loyalty, airlines, hotels, rental car companies and
credit card companies (among others) have initiated marketing programmes which
reward their loyal customers. In the United States alone, several million people are
members of the frequent-flyer programmes of the airline industry.
A large fast-food restaurant chain wished to explore the profitability of such a
programme. They randomly selected 12 of their 1,200 restaurants nationwide and
instituted a loyalty programme which rewarded customers with a $5.00 gift
certificate after every 10 meals purchased at full price.
They ran the trial programme for three months. The restaurants not in the sample
had an average increase in profits of $1,047.34 over the previous three months,
194
8.13. Hypothesis test for a single mean (σ unknown)
whereas the restaurants in the sample had the following changes in profit:
$2,232.90
$6,552.70
$3,381.30
$545.47 $3,440.70
$4,798.70 $2,965.00
$1,591.40 $2,376.20
$1,809.10
$2,610.70
−$2,191.00
Note that the last number is negative, representing a decrease in profits. Specify the
appropriate null and alternative hypotheses for determining whether the mean profit
change for restaurants with loyalty programmes is significantly greater (in a
statistical sense, which you should make clear) than $1,047.34, and conduct an
appropriate test.
Solution
It is useful to discuss the issues about this question first.
• We want to know whether a loyalty programme such as that at the 12 selected
restaurants would result in an increase in mean profits greater than that
observed (during the three-month test) at the other sites within the chain.
• So we can model the profits across the chain as $1047.34 + x, where $x is the
supposed effect of the promotion, and if the true mean value of x is µ, then we
wish to test:
H0 : µ = 0 vs. H1 : µ > 0
which is a one-tailed test since, clearly, there are (preliminary) grounds for
thinking that there is an increase due to the loyalty programme.
• We know nothing about the variability of profits across the rest of the chain, so
we will have to use the sample data, i.e. to calculate the sample variance and to
employ the t distribution on ν = 12 − 1 = 11 degrees of freedom.
• Although we shall want the variance of the data ‘sample value −1047.34’, this
will be the same as the variance of the sample data, since for any random
variable X and constant k we have:
Var(X + k) = Var(X)
because in calculating the variance every value (xi − x̄) is ‘replaced’ by:
(xi + k) − (x̄ + k)
which is in fact the same value.
• So we need to calculate x̄,
12
P
x2i , Sxx and s2 .
i=1
The total change in profits for restaurants in the programme is
12
P
xi = 30113.17.
i=1
Since n = 12 the mean change in profit for restaurants in the programme is
30113.17/12 = 2509.431 = 1047.34 + 1462.091, hence use x̄ = 1462.091.
195
8
8. Hypothesis testing
The raw sum of squares is
12
P
x2i = 126,379,568.8. So the corrected sum of squares is:
i=1
Sxx =
12
X
!2
x2i
− nx̄2 = 126,379,568.8 − 12 × (1462.091)2 = 50,812,651.51.
i=1
Hence:
50,812,651.51
Sxx
=
= 4,619,331.956.
n−1
11
Therefore, the estimated standard error is:
r
4,619,331.956 p
s
√ =
= 384,944.3296 = 620.439.
12
n
s2 =
So the test statistic value is:
t=
x̄ − µ0
1462.091 − 0
√ =
= 2.3567
620.439
s/ n
on 11 degrees of freedom.
The relevant critical values for t11 in this one-tailed test are:
t0.05, 11 = 1.796 and t0.01, 11 = 2.718.
8
So we see that the test is significant at the 5% significance level, but not at the 1%
significance level (1.796 < 2.357 < 2.718), so we can reject H0 and conclude that the
test is moderately significant with moderate evidence that the loyalty programme
has an effect.
8.14
Hypothesis test for a single proportion
We now consider the hypothesis test for a single proportion.
Recall from Section 7.10
p
that the standard error for a proportion is S.E.(P
) = π (1 − π)/n. When testing
p
H0 : π = π0 , under H0 the standard error is π0 (1 − π0 )/n leading to the following test
statistic, achieved by standardising the sample proprotion estimator, P , given in (7.1).
z test of hypothesis for a single proportion
In this case, the test statistic is:
P − π0
Z∼
∼ N (0, 1) (approximately).
=p
π0 (1 − π0 )/n
(8.3)
Hence critical values are obtained from the standard normal distribution, i.e. using
Table 5 of the New Cambridge Statistical Tables.
196
8.14. Hypothesis test for a single proportion
Example 8.5 To illustrate this, let us reconsider Example 7.3 in which a survey
was conducted by a bank to estimate the proportion of its customers who would be
interested in using a proposed new mobile telephone banking service. If we denote
the population proportion of customers who are interested in this proposal by π, and
it is found that 68 out of a random sample of 150 customers are in favour, then we
would estimate π with p = 68/150 = 0.453. Let us suppose that other surveys have
shown that 40% of the public are interested in mobile telephone banking and it is
proposed to test whether or not the above survey agrees with this figure, i.e. we
conduct a two-tailed test. Hence:
H0 : π = 0.4 vs. H1 : π 6= 0.4.
The test statistic value, using (8.3), is:
0.453 − 0.4
p − π0
=p
= 1.325.
z=p
π0 (1 − π0 )/n
0.4 × 0.6/150
Let α = 0.05, then we compare this test statistic value with the critical values ±1.96
(using Table 5 of the New Cambridge Statistical Tables), so it can be seen that the
test is not significant at the 5% significance level as we do not reject H0 since
1.325 < 1.96. The significance level decision tree (see Figure 8.1) now requires us to
test at the 10% significance level, which gives critical values of ±1.6449, so again we
do not reject H0 since 1.325 < 1.6449, i.e. the test result is not statistically
significant.
We conclude that these data are consistent with the null hypothesis and that the
level of interest shown among the bank’s customers for the proposed new mobile
telephone banking service may well be 40%.
Activity 8.7 In a particular city it is known, from past surveys, that 25% of
households regularly use a washing powder named ‘Snolite’. After an advertising
campaign, a survey of 300 randomly selected households showed that 100 had
recently purchased ‘Snolite’. Is there evidence that the campaign has been
successful? Use a 1% significance level.
Solution
Here we are testing a proportion: H0 : π = 0.25 vs. H1 : π > 0.25. Note that this is a
one-sided test – we have reason to believe that the sales campaign has increased
sales, and we believe this before collecting any data. As this is a test for a single
proportion and n is large, we compute the test statistic value to be:
p−π
100/300 − 0.25
= √
= 3.33.
z=p
0.000625
π (1 − π)/n
Compare this to the critical value of z0.01 = 2.326 (from the bottow row of Table 10
of the New Cambridge Statistical Tables) and you will see that it is significant at the
1% significance level since 2.326 < 3.33. That is, there is very strong evidence that
more than 25% of households use ‘Snolite’. It appears that the campaign has been
successful.
197
8
8. Hypothesis testing
Activity 8.8 Consumer Reports evaluated and rated 46 brands of toothpaste. One
attribute examined in the study was whether or not a toothpaste brand carries an
American Dental Association (ADA) seal verifying effective decay prevention. The
data for the 46 brands (coded as 1 = ADA seal, 0 = no ADA seal) are reproduced
below.
0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 1
1 1 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 1 1 1 0 0 1
(a) State the null and alternative hypotheses testing whether the true proportion of
toothpaste brands with the ADA seal verifying effective decay prevention is less
than 0.5.
(b) Assuming that the sample was random, from a much larger population, conduct
a test between the two hypotheses.
Solution
(a) Appropriate hypotheses are H0 : π = 0.50 vs. H1 : π < 0.50.
(b) The data show that there are r = 21 ADA seals among n = 46 items in the
sample. Let π be the true proportion. If H0 is true then the sample proportion
P ∼ N (0.50, 0.005435), approximately, since 46 is ‘large’ and hence we can use
the normal approximation with µ = 0.50 and σ 2 = π (1 − π)/n = (0.50)2 /46
= 0.005435.
8
The test statistic value for this test would then be:
−0.04348
21/46 − 0.5
= −0.59.
=
z= √
0.07372
0.005435
This is clearly not going to be significant at any plausible significance level (in
fact the p-value is P (Z < −0.59) = 0.2776). Therefore, we cannot reject the null
hypothesis that half the brands have the ADA seal.
Activity 8.9 A government agency carries out a large-scale random survey of
public attitudes towards job security. 78 of the 300 workers surveyed indicated they
were worried about losing their job. A newspaper report claims 30% of workers fear
losing their job. Is such a high percentage claim justified? State and carry out an
appropriate hypothesis test at two appropriate significance levels and explain your
results.
Solution
The null and alternative hypotheses are:
H0 : π = 0.30 vs. H1 : π < 0.30.
The test statistic under H0 is:
P − π0
P − 0.30
Z=p
=p
∼ N (0, 1).
π0 (1 − π0 )/n
0.30 × (1 − 0.70)/n
198
8.14. Hypothesis test for a single proportion
The sample proportion is p = 78/300 = 0.26 and the test statistic value is −1.51.
For α = 0.05, the one-tailed (lower tail) critical value is z0.95 = −1.645. Since
−1.645 < −1.51, we do not reject H0 at the 5% significance level.
Moving to α = 0.10, the critical value is now z0.90 = −1.282. Since −1.51 < −1.282,
we reject H0 at the 10% significance level.
The test is weakly significant and we conclude that there is weak evidence that fewer
than 30% of workers were worried about losing their jobs, suggesting there is only a
marginal reason to doubt the newspaper report’s claim.
Activity 8.10 If you live in California, the decision to purchase earthquake
insurance is a critical one. An article in the Annals of the Association of American
Geographers (June 1992) investigated many factors which California residents
consider when purchasing earthquake insurance. The survey revealed that only 133
of 337 randomly selected households in Los Angeles County were protected by
earthquake insurance.
(a) What are the appropriate null and alternative hypotheses to test the research
hypothesis that fewer than 40% of households in Los Angeles County were
protected by earthquake insurance?
(b) Do the data provide sufficient evidence to support the research hypothesis? (Use
α = 0.10.)
(c) Calculate and interpret the p-value for the test.
8
Solution
(a) Appropriate hypotheses are H0 : π = 0.40 vs. H1 : π < 0.40.
(b) The estimate of the proportion covered by insurance is 133/337 = 0.395. As this
is a test of a single proportion and n is large, the test statistic value is:
0.395 − 0.40
p−π
= √
= −0.20.
z=p
0.02669
π (1 − π)/n
Compare this to the critical value of z0.90 = −1.282 (from the bottow row of
Table 10 of the New Cambridge Statistical Tables) and you will see that it is not
significant at even the 10% significance level since −1.282 < −0.20. Hence the
test is insignificant, i.e. there is no evidence that fewer than 40% of households
in Los Angeles County were protected by earthquake insurance?
(c) To compute the p-value for this lower-tailed test, we determine:
P (Z < −0.20) = 0.4207
using Table 4 of the New Cambridge Statistical Tables. Hence we would not
reject H0 for any α < 0.4207, remembering that the p-value is the smallest
significance level such that we reject the null hypothesis.
199
8. Hypothesis testing
Activity 8.11 A museum conducts a survey of its visitors in order to assess the
popularity of a device which is used to provide information on the museum exhibits.
The device will be withdrawn if fewer than 30% of all of the museum’s visitors make
use of it. Of a random sample of 80 visitors, 20 chose to use the device.
(a) Carry out a test at the 5% significance level to see if the device should be
withdrawn and state your conclusions.
(b) Determine the p-value of the test.
Solution
(a) Let π be the population proportion of visitors who would use the device. We
test H0 : π = 0.30 vs. H1 : π < 0.30. The samplep
proportion is p = 20/80 = 0.25.
The standard error of the sample proportion is 0.3 × 0.7/80 = 0.0512. The
test statistic value is:
0.25 − 0.30
= −0.976.
z=
0.0512
For a one-sided (lower-tailed) test at the 5% significance level, the critical value
is z0.95 = −1.645, so the test is not significant – and not even at the 10%
significance level (critical value of z0.90 = −1.282). On the basis of the data,
there is no reason to withdraw the device.
The critical region for the above test is: reject H0 if the sample proportion is less
than 0.3 − 1.645 × 0.0512, i.e. if the sample proportion, p, is less than 0.2157.
8
(b) The p-value of the test is the probability of the test statistic value or a more
extreme value conditional on H0 being true. Hence the p-value is:
P (Z ≤ −0.976) = 0.1645.
So, for any α < 0.1645 we would fail to reject H0 .
8.15
Difference between two population proportions
As with confidence intervals, the correct approach to the comparison of two population
proportions, π1 and π2 , is to consider the difference between them, i.e. π1 − π2 . When
testing for equal proportions (i.e. a zero difference), the null hypothesis is
H0 : π1 − π2 = 0, or equivalently H0 : π1 = π2 . We derive the test statistic by
standardising the sampling distribution of the difference in two independent sample
proportions, P1 − P2 , given in (7.4), leading to the proposed test statistic:
P1 − P2 − (π1 − π2 )
∼ N (0, 1).
Z=p
π1 (1 − π1 )/n1 + π2 (1 − π2 )/n2
However, when evaluating this test statistic, which values do we use for π1 and π2 ? In
the test of a single proportion, we had H0 : π = π0 , where π0 is the tested value. When
comparing two proportions, under H0 no value is given for π1 and π2 , only that they are
200
8.15. Difference between two population proportions
equal, that is π1 = π2 = π, where π is the common proportion whose value, of course,
is still unknown! Hence we need to estimate π from the sample data using the pooled
proportion estimator.
Pooled proportion estimator
If R1 and R2 represent the number of ‘favourable’ responses from two independent
random samples with sample sizes of n1 and n2 , respectively, then the pooled
proportion estimator is:
R1 + R2
.
(8.4)
P =
n1 + n2
This leads to the following revised test statistic.
z test for the difference between two proportions
In this case, the test statistic is:
Z∼
=p
P1 − P2 − (π1 − π2 )
P (1 − P )(1/n1 + 1/n2 )
∼ N (0, 1) (approximately).
(8.5)
Hence critical values are obtained from the standard normal distribution, i.e. using
Table 5 of the New Cambridge Statistical Tables.
Example 8.6 To illustrate this, let us reconsider Example 7.7 to test for a
difference between the population proportions of the general public who are aware of
a particular commercial product before and after an advertising campaign. Two
surveys were conducted and the results of the two random samples were:
Before campaign
After campaign
Sample size
150
120
Number aware
68
65
If π1 and π2 are the true population proportions before and after the campaign,
respectively, then we wish to test:
H0 : π1 = π2
vs. H1 : π1 < π2 .
Note that we use a one-sided alternative on the assumption that the campaign
would not decrease awareness!8
After checking that this way round the value we get is positive (avoiding negative
differences wherever possible to minimise arithmetic errors), we estimate the
difference in the population proportions as:
65
68
‘after’ − ‘before’ = p2 − p1 =
−
= 0.0884.
120 150
On the assumption that H0 is true, we estimate the common proportion, π, using
(8.4), as:
68 + 65
= 0.4926.
150 + 120
201
8
8. Hypothesis testing
So our test statistic value, using (8.5), is:
0.0884
z=p
= 1.44.
0.4926 × (1 − 0.4926) × (1/150 + 1/120)
For a z test at the 5% significance level, we compare this test statistic value with the
upper-tail critical value of 1.645, for this upper-tailed test. Since 1.44 < 1.6449, we
fail to reject H0 and conclude that the test is not significant at the 5% significance
level. However, it is significant at the 10% significance level (because 1.2816 < 1.44),
and so we maintain belief in the null hypothesis with some reservation as the data
have thrown some doubt on it. It is possible that the campaign has increased
awareness but, on the basis of these random samples, we would not be fully
convinced – the test is at best ‘weakly significant’. If we had to take an important
decision, we would almost certainly want more information before doing so!
Incidentally, the p-value for this problem is simply P (Z > 1.44) = 0.0749, using
Table 4 of the New Cambridge Statistical Tables. So, clearly, we would be unable to
reject H0 for any significance level α < 0.0749, agreeing with our earlier findings.
Activity 8.12 A market research company has conducted a survey of adults in two
large towns, either side of an international border, in order to judge attitudes
towards a controversial internationally-broadcast celebrity television programme.
The following table shows some of the information obtained by the survey:
Sample size
Sample number approving of the programme
8
Town A
80
44
Town Z
80
41
(a) Conduct a formal hypothesis test, at two appropriate significance levels, of the
claim that the population proportions approving of the programme in the two
towns are equal.
(b) Would your conclusion be the same if, in both towns, the sample sizes had been
1500 (with the same sample proportions of approvals and the same pooled
sample proportion)?
Solution
(a) We test H0 : πA = πZ vs. H1 : πA 6= πZ . Note there is no a priori reason to
suppose which town would have a greater proportion approving of the
programme in the event that the proportions are not equal, hence we conduct a
two-tailed test.
The sample proportions are pA = 0.55 and pZ = 0.5125, and the estimate of the
common proportion under H0 is:
p=
8
44 + 41
= 0.53125.
80 + 80
An example of the importance of using common sense in determining the alternative hypothesis!
202
8.15. Difference between two population proportions
We ues the test statistic:
P A − PZ
Z=p
∼ N (0, 1)
P (1 − P ) (1/nA + 1/nZ )
approximately under H0 . The test statistic value is:
0.55 − 0.5125
p
= 0.48.
0.53125 × (1 − 0.53125) × (1/80 + 1/80)
At the 5% significance level, the critical values are ±z0.025 = ±1.96. Since
0.48 < 1.96 we do not reject H0 . Following the significance level decision tree,
we now test at the 10% significance level, for which the critical values are
±z0.05 = ±1.645. Since 0.48 < 1.645, again we do not reject H0 . Therefore, the
test is not significant as there is no evidence of a difference in the population
proportions approving of the programme.
For reference, the p-value for this two-tailed test is:
2 × P (Z ≥ 0.48) = 2 × 0.3156 = 0.6312
which is greater than any reasonable significance level, α.
(b) If nA = nZ = 1500, with the same values for pA , pZ and p, the test statistic
value becomes:
0.55 − 0.5125
p
= 2.06.
0.53125 × (1 − 0.53125) × (1/1500 + 1/1500)
Since 1.96 < 2.06, we reject H0 at the 5% significance level. Now testing at the
1% significance level we have critical values of ±2.576. Since 2.06 < 2.576, we do
not reject H0 and conclude that the test is moderately significant indicating
moderate evidence of a difference in the population proportions approving of
the programme.
The larger sample sizes have increased the power of the test, i.e. they have
increased our ability to reject H0 whereby the difference in the sample
proportions is attributed to chance (that is, sampling error) when nA = nZ = 80,
while the difference is statistically significant when nA = nZ = 1500.
For reference, the p-value for this two-tailed test is:
2 × P (Z ≥ 2.06) = 2 × 0.0197 = 0.0394.
Since 0.01 < 0.0394 < 0.05, the test is moderately significant, as before.
Activity 8.13 A random sample of 250 households in a particular community was
taken and in 50 of these households the lead level in the water supply was found to
be above an acceptable level. A sample was also taken from a second community,
which adds anti-corrosives to its water supply, and, of these, only 16 out of 320
households were found to have unacceptably high lead levels. Is this conclusive
evidence that the addition of anti-corrosives reduces lead levels?
203
8
8. Hypothesis testing
Solution
Here we are testing the difference between two proportions with a one-sided test.
Also, n1 and n2 are both large. Let:
• π1 = the population proportion of households with unacceptable levels of lead
in the community without anti-corrosives
• π2 = the population proportion of households with unacceptable levels of lead
in the community with anti-corrosives.
We wish to test H0 : π1 = π2 vs. H1 : π1 > π2 . The test statistic value is:
p 1 − p2
50/250 − 16/320
z=p
=p
= 5.55
p (1 − p) (1/n1 + 1/n2 )
0.116 × (1 − 0.116) × (1/250 + 1/320)
where p is the pooled sample proportion, i.e. p = (50 + 16)/(250 + 320) = 0.116. This
test statistic value is, at 5.55, obviously very extreme and hence is highly significant
(since z0.01 = 2.326 < 5.55), so there is strong evidence that anticorrosives reduce
lead levels.
8.16
Difference between two population means
In this section we are primarily interested in the difference between two population
means, µ1 − µ2 . As in Chapter 7, there are four cases to consider.
8
8.16.1
Unpaired samples – variances known
Suppose we have two random samples of size n1 and n2 , respectively, drawn from
N (µ1 , σ12 ) and N (µ2 , σ22 ), where σ12 and σ22 are known. Testing for the equality of means
gives the null hypothesis H0 : µ1 = µ2 or, in terms of their difference, H0 : µ1 − µ2 = 0.
Standardising the sampling distribution of X̄1 − X̄2 , shown in (7.6), gives the test
statistic.
z test for the difference between two means (variances known)
In this case, the test statistic is:
Z=
X̄1 − X̄2 − (µ1 − µ2 )
p
∼ N (0, 1).
σ12 /n1 + σ22 /n2
(8.6)
Hence critical values are obtained from the standard normal distribution, i.e. using
Table 5 of the New Cambridge Statistical Tables.
Note if testing for the equality of means, then µ1 − µ2 = 0 under H0 . Hence, in (8.6),
we set the term (µ1 − µ2 ) = 0.
204
8.16. Difference between two population means
8.16.2
Unpaired samples – variances unknown and unequal
We have the same set-up as above, with the same sampling distribution for X̄1 − X̄2 ,
but now the population variances σ12 and σ22 are unknown. Assuming large sample sizes
we can replace these unknown parameters with the sample variance estimators S12 and
S22 to obtain the test statistic.
z test for the difference between two means (variances unknown)
If the population variances σ12 and σ22 are unknown, provided the sample sizes n1 and
n2 are large (greater than 30), then:
Z∼
=
X̄1 − X̄2 − (µ1 − µ2 )
p
∼ N (0, 1) (approximately, for large samples).
S12 /n1 + S22 /n2
(8.7)
Hence critical values are obtained from the standard normal distribution, i.e. using
Table 5 of the New Cambridge Statistical Tables.
Note if testing for the equality of means, then µ1 − µ2 = 0 under H0 . Hence, in (8.7),
we set the term (µ1 − µ2 ) = 0.
Activity 8.14 Two companies supplying a television repair service are compared by
their repair times (in days). Random samples of recent repair times for these
companies gave the following statistics:
Company A
Company B
Sample size
44
52
Sample mean
11.9
10.8
Sample variance
7.3
6.2
8
(a) Is there evidence that the companies differ in their true mean repair times? Give
an appropriate significance test to support your conclusions.
(b) Determine the p-value of the test.
(c) What difference would it have made if the sample sizes had each been smaller
by 35 (i.e. sample sizes of 9 and 17, respectively)?
Solution
(a) H0 : µA = µB vs. H1 : µA 6= µB , where we use a two-tailed test since ‘by eye’ the
evidence we are given only barely suggests the direction of the difference – if
any. The test statistic value is:
z=p
11.9 − 10.8
7.3/44 + 6.3/52
= 2.06.
For a two-tailed test, this is significant at the 5% significance level
(z0.025 = 1.96 < 2.06), but not at the 1% significance level (2.06 < 2.326 = z0.01 ).
We reject H0 and conclude that company A is slower in repair times than
company B, with a moderately significant result.
205
8. Hypothesis testing
(b) The p-value for this two-tailed test is 2 × P (Z > 2.06) = 0.0394.
(c) For small samples, we should use a pooled estimate of the population standard
deviation:
s
(9 − 1) × 7.3 + (17 − 1) × 6.2
= 2.5626
s=
(9 − 1) + (17 − 1)
on 24 degrees of freedom. Hence the test statistic value in this case is:
t=
11.9 − 10.8
p
= 1.04.
2.5626 × 1/9 + 1/17
This should be compared with t24 and is clearly not significant, even at the 10%
significance level, since 1.04 < 1.711 = t0.05, 24 . With the smaller samples we fail
to detect the difference.
Comparing the two test statistic calculations shows that the different results
flow from differences in the estimated standard errors, hence ultimately (and
unsurprisingly) from the differences in the sample sizes used in the two
situations.
8.16.3
Unpaired samples – variances unknown and equal
Although still unknown, if we assume the population variances are equal to some
common variance, i.e. σ12 = σ22 = σ 2 , then we only have one (common) unknown
variance to estimate. As with confidence intervals, we utilise the pooled variance
estimator, given in (7.8).
8
t test for the difference between two means (variances unknown)
If the population variances σ12 and σ22 are unknown but assumed equal, then:
X̄1 − X̄2 − (µ1 − µ2 )
T = q
∼ tn1 +n2 −2
Sp2 (1/n1 + 1/n2 )
(8.8)
where Sp2 is the pooled variance estimator, given in (7.8).
Hence critical values are obtained from the Student’s t distribution with n1 + n2 − 2
degrees of freedom, i.e. using Table 10 of the New Cambridge Statistical Tables.9
Note if testing for the equality of means, then µ1 − µ2 = 0 under H0 . Hence, in (8.7),
we set the term (µ1 − µ2 ) = 0.
Look back at Section 7.13 for hints about how to decide whether or not σ12 = σ22 .
Example 8.7 To illustrate this, let us reconsider Example 7.8 where two
companies supplying a similar service are compared for their mean reaction times (in
9
Recall for sufficiently ‘large’ degrees of freedom, a standard normal approximation can be used,
provided a justification is given.
206
8.16. Difference between two population means
days) to complaints. Random samples of recent complaints to these companies gave
the following descriptive statistics:
Company A
Company B
Sample size
12
10
Sample mean
8.5
4.8
Sample std. dev.
3.6
2.1
We want to test for a true difference in mean reaction times, that is:
H 0 : µA = µB
vs. H1 : µA 6= µB .
Because the markets are ‘similar’, it is reasonable to assume (though it is only an
assumption!) that the two population variances are equal. Under this assumption,
using (7.8), we have:
s2p =
(12 − 1) × (3.6)2 + (10 − 1) × (2.1)2
= 9.1125.
12 + 10 − 2
Using (8.8), the test statistic value is:
8.5 − 4.8
t= p
= 2.87.
9.1125 × (1/12 + 1/10)
There are 12 + 10 − 2 = 20 degrees of freedom, hence we obtain critical values from
the t20 distribution using Table 10 of the New Cambridge Statistical Tables. For a
two-tailed test at the 5% significance level, we obtain critical values of ±2.086, hence
the test is significant at the 5% significance level since 2.086 < 2.87. Moving to the
1% significance level, as per the significance level decision tree in Figure 8.1, the
critical values are ±2.845, so again we (just) reject H0 since 2.845 < 2.87.
On this basis, we have a highly significant result when we reject H0 , and we conclude
that the mean reaction times are different. Indeed, we have strong evidence that
Company B reacts faster than Company A, on average.
Activity 8.15 A study was conducted to determine the amount of hours spent on
Facebook by university and high school students. For this reason, a questionnaire
was administered to a random sample of 14 university and 11 high school students
and the hours per day spent on Facebook were recorded.
University students
High school students
Sample size
14
11
Sample mean
3.3
2.9
Sample variance
0.9
1.3
(a) Use an appropriate hypothesis test to determine whether the mean hours per
day spent on Facebook is different between university and high school students.
Test at two appropriate significance levels, stating clearly the hypotheses, the
test statistic and its distribution under the null hypothesis. Comment on your
findings.
207
8
8. Hypothesis testing
(b) State clearly any assumptions you made in (a).
(c) Adjust the procedure above to determine whether the mean hours spent per day
on Facebook for university students is higher than that of high school students.
Solution
(a) Let µ1 denote the mean hours per day spent on Facebook for university
students and let µ2 denote the mean hours per day spent on Facebook for high
school students.
We test:
H 0 : µ1 = µ2
vs. H1 : µ1 6= µ2 .
Since the sample sizes are small, we assume σ12 = σ22 and use the test statistic:
X̄1 − X̄2
q
∼ tn1 +n2 −2
2
Sp (1/n1 + 1/n2 )
under H0 . The estimate of the common variance is:
13 × 0.9 + 10 × 1.3
(n1 − 1) × s21 + (n2 − 1) × s22
=
= 1.0739.
n1 + n2 − 2
14 + 11 − 2
The test statistic value is:
3.3 − 2.9
p
= 0.9580.
1.0739 × (1/14 + 1/11)
8
For α = 0.05, the critical values are ±t0.025, 23 = ±2.069 (the t23 distribution is
used). Since 0.9580 < 2.069, we do not reject H0 .
For α = 0.10, the critical values are ±t0.05, 23 = ±1.714. Since 0.9580 < 1.714,
again we do not reject H0 .
We conclude that there is no evidence of a difference in the mean hours spent on
Facebook between university and high school students. The test is insignificant.
(b) It assumed that the random samples are independent and that σ12 = σ22 (due to
the small sample sizes).
(c) Now we test:
H 0 : µ1 = µ2
vs. H1 : µ1 > µ2 .
All that changes are the critical values.
For α = 0.05, we have a critical value of t0.05, 23 = 1.714. Since 0.9580 < 1.714,
we do not reject H0 .
For α = 0.10, we have a critical value of t0.10, 23 = 1.319. Since 0.9580 < 1.319,
we again do not reject H0 . The conclusion is the same as in (a).
208
8.16. Difference between two population means
Activity 8.16 The following data show the times of flow through an orifice of two
types of sand. Is there sufficient evidence that one type flows faster than the other?
Type I
27.2
26.8
27.4
27.1
26.5
Type II
29.6
30.0
28.4
30.2
Solution
If µ1 and µ2 are the respective population means, then we wish to test:
H 0 : µ1 = µ2
vs. H1 : µ1 6= µ2 .
The main statistics are:
Type 1
Type 2
n
5
4
Sample mean
27.000
29.550
Sample standard
deviation
0.354
0.806
E.S.E.√of mean
0.354/√5 = 0.16
0.806/ 4 = 0.40
The pooled sample variance is:
s2p =
(5 − 1)(0.354)2 + (4 − 1)(0.806)2
2.450172
=
= 0.3500.
(5 − 1) + (4 − 1)
7
8
The pooled standard deviation is:
√
sp = 0.3500 = 0.592 on 7 degrees of freedom.
The estimated standard error is:
r
r
1
1
1 1
sp ×
+
= 0.592 ×
+ = 0.397.
n1 n2
5 4
The test statistic value is:
27 − 29.55
= −6.43.
0.397
Comparing this with the t distribution on 7 degrees of freedom, we see that the test
is significant even at the 0.1% significance level (since −6.43 < −4.785 = t0.999, 7 ) and
has a p-value of 0.000178. Conclusion: we are convinced that Type I sand flows
faster!
Activity 8.17 The data in the following table show the numbers of daily parking
offences in two areas of a city. The day identifications are unknown and the
recordings were not necessarily made on the same days. Is there evidence that the
areas experience different numbers of offences?
209
8. Hypothesis testing
Area A
38
38
29
45
42
Area B
32
38
22
30
34
Area A
33
27
32
32
34
Area B
28
32
34
24
Solution
It is clear that the data for the two areas cannot be regarded as paired, because
there is no detail about the days to which the figures refer (and because there are
different numbers of values for the two areas anyway).
The sample sizes are 10 and 9 for A and B, respectively, so we need to decide
whether we can use the combination of a pooled estimate of the variance and a t
distribution test on the appropriate number of degrees of freedom.
So, start by calculating the sample means and standard deviations (or variances):
n
Mean
Standard deviation
8
A
10
35
5.68
B
9
30.44
5.08
Because the events to which the data refer are of the same type, it is reasonable to
assume that the population variances of the numbers of offences are the same
(especially since the sample standard deviations turn out to be quite similar. (It can
be shown that, if the ratio of the sample variances is between roughly 1/3 and 3,
then there is no reason to suppose that the population variances are different.)
We test H0 : µA − µB = 0 vs. H1 : µA − µB 6= 0. The pooled variance is:
s2p =
(10 − 1) × (5.68)2 + (9 − 1) × (5.08)2
= 29.2243
(10 − 1) + (9 − 1)
Hence the estimated standard error is:
s
√
1
1
29.2243 ×
+
= 5.406 × 0.21 = 2.484
10 9
on 17 degrees of freedom. The test statistic value is:
35 − 30.44
4.56
t=
=
= 1.836.
2.484
2.484
For a two-tailed test at the 5% significance level the critical values from t17 are
±t0.025, 17 = ±2.110, so the evidence is not significant at the 5% significance level
since 1.836 < 2.110.
For a two-tailed test at the 10% significance level the critical values from t17 are
±t0.05, 17 = ±1.740, so the evidence is weakly significant since 1.740 < 1.836.
We cannot conclude with confidence that there is a difference in the number of
parking offences between the two areas, but there is a suspicion that there may be
such a difference. It would be preferable to examine more data in order to come to a
firmer conclusion.
210
8.16. Difference between two population means
8.16.4
Paired (dependent) samples
Recall from Section 7.13, that for paired (dependent) samples we work with differenced
data to reduce matters to a one-sample analysis. As before, we compute the differenced
data as:
d1 = x1 − y1 , d2 = x2 − y2 , . . . , dn = xn − yn
reducing the two-sample problem to a one-sample problem. The test is then analogous
to the hypothesis test of a single mean with σ unknown.
t test for the difference in means in paired samples
Using the sample mean and sample standard deviation of differenced data, then:
T =
X̄d − µd
√ ∼ tn−1 .
Sd / n
(8.9)
Hence critical values are obtained from the Student’s t distribution with n−1 degrees
of freedom, i.e. using Table 10 of the New Cambridge Statistical Tables.
Example 8.8 The table below shows the before and after weights (in pounds) of 8
adults after a diet. Test whether there is evidence that the diet is effective, on
average.
Before
127
130
114
139
After
122
120
116
132
Before
150
147
167
153
After
144
138
155
152
8
We want to test:
H0 : µ1 = µ2
vs. H1 : µ1 > µ2
which is equivalent to:
H0 : µd = 0 vs. H1 : µd > 0
where we choose a one-tailed test because we are looking for an average reduction (if
there is any change, we would expect weight loss from a diet!) and we define
µd = µ1 − µ2 since we anticipate that this way round the values will (more likely) be
positive.
The differences (calculated as ‘before − after’) are:
5 10
− 2 7 6 9 12 1.
Hence n = 8, x̄d = 6 and sd = 4.66 on n − 1 = 7 degrees of freedom. Using (8.9), the
test statistic value is:
6−0
√ = 3.64.
t=
4.66/ 8
At the 5% significance level, the upper-tail critical value is 1.895 (using Table 10 of
the New Cambridge Statistical Tables), hence we reject H0 since 1.895 < 3.64. Using
211
8. Hypothesis testing
Figure 8.1, we proceed to test at the 1% significance level, which uses a critical value
of 2.998. Hence we, again, reject H0 since 2.998 < 3.64 and we conclude that the test
is highly significant. These data provide strong evidence that the diet reduces
weight, on average.
Activity 8.18 Two different methods of determination of the percentage fat
content in meat are available. Both methods are used on portions of the same meat
sample. Is there any evidence to suggest that one method gives a higher reading
than the other?
Meat Sample
1
2
3
4
5
6
7
8
Method
I
II
23.1 22.7
23.2 23.6
26.5 27.1
26.6 27.4
27.1 27.4
48.3 46.8
40.5 40.4
25.0 24.9
Meat Sample
9
10
11
12
13
14
15
16
Method
I
II
38.4 38.1
23.5 23.8
22.2 22.5
24.7 24.4
45.1 43.5
27.6 27.0
25.0 24.9
36.7 35.2
Solution
Since the same meat is being put through two different tests, we are clearly dealing
with paired samples. We wish to test:
H0 : µ1 = µ2
8
vs. H1 : µ1 6= µ2
but for paired samples we do our calculations using the differenced data, so we
might reformulate this presentation of the hypotheses in the form:
H0 : µ1 − µ2 = 0 vs. H1 : µ1 − µ2 6= 0.
If we list the differences from the original data table then we get:
0.4 −0.4 −0.6 −0.8 −0.3 1.5 0.1 0.1
0.3 −0.3 −0.3
0.3
1.6 0.6 0.1 1.5
The main statistics are:
Differences
n
16
Sample mean
0.238
Sample standard
deviation
0.745
E.S.E.
√ of mean
0.745/ 16 = 0.186
So the test statistic value is:
t=
0.238 − 0
= 1.28.
0.186
Looking at Table 10 of the New Cambridge Statistical Tables for the t15 distribution,
we find that the critical values for a two-tailed test at the 5% significance level are
212
8.17. Summary
±t0.025, 15 = ±2.131, so this result is not significant at the 5% significance level since
1.28 < 2.131. Indeed, the critical values for a test at the 20% significance level are
±t0.10, 15 = ±1.341, so the result would still not be significant (which, of course, is in
any case too high a significance level to be used).
The test shows, in fact, that the p-value of the test is over 0.2, although the limited
information in Table 10 means that we cannot work it out exactly. (To do so, we
would need much more detailed statistical tables, or a computer.)
8.17
Summary
The idea of testing hypotheses is a central part of statistics, and underpins the
development of theories and checking of ideas in management and the social sciences. It
is important that you make sure you understand the material introduced in this chapter
and Chapters 6 and 7 before you move on to look at the chi-squared distribution in
Chapter 9.
8.18
Key terms and concepts
Alternative hypothesis
Critical region
Lower-tailed test
P -value
Significance level
Two-tailed test
Type II error
8.19
Common proportion
Critical value
Null hypothesis
Power
Test statistic (value)
Type I error
Upper-tailed test
8
Learning activities
Solutions can be found in Appendix A.
1. Think about each of the following statements. For each, state the null and
alternative hypotheses and say whether they will need one-tailed or two-tailed
tests.
(a) The general mean level of family income in a population is known to be 10,000
‘ulam’ per year. You take a random sample in an urban area and find the
mean family income is 6,000 ulam per year in that area. Do the families in the
chosen area have a lower mean income than the population as a whole?
(b) You are looking at data from two schools on the heights and weights of
children by age. Are the mean weights of children aged 10–11 the same in the
two schools?
(c) You are looking at reading scores for children before and after a new teaching
programme. Have their scores improved on average?
213
8. Hypothesis testing
2. Complete the following chart:
Real situation
Result from your test
Not reject H0
Reject H0
Correct decision
H0 true
Probability (1 − α)
Type II error
H1 true
Probability (1 − β) called
the power of the test
3. Say whether the following statement is true or false and briefly give your reasons.
‘The power of a test is the probability that the correct hypothesis is chosen.’
4. Explain your attitude to a null hypothesis if a test of the hypothesis is significant:
(a) at the 1% significance level
(b) at the 10% significance level, but not at the 5% significance level.
8
5. The manufacturer of a patient medicine claimed that it was 90% effective in
relieving an allergy for a period of 8 hours. In a random sample of 200 people
suffering from the allergy, the medicine provided relief for 160 people.
Determine whether the manufacturer’s claim is legitimate. (Be careful, your
parameter here will be π.) Is your test one-tailed or two-tailed?
6. You have been asked to compare the percentages of people in two groups with
n1 = 16 and n2 = 24 who are in favour of a new plan. You decide to make a pooled
estimate of the proportion and make a test. Which test would you use?
7. Look again at Question 5 in the equivalent section of Chapter 7. You were asked in
part (b) whether you thought one company gave on average more satisfaction than
the other. Now give the null and alternative hypotheses for such a one-tailed test.
Show how you would decide whether or not to reject the null hypothesis.
8.20
A reminder of your learning outcomes
After completing this chapter, and having completed the essential reading and
activities, you should be able to:
set up the null and alternative hypotheses for a problem and state whether the
latter is one-sided or two-sided, leading to a one-tailed or two-tailed test,
respectively
214
8.21. Sample examination questions
define and apply the terminology of hypothesis testing
define and explain the types of errors which can be made in hypothesis testing
perform hypothesis tests on means and proportions.
8.21
Sample examination questions
Solutions can be found in Appendix B.
1. An experiment is conducted to determine whether intensive tutoring (covering a
great deal of material in a fixed amount of time) is more effective than standard
tutoring (covering less material in the same amount of time) for a particular
course. Two randomly chosen groups of students were tutored separately and their
examination mark on the course was recorded. The data are summarised in the
table below:
Sample size
Intensive tutoring
Standard tutoring
22
25
Mean
examination mark
65.33
61.58
Sample standard
deviation
6.61
5.37
(a) Use an appropriate hypothesis test to determine whether there is a difference
between the mean examination marks between the two tutoring groups. State
clearly the hypotheses, the test statistic and its distribution under the null
hypothesis, and carry out the test at two appropriate significance levels.
Comment on your findings.
(b) State clearly any assumptions you made in part (a).
2. (a) A pharmaceutical company is conducting an experiment to test whether a new
type of pain reliever is effective. In this context, a treatment is considered
effective if it is successful with a probability of more than 0.5. The pain
reliever was given to 30 patients and it reduced the pain for 20 of them. You
are asked to use an appropriate hypothesis test to determine whether the pain
reliever is effective. State the test hypotheses, and specify your test statistic
and its distribution under the null hypothesis. Comment on your findings.
(b) A second experiment followed where a placebo pill was given to another group
of 40 patients. A placebo pill contains no medication and is prescribed so that
the patient will expect to get well. In some situations, this expectation is
enough for the patient to recover. This effect, also known as the placebo effect,
occurred to some extent in the second experiment where the pain was reduced
for 21 of the patients. You are asked to consider an appropriate hypothesis test
to incorporate this new evidence with the previous data and reassess the
effectiveness of the pain reliever.
215
8
8. Hypothesis testing
3. A paired-difference experiment involved n = 121 adults. For each adult a
characteristic was measured under two distinct conditions and the difference in
characteristic values was recorded. The sample mean of the differences was 1.195,
whereas their sample standard deviation was 10.2. The researchers reported a t
statistic value of 1.289 when testing whether the means of the two conditions are
the same.
(a) Show how the researchers obtained the t statistic value of 1.289.
(b) Calculate the p-value of the test and use the p-value to test the hypothesis of
equal means. Use a 5% significance level.
8
216
Chapter 9
Contingency tables and the
chi-squared test
9.1
Aims
This chapter introduces you to a further application of hypothesis testing using a new
distribution – the chi-squared (χ2 ) distribution. This distribution enables you to work
with categorical variables. Your aims are to:
learn a new application
link this with different applications in the business and social statistical fields.
9.2
Learning outcomes
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
set up the null and alternative hypotheses appropriate for a contingency table
9
compute the degrees of freedom, expected frequencies and appropriate critical
values of chi-squared for a contingency table
summarise the limitations of a chi-squared test
be able to extend from chi-squared to an appropriate test of proportions, if
necessary
be able to work with a one-row or one-column contingency table as above.
9.3
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060]
Sections 14.1 and 14.3.
In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials
accessible via the ST104a Statistics 1 area at http://my.londoninternational.ac.uk
217
9. Contingency tables and the chi-squared test
9.4
Further reading
Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and
Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN
9780471615170] Chapter 17.
9.5
Introduction
In Chapter 8 we focused on testing the value of a particular population parameter of
interest, such as a mean, µ, or a proportion, π. Being able to perform such statistical
tests is of particular use when making policy decisions. Here we shall look at two
additional testing procedures, one which deals with testing whether an association
exists between two categorical variables (introduced in Chapter 4), and a second which
considers the shape of the distribution from which the sample data were drawn. Both
tests can easily be applied to business management and social science fields.
9.6
Correlation and association
If we are asking whether two measurable variables (see Section 4.6) are related to each
other, the technical way of describing the case when they are is to say that the two
variables are correlated. It is possible to measure the strength of a correlation based on
sample data, and also to study to what extent changes in a variable ‘explain’ changes in
another – a topic known as regression. These topics will be discussed in Chapter 12.
9
However, we also often wish to answer questions such as ‘Do vegetarians tend to
support Party X?’ or ‘Do Party X supporters tend to be vegetarians?’. If either question
has the affirmative answer, we say that there is an association between the two
categories (or ‘attributes’ or ‘factors’). We will need frequency data in order to answer
such questions, but, in this course, we will not be concerned with the strength of any
such association, rather we restrict ourselves to testing for its existence and commenting
on the nature of any association which we discover.
Finally, note that if one variable is measurable and the other categorical, we would still
be testing for the existence of association instead of computing correlation. In which
case we would ‘demote’ the measurable variable to categorical status by creating levels.
For example, age could be converted to a categorical variable by creating age groups.1
9.7
Tests for association
This type of test tests the null hypothesis that two factors (or attributes) are not
associated, against the alternative hypothesis that they are associated. Each data unit
we sample has one level (or ‘type’ or ‘variety’) of each factor.
1
Note age group would be an ordinal variable, since we could rank the age groups in order, from
youngest to oldest.
218
9.7. Tests for association
Example 9.1 Suppose that we are sampling people, and that one factor of interest
is hair colour (black, blonde, brown etc.) while another factor of interest is eye
colour (blue, brown, green etc.). In this example, each sampled person has one level
of each factor. We wish to test whether or not these factors are associated. Hence:
H0 : There is no association between hair colour and eye colour2
H1 : There is an association between hair colour and eye colour.
So, under H0 , the distribution of eye colour is the same for blondes as it is for
brunettes etc., whereas if H1 is true it may be attributable to blonde-haired people
having a (significantly) higher proportion of blue eyes, say.
The association might also depend on the sex of the person, and that would be a
third factor which was associated with (i.e. interacted with) both of the others.3 The
main way of analysing these questions is by using a contingency table, discussed
in the next section.
9.7.1
Contingency tables
In a contingency table, also known as a cross-tabulation, the data are in the form
of frequencies (counts), where the observations are organised in cross-tabulated
categories. We sample a certain number of units (such as people) and classify them
according to the two factors of interest.
Example 9.2 In three areas of a city, a record has been kept of the numbers of
burglaries, robberies and car thefts which take place in a year. The total number of
offences was 150, and they were divided into the various categories as shown in the
following contingency table:
Area
A
B
C
Total
Burglary
30
12
8
50
Robbery
19
23
18
60
Car theft
6
14
20
40
Total
55
49
46
150
The cell frequencies are known as observed frequencies and show how the data
are spread across the different combinations of factor levels (known as
contingencies). The first step in any analysis is to complete the row and column
totals (as already done in this table).
2
When conducting tests for association, the null hypothesis can be expressed either as ‘There
is no association between categorical variables X and Y ’, or that ‘Categorical variables X and Y
are independent’. The corresponding alternative hypothesis would then replace ‘no association’ or
‘independent’ with ‘association’ or ‘not independent (dependent)’, respectively.
3
We will not consider interactions in these testing procedures in ST104a Statistics 1.
219
9
9. Contingency tables and the chi-squared test
9.7.2
Expected frequencies
We proceed by computing a corresponding set of expected frequencies, conditional
on the null hypothesis of no association between the factors, i.e. that the factors are
independent.
Now suppose that you are only given the row and column totals for the frequencies. If
the factors were assumed to be independent, consider how you would calculate the
expected frequencies. Recall from Chapter 5 that if A and B are two independent
events, then P (A ∩ B) = P (A) P (B). We now apply this idea.
Example 9.3 For the data in Example 9.2, if a record was selected at random from
the 150 records:
P (a crime being a burglary) = 50/150
P (a crime being in area A) = 55/150.
Hence, under H0 , we have:
P (a crime being a burglary in area A) =
50
55
×
150 150
and so the expected number of burglaries in area A is:
150 ×
55
50
×
.
150 150
So the expected frequency is obtained by multiplying the product of the ‘marginal’
probabilities by n, the total number of observations. This can be generalised as follows.
9
Expected frequencies in contingency tables
The expected frequency, Eij , for the cell in row i and column j of a contingency table
with r rows and c columns, is:
Eij =
row i total × column j total
total number of observations
where i = 1, . . . , r and j = 1, . . . , c.
Example 9.4 The completed expected frequency table for the data in Example 9.2
is (rounding to two decimal places – which is recommended in the examination):
Area
A
B
C
Total
220
Burglary
18.33
16.33
15.33
50
Robbery
22.00
19.60
18.40
60
Car theft
14.67
13.07
12.27
40
Total
55
49
46
150
9.7. Tests for association
Make sure you can replicate these expected frequencies using your own calculator.
9.7.3
Test statistic
To motivate our choice of test statistic, if H0 is true then we would expect to observe
small differences between the observed and expected frequencies, while large differences
would suggest that H1 is true. This is because the expected frequencies have been
calculated conditional on the null hypothesis of independence hence, if H0 is actually
true, then what we actually observe (the observed frequencies) should be
(approximately) equal to what we expect to observe (the expected frequencies).
χ2 test of association
Let the contingency table have r rows and c columns, then formally the test statistic4
used for tests of association is:
r X
c
X
(Oij − Eij )2
i=1 j=1
Eij
∼ χ2(r−1)(c−1) .
(9.1)
Critical values are found using Table 8 of the New Cambridge Statistical Tables.
Notice the ‘double summation’ here just means summing over all rows and columns.
This test statistic follows an (approximate) chi-squared distribution with (r − 1)(c − 1)
degrees of freedom, where r and c denote the number of rows and columns, respectively,
in the contingency table.
9.7.4
The χ2 distribution
A chi-squared variable is only defined over positive values. The precise shape of the
distribution is dependent on the degrees of freedom, which is a parameter of the
distribution. Figure 9.1 illustrates the chi-squared distribution for a selection of degrees
of freedom.
Although the shape of the distribution does change quite significantly for different
degrees of freedom, note the distribution is always positively skewed.
9.7.5
Degrees of freedom
The general expression for the number of degrees of freedom in tests for association is:
(number of cells) − (number of times data are used to calculate the Eij s) .
For an r × c contingency table, we begin with rc cells. We lose one degree of freedom for
needing to use the total number of observations to compute the expected frequencies.
However, we also use the row and column totals in these calculations, but we only need
r − 1 row totals and c − 1 column totals, as the final one in each case can be deduced
4
This test statistic was first used by the statistician Karl Pearson, hence it is often referred to as
‘Pearson’s chi-squared statistic’. We will not worry in this course about how this test statistic is derived.
221
9
9. Contingency tables and the chi-squared test
0.4
Chi−squared distribution: various degrees of freedom
0.2
0.0
0.1
f X (x )
0.3
1 degree of freedom
4 degrees of freedom
8 degrees of freedom
10 degrees of freedom
0
5
10
15
20
x
Figure 9.1: Examples of the chi-squared distribution with a selection of degrees of freedom.
using the total number of observations. Hence we only lose r − 1 degrees of freedom for
the row totals, similarly we only lose c − 1 degrees of freedom for the column totals.
Hence the overall degrees of freedom are:
9
ν = rc − (r − 1) − (c − 1) − 1 = (r − 1)(c − 1).
9.7.6
Performing the test
As usual, we choose a significance level, α, at which to conduct the test. However, are
we performing a one-tailed test or a two-tailed test? To determine this, we need to
consider what sort of test statistic value would be considered extreme under H0 . As seen
in Figure 9.1, the chi-squared distribution only takes positive values. The squared term
in the numerator of the test statistic in (9.1) ensures the test statistic value will be
positive (the Eij s in the denominator are clearly positive too). If H0 is true, then the
observed and expected frequencies should be quite similar, since the expected
frequencies are computed conditional on the null hypothesis of independence. This
means that |Oij − Eij | should be quite small for all cells. In contrast if H0 is not true,
then we would expect comparatively large values for |Oij − Eij | due to large differences
between the two sets of frequencies. Therefore, upon squaring |Oij − Eij |, sufficiently
large test statistic values suggest rejection of H0 . Hence χ2 tests of association are
always upper-tailed tests.
222
9.7. Tests for association
Example 9.5 Using the data in Example 9.2, we proceed with the hypothesis test.
Note it is advisable to present your calculations as an extended contingency table as
shown below, where the three rows in each cell correspond to the observed
frequencies, the expected frequencies and the test statistic contributor.
O1·
E1·
(O1· − E1· )2 /E1·
O2·
E2·
(O2· − E2· )2 /E2·
O3·
E3·
(O3· − E3· )2 /E3·
A
B
C
Total
Burglary
30
18.33
7.48
12
16.33
1.13
8
15.33
3.48
50
Robbery
19
22.00
0.41
23
19.60
0.59
18
18.40
0.01
60
Car theft
6
14.67
5.15
14
13.07
0.06
20
12.27
4.82
40
Total
55
55
49
49
46
46
150
Using (9.1), we obtain a test statistic value of:
3 X
3
X
(Oij − Eij )2
i=1 j=1
Eij
= 7.48 + 0.41 + · · · + 4.82 = 23.13.
Since r = c = 3, we have (r − 1)(c − 1) = (3 − 1)(3 − 1) = 4 degrees of freedom. For
α = 0.05, using Table 8 of the New Cambridge Statistical Tables, we obtain an
upper-tail critical value of 9.488. Hence we reject H0 since 9.488 < 23.13. Moving to
the 1% significance level, the critical value is now 13.28 so, again, we reject H0 since
13.28 < 23.13. Therefore, the test is highly significant and we conclude that there is
strong evidence of an association between the factors.
Looking again at the contingency table, comparing observed and expected
frequencies, the interpretation of this association becomes clear – burglary is the
main problem in area A whereas car theft is a problem in area C. (We can deduce
this by looking at the cells with large test statistic contributors, which are a
consequence of large differences between observed and expected frequencies.)
Incidentally, the p-value for this upper-tailed test is (using a computer)
P (X > 23.13) = 0.000119, where X ∼ χ24 , emphasising the extreme significance of
this test statistic value.
The conclusions in Example 9.5 are fairly obvious, given the small dimensions of the
contingency table. However, for data involving more factors, and more factor levels, this
type of analysis can be very insightful. Cells which make a large contribution to the test
statistic value (i.e. which have large values of (Oij − Eij )2 /Eij ) should be studied
carefully when determining the nature of an association. This is because, in cases where
H0 has been rejected, rather than simply conclude that there is an association between
two categorical variables, it is helpful to describe the nature of the association.
223
9
9. Contingency tables and the chi-squared test
Activity 9.1 An analyst of the retail trade uses as analytical tools the concepts of
‘Footfall’ (the daily number of customers per unit sales area of a shop) and ‘Ticket
price’ (the average sale price of an item in the shop’s offer).
Shops are classified as offering low, medium or high price items and, during any sales
period, as having low, medium or high footfall.
During the January sales, the analyst studies a sample of shops and obtains the
following frequency data for the nine possible combined classifications:
Low footfall
Medium footfall
High footfall
Low price
22
37
45
Medium price
43
126
75
High price
16
25
23
Conduct a suitable test for association between Ticket classification and Footfall
level, and report on your findings.
Solution
We test:
H0 : No association between ticket price and footfall level
vs.
H1 : Association between ticket price and footfall level.
We next compute the expected values for each cell using:
row i total × column j total
total number of observations
which gives:
9
Footfall
O1·
Low
E1·
(O1· − E1· )2 /E1·
O2·
Medium
E2·
(O2· − E2· )2 /E2·
O3·
High
E3·
(O3· − E3· )2 /E3·
Total
Low
22
20.45
0.12
37
47.46
2.30
45
36.10
2.20
104
Ticket price
Medium High
43
16
47.97
12.58
0.52
0.93
126
25
111.34 29.20
1.93
0.61
75
23
84.69
22.21
1.11
0.03
244
64
Total
81
46
188
188
143
143
412
The test statistic value is:
3 X
3
X
(Oij − Eij )2
i=1 j=1
Eij
= 2.20 + 1.11 + · · · + 0.93 = 9.73.
Since r = c = 3, we have (r − 1)(c − 1) = (3 − 1)(3 − 1) = 4 degrees of freedom.
224
9.7. Tests for association
For α = 0.05, using Table 8 of the New Cambridge Statistical Tables, we obtain an
upper-tail critical value of 9.488. Hence we reject H0 since 9.488 < 9.73. Moving to
the 1% significance level, the critical value is now 13.277 so we do not reject H0 since
9.73 < 13.277. Therefore, the test is moderately significant and we conclude that
there is moderate evidence of an association between ticket price and footfall.
Activity 9.2 Give an example of a 2 × 2 contingency table in which there is:
(a) no association
(b) strong association.
Be sure to show why each contingency table exhibits the property.
Solution
(a) Any 2 × 2 contingency table with equal observed frequencies would suffice. For
example (where k is some constant):
Level 1
Level 2
Level 1
k
k
Level 2
k
k
Clearly, Oij = Eij = k for all cells, leading to a test statistic value of zero for the
chi-squared test.
(b) A possible example of a 2 × 2 contingency table showing strong association is:
Level 1
Level 2
Level 1
100
200
Level 2
200
100
9
All expected frequencies are 300 × 300/600 = 150, and (Oij − Eij )2 /Eij = 16.67
for all cells. Therefore, the test statistic value for the chi-squared test is
4 × 16.67 = 66.67 which is highly significant given the critical value for the 0.1%
significance level is 10.83.
Activity 9.3 The table below shows the relationship between gender and party
identification in a US state.
Gender
Male
Female
Total
Party identification
Democrat Independent Republican
108
72
198
216
59
142
324
131
340
Total
378
417
795
Test for an association between gender and party identification at two appropriate
significance levels and comment on your results.
225
9. Contingency tables and the chi-squared test
Solution
We test:
H0 : There is no association between gender and party identification
vs.
H1 : There is an association between gender and party identification.
We construct the following contingency table:
Male
Female
O1j
E1j
(O1j − E1j )2 /E1j
O2j
E2j
(O2j − E2j )2 /E2j
Total
Party identification
Democrat Independent Republican
108
72
198
154.05
62.29
161.66
13.77
1.51
8.17
216
59
142
169.95
68.71
178.34
12.48
1.37
7.40
324
131
340
Total
378
378
417
417
795
P
The χ2 test statistic value is i,j (Oij − Eij )2 /Eij = 44.71. The number of degrees of
freedom is (2 − 1)(3 − 1) = 2, so we compare the test statistic with the χ22
distribution in Table 8 of the New Cambridge Statistical Tables. The test is
significant at the 5% significance level (critical value is 5.991, and 5.991 < 44.71).
Clearly, it is also significant at the 1% significance level (critical value is 9.210, and
9.210 < 44.71) so we have found (very!) strong evidence of an association between
gender and party identification.
9
Comparing the observed and expected frequencies, we see that males are more likely
to identify as Republican supporters, while females are more likely to identify as
Democrat supporters.
Activity 9.4 The following table shows the numbers of car accidents in an urban
area over a period of time. These are classified by severity and by type of vehicle.
Carry out a test for association on these data and draw conclusions.
Saloon
Van
Sports car
Severity of accident
Minor Medium Major
29
39
16
15
24
12
7
20
12
Solution
We test:
H0 : There is no association between type of vehicle and severity of accident
vs.
H1 : There is an association between type of vehicle and severity of accident.
226
9.7. Tests for association
We construct the following contingency table:
Saloon
Van
Sports
O1j
E1j
(O1j − E1j )2 /E1j
O2j
E2j
(O2j − E2j )2 /E2j
O3j
E3j
(O3j − E3j )2 /E3j
Total
Severity of accident
Minor Medium Major
29
39
16
24.6
40.1
19.3
0.78
0.03
0.57
15
24
12
15.0
24.3
11.7
0.00
0.00
0.00
7
20
12
11.4
18.6
9.0
1.72
0.10
1.03
51
83
40
Total
84
84
51
51
39
39
174
P
The χ2 test statistic value is i,j (Oij − Eij )2 /Eij = 4.24. The number of degrees of
freedom is (3 − 1)(3 − 1) = 4, so we compare the test statistic with the χ24
distribution in Table 8 of the New Cambridge Statistical Tables.
The test is not significant at the 5% significance level (critical value is 9.488, and
4.24 < 9.488).
It is also not significant at the 10% significance level (critical value is 7.779, and
4.24 < 7.779) so we have found no evidence whatsoever of an association since the
test is insignificant.
Activity 9.5 The classification of examination candidates for three examining
bodies is shown below. Is there evidence that the examining bodies are classifying
differently? If so, explain what the differences are.
Pass
Refer
Fail
Examining bodies
A
B
C
D
233 737 358 176
16
68
29
20
73 167 136 64
Solution
We test:
H0 : There is no association between examining bodies and grade classifications
vs.
H1 : There is an association between examining bodies and grade classifications.
We construct the following contingency table:
227
9
9. Contingency tables and the chi-squared test
O1j
E1j
(O1j − E1j )2 /E1j
O2j
Refer
E2j
(O2j − E2j )2 /E2j
O3j
Fail
E3j
(O3j − E3j )2 /E3j
Total
Pass
Examining bodies
A
B
C
D
233
737
358
176
233.2 703.8 378.7 188.3
0
1.56 1.13 0.80
16
68
29
20
20.6 62.2 33.5 16.6
1.04 0.53 0.60 0.53
73
167
136
64
68.2 205.9 110.8 55.1
0.34 7.35 5.73 1.47
322
972
523
260
Total
1504
1504
133
133
440
440
2077
P
The χ2 test statistic value is i,j (Oij − Eij )2 /Eij = 21.21. The number of degrees of
freedom is (4 − 1)(3 − 1) = 6, so we compare the test statistic with the χ26
distribution in Table 8 of the New Cambridge Statistical Tables.
The test is significant at the 5% significance level (the critical value is 12.59, and
12.59 < 21.21), indeed it is also significant at the 1% significance level (the critical
value is 16.81, and 16.81 < 21.21). So we can conclude that the test is highly
significant and hence there is an association between examining bodies and grade
classifications.
Looking at the table for large individual contributors to the test statistic value and
comparing observed and expected frequencies, it appears that B fails fewer people
than expected under independence and C fails more. Note that none of the other
individual values of the χ2 test statistic is a significant fraction of the 21.21 total.
9
Activity 9.6 In a random sample of 200 children in a school it was found that 171
had been inoculated against the common cold before the winter. The table below
shows the numbers observed to have suffered from colds over the winter season.
No cold
Cold
Inoculated
40
131
Not inoculated
5
24
Test for an association between colds and inoculation using the standard test for a
contingency table, and draw conclusions.
Solution
We test:
H0 : There is no association between being inoculated and cold prevention.
vs.
H1 : There is an association between being inoculated and cold prevention.
228
9.8. Goodness-of-fit tests
We construct the following contingency table:
No cold
Cold
O1j
E1j
(O1j − E1j )2 /E1j
O2j
E2j
(O2j − E2j )2 /E2j
Total
Inoculated
40
38.475
0.060
131
132.525
0.018
171
Not inoculated
5
6.525
0.356
24
22.475
0.103
29
Total
45
45
155
155
200
The number of degrees of freedom is (r − 1)(c − 1) = (2 − 1)(2 − 1) = 1.
The χ2 test statistic value is 0.060 + 0.356 + 0.018 + 0.103 = 0.537.
For comparison, the χ21 critical value for α = 0.3 (i.e. a 30% significance level!) is
1.074. Since 0.537 < 1.074, the test statistic value is clearly not significant. There is
no evidence that inoculation prevents colds.
9.7.7
Extending to an appropriate test of proportions
If we had a 2 × 2 contingency table, then there are two ways of testing an equivalent
hypothesis – (i.) a χ2 test as outlined above, and (ii.) a test of the difference in
proportions as detailed in Chapter 8. See the fourth Learning activity at the end of this
chapter for an example to practise.
9.8
Goodness-of-fit tests
In addition to tests of association, the chi-squared distribution is often used more
generally in so-called ‘goodness-of-fit’ tests. We may, for example, wish to answer
hypotheses such as ‘Is it reasonable to assume the data follow a particular distribution?’.
This justifies the name ‘goodness-of-fit’ tests, since we are testing whether or not a
particular probability distribution provides an adequate fit to the observed data. The
null hypothesis will assert that a specific hypothetical population distribution is the
true one. The alternative hypothesis is that this specific distribution is not the true one.
There is a special case which we shall consider in ST104a Statistics 1 when you are
only dealing with one row or one column. This is when we wish to test that the sample
data are drawn from a (discrete) uniform distribution, i.e. that each characteristic is
equally likely.
Example 9.6 Is a given die fair? If the die is fair, then the values of the faces (1, 2,
3, 4, 5 and 6) are all equally likely so we have the following hypotheses:
H0 : Score is uniformly distributed vs. H1 : Score is not uniformly distributed.
229
9
9. Contingency tables and the chi-squared test
9.8.1
Observed and expected frequencies
As with tests of association, the goodness-of-fit test involves both observed and expected
frequencies. In all goodness-of-fit tests, the sample data must be expressed in the form
of observed frequencies associated with certain classifications of the data. Assume k
classifications, hence the observed frequencies can be denoted by Oi , for i = 1, . . . , k.
Example 9.7 Extending Example 9.6, for a die the obvious classifications would be
the six faces. If the die is thrown n times, then our observed frequency data would
be the number of times each face appeared. Here k = 6.
Recall that in hypothesis testing we always assume that the null hypothesis, H0 , is true.
In order to conduct a goodness-of-fit test, expected frequencies are computed
conditional on the probability distribution expressed in H0 . The test statistic will then
involve a comparison of the observed and expected frequencies. If H0 is true, then we
would expect small differences between these two sets of frequencies, while large
differences would indicate support for H1 .
We now consider how to compute expected frequencies for discrete uniform probability
distributions.
Expected frequencies in goodness-of-fit tests
For discrete uniform probability distributions, expected frequencies are computed as:
Ei = n ×
1
k
for i = 1, 2, . . . , k
where n denotes the sample size and 1/k is the uniform (i.e. equal, same) probability
for each characteristic.
9
Expected frequencies should not be rounded, just as we do not round sample means,
say. Note that the final expected frequency (for the kth category) can easily be
computed using the formula:
Ek = n −
k−1
X
Ei .
i=1
This is because we have a constraint that the sum of the observed and expected
frequencies must be equal,5 that is:
k
X
i=1
Oi =
k
X
Ei
i=1
which results in a loss of one degree of freedom (discussed below).
5
Recall the motivation for computing expected frequencies in the first place – assuming H0 is true,
we want to know how a sample of size n is expected to be distributed across the k categories.
230
9.8. Goodness-of-fit tests
9.8.2
The goodness-of-fit test
Having computed expected frequencies, we now need to formally test H0 and so we need
a test statistic.
Goodness-of-fit test statistic
For a discrete uniform distribution with k categories, observed frequenices Oi and
expected frequencies Ei , the test statistic is:
k
X
(Oi − Ei )2
i=1
Ei
∼ χ2k−1
approximately under H0 .
(9.2)
Note that this test statistic does not have a true χ2k−1 distribution under H0 , rather it is
only an approximating distribution. An important point to note is that this
approximation is only good enough provided all the expected frequencies are at least 5.
As seen in (9.2), there are k − 1 degrees of freedom when testing a discrete uniform
distribution. k is the number of categories, and we lose one degree of freedom due to the
constraint that:
X
Oi =
X
i
Ei .
i
As with the test of association, goodness-of-fit tests are upper-tailed tests as, under H0 ,
we would expect to see small differences between the observed and expected
frequencies, as the expected frequencies are computed conditional on H0 . Hence large
test statistic values are considered extreme under H0 , since these arise due to large
differences between the observed and expected frequencies.
9
Example 9.8 A confectionery company is trying out different wrappers for a
chocolate bar – its original, A, and two new ones, B and C. It puts the bars on
display in a supermarket and looks to see how many of each wrapper type have been
sold in the first hour, with the following results.
Wrapper type
Observed frequencies
A
8
B
10
C
15
Total
33
Is there a difference between wrapper types in the consumer choices made?
To answer this we need to test:
H0 : There is no difference in preference for the wrapper types.
vs.
H1 : There is a difference in preference for the wrapper types.
231
9. Contingency tables and the chi-squared test
How do we work out the expected frequencies? Well, for equal preferences, with
three choices each (k = 3), the expected frequencies will be:
Ei = 33 ×
1
= 11
3
for i = 1, 2 and 3.
Applying (9.2), our test statistic value is:
3
X
(Oi − Ei )2
i=1
Ei
=
(8 − 11)2 (10 − 11)2 (15 − 11)2
+
+
= 2.364.
11
11
11
The degrees of freedom will be k − 1 = 3 − 1 = 2. At the 5% significance level, the
upper-tail critical value is 5.991, using Table 8 of the New Cambridge Statistical
Tables, so we do not reject H0 since 2.364 < 5.991. If we now consider the 10%
significance level6 the critical value is 4.605, so again we do not reject H0 since
2.364 < 4.605. Hence the test is not significant. Therefore, there is no evidence of a
strict preference for a particular wrapper type based on the choices observed in the
supermarket during the first hour.7
Activity 9.7 Many people believe that when a horse races, it has a better chance of
winning if its starting line-up position is closer to the rail on the inside of the track.
The starting position of 1 is closest to the inside rail, followed by position 2, and so
on. The table below lists the numbers of wins of horses in the different starting
positions. Do the data support the claim that the probabilities of winning in the
different starting positions are not all the same?
Starting position
Number of wins
1
29
2
19
3
18
4
25
5
17
6
10
7
15
8
11
Solution
9
We test whether the data follow a discrete uniform distribution of 8 categories. Let
pi = P (X = i), for i = 1, . . . , 8.
We test the null hypothesis H0 : pi = 1/8, for i = 1, . . . , 8.
Note n = 29 + 19 + 18 + 25 + 17 + 10 + 15 + 11 = 144. The expected frequencies are
Ei = 144/8 = 18, for all i = 1, . . . , 8.
1
2
3
4
5
6
7
8
Total
Starting position
Oi
29
19 18 25
17
10
15
11
144
Ei
18
18 18 18
18
18
18
18
144
Oi − Ei
11
1
0
7
−1 −8 −3 −7
0
2
(Oi − Ei ) /Ei
6.72 0.06 0 2.72 0.06 3.56 0.50 2.72 16.34
P
Under H0 , i (Oi − Ei )2 /Ei ∼ χ27 . At the 5% significance level, the critical value is
14.067. Since 14.067 < 16.34, we reject the null hypothesis. Turning to the 1%
6
Remember if the test is not significant at the 5% significance level, we should then test at the 10%
significance level. See Figure 8.1.
7
This is the same as testing for the suitability of the discrete uniform distribution, i.e. that each
wrapper is equally popular.
232
9.9. Summary
significance level, the critical value is 18.475. Since 16.34 < 18.475, we cannot reject
the null hypothesis, hence we conclude that the test is moderately significant. There
is moderate, but not strong, evidence to support the claim that the chances of
winning in the different starting positions are not all the same.
9.9
Summary
You should regard this chapter as an opportunity to revise your work on hypothesis
testing in Chapter 8 and also revisit your work on testing proportions. The only new
material here is a way of testing the significance of countable figures as opposed to their
attributes.
9.10
Key terms and concepts
Association
Chi-squared distribution
Cross-tabulation
Expected frequency
Observed frequency
9.11
Categorical variable
Contingency table
Discrete uniform distribution
Goodness-of-fit test
Learning activities
Solutions can be found in Appendix A.
1. A survey has been made of levels of satisfaction with housing by people living in
different types of accommodation. Levels of satisfaction are categorised as being
high, medium, low and very dissatisfied. Levels of housing type are public housing
apartment, public housing house, private apartment, private detached house,
private semi-detached house, miscellaneous (includes boat, caravan etc). State:
(a) the null and alternative hypotheses
(b) the degrees of freedom
(c) the 5% and 1% critical values for a χ2 test.
(Remember that you will reject the null hypothesis if your calculated test statistic
value is greater than the critical value obtained from Table 8 of the New Cambridge
Statistical Tables.)
2. Draw the χ2 curve and put in the rejection region for 5% and 1% significance levels
with 6 degrees of freedom. Make sure you understand which calculated values of χ2
will lead you to reject H0 .
3. You have carried out a χ2 test on a 3 × 4 contingency table to study whether there
is evidence of an association between advertising type and level of sales of a
233
9
9. Contingency tables and the chi-squared test
product. You have four types of advertising (A, B, C and D) and three levels of
sales (low, medium and high).
Your calculated χ2 value is 13.5. Giving degrees of freedom and an appropriate
significance level, set out your hypotheses. What would you say about the result?
4. In a survey conducted to decide where to locate a factory, samples from five towns
were examined to see the numbers of skilled and unskilled workers. The data were
as follows.
Area
A
B
C
D
E
Number of
skilled workers
80
58
114
55
83
Number of
unskilled workers
184
147
276
196
229
(a) Does the population proportion of skilled workers vary with the area?
(b) Test:
H0 : πD = πothers
vs. H1 : πD < πothers .
Think about your results and what you would explain to your management
team who had seen the chi-squared results and want, for other reasons, to site
their factory at area D.
5. Look at the following table taken from a study of sex differences in perception. One
of the tests was of scores in verbal reasoning.
Male
Female
9
High
50
90
Low
150
210
Totals
200
300
Do these figures show a difference in verbal reasoning by gender?
Try to do this:
(a) using a chi-squared test
(b) using a test of differences between proportions.
6. The table below shows a contingency table for a sample of 1,104 randomly-selected
adults from three types of environment (City, Town and Rural) who have been
classified into two groups by level of exercise (high and low). Test the hypothesis
that there is no association between level of exercise and type of environment and
draw conclusions.
Environment
City
Town
Rural
234
Level of exercise
High
Low
221
256
230
118
159
120
9.12. A reminder of your learning outcomes
7. Two surveys have collected information on adult and teenage cigarette use. The
results of the first survey are given in the first table below (sample size 4,000) and
the results of the second survey, carried out two years later on a new sample of
1,000 households, are underneath it.
(a) Without doing any further calculations, comment on any association for the
first survey.
(b) Calculate the chi-squared statistic for the second survey and test for
association.
(c) Write a short report explaining what the first survey table shows about the
nature of the association, if any, between adult and teenage cigarette use in a
household, and (by comparing the two tables) discuss whether or not the
extent of any association changed over the years.
Survey 1
Teenage
cigarette use
χ2 = 4.61
Yes
No
Total
Survey 2
Teenage
cigarette use
Yes
No
Total
Adult cigarette use
Yes
No
198
(8.4%)
170 (10.4%)
2,164 (91.6%)
1,468 (89.6%)
2,362 (100.0%)
1,638 (100.0%)
Adult cigarette use
Yes
No
48
(8.1%)
45 (11.0%)
544 (91.9%)
363 (89.0%)
592 (100.0%)
408 (100.0%)
Total
368
3,632
4,000
Total
93
907
1,000
8. You have been given the number of births in a country for each of the four seasons
of the year and are asked whether births vary over the year. What would you need
to know in order to carry out a chi-squared test of the hypothesis that births are
spread uniformly between the four seasons? Outline the steps in your work.
9.12
A reminder of your learning outcomes
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
set up the null and alternative hypotheses appropriate for a contingency table
compute the degrees of freedom, expected frequencies and appropriate critical
values of chi-squared for a contingency table
summarise the limitations of a chi-squared test
be able to extend from chi-squared to an appropriate test of proportions, if
necessary
be able to work with a one-row or one-column contingency table as above.
235
9
9. Contingency tables and the chi-squared test
9.13
Sample examination questions
Solutions can be found in Appendix B.
1. State whether the following statements are true or false. Give a brief explanation.
(a) A negative chi-squared value shows that there is little association between the
variables tested.
(b) Similar observed and expected frequencies indicate strong evidence of an
association in an r × c contingency table.
2. An experiment was conducted to examine whether age, in particular being over 30
or not, has any effect on preferences for a digital or an analogue watch. Specifically,
129 randomly-selected people were asked what watch they prefer and their
responses are summarised in the table below:
30 year old or younger
Over 30 years old
Analogue watch
10
31
Undecided
17
22
Digital watch
37
12
(a) Based on the data in the table, and without conducting any significance test,
would you say there is an association between age and watch preference?
Provide a brief justification for your answer.
(b) Calculate the χ2 statistic for the hypothesis of independence between age and
watch preference, and test that hypothesis. What do you conclude?
3. Set out the null and alternative hypotheses, degrees of freedom, expected
frequencies, and 10%, 5% and 1% critical values for the following problem. The
following figures give live births by season in a town.
9
Season
Spring
Summer
Autumn
Winter
Live births
100
200
250
180
The number of days per season in this country are 93 (Spring), 80 (Summer), 100
(Autumn) and 92 (Winter). Is there any evidence that births vary over the year?
Hint: You would expect, if the births are uniformly distributed over the year, that
the number of births would be proportional to the number of days per season. So
work out your expected frequencies by taking the number of days per season
divided by the number of days in the year and multiplying by the total number of
births over the year.
236
Chapter 10
Sampling design
10.1
Aims
If you are using this subject guide in the order in which it is presented, you can now
take a break from learning any more new statistical ideas for a while. Now is your
chance to consolidate these ideas by looking at their applications in business and the
social sciences. You should aim to consider the following types of question.
What are your choices when you design a survey?
How reliable are results of surveys reported in the media?
Are random samples the best we can do in every circumstance?
10.2
Learning outcomes
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
define random, simple random and quota sampling and describe the implications of
using them
explain the reasons for stratifying and clustering samples
10
describe the factors which contribute to errors in surveys, including:
• inaccurate and poorly-judged sampling frames
• sampling error
• non-sampling error (non-response, biased response, interviewer error)
discuss the various methods of contact which may be used in a survey and the
related implications:
• interviewer
• postal
• other self-administered
• telephone.
237
10. Sampling design
10.3
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060]
Section 6.1 and Chapter 17.
In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials
accessible via the ST104a Statistics 1 area at http://my.londoninternational.ac.uk
10.4
Further reading
Office for National Statistics Social Trends (Basingstoke: Palgrave Macmillan,
2009) [ISBN 9780230220508].
10.5
Introduction
This chapter describes the main stages of a survey and the sources of error at each
stage. Remember that a first treatment and background to this subject matter is given
earlier in Chapter 3. It makes sense to review that chapter now before continuing. This
part of the course is the foundation for your future work in applied social science,
business and management. There is not much point in learning the various techniques
we have introduced you to in the rest of the subject guide unless you understand the
sources and limitations of the data you are using. This is important to academics and
policymakers alike!
10.6
10
Motivation for sampling
In Chapter 3, we introduced the term target population as representing the collection
of units (people, objects etc.) in which we are interested. In the absence of time and
budget constraints we conduct a census, that is a total enumeration of the population.
Examples include the population census in the UK and in other countries around the
world. Its advantage is that there is no sampling error because all population units are
observed, and so there is no estimation of population parameters – we can, in principle,
determine the true values exactly.
Of course, in the real world, we do need to take into account time and budget
constraints. Due to the large size, N , of most populations, an obvious disadvantage with
a census is cost. So a census is not feasible in practice because there is only a limited
amount of information which is cost-effective to collect. Also, non-sampling errors may
occur. For example, if we have to resort to using cheaper (hence less reliable)
interviewers, then they may erroneously record data, misunderstand a respondent etc.
So we select a sample, that is a certain number of population members are selected and
studied. The selected members are known as elementary sampling units.
Sample surveys (hereafter ‘surveys’) are how new data are collected on a population
and tend to be based on samples rather than a census. However, in practice, surveys are
238
10.6. Motivation for sampling
usually conducted under circumstances which cannot be fully controlled. Since the
sample size, n, can be very large, the design and implementation of surveys requires
considerable planning and teamwork.
Example 10.1 Examples of sample surveys include:
demography: study of births, deaths, families etc.
government: study of crime, education, employment, housing, health etc.
market research: study of consumer preferences, attitudes etc.
political science: study of voting intention.
Selected respondents may be contacted using a variety of methods such as face-to-face
interviews, telephone, mail or email questionnaires. Sampling errors will occur (since
not all population units are observed, so we resort to estimating population parameter
values, which means there will be some uncertainty attached to our point estimates),
but because of the smaller numbers involved (since typically n N )1 resources can be
used to ensure high-quality interviews or to check completed questionnaires. Therefore,
non-sampling errors should be less and consequently researchers can ask more questions.
So a census is difficult to administer, time-consuming, expensive and does not guarantee
completely accurate results due to non-sampling errors. In contrast, a sample is easier,
faster and cheaper, although it introduces sampling errors while non-sampling errors
can still occur. Unsurprisingly, we would like sampling (and non-sampling) errors to be
minimised.
Before proceeding, it is worth spending a moment thinking about the classification of
data into primary data and secondary data:
primary data are new data collected by researchers for a particular purpose
secondary data refer to existing data which have already been collected by others
or for another purpose.
Of course, if secondary data exist, then there would be no need to collect new data! In
ST104a Statistics 1, the focus will be on the collection of primary data using various
sampling techniques.
Activity 10.1 Under what conditions would (a) a sample be preferable to a census,
and (b) a census be preferable to a sample?
Solution
A sample is preferable to a census in the following cases:
• Cost constraints prohibit measuring all target respondents.
• The time available to conduct the study is relatively short.
1
The symbol means ‘much less than’.
239
10
10. Sampling design
• The target population size is large.
• The variance of the key characteristics of the study is small.
• The cost of sampling error is low, i.e. there are no ‘significant’ respondents who,
if they were omitted, would significantly bias the results.
• The cost of non-sampling error is high. For example, if interviewers were to make
an error in recording responses, the costs would be exacerbated in a census.
• The nature of measurement is destructive, i.e. damage or some change to those
measured is incurred.
• Attention must be given to individual cases.
In a ‘mirror image’, a census is preferable to a sample in the following cases:
• Cost constraints are minimal, i.e. when there is a large budget.
• The time available to conduct the study is relatively long.
• The target population size is small.
• The variance of the key characteristics of the study is large.
• The cost of sampling error is high.
• The cost of non-sampling error is low.
• The nature of measurement is non-destructive.
• Attention need not be given to individual cases.
10.7
10
Types of sample
We seek a sample which is representative of the whole population to ensure our
estimated parameter values are good approximations of the true parameter values. A
particular difficulty is how to select a sample which yields a ‘fair’ representation.
Fortunately, there are many different sampling techniques (reviewed below) from which
to choose. As expected, these all have relative advantages and disadvantages, hence
each will be suitable in particular situations. Knowledge of these techniques is
frequently examined in this course, so it is strongly advised that you become very
familiar with this material.
Of course, consideration also needs to be given to the choice of sample size, n. In
practice we face a trade-off – n should be large enough to give a fair representation of
the population, yet be sufficiently small to be practical and cost-effective. Section 6.11
considered this problem in the context of setting a tolerance on the maximum permitted
sampling error. Qualitative factors to take into consideration when determining n are
study objectives, budget and time constraints, type of data analysis planned and
non-sampling errors.
240
10.7. Types of sample
Types of sample
We proceed by partitioning sampling techniques into two groups:
non-probability (non-random) sampling
probability (random) sampling.
10.7.1
Non-probability sampling
Non-probability samples are characterised by the following properties.
Some population units have no chance (i.e. a zero probability) of being selected.
Units which can be selected have an unknown (non-zero) probability of selection.
Sampling errors cannot be quantified.
Non-probability samples are used in the absence of a sampling frame.2
We shall consider three types of non-probability sampling.
Convenience sampling – individuals are selected from the members of the
population which happen to be in the vicinity. For example, those people in a
shopping mall at the time of the survey, say a school day. This technique is
convenient, hence the name! Of course, such an approach is unlikely to yield a
representative sample. (In the mall example, children and working professionals are
unlikely to be in the mall at the time.)
Judgemental sampling – this uses expert opinion to judge which individuals to
choose. An example would be a research firm selecting people to participate in a
focus group3 by telephoning people and identifying who matches the target profile
through a series of questions developed by an expert.
Quota sampling – this attempts to obtain a representative sample by specifying
quotas on certain specified characteristics, such as age, gender, social class and any
other variables relevant to the investigation being undertaken. The (approximate)
distribution of such characteristics in the population is required in order to
replicate it in the sample.
Of these, quota sampling is the most important type which we will consider now in
greater depth.
Activity 10.2 What is the difference between judgemental and convenience
sampling? Give examples of where these techniques may be applied successfully.
2
Sampling frames (lists) were discussed at length in Chapter 3.
Focus groups are used for qualitative research by asking a group of individuals about their opinions
toward a new product or advertising campaign, consumer preferences etc.
3
241
10
10. Sampling design
Solution
The main difference is that judgemental sampling purposively selects sampling units
based on the judgement of the researcher as opposed to the haphazard method of
convenience sampling.
Judgemental sampling could be used, for example, to try out a new product on
consumers who in the judgement of the researcher are sophisticated, highly
discerning and extremely fussy – the researcher may contend that satisfying such a
group may mean high levels of satisfaction in less discerning groups. The success of a
judgemental sample lies in how knowledgeable the researcher is of the target
population and how well they justify the selection of specific elements. Given these
requirements, judgemental sampling is usually limited to relatively small target
populations and qualitative studies.
Convenience sampling could be used in the piloting of a questionnaire. For example,
imagine a postal questionnaire which is to be administered to retired women. A
group of retired women who may be easy to contact and willing to discuss problems
with the questionnaire and are elements of the target population would make a good
convenience sample.
Quota sampling
Quota sampling is useful in the absence of a sampling frame, since non-probability
sampling techniques do not require a sampling frame. Another reason for conducting a
quota sample, instead of a random sample, might be speed. We may be in a hurry and
not want to spend time organising interviewers for a random sample – it is much
quicker to set target numbers (quotas) to interview.
As with any form of sampling, the objective is to obtain a sample which is representative
of the population. Quota sampling is no exception and to do so the interviewer:
seeks out units which satisfy some control characteristics (known as quota
controls), such as age, gender and social class
10
requires the distribution of these characteristics in the population to be
(approximately) known in order to replicate it in the sample.
Quota sampling is cheap, but it may be systematically biased by the choice of
interviewee made by the interviewer, and their willingness to reply. For instance,
interviewers might avoid choosing anyone who looks threatening, or too busy, or too
strange! Quota sampling also does not allow us to measure the sampling error – a
consequence of it being a non-probability sampling technique.
Each of the three techniques above (convenience, judgemental and quota) has an
appeal; however, in all cases we have no real guarantee that we have achieved an
adequately representative sample – we might do by chance, of course, but this would be
highly unlikely! For example, were the people we interviewed only those working in local
offices? Were the young adults all students?
Basically, since we do not know the probability that an individual will be selected for
the survey, the basic rules of inference which we have been learning (confidence intervals
242
10.7. Types of sample
and hypothesis tests) do not apply. Specifically, standard errors (a key ingredient in
inferential procedures) are not measurable. However, in the absence of a sampling frame
then we will have to resort to non-probability sampling techniques.
That said, non-random samples are also frequently used by market research
organisations or companies when speed (if not accuracy) is important. They are rarely
used by governments.
Example 10.2 You would likely use a quota sample (the main non-probability
sampling technique considered in ST104a Statistics 1) in the following situations.
When speed is important. Clearly, an interviewer with a target to reach a certain
number (quota) of people on a given day is likely to be quicker than one which
requires a specific person or household to be contacted (as determined by a
random sample). Typical quota controls for the interviewer to meet are:
• age group
• gender
• socio-economic group, or social class.
Note the more controls the interviewer is given, the longer it will take to
complete the required number of interviews (and hence it will take longer to
complete your study).
No available sampling frame covering the target population. If you think
obtaining a sampling frame is likely to be very complicated, then a sensible
targeting of the population by taking a quota sample might be helpful. You
might, for example, wish to contact drivers of coaches and buses over a set of
routes. There are a lot of bus companies involved, and some of them will not let
you have their list of employees for data protection reasons, say. One of the
things you could do in these circumstances is to carry out a quota sample at
different times of the day.
There are often random alternatives though, using sampling frames you may
not have thought of. In the case above, you might be able to make a list of
scheduled journeys on the routes you wish to study and take a random sample
of routes, interviewing the relevant drivers as they complete their journeys.
When you need to reduce cost. Clearly, time-saving is an important element in
cost-saving.
When accuracy is not important. You may not need to have an answer to your
question to the high – and known – level of accuracy which is possible using a
random sample; rather you merely require a rough idea about a subject.
Perhaps you only need to know if people, on the whole, like your new flavour of
ice cream in order to judge whether or not there is likely to be sufficient
consumer demand to justify full-scale production and distribution. In this case,
asking a representative group of people (a quota) would be perfectly adequate
for your needs.
Although there may be several reasons to justify the use of a quota sample, you should
243
10
10. Sampling design
be aware of the problem caused by the omission of non-respondents. Because you only
count the individuals who reply (unlike random sampling where your estimate has to
allow for bias through non-response), the omission of non-respondents4 can lead to
serious errors as your results would be misleading. For this reason, members of the
British Market Research Association have now agreed to list non-response as it occurs
in their quota samples, and this is regarded as good practice.
One possible remedy to this problem is to introduce more detailed quota controls. For
example, we might ask for age group, gender, employment status, marital status and/or
the age group of the respondent’s children. However, this can take away a lot of the cost
advantages of using a quota sample, rather than a random sample. Imagine the time
you would take locating the last female for your sample aged 35–44, married with
teenage children and a full-time job! There is the additional expense of paying
interviewers more for a smaller number of interviews (on the basis of the time they
spend on the job). If this is not done, the temptation to cheat, and hence make results
completely invalid, will be strong.
Activity 10.3 What are the main potential disadvantages of quota sampling with
respect to probability sampling?
Solution
Quota sampling may yield biased estimates. The size of the bias may be hard to
assess because the sampling is not probability based. Empirically it has been shown
to give inferior results.
10.7.2
Probability sampling
Probability sampling means that every population unit has a known (not necessarily
equal), non-zero probability of being selected in the sample. In all cases selection is
performed through some form of randomisation. For example, a pseudo-random number
generator can be used to generate a sequence of ‘random’ numbers.
10
Relative to non-probability sampling methods, probability sampling can be expensive
and time-consuming, and also requires a sampling frame. We aim to minimise both the
(random) sampling error and the systematic sampling bias. Since the probability of
selection is known, standard errors can be computed which allows confidence intervals
to be determined and hypothesis tests to be performed.
We shall consider five types of probability sampling.
Simple random sampling (SRS) – a special case where each population unit
has a known, equal and non-zero probability of selection. Of the various probability
samples, its simplicity is desirable and it produces unbiased estimates, although
more accurate methods exist (i.e. those with smaller standard errors).
Systematic random sampling – a 1-in-x systematic random sample is obtained
by randomly selecting one of the first x units in the sampling frame and then
selecting every subsequent xth unit. Though easy to implement, it is important to
4
Non-response is discussed later in this chapter.
244
10.7. Types of sample
consider how the sampling frame is compiled. For example, the data may exhibit a
periodic property such as sales in a shop. All Monday sales are likely to be similar,
as are all Saturday sales, but Saturdays are traditionally much busier shopping
days. So sales from a 1-in-7 systematic sample will have a large variation between
days, but a small variation within days. This would lead us to underestimate the
true variation in sales. A 1-in-8 systematic sample would be better. Can you say
why?5
Stratified random sampling – this sampling technique achieves a higher level of
accuracy (smaller standard errors) by exploiting natural groupings within the
population. Such groupings are called strata 6 and these are characterised as having
population units which are similar within strata, but are different between strata.
More formally, elements within a stratum are homogeneous while the strata are
collectively heterogeneous.
A simple random sample is taken from each stratum, hence ensuring a
representative overall sample since a broad cross-section of the population will have
been selected – provided suitable strata were created. Hence great care needs to be
taken when choosing the appropriate stratification factors, which should be
relevant to the purpose of the sample survey such as age group or gender.
Imagine we were investigating student satisfaction levels at a university. If we took
a simple random sample of students we could, by chance, select just (or mainly)
first-year students whose opinions may very well differ from those in other year
groups. So in this case a sensible stratification factor would be ‘year of study’. By
taking a simple random sample from each stratum you ensure that you will not end
up with an extreme (non-representative) sample and also avoid the possibility of
one particular group not being represented at all in the sample. Of course, in order
to perform stratified random sampling, we would need to be able to allocate each
population unit to a stratum. For students this should be straightforward, since the
university’s database of student records would no doubt include year of study too.
Cluster sampling – is used to reduce costs (time and money). Here the
population is divided into clusters, ideally such that each cluster is as variable as
the overall population (i.e. heterogeneity within clusters and homogeneity between
clusters). Next, some of the clusters are selected by SRS. This is where the
economy savings are expected (in face-to-face interviews), since typically the
clusters may be constructed on a geographical basis allowing interviewers to
restrict themselves to visiting households in certain areas rather than having to
travel long distances to meet the respondents required by a simple random sample
who potentially cover a large area. A one-stage cluster sample would mean that
every unit in the chosen clusters is surveyed.
It may be that the cluster sizes are large meaning that it is not feasible to survey
every unit within a cluster. A two-stage cluster sample involves taking a simple
random sample from the clusters which were selected by SRS during the first stage.
5
A 1-in-7 systematic sample would lead to only one day of the week being represented in the sample,
such as Mondays. A 1-in-8 systematic sample would have different days of the week. For example, if the
first day chosen was a Monday, the next day selected would be 8 days later, i.e. Tuesday of the following
week, followed by a Wednesday etc.
6
‘Strata’ is the plural form, ‘stratum’ is the singular form.
245
10
10. Sampling design
When a subsample is taken from a selected cluster, this is known as a multistage
design.
Individual respondents will be identified in a random manner – but crucially within
an area. You will reduce costs (the interviewer will be able to complete a higher
number of interviews in a given time, using less petrol and reducing shoe leather
costs), but will probably have to sacrifice a degree of accuracy, that is the method
is less efficient in general compared to, say, stratified random sampling. However,
cluster sampling is very useful when a sampling frame is not immediately available
for the entire population. For example, individual universities will have databases
of their own students, but a national student database does not exist (admittedly it
may be possible to create a national student database, but it would take a lot of
effort, time and money).
Clustering is clearly useful in an interviewer-administered survey. It is less
important as a design feature for telephone or postal interviews, unless you are
particularly interested in the cluster itself, to the extent that individuals in a
cluster are similar (having intra-class correlation they will be less representative of
other clusters). In other words, the variance (hence standard error) of your sample
estimate will be greater.
A further reason, in addition to cost savings, for cluster sampling, may arise from
your need as a researcher to look at the clusters themselves for their own sake. An
interest in income and educational levels for a group living in one area, or a study
of the children at a particular school and their reaction to a new television
programme, will require you to look at individuals in a cluster.
Multistage sampling – refers to the case when sample selection occurs at two or
more successive stages (the two-stage cluster sample above is an example).
Multistage sampling is frequently used in large surveys. During the first stage, large
compound units are sampled (primary units). During the second stage, smaller
units (secondary units) are sampled from the primary units. From here, additional
sampling stages of this type may be performed, as required, until we finally sample
the basic units.
10
As we have already seen, this technique is often used in cluster sampling as we
initially sample main clusters, then clusters within clusters etc. We can also use
multistage sampling with a mixture of techniques. A large government survey will
likely incorporate elements of both stratification and clustering at different stages.
A typical multistage sample in the UK might involve the following:
• divide the areas of the country into strata by industrial region
• sample clusters (local areas) from each industrial region
• from each local area choose some areas for which you have lists (such as
electoral registers) and take a simple random sample from the chosen lists
(clusters).
Note that from a technical perspective, stratified sampling can be thought of as an
extreme form of two-stage cluster sampling where at the first stage all clusters in the
population are selected. In addition, one-stage cluster sampling is at the opposite end of
this spectrum. We can summarise as follows.
246
10.7. Types of sample
Similarities and differences between stratified and cluster sampling
Stratified sampling: all strata chosen, some units selected in each stratum.
One-stage cluster sampling: some clusters chosen, all units selected in each
sampled cluster.
Two-stage cluster sampling: some clusters chosen, some units selected in each
sampled cluster.
Example 10.3 You have been asked to make a sample survey of each of the
following. Would you use quota sampling or random sampling? Explain your choice.
i. Airline pilots, for their company, about their use of holiday entitlement in order
to bring in a new work scheme.
In this case as the survey is for the company (and hence there are no
confidentiality issues) it is quite easy to use the company’s list of personnel. A
quota sample would not be very easy in these circumstances – you would have
to send your interviewers to a venue where most pilots would be likely to meet,
or you would risk a very unrepresentative sample.
So, in this case, a random sample would be easy and efficient to use. You would
be able to collect accurate information and use your statistical techniques on it.
The subject matter, too, means that it is likely the pilots would take the survey
more seriously if they were contacted through the company’s list.
ii. Possible tourists, about their holiday destinations and the likely length of time
and money they expect to spend on holiday in the next year, for a holiday
company planning its holiday schedule and brochure for next year.
The situation for the tourist survey is different from that for airline pilots.
There will be not just one, but several lists of tourists from different holiday
companies, and data confidentiality might well mean you could not buy lists
which do not belong to your company. You might use the register of voters or
list of households, but then you would not necessarily target those thinking
about holidays in the near future. So a random sample sounds like an expensive
option if this is to be a general study for a tourist company assessing its future
offers and illustrations for its holiday brochure. Here, a quota sample makes
more sense: interviewers can quickly find the right respondent for the company’s
needs and get a general picture of holidaymakers’ preferences.
iii. Household expenditure for the government’s assessment of the effect of different
types of taxes.
The government survey will require accuracy of information in an important
policy area. A random sample will have to be used and the national lists of
addresses or voters used.
247
10
10. Sampling design
Activity 10.4 The simplest probability-based sampling method is simple random
sampling. Give two reasons why it may be desirable to use a sampling design which
is more sophisticated than simple random sampling.
Solution
Possible reasons include the following.
• We can estimate population characteristics with greater precision.
• It may be more cost effective.
• It may not be possible to do simple random sampling.
Activity 10.5 A corporation wants to estimate the total number of worker-hours
lost for a given month because of accidents among its employees. Each employee is
classified into one of three categories – labourer, technician and administrator. Which
sampling method do you think would be preferable here – simple random sampling,
stratified sampling, or cluster sampling? Give arguments to explain your choice.
Solution
We can expect the three groups to be homogeneous with respect to accident rates –
for example, labourers probably have the highest rates. Therefore, stratified
sampling is preferable to simple random sampling. Cluster sampling does not make
sense here, with only three homogeneous groups.
Activity 10.6 Describe how stratified random sampling is performed and explain
how it differs from quota sampling.
Solution
10
Description of stratified random sampling: the population is divided into strata,
natural groupings within the population, and a simple random sample is taken from
each stratum.
Stratified random sampling is different from quota sampling in the following ways.
Stratified random sampling is probability sampling, whereas quota sampling is
non-probability sampling.
In stratified random sampling a sampling frame is required, whereas in quota
sampling pre-chosen frequencies in each category are sought.
Activity 10.7 Provide a similarity and a difference between one-stage cluster
sampling and two-stage cluster sampling.
Solution
Similarity: In both one-stage and two-stage cluster sampling, only some clusters are
sampled (at random).
248
10.8. Types of error
Difference: In one-stage every unit within each selected cluster is sampled, while in
two-stage a random sample is taken from each selected cluster.
10.8
Types of error
We can distinguish between two types of error in sampling design7 as follows.
Sampling error – this occurs as a result of us selecting a sample, rather than
performing a census (where a total enumeration of the population is undertaken).
It is attributable to random variation due to the sampling scheme used. For
probability sampling, we can estimate the statistical properties of the sampling
error, i.e. we can compute (estimated) standard errors which allow confidence
intervals to be determined and hypothesis tests to be performed.
Non-sampling error – this occurs as a result of the (inevitable) failures of the
sampling scheme. In practice it is very difficult to quantify this sort of error,
typically through separate investigation. We distinguish between two sorts of
non-sampling error.
• Selection bias – this may be due to (i.) the sampling frame not being equal
to the target population, (ii.) the sampling frame not being strictly adhered to,
or (iii.) non-response bias.
• Response bias – the actual measurements might be wrong due to, for
example, ambiguous question wording, misunderstanding of a word in a
questionnaire by less-educated people, or sensitivity of information which is
sought. Interviewer bias is another aspect of this, where the interaction
between the interviewer and interviewee influences the response given in some
way, either intentionally or unintentionally, such as through leading questions,
the dislike of a particular social group by the interviewer, the interviewer’s
manner or lack of training, or perhaps the loss of a batch of questionnaires
from one local post office. These could all occur in an unplanned way and bias
your survey badly.
10
10.9
Pilot and post-enumeration surveys
Both kinds of error can be controlled or allowed for more effectively by a pilot survey.
A pilot survey is used:
to find the standard error which can be attached to different kinds of questions and
hence to underpin the sampling design chosen
to sort out non-sampling questions:
• do people understand the questionnaires?
7
These should not be confused with Type I errors and Type II errors, which only concern hypothesis
testing and are covered in Section 8.7.
249
10. Sampling design
• are our interviewers working well?
• are there particular organisational problems associated with this enquiry?
Sometimes, particularly for government surveys, a post-enumeration survey is used to
check the above. Here a subsample of those interviewed are re-interviewed to make sure
that they have understood the questions and replied correctly. This is particularly
useful when technical questions are being asked.
10.10
Non-response and response bias
Bias caused by non-response and response is worth a special entry. It can cause
problems at every stage of a survey, both in random samples and quota samples,
regardless of how the surveys are administered.
The first problem can be in the sampling frame. Is an obvious group missing?
Example 10.4 The following are examples of coverage bias in the sampling frame.
If the sampling frame is of householders, those who have just moved in will be
missing.
If the sampling frame is of those aged 18 or over, and the under-20s are careless
about registration, then younger people will be missing from the sample.
In the field, non-response (data not provided by a sampling unit which we wish to
sample) is one of the major problems of sample surveys as the non-respondents, in
general, cannot be treated like the rest of the population. As such, it is most important
to try to get a picture of any shared characteristics in those refusing to answer or people
who are not available at the time of the interview.
Classifications of non-response
We can classify non-response as follows.
Item non-response occurs when a sampled member fails to respond to a
question in the questionnaire.
10
Unit non-response occurs when no information is collected from a sample
member.
Non-response may be due to any of the following factors:
not-at-home due to work commitments, or on holiday
refusals due to subject matter, or sponsorship of the survey
incapacity to respond due to illness, or language difficulties
not found due to vacant houses, incorrect addresses, moved on
lost schedules due to information being lost or destroyed after it had been collected.
250
10.10. Non-response and response bias
How should we deal with non-response? Well, note that increasing the sample size will
not solve the problem – the only outcome would be that we have more data on the
types of individuals who are willing to respond! Instead, we might look at improving
our survey procedures such as data collection and interviewer training. Non-respondents
could be followed up using call-backs, or an alternative contact method to the original
survey in an attempt to subsample the non-respondents. A proxy interview (where a
unit from your sample is substituted with an available unit) may be another possibility.
(Note that non-response also occurs in quota sampling but is not generally recorded –
see the earlier discussion.) However, an obvious remedy is to provide an incentive (for
example, cash or entry into a prize draw) to complete the survey – this exploits the
notion that human behaviour can be influenced in response to the right incentives!
Response error is very problematic because it is not so easy to detect. A seemingly
clear reply may be based on a misunderstanding of the question asked or a wish to
deceive. A good example from the UK is the reply to the question about the
consumption of alcohol in the Family Expenditure Survey. Over the years, there is up to
a 50% understatement of alcohol use compared with the overall known figures for sales
from HM Revenue & Customs!
Sources of response error
The main sources of response error include:
role of the interviewer due to the characteristics and/or opinions of the
interviewer, asking leading questions and the incorrect recording of responses
role of the respondent who may lack knowledge, forget information or be
reluctant to give the correct answer due to the sensitivity of the subject matter.
Control of response errors typically involves improving the recruitment, training and
supervision of interviewers, re-interviewing, consistency checks and increasing the
number of interviewers.
In relation to all these problems, pilot work is very important. It may also be possible
to carry out a check on the interviewers and methods used after the survey
(post-enumeration surveys).
Activity 10.8 In the context of sampling, explain the difference between item
non-response and unit non-response.
Solution
• Item non-response occurs when a sampled member fails to respond.
• Unit non-response occurs when no information is collected from a sample
member.
Activity 10.9 Non-response in surveys is considered to be problematic.
(a) Give two possible reasons why non-response may occur.
251
10
10. Sampling design
(b) Why is non-response problematic for the person or organisation conducting the
research? Give two reasons.
(c) How can non-response be reduced in telephone surveys and mail surveys,
respectively?
Solution
(a) There may be people whom the data collection procedure does not reach, who
refuse to respond, or who are unable to respond. As an example, in telephone
surveys, people may not be at home or may not have time to do the survey. In a
mail survey, the survey form may not reach the addressee.
(b) The non-respondents may be different from respondents, leading to biased
parameter estimates. Additionally, the sample size is effectively reduced, leading
to a loss of precision in estimation.
(c)
i. Numerous calls, concentrating on evenings and weekends. Interviewers with
flexible schedules who can make appointments with interviewees at a time
convenient for them. Cooperation can be enlisted by sending an advance
letter, ensure respondents feel their help is important and having
well-trained interviewers.
ii. Clear layout of survey form, easy response tasks, send reminders and call
respondents by telephone.
Activity 10.10 Give an example of response bias. Is response bias a form of
sampling error or a form of non-sampling error? Briefly explain why.
Solution
10
Interviewer bias, where the interaction between the interviewer and interviewee
influences the response given in some way, is an example of response bias. It is a
form of non-sampling error. Non-sampling error is a result of the inevitable failures
of the sampling scheme, whereas sampling error occurs due to random variation due
to the sampling scheme used.
Activity 10.11 Briefly discuss the advantages and disadvantages of paying
respondents for an interview.
Solution
Advantages include a higher response rate and hence less non-response bias.
Disadvantages are that it costs money which we may not have or may also be used
for other ways of improving the survey. Also, there are ethical considerations: we
take respondent’s time so they should get something back!
252
10.11. Method of contact
Activity 10.12 A research group has designed a survey and finds the costs are
greater than the available budget. Two possible methods of saving money are a
sample size reduction or spending less on interviewers (for example, by providing less
interviewer training or taking on less-experienced interviewers). Discuss the
advantages and disadvantages of these two approaches.
Solution
A smaller sample size leads to less precision in terms of estimation. Reducing
interviewer training leads to response bias. We can statistically assess the reduction
in precision, but the response bias is much harder to quantify, making it hard to draw
conclusions from the survey. Therefore, the group should probably go for a smaller
sample size, unless this will give them too imprecise results for their purposes.
Activity 10.13 Discuss the role of the interviewer in a survey and the importance
of training an interviewer.
Solution
The role of the interviewer is very important: to gain cooperation, train and
motivate respondents. Training interviewers leads to less interviewer bias, more
uniformity in response, and a higher response rate.
Activity 10.14 Readers of the magazine Popular Science were asked to phone in
(on a premium rate number) their responses to the following question: ‘Should the
United States build more fossil fuel-generating plants or the new so-called safe
nuclear generators to meet the energy crisis?’. Of the total call-ins, 86% chose the
nuclear option. Discuss the way the poll was conducted, the question wording, and
whether or not you think the results are a good estimate of the prevailing mood in
the country.
Solution
Both selection bias and response bias may have occurred. Due to the phone-in
method, respondents may well be very different from the general population
(selection bias). Response bias is possible due to the question wording – for example
‘so-called safe nuclear generators’ calls safety into question. It is probably hard to
say much about the prevailing mood in country based on the survey.
10.11
Method of contact
A further point you should think about when assessing how to carry out a survey is the
method of contact. The most common methods of contact are face-to-face interviews,
telephone interviews and online/postal/mail (so-called ‘self-completion’) interviews. In
most countries you can assume the following.
An interviewer-administered face-to-face questionnaire will be the most expensive
to carry out.
253
10
10. Sampling design
Telephone surveys depend very much on whether your target population is on the
telephone (and how good the telephone system is).
Self-completion questionnaires can have a low response rate.
We now explore some8 of the advantages and disadvantages of various contact methods.
Face-to-face interviews:
• Advantages – good for personal questions; allow for probing issues in greater
depth; permit difficult concepts to be explained; can show samples (such as
new product designs).
• Disadvantages – (very) expensive; not always easy to obtain detailed
information on the spot.
Telephone interviews:
• Advantages – easy to achieve a large number of interviews; easy to check the
quality of interviewers (through a central switchboard perhaps).
• Disadvantages – not everyone has a telephone so the sample can be biased;
cannot usually show samples; although telephone directories exist for landline
numbers, what about mobile telephone numbers? Also, young people are more
likely to use mobile telephones rather than landline telephones, so are more
likely to be excluded.
Self-completion interview:
• Advantages – most people can be contacted this way (there will be little
non-response due to not-at-home reasons); allow time for people to look up
details such as income, tax returns etc.
• Disadvantages – high non-response rate as it requires effort to complete the
questionnaire; answers to some questions may influence answers to earlier
questions since the whole questionnaire is revealed to the respondent, which is
important where the order of a questionnaire matters; you have no control over
who answers the questionnaire.
10
Example 10.5 Examples of occasions when you might use a particular contact
method are as follows.
Face-to-face interviewer – a survey of shopping patterns.
Here you need to be able to contact a sample of the whole population. You can
assume that a large proportion would not bother to complete a postal
questionnaire – after all, the subject matter is not very important and it takes
time to fill in a form! Using a telephone would exclude those (for example, the
poor and the elderly) who either do not have access to a telephone, or are
unwilling to talk to strangers by telephone.
8
This is not necessarily an exhaustive list. Can you add any more?
254
10.11. Method of contact
Telephone interviewer – a survey of businessmen and their attitudes
to a new item of office equipment.
All of them will have a telephone, and also the questions should be simple to
ask. Here, booking time for a telephone interview at work (once it has been
agreed with the management) should be much more effective than waiting for a
form to be filled in, or sending interviewers to disrupt office routine.
Postal/mail questionnaire – a survey of teachers about their pay and
conditions.
Here, on-the-spot interviews will not elicit the level of detail needed. Most
people do not remember their exact pay and taxation, particularly if they are
needed for earlier years. We would expect a high response rate and good-quality
data – the recipients are motivated to reply, since they may be hoping for a pay
rise! Also, the recipients come from a group of people who find it relatively easy
to fill in forms without needing the help, or prompting, of an interviewer.
Remember that it is always possible to combine methods. The Family Expenditure
Survey in the UK, for example, combines the approach of using an interviewer three
times over a fortnight (to raise response and explain details), while the respondent
household is required to fill in a self-completion diary (showing expenditure, which
could not be obtained by an interview alone).
Similarly, telephone interviews may be combined with a mail-shot subsampled
individual survey. In the case of the telephone survey of business people described above,
a description of the new equipment could be sent in advance, or as they are telephoned.
Remember also that email surveys are already widespread and becoming increasingly
popular, although they are only appropriate when the population to be studied uses
them heavily and is likely to reply to your questions, such as employees in your office.
An obvious advantage is that this method is very cheap to administer.
Activity 10.15 What criteria would you use in deciding which method of contact
to use in a survey of individuals?
Solution
10
Key criteria to use when deciding on the appropriate form of contact relate to the
extent to which the respondent can interact with the interviewer. Examples of
criteria are:
• flexibility of contact method
• diversity of questions which can be asked
• ability to use physical stimuli
• control of the data collection environment
• quantity of data required
• response rate
255
10. Sampling design
• perceived respondent anonymity
• obtaining sensitive information
• potential for interviewer bias
• potential to probe respondents
• speed
• cost.
Activity 10.16 Discuss the relative advantages and disadvantages of telephone
interviewing, also face-to-face interviewing.
Solution
A number of advantages and disadvantages could be given, clearly presented.
• Telephone interviewing advantages:
• Less expensive than personal interviews.
• Random digit dialling samples of general population.
• Shorter data collection period than personal interviews.
• Interviewer administration (versus mail).
• Better control and supervision of interviewers (versus personal).
• Better response rate than mail for list samples.
• Telephone interviewing disadvantages:
• Biased against households without telephones, unlisted numbers.
• Non-response.
• Questionnaire constraints.
• Difficult to administer questionnaires on sensitive or complex topics.
10
• Personal interviewing advantages:
• Generally yields highest cooperation and lowest refusal rates.
• Allows for longer, more complex interviews.
• High response quality.
• Takes advantage of interviewer presence.
• Multi-method data collection.
• Personal interviewing disadvantages:
• Most costly mode of administration.
• Longer data collection period.
• Interviewer concerns.
256
10.12. Summary
Activity 10.17 Discuss the feasibility of each of the types of survey contact
methods (personal interview, postal survey, email and telephone survey) for a
random sample of university students about their undergraduate experiences and
attitudes at the end of the academic year.
Solution
• Personal interview:
• Allows for probing details about experiences and attitudes in greater detail.
• Likely to have a high response rate, although costly.
• Postal survey:
• High non-response rate.
• Effort to return survey form.
• Email:
• Easiest method of contact for university students, who regularly check
email.
• Likely to suffer from non-response.
• Telephone survey:
• Most students would use mobile phones rather than landlines, hence
telephone numbers not available in directories.
• Likely to suffer from non-response.
Activity 10.18 Retirement and Investment Services would like to conduct a survey
on online users’ demands for additional internet retirement services. Outline your
suggested sampling and contact method and explain how the results might be
affected by your methodology.
10
Solution
Online sampling can be used. Use of web intercept (pop-up) or client lists (email
addresses). It is cheaper and potentially very quick but the users might tend to be
young people unless a client list is used. Internet penetration is still not as high as
landline telephone penetration. Respondents who are not online may have a different
profile. There is no recognised sampling frame for the online population.
10.12
Summary
This chapter has described the main stages of a survey and the sources of error at each
stage. The various techniques in the rest of the subject guide are of little use unless you
understand the sources and limitations of the data you are using. The contents of this
chapter should have helped you to understand how statistical techniques you have
257
10. Sampling design
learned about so far can be used in practice.
Activity 10.19 How do probability sampling techniques differ from non-probability
sampling techniques? Which factors should be considered when choosing between
probability and non-probability sampling?
Solution
Probability techniques differ from non-probability techniques in the following ways.
• Probability techniques select sampling units by chance as opposed to the
judgement of the researcher.
• The number of elements to be included in the sample set can be pre-specified.
• It is possible to specify the probability of selecting any particular sample of a
given size.
• Estimates are statistically projectable to the population.
The factors which should be considered are as follows.
• The nature of the research, i.e. is it exploratory or conclusive?
• Degree of error tolerance, i.e. the precision sought by decision-makers.
• Relative magnitude of sampling versus non-sampling errors, i.e. the need to
eliminate selection bias.
• Variability in the population, i.e. how divergent are target respondents in the
issues being measured?
• Statistical considerations, i.e. what forms of analysis will help create the
meanings sought by the decision-maker?
• Operational considerations, i.e. are sampling frames available and accurate?
10
Activity 10.20 You work for a market research company and your manager has
asked you to carry out a random sample survey for a mobile phone company to
identify whether a recently launched mobile phone is attractive to people over 40
years old. Limited time and money resources are available at your disposal. You are
being asked to prepare a brief summary containing the items below.
i. Choose an appropriate probability sampling scheme. Provide a brief justification
for your answer.
ii. Describe the sampling frame and the method of contact you will use. Briefly
explain the reasons for your choices.
iii. Provide an example in which response bias may occur. State an action that you
would take to address this issue.
258
10.12. Summary
iv. State the main research question of the survey. Identify the variables associated
with this question.
Solution
i. Cluster sampling is appropriate here due to the cost issue. Also, multistage
sampling is an option. Although the question mentions limited time, discussion
of quota sampling (for speed) would be incorrect as the question stresses a
‘probability sampling scheme’.
ii. The question requires:
• a description of a sampling frame
• a justification of its choice
• mentioning a (sensible) contact method
• stating an advantage of the contact method mentioned above.
Note that there is no unique answer. A suggested answer is given below.
The sampling frame could be an email list from the records of the mobile phone
company, which should be easy to obtain. Assuming that the new type is a
smartphone, the method of contact can be via email. Alternatively, if the new
type is not a smartphone, some basic questions can be sent by text. Method of
contact by post would be too expensive compared with the other two methods.
iii. The question requires an example of response bias and an action suggested to
address this issue.
Those least comfortable with the phone are unlikely to reply, so the
questionnaire should be designed appropriately. Also, busy people may not want
to spend time on such market research. A reward of, say, ten free text messages
could be offered as an incentive so that the cost remains low.
iv. A suggested answer for the question is ‘Are attitudes to the new phone more
favourable to younger owners?’.
In terms of variables one could mention ‘age’ and ‘measure of favourableness
toward the new phone’.
Activity 10.21 A company producing handheld electronic devices (tablets, mobile
phones etc.) wants to understand how people of different ages rate its products. For
this reason, the company’s management has decided to use a survey of its customers
and has asked you to devise an appropriate random sampling scheme. Outline the
key components of your sampling scheme.
Solution
Note that there is no unique answer. A possible set of ‘ingredients’ of an answer is
given below.
• Propose stratified sampling since customers of all ages are to be surveyed.
259
10
10. Sampling design
• Sampling frame could be the company’s customer database.
• Take a simple random sample from each stratum.
• Stratification factors should include age group.
• Other stratification factors could be gender, country of residence etc.
• Contact method: mail, telephone or email (likely to have all details on
database).
• Minimise non-response through a suitable incentive, such as discount off the
next purchase.
Activity 10.22 You have been asked to design a nationwide survey in your country
to find out about internet usage among children less than 10 years old. Provide a
probability sampling scheme and a sampling frame that you would like to use.
Identify a potential source of selection bias which may occur and discuss how this
issue can be addressed.
Solution
Note that there is no unique answer.
Sampling frame: note that the target group is ‘children less than ten years old’ hence
one might take the view that they only need to look at children who are at school or
nursery school – say aged from 4 or 5 to 10. If this is the case, a sampling frame of
schools and nurseries and sampling from their lists may be used. Another example is
to use doctors’ lists (if possible).
Sampling scheme: a probability/random sampling scheme is required, which should
be accompanied by a justification. For example, if one goes with clustering (area of
the country/type of school such as junior, infants, preschool etc.) or stratified
sampling (stratification factors: gender, age group), why would these schemes be
advantageous?
10
Source of selection bias: selection bias could arise from the omission of those who are
not at school or preschool (in most countries, school is compulsory only from five or
six years old) and those who are home schooled. Note that this should not be
confused with response bias – for example, things about children responding
differently if the teachers ask the questions!
Way to address it: reset the target population group to match what the sampling
frame is actually providing.
10.13
Key terms and concepts
Census
Convenience sampling
Incentive
Multistage design
260
Cluster sampling
Judgement sampling
Method of contact
Non-probability sampling
10.14. Learning activities
Non-response
Pilot survey
Primary data
Representative
Quota sampling
Sample survey
Sampling frame
Selection bias
Stratified random sampling
Target population
10.14
Non-sampling error
Post-enumeration survey
Probability sampling
Response error
Sample
Sampling error
Secondary data
Simple random sampling
Systematic random sampling
Learning activities
Solutions can be found in Appendix A.
1. Think of at least three lists you could use in your country as a basis for sampling.
Remember, each list must:
• be generally available
• be up-to-date
• provide a reasonable target group for the people you might wish to sample.
2. Find out about one of the government surveys carried out in your own country.
This should help you understand the problems involved in designing a useful survey
and help you with illustrations for your examination questions. (Remember that
your understanding of the general points raised here should be illustrated by
examples. The examiners are very happy if you give examples from your own
country or area of interest. They are not looking for points memorised from
textbooks!)
3. Discuss the statistical problems you might expect to have in each of the following
situations.
(a) Conducting a census of the population.
(b) Setting up a study of single-parent families.
(c) Establishing future demand for post-compulsory education.
4. (a) Define a quota sample.
(b) What are the main reasons for using a quota sample, and what are the
alternatives?
(c) What are the main sources of error in a quota sample, and how would you deal
with them?
5. Think of three quota controls you might use to make a quota sample of shoppers in
order to ask them about the average amount of money they spend on shopping per
261
10
10. Sampling design
week. Two controls should be easy for your interviewer to identify. How can you
help them with the third control?
6. Your company has five sites and a mixture of managerial, clerical and factory
workers. You want to know what kind of help they would like with their
travel-to-work arrangements. One of the trade unions involved has asked for free
parking for all its members, another has suggested subsidised rail season tickets.
You, as a statistician, have been asked to design a sample survey of employees to
help them decide what to do. You decide to make a random sample. There is no
problem with the sampling frame and you have been asked to give statistically
reliable advice. You decide to design a stratified sample.
(a) State the strata you will use.
(b) Explain how the strata will help you.
(Do not be afraid to use two sets of strata in these circumstances!)
7. You have been asked to design a random sample in order to study the way school
children learn in your country. Explain the clusters you might choose, and why.
8. Which method of contact might you use for your questionnaire in each of the
following circumstances?
(a) A random sample of school children about their favourite lessons.
(b) A random sample of households about their expenditure on non-essential
items.
(c) A quota sample of shoppers about shopping expenditure.
(d) A random sample of bank employees about how good their computing facilities
are.
(e) A random sample of the general population about whether they liked
yesterday’s television programmes.
10
9. (a) Outline the main stages of a random survey. Where do the main dangers of
errors lie?
(b) Why might you carry out a quota survey rather than a random survey?
(c) ‘The designing of questionnaires and the training of interviewers is a waste of
money.’ Discuss.
(d) When would you carry out a telephone survey rather than using a face-to-face
interview?
(e) You have been asked to survey the interest of a population in a new type of
technology. How might you stratify your sample? Explain.
10. You are carrying out a random sample survey of leisure patterns for a holiday
company, and have to decide whether to use interviews at people’s homes and
workplaces, postal (mail) questionnaires, or a telephone survey. Explain which
method of contact you would use, and why.
262
10.15. A reminder of your learning outcomes
10.15
A reminder of your learning outcomes
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
define random, simple random and quota sampling and describe the implications of
using them
explain the reasons for stratifying and clustering samples
describe the factors which contribute to errors in surveys, including:
• inaccurate and poorly-judged frames
• sampling error
• non-sampling error (non-response, biased response, interviewer error)
discuss the various methods of contact which may be used in a survey and the
related implications:
• interviewer
• postal
• other self-administered
• telephone.
10.16
Sample examination questions
Solutions can be found in Appendix B.
1. (a) State one advantage and one disadvantage of quota sampling.
(b) State one advantage and one disadvantage of telephone interviews.
2. You work for a market research company and your manager has asked you to carry
out a random sample survey for a laptop company to identify whether a new laptop
model is attractive to consumers. The main concern is to produce results of high
accuracy. You are being asked to prepare a brief summary containing the items
below.
(a) Choose an appropriate probability sampling scheme. Provide a brief
justification for your answer.
(b) Describe the sampling frame and the method of contact you will use. Briefly
explain the reasons for your choices.
(c) Provide an example in which selection bias may occur. State an action which
you would take to address this issue.
(d) State the main research question of the survey. Identify the variables
associated with this question.
263
10
10. Sampling design
3. You have been asked to design a stratified random sample survey from the
employees of a certain large company to examine whether job satisfaction of
employees varies between different job types.
(a) Discuss how you will choose your sampling frame. Also discuss the limitations
of your choice.
(b) Propose two relevant stratification factors. Justify your answers.
(c) Provide two actions to reduce response bias and explain why you think they
would be successful.
(d) Briefly discuss the statistical methodology you would use to analyse the
collected data.
10
264
Chapter 11
Some ideas underlying causation –
the use of control groups and time
order
11.1
Aims
As with Chapter 10, this chapter does not introduce any new statistical calculations.
However, it addresses the key ideas of causation in non-experimental science. It also
prepares you for the final ideas – namely correlation and regression – you need to know
in order to complete ST104a Statistics 1.
11.2
Learning outcomes
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
distinguish between an experiment in the natural sciences and the observations
possible in social science, business or management studies
determine sensible controls for a given experiment or trial
set up a panel study
explain the merits and limitations of a longitudinal/panel survey compared with a
cross-sectional survey.
11.3
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and Economics
(London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060] Section 13.2.
In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials
accessible via the ST104a Statistics 1 area at http://my.londoninternational.ac.uk
11.4
Further reading
Shipman, M. The Limitations of Social Research (London: Longman, 1997) fourth
edition [ISBN 9780582311039] Part 2.
265
11
11. Some ideas underlying causation – the use of control groups and time order
11.5
Introduction
So far, we have looked at ways of collecting and describing data. Chapters 3 and 10
introduced you to the basic ideas of sampling from populations. The main ways of
describing and presenting data were covered in Chapter 4. Chapters 6 to 8 dealt with
the ways we can assess the relevance or significance of these figures, while Chapter 9
introduced the idea of association between categorical variables. Chapter 12 will
complete the process, so far as ST104a Statistics 1 is concerned, by covering
correlation and regression.
Before we look at the ideas of correlation and regression more formally, it is important
to take stock of the limitations of assuming causation when working in social research.
Anyone who studied science at school will be familiar with the idea of an experiment.
Subjects are measured (observations are made), a treatment is administered, then
further observations are made. Providing that we can be sure that nothing else has
happened between observations, apart from the treatment (scientists write ‘other things
equal’, or ceteris paribus, in Latin), the assumption is made that the treatment has
caused the change between the first and second set of observations.
If we are dealing with such a controlled experiment, the meaning of our statistical
measures and work is very clear. However, in social science, business and management
fields, things are rarely that simple. We are generally faced with figures which show
changes in variables but the treatment given is not easy to assess.
Example 11.1 Consider determining the effectiveness of an advertising campaign
for your new chocolate bars with improved wrappers. Your company measures the
sales of the old chocolate bars, product A, in two areas, X and Y, before introducing
the new chocolate bars, product B, and running a four-month advertising campaign.
Imagine your marketing manager’s joy when sales of product B are much higher
than for product A in area X. Clearly, the changes have worked! However, oh dear,
product B is achieving a lower level of sales than product A in area Y. How can this
be? Well, on closer investigation, we find that our main rival has withdrawn their
product from area X and concentrated their sales focus on area Y (where they have
realised that a larger number of their target population lives).
11
So your success with product B in area X is not necessarily related to your
advertising campaign, rather it may be entirely due to your rival’s actions. Clearly,
whatever measures you use, there is a problem with the measurement of your
results. ‘Other things’ are no longer equal, since the behaviour of the rival changed
while you were conducting your experiment.
So it is important to control for confounding factors, that is, factors which are
correlated with the observed variables (such as the rival’s actions in Example 11.1).
Failure to properly control for such factors may lead us to treat a false positive as a
genuine causal relationship.
266
11.6. Observational studies and designed experiments
11.6
Observational studies and designed experiments
In medical, industrial and agricultural applications of statistics, it is possible to control
the levels of the important factors which affect the results. Bias from the factors which
cannot be controlled is dealt with by randomisation. These investigations are designed
experiments.
In economic and other social science applications of statistics, one usually just observes
a sample of the available population, without controlling for any of the factors which
may influence the measures observed. Such studies are observational studies.
Much of the methodology of statistics is devoted to disentangling the effects of different
factors in such observational studies, but it is better to have a designed experiment,
whenever possible. Of course, the social sciences do not really lend themselves to
designed experiments. For example, we do not have multiple versions of an economy to
which we can apply different mixtures of fiscal and monetary policies. Instead, we have
one economy to which we apply a particular mix and observe what happens.
In ST104a Statistics 1, we will only look at correlation and regression as ways of
studying the connections between variables, but analysis of variance (covered in
ST104b Statistics 2) is also very relevant in this context.
Even if it is possible to experiment, there may be ethical objections. The ethical
questions raised by medical trials are still a lively area of debate. There can be strong
objections from patients (or their relatives) to their treatment by an established method
(or a placebo) when a new treatment is available. After the trial is complete, if the new
treatment is shown to be worse than the established one, there may be complaints from
those who were given the new treatment. Some of these difficulties can be resolved by
the use of sequential methods, which try to discontinue a trial at the earliest
opportunity when one treatment has been shown to give better results than others.
When we come to study regression in Chapter 12, it will be important to remember its
limitations. It is easy to be too optimistic about its use. There is a big difference
between a randomised experiment and an observational study, even if regression is used
to allow for some of the distorting factors in the latter. The long-standing difficulties in
proving that smoking causes lung cancer or, more recently, in demonstrating the
positive effects of seat belts in cars, or of investigating the possible increase in
leukaemia cases in children of workers at nuclear power plants, should be enough to
show how the problems arise. It is difficult to allow for all the distorting factors; even to
say whether a factor should be eliminated is hard.
So how do statisticians try to measure causal relationships in the social sciences? They
use two main tools:
the control group
time order.
We now proceed to consider both of these.
Activity 11.1 Explain the difference between an experimental design and an
observational study.
267
11
11. Some ideas underlying causation – the use of control groups and time order
Solution
In an observational study, correlation is not the same as causation because of the
possible presence of unobserved confounding variables. This problem can be avoided
using an experimental design, where randomisation can be used. This creates groups
of units which are similar on average. For example, if we perform a clinical trial to
find out the effect of a treatment on a disease, and a certain gene is an unobserved
confounder, then randomisation ensures the two groups have similar genetic makeup
(on average; note that the groups need to be large enough to ensure this).
Association between type of treatment and recovery then becomes the same as
causation. In an observational study, we have no control over the genetic makeup the
groups, so cannot conclude causality if we find an association.
11.7
Use of the control group
We are often limited in the social sciences because we are unable to perform
experiments either for ethical or practical reasons. Imagine, for example, that you need
to assess the likely effect on tooth decay of adding fluoride to the water supply in a
town. There is no possibility of being allowed to experiment, as you are afraid that
fluoride might have harmful side effects. However, you know that fluoride occurs
naturally in some communities. What can you do, as a statistician?
11.7.1
Observational study
In an observational study data are collected on units (not necessarily people) without
any intervention. Researchers do their best not to influence the observations in any way.
A sample survey is a good example of such a study, where data are collected in the form
of questionnaire responses. As discussed in Chapter 10, every effort is made to ensure
response bias is minimised (if not completely eliminated).
Example 11.2 To assess the likely effect on tooth decay of adding fluoride to the
water supply, you can look at the data for your non-fluoridated water population
and compare it with one of the communities with naturally-occurring fluoride in
their water and measure tooth decay in both populations, but be careful! A lot of
other things may be different. Are the following the same for both communities?
Number of dentists per person.
11
Number of sweet shops per person.
Eating habits.
Age distribution.
Think of other relevant attributes which may differ between the two. If you can
match in this way (i.e. find two communities which share the same characteristics
and only differ in the fluoride concentration of their water supply), your results may
have some significance.
268
11.7. Use of the control group
So, to credibly establish a causal link in an observational study, all other relevant
factors need to be adequately controlled for, such that any change between observation
periods can be explained by only one variable.
11.7.2
Experimental study
In an experiment, an intervention or treatment is administered to some or all of the
experimental units (usually people). Allocation of the treatment (or perhaps a
combination of treatments) is determined by using a form of randomisation.1 Some time
after the treatments are given, the outcomes are recorded by the researchers. Data
analysis involves comparing outcomes across all treatments.
Occasionally it is possible and permissible to carry out experiments in business
situations. There are still difficulties caused by the sheer number of variables which will
need to be taken into consideration, but at least it is possible to distinguish those who
had the treatment (the experimental group) from those who did not (the control
group).
In order to avoid bias in assessing the effects of the treatments, blinding and double
blinding are recommended. In blind experimentation the experimental units are
unaware of whether they receive a treatment or a placebo, while in a double-blind set-up
the people administering the treatment and placebo also do not know which is which.
The sample size in each treatment group must be large enough to ensure that medically
(or socially) important differences can be detected. Sample size determination (as
previously discussed for observational studies in Section 7.11) to ensure adequate power
is a routine part of experimental design.
On the whole, such methods are associated with medical statistics more often than with
other applied areas, but they are also used in marketing and test marketing when
possible.
Activity 11.2 What is randomisation in the context of experimental design?
Solution
Experimental units are randomly allocated to treatment groups. In this way,
confounders are (probabilistically) controlled for, and causal conclusions can be
drawn from an experiment (with high probability). In observational studies, we can
never be sure if all confounders have been controlled, and causal conclusions are
harder to draw.
Activity 11.3 Explain what is meant by each of the following and why they are
considered desirable in an experiment:
(a) placebo
(b) double blinding.
1
For example, treatment A is administered if a fair coin toss comes up ‘heads’, otherwise treatment
B is administered.
269
11
11. Some ideas underlying causation – the use of control groups and time order
Solution
(a) Many statistical studies involve testing the effectiveness of drugs. A placebo
looks identical to the actual drug but contains no active ingredient and so has
no real physical effect. Any patient wants to be helped by the medication
administered to them. If the patient thinks s/he is receiving a drug to remedy
their condition, they tend to improve even if it turns out that the drug is a
placebo. This is known as the placebo effect. Experimental design should
control for the placebo effect.
(b) Double blinding means that neither the subjects receiving the treatment, nor
the researchers know to which group, treatment or control, subjects have been
assigned. This avoids bias in assessing the effects of the treatments. If a
researcher knows that a subject is in the control group, they do not expect a
treatment effect, and their measurement of a response might be understated. If
a researcher knows that a subject is in the treatment group, they might
overstate a response simply because they expect it.
Activity 11.4 Explain the following terms:
(a) blind trials
(b) control and treatment groups
(c) measuring causation.
Solution
(a) Blind trials are a feature of experiments. In a blind experiment, the subjects do
not know if they are getting the treatment or the placebo. In a double blind
experiment, neither the person handing out the treatments nor the person
evaluating the results know which subjects are in which group. This is to
neutralise psychological effects.
11
(b) A control group (which either gets no treatment or gets the old, established
treatment) must be used for comparison. The treatment group, of course,
receives the treatment. For example, if you are testing a treatment for a sore
throat, you must have a group which gets no treatment because sore throats
naturally get better over time. You need to show that the treatment group gets
better faster than the control group.
(c) Ideally you want the control and treatment groups to be as equal as possible, so
that the only difference between the groups is the treatment itself. At the end of
an experiment, if you then notice a difference between the two groups – for
example, in terms of the recovery time from sore throats – then this can be
attributed to the treatment and hence we would have shown a causal link in
that the treatment caused a change in recovery time.
270
11.8. Time order
11.8
Time order
Another way we might attempt to disentangle causal relationships is to look at the order
in which events occurred. Clearly, if we eat more, we gain weight, other things equal.
For example, if we do not exercise more to offset the extra consumption of calories!
This underpins work on time series data which you will meet if you study EC2020
Elements of econometrics. For now, you should know a little about longitudinal
surveys and panel surveys, where the same individuals are resurveyed over time.
11.8.1
Longitudinal surveys
Policymakers use longitudinal surveys over a long period of time to look at the
development of childhood diseases, educational development and unemployment. There
are many long-term studies in these areas. Some longitudinal medical studies of rare
diseases have been carried out at an international level over long periods. One such,
very well-known, study which is readily available is the UK National Child Development
Survey. This began with a sample of about 5,000 children born in 1958. It is still going
on! It was initially set up to look at the connections between childhood health and
development and nutrition by social groups. The figures produced in the first few years
were so useful that it was extended to study educational development and work
experience. There are several books which describe the survey at its different stages.
You should note the advantages and disadvantages of using such methods. The
advantages are that you:
can actually measure individual change (not just averages)
do not depend on people’s memories about what they did four years ago, say.
The disadvantages are:
on the one hand, drop out – if the subject material is trivial, people may not agree
to be continually resurveyed
on the other hand, conditioning – if the researcher manages to persuade
participants to continue to cooperate, they may have altered their perceptions (the
participants may have become too involved in the study).
Despite the disadvantages, such studies are widely regarded as being the best way of
studying change over time.
Activity 11.5 Compare and contrast cross-sectional and longitudinal designs.
Solution
A cross-sectional design involves the collection of data from a sample at a single
point in time. For example, if a researcher observes the buying behaviour of shoppers
selected in a street interview scenario and then analyses the data gathered, then such
a study will be called a cross-sectional study. On the other hand, longitudinal design
271
11
11. Some ideas underlying causation – the use of control groups and time order
involves collecting data periodically from a fixed sample of respondents. Therefore, a
longitudinal study provides a series of ‘snapshots’, which, when viewed together,
portray a vivid illustration of the situation and the changes which are taking place.
Compared to a cross-sectional design, a longitudinal design is more effective in
detecting change, can be more accurate due to the rapport which can be built up
with respondents, and provides a larger amount of data. A cross-sectional design can
have advantages in creating a ‘one-off’ sample which is more representative of a
target population, compared to a longitudinal sample where respondents may drop
out and particular types of respondent become under-represented. The disadvantage
of building up a rapport with respondents in a longitudinal design is the bias which
may be introduced through an over-familiarity with the study and the
interviewers/researchers working on it.
11.8.2
Panel surveys
In business and management research, we generally use slightly less ambitious surveys
called panels. They also involve contacting the same individuals over a period, but are
generally different from longitudinal surveys in the following ways:
they are more likely to be chosen by quota rather than random methods
individuals are interviewed every two to four weeks (rather than every few years)
individuals are unlikely to be panel members for longer than two years at a time.
We can use results from such surveys to look at brand loyalty and brand switching. It is
particularly useful for assessing the effectiveness of advertising. For now, make sure you
understand how a longitudinal or panel study is set up.
Activity 11.6 Discuss the advantages and disadvantages of panels.
Solution
A panel consists of a sample of respondents, generally households, who have agreed
to provide information at specified intervals over an extended period. Panels are
maintained by syndicated firms, and panel members are compensated for their
participation with gifts, coupons, information or cash.
Advantages:
11
Over a period of time, panel data can reveal changes in market share and/or
changes in consumer attitudes and behaviour.
Relatively large amounts of data can be gathered.
The researcher can collect more detailed data regarding the primary variables of
interest.
The data collected are more accurate because of the support given to
respondents and the development of a relationship between panel members and
the sponsoring organisation.
272
11.9. Case study: Smoking and lung cancer
Disadvantages:
Panels may not necessarily be any more representative than a single
cross-sectional survey. This may arise due to respondents refusing to cooperate,
‘mortality’, i.e. respondents dropping out of the panel due to lack of interest,
and the nature of payments or tangible rewards which may attract a certain
type of respondent.
New panel members are often biased in their initial responses, and seasoned
panel members may provide biased responses through being tired or bored with
the process.
11.9
Case study: Smoking and lung cancer
In order to clarify the issues raised in this chapter, we will consider a case study. It
describes how control groups were used in order to see what was connected with lung
cancer. You might be interested to know that initially the researchers expected that
pollution in the environment would show the strongest link.
They compared patients in hospital diagnosed as having lung cancer with a control
group who did not. (They were likely to have had accidents or non-chest diseases such
as appendicitis or, in the case of women, be pregnant.) Each lung cancer patient was
matched with a control by:
age
sex
occupation
home area
being in hospital at roughly the same time.
Using the measures which you will meet in Chapter 12, it was found that the main
difference between the two groups was in smoking behaviour – those with lung cancer
were more likely to be heavy smokers than the other group. Although this study was
carried out some time ago, similar studies have confirmed the results.
However, what did the study tell us? Consider again Section 11.5. Descriptive statistics
alone cannot distinguish between the following ideas.
Smoking causes lung cancer:
smoking
⇒
lung cancer.
Smoking is a symptom of developing lung cancer:
lung cancer
⇒
smoking.
273
11
11. Some ideas underlying causation – the use of control groups and time order
Your personality factor leads to smoking and lung cancer:
personality facto
⇒
smoking + lung cancer.
It is important that you understand this. Although animal experimentation (ethics
aside) may help to resolve the conundrum to some extent, we are really at a stalemate
without an experiment on the people in whom we are interested!
At this point, the original researchers carried out a longitudinal study of 40,000 doctors
in the UK. Their smoking histories were collected and subsequent death rates from lung
cancer checked. The results confirmed the initial findings.
However, although these results make the causal connection ‘heavy smoking ⇒ lung
cancer’ much more likely, it is still possible that the heavy smokers may have some
other characteristic (for example, genetic) which leads them to contract lung cancer.
Ethically it is not possible to choose individuals to participate in an experiment and ask
them to smoke heavily, so a pure experiment could not be made.
11.10
Summary
This chapter’s focus on causation gives you an opportunity to think of the implications
of the work you have covered in the earlier chapters and prepares the way for the
material in Chapter 12.
11.11
Key terms and concepts
Blinding
Conditioning
Control group
Experiment
Longitudinal survey
Outcome
Randomisation
Treatment
11
11.12
Causation
Confounding factor
Drop out
Intervention
Observational study
Panel survey
Time order
Learning activities
Solutions can be found in Appendix A.
1. Your government is assessing whether it should change the speed limit on its
motorways or main roads. Several countries in your immediate area have lowered
their limit recently by 10 miles per hour.
What control factors might you use in order to examine the likely effect on road
accidents of a change in your country?
274
11.13. A reminder of your learning outcomes
2. (a) List the advantages and disadvantages of making a panel study.
(b) Work out how you might set up a panel of children in order to see whether
they like the television programmes your company is making for under-10s in
the after school/before homework time slot.
11.13
A reminder of your learning outcomes
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
distinguish between an experiment in the natural sciences and the observations
possible in social science, business or management studies
determine sensible controls for a given experiment or trial
set up a panel study
explain the merits and limitations of a longitudinal/panel survey compared with a
cross-sectional survey.
11.14
Sample examination questions
Solutions can be found in Appendix B.
1. What are the strengths and weaknesses of a longitudinal survey? Describe how you
would design such a survey if you were aiming to study the changes in people’s use
of health services over a 20-year period. State the target group, survey design, and
frequency of contact. You should give examples of a few of the questions you might
ask.
2. Explain how panel surveys are different from longitudinal surveys.
11
275
11. Some ideas underlying causation – the use of control groups and time order
11
276
Chapter 12
Correlation and linear regression
12.1
Aims
This final chapter of the subject guide takes us back to the ideas of the formula for a
straight line, which you worked on back in Chapter 2. Here, we are going to use basic
mathematics to understand the relationship between two measurable variables.
If you are taking, or likely to take, ST104b Statistics 2, and later EC2020
Elements of econometrics, you should ensure that you cover all the examples here
very carefully as they prepare you for work on those courses.
If you do not intend to take your statistics any further, you need, at the very least as a
social scientist, to be able to assess whether relationships shown by data are what they
appear!
Be careful not to confuse the techniques used here – on measurable variables – with
those you worked with in Chapter 9 on categorical variables!
12.2
Learning outcomes
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
draw and label a scatter diagram
calculate r
explain the meaning of a particular value and the general limitations of r
calculate a and b for the line of best fit in a scatter diagram
explain the relationship between b and r
summarise the problems caused by extrapolation.
12
12.3
Essential reading
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060]
Sections 1.6, 2.4, 11.1–11.3 and 14.6.
277
12. Correlation and linear regression
In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials
accessible via the ST104a Statistics 1 area at http://my.londoninternational.ac.uk
12.4
Further reading
Aczel, A.D. Complete Business Statistics (London: McGraw–Hill Higher
Education, 2009) seventh edition [ISBN 9780071287531] Chapters 10 and 11.
Anderson, D.R., D.J. Sweeney, T.A. Williams, J. Freeman and E. Shoesmith
Statistics for Business and Economics (South–Western Cengage Learning, 2010)
eleventh edition [ISBN 9780324783247] Chapter 14 and Section 15.1.
Wonnacott, T.H. and R.J. Wonnacott Introductory Statistics for Business and
Economics (Chichester: John Wiley & Sons, 1990) fourth edition [ISBN
9780471615170] Chapters 12–14.
12.5
Introduction
In Chapter 9, you were introduced to idea of testing for association between different
attributes of a variable using the chi-squared distribution. We did this by looking at the
number of individuals falling into a category, or experiencing a particular contingency.
Correlation and regression enable us to see the connection between the actual
dimensions of two or more measurable variables. The work we will do in this chapter
will only involve looking at two variables at a time, but you should be aware that
statisticians use these theories and similar formulae to look at the relationship between
many variables – so-called multivariate analysis.
When we use these terms we are concerned with using models for prediction and
decision making. So, how do we model the relationship between two variables? We
are going to look at:
correlation – which measures the strength of a linear relationship
regression – which is a way of representing that linear relationship.
It is important you understand what these two terms have in common, but also the
differences between them.
12
Activity 12.1 Explain the main types of variables which statistical methods are
used to study. What are the differences between ‘association’ and ‘correlation’ as
statistical terms?
Solution
There are two main types of variable: measurable variables (such as distances,
weights, temperatures, frequencies and sums of money), and categorical variables,
which cannot be measured.
278
12.6. Scatter diagrams
Categorical variables can be either ordinal (when the values can be ranked, as, for
example, a scale which measures dissatisfaction versus satisfaction) or nominal, for
which no ranking is possible (such as nationalities).
Association represents the relationship between categorical variables (nominal or
ordinal scales) whereas correlation represents (and assesses numerically the strength
of) the linear relationship between measurable variables.
12.6
Scatter diagrams
Let us assume that we have some data in paired form: (xi , yi ), for i = 1, 2, . . . , n.
Example 12.1 An example of paired data is the following which represents the
number of people unemployed and the corresponding monthly reported crime figures
for twelve areas of a city.
Unemployed, x
Number of offences, y
2,614
6,200
1,160
4,610
1,055
5,336
1,199
5,411
2,157
5,808
2,305
6,004
Unemployed, x
Number of offences, y
1,687
5,420
1,287
5,588
1,869
5,719
2,283
6,336
1,162
5,103
1,201
5,268
This dataset will be used repeatedly throughout this chapter.
When dealing with paired data, the first action is to construct a scatterplot (or
scatter diagram) of the data, and visually inspect it for any apparent relationship
between the two variables.1 Figure 12.1 shows such a scatterplot for these data.
Figure 12.1 gives an impression of a positive, linear relationship, i.e. it can be seen that
x (unemployment) and y (number of offences) increase together, roughly in a straight
line, but subject to a certain amount of scatter. So, the relationship between x and y is
not exactly linear – the points do not lie exactly on a straight line.
Data showing a general ‘upward shape’ like this are said to be positively correlated, and
we shall see how to quantify this correlation. Other possible scatter patterns are shown
in Figure 12.2.
The left-hand plot shows data which have a negative correlation, i.e. y decreases as x
increases, and vice versa. The right-hand plot shows uncorrelated data, i.e. no clear
relationship between x and y. Note that correlation assesses the strength of the linear
relationship between two variables. Hence uncorrelated data, in general, just means an
absence of linearity. It is perfectly possible that uncorrelated data are related, just not
linearly – for example, x and y may exhibit a quadratic relationship (considered later).
1
Notice that, in contrast to categorical variables considered in Chapter 9, for the remainder of this
course we will deal with measurable variables only.
279
12
12. Correlation and linear regression
Scatter plot of Crime against Unemployment
x
x
x
x
5500
x
x
x
x
x
x
5000
Number of offences
6000
x
x
1000
1500
2000
2500
Unemployment
Figure 12.1: Scatterplot of ‘Number of offences’ against ‘Unemployment’.
Example 12.2 Below is a list of variables, along with their expected correlation.
Variables
Height and weight
Rainfall and sunshine hours
Ice cream sales and sun cream sales
Hours of study and examination mark
Car’s petrol consumption and goals scored
12
12.7
Expected correlation
Positive
Negative
Positive
Positive
Zero
Causal and non-causal relationships
When two variables are correlated, an interesting question which arises is whether the
correlation indicates a causal relationship. In Example 12.2, it is natural to assume that
hours of study and examination marks are positively correlated, due to more study
resulting in better examination performance, i.e. a higher mark. In this case, the
relationship is causal.
280
12.7. Causal and non-causal relationships
Scatter plot
Scatter plot
x
x
x
8
x
x
8
x
x
x
x
6
6
x
x
x
x
y
y
x
x
x
4
4
x
x
2
2
x
x
x
x
x
2
4
6
8
0
2
x
x
4
6
8
x
Figure 12.2: Scatterplots – negatively correlated variables (left) and uncorrelated variables
(right).
For the ice cream and sun cream sales, however, the relationship is not causal. It is not
the selling of ice cream which causes sun cream sales to rise, nor vice versa. Rather both
sales respond to warm weather and so these sales are seasonal, with both rising and
falling together in response to other variables such as temperature and sunshine hours.
It should be clear from this that care needs to be taken in interpreting correlated
relationships. Remember correlation does not imply causality!
Example 12.3 Let us consider a few more examples. In each case we observe
strong correlations.
i. ‘Average salary of school teachers’ and ‘Consumption of alcohol measured by
country’.
ii. ‘Stork population in Bavaria’ and ‘Human birth rate’.
iii. ‘Size of student population’ and ‘Number of juvenile offenders by local area in a
country’.
Would you seriously think there was a causal connection in these cases? Let us look
at them in a little more detail.
i. It would be frivolous to deduce that respect for teachers causes alcoholism! It is
much more likely that buying and consuming alcohol and high salaries are both
signs of a flourishing economy.
ii. This is a bit more difficult – can the fairy stories be true? Again, there is
probably a further variable at play here – are young families living in areas of
natural beauty (to raise their children in pleasant surroundings) which
encourage storks?
281
12
12. Correlation and linear regression
iii. The more young people there are, the more juvenile offenders, scholarship
winners, and students there are likely to be. Connecting these two figures is
pretty meaningless.
12.8
Correlation coefficient
The strength of a linear relationship between two random variables is given by the
correlation coefficient. For random variables X and Y , the population correlation
coefficient, denoted ρ, is defined as:
ρ=
E(XY ) − E(X) E(Y )
E((X − E(X))(Y − E(Y )))
p
= p
Var(X)Var(Y )
Var(X) Var(Y )
and some textbooks may refer to this as the ‘product-moment correlation coefficient’.
Technically, ρ can only be determined if we have perfect knowledge of the ‘bivariate
density function’ of X and Y .2 In practice, we will wish to estimate ρ, using the
sample correlation coefficient, from a set of random sample observations of X and
Y , i.e. using sample paired data (xi , yi ), for i = 1, 2, . . . , n.
Sample correlation coefficient
The sample correlation coefficient, denoted r, is:
n
P
Sxy
r=p
= s
n
Sxx Syy
P
xi yi − nx̄ȳ
i=1
x2i − nx̄
i=1
(12.1)
n
P
2
yi2 − nȳ 2
i=1
where:
corrected sum of squares of x-data is:
Sxx =
n
X
x2i − nx̄2
i=1
corrected sum of squares of y-data is:
corrected sum of cross-products is:
Syy =
Sxy =
n
X
i=1
n
X
yi2 − nȳ 2
xi yi − nx̄ȳ.
i=1
12
It is quiteP
common
P in2 examination
P
P 2questions
P to be given certain summary statistics (for
example, i xi , i xi , i yi , i yi and i xi yi ) to save you time from computing all
such quantities directly using the raw data. Hence it may be easier for you to remember
the expressions for Sxx , Syy and Sxy (the ‘corrected sum of squares’ for x and y, and
corresponding cross-products, respectively), and how they combine to calculate r.
2
A bivariate density function shows the dependence of two variables (here X and Y ) and is covered
in ST104b Statistics 2.
282
12.8. Correlation coefficient
Example 12.4 For the dataset in Example 12.1, we have n = 12, x̄ = 1665 and
ȳ = 5567, and so:
Sxx = 3431759,
Syy = 2584497 and Sxy = 2563066.
Using (12.1), we have:
r=√
2563066
= 0.8606.
3431759 × 2584497
This is a strong, positive correlation.3 We also note that the value of r agrees with
the scatterplot shown in Figure 12.1.
The sample correlation coefficient measures how closely the points in the scatterplot
lie around a straight line, and the sign of r tells us the direction of this line, i.e.
upward-sloping or downward-sloping, for positive and negative r, respectively. It does
not tells us the gradient of this line – this is what we will determine in regression.
Properties of the correlation coefficient
Both types of correlation coefficient (ρ and r) have the following properties. They:
are independent of the scale of measurement
are independent of the origin of measurement
are symmetric; that is, the correlation of X and Y is the same as the correlation
of Y and X
can only take values between ±1, i.e. −1 ≤ ρ ≤ 1 and −1 ≤ r ≤ 1, alternatively
|ρ| ≤ 1 and |r| ≤ 1, i.e. correlation coefficients always have an absolute value
less than or equal to 1.
Having defined the correlation coefficient, it is important to remember the following
when interpreting r (or ρ).
An r (ρ) near the top of this range (i.e. near 1) indicates a strong, positive linear
relationship between X and Y .
An r (ρ) near the bottom of this range (i.e. near −1) indicates a strong, negative
linear relationship between X and Y .
Zero correlation indicates that X and Y are not linearly related, i.e. the variables
are uncorrelated.
3
If you were following ST104b Statistics 2 and were to test the statistical significance of this r,
using the kind of techniques and ideas in Chapter 8, you would find that the result is highly significant
at the 1% significance level. Note that you are not expected to test for the significance of r in ST104a
Statistics 1.
283
12
12. Correlation and linear regression
Example 12.5 To reiterate, correlation assesses the strength of linear relationships
between variables only. ρ = 0 does not imply that X and Y are independent, since
the variables could have a non-linear relationship. For example, if:
y = x (1 − x) for 0 ≤ x ≤ 1
then the correlation is zero (as they are not linearly related), but, clearly, there is a
well-defined relationship between the two variables, so they are certainly not
independent. Figure 12.3 demonstrates this point for simulated sample data, where
we see a clear relationship between x and y, but it is clearly not a linear
relationship.4 Data of this kind would have a sample correlation near zero (here,
r = 0.148), which emphasises the linear nature of r.
Scatter plot
2500
x
x
x
x
2000
x
x
x
x
1500
y
x
x
1000
x
500
x
x
x
x
20
30
40
50
60
70
80
x
Figure 12.3: Scatterplot showing a non-linear relationship.
Activity 12.2 Define the term ‘sample correlation coefficient’, r, based on data
(x1 , y1 ), . . . , (xn , yn ). Describe some properties of r in terms of how its value is
different when the data have different patterns of scatter diagram.
12
Solution
The sample correlation coefficient is
Sxy
r=p
Sxx Syy
4
In fact, these data are scattered around a parabola with (approximate) equation y = 2(x−15)(85−x).
284
12.8. Correlation coefficient
where:
n
n
X
X
2
Sxx =
(xi − x̄) =
x2i − nx̄2 ,
i=1
Syy =
i=1
and:
n
X
2
(yi − ȳ) =
i=1
n
X
yi2 − nȳ 2
i=1
n
n
X
X
(xi − x̄)(yi − ȳ) =
xi yi − nx̄ȳ.
Sxy =
i=1
i=1
Examples of scatter diagrams include:
• linear up: r near 1
• linear down: r near −1
• random cloud: r close to 0.
Activity 12.3 The following table shows the number of salespeople employed by a
company and the corresponding values of sales (in £000s).
210
206
Salespeople (x)
Sales (y)
209
200
219
204
225
215
232
222
221
216
220
210
233
218
200
201
215
212
205
204
227
212
Here are some useful summary statistics:
12
X
xi = 2616,
i=1
12
X
yi = 2520,
12
X
i=1
xi yi = 550,069,
i=1
12
X
x2i = 571,500 and
i=1
12
X
yi2 = 529,746.
i=1
Compute the sample correlation coefficient for these data.
Solution
We have:
x̄ =
2520
2616
= 218 and ȳ =
= 210.
12
12
Therefore:
P
xi yi − nx̄ȳ
r=pP 2
P
( xi − nx̄2 ) ( yi2 − nȳ 2 )
550069 − 12 × 218 × 210
=p
(571500 − 12 × (218)2 ) (529746 − 12 × (210)2 )
12
= 0.8716.
This indicates a very strong, positive and linear relationship between salespeople and
sales. The more salespeople employed, the higher the sales.
285
12. Correlation and linear regression
Activity 12.4 State whether the following statements are true or false, explaining
your answers.
(a) The sample correlation coefficient between x and y is the same as the sample
correlation coefficient between y and x.
(b) If the slope is negative in a regression equation y = a + bx, then the correlation
coefficient between x and y would be negative too.
(c) If two variables have a correlation coefficient of minus 1 they are not related.
(d) A large correlation coefficient means the regression line will have a steep slope.
Solution
(a) True. The sample correlation coefficient is:
r=p
Sxy
.
Sxx Syy
Switching x and y does not change r because Sxy = Syx .
(b) True. The sign of the slope and the sign of the correlation coefficient between x
and y are the same. A negative slope indicates a negative correlation between
the variables. Using sample data, the sign is determined by the term:
Sxy =
n
X
xi yi − nx̄ȳ
i=1
which is the numerator of b and r.
(c) False. A correlation coefficient of minus 1 represents a perfect negative
correlation between two variables. A correlation coefficient of zero would
indicate that the two variables are not linearly related.
(d) False. A large correlation coefficient means a strong linear relationship between
two variables. The magnitude of the correlation coefficient says nothing about
how steep the regression line is.
12.8.1
12
Spearman rank correlation
Just as we saw in Chapter 4 that we can use the median and interquartile range as
measures of location and dispersion, respectively, instead of the mean and standard
deviation (or variance), it is possible to calculate the Spearman rank correlation, rs ,
instead of r.
To compute rs we rank the xi and yi values in ascending order and use the r formula
above using the ranks of the xi s and yi s, rather than their actual values. Of course, it
may be that we only have the ranks, in which case we would have to use this method.
286
12.8. Correlation coefficient
Spearman rank correlation
If there are no tied rankings of xi and yi , the Spearman rank correlation is:
6
rs = 1 −
n
P
d2i
i=1
(12.2)
n (n2 − 1)
where the di s are the differences in the ranks between each xi and yi .
As with other order statistics, such as the median and quartiles, it is helpful to use the
Spearman rank correlation if you are worried about the effect of extreme observations
(outliers) in your sample. The limits for rs are the same as for r, that is |rs | ≤ 1.
Example 12.6 An aptitude test has been designed to examine a prospective
salesperson’s ability to sell. Ten current staff sit the test. Instead of putting achieved
scores in the computer, a research assistant ranks the individuals in ascending order
in terms of the test as well as productivity. The ranked data are:
Staff member
Rank order in test
Rank order in productivity
A
2
1
B
3
2
C
5
3
D
1
4
E
4
5
F
9
6
G
10
7
H
6
8
I
7
9
I
−2
4
J
−2
4
J
8
10
Does it look as if the test is a good predictor of sales ability?
We first compute the differences in the ranks:
Staff member
di
d2i
A
1
1
B
1
1
C
2
4
and so:
10
X
D
−3
9
E
−1
1
F
3
9
G
3
9
H
−2
4
d2i = 46.
i=1
Hence, using (12.2), we have:
rs = 1 −
6 × 46
= 0.7212
10 × ((10)2 − 1)
which is quite strong, indicating that the test is a reasonably good predictor of sales
ability.
12
Activity 12.5 Seven students in a class received the following examination and
project marks in a subject:
Examination mark
Project mark
50
75
80
60
70
55
40
40
30
50
75
80
95
65
287
12. Correlation and linear regression
You want to know if students who excel in examinations in the subject also had
relatively high project marks.
(a) Calculate the Spearman rank correlation.
(b) Based on your answer to part (a) do you think students who score well in
examinations are also likely to have the highest project marks? Briefly justify
your answer.
Solution
(a) We have (ranking marks from highest to lowest):
Examination mark
Project mark
Rank of examination mark
Rank of project mark
Difference in ranks, di
d2i
50
75
5
2
3
9
80
60
2
4
−2
4
70
55
4
5
−1
1
40
40
6
7
−1
1
30
50
7
6
1
1
75
80
3
1
2
4
95
65
1
3
−2
4
The Spearman rank correlation is:
6
rs = 1 −
n
P
d2i
6 × 24
i=1
=1−
= 0.5714.
2
n (n − 1)
7 × 48
(b) There is some correlation, but it is not strong. It looks as if there is a positive
connection between examination marks and project marks.
12.9
Linear regression
In ST104a Statistics 1, we can only hope to review the fundamentals of what is a very
large (and important) topic in statistical analysis – several textbooks and courses focus
exclusively on regression analysis. Here, we shall concentrate on the simple linear
regression model (also known as the bivarirate regression model). This will allow
calculations to be performed on a hand-held calculator, unlike multiple linear regression
(with multiple independent variables) which typically requires the help of statistical
computer packages due to the complexity and sheer number of calculations involved.
12
In the simple linear regression model, we have two variables Y and X.
Y is the dependent (or response) variable – that which we are trying to
explain.
X is the independent (or explanatory) variable – the variable we think
influences Y .
288
12.9. Linear regression
Multiple linear regression is just a natural extension of this set-up, but with more than
one explanatory variable.
There can be a number of reasons for wanting to establish a mathematical relationship
between a response variable and an explanatory variable, for example:
to find and interpret unknown parameters in a known relationship
to understand the reason for such a relationship – is it causal?
to predict or forecast Y for specific values of the explanatory variable.
12.9.1
The simple linear regression model
We assume that there is a true (population) linear relationship between a response
variable, y, and an explanatory variable, x, of the approximate form:
y = α + βx
where α and β are fixed, but unknown, population parameters. Our objective is to
estimate α and β using (paired) sample data (xi , yi ), for i = 1, . . . , n.
Note the use of the word approximate. Particularly in the social sciences, we would not
expect a perfect linear relationship between the two variables. Therefore, we modify this
basic model to:
y = α + βx + ε
where ε is some random perturbation from the initial ‘approximate’ line. In other
words, each y observation almost lies on the line, but ‘jumps’ off the line according to
the random variable ε. This perturbation is often referred to as the error term.
For each pair of observations (xi , yi ), for i = 1, . . . , n, we can write this as:
yi = α + βxi + εi .
The random deviations ε1 , ε2 , . . . , εn corresponding to the n data points are assumed to
be independent and identically normally distributed, with zero mean and constant (but
unknown) variance, σ 2 . That is:
εi ∼ N (0, σ 2 ) for i = 1, . . . , n.
This completes the model specification.
Specification of the simple linear regression model
To summarise, we list the assumptions of the simple linear regression model.
A linear relationship between the variables of the form y = α + βx + ε.
The existence of three model parameters: the linear equation parameters α and
β, and the error term variance, σ 2 .
Var(εi ) = σ 2 for all i = 1, . . . , n, i.e. the error term variance is constant and does
not depend on the explanatory variable.
The εi s are independent and N (0, σ 2 ) distributed for all i = 1, . . . , n.
289
12
12. Correlation and linear regression
You may feel that some of these assumptions are particularly strong and restrictive. For
example, why should the error term variance be constant across all observations?
Indeed, your scepticism serves you well! In a more comprehensive discussion of linear
regression, such as in EC2020 Elements of econometrics, model assumptions would
be properly tested to assess their validity. Given the limited scope of linear regression in
ST104a Statistics 1, sadly we are too time-constrained to consider such tests in
detail. However, do be aware that with any form of modelling, a thorough critique of
model assumptions is essential. Analysis based on false assumptions leads to invalid
results – a bad thing.
Activity 12.6 Write down the simple linear regression model, explaining each term
in the model.
Solution
The simple linear regression model is:
y = α + βx + ε
where:
• y is the response (or dependent) variable
• x is the explanatory (or independent) variable
• α is the y-intercept
• β is the slope of the line
• ε is a random error term.
Activity 12.7 List the assumptions which we make when using the simple linear
regression model to explain changes in y by changes in x (i.e. the ‘regression of y on
x’).
Solution
We have the following four assumptions.
• The equation of the form y = α + βx + ε describes the true population
relationship.
12
• There are three unknown population parameters in the model, namely the
‘linear’ parameters α and β, and the error term variance, σ 2 .
• σ 2 is the same for i = 1, . . . , n, so in particular it does not depend on the value
of x we observe.
• εi ∼ N (0, σ 2 ) for i = 1, . . . , n, and the different εi s are independent of each
other.
290
12.9. Linear regression
12.9.2
Parameter estimation
As mentioned above, our principal objective is to estimate α and β, that is the
y-intercept and slope of the true line. To fit a line to some data, we need a criterion for
establishing which straight line is in some sense ‘best’. The criterion used is to minimise
the sum of the squared distances between the observed values of yi and the values
predicted by the model.5
The estimated least squares regression line is written as:
yb = a + bx
where a and b denote our estimates of α and β, respectively. The derivation of the
formulae for a and b is not required for this course,6 although you do need to know how
to calculate point estimates of α and β.
Simple linear regression line estimates
We estimate α and β with a and b, where:
n
P
b=
(xi − x̄)(yi − ȳ)
i=1
n
P
(xi − x̄)2
i=1
n
P
xi yi − nx̄ȳ
= i=1
n
P
=
x2i − nx̄2
Sxy
Sxx
i=1
and:
a = ȳ − bx̄.
Note you need to compute b first, since this is needed to calculate a.
Activity 12.8
(a) Explain the meaning of the term ‘least squares estimation’.
(b) Explain, briefly, the difference between correlation and regression.
Solution
(a) ‘Least squares estimation’ refers to moving a line around until the sum of the
(vertical) squared deviations from the data is at a minimum.
(b) The main point is that we take xi fixed in regression, but correlation is for two
random variables. Also, the estimate of β is related to correlation. For example,
positive β occurs with positive correlation.
5
6
This ‘least squares’ estimation technique is considered in ST104b Statistics 2.
The derivation is presented in ST104b Statistics 2.
291
12
12. Correlation and linear regression
Activity 12.9 The table below shows the cost of fire damage for ten fires together
with the corresponding distances of the fires to the nearest fire station.
Distance in miles (x)
Cost in £000s (y)
4.9
31.1
4.5
31.1
6.3
43.1
3.2
22.1
5.0
36.2
5.7
35.8
4.0
25.9
4.3
28.0
2.5
22.9
5.2
33.5
Here are some useful summary statistics:
10
X
xi = 45.6,
i=1
10
X
yi = 309.7,
i=1
10
X
xi yi = 1475.1,
i=1
10
X
x2i = 219.46 and
i=1
10
X
yi2 = 9973.99.
i=1
Fit a straight line to these data.
Solution
We have:
x̄ =
309.7
45.6
= 4.56 and ȳ =
= 30.97.
10
10
The slope estimate is:
P
1475.1 − 10 × 4.56 × 30.97
xi yi − nx̄ȳ
=
= 5.46.
b= P 2
2
219.46 − 10 × (4.56)2
xi − nx̄
The intercept estimate is:
a = ȳ − bx̄ = 30.97 − 5.46 × 4.56 = 6.07.
Hence the estimated regression line is yb = 6.07 + 5.46x.
Activity 12.10 Consider the following dataset (which should be treated as sample
data):
Observation
xi
yi
1
1.1
6.4
2
3.9
17.0
3
2.8
12.8
4
3.2
14.4
5
2.9
13.1
6
4.4
18.7
7
3.4
15.1
8
4.9
20.6
9
2.3
11.0
10
3.8
16.6
Summary statistics for the dataset are:
10
X
xi = 32.7,
i=1
12
10
X
yi = 145.7,
i=1
10
X
i=1
xi yi = 516.19
10
X
x2i = 117.57 and
i=1
10
X
yi2 = 2271.39.
i=1
Calculate the point estimates of α and β for the regression of y on x based on the
above sample data.
Solution
We have:
x̄ =
292
32.7
145.7
= 3.27 and ȳ =
= 14.57.
10
10
12.9. Linear regression
The slope estimate is:
P
xi yi − nx̄ȳ
516.19 − 10 × 3.27 × 14.57
b= P 2
= 3.7356.
=
2
117.57 − 10 × (3.27)2
xi − nx̄
The intercept estimate is:
a = ȳ − bx̄ = 14.57 − 3.7356 × 3.27 = 2.3546.
Hence the estimated regression line is yb = 2.3546 + 3.7356x.
Activity 12.11 In a study of household expenditure a population was divided into
five income groups with the mean income, x, and the mean expenditure, y, on
essential items recorded (in Euros per month). The results are in the following table.
x
1000
2000
3000
4000
5000
y
871
1300
1760
2326
2950
Here are some useful summary statistics:
5
X
xi = 15,000,
i=1
5
X
yi = 9207,
i=1
5
X
x2i = 55,000,000,
i=1
5
X
xi yi = 32,805,000,
i=1
and
5
X
yi2 = 19,659,017.
i=1
(a) Fit a straight line to the data.
(b) How would you use the fit to predict the percentage of income which households
spend on essential items? Comment on your answer.
Solution
(a) We have:
x̄ =
15000
9207
= 3000 and ȳ =
= 1841.4.
5
5
The slope estimate is:
P
xi yi − nx̄ȳ
32805000 − 5 × 3000 × 1841.4
b= P 2
=
= 0.5184.
2
55000000 − 5 × (3000)2
xi − nx̄
12
The intercept estimate is:
a = ȳ − bx̄ = 1841.4 − 0.5184 × 3000 = 286.2.
Hence the estimated regression line is yb = 286.2 + 0.5184x.
293
12. Correlation and linear regression
(b) The percentage of income which households spend on essential items is:
y
× 100%
x
which is approximated by:
a + bx
× 100%.
x
This percentage is decreasing with increasing x so that poorer households spend
a larger proportion of their income on essentials.
Activity 12.12 The following table shows the number of computers (in 000s), x,
produced by a company each month and the corresponding monthly costs (in
£000s), y, for running its computer maintenance department.
Number of computers
(in 000s), x
7.2
8.1
6.4
7.7
8.2
Maintenance costs
(in £000s), y
100
116
98
112
115
Number of computers
(in 000s), x
6.8
7.3
7.8
7.9
8.1
Maintenance costs
(in £000s), y
103
106
107
112
111
The following statistics can be calculated from the data.
10
X
xi = 75.5,
i=1
10
X
yi = 1080,
i=1
10
X
i=1
xi yi = 8184.9,
10
X
x2i = 573.33 and
i=1
10
X
yi2 = 116,988.
i=1
(a) Calculate the correlation coefficient for computers and maintenance costs.
(b) Find the best-fitting straight line relating y and x.
(c) Plot the points on a scatter diagram and draw the line of best fit.
(d) Comment on your results. How would you check on the strength of the
relationship you have found?
Solution
(a) We have:
1080
75.5
= 7.55 and ȳ =
= 108.
10
10
The sample correlation coefficient is:
P
xi yi − nx̄ȳ
r=pP 2
P
( xi − nx̄2 ) ( yi2 − nȳ 2 )
x̄ =
12
8184.9 − 10 × 7.55 × 108
=p
(573.33 − 10 × (7.55)2 ) (116988 − 10 × (108)2 )
= 0.9111.
294
12.9. Linear regression
(b) The slope estimate is:
P
xi yi − nx̄ȳ
8184.9 − 10 × 7.55 × 108
b= P 2
=
= 9.349.
xi − nx̄2
573.33 − 10 × (7.55)2
The intercept estimate is:
a = ȳ − bx̄ = 108 − 9.349 × 7.55 = 37.415.
Hence the estimated regression line is yb = 37.415 + 9.349x.
(c) We have:
Scatter diagram of maintenance costs vs. computers
x
x
x
110
x
x
105
x
x
100
Maintenance costs (in £000s)
115
x
x
x
6.5
7.0
7.5
8.0
Number of computers (in 000s)
Note to plot the line you need to compute any two points on the line. For
example, for x = 7 and x = 8 the y-coordinates are determined, respectively, as:
37.415 + 9.349 × 7 ≈ 103 and 37.415 + 9.349 × 8 ≈ 112
giving points (7, 103) and (8, 112). The line of best fit should be drawn passing
through these two points.
(d) The sample correlation coefficient is close to 1, indicating a very strong positive
correlation between maintenance costs and the number of computers. For each
additional 1,000 computers, maintenance costs increase by £9,349.
12
12.9.3
Prediction
Having estimated the regression line, an important application of it is for prediction.
That is, for a given value of the explanatory variable, we can use it in the estimated
regression line to obtain a prediction of y.
295
12. Correlation and linear regression
Predicting y for a given value of x
For a given value of the explanatory variable, x0 , the predicted value of the dependent
variable, yb, is:
yb = a + b x0 .
Remember to attach the appropriate units to the prediction (i.e. the units of
measurement of the original y data). Also, ensure the value you use for x0 is correct – if
the original x data is in 000s, then a prediction of y when the explanatory variable is
10,000, say, would mean x0 = 10, and not 10,000!
Example 12.7 A study was made by a retailer to determine the relationship
between weekly advertising expenditure and sales (both in thousands of pounds).
Find the equation of a regression line to predict weekly sales from advertising.
Predict weekly sales when advertising costs are £35,000. The data are:
Advertising costs (in £000s)
Sales (in £000s)
40
385
20
400
25
395
20
365
30
475
50
440
Advertising costs (in £000s)
Sales (in £000s)
40
490
20
420
50
560
40
525
25
480
50
510
Summary statistics, representing sales as yi and advertising costs as xi , for
i = 1, . . . , 12, are:
X
X
X
X
X
xi = 410,
x2i = 15650,
yi = 5445,
yi2 = 2512925,
xi yi = 191325.
So, using (12.9.2) and (12.9.2), the parameter estimates are:
b=
191325 − (12 × (410/12) × (5445/12))
= 3.221
15650 − (12 × (410/12)2 )
and:
5445
410
− 3.221 ×
= 343.7.
12
12
Hence the estimated regression line is:
a=
yb = 343.7 + 3.221x.
The predicted sales for £35,000 worth of advertising is:
yb = 343.7 + 3.221 × 35 = 456.4
12
which is £456,400. Note that since the advertising costs were given in £000s, we
used x0 = 35, and then converted the predicted sales into pounds.
Activity 12.13 The examination marks of thirteen students in two statistics papers
(a foundation paper, Paper I, and a more advanced paper, Paper II) were as follows:
296
12.9. Linear regression
Paper I (x)
Paper II (y)
65
78
73
88
42
60
52
73
84
92
60
77
70
84
79
89
60
70
83 57
99 73
77
88
54
70
Useful summary statistics for these data are:
X
xi = 856,
X
x2i = 58,402,
X
yi = 1041,
X
yi2 = 84,801 and
X
xi yi = 70,203.
(a) Calculate the sample correlation coefficient and comment on its value.
(b) Determine the line of best fit of y on x.
(c) Plot the points on a scatter diagram and draw the line of best fit.
(d) If you were told a student achieved a mark of 68 in Paper I, what would you
predict for the student’s mark in Paper II? Would you trust this prediction?
Solution
(a) We have:
x̄ =
1041
856
= 65.846 and ȳ =
= 80.077.
13
13
Therefore:
P
xi yi − nx̄ȳ
r=pP 2
P
( xi − nx̄2 ) ( yi2 − nȳ 2 )
70203 − 13 × 65.846 × 80.077
=p
(58402 − 13 × (65.846)2 ) (84801 − 13 × (80.077)2 )
= 0.9671.
This indicates a very strong, positive and linear relationship between Paper I
and Paper II examination marks. Assuming these papers are sat in order, the
higher the examination mark in Paper I, the higher the examination mark in
Paper II.
(b) The slope estimate is:
P
70203 − 13 × 65.846 × 80.077
xi yi − nx̄ȳ
=
= 0.8132.
b= P 2
2
xi − nx̄
58402 − 13 × (65.846)2
12
The intercept estimate is:
a = ȳ − bx̄ = 80.077 − 0.8132 × 65.846 = 26.5310.
Hence the estimated regression line is yb = 26.5310 + 0.8132x.
297
12. Correlation and linear regression
(c) We have:
100
Scatter diagram of examination marks
x
90
x
x
x
80
x
x
70
x
60
Paper II examination mark
x
x
x
x
x
x
50
60
70
80
Paper I examination mark
Note to plot the line you need to compute any two points on the line. For
example, for x = 50 and x = 70 the y-coordinates are determined, respectively,
as:
26.5310 + 0.8132 × 50 ≈ 67 and 26.5310 + 0.8132 × 70 ≈ 83
giving points (50, 67) and (70, 83). The line of best fit should be drawn passing
through these two points.
(d) We have:
yb = 26.5310 + 0.8132 × 68 = 81.8286.
Since examination marks are integers, we round this to 82.
The Paper I mark of 68 is within the range of our x data, hence this is
interpolation so we trust the accuracy of the prediction.
Activity 12.14 The following table, published in USA Today, lists divorce rates
and mobility rates for different regions of the USA. The divorce rate is measured as
the annual number of divorces per 1,000 population, and the mobility rate is the
percentage of people living in a different household from five years before.
Region
New England
Middle Atlantic
East North Central
West North Central
South Atlantic
East South Central
West South Central
Mountain
Pacific
12
298
Mobility rate (x-variable)
41
37
44
46
47
44
50
57
56
Divorce rate (y-variable)
4.0
3.4
5.1
4.6
5.6
6.0
6.5
7.6
5.9
12.9. Linear regression
Summary statistics for the dataset are:
9
X
xi = 422,
i=1
9
X
x2i = 20132,
i=1
9
X
yi = 48.7,
i=1
9
X
yi2 = 276.91 and
i=1
9
X
xi yi = 2341.6.
i=1
(a) Calculate the sample correlation coefficient and comment on its value.
(b) Calculate the regression equation.
(c) Plot the points on a scatter diagram and draw the line of best fit.
(d) Compute the expected divorce rate if the mobility rate is 40.
(e) Why might it be reasonable to use the divorce rate as the y variable?
Solution
(a) The sample correlation coefficient is:
P
xi yi − nx̄ȳ
r=pP 2
P
( xi − nx̄2 ) ( yi2 − nȳ 2 )
2341.6 − 9 × 46.89 × 5.41
=p
(20132 − 9 × (46.89)2 ) (276.91 − 9 × (5.41)2 )
= 0.8552
which suggests a fairly strong, positive and linear relationship betweeen divorce
rate and mobility rate (as deduced from the scatter diagram).
(b) The slope estimate is:
P
xy − nx̄ȳ
b= P 2
= 0.1685.
x − nx̄2
12
The intercept estimate is:
a = ȳ − bx̄ = −2.4893.
Hence the estimated regression line is yb = −2.4893 + 0.1685x.
299
12. Correlation and linear regression
(c) We have:
Scatter diagram of divorce rate vs. mobility rate
7
x
6
x
x
x
5
x
x
x
4
Divorce rate (per 1,000 population)
x
x
40
45
50
55
Mobility rate (per 1,000 population)
Note to plot the line you need to compute any two points on the line. For
example, for x = 45 and x = 50 the y-coordinates are determined, respectively,
as:
−2.4893 + 0.1685 × 45 ≈ 5 and − 2.4893 + 0.1685 × 50 ≈ 6
giving points (45, 5) and (50, 6). The line of best fit should be drawn passing
through these two points.
(d) For x = 40, the expected divorce rate is −2.4893 + 0.1685 × 40 = 4.25 per 1,000
population.
(e) The use of the divorce rate as the dependent variable is reasonable due to the
likely disruptive effect moving home may have on relationships (any reasonable
argument would be acceptable here).
Activity 12.15 An area manager in a department store wants to study the
relationship between the number of workers on duty and the value of merchandise
lost to shoplifters. To do so, she assigned a different number of clerks for each of 10
weeks. The results were:
Week
1
2
3
4
5
6
7
8
9
10
12
300
Number of workers (xi )
9
11
12
13
15
18
16
14
12
10
Loss (yi )
420
350
360
300
225
200
230
280
315
410
12.9. Linear regression
Here are some useful summary statistics:
10
X
xi = 130,
i=1
10
X
yi = 3090,
i=1
10
X
xi yi = 38,305,
i=1
10
X
x2i = 1760 and
i=1
10
X
yi2 = 1,007,750.
i=1
(a) Which is the independent variable and which is the dependent variable?
(b) Compute the correlation coefficient between the number of workers and the loss.
(c) Calculate the least squares regression line.
(d) Plot the points on a scatter diagram and draw the line of best fit.
(e) Interpret the regression coefficients of the fitted line.
(f) Predict the loss when the number of workers is 17. Do you trust this value?
Solution
(a) The independent variable should be the number of workers, xi , and the
dependent variable should be the loss, yi . One might expect that more workers
would mean less shoplifting, as shoplifters would be more likely to get caught,
hence they are deterred from stealing merchandise.
(b) We have:
x̄ =
3090
130
= 13 and ȳ =
= 309.
10
10
The sample correlation coefficient is:
P
xi yi − nx̄ȳ
r=pP 2
P
( xi − nx̄2 ) ( yi2 − nȳ 2 )
38305 − 10 × 13 × 309
=p
(1760 − 10 × (13)2 ) (1007750 − 10 × (309)2 )
= −0.9688.
(c) The slope estimate is:
P
xi yi − nx̄ȳ
38305 − 10 × 13 × 309
b= P 2
=
= −26.64.
xi − nx̄2
1760 − 10 × (13)2
12
The intercept estimate is:
a = ȳ − bx̄ = 309 − (−26.64) × 13 = 655.32.
Hence the estimated regression line is yb = 655.32 − 26.64x.
301
12. Correlation and linear regression
(d) We have:
Scatter diagram of value shoplifted vs. workers
400
350
x
x
x
300
x
x
250
x
x
x
200
Value of merchandise lost to shoplifters
x
x
10
12
14
16
18
Number of workers on duty
Note to plot the line you need to compute any two points on the line. For
example, for x = 10 and x = 16 the y-coordinates are determined, respectively,
as:
655.32 − 26.64 × 10 ≈ 389 and 655.32 − 26.64 × 16 ≈ 229
giving points (10, 389) and (16, 229). The line of best fit should be drawn
passing through these two points.
(e) For every extra worker, the value of merchandise lost to shoplifters decreases by
£26.64. If there were no workes on duty (unlikely), then £655.32 would be lost.
(Since x = 0 is not realistic here, the intercept has little practical meaning.)
(f) When x = 17, the predicted loss is
yb = 655.32 − 26.64 × 17 = £202.44.
Although 17 is within the range of the x data, it is only just! So, while strictly
speaking this is interpolation, being near the limit of the x data (the maximum
value of x is 18) means we should treat the prediction with caution!
Activity 12.16 The following data were recorded during an investigation into the
effect of fertiliser in g/m2 , x, on crop yields in kg/ha, y.
12
Crop yields (kg/ha)
Fertiliser (g/m2 )
160
0
168
1
176
2
179
3
183
4
186
5
189
6
186
7
184
8
Here are some useful summary statistics:
9
X
i=1
302
xi = 36,
9
X
i=1
yi = 1611,
9
X
i=1
xi yi = 6627,
9
X
i=1
x2i = 204
and
9
X
i=1
yi2 = 289,099.
12.9. Linear regression
(a) Plot the data and comment on the appropriateness of using the simple linear
regression model.
(b) Calculate a least squares regression line for the data.
(c) Predict the crop yield for 3.5 g/m2 of fertiliser.
(d) Would you feel confident predicting a crop yield for 10 g/m2 of fertiliser?
Explain briefly why or why not.
Solution
(a) We have:
190
Scatter diagram of crop yields vs. fertiliser
x
185
x
x
x
180
x
170
175
x
x
160
165
Crop yields (kg/ha)
x
x
0
2
4
6
8
Fertiliser (g/m^2)
Fitting a linear model seems reasonable, although there is a strong hint of
non-linearity.
(b) We have:
x̄ =
1611
36
= 4 and ȳ =
= 179.
9
9
The slope estimate is:
P
xi yi − nx̄ȳ
6627 − 9 × 4 × 179
b= P 2
=
= 3.05.
2
xi − nx̄
204 − 9 × 42
The intercept estimate is:
12
a = ȳ − bx̄ = 179 − 3.05 × 4 = 166.8.
Hence the estimated regression line is yb = 166.8 + 3.05x.
(c) When x = 3.5, the predicted crop yield is:
yb = 166.8 + 3.05 × 3.5 = 177.475 kg/ha.
303
12. Correlation and linear regression
(d) No, we would not feel confident since this would be clear extrapolation. The x
data values do not exceed 8, and the scatter diagram suggests that fertiliser
values above 7 g/m2 may actually have a negative effect on crop yield.
12.9.4
Points to watch about linear regression
Non-linear relationships
We have only seen here how to use a straight line to model the best fit. So, we could be
missing some quite important non-linear relationships, particularly if we were working
in the natural sciences.
Which is the dependent variable?
Note it is essential to correctly establish which is the dependent variable, y. In Example
12.7, you would have a different line if you had taken advertising costs as y and sales as
x! So remember to exercise your common sense – we would expect sales to react to
advertising campaigns rather than vice versa.
Extrapolation
In Example 12.7, we used our estimated line of best fit to predict the value of y for a
given value of x, i.e. advertising expenditure of £35,000. Such prediction is only reliable
if you are dealing with figures which lie within the dataset, known as interpolation. If
you use the estimated regression line for prediction using x0 which lies outside the range
of the available x data, then this is known as extrapolation, for which any predictions
should be viewed with extreme caution.
For Example 12.7, it may not be immediately obvious that the relationship between
advertising expenditure and sales could change, but a moment’s thought should
convince you that, were you to quadruple advertising expenditure, you would be
unlikely to get a nearly 13-fold increase in sales! Basic economics would suggest
diminishing marginal returns to advertising expenditure.
Sometimes it is very easy to see that the relationship must change. For instance,
consider Example 12.8, which shows an anthropologist’s figures on years of education of
a mother and the number of children she has, based on a Pacific island.
12
Example 12.8 Figures from an anthropologist show a negative linear relationship
between the number of years of education, x, of a mother and the number of live
births she has, y. The estimated regression line is:
yb = 8 − 0.6x
based on figures of women with between 5 and 8 years of education who had 0 to 8
live births. This looks sensible. We predict yb = 8 − 0.6 × 5 = 5 live births for those
with 5 years of education, and yb = 8 − 0.6 × 10 = 2 live births for those with 10
years of education.
304
12.10. Points to note about correlation and regression
This is all very convincing, but say a woman on the island went to university and
completed a doctorate, and so had 15 years of education. She clearly cannot have
yb = 8 − 0.6 × 15 = −1 children! Also, if someone missed school entirely, is she really
likely to have yb = 8 − 0.6 × 0 = 8 children? We have no way of knowing. The
relationship shown by the existing figures will probably not hold beyond the x data
range of 5 to 8 years of education.
12.10
Points to note about correlation and regression
As already mentioned, examiners frequently give you the following summary statistics:
n
X
xi ,
i=1
n
X
i=1
x2i ,
n
X
i=1
yi ,
n
X
i=1
yi2
and
n
X
xi y i
i=1
for i = 1, . . . , n, in order to save you time. If you do not know how to take advantage of
these, you will waste valuable time which you really need for the rest of the question.
Note that if you use your calculator, show no working and get the answer wrong, you
are unlikely to get any credit. This part of the syllabus leads directly into ST104b
Statistics 2 and EC2020 Elements of econometrics, so it is important that you
understand it if you are taking these courses.
What is the relationship between correlation and regression?
As we have seen, all the calculations we do for correlation and regression involve similar
summation measures, so they must be connected in some way. This is indeed the case.
For example, a large r2 (the square of the sample correlation coefficient) means that the
standard error of b will be low (in other words, if there is a strong connection between x
and y, the points will be close to the line of best fit). A small r2 means that the points
are scattered (resulting in a large standard error of b).
While you are not expected to determine confidence intervals nor perform hypothesis
tests for the parameters α, β and ρ (these are covered in ST104b Statistics 2), it is
important that you are aware of these ideas. On the other hand, be sure you are clear
that a large |r| does not necessarily mean that the slope of the regression line is steep – a
strong correlation can exist if the points are all close to a line with a moderate gradient.
12.11
Summary
12
This brings us to the end of our introduction to dealing with the estimation of two
variables (bivariate statistics). This chapter has been chiefly concerned with two
important features when modelling the (linear) relationship between two variables: (i.)
correlation – measuring the strength of a (linear) relationship, and (ii.) linear regression
– a way of representing that linear relationship.
305
12. Correlation and linear regression
12.12
Key terms and concepts
Causality
Correlation coefficient
Error term
Independent (explanatory) variable
Prediction
Scatter diagram (plot)
Spearman rank correlation
12.13
Correlation
Dependent (response) variable
Extrapolation
Non-linear relationship
Regression
Simple linear regression
Uncorrelated
Learning activities
Solutions can be found in Appendix A.
1. State whether the following statements are true or false and explain why.
(a) ‘The correlation between X and Y is the same as the correlation between Y
and X.’
(b) ‘If the slope is negative in a regression equation, then the correlation
coefficient between X and Y would be negative too.’
(c) ‘If two variables have a correlation coefficient of minus 1 they are not related.’
(d) ‘A strong correlation coefficient means the regression line will have a steep
slope.’
2. Sketch a scatter diagram for each of the following situations:
(a) r = 0
(b) r is strong and positive
(c) r is strong and negative
(d) r is quite strong, but negative.
3. Think of an example where you feel the correlation is clearly spurious (that is,
there is a correlation, but no causal connection) and explain how it might arise.
Now think of a ‘clear’ correlation and the circumstances in which you might accept
causality.
12
4. Work out a and b in Example 12.7 using advertising costs as the dependent
variable. Now predict advertising costs when sales are £460,000.
Make sure you understand how and why your results are different from Example
12.7!
5. Try to think of a likely linear relationship between x and y which would probably
work over some of the data, but then break down like that in the anthropologist
case in Example 12.8.
306
12.14. A reminder of your learning outcomes
This should make sure you understand the difference between interpolation and
extrapolation.
6. Sketch a scatter diagram with the line of best fit for the following a, b and r.
(a) a = 2, b = 1/2 and r = 0.9
(b) a = −3, b = −2 and r = −0.3.
12.14
A reminder of your learning outcomes
After completing this chapter, and having completed the Essential reading and
activities, you should be able to:
draw and label a scatter diagram
calculate r
explain the meaning of a particular value and the general limitations of r
calculate a and b for the line of best fit in a scatter diagram
explain the relationship between b and r
summarise the problems caused by extrapolation.
12.15
Sample examination questions
Solutions can be found in Appendix B.
1. Seven students in a class received the following examination and project marks in a
subject:
Examination mark
Project mark
50
75
80
60
70
55
40
40
30
50
75
80
95
65
You want to know if students who excel in examinations in the subject also had
relatively high project marks.
(a) Calculate the Spearman rank correlation.
(b) Based on your answer to part i. do you think students who score well in
examinations are also likely to have the highest project marks? Briefly explain
your answer.
2. An area manager in a department store wants to study the relationship between
the number of workers on duty, x, and the value of merchandise lost to shoplifters,
y, in $. To do so, the manager assigned a different number of workers for each of 10
weeks. The results were as follows:
307
12
12. Correlation and linear regression
Week
x
y
#1
9
420
#2
11
350
#3
12
360
#4
13
300
#5
15
225
#6
18
200
#7
16
230
#8
14
280
#9
12
315
#10
10
410
The summary statistics for these data are:
Sum of x data: 130
Sum of the squares of x data: 1760
Sum of y data: 3090 Sum of the squares of y data: 1007750
Sum of the products of x and y data: 38305
(a) Draw a scatter diagram of these data on the graph paper provided. Carefully
label the diagram.
(b) Calculate the sample correlation coefficient. Interpret its value.
(c) Calculate and report the least squares line of y on x. Draw the line on the
scatter diagram.
(d) Based on the regression model above, what will be the predicted loss from
shoplifting when there are 17 workers on duty? Would you trust this value?
Justify your answer.
12
308
A
Appendix A
Solutions to Learning activities
A.1
Chapter 2 – Mathematical revision – simple
algebra and coordinate geometry
1. (a) Rules: Brackets take precedence so this is:
(6) × (10) = 60.
(b) Here the ‘of’ and ÷ take precedence and this is:
1
of 12 (= 4) minus 4 ÷ 2 (= 2)
3
i.e. 4 minus 2, i.e. +2.
(c) Brackets take precedence so we have:
5
×2
5
which is 1 × 2, which is 2.
You should find these really simple. If you are in any doubt at all, go over the rules
again given in Section 2.5 of the subject guide. Work through these examples again
and (even better) go to your old school arithmetic textbook and try some more!
2. (a) 16 is 4 × 4 and −4 × −4, so the square root of 16 is ±4.
(b) (0.07)2 = 0.07 × 0.07 = 0.0049. Be careful with the decimal points!
(c) 0.49 is 0.7 × 0.7 and −0.7 × −0.7, so the square root of 0.49 is ±0.7.
Be sure that you understand the rules for placing decimal points. When you work
on estimation or tests of hypotheses of proportions you will need this.
3. (a) The answer is (98/100) × 200 which is 98 × 2, or 196.
(b) To get a percentage, multiply the given fraction by 100, i.e. we obtain:
17
17
× 100 =
× 4 = 17 × 4 = 68%.
25
1
So 17/25 can be written as 68%.
(c) 25% of 98/144 is another way of writing (25/100) × (98/144) which is
(1/4) × (49/72) or 49/288 which is 0.1701 to four decimal places.
309
A
A. Solutions to Learning activities
4. (a) This is, simply put, 8.
(b) This is the same as |6|, which is 6.
5. (a) 7 and 9 are greater than 3.
(b) 2 is less than 3.
(c) 2 and 3 are less than or equal to 3.
(d) If x2 ≥ 49, then x ≥ 7. This is the case when x is 7 or 9.
6. Substituting in the values, we obtain:
2
X
x3i = x31 + x32 = 43 + 13 = 64 + 1 = 65.
i=1
7. (a) Substituting in the values, we obtain:
3
X
xi = x1 + x2 + x3 = 2 + 3 + 5 = 10.
i=1
(b) Substituting in the values, we obtain:
4
1X 2 1 2
1
1
xi = (x1 + x22 + x23 + x24 ) = (22 + 32 + 52 + 72 ) = (4 + 9 + 25 + 49)
n i=1
4
4
4
= 21.75.
8. (a) We have:
5
X
xi = x1 + x2 + x3 + x4 + x5 = 3 + 1 + 4 + 6 + 8 = 22.
i=1
(b) We have:
4
X
x2i = x23 + x24 = 42 + 62 = 16 + 36 = 52.
i=3
(c) We have:
5
X
p i xi = p 1 x1 + p 2 x2 + p 3 x3 + p 4 x4 + p 5 x5
i=1
1
1
1
1
1
×3+ ×1+ ×4+ ×6+ ×8
4
8
8
3
6
3 1 1
1
= + + +2+1
4 8 2
3
17
=4
24
or 4.7083, if you prefer to work in decimals.
=
(Note: do not forget the ‘times’ sign takes precedence over the ‘plus’. Check
over Question 1 if in any doubt!)
310
A.1. Chapter 2 – Mathematical revision – simple algebra and coordinate geometry
(d) We have:
5
X
pi x2i = p3 x23 + p4 x24 + p5 x25 =
i=3
1
1
1
× 42 + × 62 + × 82
8
3
6
1
1
1
× 16 + × 36 + × 64
8
3
6
2
= 2 + 12 + 10
3
2
= 24
3
=
or 24.6667, if you prefer to work in decimals.
Check this carefully if you could not do it yourself first time. Note that i = 3 and
i = 5 mark the beginning and end, respectively, of the summation, as did i = 3 and
i = 4 in (b).
9. (a) Here the straight line graph has a slope of 1 and the line cuts the y-axis at
x = 3 (when x is zero, y is 3).
y
3
-3
x
(b) Here the straight line still has a positive slope of 3 (y increases by 3 when x
increases by 1) but the line crosses the y-axis at a minus value (−2).
y
2/3
x
-2
311
A
A
A. Solutions to Learning activities
A.2
Chapter 3 – The nature of statistics
1. To properly define a circle we need to specify its location and size. Hence two
suitable parameters would be the co-ordinates of the circle’s centre point (defines
its location) and its radius (defines its size).
2. In a simple random sample for each member of the population there is a known,
equal and non-zero probability of selection, while in a non-simple random sample
for each member of the population there is a known and non-zero probability of
selection (that is, these probabilities are not equal). Stratified sampling and cluster
sampling are two examples of non-simple random sampling.
There will be more on this topic in Chapter 10.
A.3
Chapter 4 – Data visualisation and descriptive
statistics
1. (a) Imagine there are only three possible parties people can vote for: A, B and C.
Hence there will be three possible variables with nA , nB and nC being the
numbers voting for each (where nA + nB + nC = 1000), or pA , pB and pC being
the proportions voting for each (where pA + pB + pC = 1). Whichever way you
look at it, these are discrete variables.
(b) The number of hours students watch television per week could be anything
from 0 to 168! If you only measure whole numbers, you would regard these as
integers, though with such large numbers the distribution will approximate to
a continuous one. It is more likely that you will be given more detailed figures,
such as 1.33 hours, for example, or 23.798 hours. In this case, hours are clearly
continuous variables.
(c) Only whole numbers are possible which are unlikely to be large, so the variable
is discrete.
2. (a) We use a histogram when we want to compare the numbers of observations or
frequencies in categories which form part of a continuum, such as numbers in
the income groups:
◦ less than £200 per week
◦ £200 and less than £300 per week
◦ £300 and less than £500 per week
and so on. There must be no overlaps between the categories. The areas under
the histogram are important to watch.
(b) A stem-and-leaf diagram is very useful to work out before we draw a
histogram. It is a way of organising data and understanding their structure.
312
A.3. Chapter 4 – Data visualisation and descriptive statistics
You could not use:
• a histogram if you are looking at named groups with no particular relationship
with each other
• a stem-and-leaf diagram if you are looking, for example, at proportions in
categories.
3. Adding the ten observations and dividing by 10 gives us:
2 + 2 + 2 + 5 + 5 + 10 + 10 + 10 + 10 + 10
66
=
= 6.6.
10
10
Using the grouped formula gives us, again:
6 + 10 + 50
(2 × 3) + (5 × 2) + (10 × 5)
=
= 6.6.
10
10
4. The statement is false. The mean is only greater than the median when the dataset
is positively skewed. If the distribution is symmetric, then the mean and median
are equal. If the distribution is negatively skewed, then the mean is less than the
median.
5. We have:
4
1
4+5+6
15
1X
xi = (x2 + x3 + x4 ) =
=
= 5.
3 i=2
3
3
3
The first observation, x1 , could be an outlier due to its value, 1, being ‘far’ from
the other observed values. Given the small sample size of n = 4, including this
potentially extreme observation might give a misleading estimate of the true
population mean.
6. (a) Since n = 3, the median is the midpoint of the ordered dataset, which is:
x(1) = 1,
x(2) = 2 and x(3) = 4
hence the median is x(2) = 2.
(b) The mean is:
n
1X
4+1+2
7
xi =
= .
n i=1
3
3
7. (a) The mode is the value which occurs most frequently. Here the mode is 2.
(b) The mean is:
n
1X
x1 + x2 + x3 + x4 + x5
4+2+2+5+6
19
xi =
=
=
= 3.8.
n i=1
5
5
5
8. There are 13 observations so the mean is the sum of them (1749 as given) divided
by 13. This comes to £134.54.
The numbers have been arranged in order of size, so the median is the
((13 + 1)/2)th observation, that is the 7th observation. This is £110.
313
A
A
A. Solutions to Learning activities
The mode of the data, as given, is £195 which occurs twice (the other figures only
occur once). However, if we round to the nearest 10 (counting the 5s downwards)
the prices become (in £s) 50, 60, 60, 80, 100, 110, 110, 110, 150, 190, 190, 240, 260
and the mode is then £110 which occurs three times (£60 and £190 occur twice –
less often).
9. Using the first, we need to know the mean, which is:
69
2 + 4 + 5 + 6 + 6 + 7 + 8 + 11 + 20
=
= 7.67.
9
9
Therefore:
9
1X
s =
(xi − 7.67)2
8 i=1
2
=
1
(2 − 7.67)2 + (4 − 7.67)2 + (5 − 7.67)2 + (6 − 7.67)2 + (6 − 7.67)2
8
+(7 − 7.67)2 + (8 − 7.67)2 + (11 − 7.67)2 + (20 − 7.67)2
1
= (32.15 + 13.47 + 7.13 + 2.79 + 2.79 + 0.45 + 0.11 + 11.09 + 152.03)
8
222.01
=
8
= 27.75.
You can see this is very hard work, and it is quite remarkable that having rounded
figures to the nearest two decimal places before squaring the (xi − x̄) terms we still
get the same answer as that using the short-cut formula.
The short-cut formula gives us:
1
s =
n−1
2
n
X
i=1
!
x2i − nx̄2
1
= (751 − 9 × (7.67)2 ) = 27.75.
8
10. The statement is false. By definition, the lower quartile separates the bottom 25%
of observations in a dataset from the top 75% of observations.
11. The range of the data is 265 − 52 = 213, so a price difference of 213 between the
most expensive and cheapest brands. The variance, standard deviation and
interquartile range are provided in the table, so check through your working and
make sure you can determine these.
314
A.4. Chapter 5 – Probability theory
12. (a) Stem-and-leaf diagram of daily telephone calls received by an office supplies
company:
Stem = 100s | Leaf = 10s & 1s
0
1
2
3
4
5
07 13 36 48 58 77 78
21 33 38 64 75 77 83
19 44 58 91
12 14 36
18 55 76
41
Since n = 25, the median is:
x((n+1)/2) = x(13) = 177.
(b) For the quartiles, noting that different quartile calculation methods are
acceptable, possible (interpolated) solutions are:
Q1 = x(n/4) = x(6.25) = x(6) +
x(7) − x(6)
78 − 77
= 77 +
= 77.25
4
4
and:
Q3 = x(3n/4) = x(18.75) = x(18) +
3(x(19) − x(18) )
3(312 − 291)
= 291 +
= 306.75.
4
4
(c) The stem-and-leaf diagram clearly shows a positively-skewed distribution,
hence we would expect the mean to be greater than the median. (In fact the
mean of the data is 210.88.)
(d) As discussed in (c), the data are clearly skewed to the right. The third quartile
indicates that about 75% of the time no more than 300 telephone calls are
received. The company’s management could use this information when
considering staffing levels.
A.4
Chapter 5 – Probability theory
1. We have:
S = {1, 2, 3, 4, 5, 6},
E = {3, 4} and F = {4, 5, 6}.
(a) F c means the elements not in F , which are 1, 2 and 3.
(b) E c ∩ F c means the elements not in E and not in F . The elements not in E are
1, 2, 5 and 6. So only 1 and 2 are not in E and F .
(c) (E ∪ F )c means elements which are neither in E nor F , so the elements in E
or F or both are 3, 4, 5 and 6. So, once again, the answer is 1 and 2.
(d) E c ∩ F are the elements not in E and in F . E c elements are 1, 2, 5 and 6 and
F elements are 4, 5, and 6, so the answer is 5 and 6.
315
A
A
A. Solutions to Learning activities
2. The Venn diagram is:
(a) E ∪ F is shown by the total area enclosed by both circles.
(b) E ∩ F is the intersection of the two circles.
(c) E is the area completely outside the circle which encloses 3 and 4.
3. (a) Of the total number of equally likely outcomes, 300, there are 30 which are
early. Hence required probability is 30/300 = 0.1.
(b) Again, of the total number of equally likely outcomes, 300, there are 150 from
Smith. Hence required probability is 150/300 = 0.5.
(c) Now, of the 300, there are only 10 which are late and from Jones. Hence the
probability is 10/300 = 1/30.
4. We have that P (A) + P (B) + P (C) = 1. Therefore:
1
1 1
+
= .
P (C) = 1 −
2 3
6
5. Since A and B are independent:
P (A ∩ B) = P (A) P (B) =
1 1
1
× = .
3 4
12
6. False. If two events are independent, then the occurrence (or non-occurrence) of
one event has no impact on whether or not the other event occurs. Mutually
exclusive events cannot both occur at the same time. A simple counterexample to
show the statement is false would be the events A = ‘a coin landing heads up’, and
B = ‘rolling a 6 on a die’. The outcome of the coin toss has no effect on the die
outcome, and vice versa. However, it is possible to get heads and a 6, so the events
are not mutually exclusive.
7. (a) P (X 2 < 4) = P (X = 1) = 0.3, since X 2 < 4 is only satisfied when X = 1.
(b) We require the following conditional probability:
P (X > 2 | X is an even number) =
316
0.2
P (X = 4)
=
= 0.2857.
P (X = 2) + P (X = 4)
0.5 + 0.2
A.4. Chapter 5 – Probability theory
8. (a) The additive law, for any two events A and B, is:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
For example, consider the following Venn diagram:
A
x
z
y
B
We have P (A) = area x + area z, P (B) = area y + area z, P (A ∩ B) = area z
and P (A ∪ B) = area x + area y + area z. Therefore:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
= (area x + area z) + (area y + area z) − (area z)
= area x + area y + area z.
Therefore, to compute P (A ∪ B), we need to subtract P (A ∩ B) otherwise that
region would have been counted twice.
(b) The multiplicative law, for any two independent events A and B, is:
P (A ∩ B) = P (A) P (B).
There is no clear way of representing this on a Venn diagram. However, this is
a very important result whenever events are independent, such as when
randomly sampling objects with replacement.
9. (a) From the given probabilities we obtain:
P (C ∪ S) = P (C) + P (S) − P (C ∩ S) = 0.05 + 0.04 − 0.01 = 0.08.
(b) Having no defects is the complement of the event of having at least one defect,
hence:
P (no defects) = 1 − P (C ∪ S) = 1 − 0.08 = 0.92.
10. Note that the percentage of good and bad parts total to 100.
(a) The probability that a randomly chosen part comes from A is 0.6 (60%), the
probability that one of A’s parts is bad is 0.03 (3%), so the probability that a
randomly chosen part comes from A and is bad is 0.6 × 0.03 = 0.018.
317
A
A
A. Solutions to Learning activities
(b) Add all the outcomes, which gives:
(0.6×0.97)+(0.6×0.03)+(0.4×0.93)+(0.4×0.07) = 0.582+0.018+0.372+0.028 = 1.
(c) The probability of receiving a bad part is the probability of either receiving a
bad part from A or from B, which is:
(0.6 × 0.03) + (0.4 × 0.07) = 0.018 + 0.028 = 0.046.
11. Here we have:
Device
A
B
C
D
Probabilities
Not fail Fail
0.9
0.1
0.9
0.1
0.9
0.1
0.9
0.1
The system fails if both A and B fail (or more) or both C and D fail (or more).
To work out the probability that the system works properly, first work out the
probability it will fail, P (F ). The probability it will work is P (F c ).
The system fails if any of the following occur:
• A, B, C and D all fail = (0.1)4 = 0.0001
• ABC, ABD, ACD or BCD fail = (0.1)3 × 0.9 × 4 = 0.0036
• A and B fail and C & D are okay = 0.1 × 0.1 × 0.9 × 0.9 = 0.0081
• C and D fail and A & B are okay = 0.1 × 0.1 × 0.9 × 0.9 = 0.0081.
So the total probability that the system will fail is
0.0001 + 0.0036 + 0.0162 = 0.0199, which makes the probability it will run
smoothly 1 − 0.0199 = 0.9801.
12. Let B = ‘a beginner student’, T = ‘a transferring student’ and G = ‘will graduate’.
Hence P (B) = 0.73, P (T ) = 0.27, P (G | B) = 0.62 and P (G | T ) = 0.78.
(a) We need P (B ∩ G) which can be derived from:
P (G | B) =
P (B ∩ G)
P (B)
⇒
P (B∩G) = P (G | B) P (B) = 0.62×0.73 = 0.4526.
(b) Using the total probability formula:
P (G) = P (G | B) P (B) + P (G | T ) P (T ) = 0.62 × 0.73 + 0.78 × 0.27
= 0.4526 + 0.2106
= 0.6632.
(c) The required probability is:
P (B ∪ G) = P (B) + P (G) − P (B ∩ G) = 0.73 + 0.6632 − 0.4526 = 0.9406.
318
A.5. Chapter 6 – The normal distribution and ideas of sampling
(d) We need to check whether P (G ∩ T ) = P (G) P (T ). Since:
P (G∩T ) = P (G | T ) P (T ) = 0.78×0.27 = 0.2106 6= P (G) P (T ) = 0.6632×0.27 = 0.1791
the events are not statistically independent.
(e) This is a conditional probability, given by:
P (T | G) =
P (T ∩ G)
P (G | T ) P (T )
0.2106
=
=
= 0.3176.
P (G)
P (G)
0.6632
(f) We already have P (G ∩ T ) = 0.2106 and P (G ∩ B) = 0.4526. We also require:
P (Gc ∩ T ) = P (Gc | T ) P (T ) = (1 − 0.78) × 0.27 = 0.0594
and:
P (Gc ∩ B) = P (Gc | B) P (B) = (1 − 0.62) × 0.73 = 0.2774.
Two students being chosen at random can be considered as independent
events, hence the required probability is:
(0.2106)2 + (0.4526)2 + (0.0594)2 + (0.2774)2 = 0.3297.
A.5
Chapter 6 – The normal distribution and ideas of
sampling
1. Z ∼ N (0, 1), i.e. µ = 0 and σ 2 = 1.
(a) Looking at Table 4 of the New Cambridge Statistical Tables, we see the area
under the curve for which Z ≤ 1 is 0.8413, i.e. P (Z ≤ 1) = 0.8413. A sketch of
this is:
(b) Since the total area under the curve is 1, the area for which Z ≥ 1 is
P (Z ≥ 1) = 1 − 0.8413 = 0.1587.
2. (a) If we are looking for 0.68 as the area under the curve between two standard
deviations, we need A (in the diagram below) to be 0.68/2 = 0.34 and hence
the whole area to the left of x to be 0.34 + 0.5 = 0.84.
Looking at Φ(x) = 0.84, we see that x is between 0.99 (where Φ(x) is 0.8389)
and 1.00 (where Φ(x) = 0.8413), i.e. approximately correct.
319
A
A
A. Solutions to Learning activities
(b) Similarly, look up Φ(x) of 0.95/2 + 0.5 = 0.475 + 0.5 = 0.975, which gives us
an x of exactly 1.96.
(c) Similarly, look up Φ(x) of 0.99/2 + 0.5 = 0.495 + 0.5 = 0.995, which gives x
between 2.57 and 2.58. Therefore, the third example is more approximate!
3. Let X ∼ N (20, 4).
(a) Standardising, we have:
22 − 20
X − 20
√
= P (Z > 1) = 1−Φ(1) = 1−0.8413 = 0.1587.
> √
P (X > 22) = P
4
4
(b) Standardising, we have:
P (14 < X < 16) = P
14 − 20
X − 20
16 − 20
√
< √
< √
4
4
4
= P (−3 < Z < −2)
= (1 − Φ(2)) − (1 − Φ(3))
= (1 − 0.97725) − (1 − 0.99865)
= 0.0214.
4. Let A ∼ N (50, 52 ) and B ∼ N (55, 62 ). We have:
A − 50
45 − 50
P (A < 45) = P
<
= P (Z < −1)
5
5
and:
P (B < 45) = P
45 − 55
B − 55
<
6
6
= P (Z < −1.67).
Clearly, P (Z < −1.67) < P (Z < −1) (if you do not see this immediately, shade
these regions on a sketch of the standard normal distribution) and hence schools of
type A would have a higher proportion of students with marks below 45.
5. Let X ∼ N (3800, (250)2 ).
(a) Standardising, we have (noting the symmetry of the standard normal
distribution about zero):
X − 3800
3500 − 3800
P (X > 3500) = P
>
= P (Z > −1.2) = Φ(1.2) = 0.8849.
250
250
Hence, as a percentage, this is 88.49%.
320
A.5. Chapter 6 – The normal distribution and ideas of sampling
(b) Standardising, we have (noting the symmetry of the standard normal
distribution about zero):
X − 3800
4000 − 3800
P (X > 4000) = P
>
= P (Z > 0.8)
250
250
= 1 − Φ(0.8)
= 1 − 0.7881
= 0.2119.
Hence, as a percentage, this is 21.19%.
(c) Using (a) and (b), we have:
P (3500 < X < 4000) = (1 − P (X > 4000)) − (1 − P (X > 3500))
= (1 − 0.2119) − (1 − 0.8849)
= 0.6730.
Therefore, we would expect 700 × 0.6730 ≈ 471 batteries to last between 3,500
and 4,000 hours.
6. (a) The population mean is:
N
P
xi
15 + 14 + 10 + 12 + 20 + 15
86
=
= 14.33.
N
6
6
The population standard deviation is:
v
r
u
N
u1 X
1 √
1
t
(xi − x̄)2 =
× 57 = 9.56 = 3.09.
σ=
N i=1
6
3
µ=
(b)
i=1
=
i. Case of two individuals
Taking samples of two elements, the 15 possible samples and their means
are:
Sample number Individuals Times x̄i
1
A, B
15, 14 14.5
A, C
15, 10 12.5
2
3
A, D
15, 12 13.5
4
A, E
15, 20 17.5
5
A, F
15, 15 15.0
6
B, C
14, 10 12.0
7
B, D
14, 12 13.0
8
B, E
14, 20 17.0
9
B, F
14, 15 14.5
10
C, D
10, 12 11.0
11
C, E
10, 20 15.0
12
C, F
10, 15 12.5
13
D, E
12, 20 16.0
14
D, F
12, 15 13.5
15
E, F
20, 15 17.5
321
A
A
A. Solutions to Learning activities
ii. Case of three individuals
Taking samples of three elements, the 20 possible samples and their means
are:
Sample number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Individuals
A, B, C
A, B, D
A, B, E
A, B, F
A, C, D
A, C, E
A, C, F
A, D, E
A, D, F
A, E, F
B, C, D
B, C, E
B, C, F
B, D, E
B, D, F
B, E, F
C, D, E
C, D, F
C, E, F
D, E, F
Times
15, 14, 10
15, 14, 12
15, 14, 20
15, 14, 15
15, 10, 12
15, 10, 20
15, 10, 15
15, 12, 20
15, 12, 15
15, 20, 15
14, 10, 12
14, 10, 20
14, 10, 15
14, 12, 20
14, 12, 15
14, 20, 15
10, 12, 20
10, 12, 15
10, 20, 15
12, 20, 15
x̄i
13.00
13.67
16.33
14.67
12.33
15.00
13.33
15.67
14.00
16.67
12.00
14.67
13.00
15.33
13.67
16.33
14.00
12.33
15.00
15.67
iii. Case of four individuals
Taking samples of four elements, the 15 possible samples and their means
are:
Sample number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
322
Individuals
C, D, E, F
B, D, E, F
B, C, E, F
B, C, D, F
B, C, D, E
A, D, E, F
A, C, E, F
A, C, D, F
A, C, D, E
A, B, E, F
A, B, D, F
A, B, D, E
A, B, C, F
A, B, C, E
A, B, C, D
Times
10, 12, 20, 15
14, 12, 20, 15
14, 10, 20, 15
14, 10, 12, 15
14, 10, 12, 20
15, 12, 20, 15
15, 10, 20, 15
15, 10, 12, 15
15, 10, 12, 20
15, 14, 20, 15
15, 14, 12, 15
15, 14, 12, 20
15, 14, 10, 15
15, 14, 10, 20
15, 14, 10, 12
x̄i
14.25
15.25
14.75
12.75
14.00
15.50
15.00
13.00
14.25
16.00
14.00
15.25
13.50
14.75
12.75
A.6. Chapter 7 – Point and interval estimation
(c) For the case of two individuals, the mean of all the sample means is:
15
P
µ=
x̄i
i=1
=
15
215
= 14.33.
15
For the case of three individuals, the mean of all the sample means is:
20
P
µ=
x̄i
i=1
=
20
286.67
= 14.33.
20
For the case of four individuals, the mean of all the sample means is:
15
P
µ=
x̄i
i=1
15
=
215
= 14.33.
15
The standard deviations of X̄ are 1.9551, 1.3824 and 0.9776 for samples of size
2, 3 and 4, respectively. This confirms that the accuracy of our population
mean estimator improves as we increase our sample size because we increase
the amount of information about the population in the sample.
A.6
Chapter 7 – Point and interval estimation
1. False. When calculated from the same data set, the only difference is the change in
confidence level. There is a trade-off whereby higher confidence (which is good,
other things equal) leads to a larger multiplier coefficient (a z-value or t-value)
leading to a greater margin of error, and hence a wider confidence interval (which is
bad, other things equal). Therefore a 95% confidence interval would be wider, since
95% confidence is greater than 90% confidence.
2. We are given a population variance of σ 2 = (1.2)2 , a sample mean of x̄ = 25, and a
sample size of n = 100.
To get a 95% confidence interval for the mean, we need:
σ
x̄ ± 1.96 × √
n
⇒
25 ± 1.96 ×
1.2
10
⇒
(24.7648, 25.2352).
3. We were given x̄ = 0.824, n = 200 and σ = 0.042.
This time we are told that the standard deviation is derived from the sample, i.e.
x̄ = 0.824, n = 200 and s = 0.042.
We should use the t distribution, but, looking at Table 10 of the New Cambridge
Statistical Tables the largest n we see is 121 (i.e. the degrees of freedom are 120),
for which the appropriate 95% value is 1.98 (compared with 1.96 for an
infinite-sized sample) and the 99% value is 2.617 (compared with 2.576).
As 200 is beyond the values given in the tables, in practice we use the values given
on the bottom line, which gives the same answers as those given in Example 7.1.
323
A
A
A. Solutions to Learning activities
4. Do this exactly as stated. Proceed carefully. Look at Table 10 of the New
Cambridge Statistical Tables as suggested!
5. (a) The true standard error of the sample proportion, P , is
r
π (1 − π)
S.E.(P ) =
.
n
If π, the population proportion, is unknown then we use the estimated
standard error in inference calculations, i.e. we use:
r
p (1 − p)
.
E.S.E.(P ) =
n
For two samples drawn from the same population (again with a population
proportion π), the pooled sample proportion is:
n1 p1 + n2 p2
n1 + n2
when the sample sizes are n1 and n2 giving rise to sample proportions p1 and
p2 , respectively.
(b) The sample proportions are pA = 220/300 = 0.733 and pB = 200/250 = 0.8.
For a 95% confidence interval, we use the z-value of 1.96. A 95% confidence
interval is:
r
0.733 × (1 − 0.733) 0.8 × (1 − 0.8)
+
⇒ (−0.0035, 0.1375).
0.8−0.733±1.96×
300
250
The confidence interval (just) contains zero, suggesting no evidence of a
difference in the true population proportions. Therefore, there is no reason to
believe that one company gives greater satisfaction than the other. However,
given the close proximity of the lower endpoint to zero, this is a marginal case.
6. (a) We are given n = 200 and p = 30/200 = 0.15. So the estimated standard error
is:
r
0.15 × 0.85 √
E.S.E.(p) =
= 0.0006375 = 0.0252.
200
A 95% confidence interval is:
0.15 ± 1.96 × 0.0252
⇒
(0.1006, 0.1994).
Taking this as percentages, we have a 95% confidence interval of between
10.06% and 19.94%, to two decimal places.
(b) In the second institution, n = 100 and p = 18/100 = 0.18. So the estimated
standard error is:
r
0.18 × 0.82 √
= 0.0015 = 0.0384
E.S.E.(p) =
100
which is larger than the one in the first sample owing to the relatively smaller
sample size.
324
A.7. Chapter 8 – Hypothesis testing
This gives us a 95% confidence interval of:
0.18 ± 1.96 × 0.0384
⇒
(0.1047, 0.2553).
Taking this as percentages, we have a 95% confidence interval of between
10.47% and 25.53%, to two decimal places.
(c) The estimated standard error is:
r
E.S.E.(pA −pB ) =
0.15 × 0.85 0.18 × 0.82 √
+
= 0.0006375 + 0.001476 = 0.0460
200
100
which gives a 95% confidence interval of:
0.4 − 0.18 ± 1.96 × 0.0460
⇒
(0.1298, 0.3102)
or between 12.98% and 31.02% – not a very useful confidence interval! Such a
wide interval arises because the sample proportions differ greatly, meaning the
proportions of students enjoying Statistics in these two different institutions
seem to be very different.
A.7
Chapter 8 – Hypothesis testing
1. Look at the wording in each of (a) to (c) carefully. You can get clues as to whether
you are dealing with a one-tailed test (where H1 will use a < or >), or a two-tailed
test (where H1 will involve 6=) according to the use of words like ‘increase’, ‘higher’,
‘greater’ and ‘diminished’ which all imply a one-tailed test; or the use of words like
‘equal’, ‘changed’ and ‘different from’ which all imply a two-tailed test.
(a) Here you want to know whether the mean incomes in a particular area (µU ) is
less than the general population mean (µ = 10000).
So you need a one-tailed test, where:
H0 : µU = 10000 (i.e. µU is equal to µ)
H1 : µU < 10000 (i.e. µU is less than µ).
(b) Here we are comparing the mean weights in two schools A and B, let us say
they are x̄A and x̄B , with population means µA and µB , respectively. We want
to know if they are different so we need a two-tailed test.
H0 : µA = µB (i.e. the means are the same)
H1 : µA 6= µB (i.e. the means are different).
(c) We look at the reading score means for children before, x̄B , and after, x̄A , for a
teaching programme with populations means µB and µA , respectively.
H0 : µA = µB (i.e. µB is no different from µA )
H1 : µA > µB (i.e. the mean score after the programme is greater than
before).
325
A
A
A. Solutions to Learning activities
2. Your completed chart should look like this:
Real situation
Result from your test
Not reject H0
Reject H0
Correct decision
Type I error
H0 true
Probability (1 − α)
called the confidence
level of the test
Type II error
Probability α called the
significance level
of the test
Correct decision
H1 true
Probability β
Probability (1 − β) called
the power of the test
3. False. The power of a test is the probability of rejecting the null hypothesis when
the alternative hypothesis is true. Although this outcome results in the correct
hypothesis being chosen, the statement is false because it implies that not rejecting
a true null hypothesis (a correct decision) also constitutes the power of a test,
which is incorrect.
4. (a) A test which is significant at the 1% significance level is highly significant. This
indicates that it is very unlikely to have obtained the sample we did if the null
hypothesis was actually true. Therefore, we would be very confident in
rejecting the null hypothesis.
(b) A test which is significant at the 10% significance level, but not at the 5%
significance level, is weakly significant. This indicates mild (at best) support for
rejecting the null hypothesis. The test outcome is less ‘conclusive’ than in (a).
5. Here the manufacturer is claiming that 90% of the population will be relieved over
8 hours, i.e. π = 0.9. There is a random sample of n = 200 and 160 gained relief,
giving a sample proportion of p = 160/200 = 0.8.
Given the manufacturer’s claim, we would be concerned if we found significant
evidence that fewer than 90% were relieved (more than 90% would not be a
problem!). So we have a one-tailed test with:
H0 : π = 0.9 vs. H1 : π < 0.9.
We use the population value π to work out the standard error, and so the test
statistic value is:
0.8 − 0.9
0.1
=
= −4.717.
z=p
0.0212
(0.9 × 0.1)/200
This goes beyond Table 5 of the New Cambridge Statistical Tables (the most
extreme critical value is −4.4172, so this result is highly significant – the p-value is
nearly zero). Alternatively, you could look at Table 10 of the New Cambridge
Statistical Tables (which provides percentage points of the t distribution) and take
the bottom line (where ν = ∞), since tν = N (0, 1).
326
A.8. Chapter 9 – Contingency tables and the chi-squared test
Note that the 5% critical value is −1.645 for a one-tailed test and the 1% and the
0.1% critical values are −2.326 and −3.090, respectively. This confirms that the
result is highly significant. So we reject H0 and the manufacturer’s claim is not
legitimate. Based on the observed sample, we have (very) strong evidence that the
proportion of the population given relief is significantly less than the 90% claimed.
6. We would test H0 : π1 = π2 vs. H1 : π1 6= π2 . Under H0 , the test statistic would be:
P1 − P2
p
∼ N (0, 1)
P (1 − P )(1/16 + 1/24)
where P is the pooled proportion estimator:
P =
n 1 P1 + n 2 P 2
16P1 + 24P2
=
.
n1 + n2
40
Since the test statistic follows (approximately) the standard normal distribution,
there are no degrees of freedom to consider. For a two-sided test, the 5% critical
values would be ±1.96.
7. There is no a priori reason, i.e. before obtaining the sample data, which holiday
company should have a greater proportion of customer satisfaction in the event they
did differ, hence we test H0 : πA = πB vs. H1 : πA 6= πB . The test statistic value is:
0.8 − 0.733
p B − pA
p
=p
= 1.843
p (1 − p)(1/n1 + 1/n2 )
0.764 × (1 − 0.764) × (1/300 + 1/250)
since p = (220 + 200)/(300 + 250) = 0.764. At the 5% significance level, the critical
values are ±1.96. Therefore, since 1.843 < 1.96, we do not reject the null hypothesis
and conclude that there is no evidence of a difference in the proportions of
customer satisfaction between the two companies.
Note that the link between a 95% confidence interval and hypothesis testing only
holds when considering a two-sided test at the 5% significance level. Had a
two-sided test been conducted here, the critical values would be ±1.96 and hence
we would fail to reject the null hypothesis. This is consistent with zero being
included in the calculated 95% confidence interval.
A.8
Chapter 9 – Contingency tables and the
chi-squared test
1. (a) The null and alternative hypotheses are:
H0 : There is no association between the kind of accommodation people
have and their level of satisfaction with it.
H1 : There is an association between the kind of accommodation people
have and their level of satisfaction with it.
(b) The degrees of freedom are (number or rows − 1) × (number of columns − 1).
Let us take the levels of satisfaction to be rows and the housing types to be
columns. Your table would look like this:
327
A
A
A. Solutions to Learning activities
Public
apartment
Level of
satisfaction
High
Public
house
Housing type
Private
Private
apartment detached
house
Private semidetached
house
Misc.
Medium
Low
Very
dissatisfied
There are 4 rows and 6 columns and therefore (4 − 1) × (6 − 1) = 3 × 5 = 15
degrees of freedom.
(c) Now look up the 5% and 1% critical values for a χ2 distribution with 15
degrees of freedom, using Table 8 of the New Cambridge Statistical Tables.
The 5% critical value is 25.00 and the 1% critical value is 30.58.
2. Look at Table 8 of the New Cambridge Statistical Tables again. This shows a χ2
curve. Copy it.
For 6 degrees of freedom, the 5% and 1% critical values are marked above.
Note that the rejection areas are to the right tail of the distribution. So you reject
the appropriate null hypothesis at the significance level required if your calculated
test statistic value is greater than these critical values.
3. We test:
H0 : No association between advertising type and level of sales
H1 : Association between advertising type and level of sales.
The degrees of freedom are (r − 1)(c − 1) = (3 − 1)(4 − 1) = 6. Using a 5%
significance level, the critical value is 12.59. Since 12.59 < 13.5, we reject the null
hypothesis and conclude that there is an association between advertising type and
level of sales. Moving to the 1% significance level, the critical value is 16.81, hence
we do not reject the null hypothesis and conclude that the test is moderately
significant.
328
A.8. Chapter 9 – Contingency tables and the chi-squared test
4. (a) We test:
H0 : There is no association between proportion of skilled workers and
area.
H1 : There is an association between proportion of skilled workers and
area.
We get a contingency table which should look something like:
Number of
Number of
Total
skilled workers unskilled workers
O
80
184
264
72.41
191.59
A
E
(O − E)2 /E
0.80
0.30
O
58
147
205
B
E
56.22
148.78
(O − E)2 /E
0.06
0.02
O
114
276
390
C
E
106.96
283.04
(O − E)2 /E
0.46
0.18
O
55
196
251
68.84
181.43
D
E
(O − E)2 /E
2.78
1.17
O
83
229
312
E
E
85.57
226.43
(O − E)2 /E
0.08
0.03
Total
390
1032
1422
P
The χ2 test statistic value is (O − E)2 /E = 5.88. The number of degrees of
freedom is (5 − 1)(2 − 1) = 4, so we compare the test statistic value with the
χ24 distribuion, using Table 8 of the New Cambridge Statistical Tables. The test
is not significant at the 5% significance level (the critical value is 9.488). It is
also not significant at the 10% significance level (the critical value is 7.779) so
we have found no evidence of an association. That is, the proportion of skilled
workers does not appear to vary with area.
(b) We have a one-tailed z test:
H0 : πD = πothers
vs. H1 : πD < πothers .
We have pD = 55/251 = 0.2191, pothers = 335/1171 = 0.2861, and a pooled
sample proportion of (55 + 335)/(251 + 1171) = 0.2742.
So the test statistic value is:
0.2861 − 0.2191
0.2861 − 0.2191
=
= 2.1613.
z=p
0.0310
0.2742 × (1 − 0.2742) × (1/251 + 1/1171)
The 5% critical value is 1.645 and the 1% critical value is 2.326. So our
calculated test statistic value of 2.1613 lies between the two, meaning we reject
the null hypothesis at the 5% significance level only.
We can tell the management team that there is moderate evidence of a lower
proportion of skilled workers in area D compared with the others. However, it
329
A
A
A. Solutions to Learning activities
is not significant at the 1% significance level and so it would be sensible to
check whether there are other factors of interest to them before deciding
against the area.
5. (a) We test:
H0 : There is no association between verbal reasoning ability and sex.
H1 : There is an association between verbal reasoning ability and sex.
The complete table with expected values for each block in brackets looks like:
Male
Female
Totals
High
50
(56)
90
(84)
140
Low
150
(144)
210
(216)
360
Totals
200
300
500
Looking at this table, we can see that there are fewer men in the higher
reasoning category than we would expect from the overall figures and slightly
more women. However, are the differences statistically significant?
We use the formula:
X (Oi,j − Ei,j )2
i,j
Ei,j
and get a test statistic value of:
χ2 =
(50 − 56)2 (150 − 144)2 (90 − 84)2 (210 − 216)2
+
+
+
56
144
84
216
= 0.6429 + 0.2500 + 0.4285 + 0.1667
= 1.483.
Looking at Table 8 of the New Cambridge Statistical Tables for 1 degree of
freedom [((r − 1) × (c − 1) = (2 − 1) × (2 − 1) = 1], the 5% critical value is
3.841. Since 1.483 < 3.841 we do not reject H0 . In fact, 1.438 is not significant
even at the 20% significance level and so we cannot reject the null hypothesis
and we conclude that there is no evidence of an association between the sexes
in their verbal reasoning ability by these figures.
(b) So what about testing the differences between proportions? Will this show
anything different? Well, it should not otherwise we have a problem! Try it
anyway.
Let us look at the proportion with the high verbal reasoning score. Here the
test is two-tailed. Is there a difference between the sexes? We test:
H 0 : πF = π M
vs. H1 : πF 6= πM .
We have pM = 50/200 = 0.25, pF = 90/300 = 0.30 and a pooled sample
proportion p = (50 + 90)/(200 + 300) = 0.28. The test statistic value is:
0.30 − 0.25
0.05
0.05
=√
=
= 1.220.
z=p
0.0410
0.00168
0.28 × (1 − 0.28) × (1/200 + 1/300)
330
A.8. Chapter 9 – Contingency tables and the chi-squared test
Using Table 5 of the New Cambridge Statistical Tables, we see that 1.25 is not
significant at the 5% significance level (1.220 < 1.96), nor at the 10%
significance level (1.220 < 1.645). So we do not reject H0 and so we have
insufficient evidence to say that there is a difference in the proportions of men
and women with a high level of verbal reasoning.
Notice that the null hypothesis used in (b) (that there is no difference in the
proportion with a high level of reasoning ability) gives a slightly different
result than that for (a) that there is no association between levels of verbal
reasoning and sex.
These are not quite the same and the slight difference in significance reflects
the fact that the chi-squared test takes into account both high and low levels
of verbal reasoning while the test of proportions only investigates those with
high reasoning ability.
6. We test:
H0 : No association between level of exercise and type of environment
H1 : Association between level of exercise and type of environment.
Determining row and column totals, we obtain the expected frequencies as shown
in the following table, along with the calculated χ2 values for each cell.
Environment
City
Town
Rural
Total
Observed
Expected
(O − E)2 /E
Observed
Expected
(O − E)2 /E
Observed
Expected
(O − E)2 /E
Level of exercise
High
Low
221
256
263.56
213.44
6.87
8.49
230
118
192.28
155.72
7.40
9.14
159
120
154.16
124.84
0.15
0.19
610
494
Total
477
348
279
1104
The test statistic value is:
6.87 + 8.49 + 7.40 + 9.14 + 0.15 + 0.19 = 32.24.
There are (r − 1)(c − 1) = (3 − 1)(2 − 1) = 2 degrees of freedom, and the 5%
critical value is 5.991. Since 5.991 < 32.24, we reject H0 . Testing at the 1%
significance level, the critical value is 9.210. Again, we reject H0 and conclude that
the test is highly significant. There is strong evidence of an association between
level of exercise and type of environment. Looking at the large contributors to the
test statistic value, we see that those living in cities tend not to exercise much,
while those in towns tend to exercise more.
7. (a) We examine the percentages. Doing so, we see that the proportion of teenagers
who do not smoke is roughly equal (around 90%) regardless of whether or not
there is adult cigarette use in a household. Eyeballing the figures there is no
331
A
A
A. Solutions to Learning activities
clear association between teenager and adult cigarette use, although there is a
large sample size of 4,000. Therefore, small differences in percentages may be
statistically significant when formally tested. However, any association would
be subtle due to the similar column percentages by row.
Noting the χ2 value of 4.61, we would reject the null hypothesis of no
association at the 5% significance level, because the critical value with
(r − 1)(c − 1) = (2 − 1)(2 − 1) = 1 degree of freedom is 3.841, but fail to reject
the null hypothesis at the 1% significance level (with a critical value of 6.635).
Hence the association is significant, but not highly significant.
(b) We test:
H0 : No association adult and teenage cigarette use
H1 : Association between adult and teenage cigarette use.
The expected frequencies as shown in the following table, along with the
calculated χ2 values for each cell.
Teenage
Adult cigarette use
cigarette use
Yes
No
Total
Observed
48
45
93
Yes
Expected
55.06
37.94
(O − E)2 /E
0.91
1.31
Observed
544
363
907
536.94
370.06
No
Expected
(O − E)2 /E
0.09
0.13
Total
592
408
1000
The test statistic value is:
0.91 + 1.31 + 0.09 + 0.13 = 2.44.
There is (r − 1)(c − 1) = (2 − 1)(2 − 1) = 1 degree of freedom, and the 5%
critical value is 3.841. Since 2.44 < 3.841, we do not reject H0 . Testing at the
10% level, the critical value is 2.706. Again, we do not reject H0 and conclude
that the test is insignificant. There is no evidence of an association between
adult and teenage cigarette use.
(c) At best, there is only a slight association between teenage and adult cigarette
use evident in the first survey table. Teenage cigarette use is marginally higher
in households without adult cigarette use (10.4%) than households with adult
cigarette use (8.4%).
We have seen that a test for association was significant for the first survey
table, but insignificant for the second survey table. Therefore, it seems that
the association has disappeared over the two years. Note the different sample
sizes used – although still a ‘large’ sample size of 1,000 households in the
second survey, it is harder to detect association than in the first survey with
4,000 households. So despite there being very similar percentages when
comparing across the two tables, the smaller sample size in the second survey
has caused the test of association to become insignificant.
8. We are told that we are supplied with the observed frequencies, i.e. the number of
births in a country for each of the four seasons. To test whether births are evenly
332
A.9. Chapter 10 – Sampling design
distributed between the four seasons, we would need to calculate the expected
number of births for each season. We would need to know the exact number of days
in each season to be able to calculate the expected frequencies, and hence perform
a goodness-of-test of an even distribution. Remember the seasons are of slightly
different lengths reflecting the variable number of days in each month. We would
then test:
H0 : There is a uniform distribution of live births throughout the year.
H1 : There is a non-uniform distribution of live births throughout the year.
Denoting the observed and expected frequencies for season i as Oi and Ei ,
respectively, the test statistic is:
4
X
(Oi − Ei )2
i=1
Ei
∼ χ23 .
At a 5% significance level, the critical value would be 7.815 and so we would reject
the null hypothesis if the test statistic value exceeded 7.815. We would then test at
a suitable second significance level to determine the strength of the test result.
A.9
Chapter 10 – Sampling design
1. Generally available lists include registers of electors, postal addresses used by the
postal services, lists of schools and companies. Whether they are up-to-date or not
depends on the purpose for which the list was intended. For example, looking at
the kinds of lists mentioned above you would expect there to be some government
regulation about registering electors – every year, or perhaps before each national
election. What is the policy in your country?
For the register of electors, the post office will register new buildings by postcode
on a continuous basis. For postal addresses you need to check who would be
responsible for the lists of schools or companies and why and when they would be
updated. Is there a subscription to pay annually, for example, in order to be a
member of a school or company association, or is there a government regulation
which means that schools or companies must be registered?
The question of target group also affects the usefulness of your sampling frame. If
you want to know about foreign nationals resident in your country, then the
register of electors is hardly appropriate, nor will it help you if you are interested
primarily in contacting people under voting age.
The list of new buildings will help you get in touch with people living in particular
areas so you may find you stratify by the socio-economic characteristics of the place
rather than the people.
Schools and companies are fine as a contact for people for many purposes, but the
fact that they have been approached through a particular organisation may affect
responses to particular questions. Would you tell interviewers who contacted you in
school time and had got hold of you through your school that you hate school, or
even about what you do after school? Would you reply to a postal questionnaire
sent via your company if you were asked how often you took unnecessary sick leave?
333
A
A
A. Solutions to Learning activities
The more you think about all this, the more difficult it can seem! So think of three
lists and then list the ways you could find them useful as a basis for sampling and
the problems you might have in particular circumstances.
2. I am afraid I really cannot help you here! You must do this yourself. Surveys in
your home country should tell you details of the:
• sampling frame
• respondents
• coverage of the questionnaires
• response rates
• general subjects covered.
You could also get hold of one of these through your library and see the careful
explanation of how the survey works and the purpose(s) for which it is used.
3. (a) If the population is very large, then this increases the chances of making
non-sampling errors which reduces the quality of the collected data. For
example, large amounts of data collected may result in more data-entry errors
leading to inaccurate data and hence inaccurate conclusions drawn from any
statistical analysis of the data.
(b) An adequate sampling frame of single parent families is unlikely to be available
for use. This would make random sampling either impossible, or very
time-consuming if a sampling frame had to be created.
(c) Those who would potentially undertake post-compulsory education in the
future will be ‘young’ today. Data protection laws may prevent us from
accessing lists of school-age children. In addition, young children may not
realise or appreciate the merits of post-compulsory education so their current
opinions on the subject may differ from their future decision about whether to
continue into further education and higher education.
4. (a) Quota sampling may be viewed as two-stage restricted judgemental sampling.
The first stage consists of developing control categories, or quota controls, of
population elements. In the second stage, sample elements are selected based
on convenience or judgement until the set quotas are met. An example might
be using gender as our control characteristic. Suppose 48% of the population is
male, with 52% being female, then we would want the sample composition to
reflect this.
(b) Quota sampling is a form of non-probability (non-random) sampling which is
appropriate in the absence of a sampling frame. An advantage over other forms
of non-probability sampling is that we can control for certain characteristics.
Alternative forms of non-probability sampling include convenience sampling
and judgemental sampling.
(c) Quota sampling suffers from selection bias and hence there is no guarantee of
selecting a sample which is representative of the population. Despite
controlling for population characteristics, ultimately it is up to the interviewer
to select people, which inevitably results in selection bias. Refining the
different levels of the control characteristics and training interviewers may
334
A.9. Chapter 10 – Sampling design
reduce, if not eliminate, such bias. Also, non-response bias may result,
although this is not unique to quota sampling. Suitable incentives, callbacks
etc. may help to improve response rates.
5. The two quota controls which are relatively easy for your interviewer to identify are
gender and age group.
They are also useful controls to use as they help you gain a representative picture
of shopping patterns. We would expect people to buy different things according to
whether they are female (particular perfumes, for example) or male (special car
accessories) or by age group (out-of-season holidays for older people, or pop concert
tickets for younger people) to give trivial examples. If we are wrong, and older
people are into pop concerts of lots of females want to make their cars sportier,
then we would find this out if we have made sure we have sufficient numbers in
each of the controls.
The question of a third control is more tricky. We would like to know about
people’s preferences if they spend more or less money when they shop, or according
to their income group, or depending on how many people they are shopping for
(just themselves, or their family and their elderly neighbours).
Of course, the interviewer could ask people how much they spent on shopping last
week, or what their family income is, or how many people they are shopping for.
People might reply to the last of these questions, but may well be unhappy about
the other two – so you get a lot of refusals – and lose your sample!
Even if they do reply, the interviewer will then have to discard some of the
interviews they have started. If everyone interviewed so far is shopping for
themselves, for example, and the quota has been filled, the interviewer will have to
ignore the person they have stopped and go to look for someone else to ask whether
they have shopped for others!
If the aim is to interview people who have bought a lot of things at that time, then
the interviewer could wait for people to leave the store concerned, and stop people
who have bought a lot that day, a medium amount, or a small amount, judging by
how full their shopping trolley is! Or do the same if people are accompanied by
children or not on their shopping expedition.
An alternative is to interview at more than one shopping area and deliberately go to
some shops in rich areas and some in poor areas, knowing that most of the people
you interview will be in an income category which matches their surroundings.
6. (a) Useful strata here might be:
• the site – people might have different travel arrangements depending on
where they work
• their work type – they may have different hours of work (which would
affect their travel possibilities), they may have different wage levels (which
might affect how they travel – do they need car parks?) and what they
can afford (would a season ticket loan help?).
You might also like to consider gender – safety issues, secure parking etc.
(b) These strata will help you to get good overall estimates (because you reduce
the overall standard error by using relevant stratification). They will also
335
A
A
A. Solutions to Learning activities
enable you to look at particular groups and assess their needs (lower paid
women, people on a site with no good public transport, shift workers etc.).
7. Here an obvious cluster will be the school (some textbooks may indeed refer to it
as the primary sampling unit). This is because a main influence on how children
learn will be the school they are at – you could give yourself an enormous amount
of unnecessary extra work and expense if you choose a sample of 100 children, say,
from 100 different schools – you would in fact have to find out about each of their
schools individually. So it makes sense to sample clusters of schools themselves. For
a similar reason, it would be a good idea to cluster children according to the study
group (class) to which they belong. The only problem with this is that, if you have
too few clusters, it may be difficult to separate the effect of outside influences (the
effect of a particular teacher, for example) as opposed to the actual teaching
methods used.
8. There are no absolute right answers to these – you may disagree with the
suggestions, but if you do make sure you can explain why you think what you do! In
addition, the explanation must use the kinds of arguments given in this chapter.
(a) This slightly depends on the age of the children concerned. The younger the
child, the more difficult it will be to elicit a clear written reply. Telephone is
probably not an option (annoying to parents). On the whole, interviews at
school make most sense – though we hope the children’s replies would not be
too influenced by their environment.
(b) The items are non-essential – we need clear records – probably a diary kept at
home (if people will agree to do this) would be best. Random digit dialling
might work – but you would need to catch people while they still remembered
their shopping. Interviewing outside the store is not an option – we have been
told this is a random survey so we will need a list of addresses rather than a
quota approach to people as they shop.
(c) A quota sample is specified, so a face-to-face interview contact method would
be appropriate. Suitable quota controls need to be provided to the interviewer
to carry out the exercise in, say, a shopping centre.
(d) Email might be a good idea – quick and easy – but, of course, if their
computer facilities are really poor, we will only get replies from employees with
state-of-the-art facilities! If the company is supporting this, you could use any
method – an interview might obtain detailed and useful information. If we
only want to know the proportions of employees who have experienced
particular problems with their computers, a postal survey would work – and
the response rate should be as good as they are doing it for their employer!
(e) Here time is important – mail will be too slow. Even a random face-to-face
survey might take too long – you need to catch the interviewee at home.
So the choice is the following.
• If random, use randomised digital dialling and telephone people at home
the day following the TV programmes – everyone will remember them and
you can make your questions short and to the point!
336
A.9. Chapter 10 – Sampling design
• Alternatively, you may prefer to use a quota survey with the possibility of
showing pictures to illustrate your questions and perhaps ask more
detailed questions. You can also do this straight after the TV programmes
are broadcast.
9. (a) The main points you should make are the following.
• Deciding on your target group. Be careful – is there something particular
about them (i.e. children who will find writing difficult, employees who
might not feel comfortable to be interviewed about work at their
workplace) which may affect your next stage?
• The sampling frame – it should be complete (and appropriate to the
target), up-to-date, not have duplicates, be centrally and cheaply
available.
• Sampling design – your stratification and/or clustering factors – you want
to maximise accuracy for cost with respect to the variables in which you
are interested.
• Contact method – will you interview face-to-face, use telephone, postal
survey, or email, or what? This may be limited by your sampling frame.
You cannot ring people up, for example, if you only have their addresses.
• Analysis – everything you have studied in the other chapters!
• Errors can happen at each stage – your sampling frame may not
adequately reflect your target, the clusters in your design may cause more
variability than you expected, people may refuse to answer, not be at
home for interviews, lose your mail questionnaire, misunderstand your
questions or lie, hence your analysis may be faulty.
(b) You might carry out a quota survey because:
• you have no convenient sampling frame
• you are in a hurry
• you are not too concerned at this point about measuring accuracy – you
just want to get a rough idea about the problem you are studying
• money is a little tight and a random sample would cost too much.
(c) Hint – you should look at all stages of a survey. How many of them involve
interviewers making complicated surveys? How important is that? Is it a waste
of money? However, the kind of information – think about it – you can get is
really impossible to get without a trained interviewer. Go into this argument
in depth and think of examples.
(d) The following circumstances are worth thinking about.
• Am I in a hurry? Do I need a random sample (i.e. if quota will not do on
this occasion)?
• Will I be able to ask questions without needing to use visual aids?
(e) You need to think about what might affect interest in this new technology so
you can segment the possible market. The information you have may be
limited.
337
A
A
A. Solutions to Learning activities
• You could look at the figures for the amount of money spent on electrical
goods by sales area, and stratify by area – high, medium and low spend.
This would mean that you would be likely to be able to contact people
with different degrees of interest in your product. This would form a good
basis for a random design.
• If you are doing this study for a company which had its customer sales
figures, you could stratify by high, medium and low spend on appropriate
items and sample from your customer base.
• If you are intending to make a quota sample, you could still stratify the
areas or shops where you send your interviewers by using information
about the area or customer database (if you have it) and taking interview
points which represent high, medium and low likely expenditure on
appropriate products.
10. Arguments could be made for and against any of the suggested contact methods.
However, on balance, a postal (mail) questionnaire might be most preferable.
Respondents could then return the survey form at their own convenience and
would not be rushed to respond as would be the case with an interview or
telephone survey. Also, this would be the least expensive and least time-consuming
method to employ. An obvious disadvantage is the likely high non-response rate. A
suitable incentive could be used to help reduce non-response.
A.10
Chapter 11 – Some ideas underlying causation –
the use of control group and time order
1. Possible control factors might be the:
• number of vehicles per capita of population (because the effect of speed limits
will be different the more cars there are being driven)
• number of vehicles per mile (or kilometre) of made-up road (for the same
reason)
• age distribution of the driving population (younger drivers are more likely to
be involved in high-speed accidents; the very old have a high incidence of
low-speed, low fatality accidents)
• ‘mix’ of road types (on the whole, motorways tend to have fewer (but more
serious) accidents compared with minor roads.
There are a few controls you might like to use. You would not use them all at once
– the data would become too detailed for sensible statistical analysis and some of
the figures you want might not be available – which is the easier to get of the first
two controls suggested for your country?
You may think of more – check if the figures exist for your country!
2. (a) The advantages given in most textbooks are the following.
• The fact that asking people questions about their experience and
behaviour at the time (rather than retrospectively) eliminates ‘memory
effect’.
338
A.10. Chapter 11 – Some ideas underlying causation – the use of control group and time order
• There is an enormous advantage to be gained from seeing how people
change their behaviour over time. (If you make repeated observations
using independent samples each time, you have no way of telling whether
the 20% of people who bought ‘Yum Yum’ sweets this month are the same
people as the 20% who did so last month. In a panel, we can distinguish
those people who are buying ‘Yum Yums’ for the first time and repeat
buyers.)
The disadvantages given in most textbooks are the following.
• The problem of drop-outs – the more often you ask people questions for a
survey, the more likely they are to eventually refuse to participate any
more. In other words, you risk a high non-response. The main way of
dealing with this – retaining respondents’ interest by use of rewards or
information about your findings – may lead to the other difficulty.
• Conditioning – which means that the respondents begin to behave
differently from the general population from which they were originally
sampled because they have been made more aware of the subject of the
study.
Think of examples of how all this works – perhaps in the case of a panel of
sweet eaters (taken every two weeks) – or old people using mobility aids, taken
monthly.
(b) The things to look for here are the following.
◦ How you will contact the children – it is easy to do through school, if you
are sure that will not bias results. You could do this by choosing schools
randomly, or choose a few representative schools (quota). The alternative
is through youth clubs – this is likely to be a quota sample (but that
leaves out less social children, or those that live in out-of-the-way places).
Alternatively, you could sample families (randomly using the register of
electors or a post list) and ask the parents to let you talk to their children
(this would probably be the most expensive). The main thing is that with
this age of child you will probably have to interview them rather than ask
them to fill in questionnaires.
◦ Once you have decided that, you may decide to contact the children
weekly for three months while you are running your pilot programme. You
will have to think about incentives to keep them replying (if it is through
school or a youth club, you may offer, for example, videos to the school, or
television show tickets for the youth club or the same things to the
families). This exercise is likely to cause bias – so you should be careful to
ask children not involved in the panel about the programmes and see if
their general answers are different.
The things you need to decide for any panel or cohort study then are:
• how to target the subjects of your study
• how often
• which sampling method to use
• how to balance the need for a good response rate with the need to get
unbiased results.
339
A
A
A. Solutions to Learning activities
A.11
Chapter 12 – Correlation and linear regression
1. (a) True. Correlation measures the strength of any linear relationship between two
(measurable) random variables, not causality. Therefore, whether we consider
X and Y , or Y and X, does not affect the strength of the linear relationship
and so the correlation is the same.
(b) True. A negative slope in a regression equation means the line of best fit is
downward-sloping. Therefore, this indicates a negative linear relationship
between X and Y , hence they must be negatively correlated, so the correlation
coefficient would be negative too.
(c) False. A correlation coefficient of −1 indicates a perfect negative linear
relationship between the variables. That is, increases in one variable are
perfectly reflected as decreases in the other variable.
(d) False. A strong correlation coefficient just means a strong linear relationship
between the variables. For example, on a scatter diagram this would mean the
points lie close to a straight line. It says nothing about the gradient of the line.
This may be steep, and hence a high slope b, but not necessarily so as the
points could be tightly clustered around a line with a low slope.
2. (a) When r = 0, this means that there is no one clear line. The scatter diagram
would look like this:
You can see that any straight line would be as good (or bad) as any other if
you were to try to draw a line of best fit!
(b) If r is strong and positive, the line will slope upwards from left to right and
the points in the scatter diagram will be close. A typical diagram might be:
Note that the following could give the same r. The angle of slope of the line is
immaterial – it is the closeness of the observations which matters.
340
A.11. Chapter 12 – Correlation and linear regression
(c) If r is strong and negative, the slope is in the other direction, but again the
points must be close.
(d) If r is quite strong and negative, the slope may be the same, but the points
will be more spread out.
3. Here you should think about two things which have risen or fallen over time
together, but have no obvious connection. Examples might be the number of
examination successes and the number of films shown per year in a town.
A ‘clear’ correlation might be recovery rates from a particular illness in relation to
the amount of medicine given. You might accept this correlation as perhaps being
causal ‘other things equal’ if everyone who got the disease were given medicine as
soon as it was diagnosed (control) and also if recovery began only after the
medicine was administered (time effect).
4. We have:
b=
12 × 191325 − 410 × 5445
= 0.1251
12 × 2512925 − (5445)2
341
A
A
A. Solutions to Learning activities
and:
410 − 0.1251 × 5445
= −22.5975.
12
So the regression equation, if we decide that advertising costs depend on sales, is:
a=
yb = 0.1251x − 22.5975.
We are assuming that as sales rise, the company concerned spends more on
advertising. When sales are £460,000 we get predicted advertising costs of:
yb = 0.1251 × 460 − 22.5975 = 34.9485
i.e. £34,948.50.
Note that the xs and the ys were given in thousands, so be careful over the units of
measurement!
5. An example might be an equation linking national income as an independent
variable to the percentage of employed persons as the dependent variable. Clearly
it is impossible to employ more than 100% of the population!
6. (a) We have:
Note how close the observations are to the line of best fit.
(b) We have:
Note how scattered the points are about the steep (gradient 2:1) negative
sloping line.
342
Appendix B
Solutions to Sample examination
questions
B.1
B
Chapter 2 – Mathematical revision – simple
algebra and coordinate geometry
1. (a) We have:
5
X
x2i = (−3.7)2 + (0.8)2 + (7.4)2 = 13.69 + 0.64 + 54.76 = 69.09.
i=3
(b) We have:
2
X
1
1
1
+
= 25 + 0.05 = 25.05.
=
xy
(−0.2) × (−0.2) 2.5 × 8.0
i=1 i i
(c) We have:
y43 +
5
X
y2
i
xi
i=4
3
= (−2.0) +
(−2.0)2
02
+
0.8
7.4
= −8 + 5 = −3.
2. (a) We have:
3
X
zi2 = 82 + (−5)2 + 62 = 64 + 25 + 36 = 125.
i=1
(b) We have:
5
X
√
yi zi =
√
16 × 4 +
√
10 × 10 = 8 + 10 = 18.
i=4
(c) We have:
z42 +
3
X
1
1 1
=4 + − − +1
y
2 5
i=1 i
2
= 16.3.
343
B. Solutions to Sample examination questions
B.2
B
Chapter 4 – Data visualisation and descriptive
statistics
1. A general tip for identifying continuous and categorical variables is to think of the
possible values they can take. If these are finite and represent specific entities the
variable is categorical. Otherwise, if these consist of numbers corresponding to
measurements, the data are continuous and the variable is measurable. Such
variables may also have measurement units or can be measured to various decimal
places.
(a) Measurable, because GDP can be measured in $bn or $tn to several decimal
places.
(b) Each satisfaction level corresponds to a category. The level of satisfaction is in
a ranked order – for example, in terms of the list items provided. Therefore,
this is a categorical ordinal variable.
(c) Each name (James, Jane etc.) is a category. Also, there is no natural ordering
between the names – for example, we cannot really say that ‘James is higher
than Jane’. Therefore, this is a categorical nominal variable.
2. (a) We have:
Class interval
[90, 100)
[100, 110)
[110, 120)
[120, 130)
[130, 140)
[140, 150)
Interval
width
10
10
10
10
10
10
Frequency
6
9
7
5
2
1
Frequency
density
0.6
0.9
0.7
0.5
0.2
0.1
Histogram of low-density lipoprotein data
0.9
Frequency density
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
90
100
110
120
130
140
150
Low-density lipoproteins (in mg/dL)
(b) We have x̄ = 3342/30 = 111.4 mg/dL, median = 109 mg/dL, and the standard
344
B.3. Chapter 5 – Probability theory
deviation is:
r
s=
1
× (377076 − 30 × (111.4)2 ) = 12.83 mg/dL.
29
B
(c) The data exhibit positive skewness, as shown by the mean being greater than
the median.
3. (a) We have:
1800 2000 2200 2400 2600 2800 3000
Calories (in kcals)
Average daily intakes of calories for 12 athletes
In order to identify the box, the quartiles are needed which are 1992.25
(anything between 1957 and 2009 is acceptable), 2124, 2207.75 (anything
between 2200 and 2231 is also acceptable as long as it is consistent with Q1 ),
hence giving an interquartile range of 215.5 (or anything else consistent with
the values of Q1 and Q3 ). Hence the outlier limits are from 1669 to 2531.
Therefore, the value of 3061 is an outlier.
Note that no label of the x-axis is necessary and that the plot can be
transposed.
(b) Based on the shape of the boxplot above, we can see that the distribution of
the data is positively skewed.
(c) A histogram, stem-and-leaf diagram or a dot plot are other types of suitable
graphical displays. The reason is that the variable income is measurable and
these graphs are suitable for displaying the distribution of such variables.
B.3
Chapter 5 – Probability theory
1. The solution of this exercise requires the following steps. Note, however, that these
steps can be performed in a different order.
We have that:
P (B c ) = 1 − P (B) = 1 − 0.05 = 0.95
345
B. Solutions to Sample examination questions
and:
P (A ∩ B) = P (A | B) P (B) = 0.70 × 0.05 = 0.035.
B
Applying the total probability formula, we have:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c ) = 0.70 × 0.05 + 0.30 × 0.95 = 0.32.
Applying Bayes’ formula, we have:
P (B | A) =
P (A ∩ B)
0.035
=
= 0.1094.
P (A)
0.32
2. (a) Applying the total probability formula, we have:
P (F ) = P (F | N ) P (N ) + P (F | S) P (S) = 0.02 × 0.95 + 0.5 × 0.05 = 0.044.
(b) Applying Bayes’ formula, we have:
P (S | F ) =
0.025
25
P (F | S) P (S)
=
=
= 0.5682.
P (F )
0.044
44
3. (a) Let B1 and B2 denote boxes 1 and 2, respectively. Let G denote a green ball
and R denote a red ball. Applying the total probability formula, we have:
1 1
5
3 1
P (G) = P (G | B1 ) P (B1 ) + P (G | B2 )P (B2 ) =
×
×
+
= .
4 2
2 2
8
(b) Applying Bayes’ formula, we have:
P (B1 | G) =
B.4
P (G | B1 ) P (B1 )
3/4 × 1/2
3
=
= = 0.6.
P (G)
5/8
5
Chapter 6 – The normal distribution and ideas of
sampling
P
1. (a) We must have i p(xi ) = 0.20 + 5k = 1, hence k = 0.16. Therefore, the
probability distribution is:
x
−1
1
2
pX (x) 0.20 0.16 0.64
(b) We have:
E(X) =
X
xi p(xi ) = (−1) × 0.20 + 1 × 0.16 + 2 × 0.64 = 1.24.
i
(c) We have:
E(X 2 ) =
X
x2i p(xi ) = (−1)2 × 0.20 + 12 × 0.16 + 22 × 0.64 = 2.92
i
hence:
Var(X) = E(X 2 ) − (E(X))2 = 2.92 − (1.24)2 = 1.3824.
An
method to find the variance is through the formula
P alternative
2
i (xi − µ) p(xi ), where µ = E(X) was found in part (b).
346
B.4. Chapter 6 – The normal distribution and ideas of sampling
2. The first part just requires knowledge of the fact that X is a normal random
variable with mean µ = 100 and variance σ 2 = 100. However, for the second part of
the question it is important to note that X̄, the sample mean, is also a normal
random variable with mean µ and variance σ 2 /n. Direct application of this fact
then yields that:
σ2
= N (100, 5).
X̄ ∼ N µ,
n
For both parts, the basic property of normal random variables for this question is
that if X ∼ N (µ, σ 2 ), then:
Z=
X −µ
∼ N (0, 1).
σ
Note also that:
• P (Z < a) = P (Z ≤ a) = Φ(a)
• P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
• P (a < Z < b) = P (a ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a).
The above is all you need to find the requested probabilities.
(a) We have X ∼ N (100, 100), hence:
105 − 100
P (X ≥ 105) = P Z ≥
= P (Z ≥ 0.5) = 1−Φ(0.5) = 1−0.6915 = 0.3085.
10
(b) We have X̄ ∼ N (100, 5), hence:
97 − 100
104 − 100
√
√
P (97 ≤ X̄ ≤ 104) = P
≤Z≤
= P (−1.34 ≤ Z ≤ 1.79)
5
5
= Φ(1.79) − (1 − Φ(1.34))
= 0.9633 − (1 − 0.9099)
= 0.8732.
3. (a) Since X̄ ∼ N (7.3, (1.9)2 /40), we have:
7.4 − 7.3
7.0 − 7.3
√ ≤Z≤
√
P (7.0 ≤ X̄ ≤ 7.4) = P
1.9/ 40
1.9/ 40
= P (−1.00 ≤ Z ≤ 0.33)
= Φ(0.33) − (1 − Φ(1.00))
= 0.6293 − (1 − 0.8413)
= 0.4706
using Table 4 of the New Cambridge Statistical Tables.
(b) The researcher’s statement is justified since we can apply the central limit
theorem given the large sample size n.
347
B
B. Solutions to Sample examination questions
B.5
B
Chapter 7 – Point and interval estimation
1. (a) The sample proportion, p, must be in the centre of the interval
(0.3676, 0.5324). Adding the two endpoints and dividing by 2 gives:
p=
0.3676 + 0.5324
= 0.45.
2
(b) The (estimated) standard error when estimating a single proportion is:
r
√
p (1 − p)
0.45 × 0.55
0.4975
√
=
= √ .
n
n
n
Since this is a 100(1 − α)% = 99% confidence interval, then α = 0.01, so the
confidence coefficient is zα/2 = z0.005 = 2.576. Therefore, to determine n we
need to solve:
0.4975
= 0.5324 − 0.45 = 0.0824.
2.576 × √
n
The correct sample size is n = 242.
Note: In questions regarding sample size determination remember to round up
when the solution is not an integer.
2. (a) Let p1 , n1 refer to the proportion of large retailers using regression and to the
total number of large retailers, respectively. Similarly, denote by p2 and n2 the
corresponding quantities for small retailers. We have p1 = 85/120 = 0.7083,
n1 = 120, p2 = 78/163 = 0.4785 and n2 = 163.
The estimated standard error is:
s
E.S.E.(p1 − p2 ) =
p1 (1 − p1 ) p2 (1 − p2 )
+
= 0.0570.
n1
n2
The correct z-value is z0.01 = 2.326, leading to the lower and upper bounds of
0.0971 and 0.3625, respectively. Presented as an interval this is
(0.0971, 0.3625).
(b) We are 98% confident that the difference between the two population
proportions is between 0.0971 and 0.3625. The interval excludes zero,
suggesting there is a difference between the true proportions at the 2%
significance level.
(c) For a 94% confidence interval, the correct confidence coefficient is z0.03 = 1.88.
The sample proportions and standard error are unaffected, hence the new
interval is (0.1226, 0.3370).
3. (a) The confidence interval formula is:
s
x̄ ± tα/2, n−1 × √ .
n
The degrees of freedom are 21 − 1 = 20, and the correct t value from tables is
t0.01, 20 = 2.528. The computed confidence interval is:
9
165 ± 2.528 × √
21
348
⇒
(160.04, 169.96).
B.6. Chapter 8 – Hypothesis testing
√
(b) We seek n such that 2.528 × 9/ n ≤ 1.5 (for a confidence interval of width 3).
Solve for n = 231 (remembering to round up to the nearest integer).
B
(c) The confidence interval formula remains as:
s
x̄ ± tα/2, n−1 × √ .
n
The correct t value is t0.005,14 = 2.977. The computed interval is:
9
160 ± 2.977 × √
15
⇒
(153.08, 166.92).
Note the confidence intervals are not directly comparable since the one in part
(a) is for 98% confidence, while the one in part (c) is for 99% confidence.
Other things equal, a 99% confidence interval is wider than a 98% one. Also,
the sample sizes are different and, other things equal, a smaller sample size
leads to a wider confidence interval. Although there is some overlap of the
computed confidence intervals (suggesting that the mean heights are
potentially the same), a formal hypothesis test should be performed.
B.6
Chapter 8 – Hypothesis testing
1. (a) Let µ1 denote the mean examination mark for the intensive tutoring group and
µ2 denote the mean examination mark for the standard tutoring group. The
wording ‘whether there is a difference between the mean examination marks’
implies a two-sided test, hence the hypotheses can be written as:
H0 : µ1 = µ2
vs. H1 : µ1 6= µ2 .
The test statistic formulae, depending on whether or not a pooled variance is
used, are provided on the formula sheet:
x̄1 − x̄2
q
s2p (1/n1 + 1/n2 )
or
x̄ − x̄2
p 1
.
2
s1 /n1 + s22 /n2
If equal variances are assumed, the test statistic value is 2.1449. If equal
variances are not assumed, the test statistic value is 2.1164.
The variances are unknown but the sample size is large enough, so the
standard normal distribution can be used due to the central limit theorem.
The t40 distribution is also correct and will be used in what follows.
The critical values at the 5% significance level are ±2.021, hence we reject the
null hypothesis. If we take a (smaller) α of 1%, the critical values are ±2.704,
so we do not reject H0 . We conclude that there is moderate evidence of a
difference between the two tutoring groups.
(b) The assumptions for part (a) relate to the following.
• Assumption about whether variances are equal, i.e. whether σ12 = σ22 or
σ12 6= σ22 .
349
B. Solutions to Sample examination questions
• Assumption about whether n1 + n2 is ‘large’ so that the normality
assumption is satisfied via the central limit theorem.
B
• Assumption about the samples being independent.
• Assumption about whether a normal or t distribution is used.
2. (a) Let πT denote the true probability for the new treatment to work. We can use
the following test.
Test H0 : πT = 0.5 vs. H1 : πT > 0.5. Under H0 , the test statistic is:
p − 0.5
p
∼ N (0, 1)
0.5 × (1 − 0.5)/n
approximately due to the central limit theorem, since here n = 30 is (just
about) large enough. Hence the test statistic value is:
20/30 − 0.5
= 1.83.
0.0913
For α = 0.05, the critical value is 1.645. Since 1.645 < 1.83, we reject H0 at the
5% significance level. There is moderate evidence that the treatment is
effective.
(b) Let πP denote the true probability for the patient to recover with the placebo.
Now we test H0 : πT = πP vs. H1 : πT > πP .
For reference, the test statistic is:
πT − π P
p
∼ N (0, 1)
P (1 − P )(1/n1 + 1/n2 )
approximately, due to the central limit theorem.
The standard error is:
s
S.E.(πT − πP ) =
41 29
×
×
70 70
1
1
+
40 30
= 0.119.
and so the test statistic value is:
20/30 − 21/40
= 1.191.
0.119
For α = 0.05, the critical value is 1.645. Since 1.191 < 1.645, we do not reject
H0 at the 5% significance level. There is no evidence of higher effectiveness
than the placebo effect.
3. (a) The test statistic value is:
x̄d
1.195
√ =
√
= 1.289.
sd / n
10.2/ 121
350
B.7. Chapter 9 – Contingency tables and the chi-squared test
(b) In this case one can use the tn−1 = t120 distribution, since the standard
deviation is unknown and estimated by sd . Nevertheless the assumption of a
standard normal distribution is also justified by the central limit theorem
given the large sample size n.
Combining the above, noting that this is a two-tailed test we get that the
p-value is:
2 × P (T ≥ 1.289) = 2 × 0.10 = 0.20
where T ∼ t120 , using Table 10 of the New Cambridge Statistical Tables. Since
the p-value is 0.20 > 0.05, the test is not significant at the 5% significance level.
B.7
Chapter 9 – Contingency tables and the
chi-squared test
1. (a) False. Chi-squared values cannot be negative.
(b) False. The expected frequencies are determined under the null hypothesis of no
association, hence similar observed and expected frequencies provide little to
no evidence of an association.
2. (a) There are some differences between younger and older people regarding watch
preference. More specifically, 16% of younger people prefer an analogue watch
compared to 48% for people over 30. Hence there seems to be an association
between age and watch preference, although this needs to be investigated
further.
(b) Set out the null hypothesis that there is no association between age and watch
preference against the alternative, that there is an association. Be careful to
get these the correct way round!
H0 : No association between age and watch preference.
H1 : Association between age and watch preference.
Work out the expected values to obtain the table below:
30 year old or younger
Over 30 years old
Analogue watch
20.34
20.66
Undecided
19.35
19.65
Digital watch
24.31
24.69
The test statistic formula is:
r X
c
X
(Oij − Eij )2
i=1 j=1
Eij
which gives a test statistic value of 24.146. This is a 2 × 3 contingency table so
the degrees of freedom are (2 − 1) × (3 − 1) = 2.
For α = 0.05, the critical value is 5.991, hence reject H0 since 5.991 < 24.146.
For α = 0.01, the critical value is 9.210, hence reject H0 since 9.210 < 24.146.
We conclude that there is strong evidence of an association between age and
watch preference.
351
B
B. Solutions to Sample examination questions
3. We perform a goodness-of-fit test with:
H0 : There is a uniform distribution of live births throughout the year.
B
H1 : There is a non-uniform distribution of live births throughout the year.
How do we calculate the expected values? You should not expect equal numbers of
births in each season unless they had exactly the same number of days. The clue
lies in the fact that you are given the number of days per season, so logically you
would expect the number of births to be (out of the 730 given):
Season
Spring
Summer
Autumn
Winter
Expected live births
730 × 93/365 = 186
730 × 80/365 = 160
730 × 100/365 = 200
730 × 92/365 = 184
The test statistic is:
χ2 =
4
X
(Oi − Ei )2
i=1
Ei
giving a test statistic value of:
χ2 =
(100 − 186)2 (200 − 160)2 (250 − 200)2 (180 − 184)2
+
+
+
186
160
200
184
= 39.76 + 10.00 + 12.50 + 0.09
= 62.35.
The convention is to take the number of categories minus 1 so, as there are four
seasons given, we have 3 degrees of freedom. Looking at Table 8 of the New
Cambridge Statistical Tables, we see that for 3 degrees of freedom this is highly
significant (the 5% critical value is 7.815 and the 1% critical value is 11.34), so we
reject H0 and conclude that there is very strong evidence that births vary over the
year.
If you look again at the observed and expected values you see:
Season
Spring
Summer
Autumn
Winter
Observed
100
200
250
180
Expected
186
160
200
184
Only observed winter births are anywhere near the expected value – both summer
and autumn show higher births than expected, while spring is much lower. You
might like to test whether the proportion of births in spring is higher than in other
seasons. You will get a highly significant result!
352
B.8. Chapter 10 – Sampling design
B.8
Chapter 10 – Sampling design
1. Note that for these types of questions there is no single correct answer. Below are
some suggested ‘good’ answers for each part.
(a)
– Possible advantages: useful in the absence of a sampling frame; speed; cost.
– Possible disadvantages: standard errors not measurable; systematically
biased due to interviewer; no guarantee of representativeness.
(b)
– Possible advantages: easy to achieve a large number of interviews; easy to
check on quality of interviewers.
– Possible disadvantages: not everyone has a telephone so the sample can be
biased; cannot usually show samples; although telephone directories exist
for landline numbers, what about mobile telephone numbers? Also, young
people are more likely to use mobile telephones rather than landline
telephones, so are more likely to be excluded.
2. One of the main things to avoid in this part is to write ‘essays’ without any
structure. This question asks for specific things and each one of them requires only
one or two lines of response. If you are unsure of what these things are, do not
write lengthy answers. This is a waste of your valuable examination time. If you
can identify what is being asked, keep in mind that the answer should not be long.
Note also that in some cases there is no single right answer to the question. Some
suggested answers are given below.
(a) We are asked for accuracy and random (probability) sampling, so a reasonable
option is the use of stratified random sampling which is known to produce
results of high accuracy. An example of a sampling scheme could be ‘a
stratified sample of those customers who bought this laptop recently’.
(b) The question requires:
◦ a description of a sampling frame
◦ a justification of its choice
◦ mentioning a (sensible) contact method
◦ stating an advantage of the contact method mentioned above.
A suggested answer is given below.
Use a list provided by retailers to identify those who bought this laptop model
recently. The list could include the postal address, telephone or email.
Stratification can be made by area of country or by gender of buyer. Finally,
an explanation as to which you would prefer – for example, email is fast if all
have it but there may be a lot of non-response.
(c) The question requires an example of selection bias and an action suggested to
address this issue.
For example, retailers’ records may be incomplete. Offer incentives to make
sure they keep accurate records.
353
B
B. Solutions to Sample examination questions
B
(d) A suggested answer for the question is ‘How does preference for the laptop
model compare for men and women?’. In terms of variables one could mention
‘gender’ and ‘buying preference’.
3. (a) An indicative answer here would be to use an email list. A limitation with this
choice is that this list may not contain all the employees.
(b) Examples of stratification factors are income level, gender, age group etc. In
order for stratified sampling to be effective, within strata the members should
be homogeneous.
(c) Example actions are incentives and face-to-face interviews. Note here that
response bias occurs when the respondents give consistently false answers, such
as claiming a younger age.
(d) Examples here are appropriate graphs (boxplots, histograms etc.), confidence
intervals and hypothesis tests of job satisfaction measurement across different
job types.
B.9
Chapter 11 – Some ideas underlying causation –
the use of control group and time order
1. Longitudinal surveys have the following strengths and weaknesses:
• Strengths: Can detect change over time, and are likely to get more accurate
results than with a single cross-sectional survey.
• Weaknesses: Due to attrition there may be a lack of representativeness over
time, and we may experience response bias.
To study people’s use of health services over a 20-year period, we would begin in
year 1 by selecting the study participants. These should be representative of the
population in terms of several characteristics such as gender, age groups, general
well-being, diet types, different fitness levels etc. To allow for likely attrition over
20 years, a large number of participants would be required. Appropriate incentives
should be provided to induce people to remain in the study over time.
The participants would be interviewed face-to-face on an annual basis or perhaps
every four or five years, depending on the budget resources available. In order to
detect change, data on the same variables would be needed at each interview stage.
Examples of questions to include might be the following.
• ‘How many times have you visited a doctor in the past 12 months?’
• ‘How long did you have to wait from making an appointment to actually
seeing a doctor?’
• ‘Overall, how satisfied are you with the level of health service provision in your
area?’
2. Panel surveys are generally different from longitudinal surveys in the following
ways:
• they are more likely to be chosen by quota rather than random methods
354
B.10. Chapter 12 – Correlation and linear regression
• individuals are interviewed every two to four weeks (rather than every few
years)
• individuals are unlikely to be panel members for longer than two years at a
time.
B.10
Chapter 12 – Correlation and linear regression
1. (a) We have:
Examination mark
Project mark
Rank of examination mark
Rank of project mark
Difference in ranks, di
d2i
50
75
5
2
3
9
80
60
2
4
−2
4
70
55
4
5
−1
1
40
40
6
7
−1
1
30
50
7
6
1
1
75
80
3
1
2
4
95
65
1
3
−2
4
The Spearman rank correlation is:
6
rs = 1 −
n
P
d2i
6 × 24
i=1
=
1
−
= 0.5714.
n (n2 − 1)
7 × 48
(b) There is some correlation, but it is not strong. It looks as if there is a positive
connection between examination marks and project marks.
2. (a) We have:
350
300
250
200
value of merchandise in $'s lost to shoplifters
400
Stolen merchandise vs number of workers
10
12
14
16
18
number of workers on duty
(b) The summary statistics can be substituted into the formula for the sample
correlation coefficient (make sure you know which one it is!) to obtain the
value r = −0.9688. An interpretation of this value is the following – the data
355
B
B. Solutions to Sample examination questions
B
suggest that the higher the number of workers, the lower the loss from
shoplifters. The fact that the value is very close to −1 suggests that this is a
strong, negative, linear association.
(c) The regression line can be written by the equation yb = a + bx or
y = a + bx + ε. The formula for b is:
P
xi yi − nx̄ȳ
b= P 2
xi − nx̄2
and by substituting the summary statistics we get b = −26.64.
The formula for a is a = ȳ − bx̄, and we get a = 655.36.
Hence the regression line can be written as:
yb = 655.36 − 26.64x or y = 655.36 − 26.64x + ε
which is drawn on the scatter diagram above.
(d) In this case one can note in the scatter diagram that the points seem to be
‘scattered’ around a straight line. Hence a linear regression model does seem to
be a good model here. According to the model, the expected loss from
shoplifting for 17 workers on duty is:
655.36 − 26.64 × 17 ≈ $202.48.
356
Appendix C
Examination formula sheet
C
Expected value of a discrete random
variable:
µ = E(X) =
N
X
p i xi
Standard deviation of a discrete random
variable:
v
u N
uX
√
pi (xi − µ)2
σ = σ2 = t
i=1
The transformation formula:
Z=
X −µ
σ
Finding Z for the sampling distribution of
the sample proportion:
P −π
Z=p
π(1 − π)/n
Confidence interval endpoints for a
single mean (σ unknown):
s
x̄ ± tα/2, n−1 × √
n
Sample size determination for a mean:
n≥
(zα/2 )2 σ 2
e2
z test of hypothesis for a single mean (σ
known):
X̄ − µ0
√
Z=
σ/ n
i=1
Finding Z for the sampling distribution of
the sample mean:
Z=
X̄ − µ
√
σ/ n
Confidence interval endpoints for a
single mean (σ known):
σ
x̄ ± zα/2 × √
n
Confidence interval endpoints for a
single proportion:
r
p (1 − p)
p ± zα/2 ×
n
Sample size determination for a
proportion:
(zα/2 )2 p (1 − p)
n≥
e2
t test of hypothesis for a single mean (σ
unknown):
X̄ − µ0
√
T =
S/ n
357
C. Examination formula sheet
z test of hypothesis for a single
proportion:
Z∼
=p
C
P − π0
π0 (1 − π0 )/n
z test for the difference between two means
(variances known):
Z=
X̄1 − X̄2 − (µ1 − µ2 )
p
σ12 /n1 + σ22 /n2
t test for the difference between two means Confidence interval endpoints for the
(variances unknown):
difference between two means:
s X̄1 − X̄2 − (µ1 − µ2 )
1
1
T = q
+
x̄1 − x̄2 ± tα/2, n1 +n2 −2 × s2p
n1 n2
Sp2 (1/n1 + 1/n2 )
Pooled variance estimator:
Sp2 =
(n1 − 1)S12 + (n2 − 1)S22
n1 + n2 − 2
Confidence interval endpoints for the
difference in means in paired samples:
sd
x̄d ± tα/2, n−1 × √
n
Pooled proportion estimator:
P =
R1 + R2
n1 + n2
χ2 statistic for test of association:
t test for the difference in means in
paired samples:
T =
z test for the difference between two
proportions:
P1 − P2 − (π1 − π2 )
Z=p
P (1 − P ) (1/n1 + 1/n2 )
Confidence interval endpoints for the
difference between two proportions:
s
p1 (1 − p1 ) p2 (1 − p2 )
p1 −p2 ±zα/2 ×
+
n1
n2
Sample correlation coefficient:
r X
c
X
(Oij − Eij )2
i=1 j=1
Eij
X̄d − µd
√
Sd / n
n
P
r = s
n
P
xi yi − nx̄ȳ
i=1
x2i − nx̄
n
P
2
i=1
Spearman rank correlation:
6
n
P
d2i
i=1
rs = 1 −
n(n2 − 1)
yi2 − nȳ 2
i=1
Simple linear regression line estimates:
n
P
xi yi − nx̄ȳ
b = i=1
n
P
x2i − nx̄2
i=1
a = ȳ − bx̄
358
Descargar