STAT354 Distribution Theory Part I Statistical Inference Part II

School of Mathematics, Statistics and Computer Science
STAT354
Distribution Theory Part I
Statistical Inference Part II
Printed at the University of New England, December 6, 2005
1
2
Contents
0.1
0.2
I
Details of the unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PLAGIARISM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Distribution Theory
v
viii
1
1 Prerequisites
1.1 Probability concepts assumed known . . . . . . . . . . . . . . . . . . . . .
1.2 Assumed knowledge of matrices and vector spaces . . . . . . . . . . . . . .
1.2.1 Worked Example . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Preliminaries
2.1 Introduction . . . . . . . . . . . . . . .
2.2 Indicator Functions . . . . . . . . . . .
2.3 Distribution Functions (cdf’s) . . . . .
2.4 Bivariate and Conditional Distributions
2.4.1 Conditional Mean and Variance
2.5 Stochastic Independence . . . . . . . .
2.6 Moment Generating Functions (mgf) .
2.6.1 Multivariate mgfs . . . . . . . .
2.7 Multinomial Distribution . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Transformations
3.1 Introduction . . . . . . . . . . . . . . . . . . .
3.1.1 The Probability Integral Transform . .
3.2 Bivariate Transformations . . . . . . . . . . .
3.3 Multivariate Transformations (One-to-One) .
3.4 Multivariate Transformations Not One-to-One
3.5 Convolutions . . . . . . . . . . . . . . . . . .
3.6 General Linear Transformation . . . . . . . .
3.7 Worked Examples : Bivariate MGFs . . . . .
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
4
.
.
.
.
.
.
.
.
.
7
7
7
8
9
10
12
12
13
15
.
.
.
.
.
.
.
.
17
17
21
22
32
33
34
36
37
4 Multivariate Normal Distribution
4.1 Bivariate Normal . . . . . . . . . . . . .
4.2 Multivariate Normal (MVN) Distribution
4.3 Moment Generating Function . . . . . .
4.4 Independence of Quadratic Forms . . . .
4.5 Distribution of Quadratic Forms . . . . .
4.6 Cochran’s Theorem . . . . . . . . . . . .
5 Order Statistics
5.1 Introduction . . . . . . . . . . . . .
5.2 Distribution of Order Statistics . .
5.3 Marginal Density Functions . . . .
5.4 Joint Distribution of Yr and Ys . . .
5.5 The Transformation F (X) . . . . .
5.6 Examples . . . . . . . . . . . . . .
5.7 Worked Examples : Order statistics
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
41
43
44
47
50
54
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
60
62
67
73
76
80
6 Non-central Distributions
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
6.2 Distribution Theory of the Non-Central Chi-Square
6.3 Non-Central t and F-distributions . . . . . . . . . .
6.4 POWER: an example of use of non-central t . . . .
6.4.1 Introduction . . . . . . . . . . . . . . . . . .
6.4.2 Power calculations . . . . . . . . . . . . . .
6.5 POWER: an example of use of non-central F . . . .
6.5.1 Analysis of variance . . . . . . . . . . . . . .
6.6 R commands . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
85
85
86
89
90
90
92
95
95
99
II
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Statistical Inference
7 Reduction of Data
7.1 Types of inference . . . . .
7.2 Frequentist inference . . .
7.3 Sufficient Statistics . . . .
7.4 Factorization Criterion . .
7.5 The Exponential Family of
7.6 Likelihood . . . . . . . . .
7.7 Information in a Sample .
101
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
Distributions
. . . . . . . .
. . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
102
102
102
103
106
110
112
115
8 Estimation
123
8.1 Some Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.2 Cramér–Rao Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
ii
8.3
8.4
Properties of Maximum Likelihood Estimates . . . . . . . . . . . . . . . . 140
Interval Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9 Hypothesis Testing
9.1 Basic Concepts and Notation . . . . . . . . . . . . . . . . .
9.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
9.1.2 Power Function and Significance Level . . . . . . . .
9.1.3 Relation between Hypothesis Testing and Confidence
9.2 Evaluation of and Construction of Tests . . . . . . . . . . .
9.2.1 Unbiased and Consistent Tests . . . . . . . . . . . . .
9.2.2 Certain Best Tests . . . . . . . . . . . . . . . . . . .
9.2.3 Neyman Pearson Theorem . . . . . . . . . . . . . .
9.2.4 Uniformly Most Powerful (UMP) Test . . . . . . . .
9.3 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . .
9.3.1 Background . . . . . . . . . . . . . . . . . . . . . . .
9.3.2 The Likelihood Ratio Test Procedure . . . . . . . .
9.3.3 Some Examples . . . . . . . . . . . . . . . . . . . .
9.3.4 Asymptotic Distribution of −2 log Λ . . . . . . . . . .
9.4 Worked Example . . . . . . . . . . . . . . . . . . . . . . . .
9.4.1 Example . . . . . . . . . . . . . . . . . . . . . . . . .
10 Bayesian Inference
10.1 Introduction . . . . . . . . . . . . .
10.1.1 Overview . . . . . . . . . .
10.2 Basic Concepts . . . . . . . . . . .
10.2.1 Bayes Theorem . . . . . . .
10.3 Bayesian Inference . . . . . . . . .
10.4 Normal data . . . . . . . . . . . . .
10.4.1 Note . . . . . . . . . . . . .
10.5 Normal data - several observations
10.6 Highest density regions . . . . . . .
10.6.1 Comparison of HDR with CI
10.7 Choice of Prior . . . . . . . . . . .
10.7.1 Improper priors . . . . . . .
10.7.2 Reference Priors . . . . . . .
10.7.3 Locally uniform priors . . .
10.8 Data–translated likelihoods . . . .
10.9 Sufficiency . . . . . . . . . . . . . .
10.10Conjugate priors . . . . . . . . . .
10.11Exponential family . . . . . . . . .
10.12Reference prior for the binomial . .
10.12.1 Bayes . . . . . . . . . . . .
10.12.2 Haldane . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . .
. . . . . .
. . . . . .
Intervals
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
151
151
151
152
153
155
155
155
159
163
167
167
167
170
177
182
183
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
187
187
187
188
188
189
189
190
191
192
192
192
192
193
193
193
194
194
195
196
196
196
10.12.3 Arc–sine . . . . . . . . . . . . . . . . . . . .
10.12.4 Conclusion . . . . . . . . . . . . . . . . . . .
10.13Jeffrey’s Rule . . . . . . . . . . . . . . . . . . . . .
10.13.1 Fisher information . . . . . . . . . . . . . .
10.13.2 Jeffrey’s prior . . . . . . . . . . . . . . . . .
10.14Approximations based on the likelihood . . . . . . .
10.14.1 Example . . . . . . . . . . . . . . . . . . . .
10.15Reference posterior distributions . . . . . . . . . . .
10.15.1 Information provided by an experiment . . .
10.15.2 Reference priors under asymptotic normality
10.15.3 Reference priors under asymptotic normality
10.16References . . . . . . . . . . . . . . . . . . . . . . .
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
197
197
197
197
197
198
199
199
199
199
201
203
0.1
Details of the unit
Coordinator
Lecturer: Dr. Bernard A. Ellem
Address: School of Mathematics, Statistics and Computer Science
University of New England, Armidale, NSW, 2351.
Telephone: 02 6773 2284
Fax: 02 6773 3312
Email : [email protected]
Objectives
Although many of the topics considered in STAT354, Distribution Theory and Statistical
Inference, sound the same as those in the second year unit, a more theoretical approach
is used, and a deeper level of understanding is required. The unit concentrates on the
fundamental aspects of statistics, rather on the particular methods in use in various disciplines. To gain most benefit from studying this unit, you should read more widely in the
prescribed text and other texts than the minimum indicated in the Study Guide.
The subject of statistics is concerned with summarizing the data to understand the
evidence contained in the data. The topics in this unit are foundations for statistical
strategies of interpreting data.
There are two (or more) schools of statistical thought. The traditional parametric
statistics relies upon weak assumptions that the data can be approximated by a parametric
distribution. In this way the essential information is contained in a small set of parameters.
The other way is to make very few assumptions and let the data themselves decide the
distribution. Whilst this may seem appealing, we could have difficulty in reducing the data
to be able to interpret. Both schools have their advantages, disadvantages and adherents
but most statisticians use whatever will do the job; often a blend of both.
This unit is predominantly about parametric statistics with one section on non-parametric
statistics, Order Statistics. Bayesian inference is introduced in the final chapter.
At times, distribution theory and inference seem remote from the practical problems
of analysing data but progress in statistics necessitates a firm foundation in inference. As
statisticians, you will be challenged with the non-standard problems where your insight
into the problems may be the most important factor in success. The insight is honed by
studying the theoretical bases of statistics.
Content
There are two sections to Statistics 354, Distribution Theory and Statistical Inference and
they will be covered in that order and completed in First Semester.
v
Textbook
The text that will be used for both sections of the course is
Casella G. and Berger R.L., (2002), Statistical Inference, 2nd ed., Duxbury.
An additional reading is
Hogg, R.V. and Craig, A.T. Introduction to Mathematical Statistics, 5th edition, 1995, Macmillan.
There is now a 6th edition, (2005), by Hogg R.V., McKean J.W. and Craig
A.T., Pearson Prentice Hall.
Throughout the Study Guides, the text will be referred to as CB, while the additional
reading will be referred to as HC.
The text for the last chapter is :
Lee P.M., (2004), Bayesian Statistics, 3rd. ed., Arnold, London
Timetable
The two sections of the course are about equal in length and you should plan to finish
Distribution Theory about half–way through the Semester. Be sure to plan your work to
leave enough time for revision of both sections before the mid–year examination period
begins.
Copies of examination papers for 1999-2005 (inclusive) are held at the UNE library web
page
http://www.une.edu.au/library/
It should be noted, however, that units change from year to year so these examination
papers are meant only as a guide. You cannot expect that previous questions or slight
modifications of them, will necessarily reappear in later examination papers.
Assessment
Assignments will count 30% towards the total assessment. The remainder will be by a two–
hour examination at the end of First Semester. To obtain a pass, students are expected to
perform at least close to pass standard in each of the two sections.
Assignments are to be of a high standard; remember this is a BSc. It seems redundant
to say that legibility, writing in ink, neat setting out etc. are required but a reminder is
not wasted.
The combination of assignments and examination should allow diligent students to
comfortably pass the course.
vi
Acknowledgements
These notes were written by Dr Gwenda Lewis at the Statistics department of U.N.E.
Bob Murison made revisions during 1997-99.
Bernard Ellem made further revisions in 2003-2005.
vii
0.2
PLAGIARISM
Students are warned to read the statement in the Faculty’s Undergraduate and Postgraduate Handbooks for 2006 regarding the University’s Policy on Plagiarism. Full details of the
Policy on Plagiarism are available in the 2006 UNE Handbook and at the following web site:
http://www.une.edu.au/offsect/policies.htm
In addition, you must complete the Plagiarism Declaration Form for all assignments,
practical reports, etc. submitted in this unit.
viii
0.3
Assignments
Both sections of the unit have assignments. These are very closely linked to the material
in the Study Guide and each will indicate which chapter they depend on. Don’t wait until
you feel you fully understand a whole chapter before beginning an assignment. Later parts
of chapters are often easier to understand, after you have had some practice at problems.
Submission of all assignments is compulsory, and a reasonable attempt should be made
at every question. The despatch dates are listed for both Distribution Theory and Statistical Inference, and you should make every effort to submit assignments on time. If you
anticipate a delay of more than a couple of days, contact the unit coordinator. Every effort
will be made to mark assignments promptly and return them to you.
2006 ASSIGNMENT SUBMISSION DATES
Assignment
Number
1
2
3
4
5
6
7
Date
(to reach the University by)
3rd March
17th March
31st March
21st April
12th May
26th May
9th June
Topic
Distribution theory and Matrix theory
Transformations and multivariate normal
Order statistics
Sufficiency, likelihood ratio
CRLB
Generalized likelihood ratio
Bayesian Inference
Each assignment is worth 20 marks.
All questions are of equal value.
ix
x
Assignment 1
[This assignment covers the work in Distribution Theory, Chapters 1 and 2.]
1. Assuming that the conditional pdf of Y given X=x is
fY |X=x (y) =
(
2y/x2 if 0 < y < x < 1
0
otherwise
and the pdf of X is
fX (x) = 4x3 , 0 < x < 1,
find
(a) the joint pdf of X and Y ;
(b) the marginal pdf of Y ;
(c) P (X > 2Y ).
2. Given random variables X and Y where the pdf of X is given by
fX (x) = e−x , x > 0,
and Y is discrete. The conditional distribution of Y given X=x is given by
fY |X=x (y) =
e−x xy
, y = 0, 1, 2, . . . .
y!
Show that the marginal distribution of Y is given by
fY (y) = (1/2)y+1 , y = 0, 1, 2, . . . .
3. If matrix A is idempotent and A + B = I, show that B is idempotent and AB =
BA = 0.
4. Given the matrix A below, find the eigenvalues and eigenvectors. Hence find the
matrix P such that P0 AP is diagonal and has as its diagonal elements, the eigenvalues
of A.


1
2
1

1 −1 
A= 2

1 −1 −2
xi
Assignment 2
[This assignment depends on Distribution Theory, Chapters 3 and 4.]
1. Let x denote the proportion of a bulk item stocked by a supplier at the beginning of
the day, and let Y denote the proportion of that item sold during the day. Suppose
X and Y have a joint df
f (x, y) = 2, 0 < y < x < 1
Of interest to the supplier is the random variable U = X − Y , which denotes the
proportion left at the end of the day.
(a) Find the df of U .
(b) Give E(U ).
(c) Interpret your results.
2. Let X ∼ P (λx ) and Y ∼ P (λy ), where X and Y are independent.
(a) Use the change of variable technique to show that
X + Y ∼ P (λx + λy )
(b) Verify your result using MGFs.
3. If X ∼ Np (µ, Σ), show that Y= CX is distributed N(Cµ, CΣC0 ) where C is a p × p
non–singular matrix.
4. X1 , . . . , Xp are normal random variables which have zero means and covariance
matrix Σ. Show that a necessary and sufficient condition for the independence of
the quadratic forms X0 BX and X0 CX is BΣC=0.
xii
Assignment 3
[This assignment covers the work in Distribution Theory, Chapters 5 and 6.]
1. Let X1 , X2 , X3 be a random sample from a distribution with pdf
f (x) = 2x, 0 ≤ x ≤ 1.
Find
(a) the pdf of the smallest of these, Y1 ;
(b) the probability that Y1 exceeds the median of the distribution.
2. X1 , X2 , . . . , Xn is a random sample from a continuous distribution with pdf f(x).
An additional observation, Xn+1 , say, is taken from this distribution. Find the probability that Xn+1 exceeds the largest of the other n observations.
3. The median is calculated for a sample size of n, where n is an odd integer.
(a) Give the distribution function for the sample median.
(b) If population from which the sample is drawn is N (µ, σ 2 ), prove that the pdf of
the sample median M is symmetric about the vertical axis centered at µ.
(c) Deduce from (b) that
E(M) = µ
4. (a) Determine the probability distribution function for the largest observation in a
random sample from the uniform distribution.
(b) A convoy of 10 trucks is to pass through a town with a low level underpass of
height 3.8m. If the heights of the loads on each truck are uniformly distributed
between cabin top height (3.0m) and the legal upper limit (4.0m), what is the
probability that at least one truck will have to turn back? Provide an alternative
explanation of your answer to convince a sceptic of the veracity of your solution.
xiii
Assignment 4
[This assignment covers work in Statistical Inference, Chapter 7.]
1. X1 , X2 , . . . Xn is a random sample from the Bernoulli distribution with parameter π.
Use Definition 7.2 to show that
n
T =
X
Xi
i=1
is sufficient for π.
2. Use the factorization criterion [Theorem 7.1] to determine in each of the cases below,
a sufficient one–dimensional statistic based on a random sample of size n from
(a) a binomial distribution, f (x; θ) = θ x (1 − θ)1−x , x = 0, 1.
(b) a geometric distribution, f (x; θ) = θ(1 − θ)x−1 , x = 1, 2, . . ..
(c) a N (θ, 1).
(d) the Rayleigh distribution, f (x; θ) = xθ e−x
2 /2θ
, x > 0.
3. The following distributions, among others, belong to the exponential family defined
in (7.5), Definition 7.4.
Identify the terms p(θ), B(θ), h(x), K(x) in each case.
(a) Binomial(n, θ),
(b) Fisher’s logarithmic series,
P (X = x) = −θ x / (x ln(1 − θ)) , 0 < θ < 1, x = 1, 2, . . . ,
(c) Normal(0, θ),
(d) Rayleigh, as in 2(d).
4. Compute the information in a random sample of size n from each of the following
populations:
(a) N (0, θ)
(b) N (θ, 1)
(c) Geometric with P(success)= 1/θ, f (x; θ) =
xiv
1
θ
1−
1
θ
x−1
, x = 1, 2, . . ..
Assignment 5
[This assignment covers work in Statistical Inference, Chapter 8.]
1. (a) What multiple of the sample mean X estimates the population mean with minimum mean square error?
(b) In particular, if there is a known relationship between µ and σ 2 , say σ 2 = kµ2 ,
what is this multiple?
2. The sample X1 , . . . , Xn is a randomly chosen from the distribution with
1
f (x; θ) = e−x/θ , x > 0.
θ
(a) Find the Cramer–Rao lower bound for the variance of an unbiased estimator of
θ.
(b) Identify the estimator that has this variance.
3. For the Cauchy distribution with location parameter θ,
f (x; θ) = 1/π[1 + (x − θ)2 ], −∞ < x < ∞,
show that the MVB cannot be attained.
4. For a random sample from a population with df
f (y; θ) = (1 + θ)(y + θ)−2 , y > 1, θ > −1,
(a) Find the minimum variance bound for an unbiased estimator of θ,
and
(b) show that the minimum variance of an estimator for log(1 + θ) is independent
of θ.
xv
Assignment 6
[This assignment covers work in Statistical Inference, Chapter 9.]
1. Let X1 , . . . , Xn denote a random sample from a distribution with probability function
f (x; θ) = θ x (1 − θ)1−x , x = 0, 1.
(a) Show that C = {x:
against H1 : θ = 13 .
P
xi ≤ K} is a best critical region for testing H0 : θ =
1
2
(b) Use the central limit theorem to find n and K so that approximately
X
Xi ≤ K|H0 ) = 0.10
X
Xi ≤ K|H1 ) = 0.80.
P(
and
P(
2. Let X1 , . . . , X25 denote a random sample of size 25 from a normal distribution,
N (θ, 100).
Find a uniformly most powerful critical region of size α = 0.10 for testing H0 : θ = 75
against H1 : θ > 75.
3. Two large batches of logs are offered for sale to a mining company whose concern is
to have the diameters as uniform as possible. Batch A is more expensive than batch
B, but the extra cost will be offset by the uniformity of product. Thus batch A would
be preferred if the standard deviation of diameters in batch A is less than 1/2 the
standard deviation of the diameters in batch B.
Produce a likelihood ratio test for deciding which batch should be purchased based
on the results of samples of the same size from each batch.
4. In a demonstration experiment on Boyle’s law
P V = constant
the volume V is measured accurately, but pressure measurements P are subject to
normally distributed random errors. Two sets of results are obtained, (P1 , V1 ) and
(P2 , V2 ).
Derive a likelihood ratio test of the validity of Boyle’s law.
xvi
Assignment 7
[ This assignment covers work in Bayesian Inference, Chapter 10.]
1. For the Normal distribution with prior N (µ0 , σ02 ), obtain the posterior distribution
for n observations by considering each observation x1 , . . . , xn sequentially.
2. A random sample of size n is to be taken from a N (µ, σ 2 ) distribution, where σ 2 is
known. How large sample must n be to reduce the posterior variance of σ 2 to the
fraction σ 2 /k of its original value, where k > 1?
3. Laplace claimed that an event (success) which has occurred n times, and has had no
failures, will occur again with probability (n + 1)/(n + 2). Use Bayes’ uniform prior
to give grounds for this claim.
4. (a) Find the Jeffreys prior for the parameter α of the Maxwell distribution
p(x|α) =
q
2
2/πα3/2 x2 e−αx /2
(b) Find a transformation of α for which the corresponding prior is uniform.
xvii
Part I
Distribution Theory
1
Chapter 1
Prerequisites
1.1
Probability concepts assumed known
1. RANDOM VARIABLES.
Discrete and continuous.
Probability functions and probability density functions.
Specification of a distribution by its cumulative distribution function (cdf).
Particular distributions: binomial, negative binomial, Poisson, exponential, uniform,
normal, (simple) gamma, generalized gamma, beta, chi-square, t, F.
2. MOMENTS AND GENERATING FUNCTIONS.
Mean and variance of the common distributions.
Moment generating function of the common distributions,
Use of moment generating functions and cumulant generating functions.
3. BIVARIATE DISTRIBUTIONS.
Correlation and covariance.
Marginal and conditional distributions.
Independence.
4. MULTIVARIATE DISTRIBUTIONS.
Multinomial distribution.
Mean and variance of a sum of random variables.
Use of mgf to find the distribution of a sum of independent random variables.
5. CHANGE OF VARIABLE TECHNIQUE.
In the univariate case, given the probability distribution of X and a function g, we
find the distribution of Y defined by Y = g(X).
1
1.2
Assumed knowledge of matrices and vector spaces
1. Use of terms singular, diagonal, unit, null, symmetric.
2. Operations of addition, subtraction, multiplication, inverse and transpose.
[We will use A0 for the transpose of A.]
3. (a) (AB)0 = B 0 A0 ,
(b) (AB)−1 = B −1 A−1
(c) (A−1 )0 = (A0 )−1 .
4. The trace of a matrix A, written tr(A), is defined as the sum of the diagonal elements of A. That is,
tr(A) =
X
aii .
i
(a) tr(A ± B) =tr(A)±tr(B),
(b) tr(AB) =tr(BA).
5. Linear Independence and Rank
P
(a) Let x1 , . . . , xn be a set of vectors and c1 , . . . , cn be scalar constants. If i ci xi =
0 only if c1 = c2 = . . . = cr = 0, the the set of vectors is linearly independent.
(b) The rank of a set of vectors is the maximum number of linearly independent
vectors in the set.
(c) For a square matrix A, the rank of A, denoted by r(A), is the maximum order
of non–zero subdeterminants.
(d) r(AA0 ) =r(A0 A) =r(A) =r(A0 ),
6. Quadratic Forms
For a p-vector x, where x0 = (x1 , . . . , xp ), and a square p × p matrix A,
x0 Ax =
p
X
aij xi xj
i,j=1
is a quadratic form in x1 , . . . , xn .
The matrix A and the quadratic form are called:
(a) positive semidefinite if x0 Ax ≥ 0 for all x and x0 Ax = 0 for some x 6= 0.
(b) positive definite if x0 Ax > 0 for all x 6= 0.
2
i. A necessary and sufficient condition for A to be positive definite is that each
leading diagonal sub–determinant is greater than 0. So a positive definite
matrix is non–singular.
ii. A necessary and sufficient condition for a symmetric matrix A to be positive
definite is that there exists a non–singular matrix P such that A = P P 0 .
7. Orthogonality.
A matrix P is said to be orthogonal if P P 0 = I (or P 0 P = I).
(a) An orthogonal matrix is non–singular.
(b) The determinant of an orthogonal matrix is ±1.
(c) The transpose of an orthogonal matrix is also orthogonal.
(d) The product of two orthogonal matrices is orthogonal.
(e) If P is orthogonal, tr(P 0 AP ) =tr(AP P 0 ) =tr(A).
(f) If P is orthogonal, r(P 0 AP ) =r(A).
8. Eigenvalues and eigenvectors.
Eigenvalues of a square matrix A are defined as the roots of the equation
|A − λI| = 0. The corresponding x satisfying x0 (A − λI) = 0 are the eigenvectors.
(a) The eigenvectors corresponding to two different eigenvalues are orthogonal.
(b) The number of non–zero eigenvalues of a square matrix A is equal to the rank
of A.
9. Reduction to diagonal form
(a) Given any symmetric p×p matrix A there exists an orthogonal matrix P such
that P 0 AP = Λ where Λ is a diagonal matrix whose elements are the eigenvalues
of A. We write P 0 AP = diag(λ1 , . . . , λp ).
i. If A is not of full rank, some of the λi will be zero.
ii. If A is positive definite (and therefore non–singular), all the λi will be
greater than zero.
iii. The eigenvectors of A form the columns of matrix P .
(b) If A is symmetric of rank r and P is orthogonal such that P 0 AP = Λ, then
P
i. tr(A) = ri=1 λi since tr(A) =tr(P 0 AP ) =tr(Λ).
P
ii. tr(As ) = ri=1 λsi .
3
(c) For every quadratic form Q = x0 Ax there exists an orthogonal transformation
x = P y which reduces Q to a diagonal quadratic form so that
Q = λ1 y12 + λ2 y22 + . . . + λr yr2
where r is the rank of A.
10. Idempotent Matrices.
A matrix A is said to be idempotent if A2 = A. In the following we shall mean
symmetric idempotent matrices. Some properties are:
(a) If A is idempotent and non–singular then A = I. To prove this, note that
AA = A and pre–multiply both sides by A−1 .
(b) The eigenvalues of an idempotent matrix are either 1 or 0.
(c) If A is idempotent of rank r, there exists an orthogonal matrix P such that
P 0 AP = Er where Er is a diagonal matrix with the first r leading diagonal
elements 1 and the remainder 0.
(d) If A is idempotent of rank r then tr(A) = r. To prove this, note that there is an
orthogonal matrix P such that P 0 AP = Er . Now tr(P 0 AP ) =tr(A) =tr(Er ) = r.
(e) If the ith diagonal element of A is zero, all elements in the ith row and column
are zero.
(f) All idempotent matrices not of full rank are positive semi–definite. No idempotent matrix can have negative elements on its diagonal.
1.2.1
Worked Example
A student had a query about idempotent matrices, esp 10 part (b),
”The eigenvalues of an idempotent matrix are either 1 or 0”.
How can this be shown?
Answer
The eigenvalues λ of A are given by
|A − λI| = 0.
For square matrices X and Y ,
|XY | = |X||Y |.
4
The original equation to give the eigenvalues of A is
Ax = λx
Since A is idempotent, we obtain by premultiplying by A,
AAx = Ax = λAx
to give
|A − λA| = |A(I − λI)| = |A||I − λI| = 0
so the eigenvalues are 1, unless A is singular, in which case some of the eigenvalues will
be zero, since A would then not be of full rank.
Examples
• If A =
1 0
0 0
!
then A is idempotent and the eigenvalues are 1 and 0.
• If A =
1 0
0 1
!
then A is idempotent and the eigenvalues are 1 and 1.
• If A =
0 1
1 0
!
then A is NOT idempotent, with eigenvalues ±1.
5
6
Chapter 2
Preliminaries
2.1
Introduction
We shall refresh some basic notions to get focused. Statistics is the science (or art) of
interpreting data when there are random events operating in conjunction with systematic
events. Mostly there is a pattern to the randomness which allows us to make sense of
the observations. Distribution functions and their derivatives called density functions or
probability functions are mathematical ways of describing the randomness.
In this course, we shall be mostly studying parametric distributions where the mathematical description of randomness is in terms of parameters because we can assume we
know the general form of the distribution. There is another topic in statistics called nonparametric statistics where the distribution function is parameter free. Order Statistics
are one example of non-parametric statistics.
2.2
Indicator Functions
A class of functions known as indicator functions is useful in statistics.
Definition 2.1
Suppose Ω is a set with typical element ω, and let A be a subset of Ω. The indicator
function of A, denoted by IA (·), is defined by
IA (w) =
(
1
0
if ω ∈ A
if ω ∈
/ A.
(2.1)
That is, IA (·) indicates the set A. Some properties of the indicator function are listed.
(a) IA (ω) = 1 − IĀ (ω) where Ā is the complement of A.
(b) IA2 (ω) = IA (ω)
7
(c) IA∩B (ω) = IA (w).IB (ω)
(d) IA∪B (ω) = IA (ω) + IB (ω) − IA∩B (ω)
(e) IA1 ∪A2 ∪...∪An (ω) = max{IA1 (ω), . . . , IAn (ω)}
The following example shows a use for indicator functions.
Example 2.1
Suppose random variable X has pdf given by





We can write f (x) as
0,
x < −1
1 + x , −1 ≤ x < 0
f (x) =

1−x , 0≤x<1



0,
x≥1
f (x) = (1 + x)I[−1,0) (x) + (1 − x)I[0,1) (x),
or more concisely
f (x) = (1 − |x|)I[−1,1] (x)
2.3
Distribution Functions (cdf ’s)
The density or probability function is an idealised pattern which would be a reasonable
approximation to represent the frequency of the data; the slight imperfections can be
disregarded. If we can accept that approximation, we can reduce the data and understand
it. To use the density or probability function, we usually have to integrate (or sum if it is
discrete). The distribution function arises as the integral or sum. Whether we refer to the
distribution or density (probability) function, we are still referring to the same information.
Read CB page 29–37 or HC 1.7 where distribution functions for the univariate case
are considered.
Example 2.2
Given random variables X and Y which are identically distributed and independent (iid),
with pdf f (x), x > 0, find P (Y > X). Consider one particular value of Y , say y ∗ . Then
the probability that this value is greater than any X is written mathematically as
∗
∗
P (y > X) = P (X < y ) =
Z
y∗
0
f (x)dx .
Now to generalize for all Y , we need to take into account the frequency of y ∗ and that
information is contained in the density f (y). We integrate the above probability over f (y).
8
The joint pdf of X and Y , fX,Y (x, y), can be written f (x)f (y) so
P (Y > X) =
=
=
=
2.4
Z
∞Z y
Z0 ∞
Z0 ∞
Z0 ∞
0
=
"
=
1
2
0
f (x)f (y) dx dy or
f (y)
Z
y
0
Z
0
∞Z ∞
x
f (x)f (y) dy dx
f (x) dx dy
f (y)F (y) dy
F (y) dF (y)
{F (y)}2
2
#∞
0
Bivariate and Conditional Distributions
(CB chapter 4 or HC chapter 2)
Rather than use f, g, h, f1 , f2 , etc as function names for pdf’s, we will almost always
use f , and if there is more than one random variable in the problem we will use a subscript
to indicate the name of the variable whose pdf we are identifying.
For example, we may say that the pdf of X is fX (x) = α e−αx , x > 0. Of course, the
x could be replaced by any other letter. It is the fX that determines the function, not
the (·). A similar notation is used for cumulative distribution functions. In the case of
a conditional pdf, we will use, for example, fX|Y =y (x) for the conditional pdf of X given
Y = y. An alternative notation is f (x|y).
Read CB 4.2 or HC 2.2 where most of the ideas should be familiar to you.
The two variables of a bivariate density fX,Y are correlated so the outcome due to one
is influenced by the other. The conditional density fY |X allows us to make statements
about Y if we have information on X. Recall that when we integrate out the terms in X
(or average over fX ), to get a density in Y only (ie fY ), we call that the marginal density
of Y .
Definition 2.2
The conditional density function of Y given X = x is defined to be
fY |X=x (y) =
fX,Y (x, y)
for fX (x) > 0
fX (x)
and is undefined elsewhere.
9
(2.2)
Comments
1. In fY |X=x (y), x is fixed and should be considered as a parameter.
2. fX,Y (x, y) is a surface above the xy-plane. A plane perpendicular to the xy-plane on
the line x = x0 will intersect the surface in the curve fX,Y (x0 , y). The area under
R∞
this curve is then given by −∞
fX,Y (x0 , y) dy = fX (x0 ). So dividing fX,Y (x0 , y) by
fX (x0 ) we obtain a pdf which is fY |X=x0 (y).
3. The definition given can be considered an extension of the concept of a conditional
probability. The conditional distribution is a normalised slice from the joint distribution function, since
Z
fY |X=x (y)dy =
R
fX,Y (x, y)dy
fX (x)
=
=1
fX (x)
fX (x)
as required. Thus the marginal density fX (x) is the normalising function.
Definition 2.3
If X and Y are jointly continuous, then the conditional distribution function of Y given
X = x is defined as
FY |X=x (y) = P (Y ≤ y|X = x)
=
Z
y
−∞
fY |X=x (y) dy
(2.3)
for all x such that fX (x) > 0.
2.4.1
Conditional Mean and Variance
Note in CB p150–152 and the latter part of HC 2.2 how to find the conditional mean and
conditional variance. The first job is to find the conditional density. Important results are:
E{E(Y |X)} = E(Y )
var{E(Y |X)} ≤ var(Y )
var{Y } = E{var(Y |X)} + var{E(Y |X)}
(2.4)
(2.5)
(2.6)
The proof of 2.6 follows from basic definitions:h
E{var(Y |X)} = E E(Y 2 |X) − {E(Y |X)}2
h
i
h
i
= E E(Y 2 |X) − E {E(Y |X)}2
h
i
= E(Y 2 ) − E {E(Y |X)}2 +
10
i
[E(Y )]2 − [E(Y )]2
|
{z
}
common trick in statistics
=
h
i
h
i
E(Y 2 ) − {E(Y )}2 − E {E(Y |X)}2 + [E{E(Y |X)}]2
= var(Y ) − var{E(Y |X)} .
|
{z
using 2.4
}
Therefore,
var(Y ) = E[var(Y |X)] + var[E(Y |X)] .
Note
To be precise, the statement of the result for the mean
E{E(Y |X)} = E(Y )
should read
Ex {Ey (Y |X)} = Ey (Y )
Proof :
Now
Ey (Y |X) =
and so
Ex {Ey (Y |X)} =
=
Z Z
Z
Z
yf (y|x)dy
Ey (Y |X)fX (x)dx =
yf (x, y)dydx =
Z
y
Z
Z
Z
f (x, y)dxdy =
!
f (x, y)
y
dy fX (x)dx
fX (x)
Z
yfY (y)dy = Ey (Y )
As part of the proof of the formula for conditional variance, the result
Ex [Ey (Y 2 |X)] = Ey (Y 2 )
was invoked. This can be verified easily by replacing y by y 2 in the integrand of the
derivation for the conditional mean. In fact the general result
Ex {Ey (g(Y |X)} = Ey [g(Y )]
can be so derived.
Exercise
For the density
f (x, y) = 2, 0 < x < y < 1
verify that
Ex {Ey (Y |X)} = Ey (Y ) = 2/3
empirically.
11
2.5
Stochastic Independence
Read CB p152 and HC 2.4 up to the end of Example 4 and note the definition of
stochastically independent random variables (HC Definition 2). The word stochastic is
often omitted.
The case of mutual independence for more than 2 variables is summarized below. Definition 2.5 gives an alternative criterion in terms of CDF’s.
Definition 2.4
Let (X1 , X2 , . . . , Xn ) be an n-dimensional continuous random vector with joint pdf
fX1 , ..., Xn (x1 , . . . , xn ) and range space Rn . Then X1 , X2 , . . . , Xn are defined to be stochastically independent if and only if
fX1 , ..., Xn (x1 , . . . , xn ) = fX1 (x1 ) . . . fXn (xn )
(2.7)
for all (x1 , . . . , xn ) ∈ Rn .
Definition 2.5
Let (X1 , X2 , . . . , Xn ) be an n-dimensional random vector with joint cdf
FX1 , ..., Xn (x1 , . . . , xn ). Then X1 , X2 , . . . , Xn are defined to be stochastically independent
if and only if
FX1 , ..., Xn (x1 , . . . , xn ) = FX1 (x1 ) . . . FXn (xn )
(2.8)
for all xi .
Comments
HC’s Theorem 1 on page 102 says, in effect,
1. If the joint pdf of X1 , . . . , Xn factorizes into g1 (x1 ) . . . gn (xn ), where gi (xi ) is a function of xi alone (including the range space), i = 1, 2, . . . , n, then X1 , X2 , . . . Xn are
mutually stochastically independent. It is not assumed that gi (xi ) is the marginal
pdf of Xi .
2. Similarly, if the joint cdf of X1 , . . . , Xn factorizes into G1 (x1 ) . . . Gn (xn ) where Gi (xi )
is a function of xi alone, then X1 , . . . , Xn are mutually stochastically independent.
2.6
Moment Generating Functions (mgf )
Moments are defined as µ0r = E(Y r ) and central moments about µ as µr = E(Y − µ)r for
r = 1, 2 . . .. These are entities by which we start reducing data.
µ1 ≡ mean
µ2 ≡ variance
12
µ5
µ6
..
.
µ3 ≡ skewness
µ4 ≡ kurtosis




no special names



Often µ1 and µ2 are enough to summarize the data. However, fourth moments and their
counterparts, cumulants, are needed to find the variance of a variance. Moment generating
functions give us a way of determining the formula for a particular moment. But they are
more versatile than that, see below.
It will be recalled that in the univariate case, random variable X has mgf defined
by
MX (t) = E(eXt )
for values of t for which the series or the integral converges.
2.6.1
Multivariate mgfs
First we revise the concept of a bivariate mgf.
Bivariate mgfs
For a bivariate distribution, the rsth moment about the origin is defined as
E(X1r X2s ) =
Z Z
xr1 xs2 f (x1 , x2 )dx1 dx2 = µ0rs
Thus µ010 = µx1 = E(X1 ), µ001 = µx2 = E(X2 ), and µ011 = E(X1 X2 ).
For central moments about the mean,
r
s
µrs = E[(X1 − µx1 ) (X2 − µx2 ) ] =
Z Z
(x1 − µx1 )r (x2 − µx2 )s f (x1 , x2 )dx1 dx2
Now we find that µ20 = σx21 , µ02 = σx22 and µ11 = cov(X1 , X2 ).
The bivariate MGF is defined as
M (X1 , X2 )(t1 , t2 ) = E et1 X1 +t2 X2
=
=
Z Z Z Z
et1 X1 +t2 X2 f (x1 , x2 )dx1 dx2
1 + (t1 x1 + t2 x2 ) + (t1 x1 + t2 x2 )2 /2! + . . . f (x1 , x2 )dx1 dx2
= 1 + µ010 t1 + µ001 t2 + µ011 t1 t2 + . . .
Theorem
If X1 and X2 are independent,
MX1 ,X2 (t1 , t2 ) = MX1 (t1 ) × MX2 (t2 )
13
Proof
M (X1 , X2 )(t1 , t2 ) = E et1 X1 +t2 X2
=
=
Z Z
Z Z
et1 X1 +t2 X2 f (x1 , x2 )dx1 dx2
et1 X1 +t2 X2 fX1 (x1 )fX2 (x2 )dx1 dx2
since X1 and X2 are independent. Now
M (X1 , X2 )(t1 , t2 ) =
= MX1 (t1 )
Z
Z
e t 2 x2
Z
et1 x1 fX1 (x1 )dx1 fX2 (x2 )dx2
et2 x2 fX2 (x2 )dx2 = MX1 (t1 )MX2 (t2 )
Example
If X1 ∼ B(1, π) and X2 ∼ B(1, π), what is the distribution of X1 + X2 if the two variables
are independent?
Answer
Now
MX1 (t1 ) = πet1 + 1 − π
and
MX2 (t2 ) = πet2 + 1 − π
giving
M (X1 , X2 )(t1 , t2 ) = MX1 (t1 )MX2 (t2 ) = πet1 + 1 − π
πet2 + 1 − π
Since the proportions are the same then t1 = t2 = t and so
M (X1 , X2 )(t, t) = E etX1 +tX2 = E e(X1 +X2 )t = MX1 +X2 (t)
Finally
MX1 +X2 (t) = πet + 1 − π
which means that
X1 + X2 ∼ B(2, π)
as expected.
14
2
Multivariate mgfs
We will now consider the mgf for a random vector X0 = (X1 , X2 , . . . , Xp ). The moment
generating function of X is defined by
MX (t1 , . . . , tp ) = E(eX1 t1 +...+Xp tp )
0
= E(eX t )
0
(2.9)
0
where t0 = (t1 , t2 , . . . , tp ). Of course E(eX t ) could be written E(et X ).
Read HC 2.4 from Theorem 4 to the end. Note in particular how multivariate mgf’s
can be used to find moments (including product moments), to find marginal distributions
of one or more variables, and to prove independence. These are summarized below.
1.
∂ s1 +s2 MX,Y (t1 , t2 ) = E(X s1 Y s2 ).
s1
s2
∂t1 ∂t2
t1 =t2 =0
(2.10)
The obvious extension can be made to the case of p (> 2) variables.
2. The marginal distributions for subsets of the p components have mgf’s obtained by
setting equal to zero those ti ’s that correspond to the variables not in the subset. For
example, if X0 = (X1 , X2 , X3 , X4 ) has mgf MX (t1 , t2 , t3 , t4 ), then
MX2 ,X3 (t2 , t3 ) = MX (0, t2 , t3 , 0).
3. If the random variables X1 , X2 , . . . , Xp are independent, then
MX (t) = MX1 (t1 )MX2 (t2 ) . . . MXp (tp )
and the converse is also true.
2.7
Multinomial Distribution
(CB p181 and HC p121)
Recall that the binomial distribution arises when we observe X, the number of
successes in n independent Bernoulli trials (experiments with only 2 possible outcomes,
success and failure). The multinomial distribution arises when each trial has k possible
outcomes. We say that the random vector (X1 , X2 , . . . , Xk−1 ) has a k- nomial distribution
if the joint probability function of X1 , . . . , Xk−1 is
P (X1 = x1 , . . . , Xk−1 = xk−1 ) =
Pk−1
where xk = n −
probability function.
i=1
xi ,
Pk
i=1
n!
px1 px2 . . . pxk k
x1 ! . . . x k ! 1 2
(2.11)
pi = 1. Note that, if k = 2, this reduces to the binomial
15
Now the joint mgf of the k-nomial distribution is
MX1 , ..., Xk−1 (t1 , . . . , tk−1 ) = E(eX1 t1 +···+Xk−1 tk−1 )
= (p1 et1 + · · · + pk−1 etk−1 + pk )n .
(2.12)
To show this, multiply the RHS of (2.11) by ex1 t1 +···+xk−1 tk−1 and sum over all (k −1)-tuples,
(x1 , . . . , xk−1 ). [HC deals with this for k = 3 on page 122.]
Comments
1. When k = 2, (2.12) agrees with the familiar form of the mgf of a binomial (n, p)
distribution.
2. Note that the marginal mgf of any Xi (obtained by putting the other ti equal to 0)
is the familiar mgf of the binomial distribution.
16
Chapter 3
Transformations
3.1
Introduction
We frequently have the type of problem where we have a random variable X with
known distribution and a function g and wish to find the distribution of the random
variable Y = g(X). There are essentially 3 methods for finding the distribution of Y and
these are summarized briefly as follows.
1. Method of Distribution Functions
Let FY (y) denote the cdf of Y . Then
FY (y) =
=
=
=
P (Y ≤ y)
P (g(X) ≤ y)
P (X ≤ g −1 (y))
FX (g −1 (y))
where FX is the cdf of random variable X.
2. Method of Transformations
In the case of a continuous random variable X with pdf fX (x), x ∈ RX , and g a strictly
increasing or strictly decreasing function for x ∈ RX , the random variable Y has pdf given
by
dx fY (y) = fX (x) (3.1)
dy
where the RHS is expressed as a function of y.
√
For example, if f (x) = αe−αX and y = x2 , write fX (x) = αe−α y . The Jacobian keeps
track of the scale change in going from x to y.
A modification of the procedure enables us to deal with the situation where g is piecewise monotone.
17
3. Method of Moment Generating Functions
This method is based on the uniqueness theorem, which states that if two mgf’s are
identical, the two random variables with those mgf’s possess the same probability distribution. So we would need to find the mgf of Y and compare it with the mgf’s for the common
distributions. If it is identical to some well-known mgf, the probability distribution of Y
will be identified.
The problem above was dealt with in a section called Change of Variable in the Statistics
unit STAT260. The new work in this chapter concerns what may be called bivariate
transformations. That is, we begin with the joint distribution of 2 random variables, X1
and X2 say, and two functions, g and h, and wish to find the joint distribution of the
random variables Y1 = g(X1 , X2 ) and Y2 = h(X1 , X2 ). The marginal distribution of one or
both of Y1 and Y2 can then be found. We may wish to do this if we changed coordinates
from Cartesian (X1 , X2 ) to polar coordinates (Y1 , Y2 ).
This can, of course, be extended to multivariable transformations.
Before leaving this section, the following example should help you recall the technique.
Example 3.1
We are given that Z ∼ N (0, 1) and wish to find the distribution of Y = Z 2 .
Method 1
If GY (y) is the cdf of Y , then
√
√
GY (y) = P (Y ≤ y) = P (Z 2 < y) = P [− y < Z < y]
√
√
= Φ( y) − Φ(− y)
where Φ is the standard normal integral. Differentiating wrt y, we get (φ = Φ0 )
√
√
G0Y (y) = gY (y) = φ( y)y −1/2 − φ(− y)(−1)y −1/2
h 1 √ 2
√ 2i √
1
e−y/2 y 1/2−1
1
√
= y −1/2 e− 2 ( y) + e− 2 (− y) / 2π =
2
2π
ie, χ21 as expected.
Method 2
Now Y = Z 2 where
1
2
fZ (z) = √ e−z /2
2π
18
The transformation is not 1:1, so
q
q
√
√
P (α < Y < β) = P ( α < Z < β) + P (− α < Z < − β)
q
√
= 2P ( α < Z < β)
Thus
fY (y) = 2fZ (Z =
where J =
√
d y
dy
√
y)|J|
= y −1/2 /2. So
1
1
fy (y) = 2 √ e−y/2 y −1/2 =
2
2π
√
2e−y/2 y −1/2
Γ(1/2)
So Y ∼ χ21 .
Method 3
The MGF of Y is
1 Z tz 2 −z 2 /2
e e
dz
MY (t) = EetY = √
2π
Z
Z
1 −z 2 /2−tz 2
1
2
= √ e
e−z (1−2t)/2 dz
dz = √
2π
2π
Putting W = (1 − 2t)1/2 Z gives
1
MY (t) = √
2π
"
χ21
1
1
√
=
1/2
(1 − 2t)
2π
Z
Z
e−w
e
−w 2 /2
2 /2
dw
(1 − 2t)1/2
dw
#
=1
=
1
(1 − 2t)1/2
as expected. The problem with the MGF approach is that you have to be
ie Y ∼
able to recognize the distribution from the form of the MGF.
Exercise
Obtain the density function for the log–normal distribution, which is simply the log of a
normal distribution. If Y = ln X and X ∼ N (µ, σ 2 ) find the distribution function for Y .
Example 3.2
Now suppose random variable X is distributed N (µ, σ 2 ), and random variable Y is
defined by Y = X 2 /σ 2 , find the distribution of Y .
19
Method 1.
Let GY (y) be the cdf of Y . Then
GY (y) = P (Y ≤ y)
X2
= P ( 2 ≤ y)
σ
= P (X 2 ≤ σ 2 y)
√
√
= P (−σ y ≤ X ≤ σ y)
!
√
√
−σ y − µ
σ y−µ
= P
≤Z≤
σ
σ
µ
µ
√
√
= Φ( y − ) − Φ(− y − ).
σ
σ
The pdf of Y, gY (y) is obtained by differentiating GY (y) wrt y.
µ 1 1
µ 1 1
√
√
gY (y) = φ( y − ). y − 2 + φ(− y − ) y − 2
σ 2
σ 2

1
√
µ 2
1
√
µ 2

e− 2 ( y+ σ ) 
1 − 1  e− 2 ( y− σ )
√
√
+
y 2
=
2
2π
2π
y
=
1
1
1
e− 2 y 2 −1
2 2 Γ( 12 )
e
−µ2 /2σ 2
1
(eµy 2 /σ + e−µy 2 /σ )
2
where y ∈ [0, ∞).
Note that the first part of the RHS is the pdf of a chi-square random variable with 1 df. In
fact Y is said to have a non-central χ2 distribution with 1 df and non-centrality parameter
µ2 /σ 2 . [This will be dealt with further in Chapter 5.]
Method 2.
Noting that y = x2 /σ 2 is strictly decreasing for x ∈ (−∞, 0] and strictly increasing for
x ∈ (0, ∞), we use a modification of (3.1).
1
1
For x ∈ (−∞, 0] we have x = −σy 2 and |dx/dy| = 12 σy − 2 .
So
−1
1
1
2
2
1
fY∗ (y) = √
e− 2 (−σy 2 −µ) /σ . 12 σy − 2 ,
2πσ
1
replacing x in the N (µ, σ 2 ) pdf by −σy 2 .
1
1
For x ∈ (0, ∞) we have x = +σy 2 and |dx/dy| = 21 σy − 2 .
So
1
1
1
2
2
1
fY∗∗ (y) = √
e− 2 (σy 2 −µ) /σ . 12 σy − 2 .
2πσ
20
The pdf of Y is the sum of fY∗ (y) and fY∗∗ (y) which simplifies to (3.2).
Method 3.
MY (t) = E(etY )
=
Z
∞
−∞
e
tx2 /σ 2
1
= σ −1 (2π)− 2
= σ −1 (2π)
= (1 − 2t)
− 21
− 21
(
)
1 (x − µ)2
× (2π) σ exp −
dx
2 σ2
)
(
Z ∞
( 12 − t)x2 − µx + 21 µ2
dx
exp −
σ2
−∞
− 21
−1

1

 µ2 /σ 4 − 4( 1 − t) 2 µ2 /σ 4 
1
1
2
π σ( − t)− 2 × exp


2
4( 12 − t)/σ 2
1
2
(
µ2
t
× exp
2
σ (1 − 2t)
)
,
which is the M.G.F. of a non-central χ2 distribution (see Continuous Distributions by
Johnson and Kotz, chapter 28.) The integral is a standard result obtained by completing
R ∞ −u2
√
the square in the exponent and using the result that −∞
e du = π giving
Z
3.1.1
∞
∞
e
−(ax2 +bx+c)
dx =
r
π (b2 −4ac)/4a
e
.
a
The Probability Integral Transform
The transformation which produces the cdf for a random variable is of particular interest.
This transformation (the probability integral transform) is defined by
F (x) =
Z
x
−∞
f (t)dt = P (X ≤ x)
The new variable Y is given by Y = F (X), and has the property of being uniform on
(0,1), ie, Y ∼ U (0, 1).
Thus we are required to prove that
fY (y) = 1, 0 < y < 1.
Proof
Now Y = φ(X) = F (X) and X = F −1 (Y ) = ψ(Y ).
The pdf for Y is then
fY (y) = fX [x = ψ(y)]|J| =
21
dψ(y) fX [ψ(y)] dy but
y = F (x) = F [ψ(y)]
and so
fY (y) =
since
dψ(y) f [ψ(y)] dF [ψ(y)] = f [ψ(y)]
1
=1
f [ψ(y)]
dF ψ(y)
= f [ψ(y)]
dψ(y)
Exercises
1. Determine the probability integral transform for
• the general uniform distribution U (a, b), and
• the pdf f (x) = 2x, 0 < x < 1
2. Verify that the transformed distribution is U (0, 1) in each case.
3. What is the connection between the probability integral transform and generating
pseudo–random numbers on a computer?
3.2
Bivariate Transformations
The discrete case will be used as a bridge to the continuous two dimensional transformation
of variables.
One dimensional case
Assuming that we have a change of variable, say from X to Y for a discrete pf, then the
original variable space is A and the transformed variable space is B. The transformation is
Y = φ(X)
with backtransform
X = ψ(Y ).
The pf in B is then
py (Y = y) = px [X = ψ(y)] = p[ψ(y)]
Example
If X is P (λ) and Y = 2X what is the pf of Y ?
P (X = x) =
e−λ λx
, x = 0, 1, 2, . . .
x!
22
Using the MGF
MY (t) = e0 MX (2t) = e−λ(e
2t −1)
but can we recognise the distribution?
Using the change of variable,
Y = 2X = φ(X)
and
X = Y /2 = ψ(Y ).
So
py (Y = y) =
e−λ λy/2
, y = 0, 2, 4, . . .
(y/2)!
Two dimensional case
The original variable space (X, Y ) is denoted by A while the transformed space (U, V ) is
denoted by B. We use the notation
U = φ1 (X, Y ), V = φ2 (X, Y )
and
X = ψ1 (U, V ), Y = ψ2 (U, V )
The pf in transformed space B, is then
pU,V (u, v) = pX,Y [x = ψ1 (u, v), y = ψ2 (u, v)]
Example
If X ∼ B(1, π) and Y ∼ B(1, π), what is the distribution of X + Y if the two variables
are independent?
The original pf is
pX,Y (x, y) = pX (x)pY (y) = π x (1 − π)1−x × π y (1 − π)1−y , x, y = 0, 1
Now
U = φ1 (X, Y ) = X + Y, V = φ2 (X, Y ) = Y
and
X = ψ1 (U, V ) = U − V, Y = ψ2 (U, V ) = V
The spaces A and B are shown in Figure 3.1.
The joint pf of U and V is
pU,V (u, v) = π u−v (1 − π)1−(u−v) × π v (1 − π)1−v = π u (1 − π)2−u , (u, v) B
23
Region B
0.0
0.0
0.5
v
0.4
y
1.0
0.8
Region A
0.0
0.4
0.8
0.0
0.5
x
1.0
1.5
2.0
u
Figure 3.1: The spaces A and B.
We now sum over v to get the marginal distribution for U , ie, pU (u).
pU (u) =
X
pU,V (u, v)
v
= π u (1 − π)2−u , u = 0
= 2 · π u (1 − π)2−u , u = 1
= π u (1 − π)2−u , u = 2
Thus
pU (u) =
2
u
π u (1 − π)2−u , u = 0, 1, 2
and so, U ∼ B(2, π) in line with the MGF solution.
Continuous Variables
Both univariate and bivariate transformations of the discrete type are covered in CB p47,
156 and HC 4.2, whereas transformations for continuous variables are covered in 4.3. The
main result here, which is the two-dimensional extension of (3.1), can be stated as follows.
For (X, Y ) continuous with joint pdf fX,Y (x, y), (x, y) ∈ A, and defining U = g(X, Y ),
V = h(X, Y ), the joint pdf of U and V , fU,V (u, v) is given by
fU,V (u, v) = fX,Y (x, y).abs|J|
24
(3.2)
providing the inverse transformation
x = G(u, v)
y = H(u, v)
)
is one-to-one. Here abs|J| refers to the absolute value of the Jacobian,
∂x/∂u ∂x/∂v .
∂y/∂u ∂y/∂v The RHS of (2.3) has to be expressed in terms of u and v, and could be written more
precisely as fX,Y (G(u, v), H(u, v))abs|J|.
The diagonal elements of J account for scale change and the off-diagonal elements
account for rotations.
Comments
1. In examples, it is essential to draw diagrams showing
(i) the range space of (X, Y ), and
(ii) the region this maps into under the transformation.
2. The distribution of the new random variables U and V is not complete unless the
range space is specified.
3. Frequently we use this technique to find the distribution of some function of random
variables, e.g., X/Y . That is, we are not mainly interested in the joint distribution
of U = g(X, Y ) and V = h(X, Y ), but in the marginal distribution of one of them.
These points will be illustrated by the following examples.
Note that the full version of the joint density is now
fU,V (u, v) = fX,Y [x = G(u, v), y = H(u, v)] · abs|J|
in line with the two dimensional discrete case.
Worked Examples
Three simple examples are presented, as a ’lead–in’ to the later problems.
1. The distribution of the sum of two unit normals.
If X ∼ N (0, 1) and Y ∼ N (0, 1), what is the distribution of X + Y if X and Y are
independent, using the change of variable method.
(We can state the answer already?)
25
The joint distribution function is
fX,Y (x, y) =
1 −x2 /2 −y2 /2
e
e
2π
The transformation is
U = X + Y = φ1 (X, Y ) = g(X, Y ), V = Y = φ2 (X, Y ) = h(X, Y )
while the inverse is
X = U − V = ψ1 (U, V ) = G(U, V ), Y = V = ψ2 (U, V ) = H(U, V )
in line with the notation in the Notes.
The spaces A and B both contain the entire variable space in each case. (Verify!)
The Jacobian J is
∂ψ1 /∂u ∂ψ1 /∂v
J=
∂ψ2 /∂u ∂ψ2 /∂v
1 −1 =
=1
0
1 Now the joint distribution function for U and V is now
fU,V (u, v) =
1 −(U −V )2 /2 −V 2 /2
e
e
· abs|J|
2π
1 −(U 2 /4)−(V −U/2)2
1 −(U 2 +V 2 −2U V +V 2 )/2
e
=
e
2π
2π
by completing the square. The joint function can now be written as
=
1 1 −U 2 /4 1 −(V −U/2)2
√ e
√ e
2π 2
1/ 2
which is the product of two normals, U ∼ N (0, 2) and a N (0, 1/2).
Thus
as expected.
X + Y ∼ N (0, 2)
2. The distribution of the sum, and of the difference of two unit normals.
This problem is similar to the previous, since we now have the transformation
U = X + Y = φ1 (X, Y ) = g(X, Y ), V = X − Y = φ2 (X, Y ) = h(X, Y )
while the inverse is
X = (U + V )/2 = ψ1 (U, V ) = G(U, V ), Y = (U − V )/2 = ψ2 (U, V ) = H(U, V )
26
in line with the notation in the notes.
The spaces A and B are as for the first example.
The Jacobian J is
∂ψ1 /∂u ∂ψ1 /∂v
J=
∂ψ2 /∂u ∂ψ2 /∂v
The joint density is now
fU,V (u, v) =
1/2
1/2
=
1/2 −1/2
= −1/2
1 −( U +V )2 /2 −( U −V )2 /2 1
e 2
e 2
·
2π
2
Minus eight times the exponent is thus
U 2 + V 2 + 2U V + U 2 + V 2 − 2U V = 2(U 2 + V 2 )
to give
1 − 2(U 2 +V 2 ) /2 1
4
e
·
2π
2
2 +V 2 )
2)
2)
(U
(U
(V
1
1 −
1 −
1
/2 1
2
e
e 2 /2 e− 2 /2 · √ · √
=
· =
2π
2
2π
2
2
fU,V (u, v) =
and so U and V are independent. Thus we can observe that U = X + Y ∼ N (0, 2)
and V = (X − Y ) ∼ N (0, 2). This is related to X and S 2 being independent for the
case n = 2, since V 2 /2 ∼ χ21 . The n–dimensional example of this independence of
the mean and variance is given on pages 21–22 of the Notes.
3. An example from the exponential distribution.
If X ∼ E(1) and Y ∼ E(1), and X and Y are independent, then show that X + Y
and X/(X + Y ) are independent.
Thus the joint density of X and Y is
fX,Y (x, y) = e−x e−y , x, y > 0.
We have the transformation
U = X + Y = φ1 (X, Y ) = g(X, Y ), V = X/(X + Y ) = φ2 (X, Y ) = h(X, Y )
while the inverse is
X = U V = ψ1 (U, V ) = G(U, V ), Y = U − U V = ψ2 (U, V ) = H(U, V )
27
The region A is defined by X, Y > 0 while B is given by U, V > 0 and V < 1.
The Jacobian is
∂ψ1 /∂u ∂ψ1 /∂v
J=
∂ψ2 /∂u ∂ψ2 /∂v
The joint density of U and V is
v
u
=
1 − v −u
= −u
fU,V (u, v) = f (ψ1 , ψ2 ) · abs|J| = e−uv e−(u−uv) · u = ue−u , (u, v) B
This gives the marginal distribution for U as
fU (u) =
Z
fU,V (u, v)dv =
Z
1
0
ue−u dv = ue−u , u > 0
ie, U ∼ G(2), being the sum of two independent exponentials. This can be verified
by the use of the MGF, since
MX,Y (t) = MX (t) × MY (t) =
1
1
1
×
=
1−t 1−t
(1 − t)2
which is the MGF of a G(2).
The marginal for V is
fV (v) =
Z
∞
0
ue−u du = 1, 0 < v < 1.
So V is distributed as a uniform on (0,1). Thus
fU,V (u, v) = fU (u) × fV (v)
and so U and V are independent.
Note
The full version [ X, Y ∼ N (µ, σ 2 )] of the first two examples is in theory not complicated,
but in practice is very messy.
Example 3.3
[This is HC Example 3 in 4.3.]
Given independent random variables X and Y , each with uniform distributions on
(0, 1), find the joint pdf of U and V defined by U = X + Y, V = X − Y , and the marginal
pdf of U .
28
The joint pdf of X and Y is
fX,Y (x, y) = 1, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 .
The inverse transformation, written in terms of observed values is
x = (u + v)/2 and y = (u − v)/2.
and is clearly one-to-one. The Jacobian is
∂(x, y) 1/2
1/2
J=
=
∂(u, v) 1/2 −1/2
=−
1
1
, so abs|J| = .
2
2
Following the notation of HC, we will use A to denote the range space of (X, Y ), and B
to denote that of (U, V ) shown in the Figure 3.2. Firstly, note that there are 4 inequalities
specifying ranges of x and y, and these give 4 inequalities concerning u and v, from which
B can be determined. That is,
x≥0
x≤1
y≥0
y≤1
⇒
⇒
⇒
⇒
u+v
u+v
u−v
u−v
≥ 0,
≤ 2,
≥ 0,
≤ 2,
that
that
that
that
is,
is,
is,
is,
v
v
v
v
≥ −u
≤ 2−u
≤u
≥ u−2
Drawing the four lines
v = −u, v = 2 − u, v = u, v = u − 2
on the graph, enables us to see the region specified by the 4 inequalities.
Now, using (3.2) we have
1
fU,V (u, v) = 1. ,
2
(
−u ≤ v ≤ u, 0 ≤ u ≤ 1
u − 2 ≤ v ≤ 2 − u, 1 ≤ u ≤ 2
The importance of having the range space correct is seen when we find the marginal
pdf of U .
fU (u) =
=
Z
∞
−∞
fU,V (u, v) dv
 Ru 1

 R−u 2 dv ,
2−u 1
dv
 u−2 2

(
0,
0≤u≤1
, 1≤u≤2
otherwise
u,
0≤u≤1
2−u , 1≤u≤ 2
= uI[0,1] (u) + (2 − u)I(1,2] (u), using indicator functions.
=
29
Figure 3.2: The region B.
v
v=u
-
@
@
@
@
@
@
@
@
@
@
@
1
@
@
@
v=u-2
@
@-
@
@
@
-
@
@
@
@
@
@
@
@
2
@
@
u
@
@
@
v=2-u
@
@
v=-u
Example 3.4
[HC Example 6, 4.3]
Given X and Y are independent random variables each with pdf
fX (x) = 21 e−x/2 , x ∈ [0, ∞), find the distribution of (X − Y )/2.
We note that the joint pdf of X and Y is
fX,Y (x, y) = 14 e−(x+y)/2 , 0 ≤ x < ∞, 0 ≤ y < ∞.
Define U = (X − Y )/2. Now we need to introduce a second random variable V which
is a function of X and Y . We wish to do this in such a way that the resulting bivariate
transformation is one-to-one and our actual task of finding the pdf of U is as easy as
possible. Our choice for V is of course, not unique. Let us define V = Y . Then the inverse
transformation is, (using u, v, x, y, since we are really dealing with the range spaces here).
x = 2u + v
y = v
from which we find the Jacobian,
2 1 J=
=2.
0 1 30
To determine B, the range space of U and V , we note that
x≥0
x<∞
y≥0
y<∞
⇒
⇒
⇒
⇒
2u + v ≥ 0, that is, , v ≥ −2u
2u + v < ∞
v≥0
v<∞
So B is as indicated in Figure 3.3.
Figure 3.3: The region B
v
A
A
A
A
A
A
A
A
A
Now using (3.2) we have
u
A v=-2u
A
A
A
AA
1
4
1
2
fU,V (u, v) =
=
e−(2u+v+v)/2 .2
e−(u+v) , (u, v) ∈ B.
The marginal pdf of U is obtained by integrating fU,V (u, v) with respect to v, giving
fU (u) =
 R
∞ 1
 −2u
2
 R∞ 1
=
(
=
1
2
0
e
2
1 u
e
2
1 −u
e
2
−|u|
e−(u+v) dv , u < 0
e−(u+v) dv ,
u>0
u<0
, u>0
, −∞ < u < ∞
[This is sometimes called the folded (or double) exponential distribution.]
Example 3.5
Given Z is distributed N (0, 1) and Y is distributed as χ2ν , and Z and Y are independent, find the pdf of a random variable T defined by
T =
Z
1
(Y /ν) 2
31
.
Now the joint pdf of Z and Y is
fZ,Y (z, y) =
e−z
2 /2
(2π)
1
2
.
e−y/2 y (ν/2)−1
, y > 0, −∞ < z < ∞.
2ν/2 Γ(ν/2)
Let V = Y , and we will find the joint pdf of T and V and then the marginal pdf of T . The
inverse transformation is
1
1
z = tv 2 /ν 2
y = v
1
from which |J| = (v/ν) 2 .
It is easy to check that B = {(t, v) : −∞ < t < ∞, 0 < v < ∞}. So the joint pdf of T
and V is
1
2
e−t v/2ν e−v/2 v (ν/2)−1 v 2
fT,V (t, v) =
1 .
1 ,
(2π) 2 2ν/2 Γ(ν/2) ν 2
for (t, v) ∈ B. The marginal pdf of T is found by integrating fT,V (t, v) with respect to v,
the limits on the integral being 0 and ∞. Carry out this integration, substituting x (say)
2
for v2 (1 + tν ), and reducing the integral to a gamma function. The answer should be
Γ( ν+1
)
2
t2
fT (t) =
1
+
1
ν
(νπ) 2 Γ(ν/2)
!−(ν+1)/2
, −∞ < t < ∞ ,
which you will recognise as the pdf of a random variable with a t-distribution with ν degrees
of freedom. (X is the sample mean and Y is the sample variance)
Exercise. [See HC 4.4.]
Given random variables X and Y are independently distributed as chi-square with ν1 , ν2
degrees of freedom, respectively, find the pdf of the random variable F defined by F =
ν2 X/ν1 Y .
Let V = Y and find the joint pdf of F and V , noting that the range space B = {(f, v) :
f > 0, v > 0}. You should find that |J| = ν1 v/ν2 . Find the marginal pdf of F , which you
should recognize as that for an Fν1 ,ν2 distribution. You should try the following substitution
to simplify the integration. Let s = v2 (1 + νν12f ).
3.3
Multivariate Transformations (One-to-One)
Note that in this extension, we will use X1 , X2 , . . . , Xn for the ‘original’ continuous
variables (rather than X and Y as we had for 2 variables) and U1 , U2 , . . . , Un or Y1 , . . . , Yn
are used for the ‘new’ variables (rather than U and V ).
32
Given random variables X1 , X2 , . . . , Xn with joint pdf fX (x1 , x2 , . . . , xn ) which is
non-zero on the n-dimensional space A. Define

u1 = g1 (x1 , x2 , . . . , xn ) 



u2 = g2 (x1 , x2 , . . . , xn ) 
..


.
un = gn (x1 , x2 , . . . , xn )
(3.3)



and suppose this is a one-to-one transformation mapping A onto a space B. Extending
(2.3) to this case we have, for the joint pdf of U1 , U2 , . . . , Un ,
fU (u1 , u2 , . . . , un ) = fX (x1 , x2 , . . . , xn ).abs|J|, where J =
∂xi
∂uj
!
.
(3.4)
[Note that J is the matrix of partial derivatives.]
3.4
Multivariate Transformations Not One-to-One
With the definitions of X1 , X2 , . . . , Xn , U1 , U2 , . . . , Un as in section 3.3, suppose now that
to each point of A there corresponds exactly one point of B, but that to each point of B
there may correspond more than one point of A.
Assume that we can represent A as the union of a finite number, k, of disjoint sets
A1 , A2 , . . . , Ak , such that (2.4) does represent a one-to-one mapping of each Aj onto
B, j = 1, . . . , k. That is, for each transformation of Aj onto B there is a unique inverse
transformation
xi = Gij (u1 , u2 , . . . , un ), i = 1, 2, . . . , n; j = 1, 2, . . . , k,
each having a non-vanishing Jacobian, |Jj |, j = 1, 2, . . . , k. The joint pdf of U1 , U2 , . . . , Un
is then given by
k
X
j=1
abs|Jj |f [G1j (u1 , . . . , un ), . . . , Gnj (u1 , . . . , un )]
for (u1 , u2 , . . . , un ) ∈ B.
The marginal pdf’s may be found in the usual way if required.
Example 3.6
Given X1 and X2 are independent random variables each distributed N (0, 1), so that
2
2
f (x1 , x2 ) = (2π)−1 e−(x1 +x2 )/2 , −∞ < x1 < ∞; −∞ < x2 < ∞,
define U1 = (X1 + X2 )/2, U2 = (X1 − X2 )2 /2 and find their joint distribution. The
transformation is not one to one since to each point in
B = {(u1 , u2 ) : −∞ < u1 < ∞, 0 ≤ u2 < ∞} there corresponds two points in
A = {(x1 , x2 ) : −∞ < x1 < ∞, −∞ < x2 < ∞}. There are two sets of inverse functions.
33
1
1
1
1
(i) x1 = u1 − (u2 /2) 2 ; x2 = u1 + (u2 /2) 2 .
(ii) x1 = u1 + (u2 /2) 2 ; x2 = u1 − (u2 /2) 2 .
From the definition of U2 , there is one type of mapping when x1 > x2 and another when
x2 > x1 . Consequently we define
A1 = {(x1 x2 ); x2 > x1 }
and
A2 = {(x1 , x2 ); x2 < x1 } .
Note that the line x1 = x2 has been omitted since when x1 = x2 we have u2 = 0. However,
since P (X1 = X2 ) = 0, excluding this line does not alter the distribution and we therefore
consider only A = A1 ∪ A2 .
Then (i) defines a one-to-one transformation of A2 onto B and (ii) defines a one-to-one
transformation of A1 onto B. Thus the joint pdf of (U1 , U2 ) is given by

1
1

 [u − (u /2) 2 ]2
1
1
[u1 + (u2 /2) 2 ]2 
1
2
fU1 ,U2 (u1 , u2 ) =
.(2u2 )− 2
exp −
−


2π
2
2

=
1
1

 [u + (u /2) 2 ]2
1
[u1 − (u2 /2) 2 ]2 
1
1
2
(2u2 )− 2
exp −
−
+


2π
2
2
1
1
2
(π)
1
2
e−u1 .
2
1
2
1
Γ( 12 )
u22
−1
e−u2 /2 , for (u1 , u2 ) ∈ B.
Comment: This also shows that U1 and U2 are stochastically independent.
3.5
Convolutions
Consider the problem of finding the distribution of the sum of 2 independent (but not
necessarily identically distributed) random variables. The pdf of the sum can be neatly
expressed using convolutions.
Theorem 3.1
Let X and Y be independent random variables with pdf’s fX , fY respectively, and
define U = X + Y . Then the pdf of U is
fU (u) =
Z
∞
−∞
fX (u − v)fY (v) dv.
Proof
34
(3.5)
Because of independence, the joint pdf of X and Y may be written
fX,Y (x, y) = fX (x)fY (y) .
Define V = Y and, noting that the Jacobian of the inverse transformation is 1, the joint
pdf of U and V is
fU,V (u, v) = fX (u − v)fY (v),
and hence the marginal pdf of U , found by integrating with respect to v, is as given in
(3.5).
Now fU (u) is called the convolution of fX and fY .
The following heuristic explanation may assist.
Equation (3.5) defines a convolution in the mathematical sense. Each single point of
fU (u) is formed by a weighted average of the entire density fY (v). The weights are the
other density fX (u − v) where its value depends on how far apart each v is from u. Thus
each single point of the density fU (u) arises from all the density fY (v).
Example 3.7
Random variables X and Y are identically and independently distributed (iid) uniformly on [0, 1]. Find the distribution of U = X + Y .
We note that fX (x) = 1, 0 ≤ x ≤ 1, fY (y) = 1, 0 ≤ y ≤ 1 and that the inverse
transformation is x = u − v, y = v with |J| = 1. The range space for (u, v) is determined
from
x≥0
x≤1
y≥0
y≤1
⇒
⇒
⇒
⇒
u − v ≥ 0,
u − v ≤ 1,
v≥0
v ≤ 1,
and is shown in the Figure 3.4.
So from Theorem 3.5,
fU (u) =
 R
 0u 1.dv,
 R1
u−1
that is, v ≤ u
that is, v ≥ u − 1
0≤u≤1
1.dv, 1 < u ≤ 2
resulting in what is sometimes called the triangular distribution,
fU (u) =
(
u,
0≤u≤1
2 − u, 1 < u ≤ 2
as in Example 3.2.
The method of convolutions is a special case of the transformation of variables, being
another method for finding the distribution of the sum of two variables. The problem that
is solved here could be solved using MGFs, viz,
MU (t) = MX (t)MY (t) =
35
et − 1
t
!2
v=u
v=u-1
Figure 3.4: Region B
v
v=1
1
since
Z
u
2
et − 1
,
MX (t) = Ee =
e dt =
t
0
but again the problem is to recognize the distribution from the resulting MGF.
Xt
3.6
1
xt
General Linear Transformation
Here we will use matrix notation to express the results of 3.3, and give a useful result
using moment generating functions.
The one-to-one linear transformation referred to in section 3.3 on page 32 can be written
in matrix notation as Y = AX, (using Y for the new variables rather that U). Here X and
Y are vectors of random variables and A is a matrix of constants. In particular, note that
E(X) is the vector whose components are E(X1 ), . . . E(Xp ), or µ1 , . . . , µp . The covariance
matrix of X (sometimes called the variance-covariance matrix) is frequently referred to as
cov(X), and is denoted by Σ. Note that it is a square matrix whose diagonal terms are
variances, and off-diagonal terms are covariances.
If A is non-singular so that there is an inverse transformation X = A−1 Y, and if X
has pdf fX (x), the corresponding pdf of Y is
fY (y) = fX (A−1 y)abs|J| = fX (A−1 y)abs|A−1 |.
(3.6)
Recall that the joint mgf of (X1 , X2 , . . . , Xp ) is expressed in matrix notation as
0
MX (t) = E et1 X1 +t2 X2 +...+tp Xp = E(et X )
provided this expectation exists. Now if Y = AX, so that Y is a p-dimensional random
vector, the mgf of Y is
0
0
0
0
MY (t) = E(et Y ) = E(et AX ) = E(e(A t) X ) = MX (A0 t)
36
(3.7)
3.7
Worked Examples : Bivariate MGFs
The distribution of linear combinations of independent random variables can sometimes
be determined by the use of moment generating functions.
Example 1
Find the distribution of W = X1 + X2 where X1 ∼ N (µ1 , σ12 ),
X2 ∼ N (µ2 , σ22 ) and X1 and X2 are independent.
Solution
MW (t) = E(eW t ) = E[eX1 t + X2 t ] = E(eX1 t ) × E(eX2 t )
since X1 and X2 are independent. Now
2 2
2 2
MW (t) = eµ1 t + σ1 t /2 × eµ2 t + σ2 t /2
2 2
2 2
MW (t) = eµ1 t + σ1 t /2 + µ2 t + σ2 t /2
2
2 2
MW (t) = e(µ1 + µ2 )t + (σ1 + σ2 )t /2
Thus
X1 + X2 ∼ N (µ1 + µ2 , σ12 + σ22 )
Example 2
In general, if X = aX1 + bX2 then
MW (t) = E(eW t ) = E[e(aX1 + bX2 )t ] = E(eaX1 t )E(ebX2 t )
ie
MW (t) = MX1 (t)MX2 (t)
Of course, the procedure relies on the resulting MGF being recognizable. Thus we
can now find the distribution of
W = X 1 − X2
Solution
2 2
2
2
MW (t) = eµ1 t + σ1 t /2 × eµ2 (−t) + σ2 (−t) /2
2 2
2 2
MW (t) = eµ1 t + σ1 t /2 − µ2 t + σ2 t /2
2
2 2
MW (t) = e(µ1 − µ2 )t + (σ1 + σ2 )t /2
37
Thus
X1 − X2 ∼ N (µ1 − µ2 , σ12 + σ22 )
The procedure can be extended to the n dimensional case, an indeed forms the basis
of one of the proofs for the Central Limit Theorem.
Central Limit Theorem
The general form of the CLT states that :
If X1 . . . Xn are iid rvs with mean µ and variance σ 2 , then
Z=
X −µ
√ ∼ N (0, 1) (asy).
σ/ n
Proof
Let the rvs Xi (i = 1 . . . n) have MGF
0
0
MXi (t) = 1 + µ1 t + µ2 t2 /2 + . . .
Now
X=
so
MX (t) =
n
Y
1
1
X1 + . . . + Xn
n
n
MXi /n (t) =
i=1
Now
n
Y
MXi (t/n) = [MXi (t/n)]n
i=1
√
n
Z=
(X − µ)
σ
to give
MZ (t) = e−(
Thus
√
n/σ)µt
√
h
√
in
√
MX [ n/σ]t = e−( n/σ)µt MXi [ n/σ](t/n)
√
"
0
n
µ t2
t
0
log MZ (t) = −
µt + n log 1 + µ1 √ + 2 2 + . . .
σ
σ n
2! σ n
√
"
#
n
t
t2
0
0
=−
µt + n µ1 √ + µ2 2 + . . .
σ
σ n
2σ n
√
2
n
t√
0 t
µt + µ
=−
n + µ2 2 −
σ
σ
2σ
Therefore, as n → ∞
0
log MZ (t) →
"
#2
n 0 t
t2
0
µ1 √ + µ 2 2 + . . .
−
2
σ n
2σ n
1 2 t2
1
µ 2 + . . . (terms in √ )
2 σ
n
(µ2 − µ2 ) t2
t2
=
σ2
2
2
38
#
+...
0
since σ 2 = µ2 − µ2 .
This is the MGF of a N (0, 1) rv, and as
σZ
X = √ +µ
n
then
X ∼ N (µ, σ 2 /n) (asy).
39
40
Chapter 4
Multivariate Normal Distribution
4.1
Bivariate Normal
If X1 , X2 have a bivariate normal distribution with parameters µ1 , µ2 , σ12 , σ22 , ρ, then the
joint pdf of X1 and X2 is
1
− 2(1−ρ
2)
h
(x1 −µ1 )2
σ12
−
2ρ(x1 −µ1 )(x2 −µ2 )
σ1 σ2
f (x1 , x2 ) = ke
√
where k = 1/2πσ1 σ2 1 − ρ2 .
Let X0 = (X1 , X2 ), µ0 = (µ1 , µ2 ), and define Σ by
Σ=
"
σ12
ρσ1 σ2
ρσ1 σ2 σ22
+
(x2 −µ2 )2
σ22
i
,
#
and we see that the joint pdf can be written in matrix notation as
−1
1
− 21 (x−µ)0 Σ (x−µ)
fX (x) =
e
1
2π|Σ| 2
where |Σ| is the determinant of Σ.
Check that |Σ| = σ12 σ22 (1 − ρ2 ) and that
Σ−1

1 
=
1 − ρ2
1
σ12
−ρ
σ1 σ2
−ρ
σ1 σ2
1
σ22


.
We write X ∼ N2 (µ, Σ). Read CB p175 or HC 3.5 to revise some of the properties of the
bivariate normal distribution, which can be regarded as a special case of the multivariate
normal distribution. This will be considered in the remainder of this chapter.
The following is a ’derivation’ of the Bivariate Normal from first principles, ie, from
two univariate independent normals. Let the two univariate independent normals be Z1
and Z2 . Then form the bivariate vector
Z1
Z2
Z=
41
!
The joint distribution function for Z is then
f (z1 , z2 ) =
1 −(z12 +z22 )/2
e
, ∞ < z 1 , z2 < ∞
2π
Alternatively
1 −(z 0 z )/2
1 −(z 0 Iz )/2
e
=
e
2π
2π
which is called the spherical normal.
Note that Z ∼ N (O, I).
If we now consider these two unit normals to be the result of transforming two bivariate
normal variables that are not necessarily independent, we can use a transformation that is
the two dimensional equivalent of the Z–score from univariate statistics. Thus we have
f (z) =
Z = P −1 (X − µ)
where
X=
X1
X2
µ=
µ1
µ2
and
!
!
The distribution of X will be the bivariate normal.
Using the general linear transformation from p 24 (3.6) of the Notes, we get
h
i
fX (x) = fZ z = P −1 (x − µ) abs|P −1 |
=
1 −[P −1 (x−µ)]0 [P −1 (x−µ)]/2
e
abs|P −1 |
2π
n
1 −(x−µ)0 P P
e
=
2π
It is easily verified that E(X) = µ since
0
o−1
(x−µ)/2
abs|P −1 |
E(X) = E[P Z + µ] = O + µ
0
but what of P P ?
The variance/covariance matrix of X is
0
Σ = E[(X − µ)(X − µ) ] = E(P Z)(P Z)
0
0
0
0
= E[P ZZ P ] = E(P IP ) = E(P P )
AS |P −1 | = |Σ|−1/2 we have
f (x) =
1
e−[(x−µ) Σ
1/2
2π|Σ|
0
42
−1
(x−µ)]/2
0
Exercise
Derive from this general form the equation for the bivariate normal.
4.2
Multivariate Normal (MVN) Distribution
The Multivariate Normal distribution has a prominent role in statistics as a consequence
of the Central Limit Theorem. For example, estimates of regression parameters are asymptotically Normal. (Some people prefer to call it a Gaussian distribution).
We will extend the notation of section 4.1 to p dimensions, so E(X)= µ is the vector
whose components are E(X1 ), . . . , E(Xp ) or µ1 , . . . , µp , and Σ = cov(X) is the variancecovariance matrix (p × p) whose diagonal terms are variances and off-diagonal terms are
covariances, and
Σ = cov(X) = E[(X − µ)(X − µ)0 ].
Definition 4.1
The random p-vector X is said to be multivariate normal if and only if the linear
function
a0 X = a 1 X 1 + . . . + a p X p
is normal for all a, where a0 = (a1 , a2 , . . . , ap ).
In loose statistical jargon, the terms ‘linear’ and ‘Normal’ are sometimes interchangeable. Where we have random variables that are ‘normal’, we can think of the components
as additive.
Theorem 4.1
If X is p-variate normal with mean µ and covariance matrix Σ (non-singular), then X
has a pdf given by
−1
1
− 12 (x−µ)0 Σ (x−µ)
e
fX (x) =
(4.1)
(2π)p/2 |Σ|1/2
Proof
We are given that E(X) = µ, E(X−µ)(X − µ)0 = Σ.
Since Σ is positive definite, there is a non-singular matrix P such that Σ =PP0 . [Chapter 1, sec 1.2, 6(b)(ii).] Consider the transformation Y = P−1 (X − µ). By Definition 4.1,
the components of Y are normal and
E(Y) = E(P−1 (X − µ)) = P−1 E(X − µ) = 0 since E(X) = µ,
cov(Y) = E(YY0 ) = P−1 E[(X − µ)(X − µ)0 ](P−1 )0 = P−1 Σ(P0 )−1 = P−1 PP0 (P0 )−1 = I,
43
So Y1 ,. . . ,Yp are iid N (0, 1) and their joint pdf is given by
fY (y) =
1
− 12 y0 y
e
.
(2π)p/2
Using (3.6), the density of X is
fX (x) = fY P−1 (x − µ) abs|P−1 |,
where
|P−1 | =
and the result follows.
1
1
1
=
0 1/2 =
|P|
|PP |
|Σ|1/2
Comments
1. Note that the transformation Y = P−1 (X − µ) is used to standardize X, in the
same way as
X −µ
Z=
σ
was used in univariate theory.
2. Note that when p = 1, equation (4.1) reduces to the pdf of the univariate normal.
3. The covariance matrix is symmetric, since cov(Xi , Xj ) = cov(Xj , Xi ).
4. It is often convenient to write X ∼ Np (µ, Σ).
5. Note that
Z
4.3
∞
−∞
...
Z
∞
−∞
−1
1
1
− 12 (x−µ)0 Σ (x−µ)
dx1 . . . dxp = |Σ| 2 .
e
p/2
(2π)
(4.2)
Moment Generating Function
We will now derive the mgf for a p-variate normal distribution and see how it can be
used in deriving other results.
Theorem 4.2
Given X ∼ Np (µ, Σ) and t0 = (t1 , t2 , . . . , tp ) a vector of real numbers, then the mgf of X
is
1 0
0
MX (t) = et µ+ 2 t Σt .
(4.3)
Proof
44
There exists a non-singular matrix P so that Σ = PP0 . Let Y = P−1 (X − µ). Then
Y ∼ Np (0, I) from the proof of Theorem 4.2. That is, each Yi ∼ N (0, 1) and we know
1 2
that MYi (t) = E(eYi t ) = e 2 t . Now
0
MY (t) = E(eY1 t1 +...+Yp tp ) = E(et Y )
= E(eY1 t1 )E(eY2 t2 ) . . . E(eYp tp )
1 2
1 2
1 2
1 e 2 t2 . . . e 2 tp
= e 2 tP
1
2
= e 2 ti
1 0
= e2t t
Also
0
E(et X )
0
E(et (µ+PY) )
0
0
E(et µ et PY )
0
0 0
et µ E(e(P t) Y )
0
et µ MY (P0 t)
1
0
0 0
0
= et µ .e 2 (P t) (P t)
1 0
0
= et µ+ 2 t Σt , putting Σ for PP0 .
MX (t) =
=
=
=
=
Comments
1. Note that when p = 1, Theorem 4.2 reduces to the familiar result for a univariate
normal.
2. If X is multivariate normal with diagonal covariance matrix, then the components of
X are independent.
3. The marginal distributions of a multivariate normal are all multivariate (or univariate) normal. eg.
"
#
"
# "
X1
µ1
∼ N
,
X2
µ2
X1 ∼ N (µ1 , Σ11 )
X2 ∼ N (µ2 , Σ22 )
Σ11 Σ12
Σ21 Σ22
#!
4. If X is multivariate normal, then AX is multivariate normal for any matrix A (of
appropriate dimension).
For a r.v X where
E(X) = µ,
var(X) = Σ ,
E(AX) = Aµ, var(AX) = AΣA0
45
5. We also note for future reference the conditional distributions,
(X 2 |X 1 = x1 ) ∼ N (µ2.1 , Σ22.1 )
where µ2.1 = µ2 + Σ21 Σ−1
11 (x1 − µ1 )
Σ22.1 = Σ22 − Σ21 Σ−1
11 Σ12
Comments
The marginal distributions of a multivariate normal
(see H and C p229 4.134)
The MGF of the MVN can be written as
0
0
1
MX (t) = et µ+ 2 t Σt = M(X 1 ,X 2 ) (t1 , t2 )
where X =
X1
X2
!
,t=
t1
t2
!
and µ =
So
µ1
µ2
!
0
.
0
MX 1 (t1 ) = et1 µ1 + 2 t1 Σ11 t1
where
1
Σ11 Σ12
Σ21 Σ22
Σ=
!
.
By setting t2 = 0, we obtain
X 1 ∼ N (µ1 , Σ11 ).
Similarly, by setting t1 = 0, we get
X 2 ∼ N (µ2 , Σ22 ).
Conditional distributions
Decomposing X as before, we form
X1
X 2 − BX 1
AX =
=
0
AΣA =
X1
X2
#
Σ11 Σ12
Σ21 Σ22
!
I
O
−B I
Now
I
O
−B I
!
!"
!
46
I −B
O I
0
!
=
I
O
−B I
!
0
Σ11 −Σ11 B + Σ12
0
Σ21 −Σ21 B + Σ22
!
0
=
−Σ11 B + Σ12
Σ11
0
0
−BΣ11 + Σ21 −BΣ11 B + BΣ12 − Σ12 B + Σ22
!
We now choose the matrix B, so that the off diagonal matrices become O, so that X 1
and X 2 − BX 1 are independent. This implies that B = Σ21 Σ−1
11 , to give
0
AΣA =
Σ11 O
O
−Σ21 Σ−1
11 Σ12 + Σ22
!
(verify!). Thus we now have
(X 2 − BX 1 ) ∼ Nn2 (µ2 − Bµ1 , Σ22.1 )
where Σ22.1 = Σ22 − Σ21 Σ−1
11 Σ12 .
(The length of the vector X 2 is n2).
Since we may treat X 1 as a constant, then
(X 2 |X 1 = x1 ) ∼ N (µ2 − Bµ1 + Bx1 , Σ22.1 )
and so
(X 2 |X 1 = x1 ) ∼ N µ2 + Σ21 Σ−1
11 (x1 − µ1 ) , Σ22.1
as previously stated.
Exercise
Using the MGF/CGF derive the first five moments of the Bivariate Normal.
4.4
Independence of Quadratic Forms
We will consider here some useful results involving quadratic forms in normal random
variables.
Theorem 4.3
Suppose X1 , X2 , . . . , Xp are identically and independently distributed as N (0, 1) and
let X0 = (X1 , X2 , . . . , Xp ). Define Q1 and Q2 by
Q1 = X0 BX, Q2 = X0 CX,
where B and C are p × p symmetric matrices with ranks less than or equal to p. Then Q1
and Q2 are independent if and only if BC = 0.
47
Proof
Firstly note that X0 BX and X0 CX are scalars so that Q1 and Q2 each have univariate
distributions. We will find the joint mgf of Q1 and Q2 . Note that the pdf of X is given by
(4.1) with µ = 0 and Σ = I, so we have
0
0
MQ1 ,Q2 (t1 , t2 ) = E(et1 X BX+t2 X CX )
Z ∞
Z ∞
1 0
1
0
0
et1 x Bx+t2 x Cx− 2 x x dx1 . . . dxp
=
...
p/2
−∞
−∞ (2π)
1
=
(2π)p/2
Z
∞
−∞
...
Z
∞
1
−∞
0
e− 2 x (I−2t1 B−2t2 C)x dx1 . . . dxp
1
= |I − 2t1 B − 2t2 C|− 2 , using (4.2),
for values of t1 , t2 which make I − 2t1 B − 2t2 C positive definite. Now the mgf’s of the
marginal distributions of Q1 and Q2 are MQ1 ,Q2 (t1 , 0), MQ1 ,Q2 (0, t2 ) respectively. That is,
1
MQ1 (t1 ) = |I − 2t1 B|− 2 ,
1
MQ2 (t2 ) = |I − 2t2 C|− 2 .
Now Q1 and Q2 are independent if and only if
MQ1 ,Q2 (t1 , t2 ) = MQ1 (t1 )MQ2 (t2 ) .
That is, if
|I − 2t1 B − 2t2 C| = |I − 2t1 B||I − 2t2 C|
= |I − 2t1 B − 2t2 C + 4t1 t2 BC|.
This is true if and only if BC = 0.
[Note that the 0 here is a p × p matrix with every entry zero.]
The matrices B, C are projection matrices. Q1 is the shadow of (X 0 X) in the B plane
and Q2 is the shadow of X 0 X in the C plane. Q1 and Q2 will be independent if B ⊥ C
since in that case, none of the information in Q1 is contained in Q2 .
Example 4.1
If X1 , X2 , . . . , Xp are iid N (0, 1) random variables and X and S 2 are defined by
X =
S2 =
p
X
i=1
p
X
i=1
Xi /p
(Xi − X)2 /(p − 1),
48
2
show that S 2 and pX are independent.
Outline of proof
2
We need to write both S 2 and pX as quadratic forms. It is easy to verify that (p −
1)S 2 = X0 BX where


1 − 1p − 1p . . . − 1p


 −1
1 − 1p . . . − 1p 
p


B=
.. 
..
..

. 
.
.


1
1
−p
− p . . . 1 − 1p
2
and that pX = X0 CX where

1
p
1
p
...
1
p
1
p
1
p
...
1
p
 .
.
C=
 .

.. 
. 

..
.
and that BC = CB = 0, implying independence.
Proof in detail
Now
0
(p − 1)S 2 = X BX, B = I − I/p
where I is a matrix of ones. To verify this, note that


X1
 .

2
. 
(p − 1)S = [X1 , X2 , . . . , Xp ] [I − I/p] 
 .

Xp


That is

= [X1 , . . . , Xp ] 

X1
O
..
.
O
Xp
(p − 1)S 2 = (X12 + . . . + Xp2 ) − (
X

 
−
 
i
0
pX = X CX
where
C = [I/p]
49
..
.
P
i
Xi /p
i
Xi /p
Xi )2 /p =
If we define
2
 P
X
i




Xi2 − pX
2
then


X1

 .
2
. 
pX = [X1 , . . . , Xp ] [I/p] 

 .
Xp
=
as expected.
Thus
 P

[X1 , . . . , Xp ] 

..
.
i
P
i

Xi /p



Xi /p
=
(
P
i
Xi ) 2
2
= pX
p
BC(= CB) = O = [I − I/p] [I/p] = I/p − II/p2

=
1
 .
 .
 .
...
..
.


1
1
 .
.. 
 .
. 
 .
p
1 ... 1
1
−
2
p

=


p
1
 .
.. 
 .
p
. 
 .
p
1 ... 1
−
p2
1
 .
 .
 .
...
..
.

1
1
 .
.. 
 .
. 
 .
1
... 1
p
...
..
.
...
..
.

p
.. 
. 

... p
p2
...
..
.

1
.. 
. 

... 1
p
=O
2
and so S 2 and pX are independent.
4.5
Distribution of Quadratic Forms
Consider the quadratic form Q = X0 BX where B is a p × p matrix of rank r ≤ p. We
will find the distribution of Q, making certain assumptions about B.
We use the cumulant generating function as a mathematical tool to derive the results.
Knowledge of cumulants up to a given order is equivalent to that of the corresponding
moments. Although moments have a direct physical or geometric interpretation, cumulants
sometimes have an advantage, due to :
• the vanishing of the cumulants for the normal distribution,
• their behaviour for sums of independent random variables, and
• especially in the multivariate case, their behaviour under linear transformation of the
random variables concerned.
Theorem 4.4
Given X is a vector of p components, X1 , . . . , Xp distributed iid N (0, 1), and
Q = X0 BX where B is a p × p matrix of rank r ≤ p, the distribution of Q
50
(i) has sth cumulant, κs = 2s−1 (s − 1)!tr(B s )
(ii) is χ2r if and only if B is idempotent (that is, B 2 = B).
Proof
Now there is an orthogonal matrix P which transforms Q into a sum of squares. That
is, let X = PY, and
Q = X0 BX = Y 0 P0 BPY = Y 0 ΛY
where Λ is a diagonal matrix with elements λ1 , λ2 , . . . , λp , the eigenvalues of B. Now
exactly r of these are non-zero where r = rank(B). So
Q=
r
X
λi Yi2 .
(4.4)
i=1
Now if X ∼ Np (0, I), then Y = P−1 X is distributed as p-variate normal with
E(Y) = P−1 E(X) = 0
and
cov(Y) =
=
=
=
E(YY0 ) = E(P−1 XX0 (P−1 )0 )
P−1 E(XX0 )(P0 )−1
(P0 P)−1 since E(XX0 ) = I
I
So Y ∼ Np (0, I).
Consider now the ith component of Y. Since Yi ∼ N (0, 1) it follows that Yi2 ∼ χ21 and
has mgf
1
MYi2 (t) = (1 − 2t)− 2 .
So λi Yi2 has mgf
1
Mλi Yi2 (t) = (1 − 2λi t)− 2 ,
and Q, defined by (4.4), has mgf
MQ (t) =
r
Y
i=1
1
(1 − 2λi t)− 2 ,
(4.5)
since the Yi are independent. The cumulant generating function (cgf) is
KQ (t) = log MQ (t)
r
1X
= −
log(1 − 2λi t)
2 i=1
"
r
1X
22 λ2i t2 23 λ3i t3
−
−...
= −
−2λi t −
2 i=1
2
3
=
"
r
X
i=1
λi t +
t
2λ2i
2
2!
+...+
51
t
2s−1 λsi
#
s
s!
(s − 1)! + . . .
#
(i) So the sth cumulant of Q, κs is
κs = 2s−1 (s − 1)!
Now
Pr
i=1
s
λ
i=1 i =
Pr
r
X
λsi , s = 1, 2, 3, . . .
(4.6)
i=1
λsi is the sum of elements of the leading diagonal of B s . That is,
tr(B s ). So (4.6) can be written
κs = 2s−1 (s − 1)! tr(B s ).
(4.7)
(ii) Now for a χ2r distribution the mgf is (1 − 2t)−r/2 , the cgf is − 2r log(1 − 2t), and the
sth cumulant is
2s−1 (s − 1)!r.
(4.8)
So if Q ∼ χ2r the sth cumulant must be given by (4.8).
Comparing with (4.7), we must have
tr(B s ) = r = tr(B).
That is, B s = B, and B is idempotent.
On the other hand, if B is idempotent, r of the λi = 1 and the others are 0, so from
P
(4.4), Q = ri=1 Yi2 , and Q ∼ χ2r .
The following theorems (stated without proof) cover more general cases.
Theorem 4.5
Let X ∼ Np (0, σ 2 I) and define Q = X0 BX where B is symmetric of rank r. Then
Q/σ 2 ∼ χ2r if and only if B is idempotent.
What form might B take?
See if the projection matrices X(X 0 X)−1 X 0 and I − X(X 0 X)−1 X 0 are idempotent.
Theorem 4.6
Let X ∼ Np (0, Σ) where Σ is positive definite. Define Q = X0 BX where B is symmetric of rank r. Then Q ∼ χ2r if and only if BΣB = B.
Example
(After H and C p485)
52
If X1 , X2 , X3 are iid N (0, 8), and


1/2 0 1/2
1 0 
B=
 0

1/2 0 1/2
show that
X 0 BX
∼ χ22 .
8
Solution
Now r(B) = 2, B is symmetric and idempotent, since





1/2 0 1/2
1/2 0 1/2
1/2 0 1/2



1 0 
1 0 
1 0 
B2 =  0

= 0
 0
1/2 0 1/2
1/2 0 1/2
1/2 0 1/2
Because B is idempotent, then
X 0 BX
∼ χ22
8
by Thm 4.5.
√ , and claim
Alternatively, we could use Thm 4.4 on X ∗ = X
8
0
X ∗ BX ∗ ∼ χ22
ie
X 0 BX
∼ χ22
8
since X ∗1 , . . . , X ∗3 ∼ N (0, 1).
Notes
1. Thm 4.5 as shown in the example is a trivial application of Thm 4.4 with the original
variables divided by σ.
2. Thm 4.6 is more general.
Again
Q = X 0 BX
but now we define
Z = P −1 X
53
where
Σ = PP0
ie,
X = PZ
Thus
Q = X 0 BX = (P Z)0 BP Z = Z 0 P 0 BP Z = Z 0 EZ
say, then Q ∼ χ2r iff E is idempotent, by Thm 4.4. This means that E 2 = E, ie,
(P 0 BP ) (P 0 BP ) = P 0 BP
Thus
P 0 B[P P 0 ]BP = P 0 BP
ie
P 0 (BΣB) P = P 0 (B) P
to give the condition
BΣB = B
as required.
3. If we have
X ∼ N (µ, Σ)
then the distribution of Q involves the non–central χ2 which is covered in Chapter 6.
4.6
Cochran’s Theorem
This is a very important theorem which allows us to decompose sums of squares into several
quadratic forms and identify their distributions and establish their independence. It can
be used to great advantage in Analysis of Variance and Regression. The importance of the
terms in the model is assessed via the distributions of their sums of squares.
Theorem 4.7
Given X ∼ Np (0, I), suppose that X0 X is decomposed into k quadratic forms,
Qi = X0 Bi X, i = 1, 2, . . . , k, where the rank of Bi is ri and the Bi are positive semidefinite, then any one of the following conditions implies the other two.
(a) the ranks of the Qi add to p;
(b) each Qi ∼ χ2ri ;
(c) all the Qi are mutually independent.
54
Proof
We can write
X0 X = X0 IX =
k
X
X0 Bi X.
i=1
That is,
I=
k
X
Bi .
i=1
(i) Given (a) we will prove (b).
Select an arbitrary Qi , say Q1 = X0 B1 X. If we make an orthogonal transformation
X = PY which diagonalizes B1 , we obtain from
X0 B1 X + X0 (I − B1 )X = X0 IX
Y 0 P0 B1 PY + Y 0 P0 (I − B1 )PY = Y 0 B0 IBY
= Y 0 IY.
(4.9)
Since the first and last terms are diagonal, so is the second. Since r(B1 ) = r1 and
therefore r(P0 B1 P) = r1 , p − r1 of the leading diagonal elements of P0 B1 P are zero.
Thus the corresponding elements of P0 (I − B1 )P are 1 and since by (a) the rank
of P0 (I − B1 )P is p − r1 , the other elements of its leading diagonal are 0 and the
corresponding elements of P0 B1 P are 1. Hence from Theorem 4.4, Q1 ∼ χ2r1 and B1
is idempotent.
The same result holds for the other Bi and we have established (b) from (a).
(ii) Given (b) we will prove (c).
I = B1 + B2 + . . . + Bk
(4.10)
and (b) implies that each Bi is idempotent (with rank ri ). Choose an arbitrary Bi ,
say Bj . There is an orthogonal matrix C such that
0
C Bj C =
"
Irj 0
0 0
#
.
Premultiplying (4.10) by C0 and post-multiplying by C, we have
0
C IC = I =
k
X
i=1,i6=j
0
C Bi C +
"
Irj 0
0 0
#
.
Now each C0 Bi C is idempotent and can’t have any negative elements on its diagonal.
So C0 Bi C must have the first rj leading diagonal elements 0, and submatrices for
55
rows rj + 1, . . . , p, columns 1, . . . , rj and for rows 1, . . . , rj , columns rj + 1, . . . , p
must have all elements 0. So
C0 Bi CC0 Bj C = 0 ,
i = 1, 2, . . . , k, i 6= j,
and thus C0 Bi Bj C = 0 which can only be so if Bi Bj = 0.
Since Bj was arbitrarily chosen, we have proved (c) from (b).
(iii) Given (b) we will prove (a).
If (b) holds, Bi has ri eigenvalues 1 and p − ri zero and since I =
P
we have p = ri .
P
Bi , taking traces
(iv) Given (c) we will prove (b).
If (c) holds, taking powers of I =
integers s. Taking traces we have
tr(
k
X
Pk
i=1
Bi , we have
Pk
i=1
Bsi = I for all positive
Bsi ) = p , for all s.
i=1
This can hold if and only if every eigenvalue of Bi is 1. That is, if each Qi ∼ χ2 .
So we have proved (b) from (c).
A more general version of Cochran’s Theorem is stated (without proof) in Theorem 4.8.
Note that
X 0 X = X 0 IX =
k
X
X 0BiX
i
so that
k
X
I=
Bi
i
The logic of the three conditions can be summarised in Table 4.1.
(1)
(2)
(3)
(4)
a
b
b
c
→
→
→
→
b
c
a
b
Table 4.1: Logic table for Cochran’s theorem
Thus (1) and (2) imply that ’a’ implies ’b’ then ’c’.
56
Also (2) and (3) directly shows that ’b’ implies ’a’ and ’c’,
while (3) and (4) mean that ’c’ implies ’b’ then ’a’.
Theorem 4.8
Given X ∼ Np (0, σ 2 I), suppose that X0 X is decomposed into k quadratic forms, Qi =
X Bi X, r = 1, 2, . . . , k, when r(Bi ) = ri . Then Q1 , Q2 , . . . , Qk are mutually independent
P
and Qi /σ 2 ∼ χ2ri if and only if ki=1 ri = p.
0
Proof
Let X ∗ = X/σ and use Thm 4.7 (a) on X ∗ .
Example 4.2
We will consider again Example 4.1 from the point of view of Cochran’s Theorem.
Recall that X1 , . . . , Xp are iid N (0, 1) and
p
X
i=1
That is,
2
(xi − x) =
X
p
X
i=1
x2i
−
(
P
p
xi ) 2 X
x2i − px2 .
=
p
i=1
x2i = (p − 1)s2 + px2 ,
where s2 is defined in the usual way. Equivalently,
X0 IX = X0 BX + X0 CX,
where B and C are defined in Example 4.1.
We can apply Cochran’s Theorem, noting that we can easily show that (a) is true, since
r(I) = p, r(B) = p − 1 and r(C) = 1 where
B 1 = B, B 2 = C
in the notation of Example 4.1 and Thm 4.7.
So we may conclude that
X
2
Xi2 ∼ χ2p , νS 2 ∼ χ2ν where ν = p − 1, and pX ∼ χ21 .
and that X and S 2 are independent.
Note that
X̄ − 0
√ ∼ N (0, 1)
1/ p
57
leading to pX̄ 2 ∼ χ21 .
The statements about the rank of B and C can be confirmed by row and column
operations on B and C.
For example, C can be reduced to columns of zeros except for the last by subtracting
the last column on the rhs from the rest. The resulting last row from the top can be
subtracted from the rest, to give a single non zero entry, showing the rank of C as one.
For B, multiply all rows by p, then add cols 1 to (p − 1), counting from lhs, to col p.
Then subtract the resulting row p from rows 1 to (p − 1), counting down from the top of
the matrix. Add 1/p of the resulting rows 1 to (p − 1) to row p, and divide by p to show
the rank of B as (p − 1).
Query
Page 3 , (ii), Notes.
A necessary and sufficient condition for a symmetric matrix A to be positive
definite is that there exists a nonsingular matrix P such that P P 0 = A.
1. If A = P P 0 then x0 (P P 0 )x = (P 0 x)0 P 0 x = Z 0 Z > 0, ∀ Z. Thus if A = P P 0 then
A is positive definite.
2. The reverse requires that if A is positive definite, then A can be written as P P 0 .
If A is pd, then λi > 0 ∀ i, which means that there exists an R such that x = Ry
for which x0 Ax = y 0 R0 ARy = y 0 Dy where D = diag(λ1 , . . . , λn ).
√
√
If we define D = dd where d = diag( λ1 , . . . , λn ) and define w = dy, then as a
check
Thus
0
−1
x0 Ax = d−1 w Dd−1 w = w0 d0 Dd−1 w = w 0 w > 0 ∀ w
x = Ry = Rd−1 w
so
0
0
x0 Ax = Rd−1 w A Rd−1 w = w 0 Rd−1 ARd−1 w
which leads to
and so
ARd−1 = Rd−1
0
Rd−1 ARd−1 = I
0 −1
=⇒ A = Rd−1
0 −1 Rd−1
−1
So, if A is positive definite, then A can be written as A = P P 0 .
58
= PP0
Chapter 5
Order Statistics
5.1
Introduction
Parametric statistics allows us to reduce the data to a few parameters which makes it
easier to interpret the data. Statistics such as the mean and variance describe the pattern
of random events and allow us to evaluate the probability of events of interest. Under
the assumption that the data follow a known parametric distribution, we estimate the
parameters from the data. However, the usefulness of the parameters depends on the
assumptions about the data being reliable and this is not necessarily guaranteed.
One strategy for interpreting data without stringent assumptions, is to use order statistics.
Read CB 5.4 or HC 4.6. In the following, we will use the notation of HC where the
pdf of the random variable X is denoted by f (x), rather than fX (x).
pdf of the random variable X is denoted by f (x), rather than fX (x).
Definition 5.1
Let X1 , X2 , . . . Xn denote a random sample from a continuous distribution with pdf
f (x), a < x < b. Let Y1 be the smallest of these, Y2 the next Xi in order of magnitude, etc. Then Yi , i = 1, 2, . . . , n is called the ith order statistic of the sample, and
(Y1 , Y2 , . . . , Yn ) the vector of order statistics. We may write Y1 < Y2 < . . . < Yn . The
following alternative notation is also common; X(1) < X(2) , < . . . , < X(n) .
Order statistics are non-parametric and only rely upon the weak assumption that the
data are samples from a continuous distribution. we pick up information by ordering the
data. If we know the underlying distribution, we can combine that knowledge with the
rank of the order statistic of interest. For instance, if the underlying distribution is normal, Y50 from a sample of size 101 will have a higher probability of being near the median
than Y10 or Y90 . But without ordering, the same could not be said for X10 , X50 , X90 . So
the ordering gives us extra information and we shall now explore the densities of order
statistics, denoted fYr (y) etc.
59
Example 1
Suppose you were required to assess the ability to handle a crowd at a railway station
with regard to stair width, staff etc. The statistic of interest is Yn .
Example 2
An oil product freezes at ≈ 10◦ C and the company ponders whether it should market
it in a cold climate. We would require the density of the minimum order statistic, fY1 (y),
to assess the risk of the product failing.
Examples of other situations where an order statistic is of interest are :
1. Largest component Yn : maximum temperature, highest rainfall, maximum storage
capacity of a dam, etc.
2. Smallest component Y1 : minimum temperature, minimum breaking strength of rope,
etc.
3. Median. Median income, median examination mark, etc.
Order statistics are useful for summarizing data but may be limited for detailed descriptions of some process which has been measured. Order statistics are also ingredients
for higher level statistical procedures.
1
Figure 5.1 shows the sample cdf as a step function increasing by n+1
at each order
statistic. We can make statements about individual order statistics by borrowing information provided by the entire set. Remember that all we assumed about the original data
was that it were continuous; there were no assumptions about the distribution. But now
that the data are ordered, we can use the extra information provided by the ordering to
derive density functions.
The data, X1 . . . Xn might be independent but the ordered data Y1 . . . Yn , are not.
To begin our study of order statistics we first want to find the joint distribution of
Y1 , . . . , Y n .
5.2
Distribution of Order Statistics
The following theorem is proved in CB p230 or HC (page 193–195) for k = 3.
Theorem 5.1
60
Figure 5.1: The cdf of order statistics, steps of
F (y)
1
.
n+1
6
1
n
n+1
..
.
4
n+1
3
n+1
2
n+1
1
n+1
-
0
Y1
Y2
Y3 Y4
Y5
...
Yn−1
y
Yn
If X1 , X2 , . . . , Xn is a random sample of size n from a continuous population with pdf
f (x), a < x < b, then the order statistic Y = (Y1 , Y2 , . . . , Yn ) has joint pdf given by
fY (y1 , y2 , . . . , yn ) = n!
n
Y
f (yi ), a < y1 < y2 < . . . < yn < b.
(5.1)
i=1
Comment:
The proof essentially uses the change of variable technique for the case when the transformation is not one to one. That is, we have a transformation of the form
Y1 = smallest observation in (X1 , X2 , . . . , Xn )
Y2 = second smallest observation in (X1 , X2 , . . . , Xn )
.
.
.
Yn = largest observation in (X1 , X2 , . . . , Xn ). This has n! possible inverse transformations. You should read and understand the proof given in HC.
61
The Jacobian for the transformation is:-
5.3
n
0 0 . . . 0 0 (n − 1)
. . . 0 ..
.. ..
.. .. = n!
.
. .
. . 0
1 Marginal Density Functions
Before engaging in the theory we do a sketch of the information we are modelling.
When we derive the distribution of a single order statistic, we divide the underlying
distribution into 3.
Figure 5.2: An order statistic in an underlying distribution
f(y)
2
f(yr )
1
3
yr
y
The observed value of the rth order statistic is yr . ie Yr = yr . This is a random variable
(Yr will have a different value from a new sample) with density f (yr ). We have
1. (r − 1) observations < yr with probability F (yr ),
2. 1 observation with density f (yr ),
3. (n − r) observations > yr with probability 1 − F (yr ).
Ordering and classifying by Yr has produced a form similar to a multinomial distribution
with 3 categories , < yr ,= yr ,> yr , and associated with these categories we have the entities
F (yr ),f (yr ),1 − F (yr ).
62
The multinomial density function for 3 categories is
P (X1 = x1 , X2 = x2 , X3 = x3 ) =
n!
× pn1 1 pns 2 pn3 3 .
n1 !n2 !n3 !
The density of an order statistic has a similar form,
fYr (y) =
n!
× [F (yr )](r−1) × f (yr ) × [1 − F (yr )](n−r) .
(r − 1)!1!(n − r)!
Note there are 3 components of the density corresponding to the 3 categories.
For 2 order statistics, Yr , Ys , there are 5 categories,
1. y < yr
2. y = yr
3. yr < y < ys
4. y = ys
5. y > ys
Figure 5.3: Two order statistics in an underlying distribution
f(y)
2
f(yr )
4
f(ys)
1
3
yr
5
y
ys
63
From the same analogy to multinomials used for a single order statistic, there are 5
components to the joint density,
fYr ,Ys (yr , ys ) =
n!
×
(r − 1)!1!(n − r − 1)!1!(n − s)!
[F (yr )](r−1) × f (yr ) × [F (ys) − F (yr )]s−r−1 × f (ys ) [1 − F (ys )](n−s) .
Formal derivation of the marginal densities of order statistics
Since we know the pdf of Y = (Y1 , Y2 , . . . , Yn ) is given by (5.1), the marginal pdf of
the rth smallest component, Yr , can be found by integrating over the remaining (n − 1)
variables. Thus
fYr (yr ) =
Z
yr
Z
yr−1
−∞ −∞
...
Z
y2
−∞
"Z
∞
yr
...
Z
∞
yn−1
n!
n
Y
#
f (yi )dyn . . . dyr+1 dy1 . . . dyr−1 .
i=1
(5.2)
(The parentheses are inserted as a guide to the integration – they are not actually required.)
Notice the order of integration used is to first integrate over yn , then yn−1 , . . . and then
yr+1 (this is the part of (5.2) enclosed by the parentheses). This is followed by integration
over y1 , then y2 , . . ., and finally over yr−1 . The limits of integration are obtained from the
inequalities.
∞ > yn > yn−1 > . . . > yr+1 > yr
and
−∞ < y1 < y2 < . . . < yr−1 < yr .
In order to integrate (5.1), we first have
Z
∞
yr
=
...
Z
Z
∞
yn−2
∞
yr
Z
...
∞
n
Y
yn−1 i=r+1
Z
Z
∞
yn−2
f (yi )dyn . . . dyr+1
[1 − F (yn−1 )]f (yn−1 )
=
=
[1 − F (yr )]n−r
, on simplification.
(n − r)!
yr
...
∞
yn−3
f (yi )dyn−1 . . . dyr+1
i=r+1
Z
∞
n−2
Y
n−3
Y
[1 − F (yn−2 )]2
f (yn−2 )
f (yi )dyn−2 . . . dyr+1
2!
i=r+1
(5.3)
Similarly
Z
yr
Z
yr−1
−∞ −∞
...
Z
y2 r−1
Y
−∞ i=1
f (yi )dy1 . . . dyr−1 =
Z
yr
−∞
64
...
Z
y3
−∞
F (y2 )f (y2 )
r−1
Y
i=3
f (yi )dy2 . . . dyr−1
Z
=
yr
−∞
...
Z
y4
−∞
r−1
Y
[F (y3 )]2
f (yi )dy3 . . . dyr−1
f (y3 )
2!
i=4
= [F (yr )]r−1 /(r − 1)!, on simplification.
(5.4)
Hence using (5.3) and (5.3) in (5.2), we obtain
fYr (yr ) = n!f (yr )
= n!f (yr )
Z
yr
−∞
...
Z
y2
−∞
"
#
Y
[1 − F (yr )]n−r r−1
f (yi )dy1 . . . dyr−1
(n − r)!
i=1
[1 − F (yr )]n−r [F (yr )]r−1
.
(n − r)!
(r − 1)!
so that the marginal p.d.f. of Yr is given by
fYr (yr ) =
n!
[F (yr )]r−1 [1 − F (yr )]n−r f (yr ), for − ∞ < yr < ∞ . (5.5)
(n − r)!(r − 1)!
The probability density functions of both the minimum observation (r = 1) and the
maximum observation (r = n) are special cases of (5.5).
For r = 1,
fY1 (y1 ) = n[1 − F (y1 )]n−1 f (y1 ), −∞ < y1 < ∞
(5.6)
fYn (yn ) = n[F (yn )]n−1 f (yn )
(5.7)
For r = n,
The integration technique can be applied to find the joint pdf of two (or more) order
statistics, and this is done in 5.4. Before examining that, we will give an alternative (much
briefer) derivation of (5.7).
Let the cdf of Yn be denoted by FYn . For any value y in the range space of Yn , the cdf
of Yn is
FYn (y) = P (Yn ≤ y) = P(all n observations ≤ y)
= [P(an observation ≤ y)]n
= [F (y)]n
The pdf of Yn is thus fYn (y) = FY0 n (y) = n[F (y)]n−1 .f (y), a < y < b.
Of course y in the above is just a dummy, and could be replaced by yn to give (5.7).
Exercise
Use this technique to prove (5.6).
65
Example 5.1
Let X1 , . . . Xn be a sample from the uniform distribution f (x) = 1/θ , 0 < x < θ. Find a
100(1 − α)% CI for θ using the largest order statistic, Yn .
By definition,
0 < Y 1 < Y2 < . . . < Y n < θ .
So Yn will suffice as the lower limit for θ. Given the information gleaned from the order
statistics, what is the upper limit? Using the above result for the density of the largest
order statistic,
fYN (yn ) = n [F (yn )]n−1 f (yn )
y n−1
= n nn
θ
Choose c such that,
P (cθ < Yn < θ)
Z θ
nynn−1
dyn
θn
cθ
n θ
yn
θ n cθ
1 − cn
= 1−α
= 1−α
= 1−α
1
= 1 − α ⇒ c = αn
Therefore,
1
P θα n < Yn < θ = 1 − α
1
1
1
<
<
P
= 1 − α since monotone decreasing
1
θ
Yn
n
θα
Yn
P Yn < θ < 1
= 1−α
αn
1
A 100(1 − α)% CI for θ is given by (Yn , Yn α− n ).
Verification of the multinomial formulation for marginal distributions
For n = 2, the multinomial formulation becomes a binomial.
smallest
Now P1 ∝ f (y1 ), and P2 = P (Y2 > y1 ) = 1 − F (y1 ) = P (obs > y1 ),
So
66
a
P1
P2
↓ − − − →
Y1
Y2
b
1
1
fY1 (y1 ) ∝ 2P11 P21 = 2f (y1 )[1 − F (y1 )], a < y1 < b
in agreement with equation (5.6) for n=2.
largest
P2
P1
← − − − ↓
a
Y1
Y2
1
1
b
Now P1 ∝ f (y2 ), and P2 = P (Y1 < y2 ) = F (y2 ) = P (obs < y2 ),
So
fY2 (y2 ) ∝ 2P11 P21 = 2f (y2 )F (y2 ), a < y2 < b
in agreement with equation (5.7) for n=2.
For n = 3, we have a trinomial. Consider the median Y2 .
P2
P1
P3
← − − − ↓ − − − →
a
Y1
Y2
Y3
b
1
1
1
Now P1 ∝ f (y2 ), P2 = P (obs < y2 ) = F (y2 ) and
P3 = P (obs > y2 ) = 1 − F (y2 ) to give
fY2 (y2 ) ∝ 3!P11 P21 P31 = 6f (y2 )F (y2 )[1 − F (y2 )], a < y2 < b
as per equation (5.5) with n = 3 and r = 2.
5.4
Joint Distribution of Yr and Ys
The joint pdf of the order statistics for a sample of size two is derived. The original
sample is X1 , X2 while the order statistics are denoted by Y1 , Y2 . Thus Y1 = X1 or X2 and
Y2 = X1 or X2 . Thus the transformation is not 1:1.
67
The space A1 is defined by a < x1 < x2 < b and A2 is a < x2 < x1 < b, giving
A = A1 + A2 .
Both these regions map into the space B defined by a < y1 < y2 < b.
(You should draw these regions.)
In A1 we have
X1 = Y 1 , Y 1 = X 1 = ψ 1
X2 = Y 2 , Y 2 = X 2 = ψ 2
with Jacobian
1 0 =1
J1 =
0 1 In A2 we have
X1 = Y 2 , Y 1 = X 2 = ψ I
X2 = Y1 , Y2 = X1 = ψII
with Jacobian
0 1 = −1
J1 =
1 0 This gives the joint density for Y1 and Y2 as
fY (y1 , y2 ) = abs|J1 |f (ψ1 , ψ2 ) + abs|J2 |f (ψI , ψII ), (y1 , y2 ) B
= 2f (y1 , y2 ) = 2f (y1 )f (y2 ), a < y1 < y2 < b
since X2 and X2 are iid f (x).
Marginal Density Functions
Case : n = 2
We first examine the case n = 2 observations.
Smallest OS
fY1 (y1 ) =
Z
fY (y1 , y2 )dy2 =
Z
y2 =b
y2 =y1
2f (y1 )f (y2 )dy2
= 2f (y1 ) [F (y2 )]by1 = 2f (y1 ) [1 − f (y1 )] , a < y1 < b
68
Largest OS
fY2 (y2 ) =
Z
fY (y1 , y2 )dy1 =
Z
y1 =y2
y1 =a
2f (y1 )f (y2 )dy1
= 2f (y2 ) [F (y1 )]ya2 = 2f (y2 )F (y2 ), a < y2 < b
These results can be compared with those n = 2, viz, equations (5.6) and (5.7).
(5.6) gives
fY1 (y1 ) = n[1 − F (y1 )]n−1 f (y1 ) = 2[1 − F (y1 )]1 f (y1 ), a < y1 < b
which is the same as the previous result for the smallest order statistic for a sample size
of two.
(5.7) gives
fYn (yn ) = n[F (yn )]n−1 f (yn ) = 2[F (y2 )]1 f (y2 ), a < y2 < b
which is the same as the previous result for the largest order statistic for a sample size of
two.
Case : n = 3
Now for the case n = 3 as per H and C, p193–195, 5th ed.
We have X1 , X2 , X3 → Y1 , Y2 , Y3 and the joint density is
fY (y1 , y2 , y3 ) = 3!f (y1 )f (y2 )f (y3 ), a < y1 < y2 < y3 < b
First, find the distribution of Y2 (the median).
fY2 (y2 ) =
= 6f (y2 )
Z
Z Z
b
y3 =y2
fY (y1 , y2 , y3 )dy1 dy3 =
f (y3 )
= 6f (y2 )
Z
Z
y1 =y2
a
b
y3 =y2
Z
y3 =b
y3 =y2
Z
y1 =y2
y1 =a
6f (y1 )f (y2 )f (y3 )dy1 dy3
f (y1 )dy1 dy3 = 6f (y2 )
Z
b
y3 =y2
f (y3 ) [F (y1 )]ya2 dy3
f (y3 )F (y2 )dy3 = 6f (y2 )F (y2 ) [F (y3 )]by2
= 6f (y2 )F (y2 )[1 − F (y2 )], a < y2 < b
This can be verified by using n = 3, r = 2 on equation (5.5).
The marginal distribution of Y1 , the smallest observation :
69
Z Z
fY1 (y1 ) =
Z Z
=
= 6f (y1 )
fY (y1 , y2 , y3 )dy2 dy3 , a < y1 < y2 < y3 < b
fY (y1 , y2 , y3 )dy3 dy2 =
Z
y2 =y1
= 6f (y1 )
Z
b
y2 =y1
y2 =b
y2 =y1
Z
y3 =b
y3 =y2
6f (y1 )f (y2 )f (y3 )dy3 dy2
!
f (y3 )dy3 dy2 = 6f (y1 )
Z
f (y2 )[1 − F (y2 )]dy2 = 6f (y1 )
Z
Z
b
Z
b
y3 =y2
"
[1 − F (y2 )]2
= 6f (y1 )
(−1)
2
So
b
y2 =y1
b
y1
f (y2 )[F (y3 )]by2 dy2
[1 − F (y2 )]dF (y2 )
#b
y1
#
"
(1 − F (b))2 (1 − F (y1 ))2
−
(−1)
fY1 (y1 ) = 6f (y1 )
2
2
"
#
(1 − F (y1 ))2
= 6f (y1 ) 0 +
= 3f (y1 )(1 − F (y1 ))2 , a < y1 < b
2
as per equation (5.6), for the case n=3.
The marginal distribution of Y3 , the largest observation :
fY3 (y3 ) =
= 6f (y3 )
Z
Z Z
y3 =y3
a
fY (y1 , y2 , y3 )dy1 dy2 =
f (y2 )
= 6f (y3 )
Z
Z
y1 =y2
a
y2 =y3
a
Z
y2 =y3
y2 =a
Z
y1 =y2
y1 =a
f (y1 )dy1 dy2 = 6f (y3 )
f (y2 )F (y2 )dy2 = 6f (y3 )
"
F (y2 )2
= 6f (y3 )
2
# y3
Z
a
Z
6f (y1 )f (y2 )f (y3 )dy1 dy2
y2 =y3
a
y2 =y3
f (y2 ) [F (y1 )]ay1 =y2 dy2
F (y2 )dF (y2 )
= 3f (y3 )F (y3 )2 , a < y3 < b
a
as per equation (5.7), for n = 3.
70
The General Case – joint distribution
The joint p.d.f. of Yr and Ys , (r < s) is found by integrating over the other (n−2) variables.
Then
Z
fYr ,Ys (yr , ys ) =
∞
ys
...
Z
∞
yn−1
(Z
ys
...
yr
Z
yr+3
yr
Z
yr+2
yr
"Z
yr
−∞
...
Z
y3
Z
y2
−∞ −∞
n!
n
Y
f (yi )dy1 . . . dyr−1
i=1
. dyr+1 . . . dys−1 } dyn . . . dys+1
(5.8)
The order of integration is first over y1 , y2 , . . . , to yr−1 , then over yr+1 , yr+2 , . . . , to
ys−1 and finally over yn , yn−1 , . . . , to ys+1 . The limits of integration are obtained from the
inequalities
−∞ < y1 < y2 < . . . < yr−1 < yr ,
yr < yr+1 < . . . < ys−1 < ys ,
∞ > yn > yn−1 > . . . > ys+1 > ys .
In order to integrate (5.8) we use methods similar to (5.3) and (5.4) together with
Z
ys
yr
...
=
Z
Z
yr+3
yr
ys
yr
...
Z
Z
Z
yr+2 s−1
Y
yr
yr+3
yr
f (yi )dyr+1 dyr+2 . . . dys−1
i=r+1
[F (yr+2 ) − F( yr )]fX (yr+2 )
s−1
Y
=
=
[F (ys ) − F (yr )]s−r−1
, on simplification.
(s − r − 1)!
yr
...
yr+4
yr
f (yi )dyr+2 . . . dys−1
i=r+3
s−1
Y
Z
ys
[F (yr+3 ) − F (yr )]2
f (yr+3 )
f (yi )dyr+3 . . . dys−1
2!
i=r+4
(5.9)
Thus we get
fYr ,Ys (yr , ys ) = n!f (yr )f (ys )
Z
∞
ys
...
Z
∞
yn
(Z
ys
yr
)
...
Z
yr+2
yr
n
[F (yr )]r−1 Y
f (yi ).
(r − 1)! i=r+1
i6=s
. dyr+1 . . . dys−1 dyn . . . dys+1
)
Z ∞ (
[F (yr )]r−1 Z ∞
[F (ys ) − F (yr )]s−r−1
= n!f (yr )f (ys )
.
...
(r − 1)! ys
(s − r − 1)!
yn−1
.
n
Y
f (yi )dyn . . . dys+1
i=s+1
[F (yr )]r−1 [F (ys ) − F (yr )]s−r−1 [1 − F (ys )]n−s
= n!f (yr )f (ys )
.
(r − 1)!
(s − r − 1)!
(n − s)!
71
#
Hence the pdf of (Yr , Ys ) is given by
fYr ,Ys (yr , ys ) =
n!
[F (yr )]r−1 [F (ys ) − F (yr )]s−r−1 .
(r − 1)!(s − r − 1)!(n − s)!
.[1 − F (ys )]n−s f (yr )f (ys).
(5.10)
We now give an alternative derivation of the special case of (5.10) where r = 1, s = n.
In this method we first find the joint cumulative distribution function, then derive the joint
pdf from it by differentiation.
The joint cdf of Y1 and Yn is P (Y1 ≤ y1 , Yn ≤ yn ). Note that
{Yn ≤ yn } = {Y1 ≤ y1 ∩ Yn ≤ yn } ∪ {Y1 > y1 ∩ Yn ≤ yn }
where the two events on the RHS are mutually exclusive. So
P (Yn ≤ yn ) = P (Y1 ≤ y1 ∩ Yn ≤ yn ) + P (Y1 ≥ y1 ∩ Yn ≤ yn )
P (Y1 ≤ y1 ∩ Yn ≤ yn ) = P (all n obs. ≤ yn ) − P (all n obs. are between y1 and yn )
So
So the joint pdf is
FY1 ,Yn (y1 , yn ) = [F (yn )]n − [F (yn ) − F (y1 )]n , y1 ≤ yn
i
∂ h
∂ ∂
n (F (yn ))n−1 f (yn ) − n [F (yn ) − F (y1 )]n−1 f (yn )
[FY1 ,Yn (y1 , yn )] =
∂y1 ∂yn
∂y1
= 0 + n(n − 1) [F (yn ) − F (y1 )]n−2 f (yn )f (y1 )
(5.11)
which is (5.10) with r = 1 and s = n.
The multinomial formulation of the joint distribution of two order statistics has been
given earlier. The 5 components and their probabilities are shown in Figure 5.4.
Obsn.
P rob.
1, . . . , (r − 1)
r
(r + 1), . . . , (s − 1)
s
(s + 1), . . . , n
F (yr )
f (yr )
F (ys ) − F (yr )
f (ys )
1 − F (ys )
←−
↔
←→
↔
−→
#obsn.
(r − 1)
1
(s − r − 1)
1
(n − s)
Figure 5.4: Multinomial probabilities for the joint order statistics
For the general case, the joint distribution function is obtained by integration. The
procedure is demonstrated for the simple cases n = 2 and n = 3, for the smallest and
largest observations.
72
(Case n = 2)
In this case, the joint distribution function is known already as
fY (y1 , y2 ) = 2f (y1 )f (y2 ), a < y1 < y2 < b
which can be seen to be that given by equation (5.10), for n = 2, r = 1 and s = 2.
(Verify!)
(Case n = 3)
We now want fY1 ,Y3 (y1 , y3 ), for example to find the distribution of the range. We
know
fY (y1 , y2 , y3 ) = 6f (y1 )f (y2 )f (y3 )
and so we need to integrate out y2 . Thus
fY1 ,Y3 (y1 , y3 ) =
Z
6f (y1 )f (y2 )f (y3 )dy2 = 6f (y1 )f (y3 )
Z
y2 =y3
y2 =y1
f (y2 )dy2
= 6f (y1 )f (y3 ) [F (y2 )]yy31 = 6f (y1 )f (y3 ) [F (y3 ) − F (y1 )] , a < y1 < y3 < b
This is the same as equation (5.10), for n = 3, r = 1 and s = 3. (Verify!)
5.5
The Transformation F (X)
Theorem 5.2
(Probability Integral Transform)
Let the random variable X have cdf F (x). If F (x) is continuous, the random variable
Z produced by the transformation
Z = F (X)
(5.12)
has the uniform probability distribution over the interval 0 ≤ z ≤ 1.
Proof
See CB p54 or HC 4.1, p 161.
The above result is a useful ploy for inference using order statistics and the following
diagram (Figure 5.5) illustrates the connections amongst the underlying distribution, the
observed data and their order statistics, and the transform of these to a set of data (Z i )
whose distribution is known.
73
The top row is an expression for Theorem 5.2. The second row is the one-to-one
mapping of samples from F (x) to samples from the uniform distribution. The ordered
variables Y1 . . . Yn are transformed to ordered Zi and properties about the original data,
X, may be discerned from the Z(1) . . . Z(n) .
Figure 5.5: Probability Integral Transform of order statistics.
X ∼ f (x), F (x)
?
Z = F (X)
-
?
Zi = F (Xi )
-
X1 . . . X n
?
Z ∼ g(Z) = 1
G(z) = z
0<z<1
Z1 . . . Zn
?
Z(i) = F (Yi )
-
Y1 . . . Y n
Z(1) . . . Z(n)
Theorem 5.3
Consider (Y1 , Y2 , . . . , Yn ), the vector of order statistics from a random sample of size n
from a population with a continuous cdf F . Then the joint pdf of the random variables
Z(i) = F (Yi ),
i = 1, 2, . . . , n
(5.13)
is given by
fZ (z(1) , z(2) , . . . , z(n) ) =
(
n! for 0 < z(1) < . . . < z(n) < 1
0 elsewhere
Proof
See HC 11.2, p 502.
74
(5.14)
Since Z(i) = F (Yi ), then
fZ [z(1) , z(2) , . . . , z(n) ] = n!
Y
g[z(i) ] = n!
Y
g(zi )
i
i
but since g(zi ) = 1 ∀ i , 0 < zi < 1, by Thm 5.2, then
fZ [z(1) , z(2) , . . . , z(n) ] = n! , 0 < z(1) < . . . < z(n) < 1
Theorem 5.4
The marginal pdf of Z(r) = F (Yr ) is given by
fZ(r) (z(r) ) =
n!
z r−1 (1 − z(r) )n−r , 0 < z(r) < 1.
(r − 1)!(n − r)! (r)
(5.15)
It can be seen that Z(r) ∼ Beta(r, n − r + 1) so its mean is
E(Z(r) ) =
r
n+1
(5.16)
Note: The Beta density is given by
f (x; a, b) =
1
Γ(a)Γ(b)
× xa−1 (1 − x)b−1 × I(0,1) (x) where B(a, b) =
B(a, b)
Γ(a + b)
and you may need to revise the Gamma function.
Proof
We need to integrate out all variables except Z(r) . Thus
fZ(r) [z(r) ] = n!
Z
z(r)
0
Z
z(r−1)
0
...
Z
z(2)
0
"Z
1
z(r)
Z
1
z(r+1)
...
Z
1
z(n−1)
#
dz(n) . . . dz(r+1) dz(1) . . . dz(r−1)
Note that the correct order of integration is determined by the inequalities
1 > z(n) > z(n−1) > . . . > z(r+1) > z(r)
and
0 < z(1) < z(2) < . . . < z(r−1) < z(r)
Successive integrations yield the final result. Thus
Z
z(2)
0
dz(1) = z(2)
75
and for the inner group
Z
1
z(n−1)
dz(n) = [1 − z(n−1) ]
leading to the two different terms in z for fZ(r) [z(r) ].
Alternatively, simply use equation (5.5) on page 45 of the Notes, with G = F = z and
the lower and upper limits being 0 and 1 respectively.
Theorem 5.5
The joint pdf of Z(r) and Z(s)
fZ(r) ,Z(s) (z(r) , z(s) ) =
(r < s) is given by
n!
r−1
z(r)
(z(s) − z(r) )s−r−1 (1 − z(s) )n−s
(r − 1)!(s − r − 1)!(n − s)!
for 0 < z(r) < z(s) < 1
(5.17)
Proof
This is left as an exercise. You need only to notice Z(r) and Z(s) have uniform distributions and use (5.10), with lower and upper limits of 0 and 1 on z = y and note that
G = F = z, since Z ∼ U (0, 1).
5.6
Examples
Example 1
Distribution of the Sample Median
The sample median M is defined as
M=



Y n+1
2
for n odd
[Y n2 + Y n2 +1 ]/2 for n even.
For the case of n odd, replace r by (n + 1)/2 in (5.5) on page 65 .
For the case of n even, let n = 2m, and U = [Ym + Ym+1 ]/2. Then
fYm ,Ym+1 (ym , ym+1 ) =
(2m)!
[F (ym )]m−1 [1 − F (ym+1 )]m−1 f (ym )f (ym+1 ).
[(m − 1)!]2
Define u and v as follows
u = (ym + ym+1 )/2
v = ym+1 .
76
(5.18)
Then
ym = 2u − v
ym+1 = v
and |J| = 2.
Thus
fU,V (u, v) =
(2m)!
[F (2u − v)]m−1 [1 − F (v)]m−1 f (2u − v).f (v).2
[(m − 1)!]2
−∞<u<v <∞
and integrating with respect to v we obtain the pdf of U (the sample median for a sample
size n = 2m),
2(2m)! Z ∞
fU (u) =
[F (2u − v)]m−1 [1 − F (v)]m−1 f (2u − v).f (v)dv.
2
[(m − 1)!] u
Example 2
(5.19)
Distribution of the Sample Midrange
For an ordered sample Y1 < Y2 < . . . < Yn , this is defined as 12 (Y1 + Yn ), and its pdf
can be found for a particular distribution, beginning with (5.11) and using the technique
of bivariate transformation.
Example 3
Distribution of the Sample Range
Distribution of the range, for n = 2.
In this case, the range R is defined by R = Y2 − Y1 .
The transformation can be written as
R = Y 2 − Y 1 , Y1 = S = ψ 1
S = Y 1 , Y2 = R + S = ψ 2
with Jacobian
∂ψ1 /∂R ∂ψ2 /∂R
J=
∂ψ1 /∂S ∂ψ2 /∂S
The original region A is defined by
0 1 =
= −1
1 1 a < Y 1 < Y2 < b
The transformed region B is defined by (r, s) such that (y1 , y2 ) A.
Thus
77
Y1 > a → S > a
Y1 < b → S < b
Y2 < b → R + S < b ; R < b − a
Y2 > a → R + S > a (redundant)
Y2 > Y 1 → R + S > S ; R > 0
This region B should be sketched, to check limits for integration etc.
Thus the joint distribution of R and S is now
fR,S (r, s) = f (ψ1 , ψ2 )abs|J| = 2f (s) × f (r + s) × 1, (r, s) B
For the special case of the uniform distribution, U (a, b), we have that
f (x) =
1
, a<x<b
b−a
This gives
fR,S (r, s) =
and
fR (r) =
Z
fR,S (r, s)ds =
Z
s=b−r
s=a
2
(b − a)2
2(b − r − a)
2
ds =
, 0<r <b−a
2
(b − a)
(b − a)2
as per the general case with n = 2.
The general case
For an ordered sample Y1 < Y2 < . . . < Yn , the sample range is R = Yn − Y1 . Assuming
that the sample is from a continuous distribution with pdf f (x), a < x < b and cdf F (x),
the joint pdf of Y1 and Yn is given by equation (5.11) on page 72. Finding the distribution
of R becomes a problem in bivariate transformations. Define
r = yn − y1 and v = y1 .
The inverse relationship, which is one-to-one, is
y1 = v and yn = r + v,
with |J| = 1. So we have
fR,V (r, v) = fY1 ,Yn (y1 , yn )|J| = n(n − 1)[F (r + v) − F (v)]n−2 f (r + v) f (v).
78
To find the range space, we deduce from the fact that a < y1 < yn < b, v > a, r > 0 and
r + v < b or v < b − r. So the range space is a < v < b − r; 0 < r < b − a, and
fR (r) =
Z
b−r
a
n(n − 1) [F (r + v) − F (v)]n−2 f (v)f (r + v)dv, 0 < r < b − a.
As a special case for f (x) =
We have
f (v) =
1
,a
b−a
< x < b, find fR (r).
1
1
, a < v < b and f (r + v) =
, a < r + v < b, i.e. a < v < b − r.
b−a
b−a
Now
F (r + v) − F (v) =
Z
r+v
v
So
Z
r
1
dx =
b−a
b−a
n−2
r
dv
fR (r) = n(n − 1)
·
b−a
(b − a)2
a
n(n − 1)r n−2 (b − r − a)
, 0 < r < b − a.
=
(b − a)n
Example 4
b−r
Estimating Coverage
The ith coverage is defined as
Ui = F (Yi ) − F (Yi−1 ) = Z(i) − Z(i−1)
being the area under the density function f (y) between Yi and Yi−1 .
To find the distribution of Ui we need the joint distribution of Z(i) and Z(i−1) . This is
given by Thm (5.5) equation (5.17), with r = i − 1 and s = i.
The joint distribution of Z(r) and Z(s) is given by
fZ(r) ,Z(s) [z(r) , z(s) ] =
n!
z r−1 [z(s) − z(r) ]s−r−1 [1 − z(s) ]n−s ,
(r − 1)!(s − r − 1)!(n − s)! (r)
0 < z(r) < z(s) < 1
This becomes, under r = i − 1 ands = i
fZ(i−1) ,Z(i) [z(i−1) , z(i) ] =
n!
z i−2 [z(i) − z(i−1) ]0 [1 − z(i) ]n−i ,
(i − 2)!0!(n − i)! (i−1)
0 < z(i−1) < z(i) < 1
The remainder of the derivation is left as an assignment question, using the outline
given below.
79
Figure 5.6: Coverage
0.0
0.1
0.2
0.3
f (y)
y
Yi−1
Yi
Define the ith coverage as Ui = F (Yi ) − F (Yi−1 ), the area under the density f (y)
0
2
4
between Yi and
Yi−1 , as shown
in Figure (5.5).
We have data,6 X1 . . . Xn but do not assume
a parametric form for the distribution and rely on order statistics to estimate coverage.
From theorem 5.5, Z = F (X) ∼ U nif (0, 1) and Z(1) . . . Z(n) is an ordered sample from
U nif (0, 1). By definition, Ui = Z(i) − Z(i−1) . From theorem 5.5, Z(i) ∼ Beta(i, n − i + 1)
and Z(i−1) ∼ Beta(i − 1, n − i + 2). Theorem 5.5 gives the joint distribution of Z(i−1) , Z(i)
leading to the joint distribution of Z(i−1) , U(i) . Integrating wrt Z(i−1) gives the distribution
of Ui , Ui ∼ Beta(1, n). Therefore E(Ui ) = 1/(n + 1) and var(Ui ) = n/{(n + 1)2 (n + 1)}.
5.7
Worked Examples : Order statistics
Example 1
The cdf and df of the smallest order statistic.
Solution
The cdf first :
FY1 (y1 ) = P (Y1 ≤ y1 ) = P (at least one of n obs ≤ y1 ) = 1 − P (no obs ≤ y1 )
80
giving
Thus the df is
FY1 (y1 ) = 1 − P (all obs > y1 ) = 1 − [1 − F (y1 )]n .
fY1 (y1 ) = FY0 1 (y1 ) = n[1 − F (y1 )]n−1 f (y1 ).
Example 2
Suppose links of a chain are such that the population of individual links having
breaking strengths Y ( Kg) has the df
f (y) = λe−λy , y > 0,
where λ is a positive constant. If a chain is made up of 100 links of this type taken
at random from the populations of links, what is the probability that such a chain
would have a breaking strength exceeding K kilograms? Interpret your results.
Solution
Since the breaking strength of a chain is equal to the breaking strength of its weakest
link, the problem reduces to finding the probability that the smallest order statistic
in a sample of 100 will exceed K.
We have
and
but
P (Y1 > K) = 1 − P (Y1 < K) = 1 − FY1 (K)
FY1 (y1 ) = 1 − [1 − F (y1 )]n
F (y1 ) =
and so
giving
Z
y1
0
λe−λy dy = 1 − e−λy1 , y1 > 0
FY1 (y1 ) = 1 − [e−λy1 ]100 = 1 − e−100λy1
P (Y1 > K) = 1 − FY1 (y1 = K) = e−100λK .
The df for Y1 is
fY1 (y1 ) = FY0 1 (y1 ) = 100λe−100λy1 , y1 > 0
and so
E(Y ) = 1/λ, E(Y1 ) = 1/(100λ)
which explains the extreme quality control used for high performance units (like
chains) which are made up of large numbers of similar components.
Example 3
A random sample of size n is drawn from a U (0, θ) population.
81
1. Suppose that kYn is used to estimate θ. Find k so that E(kYn −θ)2 is a minimum.
2. What is the probability that all the observations will be less than cθ for 0 <
c < 1?
Solution
1.
Now,
fYn (yn ) = nyn n−1 θ −n , 0 < yn < θ
ie, Yn /θ ∼ Beta(n, 1).
Thus
E(Yn /θ) = n/(n + 1)
while
V (Yn /θ) =
n
.
(n + 1)2 (n + 2)
Now
E(kYn − θ)2 = k 2 E(Yn )2 + θ 2 − 2kθE(Yn )
=k
2
"
= k 2 [V (Yn ) + [EYn ]2 ] + θ 2 −
#
2kθ 2 n
n+1
nθ 2
n2 θ 2
2kθ 2 n
2
+
+
θ
−
(n + 1)2 (n + 2) (n + 1)2
(n + 1)
This will be a minimum when
dV
= 0,
dk
ie, when
2kn2 θ 2
2θ 2 n
2knθ 2
+
−
=0
(n + 1)2 (n + 2) (n + 1)2 (n + 1)
Thus
k=
(n + 2)
.
(n + 1)
2.
The probability is in general
P (Yn < φ) = F (Yn (φ)) = F (yn (φ))n = (φ/θ)n .
So if φ = cθ, then P (Yn < cθ) = cn .
82
Example
If X1 , . . . , Xn is a random sample from a uniform distribution with pdf
fX (x) = 1/θ, 0 < x < θ
with order statistics Y1 , . . . , Yn , show that Y1 /Yn and Yn are independent.
Solution
The joint distribution of Y1 and Yn is
FY1 ,Yn (y1 , yn ) = n(n − 1) [F (yn ) − F (y1 )]n−2 f (y1 )f (yn ), 0 < y1 < yn < θ
as per (5.11) page 48 of the Notes. This simplifies to
f (y1 , yn ) = n(n − 1)
yn − y 1
θ
n−2
1
, 0 < y 1 < yn < θ
θ2
since f (y) = 1/θ and F (y) = y/θ.
The transformations are
U = Y1 /Yn , Y1 = U V = ψ1 (U, V )
V = Yn , Yn = V = ψ2 (U, V )
The Jacobian of the transformation is
∂ψ1 /∂u ∂ψ1 /∂v
J=
∂ψ2 /∂u ∂ψ2 /∂v
The joint distribution of U and V is then
v − uv
fU,V (u, v) = n(n − 1)
θ
v u =
=v
0 1 n−2
v
, (u, v) B
θ2
The region A is defined by 0 < Y1 < Yn < θ while B is defined by 0 < U < 1 and
0 < V < θ, as obtained from the inequalities
Y1 < Yn −→ U V < V ; U < 1
0 < Y1 −→ U V > 0 ; U, V > 0
Yn < θ −→ V < θ
Yn > 0 −→ V > 0
The joint distribution factorises, viz
fU,V (u, v) = (n − 1)(1 − u)n−2 × n
n−2
v
θ
83
n−1
v
v
= (n − 1)(1 − u)n−2 n
2
θ
θ
1
θ
Thus
fU (u) =
Z
θ
0
f (u, v)dv = (n − 1)(1 − u)n−2
and so U = Y1 /Yn ∼ B(1, n − 1), 0 < u < 1
Also
n−1
Z 1
1
v
v
, 0< <1
fV (v) =
f (u, v)du = n
θ
θ
θ
0
and so V /θ = Yn /θ ∼ B(n, 1) independently of U = Y1 /Yn since
fU,V (u, v) = fU (u)fV (v)
Note also that the distribution of V = Yn is the distribution of the largest order statistic,
as per equation (5.7). Thus
fYn (yn ) = n[F (yn )]n−1 f (yn ) = n
in line with the df for V = Yn .
84
yn
θ
n−1
1
, 0 < yn < θ
θ
Chapter 6
Non-central Distributions
6.1
Introduction
Recall that if Z is a random variable having a standard normal distribution then Z 2
has a chi-square distribution with one degree of freedom. Furthermore, if Z1 , Z2 , . . . , Zp
P
are independent and each Zi is distributed N (0, 1) then the random variable pi=1 Zi2 has
a chi-square distribution with p degrees of freedom.
Suppose now that the means of the normal distributions are not zero. We wish to find
P
the distributions of Zi2 and pi=1 Zi2 .
Definition 6.1
Let Xi be distributed as N (µi , 1), i = 1, 2, . . . , p. Then Xi2 is said to have a noncentral chi-square distribution with one degree of freedom and non-centrality parameter
P
µ2i , and pi=1 Xi2 has a non-central chi-square distribution with p degrees of freedom and
P
non-centrality parameter λ where λ = pi=1 µ2i .
Notation
If W has a non-central chi-square distribution, with p degrees of freedom and noncentrality parameter λ we will write W ∼ χ2p (λ). Of course if λ = 0 we have the usual χ2
distribution, sometimes called the central chi-square distribution.
The term non-central can also apply in the case of the t-distribution. Recall that
√ Z where Z ∼ N (0, 1), W ∼ χ2ν and Z and W are independent has a t-distribution
W/ν
with parameter ν. When the variable in the numerator has a non-zero mean then the
distribution is said to be non-central t.
[Non-central t and F distributions are defined in 5.4.]
A common use of the non-central distributions is in calculating the power of the χ2 , t
and F tests and in such applications as robustness studies.
85
6.2
Distribution Theory of the Non-Central Chi-Square
The following theorem is of considerable help in deriving results concerning the non-central
chi-square distribution.
Theorem 6.1
A random variable W ∼ χ2p (λ) can be represented as the sum of a non-central chisquare variable with one degree of freedom and non-centrality parameter λ and a (central)
chi-square variable with p − 1 degrees of freedom where the two variables are independent.
Proof
Let X1 , X2 , . . . , Xp be independently distributed, where Xi ∼ N (µi , 1) and write
X = (X1 , X2 , . . . , Xp ).
Define
p
0
W =
X
Xi2 .
(6.1)
i=1
Choose an orthogonal matrix B such that the elements in the first row are defined by
1
b1j = µj λ− 2 for j = 1, 2, . . . , p
(6.2)
P
where λ = pj=1 µ2j .
Define Y 0 = (Y1 , Y2 , . . . , Yp ) by the orthogonal transformation
Y = BX .
(6.3)
Then using the result of Assignment 2, Q. 9, we see that Y ∼ Np (Bµ, BIB0 ). That is,
since B is orthogonal
Y ∼ Np (Bµ, I)
(6.4)
But, the mean of the vector Y can be written as
E(Y) = Bµ =
 P
b1j µj
 Pb µ

2j j


.

.


.
P
bpj µj
where, using (6.2) we have
E(Y1 ) =
X
b1j µj =
X
1








1
1
µ2j λ− 2 = λ/λ 2 = λ 2 .
Further, since the rows of B are mutually orthogonal
E(Yi ) =
p
X
bij µj = 0 for i = 2, 3, . . . , p .
j=1
86
From (6.3), Y1 , Y2 , . . . , Yp is a set of independent normally distributed random variables.
Also
W = X0 X = (B−1 Y)0 B−1 Y = Y 0 (B−1 )0 B−1 Y = Y 0 Y
= Y12 +
p
X
Yi2
i=2
= V +U
(6.5)
Pp
where V = Y12 and U = i=2 Yi2 .
Since U depends only on Y2 , . . . , Yp ,
U is independent of V . Furthermore,
1
2
2
Y1 ∼ N (λ , 1) so that Y1 is distributed as a chi-square with one degree of freedom and
non-centrality parameter λ. Also, U is distributed as a chi-square with (p − 1) degrees
of freedom since Y2 , . . . , Yp are independently and identically distributed N (0, 1). This
completes the proof.
This theorem will now be used to derive the density function for a random variable
with a non-central χ2 distribution.
Theorem 6.2
If W ∼ χ2n (λ), then the probability density function of W is
1
λ

1
e− 2 w e− 2 w 2 n−1 
1
1
1 (wλ)
gW (w) =
+
.
1+
1
1
n
n 2
n(n + 2) 2!
2 2 Γ( 2 n)
wλ
2
!2
0≤w<∞.

+ . . . ,
(6.6)
Proof:
P
Write V = Y12 and U = ni=2 Yi2 , so that by Theorem 6.1, we can write W as W = V +U .
Now U ∼ χ2n−1 (0), so that the probability density function of U is
1
fU (u) =
e− 2 u u
n−1
−1
2
1
2 2 (n−1) Γ[ 12 (n − 1)]
, 0≤u<∞.
(6.7)
In Example 3.1 of Chapter 3, the density function of V was found to be
1
fV (v) =
v
λ
v − 2 e− 2 e− 2
3
2 2 Γ( 12 )
1
1
e
(vλ) 2
+e
−(vλ) 2
.
(6.8)
But U and V are independent so that the joint p.d.f. of U and V is
fU,V (u, v) = fU (u)fV (v)
=
u
2
(n−3)
2
(n+2)
2
e−
(v+u)
2
1
λ
v − 2 e− 2
Γ( 12 )Γ[ 12 (n − 1)]
87
1
1
e(λv) 2 + e−(λv) 2
.
Define random variables W and T by
Then
(
W =U +V
T =U .
(
V =W −T
U =T .
The original variable space A, (U, V ) is defined by U > 0, V > 0.
The transformation is W = U + V, T = U while the space B, (T, W ) is defined by
T > 0, W > T, and W > 0.
Clearly the Jacobian of the transformation is 1 so that
w
fT,W (t, w) =
e− 2 t
2
n−3
2
(n+2)
2
1
1
w
1
λ
t
w
1
2
e− 2 e− 2 w − 2 1 −
n
λ 2 (w−t) 2
λ
+e
1
−λ 2 (w−t) 2
, 0 ≤ t ≤ w, 0 ≤ w < ∞.
and expand the terms in the brackets so that
t
w
− 1
t(
t
w
2
2.2 2 Γ( 21 )Γ[ 12 (n − 1)]
w
1
1
1
. e
Γ( 12 )Γ[ 12 (n − 1)]
Now write (w − t) 2 = w 1 −
fT,W (t, w) =
λ
(w − t)− 2 e− 2
(n−4)
e− 2 e− 2 w 2
.
=
n
2 2 Γ( 12 )Γ[ 12 (n − 1)]
n−3
)
2
(
t
2wλ
1−
+...
2+
2!
w
( (n−3) (
2
t
1−
w
− 1
2
)
t
wλ
1−
+
2!
w
1
2
)
+...
.
To obtain the marginal density function of W we integrate with respect to t, (0 ≤ t < w).
Notice we have a series of integrals of the form
Z
w
0
t
w
(n−3) 2
t
1−
w
r− 1
2
dt =
Z
1
v
(n−3)
2
0
1
(1 − v)r+ 2
dv
w
1
1
n−1
=
B
, r−
w
2
2
for r = 0, 1, 2, . . . .
Thus
w
λ
n−2
e− 2 e− 2 w ( 2 )
gW (w) = n 1 1
2 2 Γ( 2 )Γ[ 2 (n − 1)]
(
!
)
n−1 3
wλ
(n − 1) 1
,
B
,
+
+...
B
2
2
2!
2
2
and using the relationship B(m, n) = Γ(m)Γ(n)/Γ(m + n) we obtain (6.6).
Note:
No generality is lost by assuming unit variances as the more general case (where the
variance is σ 2 , say) can easily be reduced to this case. That is, if X ∼ N (µ, σ 2 ) then
X/σ ∼ N (µ/σ, 1).
88
MGF of W
Direct calculation of the moments would be tedious, so we need the MGF of W .
MW (t) =
Z
ewt gW (w)dw =
Z
e−w(1−2t)/2 e−λ/2 w (n/2)−1
[f (wλ, n)] dw
2n/2 Γ(n/2)
Choose w 0 = w(1 − 2t) and λ = Λ(1 − 2t) and note that wλ = w 0 Λ.
Thus
MW (t) =
Z
0
e−w /2 e−Λ(1−2t)/2
w0
1−2t
2n/2 Γ(n/2)

So
eΛt

=
(1 − 2t)n/2
Z
(n/2)−1
w0
[f (w 0 Λ, n)] d
1 − 2t
!

e−w /2 e−Λ/2 (w 0 )(n/2)−1
[f (w 0 Λ, n)] dw 0 
n/2
2 Γ(n/2)
0
=1
MW (t) = e(λt)/(1−2t) (1 − 2t)−n/2
6.3
Non-Central t and F-distributions
Suppose X ∼ N (µ, 1) and W ∼ χ2n (0) and that X and W are independent. Then the
random variable T 0 defined by
X
T0 = q
W/n
has a non-central t-distribution with n df and non-centrality parameter µ.
No attempt will be made to derive the pdf of the non-central t. Clearly, when µ = 0, T 0
reduces to the central t distribution.
Let W1 ∼ χ2n1 (λ) and W2 ∼ χ2n2 (0) be independent random variables. Then the random
variable F 0 defined by
W1 /n1
F0 =
W2 /n2
has a non-central F -distribution with non-centrality parameter λ. We write F 0 ∼ Fn1 ,n2 (λ).
The F 0 statistic has a non-central F distribution with probability density function
g(x) =
∞
X
r=0
1
e
−λ
2
1
( 1 λ)r
(n1 /n2 ) 2 n1 +r
x 2 n1 +r−1
× 2
×
×
1
r!
B( 12 n1 + r, 12 n2 ) [1 + (n1 /n2 )x] 2 (n1 +n2 )+r
where n1 , n2 are the degrees of freedom, λ is the non-centrality parameter defined by
λ=
Pp
i=1
mi (τi − τ̂i )2
σ2
89
(6.9)
where mi is the number of observations in group i for the AOV effects model
Yij = µ + τi + εij , εij ∼ N (0, σ 2 ).
If all means are equal, λ = 0 and g(x) is the pdf of a central F variable.
The terms of the form B(a, b) are beta functions.
6.4
6.4.1
POWER: an example of use of non-central t
Introduction
In hypothesis testing we can make a Type I or a Type II error.
H0 true
H0 false
Accept H0
Reject H0
correct (1)
Type I error (α)
Type II error (β) correct(2)
The power of the test is P = 1 − β, which is P(reject H0 |H0 is false) = P[correct(2)].
Note the some authors use β to mean Power. We will use the definition
Power = 1 - P(Type II error)
Both Type I and Type II errors need to be controlled. In fact, we are faced with a
trade–off between the two.
If we lower Type I, Type II will increase.
If we lower Type II, Type I will increase.
Type I is preset, and so we need to have some idea about Type II error and thus the
Power of the test. The regions for Type I (α) and Type II (β) errors are shown in Figure 6.1. The area to the right of the critical value (cv) under the solid curve (Ho) is the
Type I error. The area under the dotted curve (Ha) to the left of the cv is the Type II error.
90
0.4
Ha
0.2
0.1
Normal density
0.3
Ho
cv
α
0.0
β
−2
0
2
4
6
x
Figure 6.1: The Null and Alternative Hypotheses
Consider a simple one sample t–test, with
H0 : µ = 0 vs Ha : µ > 3
If we have a sample of 26 observations with s = 5, what is the Power of the test?
Under H0 the test statistic is the
T =
X̄
√ ∼ tn−1
s/ n
since the mean of X is taken to be zero.
When the alternative hypothesis is true, the mean of X is no longer zero and hence the
test statistic no longer follows an ordinary t–distribution. The distribution is no longer
symmetric, but becomes a non–central t–distribution. The lack of symmetry is described
by a non–centrality parameter λ, where
λ=
diff
3
√ =
= 3.0594
σ/ n
5/5.099
in this case. Now the critical value for the test is t25,5% = 1.708, so the Type I error is
5%. The plot of the cumulative non–central t–distribution imposed by Ha is shown in
Figure 6.2, together with the critical value :
> # after Dalgaard p141
> curve(pt(x,25,ncp=3.0594),from =0, to=6)
91
1.0
0.8
0.6
0.4
0.2
0.0
pt(x, 25, ncp = 3.0594)
0
1
2
3
4
5
6
x
Figure 6.2: Plot of the cdf for the non–central t distribution
> abline(v=qt(0.95,25))
> qt(0.95,25)
[1] 1.708141
>
> pt(qt(0.95,25),25,ncp=3.0594)
[1] 0.0917374
> 1-pt(qt(0.95,25),25,ncp=3.0594)
[1] 0.9082626
The Type II error is the area under the curve to the left of the critical value as shown
by the vertical line. Thus the Type II error is 0.092 and the Power is 0.908. The desired
value for Power is usually 0.8 to 0.9.
For the two sample t–test, the non–centrality parameter becomes
diff
λ= q
σ 1/n1 + 1/n2
The software power.t.test in R can estimate the sample size needed to attain a given
Power.
6.4.2
Power calculations
The base package in R, has a function called power.t.test which is useful for calculating
power curves a priori by plugging in guessed parameters. The function has the form:92
power.t.test(n=NULL, delta=NULL, sd=1, sig.level=0.05, power=NULL,
type=c("two.sample", "one.sample", "paired"),
alternative=c("two.sided", "one.sided"))
Its use requires that one parameter be set as NULL and all the others defined with
values. The function will return the value of that which is set to NULL. It may be used to
calculate
(i) sample size (n), given:- the difference (delta),sd, sig.level and power,
(ii) power, given:- n, delta, sd, sig.level,
(iii) detectable difference (delta), given the other arguments
Thus to show the equivalence with the previous method, we will estimate the sample
size needed to get the power given for our one sample problem, using power.t.test.
> power.t.test(delta=3,sd=5, sig.level =0.05,
power=0.908, type="one.sample",
+ alt="one.sided")
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
25.97355
3
5
0.05
0.908
one.sided
As expected, we find that a sample of 26 is needed!
We can also perform the same calculation, ie, given the sample size, find the power.
> power.t.test(n=26,delta=3,sd=5,
sig.level =0.05, type="one.sample",
+ alt="one.sided")
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
26
3
5
0.05
0.9082645
one.sided
93
The results are equivalent to the original power calculations, ie, power is approx 91%.
To return to the two sample case, now consider
H0 : µ1 = µ2 vs Ha : µ1 > µ2
Let us find the sample sizes n1 = n2 = n needed to detect a difference of 3 when the
common sd is 5, with a Power = 0.8.
> power.t.test(delta=3,sd=5, sig.level=0.05,
power =0.8,alt="one.sided")
Two-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
35.04404
3
5
0.05
0.8
one.sided
NOTE: n is number in *each* group
So we need 35 observations in each group.
We now will run a similar but not identical problem.
Let us find the sample sizes n1 = n2 = n needed to detect a difference of 3 when
the common sd is 5, with a Power = 0.9, for a two sided alternative. Thus we now have
Ha : µ1 6= µ2
> power.t.test(delta=3,sd=5, sig.level=0.05,
power =0.9,alt="two.sided")
Two-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
59.35157
3
5
0.05
0.9
two.sided
NOTE: n is number in *each* group
We now need 59 observations in each group.
94
6.5
6.5.1
POWER: an example of use of non-central F
Analysis of variance
For the AOV where we have more than two groups, we need to use the non–central F
distribution. The non—centrality parameter is now
Λ=
n
P
i (µi −
σ2
µ)2
Note that this differs from the t–test definition! The formulation uses equal numbers
is each group (n), but this is not necessary.
First up, we will use the non–central F distribution in R to verify our calculations for
the last t–test. The non–central F in R is simply the standard F with the optional parameter ncp.
There are some points to note :
1. to compare the two, the t–test has to be two–sided.
2. the two ncps are not exactly the same but are related, since in the two sample case
diff
λ= q
σ 2/n
and so
λ2 =
ndiff 2
= Λ/2
2σ 2
AOV table (df only) :
The number of obs = 59 x 2 = 118
Source
Groups
Error
TOTAL
df
1
116
117
> qf(0.95,1,116)
[1] 3.922879
> lambda= 59 * 9 /(25 * 2)
> 1-pf(qf(0.95,1,116),df1=1,df2=116,ncp=lambda)
[1] 0.8982733
Thus the non–central F produces a Power of 90% as per the non–central t.
Lastly, we note the comment from Geisbrecht and Gumpertz (G and G) , p61, concerning the definition of the non–centrality parameter :
95
” Note that usage in the literature is not at all consistent, Some writers divide by
2 . . . and other define the non–centrality parameter as the square root, . . . Users
must be cautious”.
In developing our definition here we are using the definition that obviously works in R
and SAS, the package used by G and G.
For more information, see the R functions pf and Chisquare.
We now turn our attention to more than two groups, ie, Analysis of Variance proper.
It is worth noting that the form of the non–centrality parameter can be seen in the
table of Expected Mean Squares for the fixed effects AOV. This will be exploited later in
post–mortem power calculations on the AOV of data.
Example
Giesbrecht F.G. and Gumpertz M.L., (2004), Planning, Construction and Statistical
Analysis of Comparative Experiments, Wiley, Hoboken, pp62–63.
We have an experiment with a factor at 6 levels, and 8 reps within each levels. This
gives an AOV table (df only) :
The number of obs = 6 x 8 = 48
Source
Groups
Error
TOTAL
df
5
42
47
What is the Power of the test?
We are given that the mean is approx 75 and the coefficient of variation (CV ) is 20%.
Remember that
CV = 100 × σ/µ
Thus we can estimate
σ = 0.2 × µ = 15 → σ 2 = 225
The expected treatment means are (65, 85, 70, 75, 75, 80).
We now have all the information needed to calculate the non–centrality parameter.
Λ=
8
P
i (µi −
σ2
µ)2
8[(−10)2 + (10)2 + (−5)2 + (0)2 + (0)2 + (5)2 ]
= 8.89
225
The R calculations :
=
96
> lambda <- 8.89
> qf(0.95,5,42)
[1] 2.437693
> pf(2.44,5,42,ncp=lambda,lower.tail=F)
[1] 0.5524059
Thus the probability of detecting a shift in the means of the order nominated is only
55%! Thus we would need more than 8 observations per treatment level to pick up a change
in means of the type suggested.
Example
Consider the following AOV of yield data from a randomized block experiment to
measure yields of 8 lucerne varieties from 3 replicates.
The statistical model is :yij = µ +
β
i
|{z}
+
block effect
+ ij ij ∼ N (0, σ 2 )
|{z}
treatment effect error
τj
|{z}
Table 6.1: AOV of lucerne variety yields
Source
df
SS
MS
F E(MS)
P
replicate
2 41,091 20,545 2.4 σ 2 + 8 3i=1 βi2 /2
P
variety
7 75,437 10,777 1.26 σ 2 + 3 8i=1 τi2 /7
σ2
residuals 14 120,218 8,587
The variety effects are:1
0
2
3
4 5
-69 -46 90 81
6 7 8
71 60 22
Is there sufficient evidence to say that the observed differences are due to systematic
effects? If there is not sufficient evidence to say this, we are obliged to take the position
that the observed differences could have arisen from random sampling from a population
for which there were no variety effects.
If the null hypothesis that τj = 0, ∀j is true,
variety MS ∼ σ 2 χ27 /7
and
residual MS ∼ σ 2 χ214 /14
The ratio of these 2 mean squares is a random variable whose distribution is named
the F distribution. We assess the null hypothesis using the F statistic which in this case
97
is 1.26 and the probability of getting this or more extreme by sampling from a population
with τj = 0 is 0.34. Thus we have no strong evidence to support the alternate hypothesis
that not all τj = 0.
Given the effort of conducting an experiment, this is a disappointing result. Specific
contrasts amongst the varieties should be tested but if there were no contrasts specified at
the design stage, this “data-snooping” may also be misleading. A post mortem poses the
following questions.
• Was the experiment designed to account for natural variation? If the blocking is not
effective, the systematic component τj is masked by the random component ij .
• Suppose that genuine differences do exist in the population. What was the probability
of detecting them with such a design?
The quantile of F marking the 5% critical region is Fcrit = 2.67. To be 95% sure that
the observed differences were not due to chance, the observed F has to exceed Fcrit . The
probability of rejection of the null hypothesis when the differences are actually not zero is
called POWER and is measured by the area under the non-central F density to the right
of Fcrit .
The F curves for this example are shown in Figure 6.3 where the vertical lines is Fcrit .
The calculated value of power is 0.1 which is very poor. A better design which will reduce
experiment error is required.
For post mortems of AOV’s, the non-centrality parameter is estimated by
λ̂ =
df1 × (Trt MS − Error MS)
Error MS
It remains a student exercise to compare this with previous definitions (eg. 6.9).
We revisit the topic of POWER in Statistical Inference.
98
Figure 6.3: Central and non-central F densities
0.6
F density
central F
non-central F
0.4
0.2
POWER = 0.1
0.0
0
6.6
2
4
6
F quantile
8
10
R commands
df(x,df1,df2)
qf(p,df1,df2)
pf(q,df1,df2,ncp)
ncpdf(q,df1,df2,ncp)
density of a central F distribution
quantile corresponding to probability p for a central -F
probability for a F distribution with non-centrality parameter
density of non-central F , see below
ncpdf_function(x, df1, df2, ncp){
# written by Bob Murison at UNE based on paper by
# O’Neill and Thomson (1998), Aust J exp Agric, 38 617-22.
###############################################
Beta <- function(v1, v2){(gamma(v1) * gamma(v2))/gamma(v1 + v2)}
r <- 0:100
gF <- x * 0
for(i in seq(along = x)) {
gF[i] <- sum((((exp(-0.5 * ncp) * (ncp/2)^r)/gamma(r + 1) * (
df1/df2)^(df1/2 + r))/Beta((df1/2 + r), (df2/2)) *
x[i]^(df1/2 + r - 1))/((1 + (df1/df2) *x[i])^((df1 + df2)/2 + r)))
}
gF
}
99
100
Part II
Statistical Inference
101
Chapter 7
Reduction of Data
7.1
Types of inference
There are several ways to model data and use statistical inference to interpret the models.
Common strategies include
1. frequentist parametric inference,
2. Bayesian parametric inference,
3. non-parametric inference (frequentist and Bayesian),
4. semi-parametric inference,
and there are others.
Applied statisticians use the techniques best suited to the data and each technique
has its strengths and limitations. This section of the unit is about parametric frequentist
inference and many of the principles and skills are transportable to the other styles of
inference.
7.2
Frequentist inference
The object of statistics is to make an inference about a population based on information
contained in a sample. Populations are characterized by parameters, so many statistical
investigations involve inferences about one or more parameters. The process of performing
repetitions of an experiment and gathering data from it is called sampling.
The basic ideas of random sampling, presentation of data by way of density or
probability functions, and a statistic as a function of the data, are assumed known.
Computation of a statistic from a set of observations constitutes a reduction of the data
(where there are n items, say) to a single number. In the process of such reduction, some
information about the population may be lost. Hopefully, the statistic used is chosen so
102
that the information lost is not relevant to the problem. The notion of sufficiency, covered
in the next section, deals with this idea.
Commonly used statistics are: sample mean, sample variance, sample median, sample
range and mid–range. These are random variables with probability distributions dependent
on the original distribution from which the sample was taken.
7.3
Sufficient Statistics
[Read CB 6.1 or HC 7.2. The notation we will use for a statistic is T (X) = T (X1 , X2 , . . . , Xn ),
rather than Y1 = u(X1 , X2 , . . . , Xn ) as their choice.]
The idea of sufficiency is that if we observe a random variable X (using a sample
X1 , . . . , Xn , or X) whose distribution depends on θ, often X can be reduced via a function,
without losing any information about θ. For example,
T (X) = T (X1 , . . . , Xn ) =
n
X
Xi /n,
i=1
which is the sample mean, may in some cases contain all the relevant information about
θ, and in that case T(X) is called a sufficient statistic. That is, knowing the actual n
observations doesn’t contribute any more to the inference about θ, than just knowing the
average of the n observations. We can then base our inference about θ on T(X), which
can be considerably simpler than X (involving a univariate distribution, rather than an
n–variate one).
Definition 7.1
A statistic T = T (X) is said to be sufficient for a family of distributions if and only if
the conditional distribution of X given the value of T is the same for all members of the
family (that is, doesn’t depend on θ).
Equivalent definitions for the discrete case and continuous cases respectively are given
below.
Definition 7.2
Let f (x; θ), θ ∈ Θ be a family of distributions of the discrete type. For a random sample
X1 , . . . , Xn from f (x; θ), define T = T (X1 , . . . , Xn ). Then T is a sufficient statistic for θ
if, for all θ and all possible sample points,
P (X1 = x1 , . . . , Xn = xn |T = t(x1 , . . . , xn ))
(7.1)
does not involve θ. [Note that the lack of dependence on θ includes not only the function,
but the range space as well.]
103
Consider here the role of θ. Its job is to represent all the stochastic information of the
data. Other information such as the scale of measurement should not be random. So if
T is sufficient for θ, then interpretation of the data conditional on f (t(x1 , . . . , xn )) should
remove all the stochastic bits, leaving only the non-random bits.
Definition 7.3
Let X1 , . . . , Xn be a random sample from a continuous distribution, f (x; θ), θ ∈ Θ. Let
T = T (X1 , . . . , Xn ) be a statistic with pdf fT (t). Then T is sufficient for θ if and only if
f (x1 , θ) × f (x2 ; θ) · · · × f (xn ; θ)
fT (t(x1 , x2 , . . . , xn ); θ)
(7.2)
does not depend on θ, for every fixed value of t.
Again the range space of the xi must not depend on θ either.
Example
A random sample of size n is taken from the Poisson distribution, P (λ).
P
Is i Xi sufficient for λ?
Let
X
P (A)
P (X1 , . . . , Xn |
Xi = t) = P (A|B) =
P (B)
since A represents only one way in which the total t could be achieved, and so A ⊂ B ;
P (A ∩ B) = P (A).
P
As Xi ∼ P (nλ) we have
P (X1 , . . . , Xn |
X
Xi = t) =
which does not involve λ, and so T =
Example 7.1
Q
i
e−λ λxi /xi !
e−nλ (nλ)
P
P
x
i i
/(
P
i
xi )!
Xi is sufficient for λ.
=
(
n
P
P
i
x
i
i
xi )!
Q
( xi !)
Given X1 , . . . , Xn is a random sample from a binomial distribution with parameters
P
m, θ, show that T = ni=1 Xi is a sufficient statistic for θ.
Solution.
From Definition 7.2, we need to consider
P (X1 = x1 , X2 = x2 , . . . , Xn = xn |
and note that the Xi are independent and that
becomes
m
x1
θ x1 (1 − θ)m−x1 . . .
mn
t
m
xn
Pn
i=1
θ xn (1 − θ)m−xn
θ t (1 − θ)mn−t
104
X
Xi = t),
Xi ∼ bin(mn, θ). So equation (7.1)
, xi = 0, 1, . . . m,
X
xi = t,
which on simplification is
m
x1
m
. . . xmn
x2
mn
t
which is seen to be free of θ and the range space of the xi also.
P
Hence ni=1 Xi is sufficient for θ.
Continuous case
The statistic T is sufficient for θ if
Q
f (xi ; θ)
fT (t(x1 , . . . , xn ); θ)
i
does not involve θ.
Example
For a random sample of size n from an exponential distribution, show that the sample
total is sufficient for the exponential parameter.
A form of the exponential distribution is
f (x; λ) = λe−λx , x > 0
If the sample total is sufficient for λ then
Q
f (xi ; λ)
fT (t(x1 , . . . , xn ); λ)
i
should not contain λ.
The distribution of the sample total is Gamma with parameters n and λ, as can be
shown using moment generating functions.
Thus the conditional distribution becomes
λn (
Pn
i
xi
Qn
−λxi
i λe P
n
)n−1 e−λ i xi /Γ(n)
= Γ(n)/(
X
xi )n−1
i
which does not contain λ, indicating that the sample total is sufficient for λ.
Example 7.2
Let X1 , . . . , Xn be a random sample from the truncated exponential distribution, where
fXi (xi ) = eθ−xi , xi > θ
or, using the indicator function notation,
fXi (xi ) = eθ−xi I(θ,∞) (xi ).
105
Show that Y1 = min(Xi ) is sufficient for θ.
Solution.
In Definition 7.3, T = T (X1 , . . . , Xn ) = Y1 and to examine equation (7.2), we need
fT (t), the pdf of the smallest order statistic. Now for the pdf above,
F (x) =
Z
x
θ
eθ−z dz = eθ [eθ − e−x ] = 1 − eθ−x .
From Distribution Theory equation (5.6), the pdf of Y1 is
n[1 − F (y1 )]n−1 f (y1 ) = ne(θ−y1 )(n−1) × eθ−y1 = nen(θ−y1 ) , y1 > θ.
So the conditional density of X1 , . . . Xn given T = t is
P
e − xi
eθ−x1 eθ−x2 . . . eθ−xn
=
, xi ≥ y1 , i = 1, . . . n,
nen(θ−y1 )
ne−ny1
which is free of θ for each fixed y1 = min(xi ). Note that since xi ≥ y1 , i = 1, . . . , n neither
the expression nor the range space depends on θ, so the first order statistic, Y1 is a sufficient
statistic for θ.
In establishing that a particular statistic is sufficient, we do not usually use the above
definition(s) directly. Instead, a factorization criterion is preferred and this is described in
7.4.
7.4
Factorization Criterion
The Theorem stated below is often referred to as the Fisher–Neyman criterion.
Theorem 7.1
Let X1 , . . . , Xn (or X) denote a random sample from a distribution with density function
f(x; θ). Then the statistic T=t(X) is a sufficient statistic for θ if and only if we can find
two functions g and h such that
f (x; θ) = g(t(x); θ)h(x)
where, for every fixed value of t(x), h(x) does not depend on θ.
(The range space of x for which f (x; θ) > 0 must not depend on θ.
(An aside; heuristic explanation.)
Factorisation is a way of separating the random and non-random components. Only
when t(x) is comprehensive enough such that
f (x; θ) = g (t(x); θ) h(x) ,
106
will it be sufficient information to pin down θ. Conversely, when it is sufficient, the extra
“enhancements” are redundant.
Proof.
For the continuous case, a proof is given in HC 7.2, Theorem 1, where their k1 is our
g and their k2 is our h.
To use the factorization criterion, we examine the joint density function, f(x; θ) and
see whether there is any factorization of the type required in terms of some function t(x).
It is usually not easy to use the factorization criterion to show that a statistic T is not
sufficient.
Note that the family of distributions may be indexed by a vector parameter θ, in which
case the statistic T in the definition of sufficiency can be a vector function of observations,
for example, (X̄, S 2 ) or (Xmin , Xmax ).
Example
(discrete)
For the Poisson distribution, is the sample total sufficient for λ?
Y
i
and so T =
(continuous)
P
P
P
e−nλ λ i xi
1
e−λ λxi
= e−nλ λ i xi Q
= Q
xi !
i xi !
i xi !
= g(t; λ) × h(xi )
i
Xi is sufficient for λ, as expected.
For the exponential distribution, is the sample total sufficient for λ?
Y
λe−λxi = λn e−nλ
i
and so T =
P
P
i
= g(t; λ) × h(xi )
i
Xi is sufficient for λ, as expected.
Example 7.3
107
xi
×1
We will consider again example 7.2, using the factorization criterion.
The joint probability density function of X1 , . . . , Xn is
f (x; θ) = e−
(
=
P
(xi −θ)
P
n
Y
I(θ,∞) (xi )
i=1
e− P xi enθ .1
e− xi enθ .0
P
if min(x1 , . . . , xn ) > θ
otherwise.
= e− xi enθ I(θ,∞) (t)
= h(x)g(t(x; θ))
where
h(x) = e−
P
xi
, t(x) = min xi and g(t(x; θ)) = enθ I(θ,∞) (t).
So, by Theorem 7.1, min(Xi ) is a sufficient statistic for θ.
Example 7.4
(Example 4 in HC, 7.2, p319)
Let X1 , . . . , Xn be a random sample from a N (θ, σ 2 ) distribution where σ 2 is known. Show
that X̄ is sufficient for θ.
Now
Pn
2
2
f (x1 ; θ) . . . f (xn ; θ) = ce− i=1 (xi −θ) /2σ .
Writing xi − θ as xi − x̄ + (x̄ − θ), we have,
X
(xi − θ)2 =
X
(xi − x̄)2 + n(x̄ − θ)2 + 2(x − θ)
So the RHS is
ce−n(x̄−θ)
2 /2σ 2
.e−
P
(xi −x̄)2 /2σ 2
X
|
(xi − x̄) .
= g(x̄; θ) · h(x),
since the first term on the RHS depends on x only through x̄ (or
term does not depend on θ.
{z
=0
P
}
xi ) and the second
Read HC 7.2, Examples 5, 6.
Example 7.5
Consider a random sample of size n from the uniform distribution, f (x; θ) = 1/θ, x ∈
(0, θ]. We will use the factorization criterion to find a sufficient statistic for θ.
The joint density function is
f (x; θ) =
(
1
,
θn
0,
if 0 < xi < θ for i = 1, 2, . . . , n
if xi > θ or xi < 0 for any i.
108
This can be written in the form
f (x : θ) = g(yn , θ)h(x),
where
g(yn, θ) =
and
h(x) =
(
(
1,
0,
1
,
θn
0,
if θ > yn ,
if θ ≤ yn ,
if xi > 0 for all i,
if any xi ≤ 0.
Of course, yn in the above, is the largest order statistic. The factorization criterion is
satisfied in terms of the statistic T = Yn , so this statistic is sufficient for θ.
Comment.
Note that the joint pdf is 1/θ n which is just a function of θ, so it would appear that
any statistic is sufficient. The fallacy in this argument is that the joint density function
is not always given by 1/θ n , but is equal to zero for xi ∈
/ [0, θ]. So it really is not just a
function of θ. However, we can get it into the required form by taking T = Yn . [Note that
if Yn < θ then all the Xi ≤ θ.]
Although the factorization criterion works here and in other cases where the range
space depends on the parameter, one has to be careful, and it is often safer to find the
conditional density for the sample given the statistic, rather than use the factorization
criterion. This is done below.
The joint pdf of the ordered sample Y1 < Y2 < . . . < Yn is
(
n!
θn
0
for 0 ≤ y1 ≤ . . . yn ≤ θ
otherwise ,
and the density for Yn is
(
nynn−1 /θ n
0
for 0 ≤ yn ≤ θ
otherwise .
Hence the conditional density of Y1 , . . . , Yn given Yn (which is ≤ θ) is
(
(n − 1)!/ynn−1
0
for 0 ≤ y1 ≤ . . . ≤ yn
otherwise.
which does not depend on θ.
109
7.5
The Exponential Family of Distributions
[Read CB 3.4 or HC 7.5 where we will use B(θ) for their eq(θ) and h(x) for their eS(x) .]
Definition 7.4
The exponential family of distributions is a one-parameter family that can be written
in the form
f (x; θ) = B(θ)h(x)e[p(θ)K(x)] , a < x < b,
(7.3)
where γ < θ < δ. If, in addition,
(a) neither a nor b depends on θ,
(b) p(θ) is a non-trivial continuous function of θ,
(c) each of K 0 (x) 6≡ 0 and h(x) is a continuous function of x, a < x < b,
we say that we have a regular case of the exponential family.
Most of the well-known distributions can be put into this form, for example, binomial,
Poisson, geometric, gamma and normal. The joint density function of a random sample X
from such a distribution can be written as
f (x; θ) = B n (θ)
n
Y
h(xi )ep(θ)
i=1
Putting
T = t(X) =
n
X
Pn
i=1
K(xi )
K(Xi ) and t(x) =
i=1
n
B (θ)e
p(θ)t(x)
n
X
(7.4)
K(xi ),
i=1
we see that f (x; θ) can be written as
h
, a < xi < b.
" n
i Y
#
h(xi ) = g(t(x; θ)h(x),
i=1
so that Theorem 7.1 applies and t(X) is a sufficient statistic for θ.
Example 7.6
Let X ∼ U [0, θ]. Then f (x) = 1/θ, x ∈ [0, θ]. We see that f (x) cannot be written in
the form of equation (7.3). We could write B(θ) = 1/θ, p(θ) = 0, but then we would need
h(x) =
(
1, 0 ≤ x ≤ θ
0 otherwise
which makes h(x) depend on θ and condition (c) of definition (7.4) would not be satisfied.
[We already know that max Xi is sufficient for θ here, and note that max Xi is not of the
P
form ni=1 K(Xi ).]
110
Example 7.7
Consider the normal distribution with mean θ and variance 1. The density function
can be written in the form of equation (7.3) where
1
1
2
2
2
√ e−(x−θ) /2 = √ e−θ /2 . (e−x /2 ) .eθx ,
| {z }
2π
2π{z
|
}
h(x)
B(θ)
P
P
and p(θ) = θ, K(x) = x. So T = K(Xi ) = Xi is minimal sufficient for θ.
P
Note that we could have defined p(θ) = nθ and K(x) = x/n, so that T = Xi /n = X is
also sufficient for θ.
A distribution from the exponential family arises from tilting a simple density ,
f (x; θ) = f (x) × eθx−K(θ)
where
K(θ) = log E(eθx )
µ = K 0 (θ)
σ 2 = K 00 (θ)
and θ is termed the natural parameter.
Theorem
A necessary and sufficient condition for a pdf to possess a sufficient statistic is that it
belongs to the exponential family of distributions.
The exponential family also gives the form of the sufficient statistic, viz,
P
K(xi ) Y h(x )
X
K(Xi )
f (x; θ) = B n (θ)ep(θ)
Choosing
T = t(X) =
i
i
i
i
and
t(x) =
X
K(xi )
i
gives
f (x; θ) = B (θ)ep(θ)
Thus t(X) =
P
i
n
P
i
K(xi )
"Y
i
#
h(xi ) = g[t(x; θ)]h(x)
K(Xi ) is a sufficient statistic for θ, by use of the factorisation criterion.
111
Examples
(Poisson)
1
e−λ λx
= e−λ ex ln λ
x!
x!
= e−λ , h(x) = 1/(x!), p(θ) = ln θ = ln λ and K = I.
f (x; λ) =
So θ = λ, B(θ) = e−θ
Thus
(Exponential)
P
i
Xi is sufficient for λ.
f (x; λ) = λe−λx
So θ = λ, h = 1, p(θ) = −λ = −θ and K = I
Thus
7.6
P
i
Xi is sufficient for λ.
Likelihood
The likelihood is the joint probability function of the sample as a function of θ.
Thus
L(θ; x) = f (x; θ) = L(θ)
The fact the likelihood as a function of θ differs from the pdf as a function of x was
first defined by Fisher :
(1921) ”What we can find from a sample is the likelihood of any particular value of ρ, if
we define the likelihood as a quantity proportional to the probability that, from a
population having that particular value of ρ, a sample having the observed value
r should be obtained. So defined, probability and likelihood are quantities of an
entirely different nature.”
(1925) ”What has now appeared is that the mathematical concept of probability is inadequate to express our mental confidence or diffidence in making such inferences, and
that the mathematical quantity which appears to be appropriate for measuring our
order of preference among different possible populations does not in fact obey the laws
of probability. To distinguish it from probability, I have used the term ’Likelihood’
to designate this quantity.”
The value of θ that maximises this likelihood, is called the maximum likelihood estimator (mle).
Definition 7.5
Let X1 , . . . , Xn be a random sample from f (x; θ) and x1 , . . . , xn the corresponding
observed values. The likelihood of the sample is the joint probability function (or the
112
joint probability density function, in the continuous case) evaluated at x1 , . . . , xn , and is
denoted by L(θ; x1 , . . . , xn ).
Now the notation emphasizes that, for a given sample x, the likelihood is a function of
θ. Of course
L(θ; x) = f (x; θ), [= L(θ), in a briefer notation].
The likelihood function is a statistic, depending on the observed sample x. A statistical
inference or procedure should be consistent with the assumption that the best explanation
of a set of data is provided by θ̂, a value of θ that maximizes the likelihood function. This
value of θ is called the maximum likelihood estimate (mle). The relationship of a
sufficient statistic for θ to the mle for θ is contained in the following theorem.
Theorem 7.2
Let X1 , . . . , Xn be a random sample from f (x; θ). If a sufficient statistic T = t(X) for θ
exists, and if a maximum likelihood estimate θ̂ of θ also exists uniquely, then θ̂ is a function
of T .
Proof
Let g(t(x; θ)) be the pdf of T . Then by the definition of sufficiency, the likelihood
function can be written
L(θ; x1 , . . . , xn ) = f (x1 , θ) . . . f (xn ; θ) = g (t(x1 , . . . , xn ); θ) h(x1 , . . . , xn )
(7.5)
where h(x1 , . . . , xn ) does not depend on θ. So L and g as functions of θ are maximized
simultaneously. Since there is one and only one value of θ that maximizes L and hence
g(t(x1 , . . . , xn ); θ), that value θ must be a function of t(x1 , . . . , xn ). Thus the mle θ̂ is a
function of the sufficient statistic T = t(X1 , . . . , Xn ).
Sometimes we cannot find the maximum likelihood estimator by differentiating the
likelihood (or log of the likelihood) with respect to θ and setting the equation equal to
zero. Two possible problems are:
(i) The likelihood is not differentiable throughout the range space;
(ii) The likelihood is differentiable, but there is a terminal maximum (that is, at one end
of the range space).
For example, consider the uniform distribution on [0, θ]. The likelihood, using a random
sample of size n is
L(θ; x1 , . . . , xn ) =
(
1
θn
0
for 0 ≤ xi ≤ θ, i = 1, . . . , n
otherwise .
(7.6)
Now 1/θ n is decreasing in θ over the range of positive values. Hence it will be maximized
by choosing θ as small as possible while still satisfying 0 ≤ xi ≤ θ. That is, we choose θ
equal to X(n) , or Yn , the largest order statistic.
113
Example 7.8
Consider the truncated exponential distribution with pdf
f (x; θ) = e−(x−θ) I[θ,∞) (x).
The Likelihood is
L(θ; x1 , . . . , xn ) = e
nθ−
P
xi
n
Y
I[θ,∞) (xi ).
i=1
Hence the likelihood is increasing in θ and we choose θ as large as possible, that is, equal
to min(xi ).
Further use is made of the concept of likelihood in Hypothesis Testing (Chapter 9), but
here we will define the term likelihood ratio, and in particular monotone likelihood
ratio.
Definition 7.6
Let θ1 and θ2 be two competing values of θ in the density f (x; θ), where a sample of
values X leads to likelihood, L(θ; X). Then the likelihood ratio is
Λ = L(θ1 ; X)/L(θ2 ; X).
This ratio can be thought of as comparing the relative merits of the two possible values
of θ, in the light of the data X. Large values of Λ would favour θ1 and small values of Λ
would favour θ2 . Sometimes the statistic T has the property that for each pair of values
θ1 , θ2 , where θ1 > θ2 , the likelihood ratio is a monotone function of T . If it is monotone
increasing, then large values of T tend to be associated with the larger of the two parameter
values. This idea is often used in an intuitive approach to hypothesis testing where, for
example, a large value of X would support the larger of two possible values of µ.
Definition 7.7
A family of distributions indexed by a real parameter θ is said to have a monotone
likelihood ratio if there is a statistic T such that for each pair of values θ1 and θ2 , where
θ1 > θ2 , the likelihood ratio L(θ1 )/L(θ2 ) is a non–decreasing function of T .
Example 7.9
Let X1 , . . . , Xn be a random sample from a Poisson distribution with parameter λ.
Determine whether (X1 , . . . , Xn ) has a monotone likelihood ratio (mlr).
Here the likelihood of the sample is
L(λ; x1 , . . . , xn ) = e−nλ λ
114
P
xi
/
Y
xi !.
Let λ1 , λ2 be 2 values of λ with 0 < λ1 < λ2 < ∞. Then for given x1 , . . . , xn
−nλ2
P
x
L(λ2 ; x)
e
λ2 i
P =
=
x
L(λ1 ; x)
e−nλ1 λ1 i
λ2
λ1
! P xi
e−n(λ2 −λ1 ) .
Note that (λ2 /λ1 ) > 1 so this ratio is increasing as T (x) =
P
(X1 , . . . , Xn ) has a monotone likelihood ratio in T (x) = xi .
7.7
P
xi increases.
Hence
Information in a Sample
In the next chapter we will be considering properties of estimators. One of these properties
involves the variance of an estimator and our desire to choose an estimator with variance
as small as possible. Some concepts and results that will be used there are introduced in
this section. In particular, we will consider the notion of information in a sample, and
how we measure this information when data from several experiments is combined.
Consider a distribution indexed by a real parameter θ and suppose X1 , . . . , Xn1 and
Y1 , . . . , Yn2 are independent sets of data, then the likelihood of the combined sample is the
product of the likelihoods of the two individual samples. That is,
L(θ; x, y) = L1 (θ; x)L2 (θ; y)
and so
log L(θ; x, y) = log L1 (θ; x) + log L2 (θ; y).
The statistic that we shall be concerned with is the derivative with respect to θ of the log
likelihood.
Definition 7.8
The score of a sample, denoted by V is defined by
V =
where L0 (θ) =
∂
L(θ)
∂θ
L0 (θ)
∂
log L(θ; X) =
= `0 (θ)
∂θ
L(θ)
and `(θ) = log L(θ).
Some properties of V are given below. Rigorous proofs of these results depend on fulfillment of conditions (sometimes referred to as regularity conditions) that permit interchange
of integration and differentiation operations, and on the existence and integrability of the
various partial derivatives. The proofs are not required in this course but an outline of the
proof of equation (7.7) is given on page 121.
Properties of V
115
(i) The expected value of V is zero.
∂`(θ)
E(V ) = E
∂θ
!
∂
ln f
=E
∂θ
but since
then by differentiating wrt θ we get
Z
Z
!
=
Z
Z
1 ∂f
∂f
f dx =
dx
f ∂θ
∂θ
f dx = 1
∂f
dx = 0
∂θ
which gives
∂`(θ)
E
∂θ
!
= 0.
Intuitively, this is reasonable, as the mle is obtained by solving
∂`
=0
∂θ
(ii) Var(V ) is called the information (or Fisher’s information) in a sample and is denoted
by IX (θ), so we have
"


#2

∂
IX (θ) = Var(V ) = E
.
log f (X; θ)
 ∂θ

∂`(θ)
Var(score) = Var
∂θ
!
∂`(θ)
=E
∂θ
where Ix (θ) is called the information in the sample.
!2
(7.7)
def
= Ix (θ)
V
If we consider two likelihood functions, both centered on the mle, one a spike ( )
T
and the other a flat pulse ( ), then Var(score) is larger for the spike than for the
pulse, since the function
!2
∂`
∂θ
corresponds to the absolute change in the derivative, which is greater for the spike
than for the pulse.
Thus the information contained in the spike is stronger, while the pulse is less informing.
Later it will be shown (p126) that the variance of an estimator is related to the
inverse of the information, and so the spike will correspond to a situation where the
116
parameter is well estimated (ie, high precision or information), but low variance of
estimation and a short confidence interval for the parameter.
The flat pulse will correspond to a poorly estimated parameter, with low precision or
information, and thus with high variance of estimation and wide confidence interval
for the parameter.
Thus we need to distinguish between the variance of the score, and the variance of
b
the estimator for the parameter (V (θ)).
(iii) Information is additive over independent experiments. For X, Y independent, we
have
IX (θ) + IY (θ) = IX+Y (θ).
(7.8)
(iv) As a special case of (iii), the information in a random sample of size n is n times the
information in a single observation. That is,
IX (θ) = nIX (θ).
(7.9)
(v) The information provided by a sufficient statistic T = t(X) is the same as that in
the sample X. That is,
IT (θ) = IX (θ).
(7.10)
(vi) The information in a sample can be computed by an alternate formula,
∂V
IX (θ) = −E
∂θ
!
.
An alternative form for Ix (θ) is
∂2`
Ix (θ) = −E
∂θ 2
since the expected value of the score gives
Z
!
∂
(ln f )f dx = 0
∂θ
which differentiated wrt θ gives
Z "
ie
or
Z "
#
∂ ln f ∂f
∂ 2 ln f
f+
dx = 0
2
∂θ
∂θ ∂θ
#
∂ 2 ln f
∂ ln f 1 ∂f
f dx = 0
f+
2
∂θ
∂θ f ∂θ
Z "
#
∂ 2 ln f
∂ ln f ∂ ln f
f dx = 0
+
∂θ 2
∂θ ∂θ
from which the result follows.
117
(7.11)
(vii) For T = t(X) a statistic,
IT (θ) ≤ IX (θ)
(7.12)
with equality holding if and only if T is a sufficient statistic for θ.
[This property emphasizes the importance of sufficiency. The reduction of a sample
to a statistic may lose information relative to θ, but there is no loss of information if
and only if sufficiency is maintained in the data reduction.]
Comment on (i) and (vi).
A typical example where the “regularity conditions” don’t hold is the case where X is
distributed U(0, θ). When the range space of X depends on θ, the order of integration
(over X) and differentiation (with respect to θ) can’t usually be interchanged, as is done
in proving (i) and (vi). In particular, for a sample of size 1 from f (x) = 1/θ, 0 < x < θ,
we have L(θ; x) = 1/θ, and
log L(θ; x) = − log θ
∂
1
V =
log L(θ; x) = −
∂θ
θ
Z θ
1 1
− . dx
E(V ) =
θ θ
0
1
= − 6= 0.
θ
Example 7.10
For X1 , . . . , Xn a random sample from a N(µ, σ 2 ) distribution, find
(a) the information for µ; (b) the information for σ 2 .
(a) We have
2
2
f (xi ; µ) = (2πσ 2 )−1/2 e−(xi −µ) /2σ
1
1
log f (xi ; µ) = − log(2πσ 2 ) − 2 (xi − µ)2
2
2σ
1
1
1
= − log(2π) − log σ 2 − 2 (xi − µ)2
2
2
2σ
∂
1
V =
log f =
.2(xi − µ)
∂µ
2σ 2
∂V
1
= − 2 , a constant,
∂µ
σ
!
1
∂V
= 2
IX (µ) = −E
∂µ
σ
118
Alternatively, we note that V 2 = (Xi − µ)2 /σ 4 and that
Var(V ) = E(V 2 ) =
1
E(Xi − µ)2
= 2.
4
σ
σ
[Both IX (µ) and Var(V ) are expressions for the information in a single observation.
The information in a random sample of size n is thus n/σ 2 .]
(b) We have
∂
1
(xi − µ)2
log
f
=
−
+
∂σ 2
2σ 2
2σ 4
∂V
1
(xi − µ)2
=
−
∂σ 2
2(σ 2 )2
(σ 2 )3
!
1
E(Xi − µ)2
∂V
=
−
−E
+
∂σ 2
2σ 4
(σ 2 )3
1
σ2
= − 4+ 2 3
2σ
(σ )
1
=
2σ 4
V =
For a sample of size n, IX = n/2σ 4 .
Example 7.11
Compute the information on p from n Bernoulli trials with probability of success equal
to p. Now
f (x; p) = px (1 − p)1−x
log f (x; p) = x log p + (1 − x) log(1 − p)
∂
x
1−x
V =
log f (x; p) =
−
∂p
p
1−p
∂V
x
1−x
= − 2 −
∂p
p
(1 − p)2
!
∂V
1
1−p
E
= − 2 .p −
∂p
p
(1 − p)2
1
= −
p(1 − p)
!
1
∂V
−E
=
∂p
pq
For a sample of size n, the information on p is IX (p) = n/pq.
119
Examples
(Poisson)
f (x; θ) =
Y
i
and so
` = −nλ +
to give
P
e−λ λxi
e−nλ λ i xi
= Q
xi !
i (xi !)
X
i
xi ln λ −
∂`
= −n +
∂λ
and
Y
(ln xi !)
i
P
xi
λ
i
P
∂2`
xi (−1)
= i 2
2
∂λ
λ
Using the alternative formula for Ix (θ) gives
"
#
P
∂2`
nλ
E[ i xi ]
= 2 = n/λ
I(θ) = −E
=
2
2
∂λ
λ
λ
Using the first form gives
∂`
I(θ) = E
∂λ
!2
= E −n +
2
=n +E
But
(
(
P
xi
λ
i
2
2
=n +E
P
xi )2
nλ
( i xi ) 2
−
2n
=
E
− n2
λ2
λ
λ2
i
xi ) 2 =
i
X
x2i + 2
i
X
xi xj
i6=j
with expectation
nV (x) + nµ2x + 2(n(n − 1)/2)λ2 = nλ + nλ2 + (n2 − n)λ2
to give
nλ + n2 λ2
− n2 = n/λ
I(θ) =
2
λ
as before.
(Exponential)
f (x; θ) =
Y
i
and so
λe−λxi = λn e−λ
` = n ln λ − λ
120
X
i
xi
P
P
xi
xi ) 2
− 2nE i
2
λ
λ
i
P
P
X
(
i
xi
with
n X
∂`
xi
= −
∂λ
λ
i
and
∂2`
n
n
= − 2 ; I(λ) = 2
2
∂λ
λ
λ
Using the first form
∂`
I(λ) = E
∂λ
2
P
xi
xi ) − 2n
λ
X
X
= E n /λ + (
i
(
= E n/λ −
X
2
but
!2
2
xi ) 2 =
i
i
!
x2i + 2
i
X
= E(
X
xi
i
X
i
!2
xi )2 − n2 /λ2
xi xj
i6=j
with expectation
n(σ 2 + µ2 ) + 2
n(n − 1) 1
= n(1/λ2 + 1/λ2 ) + (n2 − n)(1/λ2 ) = n/λ2 + n2 /λ2
2
2
λ
Thus
∂`
I(λ) = E
∂λ
!2
= n/λ2
as before.
Outline of Proof of equation 7.7
∂
f 0 (X; θ)
f0
log f (X; θ) =
=
∂θ
f (X; θ)
f
00
0 2
00
f f − (f )
f
∂V
=
=
−V2
2
∂θ
f
f
!
!
∂V
f 00
E
= E
− E(V 2 )
∂θ
f
V
=
Now
f 00 =
∂2
f (X; θ)
∂θ 2
121
So
f 00
E
f
!
=
Z
∞
∂2
f (x; θ)
∂θ 2
−∞
f (x; θ)
f (x; θ)dx
Z
∂2 ∞
=
f (x; θ)dx
∂θ 2 | −∞ {z
}
=1
= 0
So
∂V
E(V ) = −E
∂θ
2
!
= Var(V ) = IX (θ).
Comments
1. The proof is somewhat ‘simplistic’ in the sense that just X is used rather that X.
The latter would require multiple integrals rather than just a single integral.
2. The proof that E(V ) = 0 is similar.
3. Note the line where the order of integration (wrt x) and differentiation (wrt θ) is
interchanged. This can only be done when regularity conditions apply. For instance,
the limits on the integrals must not involve θ.
122
Chapter 8
Estimation
8.1
Some Properties of Estimators
[Read CB 10.1 or HC 6.1.]
Let the random variable X have a pdf (or probability function) that is of known functional
form, but in which the pdf depends on an unknown parameter θ (which may be a vector)
that may take any value in a set Θ (the parameter space). We can write the pdf as
f (x; θ), θ ∈ Θ. To each value θ ∈ Θ there corresponds one member of the family. If the
experimenter needs to select precisely one member of the family as being the pdf of the
random variable, he needs a point estimate of θ, and this is the subject of sections 8.1 to
8.4 of this chapter.
Of course, we estimate θ by some (appropriate) function of the observations X1 , . . . , Xn
and such a function is called a statistic or an estimator. A particular value of an estimator,
say t(x1 , . . . , xn ), is called an estimate. We will be considering various qualities that a
“good” estimator should possess, but firstly, it should be noted that, by virtue of it being
a function of the sample values, an estimator is itself a random variable. So its behaviour
for different random samples will be described by a probability distribution.
It seems reasonable to require that the distribution of the estimator be somehow centred
with respect to θ. If it is not, the estimator will tend either to under-estimate or overestimate θ. A further property that a good estimator should possess is precision, that
is, the dispersion of the distribution should be small. These two properties need to be
considered together. It is not very helpful to have an estimator with small variance if it
is “centred” far from θ. The difference between an estimator T = t(X1 , . . . , Xn ) and θ is
referred to as an error, and the “mean squared error” defined below is a commonly used
measure of performance of an estimator.
Unbiasedness
Definition 8.1
123
For a random sample X1 , . . . , Xn from f (x; θ), a statistic T = t(X1 , . . . , Xn ) is an
unbiased estimator of θ if E(T ) = θ.
Definition 8.2
The bias in T (as an estimator of θ) is
bT (θ) = E(T ) − θ.
(8.1)
Mean Square Error
Definition 8.3
For a random sample X1 , . . . , Xn from f (x; θ) and a statistic
T= t(X1 , . . . , Xn ) which is an estimator of θ, the mean square error (mse) is defined as
mse = E[(T − θ)2 ].
(8.2)
The mse can be expressed alternatively as
E[(T − θ)2 ] = E[(T − E(T )) + (E(T ) − θ)]2
= E[(T − E(T ))2 ] + [E(T ) − θ]2 .
So we have
mse = Var(T )
+
b2T (θ).
(8.3)
Now from (8.3) we can see that the mse cannot usually be made equal to zero. It will only
be small when both Var(T ) and the bias in T are small. So rather than use unbiasedness
and minimum variance to characterize “goodness” of a point estimator, we might employ
the mean square error.
Example 8.1
Consider the problem of the choice of estimator of σ 2 based on a random sample of size
n from a N (µ, σ 2 ) distribution. Recall that
S2 =
n
X
i=1
(Xi − X)2 /(n − 1)
is often called the sample variance and has the properties
E(S 2 ) = σ 2 , (so S 2 is unbiased)
Var(S 2 ) = 2σ 4 /(n − 1).
124
[Note that this is not HC’s use of S 2 . See 4.1 Definition 3.]
P
2
Consider the mle of σ 2 , ni=1 (Xi − X )/n, which we’ll denote by σ̂ 2 . Now
σ̂ 2 =
n−1 2
S
n
and
1
n−1 2
σ = 1−
E(σ̂ ) =
σ2
n
n
2(n − 1) 4
(n − 1)2
Var(S 2 ) =
σ .
Var(σ̂ 2 ) =
2
n
n2
2
Why is σ̂ 2 biased? To calculate σ̂ 2 we first have to extract the mean, consuming 1
degree of freedom. So we do not have n independent estimates of dispersion about the
mean; we have (n − 1).
Now σ̂ 2 is biased, but what about its mean square error? Using (8.3),
mse σ̂
Now for S 2 ,
2
2(n − 1)σ 4
+
=
n2
σ 4 (2n − 1)
=
.
n2
1
σ − (1 − )σ 2
n
2
2
2σ 4
2n − 1 4
mse S = Var(S ) =
>
σ
n−1
n2
2
since
2
2
2n − 1
>
n−1
n2
for n an integer greater than 1. So for the normal distribution the mle of σ 2 is better in
the sense of mse than the sample variance.
Consistency
A further desirable property of estimators is that of consistency, which is an asymptotic
property. To understand consistency, it is necessary to think of T as really being Tn ,
the nth member of an infinite sequence of estimators, T1 , . . . , Tn . Roughly speaking, an
estimator is consistent if, as n gets large, the probability that Tn lies arbitrarily close to the
parameter being estimated becomes itself arbitrarily close to 1. More formally, we have
Definition 8.4
Tn = t(X1 , . . . , Xn ) is a consistent estimator of θ if
lim P (|Tn − θ| ≥ ) = 0 for any > 0.
n→∞
125
(8.4)
This is often referred to as convergence in probability of Tn to θ.
An equivalent definition (for cases where the second moment exists) is
Definition 8.5
Tn = t(X1 , . . . , Xn ) is a consistent estimator of θ if
lim E[(Tn − θ)2 ] = 0.
(8.5)
n→∞
That is, the mse of Tn as an estimator of θ, decreases to zero as more and more observations
are incorporated into its composition. Note that, using (8.3) we see that (8.5) will be
satisfied if Tn is asymptotically unbiased and if Var(Tn ) → 0 as n → ∞.
Asymptote means the truth. So as the sample size increases, Tn gets closer to the true
value. When n = ∞, we have sampled the entire population. The idea of consistency
can be gleaned from Figure 8.1 where Tn converges to θ. If it didn’t, Tn would not be a
consistent estimator.
Figure 8.1: Convergence of an estimator
Tn
θ
n
Example 8.2
126
Let Y be a random variable with mean µ and variance σ 2 . Let Y be the sample mean
of n random observations taken on Y . Is Y a consistent estimator of µ?
Now E(Y ) = µ so Y is unbiased. Also Var(Y ) = σ 2 /n → 0 as n → ∞, so Y is a
consistent estimator of µ.
NOTE
These statements are the same for a consistent estimator Tn of θ :
1.
lim P (|T − θ| ≥ ) = 0, > 0
n
n→∞
2.
lim E[(T − θ)2 ] = 0
n
n→∞
3.
lim P (|T − θ| < ) = 1, > 0
n
n→∞
4.
E(Tn ) → θ, V (Tn ) → 0, as n → ∞
5.
∃N : n > N for δ > 0, > 0 :
P (|Tn − θ| < δ) > 1 − Operationally, consistency boils down to V (Tn ) → 0 as n → ∞,
assuming that Tn is unbiased.
For example, the sample mean from a N (µ, σ 2 ) population is unbiased, and
V (X̄) = σ 2 /n
and so X̄ is consistent for µ since
V (X̄) → 0 as n → ∞.
Examples
(1 & 3) If Tn is unbiased for θ and σn2 = V (Tn ), then by Chebychev
P (|Tn − θ| < δσn ) > 1 − 1/δ 2 , δ > 0
and choosing δσn = as being fixed, then
P (|Tn − θ| < ) > 1 −
This equivalent to 1 and 3.
127
σn2
→ 1 as σn2 → 0.
2
(2 & 4) Choose a sample of size n from the uniform distribution U (θ), and the estimator
of θ as the largest order statistic, Y(n) . Now
f (y; θ) = 1/θ, 0 < y < θ
The distribution of Tn = Y(n) is
fYn (yn ) = nynn−1 /θ n , 0 < yn < θ
Now
n
nθ
θ and V [Y(n) ] =
n+1
(n + 1)2 (n + 2)
E[Y(n) ] =
and so
E(Tn ) → θ and V (Tn ) → 0 as n → ∞
(5) For the uniform distribution problem defined in (4)
P (|Tn − θ| < δ) = P (θ − δ < Tn < θ), as yn < θ
=
Z
θ
θ−δ
fY(n) (yn )dyn = 1 −
Now
P (|Tn − θ| < δ) = 1 −
(θ − δ)n
θn
(θ − δ)n
>1−ε
θn
where = (1 − δ/θ)n .
For any δ, it is possible to make as small as possible (in particular smaller than ε),
by suitable choice of n. Thus P = 1 − > 1 − ε ; < ε.
Thus (1 − δ/θ)n < ε, or
or
n>
n ln(1 − δ/θ) < ln ε
ln ε
= N (0 < δ < θ), ; (0 < ε < 1)
ln(1 − δ/θ)
Thus Y(n) is consistent for θ.
Exercise
Show that
mse[Y(n) ] → 0 as n → ∞ (2)
128
Efficiency
We will next make some comments on the property of efficiency of estimators. The term
is frequently used in comparison of two estimators where a measure of relative efficiency is
used. In particular,
Definition 8.6
Given two unbiased estimators, T1 and T2 of θ, the efficiency of T1 relative to T2 is
defined to be
e(T1 , T2 ) = Var(T2 )/Var(T1 ),
and T2 is more efficient than T1 if Var(T2 ) <Var(T1 ).
Note that it is only reasonable to compare estimators on the basis of variance if they
are both unbiased. To allow for cases where this is not so, we can use mse in the definition
of efficiency. That is,
Definition 8.7
An estimator T2 of θ is more efficient than T1 if
mse T2 ≤ mse T1 ,
with strict inequality for some θ. Also the relative efficiency of T1 with respect to T2 is
e(T1 , T2 ) =
mse T2
E[(T2 − θ)2 ]
=
.
mse T1
E[(T1 − θ)2 ]
(8.6)
Example 8.3
Let X1 , . . . , Xn denote a random sample from U(0, θ), with Y1 , Y2 , . . . , Yn the corresponding ordered sample.
(i) Show that T1 = 2X and T2 =
n+1
Yn
n
are unbiased estimates of θ.
(ii) Find e(T1 , T2 ).
Solution
(i) Now E(Xi ) = θ/2 and Var(Xi ) = θ 2 /12 so
θ
E(T1 ) = 2E(X) = 2E(Xi ) = 2. = θ.
2
To find the mean of T2 , first note that the probability density function of Yn is
fYn (y) = n(FX (y))n−1 fX (y), for 0 ≤ y ≤ θ
ny n−1
=
I(0,θ) (y).
θn
129
So
n Zθ n
y dy
E(Yn ) = n
θ 0
"
#θ
n y n+1
= n
θ n+1 0
nθ
=
.
n+1
For T2 defined by T2 =
n+1
Yn ,
n
we have
E(T2 ) =
n+1
E(Yn ) = θ.
n
So both T1 and T2 are unbiased.
(ii)
Var(T1 ) = Var(2X) = 4Var(X) =
4θ 2
θ2
4Var(Xi )
=
=
.
n
12n
3n
To find Var(T2 ), first we need to find E(Yn2 ) from
E(Yn2 ) =
Z
θ
0
y2
n n−1
y dy
θn
n θ n+2
= n
θ n+2
n 2
=
θ .
n+2
n2 θ 2
n 2
θ
−
n+2
(n + 1)2
nθ 2
=
.
(n + 1)2 (n + 2)
Var(Yn ) =
So
(n + 1)2
nθ 2
.
n2
(n + 1)2 (n + 2)
2
θ
=
n(n + 2)
Var(T2 ) =
Since these estimates are unbiased, we may use definition 8.6,
e(T1 , T2 ) =
3n
Var(T2 )
θ2
3
=
=
.
2
Var(T1 )
n(n + 2) θ
n+2
This is less than 1 for n > 1 so T2 is more efficient than T1 .
130
Example
The mean and median from a Normal population are both unbiased for the population
mean. The mean X̄ has variance σ 2 /n while the median X̃ has variance
π 2
σ /n
2
for large n. Thus the mean is more efficient, with e = 0.637. Another interpretation is the
following :
If we required the median to give the same precision as the mean based on a sample
of 100 observations, the sample using the median would need to be based on 157 (= π/2)
observations.
If both estimators are not unbiased, then we must use mse. Thus we call the more
efficient estimator the one with the smaller mse. Then
e(T1 , T2 ) =
mse(T2 )
mse(T1 )
with mse(T2 ) < mse(T1 ) giving e < 1.
Example
The sample variance S 2 versus the mle σ̂ 2 .
Now
S2
2σ 4 /(n − 1)
2n2
e= 2 =
=
σ̂
(2n − 1)σ 4 /n2
(2n − 1)(n − 1)
=
2n
n
= (1 + 1/(2n − 1)) (1 + 1/(n − 1)) > 1
2n − 1 n − 1
The mse(σ̂ 2 ) < mse(S 2 ), making the mle more efficient.
Notice that both definitions of e the efficiency is purely conventional; the less efficient
estimator could easily be the numerator, giving e > 1, as in the second example.
8.2
Cramér–Rao Lower Bound
The concept of relative efficiency provides a criterion for choosing between two competing
estimators, but it does not give us any assurance that the better of the two is any good.
How do we know, for example, that there is not another estimator whose variance (or mse)
is much smaller than the two considered?
Minimum Variance Estimation
The Theorem below gives a lower bound for the variance (or mse) of an estimator.
131
Theorem 8.1
Let T = t(X), based on a sample X from f (x; θ) be an estimator of θ (assumed to be
one-dimensional). Then
[1 + b0T (θ)]2 def [τ 0 (θ)]2
(8.7)
Var(T ) ≥
=
IX (θ)
IX (θ)
and
mse(T ) ≥
[1 + b0T (θ)]2
IX (θ)
b2T (θ)
+
(8.8)
where bT (θ) is given in (8.1) and IX (θ) is defined in equation (7.7).
Outline of Proof.
[The validity depends on regularity conditions, where the interchange of integration
and differentiation operations is permitted and on the existence and integrability of various partial derivatives.]
Now
V =
∂
∂
log f (X; θ) =
log L(θ)
∂θ
∂θ
as in Definition (7.7) and we note that E(V ) = 0, so Var(V)= E(V 2 ) and
cov(V, T ) = E(V T )
!
∂
= E T
log f (X; θ)
∂θ
=
Z
···
Z
t(x1 , . . . , xn )
"
X
i
#
∂
ln f (xi ; θ) f (x1 ; θ) . . . f (xn ; θ)dx1 . . . dxn
∂θ
∂
E(T )
∂θ
∂
=
[θ + bT (θ)] from (8.1)
∂θ
= 1 + b0T (θ).
=
Recall that the absolute value of the correlation coefficient, for measuring correlation between any two variables is less than or equal to 1 and that
ρV,T = cov(V, T )/σV σT
so that we have
[cov(V, T )]2 ≤ Var(V )Var(T )
132
or
Var(T ) ≥ [cov(V, T )]2 /Var(V )
[1 + b0T (θ)]2
=
IX (θ)
thus proving (8.7). Now (8.8) follows using (8.3).
Corollary.
For the class of unbiased estimators,
Var(T ) = mse(T ) ≥
1
.
IX (θ)
(8.9)
Now inequality (8.9) is known as the Cramér–Rao lower bound, or sometimes the
Information inequality. It provides (in “regular estimation” cases) a lower bound on
the variance of an unbiased estimator, T. The inequality is generally attributed to Cramér’s
work in 1946 and Rao’s work in 1945, though it was apparently first given by M. Frechet
in 1937–38.
Example 1
Sampling from a Bernoulli distribution with pf
f (x; π) = π x (1 − π)1−x , x = 0, 1
What is the bound on the variance for an estimator of π?
Note that we can use the following :
Ix (θ) = nIx (θ), (θ = π)
Now
and so
Hence
` = ln f = x ln π + (1 − x) ln(1 − π)
x 1−x
x−π
∂ ln f
= +
(−1) =
∂θ
π 1−π
π(1 − π)
I = E(
E(x − π)2
∂` 2
1
) = 2
=
2
∂θ
π (1 − π)
π(1 − π)
We notice the E(X) = π is unbiased for π, and so
Var(T ) ≥
π(1 − π)
n
as expected.
133
Example 2
Sampling from an exponential distribution with pdf
f (x; θ) = e−x/θ /θ, x > 0
What is the MVB for an estimator of θ?
` = ln f = − ln θ −
x
θ
1
x
∂`
= − + 2 , ; E(X) = θ (mle)
∂θ
θ θ
∂2`
1
x
= 2 −2 3
2
∂θ
θ
θ
giving
∂2`
1
E(x)
)= 2 −2 3
2
∂θ
θ
θ
θ
1
1
= 2 −2 3 =− 2
θ
θ
θ
E(
and so I = 1/θ 2 . So
Ix (θ) =
n
θ2
If T is unbiased, eg, T = X̄ then
V (T ) ≥
1
θ2
=
nI(θ)
n
V (X̄) =
V (X)
θ2
=
n
n
in agreement with
Reparameterisation
An alternative form of the exponential is
f (x; λ) = λe−λx , x > 0
What is the MVB for an estimator of λ?
` = ln λ − λx
1
∂`
= − x ; E(X) = 1/λ = τ (λ)
∂λ
λ
∂2`
1
=
−
→ I = 1/λ2
∂λ2
λ2
134
So now the MVB for a (biased) estimator of λ, say X̄, is given by
V (T ) ≥
[−1/λ2 ]2
1
[τ 0 (λ)]2
=
=
2
nI(λ)
n/λ
nλ2
in agreement with E(X) = 1/λ, V (X) = 1/λ2 and
V (X̄) =
1
.
nλ2
Alternative method
To return to the θ parameterisation of the exponential distribution, if we do not use an
unbiased estimator, but in fact want the second λ form, we can examine an estimator T 0
with τ (θ) = 1/θ, in place of the unbiased version τ (θ) = θ.
Now
2
[τ 0 (θ)]2
(−1/θ 2 )
1
0
V (T ) ≥
=
= 2
2
nI(θ)
n/θ
nθ
in agreement with the λ parameterisation.
Thus there is no need to resolve in terms of the new parameterisation. Simply use the
biased form the the MVB.
Definition 8.8
The (absolute) efficiency of an unbiased estimator T is defined as
e(T ) =
1/IX (θ)
.
Var(T )
(8.10)
Note that, because of (8.9), e(T ) ≤ 1, so we can think of e(T ) as a measure of efficiency of
any given estimator, rather than the relative efficiency of one with respect to another as
in Definition 8.1.
In the case where e(T ) = 1, so that the actual lower bound of Var(T ) is achieved, some
texts refer to the estimator T as efficient. This terminology is not universally accepted.
Some prefer to use the phrase minimum variance bound (MVB) for 1/IX (θ), and an
estimator which is unbiased and which attains this bound is called a minimum variance
bound unbiased (MVBU) estimator.
Example 8.4
In the problem of estimating θ in a normal distribution with mean θ and known variance
σ , find the MVB of an unbiased estimator.
2
135
The MVB is 1/IX (θ) where
∂ log L(θ)
IX (θ) = E
∂θ
!2
.
For a sample X1 , . . . , Xn we have for the likelihood,
L(θ) =
n
Y
i=1
√
1
1
2
2
e− 2 (xi −θ) /σ
2πσ
P
2
2
= (2π)−n/2 (σ 2 )−n/2 e− (xi −θ) /2σ
n
1X
n
(xi − θ)2 /σ 2
log L(θ) = − log(2π) − log σ 2 −
2
2
2
n
X
∂
−1
n
log L(θ) =
.
−
2
(xi − θ) = − 2 (x − θ)
2
∂θ
2σ
σ
i=1
∂
E
log L(θ)
∂θ
!2
n2
E(X − θ)2
σ4
n2
=
Var(X) = n/σ 2
4
σ
=
So the MVB is σ 2 /n.
When can the MVB be Attained?
It is easy to establish the condition under which the minimum variance bound of an unbiased estimator, (8.9), is achieved. In the proof of Theorem 8.2, it should be noted that the
inequality concerning the correlation of V and T becomes an equality (that is, ρV,T = +1
or −1) when V is a linear function of T.
Recalling that
∂ log L(θ)
V =
,
∂θ
we may write this condition as
∂ log L(θ)
= A(T − θ),
∂θ
where A is independent of the observations but may be a function of θ, so we will write it
as A(θ). So the condition for the MVB to be attained is that the statistic T=t(X1 , . . . , Xn )
satisfies
∂ log L(θ)
= A(θ)(T − E(T )).
(8.11)
∂θ
Example 8.5
136
In the problem of estimating θ in a normal distribution with mean θ and known variance
σ , where σ 2 is known, show that the MVB of an unbiased estimator can be attained.
As in Example 8.2,
∂ log L(θ)
n(x − θ)
=
.
∂θ
σ2
2
Now defining T = t(X1 , . . . , Xn ) = X, we know it is an unbiased estimator of θ, and we
see that (8.11) is satisfied, where A(θ) = n/σ 2 , (A not being a function of θ in this case).
Thus the minimum variance bound can be attained.
Comment.
In the case of an unbiased estimator T where the MVB is attained, note that the
inequality in (8.9) becomes an equality and we have
∂ log L(θ)
Var(T ) = 1/IX (θ) = 1/E
∂θ
!2
.
(8.12)
Also, squaring (8.11) and taking expectations of both sides, we have
∂ log L(θ)
E
∂θ
!2
= [A(θ)]2 E[(T − θ)2 ]
= [A(θ)]2 Var(T )
That is,
!2
∂ log L(θ)
Var(T ) = E
/[A(θ)]2
∂θ
1
1
=
.
, using (2.12)
2
[A(θ)] Var(T )
giving
Var(T ) = 1/A(θ).
So, if the statistic T satisfies (8.11), Var(T) can be identified immediately as the multiple
of T − θ on the RHS. For instance, in Example 8.2, the factor n/σ 2 multiplying x − θ can
be identified as the reciprocal of the variance of T , and it was not necessary to evaluate
the MVB as in Example 8.2.
Example 8.6
Consider the problem of estimating the variance, θ, of a normal distribution with known
mean µ, based on a sample of size n.
137
Now the likelihood is
P
2
L(θ) = (2π)−n/2 θ −n/2 e− (xi −µ) /2θ
n
X
n
n
log L(θ) = − log(2π) − log θ − (xi − µ)2 /2θ
2
2
i=1
P
∂ log L(θ)
(xi − µ)2
n
= −
+
∂θ
2θ P
2θ 2 !
2
n
(xi − µ)
=
−θ
2
2θ
n
P
which is in the form (2.11) where T = t(X1 , . . . , Xn ) = ni=1 (Xi − µ)2 /n. So, using this as
the estimate of θ, the MVB is achieved and it is 2θ 2 /n.
P
P
Note that E( (Xi − µ)2 ) = E(Xi − µ)2 = nVar(Xi ) = nθ so T is an unbiased estimator
P
of θ. Also, (Xi − µ)2 /θ ∼ χ2n so has variance 2n. Hence,
"
#
P
2θ 2
θ2
θ (Xi − µ)2
.
= 2 .2n =
Var(T ) = Var .
n
θ
n
n
Example 8.7
Consider the problem where we have a random sample X1 , . . . , Xn from a Poisson
distribution with parameter θ and we wish to find the Cramér-Rao lower bound for the
variance of an unbiased estimator of θ, and identify the estimator that has his variance.
Now for f (x; θ) = e−θ θ x /x!, the likelihood of the sample is
P
e−nθ θ xi
L(θ; x1 , . . . , xn ) = Qn
i=1 (xi !)
X
log L(θ; x1 , . . . , xn ) = −nθ +
xi log θ − log K
P
∂ log L(θ)
xi
= −n +
∂θ
θ
−nθ + nx
=
θ
n
=
[x − θ]
θ
= A(θ) [T − θ]
where T (Xi ) = X is the statistic. This is in the correct form for the minimum variance
bound to be attained and it is 1/A(θ) = θ/n. We note that X is an estimator which has
variance θ/n.
138
Example
For the Bernoulli distribution, can the MVB be attained?
f = π x (1 − π)1−x , x = 0, 1
` = x ln π + (1 − x) ln(1 − π)
x 1−x
∂`
= +
(−1)
∂θ
π 1−π
=
Therefore
x−π
= I(π)(x − π)
π(1 − π)
A(π) =
1
,
π(1 − π)
T = X, θ = π and the lower bound is attained.
Notice again that the working was for a single sample value, so that for sample of size
n,
n
A(π) = Ix (π) =
π(1 − π)
with T = X̄.
Notes
Finally, some important results :
1. If an unbiased estimator of some function of θ exists having a lower bound variance 1/[nI(θ)], the sampling is necessarily from a member of the exponential family.
Conversely, for any member of the exponential family, there is always precisely one
function of θ for which there exists an unbiased estimator with the minimum variance
1/[nI(θ)].
2. A Cramér–Rao lower bound estimator can only exist if there is a sufficient statistic
for θ. (The reverse is not necessarily true.)
Proof
The factorisation criterion for sufficiency requires that
f (x; θ) = g[t(x); θ]h(x)
139
ie
∂ ln g
∂`
=
,
∂θ
∂θ
while the MVB is attained if
∂`
∂ ln f (x; θ)
=
= A(θ)[T − τ (θ)]
∂θ
∂θ
in general, which is a special case of the sufficiency condition. Thus even if the MVB
is not attained there still may be a sufficient statistic.
3. (Blackwell–Rao)
If an unbiased estimator T1 exists for τ (θ), where θ is the unknown parameter, and
if T is a sufficient estimator of τ (θ), then there exist a function u(T ) of T which is
also unbiased for τ (θ) with variance Var[u(T )] ≤ Var(T1 ).
8.3
Properties of Maximum Likelihood Estimates
Statistical inference should be consistent with the assumption that the best explanation of
a set of data is provided by the value of θ, (θ̂, say) that maximizes the likelihood function.
Estimators derived by the method of maximum likelihood have some desirable properties.
These are stated without proof below.
1. Sufficiency
It was already established in section 7.6, that if a single sufficient statistic exists for
θ, the maximum likelihood estimate of θ must be a function of it. That is, the mle
depends on the sample observations only through the value of a sufficient statistic.
2. Invariance
The maximum likelihood estimate is invariant under functional transformations.
That is, if T = t(X1 , . . . , Xn ) is the mle of θ and if u(θ) is a function of θ, then
u(T ) is the mle of u(θ). For example, if σ̂ is the mle of σ, then σ̂ 2 is the mle of σ 2 .
That is, σc2 = σ̂ 2 .
A full proof exists, but essentially the argument here is
∂` ∂θ
∂`
=
∂u
∂θ ∂u
so that maximisation wrt θ is equivalent to maximisation wrt u.
Example
140
If σ̂ 2 is the mle of σ 2 in sampling from a N (0, σ 2 ) population, then σ̂ is the mle of σ.
L=
Y
fi =
i
Y
i
P 2
2
2
2
1
1
−
i xi /2σ
√ e−xi /2σ =
e
n/2
σ n 2π
σ 2π
n
1X 2 2
n
x /σ
` = − ln(σ 2 ) − ln(2π) −
2
2
2 i i
P
2
∂`
n
i xi
=
−
+
=0
∂σ 2
2σ 2
2σ 4
to give the mle of σ 2 as
2
σ̂ =
P
x2i
.
n
i
The mle of σ̂ is given by
∂`
n 1 X 2 (−2)
=− −
=0
x
∂σ
σ 2 i i σ3
if
2
σ =
P
x2i
; σ̂ =
n
i
sP
x2i
n
i
as required.
As a final check
∂` ∂θ
∂`
=
∂u
∂θ ∂u
becomes
∂`
∂` ∂σ 2
=
∂σ
∂σ 2 ∂σ
P
!
P
2
2
n
∂`
n
i xi
i xi
2σ = − +
=
= − 2+
4
3
2σ
2σ
σ
σ
∂σ
as required by invariance.
3. Consistency
The maximum likelihood estimator is consistent.
This can be shown from first principles (Wald(1949)), using the expected value of
the log likelihood, but the simpler demonstration is via the asymptotic behaviour of
the score statistic. Essentially we find that V (θ̂) → 0 as n → ∞
4. Efficiency
If there is a MVB estimator of θ, the method of maximum likelihood will produce it.
141
5. Asymptotic Normality
Under certain regularity conditions, a maximum likelihood estimator has an asymptotically Normal distribution with variance 1/I(θ), ie,
θb − θ
q
b
V (θ)
Proof
∼ N (0, 1), asy.
CB p472
`=
X
ln f (xi ; θ)
i
def
`0 (θ) =
∂`
= `0 (θ0 ) + (θ − θ0 )`00 (θ0 ) + . . .
∂θ
by a Taylor series expansion.
Replacing θ by the mle (θ̂) gives
`0 (θ̂) = `0 (θ0 ) + (θ̂ − θ0 )`00 (θ0 ) + . . . = 0
and so, approx
√
√ `0 (θ0 )
√
`0 (θ0 )/ n
n(θ̂ − θ0 ) = − n 00
= − 00
` (θ0 )
` (θ0 )/n
If I(θ0 ) = E[`0 (θ0 )]2 = 1/V (θ) is the information from a single observation, then
√
`0 (θ0 )/ n → N [0, I(θ0 )]
by the CLT and
−`00 (θ0 )/n → I(θ0 )
leading to
√
n(θ̂ − θ0 ) → N [0, 1/I(θ0 )]
Thus
(θ̂ − θ0 )
q
1/ nI(θ0 )
as required.
→ N (0, 1) asy
Thus the mle is consistent, and since the MVB is attained, it is also efficient.
142
Example
Poisson distribution
p(xi ; λ) =
L=
Y
i
e−λ λxi
, xi = 0, 1, . . .
xi !
P
e−nλ λ i xi
p(xi ; θ) = Q
i (xi !)
` = ln L = −nλ +
X
i
xi ln λ +
P
X
(xi !)
i
∂`
xi
= −n + i
∂λ
λ
P
2
xi
∂ `
nλ
n
= − i 2 → nI(θ) = 2 =
2
∂λ
λ
λ
λ
in line with V (X̄) = λ/n.
Thus
X̄ − λ
q
as expected.
λ/n
∼ N (0, 1)
6. MLE of vector parameters
Define
∂ log L(θ) ∂ log L(θ)
.
Iij (θ) = E
∂θi
∂θj
!
(8.13)
Now the RHS of (8.13) can be expressed as
!
∂ log L(θ) ∂ log L(θ)
cov
,
.
∂θi
∂θj
As was the case with information on one parameter, IX (θ) [see (1.6) and (1.10)],
there is an alternative formula for computing the terms of the information matrix.
!
∂ 2 log L(θ)
Iij (θ) = − E
,
∂θi ∂θj
(8.14)
provided certain regularity conditions are satisfied. Define an information matrix,
I(θ) to have elements Ii,j , then, the mle’s θ̂i , found by solving the set of equations
∂ log L(θ)
= 0,
∂θi
i = 1, 2, . . . ,
have an asymptotically normal distribution with means θi and covariance matrix
[I(θ)]−1 .
143
Example 8.8
Sampling Y1 , . . . , Yn from N (µ, σ 2 ) with σ 2 unknown. We wish to find the joint Information matrix for µ and σ 2 .
L(µ, σ 2 ) =
Y
i
2
2
1
√ e−(yi − µ) /2σ
σ 2π
= σ −n (2π)−n/2 e−
P
i (yi
− µ)2 /2σ 2
P
n
1 i (yi − µ)2
n
` = ln L = − ln(σ 2 ) − ln(2π) −
2
2
2
σ2
X
∂`
= − (yi − µ)(−1)/σ 2
∂µ
i
X
(−1)
n
∂`
= − 2 − (yi − µ)2 4
2
∂σ
2σ
2σ
i
Now to get the terms in the Information matrix.
X 1
n
∂2`
=
−
=− 2
2
2
∂µ
σ
i σ
X
n
∂2`
(yi − µ)2 /σ 6
=
−
∂σ 4
2σ 4
i
"
#
n
∂2`
nσ 2
n
E
=
−
=
−
∂σ 4
2σ 4
σ6
2σ 4
since E(yi − µ)2 = σ 2 . Also
"
#
"
#
#
"
X
∂2`
∂2`
(yi − µ)/σ 4 = 0
E
=
E
=
E
−
∂µ∂σ 2
∂σ 2 ∂µ
i
Thus the matrix of second derivatives of ` is
"
−n/σ 2
0
0
−n/2σ 4
which is negative definite.
Since
V (θ) = I
then
V
"
µ̂
σ̂ 2
!#
=−
"
−1
#
"
∂2`
=− E
∂θi ∂θj
−n/σ 2
0
0
−n/2σ 4
144
#−1
!#−1
=
"
σ 2 /n
0
0
2σ 4 /n
#
as expected.
We note that X̄ and σ̂ 2 are asymptotically normally distributed, and independent, with
variances given. However, we know that the independence property and the normality
and variance of X̄ are exact for any n. But the normality property and the variance of
P
(Xi − X̄)2 /n are strictly limiting ones.
8.4
Interval Estimates
The notion of an interval estimate of a parameter θ with a confidence coefficient is
assumed to be familiar. A point estimate, on its own, doesn’t convey any indication of
reliability, but a point estimate together with its standard error would do so. This idea is
incorporated into a confidence interval, which is a range of values within which we are
“fairly confident” that the true (unknown) value of the parameter θ lies. The length and
location of the interval are random variables and we cannot be certain that θ will actually
fall within the limits evaluated from a single sample. So the object is to generate narrow
intervals which include θ with a high probability.
Examples such as
(i) A CI for µ in a normal distribution where σ is either known or unknown;
(ii) A CI for p where p is the probability of success in a binomial distribution;
(iii) A CI for σ 2 in a normal distribution where the mean is either known or unknown;
etc.
will not be repeated here. However, we will mention the general method of construction of
a confidence interval using a pivotal quantity. Further, we will find a confidence interval
for a population quantile.
Suppose θ̂L and θ̂U (both functions of X1 , . . . , Xn and hence random variables) are the
lower and upper confidence limits respectively, for a parameter θ. Then if
P (θ̂L < θ < θ̂U ) = γ,
(8.15)
the probability γ is called the confidence coefficient. The interval (θ̂L , θ̂U ) is referred to
as a two–sided confidence interval, both endpoints being random variables.
It is possible to construct 1–sided intervals such that
P (θ̂L < θ) = γ
or
P (θ < θ̂U ) = γ,
in which case only one end-point is random. The confidence intervals are respectively,
(θ̂L , ∞), (−∞, θ̂U ).
145
Example
In the construction of a confidence interval for the true mean µ when sampling from a
Normal population, the sampling distribution of the sample mean X̄ is used, viz
X̄ ∼ N (µ, σ 2 /n)
or
Z=
X̄ − µ
√ ∼ N (0, 1).
σ/ n
Now Z has distribution function
2
1
f (z) = √ e−z /2 .
2π
Thus Z has a distribution which does not depend on knowledge of µ but Z involves the
unknown µ. The variable Z is said to be pivotal for µ, as confidence intervals can be
constructed easily for µ, using the form of the density for Z.
Pivotal Method
A very useful method for finding confidence intervals uses a pivotal quantity that has 2
characteristics.
1. It is a function of the sample measurements and the unknown parameter θ (where θ
is the only unknown).
2. It has a probability distribution which does not depend on the parameter θ.
Suppose that T=t(X) is a reasonable point estimate of θ, then we will denote this pivotal
quantity by p(T, θ), and we will use the known form of the probability distribution of
p(T, θ) to make the following statement.
For a specified constant γ, (0 < γ < 1), and constants a and b, (a < b),
P (a < p(T, θ) < b) = γ.
(8.16)
So, given T , the inequality (8.16) is solved for θ to obtain a region of θ–values which is a
confidence region (usually an interval) for θ corresponding to the observed T–value. This
rearrangement, of course, results in an equation of the form (8.15).
Example 8.9
For random variable X ∼ U (0, θ), construct a 90% confidence interval for θ.
Now we know that Yn , the largest order statistic from a sample of size n from this
distribution, is sufficient for θ and has pdf
fYn (y) = n y n−1 /θ n , 0 ≤ y ≤ θ.
146
Let Z = Yn /θ, then the pdf of Z is
fZ (z) = n z n−1 , 0 ≤ z ≤ 1.
We see that Yn /θ is a suitable pivotal quantity with the 2 characteristic properties referred
to earlier. So we have
P (a < Yn /θ < b) = .90.
Noting that the cdf of Z is FZ (z) = z n , 0 ≤ z ≤ 1, values of a and b may be found as
follows.
FZ (a) = .05, and FZ (b) = .95
√
√
n
n
an = .05 and bn = .95, giving a = .05 and b = .95.
So we may write
√
√
n
n
P ( .05 < Yn /θ < .95) = .90.
Rearranging, the confidence interval for θ is
√
√
n
n
(Yn / .95, Yn / .05).
Examples
[Exponential] Using the form of the exponential as
f (y; θ) = θe−θy , y > 0
construct a 100(1 − α)% confidence interval for θ using the estimator T = 2θ
for a sample Y1 . . . Yn .
Now
MYi (t) =
(verify!).
So
M2θYi (t) = MYi (2θt) =
giving
and so T ∼ χ22n .
M2θ P
Y (t)
i i
=
θ
θ−t
1
θ
=
θ − 2θt
1 − 2t
1
= MT (t)
(1 − 2t)n
Since the distribution of T is independent of θ, then T is pivotal for θ. Thus
P (χ2L < T < χ2U ) = 1 − α
147
P
i
Yi
ie
P (χ2L < 2θ
X
i
or
is the required CI for θ.
Yi < χ2U ) = 1 − α
χ2
χ2
P ( PL < θ < PU ) = 1 − α
2 i Yi
2 i Yi
[Poisson]
For sampling from the Poisson distribution, the sample mean
Ȳ ∼ N (λ, λ/n)
for large n.
Examine methods for constructing confidence intervals for λ.
Now
and so W is pivotal for λ.
Ȳ − λ
W = q
∼ N (0, 1)
λ/n
(Method 1) Replace the λ in the SE with Ȳ . That is, use observed instead of
expected information. Then we have
Ȳ − λ
W1 = q
∼ N (0, 1)
Ȳ /n
and the approximate (1 − α)100% CI for λ becomes
q
Ȳ ± Zα/2 Ȳ /n
(Method 2) Using the expected information for λ gives a (1 − α)100% CI for λ from
P (−Zα/2 < Z < Zα/2 ) = 1 − α
ie, the upper and lower limits for λ satisfy
ie
or
Ȳ − λ q
λ/n Ȳ − λ
2
= Zα/2
2
= Zα/2
λ/n
2
λ2 − 2Ȳ λ + Ȳ 2 = Zα/2
λ/n.
148
The root of the resulting equation
2
λ2 − 2λ Ȳ + Zα/2
/n + Ȳ 2 = 0
are real, and give λu and λl where
P (λl < λ < λu ) = 1 − α
defines the 100(1 − α)% confidence interval for λ.
(Method 3)
We have the exact result that
X
i
Yi ∼ P (nλ)
So if we define
Λ = nλ
and use
Λ̂ =
X
yi
i
then we can use tables of the Poisson distribution to find
P (L < Λ < U ) = 1 − α
which will become
P (L/n < λ < U/n) = 1 − α
Due to the Poisson being discrete, the choice of α will be limited. For example,
P
if n = 10 and i yi = 9.0 then
P (3 < Λ < 15) = 0.957
giving a 96% CI for λ of (0.3, 1.5).
> y=0:20
> sp <- ppois(y,9)
> print(cbind(y,sp,1-sp))
y
sp
[1,] 0 0.0001234098 0.9998765902
[2,] 1 0.0012340980 0.9987659020
[3,] 2 0.0062321951 0.9937678049
[4,] 3 0.0212264863 0.9787735137
[5,] 4 0.0549636415 0.9450363585
[6,] 5 0.1156905208 0.8843094792
[7,] 6 0.2067808399 0.7932191601
149
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
[15,]
[16,]
[17,]
[18,]
[19,]
[20,]
[21,]
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0.3238969643
0.4556526043
0.5874082443
0.7059883203
0.8030083825
0.8757734292
0.9261492307
0.9585336745
0.9779643408
0.9888940906
0.9946804287
0.9975735978
0.9989440463
0.9995607481
0.6761030357
0.5443473957
0.4125917557
0.2940116797
0.1969916175
0.1242265708
0.0738507693
0.0414663255
0.0220356592
0.0111059094
0.0053195713
0.0024264022
0.0010559537
0.0004392519
Large Sample Confidence Intervals
The asymptotic distribution of the mle is given as
Q= q
θ̂ − θ
1/nI(θ)
∼ N (0, 1)
so Q is pivotal for θ.
Furthermore if we replace expected information I(θ) by observed information I(θ̂) then
a confidence interval for θ can be constructed easily, as per Method 1.
If the parameterisation of θ is such that I(θ) is independent of θ then observed and
expected information coincide, and Method 1 applies without the necessity of replacing
expected values for θ by observed values.
Comment.
Note that there is some arbitrariness in the choice of a confidence interval in a given
problem. There are usually several statistics T = t(X1 , . . . , Xn ) that could be used, and
it is not really necessary to allocate equal probability to the two tails of the distribution,
as was done in the above example. However, it is customary to do this, as this often leads
to the shortest confidence interval (for the same confidence coefficient), another property
considered desirable.
150
Chapter 9
Hypothesis Testing
9.1
9.1.1
Basic Concepts and Notation
Introduction
As an alternative to estimating the values of one or more parameters of a probability distribution, as was the objective in Chapter 8, we may test hypotheses about such parameters.
Both estimation and hypothesis testing may be viewed as different aspects of the same
general problem of reaching decisions on the basis of data. Explicit formulation as well
as important basic concepts on the theory of testing statistical hypotheses are due to J.
Neyman and E.S. Pearson, who are considered pioneers in the area.
Although the notation of H and A for hypotheses and alternatives has been used,
we will now use that of Hogg and Craig where the terms null hypothesis and alternative
hypothesis are used, with corresponding notation H0 and H1 . The null hypothesis is always
a statement of either ‘no effect’ or the ‘status quo’. If the statistical hypothesis completely
specifies the distribution, it is called simple; if it does not, it is called composite.
After experimentation (for example, taking a sample of size n, (X1 , . . . , Xn ) or X), some
reduction in data is used, resulting in a test statistic, T= t(X), say. We may consider a
subset of the range of possible values of T as a rejection (or critical) region. Previously
this has been denoted by R, but to be consistent with HC, we will now use C. That is, C
is the subset of the sample space of T , which leads to the rejection of the hypothesis under
consideration. The region C can refer to either the X-values or the T-values.
The case where both H0 and H1 are simple, where the size of the Type I and Type II
error is easily determined, is assumed known. You may find it helpful to read CB 10.3 or
HC 7.1. In the case of composite hypotheses and alternatives, the power function of
the test is an important tool for evaluating its performance, and this is examined in the
following section.
151
9.1.2
Power Function and Significance Level
Suppose that T is the test statistic and C the critical region for a test of a hypothesis
concerning the value of a parameter θ. Then the power function of the test is the probability
that the test rejects H0 , when the actual parameter value is θ. That is,
π(θ) = Pθ (rejecting H0 ) = Pθ (C).
(9.1)
Some texts interpret power as the probability of rejecting H0 when it is false, but the more
general interpretation is the probability that a test rejects H0 for θ taking values given by
H0 or H1 .
Suppose we want to test the simple H0 : θ = θ0 , against the composite alternative
H1 : θ 6= θ0 . Ideally we would like a test to detect a departure from H0 with certainty;
that is, we would like π(θ) to be 1 for all θ in H1 , and π(θ) to be 0 for θ in H0 . Since for
a fixed sample size, P(rejecting H0 |H0 is true) and P(not rejecting H0 |H0 is false) cannot
both be made arbitrarily small, the ideal test is not possible.
So long as H0 is simple, it is possible to define P(Type I error), denoted by α, as
P(rejecting H0 |H0 is true). But to allow for H0 to be composite, we need the following
definitions.
Definition 9.1
The size of a test (or of a critical region) is
α = max Pθ (reject H0 ) = max π(θ)
θ∈H0
θ∈H0
(9.2)
This is also known as the significance level.
Definition 9.2
The size of Type II error is
β = max[1 − π(θ)].
θ∈H1
(9.3)
Some statisticians regard the formal approach above, of setting up a rejection region,
as not the most appropriate, and prefer to compute a P–value. This involves the choice of
a test statistic T, the extreme values of which provide evidence against H0 . The statistic
T should be a good estimator of θ and its distribution under H0 known. After experimentation, an observed value of T, t say, is examined to see whether it can be considered
extreme in the sense of being unlikely to occur if H0 were true. The computed P–value is
the probability of observing T = t or something more extreme. This is the “α” at which
the observed value T = t is just significant.
The test situation can be summarised in a table :
152
Accept
H0 (C)
H0
true
H0
false
Reject
H0 (C)
α
(Type I error)
β
(Type II error)
and Power = 1 − β = P( Reject H0 |H0 is false).
The size of a test is given by α = Type I error, and is also known as the significance level.
There is a trade–off in practice between α and β, with α being preset, but β being
unknown, but estimable from the alternative.
Power is typically a function of the parameter under test.
A P–Value is a computer generated mechanism for enabling an experimenter to undertake a test without recourse to examining tables of test statistic values at the prescribed
size.
Formally, P–Value =
P[Obtaining values more extreme than observed,
in the direction of H1 |(H0 is true)].
9.1.3
Relation between Hypothesis Testing and Confidence Intervals
It may be recalled that rejecting a null hypothesis about θ (θ = θ0 , say) at the 5% significance level is equivalent to saying that the value θ0 is not included in a 95% confidence
interval for θ.
So we have a duality property here. We will illustrate this with an example.
Example 9.1
Consider the family of normal distributions with unknown mean µ and known variance σ 2 .
Let zα be defined by P (Z ≥ zα ) = α. For a 2–sided alternative (using a 2–tailed test), the
rejection region for a test of size α is
(
That is,
(
)
|x − µ0 |
√
> zα/2 .
x:
σ/ n
)
zα/2 σ
zα/2 σ
x : x > µ0 + √ or x < µ0 − √
.
n
n
153
This is the event {X ∈ C(θ)} and it has probability α. The complementary event,
{X ∈
/ C(θ)} has probability 1 − α. The latter event can be written equivalently as
n
which is equivalent to
and
√
√ o
x : µ0 − zα/2 σ/ n < x < µ0 + zα/2 σ/ n .
√
√
x − zα/2 σ/ n < µ0 < x + xα/2 σ/ n
√
√
(x − zα/2 σ/ n, x + zα/2 σ/ n)
is a 100(1 − α)% confidence interval for µ.
Note
There is a duality between the size of the test and the confidence coefficient of the
corresponding confidence interval, for a two sided test.
Example
Testing the variance from a Normal population.
For the test H0 : σ 2 = σ02 vs H1 : σ 2 6= σ02 , the test statistic is
νs2
∼ χ2ν
2
σ
where ν = n − 1 = df .
Thus the corresponding acceptance region (C) is defined as
χ2ν,L <
νs2
< χ2ν,U
σ02
where L and U are the lower and upper α/2 points of the χ2 distribution on n − 1 df. That
is
!
νs2
2
2
P χν,L < 2 < χν,U = 1 − α.
σ0
This can be written as
P
νs2
νs2
2
<σ < 2
χ2ν,U
χν,L
!
to give a confidence interval for σ 2 with confidence coefficient α.
154
9.2
9.2.1
Evaluation of and Construction of Tests
Unbiased and Consistent Tests
In the case of estimation of parameters, it was necessary to define some desirable properties
for estimators, to enable us to have criteria for choosing between competing estimators.
Similarly in hypothesis testing, we would like to use a test that is “best” in some sense.
Note that a test specifies a critical region. Alternatively, the choice of a critical region
defines a test. That is, the terms ‘test’ and ‘critical region’ can, in this sense, be used
interchangeably. So if we define a best critical region, we have defined a best test.
The analogue for unbiasedness and consistency in estimation are defined below for
hypothesis testing.
Definition 9.3
A test is unbiased if Pθ (rejecting H0 |H1 ) is always greater than Pθ (rejecting H0 |H0 ).
That is,
min π(θ) ≥ max π(θ).
θ∈H1
θ∈H0
Definition 9.4
A sequence of tests {ψn }, each of size α, is consistent if their power functions approach
1 for all θ specified by the alternative. That is,
πψn (θ) → 1, for θ ∈ H1 .
9.2.2
Certain Best Tests
When H0 and H1 are both simple, the error sizes α and β are uniquely defined. In this
section we require that both the null hypothesis and alternative hypothesis are simple, so
that in effect, the parameter space is a set consisting of exactly 2 points. We will define a
best test for testing H0 against H1 , and in 9.2.3 we will prove a Theorem that provides
a method for determining a best test.
Let f (x; θ) denote the density function of a random variable X. Let X1 , X2 , . . . , Xn
denote a random sample from this distribution and consider the simple hypothesis
H0 : θ = θ0 and the simple alternative H1 : θ = θa . So H0 ∪ H1 = {θ0 , θa }.
One repetition of the experiment will result in a particular n–tuple, (x1 , x2 , . . . , xn ).
Consider a set Ci , which is a collection of n–tuples having size α that is, Ci has the
property that
P [(X1 , X2 , . . . , Xn ) ∈ Ci |H0 is true ] = α.
It follows that Ci can be thought of as a critical region for the test. Specifically, if the
observed n–tuple (x1 , x2 , . . . , xn ) falls in our pre–selected Ci , we will reject H0 . However, if
155
HA were true, then intuitively the ‘best’ critical region would be the one having the highest
probability of containing (x1 , x2 , . . . , xn ). Formalizing this notion, we have the following
definition.
Definition 9.5
C is called the best critical region, (BCR) of size α for testing the simple H0 against
the simple H1 if,
(a) P [(X1 , X2 , . . . , Xn ) ∈ C|H0 ] = α,
(b) P [(X1 , . . . , Xn ) ∈ C|H1 ] ≥ P [(X1 , . . . , Xn ) ∈ Ci |H1 ] for every other Ci (of size α).
This definition can be stated in terms of power. Suppose that there is one of these subsets,
say C, such that when H0 is true, the power of the test associated with C is at least as
great as the power of the test associated with each other Ci .
Example
A coin is tossed twice.
We wish to test H0 : π = 1/2 vs H1 : π = 2/3, where π = P (heads).
The test rejects H0 if two heads occur.
Let the number of heads be denoted by X.
Is X = 2 the BCR for the test?
Under H0 the distribution of outcomes is :
P (X = x)
x
1/4 1/2 1/4
0
1
2
Under H1 the distribution of outcomes is :
P (X = x)
x
1/9 4/9 4/9
0
1
2
The size of the test is
α = P (X = 2|π = 1/2) = 1/4
while the power is given by
1 − β = P (X = 2|π = 2/3) = 4/9
To see if X = 2 is a BCR, try X ≥ 1 as an alternative. Now
α = P (X ≥ 1|π = 1/2) = 1/2 + 1/4 = 3/4
156
and the power is
1 − β = P (X ≥ 1|π = 2/3) = 8/9
so the power has increased but so has the size. Thus X = 2 is the BCR for the test.
Definition 9.6
A test of the simple hypothesis H0 versus the simple alternative H1 that has the smallest
β (or equivalently, the largest π(θ)) among tests with no larger α is called most powerful.
Example 9.2
Suppose X ∼ bin(5, θ). Let f (x; θ) denote the probability function of X. Consider
H0 : θ = 21 , H1 : θ = 43 . The table in Figure 9.1 gives the values of f (x; 21 ), f (x; 34 ) and
f (x; 12 )/f (x; 34 ) for x = 0, 1, . . . , 5.
x
Figure 9.1: Null vs alternative
0
1
2
3
4
5
f(x; 21 )
1
32
5
32
10
32
10
32
5
32
1
32
f(x; 43 )
1
1024
15
1024
90
1024
270
1024
405
1024
243
1024
f(x; 12 )/f(x; 34 )
32
32
3
32
9
32
27
32
81
32
243
Using X to test H0 against H1 , we shall first assign significance level α = 1/32 and
want a best critical region of this size. Now C1 = {x : x = 0} and C2 = {x : x = 5} are
possible critical regions and there is no other subset with α = 1/32. So either C1 or C2 is
the best critical region for this α. If we use C1 then P(x ∈ C1 |H1 )=1/1024 and
P(rejecting H0 |H1 is true) P(rejecting H0 |H0 is true),
an unacceptable situation. On the other hand, if we use C2 then P (X ∈ A2 |HA ) =
243/1024 and
P(rejecting H0 |H1 is true) P(rejecting H0 |H0 is true),
a much more desirable state of affairs. So C2 is the best critical region of size α = 1/32 for
testing H0 against H1 . It should be noted that, in this problem, the best critical region,
C, is found by including in C the point (or points) at which f (x; 12 ) is small in comparison
with f (x; 43 ). This suggests that in general, the ratio f (x; H0 )/f (x; H1 ) provides a tool by
which to find a best critical region for a certain given value of α.
157
Example
We wish to test
H0 : θ = 2 vs H1 : θ = 4
for a sample of 2 observations from the exponential distribution with df
f (y : θ) = e−y/θ /θ, y > 0
The critical region for the test is defined as
C : {(Y1 , Y2 ) ; 9.5 ≤ Y1 + Y2 }
which makes sense if you plot the null and alternative densities.
Find the size and power of the test.
Now
f (y; θ = 2) =
e−(y1 + y2 )/2
, y1 , y2 > 0
4
The size of the test is
α = P (Y ∈ C|H0 ) = 1 − P (Y1 + Y2 < 9.5|θ = 2)
= 1 − (1/4)
Z
y2 =9.5
y2 =0
Z
y1 =9.5−y2
y1 =0
e−(y1 + y2 )/2 dy1 dy2 = 0.0497
Power?
If H1 is true, then
f (y; θ = 4) =
e−(y1 + y2 )/4
, y1 , y2 > 0
16
and
β = P (Y ∈ C|H1 ) = P (Y1 + Y2 < 9.5|θ = 4)
= (1/16)
Z
y2 =9.5
y2 =0
Z
y1 =9.5−y2
y1 =0
e−(y1 + y2 )/4 dy1 dy2 = 0.686
giving the power of the test as 0.314.
Can we find a better test?
For example, try the CR as
C : {(Y1 , Y2 ) ; 9.0 ≤ Y1 + Y2 }
The following fundamental theorem, due to Neyman and Pearson, tells us that we
cannot find a better test and it provides the methodology for deriving the most powerful
test for testing simple H0 against simple H1 .
158
9.2.3
Neyman Pearson Theorem
Suppose X1 , . . . , Xn is a random sample with joint density function f (X; θ). For simple H0
and simple H1 , the joint density function can be written as f0 (x; θ), f1 (x; θ), respectively.
Alternatively, we could use the likelihood notation, L(θ0 ; x), L(θ1 ; x).
Theorem 9.1
In testing H0 : θ = θ0 against H1 : θ = θ1 , the critical region
CK = {x : f0 (x)/f1 (x) < K}
is most powerful (where K ≥ 0).
[Or, in terms of likelihood, for a given α, the test that maximizes the power at θ1 has
rejection region determined by
L(θ0 ; x1 , . . . , xn )/L(θ1 ; x1 , . . . , xn ) < K.
Such a test will be most powerful for testing H0 against H1 .]
Proof.
The constant K is chosen so that
P (x ∈ CK |H0 ) =
Z
Z
...
CK
f0 (x)dx1 . . . dxn = α
Let AK be another region in the sample space of size α. Then
P (x ∈ AK |H0 ) =
Z
Z
...
AK
f0 (x)dx1 . . . dxn = α
The regions CK and AK may overlap, as shown in Figure 9.2.
AK
CK
0
CK
I
A0K
0
Figure 9.2: The regions CK , CK
, AK and A0K .
Now
α=
=
Z
Z
CK
A0K
f0 (x)dx =
f0 (x)dx +
Z
Z
I
0
CK
f0 (x)dx +
f0 dx =
159
Z
AK
Z
I
f0 dx
f0 (x)dx.
This implies that
Z
0
CK
Z
f0 dx =
A0K
f0 dx
which are equal to α if CK and AK are disjoint.
0
Since CK
∈ CK and
f0 (x)/f1 (x) < K
then
f0 (x) < Kf1 (x)
giving
Z
ie
Z
0
CK
f0 dx < K
Z
0
CK
f1 dx > (1/K)
0
CK
Also, for x 6∈ CK ,
f1 dx
Z
f0 dx
0
CK
f0 (x)/f1 (x) > K
But A0K is outside CK , and so
Z
giving
Z
A0K
f0 dx > K
Z
A0K
f1 dx < (1/K)
A0K
f1 dx
Z
f0 dx
A0K
Now for the power :
1−β =
Z
CK
≥ (1/K)
≥ (1/K)
≥
and so
Z
A0K
Z
f1 (x)dx =
Z
Z
0
CK
A0K
f0 (x)dx +
f0 (x)dx +
f1 (x)dx +
Z
CK
0
CK
f1 (x)dx +
Z
I
f1 dx ≥
Z
f1 dx =
Z
AK
f1 dx
f1 dx
I
Z
I
f1 dx
I
Z
Z
AK
f1 dx
f1 dx.
Thus the test based on CK is more powerful than the test based on AK , and so the test
based on CK is the most powerful.
Example 1
160
Sampling from N (µ, 1) distribution and testing
H0 : µ = µ0 vs H1 : µ = µ1
Now
P
2
1
−
i (xi − µ0 )
e
f0 = f (x; θ0 ) =
(2π)n/2
and
P
2
1
−
i (xi − µ1 )
f1 = f (x; θ1 ) =
e
(2π)n/2
giving
P
P
2
2
f0
= e−[ i (xi − µ0 ) − i (xi − µ1 ) ]/2
f1
On simplification, the BCR is defined by
P
2
f0
= e−n(µ0 − µ1 ) /2 + i xi (µ0 − µ1 ) < K
f1
ie,
X
i
Now if µ0 > µ1 then
P
xi (µ0 − µ1 ) < ln K + n(µ0 − µ1 )(µ0 + µ1 )
X
i
xi <
n
ln K
+ (µ0 + µ1 )
(µ0 − µ1 ) 2
ie, the CR becomes i Xi < constant. Thus we reject H0 for small values of the sample
mean, as expected.
When µ0 < µ1 , then the NP condition becomes
X
n
xi (µ1 − µ0 )
(µ1 − µ0 )(µ0 + µ1 ) < ln K +
2
i
ie,
X
i
Xi > −
n
ln K
+ (µ0 + µ1 )
(µ1 − µ0 ) 2
and thus we reject H0 for large values of the sample mean, as expected.
Note that we use the sampling distribution of the mean to find K.
These tests will be the most powerful under each of the conditions specified.
Example 2
What is the BCR for a test of size 0.05 for
H0 : θ = 1/2 vs H1 : θ = 2
161
for a single observation from the population with df
f (y; θ) = θe−θy , y > 0
Now
f0 = e−y/2 /2
while
f1 = 2e−2y
The critical region C is defined by
f0
<K
f1
ie,
e1.5y
e−y/2 /2
<K
=
4
2e−2y
Thus, reject H0 for small Y , as expected. (Verify using diagrams of the densities under H0
and H1 .)
Under H0 , the size of the test requires that
P (Y ∈ C|H0 ) = α = 0.05 = P (0 < Y < cv|θ = 0.5)
Thus
(1/2)
To find K, use
Z
0
cv

cv
e−y/2 
→ cv = 0.1026
e−y/2 dy = 0.05 = (1/2) 
−1/2
0
e1.5 × 0.1026
= 0.2916 = K
4
The test defined by C will be the most powerful.
Example 9.3
Suppose X represents a single observation from the probability density function given
by
f (x; θ) = θxθ−1 , 0 < x < θ.
Find the most powerful (MP) test with significance level α = .05 to test H0 : θ = 1 versus
H1 : θ = 2.
Solution.
162
Since both H0 and H1 are simple, the previous Theorem can be applied to derive the
test. Here
L(θ0 )
f (x; θ0 )
1 × x1−1
1
=
=
=
.
2−1
L(θa )
f (x; θa )
2×x
2x
The form of the rejection region for the MP test is
1
< k.
2x
equivalently, x > 1/2k or, since 1/2k is a constant (k 0 say), the critical region is x > k 0 .
The value of k 0 is determined by
.05 = P (X is in the critical region when θ = 1)
= P (X > k 0 when θ = 1)
=
Z
1
k0
1.dy
= 1 − k0
So k 0 = .95. So the rejection region is C = {y : y > .95}. Among all tests for H0 versus
H1 based on a sample of size 1 and α = .05, this test has smallest Type II error probability.
[ Note that the form of the test statistic and rejection region depends on both H0 and H1 .
If H1 is changed to θ = 3, the MP test is based on Y 2 and we reject H0 in favour of H1 if
Y 2 > k 0 for some k 0 ].
9.2.4
Uniformly Most Powerful (UMP) Test
Suppose we sample from a population with a distribution that is completely specified
except for the value of a single parameter θ. If we wish to test H0 : θ = θ0 (simple) versus
H1 : θ > θ0 (composite) there is no general theorem like Theorem 9.1 that can be applied.
But it can be applied to find a MP test for H0 : θ = θ0 versus HA : θ = θa for any single
value θa ∈ H1 . In many situations the form of the rejection region for the MP test does
not depend on the particular choice of θa . When a test obtained by Theorem 9.1 actually
maximizes the power for every value of θ > θ0 , it is said to be uniformly most powerful,
(UMP) for H0 : θ = θ0 against H1 : θ > θ0 .
We may state the definition as follows:
Definition 9.7
The critical region C is a uniformly most powerful critical (UMPCR) of size α for testing
the simple hypothesis H0 against a composite alternative H1 if the set C is a best critical
region of size α for testing H0 against each simple hypothesis in H1 . A test defined by this
critical region C is called a uniformly most powerful test, with significance level α, for
testing the simple H0 against the composite H1 .
163
Uniformly most powerful tests don’t always exist, but when they do, the Neyman
Pearson Theorem provides a technique for finding them.
Example 9.4
Let X1 , . . . , Xn be a random sample from a N (0, θ) distribution where the variance θ
is unknown. Find a UMP test for H0 : θ = θ0 (> 0) against H1 : θ > θ0 .
Solution.
Now H0 ∪ H1 = {θ : θ ≥ θ0 }. The likelihood of the sample is
L(θ) =
1
2πθ
n/2
1
e− 2
P
x2i /θ
.
Let θa be a number greater than θ0 and let K > 0. Let C be the set of points where
L(θ0 ; x1 , . . . , xn )/L(θa ; x1 , . . . , xn ) ≤ K.
That is, the set of points where
θa
θ0
!n/2
Or equivalently,
X
x2i
1
e− 2
P
x2i (θa −θ0 )/θ0 θa
"
≤ K.
#
2θ0 θa
n
θa
≥
log( ) − log K = c , say.
(θa − θ0 ) 2
θ0
P
The set C = {(x1 , . . . , xn ) : x2i ≥ c} is then a BCR for testing H0 : θ = θ0 against
H1 : θ = θa . It remains to determine c so that this critical region has the desired α. If H0
P
is true, Xi2 /θ0 is distributed as χ2n . Since
α = P(
X
Xi2 /θ0 ≥
c
|H0 ),
θ0
c/θ0 may be found from tables of χ2 or using pchisq in R.
So C defined above is a BCR of size α for testing H0 : θ = θ0 against H1 : θ = θa .
We note that, for each number θa > θ0 , the above argument holds. So C = {(x1 , . . . , xn ) :
P 2
xi ≥ c} is a UMP critical region of size α for testing H0 : θ = θ0 against H1 : θ > θ0 . To
be specific, suppose now that θ0 = 3, n = 15, α = .05, show that c = 75.
Example 1
For a sample Y1 , . . . , Yn from a N (µ, 1) distribution, we wish to test
H0 : µ = µ0 vs H1 : µ > µ0 .
164
To test
H0 : µ = µ0 vs H1 : µ = µ1 .
if µ1 > µ0 , the the region
C : Ȳ > K +
provides a MP test, where K + is chosen such that
P (Ȳ > K + |H0 ) = α.
Thus for every µ > µ0 we can find a MP test, since C depends on µ0 and not on µ1 .
So the test defined by
i
h
T + : Reject H0 if Ȳ > K +
defines a UMP test for
H0 : µ = µ0 vs H1 : µ > µ0 .
Similarly, if µ1 < µ0 , the the region
C : Ȳ < K −
provides a MP test, where K − is chosen such that
P (Ȳ < K − |H0 ) = α.
Thus for every µ < µ0 we can find a MP test, since C depends on µ0 and not on µ1 .
So the test defined by
i
h
T − : Reject H0 if Ȳ < K −
defines a UMP test for
H0 : µ = µ0 vs H1 : µ < µ0 .
Counterexample
To show that a UMP test does not necessarily exist, consider testing
H0 : µ = µ0 vs H1 : µ 6= µ0 .
Since we have UMP tests for
H1 : µ < µ0 , (T − )
and
H1 : µ > µ0 , (T + )
then no UMP test can exist (by definition) for
H1 : µ 6= µ0 , say (T ± )
Exercise
165
Produce a plot of the power functions for T − , T + and T ± to verify the counterexample
graphically.
Example 2
For a random sample X1 , . . . , X20 from a P (θ) population, show that the CR defined
by
i=20
X
i=1
is a UMP CR for testing
Xi ≥ 5
H0 : θ = 0.1 vs H1 : θ > 0.1
Consider
H0 : θ = θ0 vs H1 : θ = θ1
Now
P (X = x) =
and so
e−θ θ x
, x = 0, 1, 2, . . .
x!
Qi=20 −θ xi
e 0 θ0 /xi !
f0
= Qi=1
i=20 −θ1 xi
f1
θ1 /xi !
i=1 e
P
x
e−nθ0 θ0 i i
P
=
<K
i xi
−
nθ
1
e
θ
1
which gives
−nθ0 +
ie,
X
i
xi ln θ0 + nθ1 −
X̄ (ln θ0 − ln θ1 ) <
Now if θ1 > θ0 then
X̄ >
which leads to the CR
X
xi ln θ1 < ln K
i
ln K
+ (θ0 − θ1 )
n
−(1/n) ln K + θ1 − θ0
ln θ1 − ln θ0
X
Xi > constant
i
which is a MP CR by the NP lemma, but is independent of θ, and thus this CR is also
UMP for the test of H1 : θ > θ0 .
Exercise
Plot the power function and give the size of the test.
166
9.3
9.3.1
Likelihood Ratio Tests
Background
The Neyman Pearson Theorem provides a method of constructing most powerful tests for
simple hypotheses when the distribution of the observation is known except for the value
of a single parameter. But in many cases the problem is more complex than this. In this
section we will examine a general method that can be used to derive tests of hypotheses.
The procedure works for simple or composite hypotheses and whether or not there are
‘nuisance’ parameters with unknown values.
As well as thinking of H0 as being a statement (or assertion) about a parameter θ, it
is a set of values taken by θ. Similarly for H1 . So it is appropriate to write θ ∈ H0 , for
example, or maxH0 L(θ). The set of all possible values of θ is H0 ∪ H1 .
Let f (x; θ) be the density function of a random variable X with unknown parameter
θ, and let X1 , X2 , . . . , Xn be a random sample from this distribution, with observed values
x1 , x2 , . . . , xn . The likelihood function of the sample is
L(θ) = L(θ; x1 , . . . , xn ) =
n
Y
f (xi ; θ) = f (x; θ).
i=1
It is necessary to have a clear idea of what is meant by the parameter space and that subset
of it defined by the hypothesis.
Example 9.5
(a) If X is distributed as Bin(n, p) and we are testing H0 : p = p0 , then H0 can be
written H0 = {p : p = p0 } and H0 ∪ H1 = {p : p ∈ [0, 1]}.
(b) If X is distributed as N(µ, σ 2 ) where both µ and σ 2 are unknown and we are testing
H0 : σ 2 = σ02 , then
H0 ∪ H1 = {(µ, σ 2 ) : µ ∈ (−∞, ∞), σ 2 ∈ (0, ∞)}
H0 = {(µ, σ 2 ) : µ ∈ (−∞, ∞), σ 2 = σ02 }
Now (b) is illustrated in Figure 9.3.
9.3.2
The Likelihood Ratio Test Procedure
The notion of using the magnitude of the ratio of two probability density functions as
the basis of a best test or of a uniformly most powerful test can be modified, and made
intuitively appealing, to provide a method of constructing a test where either or both of
the hypothesis and alternative are composite. The method leads to tests called likelihood
167
σ2
6
H 0 ∪H1 2 σ0 H0
µ
Figure 9.3: Parameter space
ratio tests, and although not necessarily uniformly most powerful, they often have desirable properties. The test involves a comparison of the maximum value the likelihood can
take when θ is allowed to take any value in the parameter space, and the maximum value
of the likelihood when θ is restricted by the hypothesis. Define
Λ = max L(θ)/ max L(θ).
H0 ∪H1
H0
(9.4)
Note that
(i) θ may be a vector of parameters;
(ii) Both numerator and denominator (and hence Λ) are functions of the sample values
x1 , . . . , xn , and the right hand side could be written more fully as
max f (x; θ)/ max f (x, θ).
θ∈H0
θ∈H0 ∪H1
Strictly speaking, Λ as defined in (9.5) is a function of random variables X1 , . . . , Xn and
so is itself a random variable with a probability distribution. When X is replaced by the
observed values x in the ratio, we will use λ for the observed value of Λ, and both will be
called the likelihood ratio.
Clearly, by the definition of maximum likelihood estimates, maxH0 ∪H1 L(θ) will be obtained by substituting the mle(’s) for θ into L(θ). Note that
(i) λ ≥ 0 since it is a ratio of pdf’s;
168
(ii) maxH0 L(θ) ≤ maxH0 ∪H1 L(θ) since the set H0 over which L(θ) is maximized is a
subset of H0 ∪ H1 . This means that λ ≤ 1.
So the random variable Λ has a probability distribution on [0, 1]. If, for a given sample
x1 , . . . , xn , λ is close to 1, then maxH0 L(θ) is almost as large as maxH0 ∪H1 L(θ). This
means that we can’t find an appreciably larger value of the likelihood, L(θ), by searching
for a value of θ through the entire parameter space H0 ∪ H1 supports the proposition that
H0 is true. On the other hand, if λ is small, we note that the observed x1 , . . . , xn was
unlikely to occur if H0 were true, so the occurrence of it casts doubt on H0 . So a value of
λ near zero implies the unreasonableness of the hypothesis.
Let the random variable Λ have probability density function g(λ), 0 ≤ λ ≤ 1. To carry
out the LR test in a given problem involves finding a value λ0 (< 1) so that the critical
region for a size α test is {λ : 0 < λ < λ0 }. That is,
P (Λ ≤ λ0 ) =
Z
λ0
0
g(λ)dλ = α.
(9.5)
Since the distribution of Λ is generally very complicated, we would appear to have a difficult problem here. But in many cases, a certain function of Λ has a well-known distribution
and an equivalent test can be carried out. [See the Examples in sub–section 9.3.3 below.]
Cases where this is not so are dealt with in sub–section 9.3.4.
To summarise, note the following :
1. That λ = Λ̂ ≥ 0, since Λ is a ratio of pdfs.
2. That λ ≤ 1, since H0 ∈ H0 ∪ H1 and thus
maxH0 L(θ) ≤ maxH0 ∪H1 L(θ)
3. If λ ≈ 0, then H0 is not true, since
maxH0 L(θ) << maxH0 ∪H1 L(θ)
whereas
4. If λ ≈ 1, then H0 is true, since then
maxH0 L(θ) ≈ maxH0 ∪H1 L(θ)
5. Using all of the above, if Λ has a pdf of g(λ), 0 < λ < 1, then the a CR of size α for
the LRT will be {λ : 0 < λ < λ0 << 1}, where
P (Λ < λ0 ) =
Z
169
λ0
0
g(λ)dλ = α.
9.3.3
Some Examples
Example 9.6
Let X have a normal distribution with unknown mean µ and known variance σ02 .
Suppose we have a random sample x1 , x2 , . . . , xn from this distribution and wish to test
H0 : µ = 3 against H1 : µ 6= 3. Now
H0 ∪ H1 = {(µ, σ 2 ) : µ ∈ (−∞, ∞), σ 2 = σ02 }
H0 = {(µ, σ 2 ) : µ = 3, σ 2 = σ02 }
Rather than L(θ) we have here L(µ; x1 , . . . , xn , σ02 ) or more briefly, L(µ).
L(µ) =
=
n
Y
1
2 1
i=1 (2πσ0 ) 2
exp{− 21 (xi − µ)2 /σ02 }
(2π)−n/2 σ0−n
exp
(
− 21
n
X
i=1
(xi − µ)
2
/σ02
)
Now max L(µ) is obtained by replacing µ in the above by its mle, x. So
H0 ∪H1
max L(µ) =
H0 ∪H1
(2π)−n/2 σ0−n
exp
(
− 21
n
X
i=1
(xi − x)
2
/σ02
)
.
Also L(µ|H0 ) has only one value, obtained by replacing µ by 3 and σ 2 by σ02 . So
max L(µ) =
H0
=
(2π)−n/2 σ0−n
(2π)−n/2 σ0−n
exp
exp
(
− 12
(
− 12
n
X
i=1
n
X
i=1
(xi − 3)
2
(xi − x)
/σ02
2
)
/σ02
n
− (x − 3)2 /σ02
2
)
Thus, on simplification
λ = exp{−n(x − 3)2 /2σ02 }.
(9.6)
Intuitively, we would expect that values of x close to 3 support the hypothesis and it can
be seen that in this case λ is close to 1. Values of x̄ far from 3 lead to λ close to 0. We
need to find the critical value λ0 to satisfy equation (9.6). That is, we need to know the
distribution of Λ. From equation (9.7), using random variables instead of observed values,
we have
!2
X −3
√
−2 log Λ =
σ0 / n
which is the square of a N (0, 1) variate and therefore is distributed as χ21 . For α = .05, the
critical region is {x : n(x − 3)2 /σ02 ≥ 3.84}, or alternatively
√
√
{x : n(x − 3)/σ0 > 1.96 or n(x − 3)/σ0 < −1.96}.
170
Figure 9.4: Critical regions
−2 log λ
rejection region
for −2 log λ
3.84
λ
6
rejection region for λ
The relationship between the critical region for λ and the critical region for −2 log λ is
shown in Figure 9.4.
Example 9.7
Given X1 , X2 , . . . , Xn is a random sample from a N (µ, σ 2 ) distribution, where σ 2 is
unknown, derive the LR test of H0 : µ = µ0 against H1 : µ 6= µ0 .
Now the parameter space is
H0 ∪ H1 = {(µ, σ 2 ) : µ ∈ (−∞, ∞), σ 2 > 0},
and that restricted by the hypothesis is
H0 = {(µ, σ 2 ) : µ = µ0 , σ 2 > 0}.
171
We note that there are 2 unknown parameters here, and the likelihood of the sample,
L(µ, σ 2 ; x1 , x2 , . . . , xn ) can be written as
n
Y
L(µ, σ 2 ) =
1
1
(2πσ 2 )− 2 e− 2 (xi −µ)
i=1
1
= (2π)−n/2 (σ 2 )−n/2 e− 2
Now the mle’s of µ and σ 2 are
2
µ̂ = x,
σ̂ =
2 /σ 2
Pn
i=1
(xi −µ)2 /σ 2
(9.7)
n
X
i=1
(xi − x)2 /n,
and max L(µ, σ 2 ) is obtained by substituting these for µ and σ 2 in equation (9.8). This
H0 ∪H1
gives
2
max L(µ, σ ) = (2π)
H0 ∪H1
"
−n/2 n/2
n
n
X
i=1
(xi − x)
2
#−n/2
e−n/2 .
(9.8)
Now max L(µ, σ 2 ) is obtained by substituting µ0 for µ in equation (9.8) and replacing σ 2
H0
by its MLE where µ is known. This is
Thus
2
max L(µ, σ ) = (2π)
H0
Pn
i=1 (xi
−n/2 n/2
n
− µ0 )2 /n = σ̃ 2 , say.
"
n
X
i=1
(xi − µ0 )
2
#−n/2
e−n/2 .
2
2
So λ = max L(µ, σ )/ max L(µ, σ ) becomes
H0 ∪H1
H0
λ=
"
X
i
2
(xi − x) /
X
i
Taking 2/nth powers of both sides and writing
we have
λ
2/n
=
X
i
=
2
(xi − x) /
1
n(x−µ0 )2
1+ P
(xi −x)2
"
X
i
(xi − µ0 )
P
i (xi
2
#−n/2
.
− µ0 )2 as
2
P
i
(xi − x) + n(x − µ0 )
[(xi − x) + (x − µ0 )]2 ,
2
#
.
(9.9)
i
Recalling that λ is the observed value of a random variable with a range space [0, 1], and
that the critical region is of the form 0 < λ < λ0 , we would like to find a function of λ (or
of λ2/n ) whose probability distribution we recognize. Now
X − µ0
n(X − µ0 )2
√
X ∼ N (µ0 , σ /n) so
∼ N (0, 1) and
∼ χ21 .
2
σ/ n
σ
2
172
P
P
Also, ni=1 (Xi − X)2 = νS 2 so ni=1 (Xi − X)2 /σ 2 = νS 2 /σ 2 ∼ χ2ν where ν = n − 1. So,
expressing the denominator of equation (9.10) in terms of random variables we have
1+
n(X − µ0 )2 /σ 2
νS 2 /σ 2
∼
1+
χ21
χ2ν
∼
1+
χ21
ν(χ2ν /ν)
∼
1+
T2
ν
where T is a random variable with a t distribution on ν degrees of freedom. Considering
range spaces, the relationship between λ (or λ2/n ) and t2 is a strictly decreasing one, and
a critical region of the form 0 < λ < λ0 is equivalent to a CR of the form t2 > t20 , as
indicated in Figure 9.5.
Figure 9.5: Rejection regions
t2
t20
λ
λ0
That is, the critical region is of the form |t| > t0 where t0 is obtained from tables, using
ν degrees of freedom, and the appropriate significance level.
Example 9.8
173
Given the random sample X1 , X2 , . . . , Xn from a N (µ, σ 2 ) distribution, derive the LR
test of the hypothesis H0 : σ 2 = σ02 , where µ is unknown, against H1 : σ 2 6= σ02 .
The parameter space, and restricted parameter space are
H0 ∪ H1 = {(µ, σ 2 ) : µ ∈ (−∞, ∞), σ 2 > 0}
H0 = {(µ, σ 2 ) : µ ∈ (−∞, ∞), σ 2 = σ02 }
L(µ, σ 2 ) is given by equation (9.8) in Example 9.9, and max L(µ, σ 2 ) is given by equation
H0 ∪H1
(9.9). To find max L(µ, σ 2 ) we replace σ 2 by σ02 and µ by the mle, x. So
H0
1
max L(µ, σ 2 ) = (2π)−n/2 (σ02 )−n/2 e− 2
H0
So
λ=e
n/2
"P
− x)2
nσ02
i (xi
#n/2
1
e− 2
P
i
P
i
(xi −x)2 /σ02
(xi −x)2 /σ02
Again, we would like to express Λ as a function of a random variable whose distribution
P
we know. Denoting i (xi − x)2 /σ02 by w, the random variable W , whose observed value
this is, has a χ2 distribution with parameter ν = n − 1. So we have
λ = en/2 (w/n)n/2 e−w/2 ,
and the relationship between the range spaces of Λ and W is shown in Figure 9.6.
A critical region of the form 0 < λ < λ0 corresponds to the pair of intervals, 0 < w <
a, b < w < ∞. So for a size-α test, H is rejected if
νs2 /σ02 < χ2ν,α/2 or νs2 /σ02 > χ2ν,1−(α/2) .
[This of course is the familiar intuitive test for this problem.]
Some Examples
1. Sampling from the Normal distribution.
Let Xi ∼ N (µ, 1). We wish to test H0 : µ = µ0 vs H1 : µ 6= µ0 .
L=
Y
i
P
2
2
1
√ e−(xi − µ) /2 = (2π)−n/2 e− i (xi − µ) /2
2π
In H0 , we have
L0 = (2π)−n/2 e−
174
P
i (xi
− µ0 )2 /2
Figure 9.6: Λ vs W
λ
λ0
W
a
b
In H1 , the likelihood is
L = (2π)−n/2 e−
P
which gives
` = ln[(2π)−n/2 ] −
i (xi
X
i
− µ)2 /2
(xi − µ)2 /2
Now ∂`/∂µ = 0 gives µ̂ = x̄ and so
L1 = (2π)n/2 e−
The LR becomes
Thus
P
i (xi
− x̄)2 /2
P
2
e− i (xi − µ0 ) /2
L0
P
=
Λ̂ = λ =
2
L1
e− i (xi − x̄) /2
−2 ln λ =
X
i
(xi − µ0 )2 −
175
X
i
(xi − x̄)2 = n(x̄ − µ0 )2
Now
(X̄ − µ0 )2
∼ χ21
1/n
since
(X̄ − µ0 )
√
∼ N (0, 1)
1/ n
2. Sampling from the Poisson distribution.
Let Y ∼ P (λ). We wish to test H0 : λ = λ0 vs H1 : λ > λ0 .
Now the likelihood is
e−λ λyi
yi !
Y
L=
i
In H0 , we get
P
while in H1 the likelihood is
e−λ0 λ0 yi
e−nλ0 λ0 i yi
=
Q
yi !
i (yi !)
to give
e−nλ λ i yi
L= Q
i (yi !)
L0 =
Y
i
P
X
` = ln L = −nλ +
i
yi ln λ −
Now
∂`/∂λ = 0 = −n +
gives
λ̂ =
X
X
X
ln yi !.
i
yi /λ
i
yi /n
i
and so
L1 =
Thus the LR becomes
e−
P
i
yi (P y /n)
i i
Q
i (yi !)
P
i
yi
P
L0
e−nλ0 λ0 i yi
P
P
Λ̂ =
=
P
L1
e− i yi ( i yi /n) i yi
So the rejection region Λ < Λ0 is equivalent to
P
L0
e−nλ0 λ0 i yi
P
P
=
< Λ0
P
L1
e− i yi ( i yi /n) i yi
176
ie
n(Ȳ − λ0 ) +
X
yi ln(λ0 /Ȳ ) < K
i
Thus the LR test is equivalent to a test based on Ȳ , as expected.
What about the distribution of −2 ln Λ?
Now
"
−2 ln Λ = −2 −n(λ0 − ȳ) +
X
= −2nλ0 − 2
i
Using
yi − 2
X
X
yi ln(nλ0 /
i
yi ln(nλ0 /
i
X
X
yi )
i
#
yi )
i
ln(x) = (x − 1) − (x − 1)/2 + . . . , 0 < x < 2
and assuming that Ȳ > λ0 , ie,
ln(nλ0 /
X
P
yi ) =
i
then
−2 ln Λ = 2nλ0 − 2
X
i
i
yi /n > λ0 , then
!
nλ0
nλ0
−1 − P
−1
P
i yi
i yi
yi − 2
=
(
P
=
X
i
i

!2
/2 + . . .
nλ0
nλ0
yi  P
−1− P
−1
i yi
i yi
yi − nλ0 )2
+...
P
i yi
!2

+ . . .
(Ȳ − λ0 )2
∼ χ21
Ȳ /n
since V (Ȳ ) = λ0 /n and λ̂ = Ȳ .
9.3.4
Asymptotic Distribution of −2 log Λ
The case of a single parameter will be covered first, then the situation of multiple parameters.
(single) We have H0 : θ = θ0 vs H1 : θ 6= θ0 .
It is required to demonstrate the
−2 ln Λ ∼ χ21 .
Now
ln L(θ) = `
177
and using Taylor’s expansion we get
`(θ) = `(θ̂) + `0 (θ̂)(θ − θ̂) + `00 (θ̂)(θ − θ̂)2 /2 + . . .
but
`0 (θ̂) = 0
and so
`(θ) = `(θ̂) + `00 (θ̂)(θ − θ̂)2 /2 + . . .
Now
h
−2 ln Λ = −2 `(θ0 ) − `(θ̂)
and since
i
`(θ0 ) = `(θ̂) + `00 (θ̂)(θ0 − θ̂)2 /2 + . . .
then
h
i
−2 ln Λ = −2 `(θ̂) + `00 (θ̂)(θ0 − θ̂)2 /2 − `(θ̂) = −`00 (θ̂)(θ0 − θ̂)2
Thus
−2 ln Λ =
(θ̂ − θ0 )2
−1/`00 (θ̂)
∼ χ21
as V (θ̂) = 1/I and I ≈ −E[`00 (θ̂)].
This is also the large scale sampling distribution for the mle θ̂.
(multiple) The proof in the multiple parameter case follows the approach in the single
parameter case, but is not so straightforward. For a full proof, see Silvey (1987),
pages 112–114 and Theorem 7.2.2.
Let the null be written as
H0 : θ ∈ ω
and the alternative be
H1 : θ ∈ Ω.
The LR is then
Λ=
L(ω̂)
L(Ω̂)
where ω̂ stands for the mle of θ in ω, and where Ω̂ stands for the mle of θ in Ω.
Thus
L(ω̂)
−2 ln Λ = −2 ln
L(Ω̂)
!
h
i
h
= −2 ln f (ω̂) − ln f (Ω̂) = −2 `(ω̂) − `(Ω̂)
178
i
Using a Taylor expansion
`(ω̂) = `(Ω̂) + `0 (Ω̂)0 (ω̂ − Ω̂) + (ω̂ − Ω̂)0 `00 (Ω̂)(ω̂ − Ω̂)/2 + . . .
Since
`0 (Ω̂) = 0
then
`(ω̂) − `(Ω̂) = (ω̂ − Ω̂)0 `00 (Ω̂)(ω̂ − Ω̂)/2 + . . .
which gives
−2 ln Λ ≈ −(ω̂ − Ω̂)0 `00 (Ω̂)(ω̂ − Ω̂) ≈ (ω̂ − Ω̂)0 nI(θ)(ω̂ − Ω̂)
since I(θ̂) ≈ I(θ).
√
√
By consideration of the distributions of n(Ω̂ − θ∗) and n(ω̂ − θ∗), where θ∗ is
the true value of θ, the final result can be obtained.
Essentially, the change in the number of parameters in moving from H0 ∪ H1 to H0
imposes restrictions which lead to the degrees of freedom of of the resulting χ2 being
of the same dimension as this difference in the number of parameters between H0 ∪H1
and H0 , as per page 116-117 of the Notes.
Thus
−2 ln Λ ∼ χ2df
where df is the difference between the number of parameters specified in H0 ∪ H1
and H0 .
Reference
Silvey S.D., (1987), Statistical Inference, Chapman and Hall, London.
The distribution of some function of Λ can’t always be found as readily as in the previous
examples. If n is large and certain conditions are satisfied, there is an approximation to
the distribution of Λ that is satisfactory in most large-sample applications of the test. We
state without proof the following theorem.
Theorem 9.2
Under the proper regularity conditions on f (x; θ), the random variable −2 log Λ is
distributed asymptotically as chi-square. The number of degrees of freedom is equal to the
difference between the number of independent parameters in H0 ∪ H1 and H0 .
[Note that in Example 9.8 the distribution of −2 log Λ was exactly χ21 .]
179
Example 9.9
(Test for equality of several variances.)
The hypothesis of equality of variances in two normal distributions is tested using the
F -test. We will now derive a test for the k-sample case by the likelihood ratio procedure.
Consider independent samples
x11 , x12 , . . . , x1n1
from the population N (µ1 , σ12 ),
x21 , x22 , . . . , x2n2
..
.
...................
N (µ2 , σ22 ),
xk1 , xk2 , . . . , xknk
...................
N (µk , σk2 ).
That is, we have observations {xij , j = 1, . . . , ni , i = 1, 2, . . . , k}.
We wish to test the hypothesis
H0 : σ12 = σ22 = . . . = σk2 (= σ 2 )
against the alternative that the σi2 are not all the same. Let n =
the random variable Xij is
P
i
ni . Now the p.d.f. of
o
n
f (xij ) = (2π)−1/2 σi−1 exp − 12 (xij − µi )2 /σi2 .
So the likelihood function of the samples above is


L(µ, σ 2 ) = (2π)−n/2 σ1n1 . . . σk−nk exp − 21
ni
k X
X
i=1 j=1


(xij − µi )2 /σi2  .
(9.10)
The whole parameter space and restricted parameter space are given by
H0 ∪ H1 = {(µi , σi2 ) : µi ∈ (−∞, ∞), σi2 ∈ (0, σ), i = 1, . . . , k}
H0 = {(µi , σ 2 ) : µi ∈ (−∞, ∞), σ 2 ∈ (0, ∞), i = 1, . . . , k}.
The log of the likelihood is
ni
k
X
n1
nk
1X
n
log σ12 − . . . −
log σk2 −
σi−2 (xij − µi )2 ,
log L = − log 2π −
2
2
2
2 i=1
j=1
using L for L(µ, σ 2 ).
To find max L we need the MLE’s of the 2k parameters µ1 , . . . , µk , σ12 , . . . , σk2 .
HO ∪H1
ni
1 −2 X
∂ log L/∂µi = σi 2 (xij − µi ), i = 1, . . . , k
2
j=1
180
(9.11)
∂
log L/∂σi2
=
−(ni /2σi2 )
ni
1 −1 X
− ( 4 ) (xij − µi )2 , i = 1, . . . , k
2 σi j=1
(9.12)
Equating (9.11) and (9.11) to zero and solving we obtain
ni
1 X
xij = xi· , i = 1, . . . , k
=
ni j=1
µ̂i
(9.13)
ni
1 X
=
(xij − xi· )2 , i = 1, . . . , k.
ni j=1
σ̂i2
(9.14)
Substituting these in (9.10) we obtain
max L = (2π)−n/2 σ̂1−n1 . . . σ̂k−nk
H0 ∪H1


ni
k X
(xij − xi· )2 ni 
1X
exp −
P i
2 i=1 j=1 nj=1
(xij − xi· )2
= (2π)−n/2 σ̂1−n1 . . . σ̂k−nk e−n/2 , since n =
k
X
ni .
(9.15)
i=1
Now in the restricted parameter space H0 there are (k + 1) parameters, µ1 , . . . , µk
and σ 2 . So we need to find the mle’s of these parameters. The likelihood function now is
(putting σi2 = σ 2 , all i)
L = (2π)−n/2 (σ 2 )−n/2
and



ni
k X

1 X
exp − 2
(xij − µi )2 
2σ i=1 j=1
(9.16)
ni
k X
n
n
1 X
log L = − log(2π) − log σ 2 − 2
(xij − µi )2
2
2
2σ i=1 j=1
ni
X
1
∂ log L
= − 2 (−2) (xij − µi ), i = 1, . . . , k
∂µi
2σ
j=1
ni
k X
∂ log L
n
1 X
(xij − µi )2
=
−
+
∂σ 2
2σ 2 2σ 4 i=1 j=1
(9.17)
(9.18)
Equating (9.17) and (9.18) to zero and solving we obtain
µ̃i =
σ̃
2
ni
1 X
xij = xi· (= µ̂i ), i = 1, . . . , k
ni j=1
(9.19)
k
1X
ni σ̂i2 .
n i=1
(9.20)
ni
k X
1X
(xij − xi· )2
=
n i=1 j=1
=
181
Substituting (9.20) and (9.20) into (9.16) we obtain
max L = e−n/2 (2π)−n/2 /
H0
So
σ̂1n1 . . . σ̂knk
λ= P
=
n/2
[ i ni σ̂i2 /n]
"
k
Y
"
k
X
#n/2
(9.21)
/(σ̃ 2 )n/2
(9.22)
ni σ̂i2 /n
i=1
(σ̂i2 )ni /2
i=1
#
Now, using Theorem 9.2, the distribution of −2 log Λ is asymptotically χ2 . To determine
the number of degrees of freedom we note that the number of parameters in H0 ∪ H1 is 2k
and in H0 is k + 1. Hence the number of degrees of freedom is 2k − (k + 1) = k − 1. Thus
−2 log Λ = −
k
X
ni log σ̂i2 + n log σ̃ 2
(9.23)
i=1
is distributed approximately χ2k−1 .
Bartlett (1937) modified this statistic by using unbiased estimates of σi2 and σ 2 instead
of MLE’s. That is, he used (ni − 1) and (n − k) as divisors, so the statistic becomes
B=−
where
s2i =
ni
X
j=1
k
X
νi log s2i
+
i=1
k
X
i=1
!
νi log s2
(xij − xi· )2 /(ni − 1) and s2 =
ν1 s21 + . . . + νk s2k
ν1 + . . . + ν k
[Investigate the form of this when k = 2.]
A better approximation still is obtained using as the statistic Q = B/C where the
constant C is defined by
"
X 1
1
1
C =1+
−P
3(k − 1) i νi
i νi
#
and this statistic is commonly referred to as Bartlett’s statistic for testing homogeneity of
variances. That is,
P
P
( i νi ) log s2 − i νi log s2i
(9.24)
Q=
P
P
1
1 + 3(k−1)
[ i (1/νi ) − (1/ i νi )]
is distributed approximately as χ2k−1 under the hypothesis H0 : σ12 = . . . = σk2 . The
approximation is not very good for small ni .
9.4
Worked Example
The aim of this worked example is to demonstrate the use of the likelihood ratio test for
testing composite hypotheses. The example used is akin to that in Example 9.8, but results
are drawn from sections 9.3.4 and 9.3.2.
182
9.4.1
Example
Let X1 , . . . , Xn be drawn from a N (µ, σ02 ) population, ie, a Normal distribution with unknown mean but known variance σ02 .
We wish to test
H0 : µ = µ0 against H1 : µ 6= µ0 .
Thus
ω = [H0 ] = [(µ0 , σ02 )]
while
Ω = [H0 ∪ H1 ] = [(µ, σ02 )]
Now
1
2
2
e− 2 (xi − µ) /σ0
√
L = L(µ) =
σ0 2π
i=1
n
Y
and if ` = log L then
X
∂`
1
= − (2)(−1) (xi − µ)/σ02 = 0
∂µ
2
i
b
if x = µ.
Thus
P
1
e− 2
b =
L(Ω)
i (xi
− x)2 /σ02
σ0n (2π)n/2
In ω, µ is single valued, as it corresponds to the value µ0 . Thus
b =
L(ω)
and so the LR becomes
λ=
1
e− 2
−2 log λ = [
X
i
(xi − µ0 )2 −
X
i
i (xi
− µ0 )2 /σ02
σ0n (2π)n/2
1
e− 2
1
e− 2
This gives
P
P
− µ0 )2 /σ02
i (xi
P
i (xi
(xi − x)2 ]/σ02 =
− x)2 /σ02
X
i
[(x2i + µ20 − 2xi µ0 − x2i − x2 + 2xi x]/σ02
and so
−2 log λ =
Since
[nµ20
2
+ nx − 2
X
xi µ0 ]/σ02
i
= [n(x − µ0 )
X − µ0
√ ∼ N (0, 1)
σ0 / n
183
2
]/σ02
=
x − µ0
√
σ0 / n
!2
exactly then
−2 log λ ∼ χ21
exactly.
In general, the df for the asymptotic distribution of −2 log λ is
dim(Ω) − dim(ω)
which in this case is
1−0=1
The formal definition of ’dim’ is the number of free parameters. In ω both µ and σ 2 are
fixed, while Ω has one free parameter, µ.
Example
Simple Linear Regression :
We have a sample (Xi , Yi ), i = 1, . . . , n, where
Yi ∼ N (β0 + β1 Xi , σ 2 ),
and wish to test
H0 : β1 = 0 vs H1 : β1 6= 0
So θ = (β0 , β1 , σ 2 ), with θ 0 = (β0 , 0, σ 2 ).
The likelihood is :
L=
Y
i
ie
2
2
1
√ e−(yi − β0 − β1 xi ) /2σ
σ 2π
X
n
` = ln L = − ln σ 2 − . . . − (yi − β0 − β1 xi )2 /2σ 2
2
i
Under H0 ,
L=
and
P
2
2
1
− i (yi − β0 ) /2σ
√
e
n
(σ 2π)
X
n
` = − ln σ 2 − . . . − (yi − β0 )2 /2σ 2
2
i
and so
∂`/∂β0 = −
1X
(yi − β0 )(−1)/σ 2 = 0 → β̂0 = ȳ
2 i
184
and
∂`/∂σ 2 = −
X
X
n
2
4
2
−
(1/2)
(y
−
β
)
(−1)/σ
=
0
→
σ̂
=
(yi − ȳ)2 /n
i
0
2σ 2
i
i
to give
L0 = maxH0 L =
[2π
In H0 ∪ H1 ,
P
e−n/2
2
n/2
i (yi − ȳ) /n]
X
n
` = ln L = − ln σ 2 − . . . − (yi − β0 − β1 xi )2 /2σ 2
2
i
and so
∂`/∂β0 = −(1/2)
X
i
2(yi − β0 − β1 xi )(−1)/σ 2 = 0 → β̂0 = ȳ − β̂1 x̄
Also
∂`/∂β1 = −(1/2)
X
→ β̂1 =
Finally
∂`/∂σ 2 = −
i
P
2(yi − β0 − β1 xi )(−xi )/σ 2 = 0
P
P
xi yi − i xi i yi /n
P
P 2
2
i xi − ( i xi ) /n
i
X
n
−
(1/2)(−1)
(yi − β0 − β1 xi )2 /σ 4 = 0
2σ 2
i
→ σ̂ 2 =
X
i
(yi − β̂0 − β̂1 xi )2 /n
So the maximised likelihood in H0 ∪ H1 is
L1 = maxH0 ∪H1 L =
Thus the likelihood ratio is
L0
Λ=
=
L1
and so
but
X
i
e−n/2
P
[2π i (yi − β̂0 − β̂1 xi )2 /n]n/2
P
− β̂0 − β̂1 xi )2
P
2
i (yi − ȳ)
i (yi
!n/2
(n − 2)S 2
Λ2/n = P
2
i (yi − ȳ)
(yi − ȳ)2 = (n − 2)S 2 + β̂12
and so
Λ2/n =
(n − 2)S 2
(n − 2)S 2 + β̂12
185
P
X
i
i (xi
(xi − x̄)2
− x̄)2
= 1/ 1 +
=
β̂12
P
− x̄)2
(n − 2)S 2
S/
!
1
1+
T 2 /(n
where
T =
i (xi
qP
− 2)
β̂1
i (xi
− x̄)2
∼ tn−2
is recognised as the familiar test statistic for testing H0 : β1 = 0.
186
Chapter 10
Bayesian Inference
10.1
Introduction
The reference text for this material is :
Lee P.M., (2004), Bayesian Statistics, 3rd ed, Arnold, London.
The topics covered are a selected and condensed set of extracts from Chapters 1 to 3
of Lee, written in a notation consistent with previous chapters of this set of Unit Notes.
Some additional exercises and workings are given.
10.1.1
Overview
This chapter introduces an alternative theory of statistical inference to the procedures
already covered in this Unit. The notion of subjective probability or prior belief is incorporated into the data analysis procedure via the Bayesian inferential process.
As this development is confined to only one chapter, the exposition of Bayesian inference is cursory only.
The references at the end of this chapter are supplied for further reading into this
important topic.
187
10.2
Basic Concepts
10.2.1
Bayes Theorem
Discrete Case
This will be the form most familiar to students of introductory statistics.
Let E be an event, with H1 , . . . , Hn a sequence of mutually exclusive and exhaustive
events. Then
P (Hn |E) ∝ P (Hn )P (E|Hn )
P (Hn |E) =
P (Hn )P (E|Hn )
P (E)
assuming that P (E) 6= 0. This can be written as
P (Hn |E) = P
P (Hn )P (E|Hn )
m P (Hm )P (E|Hm )
which is the full form of Bayes theorem in the discrete case.
Continuous Case
If x and y are continuous random variables, and p(y|x) ≥ 0 with
Z
then
p(y) =
since
Z
p(y|x)dy = 1
p(x, y)dx =
p(y|x) =
So
p(y|x) =
Z
p(x)p(y|x)dx
p(x, y)
p(x)
p(x, y)
p(y)p(x|y)
=
p(x)
p(x)
giving
p(y|x) ∝ p(y)p(x|y)
which is Bayes theorem with the constant of proportionality
ie,
1/p(x) = R
p(y|x) = R
1
p(y)p(x|y)dy
p(y)p(x|y)
p(y)p(x|y)dy
188
10.3
Bayesian Inference
Information known a priori about parameters θ are incorporated into the prior pdf p(θ).
The pdf of the data X subject to the parameters is denoted by p(X|θ).
Using Bayes’ theorem for (vector) random variables, we have
p(θ|X) ∝ p(θ)p(X|θ)
where p(θ|X) is called the posterior distribution for θ given X.
The likelihood function L considers the probability law for the data as a function of the
parameters, hence
L(θ|X) = p(X|θ)
so Bayes’ theorem can be written as
posterior ∝ prior × likelihood
which shows how prior information is updated by knowledge of the data.
The posterior/prior/likelihood relation is sometimes written as
p(θ|x) =
where
p(x) =
Z
p(θ)p(x|θ)
p(x)
p(θ)p(x|θ)dθ
where we have reverted to scalar notation momentarily.
The marginal distribution p(x) is called the predictive or preposterior distribution.
These equations will be used in later work.
10.4
Normal data
The procedure whereby prior information is updated by knowledge of the data is now
demonstrated using a simple example of sampling of a single observation from a Normal
population with known variance. Hence the data point X comes from N (µ, σ 2 ) where σ 2
is assumed known.
The parameter of interest is µ. Thus
2
2
1
p(x) = √ e−(x − µ) /2σ
σ 2π
The prior is taken as
µ ∼ N (µ0 , σ02 )
189
and the likelihood is
2
2
1
L(µ|x) = √ e−(x − µ) /2σ
σ 2π
So the posterior becomes
p(µ|x) = p(µ) · p(x|µ) = p(µ) · L(µ|x)
=
σ0
1
√
2
2
2
2
1
e−(µ − µ0 ) /2σ0 √ e−(x − µ) /2σ
2π
σ 2π
2
2
2
2
2
∝ e−µ (1/σ0 + 1/σ )/2 + µ(µ0 /σ0 + x/σ )
2
2
2
= e−µ /2σ1 + µµ1 /σ1
where
σ12 =
1/σ02
1
+ 1/σ 2
and
µ1 = σ12 (µ0 /σ02 + x/σ 2 )
Therefore
2
2
2
2
2
p(µ|x) ∝ e(µ /σ1 − 2µµ1 /σ1 + µ1 /σ1 )/2
2
2
= e−(µ − µ1 ) /2σ1
Thus the posterior distribution is given by
µ|x ∼ N (µ1 , σ12 )
10.4.1
Note
If we define precision as the inverse of the the variance, then since
1/σ12 = 1/σ02 + 1/σ 2
we have that
posterior precision = prior precision + data precision .
For the mean, we have
µ1 /σ12 = µ0 /σ02 + x/σ 2
and so the posterior mean is a weighted sum of the prior mean and the data mean (point),
with the weights being proportional to the respective precisions.
190
10.5
Normal data - several observations
The process that was undertaken for a single data point is now described for a sample
consisting of more than one observation.
The prior is again
µ ∼ N (µ0 , σ02 )
but the likelihood is
P
2
2
1
−
i (xi − µ) /2σ
√
e
n
(σ 2π)
L(µ|x1 , . . . , xn ) = p(x1 |µ) . . . p(xn |µ) =
Thus the posterior becomes
p(µ|x1 , . . . , xn ) = p(µ) · p(x1 , . . . , xn |µ) = p(µ) · L(µ|x1 , . . . , xn )
=
1
√
σ0 2π
2
2
e−(µ − µ0 ) /2σ0
P
2
2
1
√ n e− i (xi − µ) /2σ
(σ 2π
2
2
2
2
∝ e−µ (1/σ0 + n/σ )/2 + µ(µ0 /σ0 +
2
2
2
= e−µ /2σ1 + µµ1 /σ1
where
σ12 =
1/σ02
P
i
xi /σ 2 )
1
+ n/σ 2
and
µ1 = σ12 (µ0 /σ02 +
X
xi /σ 2 )
i
Therefore
2
2
2
2
2
p(µ|x) ∝ e(µ /σ1 − 2µµ1 /σ1 + µ1 /σ1 )/2
2
2
= e−(µ − µ1 ) /2σ1
Thus the posterior distribution is given by
µ|x ∼ N (µ1 , σ12 )
This result could be obtained using the single observation derivation, since
x̄ ∼ N (µ, σ 2 /n)
and so the posterior result given here for a sample of size n is equivalent to that obtained
for a single observation on the mean of a sample of size n.
191
10.6
Highest density regions
One method of characterising the posterior distribution or density is to describe an interval
or region that contains ’most’ of the distribution. Such and interval would be expected to
contain more of the distribution inside than out, and the interval or region should be chosen
to be as short or as small as possible. In most cases, there is one such interval or region
for a chosen probability level. For Bayesian inference, such an interval or region is called
a higher (posterior) density region or HDR. Alternative terminology includes Bayesian
confidence interval, credible interval or higher posterior density (HPD).
10.6.1
Comparison of HDR with CI
The confidence interval or CI obtained from the sampling theory approach of classical
frequentist statistics appears at first similar to the HDR. For either method,using the
normal distribution as an example, we use the fact that
x−µ
∼ N (0, 1)
σ
For the sampling theory approach x is consider random, while µ is taken as fixed. The
resulting interval is taken as random.
In the Bayesian context, the random variable is taken as µ, while the interval is fixed
once the data are available.
If we use the notation ∼ to denote a random variable, then for a 95%CI we have
∼
|(x −µ)/σ| < 1.96
whereas a 95% HDR is saying that
∼
|(µ −x)/σ| < 1.96
In cases other than the simple situation described here, the two methods can differ.
10.7
Choice of Prior
Using the results of the two simple examples already given, it should be clear that not only
form of the prior as well as it parameters can have a bearing on the posterior distribution. For this reason, the choice of prior needs to be given consideration, in its form and
representation via suggested values for the parameters.
10.7.1
Improper priors
In the case of the Normal distribution with several observations, the posterior variance was
σ12 =
1/σ02
1
.
+ n/σ 2
192
This posterior variance approaches σn2 when σ02 is large compared to σ 2 /n. Alas in the
limit, this means that the prior N (µ0 , σ02 ) would become uniform on the whole real line in
the limit, and thus would not be a proper density. Hence the term improper prior.
Such improper priors can nevertheless produce quite proper posterior distributions when
combined with an ordinary likelihood. (In such cases, the likelihood dominates the posterior.)
Operationally, improper priors are best interpreted as approximations over a large range
of values rather than being truly valid over an infinite interval. The construction of such
approximations can be formalised, see Lee p42, theorem.
10.7.2
Reference Priors
The term ’reference’ prior is used to cover the case where the data analysis proceeds on
the assumption that the likelihood should be expected to dominate the prior, especially
where there is no strongly held belief ’a priori’.
10.7.3
Locally uniform priors
A prior that is reasonably constant over the region where the likelihood dominates and is
not large elsewhere is said to be locally uniform. For such a prior, the posterior becomes
p(θ|x) ∝ p(x|θ) = L(θ|x)
and so as expected, the likelihood dominates the prior. Bayes postulated (not his theorem)
that the ’know nothing’ prior p(θ) should be represented by a uniform prior where θ is an
unknown probability such that 0 < θ < 1. This implies
p(θ) = 1, 0 < θ < 1
Alas this suffer the same problems of improper priors, but again if appropriate intervals
can be chosen, a local uniform prior can be found that will be workable.
10.8
Data–translated likelihoods
In choosing a reference prior that is ’flat’, it would seem natural to choose an appropriate
scale of measurement for the uniform prior, which is related to the problem at hand. One
such scale is one on which the likelihood is data–translated, ie, one for which
L(θ|x) = g(θ − t(x))
for a function t, which is a sufficient statistic for θ.
For example, the Normal has
L(µ|x) ∝ e−(x−µ)
193
2 /2σ 2
and it is thus clearly of this type.
However, a binomial B(n, π) has
L(π|x) ∝ π x (1 − π)n−x
which cannot be but into the form g(π − t(x)).
If the likelihood L is data–translated, then different data values give the same form
for L except for a shift in location. Thus the main function of the data is to determine
location. Thus it would seem sensible to adopt a uniform prior for such data–translated
likelihoods.
10.9
Sufficiency
Recall that a statistic T = t(x) is sufficient for θ iff
f (x; θ) = g(t(x)) · h(x)
by the Fisher–Neyman factorisation criterion.
Theorem
For any prior distribution, the posterior of θ given x is the same as the posterior of θ
given a sufficient statistic t for θ, assuming that t exists.
Proof
From sufficiency
p(x|θ) = p(t|θ)p(x|t)
and so the posterior is
p(θ|x) ∝ p(θ)p(x|θ) = p(θ)p(t|θ)p(x|t)
which proves the result.
10.10
∝ p(θ)p(t|θ)
Conjugate priors
A class P of priors forms a conjugate family if the posterior is in the class for all x when
the prior is in P .
194
In practice, we restrict P to the class of the exponential family, mostly.
As an example, the Normal has a Normal conjugate prior, as shown by the posterior
being Normal as well.
10.11
Exponential family
Earlier in the Unit we defined the exponential family of distributions by
f (xi ; θ) = p(xi |θ) = h(xi )B(θ)eq(θ)K(xi )
Note the change from p(θ) to q(θ) to avoid confusion with the general form for the prior.
This form for the density gives the likelihood as
L(θ|x) = h(x)B n (θ)eq(θ)
P
i
K(xi )
For the exponential family, there is a family of conjugate priors defined by
p(θ) ∝ B(θ)eq(θ)τ
since the posterior is then
P
p(θ) · L(θ|x) = B(θ)eq(θ)τ B n (θ)eq(θ)
= B n+1 (θ)eq(θ)(τ +
P
i
i
K(xi )
K(xi )
which belongs to the same class as the prior. Hence the class is conjugate.
Normal mean
For the normal case with known variance, τ ∝ x and t =
for the class (in the exponent) is
P
i
K(xi ) ∝ nx, the general form
−νµ2 /2σ 2 + T µ/σ 2
where T can be chosen as νµ0 to give the correct scale. The full form is proportional to
2
2
e−ν(µ − µ0 ) /2σ
which for ν = 1 generates the prior used for the normal.
195
Binomial proportion
In a fixed number of n trials, the number of successes x can be such that
x ∼ B(n, π)
ie, x follows the Binomial distribution, where π is the probability of a success at each of
the (independent) trials. Thus
!
n x
π (1 − π)n−x , x = 0, 1, . . . , n
p(x|π) =
x
∝ π x (1 − π)n−x .
If we choose the prior as a Beta distribution, ie,
p(π) ∝ π α−1 (1 − π)β−1 , 0 ≤ π ≤ 1
then the posterior is
p(π|x) ∝ π α+x−1 (1 − π)β+n−x−1
which is also a Beta distribution. Thus the prior is conjugate.
Alternatively, we could construct the prior from the general form, using the fact that
the Binomial is a member of the exponential family.
10.12
Reference prior for the binomial
We should note from the previous section that if we choose a B(α, β) prior, then the
posterior is B(α + x, β + n − x), when we have x successes in n trials.
10.12.1
Bayes
Bayes proposed the following uniform prior for the binomial;
π
= 1, 0 ≤ π ≤ 1,
= 0 else
which is really B(1, 1). In short, the implications of this prior is that the mode of the
posterior distribution corresponds to the unbiased estimate of the proportion from classical
statistics.
10.12.2
Haldane
Haldane proposed a B(0, 0) prior, ie,
p(π) ∝ π −1 (1 − π)−1
For this prior, the mean of the posterior is the observed proportion.
196
10.12.3
Arc–sine
√
The arc–sine prior is a B(1/2, 1/2) which gives a uniform prior on the scale sin −1 π, hence
the name arc–sine. This transformation corresponds to the variance stablising transformation for the binomial proportion.
10.12.4
Conclusion
All 3 priors give equivalent answers even for small amounts of data, but the reason for
labouring the point is to show the problem of describing the situation of knowing nothing
about the proportion.
10.13
Jeffrey’s Rule
10.13.1
Fisher information
Recall that Fisher information from a single observation x is given by
I(θ|x) = −E ∂ 2 `/∂θ 2 = E (∂`/∂θ)2
where ` = ln L. The information from n observations x is given by I(θ|x) and
I(θ|x) = nI(θ|x)
10.13.2
Jeffrey’s prior
If the parameter θ is transformed to φ via φ = φ(θ), then
∂`(φ|x)
∂`(θ|x) dθ
=
∂φ
∂θ dφ
So
"
#2
"
∂`(θ|x)
= I(φ|x) = E
∂θ
#2
!2
!2
dθ
= I(θ|x)
dφ
q
q
√
So if p(θ) ∝ I(θ|x) then p(φ) ∝ I(φ|x), therefore choose I as a reference prior, as
it is invariant to change of scale. So the reference prior is given by
∂`(φ|x)
E
∂φ
p(θ) ∝
q
dθ
dφ
(I(θ|x)
called Jeffrey’s rule.
Note that this is not the only form of prior possible, but it can be used as a guide
especially if there is no other obvious method of finding a prior.
197
Example
Consider the binomial parameter π for which we have
p(x|θ) ∝ π x (1 − π)(n−x)
and so
` = x ln π + (n − x) ln(1 − π) + constant
to give
which becomes
The information is thus
∂`/∂π = x/π − (n − x)/(1 − π)
∂ 2 `/∂π 2 = −x/π 2 − (n − x)/(1 − π)2
I = −E∂ 2 `/∂π 2 = Ex/π 2 + E(n − x)/(1 − π)2 = n/π + n/(1 − π)
n
I=
π(1 − π)
So finally the prior is
p(π) ∝ π −1/2 (1 − π)−1/2
or, π ∼ B(1/2, 1/2), ie, the arc–sine distribution given earlier as a reference prior for the
binomial.
10.14
Approximations based on the likelihood
One way of describing a HDR is to quote the mode of the posterior density, although this
goes against the idea of constricting a HDR.
If the likelihood dominates the prior, if for example the prior chosen is a reference prior,
b obtained by solving
then the posterior mode will be close to the MLE θ,
b we have
In the neighbourhood of θ,
and so
Thus
or
b =0
L0 (θ)
b + (θ − θ)
b 2 L00 (θ)/2
b
L(θ) = L(θ)
+...
b 2 00 b
L(θ|x) ∝ e(θ − θ) L (θ)/2
b −1/L00 (θ))
b
L(θ|x) ∝ N (θ,
b 1/I(θ|x))
b
∝ N (θ,
Thus a HDR could be constructed for θ using this approximation. to the likelihood,
assuming that the likelihood dominates the prior.
198
10.14.1
Example
For a sample x0 = (x1 , . . . , xn ) from a P (λ) with T =
L(λ|x) ∝ e−nλ
λT
T!
P
i
xi , we have
Thus
`(λ) = T ln λ − nλ + constant
to give
`0 (λ) = T /λ − n
and
`00 (λ) = −T /λ2
b = T /n = x̄ and I(λ|x) = nλ/λ2 = n/λ.
Now λ
b
b = n2 /T . Thus the posterior can be approximated by N (T /n, T /n2 ),
So I(λ|x) = −`00 (λ)
asymptotically.
Note that this posterior differs from the posterior obtained by using a conjugate prior.
10.15
Reference posterior distributions
10.15.1
Information provided by an experiment
Note that while the term information used here is related to Fisher information, it is best
to treat the two as separate quantities for the moment.
10.15.2
Reference priors under asymptotic normality
Lindley defined the amount I(x|p) of information that an observation x prvides about θ
as
I(x|p) =
Z
p(θ|x) ln [p(θ|x)/p(θ)] dθ
Z
p(θ|x) ln [p(θ|x)/p(θ)] dθdx
Averaged over all random observations, we obtaind the expected information as
I=
Z
p(x)
This is equivalent to Shannon’s information.
199
Exercise
( Show that the two expressions
Z
and
p(θ)
Z Z
Z
p(x|θ) ln [p(x|θ)/p(x)] dxdθ
"
#
p(θ, x)
p(θ, x) ln
dθdx
p(θ)p(x)
are equivalent to the original form for expected information. )
Define In as the information about θ from n (independent) observations from the same
distribution as x.
Thus I∞ gives the information about θ when the prior is p(θ), since in that case we
would have the exact value of θ. This condition ensures that we have a true reference prior,
ie, one that contains no information about θ.
Define pn (θ|x) as the posterior corresponding to the prior pn (θ), where the prior pn (θ)
maximises In .
The reference posterior p(θ|x) is then limn→∞ pn (θ|x).
The reference prior p(θ) is defined (indirectly) as the p(θ) such that
p(θ|x) ∝ p(θ)p(x|θ)
where p(θ|x) is the reference posterior defined above.
To find the reference prior, define entropy H as
H{p(θ)} = −
Z
p(θ) ln p(θ)dθ
Now the information about θ contained in n observations x0 = (x1 , . . . , xn ) is
In =
Z
p(x)
Z
p(θ|x) ln [p(θ|x)/p(θ)] dθdx
where the posterior pn (θ|x) = p(θ|x) and the prior pn (θ) = p(θ) to make the notation
concise.
In =
Z
p(x)
=−
Z
Z
p(θ|x) ln p(θ|x)dθdx −
p(x)H{p(θ|x)}dx −
Z Z
200
Z
p(x)
Z
p(θ|x) ln p(θ)dθdx
p(x)p(θ|x ln p(θ)dθdx
=−
=−
Z
Z
p(x)H{p(θ|x)}dx −
p(x)H{p(θ|x)}dx −
=−
=
Z
Z
Z
Z Z
p(θ)p(x|θ) ln p(θ)dxdθ
p(θ) ln p(θ)
p(x)H{p(θ|x)}dx −
=−
Z
Z
Z
p(x|θ)dx
dθ
=1
p(θ) ln p(θ)dθ
p(x|θ)H{p(θ|x)}dx + H{p(θ)}
p(θ) ln e
−
R
This puts In into the form
In =
p(x|θ)H{p(θ|x}dx)/p(θ) dθ
Z
p(θ) ln
f (θ)
dθ
p(θ)
Now In is maximal when p(θ) ∝ f (θ), via the calculus of variations (Exercise!!).
Thus the prior corresponding to a maximal In is
R
pn (θ) ∝ e− p(x|θ)H{p(θ|x)}dx
10.15.3
Reference priors under asymptotic normality
Here the posterior distribution for n observations is approximately
b 1/I(θ|x))
b
N (θ,
and so
b 1/nI(θ|x))
b
p(θ|x) ∼ N (θ,
using the additive property of Fisher information.
The entropy H for a N (µ, σ 2 ) density becomes
H{p(µ)} = −
Z
√
o
√
√
2
2n
1
e−(z − µ) /2σ − ln 2πσ 2 − (z − µ)2 /2σ 2 dz = ln 2πeσ 2
2πσ 2
This gives
H{p(µ|x)} = ln
where µ = θ.
q
201
b
2πe/nI(θ|x)
Thus we get
Z
p(x|θ)H{p(θ|x)}dx = −
=−
Z
b ln
p(θ|θ)
= − ln
b
since p(θ|θ)
≈ 0 except when θb ≈ θ.
Thus
pn (θ) ∝ e
ln
q
Z
q
q
b
p(x|θ) ln 2πe/nI(θ|x)dx
b
2πe/nI(θ|x)d
θb
2πe/nI(θ|x)
q
I(θ|x)
=
q
I(θ|x)
which is Jeffrey’s prior.
This result has use for a wider class of problems and for handling nuisance parameters.
Exercise
Show that the entropy for the prior p(θ) = e−θ/β /β is 1 + ln β.
202
10.16
References
1. Bernardo J.M. and A.F.M. Smith, (1994), Bayesian Theory, Wiley.
2. Broemeling L.D., (1985), Bayesian Analysis of Linear Models, Marcel Dekker.
3. Carlin B.P. and Louis T.A., (2000), Bayes and Empirical Bayes Methods for Data
Analysis, 2nd ed., Chapman and Hall.
4. Gelman A., Carlin J.B., Stern H.S. and Rubin D.B., (2004), Bayesian Data Analysis,
2nd ed., Chapman and Hall.
5. Lee P.M., (2004), Bayesian Statistics, 3rd ed., Arnold.
6. Leonard T. and Hsu T.S.J., (2001), Bayesian Methods, Cambridge University Press.
203
204