School of Mathematics, Statistics and Computer Science STAT354 Distribution Theory Part I Statistical Inference Part II Printed at the University of New England, December 6, 2005 1 2 Contents 0.1 0.2 I Details of the unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PLAGIARISM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution Theory v viii 1 1 Prerequisites 1.1 Probability concepts assumed known . . . . . . . . . . . . . . . . . . . . . 1.2 Assumed knowledge of matrices and vector spaces . . . . . . . . . . . . . . 1.2.1 Worked Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Preliminaries 2.1 Introduction . . . . . . . . . . . . . . . 2.2 Indicator Functions . . . . . . . . . . . 2.3 Distribution Functions (cdf’s) . . . . . 2.4 Bivariate and Conditional Distributions 2.4.1 Conditional Mean and Variance 2.5 Stochastic Independence . . . . . . . . 2.6 Moment Generating Functions (mgf) . 2.6.1 Multivariate mgfs . . . . . . . . 2.7 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Transformations 3.1 Introduction . . . . . . . . . . . . . . . . . . . 3.1.1 The Probability Integral Transform . . 3.2 Bivariate Transformations . . . . . . . . . . . 3.3 Multivariate Transformations (One-to-One) . 3.4 Multivariate Transformations Not One-to-One 3.5 Convolutions . . . . . . . . . . . . . . . . . . 3.6 General Linear Transformation . . . . . . . . 3.7 Worked Examples : Bivariate MGFs . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 4 . . . . . . . . . 7 7 7 8 9 10 12 12 13 15 . . . . . . . . 17 17 21 22 32 33 34 36 37 4 Multivariate Normal Distribution 4.1 Bivariate Normal . . . . . . . . . . . . . 4.2 Multivariate Normal (MVN) Distribution 4.3 Moment Generating Function . . . . . . 4.4 Independence of Quadratic Forms . . . . 4.5 Distribution of Quadratic Forms . . . . . 4.6 Cochran’s Theorem . . . . . . . . . . . . 5 Order Statistics 5.1 Introduction . . . . . . . . . . . . . 5.2 Distribution of Order Statistics . . 5.3 Marginal Density Functions . . . . 5.4 Joint Distribution of Yr and Ys . . . 5.5 The Transformation F (X) . . . . . 5.6 Examples . . . . . . . . . . . . . . 5.7 Worked Examples : Order statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 41 43 44 47 50 54 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 59 60 62 67 73 76 80 6 Non-central Distributions 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 6.2 Distribution Theory of the Non-Central Chi-Square 6.3 Non-Central t and F-distributions . . . . . . . . . . 6.4 POWER: an example of use of non-central t . . . . 6.4.1 Introduction . . . . . . . . . . . . . . . . . . 6.4.2 Power calculations . . . . . . . . . . . . . . 6.5 POWER: an example of use of non-central F . . . . 6.5.1 Analysis of variance . . . . . . . . . . . . . . 6.6 R commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 85 86 89 90 90 92 95 95 99 II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical Inference 7 Reduction of Data 7.1 Types of inference . . . . . 7.2 Frequentist inference . . . 7.3 Sufficient Statistics . . . . 7.4 Factorization Criterion . . 7.5 The Exponential Family of 7.6 Likelihood . . . . . . . . . 7.7 Information in a Sample . 101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 102 102 103 106 110 112 115 8 Estimation 123 8.1 Some Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 123 8.2 Cramér–Rao Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 ii 8.3 8.4 Properties of Maximum Likelihood Estimates . . . . . . . . . . . . . . . . 140 Interval Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 9 Hypothesis Testing 9.1 Basic Concepts and Notation . . . . . . . . . . . . . . . . . 9.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 9.1.2 Power Function and Significance Level . . . . . . . . 9.1.3 Relation between Hypothesis Testing and Confidence 9.2 Evaluation of and Construction of Tests . . . . . . . . . . . 9.2.1 Unbiased and Consistent Tests . . . . . . . . . . . . . 9.2.2 Certain Best Tests . . . . . . . . . . . . . . . . . . . 9.2.3 Neyman Pearson Theorem . . . . . . . . . . . . . . 9.2.4 Uniformly Most Powerful (UMP) Test . . . . . . . . 9.3 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . 9.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 The Likelihood Ratio Test Procedure . . . . . . . . 9.3.3 Some Examples . . . . . . . . . . . . . . . . . . . . 9.3.4 Asymptotic Distribution of −2 log Λ . . . . . . . . . . 9.4 Worked Example . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . 10 Bayesian Inference 10.1 Introduction . . . . . . . . . . . . . 10.1.1 Overview . . . . . . . . . . 10.2 Basic Concepts . . . . . . . . . . . 10.2.1 Bayes Theorem . . . . . . . 10.3 Bayesian Inference . . . . . . . . . 10.4 Normal data . . . . . . . . . . . . . 10.4.1 Note . . . . . . . . . . . . . 10.5 Normal data - several observations 10.6 Highest density regions . . . . . . . 10.6.1 Comparison of HDR with CI 10.7 Choice of Prior . . . . . . . . . . . 10.7.1 Improper priors . . . . . . . 10.7.2 Reference Priors . . . . . . . 10.7.3 Locally uniform priors . . . 10.8 Data–translated likelihoods . . . . 10.9 Sufficiency . . . . . . . . . . . . . . 10.10Conjugate priors . . . . . . . . . . 10.11Exponential family . . . . . . . . . 10.12Reference prior for the binomial . . 10.12.1 Bayes . . . . . . . . . . . . 10.12.2 Haldane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 151 151 152 153 155 155 155 159 163 167 167 167 170 177 182 183 . . . . . . . . . . . . . . . . . . . . . 187 187 187 188 188 189 189 190 191 192 192 192 192 193 193 193 194 194 195 196 196 196 10.12.3 Arc–sine . . . . . . . . . . . . . . . . . . . . 10.12.4 Conclusion . . . . . . . . . . . . . . . . . . . 10.13Jeffrey’s Rule . . . . . . . . . . . . . . . . . . . . . 10.13.1 Fisher information . . . . . . . . . . . . . . 10.13.2 Jeffrey’s prior . . . . . . . . . . . . . . . . . 10.14Approximations based on the likelihood . . . . . . . 10.14.1 Example . . . . . . . . . . . . . . . . . . . . 10.15Reference posterior distributions . . . . . . . . . . . 10.15.1 Information provided by an experiment . . . 10.15.2 Reference priors under asymptotic normality 10.15.3 Reference priors under asymptotic normality 10.16References . . . . . . . . . . . . . . . . . . . . . . . iv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 197 197 197 197 198 199 199 199 199 201 203 0.1 Details of the unit Coordinator Lecturer: Dr. Bernard A. Ellem Address: School of Mathematics, Statistics and Computer Science University of New England, Armidale, NSW, 2351. Telephone: 02 6773 2284 Fax: 02 6773 3312 Email : [email protected] Objectives Although many of the topics considered in STAT354, Distribution Theory and Statistical Inference, sound the same as those in the second year unit, a more theoretical approach is used, and a deeper level of understanding is required. The unit concentrates on the fundamental aspects of statistics, rather on the particular methods in use in various disciplines. To gain most benefit from studying this unit, you should read more widely in the prescribed text and other texts than the minimum indicated in the Study Guide. The subject of statistics is concerned with summarizing the data to understand the evidence contained in the data. The topics in this unit are foundations for statistical strategies of interpreting data. There are two (or more) schools of statistical thought. The traditional parametric statistics relies upon weak assumptions that the data can be approximated by a parametric distribution. In this way the essential information is contained in a small set of parameters. The other way is to make very few assumptions and let the data themselves decide the distribution. Whilst this may seem appealing, we could have difficulty in reducing the data to be able to interpret. Both schools have their advantages, disadvantages and adherents but most statisticians use whatever will do the job; often a blend of both. This unit is predominantly about parametric statistics with one section on non-parametric statistics, Order Statistics. Bayesian inference is introduced in the final chapter. At times, distribution theory and inference seem remote from the practical problems of analysing data but progress in statistics necessitates a firm foundation in inference. As statisticians, you will be challenged with the non-standard problems where your insight into the problems may be the most important factor in success. The insight is honed by studying the theoretical bases of statistics. Content There are two sections to Statistics 354, Distribution Theory and Statistical Inference and they will be covered in that order and completed in First Semester. v Textbook The text that will be used for both sections of the course is Casella G. and Berger R.L., (2002), Statistical Inference, 2nd ed., Duxbury. An additional reading is Hogg, R.V. and Craig, A.T. Introduction to Mathematical Statistics, 5th edition, 1995, Macmillan. There is now a 6th edition, (2005), by Hogg R.V., McKean J.W. and Craig A.T., Pearson Prentice Hall. Throughout the Study Guides, the text will be referred to as CB, while the additional reading will be referred to as HC. The text for the last chapter is : Lee P.M., (2004), Bayesian Statistics, 3rd. ed., Arnold, London Timetable The two sections of the course are about equal in length and you should plan to finish Distribution Theory about half–way through the Semester. Be sure to plan your work to leave enough time for revision of both sections before the mid–year examination period begins. Copies of examination papers for 1999-2005 (inclusive) are held at the UNE library web page http://www.une.edu.au/library/ It should be noted, however, that units change from year to year so these examination papers are meant only as a guide. You cannot expect that previous questions or slight modifications of them, will necessarily reappear in later examination papers. Assessment Assignments will count 30% towards the total assessment. The remainder will be by a two– hour examination at the end of First Semester. To obtain a pass, students are expected to perform at least close to pass standard in each of the two sections. Assignments are to be of a high standard; remember this is a BSc. It seems redundant to say that legibility, writing in ink, neat setting out etc. are required but a reminder is not wasted. The combination of assignments and examination should allow diligent students to comfortably pass the course. vi Acknowledgements These notes were written by Dr Gwenda Lewis at the Statistics department of U.N.E. Bob Murison made revisions during 1997-99. Bernard Ellem made further revisions in 2003-2005. vii 0.2 PLAGIARISM Students are warned to read the statement in the Faculty’s Undergraduate and Postgraduate Handbooks for 2006 regarding the University’s Policy on Plagiarism. Full details of the Policy on Plagiarism are available in the 2006 UNE Handbook and at the following web site: http://www.une.edu.au/offsect/policies.htm In addition, you must complete the Plagiarism Declaration Form for all assignments, practical reports, etc. submitted in this unit. viii 0.3 Assignments Both sections of the unit have assignments. These are very closely linked to the material in the Study Guide and each will indicate which chapter they depend on. Don’t wait until you feel you fully understand a whole chapter before beginning an assignment. Later parts of chapters are often easier to understand, after you have had some practice at problems. Submission of all assignments is compulsory, and a reasonable attempt should be made at every question. The despatch dates are listed for both Distribution Theory and Statistical Inference, and you should make every effort to submit assignments on time. If you anticipate a delay of more than a couple of days, contact the unit coordinator. Every effort will be made to mark assignments promptly and return them to you. 2006 ASSIGNMENT SUBMISSION DATES Assignment Number 1 2 3 4 5 6 7 Date (to reach the University by) 3rd March 17th March 31st March 21st April 12th May 26th May 9th June Topic Distribution theory and Matrix theory Transformations and multivariate normal Order statistics Sufficiency, likelihood ratio CRLB Generalized likelihood ratio Bayesian Inference Each assignment is worth 20 marks. All questions are of equal value. ix x Assignment 1 [This assignment covers the work in Distribution Theory, Chapters 1 and 2.] 1. Assuming that the conditional pdf of Y given X=x is fY |X=x (y) = ( 2y/x2 if 0 < y < x < 1 0 otherwise and the pdf of X is fX (x) = 4x3 , 0 < x < 1, find (a) the joint pdf of X and Y ; (b) the marginal pdf of Y ; (c) P (X > 2Y ). 2. Given random variables X and Y where the pdf of X is given by fX (x) = e−x , x > 0, and Y is discrete. The conditional distribution of Y given X=x is given by fY |X=x (y) = e−x xy , y = 0, 1, 2, . . . . y! Show that the marginal distribution of Y is given by fY (y) = (1/2)y+1 , y = 0, 1, 2, . . . . 3. If matrix A is idempotent and A + B = I, show that B is idempotent and AB = BA = 0. 4. Given the matrix A below, find the eigenvalues and eigenvectors. Hence find the matrix P such that P0 AP is diagonal and has as its diagonal elements, the eigenvalues of A. 1 2 1 1 −1 A= 2 1 −1 −2 xi Assignment 2 [This assignment depends on Distribution Theory, Chapters 3 and 4.] 1. Let x denote the proportion of a bulk item stocked by a supplier at the beginning of the day, and let Y denote the proportion of that item sold during the day. Suppose X and Y have a joint df f (x, y) = 2, 0 < y < x < 1 Of interest to the supplier is the random variable U = X − Y , which denotes the proportion left at the end of the day. (a) Find the df of U . (b) Give E(U ). (c) Interpret your results. 2. Let X ∼ P (λx ) and Y ∼ P (λy ), where X and Y are independent. (a) Use the change of variable technique to show that X + Y ∼ P (λx + λy ) (b) Verify your result using MGFs. 3. If X ∼ Np (µ, Σ), show that Y= CX is distributed N(Cµ, CΣC0 ) where C is a p × p non–singular matrix. 4. X1 , . . . , Xp are normal random variables which have zero means and covariance matrix Σ. Show that a necessary and sufficient condition for the independence of the quadratic forms X0 BX and X0 CX is BΣC=0. xii Assignment 3 [This assignment covers the work in Distribution Theory, Chapters 5 and 6.] 1. Let X1 , X2 , X3 be a random sample from a distribution with pdf f (x) = 2x, 0 ≤ x ≤ 1. Find (a) the pdf of the smallest of these, Y1 ; (b) the probability that Y1 exceeds the median of the distribution. 2. X1 , X2 , . . . , Xn is a random sample from a continuous distribution with pdf f(x). An additional observation, Xn+1 , say, is taken from this distribution. Find the probability that Xn+1 exceeds the largest of the other n observations. 3. The median is calculated for a sample size of n, where n is an odd integer. (a) Give the distribution function for the sample median. (b) If population from which the sample is drawn is N (µ, σ 2 ), prove that the pdf of the sample median M is symmetric about the vertical axis centered at µ. (c) Deduce from (b) that E(M) = µ 4. (a) Determine the probability distribution function for the largest observation in a random sample from the uniform distribution. (b) A convoy of 10 trucks is to pass through a town with a low level underpass of height 3.8m. If the heights of the loads on each truck are uniformly distributed between cabin top height (3.0m) and the legal upper limit (4.0m), what is the probability that at least one truck will have to turn back? Provide an alternative explanation of your answer to convince a sceptic of the veracity of your solution. xiii Assignment 4 [This assignment covers work in Statistical Inference, Chapter 7.] 1. X1 , X2 , . . . Xn is a random sample from the Bernoulli distribution with parameter π. Use Definition 7.2 to show that n T = X Xi i=1 is sufficient for π. 2. Use the factorization criterion [Theorem 7.1] to determine in each of the cases below, a sufficient one–dimensional statistic based on a random sample of size n from (a) a binomial distribution, f (x; θ) = θ x (1 − θ)1−x , x = 0, 1. (b) a geometric distribution, f (x; θ) = θ(1 − θ)x−1 , x = 1, 2, . . .. (c) a N (θ, 1). (d) the Rayleigh distribution, f (x; θ) = xθ e−x 2 /2θ , x > 0. 3. The following distributions, among others, belong to the exponential family defined in (7.5), Definition 7.4. Identify the terms p(θ), B(θ), h(x), K(x) in each case. (a) Binomial(n, θ), (b) Fisher’s logarithmic series, P (X = x) = −θ x / (x ln(1 − θ)) , 0 < θ < 1, x = 1, 2, . . . , (c) Normal(0, θ), (d) Rayleigh, as in 2(d). 4. Compute the information in a random sample of size n from each of the following populations: (a) N (0, θ) (b) N (θ, 1) (c) Geometric with P(success)= 1/θ, f (x; θ) = xiv 1 θ 1− 1 θ x−1 , x = 1, 2, . . .. Assignment 5 [This assignment covers work in Statistical Inference, Chapter 8.] 1. (a) What multiple of the sample mean X estimates the population mean with minimum mean square error? (b) In particular, if there is a known relationship between µ and σ 2 , say σ 2 = kµ2 , what is this multiple? 2. The sample X1 , . . . , Xn is a randomly chosen from the distribution with 1 f (x; θ) = e−x/θ , x > 0. θ (a) Find the Cramer–Rao lower bound for the variance of an unbiased estimator of θ. (b) Identify the estimator that has this variance. 3. For the Cauchy distribution with location parameter θ, f (x; θ) = 1/π[1 + (x − θ)2 ], −∞ < x < ∞, show that the MVB cannot be attained. 4. For a random sample from a population with df f (y; θ) = (1 + θ)(y + θ)−2 , y > 1, θ > −1, (a) Find the minimum variance bound for an unbiased estimator of θ, and (b) show that the minimum variance of an estimator for log(1 + θ) is independent of θ. xv Assignment 6 [This assignment covers work in Statistical Inference, Chapter 9.] 1. Let X1 , . . . , Xn denote a random sample from a distribution with probability function f (x; θ) = θ x (1 − θ)1−x , x = 0, 1. (a) Show that C = {x: against H1 : θ = 13 . P xi ≤ K} is a best critical region for testing H0 : θ = 1 2 (b) Use the central limit theorem to find n and K so that approximately X Xi ≤ K|H0 ) = 0.10 X Xi ≤ K|H1 ) = 0.80. P( and P( 2. Let X1 , . . . , X25 denote a random sample of size 25 from a normal distribution, N (θ, 100). Find a uniformly most powerful critical region of size α = 0.10 for testing H0 : θ = 75 against H1 : θ > 75. 3. Two large batches of logs are offered for sale to a mining company whose concern is to have the diameters as uniform as possible. Batch A is more expensive than batch B, but the extra cost will be offset by the uniformity of product. Thus batch A would be preferred if the standard deviation of diameters in batch A is less than 1/2 the standard deviation of the diameters in batch B. Produce a likelihood ratio test for deciding which batch should be purchased based on the results of samples of the same size from each batch. 4. In a demonstration experiment on Boyle’s law P V = constant the volume V is measured accurately, but pressure measurements P are subject to normally distributed random errors. Two sets of results are obtained, (P1 , V1 ) and (P2 , V2 ). Derive a likelihood ratio test of the validity of Boyle’s law. xvi Assignment 7 [ This assignment covers work in Bayesian Inference, Chapter 10.] 1. For the Normal distribution with prior N (µ0 , σ02 ), obtain the posterior distribution for n observations by considering each observation x1 , . . . , xn sequentially. 2. A random sample of size n is to be taken from a N (µ, σ 2 ) distribution, where σ 2 is known. How large sample must n be to reduce the posterior variance of σ 2 to the fraction σ 2 /k of its original value, where k > 1? 3. Laplace claimed that an event (success) which has occurred n times, and has had no failures, will occur again with probability (n + 1)/(n + 2). Use Bayes’ uniform prior to give grounds for this claim. 4. (a) Find the Jeffreys prior for the parameter α of the Maxwell distribution p(x|α) = q 2 2/πα3/2 x2 e−αx /2 (b) Find a transformation of α for which the corresponding prior is uniform. xvii Part I Distribution Theory 1 Chapter 1 Prerequisites 1.1 Probability concepts assumed known 1. RANDOM VARIABLES. Discrete and continuous. Probability functions and probability density functions. Specification of a distribution by its cumulative distribution function (cdf). Particular distributions: binomial, negative binomial, Poisson, exponential, uniform, normal, (simple) gamma, generalized gamma, beta, chi-square, t, F. 2. MOMENTS AND GENERATING FUNCTIONS. Mean and variance of the common distributions. Moment generating function of the common distributions, Use of moment generating functions and cumulant generating functions. 3. BIVARIATE DISTRIBUTIONS. Correlation and covariance. Marginal and conditional distributions. Independence. 4. MULTIVARIATE DISTRIBUTIONS. Multinomial distribution. Mean and variance of a sum of random variables. Use of mgf to find the distribution of a sum of independent random variables. 5. CHANGE OF VARIABLE TECHNIQUE. In the univariate case, given the probability distribution of X and a function g, we find the distribution of Y defined by Y = g(X). 1 1.2 Assumed knowledge of matrices and vector spaces 1. Use of terms singular, diagonal, unit, null, symmetric. 2. Operations of addition, subtraction, multiplication, inverse and transpose. [We will use A0 for the transpose of A.] 3. (a) (AB)0 = B 0 A0 , (b) (AB)−1 = B −1 A−1 (c) (A−1 )0 = (A0 )−1 . 4. The trace of a matrix A, written tr(A), is defined as the sum of the diagonal elements of A. That is, tr(A) = X aii . i (a) tr(A ± B) =tr(A)±tr(B), (b) tr(AB) =tr(BA). 5. Linear Independence and Rank P (a) Let x1 , . . . , xn be a set of vectors and c1 , . . . , cn be scalar constants. If i ci xi = 0 only if c1 = c2 = . . . = cr = 0, the the set of vectors is linearly independent. (b) The rank of a set of vectors is the maximum number of linearly independent vectors in the set. (c) For a square matrix A, the rank of A, denoted by r(A), is the maximum order of non–zero subdeterminants. (d) r(AA0 ) =r(A0 A) =r(A) =r(A0 ), 6. Quadratic Forms For a p-vector x, where x0 = (x1 , . . . , xp ), and a square p × p matrix A, x0 Ax = p X aij xi xj i,j=1 is a quadratic form in x1 , . . . , xn . The matrix A and the quadratic form are called: (a) positive semidefinite if x0 Ax ≥ 0 for all x and x0 Ax = 0 for some x 6= 0. (b) positive definite if x0 Ax > 0 for all x 6= 0. 2 i. A necessary and sufficient condition for A to be positive definite is that each leading diagonal sub–determinant is greater than 0. So a positive definite matrix is non–singular. ii. A necessary and sufficient condition for a symmetric matrix A to be positive definite is that there exists a non–singular matrix P such that A = P P 0 . 7. Orthogonality. A matrix P is said to be orthogonal if P P 0 = I (or P 0 P = I). (a) An orthogonal matrix is non–singular. (b) The determinant of an orthogonal matrix is ±1. (c) The transpose of an orthogonal matrix is also orthogonal. (d) The product of two orthogonal matrices is orthogonal. (e) If P is orthogonal, tr(P 0 AP ) =tr(AP P 0 ) =tr(A). (f) If P is orthogonal, r(P 0 AP ) =r(A). 8. Eigenvalues and eigenvectors. Eigenvalues of a square matrix A are defined as the roots of the equation |A − λI| = 0. The corresponding x satisfying x0 (A − λI) = 0 are the eigenvectors. (a) The eigenvectors corresponding to two different eigenvalues are orthogonal. (b) The number of non–zero eigenvalues of a square matrix A is equal to the rank of A. 9. Reduction to diagonal form (a) Given any symmetric p×p matrix A there exists an orthogonal matrix P such that P 0 AP = Λ where Λ is a diagonal matrix whose elements are the eigenvalues of A. We write P 0 AP = diag(λ1 , . . . , λp ). i. If A is not of full rank, some of the λi will be zero. ii. If A is positive definite (and therefore non–singular), all the λi will be greater than zero. iii. The eigenvectors of A form the columns of matrix P . (b) If A is symmetric of rank r and P is orthogonal such that P 0 AP = Λ, then P i. tr(A) = ri=1 λi since tr(A) =tr(P 0 AP ) =tr(Λ). P ii. tr(As ) = ri=1 λsi . 3 (c) For every quadratic form Q = x0 Ax there exists an orthogonal transformation x = P y which reduces Q to a diagonal quadratic form so that Q = λ1 y12 + λ2 y22 + . . . + λr yr2 where r is the rank of A. 10. Idempotent Matrices. A matrix A is said to be idempotent if A2 = A. In the following we shall mean symmetric idempotent matrices. Some properties are: (a) If A is idempotent and non–singular then A = I. To prove this, note that AA = A and pre–multiply both sides by A−1 . (b) The eigenvalues of an idempotent matrix are either 1 or 0. (c) If A is idempotent of rank r, there exists an orthogonal matrix P such that P 0 AP = Er where Er is a diagonal matrix with the first r leading diagonal elements 1 and the remainder 0. (d) If A is idempotent of rank r then tr(A) = r. To prove this, note that there is an orthogonal matrix P such that P 0 AP = Er . Now tr(P 0 AP ) =tr(A) =tr(Er ) = r. (e) If the ith diagonal element of A is zero, all elements in the ith row and column are zero. (f) All idempotent matrices not of full rank are positive semi–definite. No idempotent matrix can have negative elements on its diagonal. 1.2.1 Worked Example A student had a query about idempotent matrices, esp 10 part (b), ”The eigenvalues of an idempotent matrix are either 1 or 0”. How can this be shown? Answer The eigenvalues λ of A are given by |A − λI| = 0. For square matrices X and Y , |XY | = |X||Y |. 4 The original equation to give the eigenvalues of A is Ax = λx Since A is idempotent, we obtain by premultiplying by A, AAx = Ax = λAx to give |A − λA| = |A(I − λI)| = |A||I − λI| = 0 so the eigenvalues are 1, unless A is singular, in which case some of the eigenvalues will be zero, since A would then not be of full rank. Examples • If A = 1 0 0 0 ! then A is idempotent and the eigenvalues are 1 and 0. • If A = 1 0 0 1 ! then A is idempotent and the eigenvalues are 1 and 1. • If A = 0 1 1 0 ! then A is NOT idempotent, with eigenvalues ±1. 5 6 Chapter 2 Preliminaries 2.1 Introduction We shall refresh some basic notions to get focused. Statistics is the science (or art) of interpreting data when there are random events operating in conjunction with systematic events. Mostly there is a pattern to the randomness which allows us to make sense of the observations. Distribution functions and their derivatives called density functions or probability functions are mathematical ways of describing the randomness. In this course, we shall be mostly studying parametric distributions where the mathematical description of randomness is in terms of parameters because we can assume we know the general form of the distribution. There is another topic in statistics called nonparametric statistics where the distribution function is parameter free. Order Statistics are one example of non-parametric statistics. 2.2 Indicator Functions A class of functions known as indicator functions is useful in statistics. Definition 2.1 Suppose Ω is a set with typical element ω, and let A be a subset of Ω. The indicator function of A, denoted by IA (·), is defined by IA (w) = ( 1 0 if ω ∈ A if ω ∈ / A. (2.1) That is, IA (·) indicates the set A. Some properties of the indicator function are listed. (a) IA (ω) = 1 − IĀ (ω) where Ā is the complement of A. (b) IA2 (ω) = IA (ω) 7 (c) IA∩B (ω) = IA (w).IB (ω) (d) IA∪B (ω) = IA (ω) + IB (ω) − IA∩B (ω) (e) IA1 ∪A2 ∪...∪An (ω) = max{IA1 (ω), . . . , IAn (ω)} The following example shows a use for indicator functions. Example 2.1 Suppose random variable X has pdf given by We can write f (x) as 0, x < −1 1 + x , −1 ≤ x < 0 f (x) = 1−x , 0≤x<1 0, x≥1 f (x) = (1 + x)I[−1,0) (x) + (1 − x)I[0,1) (x), or more concisely f (x) = (1 − |x|)I[−1,1] (x) 2.3 Distribution Functions (cdf ’s) The density or probability function is an idealised pattern which would be a reasonable approximation to represent the frequency of the data; the slight imperfections can be disregarded. If we can accept that approximation, we can reduce the data and understand it. To use the density or probability function, we usually have to integrate (or sum if it is discrete). The distribution function arises as the integral or sum. Whether we refer to the distribution or density (probability) function, we are still referring to the same information. Read CB page 29–37 or HC 1.7 where distribution functions for the univariate case are considered. Example 2.2 Given random variables X and Y which are identically distributed and independent (iid), with pdf f (x), x > 0, find P (Y > X). Consider one particular value of Y , say y ∗ . Then the probability that this value is greater than any X is written mathematically as ∗ ∗ P (y > X) = P (X < y ) = Z y∗ 0 f (x)dx . Now to generalize for all Y , we need to take into account the frequency of y ∗ and that information is contained in the density f (y). We integrate the above probability over f (y). 8 The joint pdf of X and Y , fX,Y (x, y), can be written f (x)f (y) so P (Y > X) = = = = 2.4 Z ∞Z y Z0 ∞ Z0 ∞ Z0 ∞ 0 = " = 1 2 0 f (x)f (y) dx dy or f (y) Z y 0 Z 0 ∞Z ∞ x f (x)f (y) dy dx f (x) dx dy f (y)F (y) dy F (y) dF (y) {F (y)}2 2 #∞ 0 Bivariate and Conditional Distributions (CB chapter 4 or HC chapter 2) Rather than use f, g, h, f1 , f2 , etc as function names for pdf’s, we will almost always use f , and if there is more than one random variable in the problem we will use a subscript to indicate the name of the variable whose pdf we are identifying. For example, we may say that the pdf of X is fX (x) = α e−αx , x > 0. Of course, the x could be replaced by any other letter. It is the fX that determines the function, not the (·). A similar notation is used for cumulative distribution functions. In the case of a conditional pdf, we will use, for example, fX|Y =y (x) for the conditional pdf of X given Y = y. An alternative notation is f (x|y). Read CB 4.2 or HC 2.2 where most of the ideas should be familiar to you. The two variables of a bivariate density fX,Y are correlated so the outcome due to one is influenced by the other. The conditional density fY |X allows us to make statements about Y if we have information on X. Recall that when we integrate out the terms in X (or average over fX ), to get a density in Y only (ie fY ), we call that the marginal density of Y . Definition 2.2 The conditional density function of Y given X = x is defined to be fY |X=x (y) = fX,Y (x, y) for fX (x) > 0 fX (x) and is undefined elsewhere. 9 (2.2) Comments 1. In fY |X=x (y), x is fixed and should be considered as a parameter. 2. fX,Y (x, y) is a surface above the xy-plane. A plane perpendicular to the xy-plane on the line x = x0 will intersect the surface in the curve fX,Y (x0 , y). The area under R∞ this curve is then given by −∞ fX,Y (x0 , y) dy = fX (x0 ). So dividing fX,Y (x0 , y) by fX (x0 ) we obtain a pdf which is fY |X=x0 (y). 3. The definition given can be considered an extension of the concept of a conditional probability. The conditional distribution is a normalised slice from the joint distribution function, since Z fY |X=x (y)dy = R fX,Y (x, y)dy fX (x) = =1 fX (x) fX (x) as required. Thus the marginal density fX (x) is the normalising function. Definition 2.3 If X and Y are jointly continuous, then the conditional distribution function of Y given X = x is defined as FY |X=x (y) = P (Y ≤ y|X = x) = Z y −∞ fY |X=x (y) dy (2.3) for all x such that fX (x) > 0. 2.4.1 Conditional Mean and Variance Note in CB p150–152 and the latter part of HC 2.2 how to find the conditional mean and conditional variance. The first job is to find the conditional density. Important results are: E{E(Y |X)} = E(Y ) var{E(Y |X)} ≤ var(Y ) var{Y } = E{var(Y |X)} + var{E(Y |X)} (2.4) (2.5) (2.6) The proof of 2.6 follows from basic definitions:h E{var(Y |X)} = E E(Y 2 |X) − {E(Y |X)}2 h i h i = E E(Y 2 |X) − E {E(Y |X)}2 h i = E(Y 2 ) − E {E(Y |X)}2 + 10 i [E(Y )]2 − [E(Y )]2 | {z } common trick in statistics = h i h i E(Y 2 ) − {E(Y )}2 − E {E(Y |X)}2 + [E{E(Y |X)}]2 = var(Y ) − var{E(Y |X)} . | {z using 2.4 } Therefore, var(Y ) = E[var(Y |X)] + var[E(Y |X)] . Note To be precise, the statement of the result for the mean E{E(Y |X)} = E(Y ) should read Ex {Ey (Y |X)} = Ey (Y ) Proof : Now Ey (Y |X) = and so Ex {Ey (Y |X)} = = Z Z Z Z yf (y|x)dy Ey (Y |X)fX (x)dx = yf (x, y)dydx = Z y Z Z Z f (x, y)dxdy = ! f (x, y) y dy fX (x)dx fX (x) Z yfY (y)dy = Ey (Y ) As part of the proof of the formula for conditional variance, the result Ex [Ey (Y 2 |X)] = Ey (Y 2 ) was invoked. This can be verified easily by replacing y by y 2 in the integrand of the derivation for the conditional mean. In fact the general result Ex {Ey (g(Y |X)} = Ey [g(Y )] can be so derived. Exercise For the density f (x, y) = 2, 0 < x < y < 1 verify that Ex {Ey (Y |X)} = Ey (Y ) = 2/3 empirically. 11 2.5 Stochastic Independence Read CB p152 and HC 2.4 up to the end of Example 4 and note the definition of stochastically independent random variables (HC Definition 2). The word stochastic is often omitted. The case of mutual independence for more than 2 variables is summarized below. Definition 2.5 gives an alternative criterion in terms of CDF’s. Definition 2.4 Let (X1 , X2 , . . . , Xn ) be an n-dimensional continuous random vector with joint pdf fX1 , ..., Xn (x1 , . . . , xn ) and range space Rn . Then X1 , X2 , . . . , Xn are defined to be stochastically independent if and only if fX1 , ..., Xn (x1 , . . . , xn ) = fX1 (x1 ) . . . fXn (xn ) (2.7) for all (x1 , . . . , xn ) ∈ Rn . Definition 2.5 Let (X1 , X2 , . . . , Xn ) be an n-dimensional random vector with joint cdf FX1 , ..., Xn (x1 , . . . , xn ). Then X1 , X2 , . . . , Xn are defined to be stochastically independent if and only if FX1 , ..., Xn (x1 , . . . , xn ) = FX1 (x1 ) . . . FXn (xn ) (2.8) for all xi . Comments HC’s Theorem 1 on page 102 says, in effect, 1. If the joint pdf of X1 , . . . , Xn factorizes into g1 (x1 ) . . . gn (xn ), where gi (xi ) is a function of xi alone (including the range space), i = 1, 2, . . . , n, then X1 , X2 , . . . Xn are mutually stochastically independent. It is not assumed that gi (xi ) is the marginal pdf of Xi . 2. Similarly, if the joint cdf of X1 , . . . , Xn factorizes into G1 (x1 ) . . . Gn (xn ) where Gi (xi ) is a function of xi alone, then X1 , . . . , Xn are mutually stochastically independent. 2.6 Moment Generating Functions (mgf ) Moments are defined as µ0r = E(Y r ) and central moments about µ as µr = E(Y − µ)r for r = 1, 2 . . .. These are entities by which we start reducing data. µ1 ≡ mean µ2 ≡ variance 12 µ5 µ6 .. . µ3 ≡ skewness µ4 ≡ kurtosis no special names Often µ1 and µ2 are enough to summarize the data. However, fourth moments and their counterparts, cumulants, are needed to find the variance of a variance. Moment generating functions give us a way of determining the formula for a particular moment. But they are more versatile than that, see below. It will be recalled that in the univariate case, random variable X has mgf defined by MX (t) = E(eXt ) for values of t for which the series or the integral converges. 2.6.1 Multivariate mgfs First we revise the concept of a bivariate mgf. Bivariate mgfs For a bivariate distribution, the rsth moment about the origin is defined as E(X1r X2s ) = Z Z xr1 xs2 f (x1 , x2 )dx1 dx2 = µ0rs Thus µ010 = µx1 = E(X1 ), µ001 = µx2 = E(X2 ), and µ011 = E(X1 X2 ). For central moments about the mean, r s µrs = E[(X1 − µx1 ) (X2 − µx2 ) ] = Z Z (x1 − µx1 )r (x2 − µx2 )s f (x1 , x2 )dx1 dx2 Now we find that µ20 = σx21 , µ02 = σx22 and µ11 = cov(X1 , X2 ). The bivariate MGF is defined as M (X1 , X2 )(t1 , t2 ) = E et1 X1 +t2 X2 = = Z Z Z Z et1 X1 +t2 X2 f (x1 , x2 )dx1 dx2 1 + (t1 x1 + t2 x2 ) + (t1 x1 + t2 x2 )2 /2! + . . . f (x1 , x2 )dx1 dx2 = 1 + µ010 t1 + µ001 t2 + µ011 t1 t2 + . . . Theorem If X1 and X2 are independent, MX1 ,X2 (t1 , t2 ) = MX1 (t1 ) × MX2 (t2 ) 13 Proof M (X1 , X2 )(t1 , t2 ) = E et1 X1 +t2 X2 = = Z Z Z Z et1 X1 +t2 X2 f (x1 , x2 )dx1 dx2 et1 X1 +t2 X2 fX1 (x1 )fX2 (x2 )dx1 dx2 since X1 and X2 are independent. Now M (X1 , X2 )(t1 , t2 ) = = MX1 (t1 ) Z Z e t 2 x2 Z et1 x1 fX1 (x1 )dx1 fX2 (x2 )dx2 et2 x2 fX2 (x2 )dx2 = MX1 (t1 )MX2 (t2 ) Example If X1 ∼ B(1, π) and X2 ∼ B(1, π), what is the distribution of X1 + X2 if the two variables are independent? Answer Now MX1 (t1 ) = πet1 + 1 − π and MX2 (t2 ) = πet2 + 1 − π giving M (X1 , X2 )(t1 , t2 ) = MX1 (t1 )MX2 (t2 ) = πet1 + 1 − π πet2 + 1 − π Since the proportions are the same then t1 = t2 = t and so M (X1 , X2 )(t, t) = E etX1 +tX2 = E e(X1 +X2 )t = MX1 +X2 (t) Finally MX1 +X2 (t) = πet + 1 − π which means that X1 + X2 ∼ B(2, π) as expected. 14 2 Multivariate mgfs We will now consider the mgf for a random vector X0 = (X1 , X2 , . . . , Xp ). The moment generating function of X is defined by MX (t1 , . . . , tp ) = E(eX1 t1 +...+Xp tp ) 0 = E(eX t ) 0 (2.9) 0 where t0 = (t1 , t2 , . . . , tp ). Of course E(eX t ) could be written E(et X ). Read HC 2.4 from Theorem 4 to the end. Note in particular how multivariate mgf’s can be used to find moments (including product moments), to find marginal distributions of one or more variables, and to prove independence. These are summarized below. 1. ∂ s1 +s2 MX,Y (t1 , t2 ) = E(X s1 Y s2 ). s1 s2 ∂t1 ∂t2 t1 =t2 =0 (2.10) The obvious extension can be made to the case of p (> 2) variables. 2. The marginal distributions for subsets of the p components have mgf’s obtained by setting equal to zero those ti ’s that correspond to the variables not in the subset. For example, if X0 = (X1 , X2 , X3 , X4 ) has mgf MX (t1 , t2 , t3 , t4 ), then MX2 ,X3 (t2 , t3 ) = MX (0, t2 , t3 , 0). 3. If the random variables X1 , X2 , . . . , Xp are independent, then MX (t) = MX1 (t1 )MX2 (t2 ) . . . MXp (tp ) and the converse is also true. 2.7 Multinomial Distribution (CB p181 and HC p121) Recall that the binomial distribution arises when we observe X, the number of successes in n independent Bernoulli trials (experiments with only 2 possible outcomes, success and failure). The multinomial distribution arises when each trial has k possible outcomes. We say that the random vector (X1 , X2 , . . . , Xk−1 ) has a k- nomial distribution if the joint probability function of X1 , . . . , Xk−1 is P (X1 = x1 , . . . , Xk−1 = xk−1 ) = Pk−1 where xk = n − probability function. i=1 xi , Pk i=1 n! px1 px2 . . . pxk k x1 ! . . . x k ! 1 2 (2.11) pi = 1. Note that, if k = 2, this reduces to the binomial 15 Now the joint mgf of the k-nomial distribution is MX1 , ..., Xk−1 (t1 , . . . , tk−1 ) = E(eX1 t1 +···+Xk−1 tk−1 ) = (p1 et1 + · · · + pk−1 etk−1 + pk )n . (2.12) To show this, multiply the RHS of (2.11) by ex1 t1 +···+xk−1 tk−1 and sum over all (k −1)-tuples, (x1 , . . . , xk−1 ). [HC deals with this for k = 3 on page 122.] Comments 1. When k = 2, (2.12) agrees with the familiar form of the mgf of a binomial (n, p) distribution. 2. Note that the marginal mgf of any Xi (obtained by putting the other ti equal to 0) is the familiar mgf of the binomial distribution. 16 Chapter 3 Transformations 3.1 Introduction We frequently have the type of problem where we have a random variable X with known distribution and a function g and wish to find the distribution of the random variable Y = g(X). There are essentially 3 methods for finding the distribution of Y and these are summarized briefly as follows. 1. Method of Distribution Functions Let FY (y) denote the cdf of Y . Then FY (y) = = = = P (Y ≤ y) P (g(X) ≤ y) P (X ≤ g −1 (y)) FX (g −1 (y)) where FX is the cdf of random variable X. 2. Method of Transformations In the case of a continuous random variable X with pdf fX (x), x ∈ RX , and g a strictly increasing or strictly decreasing function for x ∈ RX , the random variable Y has pdf given by dx fY (y) = fX (x) (3.1) dy where the RHS is expressed as a function of y. √ For example, if f (x) = αe−αX and y = x2 , write fX (x) = αe−α y . The Jacobian keeps track of the scale change in going from x to y. A modification of the procedure enables us to deal with the situation where g is piecewise monotone. 17 3. Method of Moment Generating Functions This method is based on the uniqueness theorem, which states that if two mgf’s are identical, the two random variables with those mgf’s possess the same probability distribution. So we would need to find the mgf of Y and compare it with the mgf’s for the common distributions. If it is identical to some well-known mgf, the probability distribution of Y will be identified. The problem above was dealt with in a section called Change of Variable in the Statistics unit STAT260. The new work in this chapter concerns what may be called bivariate transformations. That is, we begin with the joint distribution of 2 random variables, X1 and X2 say, and two functions, g and h, and wish to find the joint distribution of the random variables Y1 = g(X1 , X2 ) and Y2 = h(X1 , X2 ). The marginal distribution of one or both of Y1 and Y2 can then be found. We may wish to do this if we changed coordinates from Cartesian (X1 , X2 ) to polar coordinates (Y1 , Y2 ). This can, of course, be extended to multivariable transformations. Before leaving this section, the following example should help you recall the technique. Example 3.1 We are given that Z ∼ N (0, 1) and wish to find the distribution of Y = Z 2 . Method 1 If GY (y) is the cdf of Y , then √ √ GY (y) = P (Y ≤ y) = P (Z 2 < y) = P [− y < Z < y] √ √ = Φ( y) − Φ(− y) where Φ is the standard normal integral. Differentiating wrt y, we get (φ = Φ0 ) √ √ G0Y (y) = gY (y) = φ( y)y −1/2 − φ(− y)(−1)y −1/2 h 1 √ 2 √ 2i √ 1 e−y/2 y 1/2−1 1 √ = y −1/2 e− 2 ( y) + e− 2 (− y) / 2π = 2 2π ie, χ21 as expected. Method 2 Now Y = Z 2 where 1 2 fZ (z) = √ e−z /2 2π 18 The transformation is not 1:1, so q q √ √ P (α < Y < β) = P ( α < Z < β) + P (− α < Z < − β) q √ = 2P ( α < Z < β) Thus fY (y) = 2fZ (Z = where J = √ d y dy √ y)|J| = y −1/2 /2. So 1 1 fy (y) = 2 √ e−y/2 y −1/2 = 2 2π √ 2e−y/2 y −1/2 Γ(1/2) So Y ∼ χ21 . Method 3 The MGF of Y is 1 Z tz 2 −z 2 /2 e e dz MY (t) = EetY = √ 2π Z Z 1 −z 2 /2−tz 2 1 2 = √ e e−z (1−2t)/2 dz dz = √ 2π 2π Putting W = (1 − 2t)1/2 Z gives 1 MY (t) = √ 2π " χ21 1 1 √ = 1/2 (1 − 2t) 2π Z Z e−w e −w 2 /2 2 /2 dw (1 − 2t)1/2 dw # =1 = 1 (1 − 2t)1/2 as expected. The problem with the MGF approach is that you have to be ie Y ∼ able to recognize the distribution from the form of the MGF. Exercise Obtain the density function for the log–normal distribution, which is simply the log of a normal distribution. If Y = ln X and X ∼ N (µ, σ 2 ) find the distribution function for Y . Example 3.2 Now suppose random variable X is distributed N (µ, σ 2 ), and random variable Y is defined by Y = X 2 /σ 2 , find the distribution of Y . 19 Method 1. Let GY (y) be the cdf of Y . Then GY (y) = P (Y ≤ y) X2 = P ( 2 ≤ y) σ = P (X 2 ≤ σ 2 y) √ √ = P (−σ y ≤ X ≤ σ y) ! √ √ −σ y − µ σ y−µ = P ≤Z≤ σ σ µ µ √ √ = Φ( y − ) − Φ(− y − ). σ σ The pdf of Y, gY (y) is obtained by differentiating GY (y) wrt y. µ 1 1 µ 1 1 √ √ gY (y) = φ( y − ). y − 2 + φ(− y − ) y − 2 σ 2 σ 2 1 √ µ 2 1 √ µ 2 e− 2 ( y+ σ ) 1 − 1 e− 2 ( y− σ ) √ √ + y 2 = 2 2π 2π y = 1 1 1 e− 2 y 2 −1 2 2 Γ( 12 ) e −µ2 /2σ 2 1 (eµy 2 /σ + e−µy 2 /σ ) 2 where y ∈ [0, ∞). Note that the first part of the RHS is the pdf of a chi-square random variable with 1 df. In fact Y is said to have a non-central χ2 distribution with 1 df and non-centrality parameter µ2 /σ 2 . [This will be dealt with further in Chapter 5.] Method 2. Noting that y = x2 /σ 2 is strictly decreasing for x ∈ (−∞, 0] and strictly increasing for x ∈ (0, ∞), we use a modification of (3.1). 1 1 For x ∈ (−∞, 0] we have x = −σy 2 and |dx/dy| = 12 σy − 2 . So −1 1 1 2 2 1 fY∗ (y) = √ e− 2 (−σy 2 −µ) /σ . 12 σy − 2 , 2πσ 1 replacing x in the N (µ, σ 2 ) pdf by −σy 2 . 1 1 For x ∈ (0, ∞) we have x = +σy 2 and |dx/dy| = 21 σy − 2 . So 1 1 1 2 2 1 fY∗∗ (y) = √ e− 2 (σy 2 −µ) /σ . 12 σy − 2 . 2πσ 20 The pdf of Y is the sum of fY∗ (y) and fY∗∗ (y) which simplifies to (3.2). Method 3. MY (t) = E(etY ) = Z ∞ −∞ e tx2 /σ 2 1 = σ −1 (2π)− 2 = σ −1 (2π) = (1 − 2t) − 21 − 21 ( ) 1 (x − µ)2 × (2π) σ exp − dx 2 σ2 ) ( Z ∞ ( 12 − t)x2 − µx + 21 µ2 dx exp − σ2 −∞ − 21 −1 1 µ2 /σ 4 − 4( 1 − t) 2 µ2 /σ 4 1 1 2 π σ( − t)− 2 × exp 2 4( 12 − t)/σ 2 1 2 ( µ2 t × exp 2 σ (1 − 2t) ) , which is the M.G.F. of a non-central χ2 distribution (see Continuous Distributions by Johnson and Kotz, chapter 28.) The integral is a standard result obtained by completing R ∞ −u2 √ the square in the exponent and using the result that −∞ e du = π giving Z 3.1.1 ∞ ∞ e −(ax2 +bx+c) dx = r π (b2 −4ac)/4a e . a The Probability Integral Transform The transformation which produces the cdf for a random variable is of particular interest. This transformation (the probability integral transform) is defined by F (x) = Z x −∞ f (t)dt = P (X ≤ x) The new variable Y is given by Y = F (X), and has the property of being uniform on (0,1), ie, Y ∼ U (0, 1). Thus we are required to prove that fY (y) = 1, 0 < y < 1. Proof Now Y = φ(X) = F (X) and X = F −1 (Y ) = ψ(Y ). The pdf for Y is then fY (y) = fX [x = ψ(y)]|J| = 21 dψ(y) fX [ψ(y)] dy but y = F (x) = F [ψ(y)] and so fY (y) = since dψ(y) f [ψ(y)] dF [ψ(y)] = f [ψ(y)] 1 =1 f [ψ(y)] dF ψ(y) = f [ψ(y)] dψ(y) Exercises 1. Determine the probability integral transform for • the general uniform distribution U (a, b), and • the pdf f (x) = 2x, 0 < x < 1 2. Verify that the transformed distribution is U (0, 1) in each case. 3. What is the connection between the probability integral transform and generating pseudo–random numbers on a computer? 3.2 Bivariate Transformations The discrete case will be used as a bridge to the continuous two dimensional transformation of variables. One dimensional case Assuming that we have a change of variable, say from X to Y for a discrete pf, then the original variable space is A and the transformed variable space is B. The transformation is Y = φ(X) with backtransform X = ψ(Y ). The pf in B is then py (Y = y) = px [X = ψ(y)] = p[ψ(y)] Example If X is P (λ) and Y = 2X what is the pf of Y ? P (X = x) = e−λ λx , x = 0, 1, 2, . . . x! 22 Using the MGF MY (t) = e0 MX (2t) = e−λ(e 2t −1) but can we recognise the distribution? Using the change of variable, Y = 2X = φ(X) and X = Y /2 = ψ(Y ). So py (Y = y) = e−λ λy/2 , y = 0, 2, 4, . . . (y/2)! Two dimensional case The original variable space (X, Y ) is denoted by A while the transformed space (U, V ) is denoted by B. We use the notation U = φ1 (X, Y ), V = φ2 (X, Y ) and X = ψ1 (U, V ), Y = ψ2 (U, V ) The pf in transformed space B, is then pU,V (u, v) = pX,Y [x = ψ1 (u, v), y = ψ2 (u, v)] Example If X ∼ B(1, π) and Y ∼ B(1, π), what is the distribution of X + Y if the two variables are independent? The original pf is pX,Y (x, y) = pX (x)pY (y) = π x (1 − π)1−x × π y (1 − π)1−y , x, y = 0, 1 Now U = φ1 (X, Y ) = X + Y, V = φ2 (X, Y ) = Y and X = ψ1 (U, V ) = U − V, Y = ψ2 (U, V ) = V The spaces A and B are shown in Figure 3.1. The joint pf of U and V is pU,V (u, v) = π u−v (1 − π)1−(u−v) × π v (1 − π)1−v = π u (1 − π)2−u , (u, v) B 23 Region B 0.0 0.0 0.5 v 0.4 y 1.0 0.8 Region A 0.0 0.4 0.8 0.0 0.5 x 1.0 1.5 2.0 u Figure 3.1: The spaces A and B. We now sum over v to get the marginal distribution for U , ie, pU (u). pU (u) = X pU,V (u, v) v = π u (1 − π)2−u , u = 0 = 2 · π u (1 − π)2−u , u = 1 = π u (1 − π)2−u , u = 2 Thus pU (u) = 2 u π u (1 − π)2−u , u = 0, 1, 2 and so, U ∼ B(2, π) in line with the MGF solution. Continuous Variables Both univariate and bivariate transformations of the discrete type are covered in CB p47, 156 and HC 4.2, whereas transformations for continuous variables are covered in 4.3. The main result here, which is the two-dimensional extension of (3.1), can be stated as follows. For (X, Y ) continuous with joint pdf fX,Y (x, y), (x, y) ∈ A, and defining U = g(X, Y ), V = h(X, Y ), the joint pdf of U and V , fU,V (u, v) is given by fU,V (u, v) = fX,Y (x, y).abs|J| 24 (3.2) providing the inverse transformation x = G(u, v) y = H(u, v) ) is one-to-one. Here abs|J| refers to the absolute value of the Jacobian, ∂x/∂u ∂x/∂v . ∂y/∂u ∂y/∂v The RHS of (2.3) has to be expressed in terms of u and v, and could be written more precisely as fX,Y (G(u, v), H(u, v))abs|J|. The diagonal elements of J account for scale change and the off-diagonal elements account for rotations. Comments 1. In examples, it is essential to draw diagrams showing (i) the range space of (X, Y ), and (ii) the region this maps into under the transformation. 2. The distribution of the new random variables U and V is not complete unless the range space is specified. 3. Frequently we use this technique to find the distribution of some function of random variables, e.g., X/Y . That is, we are not mainly interested in the joint distribution of U = g(X, Y ) and V = h(X, Y ), but in the marginal distribution of one of them. These points will be illustrated by the following examples. Note that the full version of the joint density is now fU,V (u, v) = fX,Y [x = G(u, v), y = H(u, v)] · abs|J| in line with the two dimensional discrete case. Worked Examples Three simple examples are presented, as a ’lead–in’ to the later problems. 1. The distribution of the sum of two unit normals. If X ∼ N (0, 1) and Y ∼ N (0, 1), what is the distribution of X + Y if X and Y are independent, using the change of variable method. (We can state the answer already?) 25 The joint distribution function is fX,Y (x, y) = 1 −x2 /2 −y2 /2 e e 2π The transformation is U = X + Y = φ1 (X, Y ) = g(X, Y ), V = Y = φ2 (X, Y ) = h(X, Y ) while the inverse is X = U − V = ψ1 (U, V ) = G(U, V ), Y = V = ψ2 (U, V ) = H(U, V ) in line with the notation in the Notes. The spaces A and B both contain the entire variable space in each case. (Verify!) The Jacobian J is ∂ψ1 /∂u ∂ψ1 /∂v J= ∂ψ2 /∂u ∂ψ2 /∂v 1 −1 = =1 0 1 Now the joint distribution function for U and V is now fU,V (u, v) = 1 −(U −V )2 /2 −V 2 /2 e e · abs|J| 2π 1 −(U 2 /4)−(V −U/2)2 1 −(U 2 +V 2 −2U V +V 2 )/2 e = e 2π 2π by completing the square. The joint function can now be written as = 1 1 −U 2 /4 1 −(V −U/2)2 √ e √ e 2π 2 1/ 2 which is the product of two normals, U ∼ N (0, 2) and a N (0, 1/2). Thus as expected. X + Y ∼ N (0, 2) 2. The distribution of the sum, and of the difference of two unit normals. This problem is similar to the previous, since we now have the transformation U = X + Y = φ1 (X, Y ) = g(X, Y ), V = X − Y = φ2 (X, Y ) = h(X, Y ) while the inverse is X = (U + V )/2 = ψ1 (U, V ) = G(U, V ), Y = (U − V )/2 = ψ2 (U, V ) = H(U, V ) 26 in line with the notation in the notes. The spaces A and B are as for the first example. The Jacobian J is ∂ψ1 /∂u ∂ψ1 /∂v J= ∂ψ2 /∂u ∂ψ2 /∂v The joint density is now fU,V (u, v) = 1/2 1/2 = 1/2 −1/2 = −1/2 1 −( U +V )2 /2 −( U −V )2 /2 1 e 2 e 2 · 2π 2 Minus eight times the exponent is thus U 2 + V 2 + 2U V + U 2 + V 2 − 2U V = 2(U 2 + V 2 ) to give 1 − 2(U 2 +V 2 ) /2 1 4 e · 2π 2 2 +V 2 ) 2) 2) (U (U (V 1 1 − 1 − 1 /2 1 2 e e 2 /2 e− 2 /2 · √ · √ = · = 2π 2 2π 2 2 fU,V (u, v) = and so U and V are independent. Thus we can observe that U = X + Y ∼ N (0, 2) and V = (X − Y ) ∼ N (0, 2). This is related to X and S 2 being independent for the case n = 2, since V 2 /2 ∼ χ21 . The n–dimensional example of this independence of the mean and variance is given on pages 21–22 of the Notes. 3. An example from the exponential distribution. If X ∼ E(1) and Y ∼ E(1), and X and Y are independent, then show that X + Y and X/(X + Y ) are independent. Thus the joint density of X and Y is fX,Y (x, y) = e−x e−y , x, y > 0. We have the transformation U = X + Y = φ1 (X, Y ) = g(X, Y ), V = X/(X + Y ) = φ2 (X, Y ) = h(X, Y ) while the inverse is X = U V = ψ1 (U, V ) = G(U, V ), Y = U − U V = ψ2 (U, V ) = H(U, V ) 27 The region A is defined by X, Y > 0 while B is given by U, V > 0 and V < 1. The Jacobian is ∂ψ1 /∂u ∂ψ1 /∂v J= ∂ψ2 /∂u ∂ψ2 /∂v The joint density of U and V is v u = 1 − v −u = −u fU,V (u, v) = f (ψ1 , ψ2 ) · abs|J| = e−uv e−(u−uv) · u = ue−u , (u, v) B This gives the marginal distribution for U as fU (u) = Z fU,V (u, v)dv = Z 1 0 ue−u dv = ue−u , u > 0 ie, U ∼ G(2), being the sum of two independent exponentials. This can be verified by the use of the MGF, since MX,Y (t) = MX (t) × MY (t) = 1 1 1 × = 1−t 1−t (1 − t)2 which is the MGF of a G(2). The marginal for V is fV (v) = Z ∞ 0 ue−u du = 1, 0 < v < 1. So V is distributed as a uniform on (0,1). Thus fU,V (u, v) = fU (u) × fV (v) and so U and V are independent. Note The full version [ X, Y ∼ N (µ, σ 2 )] of the first two examples is in theory not complicated, but in practice is very messy. Example 3.3 [This is HC Example 3 in 4.3.] Given independent random variables X and Y , each with uniform distributions on (0, 1), find the joint pdf of U and V defined by U = X + Y, V = X − Y , and the marginal pdf of U . 28 The joint pdf of X and Y is fX,Y (x, y) = 1, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 . The inverse transformation, written in terms of observed values is x = (u + v)/2 and y = (u − v)/2. and is clearly one-to-one. The Jacobian is ∂(x, y) 1/2 1/2 J= = ∂(u, v) 1/2 −1/2 =− 1 1 , so abs|J| = . 2 2 Following the notation of HC, we will use A to denote the range space of (X, Y ), and B to denote that of (U, V ) shown in the Figure 3.2. Firstly, note that there are 4 inequalities specifying ranges of x and y, and these give 4 inequalities concerning u and v, from which B can be determined. That is, x≥0 x≤1 y≥0 y≤1 ⇒ ⇒ ⇒ ⇒ u+v u+v u−v u−v ≥ 0, ≤ 2, ≥ 0, ≤ 2, that that that that is, is, is, is, v v v v ≥ −u ≤ 2−u ≤u ≥ u−2 Drawing the four lines v = −u, v = 2 − u, v = u, v = u − 2 on the graph, enables us to see the region specified by the 4 inequalities. Now, using (3.2) we have 1 fU,V (u, v) = 1. , 2 ( −u ≤ v ≤ u, 0 ≤ u ≤ 1 u − 2 ≤ v ≤ 2 − u, 1 ≤ u ≤ 2 The importance of having the range space correct is seen when we find the marginal pdf of U . fU (u) = = Z ∞ −∞ fU,V (u, v) dv Ru 1 R−u 2 dv , 2−u 1 dv u−2 2 ( 0, 0≤u≤1 , 1≤u≤2 otherwise u, 0≤u≤1 2−u , 1≤u≤ 2 = uI[0,1] (u) + (2 − u)I(1,2] (u), using indicator functions. = 29 Figure 3.2: The region B. v v=u - @ @ @ @ @ @ @ @ @ @ @ 1 @ @ @ v=u-2 @ @- @ @ @ - @ @ @ @ @ @ @ @ 2 @ @ u @ @ @ v=2-u @ @ v=-u Example 3.4 [HC Example 6, 4.3] Given X and Y are independent random variables each with pdf fX (x) = 21 e−x/2 , x ∈ [0, ∞), find the distribution of (X − Y )/2. We note that the joint pdf of X and Y is fX,Y (x, y) = 14 e−(x+y)/2 , 0 ≤ x < ∞, 0 ≤ y < ∞. Define U = (X − Y )/2. Now we need to introduce a second random variable V which is a function of X and Y . We wish to do this in such a way that the resulting bivariate transformation is one-to-one and our actual task of finding the pdf of U is as easy as possible. Our choice for V is of course, not unique. Let us define V = Y . Then the inverse transformation is, (using u, v, x, y, since we are really dealing with the range spaces here). x = 2u + v y = v from which we find the Jacobian, 2 1 J= =2. 0 1 30 To determine B, the range space of U and V , we note that x≥0 x<∞ y≥0 y<∞ ⇒ ⇒ ⇒ ⇒ 2u + v ≥ 0, that is, , v ≥ −2u 2u + v < ∞ v≥0 v<∞ So B is as indicated in Figure 3.3. Figure 3.3: The region B v A A A A A A A A A Now using (3.2) we have u A v=-2u A A A AA 1 4 1 2 fU,V (u, v) = = e−(2u+v+v)/2 .2 e−(u+v) , (u, v) ∈ B. The marginal pdf of U is obtained by integrating fU,V (u, v) with respect to v, giving fU (u) = R ∞ 1 −2u 2 R∞ 1 = ( = 1 2 0 e 2 1 u e 2 1 −u e 2 −|u| e−(u+v) dv , u < 0 e−(u+v) dv , u>0 u<0 , u>0 , −∞ < u < ∞ [This is sometimes called the folded (or double) exponential distribution.] Example 3.5 Given Z is distributed N (0, 1) and Y is distributed as χ2ν , and Z and Y are independent, find the pdf of a random variable T defined by T = Z 1 (Y /ν) 2 31 . Now the joint pdf of Z and Y is fZ,Y (z, y) = e−z 2 /2 (2π) 1 2 . e−y/2 y (ν/2)−1 , y > 0, −∞ < z < ∞. 2ν/2 Γ(ν/2) Let V = Y , and we will find the joint pdf of T and V and then the marginal pdf of T . The inverse transformation is 1 1 z = tv 2 /ν 2 y = v 1 from which |J| = (v/ν) 2 . It is easy to check that B = {(t, v) : −∞ < t < ∞, 0 < v < ∞}. So the joint pdf of T and V is 1 2 e−t v/2ν e−v/2 v (ν/2)−1 v 2 fT,V (t, v) = 1 . 1 , (2π) 2 2ν/2 Γ(ν/2) ν 2 for (t, v) ∈ B. The marginal pdf of T is found by integrating fT,V (t, v) with respect to v, the limits on the integral being 0 and ∞. Carry out this integration, substituting x (say) 2 for v2 (1 + tν ), and reducing the integral to a gamma function. The answer should be Γ( ν+1 ) 2 t2 fT (t) = 1 + 1 ν (νπ) 2 Γ(ν/2) !−(ν+1)/2 , −∞ < t < ∞ , which you will recognise as the pdf of a random variable with a t-distribution with ν degrees of freedom. (X is the sample mean and Y is the sample variance) Exercise. [See HC 4.4.] Given random variables X and Y are independently distributed as chi-square with ν1 , ν2 degrees of freedom, respectively, find the pdf of the random variable F defined by F = ν2 X/ν1 Y . Let V = Y and find the joint pdf of F and V , noting that the range space B = {(f, v) : f > 0, v > 0}. You should find that |J| = ν1 v/ν2 . Find the marginal pdf of F , which you should recognize as that for an Fν1 ,ν2 distribution. You should try the following substitution to simplify the integration. Let s = v2 (1 + νν12f ). 3.3 Multivariate Transformations (One-to-One) Note that in this extension, we will use X1 , X2 , . . . , Xn for the ‘original’ continuous variables (rather than X and Y as we had for 2 variables) and U1 , U2 , . . . , Un or Y1 , . . . , Yn are used for the ‘new’ variables (rather than U and V ). 32 Given random variables X1 , X2 , . . . , Xn with joint pdf fX (x1 , x2 , . . . , xn ) which is non-zero on the n-dimensional space A. Define u1 = g1 (x1 , x2 , . . . , xn ) u2 = g2 (x1 , x2 , . . . , xn ) .. . un = gn (x1 , x2 , . . . , xn ) (3.3) and suppose this is a one-to-one transformation mapping A onto a space B. Extending (2.3) to this case we have, for the joint pdf of U1 , U2 , . . . , Un , fU (u1 , u2 , . . . , un ) = fX (x1 , x2 , . . . , xn ).abs|J|, where J = ∂xi ∂uj ! . (3.4) [Note that J is the matrix of partial derivatives.] 3.4 Multivariate Transformations Not One-to-One With the definitions of X1 , X2 , . . . , Xn , U1 , U2 , . . . , Un as in section 3.3, suppose now that to each point of A there corresponds exactly one point of B, but that to each point of B there may correspond more than one point of A. Assume that we can represent A as the union of a finite number, k, of disjoint sets A1 , A2 , . . . , Ak , such that (2.4) does represent a one-to-one mapping of each Aj onto B, j = 1, . . . , k. That is, for each transformation of Aj onto B there is a unique inverse transformation xi = Gij (u1 , u2 , . . . , un ), i = 1, 2, . . . , n; j = 1, 2, . . . , k, each having a non-vanishing Jacobian, |Jj |, j = 1, 2, . . . , k. The joint pdf of U1 , U2 , . . . , Un is then given by k X j=1 abs|Jj |f [G1j (u1 , . . . , un ), . . . , Gnj (u1 , . . . , un )] for (u1 , u2 , . . . , un ) ∈ B. The marginal pdf’s may be found in the usual way if required. Example 3.6 Given X1 and X2 are independent random variables each distributed N (0, 1), so that 2 2 f (x1 , x2 ) = (2π)−1 e−(x1 +x2 )/2 , −∞ < x1 < ∞; −∞ < x2 < ∞, define U1 = (X1 + X2 )/2, U2 = (X1 − X2 )2 /2 and find their joint distribution. The transformation is not one to one since to each point in B = {(u1 , u2 ) : −∞ < u1 < ∞, 0 ≤ u2 < ∞} there corresponds two points in A = {(x1 , x2 ) : −∞ < x1 < ∞, −∞ < x2 < ∞}. There are two sets of inverse functions. 33 1 1 1 1 (i) x1 = u1 − (u2 /2) 2 ; x2 = u1 + (u2 /2) 2 . (ii) x1 = u1 + (u2 /2) 2 ; x2 = u1 − (u2 /2) 2 . From the definition of U2 , there is one type of mapping when x1 > x2 and another when x2 > x1 . Consequently we define A1 = {(x1 x2 ); x2 > x1 } and A2 = {(x1 , x2 ); x2 < x1 } . Note that the line x1 = x2 has been omitted since when x1 = x2 we have u2 = 0. However, since P (X1 = X2 ) = 0, excluding this line does not alter the distribution and we therefore consider only A = A1 ∪ A2 . Then (i) defines a one-to-one transformation of A2 onto B and (ii) defines a one-to-one transformation of A1 onto B. Thus the joint pdf of (U1 , U2 ) is given by 1 1 [u − (u /2) 2 ]2 1 1 [u1 + (u2 /2) 2 ]2 1 2 fU1 ,U2 (u1 , u2 ) = .(2u2 )− 2 exp − − 2π 2 2 = 1 1 [u + (u /2) 2 ]2 1 [u1 − (u2 /2) 2 ]2 1 1 2 (2u2 )− 2 exp − − + 2π 2 2 1 1 2 (π) 1 2 e−u1 . 2 1 2 1 Γ( 12 ) u22 −1 e−u2 /2 , for (u1 , u2 ) ∈ B. Comment: This also shows that U1 and U2 are stochastically independent. 3.5 Convolutions Consider the problem of finding the distribution of the sum of 2 independent (but not necessarily identically distributed) random variables. The pdf of the sum can be neatly expressed using convolutions. Theorem 3.1 Let X and Y be independent random variables with pdf’s fX , fY respectively, and define U = X + Y . Then the pdf of U is fU (u) = Z ∞ −∞ fX (u − v)fY (v) dv. Proof 34 (3.5) Because of independence, the joint pdf of X and Y may be written fX,Y (x, y) = fX (x)fY (y) . Define V = Y and, noting that the Jacobian of the inverse transformation is 1, the joint pdf of U and V is fU,V (u, v) = fX (u − v)fY (v), and hence the marginal pdf of U , found by integrating with respect to v, is as given in (3.5). Now fU (u) is called the convolution of fX and fY . The following heuristic explanation may assist. Equation (3.5) defines a convolution in the mathematical sense. Each single point of fU (u) is formed by a weighted average of the entire density fY (v). The weights are the other density fX (u − v) where its value depends on how far apart each v is from u. Thus each single point of the density fU (u) arises from all the density fY (v). Example 3.7 Random variables X and Y are identically and independently distributed (iid) uniformly on [0, 1]. Find the distribution of U = X + Y . We note that fX (x) = 1, 0 ≤ x ≤ 1, fY (y) = 1, 0 ≤ y ≤ 1 and that the inverse transformation is x = u − v, y = v with |J| = 1. The range space for (u, v) is determined from x≥0 x≤1 y≥0 y≤1 ⇒ ⇒ ⇒ ⇒ u − v ≥ 0, u − v ≤ 1, v≥0 v ≤ 1, and is shown in the Figure 3.4. So from Theorem 3.5, fU (u) = R 0u 1.dv, R1 u−1 that is, v ≤ u that is, v ≥ u − 1 0≤u≤1 1.dv, 1 < u ≤ 2 resulting in what is sometimes called the triangular distribution, fU (u) = ( u, 0≤u≤1 2 − u, 1 < u ≤ 2 as in Example 3.2. The method of convolutions is a special case of the transformation of variables, being another method for finding the distribution of the sum of two variables. The problem that is solved here could be solved using MGFs, viz, MU (t) = MX (t)MY (t) = 35 et − 1 t !2 v=u v=u-1 Figure 3.4: Region B v v=1 1 since Z u 2 et − 1 , MX (t) = Ee = e dt = t 0 but again the problem is to recognize the distribution from the resulting MGF. Xt 3.6 1 xt General Linear Transformation Here we will use matrix notation to express the results of 3.3, and give a useful result using moment generating functions. The one-to-one linear transformation referred to in section 3.3 on page 32 can be written in matrix notation as Y = AX, (using Y for the new variables rather that U). Here X and Y are vectors of random variables and A is a matrix of constants. In particular, note that E(X) is the vector whose components are E(X1 ), . . . E(Xp ), or µ1 , . . . , µp . The covariance matrix of X (sometimes called the variance-covariance matrix) is frequently referred to as cov(X), and is denoted by Σ. Note that it is a square matrix whose diagonal terms are variances, and off-diagonal terms are covariances. If A is non-singular so that there is an inverse transformation X = A−1 Y, and if X has pdf fX (x), the corresponding pdf of Y is fY (y) = fX (A−1 y)abs|J| = fX (A−1 y)abs|A−1 |. (3.6) Recall that the joint mgf of (X1 , X2 , . . . , Xp ) is expressed in matrix notation as 0 MX (t) = E et1 X1 +t2 X2 +...+tp Xp = E(et X ) provided this expectation exists. Now if Y = AX, so that Y is a p-dimensional random vector, the mgf of Y is 0 0 0 0 MY (t) = E(et Y ) = E(et AX ) = E(e(A t) X ) = MX (A0 t) 36 (3.7) 3.7 Worked Examples : Bivariate MGFs The distribution of linear combinations of independent random variables can sometimes be determined by the use of moment generating functions. Example 1 Find the distribution of W = X1 + X2 where X1 ∼ N (µ1 , σ12 ), X2 ∼ N (µ2 , σ22 ) and X1 and X2 are independent. Solution MW (t) = E(eW t ) = E[eX1 t + X2 t ] = E(eX1 t ) × E(eX2 t ) since X1 and X2 are independent. Now 2 2 2 2 MW (t) = eµ1 t + σ1 t /2 × eµ2 t + σ2 t /2 2 2 2 2 MW (t) = eµ1 t + σ1 t /2 + µ2 t + σ2 t /2 2 2 2 MW (t) = e(µ1 + µ2 )t + (σ1 + σ2 )t /2 Thus X1 + X2 ∼ N (µ1 + µ2 , σ12 + σ22 ) Example 2 In general, if X = aX1 + bX2 then MW (t) = E(eW t ) = E[e(aX1 + bX2 )t ] = E(eaX1 t )E(ebX2 t ) ie MW (t) = MX1 (t)MX2 (t) Of course, the procedure relies on the resulting MGF being recognizable. Thus we can now find the distribution of W = X 1 − X2 Solution 2 2 2 2 MW (t) = eµ1 t + σ1 t /2 × eµ2 (−t) + σ2 (−t) /2 2 2 2 2 MW (t) = eµ1 t + σ1 t /2 − µ2 t + σ2 t /2 2 2 2 MW (t) = e(µ1 − µ2 )t + (σ1 + σ2 )t /2 37 Thus X1 − X2 ∼ N (µ1 − µ2 , σ12 + σ22 ) The procedure can be extended to the n dimensional case, an indeed forms the basis of one of the proofs for the Central Limit Theorem. Central Limit Theorem The general form of the CLT states that : If X1 . . . Xn are iid rvs with mean µ and variance σ 2 , then Z= X −µ √ ∼ N (0, 1) (asy). σ/ n Proof Let the rvs Xi (i = 1 . . . n) have MGF 0 0 MXi (t) = 1 + µ1 t + µ2 t2 /2 + . . . Now X= so MX (t) = n Y 1 1 X1 + . . . + Xn n n MXi /n (t) = i=1 Now n Y MXi (t/n) = [MXi (t/n)]n i=1 √ n Z= (X − µ) σ to give MZ (t) = e−( Thus √ n/σ)µt √ h √ in √ MX [ n/σ]t = e−( n/σ)µt MXi [ n/σ](t/n) √ " 0 n µ t2 t 0 log MZ (t) = − µt + n log 1 + µ1 √ + 2 2 + . . . σ σ n 2! σ n √ " # n t t2 0 0 =− µt + n µ1 √ + µ2 2 + . . . σ σ n 2σ n √ 2 n t√ 0 t µt + µ =− n + µ2 2 − σ σ 2σ Therefore, as n → ∞ 0 log MZ (t) → " #2 n 0 t t2 0 µ1 √ + µ 2 2 + . . . − 2 σ n 2σ n 1 2 t2 1 µ 2 + . . . (terms in √ ) 2 σ n (µ2 − µ2 ) t2 t2 = σ2 2 2 38 # +... 0 since σ 2 = µ2 − µ2 . This is the MGF of a N (0, 1) rv, and as σZ X = √ +µ n then X ∼ N (µ, σ 2 /n) (asy). 39 40 Chapter 4 Multivariate Normal Distribution 4.1 Bivariate Normal If X1 , X2 have a bivariate normal distribution with parameters µ1 , µ2 , σ12 , σ22 , ρ, then the joint pdf of X1 and X2 is 1 − 2(1−ρ 2) h (x1 −µ1 )2 σ12 − 2ρ(x1 −µ1 )(x2 −µ2 ) σ1 σ2 f (x1 , x2 ) = ke √ where k = 1/2πσ1 σ2 1 − ρ2 . Let X0 = (X1 , X2 ), µ0 = (µ1 , µ2 ), and define Σ by Σ= " σ12 ρσ1 σ2 ρσ1 σ2 σ22 + (x2 −µ2 )2 σ22 i , # and we see that the joint pdf can be written in matrix notation as −1 1 − 21 (x−µ)0 Σ (x−µ) fX (x) = e 1 2π|Σ| 2 where |Σ| is the determinant of Σ. Check that |Σ| = σ12 σ22 (1 − ρ2 ) and that Σ−1 1 = 1 − ρ2 1 σ12 −ρ σ1 σ2 −ρ σ1 σ2 1 σ22 . We write X ∼ N2 (µ, Σ). Read CB p175 or HC 3.5 to revise some of the properties of the bivariate normal distribution, which can be regarded as a special case of the multivariate normal distribution. This will be considered in the remainder of this chapter. The following is a ’derivation’ of the Bivariate Normal from first principles, ie, from two univariate independent normals. Let the two univariate independent normals be Z1 and Z2 . Then form the bivariate vector Z1 Z2 Z= 41 ! The joint distribution function for Z is then f (z1 , z2 ) = 1 −(z12 +z22 )/2 e , ∞ < z 1 , z2 < ∞ 2π Alternatively 1 −(z 0 z )/2 1 −(z 0 Iz )/2 e = e 2π 2π which is called the spherical normal. Note that Z ∼ N (O, I). If we now consider these two unit normals to be the result of transforming two bivariate normal variables that are not necessarily independent, we can use a transformation that is the two dimensional equivalent of the Z–score from univariate statistics. Thus we have f (z) = Z = P −1 (X − µ) where X= X1 X2 µ= µ1 µ2 and ! ! The distribution of X will be the bivariate normal. Using the general linear transformation from p 24 (3.6) of the Notes, we get h i fX (x) = fZ z = P −1 (x − µ) abs|P −1 | = 1 −[P −1 (x−µ)]0 [P −1 (x−µ)]/2 e abs|P −1 | 2π n 1 −(x−µ)0 P P e = 2π It is easily verified that E(X) = µ since 0 o−1 (x−µ)/2 abs|P −1 | E(X) = E[P Z + µ] = O + µ 0 but what of P P ? The variance/covariance matrix of X is 0 Σ = E[(X − µ)(X − µ) ] = E(P Z)(P Z) 0 0 0 0 = E[P ZZ P ] = E(P IP ) = E(P P ) AS |P −1 | = |Σ|−1/2 we have f (x) = 1 e−[(x−µ) Σ 1/2 2π|Σ| 0 42 −1 (x−µ)]/2 0 Exercise Derive from this general form the equation for the bivariate normal. 4.2 Multivariate Normal (MVN) Distribution The Multivariate Normal distribution has a prominent role in statistics as a consequence of the Central Limit Theorem. For example, estimates of regression parameters are asymptotically Normal. (Some people prefer to call it a Gaussian distribution). We will extend the notation of section 4.1 to p dimensions, so E(X)= µ is the vector whose components are E(X1 ), . . . , E(Xp ) or µ1 , . . . , µp , and Σ = cov(X) is the variancecovariance matrix (p × p) whose diagonal terms are variances and off-diagonal terms are covariances, and Σ = cov(X) = E[(X − µ)(X − µ)0 ]. Definition 4.1 The random p-vector X is said to be multivariate normal if and only if the linear function a0 X = a 1 X 1 + . . . + a p X p is normal for all a, where a0 = (a1 , a2 , . . . , ap ). In loose statistical jargon, the terms ‘linear’ and ‘Normal’ are sometimes interchangeable. Where we have random variables that are ‘normal’, we can think of the components as additive. Theorem 4.1 If X is p-variate normal with mean µ and covariance matrix Σ (non-singular), then X has a pdf given by −1 1 − 12 (x−µ)0 Σ (x−µ) e fX (x) = (4.1) (2π)p/2 |Σ|1/2 Proof We are given that E(X) = µ, E(X−µ)(X − µ)0 = Σ. Since Σ is positive definite, there is a non-singular matrix P such that Σ =PP0 . [Chapter 1, sec 1.2, 6(b)(ii).] Consider the transformation Y = P−1 (X − µ). By Definition 4.1, the components of Y are normal and E(Y) = E(P−1 (X − µ)) = P−1 E(X − µ) = 0 since E(X) = µ, cov(Y) = E(YY0 ) = P−1 E[(X − µ)(X − µ)0 ](P−1 )0 = P−1 Σ(P0 )−1 = P−1 PP0 (P0 )−1 = I, 43 So Y1 ,. . . ,Yp are iid N (0, 1) and their joint pdf is given by fY (y) = 1 − 12 y0 y e . (2π)p/2 Using (3.6), the density of X is fX (x) = fY P−1 (x − µ) abs|P−1 |, where |P−1 | = and the result follows. 1 1 1 = 0 1/2 = |P| |PP | |Σ|1/2 Comments 1. Note that the transformation Y = P−1 (X − µ) is used to standardize X, in the same way as X −µ Z= σ was used in univariate theory. 2. Note that when p = 1, equation (4.1) reduces to the pdf of the univariate normal. 3. The covariance matrix is symmetric, since cov(Xi , Xj ) = cov(Xj , Xi ). 4. It is often convenient to write X ∼ Np (µ, Σ). 5. Note that Z 4.3 ∞ −∞ ... Z ∞ −∞ −1 1 1 − 12 (x−µ)0 Σ (x−µ) dx1 . . . dxp = |Σ| 2 . e p/2 (2π) (4.2) Moment Generating Function We will now derive the mgf for a p-variate normal distribution and see how it can be used in deriving other results. Theorem 4.2 Given X ∼ Np (µ, Σ) and t0 = (t1 , t2 , . . . , tp ) a vector of real numbers, then the mgf of X is 1 0 0 MX (t) = et µ+ 2 t Σt . (4.3) Proof 44 There exists a non-singular matrix P so that Σ = PP0 . Let Y = P−1 (X − µ). Then Y ∼ Np (0, I) from the proof of Theorem 4.2. That is, each Yi ∼ N (0, 1) and we know 1 2 that MYi (t) = E(eYi t ) = e 2 t . Now 0 MY (t) = E(eY1 t1 +...+Yp tp ) = E(et Y ) = E(eY1 t1 )E(eY2 t2 ) . . . E(eYp tp ) 1 2 1 2 1 2 1 e 2 t2 . . . e 2 tp = e 2 tP 1 2 = e 2 ti 1 0 = e2t t Also 0 E(et X ) 0 E(et (µ+PY) ) 0 0 E(et µ et PY ) 0 0 0 et µ E(e(P t) Y ) 0 et µ MY (P0 t) 1 0 0 0 0 = et µ .e 2 (P t) (P t) 1 0 0 = et µ+ 2 t Σt , putting Σ for PP0 . MX (t) = = = = = Comments 1. Note that when p = 1, Theorem 4.2 reduces to the familiar result for a univariate normal. 2. If X is multivariate normal with diagonal covariance matrix, then the components of X are independent. 3. The marginal distributions of a multivariate normal are all multivariate (or univariate) normal. eg. " # " # " X1 µ1 ∼ N , X2 µ2 X1 ∼ N (µ1 , Σ11 ) X2 ∼ N (µ2 , Σ22 ) Σ11 Σ12 Σ21 Σ22 #! 4. If X is multivariate normal, then AX is multivariate normal for any matrix A (of appropriate dimension). For a r.v X where E(X) = µ, var(X) = Σ , E(AX) = Aµ, var(AX) = AΣA0 45 5. We also note for future reference the conditional distributions, (X 2 |X 1 = x1 ) ∼ N (µ2.1 , Σ22.1 ) where µ2.1 = µ2 + Σ21 Σ−1 11 (x1 − µ1 ) Σ22.1 = Σ22 − Σ21 Σ−1 11 Σ12 Comments The marginal distributions of a multivariate normal (see H and C p229 4.134) The MGF of the MVN can be written as 0 0 1 MX (t) = et µ+ 2 t Σt = M(X 1 ,X 2 ) (t1 , t2 ) where X = X1 X2 ! ,t= t1 t2 ! and µ = So µ1 µ2 ! 0 . 0 MX 1 (t1 ) = et1 µ1 + 2 t1 Σ11 t1 where 1 Σ11 Σ12 Σ21 Σ22 Σ= ! . By setting t2 = 0, we obtain X 1 ∼ N (µ1 , Σ11 ). Similarly, by setting t1 = 0, we get X 2 ∼ N (µ2 , Σ22 ). Conditional distributions Decomposing X as before, we form X1 X 2 − BX 1 AX = = 0 AΣA = X1 X2 # Σ11 Σ12 Σ21 Σ22 ! I O −B I Now I O −B I ! !" ! 46 I −B O I 0 ! = I O −B I ! 0 Σ11 −Σ11 B + Σ12 0 Σ21 −Σ21 B + Σ22 ! 0 = −Σ11 B + Σ12 Σ11 0 0 −BΣ11 + Σ21 −BΣ11 B + BΣ12 − Σ12 B + Σ22 ! We now choose the matrix B, so that the off diagonal matrices become O, so that X 1 and X 2 − BX 1 are independent. This implies that B = Σ21 Σ−1 11 , to give 0 AΣA = Σ11 O O −Σ21 Σ−1 11 Σ12 + Σ22 ! (verify!). Thus we now have (X 2 − BX 1 ) ∼ Nn2 (µ2 − Bµ1 , Σ22.1 ) where Σ22.1 = Σ22 − Σ21 Σ−1 11 Σ12 . (The length of the vector X 2 is n2). Since we may treat X 1 as a constant, then (X 2 |X 1 = x1 ) ∼ N (µ2 − Bµ1 + Bx1 , Σ22.1 ) and so (X 2 |X 1 = x1 ) ∼ N µ2 + Σ21 Σ−1 11 (x1 − µ1 ) , Σ22.1 as previously stated. Exercise Using the MGF/CGF derive the first five moments of the Bivariate Normal. 4.4 Independence of Quadratic Forms We will consider here some useful results involving quadratic forms in normal random variables. Theorem 4.3 Suppose X1 , X2 , . . . , Xp are identically and independently distributed as N (0, 1) and let X0 = (X1 , X2 , . . . , Xp ). Define Q1 and Q2 by Q1 = X0 BX, Q2 = X0 CX, where B and C are p × p symmetric matrices with ranks less than or equal to p. Then Q1 and Q2 are independent if and only if BC = 0. 47 Proof Firstly note that X0 BX and X0 CX are scalars so that Q1 and Q2 each have univariate distributions. We will find the joint mgf of Q1 and Q2 . Note that the pdf of X is given by (4.1) with µ = 0 and Σ = I, so we have 0 0 MQ1 ,Q2 (t1 , t2 ) = E(et1 X BX+t2 X CX ) Z ∞ Z ∞ 1 0 1 0 0 et1 x Bx+t2 x Cx− 2 x x dx1 . . . dxp = ... p/2 −∞ −∞ (2π) 1 = (2π)p/2 Z ∞ −∞ ... Z ∞ 1 −∞ 0 e− 2 x (I−2t1 B−2t2 C)x dx1 . . . dxp 1 = |I − 2t1 B − 2t2 C|− 2 , using (4.2), for values of t1 , t2 which make I − 2t1 B − 2t2 C positive definite. Now the mgf’s of the marginal distributions of Q1 and Q2 are MQ1 ,Q2 (t1 , 0), MQ1 ,Q2 (0, t2 ) respectively. That is, 1 MQ1 (t1 ) = |I − 2t1 B|− 2 , 1 MQ2 (t2 ) = |I − 2t2 C|− 2 . Now Q1 and Q2 are independent if and only if MQ1 ,Q2 (t1 , t2 ) = MQ1 (t1 )MQ2 (t2 ) . That is, if |I − 2t1 B − 2t2 C| = |I − 2t1 B||I − 2t2 C| = |I − 2t1 B − 2t2 C + 4t1 t2 BC|. This is true if and only if BC = 0. [Note that the 0 here is a p × p matrix with every entry zero.] The matrices B, C are projection matrices. Q1 is the shadow of (X 0 X) in the B plane and Q2 is the shadow of X 0 X in the C plane. Q1 and Q2 will be independent if B ⊥ C since in that case, none of the information in Q1 is contained in Q2 . Example 4.1 If X1 , X2 , . . . , Xp are iid N (0, 1) random variables and X and S 2 are defined by X = S2 = p X i=1 p X i=1 Xi /p (Xi − X)2 /(p − 1), 48 2 show that S 2 and pX are independent. Outline of proof 2 We need to write both S 2 and pX as quadratic forms. It is easy to verify that (p − 1)S 2 = X0 BX where 1 − 1p − 1p . . . − 1p −1 1 − 1p . . . − 1p p B= .. .. .. . . . 1 1 −p − p . . . 1 − 1p 2 and that pX = X0 CX where 1 p 1 p ... 1 p 1 p 1 p ... 1 p . . C= . .. . .. . and that BC = CB = 0, implying independence. Proof in detail Now 0 (p − 1)S 2 = X BX, B = I − I/p where I is a matrix of ones. To verify this, note that X1 . 2 . (p − 1)S = [X1 , X2 , . . . , Xp ] [I − I/p] . Xp That is = [X1 , . . . , Xp ] X1 O .. . O Xp (p − 1)S 2 = (X12 + . . . + Xp2 ) − ( X − i 0 pX = X CX where C = [I/p] 49 .. . P i Xi /p i Xi /p Xi )2 /p = If we define 2 P X i Xi2 − pX 2 then X1 . 2 . pX = [X1 , . . . , Xp ] [I/p] . Xp = as expected. Thus P [X1 , . . . , Xp ] .. . i P i Xi /p Xi /p = ( P i Xi ) 2 2 = pX p BC(= CB) = O = [I − I/p] [I/p] = I/p − II/p2 = 1 . . . ... .. . 1 1 . .. . . . p 1 ... 1 1 − 2 p = p 1 . .. . p . . p 1 ... 1 − p2 1 . . . ... .. . 1 1 . .. . . . 1 ... 1 p ... .. . ... .. . p .. . ... p p2 ... .. . 1 .. . ... 1 p =O 2 and so S 2 and pX are independent. 4.5 Distribution of Quadratic Forms Consider the quadratic form Q = X0 BX where B is a p × p matrix of rank r ≤ p. We will find the distribution of Q, making certain assumptions about B. We use the cumulant generating function as a mathematical tool to derive the results. Knowledge of cumulants up to a given order is equivalent to that of the corresponding moments. Although moments have a direct physical or geometric interpretation, cumulants sometimes have an advantage, due to : • the vanishing of the cumulants for the normal distribution, • their behaviour for sums of independent random variables, and • especially in the multivariate case, their behaviour under linear transformation of the random variables concerned. Theorem 4.4 Given X is a vector of p components, X1 , . . . , Xp distributed iid N (0, 1), and Q = X0 BX where B is a p × p matrix of rank r ≤ p, the distribution of Q 50 (i) has sth cumulant, κs = 2s−1 (s − 1)!tr(B s ) (ii) is χ2r if and only if B is idempotent (that is, B 2 = B). Proof Now there is an orthogonal matrix P which transforms Q into a sum of squares. That is, let X = PY, and Q = X0 BX = Y 0 P0 BPY = Y 0 ΛY where Λ is a diagonal matrix with elements λ1 , λ2 , . . . , λp , the eigenvalues of B. Now exactly r of these are non-zero where r = rank(B). So Q= r X λi Yi2 . (4.4) i=1 Now if X ∼ Np (0, I), then Y = P−1 X is distributed as p-variate normal with E(Y) = P−1 E(X) = 0 and cov(Y) = = = = E(YY0 ) = E(P−1 XX0 (P−1 )0 ) P−1 E(XX0 )(P0 )−1 (P0 P)−1 since E(XX0 ) = I I So Y ∼ Np (0, I). Consider now the ith component of Y. Since Yi ∼ N (0, 1) it follows that Yi2 ∼ χ21 and has mgf 1 MYi2 (t) = (1 − 2t)− 2 . So λi Yi2 has mgf 1 Mλi Yi2 (t) = (1 − 2λi t)− 2 , and Q, defined by (4.4), has mgf MQ (t) = r Y i=1 1 (1 − 2λi t)− 2 , (4.5) since the Yi are independent. The cumulant generating function (cgf) is KQ (t) = log MQ (t) r 1X = − log(1 − 2λi t) 2 i=1 " r 1X 22 λ2i t2 23 λ3i t3 − −... = − −2λi t − 2 i=1 2 3 = " r X i=1 λi t + t 2λ2i 2 2! +...+ 51 t 2s−1 λsi # s s! (s − 1)! + . . . # (i) So the sth cumulant of Q, κs is κs = 2s−1 (s − 1)! Now Pr i=1 s λ i=1 i = Pr r X λsi , s = 1, 2, 3, . . . (4.6) i=1 λsi is the sum of elements of the leading diagonal of B s . That is, tr(B s ). So (4.6) can be written κs = 2s−1 (s − 1)! tr(B s ). (4.7) (ii) Now for a χ2r distribution the mgf is (1 − 2t)−r/2 , the cgf is − 2r log(1 − 2t), and the sth cumulant is 2s−1 (s − 1)!r. (4.8) So if Q ∼ χ2r the sth cumulant must be given by (4.8). Comparing with (4.7), we must have tr(B s ) = r = tr(B). That is, B s = B, and B is idempotent. On the other hand, if B is idempotent, r of the λi = 1 and the others are 0, so from P (4.4), Q = ri=1 Yi2 , and Q ∼ χ2r . The following theorems (stated without proof) cover more general cases. Theorem 4.5 Let X ∼ Np (0, σ 2 I) and define Q = X0 BX where B is symmetric of rank r. Then Q/σ 2 ∼ χ2r if and only if B is idempotent. What form might B take? See if the projection matrices X(X 0 X)−1 X 0 and I − X(X 0 X)−1 X 0 are idempotent. Theorem 4.6 Let X ∼ Np (0, Σ) where Σ is positive definite. Define Q = X0 BX where B is symmetric of rank r. Then Q ∼ χ2r if and only if BΣB = B. Example (After H and C p485) 52 If X1 , X2 , X3 are iid N (0, 8), and 1/2 0 1/2 1 0 B= 0 1/2 0 1/2 show that X 0 BX ∼ χ22 . 8 Solution Now r(B) = 2, B is symmetric and idempotent, since 1/2 0 1/2 1/2 0 1/2 1/2 0 1/2 1 0 1 0 1 0 B2 = 0 = 0 0 1/2 0 1/2 1/2 0 1/2 1/2 0 1/2 Because B is idempotent, then X 0 BX ∼ χ22 8 by Thm 4.5. √ , and claim Alternatively, we could use Thm 4.4 on X ∗ = X 8 0 X ∗ BX ∗ ∼ χ22 ie X 0 BX ∼ χ22 8 since X ∗1 , . . . , X ∗3 ∼ N (0, 1). Notes 1. Thm 4.5 as shown in the example is a trivial application of Thm 4.4 with the original variables divided by σ. 2. Thm 4.6 is more general. Again Q = X 0 BX but now we define Z = P −1 X 53 where Σ = PP0 ie, X = PZ Thus Q = X 0 BX = (P Z)0 BP Z = Z 0 P 0 BP Z = Z 0 EZ say, then Q ∼ χ2r iff E is idempotent, by Thm 4.4. This means that E 2 = E, ie, (P 0 BP ) (P 0 BP ) = P 0 BP Thus P 0 B[P P 0 ]BP = P 0 BP ie P 0 (BΣB) P = P 0 (B) P to give the condition BΣB = B as required. 3. If we have X ∼ N (µ, Σ) then the distribution of Q involves the non–central χ2 which is covered in Chapter 6. 4.6 Cochran’s Theorem This is a very important theorem which allows us to decompose sums of squares into several quadratic forms and identify their distributions and establish their independence. It can be used to great advantage in Analysis of Variance and Regression. The importance of the terms in the model is assessed via the distributions of their sums of squares. Theorem 4.7 Given X ∼ Np (0, I), suppose that X0 X is decomposed into k quadratic forms, Qi = X0 Bi X, i = 1, 2, . . . , k, where the rank of Bi is ri and the Bi are positive semidefinite, then any one of the following conditions implies the other two. (a) the ranks of the Qi add to p; (b) each Qi ∼ χ2ri ; (c) all the Qi are mutually independent. 54 Proof We can write X0 X = X0 IX = k X X0 Bi X. i=1 That is, I= k X Bi . i=1 (i) Given (a) we will prove (b). Select an arbitrary Qi , say Q1 = X0 B1 X. If we make an orthogonal transformation X = PY which diagonalizes B1 , we obtain from X0 B1 X + X0 (I − B1 )X = X0 IX Y 0 P0 B1 PY + Y 0 P0 (I − B1 )PY = Y 0 B0 IBY = Y 0 IY. (4.9) Since the first and last terms are diagonal, so is the second. Since r(B1 ) = r1 and therefore r(P0 B1 P) = r1 , p − r1 of the leading diagonal elements of P0 B1 P are zero. Thus the corresponding elements of P0 (I − B1 )P are 1 and since by (a) the rank of P0 (I − B1 )P is p − r1 , the other elements of its leading diagonal are 0 and the corresponding elements of P0 B1 P are 1. Hence from Theorem 4.4, Q1 ∼ χ2r1 and B1 is idempotent. The same result holds for the other Bi and we have established (b) from (a). (ii) Given (b) we will prove (c). I = B1 + B2 + . . . + Bk (4.10) and (b) implies that each Bi is idempotent (with rank ri ). Choose an arbitrary Bi , say Bj . There is an orthogonal matrix C such that 0 C Bj C = " Irj 0 0 0 # . Premultiplying (4.10) by C0 and post-multiplying by C, we have 0 C IC = I = k X i=1,i6=j 0 C Bi C + " Irj 0 0 0 # . Now each C0 Bi C is idempotent and can’t have any negative elements on its diagonal. So C0 Bi C must have the first rj leading diagonal elements 0, and submatrices for 55 rows rj + 1, . . . , p, columns 1, . . . , rj and for rows 1, . . . , rj , columns rj + 1, . . . , p must have all elements 0. So C0 Bi CC0 Bj C = 0 , i = 1, 2, . . . , k, i 6= j, and thus C0 Bi Bj C = 0 which can only be so if Bi Bj = 0. Since Bj was arbitrarily chosen, we have proved (c) from (b). (iii) Given (b) we will prove (a). If (b) holds, Bi has ri eigenvalues 1 and p − ri zero and since I = P we have p = ri . P Bi , taking traces (iv) Given (c) we will prove (b). If (c) holds, taking powers of I = integers s. Taking traces we have tr( k X Pk i=1 Bi , we have Pk i=1 Bsi = I for all positive Bsi ) = p , for all s. i=1 This can hold if and only if every eigenvalue of Bi is 1. That is, if each Qi ∼ χ2 . So we have proved (b) from (c). A more general version of Cochran’s Theorem is stated (without proof) in Theorem 4.8. Note that X 0 X = X 0 IX = k X X 0BiX i so that k X I= Bi i The logic of the three conditions can be summarised in Table 4.1. (1) (2) (3) (4) a b b c → → → → b c a b Table 4.1: Logic table for Cochran’s theorem Thus (1) and (2) imply that ’a’ implies ’b’ then ’c’. 56 Also (2) and (3) directly shows that ’b’ implies ’a’ and ’c’, while (3) and (4) mean that ’c’ implies ’b’ then ’a’. Theorem 4.8 Given X ∼ Np (0, σ 2 I), suppose that X0 X is decomposed into k quadratic forms, Qi = X Bi X, r = 1, 2, . . . , k, when r(Bi ) = ri . Then Q1 , Q2 , . . . , Qk are mutually independent P and Qi /σ 2 ∼ χ2ri if and only if ki=1 ri = p. 0 Proof Let X ∗ = X/σ and use Thm 4.7 (a) on X ∗ . Example 4.2 We will consider again Example 4.1 from the point of view of Cochran’s Theorem. Recall that X1 , . . . , Xp are iid N (0, 1) and p X i=1 That is, 2 (xi − x) = X p X i=1 x2i − ( P p xi ) 2 X x2i − px2 . = p i=1 x2i = (p − 1)s2 + px2 , where s2 is defined in the usual way. Equivalently, X0 IX = X0 BX + X0 CX, where B and C are defined in Example 4.1. We can apply Cochran’s Theorem, noting that we can easily show that (a) is true, since r(I) = p, r(B) = p − 1 and r(C) = 1 where B 1 = B, B 2 = C in the notation of Example 4.1 and Thm 4.7. So we may conclude that X 2 Xi2 ∼ χ2p , νS 2 ∼ χ2ν where ν = p − 1, and pX ∼ χ21 . and that X and S 2 are independent. Note that X̄ − 0 √ ∼ N (0, 1) 1/ p 57 leading to pX̄ 2 ∼ χ21 . The statements about the rank of B and C can be confirmed by row and column operations on B and C. For example, C can be reduced to columns of zeros except for the last by subtracting the last column on the rhs from the rest. The resulting last row from the top can be subtracted from the rest, to give a single non zero entry, showing the rank of C as one. For B, multiply all rows by p, then add cols 1 to (p − 1), counting from lhs, to col p. Then subtract the resulting row p from rows 1 to (p − 1), counting down from the top of the matrix. Add 1/p of the resulting rows 1 to (p − 1) to row p, and divide by p to show the rank of B as (p − 1). Query Page 3 , (ii), Notes. A necessary and sufficient condition for a symmetric matrix A to be positive definite is that there exists a nonsingular matrix P such that P P 0 = A. 1. If A = P P 0 then x0 (P P 0 )x = (P 0 x)0 P 0 x = Z 0 Z > 0, ∀ Z. Thus if A = P P 0 then A is positive definite. 2. The reverse requires that if A is positive definite, then A can be written as P P 0 . If A is pd, then λi > 0 ∀ i, which means that there exists an R such that x = Ry for which x0 Ax = y 0 R0 ARy = y 0 Dy where D = diag(λ1 , . . . , λn ). √ √ If we define D = dd where d = diag( λ1 , . . . , λn ) and define w = dy, then as a check Thus 0 −1 x0 Ax = d−1 w Dd−1 w = w0 d0 Dd−1 w = w 0 w > 0 ∀ w x = Ry = Rd−1 w so 0 0 x0 Ax = Rd−1 w A Rd−1 w = w 0 Rd−1 ARd−1 w which leads to and so ARd−1 = Rd−1 0 Rd−1 ARd−1 = I 0 −1 =⇒ A = Rd−1 0 −1 Rd−1 −1 So, if A is positive definite, then A can be written as A = P P 0 . 58 = PP0 Chapter 5 Order Statistics 5.1 Introduction Parametric statistics allows us to reduce the data to a few parameters which makes it easier to interpret the data. Statistics such as the mean and variance describe the pattern of random events and allow us to evaluate the probability of events of interest. Under the assumption that the data follow a known parametric distribution, we estimate the parameters from the data. However, the usefulness of the parameters depends on the assumptions about the data being reliable and this is not necessarily guaranteed. One strategy for interpreting data without stringent assumptions, is to use order statistics. Read CB 5.4 or HC 4.6. In the following, we will use the notation of HC where the pdf of the random variable X is denoted by f (x), rather than fX (x). pdf of the random variable X is denoted by f (x), rather than fX (x). Definition 5.1 Let X1 , X2 , . . . Xn denote a random sample from a continuous distribution with pdf f (x), a < x < b. Let Y1 be the smallest of these, Y2 the next Xi in order of magnitude, etc. Then Yi , i = 1, 2, . . . , n is called the ith order statistic of the sample, and (Y1 , Y2 , . . . , Yn ) the vector of order statistics. We may write Y1 < Y2 < . . . < Yn . The following alternative notation is also common; X(1) < X(2) , < . . . , < X(n) . Order statistics are non-parametric and only rely upon the weak assumption that the data are samples from a continuous distribution. we pick up information by ordering the data. If we know the underlying distribution, we can combine that knowledge with the rank of the order statistic of interest. For instance, if the underlying distribution is normal, Y50 from a sample of size 101 will have a higher probability of being near the median than Y10 or Y90 . But without ordering, the same could not be said for X10 , X50 , X90 . So the ordering gives us extra information and we shall now explore the densities of order statistics, denoted fYr (y) etc. 59 Example 1 Suppose you were required to assess the ability to handle a crowd at a railway station with regard to stair width, staff etc. The statistic of interest is Yn . Example 2 An oil product freezes at ≈ 10◦ C and the company ponders whether it should market it in a cold climate. We would require the density of the minimum order statistic, fY1 (y), to assess the risk of the product failing. Examples of other situations where an order statistic is of interest are : 1. Largest component Yn : maximum temperature, highest rainfall, maximum storage capacity of a dam, etc. 2. Smallest component Y1 : minimum temperature, minimum breaking strength of rope, etc. 3. Median. Median income, median examination mark, etc. Order statistics are useful for summarizing data but may be limited for detailed descriptions of some process which has been measured. Order statistics are also ingredients for higher level statistical procedures. 1 Figure 5.1 shows the sample cdf as a step function increasing by n+1 at each order statistic. We can make statements about individual order statistics by borrowing information provided by the entire set. Remember that all we assumed about the original data was that it were continuous; there were no assumptions about the distribution. But now that the data are ordered, we can use the extra information provided by the ordering to derive density functions. The data, X1 . . . Xn might be independent but the ordered data Y1 . . . Yn , are not. To begin our study of order statistics we first want to find the joint distribution of Y1 , . . . , Y n . 5.2 Distribution of Order Statistics The following theorem is proved in CB p230 or HC (page 193–195) for k = 3. Theorem 5.1 60 Figure 5.1: The cdf of order statistics, steps of F (y) 1 . n+1 6 1 n n+1 .. . 4 n+1 3 n+1 2 n+1 1 n+1 - 0 Y1 Y2 Y3 Y4 Y5 ... Yn−1 y Yn If X1 , X2 , . . . , Xn is a random sample of size n from a continuous population with pdf f (x), a < x < b, then the order statistic Y = (Y1 , Y2 , . . . , Yn ) has joint pdf given by fY (y1 , y2 , . . . , yn ) = n! n Y f (yi ), a < y1 < y2 < . . . < yn < b. (5.1) i=1 Comment: The proof essentially uses the change of variable technique for the case when the transformation is not one to one. That is, we have a transformation of the form Y1 = smallest observation in (X1 , X2 , . . . , Xn ) Y2 = second smallest observation in (X1 , X2 , . . . , Xn ) . . . Yn = largest observation in (X1 , X2 , . . . , Xn ). This has n! possible inverse transformations. You should read and understand the proof given in HC. 61 The Jacobian for the transformation is:- 5.3 n 0 0 . . . 0 0 (n − 1) . . . 0 .. .. .. .. .. = n! . . . . . 0 1 Marginal Density Functions Before engaging in the theory we do a sketch of the information we are modelling. When we derive the distribution of a single order statistic, we divide the underlying distribution into 3. Figure 5.2: An order statistic in an underlying distribution f(y) 2 f(yr ) 1 3 yr y The observed value of the rth order statistic is yr . ie Yr = yr . This is a random variable (Yr will have a different value from a new sample) with density f (yr ). We have 1. (r − 1) observations < yr with probability F (yr ), 2. 1 observation with density f (yr ), 3. (n − r) observations > yr with probability 1 − F (yr ). Ordering and classifying by Yr has produced a form similar to a multinomial distribution with 3 categories , < yr ,= yr ,> yr , and associated with these categories we have the entities F (yr ),f (yr ),1 − F (yr ). 62 The multinomial density function for 3 categories is P (X1 = x1 , X2 = x2 , X3 = x3 ) = n! × pn1 1 pns 2 pn3 3 . n1 !n2 !n3 ! The density of an order statistic has a similar form, fYr (y) = n! × [F (yr )](r−1) × f (yr ) × [1 − F (yr )](n−r) . (r − 1)!1!(n − r)! Note there are 3 components of the density corresponding to the 3 categories. For 2 order statistics, Yr , Ys , there are 5 categories, 1. y < yr 2. y = yr 3. yr < y < ys 4. y = ys 5. y > ys Figure 5.3: Two order statistics in an underlying distribution f(y) 2 f(yr ) 4 f(ys) 1 3 yr 5 y ys 63 From the same analogy to multinomials used for a single order statistic, there are 5 components to the joint density, fYr ,Ys (yr , ys ) = n! × (r − 1)!1!(n − r − 1)!1!(n − s)! [F (yr )](r−1) × f (yr ) × [F (ys) − F (yr )]s−r−1 × f (ys ) [1 − F (ys )](n−s) . Formal derivation of the marginal densities of order statistics Since we know the pdf of Y = (Y1 , Y2 , . . . , Yn ) is given by (5.1), the marginal pdf of the rth smallest component, Yr , can be found by integrating over the remaining (n − 1) variables. Thus fYr (yr ) = Z yr Z yr−1 −∞ −∞ ... Z y2 −∞ "Z ∞ yr ... Z ∞ yn−1 n! n Y # f (yi )dyn . . . dyr+1 dy1 . . . dyr−1 . i=1 (5.2) (The parentheses are inserted as a guide to the integration – they are not actually required.) Notice the order of integration used is to first integrate over yn , then yn−1 , . . . and then yr+1 (this is the part of (5.2) enclosed by the parentheses). This is followed by integration over y1 , then y2 , . . ., and finally over yr−1 . The limits of integration are obtained from the inequalities. ∞ > yn > yn−1 > . . . > yr+1 > yr and −∞ < y1 < y2 < . . . < yr−1 < yr . In order to integrate (5.1), we first have Z ∞ yr = ... Z Z ∞ yn−2 ∞ yr Z ... ∞ n Y yn−1 i=r+1 Z Z ∞ yn−2 f (yi )dyn . . . dyr+1 [1 − F (yn−1 )]f (yn−1 ) = = [1 − F (yr )]n−r , on simplification. (n − r)! yr ... ∞ yn−3 f (yi )dyn−1 . . . dyr+1 i=r+1 Z ∞ n−2 Y n−3 Y [1 − F (yn−2 )]2 f (yn−2 ) f (yi )dyn−2 . . . dyr+1 2! i=r+1 (5.3) Similarly Z yr Z yr−1 −∞ −∞ ... Z y2 r−1 Y −∞ i=1 f (yi )dy1 . . . dyr−1 = Z yr −∞ 64 ... Z y3 −∞ F (y2 )f (y2 ) r−1 Y i=3 f (yi )dy2 . . . dyr−1 Z = yr −∞ ... Z y4 −∞ r−1 Y [F (y3 )]2 f (yi )dy3 . . . dyr−1 f (y3 ) 2! i=4 = [F (yr )]r−1 /(r − 1)!, on simplification. (5.4) Hence using (5.3) and (5.3) in (5.2), we obtain fYr (yr ) = n!f (yr ) = n!f (yr ) Z yr −∞ ... Z y2 −∞ " # Y [1 − F (yr )]n−r r−1 f (yi )dy1 . . . dyr−1 (n − r)! i=1 [1 − F (yr )]n−r [F (yr )]r−1 . (n − r)! (r − 1)! so that the marginal p.d.f. of Yr is given by fYr (yr ) = n! [F (yr )]r−1 [1 − F (yr )]n−r f (yr ), for − ∞ < yr < ∞ . (5.5) (n − r)!(r − 1)! The probability density functions of both the minimum observation (r = 1) and the maximum observation (r = n) are special cases of (5.5). For r = 1, fY1 (y1 ) = n[1 − F (y1 )]n−1 f (y1 ), −∞ < y1 < ∞ (5.6) fYn (yn ) = n[F (yn )]n−1 f (yn ) (5.7) For r = n, The integration technique can be applied to find the joint pdf of two (or more) order statistics, and this is done in 5.4. Before examining that, we will give an alternative (much briefer) derivation of (5.7). Let the cdf of Yn be denoted by FYn . For any value y in the range space of Yn , the cdf of Yn is FYn (y) = P (Yn ≤ y) = P(all n observations ≤ y) = [P(an observation ≤ y)]n = [F (y)]n The pdf of Yn is thus fYn (y) = FY0 n (y) = n[F (y)]n−1 .f (y), a < y < b. Of course y in the above is just a dummy, and could be replaced by yn to give (5.7). Exercise Use this technique to prove (5.6). 65 Example 5.1 Let X1 , . . . Xn be a sample from the uniform distribution f (x) = 1/θ , 0 < x < θ. Find a 100(1 − α)% CI for θ using the largest order statistic, Yn . By definition, 0 < Y 1 < Y2 < . . . < Y n < θ . So Yn will suffice as the lower limit for θ. Given the information gleaned from the order statistics, what is the upper limit? Using the above result for the density of the largest order statistic, fYN (yn ) = n [F (yn )]n−1 f (yn ) y n−1 = n nn θ Choose c such that, P (cθ < Yn < θ) Z θ nynn−1 dyn θn cθ n θ yn θ n cθ 1 − cn = 1−α = 1−α = 1−α 1 = 1 − α ⇒ c = αn Therefore, 1 P θα n < Yn < θ = 1 − α 1 1 1 < < P = 1 − α since monotone decreasing 1 θ Yn n θα Yn P Yn < θ < 1 = 1−α αn 1 A 100(1 − α)% CI for θ is given by (Yn , Yn α− n ). Verification of the multinomial formulation for marginal distributions For n = 2, the multinomial formulation becomes a binomial. smallest Now P1 ∝ f (y1 ), and P2 = P (Y2 > y1 ) = 1 − F (y1 ) = P (obs > y1 ), So 66 a P1 P2 ↓ − − − → Y1 Y2 b 1 1 fY1 (y1 ) ∝ 2P11 P21 = 2f (y1 )[1 − F (y1 )], a < y1 < b in agreement with equation (5.6) for n=2. largest P2 P1 ← − − − ↓ a Y1 Y2 1 1 b Now P1 ∝ f (y2 ), and P2 = P (Y1 < y2 ) = F (y2 ) = P (obs < y2 ), So fY2 (y2 ) ∝ 2P11 P21 = 2f (y2 )F (y2 ), a < y2 < b in agreement with equation (5.7) for n=2. For n = 3, we have a trinomial. Consider the median Y2 . P2 P1 P3 ← − − − ↓ − − − → a Y1 Y2 Y3 b 1 1 1 Now P1 ∝ f (y2 ), P2 = P (obs < y2 ) = F (y2 ) and P3 = P (obs > y2 ) = 1 − F (y2 ) to give fY2 (y2 ) ∝ 3!P11 P21 P31 = 6f (y2 )F (y2 )[1 − F (y2 )], a < y2 < b as per equation (5.5) with n = 3 and r = 2. 5.4 Joint Distribution of Yr and Ys The joint pdf of the order statistics for a sample of size two is derived. The original sample is X1 , X2 while the order statistics are denoted by Y1 , Y2 . Thus Y1 = X1 or X2 and Y2 = X1 or X2 . Thus the transformation is not 1:1. 67 The space A1 is defined by a < x1 < x2 < b and A2 is a < x2 < x1 < b, giving A = A1 + A2 . Both these regions map into the space B defined by a < y1 < y2 < b. (You should draw these regions.) In A1 we have X1 = Y 1 , Y 1 = X 1 = ψ 1 X2 = Y 2 , Y 2 = X 2 = ψ 2 with Jacobian 1 0 =1 J1 = 0 1 In A2 we have X1 = Y 2 , Y 1 = X 2 = ψ I X2 = Y1 , Y2 = X1 = ψII with Jacobian 0 1 = −1 J1 = 1 0 This gives the joint density for Y1 and Y2 as fY (y1 , y2 ) = abs|J1 |f (ψ1 , ψ2 ) + abs|J2 |f (ψI , ψII ), (y1 , y2 ) B = 2f (y1 , y2 ) = 2f (y1 )f (y2 ), a < y1 < y2 < b since X2 and X2 are iid f (x). Marginal Density Functions Case : n = 2 We first examine the case n = 2 observations. Smallest OS fY1 (y1 ) = Z fY (y1 , y2 )dy2 = Z y2 =b y2 =y1 2f (y1 )f (y2 )dy2 = 2f (y1 ) [F (y2 )]by1 = 2f (y1 ) [1 − f (y1 )] , a < y1 < b 68 Largest OS fY2 (y2 ) = Z fY (y1 , y2 )dy1 = Z y1 =y2 y1 =a 2f (y1 )f (y2 )dy1 = 2f (y2 ) [F (y1 )]ya2 = 2f (y2 )F (y2 ), a < y2 < b These results can be compared with those n = 2, viz, equations (5.6) and (5.7). (5.6) gives fY1 (y1 ) = n[1 − F (y1 )]n−1 f (y1 ) = 2[1 − F (y1 )]1 f (y1 ), a < y1 < b which is the same as the previous result for the smallest order statistic for a sample size of two. (5.7) gives fYn (yn ) = n[F (yn )]n−1 f (yn ) = 2[F (y2 )]1 f (y2 ), a < y2 < b which is the same as the previous result for the largest order statistic for a sample size of two. Case : n = 3 Now for the case n = 3 as per H and C, p193–195, 5th ed. We have X1 , X2 , X3 → Y1 , Y2 , Y3 and the joint density is fY (y1 , y2 , y3 ) = 3!f (y1 )f (y2 )f (y3 ), a < y1 < y2 < y3 < b First, find the distribution of Y2 (the median). fY2 (y2 ) = = 6f (y2 ) Z Z Z b y3 =y2 fY (y1 , y2 , y3 )dy1 dy3 = f (y3 ) = 6f (y2 ) Z Z y1 =y2 a b y3 =y2 Z y3 =b y3 =y2 Z y1 =y2 y1 =a 6f (y1 )f (y2 )f (y3 )dy1 dy3 f (y1 )dy1 dy3 = 6f (y2 ) Z b y3 =y2 f (y3 ) [F (y1 )]ya2 dy3 f (y3 )F (y2 )dy3 = 6f (y2 )F (y2 ) [F (y3 )]by2 = 6f (y2 )F (y2 )[1 − F (y2 )], a < y2 < b This can be verified by using n = 3, r = 2 on equation (5.5). The marginal distribution of Y1 , the smallest observation : 69 Z Z fY1 (y1 ) = Z Z = = 6f (y1 ) fY (y1 , y2 , y3 )dy2 dy3 , a < y1 < y2 < y3 < b fY (y1 , y2 , y3 )dy3 dy2 = Z y2 =y1 = 6f (y1 ) Z b y2 =y1 y2 =b y2 =y1 Z y3 =b y3 =y2 6f (y1 )f (y2 )f (y3 )dy3 dy2 ! f (y3 )dy3 dy2 = 6f (y1 ) Z f (y2 )[1 − F (y2 )]dy2 = 6f (y1 ) Z Z b Z b y3 =y2 " [1 − F (y2 )]2 = 6f (y1 ) (−1) 2 So b y2 =y1 b y1 f (y2 )[F (y3 )]by2 dy2 [1 − F (y2 )]dF (y2 ) #b y1 # " (1 − F (b))2 (1 − F (y1 ))2 − (−1) fY1 (y1 ) = 6f (y1 ) 2 2 " # (1 − F (y1 ))2 = 6f (y1 ) 0 + = 3f (y1 )(1 − F (y1 ))2 , a < y1 < b 2 as per equation (5.6), for the case n=3. The marginal distribution of Y3 , the largest observation : fY3 (y3 ) = = 6f (y3 ) Z Z Z y3 =y3 a fY (y1 , y2 , y3 )dy1 dy2 = f (y2 ) = 6f (y3 ) Z Z y1 =y2 a y2 =y3 a Z y2 =y3 y2 =a Z y1 =y2 y1 =a f (y1 )dy1 dy2 = 6f (y3 ) f (y2 )F (y2 )dy2 = 6f (y3 ) " F (y2 )2 = 6f (y3 ) 2 # y3 Z a Z 6f (y1 )f (y2 )f (y3 )dy1 dy2 y2 =y3 a y2 =y3 f (y2 ) [F (y1 )]ay1 =y2 dy2 F (y2 )dF (y2 ) = 3f (y3 )F (y3 )2 , a < y3 < b a as per equation (5.7), for n = 3. 70 The General Case – joint distribution The joint p.d.f. of Yr and Ys , (r < s) is found by integrating over the other (n−2) variables. Then Z fYr ,Ys (yr , ys ) = ∞ ys ... Z ∞ yn−1 (Z ys ... yr Z yr+3 yr Z yr+2 yr "Z yr −∞ ... Z y3 Z y2 −∞ −∞ n! n Y f (yi )dy1 . . . dyr−1 i=1 . dyr+1 . . . dys−1 } dyn . . . dys+1 (5.8) The order of integration is first over y1 , y2 , . . . , to yr−1 , then over yr+1 , yr+2 , . . . , to ys−1 and finally over yn , yn−1 , . . . , to ys+1 . The limits of integration are obtained from the inequalities −∞ < y1 < y2 < . . . < yr−1 < yr , yr < yr+1 < . . . < ys−1 < ys , ∞ > yn > yn−1 > . . . > ys+1 > ys . In order to integrate (5.8) we use methods similar to (5.3) and (5.4) together with Z ys yr ... = Z Z yr+3 yr ys yr ... Z Z Z yr+2 s−1 Y yr yr+3 yr f (yi )dyr+1 dyr+2 . . . dys−1 i=r+1 [F (yr+2 ) − F( yr )]fX (yr+2 ) s−1 Y = = [F (ys ) − F (yr )]s−r−1 , on simplification. (s − r − 1)! yr ... yr+4 yr f (yi )dyr+2 . . . dys−1 i=r+3 s−1 Y Z ys [F (yr+3 ) − F (yr )]2 f (yr+3 ) f (yi )dyr+3 . . . dys−1 2! i=r+4 (5.9) Thus we get fYr ,Ys (yr , ys ) = n!f (yr )f (ys ) Z ∞ ys ... Z ∞ yn (Z ys yr ) ... Z yr+2 yr n [F (yr )]r−1 Y f (yi ). (r − 1)! i=r+1 i6=s . dyr+1 . . . dys−1 dyn . . . dys+1 ) Z ∞ ( [F (yr )]r−1 Z ∞ [F (ys ) − F (yr )]s−r−1 = n!f (yr )f (ys ) . ... (r − 1)! ys (s − r − 1)! yn−1 . n Y f (yi )dyn . . . dys+1 i=s+1 [F (yr )]r−1 [F (ys ) − F (yr )]s−r−1 [1 − F (ys )]n−s = n!f (yr )f (ys ) . (r − 1)! (s − r − 1)! (n − s)! 71 # Hence the pdf of (Yr , Ys ) is given by fYr ,Ys (yr , ys ) = n! [F (yr )]r−1 [F (ys ) − F (yr )]s−r−1 . (r − 1)!(s − r − 1)!(n − s)! .[1 − F (ys )]n−s f (yr )f (ys). (5.10) We now give an alternative derivation of the special case of (5.10) where r = 1, s = n. In this method we first find the joint cumulative distribution function, then derive the joint pdf from it by differentiation. The joint cdf of Y1 and Yn is P (Y1 ≤ y1 , Yn ≤ yn ). Note that {Yn ≤ yn } = {Y1 ≤ y1 ∩ Yn ≤ yn } ∪ {Y1 > y1 ∩ Yn ≤ yn } where the two events on the RHS are mutually exclusive. So P (Yn ≤ yn ) = P (Y1 ≤ y1 ∩ Yn ≤ yn ) + P (Y1 ≥ y1 ∩ Yn ≤ yn ) P (Y1 ≤ y1 ∩ Yn ≤ yn ) = P (all n obs. ≤ yn ) − P (all n obs. are between y1 and yn ) So So the joint pdf is FY1 ,Yn (y1 , yn ) = [F (yn )]n − [F (yn ) − F (y1 )]n , y1 ≤ yn i ∂ h ∂ ∂ n (F (yn ))n−1 f (yn ) − n [F (yn ) − F (y1 )]n−1 f (yn ) [FY1 ,Yn (y1 , yn )] = ∂y1 ∂yn ∂y1 = 0 + n(n − 1) [F (yn ) − F (y1 )]n−2 f (yn )f (y1 ) (5.11) which is (5.10) with r = 1 and s = n. The multinomial formulation of the joint distribution of two order statistics has been given earlier. The 5 components and their probabilities are shown in Figure 5.4. Obsn. P rob. 1, . . . , (r − 1) r (r + 1), . . . , (s − 1) s (s + 1), . . . , n F (yr ) f (yr ) F (ys ) − F (yr ) f (ys ) 1 − F (ys ) ←− ↔ ←→ ↔ −→ #obsn. (r − 1) 1 (s − r − 1) 1 (n − s) Figure 5.4: Multinomial probabilities for the joint order statistics For the general case, the joint distribution function is obtained by integration. The procedure is demonstrated for the simple cases n = 2 and n = 3, for the smallest and largest observations. 72 (Case n = 2) In this case, the joint distribution function is known already as fY (y1 , y2 ) = 2f (y1 )f (y2 ), a < y1 < y2 < b which can be seen to be that given by equation (5.10), for n = 2, r = 1 and s = 2. (Verify!) (Case n = 3) We now want fY1 ,Y3 (y1 , y3 ), for example to find the distribution of the range. We know fY (y1 , y2 , y3 ) = 6f (y1 )f (y2 )f (y3 ) and so we need to integrate out y2 . Thus fY1 ,Y3 (y1 , y3 ) = Z 6f (y1 )f (y2 )f (y3 )dy2 = 6f (y1 )f (y3 ) Z y2 =y3 y2 =y1 f (y2 )dy2 = 6f (y1 )f (y3 ) [F (y2 )]yy31 = 6f (y1 )f (y3 ) [F (y3 ) − F (y1 )] , a < y1 < y3 < b This is the same as equation (5.10), for n = 3, r = 1 and s = 3. (Verify!) 5.5 The Transformation F (X) Theorem 5.2 (Probability Integral Transform) Let the random variable X have cdf F (x). If F (x) is continuous, the random variable Z produced by the transformation Z = F (X) (5.12) has the uniform probability distribution over the interval 0 ≤ z ≤ 1. Proof See CB p54 or HC 4.1, p 161. The above result is a useful ploy for inference using order statistics and the following diagram (Figure 5.5) illustrates the connections amongst the underlying distribution, the observed data and their order statistics, and the transform of these to a set of data (Z i ) whose distribution is known. 73 The top row is an expression for Theorem 5.2. The second row is the one-to-one mapping of samples from F (x) to samples from the uniform distribution. The ordered variables Y1 . . . Yn are transformed to ordered Zi and properties about the original data, X, may be discerned from the Z(1) . . . Z(n) . Figure 5.5: Probability Integral Transform of order statistics. X ∼ f (x), F (x) ? Z = F (X) - ? Zi = F (Xi ) - X1 . . . X n ? Z ∼ g(Z) = 1 G(z) = z 0<z<1 Z1 . . . Zn ? Z(i) = F (Yi ) - Y1 . . . Y n Z(1) . . . Z(n) Theorem 5.3 Consider (Y1 , Y2 , . . . , Yn ), the vector of order statistics from a random sample of size n from a population with a continuous cdf F . Then the joint pdf of the random variables Z(i) = F (Yi ), i = 1, 2, . . . , n (5.13) is given by fZ (z(1) , z(2) , . . . , z(n) ) = ( n! for 0 < z(1) < . . . < z(n) < 1 0 elsewhere Proof See HC 11.2, p 502. 74 (5.14) Since Z(i) = F (Yi ), then fZ [z(1) , z(2) , . . . , z(n) ] = n! Y g[z(i) ] = n! Y g(zi ) i i but since g(zi ) = 1 ∀ i , 0 < zi < 1, by Thm 5.2, then fZ [z(1) , z(2) , . . . , z(n) ] = n! , 0 < z(1) < . . . < z(n) < 1 Theorem 5.4 The marginal pdf of Z(r) = F (Yr ) is given by fZ(r) (z(r) ) = n! z r−1 (1 − z(r) )n−r , 0 < z(r) < 1. (r − 1)!(n − r)! (r) (5.15) It can be seen that Z(r) ∼ Beta(r, n − r + 1) so its mean is E(Z(r) ) = r n+1 (5.16) Note: The Beta density is given by f (x; a, b) = 1 Γ(a)Γ(b) × xa−1 (1 − x)b−1 × I(0,1) (x) where B(a, b) = B(a, b) Γ(a + b) and you may need to revise the Gamma function. Proof We need to integrate out all variables except Z(r) . Thus fZ(r) [z(r) ] = n! Z z(r) 0 Z z(r−1) 0 ... Z z(2) 0 "Z 1 z(r) Z 1 z(r+1) ... Z 1 z(n−1) # dz(n) . . . dz(r+1) dz(1) . . . dz(r−1) Note that the correct order of integration is determined by the inequalities 1 > z(n) > z(n−1) > . . . > z(r+1) > z(r) and 0 < z(1) < z(2) < . . . < z(r−1) < z(r) Successive integrations yield the final result. Thus Z z(2) 0 dz(1) = z(2) 75 and for the inner group Z 1 z(n−1) dz(n) = [1 − z(n−1) ] leading to the two different terms in z for fZ(r) [z(r) ]. Alternatively, simply use equation (5.5) on page 45 of the Notes, with G = F = z and the lower and upper limits being 0 and 1 respectively. Theorem 5.5 The joint pdf of Z(r) and Z(s) fZ(r) ,Z(s) (z(r) , z(s) ) = (r < s) is given by n! r−1 z(r) (z(s) − z(r) )s−r−1 (1 − z(s) )n−s (r − 1)!(s − r − 1)!(n − s)! for 0 < z(r) < z(s) < 1 (5.17) Proof This is left as an exercise. You need only to notice Z(r) and Z(s) have uniform distributions and use (5.10), with lower and upper limits of 0 and 1 on z = y and note that G = F = z, since Z ∼ U (0, 1). 5.6 Examples Example 1 Distribution of the Sample Median The sample median M is defined as M= Y n+1 2 for n odd [Y n2 + Y n2 +1 ]/2 for n even. For the case of n odd, replace r by (n + 1)/2 in (5.5) on page 65 . For the case of n even, let n = 2m, and U = [Ym + Ym+1 ]/2. Then fYm ,Ym+1 (ym , ym+1 ) = (2m)! [F (ym )]m−1 [1 − F (ym+1 )]m−1 f (ym )f (ym+1 ). [(m − 1)!]2 Define u and v as follows u = (ym + ym+1 )/2 v = ym+1 . 76 (5.18) Then ym = 2u − v ym+1 = v and |J| = 2. Thus fU,V (u, v) = (2m)! [F (2u − v)]m−1 [1 − F (v)]m−1 f (2u − v).f (v).2 [(m − 1)!]2 −∞<u<v <∞ and integrating with respect to v we obtain the pdf of U (the sample median for a sample size n = 2m), 2(2m)! Z ∞ fU (u) = [F (2u − v)]m−1 [1 − F (v)]m−1 f (2u − v).f (v)dv. 2 [(m − 1)!] u Example 2 (5.19) Distribution of the Sample Midrange For an ordered sample Y1 < Y2 < . . . < Yn , this is defined as 12 (Y1 + Yn ), and its pdf can be found for a particular distribution, beginning with (5.11) and using the technique of bivariate transformation. Example 3 Distribution of the Sample Range Distribution of the range, for n = 2. In this case, the range R is defined by R = Y2 − Y1 . The transformation can be written as R = Y 2 − Y 1 , Y1 = S = ψ 1 S = Y 1 , Y2 = R + S = ψ 2 with Jacobian ∂ψ1 /∂R ∂ψ2 /∂R J= ∂ψ1 /∂S ∂ψ2 /∂S The original region A is defined by 0 1 = = −1 1 1 a < Y 1 < Y2 < b The transformed region B is defined by (r, s) such that (y1 , y2 ) A. Thus 77 Y1 > a → S > a Y1 < b → S < b Y2 < b → R + S < b ; R < b − a Y2 > a → R + S > a (redundant) Y2 > Y 1 → R + S > S ; R > 0 This region B should be sketched, to check limits for integration etc. Thus the joint distribution of R and S is now fR,S (r, s) = f (ψ1 , ψ2 )abs|J| = 2f (s) × f (r + s) × 1, (r, s) B For the special case of the uniform distribution, U (a, b), we have that f (x) = 1 , a<x<b b−a This gives fR,S (r, s) = and fR (r) = Z fR,S (r, s)ds = Z s=b−r s=a 2 (b − a)2 2(b − r − a) 2 ds = , 0<r <b−a 2 (b − a) (b − a)2 as per the general case with n = 2. The general case For an ordered sample Y1 < Y2 < . . . < Yn , the sample range is R = Yn − Y1 . Assuming that the sample is from a continuous distribution with pdf f (x), a < x < b and cdf F (x), the joint pdf of Y1 and Yn is given by equation (5.11) on page 72. Finding the distribution of R becomes a problem in bivariate transformations. Define r = yn − y1 and v = y1 . The inverse relationship, which is one-to-one, is y1 = v and yn = r + v, with |J| = 1. So we have fR,V (r, v) = fY1 ,Yn (y1 , yn )|J| = n(n − 1)[F (r + v) − F (v)]n−2 f (r + v) f (v). 78 To find the range space, we deduce from the fact that a < y1 < yn < b, v > a, r > 0 and r + v < b or v < b − r. So the range space is a < v < b − r; 0 < r < b − a, and fR (r) = Z b−r a n(n − 1) [F (r + v) − F (v)]n−2 f (v)f (r + v)dv, 0 < r < b − a. As a special case for f (x) = We have f (v) = 1 ,a b−a < x < b, find fR (r). 1 1 , a < v < b and f (r + v) = , a < r + v < b, i.e. a < v < b − r. b−a b−a Now F (r + v) − F (v) = Z r+v v So Z r 1 dx = b−a b−a n−2 r dv fR (r) = n(n − 1) · b−a (b − a)2 a n(n − 1)r n−2 (b − r − a) , 0 < r < b − a. = (b − a)n Example 4 b−r Estimating Coverage The ith coverage is defined as Ui = F (Yi ) − F (Yi−1 ) = Z(i) − Z(i−1) being the area under the density function f (y) between Yi and Yi−1 . To find the distribution of Ui we need the joint distribution of Z(i) and Z(i−1) . This is given by Thm (5.5) equation (5.17), with r = i − 1 and s = i. The joint distribution of Z(r) and Z(s) is given by fZ(r) ,Z(s) [z(r) , z(s) ] = n! z r−1 [z(s) − z(r) ]s−r−1 [1 − z(s) ]n−s , (r − 1)!(s − r − 1)!(n − s)! (r) 0 < z(r) < z(s) < 1 This becomes, under r = i − 1 ands = i fZ(i−1) ,Z(i) [z(i−1) , z(i) ] = n! z i−2 [z(i) − z(i−1) ]0 [1 − z(i) ]n−i , (i − 2)!0!(n − i)! (i−1) 0 < z(i−1) < z(i) < 1 The remainder of the derivation is left as an assignment question, using the outline given below. 79 Figure 5.6: Coverage 0.0 0.1 0.2 0.3 f (y) y Yi−1 Yi Define the ith coverage as Ui = F (Yi ) − F (Yi−1 ), the area under the density f (y) 0 2 4 between Yi and Yi−1 , as shown in Figure (5.5). We have data,6 X1 . . . Xn but do not assume a parametric form for the distribution and rely on order statistics to estimate coverage. From theorem 5.5, Z = F (X) ∼ U nif (0, 1) and Z(1) . . . Z(n) is an ordered sample from U nif (0, 1). By definition, Ui = Z(i) − Z(i−1) . From theorem 5.5, Z(i) ∼ Beta(i, n − i + 1) and Z(i−1) ∼ Beta(i − 1, n − i + 2). Theorem 5.5 gives the joint distribution of Z(i−1) , Z(i) leading to the joint distribution of Z(i−1) , U(i) . Integrating wrt Z(i−1) gives the distribution of Ui , Ui ∼ Beta(1, n). Therefore E(Ui ) = 1/(n + 1) and var(Ui ) = n/{(n + 1)2 (n + 1)}. 5.7 Worked Examples : Order statistics Example 1 The cdf and df of the smallest order statistic. Solution The cdf first : FY1 (y1 ) = P (Y1 ≤ y1 ) = P (at least one of n obs ≤ y1 ) = 1 − P (no obs ≤ y1 ) 80 giving Thus the df is FY1 (y1 ) = 1 − P (all obs > y1 ) = 1 − [1 − F (y1 )]n . fY1 (y1 ) = FY0 1 (y1 ) = n[1 − F (y1 )]n−1 f (y1 ). Example 2 Suppose links of a chain are such that the population of individual links having breaking strengths Y ( Kg) has the df f (y) = λe−λy , y > 0, where λ is a positive constant. If a chain is made up of 100 links of this type taken at random from the populations of links, what is the probability that such a chain would have a breaking strength exceeding K kilograms? Interpret your results. Solution Since the breaking strength of a chain is equal to the breaking strength of its weakest link, the problem reduces to finding the probability that the smallest order statistic in a sample of 100 will exceed K. We have and but P (Y1 > K) = 1 − P (Y1 < K) = 1 − FY1 (K) FY1 (y1 ) = 1 − [1 − F (y1 )]n F (y1 ) = and so giving Z y1 0 λe−λy dy = 1 − e−λy1 , y1 > 0 FY1 (y1 ) = 1 − [e−λy1 ]100 = 1 − e−100λy1 P (Y1 > K) = 1 − FY1 (y1 = K) = e−100λK . The df for Y1 is fY1 (y1 ) = FY0 1 (y1 ) = 100λe−100λy1 , y1 > 0 and so E(Y ) = 1/λ, E(Y1 ) = 1/(100λ) which explains the extreme quality control used for high performance units (like chains) which are made up of large numbers of similar components. Example 3 A random sample of size n is drawn from a U (0, θ) population. 81 1. Suppose that kYn is used to estimate θ. Find k so that E(kYn −θ)2 is a minimum. 2. What is the probability that all the observations will be less than cθ for 0 < c < 1? Solution 1. Now, fYn (yn ) = nyn n−1 θ −n , 0 < yn < θ ie, Yn /θ ∼ Beta(n, 1). Thus E(Yn /θ) = n/(n + 1) while V (Yn /θ) = n . (n + 1)2 (n + 2) Now E(kYn − θ)2 = k 2 E(Yn )2 + θ 2 − 2kθE(Yn ) =k 2 " = k 2 [V (Yn ) + [EYn ]2 ] + θ 2 − # 2kθ 2 n n+1 nθ 2 n2 θ 2 2kθ 2 n 2 + + θ − (n + 1)2 (n + 2) (n + 1)2 (n + 1) This will be a minimum when dV = 0, dk ie, when 2kn2 θ 2 2θ 2 n 2knθ 2 + − =0 (n + 1)2 (n + 2) (n + 1)2 (n + 1) Thus k= (n + 2) . (n + 1) 2. The probability is in general P (Yn < φ) = F (Yn (φ)) = F (yn (φ))n = (φ/θ)n . So if φ = cθ, then P (Yn < cθ) = cn . 82 Example If X1 , . . . , Xn is a random sample from a uniform distribution with pdf fX (x) = 1/θ, 0 < x < θ with order statistics Y1 , . . . , Yn , show that Y1 /Yn and Yn are independent. Solution The joint distribution of Y1 and Yn is FY1 ,Yn (y1 , yn ) = n(n − 1) [F (yn ) − F (y1 )]n−2 f (y1 )f (yn ), 0 < y1 < yn < θ as per (5.11) page 48 of the Notes. This simplifies to f (y1 , yn ) = n(n − 1) yn − y 1 θ n−2 1 , 0 < y 1 < yn < θ θ2 since f (y) = 1/θ and F (y) = y/θ. The transformations are U = Y1 /Yn , Y1 = U V = ψ1 (U, V ) V = Yn , Yn = V = ψ2 (U, V ) The Jacobian of the transformation is ∂ψ1 /∂u ∂ψ1 /∂v J= ∂ψ2 /∂u ∂ψ2 /∂v The joint distribution of U and V is then v − uv fU,V (u, v) = n(n − 1) θ v u = =v 0 1 n−2 v , (u, v) B θ2 The region A is defined by 0 < Y1 < Yn < θ while B is defined by 0 < U < 1 and 0 < V < θ, as obtained from the inequalities Y1 < Yn −→ U V < V ; U < 1 0 < Y1 −→ U V > 0 ; U, V > 0 Yn < θ −→ V < θ Yn > 0 −→ V > 0 The joint distribution factorises, viz fU,V (u, v) = (n − 1)(1 − u)n−2 × n n−2 v θ 83 n−1 v v = (n − 1)(1 − u)n−2 n 2 θ θ 1 θ Thus fU (u) = Z θ 0 f (u, v)dv = (n − 1)(1 − u)n−2 and so U = Y1 /Yn ∼ B(1, n − 1), 0 < u < 1 Also n−1 Z 1 1 v v , 0< <1 fV (v) = f (u, v)du = n θ θ θ 0 and so V /θ = Yn /θ ∼ B(n, 1) independently of U = Y1 /Yn since fU,V (u, v) = fU (u)fV (v) Note also that the distribution of V = Yn is the distribution of the largest order statistic, as per equation (5.7). Thus fYn (yn ) = n[F (yn )]n−1 f (yn ) = n in line with the df for V = Yn . 84 yn θ n−1 1 , 0 < yn < θ θ Chapter 6 Non-central Distributions 6.1 Introduction Recall that if Z is a random variable having a standard normal distribution then Z 2 has a chi-square distribution with one degree of freedom. Furthermore, if Z1 , Z2 , . . . , Zp P are independent and each Zi is distributed N (0, 1) then the random variable pi=1 Zi2 has a chi-square distribution with p degrees of freedom. Suppose now that the means of the normal distributions are not zero. We wish to find P the distributions of Zi2 and pi=1 Zi2 . Definition 6.1 Let Xi be distributed as N (µi , 1), i = 1, 2, . . . , p. Then Xi2 is said to have a noncentral chi-square distribution with one degree of freedom and non-centrality parameter P µ2i , and pi=1 Xi2 has a non-central chi-square distribution with p degrees of freedom and P non-centrality parameter λ where λ = pi=1 µ2i . Notation If W has a non-central chi-square distribution, with p degrees of freedom and noncentrality parameter λ we will write W ∼ χ2p (λ). Of course if λ = 0 we have the usual χ2 distribution, sometimes called the central chi-square distribution. The term non-central can also apply in the case of the t-distribution. Recall that √ Z where Z ∼ N (0, 1), W ∼ χ2ν and Z and W are independent has a t-distribution W/ν with parameter ν. When the variable in the numerator has a non-zero mean then the distribution is said to be non-central t. [Non-central t and F distributions are defined in 5.4.] A common use of the non-central distributions is in calculating the power of the χ2 , t and F tests and in such applications as robustness studies. 85 6.2 Distribution Theory of the Non-Central Chi-Square The following theorem is of considerable help in deriving results concerning the non-central chi-square distribution. Theorem 6.1 A random variable W ∼ χ2p (λ) can be represented as the sum of a non-central chisquare variable with one degree of freedom and non-centrality parameter λ and a (central) chi-square variable with p − 1 degrees of freedom where the two variables are independent. Proof Let X1 , X2 , . . . , Xp be independently distributed, where Xi ∼ N (µi , 1) and write X = (X1 , X2 , . . . , Xp ). Define p 0 W = X Xi2 . (6.1) i=1 Choose an orthogonal matrix B such that the elements in the first row are defined by 1 b1j = µj λ− 2 for j = 1, 2, . . . , p (6.2) P where λ = pj=1 µ2j . Define Y 0 = (Y1 , Y2 , . . . , Yp ) by the orthogonal transformation Y = BX . (6.3) Then using the result of Assignment 2, Q. 9, we see that Y ∼ Np (Bµ, BIB0 ). That is, since B is orthogonal Y ∼ Np (Bµ, I) (6.4) But, the mean of the vector Y can be written as E(Y) = Bµ = P b1j µj Pb µ 2j j . . . P bpj µj where, using (6.2) we have E(Y1 ) = X b1j µj = X 1 1 1 µ2j λ− 2 = λ/λ 2 = λ 2 . Further, since the rows of B are mutually orthogonal E(Yi ) = p X bij µj = 0 for i = 2, 3, . . . , p . j=1 86 From (6.3), Y1 , Y2 , . . . , Yp is a set of independent normally distributed random variables. Also W = X0 X = (B−1 Y)0 B−1 Y = Y 0 (B−1 )0 B−1 Y = Y 0 Y = Y12 + p X Yi2 i=2 = V +U (6.5) Pp where V = Y12 and U = i=2 Yi2 . Since U depends only on Y2 , . . . , Yp , U is independent of V . Furthermore, 1 2 2 Y1 ∼ N (λ , 1) so that Y1 is distributed as a chi-square with one degree of freedom and non-centrality parameter λ. Also, U is distributed as a chi-square with (p − 1) degrees of freedom since Y2 , . . . , Yp are independently and identically distributed N (0, 1). This completes the proof. This theorem will now be used to derive the density function for a random variable with a non-central χ2 distribution. Theorem 6.2 If W ∼ χ2n (λ), then the probability density function of W is 1 λ 1 e− 2 w e− 2 w 2 n−1 1 1 1 (wλ) gW (w) = + . 1+ 1 1 n n 2 n(n + 2) 2! 2 2 Γ( 2 n) wλ 2 !2 0≤w<∞. + . . . , (6.6) Proof: P Write V = Y12 and U = ni=2 Yi2 , so that by Theorem 6.1, we can write W as W = V +U . Now U ∼ χ2n−1 (0), so that the probability density function of U is 1 fU (u) = e− 2 u u n−1 −1 2 1 2 2 (n−1) Γ[ 12 (n − 1)] , 0≤u<∞. (6.7) In Example 3.1 of Chapter 3, the density function of V was found to be 1 fV (v) = v λ v − 2 e− 2 e− 2 3 2 2 Γ( 12 ) 1 1 e (vλ) 2 +e −(vλ) 2 . (6.8) But U and V are independent so that the joint p.d.f. of U and V is fU,V (u, v) = fU (u)fV (v) = u 2 (n−3) 2 (n+2) 2 e− (v+u) 2 1 λ v − 2 e− 2 Γ( 12 )Γ[ 12 (n − 1)] 87 1 1 e(λv) 2 + e−(λv) 2 . Define random variables W and T by Then ( W =U +V T =U . ( V =W −T U =T . The original variable space A, (U, V ) is defined by U > 0, V > 0. The transformation is W = U + V, T = U while the space B, (T, W ) is defined by T > 0, W > T, and W > 0. Clearly the Jacobian of the transformation is 1 so that w fT,W (t, w) = e− 2 t 2 n−3 2 (n+2) 2 1 1 w 1 λ t w 1 2 e− 2 e− 2 w − 2 1 − n λ 2 (w−t) 2 λ +e 1 −λ 2 (w−t) 2 , 0 ≤ t ≤ w, 0 ≤ w < ∞. and expand the terms in the brackets so that t w − 1 t( t w 2 2.2 2 Γ( 21 )Γ[ 12 (n − 1)] w 1 1 1 . e Γ( 12 )Γ[ 12 (n − 1)] Now write (w − t) 2 = w 1 − fT,W (t, w) = λ (w − t)− 2 e− 2 (n−4) e− 2 e− 2 w 2 . = n 2 2 Γ( 12 )Γ[ 12 (n − 1)] n−3 ) 2 ( t 2wλ 1− +... 2+ 2! w ( (n−3) ( 2 t 1− w − 1 2 ) t wλ 1− + 2! w 1 2 ) +... . To obtain the marginal density function of W we integrate with respect to t, (0 ≤ t < w). Notice we have a series of integrals of the form Z w 0 t w (n−3) 2 t 1− w r− 1 2 dt = Z 1 v (n−3) 2 0 1 (1 − v)r+ 2 dv w 1 1 n−1 = B , r− w 2 2 for r = 0, 1, 2, . . . . Thus w λ n−2 e− 2 e− 2 w ( 2 ) gW (w) = n 1 1 2 2 Γ( 2 )Γ[ 2 (n − 1)] ( ! ) n−1 3 wλ (n − 1) 1 , B , + +... B 2 2 2! 2 2 and using the relationship B(m, n) = Γ(m)Γ(n)/Γ(m + n) we obtain (6.6). Note: No generality is lost by assuming unit variances as the more general case (where the variance is σ 2 , say) can easily be reduced to this case. That is, if X ∼ N (µ, σ 2 ) then X/σ ∼ N (µ/σ, 1). 88 MGF of W Direct calculation of the moments would be tedious, so we need the MGF of W . MW (t) = Z ewt gW (w)dw = Z e−w(1−2t)/2 e−λ/2 w (n/2)−1 [f (wλ, n)] dw 2n/2 Γ(n/2) Choose w 0 = w(1 − 2t) and λ = Λ(1 − 2t) and note that wλ = w 0 Λ. Thus MW (t) = Z 0 e−w /2 e−Λ(1−2t)/2 w0 1−2t 2n/2 Γ(n/2) So eΛt = (1 − 2t)n/2 Z (n/2)−1 w0 [f (w 0 Λ, n)] d 1 − 2t ! e−w /2 e−Λ/2 (w 0 )(n/2)−1 [f (w 0 Λ, n)] dw 0 n/2 2 Γ(n/2) 0 =1 MW (t) = e(λt)/(1−2t) (1 − 2t)−n/2 6.3 Non-Central t and F-distributions Suppose X ∼ N (µ, 1) and W ∼ χ2n (0) and that X and W are independent. Then the random variable T 0 defined by X T0 = q W/n has a non-central t-distribution with n df and non-centrality parameter µ. No attempt will be made to derive the pdf of the non-central t. Clearly, when µ = 0, T 0 reduces to the central t distribution. Let W1 ∼ χ2n1 (λ) and W2 ∼ χ2n2 (0) be independent random variables. Then the random variable F 0 defined by W1 /n1 F0 = W2 /n2 has a non-central F -distribution with non-centrality parameter λ. We write F 0 ∼ Fn1 ,n2 (λ). The F 0 statistic has a non-central F distribution with probability density function g(x) = ∞ X r=0 1 e −λ 2 1 ( 1 λ)r (n1 /n2 ) 2 n1 +r x 2 n1 +r−1 × 2 × × 1 r! B( 12 n1 + r, 12 n2 ) [1 + (n1 /n2 )x] 2 (n1 +n2 )+r where n1 , n2 are the degrees of freedom, λ is the non-centrality parameter defined by λ= Pp i=1 mi (τi − τ̂i )2 σ2 89 (6.9) where mi is the number of observations in group i for the AOV effects model Yij = µ + τi + εij , εij ∼ N (0, σ 2 ). If all means are equal, λ = 0 and g(x) is the pdf of a central F variable. The terms of the form B(a, b) are beta functions. 6.4 6.4.1 POWER: an example of use of non-central t Introduction In hypothesis testing we can make a Type I or a Type II error. H0 true H0 false Accept H0 Reject H0 correct (1) Type I error (α) Type II error (β) correct(2) The power of the test is P = 1 − β, which is P(reject H0 |H0 is false) = P[correct(2)]. Note the some authors use β to mean Power. We will use the definition Power = 1 - P(Type II error) Both Type I and Type II errors need to be controlled. In fact, we are faced with a trade–off between the two. If we lower Type I, Type II will increase. If we lower Type II, Type I will increase. Type I is preset, and so we need to have some idea about Type II error and thus the Power of the test. The regions for Type I (α) and Type II (β) errors are shown in Figure 6.1. The area to the right of the critical value (cv) under the solid curve (Ho) is the Type I error. The area under the dotted curve (Ha) to the left of the cv is the Type II error. 90 0.4 Ha 0.2 0.1 Normal density 0.3 Ho cv α 0.0 β −2 0 2 4 6 x Figure 6.1: The Null and Alternative Hypotheses Consider a simple one sample t–test, with H0 : µ = 0 vs Ha : µ > 3 If we have a sample of 26 observations with s = 5, what is the Power of the test? Under H0 the test statistic is the T = X̄ √ ∼ tn−1 s/ n since the mean of X is taken to be zero. When the alternative hypothesis is true, the mean of X is no longer zero and hence the test statistic no longer follows an ordinary t–distribution. The distribution is no longer symmetric, but becomes a non–central t–distribution. The lack of symmetry is described by a non–centrality parameter λ, where λ= diff 3 √ = = 3.0594 σ/ n 5/5.099 in this case. Now the critical value for the test is t25,5% = 1.708, so the Type I error is 5%. The plot of the cumulative non–central t–distribution imposed by Ha is shown in Figure 6.2, together with the critical value : > # after Dalgaard p141 > curve(pt(x,25,ncp=3.0594),from =0, to=6) 91 1.0 0.8 0.6 0.4 0.2 0.0 pt(x, 25, ncp = 3.0594) 0 1 2 3 4 5 6 x Figure 6.2: Plot of the cdf for the non–central t distribution > abline(v=qt(0.95,25)) > qt(0.95,25) [1] 1.708141 > > pt(qt(0.95,25),25,ncp=3.0594) [1] 0.0917374 > 1-pt(qt(0.95,25),25,ncp=3.0594) [1] 0.9082626 The Type II error is the area under the curve to the left of the critical value as shown by the vertical line. Thus the Type II error is 0.092 and the Power is 0.908. The desired value for Power is usually 0.8 to 0.9. For the two sample t–test, the non–centrality parameter becomes diff λ= q σ 1/n1 + 1/n2 The software power.t.test in R can estimate the sample size needed to attain a given Power. 6.4.2 Power calculations The base package in R, has a function called power.t.test which is useful for calculating power curves a priori by plugging in guessed parameters. The function has the form:92 power.t.test(n=NULL, delta=NULL, sd=1, sig.level=0.05, power=NULL, type=c("two.sample", "one.sample", "paired"), alternative=c("two.sided", "one.sided")) Its use requires that one parameter be set as NULL and all the others defined with values. The function will return the value of that which is set to NULL. It may be used to calculate (i) sample size (n), given:- the difference (delta),sd, sig.level and power, (ii) power, given:- n, delta, sd, sig.level, (iii) detectable difference (delta), given the other arguments Thus to show the equivalence with the previous method, we will estimate the sample size needed to get the power given for our one sample problem, using power.t.test. > power.t.test(delta=3,sd=5, sig.level =0.05, power=0.908, type="one.sample", + alt="one.sided") One-sample t test power calculation n delta sd sig.level power alternative = = = = = = 25.97355 3 5 0.05 0.908 one.sided As expected, we find that a sample of 26 is needed! We can also perform the same calculation, ie, given the sample size, find the power. > power.t.test(n=26,delta=3,sd=5, sig.level =0.05, type="one.sample", + alt="one.sided") One-sample t test power calculation n delta sd sig.level power alternative = = = = = = 26 3 5 0.05 0.9082645 one.sided 93 The results are equivalent to the original power calculations, ie, power is approx 91%. To return to the two sample case, now consider H0 : µ1 = µ2 vs Ha : µ1 > µ2 Let us find the sample sizes n1 = n2 = n needed to detect a difference of 3 when the common sd is 5, with a Power = 0.8. > power.t.test(delta=3,sd=5, sig.level=0.05, power =0.8,alt="one.sided") Two-sample t test power calculation n delta sd sig.level power alternative = = = = = = 35.04404 3 5 0.05 0.8 one.sided NOTE: n is number in *each* group So we need 35 observations in each group. We now will run a similar but not identical problem. Let us find the sample sizes n1 = n2 = n needed to detect a difference of 3 when the common sd is 5, with a Power = 0.9, for a two sided alternative. Thus we now have Ha : µ1 6= µ2 > power.t.test(delta=3,sd=5, sig.level=0.05, power =0.9,alt="two.sided") Two-sample t test power calculation n delta sd sig.level power alternative = = = = = = 59.35157 3 5 0.05 0.9 two.sided NOTE: n is number in *each* group We now need 59 observations in each group. 94 6.5 6.5.1 POWER: an example of use of non-central F Analysis of variance For the AOV where we have more than two groups, we need to use the non–central F distribution. The non—centrality parameter is now Λ= n P i (µi − σ2 µ)2 Note that this differs from the t–test definition! The formulation uses equal numbers is each group (n), but this is not necessary. First up, we will use the non–central F distribution in R to verify our calculations for the last t–test. The non–central F in R is simply the standard F with the optional parameter ncp. There are some points to note : 1. to compare the two, the t–test has to be two–sided. 2. the two ncps are not exactly the same but are related, since in the two sample case diff λ= q σ 2/n and so λ2 = ndiff 2 = Λ/2 2σ 2 AOV table (df only) : The number of obs = 59 x 2 = 118 Source Groups Error TOTAL df 1 116 117 > qf(0.95,1,116) [1] 3.922879 > lambda= 59 * 9 /(25 * 2) > 1-pf(qf(0.95,1,116),df1=1,df2=116,ncp=lambda) [1] 0.8982733 Thus the non–central F produces a Power of 90% as per the non–central t. Lastly, we note the comment from Geisbrecht and Gumpertz (G and G) , p61, concerning the definition of the non–centrality parameter : 95 ” Note that usage in the literature is not at all consistent, Some writers divide by 2 . . . and other define the non–centrality parameter as the square root, . . . Users must be cautious”. In developing our definition here we are using the definition that obviously works in R and SAS, the package used by G and G. For more information, see the R functions pf and Chisquare. We now turn our attention to more than two groups, ie, Analysis of Variance proper. It is worth noting that the form of the non–centrality parameter can be seen in the table of Expected Mean Squares for the fixed effects AOV. This will be exploited later in post–mortem power calculations on the AOV of data. Example Giesbrecht F.G. and Gumpertz M.L., (2004), Planning, Construction and Statistical Analysis of Comparative Experiments, Wiley, Hoboken, pp62–63. We have an experiment with a factor at 6 levels, and 8 reps within each levels. This gives an AOV table (df only) : The number of obs = 6 x 8 = 48 Source Groups Error TOTAL df 5 42 47 What is the Power of the test? We are given that the mean is approx 75 and the coefficient of variation (CV ) is 20%. Remember that CV = 100 × σ/µ Thus we can estimate σ = 0.2 × µ = 15 → σ 2 = 225 The expected treatment means are (65, 85, 70, 75, 75, 80). We now have all the information needed to calculate the non–centrality parameter. Λ= 8 P i (µi − σ2 µ)2 8[(−10)2 + (10)2 + (−5)2 + (0)2 + (0)2 + (5)2 ] = 8.89 225 The R calculations : = 96 > lambda <- 8.89 > qf(0.95,5,42) [1] 2.437693 > pf(2.44,5,42,ncp=lambda,lower.tail=F) [1] 0.5524059 Thus the probability of detecting a shift in the means of the order nominated is only 55%! Thus we would need more than 8 observations per treatment level to pick up a change in means of the type suggested. Example Consider the following AOV of yield data from a randomized block experiment to measure yields of 8 lucerne varieties from 3 replicates. The statistical model is :yij = µ + β i |{z} + block effect + ij ij ∼ N (0, σ 2 ) |{z} treatment effect error τj |{z} Table 6.1: AOV of lucerne variety yields Source df SS MS F E(MS) P replicate 2 41,091 20,545 2.4 σ 2 + 8 3i=1 βi2 /2 P variety 7 75,437 10,777 1.26 σ 2 + 3 8i=1 τi2 /7 σ2 residuals 14 120,218 8,587 The variety effects are:1 0 2 3 4 5 -69 -46 90 81 6 7 8 71 60 22 Is there sufficient evidence to say that the observed differences are due to systematic effects? If there is not sufficient evidence to say this, we are obliged to take the position that the observed differences could have arisen from random sampling from a population for which there were no variety effects. If the null hypothesis that τj = 0, ∀j is true, variety MS ∼ σ 2 χ27 /7 and residual MS ∼ σ 2 χ214 /14 The ratio of these 2 mean squares is a random variable whose distribution is named the F distribution. We assess the null hypothesis using the F statistic which in this case 97 is 1.26 and the probability of getting this or more extreme by sampling from a population with τj = 0 is 0.34. Thus we have no strong evidence to support the alternate hypothesis that not all τj = 0. Given the effort of conducting an experiment, this is a disappointing result. Specific contrasts amongst the varieties should be tested but if there were no contrasts specified at the design stage, this “data-snooping” may also be misleading. A post mortem poses the following questions. • Was the experiment designed to account for natural variation? If the blocking is not effective, the systematic component τj is masked by the random component ij . • Suppose that genuine differences do exist in the population. What was the probability of detecting them with such a design? The quantile of F marking the 5% critical region is Fcrit = 2.67. To be 95% sure that the observed differences were not due to chance, the observed F has to exceed Fcrit . The probability of rejection of the null hypothesis when the differences are actually not zero is called POWER and is measured by the area under the non-central F density to the right of Fcrit . The F curves for this example are shown in Figure 6.3 where the vertical lines is Fcrit . The calculated value of power is 0.1 which is very poor. A better design which will reduce experiment error is required. For post mortems of AOV’s, the non-centrality parameter is estimated by λ̂ = df1 × (Trt MS − Error MS) Error MS It remains a student exercise to compare this with previous definitions (eg. 6.9). We revisit the topic of POWER in Statistical Inference. 98 Figure 6.3: Central and non-central F densities 0.6 F density central F non-central F 0.4 0.2 POWER = 0.1 0.0 0 6.6 2 4 6 F quantile 8 10 R commands df(x,df1,df2) qf(p,df1,df2) pf(q,df1,df2,ncp) ncpdf(q,df1,df2,ncp) density of a central F distribution quantile corresponding to probability p for a central -F probability for a F distribution with non-centrality parameter density of non-central F , see below ncpdf_function(x, df1, df2, ncp){ # written by Bob Murison at UNE based on paper by # O’Neill and Thomson (1998), Aust J exp Agric, 38 617-22. ############################################### Beta <- function(v1, v2){(gamma(v1) * gamma(v2))/gamma(v1 + v2)} r <- 0:100 gF <- x * 0 for(i in seq(along = x)) { gF[i] <- sum((((exp(-0.5 * ncp) * (ncp/2)^r)/gamma(r + 1) * ( df1/df2)^(df1/2 + r))/Beta((df1/2 + r), (df2/2)) * x[i]^(df1/2 + r - 1))/((1 + (df1/df2) *x[i])^((df1 + df2)/2 + r))) } gF } 99 100 Part II Statistical Inference 101 Chapter 7 Reduction of Data 7.1 Types of inference There are several ways to model data and use statistical inference to interpret the models. Common strategies include 1. frequentist parametric inference, 2. Bayesian parametric inference, 3. non-parametric inference (frequentist and Bayesian), 4. semi-parametric inference, and there are others. Applied statisticians use the techniques best suited to the data and each technique has its strengths and limitations. This section of the unit is about parametric frequentist inference and many of the principles and skills are transportable to the other styles of inference. 7.2 Frequentist inference The object of statistics is to make an inference about a population based on information contained in a sample. Populations are characterized by parameters, so many statistical investigations involve inferences about one or more parameters. The process of performing repetitions of an experiment and gathering data from it is called sampling. The basic ideas of random sampling, presentation of data by way of density or probability functions, and a statistic as a function of the data, are assumed known. Computation of a statistic from a set of observations constitutes a reduction of the data (where there are n items, say) to a single number. In the process of such reduction, some information about the population may be lost. Hopefully, the statistic used is chosen so 102 that the information lost is not relevant to the problem. The notion of sufficiency, covered in the next section, deals with this idea. Commonly used statistics are: sample mean, sample variance, sample median, sample range and mid–range. These are random variables with probability distributions dependent on the original distribution from which the sample was taken. 7.3 Sufficient Statistics [Read CB 6.1 or HC 7.2. The notation we will use for a statistic is T (X) = T (X1 , X2 , . . . , Xn ), rather than Y1 = u(X1 , X2 , . . . , Xn ) as their choice.] The idea of sufficiency is that if we observe a random variable X (using a sample X1 , . . . , Xn , or X) whose distribution depends on θ, often X can be reduced via a function, without losing any information about θ. For example, T (X) = T (X1 , . . . , Xn ) = n X Xi /n, i=1 which is the sample mean, may in some cases contain all the relevant information about θ, and in that case T(X) is called a sufficient statistic. That is, knowing the actual n observations doesn’t contribute any more to the inference about θ, than just knowing the average of the n observations. We can then base our inference about θ on T(X), which can be considerably simpler than X (involving a univariate distribution, rather than an n–variate one). Definition 7.1 A statistic T = T (X) is said to be sufficient for a family of distributions if and only if the conditional distribution of X given the value of T is the same for all members of the family (that is, doesn’t depend on θ). Equivalent definitions for the discrete case and continuous cases respectively are given below. Definition 7.2 Let f (x; θ), θ ∈ Θ be a family of distributions of the discrete type. For a random sample X1 , . . . , Xn from f (x; θ), define T = T (X1 , . . . , Xn ). Then T is a sufficient statistic for θ if, for all θ and all possible sample points, P (X1 = x1 , . . . , Xn = xn |T = t(x1 , . . . , xn )) (7.1) does not involve θ. [Note that the lack of dependence on θ includes not only the function, but the range space as well.] 103 Consider here the role of θ. Its job is to represent all the stochastic information of the data. Other information such as the scale of measurement should not be random. So if T is sufficient for θ, then interpretation of the data conditional on f (t(x1 , . . . , xn )) should remove all the stochastic bits, leaving only the non-random bits. Definition 7.3 Let X1 , . . . , Xn be a random sample from a continuous distribution, f (x; θ), θ ∈ Θ. Let T = T (X1 , . . . , Xn ) be a statistic with pdf fT (t). Then T is sufficient for θ if and only if f (x1 , θ) × f (x2 ; θ) · · · × f (xn ; θ) fT (t(x1 , x2 , . . . , xn ); θ) (7.2) does not depend on θ, for every fixed value of t. Again the range space of the xi must not depend on θ either. Example A random sample of size n is taken from the Poisson distribution, P (λ). P Is i Xi sufficient for λ? Let X P (A) P (X1 , . . . , Xn | Xi = t) = P (A|B) = P (B) since A represents only one way in which the total t could be achieved, and so A ⊂ B ; P (A ∩ B) = P (A). P As Xi ∼ P (nλ) we have P (X1 , . . . , Xn | X Xi = t) = which does not involve λ, and so T = Example 7.1 Q i e−λ λxi /xi ! e−nλ (nλ) P P x i i /( P i xi )! Xi is sufficient for λ. = ( n P P i x i i xi )! Q ( xi !) Given X1 , . . . , Xn is a random sample from a binomial distribution with parameters P m, θ, show that T = ni=1 Xi is a sufficient statistic for θ. Solution. From Definition 7.2, we need to consider P (X1 = x1 , X2 = x2 , . . . , Xn = xn | and note that the Xi are independent and that becomes m x1 θ x1 (1 − θ)m−x1 . . . mn t m xn Pn i=1 θ xn (1 − θ)m−xn θ t (1 − θ)mn−t 104 X Xi = t), Xi ∼ bin(mn, θ). So equation (7.1) , xi = 0, 1, . . . m, X xi = t, which on simplification is m x1 m . . . xmn x2 mn t which is seen to be free of θ and the range space of the xi also. P Hence ni=1 Xi is sufficient for θ. Continuous case The statistic T is sufficient for θ if Q f (xi ; θ) fT (t(x1 , . . . , xn ); θ) i does not involve θ. Example For a random sample of size n from an exponential distribution, show that the sample total is sufficient for the exponential parameter. A form of the exponential distribution is f (x; λ) = λe−λx , x > 0 If the sample total is sufficient for λ then Q f (xi ; λ) fT (t(x1 , . . . , xn ); λ) i should not contain λ. The distribution of the sample total is Gamma with parameters n and λ, as can be shown using moment generating functions. Thus the conditional distribution becomes λn ( Pn i xi Qn −λxi i λe P n )n−1 e−λ i xi /Γ(n) = Γ(n)/( X xi )n−1 i which does not contain λ, indicating that the sample total is sufficient for λ. Example 7.2 Let X1 , . . . , Xn be a random sample from the truncated exponential distribution, where fXi (xi ) = eθ−xi , xi > θ or, using the indicator function notation, fXi (xi ) = eθ−xi I(θ,∞) (xi ). 105 Show that Y1 = min(Xi ) is sufficient for θ. Solution. In Definition 7.3, T = T (X1 , . . . , Xn ) = Y1 and to examine equation (7.2), we need fT (t), the pdf of the smallest order statistic. Now for the pdf above, F (x) = Z x θ eθ−z dz = eθ [eθ − e−x ] = 1 − eθ−x . From Distribution Theory equation (5.6), the pdf of Y1 is n[1 − F (y1 )]n−1 f (y1 ) = ne(θ−y1 )(n−1) × eθ−y1 = nen(θ−y1 ) , y1 > θ. So the conditional density of X1 , . . . Xn given T = t is P e − xi eθ−x1 eθ−x2 . . . eθ−xn = , xi ≥ y1 , i = 1, . . . n, nen(θ−y1 ) ne−ny1 which is free of θ for each fixed y1 = min(xi ). Note that since xi ≥ y1 , i = 1, . . . , n neither the expression nor the range space depends on θ, so the first order statistic, Y1 is a sufficient statistic for θ. In establishing that a particular statistic is sufficient, we do not usually use the above definition(s) directly. Instead, a factorization criterion is preferred and this is described in 7.4. 7.4 Factorization Criterion The Theorem stated below is often referred to as the Fisher–Neyman criterion. Theorem 7.1 Let X1 , . . . , Xn (or X) denote a random sample from a distribution with density function f(x; θ). Then the statistic T=t(X) is a sufficient statistic for θ if and only if we can find two functions g and h such that f (x; θ) = g(t(x); θ)h(x) where, for every fixed value of t(x), h(x) does not depend on θ. (The range space of x for which f (x; θ) > 0 must not depend on θ. (An aside; heuristic explanation.) Factorisation is a way of separating the random and non-random components. Only when t(x) is comprehensive enough such that f (x; θ) = g (t(x); θ) h(x) , 106 will it be sufficient information to pin down θ. Conversely, when it is sufficient, the extra “enhancements” are redundant. Proof. For the continuous case, a proof is given in HC 7.2, Theorem 1, where their k1 is our g and their k2 is our h. To use the factorization criterion, we examine the joint density function, f(x; θ) and see whether there is any factorization of the type required in terms of some function t(x). It is usually not easy to use the factorization criterion to show that a statistic T is not sufficient. Note that the family of distributions may be indexed by a vector parameter θ, in which case the statistic T in the definition of sufficiency can be a vector function of observations, for example, (X̄, S 2 ) or (Xmin , Xmax ). Example (discrete) For the Poisson distribution, is the sample total sufficient for λ? Y i and so T = (continuous) P P P e−nλ λ i xi 1 e−λ λxi = e−nλ λ i xi Q = Q xi ! i xi ! i xi ! = g(t; λ) × h(xi ) i Xi is sufficient for λ, as expected. For the exponential distribution, is the sample total sufficient for λ? Y λe−λxi = λn e−nλ i and so T = P P i = g(t; λ) × h(xi ) i Xi is sufficient for λ, as expected. Example 7.3 107 xi ×1 We will consider again example 7.2, using the factorization criterion. The joint probability density function of X1 , . . . , Xn is f (x; θ) = e− ( = P (xi −θ) P n Y I(θ,∞) (xi ) i=1 e− P xi enθ .1 e− xi enθ .0 P if min(x1 , . . . , xn ) > θ otherwise. = e− xi enθ I(θ,∞) (t) = h(x)g(t(x; θ)) where h(x) = e− P xi , t(x) = min xi and g(t(x; θ)) = enθ I(θ,∞) (t). So, by Theorem 7.1, min(Xi ) is a sufficient statistic for θ. Example 7.4 (Example 4 in HC, 7.2, p319) Let X1 , . . . , Xn be a random sample from a N (θ, σ 2 ) distribution where σ 2 is known. Show that X̄ is sufficient for θ. Now Pn 2 2 f (x1 ; θ) . . . f (xn ; θ) = ce− i=1 (xi −θ) /2σ . Writing xi − θ as xi − x̄ + (x̄ − θ), we have, X (xi − θ)2 = X (xi − x̄)2 + n(x̄ − θ)2 + 2(x − θ) So the RHS is ce−n(x̄−θ) 2 /2σ 2 .e− P (xi −x̄)2 /2σ 2 X | (xi − x̄) . = g(x̄; θ) · h(x), since the first term on the RHS depends on x only through x̄ (or term does not depend on θ. {z =0 P } xi ) and the second Read HC 7.2, Examples 5, 6. Example 7.5 Consider a random sample of size n from the uniform distribution, f (x; θ) = 1/θ, x ∈ (0, θ]. We will use the factorization criterion to find a sufficient statistic for θ. The joint density function is f (x; θ) = ( 1 , θn 0, if 0 < xi < θ for i = 1, 2, . . . , n if xi > θ or xi < 0 for any i. 108 This can be written in the form f (x : θ) = g(yn , θ)h(x), where g(yn, θ) = and h(x) = ( ( 1, 0, 1 , θn 0, if θ > yn , if θ ≤ yn , if xi > 0 for all i, if any xi ≤ 0. Of course, yn in the above, is the largest order statistic. The factorization criterion is satisfied in terms of the statistic T = Yn , so this statistic is sufficient for θ. Comment. Note that the joint pdf is 1/θ n which is just a function of θ, so it would appear that any statistic is sufficient. The fallacy in this argument is that the joint density function is not always given by 1/θ n , but is equal to zero for xi ∈ / [0, θ]. So it really is not just a function of θ. However, we can get it into the required form by taking T = Yn . [Note that if Yn < θ then all the Xi ≤ θ.] Although the factorization criterion works here and in other cases where the range space depends on the parameter, one has to be careful, and it is often safer to find the conditional density for the sample given the statistic, rather than use the factorization criterion. This is done below. The joint pdf of the ordered sample Y1 < Y2 < . . . < Yn is ( n! θn 0 for 0 ≤ y1 ≤ . . . yn ≤ θ otherwise , and the density for Yn is ( nynn−1 /θ n 0 for 0 ≤ yn ≤ θ otherwise . Hence the conditional density of Y1 , . . . , Yn given Yn (which is ≤ θ) is ( (n − 1)!/ynn−1 0 for 0 ≤ y1 ≤ . . . ≤ yn otherwise. which does not depend on θ. 109 7.5 The Exponential Family of Distributions [Read CB 3.4 or HC 7.5 where we will use B(θ) for their eq(θ) and h(x) for their eS(x) .] Definition 7.4 The exponential family of distributions is a one-parameter family that can be written in the form f (x; θ) = B(θ)h(x)e[p(θ)K(x)] , a < x < b, (7.3) where γ < θ < δ. If, in addition, (a) neither a nor b depends on θ, (b) p(θ) is a non-trivial continuous function of θ, (c) each of K 0 (x) 6≡ 0 and h(x) is a continuous function of x, a < x < b, we say that we have a regular case of the exponential family. Most of the well-known distributions can be put into this form, for example, binomial, Poisson, geometric, gamma and normal. The joint density function of a random sample X from such a distribution can be written as f (x; θ) = B n (θ) n Y h(xi )ep(θ) i=1 Putting T = t(X) = n X Pn i=1 K(xi ) K(Xi ) and t(x) = i=1 n B (θ)e p(θ)t(x) n X (7.4) K(xi ), i=1 we see that f (x; θ) can be written as h , a < xi < b. " n i Y # h(xi ) = g(t(x; θ)h(x), i=1 so that Theorem 7.1 applies and t(X) is a sufficient statistic for θ. Example 7.6 Let X ∼ U [0, θ]. Then f (x) = 1/θ, x ∈ [0, θ]. We see that f (x) cannot be written in the form of equation (7.3). We could write B(θ) = 1/θ, p(θ) = 0, but then we would need h(x) = ( 1, 0 ≤ x ≤ θ 0 otherwise which makes h(x) depend on θ and condition (c) of definition (7.4) would not be satisfied. [We already know that max Xi is sufficient for θ here, and note that max Xi is not of the P form ni=1 K(Xi ).] 110 Example 7.7 Consider the normal distribution with mean θ and variance 1. The density function can be written in the form of equation (7.3) where 1 1 2 2 2 √ e−(x−θ) /2 = √ e−θ /2 . (e−x /2 ) .eθx , | {z } 2π 2π{z | } h(x) B(θ) P P and p(θ) = θ, K(x) = x. So T = K(Xi ) = Xi is minimal sufficient for θ. P Note that we could have defined p(θ) = nθ and K(x) = x/n, so that T = Xi /n = X is also sufficient for θ. A distribution from the exponential family arises from tilting a simple density , f (x; θ) = f (x) × eθx−K(θ) where K(θ) = log E(eθx ) µ = K 0 (θ) σ 2 = K 00 (θ) and θ is termed the natural parameter. Theorem A necessary and sufficient condition for a pdf to possess a sufficient statistic is that it belongs to the exponential family of distributions. The exponential family also gives the form of the sufficient statistic, viz, P K(xi ) Y h(x ) X K(Xi ) f (x; θ) = B n (θ)ep(θ) Choosing T = t(X) = i i i i and t(x) = X K(xi ) i gives f (x; θ) = B (θ)ep(θ) Thus t(X) = P i n P i K(xi ) "Y i # h(xi ) = g[t(x; θ)]h(x) K(Xi ) is a sufficient statistic for θ, by use of the factorisation criterion. 111 Examples (Poisson) 1 e−λ λx = e−λ ex ln λ x! x! = e−λ , h(x) = 1/(x!), p(θ) = ln θ = ln λ and K = I. f (x; λ) = So θ = λ, B(θ) = e−θ Thus (Exponential) P i Xi is sufficient for λ. f (x; λ) = λe−λx So θ = λ, h = 1, p(θ) = −λ = −θ and K = I Thus 7.6 P i Xi is sufficient for λ. Likelihood The likelihood is the joint probability function of the sample as a function of θ. Thus L(θ; x) = f (x; θ) = L(θ) The fact the likelihood as a function of θ differs from the pdf as a function of x was first defined by Fisher : (1921) ”What we can find from a sample is the likelihood of any particular value of ρ, if we define the likelihood as a quantity proportional to the probability that, from a population having that particular value of ρ, a sample having the observed value r should be obtained. So defined, probability and likelihood are quantities of an entirely different nature.” (1925) ”What has now appeared is that the mathematical concept of probability is inadequate to express our mental confidence or diffidence in making such inferences, and that the mathematical quantity which appears to be appropriate for measuring our order of preference among different possible populations does not in fact obey the laws of probability. To distinguish it from probability, I have used the term ’Likelihood’ to designate this quantity.” The value of θ that maximises this likelihood, is called the maximum likelihood estimator (mle). Definition 7.5 Let X1 , . . . , Xn be a random sample from f (x; θ) and x1 , . . . , xn the corresponding observed values. The likelihood of the sample is the joint probability function (or the 112 joint probability density function, in the continuous case) evaluated at x1 , . . . , xn , and is denoted by L(θ; x1 , . . . , xn ). Now the notation emphasizes that, for a given sample x, the likelihood is a function of θ. Of course L(θ; x) = f (x; θ), [= L(θ), in a briefer notation]. The likelihood function is a statistic, depending on the observed sample x. A statistical inference or procedure should be consistent with the assumption that the best explanation of a set of data is provided by θ̂, a value of θ that maximizes the likelihood function. This value of θ is called the maximum likelihood estimate (mle). The relationship of a sufficient statistic for θ to the mle for θ is contained in the following theorem. Theorem 7.2 Let X1 , . . . , Xn be a random sample from f (x; θ). If a sufficient statistic T = t(X) for θ exists, and if a maximum likelihood estimate θ̂ of θ also exists uniquely, then θ̂ is a function of T . Proof Let g(t(x; θ)) be the pdf of T . Then by the definition of sufficiency, the likelihood function can be written L(θ; x1 , . . . , xn ) = f (x1 , θ) . . . f (xn ; θ) = g (t(x1 , . . . , xn ); θ) h(x1 , . . . , xn ) (7.5) where h(x1 , . . . , xn ) does not depend on θ. So L and g as functions of θ are maximized simultaneously. Since there is one and only one value of θ that maximizes L and hence g(t(x1 , . . . , xn ); θ), that value θ must be a function of t(x1 , . . . , xn ). Thus the mle θ̂ is a function of the sufficient statistic T = t(X1 , . . . , Xn ). Sometimes we cannot find the maximum likelihood estimator by differentiating the likelihood (or log of the likelihood) with respect to θ and setting the equation equal to zero. Two possible problems are: (i) The likelihood is not differentiable throughout the range space; (ii) The likelihood is differentiable, but there is a terminal maximum (that is, at one end of the range space). For example, consider the uniform distribution on [0, θ]. The likelihood, using a random sample of size n is L(θ; x1 , . . . , xn ) = ( 1 θn 0 for 0 ≤ xi ≤ θ, i = 1, . . . , n otherwise . (7.6) Now 1/θ n is decreasing in θ over the range of positive values. Hence it will be maximized by choosing θ as small as possible while still satisfying 0 ≤ xi ≤ θ. That is, we choose θ equal to X(n) , or Yn , the largest order statistic. 113 Example 7.8 Consider the truncated exponential distribution with pdf f (x; θ) = e−(x−θ) I[θ,∞) (x). The Likelihood is L(θ; x1 , . . . , xn ) = e nθ− P xi n Y I[θ,∞) (xi ). i=1 Hence the likelihood is increasing in θ and we choose θ as large as possible, that is, equal to min(xi ). Further use is made of the concept of likelihood in Hypothesis Testing (Chapter 9), but here we will define the term likelihood ratio, and in particular monotone likelihood ratio. Definition 7.6 Let θ1 and θ2 be two competing values of θ in the density f (x; θ), where a sample of values X leads to likelihood, L(θ; X). Then the likelihood ratio is Λ = L(θ1 ; X)/L(θ2 ; X). This ratio can be thought of as comparing the relative merits of the two possible values of θ, in the light of the data X. Large values of Λ would favour θ1 and small values of Λ would favour θ2 . Sometimes the statistic T has the property that for each pair of values θ1 , θ2 , where θ1 > θ2 , the likelihood ratio is a monotone function of T . If it is monotone increasing, then large values of T tend to be associated with the larger of the two parameter values. This idea is often used in an intuitive approach to hypothesis testing where, for example, a large value of X would support the larger of two possible values of µ. Definition 7.7 A family of distributions indexed by a real parameter θ is said to have a monotone likelihood ratio if there is a statistic T such that for each pair of values θ1 and θ2 , where θ1 > θ2 , the likelihood ratio L(θ1 )/L(θ2 ) is a non–decreasing function of T . Example 7.9 Let X1 , . . . , Xn be a random sample from a Poisson distribution with parameter λ. Determine whether (X1 , . . . , Xn ) has a monotone likelihood ratio (mlr). Here the likelihood of the sample is L(λ; x1 , . . . , xn ) = e−nλ λ 114 P xi / Y xi !. Let λ1 , λ2 be 2 values of λ with 0 < λ1 < λ2 < ∞. Then for given x1 , . . . , xn −nλ2 P x L(λ2 ; x) e λ2 i P = = x L(λ1 ; x) e−nλ1 λ1 i λ2 λ1 ! P xi e−n(λ2 −λ1 ) . Note that (λ2 /λ1 ) > 1 so this ratio is increasing as T (x) = P (X1 , . . . , Xn ) has a monotone likelihood ratio in T (x) = xi . 7.7 P xi increases. Hence Information in a Sample In the next chapter we will be considering properties of estimators. One of these properties involves the variance of an estimator and our desire to choose an estimator with variance as small as possible. Some concepts and results that will be used there are introduced in this section. In particular, we will consider the notion of information in a sample, and how we measure this information when data from several experiments is combined. Consider a distribution indexed by a real parameter θ and suppose X1 , . . . , Xn1 and Y1 , . . . , Yn2 are independent sets of data, then the likelihood of the combined sample is the product of the likelihoods of the two individual samples. That is, L(θ; x, y) = L1 (θ; x)L2 (θ; y) and so log L(θ; x, y) = log L1 (θ; x) + log L2 (θ; y). The statistic that we shall be concerned with is the derivative with respect to θ of the log likelihood. Definition 7.8 The score of a sample, denoted by V is defined by V = where L0 (θ) = ∂ L(θ) ∂θ L0 (θ) ∂ log L(θ; X) = = `0 (θ) ∂θ L(θ) and `(θ) = log L(θ). Some properties of V are given below. Rigorous proofs of these results depend on fulfillment of conditions (sometimes referred to as regularity conditions) that permit interchange of integration and differentiation operations, and on the existence and integrability of the various partial derivatives. The proofs are not required in this course but an outline of the proof of equation (7.7) is given on page 121. Properties of V 115 (i) The expected value of V is zero. ∂`(θ) E(V ) = E ∂θ ! ∂ ln f =E ∂θ but since then by differentiating wrt θ we get Z Z ! = Z Z 1 ∂f ∂f f dx = dx f ∂θ ∂θ f dx = 1 ∂f dx = 0 ∂θ which gives ∂`(θ) E ∂θ ! = 0. Intuitively, this is reasonable, as the mle is obtained by solving ∂` =0 ∂θ (ii) Var(V ) is called the information (or Fisher’s information) in a sample and is denoted by IX (θ), so we have " #2 ∂ IX (θ) = Var(V ) = E . log f (X; θ) ∂θ ∂`(θ) Var(score) = Var ∂θ ! ∂`(θ) =E ∂θ where Ix (θ) is called the information in the sample. !2 (7.7) def = Ix (θ) V If we consider two likelihood functions, both centered on the mle, one a spike ( ) T and the other a flat pulse ( ), then Var(score) is larger for the spike than for the pulse, since the function !2 ∂` ∂θ corresponds to the absolute change in the derivative, which is greater for the spike than for the pulse. Thus the information contained in the spike is stronger, while the pulse is less informing. Later it will be shown (p126) that the variance of an estimator is related to the inverse of the information, and so the spike will correspond to a situation where the 116 parameter is well estimated (ie, high precision or information), but low variance of estimation and a short confidence interval for the parameter. The flat pulse will correspond to a poorly estimated parameter, with low precision or information, and thus with high variance of estimation and wide confidence interval for the parameter. Thus we need to distinguish between the variance of the score, and the variance of b the estimator for the parameter (V (θ)). (iii) Information is additive over independent experiments. For X, Y independent, we have IX (θ) + IY (θ) = IX+Y (θ). (7.8) (iv) As a special case of (iii), the information in a random sample of size n is n times the information in a single observation. That is, IX (θ) = nIX (θ). (7.9) (v) The information provided by a sufficient statistic T = t(X) is the same as that in the sample X. That is, IT (θ) = IX (θ). (7.10) (vi) The information in a sample can be computed by an alternate formula, ∂V IX (θ) = −E ∂θ ! . An alternative form for Ix (θ) is ∂2` Ix (θ) = −E ∂θ 2 since the expected value of the score gives Z ! ∂ (ln f )f dx = 0 ∂θ which differentiated wrt θ gives Z " ie or Z " # ∂ ln f ∂f ∂ 2 ln f f+ dx = 0 2 ∂θ ∂θ ∂θ # ∂ 2 ln f ∂ ln f 1 ∂f f dx = 0 f+ 2 ∂θ ∂θ f ∂θ Z " # ∂ 2 ln f ∂ ln f ∂ ln f f dx = 0 + ∂θ 2 ∂θ ∂θ from which the result follows. 117 (7.11) (vii) For T = t(X) a statistic, IT (θ) ≤ IX (θ) (7.12) with equality holding if and only if T is a sufficient statistic for θ. [This property emphasizes the importance of sufficiency. The reduction of a sample to a statistic may lose information relative to θ, but there is no loss of information if and only if sufficiency is maintained in the data reduction.] Comment on (i) and (vi). A typical example where the “regularity conditions” don’t hold is the case where X is distributed U(0, θ). When the range space of X depends on θ, the order of integration (over X) and differentiation (with respect to θ) can’t usually be interchanged, as is done in proving (i) and (vi). In particular, for a sample of size 1 from f (x) = 1/θ, 0 < x < θ, we have L(θ; x) = 1/θ, and log L(θ; x) = − log θ ∂ 1 V = log L(θ; x) = − ∂θ θ Z θ 1 1 − . dx E(V ) = θ θ 0 1 = − 6= 0. θ Example 7.10 For X1 , . . . , Xn a random sample from a N(µ, σ 2 ) distribution, find (a) the information for µ; (b) the information for σ 2 . (a) We have 2 2 f (xi ; µ) = (2πσ 2 )−1/2 e−(xi −µ) /2σ 1 1 log f (xi ; µ) = − log(2πσ 2 ) − 2 (xi − µ)2 2 2σ 1 1 1 = − log(2π) − log σ 2 − 2 (xi − µ)2 2 2 2σ ∂ 1 V = log f = .2(xi − µ) ∂µ 2σ 2 ∂V 1 = − 2 , a constant, ∂µ σ ! 1 ∂V = 2 IX (µ) = −E ∂µ σ 118 Alternatively, we note that V 2 = (Xi − µ)2 /σ 4 and that Var(V ) = E(V 2 ) = 1 E(Xi − µ)2 = 2. 4 σ σ [Both IX (µ) and Var(V ) are expressions for the information in a single observation. The information in a random sample of size n is thus n/σ 2 .] (b) We have ∂ 1 (xi − µ)2 log f = − + ∂σ 2 2σ 2 2σ 4 ∂V 1 (xi − µ)2 = − ∂σ 2 2(σ 2 )2 (σ 2 )3 ! 1 E(Xi − µ)2 ∂V = − −E + ∂σ 2 2σ 4 (σ 2 )3 1 σ2 = − 4+ 2 3 2σ (σ ) 1 = 2σ 4 V = For a sample of size n, IX = n/2σ 4 . Example 7.11 Compute the information on p from n Bernoulli trials with probability of success equal to p. Now f (x; p) = px (1 − p)1−x log f (x; p) = x log p + (1 − x) log(1 − p) ∂ x 1−x V = log f (x; p) = − ∂p p 1−p ∂V x 1−x = − 2 − ∂p p (1 − p)2 ! ∂V 1 1−p E = − 2 .p − ∂p p (1 − p)2 1 = − p(1 − p) ! 1 ∂V −E = ∂p pq For a sample of size n, the information on p is IX (p) = n/pq. 119 Examples (Poisson) f (x; θ) = Y i and so ` = −nλ + to give P e−λ λxi e−nλ λ i xi = Q xi ! i (xi !) X i xi ln λ − ∂` = −n + ∂λ and Y (ln xi !) i P xi λ i P ∂2` xi (−1) = i 2 2 ∂λ λ Using the alternative formula for Ix (θ) gives " # P ∂2` nλ E[ i xi ] = 2 = n/λ I(θ) = −E = 2 2 ∂λ λ λ Using the first form gives ∂` I(θ) = E ∂λ !2 = E −n + 2 =n +E But ( ( P xi λ i 2 2 =n +E P xi )2 nλ ( i xi ) 2 − 2n = E − n2 λ2 λ λ2 i xi ) 2 = i X x2i + 2 i X xi xj i6=j with expectation nV (x) + nµ2x + 2(n(n − 1)/2)λ2 = nλ + nλ2 + (n2 − n)λ2 to give nλ + n2 λ2 − n2 = n/λ I(θ) = 2 λ as before. (Exponential) f (x; θ) = Y i and so λe−λxi = λn e−λ ` = n ln λ − λ 120 X i xi P P xi xi ) 2 − 2nE i 2 λ λ i P P X ( i xi with n X ∂` xi = − ∂λ λ i and ∂2` n n = − 2 ; I(λ) = 2 2 ∂λ λ λ Using the first form ∂` I(λ) = E ∂λ 2 P xi xi ) − 2n λ X X = E n /λ + ( i ( = E n/λ − X 2 but !2 2 xi ) 2 = i i ! x2i + 2 i X = E( X xi i X i !2 xi )2 − n2 /λ2 xi xj i6=j with expectation n(σ 2 + µ2 ) + 2 n(n − 1) 1 = n(1/λ2 + 1/λ2 ) + (n2 − n)(1/λ2 ) = n/λ2 + n2 /λ2 2 2 λ Thus ∂` I(λ) = E ∂λ !2 = n/λ2 as before. Outline of Proof of equation 7.7 ∂ f 0 (X; θ) f0 log f (X; θ) = = ∂θ f (X; θ) f 00 0 2 00 f f − (f ) f ∂V = = −V2 2 ∂θ f f ! ! ∂V f 00 E = E − E(V 2 ) ∂θ f V = Now f 00 = ∂2 f (X; θ) ∂θ 2 121 So f 00 E f ! = Z ∞ ∂2 f (x; θ) ∂θ 2 −∞ f (x; θ) f (x; θ)dx Z ∂2 ∞ = f (x; θ)dx ∂θ 2 | −∞ {z } =1 = 0 So ∂V E(V ) = −E ∂θ 2 ! = Var(V ) = IX (θ). Comments 1. The proof is somewhat ‘simplistic’ in the sense that just X is used rather that X. The latter would require multiple integrals rather than just a single integral. 2. The proof that E(V ) = 0 is similar. 3. Note the line where the order of integration (wrt x) and differentiation (wrt θ) is interchanged. This can only be done when regularity conditions apply. For instance, the limits on the integrals must not involve θ. 122 Chapter 8 Estimation 8.1 Some Properties of Estimators [Read CB 10.1 or HC 6.1.] Let the random variable X have a pdf (or probability function) that is of known functional form, but in which the pdf depends on an unknown parameter θ (which may be a vector) that may take any value in a set Θ (the parameter space). We can write the pdf as f (x; θ), θ ∈ Θ. To each value θ ∈ Θ there corresponds one member of the family. If the experimenter needs to select precisely one member of the family as being the pdf of the random variable, he needs a point estimate of θ, and this is the subject of sections 8.1 to 8.4 of this chapter. Of course, we estimate θ by some (appropriate) function of the observations X1 , . . . , Xn and such a function is called a statistic or an estimator. A particular value of an estimator, say t(x1 , . . . , xn ), is called an estimate. We will be considering various qualities that a “good” estimator should possess, but firstly, it should be noted that, by virtue of it being a function of the sample values, an estimator is itself a random variable. So its behaviour for different random samples will be described by a probability distribution. It seems reasonable to require that the distribution of the estimator be somehow centred with respect to θ. If it is not, the estimator will tend either to under-estimate or overestimate θ. A further property that a good estimator should possess is precision, that is, the dispersion of the distribution should be small. These two properties need to be considered together. It is not very helpful to have an estimator with small variance if it is “centred” far from θ. The difference between an estimator T = t(X1 , . . . , Xn ) and θ is referred to as an error, and the “mean squared error” defined below is a commonly used measure of performance of an estimator. Unbiasedness Definition 8.1 123 For a random sample X1 , . . . , Xn from f (x; θ), a statistic T = t(X1 , . . . , Xn ) is an unbiased estimator of θ if E(T ) = θ. Definition 8.2 The bias in T (as an estimator of θ) is bT (θ) = E(T ) − θ. (8.1) Mean Square Error Definition 8.3 For a random sample X1 , . . . , Xn from f (x; θ) and a statistic T= t(X1 , . . . , Xn ) which is an estimator of θ, the mean square error (mse) is defined as mse = E[(T − θ)2 ]. (8.2) The mse can be expressed alternatively as E[(T − θ)2 ] = E[(T − E(T )) + (E(T ) − θ)]2 = E[(T − E(T ))2 ] + [E(T ) − θ]2 . So we have mse = Var(T ) + b2T (θ). (8.3) Now from (8.3) we can see that the mse cannot usually be made equal to zero. It will only be small when both Var(T ) and the bias in T are small. So rather than use unbiasedness and minimum variance to characterize “goodness” of a point estimator, we might employ the mean square error. Example 8.1 Consider the problem of the choice of estimator of σ 2 based on a random sample of size n from a N (µ, σ 2 ) distribution. Recall that S2 = n X i=1 (Xi − X)2 /(n − 1) is often called the sample variance and has the properties E(S 2 ) = σ 2 , (so S 2 is unbiased) Var(S 2 ) = 2σ 4 /(n − 1). 124 [Note that this is not HC’s use of S 2 . See 4.1 Definition 3.] P 2 Consider the mle of σ 2 , ni=1 (Xi − X )/n, which we’ll denote by σ̂ 2 . Now σ̂ 2 = n−1 2 S n and 1 n−1 2 σ = 1− E(σ̂ ) = σ2 n n 2(n − 1) 4 (n − 1)2 Var(S 2 ) = σ . Var(σ̂ 2 ) = 2 n n2 2 Why is σ̂ 2 biased? To calculate σ̂ 2 we first have to extract the mean, consuming 1 degree of freedom. So we do not have n independent estimates of dispersion about the mean; we have (n − 1). Now σ̂ 2 is biased, but what about its mean square error? Using (8.3), mse σ̂ Now for S 2 , 2 2(n − 1)σ 4 + = n2 σ 4 (2n − 1) = . n2 1 σ − (1 − )σ 2 n 2 2 2σ 4 2n − 1 4 mse S = Var(S ) = > σ n−1 n2 2 since 2 2 2n − 1 > n−1 n2 for n an integer greater than 1. So for the normal distribution the mle of σ 2 is better in the sense of mse than the sample variance. Consistency A further desirable property of estimators is that of consistency, which is an asymptotic property. To understand consistency, it is necessary to think of T as really being Tn , the nth member of an infinite sequence of estimators, T1 , . . . , Tn . Roughly speaking, an estimator is consistent if, as n gets large, the probability that Tn lies arbitrarily close to the parameter being estimated becomes itself arbitrarily close to 1. More formally, we have Definition 8.4 Tn = t(X1 , . . . , Xn ) is a consistent estimator of θ if lim P (|Tn − θ| ≥ ) = 0 for any > 0. n→∞ 125 (8.4) This is often referred to as convergence in probability of Tn to θ. An equivalent definition (for cases where the second moment exists) is Definition 8.5 Tn = t(X1 , . . . , Xn ) is a consistent estimator of θ if lim E[(Tn − θ)2 ] = 0. (8.5) n→∞ That is, the mse of Tn as an estimator of θ, decreases to zero as more and more observations are incorporated into its composition. Note that, using (8.3) we see that (8.5) will be satisfied if Tn is asymptotically unbiased and if Var(Tn ) → 0 as n → ∞. Asymptote means the truth. So as the sample size increases, Tn gets closer to the true value. When n = ∞, we have sampled the entire population. The idea of consistency can be gleaned from Figure 8.1 where Tn converges to θ. If it didn’t, Tn would not be a consistent estimator. Figure 8.1: Convergence of an estimator Tn θ n Example 8.2 126 Let Y be a random variable with mean µ and variance σ 2 . Let Y be the sample mean of n random observations taken on Y . Is Y a consistent estimator of µ? Now E(Y ) = µ so Y is unbiased. Also Var(Y ) = σ 2 /n → 0 as n → ∞, so Y is a consistent estimator of µ. NOTE These statements are the same for a consistent estimator Tn of θ : 1. lim P (|T − θ| ≥ ) = 0, > 0 n n→∞ 2. lim E[(T − θ)2 ] = 0 n n→∞ 3. lim P (|T − θ| < ) = 1, > 0 n n→∞ 4. E(Tn ) → θ, V (Tn ) → 0, as n → ∞ 5. ∃N : n > N for δ > 0, > 0 : P (|Tn − θ| < δ) > 1 − Operationally, consistency boils down to V (Tn ) → 0 as n → ∞, assuming that Tn is unbiased. For example, the sample mean from a N (µ, σ 2 ) population is unbiased, and V (X̄) = σ 2 /n and so X̄ is consistent for µ since V (X̄) → 0 as n → ∞. Examples (1 & 3) If Tn is unbiased for θ and σn2 = V (Tn ), then by Chebychev P (|Tn − θ| < δσn ) > 1 − 1/δ 2 , δ > 0 and choosing δσn = as being fixed, then P (|Tn − θ| < ) > 1 − This equivalent to 1 and 3. 127 σn2 → 1 as σn2 → 0. 2 (2 & 4) Choose a sample of size n from the uniform distribution U (θ), and the estimator of θ as the largest order statistic, Y(n) . Now f (y; θ) = 1/θ, 0 < y < θ The distribution of Tn = Y(n) is fYn (yn ) = nynn−1 /θ n , 0 < yn < θ Now n nθ θ and V [Y(n) ] = n+1 (n + 1)2 (n + 2) E[Y(n) ] = and so E(Tn ) → θ and V (Tn ) → 0 as n → ∞ (5) For the uniform distribution problem defined in (4) P (|Tn − θ| < δ) = P (θ − δ < Tn < θ), as yn < θ = Z θ θ−δ fY(n) (yn )dyn = 1 − Now P (|Tn − θ| < δ) = 1 − (θ − δ)n θn (θ − δ)n >1−ε θn where = (1 − δ/θ)n . For any δ, it is possible to make as small as possible (in particular smaller than ε), by suitable choice of n. Thus P = 1 − > 1 − ε ; < ε. Thus (1 − δ/θ)n < ε, or or n> n ln(1 − δ/θ) < ln ε ln ε = N (0 < δ < θ), ; (0 < ε < 1) ln(1 − δ/θ) Thus Y(n) is consistent for θ. Exercise Show that mse[Y(n) ] → 0 as n → ∞ (2) 128 Efficiency We will next make some comments on the property of efficiency of estimators. The term is frequently used in comparison of two estimators where a measure of relative efficiency is used. In particular, Definition 8.6 Given two unbiased estimators, T1 and T2 of θ, the efficiency of T1 relative to T2 is defined to be e(T1 , T2 ) = Var(T2 )/Var(T1 ), and T2 is more efficient than T1 if Var(T2 ) <Var(T1 ). Note that it is only reasonable to compare estimators on the basis of variance if they are both unbiased. To allow for cases where this is not so, we can use mse in the definition of efficiency. That is, Definition 8.7 An estimator T2 of θ is more efficient than T1 if mse T2 ≤ mse T1 , with strict inequality for some θ. Also the relative efficiency of T1 with respect to T2 is e(T1 , T2 ) = mse T2 E[(T2 − θ)2 ] = . mse T1 E[(T1 − θ)2 ] (8.6) Example 8.3 Let X1 , . . . , Xn denote a random sample from U(0, θ), with Y1 , Y2 , . . . , Yn the corresponding ordered sample. (i) Show that T1 = 2X and T2 = n+1 Yn n are unbiased estimates of θ. (ii) Find e(T1 , T2 ). Solution (i) Now E(Xi ) = θ/2 and Var(Xi ) = θ 2 /12 so θ E(T1 ) = 2E(X) = 2E(Xi ) = 2. = θ. 2 To find the mean of T2 , first note that the probability density function of Yn is fYn (y) = n(FX (y))n−1 fX (y), for 0 ≤ y ≤ θ ny n−1 = I(0,θ) (y). θn 129 So n Zθ n y dy E(Yn ) = n θ 0 " #θ n y n+1 = n θ n+1 0 nθ = . n+1 For T2 defined by T2 = n+1 Yn , n we have E(T2 ) = n+1 E(Yn ) = θ. n So both T1 and T2 are unbiased. (ii) Var(T1 ) = Var(2X) = 4Var(X) = 4θ 2 θ2 4Var(Xi ) = = . n 12n 3n To find Var(T2 ), first we need to find E(Yn2 ) from E(Yn2 ) = Z θ 0 y2 n n−1 y dy θn n θ n+2 = n θ n+2 n 2 = θ . n+2 n2 θ 2 n 2 θ − n+2 (n + 1)2 nθ 2 = . (n + 1)2 (n + 2) Var(Yn ) = So (n + 1)2 nθ 2 . n2 (n + 1)2 (n + 2) 2 θ = n(n + 2) Var(T2 ) = Since these estimates are unbiased, we may use definition 8.6, e(T1 , T2 ) = 3n Var(T2 ) θ2 3 = = . 2 Var(T1 ) n(n + 2) θ n+2 This is less than 1 for n > 1 so T2 is more efficient than T1 . 130 Example The mean and median from a Normal population are both unbiased for the population mean. The mean X̄ has variance σ 2 /n while the median X̃ has variance π 2 σ /n 2 for large n. Thus the mean is more efficient, with e = 0.637. Another interpretation is the following : If we required the median to give the same precision as the mean based on a sample of 100 observations, the sample using the median would need to be based on 157 (= π/2) observations. If both estimators are not unbiased, then we must use mse. Thus we call the more efficient estimator the one with the smaller mse. Then e(T1 , T2 ) = mse(T2 ) mse(T1 ) with mse(T2 ) < mse(T1 ) giving e < 1. Example The sample variance S 2 versus the mle σ̂ 2 . Now S2 2σ 4 /(n − 1) 2n2 e= 2 = = σ̂ (2n − 1)σ 4 /n2 (2n − 1)(n − 1) = 2n n = (1 + 1/(2n − 1)) (1 + 1/(n − 1)) > 1 2n − 1 n − 1 The mse(σ̂ 2 ) < mse(S 2 ), making the mle more efficient. Notice that both definitions of e the efficiency is purely conventional; the less efficient estimator could easily be the numerator, giving e > 1, as in the second example. 8.2 Cramér–Rao Lower Bound The concept of relative efficiency provides a criterion for choosing between two competing estimators, but it does not give us any assurance that the better of the two is any good. How do we know, for example, that there is not another estimator whose variance (or mse) is much smaller than the two considered? Minimum Variance Estimation The Theorem below gives a lower bound for the variance (or mse) of an estimator. 131 Theorem 8.1 Let T = t(X), based on a sample X from f (x; θ) be an estimator of θ (assumed to be one-dimensional). Then [1 + b0T (θ)]2 def [τ 0 (θ)]2 (8.7) Var(T ) ≥ = IX (θ) IX (θ) and mse(T ) ≥ [1 + b0T (θ)]2 IX (θ) b2T (θ) + (8.8) where bT (θ) is given in (8.1) and IX (θ) is defined in equation (7.7). Outline of Proof. [The validity depends on regularity conditions, where the interchange of integration and differentiation operations is permitted and on the existence and integrability of various partial derivatives.] Now V = ∂ ∂ log f (X; θ) = log L(θ) ∂θ ∂θ as in Definition (7.7) and we note that E(V ) = 0, so Var(V)= E(V 2 ) and cov(V, T ) = E(V T ) ! ∂ = E T log f (X; θ) ∂θ = Z ··· Z t(x1 , . . . , xn ) " X i # ∂ ln f (xi ; θ) f (x1 ; θ) . . . f (xn ; θ)dx1 . . . dxn ∂θ ∂ E(T ) ∂θ ∂ = [θ + bT (θ)] from (8.1) ∂θ = 1 + b0T (θ). = Recall that the absolute value of the correlation coefficient, for measuring correlation between any two variables is less than or equal to 1 and that ρV,T = cov(V, T )/σV σT so that we have [cov(V, T )]2 ≤ Var(V )Var(T ) 132 or Var(T ) ≥ [cov(V, T )]2 /Var(V ) [1 + b0T (θ)]2 = IX (θ) thus proving (8.7). Now (8.8) follows using (8.3). Corollary. For the class of unbiased estimators, Var(T ) = mse(T ) ≥ 1 . IX (θ) (8.9) Now inequality (8.9) is known as the Cramér–Rao lower bound, or sometimes the Information inequality. It provides (in “regular estimation” cases) a lower bound on the variance of an unbiased estimator, T. The inequality is generally attributed to Cramér’s work in 1946 and Rao’s work in 1945, though it was apparently first given by M. Frechet in 1937–38. Example 1 Sampling from a Bernoulli distribution with pf f (x; π) = π x (1 − π)1−x , x = 0, 1 What is the bound on the variance for an estimator of π? Note that we can use the following : Ix (θ) = nIx (θ), (θ = π) Now and so Hence ` = ln f = x ln π + (1 − x) ln(1 − π) x 1−x x−π ∂ ln f = + (−1) = ∂θ π 1−π π(1 − π) I = E( E(x − π)2 ∂` 2 1 ) = 2 = 2 ∂θ π (1 − π) π(1 − π) We notice the E(X) = π is unbiased for π, and so Var(T ) ≥ π(1 − π) n as expected. 133 Example 2 Sampling from an exponential distribution with pdf f (x; θ) = e−x/θ /θ, x > 0 What is the MVB for an estimator of θ? ` = ln f = − ln θ − x θ 1 x ∂` = − + 2 , ; E(X) = θ (mle) ∂θ θ θ ∂2` 1 x = 2 −2 3 2 ∂θ θ θ giving ∂2` 1 E(x) )= 2 −2 3 2 ∂θ θ θ θ 1 1 = 2 −2 3 =− 2 θ θ θ E( and so I = 1/θ 2 . So Ix (θ) = n θ2 If T is unbiased, eg, T = X̄ then V (T ) ≥ 1 θ2 = nI(θ) n V (X̄) = V (X) θ2 = n n in agreement with Reparameterisation An alternative form of the exponential is f (x; λ) = λe−λx , x > 0 What is the MVB for an estimator of λ? ` = ln λ − λx 1 ∂` = − x ; E(X) = 1/λ = τ (λ) ∂λ λ ∂2` 1 = − → I = 1/λ2 ∂λ2 λ2 134 So now the MVB for a (biased) estimator of λ, say X̄, is given by V (T ) ≥ [−1/λ2 ]2 1 [τ 0 (λ)]2 = = 2 nI(λ) n/λ nλ2 in agreement with E(X) = 1/λ, V (X) = 1/λ2 and V (X̄) = 1 . nλ2 Alternative method To return to the θ parameterisation of the exponential distribution, if we do not use an unbiased estimator, but in fact want the second λ form, we can examine an estimator T 0 with τ (θ) = 1/θ, in place of the unbiased version τ (θ) = θ. Now 2 [τ 0 (θ)]2 (−1/θ 2 ) 1 0 V (T ) ≥ = = 2 2 nI(θ) n/θ nθ in agreement with the λ parameterisation. Thus there is no need to resolve in terms of the new parameterisation. Simply use the biased form the the MVB. Definition 8.8 The (absolute) efficiency of an unbiased estimator T is defined as e(T ) = 1/IX (θ) . Var(T ) (8.10) Note that, because of (8.9), e(T ) ≤ 1, so we can think of e(T ) as a measure of efficiency of any given estimator, rather than the relative efficiency of one with respect to another as in Definition 8.1. In the case where e(T ) = 1, so that the actual lower bound of Var(T ) is achieved, some texts refer to the estimator T as efficient. This terminology is not universally accepted. Some prefer to use the phrase minimum variance bound (MVB) for 1/IX (θ), and an estimator which is unbiased and which attains this bound is called a minimum variance bound unbiased (MVBU) estimator. Example 8.4 In the problem of estimating θ in a normal distribution with mean θ and known variance σ , find the MVB of an unbiased estimator. 2 135 The MVB is 1/IX (θ) where ∂ log L(θ) IX (θ) = E ∂θ !2 . For a sample X1 , . . . , Xn we have for the likelihood, L(θ) = n Y i=1 √ 1 1 2 2 e− 2 (xi −θ) /σ 2πσ P 2 2 = (2π)−n/2 (σ 2 )−n/2 e− (xi −θ) /2σ n 1X n (xi − θ)2 /σ 2 log L(θ) = − log(2π) − log σ 2 − 2 2 2 n X ∂ −1 n log L(θ) = . − 2 (xi − θ) = − 2 (x − θ) 2 ∂θ 2σ σ i=1 ∂ E log L(θ) ∂θ !2 n2 E(X − θ)2 σ4 n2 = Var(X) = n/σ 2 4 σ = So the MVB is σ 2 /n. When can the MVB be Attained? It is easy to establish the condition under which the minimum variance bound of an unbiased estimator, (8.9), is achieved. In the proof of Theorem 8.2, it should be noted that the inequality concerning the correlation of V and T becomes an equality (that is, ρV,T = +1 or −1) when V is a linear function of T. Recalling that ∂ log L(θ) V = , ∂θ we may write this condition as ∂ log L(θ) = A(T − θ), ∂θ where A is independent of the observations but may be a function of θ, so we will write it as A(θ). So the condition for the MVB to be attained is that the statistic T=t(X1 , . . . , Xn ) satisfies ∂ log L(θ) = A(θ)(T − E(T )). (8.11) ∂θ Example 8.5 136 In the problem of estimating θ in a normal distribution with mean θ and known variance σ , where σ 2 is known, show that the MVB of an unbiased estimator can be attained. As in Example 8.2, ∂ log L(θ) n(x − θ) = . ∂θ σ2 2 Now defining T = t(X1 , . . . , Xn ) = X, we know it is an unbiased estimator of θ, and we see that (8.11) is satisfied, where A(θ) = n/σ 2 , (A not being a function of θ in this case). Thus the minimum variance bound can be attained. Comment. In the case of an unbiased estimator T where the MVB is attained, note that the inequality in (8.9) becomes an equality and we have ∂ log L(θ) Var(T ) = 1/IX (θ) = 1/E ∂θ !2 . (8.12) Also, squaring (8.11) and taking expectations of both sides, we have ∂ log L(θ) E ∂θ !2 = [A(θ)]2 E[(T − θ)2 ] = [A(θ)]2 Var(T ) That is, !2 ∂ log L(θ) Var(T ) = E /[A(θ)]2 ∂θ 1 1 = . , using (2.12) 2 [A(θ)] Var(T ) giving Var(T ) = 1/A(θ). So, if the statistic T satisfies (8.11), Var(T) can be identified immediately as the multiple of T − θ on the RHS. For instance, in Example 8.2, the factor n/σ 2 multiplying x − θ can be identified as the reciprocal of the variance of T , and it was not necessary to evaluate the MVB as in Example 8.2. Example 8.6 Consider the problem of estimating the variance, θ, of a normal distribution with known mean µ, based on a sample of size n. 137 Now the likelihood is P 2 L(θ) = (2π)−n/2 θ −n/2 e− (xi −µ) /2θ n X n n log L(θ) = − log(2π) − log θ − (xi − µ)2 /2θ 2 2 i=1 P ∂ log L(θ) (xi − µ)2 n = − + ∂θ 2θ P 2θ 2 ! 2 n (xi − µ) = −θ 2 2θ n P which is in the form (2.11) where T = t(X1 , . . . , Xn ) = ni=1 (Xi − µ)2 /n. So, using this as the estimate of θ, the MVB is achieved and it is 2θ 2 /n. P P Note that E( (Xi − µ)2 ) = E(Xi − µ)2 = nVar(Xi ) = nθ so T is an unbiased estimator P of θ. Also, (Xi − µ)2 /θ ∼ χ2n so has variance 2n. Hence, " # P 2θ 2 θ2 θ (Xi − µ)2 . = 2 .2n = Var(T ) = Var . n θ n n Example 8.7 Consider the problem where we have a random sample X1 , . . . , Xn from a Poisson distribution with parameter θ and we wish to find the Cramér-Rao lower bound for the variance of an unbiased estimator of θ, and identify the estimator that has his variance. Now for f (x; θ) = e−θ θ x /x!, the likelihood of the sample is P e−nθ θ xi L(θ; x1 , . . . , xn ) = Qn i=1 (xi !) X log L(θ; x1 , . . . , xn ) = −nθ + xi log θ − log K P ∂ log L(θ) xi = −n + ∂θ θ −nθ + nx = θ n = [x − θ] θ = A(θ) [T − θ] where T (Xi ) = X is the statistic. This is in the correct form for the minimum variance bound to be attained and it is 1/A(θ) = θ/n. We note that X is an estimator which has variance θ/n. 138 Example For the Bernoulli distribution, can the MVB be attained? f = π x (1 − π)1−x , x = 0, 1 ` = x ln π + (1 − x) ln(1 − π) x 1−x ∂` = + (−1) ∂θ π 1−π = Therefore x−π = I(π)(x − π) π(1 − π) A(π) = 1 , π(1 − π) T = X, θ = π and the lower bound is attained. Notice again that the working was for a single sample value, so that for sample of size n, n A(π) = Ix (π) = π(1 − π) with T = X̄. Notes Finally, some important results : 1. If an unbiased estimator of some function of θ exists having a lower bound variance 1/[nI(θ)], the sampling is necessarily from a member of the exponential family. Conversely, for any member of the exponential family, there is always precisely one function of θ for which there exists an unbiased estimator with the minimum variance 1/[nI(θ)]. 2. A Cramér–Rao lower bound estimator can only exist if there is a sufficient statistic for θ. (The reverse is not necessarily true.) Proof The factorisation criterion for sufficiency requires that f (x; θ) = g[t(x); θ]h(x) 139 ie ∂ ln g ∂` = , ∂θ ∂θ while the MVB is attained if ∂` ∂ ln f (x; θ) = = A(θ)[T − τ (θ)] ∂θ ∂θ in general, which is a special case of the sufficiency condition. Thus even if the MVB is not attained there still may be a sufficient statistic. 3. (Blackwell–Rao) If an unbiased estimator T1 exists for τ (θ), where θ is the unknown parameter, and if T is a sufficient estimator of τ (θ), then there exist a function u(T ) of T which is also unbiased for τ (θ) with variance Var[u(T )] ≤ Var(T1 ). 8.3 Properties of Maximum Likelihood Estimates Statistical inference should be consistent with the assumption that the best explanation of a set of data is provided by the value of θ, (θ̂, say) that maximizes the likelihood function. Estimators derived by the method of maximum likelihood have some desirable properties. These are stated without proof below. 1. Sufficiency It was already established in section 7.6, that if a single sufficient statistic exists for θ, the maximum likelihood estimate of θ must be a function of it. That is, the mle depends on the sample observations only through the value of a sufficient statistic. 2. Invariance The maximum likelihood estimate is invariant under functional transformations. That is, if T = t(X1 , . . . , Xn ) is the mle of θ and if u(θ) is a function of θ, then u(T ) is the mle of u(θ). For example, if σ̂ is the mle of σ, then σ̂ 2 is the mle of σ 2 . That is, σc2 = σ̂ 2 . A full proof exists, but essentially the argument here is ∂` ∂θ ∂` = ∂u ∂θ ∂u so that maximisation wrt θ is equivalent to maximisation wrt u. Example 140 If σ̂ 2 is the mle of σ 2 in sampling from a N (0, σ 2 ) population, then σ̂ is the mle of σ. L= Y fi = i Y i P 2 2 2 2 1 1 − i xi /2σ √ e−xi /2σ = e n/2 σ n 2π σ 2π n 1X 2 2 n x /σ ` = − ln(σ 2 ) − ln(2π) − 2 2 2 i i P 2 ∂` n i xi = − + =0 ∂σ 2 2σ 2 2σ 4 to give the mle of σ 2 as 2 σ̂ = P x2i . n i The mle of σ̂ is given by ∂` n 1 X 2 (−2) =− − =0 x ∂σ σ 2 i i σ3 if 2 σ = P x2i ; σ̂ = n i sP x2i n i as required. As a final check ∂` ∂θ ∂` = ∂u ∂θ ∂u becomes ∂` ∂` ∂σ 2 = ∂σ ∂σ 2 ∂σ P ! P 2 2 n ∂` n i xi i xi 2σ = − + = = − 2+ 4 3 2σ 2σ σ σ ∂σ as required by invariance. 3. Consistency The maximum likelihood estimator is consistent. This can be shown from first principles (Wald(1949)), using the expected value of the log likelihood, but the simpler demonstration is via the asymptotic behaviour of the score statistic. Essentially we find that V (θ̂) → 0 as n → ∞ 4. Efficiency If there is a MVB estimator of θ, the method of maximum likelihood will produce it. 141 5. Asymptotic Normality Under certain regularity conditions, a maximum likelihood estimator has an asymptotically Normal distribution with variance 1/I(θ), ie, θb − θ q b V (θ) Proof ∼ N (0, 1), asy. CB p472 `= X ln f (xi ; θ) i def `0 (θ) = ∂` = `0 (θ0 ) + (θ − θ0 )`00 (θ0 ) + . . . ∂θ by a Taylor series expansion. Replacing θ by the mle (θ̂) gives `0 (θ̂) = `0 (θ0 ) + (θ̂ − θ0 )`00 (θ0 ) + . . . = 0 and so, approx √ √ `0 (θ0 ) √ `0 (θ0 )/ n n(θ̂ − θ0 ) = − n 00 = − 00 ` (θ0 ) ` (θ0 )/n If I(θ0 ) = E[`0 (θ0 )]2 = 1/V (θ) is the information from a single observation, then √ `0 (θ0 )/ n → N [0, I(θ0 )] by the CLT and −`00 (θ0 )/n → I(θ0 ) leading to √ n(θ̂ − θ0 ) → N [0, 1/I(θ0 )] Thus (θ̂ − θ0 ) q 1/ nI(θ0 ) as required. → N (0, 1) asy Thus the mle is consistent, and since the MVB is attained, it is also efficient. 142 Example Poisson distribution p(xi ; λ) = L= Y i e−λ λxi , xi = 0, 1, . . . xi ! P e−nλ λ i xi p(xi ; θ) = Q i (xi !) ` = ln L = −nλ + X i xi ln λ + P X (xi !) i ∂` xi = −n + i ∂λ λ P 2 xi ∂ ` nλ n = − i 2 → nI(θ) = 2 = 2 ∂λ λ λ λ in line with V (X̄) = λ/n. Thus X̄ − λ q as expected. λ/n ∼ N (0, 1) 6. MLE of vector parameters Define ∂ log L(θ) ∂ log L(θ) . Iij (θ) = E ∂θi ∂θj ! (8.13) Now the RHS of (8.13) can be expressed as ! ∂ log L(θ) ∂ log L(θ) cov , . ∂θi ∂θj As was the case with information on one parameter, IX (θ) [see (1.6) and (1.10)], there is an alternative formula for computing the terms of the information matrix. ! ∂ 2 log L(θ) Iij (θ) = − E , ∂θi ∂θj (8.14) provided certain regularity conditions are satisfied. Define an information matrix, I(θ) to have elements Ii,j , then, the mle’s θ̂i , found by solving the set of equations ∂ log L(θ) = 0, ∂θi i = 1, 2, . . . , have an asymptotically normal distribution with means θi and covariance matrix [I(θ)]−1 . 143 Example 8.8 Sampling Y1 , . . . , Yn from N (µ, σ 2 ) with σ 2 unknown. We wish to find the joint Information matrix for µ and σ 2 . L(µ, σ 2 ) = Y i 2 2 1 √ e−(yi − µ) /2σ σ 2π = σ −n (2π)−n/2 e− P i (yi − µ)2 /2σ 2 P n 1 i (yi − µ)2 n ` = ln L = − ln(σ 2 ) − ln(2π) − 2 2 2 σ2 X ∂` = − (yi − µ)(−1)/σ 2 ∂µ i X (−1) n ∂` = − 2 − (yi − µ)2 4 2 ∂σ 2σ 2σ i Now to get the terms in the Information matrix. X 1 n ∂2` = − =− 2 2 2 ∂µ σ i σ X n ∂2` (yi − µ)2 /σ 6 = − ∂σ 4 2σ 4 i " # n ∂2` nσ 2 n E = − = − ∂σ 4 2σ 4 σ6 2σ 4 since E(yi − µ)2 = σ 2 . Also " # " # # " X ∂2` ∂2` (yi − µ)/σ 4 = 0 E = E = E − ∂µ∂σ 2 ∂σ 2 ∂µ i Thus the matrix of second derivatives of ` is " −n/σ 2 0 0 −n/2σ 4 which is negative definite. Since V (θ) = I then V " µ̂ σ̂ 2 !# =− " −1 # " ∂2` =− E ∂θi ∂θj −n/σ 2 0 0 −n/2σ 4 144 #−1 !#−1 = " σ 2 /n 0 0 2σ 4 /n # as expected. We note that X̄ and σ̂ 2 are asymptotically normally distributed, and independent, with variances given. However, we know that the independence property and the normality and variance of X̄ are exact for any n. But the normality property and the variance of P (Xi − X̄)2 /n are strictly limiting ones. 8.4 Interval Estimates The notion of an interval estimate of a parameter θ with a confidence coefficient is assumed to be familiar. A point estimate, on its own, doesn’t convey any indication of reliability, but a point estimate together with its standard error would do so. This idea is incorporated into a confidence interval, which is a range of values within which we are “fairly confident” that the true (unknown) value of the parameter θ lies. The length and location of the interval are random variables and we cannot be certain that θ will actually fall within the limits evaluated from a single sample. So the object is to generate narrow intervals which include θ with a high probability. Examples such as (i) A CI for µ in a normal distribution where σ is either known or unknown; (ii) A CI for p where p is the probability of success in a binomial distribution; (iii) A CI for σ 2 in a normal distribution where the mean is either known or unknown; etc. will not be repeated here. However, we will mention the general method of construction of a confidence interval using a pivotal quantity. Further, we will find a confidence interval for a population quantile. Suppose θ̂L and θ̂U (both functions of X1 , . . . , Xn and hence random variables) are the lower and upper confidence limits respectively, for a parameter θ. Then if P (θ̂L < θ < θ̂U ) = γ, (8.15) the probability γ is called the confidence coefficient. The interval (θ̂L , θ̂U ) is referred to as a two–sided confidence interval, both endpoints being random variables. It is possible to construct 1–sided intervals such that P (θ̂L < θ) = γ or P (θ < θ̂U ) = γ, in which case only one end-point is random. The confidence intervals are respectively, (θ̂L , ∞), (−∞, θ̂U ). 145 Example In the construction of a confidence interval for the true mean µ when sampling from a Normal population, the sampling distribution of the sample mean X̄ is used, viz X̄ ∼ N (µ, σ 2 /n) or Z= X̄ − µ √ ∼ N (0, 1). σ/ n Now Z has distribution function 2 1 f (z) = √ e−z /2 . 2π Thus Z has a distribution which does not depend on knowledge of µ but Z involves the unknown µ. The variable Z is said to be pivotal for µ, as confidence intervals can be constructed easily for µ, using the form of the density for Z. Pivotal Method A very useful method for finding confidence intervals uses a pivotal quantity that has 2 characteristics. 1. It is a function of the sample measurements and the unknown parameter θ (where θ is the only unknown). 2. It has a probability distribution which does not depend on the parameter θ. Suppose that T=t(X) is a reasonable point estimate of θ, then we will denote this pivotal quantity by p(T, θ), and we will use the known form of the probability distribution of p(T, θ) to make the following statement. For a specified constant γ, (0 < γ < 1), and constants a and b, (a < b), P (a < p(T, θ) < b) = γ. (8.16) So, given T , the inequality (8.16) is solved for θ to obtain a region of θ–values which is a confidence region (usually an interval) for θ corresponding to the observed T–value. This rearrangement, of course, results in an equation of the form (8.15). Example 8.9 For random variable X ∼ U (0, θ), construct a 90% confidence interval for θ. Now we know that Yn , the largest order statistic from a sample of size n from this distribution, is sufficient for θ and has pdf fYn (y) = n y n−1 /θ n , 0 ≤ y ≤ θ. 146 Let Z = Yn /θ, then the pdf of Z is fZ (z) = n z n−1 , 0 ≤ z ≤ 1. We see that Yn /θ is a suitable pivotal quantity with the 2 characteristic properties referred to earlier. So we have P (a < Yn /θ < b) = .90. Noting that the cdf of Z is FZ (z) = z n , 0 ≤ z ≤ 1, values of a and b may be found as follows. FZ (a) = .05, and FZ (b) = .95 √ √ n n an = .05 and bn = .95, giving a = .05 and b = .95. So we may write √ √ n n P ( .05 < Yn /θ < .95) = .90. Rearranging, the confidence interval for θ is √ √ n n (Yn / .95, Yn / .05). Examples [Exponential] Using the form of the exponential as f (y; θ) = θe−θy , y > 0 construct a 100(1 − α)% confidence interval for θ using the estimator T = 2θ for a sample Y1 . . . Yn . Now MYi (t) = (verify!). So M2θYi (t) = MYi (2θt) = giving and so T ∼ χ22n . M2θ P Y (t) i i = θ θ−t 1 θ = θ − 2θt 1 − 2t 1 = MT (t) (1 − 2t)n Since the distribution of T is independent of θ, then T is pivotal for θ. Thus P (χ2L < T < χ2U ) = 1 − α 147 P i Yi ie P (χ2L < 2θ X i or is the required CI for θ. Yi < χ2U ) = 1 − α χ2 χ2 P ( PL < θ < PU ) = 1 − α 2 i Yi 2 i Yi [Poisson] For sampling from the Poisson distribution, the sample mean Ȳ ∼ N (λ, λ/n) for large n. Examine methods for constructing confidence intervals for λ. Now and so W is pivotal for λ. Ȳ − λ W = q ∼ N (0, 1) λ/n (Method 1) Replace the λ in the SE with Ȳ . That is, use observed instead of expected information. Then we have Ȳ − λ W1 = q ∼ N (0, 1) Ȳ /n and the approximate (1 − α)100% CI for λ becomes q Ȳ ± Zα/2 Ȳ /n (Method 2) Using the expected information for λ gives a (1 − α)100% CI for λ from P (−Zα/2 < Z < Zα/2 ) = 1 − α ie, the upper and lower limits for λ satisfy ie or Ȳ − λ q λ/n Ȳ − λ 2 = Zα/2 2 = Zα/2 λ/n 2 λ2 − 2Ȳ λ + Ȳ 2 = Zα/2 λ/n. 148 The root of the resulting equation 2 λ2 − 2λ Ȳ + Zα/2 /n + Ȳ 2 = 0 are real, and give λu and λl where P (λl < λ < λu ) = 1 − α defines the 100(1 − α)% confidence interval for λ. (Method 3) We have the exact result that X i Yi ∼ P (nλ) So if we define Λ = nλ and use Λ̂ = X yi i then we can use tables of the Poisson distribution to find P (L < Λ < U ) = 1 − α which will become P (L/n < λ < U/n) = 1 − α Due to the Poisson being discrete, the choice of α will be limited. For example, P if n = 10 and i yi = 9.0 then P (3 < Λ < 15) = 0.957 giving a 96% CI for λ of (0.3, 1.5). > y=0:20 > sp <- ppois(y,9) > print(cbind(y,sp,1-sp)) y sp [1,] 0 0.0001234098 0.9998765902 [2,] 1 0.0012340980 0.9987659020 [3,] 2 0.0062321951 0.9937678049 [4,] 3 0.0212264863 0.9787735137 [5,] 4 0.0549636415 0.9450363585 [6,] 5 0.1156905208 0.8843094792 [7,] 6 0.2067808399 0.7932191601 149 [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,] 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0.3238969643 0.4556526043 0.5874082443 0.7059883203 0.8030083825 0.8757734292 0.9261492307 0.9585336745 0.9779643408 0.9888940906 0.9946804287 0.9975735978 0.9989440463 0.9995607481 0.6761030357 0.5443473957 0.4125917557 0.2940116797 0.1969916175 0.1242265708 0.0738507693 0.0414663255 0.0220356592 0.0111059094 0.0053195713 0.0024264022 0.0010559537 0.0004392519 Large Sample Confidence Intervals The asymptotic distribution of the mle is given as Q= q θ̂ − θ 1/nI(θ) ∼ N (0, 1) so Q is pivotal for θ. Furthermore if we replace expected information I(θ) by observed information I(θ̂) then a confidence interval for θ can be constructed easily, as per Method 1. If the parameterisation of θ is such that I(θ) is independent of θ then observed and expected information coincide, and Method 1 applies without the necessity of replacing expected values for θ by observed values. Comment. Note that there is some arbitrariness in the choice of a confidence interval in a given problem. There are usually several statistics T = t(X1 , . . . , Xn ) that could be used, and it is not really necessary to allocate equal probability to the two tails of the distribution, as was done in the above example. However, it is customary to do this, as this often leads to the shortest confidence interval (for the same confidence coefficient), another property considered desirable. 150 Chapter 9 Hypothesis Testing 9.1 9.1.1 Basic Concepts and Notation Introduction As an alternative to estimating the values of one or more parameters of a probability distribution, as was the objective in Chapter 8, we may test hypotheses about such parameters. Both estimation and hypothesis testing may be viewed as different aspects of the same general problem of reaching decisions on the basis of data. Explicit formulation as well as important basic concepts on the theory of testing statistical hypotheses are due to J. Neyman and E.S. Pearson, who are considered pioneers in the area. Although the notation of H and A for hypotheses and alternatives has been used, we will now use that of Hogg and Craig where the terms null hypothesis and alternative hypothesis are used, with corresponding notation H0 and H1 . The null hypothesis is always a statement of either ‘no effect’ or the ‘status quo’. If the statistical hypothesis completely specifies the distribution, it is called simple; if it does not, it is called composite. After experimentation (for example, taking a sample of size n, (X1 , . . . , Xn ) or X), some reduction in data is used, resulting in a test statistic, T= t(X), say. We may consider a subset of the range of possible values of T as a rejection (or critical) region. Previously this has been denoted by R, but to be consistent with HC, we will now use C. That is, C is the subset of the sample space of T , which leads to the rejection of the hypothesis under consideration. The region C can refer to either the X-values or the T-values. The case where both H0 and H1 are simple, where the size of the Type I and Type II error is easily determined, is assumed known. You may find it helpful to read CB 10.3 or HC 7.1. In the case of composite hypotheses and alternatives, the power function of the test is an important tool for evaluating its performance, and this is examined in the following section. 151 9.1.2 Power Function and Significance Level Suppose that T is the test statistic and C the critical region for a test of a hypothesis concerning the value of a parameter θ. Then the power function of the test is the probability that the test rejects H0 , when the actual parameter value is θ. That is, π(θ) = Pθ (rejecting H0 ) = Pθ (C). (9.1) Some texts interpret power as the probability of rejecting H0 when it is false, but the more general interpretation is the probability that a test rejects H0 for θ taking values given by H0 or H1 . Suppose we want to test the simple H0 : θ = θ0 , against the composite alternative H1 : θ 6= θ0 . Ideally we would like a test to detect a departure from H0 with certainty; that is, we would like π(θ) to be 1 for all θ in H1 , and π(θ) to be 0 for θ in H0 . Since for a fixed sample size, P(rejecting H0 |H0 is true) and P(not rejecting H0 |H0 is false) cannot both be made arbitrarily small, the ideal test is not possible. So long as H0 is simple, it is possible to define P(Type I error), denoted by α, as P(rejecting H0 |H0 is true). But to allow for H0 to be composite, we need the following definitions. Definition 9.1 The size of a test (or of a critical region) is α = max Pθ (reject H0 ) = max π(θ) θ∈H0 θ∈H0 (9.2) This is also known as the significance level. Definition 9.2 The size of Type II error is β = max[1 − π(θ)]. θ∈H1 (9.3) Some statisticians regard the formal approach above, of setting up a rejection region, as not the most appropriate, and prefer to compute a P–value. This involves the choice of a test statistic T, the extreme values of which provide evidence against H0 . The statistic T should be a good estimator of θ and its distribution under H0 known. After experimentation, an observed value of T, t say, is examined to see whether it can be considered extreme in the sense of being unlikely to occur if H0 were true. The computed P–value is the probability of observing T = t or something more extreme. This is the “α” at which the observed value T = t is just significant. The test situation can be summarised in a table : 152 Accept H0 (C) H0 true H0 false Reject H0 (C) α (Type I error) β (Type II error) and Power = 1 − β = P( Reject H0 |H0 is false). The size of a test is given by α = Type I error, and is also known as the significance level. There is a trade–off in practice between α and β, with α being preset, but β being unknown, but estimable from the alternative. Power is typically a function of the parameter under test. A P–Value is a computer generated mechanism for enabling an experimenter to undertake a test without recourse to examining tables of test statistic values at the prescribed size. Formally, P–Value = P[Obtaining values more extreme than observed, in the direction of H1 |(H0 is true)]. 9.1.3 Relation between Hypothesis Testing and Confidence Intervals It may be recalled that rejecting a null hypothesis about θ (θ = θ0 , say) at the 5% significance level is equivalent to saying that the value θ0 is not included in a 95% confidence interval for θ. So we have a duality property here. We will illustrate this with an example. Example 9.1 Consider the family of normal distributions with unknown mean µ and known variance σ 2 . Let zα be defined by P (Z ≥ zα ) = α. For a 2–sided alternative (using a 2–tailed test), the rejection region for a test of size α is ( That is, ( ) |x − µ0 | √ > zα/2 . x: σ/ n ) zα/2 σ zα/2 σ x : x > µ0 + √ or x < µ0 − √ . n n 153 This is the event {X ∈ C(θ)} and it has probability α. The complementary event, {X ∈ / C(θ)} has probability 1 − α. The latter event can be written equivalently as n which is equivalent to and √ √ o x : µ0 − zα/2 σ/ n < x < µ0 + zα/2 σ/ n . √ √ x − zα/2 σ/ n < µ0 < x + xα/2 σ/ n √ √ (x − zα/2 σ/ n, x + zα/2 σ/ n) is a 100(1 − α)% confidence interval for µ. Note There is a duality between the size of the test and the confidence coefficient of the corresponding confidence interval, for a two sided test. Example Testing the variance from a Normal population. For the test H0 : σ 2 = σ02 vs H1 : σ 2 6= σ02 , the test statistic is νs2 ∼ χ2ν 2 σ where ν = n − 1 = df . Thus the corresponding acceptance region (C) is defined as χ2ν,L < νs2 < χ2ν,U σ02 where L and U are the lower and upper α/2 points of the χ2 distribution on n − 1 df. That is ! νs2 2 2 P χν,L < 2 < χν,U = 1 − α. σ0 This can be written as P νs2 νs2 2 <σ < 2 χ2ν,U χν,L ! to give a confidence interval for σ 2 with confidence coefficient α. 154 9.2 9.2.1 Evaluation of and Construction of Tests Unbiased and Consistent Tests In the case of estimation of parameters, it was necessary to define some desirable properties for estimators, to enable us to have criteria for choosing between competing estimators. Similarly in hypothesis testing, we would like to use a test that is “best” in some sense. Note that a test specifies a critical region. Alternatively, the choice of a critical region defines a test. That is, the terms ‘test’ and ‘critical region’ can, in this sense, be used interchangeably. So if we define a best critical region, we have defined a best test. The analogue for unbiasedness and consistency in estimation are defined below for hypothesis testing. Definition 9.3 A test is unbiased if Pθ (rejecting H0 |H1 ) is always greater than Pθ (rejecting H0 |H0 ). That is, min π(θ) ≥ max π(θ). θ∈H1 θ∈H0 Definition 9.4 A sequence of tests {ψn }, each of size α, is consistent if their power functions approach 1 for all θ specified by the alternative. That is, πψn (θ) → 1, for θ ∈ H1 . 9.2.2 Certain Best Tests When H0 and H1 are both simple, the error sizes α and β are uniquely defined. In this section we require that both the null hypothesis and alternative hypothesis are simple, so that in effect, the parameter space is a set consisting of exactly 2 points. We will define a best test for testing H0 against H1 , and in 9.2.3 we will prove a Theorem that provides a method for determining a best test. Let f (x; θ) denote the density function of a random variable X. Let X1 , X2 , . . . , Xn denote a random sample from this distribution and consider the simple hypothesis H0 : θ = θ0 and the simple alternative H1 : θ = θa . So H0 ∪ H1 = {θ0 , θa }. One repetition of the experiment will result in a particular n–tuple, (x1 , x2 , . . . , xn ). Consider a set Ci , which is a collection of n–tuples having size α that is, Ci has the property that P [(X1 , X2 , . . . , Xn ) ∈ Ci |H0 is true ] = α. It follows that Ci can be thought of as a critical region for the test. Specifically, if the observed n–tuple (x1 , x2 , . . . , xn ) falls in our pre–selected Ci , we will reject H0 . However, if 155 HA were true, then intuitively the ‘best’ critical region would be the one having the highest probability of containing (x1 , x2 , . . . , xn ). Formalizing this notion, we have the following definition. Definition 9.5 C is called the best critical region, (BCR) of size α for testing the simple H0 against the simple H1 if, (a) P [(X1 , X2 , . . . , Xn ) ∈ C|H0 ] = α, (b) P [(X1 , . . . , Xn ) ∈ C|H1 ] ≥ P [(X1 , . . . , Xn ) ∈ Ci |H1 ] for every other Ci (of size α). This definition can be stated in terms of power. Suppose that there is one of these subsets, say C, such that when H0 is true, the power of the test associated with C is at least as great as the power of the test associated with each other Ci . Example A coin is tossed twice. We wish to test H0 : π = 1/2 vs H1 : π = 2/3, where π = P (heads). The test rejects H0 if two heads occur. Let the number of heads be denoted by X. Is X = 2 the BCR for the test? Under H0 the distribution of outcomes is : P (X = x) x 1/4 1/2 1/4 0 1 2 Under H1 the distribution of outcomes is : P (X = x) x 1/9 4/9 4/9 0 1 2 The size of the test is α = P (X = 2|π = 1/2) = 1/4 while the power is given by 1 − β = P (X = 2|π = 2/3) = 4/9 To see if X = 2 is a BCR, try X ≥ 1 as an alternative. Now α = P (X ≥ 1|π = 1/2) = 1/2 + 1/4 = 3/4 156 and the power is 1 − β = P (X ≥ 1|π = 2/3) = 8/9 so the power has increased but so has the size. Thus X = 2 is the BCR for the test. Definition 9.6 A test of the simple hypothesis H0 versus the simple alternative H1 that has the smallest β (or equivalently, the largest π(θ)) among tests with no larger α is called most powerful. Example 9.2 Suppose X ∼ bin(5, θ). Let f (x; θ) denote the probability function of X. Consider H0 : θ = 21 , H1 : θ = 43 . The table in Figure 9.1 gives the values of f (x; 21 ), f (x; 34 ) and f (x; 12 )/f (x; 34 ) for x = 0, 1, . . . , 5. x Figure 9.1: Null vs alternative 0 1 2 3 4 5 f(x; 21 ) 1 32 5 32 10 32 10 32 5 32 1 32 f(x; 43 ) 1 1024 15 1024 90 1024 270 1024 405 1024 243 1024 f(x; 12 )/f(x; 34 ) 32 32 3 32 9 32 27 32 81 32 243 Using X to test H0 against H1 , we shall first assign significance level α = 1/32 and want a best critical region of this size. Now C1 = {x : x = 0} and C2 = {x : x = 5} are possible critical regions and there is no other subset with α = 1/32. So either C1 or C2 is the best critical region for this α. If we use C1 then P(x ∈ C1 |H1 )=1/1024 and P(rejecting H0 |H1 is true) P(rejecting H0 |H0 is true), an unacceptable situation. On the other hand, if we use C2 then P (X ∈ A2 |HA ) = 243/1024 and P(rejecting H0 |H1 is true) P(rejecting H0 |H0 is true), a much more desirable state of affairs. So C2 is the best critical region of size α = 1/32 for testing H0 against H1 . It should be noted that, in this problem, the best critical region, C, is found by including in C the point (or points) at which f (x; 12 ) is small in comparison with f (x; 43 ). This suggests that in general, the ratio f (x; H0 )/f (x; H1 ) provides a tool by which to find a best critical region for a certain given value of α. 157 Example We wish to test H0 : θ = 2 vs H1 : θ = 4 for a sample of 2 observations from the exponential distribution with df f (y : θ) = e−y/θ /θ, y > 0 The critical region for the test is defined as C : {(Y1 , Y2 ) ; 9.5 ≤ Y1 + Y2 } which makes sense if you plot the null and alternative densities. Find the size and power of the test. Now f (y; θ = 2) = e−(y1 + y2 )/2 , y1 , y2 > 0 4 The size of the test is α = P (Y ∈ C|H0 ) = 1 − P (Y1 + Y2 < 9.5|θ = 2) = 1 − (1/4) Z y2 =9.5 y2 =0 Z y1 =9.5−y2 y1 =0 e−(y1 + y2 )/2 dy1 dy2 = 0.0497 Power? If H1 is true, then f (y; θ = 4) = e−(y1 + y2 )/4 , y1 , y2 > 0 16 and β = P (Y ∈ C|H1 ) = P (Y1 + Y2 < 9.5|θ = 4) = (1/16) Z y2 =9.5 y2 =0 Z y1 =9.5−y2 y1 =0 e−(y1 + y2 )/4 dy1 dy2 = 0.686 giving the power of the test as 0.314. Can we find a better test? For example, try the CR as C : {(Y1 , Y2 ) ; 9.0 ≤ Y1 + Y2 } The following fundamental theorem, due to Neyman and Pearson, tells us that we cannot find a better test and it provides the methodology for deriving the most powerful test for testing simple H0 against simple H1 . 158 9.2.3 Neyman Pearson Theorem Suppose X1 , . . . , Xn is a random sample with joint density function f (X; θ). For simple H0 and simple H1 , the joint density function can be written as f0 (x; θ), f1 (x; θ), respectively. Alternatively, we could use the likelihood notation, L(θ0 ; x), L(θ1 ; x). Theorem 9.1 In testing H0 : θ = θ0 against H1 : θ = θ1 , the critical region CK = {x : f0 (x)/f1 (x) < K} is most powerful (where K ≥ 0). [Or, in terms of likelihood, for a given α, the test that maximizes the power at θ1 has rejection region determined by L(θ0 ; x1 , . . . , xn )/L(θ1 ; x1 , . . . , xn ) < K. Such a test will be most powerful for testing H0 against H1 .] Proof. The constant K is chosen so that P (x ∈ CK |H0 ) = Z Z ... CK f0 (x)dx1 . . . dxn = α Let AK be another region in the sample space of size α. Then P (x ∈ AK |H0 ) = Z Z ... AK f0 (x)dx1 . . . dxn = α The regions CK and AK may overlap, as shown in Figure 9.2. AK CK 0 CK I A0K 0 Figure 9.2: The regions CK , CK , AK and A0K . Now α= = Z Z CK A0K f0 (x)dx = f0 (x)dx + Z Z I 0 CK f0 (x)dx + f0 dx = 159 Z AK Z I f0 dx f0 (x)dx. This implies that Z 0 CK Z f0 dx = A0K f0 dx which are equal to α if CK and AK are disjoint. 0 Since CK ∈ CK and f0 (x)/f1 (x) < K then f0 (x) < Kf1 (x) giving Z ie Z 0 CK f0 dx < K Z 0 CK f1 dx > (1/K) 0 CK Also, for x 6∈ CK , f1 dx Z f0 dx 0 CK f0 (x)/f1 (x) > K But A0K is outside CK , and so Z giving Z A0K f0 dx > K Z A0K f1 dx < (1/K) A0K f1 dx Z f0 dx A0K Now for the power : 1−β = Z CK ≥ (1/K) ≥ (1/K) ≥ and so Z A0K Z f1 (x)dx = Z Z 0 CK A0K f0 (x)dx + f0 (x)dx + f1 (x)dx + Z CK 0 CK f1 (x)dx + Z I f1 dx ≥ Z f1 dx = Z AK f1 dx f1 dx I Z I f1 dx I Z Z AK f1 dx f1 dx. Thus the test based on CK is more powerful than the test based on AK , and so the test based on CK is the most powerful. Example 1 160 Sampling from N (µ, 1) distribution and testing H0 : µ = µ0 vs H1 : µ = µ1 Now P 2 1 − i (xi − µ0 ) e f0 = f (x; θ0 ) = (2π)n/2 and P 2 1 − i (xi − µ1 ) f1 = f (x; θ1 ) = e (2π)n/2 giving P P 2 2 f0 = e−[ i (xi − µ0 ) − i (xi − µ1 ) ]/2 f1 On simplification, the BCR is defined by P 2 f0 = e−n(µ0 − µ1 ) /2 + i xi (µ0 − µ1 ) < K f1 ie, X i Now if µ0 > µ1 then P xi (µ0 − µ1 ) < ln K + n(µ0 − µ1 )(µ0 + µ1 ) X i xi < n ln K + (µ0 + µ1 ) (µ0 − µ1 ) 2 ie, the CR becomes i Xi < constant. Thus we reject H0 for small values of the sample mean, as expected. When µ0 < µ1 , then the NP condition becomes X n xi (µ1 − µ0 ) (µ1 − µ0 )(µ0 + µ1 ) < ln K + 2 i ie, X i Xi > − n ln K + (µ0 + µ1 ) (µ1 − µ0 ) 2 and thus we reject H0 for large values of the sample mean, as expected. Note that we use the sampling distribution of the mean to find K. These tests will be the most powerful under each of the conditions specified. Example 2 What is the BCR for a test of size 0.05 for H0 : θ = 1/2 vs H1 : θ = 2 161 for a single observation from the population with df f (y; θ) = θe−θy , y > 0 Now f0 = e−y/2 /2 while f1 = 2e−2y The critical region C is defined by f0 <K f1 ie, e1.5y e−y/2 /2 <K = 4 2e−2y Thus, reject H0 for small Y , as expected. (Verify using diagrams of the densities under H0 and H1 .) Under H0 , the size of the test requires that P (Y ∈ C|H0 ) = α = 0.05 = P (0 < Y < cv|θ = 0.5) Thus (1/2) To find K, use Z 0 cv cv e−y/2 → cv = 0.1026 e−y/2 dy = 0.05 = (1/2) −1/2 0 e1.5 × 0.1026 = 0.2916 = K 4 The test defined by C will be the most powerful. Example 9.3 Suppose X represents a single observation from the probability density function given by f (x; θ) = θxθ−1 , 0 < x < θ. Find the most powerful (MP) test with significance level α = .05 to test H0 : θ = 1 versus H1 : θ = 2. Solution. 162 Since both H0 and H1 are simple, the previous Theorem can be applied to derive the test. Here L(θ0 ) f (x; θ0 ) 1 × x1−1 1 = = = . 2−1 L(θa ) f (x; θa ) 2×x 2x The form of the rejection region for the MP test is 1 < k. 2x equivalently, x > 1/2k or, since 1/2k is a constant (k 0 say), the critical region is x > k 0 . The value of k 0 is determined by .05 = P (X is in the critical region when θ = 1) = P (X > k 0 when θ = 1) = Z 1 k0 1.dy = 1 − k0 So k 0 = .95. So the rejection region is C = {y : y > .95}. Among all tests for H0 versus H1 based on a sample of size 1 and α = .05, this test has smallest Type II error probability. [ Note that the form of the test statistic and rejection region depends on both H0 and H1 . If H1 is changed to θ = 3, the MP test is based on Y 2 and we reject H0 in favour of H1 if Y 2 > k 0 for some k 0 ]. 9.2.4 Uniformly Most Powerful (UMP) Test Suppose we sample from a population with a distribution that is completely specified except for the value of a single parameter θ. If we wish to test H0 : θ = θ0 (simple) versus H1 : θ > θ0 (composite) there is no general theorem like Theorem 9.1 that can be applied. But it can be applied to find a MP test for H0 : θ = θ0 versus HA : θ = θa for any single value θa ∈ H1 . In many situations the form of the rejection region for the MP test does not depend on the particular choice of θa . When a test obtained by Theorem 9.1 actually maximizes the power for every value of θ > θ0 , it is said to be uniformly most powerful, (UMP) for H0 : θ = θ0 against H1 : θ > θ0 . We may state the definition as follows: Definition 9.7 The critical region C is a uniformly most powerful critical (UMPCR) of size α for testing the simple hypothesis H0 against a composite alternative H1 if the set C is a best critical region of size α for testing H0 against each simple hypothesis in H1 . A test defined by this critical region C is called a uniformly most powerful test, with significance level α, for testing the simple H0 against the composite H1 . 163 Uniformly most powerful tests don’t always exist, but when they do, the Neyman Pearson Theorem provides a technique for finding them. Example 9.4 Let X1 , . . . , Xn be a random sample from a N (0, θ) distribution where the variance θ is unknown. Find a UMP test for H0 : θ = θ0 (> 0) against H1 : θ > θ0 . Solution. Now H0 ∪ H1 = {θ : θ ≥ θ0 }. The likelihood of the sample is L(θ) = 1 2πθ n/2 1 e− 2 P x2i /θ . Let θa be a number greater than θ0 and let K > 0. Let C be the set of points where L(θ0 ; x1 , . . . , xn )/L(θa ; x1 , . . . , xn ) ≤ K. That is, the set of points where θa θ0 !n/2 Or equivalently, X x2i 1 e− 2 P x2i (θa −θ0 )/θ0 θa " ≤ K. # 2θ0 θa n θa ≥ log( ) − log K = c , say. (θa − θ0 ) 2 θ0 P The set C = {(x1 , . . . , xn ) : x2i ≥ c} is then a BCR for testing H0 : θ = θ0 against H1 : θ = θa . It remains to determine c so that this critical region has the desired α. If H0 P is true, Xi2 /θ0 is distributed as χ2n . Since α = P( X Xi2 /θ0 ≥ c |H0 ), θ0 c/θ0 may be found from tables of χ2 or using pchisq in R. So C defined above is a BCR of size α for testing H0 : θ = θ0 against H1 : θ = θa . We note that, for each number θa > θ0 , the above argument holds. So C = {(x1 , . . . , xn ) : P 2 xi ≥ c} is a UMP critical region of size α for testing H0 : θ = θ0 against H1 : θ > θ0 . To be specific, suppose now that θ0 = 3, n = 15, α = .05, show that c = 75. Example 1 For a sample Y1 , . . . , Yn from a N (µ, 1) distribution, we wish to test H0 : µ = µ0 vs H1 : µ > µ0 . 164 To test H0 : µ = µ0 vs H1 : µ = µ1 . if µ1 > µ0 , the the region C : Ȳ > K + provides a MP test, where K + is chosen such that P (Ȳ > K + |H0 ) = α. Thus for every µ > µ0 we can find a MP test, since C depends on µ0 and not on µ1 . So the test defined by i h T + : Reject H0 if Ȳ > K + defines a UMP test for H0 : µ = µ0 vs H1 : µ > µ0 . Similarly, if µ1 < µ0 , the the region C : Ȳ < K − provides a MP test, where K − is chosen such that P (Ȳ < K − |H0 ) = α. Thus for every µ < µ0 we can find a MP test, since C depends on µ0 and not on µ1 . So the test defined by i h T − : Reject H0 if Ȳ < K − defines a UMP test for H0 : µ = µ0 vs H1 : µ < µ0 . Counterexample To show that a UMP test does not necessarily exist, consider testing H0 : µ = µ0 vs H1 : µ 6= µ0 . Since we have UMP tests for H1 : µ < µ0 , (T − ) and H1 : µ > µ0 , (T + ) then no UMP test can exist (by definition) for H1 : µ 6= µ0 , say (T ± ) Exercise 165 Produce a plot of the power functions for T − , T + and T ± to verify the counterexample graphically. Example 2 For a random sample X1 , . . . , X20 from a P (θ) population, show that the CR defined by i=20 X i=1 is a UMP CR for testing Xi ≥ 5 H0 : θ = 0.1 vs H1 : θ > 0.1 Consider H0 : θ = θ0 vs H1 : θ = θ1 Now P (X = x) = and so e−θ θ x , x = 0, 1, 2, . . . x! Qi=20 −θ xi e 0 θ0 /xi ! f0 = Qi=1 i=20 −θ1 xi f1 θ1 /xi ! i=1 e P x e−nθ0 θ0 i i P = <K i xi − nθ 1 e θ 1 which gives −nθ0 + ie, X i xi ln θ0 + nθ1 − X̄ (ln θ0 − ln θ1 ) < Now if θ1 > θ0 then X̄ > which leads to the CR X xi ln θ1 < ln K i ln K + (θ0 − θ1 ) n −(1/n) ln K + θ1 − θ0 ln θ1 − ln θ0 X Xi > constant i which is a MP CR by the NP lemma, but is independent of θ, and thus this CR is also UMP for the test of H1 : θ > θ0 . Exercise Plot the power function and give the size of the test. 166 9.3 9.3.1 Likelihood Ratio Tests Background The Neyman Pearson Theorem provides a method of constructing most powerful tests for simple hypotheses when the distribution of the observation is known except for the value of a single parameter. But in many cases the problem is more complex than this. In this section we will examine a general method that can be used to derive tests of hypotheses. The procedure works for simple or composite hypotheses and whether or not there are ‘nuisance’ parameters with unknown values. As well as thinking of H0 as being a statement (or assertion) about a parameter θ, it is a set of values taken by θ. Similarly for H1 . So it is appropriate to write θ ∈ H0 , for example, or maxH0 L(θ). The set of all possible values of θ is H0 ∪ H1 . Let f (x; θ) be the density function of a random variable X with unknown parameter θ, and let X1 , X2 , . . . , Xn be a random sample from this distribution, with observed values x1 , x2 , . . . , xn . The likelihood function of the sample is L(θ) = L(θ; x1 , . . . , xn ) = n Y f (xi ; θ) = f (x; θ). i=1 It is necessary to have a clear idea of what is meant by the parameter space and that subset of it defined by the hypothesis. Example 9.5 (a) If X is distributed as Bin(n, p) and we are testing H0 : p = p0 , then H0 can be written H0 = {p : p = p0 } and H0 ∪ H1 = {p : p ∈ [0, 1]}. (b) If X is distributed as N(µ, σ 2 ) where both µ and σ 2 are unknown and we are testing H0 : σ 2 = σ02 , then H0 ∪ H1 = {(µ, σ 2 ) : µ ∈ (−∞, ∞), σ 2 ∈ (0, ∞)} H0 = {(µ, σ 2 ) : µ ∈ (−∞, ∞), σ 2 = σ02 } Now (b) is illustrated in Figure 9.3. 9.3.2 The Likelihood Ratio Test Procedure The notion of using the magnitude of the ratio of two probability density functions as the basis of a best test or of a uniformly most powerful test can be modified, and made intuitively appealing, to provide a method of constructing a test where either or both of the hypothesis and alternative are composite. The method leads to tests called likelihood 167 σ2 6 H 0 ∪H1 2 σ0 H0 µ Figure 9.3: Parameter space ratio tests, and although not necessarily uniformly most powerful, they often have desirable properties. The test involves a comparison of the maximum value the likelihood can take when θ is allowed to take any value in the parameter space, and the maximum value of the likelihood when θ is restricted by the hypothesis. Define Λ = max L(θ)/ max L(θ). H0 ∪H1 H0 (9.4) Note that (i) θ may be a vector of parameters; (ii) Both numerator and denominator (and hence Λ) are functions of the sample values x1 , . . . , xn , and the right hand side could be written more fully as max f (x; θ)/ max f (x, θ). θ∈H0 θ∈H0 ∪H1 Strictly speaking, Λ as defined in (9.5) is a function of random variables X1 , . . . , Xn and so is itself a random variable with a probability distribution. When X is replaced by the observed values x in the ratio, we will use λ for the observed value of Λ, and both will be called the likelihood ratio. Clearly, by the definition of maximum likelihood estimates, maxH0 ∪H1 L(θ) will be obtained by substituting the mle(’s) for θ into L(θ). Note that (i) λ ≥ 0 since it is a ratio of pdf’s; 168 (ii) maxH0 L(θ) ≤ maxH0 ∪H1 L(θ) since the set H0 over which L(θ) is maximized is a subset of H0 ∪ H1 . This means that λ ≤ 1. So the random variable Λ has a probability distribution on [0, 1]. If, for a given sample x1 , . . . , xn , λ is close to 1, then maxH0 L(θ) is almost as large as maxH0 ∪H1 L(θ). This means that we can’t find an appreciably larger value of the likelihood, L(θ), by searching for a value of θ through the entire parameter space H0 ∪ H1 supports the proposition that H0 is true. On the other hand, if λ is small, we note that the observed x1 , . . . , xn was unlikely to occur if H0 were true, so the occurrence of it casts doubt on H0 . So a value of λ near zero implies the unreasonableness of the hypothesis. Let the random variable Λ have probability density function g(λ), 0 ≤ λ ≤ 1. To carry out the LR test in a given problem involves finding a value λ0 (< 1) so that the critical region for a size α test is {λ : 0 < λ < λ0 }. That is, P (Λ ≤ λ0 ) = Z λ0 0 g(λ)dλ = α. (9.5) Since the distribution of Λ is generally very complicated, we would appear to have a difficult problem here. But in many cases, a certain function of Λ has a well-known distribution and an equivalent test can be carried out. [See the Examples in sub–section 9.3.3 below.] Cases where this is not so are dealt with in sub–section 9.3.4. To summarise, note the following : 1. That λ = Λ̂ ≥ 0, since Λ is a ratio of pdfs. 2. That λ ≤ 1, since H0 ∈ H0 ∪ H1 and thus maxH0 L(θ) ≤ maxH0 ∪H1 L(θ) 3. If λ ≈ 0, then H0 is not true, since maxH0 L(θ) << maxH0 ∪H1 L(θ) whereas 4. If λ ≈ 1, then H0 is true, since then maxH0 L(θ) ≈ maxH0 ∪H1 L(θ) 5. Using all of the above, if Λ has a pdf of g(λ), 0 < λ < 1, then the a CR of size α for the LRT will be {λ : 0 < λ < λ0 << 1}, where P (Λ < λ0 ) = Z 169 λ0 0 g(λ)dλ = α. 9.3.3 Some Examples Example 9.6 Let X have a normal distribution with unknown mean µ and known variance σ02 . Suppose we have a random sample x1 , x2 , . . . , xn from this distribution and wish to test H0 : µ = 3 against H1 : µ 6= 3. Now H0 ∪ H1 = {(µ, σ 2 ) : µ ∈ (−∞, ∞), σ 2 = σ02 } H0 = {(µ, σ 2 ) : µ = 3, σ 2 = σ02 } Rather than L(θ) we have here L(µ; x1 , . . . , xn , σ02 ) or more briefly, L(µ). L(µ) = = n Y 1 2 1 i=1 (2πσ0 ) 2 exp{− 21 (xi − µ)2 /σ02 } (2π)−n/2 σ0−n exp ( − 21 n X i=1 (xi − µ) 2 /σ02 ) Now max L(µ) is obtained by replacing µ in the above by its mle, x. So H0 ∪H1 max L(µ) = H0 ∪H1 (2π)−n/2 σ0−n exp ( − 21 n X i=1 (xi − x) 2 /σ02 ) . Also L(µ|H0 ) has only one value, obtained by replacing µ by 3 and σ 2 by σ02 . So max L(µ) = H0 = (2π)−n/2 σ0−n (2π)−n/2 σ0−n exp exp ( − 12 ( − 12 n X i=1 n X i=1 (xi − 3) 2 (xi − x) /σ02 2 ) /σ02 n − (x − 3)2 /σ02 2 ) Thus, on simplification λ = exp{−n(x − 3)2 /2σ02 }. (9.6) Intuitively, we would expect that values of x close to 3 support the hypothesis and it can be seen that in this case λ is close to 1. Values of x̄ far from 3 lead to λ close to 0. We need to find the critical value λ0 to satisfy equation (9.6). That is, we need to know the distribution of Λ. From equation (9.7), using random variables instead of observed values, we have !2 X −3 √ −2 log Λ = σ0 / n which is the square of a N (0, 1) variate and therefore is distributed as χ21 . For α = .05, the critical region is {x : n(x − 3)2 /σ02 ≥ 3.84}, or alternatively √ √ {x : n(x − 3)/σ0 > 1.96 or n(x − 3)/σ0 < −1.96}. 170 Figure 9.4: Critical regions −2 log λ rejection region for −2 log λ 3.84 λ 6 rejection region for λ The relationship between the critical region for λ and the critical region for −2 log λ is shown in Figure 9.4. Example 9.7 Given X1 , X2 , . . . , Xn is a random sample from a N (µ, σ 2 ) distribution, where σ 2 is unknown, derive the LR test of H0 : µ = µ0 against H1 : µ 6= µ0 . Now the parameter space is H0 ∪ H1 = {(µ, σ 2 ) : µ ∈ (−∞, ∞), σ 2 > 0}, and that restricted by the hypothesis is H0 = {(µ, σ 2 ) : µ = µ0 , σ 2 > 0}. 171 We note that there are 2 unknown parameters here, and the likelihood of the sample, L(µ, σ 2 ; x1 , x2 , . . . , xn ) can be written as n Y L(µ, σ 2 ) = 1 1 (2πσ 2 )− 2 e− 2 (xi −µ) i=1 1 = (2π)−n/2 (σ 2 )−n/2 e− 2 Now the mle’s of µ and σ 2 are 2 µ̂ = x, σ̂ = 2 /σ 2 Pn i=1 (xi −µ)2 /σ 2 (9.7) n X i=1 (xi − x)2 /n, and max L(µ, σ 2 ) is obtained by substituting these for µ and σ 2 in equation (9.8). This H0 ∪H1 gives 2 max L(µ, σ ) = (2π) H0 ∪H1 " −n/2 n/2 n n X i=1 (xi − x) 2 #−n/2 e−n/2 . (9.8) Now max L(µ, σ 2 ) is obtained by substituting µ0 for µ in equation (9.8) and replacing σ 2 H0 by its MLE where µ is known. This is Thus 2 max L(µ, σ ) = (2π) H0 Pn i=1 (xi −n/2 n/2 n − µ0 )2 /n = σ̃ 2 , say. " n X i=1 (xi − µ0 ) 2 #−n/2 e−n/2 . 2 2 So λ = max L(µ, σ )/ max L(µ, σ ) becomes H0 ∪H1 H0 λ= " X i 2 (xi − x) / X i Taking 2/nth powers of both sides and writing we have λ 2/n = X i = 2 (xi − x) / 1 n(x−µ0 )2 1+ P (xi −x)2 " X i (xi − µ0 ) P i (xi 2 #−n/2 . − µ0 )2 as 2 P i (xi − x) + n(x − µ0 ) [(xi − x) + (x − µ0 )]2 , 2 # . (9.9) i Recalling that λ is the observed value of a random variable with a range space [0, 1], and that the critical region is of the form 0 < λ < λ0 , we would like to find a function of λ (or of λ2/n ) whose probability distribution we recognize. Now X − µ0 n(X − µ0 )2 √ X ∼ N (µ0 , σ /n) so ∼ N (0, 1) and ∼ χ21 . 2 σ/ n σ 2 172 P P Also, ni=1 (Xi − X)2 = νS 2 so ni=1 (Xi − X)2 /σ 2 = νS 2 /σ 2 ∼ χ2ν where ν = n − 1. So, expressing the denominator of equation (9.10) in terms of random variables we have 1+ n(X − µ0 )2 /σ 2 νS 2 /σ 2 ∼ 1+ χ21 χ2ν ∼ 1+ χ21 ν(χ2ν /ν) ∼ 1+ T2 ν where T is a random variable with a t distribution on ν degrees of freedom. Considering range spaces, the relationship between λ (or λ2/n ) and t2 is a strictly decreasing one, and a critical region of the form 0 < λ < λ0 is equivalent to a CR of the form t2 > t20 , as indicated in Figure 9.5. Figure 9.5: Rejection regions t2 t20 λ λ0 That is, the critical region is of the form |t| > t0 where t0 is obtained from tables, using ν degrees of freedom, and the appropriate significance level. Example 9.8 173 Given the random sample X1 , X2 , . . . , Xn from a N (µ, σ 2 ) distribution, derive the LR test of the hypothesis H0 : σ 2 = σ02 , where µ is unknown, against H1 : σ 2 6= σ02 . The parameter space, and restricted parameter space are H0 ∪ H1 = {(µ, σ 2 ) : µ ∈ (−∞, ∞), σ 2 > 0} H0 = {(µ, σ 2 ) : µ ∈ (−∞, ∞), σ 2 = σ02 } L(µ, σ 2 ) is given by equation (9.8) in Example 9.9, and max L(µ, σ 2 ) is given by equation H0 ∪H1 (9.9). To find max L(µ, σ 2 ) we replace σ 2 by σ02 and µ by the mle, x. So H0 1 max L(µ, σ 2 ) = (2π)−n/2 (σ02 )−n/2 e− 2 H0 So λ=e n/2 "P − x)2 nσ02 i (xi #n/2 1 e− 2 P i P i (xi −x)2 /σ02 (xi −x)2 /σ02 Again, we would like to express Λ as a function of a random variable whose distribution P we know. Denoting i (xi − x)2 /σ02 by w, the random variable W , whose observed value this is, has a χ2 distribution with parameter ν = n − 1. So we have λ = en/2 (w/n)n/2 e−w/2 , and the relationship between the range spaces of Λ and W is shown in Figure 9.6. A critical region of the form 0 < λ < λ0 corresponds to the pair of intervals, 0 < w < a, b < w < ∞. So for a size-α test, H is rejected if νs2 /σ02 < χ2ν,α/2 or νs2 /σ02 > χ2ν,1−(α/2) . [This of course is the familiar intuitive test for this problem.] Some Examples 1. Sampling from the Normal distribution. Let Xi ∼ N (µ, 1). We wish to test H0 : µ = µ0 vs H1 : µ 6= µ0 . L= Y i P 2 2 1 √ e−(xi − µ) /2 = (2π)−n/2 e− i (xi − µ) /2 2π In H0 , we have L0 = (2π)−n/2 e− 174 P i (xi − µ0 )2 /2 Figure 9.6: Λ vs W λ λ0 W a b In H1 , the likelihood is L = (2π)−n/2 e− P which gives ` = ln[(2π)−n/2 ] − i (xi X i − µ)2 /2 (xi − µ)2 /2 Now ∂`/∂µ = 0 gives µ̂ = x̄ and so L1 = (2π)n/2 e− The LR becomes Thus P i (xi − x̄)2 /2 P 2 e− i (xi − µ0 ) /2 L0 P = Λ̂ = λ = 2 L1 e− i (xi − x̄) /2 −2 ln λ = X i (xi − µ0 )2 − 175 X i (xi − x̄)2 = n(x̄ − µ0 )2 Now (X̄ − µ0 )2 ∼ χ21 1/n since (X̄ − µ0 ) √ ∼ N (0, 1) 1/ n 2. Sampling from the Poisson distribution. Let Y ∼ P (λ). We wish to test H0 : λ = λ0 vs H1 : λ > λ0 . Now the likelihood is e−λ λyi yi ! Y L= i In H0 , we get P while in H1 the likelihood is e−λ0 λ0 yi e−nλ0 λ0 i yi = Q yi ! i (yi !) to give e−nλ λ i yi L= Q i (yi !) L0 = Y i P X ` = ln L = −nλ + i yi ln λ − Now ∂`/∂λ = 0 = −n + gives λ̂ = X X X ln yi !. i yi /λ i yi /n i and so L1 = Thus the LR becomes e− P i yi (P y /n) i i Q i (yi !) P i yi P L0 e−nλ0 λ0 i yi P P Λ̂ = = P L1 e− i yi ( i yi /n) i yi So the rejection region Λ < Λ0 is equivalent to P L0 e−nλ0 λ0 i yi P P = < Λ0 P L1 e− i yi ( i yi /n) i yi 176 ie n(Ȳ − λ0 ) + X yi ln(λ0 /Ȳ ) < K i Thus the LR test is equivalent to a test based on Ȳ , as expected. What about the distribution of −2 ln Λ? Now " −2 ln Λ = −2 −n(λ0 − ȳ) + X = −2nλ0 − 2 i Using yi − 2 X X yi ln(nλ0 / i yi ln(nλ0 / i X X yi ) i # yi ) i ln(x) = (x − 1) − (x − 1)/2 + . . . , 0 < x < 2 and assuming that Ȳ > λ0 , ie, ln(nλ0 / X P yi ) = i then −2 ln Λ = 2nλ0 − 2 X i i yi /n > λ0 , then ! nλ0 nλ0 −1 − P −1 P i yi i yi yi − 2 = ( P = X i i !2 /2 + . . . nλ0 nλ0 yi P −1− P −1 i yi i yi yi − nλ0 )2 +... P i yi !2 + . . . (Ȳ − λ0 )2 ∼ χ21 Ȳ /n since V (Ȳ ) = λ0 /n and λ̂ = Ȳ . 9.3.4 Asymptotic Distribution of −2 log Λ The case of a single parameter will be covered first, then the situation of multiple parameters. (single) We have H0 : θ = θ0 vs H1 : θ 6= θ0 . It is required to demonstrate the −2 ln Λ ∼ χ21 . Now ln L(θ) = ` 177 and using Taylor’s expansion we get `(θ) = `(θ̂) + `0 (θ̂)(θ − θ̂) + `00 (θ̂)(θ − θ̂)2 /2 + . . . but `0 (θ̂) = 0 and so `(θ) = `(θ̂) + `00 (θ̂)(θ − θ̂)2 /2 + . . . Now h −2 ln Λ = −2 `(θ0 ) − `(θ̂) and since i `(θ0 ) = `(θ̂) + `00 (θ̂)(θ0 − θ̂)2 /2 + . . . then h i −2 ln Λ = −2 `(θ̂) + `00 (θ̂)(θ0 − θ̂)2 /2 − `(θ̂) = −`00 (θ̂)(θ0 − θ̂)2 Thus −2 ln Λ = (θ̂ − θ0 )2 −1/`00 (θ̂) ∼ χ21 as V (θ̂) = 1/I and I ≈ −E[`00 (θ̂)]. This is also the large scale sampling distribution for the mle θ̂. (multiple) The proof in the multiple parameter case follows the approach in the single parameter case, but is not so straightforward. For a full proof, see Silvey (1987), pages 112–114 and Theorem 7.2.2. Let the null be written as H0 : θ ∈ ω and the alternative be H1 : θ ∈ Ω. The LR is then Λ= L(ω̂) L(Ω̂) where ω̂ stands for the mle of θ in ω, and where Ω̂ stands for the mle of θ in Ω. Thus L(ω̂) −2 ln Λ = −2 ln L(Ω̂) ! h i h = −2 ln f (ω̂) − ln f (Ω̂) = −2 `(ω̂) − `(Ω̂) 178 i Using a Taylor expansion `(ω̂) = `(Ω̂) + `0 (Ω̂)0 (ω̂ − Ω̂) + (ω̂ − Ω̂)0 `00 (Ω̂)(ω̂ − Ω̂)/2 + . . . Since `0 (Ω̂) = 0 then `(ω̂) − `(Ω̂) = (ω̂ − Ω̂)0 `00 (Ω̂)(ω̂ − Ω̂)/2 + . . . which gives −2 ln Λ ≈ −(ω̂ − Ω̂)0 `00 (Ω̂)(ω̂ − Ω̂) ≈ (ω̂ − Ω̂)0 nI(θ)(ω̂ − Ω̂) since I(θ̂) ≈ I(θ). √ √ By consideration of the distributions of n(Ω̂ − θ∗) and n(ω̂ − θ∗), where θ∗ is the true value of θ, the final result can be obtained. Essentially, the change in the number of parameters in moving from H0 ∪ H1 to H0 imposes restrictions which lead to the degrees of freedom of of the resulting χ2 being of the same dimension as this difference in the number of parameters between H0 ∪H1 and H0 , as per page 116-117 of the Notes. Thus −2 ln Λ ∼ χ2df where df is the difference between the number of parameters specified in H0 ∪ H1 and H0 . Reference Silvey S.D., (1987), Statistical Inference, Chapman and Hall, London. The distribution of some function of Λ can’t always be found as readily as in the previous examples. If n is large and certain conditions are satisfied, there is an approximation to the distribution of Λ that is satisfactory in most large-sample applications of the test. We state without proof the following theorem. Theorem 9.2 Under the proper regularity conditions on f (x; θ), the random variable −2 log Λ is distributed asymptotically as chi-square. The number of degrees of freedom is equal to the difference between the number of independent parameters in H0 ∪ H1 and H0 . [Note that in Example 9.8 the distribution of −2 log Λ was exactly χ21 .] 179 Example 9.9 (Test for equality of several variances.) The hypothesis of equality of variances in two normal distributions is tested using the F -test. We will now derive a test for the k-sample case by the likelihood ratio procedure. Consider independent samples x11 , x12 , . . . , x1n1 from the population N (µ1 , σ12 ), x21 , x22 , . . . , x2n2 .. . ................... N (µ2 , σ22 ), xk1 , xk2 , . . . , xknk ................... N (µk , σk2 ). That is, we have observations {xij , j = 1, . . . , ni , i = 1, 2, . . . , k}. We wish to test the hypothesis H0 : σ12 = σ22 = . . . = σk2 (= σ 2 ) against the alternative that the σi2 are not all the same. Let n = the random variable Xij is P i ni . Now the p.d.f. of o n f (xij ) = (2π)−1/2 σi−1 exp − 12 (xij − µi )2 /σi2 . So the likelihood function of the samples above is L(µ, σ 2 ) = (2π)−n/2 σ1n1 . . . σk−nk exp − 21 ni k X X i=1 j=1 (xij − µi )2 /σi2 . (9.10) The whole parameter space and restricted parameter space are given by H0 ∪ H1 = {(µi , σi2 ) : µi ∈ (−∞, ∞), σi2 ∈ (0, σ), i = 1, . . . , k} H0 = {(µi , σ 2 ) : µi ∈ (−∞, ∞), σ 2 ∈ (0, ∞), i = 1, . . . , k}. The log of the likelihood is ni k X n1 nk 1X n log σ12 − . . . − log σk2 − σi−2 (xij − µi )2 , log L = − log 2π − 2 2 2 2 i=1 j=1 using L for L(µ, σ 2 ). To find max L we need the MLE’s of the 2k parameters µ1 , . . . , µk , σ12 , . . . , σk2 . HO ∪H1 ni 1 −2 X ∂ log L/∂µi = σi 2 (xij − µi ), i = 1, . . . , k 2 j=1 180 (9.11) ∂ log L/∂σi2 = −(ni /2σi2 ) ni 1 −1 X − ( 4 ) (xij − µi )2 , i = 1, . . . , k 2 σi j=1 (9.12) Equating (9.11) and (9.11) to zero and solving we obtain ni 1 X xij = xi· , i = 1, . . . , k = ni j=1 µ̂i (9.13) ni 1 X = (xij − xi· )2 , i = 1, . . . , k. ni j=1 σ̂i2 (9.14) Substituting these in (9.10) we obtain max L = (2π)−n/2 σ̂1−n1 . . . σ̂k−nk H0 ∪H1 ni k X (xij − xi· )2 ni 1X exp − P i 2 i=1 j=1 nj=1 (xij − xi· )2 = (2π)−n/2 σ̂1−n1 . . . σ̂k−nk e−n/2 , since n = k X ni . (9.15) i=1 Now in the restricted parameter space H0 there are (k + 1) parameters, µ1 , . . . , µk and σ 2 . So we need to find the mle’s of these parameters. The likelihood function now is (putting σi2 = σ 2 , all i) L = (2π)−n/2 (σ 2 )−n/2 and ni k X 1 X exp − 2 (xij − µi )2 2σ i=1 j=1 (9.16) ni k X n n 1 X log L = − log(2π) − log σ 2 − 2 (xij − µi )2 2 2 2σ i=1 j=1 ni X 1 ∂ log L = − 2 (−2) (xij − µi ), i = 1, . . . , k ∂µi 2σ j=1 ni k X ∂ log L n 1 X (xij − µi )2 = − + ∂σ 2 2σ 2 2σ 4 i=1 j=1 (9.17) (9.18) Equating (9.17) and (9.18) to zero and solving we obtain µ̃i = σ̃ 2 ni 1 X xij = xi· (= µ̂i ), i = 1, . . . , k ni j=1 (9.19) k 1X ni σ̂i2 . n i=1 (9.20) ni k X 1X (xij − xi· )2 = n i=1 j=1 = 181 Substituting (9.20) and (9.20) into (9.16) we obtain max L = e−n/2 (2π)−n/2 / H0 So σ̂1n1 . . . σ̂knk λ= P = n/2 [ i ni σ̂i2 /n] " k Y " k X #n/2 (9.21) /(σ̃ 2 )n/2 (9.22) ni σ̂i2 /n i=1 (σ̂i2 )ni /2 i=1 # Now, using Theorem 9.2, the distribution of −2 log Λ is asymptotically χ2 . To determine the number of degrees of freedom we note that the number of parameters in H0 ∪ H1 is 2k and in H0 is k + 1. Hence the number of degrees of freedom is 2k − (k + 1) = k − 1. Thus −2 log Λ = − k X ni log σ̂i2 + n log σ̃ 2 (9.23) i=1 is distributed approximately χ2k−1 . Bartlett (1937) modified this statistic by using unbiased estimates of σi2 and σ 2 instead of MLE’s. That is, he used (ni − 1) and (n − k) as divisors, so the statistic becomes B=− where s2i = ni X j=1 k X νi log s2i + i=1 k X i=1 ! νi log s2 (xij − xi· )2 /(ni − 1) and s2 = ν1 s21 + . . . + νk s2k ν1 + . . . + ν k [Investigate the form of this when k = 2.] A better approximation still is obtained using as the statistic Q = B/C where the constant C is defined by " X 1 1 1 C =1+ −P 3(k − 1) i νi i νi # and this statistic is commonly referred to as Bartlett’s statistic for testing homogeneity of variances. That is, P P ( i νi ) log s2 − i νi log s2i (9.24) Q= P P 1 1 + 3(k−1) [ i (1/νi ) − (1/ i νi )] is distributed approximately as χ2k−1 under the hypothesis H0 : σ12 = . . . = σk2 . The approximation is not very good for small ni . 9.4 Worked Example The aim of this worked example is to demonstrate the use of the likelihood ratio test for testing composite hypotheses. The example used is akin to that in Example 9.8, but results are drawn from sections 9.3.4 and 9.3.2. 182 9.4.1 Example Let X1 , . . . , Xn be drawn from a N (µ, σ02 ) population, ie, a Normal distribution with unknown mean but known variance σ02 . We wish to test H0 : µ = µ0 against H1 : µ 6= µ0 . Thus ω = [H0 ] = [(µ0 , σ02 )] while Ω = [H0 ∪ H1 ] = [(µ, σ02 )] Now 1 2 2 e− 2 (xi − µ) /σ0 √ L = L(µ) = σ0 2π i=1 n Y and if ` = log L then X ∂` 1 = − (2)(−1) (xi − µ)/σ02 = 0 ∂µ 2 i b if x = µ. Thus P 1 e− 2 b = L(Ω) i (xi − x)2 /σ02 σ0n (2π)n/2 In ω, µ is single valued, as it corresponds to the value µ0 . Thus b = L(ω) and so the LR becomes λ= 1 e− 2 −2 log λ = [ X i (xi − µ0 )2 − X i i (xi − µ0 )2 /σ02 σ0n (2π)n/2 1 e− 2 1 e− 2 This gives P P − µ0 )2 /σ02 i (xi P i (xi (xi − x)2 ]/σ02 = − x)2 /σ02 X i [(x2i + µ20 − 2xi µ0 − x2i − x2 + 2xi x]/σ02 and so −2 log λ = Since [nµ20 2 + nx − 2 X xi µ0 ]/σ02 i = [n(x − µ0 ) X − µ0 √ ∼ N (0, 1) σ0 / n 183 2 ]/σ02 = x − µ0 √ σ0 / n !2 exactly then −2 log λ ∼ χ21 exactly. In general, the df for the asymptotic distribution of −2 log λ is dim(Ω) − dim(ω) which in this case is 1−0=1 The formal definition of ’dim’ is the number of free parameters. In ω both µ and σ 2 are fixed, while Ω has one free parameter, µ. Example Simple Linear Regression : We have a sample (Xi , Yi ), i = 1, . . . , n, where Yi ∼ N (β0 + β1 Xi , σ 2 ), and wish to test H0 : β1 = 0 vs H1 : β1 6= 0 So θ = (β0 , β1 , σ 2 ), with θ 0 = (β0 , 0, σ 2 ). The likelihood is : L= Y i ie 2 2 1 √ e−(yi − β0 − β1 xi ) /2σ σ 2π X n ` = ln L = − ln σ 2 − . . . − (yi − β0 − β1 xi )2 /2σ 2 2 i Under H0 , L= and P 2 2 1 − i (yi − β0 ) /2σ √ e n (σ 2π) X n ` = − ln σ 2 − . . . − (yi − β0 )2 /2σ 2 2 i and so ∂`/∂β0 = − 1X (yi − β0 )(−1)/σ 2 = 0 → β̂0 = ȳ 2 i 184 and ∂`/∂σ 2 = − X X n 2 4 2 − (1/2) (y − β ) (−1)/σ = 0 → σ̂ = (yi − ȳ)2 /n i 0 2σ 2 i i to give L0 = maxH0 L = [2π In H0 ∪ H1 , P e−n/2 2 n/2 i (yi − ȳ) /n] X n ` = ln L = − ln σ 2 − . . . − (yi − β0 − β1 xi )2 /2σ 2 2 i and so ∂`/∂β0 = −(1/2) X i 2(yi − β0 − β1 xi )(−1)/σ 2 = 0 → β̂0 = ȳ − β̂1 x̄ Also ∂`/∂β1 = −(1/2) X → β̂1 = Finally ∂`/∂σ 2 = − i P 2(yi − β0 − β1 xi )(−xi )/σ 2 = 0 P P xi yi − i xi i yi /n P P 2 2 i xi − ( i xi ) /n i X n − (1/2)(−1) (yi − β0 − β1 xi )2 /σ 4 = 0 2σ 2 i → σ̂ 2 = X i (yi − β̂0 − β̂1 xi )2 /n So the maximised likelihood in H0 ∪ H1 is L1 = maxH0 ∪H1 L = Thus the likelihood ratio is L0 Λ= = L1 and so but X i e−n/2 P [2π i (yi − β̂0 − β̂1 xi )2 /n]n/2 P − β̂0 − β̂1 xi )2 P 2 i (yi − ȳ) i (yi !n/2 (n − 2)S 2 Λ2/n = P 2 i (yi − ȳ) (yi − ȳ)2 = (n − 2)S 2 + β̂12 and so Λ2/n = (n − 2)S 2 (n − 2)S 2 + β̂12 185 P X i i (xi (xi − x̄)2 − x̄)2 = 1/ 1 + = β̂12 P − x̄)2 (n − 2)S 2 S/ ! 1 1+ T 2 /(n where T = i (xi qP − 2) β̂1 i (xi − x̄)2 ∼ tn−2 is recognised as the familiar test statistic for testing H0 : β1 = 0. 186 Chapter 10 Bayesian Inference 10.1 Introduction The reference text for this material is : Lee P.M., (2004), Bayesian Statistics, 3rd ed, Arnold, London. The topics covered are a selected and condensed set of extracts from Chapters 1 to 3 of Lee, written in a notation consistent with previous chapters of this set of Unit Notes. Some additional exercises and workings are given. 10.1.1 Overview This chapter introduces an alternative theory of statistical inference to the procedures already covered in this Unit. The notion of subjective probability or prior belief is incorporated into the data analysis procedure via the Bayesian inferential process. As this development is confined to only one chapter, the exposition of Bayesian inference is cursory only. The references at the end of this chapter are supplied for further reading into this important topic. 187 10.2 Basic Concepts 10.2.1 Bayes Theorem Discrete Case This will be the form most familiar to students of introductory statistics. Let E be an event, with H1 , . . . , Hn a sequence of mutually exclusive and exhaustive events. Then P (Hn |E) ∝ P (Hn )P (E|Hn ) P (Hn |E) = P (Hn )P (E|Hn ) P (E) assuming that P (E) 6= 0. This can be written as P (Hn |E) = P P (Hn )P (E|Hn ) m P (Hm )P (E|Hm ) which is the full form of Bayes theorem in the discrete case. Continuous Case If x and y are continuous random variables, and p(y|x) ≥ 0 with Z then p(y) = since Z p(y|x)dy = 1 p(x, y)dx = p(y|x) = So p(y|x) = Z p(x)p(y|x)dx p(x, y) p(x) p(x, y) p(y)p(x|y) = p(x) p(x) giving p(y|x) ∝ p(y)p(x|y) which is Bayes theorem with the constant of proportionality ie, 1/p(x) = R p(y|x) = R 1 p(y)p(x|y)dy p(y)p(x|y) p(y)p(x|y)dy 188 10.3 Bayesian Inference Information known a priori about parameters θ are incorporated into the prior pdf p(θ). The pdf of the data X subject to the parameters is denoted by p(X|θ). Using Bayes’ theorem for (vector) random variables, we have p(θ|X) ∝ p(θ)p(X|θ) where p(θ|X) is called the posterior distribution for θ given X. The likelihood function L considers the probability law for the data as a function of the parameters, hence L(θ|X) = p(X|θ) so Bayes’ theorem can be written as posterior ∝ prior × likelihood which shows how prior information is updated by knowledge of the data. The posterior/prior/likelihood relation is sometimes written as p(θ|x) = where p(x) = Z p(θ)p(x|θ) p(x) p(θ)p(x|θ)dθ where we have reverted to scalar notation momentarily. The marginal distribution p(x) is called the predictive or preposterior distribution. These equations will be used in later work. 10.4 Normal data The procedure whereby prior information is updated by knowledge of the data is now demonstrated using a simple example of sampling of a single observation from a Normal population with known variance. Hence the data point X comes from N (µ, σ 2 ) where σ 2 is assumed known. The parameter of interest is µ. Thus 2 2 1 p(x) = √ e−(x − µ) /2σ σ 2π The prior is taken as µ ∼ N (µ0 , σ02 ) 189 and the likelihood is 2 2 1 L(µ|x) = √ e−(x − µ) /2σ σ 2π So the posterior becomes p(µ|x) = p(µ) · p(x|µ) = p(µ) · L(µ|x) = σ0 1 √ 2 2 2 2 1 e−(µ − µ0 ) /2σ0 √ e−(x − µ) /2σ 2π σ 2π 2 2 2 2 2 ∝ e−µ (1/σ0 + 1/σ )/2 + µ(µ0 /σ0 + x/σ ) 2 2 2 = e−µ /2σ1 + µµ1 /σ1 where σ12 = 1/σ02 1 + 1/σ 2 and µ1 = σ12 (µ0 /σ02 + x/σ 2 ) Therefore 2 2 2 2 2 p(µ|x) ∝ e(µ /σ1 − 2µµ1 /σ1 + µ1 /σ1 )/2 2 2 = e−(µ − µ1 ) /2σ1 Thus the posterior distribution is given by µ|x ∼ N (µ1 , σ12 ) 10.4.1 Note If we define precision as the inverse of the the variance, then since 1/σ12 = 1/σ02 + 1/σ 2 we have that posterior precision = prior precision + data precision . For the mean, we have µ1 /σ12 = µ0 /σ02 + x/σ 2 and so the posterior mean is a weighted sum of the prior mean and the data mean (point), with the weights being proportional to the respective precisions. 190 10.5 Normal data - several observations The process that was undertaken for a single data point is now described for a sample consisting of more than one observation. The prior is again µ ∼ N (µ0 , σ02 ) but the likelihood is P 2 2 1 − i (xi − µ) /2σ √ e n (σ 2π) L(µ|x1 , . . . , xn ) = p(x1 |µ) . . . p(xn |µ) = Thus the posterior becomes p(µ|x1 , . . . , xn ) = p(µ) · p(x1 , . . . , xn |µ) = p(µ) · L(µ|x1 , . . . , xn ) = 1 √ σ0 2π 2 2 e−(µ − µ0 ) /2σ0 P 2 2 1 √ n e− i (xi − µ) /2σ (σ 2π 2 2 2 2 ∝ e−µ (1/σ0 + n/σ )/2 + µ(µ0 /σ0 + 2 2 2 = e−µ /2σ1 + µµ1 /σ1 where σ12 = 1/σ02 P i xi /σ 2 ) 1 + n/σ 2 and µ1 = σ12 (µ0 /σ02 + X xi /σ 2 ) i Therefore 2 2 2 2 2 p(µ|x) ∝ e(µ /σ1 − 2µµ1 /σ1 + µ1 /σ1 )/2 2 2 = e−(µ − µ1 ) /2σ1 Thus the posterior distribution is given by µ|x ∼ N (µ1 , σ12 ) This result could be obtained using the single observation derivation, since x̄ ∼ N (µ, σ 2 /n) and so the posterior result given here for a sample of size n is equivalent to that obtained for a single observation on the mean of a sample of size n. 191 10.6 Highest density regions One method of characterising the posterior distribution or density is to describe an interval or region that contains ’most’ of the distribution. Such and interval would be expected to contain more of the distribution inside than out, and the interval or region should be chosen to be as short or as small as possible. In most cases, there is one such interval or region for a chosen probability level. For Bayesian inference, such an interval or region is called a higher (posterior) density region or HDR. Alternative terminology includes Bayesian confidence interval, credible interval or higher posterior density (HPD). 10.6.1 Comparison of HDR with CI The confidence interval or CI obtained from the sampling theory approach of classical frequentist statistics appears at first similar to the HDR. For either method,using the normal distribution as an example, we use the fact that x−µ ∼ N (0, 1) σ For the sampling theory approach x is consider random, while µ is taken as fixed. The resulting interval is taken as random. In the Bayesian context, the random variable is taken as µ, while the interval is fixed once the data are available. If we use the notation ∼ to denote a random variable, then for a 95%CI we have ∼ |(x −µ)/σ| < 1.96 whereas a 95% HDR is saying that ∼ |(µ −x)/σ| < 1.96 In cases other than the simple situation described here, the two methods can differ. 10.7 Choice of Prior Using the results of the two simple examples already given, it should be clear that not only form of the prior as well as it parameters can have a bearing on the posterior distribution. For this reason, the choice of prior needs to be given consideration, in its form and representation via suggested values for the parameters. 10.7.1 Improper priors In the case of the Normal distribution with several observations, the posterior variance was σ12 = 1/σ02 1 . + n/σ 2 192 This posterior variance approaches σn2 when σ02 is large compared to σ 2 /n. Alas in the limit, this means that the prior N (µ0 , σ02 ) would become uniform on the whole real line in the limit, and thus would not be a proper density. Hence the term improper prior. Such improper priors can nevertheless produce quite proper posterior distributions when combined with an ordinary likelihood. (In such cases, the likelihood dominates the posterior.) Operationally, improper priors are best interpreted as approximations over a large range of values rather than being truly valid over an infinite interval. The construction of such approximations can be formalised, see Lee p42, theorem. 10.7.2 Reference Priors The term ’reference’ prior is used to cover the case where the data analysis proceeds on the assumption that the likelihood should be expected to dominate the prior, especially where there is no strongly held belief ’a priori’. 10.7.3 Locally uniform priors A prior that is reasonably constant over the region where the likelihood dominates and is not large elsewhere is said to be locally uniform. For such a prior, the posterior becomes p(θ|x) ∝ p(x|θ) = L(θ|x) and so as expected, the likelihood dominates the prior. Bayes postulated (not his theorem) that the ’know nothing’ prior p(θ) should be represented by a uniform prior where θ is an unknown probability such that 0 < θ < 1. This implies p(θ) = 1, 0 < θ < 1 Alas this suffer the same problems of improper priors, but again if appropriate intervals can be chosen, a local uniform prior can be found that will be workable. 10.8 Data–translated likelihoods In choosing a reference prior that is ’flat’, it would seem natural to choose an appropriate scale of measurement for the uniform prior, which is related to the problem at hand. One such scale is one on which the likelihood is data–translated, ie, one for which L(θ|x) = g(θ − t(x)) for a function t, which is a sufficient statistic for θ. For example, the Normal has L(µ|x) ∝ e−(x−µ) 193 2 /2σ 2 and it is thus clearly of this type. However, a binomial B(n, π) has L(π|x) ∝ π x (1 − π)n−x which cannot be but into the form g(π − t(x)). If the likelihood L is data–translated, then different data values give the same form for L except for a shift in location. Thus the main function of the data is to determine location. Thus it would seem sensible to adopt a uniform prior for such data–translated likelihoods. 10.9 Sufficiency Recall that a statistic T = t(x) is sufficient for θ iff f (x; θ) = g(t(x)) · h(x) by the Fisher–Neyman factorisation criterion. Theorem For any prior distribution, the posterior of θ given x is the same as the posterior of θ given a sufficient statistic t for θ, assuming that t exists. Proof From sufficiency p(x|θ) = p(t|θ)p(x|t) and so the posterior is p(θ|x) ∝ p(θ)p(x|θ) = p(θ)p(t|θ)p(x|t) which proves the result. 10.10 ∝ p(θ)p(t|θ) Conjugate priors A class P of priors forms a conjugate family if the posterior is in the class for all x when the prior is in P . 194 In practice, we restrict P to the class of the exponential family, mostly. As an example, the Normal has a Normal conjugate prior, as shown by the posterior being Normal as well. 10.11 Exponential family Earlier in the Unit we defined the exponential family of distributions by f (xi ; θ) = p(xi |θ) = h(xi )B(θ)eq(θ)K(xi ) Note the change from p(θ) to q(θ) to avoid confusion with the general form for the prior. This form for the density gives the likelihood as L(θ|x) = h(x)B n (θ)eq(θ) P i K(xi ) For the exponential family, there is a family of conjugate priors defined by p(θ) ∝ B(θ)eq(θ)τ since the posterior is then P p(θ) · L(θ|x) = B(θ)eq(θ)τ B n (θ)eq(θ) = B n+1 (θ)eq(θ)(τ + P i i K(xi ) K(xi ) which belongs to the same class as the prior. Hence the class is conjugate. Normal mean For the normal case with known variance, τ ∝ x and t = for the class (in the exponent) is P i K(xi ) ∝ nx, the general form −νµ2 /2σ 2 + T µ/σ 2 where T can be chosen as νµ0 to give the correct scale. The full form is proportional to 2 2 e−ν(µ − µ0 ) /2σ which for ν = 1 generates the prior used for the normal. 195 Binomial proportion In a fixed number of n trials, the number of successes x can be such that x ∼ B(n, π) ie, x follows the Binomial distribution, where π is the probability of a success at each of the (independent) trials. Thus ! n x π (1 − π)n−x , x = 0, 1, . . . , n p(x|π) = x ∝ π x (1 − π)n−x . If we choose the prior as a Beta distribution, ie, p(π) ∝ π α−1 (1 − π)β−1 , 0 ≤ π ≤ 1 then the posterior is p(π|x) ∝ π α+x−1 (1 − π)β+n−x−1 which is also a Beta distribution. Thus the prior is conjugate. Alternatively, we could construct the prior from the general form, using the fact that the Binomial is a member of the exponential family. 10.12 Reference prior for the binomial We should note from the previous section that if we choose a B(α, β) prior, then the posterior is B(α + x, β + n − x), when we have x successes in n trials. 10.12.1 Bayes Bayes proposed the following uniform prior for the binomial; π = 1, 0 ≤ π ≤ 1, = 0 else which is really B(1, 1). In short, the implications of this prior is that the mode of the posterior distribution corresponds to the unbiased estimate of the proportion from classical statistics. 10.12.2 Haldane Haldane proposed a B(0, 0) prior, ie, p(π) ∝ π −1 (1 − π)−1 For this prior, the mean of the posterior is the observed proportion. 196 10.12.3 Arc–sine √ The arc–sine prior is a B(1/2, 1/2) which gives a uniform prior on the scale sin −1 π, hence the name arc–sine. This transformation corresponds to the variance stablising transformation for the binomial proportion. 10.12.4 Conclusion All 3 priors give equivalent answers even for small amounts of data, but the reason for labouring the point is to show the problem of describing the situation of knowing nothing about the proportion. 10.13 Jeffrey’s Rule 10.13.1 Fisher information Recall that Fisher information from a single observation x is given by I(θ|x) = −E ∂ 2 `/∂θ 2 = E (∂`/∂θ)2 where ` = ln L. The information from n observations x is given by I(θ|x) and I(θ|x) = nI(θ|x) 10.13.2 Jeffrey’s prior If the parameter θ is transformed to φ via φ = φ(θ), then ∂`(φ|x) ∂`(θ|x) dθ = ∂φ ∂θ dφ So " #2 " ∂`(θ|x) = I(φ|x) = E ∂θ #2 !2 !2 dθ = I(θ|x) dφ q q √ So if p(θ) ∝ I(θ|x) then p(φ) ∝ I(φ|x), therefore choose I as a reference prior, as it is invariant to change of scale. So the reference prior is given by ∂`(φ|x) E ∂φ p(θ) ∝ q dθ dφ (I(θ|x) called Jeffrey’s rule. Note that this is not the only form of prior possible, but it can be used as a guide especially if there is no other obvious method of finding a prior. 197 Example Consider the binomial parameter π for which we have p(x|θ) ∝ π x (1 − π)(n−x) and so ` = x ln π + (n − x) ln(1 − π) + constant to give which becomes The information is thus ∂`/∂π = x/π − (n − x)/(1 − π) ∂ 2 `/∂π 2 = −x/π 2 − (n − x)/(1 − π)2 I = −E∂ 2 `/∂π 2 = Ex/π 2 + E(n − x)/(1 − π)2 = n/π + n/(1 − π) n I= π(1 − π) So finally the prior is p(π) ∝ π −1/2 (1 − π)−1/2 or, π ∼ B(1/2, 1/2), ie, the arc–sine distribution given earlier as a reference prior for the binomial. 10.14 Approximations based on the likelihood One way of describing a HDR is to quote the mode of the posterior density, although this goes against the idea of constricting a HDR. If the likelihood dominates the prior, if for example the prior chosen is a reference prior, b obtained by solving then the posterior mode will be close to the MLE θ, b we have In the neighbourhood of θ, and so Thus or b =0 L0 (θ) b + (θ − θ) b 2 L00 (θ)/2 b L(θ) = L(θ) +... b 2 00 b L(θ|x) ∝ e(θ − θ) L (θ)/2 b −1/L00 (θ)) b L(θ|x) ∝ N (θ, b 1/I(θ|x)) b ∝ N (θ, Thus a HDR could be constructed for θ using this approximation. to the likelihood, assuming that the likelihood dominates the prior. 198 10.14.1 Example For a sample x0 = (x1 , . . . , xn ) from a P (λ) with T = L(λ|x) ∝ e−nλ λT T! P i xi , we have Thus `(λ) = T ln λ − nλ + constant to give `0 (λ) = T /λ − n and `00 (λ) = −T /λ2 b = T /n = x̄ and I(λ|x) = nλ/λ2 = n/λ. Now λ b b = n2 /T . Thus the posterior can be approximated by N (T /n, T /n2 ), So I(λ|x) = −`00 (λ) asymptotically. Note that this posterior differs from the posterior obtained by using a conjugate prior. 10.15 Reference posterior distributions 10.15.1 Information provided by an experiment Note that while the term information used here is related to Fisher information, it is best to treat the two as separate quantities for the moment. 10.15.2 Reference priors under asymptotic normality Lindley defined the amount I(x|p) of information that an observation x prvides about θ as I(x|p) = Z p(θ|x) ln [p(θ|x)/p(θ)] dθ Z p(θ|x) ln [p(θ|x)/p(θ)] dθdx Averaged over all random observations, we obtaind the expected information as I= Z p(x) This is equivalent to Shannon’s information. 199 Exercise ( Show that the two expressions Z and p(θ) Z Z Z p(x|θ) ln [p(x|θ)/p(x)] dxdθ " # p(θ, x) p(θ, x) ln dθdx p(θ)p(x) are equivalent to the original form for expected information. ) Define In as the information about θ from n (independent) observations from the same distribution as x. Thus I∞ gives the information about θ when the prior is p(θ), since in that case we would have the exact value of θ. This condition ensures that we have a true reference prior, ie, one that contains no information about θ. Define pn (θ|x) as the posterior corresponding to the prior pn (θ), where the prior pn (θ) maximises In . The reference posterior p(θ|x) is then limn→∞ pn (θ|x). The reference prior p(θ) is defined (indirectly) as the p(θ) such that p(θ|x) ∝ p(θ)p(x|θ) where p(θ|x) is the reference posterior defined above. To find the reference prior, define entropy H as H{p(θ)} = − Z p(θ) ln p(θ)dθ Now the information about θ contained in n observations x0 = (x1 , . . . , xn ) is In = Z p(x) Z p(θ|x) ln [p(θ|x)/p(θ)] dθdx where the posterior pn (θ|x) = p(θ|x) and the prior pn (θ) = p(θ) to make the notation concise. In = Z p(x) =− Z Z p(θ|x) ln p(θ|x)dθdx − p(x)H{p(θ|x)}dx − Z Z 200 Z p(x) Z p(θ|x) ln p(θ)dθdx p(x)p(θ|x ln p(θ)dθdx =− =− Z Z p(x)H{p(θ|x)}dx − p(x)H{p(θ|x)}dx − =− = Z Z Z Z Z p(θ)p(x|θ) ln p(θ)dxdθ p(θ) ln p(θ) p(x)H{p(θ|x)}dx − =− Z Z Z p(x|θ)dx dθ =1 p(θ) ln p(θ)dθ p(x|θ)H{p(θ|x)}dx + H{p(θ)} p(θ) ln e − R This puts In into the form In = p(x|θ)H{p(θ|x}dx)/p(θ) dθ Z p(θ) ln f (θ) dθ p(θ) Now In is maximal when p(θ) ∝ f (θ), via the calculus of variations (Exercise!!). Thus the prior corresponding to a maximal In is R pn (θ) ∝ e− p(x|θ)H{p(θ|x)}dx 10.15.3 Reference priors under asymptotic normality Here the posterior distribution for n observations is approximately b 1/I(θ|x)) b N (θ, and so b 1/nI(θ|x)) b p(θ|x) ∼ N (θ, using the additive property of Fisher information. The entropy H for a N (µ, σ 2 ) density becomes H{p(µ)} = − Z √ o √ √ 2 2n 1 e−(z − µ) /2σ − ln 2πσ 2 − (z − µ)2 /2σ 2 dz = ln 2πeσ 2 2πσ 2 This gives H{p(µ|x)} = ln where µ = θ. q 201 b 2πe/nI(θ|x) Thus we get Z p(x|θ)H{p(θ|x)}dx = − =− Z b ln p(θ|θ) = − ln b since p(θ|θ) ≈ 0 except when θb ≈ θ. Thus pn (θ) ∝ e ln q Z q q b p(x|θ) ln 2πe/nI(θ|x)dx b 2πe/nI(θ|x)d θb 2πe/nI(θ|x) q I(θ|x) = q I(θ|x) which is Jeffrey’s prior. This result has use for a wider class of problems and for handling nuisance parameters. Exercise Show that the entropy for the prior p(θ) = e−θ/β /β is 1 + ln β. 202 10.16 References 1. Bernardo J.M. and A.F.M. Smith, (1994), Bayesian Theory, Wiley. 2. Broemeling L.D., (1985), Bayesian Analysis of Linear Models, Marcel Dekker. 3. Carlin B.P. and Louis T.A., (2000), Bayes and Empirical Bayes Methods for Data Analysis, 2nd ed., Chapman and Hall. 4. Gelman A., Carlin J.B., Stern H.S. and Rubin D.B., (2004), Bayesian Data Analysis, 2nd ed., Chapman and Hall. 5. Lee P.M., (2004), Bayesian Statistics, 3rd ed., Arnold. 6. Leonard T. and Hsu T.S.J., (2001), Bayesian Methods, Cambridge University Press. 203 204
© Copyright 2026 Paperzz