Stat 2 mock final answers Question 1: I am hired to test a the

Stat 2 mock final answers
Question 1: I am hired to test a the effectiveness of a flu vaccine on Berkeley students. I wish to
vaccinate a simple random sample of students, but only half of them consent to be vaccinated. I
vaccinate all those who consent (the treatment group), while retaining those who do not consent as a
control group. I compare the percentage of the treatment group who get the flu to the percentage of the
control group who get the flu using a two-sample z-test.
What is the major flaw in my experimental design? How could it be improved?
The design is not randomised, and the people who do not consent to be vaccinated make a poor control
group, since they differ from the treatment group for more reasons than just the treatment. For one
thing, they don't receive a placebo, so any placebo effect is not accounted for. For another, those who
refuse to consent are, on average, from different socio-economic groups than those who do, and may be
more or less likely to get the flu. For yet another, they know they are unvaccinated, and so may behave
differently from the vaccinated when faced with the possibility of getting the flu.
A better design would be to flip coins to split those who consent into a treatment group and a control
group. With large samples, this would make the two-sample z-test legitimate.
The histogram of incomes for employees of a large company is shown below:
Question 2:
What is the median income of the employees at the company?
Work out the percentage of employees who fall within each block by finding their areas:
$0-$10000
$10000-$20000
$20000-$40000
$40000-$80000
30%
10%
30%
30%
So $20000 is the 40th percentile and $40000 is the 70th percentile. The median, which is the 50th
percentile, must therefore be between $20000 and $40000. Exactly what it is we're not sure.
Question 3: The correlation between height and weight for adults is 0.7. Body mass index (BMI) is
equal to height divided by weight squared. Multiple choice: the correlation between height and BMI is:
(a)
equal to 0.7
(b)
equal to -0.7
(c)
equal to 0.3
(d)
equal to 0.49
(e)
cannot be determined from the information given
(e). The correlation is a measurement of linear association. We don't know what will happen to it if we
perform the non-linear transformation of weight into BMI.
In a very large class, midterm scores are approximately normally distributed, with average 50 and SD
15. Final scores are also approximately normally distributed, with average 60 and SD 10. The scatter
plot of final scores against midterm scores is football-shaped, with a correlation of 0.6.
Question 4:
What percentage of students scored higher than 75 on the final?
We convert to standard units and use the normal curve. z = (60 – 15)/10 = 0.5, P(Z > 0.5) = 6.7%.
Question 5:
midterm?
What is the regression prediction for the final score of a student who scored 45 on the
Many ways to do this. For example, the regression equation gives 60 + 0.6*10/15*(45 – 50) = 58.
Question 6:
What is the RMS error of the regression?
RMS error = SD(y) * sqrt(1 – r2) = 10 * sqrt(1 – 0.62) = 8.
Question 7: I have a bag containing ten marbles: four blue and six green. I draw five marbles with
replacement. What is the probability that exactly three of the marbles I draw are blue?
The number of draws is fixed, the draws are independent and the probability of a blue marble is the
same for each draw. So the conditions of the binomial are satisfied, and we can just use the binomial
formula with n = 5, k = 3, p = 0.4. This gives 23.04%.
Question 8: Which of the following two events is more likely, and why?
(a) I toss a fair coin 100 times, and get exactly 50 heads.
(b) I toss a fair coin 1000 times, and get exactly 500 heads.
Note: You do not need to find the exact probabilities.
(a) is more likely. When we take the sum of a larger number of draws, it's harder to predict the exact
value.
Question 9:
I roll a biased die that has the following set of probabilities for each outcome:
Number of spots
1
2
3
4
5
6
Probability
20%
10%
10%
10%
20%
30%
What is the expected number of spots I see on one roll of the die?
This question asks for the expected value of one draw from the following box:
[ 20 ones 10 twos 10 threes 10 fours 20 fives 30 sixes ]
The expected value of one draw is just the box average, which is 3.9.
I have a box containing 1000 tickets, each with a number written on it. I make 400 draws from the box
with replacement. The average of these 400 draws is 10, while their SD is 30.
Question 10: Find a 95% confidence interval for the average of the tickets in the box.
Our estimate of the average is 10. We estimate its standard error as 30/sqrt(400) = 1.5. A 95%
confidence interval runs from 2 SEs below the estimate to 2 SEs above it, giving 7 to 13.
I now examine all 1000 tickets. I find their average is 11.
Question 11: Can you now find a 95% confidence interval for the average of the tickets in the box? If
so, calculate the confidence interval. If not, explain why you cannot.
You know what the average of the tickets in the box is: it's 11. There's no variation in this number. So
there's no point in making a confidence interval.
Question 12: A mayoral election is to be held in a town where the population is 40% black, 30%
white, 20% Latino and 10% Asian. Two survey companies take opinion polls on the mayoral election.
Company A takes a quota sample of 400 black residents, 300 white residents, 200 Latino residents and
100 Asian residents, and asks them which mayoral candidate they prefer. The interviewers for
Company A may select the members of the sample, as long as the quotas are met. Company B takes a
simple random sample of 1000 town residents. This include 390 black residents, 310 white residents,
195 Latino residents and 105 Asian residents.
Which has the better sampling design, Company A or Company B? Why?
Company A's design leaves room for selection bias on the part of the interviewers. For example, if the
interviewers tend to select well-dressed people, these people may more inclined to vote for a certain
candidate than others of their ethnicity. Company B's design is a probability sample, so if the simple
random sample is taken correctly, no such bias exists. Although the percentage of their sample in each
ethnicity doesn't exactly match the true percentages for the city, they're close enough for any variation
to be due to chance (we can check this using a chi-square test). In short, Company B's sample is better
because it avoid selection bias.
Of a simple random sample of 240 Berkeley students, 150 support Proposition Z. Of a simple random
sample of 200 Stanford students, 120 support Proposition Z.
Question 13: Find a 95% confidence interval for the percentage of all Berkeley students who support
Proposition Z. (You do not need to use the correction factor.)
The sample percentage of Berkeley students who support Proposition Z is 62.5%; this is our estimate
for the percentage of all Berkeley students who support Proposition Z. The standard error of this
estimate is approximately sqrt(0.625*0.375/240)*100% = 3.125%. A 95% confidence interval runs
from 2 SEs below the estimate to 2 SEs above it, giving 56.25% to 68.75%.
Question 14: To see if there is a statistically significant difference between the percentage of Berkeley
students who support Proposition Z and the percentage of Stanford students who support Proposition
Z, I wish to carry out a two-sample z-test. To do this, I need to estimate the standard error of the
difference of the two sample percentages. What is this standard error?
In the previous question, we found the SE for the sample percentage of Berkeley students is 3.125%.
Similarly, we find the SE for the sample percentage of Stanford students is 3.46%. The standard error
of their difference is sqrt( 3.1252 + 3.462) = 4.67%.
Question 15: For my two-sample z-test, I end up getting a two-tailed P-value of 59%. What should I
conclude from this? (If you accept or reject a hypothesis, then state that hypothesis.)
Our null hypothesis would be that the percentage of all Berkeley students who support Proposition Z
equals percentage of all Stanford students who support Proposition Z; our alternative hypothesis would
be that these percentages are different. A P-value of 59% is in no way significant evidence against the
null hypothesis: the difference between our observed results and what we expect under the null can
easily be explained by chance. We do not reject the null hypothesis; the percentages of students who
support Proposition Z may be the same at Berkeley as at Stanford.
Question 16 (hard): I have an equal number of red, blue and green marbles in a large box. I draw
four marbles from the box, with replacement. What is the probability I get at least one of each color?
One way to do this is to list the ways of getting at least one marble of each colour, and find the
probability of each of these ways. Here “way” means four marbles, each either red, blue or green, in
some order, so there are 3*3*3*3 = 81 ways we can colour four marbles.
“At least one marble of each colour” means two marbles of one colour and one of each of the other two
colours. Let's count the numbers of ways we can get two reds, one blue and one green.
RRBG RBRG RBGR BRRG BRGR BGRR
RRGB RGRB RGBR GBBR GRBR GBRR
That's twelve ways. (Quicker way to count: there are 4!/(2!2!) = 6 ways to place the two reds, then for
each of these, either blue comes before green, or green comes before blue. 6 x 2 = 12.) Similarly, there
are twelve ways with two blues, and twelve ways with two greens. That's 36 ways altogether. The
probability of each way is 1/81 (or 1/3 * 1/3 * 1/3 * 1/3). So the overall probability is 36/81.
I told you it was hard.