When do you use paired t, and when do you use 2-sample t?
Give examples of each
When we have two related samples (same variable)
ex. Sight readings for left and right eyes of a sample of people. There are two samples- the sample of readings from left eyes and the sample of readings from right eyes. The readings from the left and right eye of the same person are related.
- ex. Trees in different locations of a forest where one of the pairs in a single location receives extra CO2 and the other pair does not. This determines whether extra CO2 makes trees grow faster.
When we have two unrelated samples (same variable)
ex. Comparing male and female students in terms of binge drinking.
ex. Comparing heights of men and women
What are the conditions for using paired t?
1. We must have a representative sample of individuals who are measured twice, or a representative sample of pairs of individuals.
2. There should be no outliers in the sample of differences.
3. The population of differences should be normal.
4. Outliers and non-normality do not matter if the sample is large (n ≥ 40).
Do a significance test for ex. 20.46, Recruiting T-cells using paired t
FOUR STEP PROCESS
- How strong is the evidence that the average t-cell count for all patients who use the medication is higher than the average t-cell count for all patients who don't use the medication?
- PLAN:μA is the average t-cell count in all patients who use the medication
B is the average t-cell count in all patients who don't use the medication
- Ho: μA - μB ≥ 0
- Ha: μA - μB < 0
We'll use paired t because we have two related samples.
The conditions for paired t are
- 1. Two related representative samples
- 2. No outliers in the sample of differences
- 3. The population of differences should be normal
- 4. If n ≥ 40, #2 and #3 don't matter.
- Checking the conditions:
- 1. No information is given about how they got the data. In practice, we would contact the person who collected the data to find out
2. The boxplot of the sample of differences shows there are no outliers
3. The histogram of the sample of differences suggests that the population of differences is skewed.
- Test statistic: t = 2.83
- p-value: 0.018
- Since the p-value is between 0.01 and 0.05, we have strong evidence that the average t-cell count in all patients who use the medication in the future will exceed the average t-cell count in all patients who do not use the medication in the future.
*This test might be a bit inaccurate because of possible non-normality.
1. How do you do a significance test for paired t on Minitab? (using the t-cells hypothesis, Ha: μA - μB > 0
2. What do you need to do differently to find a CI for paired t on Minitab?
1a. Stat -> Basic Statistics -> Paired t
b. Insert the variables into the Sample 1 and Sample 2 boxes.
*In the Sample 1 box, put the sample which should have the larger mean, and put the smaller in the Sample 2 box.
You can do it the other way around, but this will produce a negative Test statistic.
Because you expect the number of t-cells to be greater after the medication is taken, you would put the 'after' sample in first and the 'baseline' sample in second. That way, Minitab will subtract the baseline results (lower t-cells) from the 'after' results (higher t-cells). After - Baseline (Ha: μA - μB > 0)
- The hypothesis μA - μB > 0 means that μA is expected to be larger than μB because the after subtraction, the result is greater than 0
If the result was 0 or less, than we couldn't reject Ho, and we couldn't claim our Ha.
c. Click 'options' button. Hypothesized difference will always be 0 for this class, but in practice it could change.
- Change 'Alternative hypothesis:' to fit your Ha. In this case, we are looking for 'Difference > hypothesized difference'.
d. Click 'Graphs' box to select histogram and boxplot so you can check conditions.
The following information will provide the rest of the answers to finish the significance test.
2. In the 'options' section, put in the confidence level, but then make sure the 'Alternative Hypothesis' box says 'Difference ≠ hypothesized difference', even if you have a greater than or less than hypothesis.
- This is just a technicality with Minitab
1. How do we know that two samples are related in order to used paired t?
2. When setting up a hypothesis for a paired t significance test, what is the optimal way to do it?
3. Why should you do it this way?
1. Ask if we were to jumble up the two samples, would it matter? If it does matter, use paired t, if it doesn't then used 2-sample t.
- ex. Midterm scores and quiz scores for STAT 104.
- If Gillian was to give me my quiz scores, but give me someone else's Midterm score, would it matter? YES!
ex. Heights of men and women. If we were to mix up the order of the heights of men, it wouldn't matter. Then we would use 2-sample t.
2. When comparing the averages of two samples representing two populations, set it up as a greater than test.
ex. Ha: μA - μB > 0 (t-cells after medication are more than before), rather than Ha: μ
B - μA < 0 (t-cells before medication are less than t-cells after).
By doing it this way, you'll get a positive test statistic and a positive CI, rather than a negative test statistic and a negative CI.
3. While both answers are correct, people tend to prefer working with positive numbers.
1. Give a 90% confidence interval to estimate the average difference in the number of t-cells after 20 days on the medication.
2. Interpret the interval
3. Which gives us more information, a significance test for t-cells or the CI?
1. 90% CI for μA - μB: (0.152, 0.905)
2. We estimate that the average number of t-cells in all patients who use medication in the future exceeds the average of t-cells in all patients who don't use the medication in the future by between 0.152 and 0.905 (thousand per microlitre).
3. The significance test tells us that the medication works. The CI tells us that the medication not only works, but by how much the count increases in patients who take the medication.
*Many people think that significance tests should be banned because confidence intervals are much more informative.
1. What is the most important condition for using inference in practice, and how can we be fairly sure this requirement is met?
2. What should we do if this condition is not met?
3a. What are the two sampling methods (used in this course) that required to use inferential methods.
b. What happens if they are not used?
1. The most important condition is that the sample represents the population.
We can be fairly sure the sample represents the population if the sample does not suffer from under-coverage, non-response or response bias.
2. Samples that suffer from under-coverage, non-response or response bias are biased and should be discarded.
3a. Either the sample is a simple random sample (SRS), or the data comes form a randomized comparative experiment.
b. If the sample is not a SRS or does not come from a randomized comparative experiment, conclusions may be challenged because the sample may not represent the population.
1. How is a randomized comparative experiment obtained generally speaking?
2. What are the three possible situations when determining the validity of a test?
3. What would we do in practice if we couldn't determine validity? (2 parts)
1. Individuals are randomly assigned to groups.
2a. We know that a SRS or randomized comparative experiment was obtained, and therefore the test is valid.
b. We have no information about the sampling design, and cannot confirm the validity of the test.
c. We know that the sample is biased (because of under-coverage, non-response or response bias).
3. In practice, we need to find out ab out the sampling design, and would send an email to the person who collected the data asking whether the sample is an SRS.
If not, we'd ask whether the individuals in the sample can be regarded as typical of those in the population.
If we know that the sample is biased, we cannot perform any inference. The sample should be discarded.
1. What does the margin of error in a confidence interval account for? (2 parts)
2. What does the margin of error in a confidence interval NOT account for?
3. What does a 95% confidence interval mean?
1. The margin of error accounts for random sample variation.
(The variation between different random samples of the same population with the same sample size n).
Random sampling variation refers to the fact that different random samples produce different results.
1b. Occasionally, a random sample will be a freak sample because it happened to obtain relatively high or low numbers, even though it was a correctly obtained random sample.
This means that the sample average will not be close to the population average so we will obtain a bad estimate of the populaiton average.
- 2a It does not account for sample bias.
- This includes under-coverage, non-response and response bias.
3. The margin of error (m
) in a 95% confidence interval is set up so that 95% of all possible samples will yield a sample average that is less than m
units away from the population mean.
5% of samples could be considered freak samples - their averages will be more than m
units away from the population mean so the CI's miss the population mean.
1. When determining whether the results of a test are significant, what is one important point to keep in mind?
2. What do we need to consider when determining the population we want to make inferences about from our sample?
Give an example.
1. There is no sharp border between significant and non-significant; only increasing evidence for the research hypothesis as the p-value gets smaller.
ex. Many people incorrectly think that a p-value below 0.05 indicates significant results, and a p-value above 0.05 indicates insignificant results. In actuality, a p-value of 0.049 and 0.051 give similar evidence in favour of the research hypothesis.
No black and white answer!!
2. We need to consider whether the sample can generalize to the larger population, or whether we need to adjust our population to obtain more accurate results.
ex. When attempting to determine the highest price women would pay for a purse by selecting women who enter a high-end department store, we cannot make inferences about the population of all women.
These are women with specific tastes, and so we must adjust the population to all women who shop at this high-end department store.
In this example, the person 'randomly selected' women who entered the store, so this is a convenience sample and therefore suffers from under-coverage bias.
1. In the example of changing biomass of wildlife in reserves where the individuals are years, what is the population?
2. Interpret the interval for adults in the US who actively avoid drinking pop.
A 95% CI is 61% ± 4% (57, 65).
3. What is the important thing to remember in this example.
4. Interpret the confidence (not on test, but good to know).
1. When making a time-series plot, we see a downward trend over the 29 years. Since the variable is changing over time, we cannot generalize to all years, so the population is simply the 29 years for which we made observations.
2. We estimate that between 57% and 65% of all adults in the US actively avoid drinking pop.
3. The word average is not included in this interpretation because we're estimating the percentage, not the population mean.
4. 95% of all possible random samples will yield a percentage that is less than $$ away from the true percentage who avoid drinking pop.
1. What is the formula for calculating the sample size for achieving a given margin of error in a CI for μ?
2. What does m represent?
3. When doing this equation on a test, what are three very important things to remember?
1. n ≥ [(z*σ)/m]²
*Put this on cheat sheet. Don't forget the ²!
2. The desired margin of error.
3a. Don't forget to square the answer ²
b. She gives 1 point just for writing down the formula. So don't forget the formula!!
c. If you get a decimal, round n to the nearest whole number. Ex. 216.09 would be 217
1. How do you calculate the margin of error (m) from a confidence interval?
ex. CI: (25.33, 28.27)
2. Calculate the sample size (n) if you want a 99% confidence level.
σ = 7
m = 0.1
z* (at a 99% confidence level found in the t table) = 2.576
3. What would we need to do if we couldn't afford to pay for such a large sample size? (2 strategies).
1. m = (28.27 - 25.33) / 2
Subtract the smallest number from the highest number and divide by 2.
2. n ≥ [(z*σ)/m]²
n ≥ [(2.576)(7) / 0.1]²
n ≥ 32,515.30
***ROUND UP TO NEXT WHOLE NUMBER
Answer : n ≥ 32,516
3. To make n smaller, make m bigger, and lower the confidence level.