STAT 104 - Chapter 1

  1. Define:


    Individuals: The objects described by a set of data. Individuals may be people, but they may also be animals or things.

    ex. dogs, schools, years

    Variables: A characteristic of an individual. A vairalbe might vary from one individual to another.

    ex. hair colour of dog, age of school, amount spent on healthcare in a year
  2. How do you access Q drive?
    1. Go to Computer

    2. Instructor Files

    3. Math & Stat

    4. Gillian

    5. Stat 104
  3. What is the difference between raw data and summarized data?

    Give an example of summarized data citing individual and variable
    Raw data shows a list of the individuals and the information for each individual is presented.

    Summarized data does not list individuals with information alongside. The data is already summarized.

    • ex Car colour in percentages
    • White - 35%
    • Black - 17%
    • Silver - 12%    etc.

    Here the individual is the car, and the variable is the colour.

    ex. Percentage of eye colour in the class.

    • Individual - students
    • Variable - Colour of eyes

    *With summarized data you can'nt find the individual from going one line to the next.
  4. Describe the two types of variables

    Give examples of each.
    1. Categorical variables - A qualitative characteristic that describes an individual in a non-numeric way. The information is usually expressed as words. 

    ex. Gender of person, opinion of person, primary colour of dog

    2. Quantitative Variable: A characteristic that describes an individual in a numeric way.

    The numbers must have numeric meaning; it makes sense to form an average. So they cannot be student number, gender where male is 1 and female is 0,

    or opinions using a Lykert Scale because there is no concept of distance between Disagree and Strongly Disagree. 

    ex. height of person, weight of dog, speed of car, number of children per family.
  5. 1, What is distribution of variables?

    2. a. What are the two charts used to display categorical variables?

    b. How do we describe categorical variables? (3 points)

    3. a. What are the five charts used to display quantitative variables? 

    b. How do we describe quantitative variables? (4 points)
    The distribution of variables gives the possible values of the variables along with how often they occur.

    • Sometimes presented in tables 
    • ex. The car colour data table gives the distribution of car colours.

    2a. bar charts and pie charts

    b. Most common, least common and anything else of interest

    3a histograms, dot-plots, stem-and-leaf plots, box-and-whisker plots. 

    *Time-series plots: only for variables that fluctuate over time

    b. Center, spread, shape and outliers
  6. How do you make a bar chart on Minitab?

    What will be on the Horizontal Axis and what will be on the Vertical Axis?

    * What do you need to do after the graph comes up?
    1. Go to Graph -> Bar Chart

    2. Bars represent: Values from a table

    3. Simple

    • 4. Graph Variable: Select the numerical data you want to show in your graph.
    • ex number of people with certain eye colour. * Vertical Axis

    • 5. Categorical Variable: Select the categorical variable you want to graph. 
    • *Horizontal Axis
    • ex. eye colour

    *Change the name of the table to give a clear description of what data is being presented.
  7. Brown eyes - 55%
    Blue eyes - 26%
    Hazel eyes - 10%
    Green eyes - 7%

    How would you describe the distribution of eye colour?
    1. The most common eye colour is brown. 55% of people in the room have brown eyes.

    2. The least common eye colour is green. 7% of people in the room have green eyes.

    • 3. Something else of interest: 
    • Brown eyes are twice as common as blue eyes

    Hazel and green eyes are approximately just as likely to occur.
  8. 1. What are the two necessary components of a pie chart?

    2. How do you make a pie chart on MiniTab?
    • 1.
    • Pie charts can be used to describe a categorical variable if....

    • a. Each individual is included exactly once
    • ex. each person in the room (eye colour)

    b. The percentages add up to 100%

    *In other words, the categories are all part of a single pie

    2.a. - Graph - Pie Chart

    b. Click on Value from a Table box

    • c. Categorical Variable: Insert the categories
    • ex. Eye colour

    • d. Summary variables: Insert the numerical data you want to represent.
    • ex. Number of people with x eye colour
  9. What do you have to do to make a Histogram more readable? (2)
    1. Click on bottom numbers, change position of ticks, then click on the Binning tab, and change Interval Type to Cutpoint instead of Midpoint. 

    2. Change the labels so they describe the data better.
  10. Define:

    1. Center
    2. Spread
    3. Shape 
    4. Outliers

    *How do you calculate these?

    ** How would you put these into a sentence, using the health care data?
    1. Center: The value of the variable that has half of the individuals below it and half above it

    Center = n + 1 / 2, where n is the number of individuals in the data set

    * If we don't get a whole number, the center is the average of the numbers on either side. 

    2. Spread: How much the variable varies calculated as Max - Min

    3. Shape: The shape of a distribution. Either symmetric or skewed.

    4. Outliers: Generally, an individual that does not follow the overall pattern of the data. An individual whose value is a lot more or a lot less than the others. 

    *In a histogram, stem-plots and dot-plots, outliers can be identified because there is a gap between certain individuals and the others.

    ** Center: The typical amount spent on health care is $2200

    Spread: The amounts spent are spread over $10,000.

    Shape: The distribution of the amounts spent on health care is skewed to the right because the spread of the top 50% of the amounts spent is more than the spread of the bottom 50% of the amounts spent

    Outliers: One country spent a lot more on health care than all the other countries. The country spent approx. $9500 on health care.
  11. 1. When do we NOT consider outliers?

    2. What does it mean when a shape is skewed to the right?

    1. When determining the shape. Outliers DO NOT cause skewness. 

    2. The distribution has a long tail on the right. The spread of the lower 50% is less than the spread of the top 50%.
  12. 1. In a stem-and-leaf plot, what are the leaf units?

    2. How do you read a stem plot?

    3. What is the center, spread, shape and outliers?
    Image Upload 1
    1. The leaf units tell us the units for the data. ex. If the leaf unit is 100, then the numbers are in hundreds. 

    2. When reading a number from a stemplot, put the stem and leaf together and multiply by the leaf unit. 

    ex. If the stem is 60, and the leaf is 2, and the unit is 10, the number is 602 x 10 = 6020. 

    • 3. 
    • Image Upload 2

    Center: n+1 / 2

    35 + 1 / 2 = 18

    The left hand column shows how many numbers the row passes. Count 18 from either the bottom or the top. 

    22 x 100 = 2200

    Typically, the 35 countries that had the highest GDP in 2013 spent $2200 on health care.

    Spread: 9100 - 200 = 8900

    The amounts spent on health care in 2013 are spread over $8900

    Shape: The distribution of the amounts spent on health care in 2013 is skewed to the right because the spread of the top 50% is more than the spread of the bottom 50% of the amounts spent.

    Outliers: One country spent a lot more than all the other countries. This country spent $9100.
  13. 1. Why use a stem-and-leaf over a histogram, and vice-versa. (2 points)

    2. How do you find outliers on a stem-and-leaf using MiniTab?

    3. How do you describe the healthcare data if there are no outliers?

    4. Because the results of reading different graphs are not always the same, what do you need to do on a test?
    1a. The numbers are generally more accurate using a stem-and-leaf, but histograms give a better visual illustration.

    b. If you have a large dataset, histograms are better, A small dataset, histograms are useless and a stem-and-leaf or a dotplot are better. 

    • 2. Click on box called Trim Outliers
    • LO means a low outlier, HI means a high outlier. 

    3. No countries spent a lot more or a lot less than the other countries on healthcare.

    4. Indicate which graph you have drawn.
  14. How do you make a side-by-side histogram if there is more than one category?

    Image Upload 3
    • Go to Stat bar at the top
    • Basic Statistics
    • Display Descriptive Statistics...
    • Select Variables and By Variables

    Click Graphs button, and select Histogram of data
  15. 1. What is a time-series plot, and what does it show?

    2. Describe the 3 features of a time series plot

    3. How do you make a Time-Series plot in Minitab?

    4. How would you describe this time series plot?
    Image Upload 4
    • 1. A collection of reading of a variable taken sequentially in time. 
    • ex. recorded every year.

    It shows how the variable has changed over time. 

    2a. Trend: the overall change in the variable during the time period for which we have data. 

    Trends may be increasing, decreasing or constant if the numbers fluctuate, but do not generally get bigger or smaller. 

    b. Seasonality: A repeating pattern that continues throughout the time period for which we have data. 

    ex. Temperatures go up in summer, and down in winter every year. 

    c. Random Fluctuations: irregular short term changes up or down, including spikes. 

    Time series graphs are never smooth. The line wobbles, and these are random fluctuations. 

    3a. In the Series box, put the variable, not the year (individual).

    b. Click Time/Scale button. 

    c. Under Time Scale: column, select Stamp

    d. In Stamp columns box, select year and then click 'ok' 'ok'.

    • 4. 
    • Image Upload 5

    Trend: Increased from approximately 1980 until 1992, and then levelled off. 

    Seasonality: No repeating patterns. 

    *Large random fluctuation in 2002.
  16. 1. Why should bar charts NOT be used for time series data?

    2. Why should time series plots NOT be used to describe the distribution of categorical variables?
    1. Because you want to see how data changes over time. Bar charts are choppy, and do not show random fluctuations.

    2. Because categorical variables have no concept of distance. 

    ex. Percentage of students with each letter grade.
Card Set
STAT 104 - Chapter 1
Stat 104 - Chapter 1