1. How do you load a dataset into R? (2 steps)
2. What is another name for a causal variable?
3. What is the name for the effect?
4. What is experimental research in a nutshell?
b. data(congress, package = "qss")
2. treatment variable
3. Outcome variable
4. Experimental research examines how a treatment causally affects an outcome by assigning varying values of the treatment variable to different observations, and measuring their corresponding values of the outcome variable.
##  4870 4
What does 4870 represent, and what does 4 represent?
2. In the example of race/gender and job applicants, what is the outcome variable, and what is the treatment variable?
3. How would we look at the first several observations of a data set?
1. 4870 are the number of observations, and 4 is the number of variables
2. The outcome variable is whether the fictitious applicant received a callback from a prospective employer.
The treatment variable is the race and gender of each applicant (or more specifically how employers perceived the gender and race of applicants, rather than directly manipulating those attributes.
1a. With the resume data, what is the first step to seeing if black-sounding names are less likely to receive a call back?
b. How do you do this step in R?
2. What does the $ sign do?
3. How do you add totals to a two-way table?
How would you sum up the following from the contingency table?
a. All black applicants
b. All white applicants
c. All applicants who did not receive a call
d. All applicants who did receive a call
5. By indexing, how would you see the amount of black voters who did receive a call?
6a. How would you find the call back rate (in general)?
6b. How would you find the callback rate for each race?
1. Create a two-way contingency table (cross tabulation) summarizing the relationship between the race of each fictitious job applicant and whether a callback was received.
A two-way contingency table contains the number of observations that fall within each category, defined by its corresponding row (race variable) and column (call variable).
b. Create the table, and give it a name using the following syntax
= resume$race, call
2. It extracts a specific variable from the data frame
ex. resume$race will extract only the race (black or white) for each observation in the data set.
- 4a. sum(race.call.tab[1, ])
- b. sum(race.call.tab[2, ])
- c. sum(race.call.tab[, 1])
- d. sum(race.call.tab[, 2])
5. race.call.tab[1, 2]
6a. ## overall callback rate: total callbacks divided by the sample size
sum(race.call.tab[, 2]) / nrow(resume)
##  0.08049281
- ## callback rates for each race
- race.call.tab[1, 2] / sum(race.call.tab[1, ]) # black
- ##  0.06447639
- race.call.tab[2, 2] / sum(race.call.tab[2, ]) # white
- ##  0.096509
1. How do you create a new column in a data set? Ex. Create a 'treatment' category for the three levels from the rosca data. (3 STEPS)
2. What is the difference between a 'character' and a 'factor'?
1a. rosca$treatment <- NA
* This will create a new category and all the values will appear as 'NA' until you assign values to them.
1b. rosca$treatment[rosca$encouragement == 1] <- "control"
rosca$treatment[rosca$safe_box == 1] <- "safebox"
rosca$treatment[rosca$locked_box == 1] <- "lockbox"
- * This tells R to assign the label "control" to all observations under the variable 'encouragement' that have a value of 1.
- * 1 means they were part of the 'encouragement only' group, which is the control group, and 0 means that they were not.
The same applies to the other groups
1c. rosca$treatment <- as.factor(rosca$treatment)
* This changes the class from "character" into a "factor".
2a. While factors look (and often behave) like character vectors, they are actually integers under the hood.
b.Factors can be ordered or unordered
c. Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order.
1a. When is the logical operator & TRUE?
1b. When is the logical operator '&' FALSE?
2. What does | stand for?
3a. When is the logical operator '|' TRUE?
3b. When is the logical operator | FALSE?
4. Can the two be combined, and if so, how would you write that?
1a. The value of “AND” (&) is only TRUE when all of the objects have a value of TRUE
1b. The value of & is FALSE if any of the objects have a value of FALSE.
ex. TRUE & FALSE & TRUE = FALSE
2. | stands for 'OR'
3a. “OR” | is true when at least one of the objects has the value TRUE
3b. “OR” is false when none of the objects has the value TRUE
4. “AND” and “OR” can be used simultaneously, but parentheses should be used to avoid confusion
ex. (TRUE | FALSE) & FALSE
# the parentheses evaluate to TRUE
##  FALSE
TRUE | (FALSE & FALSE)
# the parentheses evaluate to FALSE
##  TRUE
1. What do relational operators do?
2. What are the symbols for the following relational operators?
a. greater than
b. greater than or equal to
c. less than
d. less than or equal to
e. equal to
f. not equal to
3. Are the following true or false?
a. 4 > 3
b. "Hello" == "hello"
c. "Hello" != "hello'
4. What is the answer to the following?
x <- c(3, 2, 1, -2, -1)
x >= 2
5. What is the answer to the following?
x <- c(3, 2, 1, -2, -1)
(x > 0) & (x <= 2)
6. What is the answer to the following logical disjunction?
x <- c(3, 2, 1, -2, -1)
(x > 2) | (x <= -1)
1. Relational operators evaluate the relationships between two values.
- a. (>)
- b. (>=)
- c. (<)
- d. (<=)
- e. (==)
- f. (!=)
- 3a. TRUE
- b. FALSE #because R is case sensitive
- c. TRUE
4. ##  TRUE TRUE FALSE FALSE FALSE
5. ##  FALSE TRUE TRUE FALSE FALSE
6. ##  TRUE FALSE FALSE TRUE TRUE
1. In the 'resume' dataset, how do you find the mean (callback rate) among the résumés with black-sounding names by subsetting in one step?
2. How does this syntax work (in lamens terms)?
3. How would you do the same thing in 2 steps? HINT: create a separate object
4. What is VERY IMPORTANT when doing the 2-step process?
1. mean(resume$call[resume$race == "black"])
##  0.06447639
2. This command syntax subsets the call variable in the resume data frame for the observations whose values for the race variable are equal to black.
We can utilize square brackets [ ] to index the values in a vector by placing the logical value of each element into a vector of the same length within the square brackets.
* The elements whose indexing value is TRUE are extracted.
** LAMENS TERMS:
resume$call[resume$race == "black"] will give you all of the values for the variable 'call' when the value for the variable 'race' is black,,, or when 'black' is TRUE.
mean(resume$call[resume$race == "black"]) will just give you the average of the call variable (callback rate) for only the values of the race variable that are 'black'.
If you mulltiply the callback rate 0.06447639 by 100, you'll get the percentage of black applicants that received call backs at 6.4%
3. ## subset blacks only
STEP 1: resumeB <- resume[resume$race == "black", ]
STEP 2: mean(resumeB$call)
# callback rate for blacks
##  0.06447639
4. Unlike in the case of indexing vectors, we use a comma to separate row and column indexes.
This comma is important and forgetting to include it will lead to an error.
* It needs to know whether we are extracting the value "black" from each row, or from each column. In this case, ["black", ] is being extracted from each row because we want to know the information from each observation (row), not from each variable (column).
It will produce an error if you don't, because it will try to extract from both observations and variables, and there is only one variable that has "black" in it.
1. What are the levels in the following example:
## race of first 5 observations resume$race[1:5]
##  white white black black white ##
2. Generally speaking, what are in the rows and what are in the columns of a dataset?
3. In which order to you present them when subsetting?
1. The levels are black and white.
- Rows = observations
- Columns = variables
3. [rows, columns]
rows = horizontal RH
columns = vertical CV
1. What does the subset() function do?
2. What does the select() function do?
3. Using the 'resume' dataset, give an example of how you would use these by extracting the call and firstname variables for the résumés which contain female black-sounding names?
4. How can you shorten this?
5. What is important to note in logical statements such as these?
6. How would we separately compute the racial gap in callback rate among female and male job applicants?
* Note that we do not include a select argument to specify which variables to keep. Consequently, all variables will be retained.
1. Use the subset()
function to construct a data frame that contains just some of the original observations and just some of the original variables.
The subset argument takes a logical vector that indicates whether each individual row should be kept for the new data frame.
2. The select argument takes a character vector that specifies the names of variables to be retained.
3. ## keep "call" and "firstname" variables ## also keep observations with female black-sounding names
resumeBf <- subset(resume, select = c("call", "firstname"), subset(resume, select = c("call", "firstname"), subset = (race == "black" & sex == "female"))
4. subset(resume, subset = (race == "black" & sex == "female")) shortens to subset(resume, race == "black" & sex == "female").
Note that one could specify the data frame name to which the race and sex variables belong, i.e., subset(resume, (resume$race == "black" & resume$sex == "female")), but this is unnecessary. By default, the variable names in this argument are assumed to come from the data frame specified in the first argument (resume in this case).
So we can use simpler syntax:
subset(resume, (race == "black" & sex == "female"))
- ## alternative syntax with the same results resumeBf <- resume[resume$race == "black" & resume$sex == "female",
- c("call", "firstname")]
5. It is important to pay close attention to parentheses so that each logical statement is contained within a pair of parentheses.
- black male
- resumeBm <- subset(resume, subset = (race == "black") & (sex == "male"))
- White female
- resumeWf <- subset(resume, subset = (race == "white") & (sex == "female"))
- white male
- resumeWm <- subset(resume, subset = (race == "white") & (sex == "male"))
- among females
- mean(resumeWf$call) - mean(resumeBf$call)
##  0.03264689
- among males
- mean(resumeWm$call) - mean(resumeBm$call)
##  0.03040786
- CONCLUSIONIt appears that the racial gap exists but does not vary across gender groups. For both female and male job applicants, the callback rate is higher for whites than blacks by roughly 3 percentage point
1. When would we use the ifelse() function?
2. Explain the three elements of an ifelse() function
3. Suppose that we want to create a new binary variable called BlackFemale in the resume data frame that equals 1 if the job applicant’s name sounds black and female, and 0 otherwise.
How would we use the ifelse() function to achieve this?
1. When we would like to perform different actions depending on whether a statement is true or false.
2. The function ifelse(X, Y, Z) contains three elements.
For each element in X that is TRUE, the corresponding element in Y is returned.
In contrast, for each element in X that is FALSE, the corresponding element in Z is returned.
ex. new_vector <- ifelse(condition, value_if_condition_true, value_if_condition_false)
- resume$BlackFemale <- ifelse(resume$race == "black" &
- resume$sex == "female", 1, 0)
1. What is a factor variable?
2. Using the resume dataset, how do we create a factor variable that takes one of the four values, i.e., BlackFemale, BlackMale, WhiteFemale, and WhiteMale? (3 steps)
3. How would we check the 'levels' within the new variable we've created?
4. How would you see all the observations in the new variable?
1. A factor variable is another name for a categorical variable that takes a finite number of distinct values or levels.
2a. First create a new variable, type, which is filled with missing values NA.
2b. We then specify each type using the characteristics of the applicant
ex. resume$type <- NA
resume$type[resume$race == "black" & resume$sex == "female"] <- "BlackFemale"
resume$type[resume$race == "black" & resume$sex == "male"] <- "BlackMale"
resume$type[resume$race == "white" & resume$sex == "female"] <- "WhiteFemale"
resume$type[resume$race == "white" & resume$sex == "male"] <- "WhiteMale"
2c. Because this new variable will appear as a character vector, we must change the new character variable into a factor variable by inserting the following:
resume$type <- as.factor(resume$type)
##  "BlackFemale" "BlackMale" "WhiteFemale" "WhiteMale"
1. What does the tapply() function do?
2. What are the three arguments (parts) of the tapply function, and what do they do?
3. Using the resume dataset, and the factor variable resume$type that includes the four levels, "BlackFemale" "BlackMale" "WhiteFemale" "WhiteMale", find the mean for these levels using the tapply() function.
1. It applies a function repeatedly within each level of the factor variable.
Suppose, for example, we want to calculate the callback rate for each of the four categories we created in the resume dataset using the factor variable resume$type
that includes the four levels, "BlackFemale" "BlackMale" "WhiteFemale" "WhiteMale".
If we use the tapply() function this can be done in one line, rather than computing them one by one.
2. tapply(X, INDEX, FUN)
We use the function as in tapply(X, INDEX, FUN)
, which applies the function indicated by argument FUN to the object X for each of the groups defined by unique values of the vector INDEX.
3. tapply(resume$call, resume$type, mean)
* The result indicates that black males have the lowest callback rate followed by black females, white males, and white females.
1. Using the resume dataset, how would we compute the callback rate for each first name?
- STEP 1: turn first name into a factor variable
resume$firstname <- as.factor(resume$firstname)
: compute callback rate for each first name
callback.name <- tapply(resume$call, resume$firstname, mean)
; sort the result into increasing order
* As expected from the above aggregate result, we find that many typical names for black males and females have low callback rates
1. What does the subset function actually do?
2. What does this allow you to do?
3. Give an example of this (rosca)
4. How would you create a subset for the rosca2 data in R using the above example?
1. It creates a separate dataset for only that particular variable you are interested in.
2. It simplifies the process of analyzing data, because you can look at that particular subset instead of the larger dataset.
3. ex. The rosca experiment.
We only are interested in married females, so we can create a subset of the dataframe that gives the amount invested, age and treatment group for ONLY married females.
4. rosca2.married <- subset(rosca2, subset = (bg_married == 1) & (rosca2$bg_female == 1))