Data Scientist

  1. You are using MADlib for Linear Regression analysis. Which value does the statement return?

    SELECT (linregr(depvar, indepvar)).r2 FROM zeta1;

    A. Goodness of fit
    B. Coefficients
    C. Standard error
    D. P-value
    A
  2. Which data asset is an example of quasi-structured data?

    A. Webserver log
    B. XML data file
    C. Database table
    D. News article
    A
  3. What would be considered "Big Data"?

    A. An OLAP Cube containing customer demographic information about 100,000,000 customers
    B. Daily Log files from a web server that receives 100,000 hits per minute
    C. Aggregated statistical data stored in a relational database table
    D. Spreadsheets containing monthly sales data for a Global 100 corporation
    B
  4. A data scientist plans to classify the sentiment polarity of 10, 000 product reviews collected from
    the Internet. What is the most appropriate model to use? Suppose labeled training data is
    available.

    A. Naïve Bayesian classifier
    B. Linear regression
    C. Logistic regression
    D. K-means clustering
    A
  5. In which lifecycle stage are test and training data sets created?

    A. Model building
    B. Model planning
    C. Discovery
    D. Data preparation
    A
  6. When creating a presentation for a technical audience, what is the main objective?

    A. Show that you met the project goals
    B. Show how you met the project goals
    C. Show if the model will meet the SLA
    D. Show the technique to be used in the production environment
    B
  7. Your company has 3 different sales teams. Each team's sales manager has developed incentive
    offers to increase the size of each sales transaction. Any sales manager whose incentive program
    can be shown to increase the size of the average sales transaction will receive a bonus.

    Data are available for the number and average sale amount for transactions offering one of the
    incentives as well as transactions offering no incentive.

    The VP of Sales has asked you to determine analytically if any of the incentive programs has
    resulted in a demonstrable increase in the average sale amount. Which analytical technique would
    be appropriate in this situation?

    A. One-way ANOVA
    B. Multi-way ANOVA
    C. Student's t-test
    D. Wilcoxson Rank Sum Test
    A
  8. In data visualization, what is used to focus the audience on a key part of a chart?

    A. Emphasis colors
    B. Detailed text
    C. Pastel colors
    D. A data table
    A
  9. Which word or phrase completes the statement? Data-ink ratio is to data visualization as
    __________ .

    A. Confusion matrix is to classifier
    B. Data scientist is to big data
    C. Seasonality is to ARIMA
    D. K-means is to Naive Bayes
    A
  10. Consider a database with 4 transactions:
    Transaction 1: {cheese, bread, milk}
    Transaction 2: {soda, bread, milk}
    Transaction 3: {cheese, bread}
    Transaction 4: {cheese, soda, juice}

    You decide to run the association rules algorithm where minimum support is 50%. Which rule has
    a confidence at least 50%?

    A. {cheese} => {bread}
    B. {juice} => {cheese}
    C. {milk} => {soda}
    D. {soda} => {milk}
    A
  11. You are using the Apriori algorithm to determine the likelihood that a person who owns a home
    has a good credit score. You have determined that the confidence for the rules used in the
    algorithm is > 75%. You calculate lift = 1.011 for the rule, "People with good credit are
    homeowners". What can you determine from the lift calculation?

    A. Support for the association is low
    B. Leverage of the rules is low
    C. The rule is coincidental
    D. The rule is true
    C
  12. Consider a database with 4 transactions:
    Transaction 1: {cheese, bread, milk}
    Transaction 2: {soda, bread, milk}
    Transaction 3: {cheese, bread}
    Transaction 4: {cheese, soda, juice}

    The minimum support is 25%. Which rule has a confidence equal to 50%?

    A. {bread,milk} => {cheese}
    B. {bread} => {milk}
    C. {juice} => {soda}
    D. {bread} => {cheese}
    D
  13. Under which circumstance do you need to implement N-fold cross-validation after creating a
    regression model?

    A. There is not enough data to create a test set.
    B. The data is unformatted.
    C. There are missing values in the data.
    D. There are categorical variables in the model.
    A
  14. What is an appropriate data visualization to use in a presentation for an analyst audience?

    A. Pie chart
    B. Area chart
    C. Stacked bar chart
    D. ROC curve
    D
  15. When would you use GROUP BY ROLLUP clause in your OLAP query?

    A. where all subtotals and grand totals are to be included in the output
    B. where only the subtotals are to be included in the output
    C. where only the grand totals are to be included in the output
    D. where only specific subtotals and grand totals for a combination of variables are to be included
    in the output
    A
  16. Which type of numeric value does a logistic regression model estimate?

    A. Probability
    B. A p-value
    C. Any integer
    D. Any real number
    A
  17. Your colleague, who is new to Hadoop, approaches you with a question. They want to know how
    best to access their data. This colleague has a strong background in data flow languages and
    programming.

    Which query interface would you recommend?

    A. Pig
    B. Hive
    C. Howl
    D. HBase
    A
  18. The web analytics team uses Hadoop to process access logs. They now want to correlate this
    data with structured user data residing in a production single-instance JDBC database. They
    collaborate with the production team to import the data into Hadoop. Which tool should they use?

    A. Sqoop
    B. Pig
    C. Chukwa
    D. Scribe
    A
  19. What does the R code
    z <- f[1:10, ]
    do?

    A. Assigns the first 10 rows of f to the vector z
    B. Assigns the 1st 10 columns of the 1st row of f to z
    C. Assigns a sequence of values from 1 to 10 to z
    D. Assigns the 1st 10 columns to z
    A
  20. In R, functions like plot() and hist() are known as what?

    A. generic functions
    B. virtual methods
    C. virtual functions
    D. generic methods
    B
  21. Review the following code:

    SELECT pn, vn, sum(prc*qty)
    FROM sale
    GROUP BY CUBE(pn, vn)
    ORDER BY 1, 2, 3;

    Which combination of subtotals do you expect to be returned by the query?

    A. (pn,vn)
    B. ( (pn,vn),(pn) )
    C. ( (pn,vn),(pn),(vn) )
    D. ( (pn,vn),(pn),(vn),( ) )
    D
  22. In MADlib what does MAD stand for?


    A. Magnetic,Agile,Deep
    B. Machine Learning,Algorithms for Databases
    C. Mathematical Algorithms for Databases
    D. Modular,Accurate,Dependable
    C
  23. The web analytics team uses Hadoop to process access logs. They now want to correlate this
    data with structured user data residing in their massively parallel database. Which tool should they
    use to export the structured data from Hadoop?

    A. Sqoop
    B. Pig
    C. Chukwa
    D. Scribe
    A
  24. When would you prefer a Naive Bayes model to a logistic regression model for classification?

    A. When you are using several categorical input variables with over 1000 possible values each.
    B. When you need to estimate the probability of an outcome,not just which class it is in.
    C. When all the input variables are numerical.
    D. When some of the input variables might be correlated.
    A
  25. Before you build an ARMA model, how can you tell if your time series is weakly stationary?

    A. There appears to be a constant variance around a constant mean.
    B. The mean of the series is close to 0.
    C. The series is normally distributed.
    D. There appears to be no apparent trend component.
    A
  26. What is an example of a null hypothesis?

    A. that a newly created model does not provide better predictions than the currently existing model
    B. that a newly created model provides a prediction of a null sample mean
    C. that a newly created model provides a prediction of a null population mean
    D. that a newly created model provides a prediction that will be well fit to the null distribution
    A
  27. You have fit a decision tree classifier using 12 input variables. The resulting tree used 7 of the 12
    variables, and is 5 levels deep. Some of the nodes contain only 3 data points. The AUC of the
    model is 0.85. What is your evaluation of this model?

    A. The tree is probably overfit. Try fitting shallower trees and using an ensemble method.
    B. The AUC is high,and the small nodes are all very pure. This is an accurate model.
    C. The tree did not split on all the input variables. You need a larger data set to get a more
    accurate model.

    D. The AUC is high,so the overall model is accurate. It is not well-calibrated,because the small
    nodes will give poor estimates of probability.
    A
  28. If your intention is to show trends over time, which chart type is the most appropriate way to depict
    the data?

    A. Line chart
    B. Bar chart
    C. Stacked bar chart
    D. Histogram
    A
  29. You are analyzing a time series and want to determine its stationarity. You also want to determine
    the order of autoregressive models.

    How are the autocorrelation functions used?

    A. ACF as an indication of stationarity,and PACF for the correlation between Xt and Xt-k not
    explained by their mutual correlation with X1 through Xk-1.
    B. PACF as an indication of stationarity,and ACF for the correlation between Xt and Xt-k not
    explained by their mutual correlation with X1 through Xk-1.

    C. ACF as an indication of stationarity,and PACF to determine the correlation of X1 through Xk-1.
    D. PACF as an indication of stationarity,and ACF to determine the correlation of X1 through Xk-1.
    A
  30. Which word or phrase completes the statement? A spreadsheet is to a data island as a centralized
    database for reporting is to a ________?

    A. Data Warehouse
    B. Data Repository
    C. Analytic Sandbox
    D. Data Mart
    A
  31. What is one modeling or descriptive statistical function in MADlib that is typically not provided in a
    standard relational database?

    A. Linear regression
    B. Expected value
    C. Variance
    D. Quantiles
    A
  32. In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project?

    A. Discovery
    B. Data Preparation
    C. Model Building
    D. Communicate Results
    B
  33. You are testing two new weight-gain formulas for puppies. The test gives the results:

    Control group: 1% weight gain
    Formula A. 3% weight gain
    Formula B. 4% weight gain

    A one-way ANOVA returns a p-value = 0.027
    What can you conclude?

    A. Either Formula A or Formula B is effective at promoting weight gain.
    B. Formula B is more effective at promoting weight gain than Formula A.
    C. Formula A and Formula B are both effective at promoting weight gain.
    D. Formula A and Formula B are about equally effective at promoting weight gain.
    A
  34. Data visualization is used in the final presentation of an analytics project. For what else is this
    technique commonly used?

    A. Data exploration
    B. Descriptive statistics
    C. ETLT
    D. Model selection
    A
  35. Which functionality do regular expressions provide?

    A. text pattern matching
    B. underflow prevention
    C. increased numerical precision
    D. decreased processing complexity
    A
  36. When creating a project sponsor presentation, what is the main objective?

    A. Show that you met the project goals
    B. Show how you met the project goals
    C. Show how well the model will meet the SLA (service level agreement)
    D. Clearly describe the methods and techniques used
    A
  37. The average purchase size from your online sales site is $17, 200. The customer experience team
    believes a certain adjustment of the website will increase sales. A pilot study on a few hundred
    customers showed an increase in average purchase size of $1.47, with a significance level of
    p=0.1.

    The team runs a larger study, of a few thousand customers. The second study shows an
    increased average purchase size of $0.74, with a significance level of 0.03. What is your
    assessment of this study?

    A. The change in purchase size is not practically important,and the good p-value of the second
    study is probably a result of the large study size.
    B. The change in purchase size is small,but may aggregate up to a large increase in profits over
    the entire customer base.

    C. The difference in the change in purchase size between the two studies is troubling; The team
    should run another,larger study.
    D. The p-value of the second study shows a statistically significant change in purchase size. The
    new website is an improvement.
    A
  38. Which word or phrase completes the statement? Business Intelligence is to monitoring trends as
    Data Science is to ________ trends.

    A. Predicting
    B. Discarding
    C. Driving
    D. Optimizing
    A
  39. Consider a scale that has five (5) values that range from “not important” to “very important”. Which
    data classification best describes this data?

    A. Ordinal
    B. Nominal
    C. Real
    D. Ratio
    A
  40. Which key role for a successful analytic project can provide business domain expertise with a
    deep understanding of the data and key performance indicators?

    A. Business Intelligence Analyst
    B. Project Manager
    C. Project Sponsor
    D. Business User
    A
  41. On analyzing your time series data you suspect that the data represented as

    y1, y2, y3, ... , yn-1, yn

    may have a trend component that is quadratic in nature. Which pattern of data will indicate that
    the trend in the time series data is quadratic in nature?

    A. (y3-y2) – (y2-y1) = .........= (yn-yn-1)-(yn-1-yn-2)
    B. (y2-y1) = (y3-y2) = ....... = (yn-yn-1)
    C. ((y2-y1) /y1 ) * 100% = .......((yn-yn-1)/yn-1) * 100%
    D. (y4-y2) – (y3-y1) = .........= (yn-yn-2)-(yn-1-yn-3)
    A
  42. Which analytical method is considered unsupervised?

    A. K-means clustering
    B. Naïve Bayesian classifier
    C. Decision tree
    D. Linear regression
    A
  43. You have used k-means clustering to classify behavior of 100, 000 customers for a retail store.
    You decide to use household income, age, gender and yearly purchase amount as measures. You
    have chosen to use 8 clusters and notice that 2 clusters only have 3 customers assigned. What
    should you do?

    A. Decrease the number of clusters
    B. Increase the number of clusters
    C. Decrease the number of measures used
    D. Identify additional measures to add to the analysis
    A
  44. What does R code nv <- v[v < 1000] do?

    A. Selects the values in vector v that are less than 1000 and assigns them to the vector nv
    B. Sets nv to TRUE or FALSE depending on whether all elements of vector v are less than 1000
    C. Removes elements of vector v less than 1000 and assigns the elements >= 1000 to nv
    D. Selects values of vector v less than 1000,modifies v,and makes a copy to nv
    A
  45. For which class of problem is MapReduce most suitable?

    A. Embarrassingly parallel
    B. Minimal result data
    C. Simple marginalization tasks
    D. Non-overlapping queries
    A
  46. Which activity is performed in the Operationalize phase of the Data Analytics Lifecycle?

    A. Define the process to maintain the model
    B. Try different analytical techniques
    C. Try different variables
    D. Transform existing variables
    A
  47. Since R factors are categorical variables, they are most closely related to which data classification
    level?

    A. nominal
    B. ordinal
    C. interval
    D. ratio
    A
  48. In which phase of the analytic lifecycle would you expect to spend most of the project time?

    A. Discovery
    B. Data preparation
    C. Communicate Results
    D. Operationalize
    C
  49. You are building a logistic regression model to predict whether a tax filer will be audited within the
    next two years. Your training set population is 1000 filers. The audit rate in your training data is
    4.2%. What is the sum of the probabilities that the model assigns to all the filers in your training set
    that have been audited?

    A. 42.0
    B. 4.2
    C. 0.42
    D. 0.042
    A
  50. Image Upload 2

    You are asked to write a report on how specific variables impact your client’s sales using a data
    set provided to you by the client. The data includes 15 variables that the client views as directly
    related to sales, and you are restricted to these variables only.

    After a preliminary analysis of the data, the following findings were made:

    1. Multicollinearity is not an issue among the variables
    2. Only three variables—A, B, and C—have significant correlation with sales

    You build a linear regression model on the dependent variable of sales with the independent
    variables of A, B, and C. The results of the regression are seen in the exhibit.

    You cannot request additional datA. what is a way that you could try to increase the R2 of the
    model without artificially inflating it?

    A. Create clusters based on the data and use them as model inputs
    B. Force all 15 variables into the model as independent variables
    C. Create interaction variables based only on variables A,B,and C
    D. Break variables A,B,and C into their own univariate models
    A
  51. You have two tables of customers in your database. Customers in cust_table_1 were sent an e-
    mail promotion last year, and customers in cust_table_2 received a newsletter last year.
    Customers can only be entered in once per table. You want to create a table that includes all
    customers, and any of the communications they received last year. Which type of join would you
    use for this table?

    A. Full outer join
    B. Inner join
    C. Left outer join
    D. Cross join
    A
  52. In which lifecycle stage are initial hypotheses formed?

    A. Discovery
    B. Model planning
    C. Model building
    D. Data preparation
    A
  53. You are given 10, 000, 000 user profile pages of an online dating site in XML files, and they are
    stored in HDFS. You are assigned to divide the users into groups based on the content of their
    profiles. You have been instructed to try K-means clustering on this data. How should you
    proceed?

    A. Run MapReduce to transform the data,and find relevant key value pairs.
    B. Divide the data into sets of 1,000 user profiles,and run K-means clustering in RHadoop
    iteratively.
    C. Run a Naive Bayes classification as a pre-processing step in HDFS.
    D. Partition the data by XML file size,and run K-means clustering in each partition.
    A
  54. The Marketing department of your company wishes to track opinion on a new product that was
    recently introduced. Marketing would like to know how many positive and negative reviews are
    appearing over a given period and potentially retrieve each review for more in-depth insight.

    They have identified several popular product review blogs that historically have published
    thousands of user reviews of your company’s products.

    You have been asked to provide the desired analysis. You examine the RSS feeds for each blog
    and determine which fields are relevant. You then craft a regular expression to match your new
    product’s name and extract the relevant text from each matching review.

    What is the next step you should take?

    A. Convert the extracted text into a suitable document representation and index into a review
    corpus
    B. Use the extracted text and your regular expression to perform a sentiment analysis based on
    mentions of the new product

    C. Read the extracted text for each review and manually tabulate the results
    D. Group the reviews using Naïve Bayesian classification
    A
  55. Which word or phrase completes the statement? A Data Scientist would consider that a RDBMS is
    to a Table as R is to a ______________ .

    A. Data frame
    B. List
    C. Matrix
    D. Array
    A
  56. Which word or phrase completes the statement? Unix is to bash as Hadoop is to:

    A. Pig
    B. HDFS
    C. Sqoop
    D. NameNode
    A
  57. A call center for a large electronics company handles an average of 35, 000 support calls a day.
    The head of the call center would like to optimize the staffing of the call center during the rollout of
    a new product due to recent customer complaints of long wait times. You have been asked to
    create a model to optimize call center costs and customer wait times.

    The goals for this project include:
    1. Relative to the release of a product, how does the call volume change over time?

    2. How to best optimize staffing based on the call volume for the newly released product, relative
    to old products.

    3. Historically, what time of day does the call center need to be most heavily staffed?
    4. Determine the frequency of calls by both product type and customer language.

    Which goals are suitable to be completed with MapReduce?

    A. Goal 2 and 4
    B. Goal 1 and 3
    C. Goals 1,2,3,4
    D. Goals 2,3,4
    A
  58. Consider the example of an analysis for fraud detection on credit card usage. You will need to
    ensure higher-risk transactions that may indicate fraudulent credit card activity are retained in your
    data for analysis, and not dropped as outliers during pre-processing. What will be your approach
    for loading data into the analytical sandbox for this analysis?

    A. ELT
    B. ETL
    C. EDW
    D. OLTP
    A
  59. Trend, seasonal, and cyclical are components of a time series. What is another component?

    A. Irregular
    B. Linear
    C. Quadratic
    D. Exponential
    A
  60. You are studying the behavior of a population, and you are provided with multidimensional data at
    the individual level. You have identified four specific individuals who are valuable to your study,
    and would like to find all users who are most similar to each individual. Which algorithm is the
    most appropriate for this study?

    A. K-means clustering
    B. Linear regression
    C. Association rules 
    D. Decision Trees
    A
  61. Which R data structure allows elements to have different data types?

    A. List
    B. Vector
    C. Matrix
    D. Array
    A
  62. Which key role for a successful analytic project can consult and advise the project team on the
    value of end results and how these will be used on a day-to-day basis?

    A. Business User
    B. Project Manager
    C. Data Scientist
    D. Business Intelligence Analyst
    A
  63. A disk drive manufacturer has a defect rate of less than 1.0% with 98% confidence. A quality
    assurance team samples 1000 disk drives and finds 14 defective units. Which action should the
    team recommend?

    A. The manufacturing process should be inspected for problems.
    B. A larger sample size should be taken to determine if the plant is functioning properly f
    C. A smaller sample size should be taken to determine if the plant is functioning properly
    D. The manufacturing process is functioning properly and no further action is required.
    A
  64. What is required in a presentation for project sponsors?

    A. The "Big Picture" takeaways for executive level stakeholders
    B. Data warehouse design changes
    C. Line by line review of the developed code
    D. Detailed statistical basis for the modeling approach used in the project
    A
  65. A data scientist wants to predict the probability of death from heart disease based on three risk
    factors: age, gender, and blood cholesterol level.

    What is the most appropriate method for this project?

    A. Logistic regression
    B. Linear regression
    C. K-means clustering
    D. Apriori algorithm
    A
  66. What are the characteristics of Big Data?

    A. Data volume,processing complexity,and data structure variety.
    B. Data volume,business importance,and data structure variety.
    C. Data type,processing complexity,and data structure variety.
    D. Data volume,processing complexity,and business importance.
    A
  67. You are analyzing data in order to build a classifier model. You discover non-linear data and
    discontinuities that will affect the model. Which analytical method would you recommend?

    A. Decision Trees
    B. Logistic Regression
    C. ARIMA
    D. Linear Regression
    A
  68. What is an appropriate data visualization to use in a presentation for a project sponsor?

    A. Bar chart
    B. Pie chart
    C. Box and Whisker plot
    D. Density plot
    A
  69. In a Student's t-test, what is the meaning of the p-value?

    A. it is the area under the appropriate tails of the Student's distribution
    B. it is the "power" of the Student's t-test
    C. it is the mean of the distribution for the null hypothesis
    D. it is the mean of the distribution for the alternate hypothesis
    A
  70. In addition to less data movement and the ability to use larger datasets in calculations, what is a
    benefit of analytical calculations in a database?

    A. quicker time to insight
    B. more efficient handling of categorical values
    C. improved connections between disparate data sources
    D. full use of data aggregation functionality
    A
  71. You have been assigned to do a study of the daily revenue effect of a pricing model of online
    transactions. When have you completed the analytics lifecycle?

    A. You have written documentation,and the code has been handed off to the Data Base
    Administrator and business operations.
    B. You have a completely developed model,and the results have shown statistically acceptable
    results.
    C. You have presented the results of the model to both the internal analytics team and the
    business owner of the project.
    D. You have a completely developed model based on both a sample of the data and the entire set
    of data available.
    A
  72. Consider these itemsets:

    (hat, scarf, coat)
    (hat, scarf, coat, gloves)
    (hat, scarf, gloves)
    (hat, gloves)
    (scarf, coat, gloves)

    What is the confidence of the rule (gloves -> hat)?

    A. 75%
    B. 60%
    C. 66%
    D. 80%
    A
  73. What is holdout data?

    A. a subset of the provided data set selected at random and used to validate the model
    B. a subset of the provided data set selected at random and used to initially construct the model
    C. a subset of the provided data set that is removed by the data scientist because it contains data
    errors
    D. a subset of the provided data set that is removed by the data scientist because it contains
    outliers
    A
  74. Which characteristic applies mainly to Data Science as opposed to Business Intelligence?

    A. Advanced analytical methods
    B. Robust reporting
    C. Focus on structured data
    D. Data dashboards
    A
  75. Which word or phrase completes the statement?
    Theater actor is to "Artistic and Expressive" as Data Scientist is to ________________

    A. "Communicative and Collaborative"
    B. "Introverted and Technical"
    C. "Logical and Steadfast"
    D. "Independent and Intelligent"
    A
  76. Which process in text analysis can be used to reduce dimensionality?

    A. Stemming
    B. Parsing
    C. Digitizing
    D. Sorting
    A
  77. What is the format of the output from the Map function of MapReduce?

    A. Key-value pairs
    B. Binary respresentation of keys concatenated with structured data
    C. Compressed index
    D. Unique key record and separate records of all possible values
    A
  78. Which data type value is used for the observed response variable in a logistic regression model?

    A. Any positive real number
    B. Any integer
    C. A binary value
    D. Any real number
    C
  79. A data scientist is given an R data frame, “empdata”, with the columns Age, Salary, Occupation,
    Education, and Gender. The data scientist would like to examine only the Salary and Occupation
    columns for ages greater than 40. Which command extracts the appropriate rows and columns
    from the data frame?

    A. empdata[empdata$Age > 40,c("Salary","Occupation")]
    B. empdata[c("Salary","Occupation"),empdata$Age > 40]
    C. empdata[Age > 40,("Salary","Occupation")]
    D. empdata[,c("Salary","Occupation")]$Age > 40
    A
  80. What is required in a presentation for business analysts?

    A. Budgetary considerations and requests
    B. Operational process changes
    C. Detailed statistical explanation of the applicable modeling theory
    D. The presentation author's credentials
    B
  81. What is LOESS used for?

    A. It fits a smoothed curve to scatterplot data,to give a general sense of the data's behavior.
    B. It is a significance test for the correlation between two variables.
    C. It plots a continuous variable versus a discrete variable,to compare distributions across classes.
    D. It is run after a one-way ANOVA,to determine which population has the highest mean value.
    A
  82. Which word or phrase completes the statement? Mahout is to Hadoop as MADlib is to
    ____________ .

    A. PostgreSQL
    B. R
    C. Excel
    D. SAS
    A
  83. In linear regression modeling, which action can be taken to improve the linearity of the relationship
    between the dependent and independent variables?

    A. Apply a transformation to a variable
    B. Use a different statistical package
    C. Calculate the R-Squared value
    D. Change the units of measurement on the independent variable
    A
  84. Data visualization is used in the final presentation of an analytics project. For what else is this
    technique commonly used?

    A. Assessing data quality
    B. Descriptive statistics
    C. ETLT
    D. Model selection
    A
  85. You have been assigned to do a study of the daily revenue effect of a pricing model of online
    transactions. All the data currently available to you has been loaded into your analytics database;
    revenue data, pricing data, and online transaction data. You find that all the data comes in
    different levels of granularity. The transaction data has timestamps (day, hour, minutes, seconds),
    pricing is stored at the daily level, and revenue data is only reported monthly. What is your next
    step?

    A. Report back to the business owner that the current data model does not support the business
    question.
    B. Interpolate a daily model for revenue from the monthly revenue data.
    C. Aggregate all data to the monthly level in order to create a monthly revenue model.

    D. Disregard revenue as a driver in the pricing model,and create a daily model based on pricing
    and transactions only.
    A
  86. Which SQL OLAP extension provides all possible grouping combinations?

    A. CUBE
    B. ROLLUP
    C. UNION ALL
    D. CROSS JOIN
    A
  87. What is the primary bottleneck in text classification?

    A. The availablilty of tagged training data.
    B. The ability to parse unstructured text data.
    C. The high dimensionality of text data.
    D. The fact that text corpora are dynamic.
    A
  88. Which characteristic applies only to Business Intelligence as opposed to Data Science?

    A. Uses only structured data
    B. Supports solving “what if” scenarios
    C. Uses large data sets
    D. Uses predictive modeling techniques
    A
  89. You have been assigned to run a linear regression model for each of 5, 000 distinct districts, and
    all the data is currently stored in a PostgreSQL database. Which tool/library would you use to
    produce these models with the least effort?

    A. MADlib
    B. Mahout
    C. R
    D. HBase
    A
  90. Your customer provided you with 2, 000 unlabeled records and asked you to separate them into
    three groups. What is the correct analytical method to use?

    A. K-means clustering
    B. Linear regression
    C. Naive Bayesian classification
    D. Logistic regression
    A
  91. You are performing a market basket analysis using the Apriori algorithm. Which measure is a ratio
    describing the how many more times two items are present together than would be expected if
    those two items are statistically independent?

    A. Lift
    B. Leverage
    C. Support
    D. Confidence
    A
  92. In which lifecycle stage are appropriate analytical techniques determined?

    A. Model planning
    B. Model building
    C. Data preparation
    D. Discovery
    A
  93. What is Hadoop?

    A. Java classes for HDFS types and MapReduce job management and HDFS
    B. Java classes for HDFS types and MapReduce job management and the MapReduce paradigm
    C. MapReduce paradigm and HDFS
    D. MapReduce paradigm and massive unstructured data storage on commodity hardware
    A
  94. You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient
    Sex, Height, Weight, Age and Income as measures and have used 3 clusters. When you create a
    pair-wise plot of the clusters, you notice that there is significant overlap between the clusters.
    What should you do?

    A. Identify additional measures to add to the analysis
    B. Remove one of the measures
    C. Decrease the number of clusters
    D. Increase the number of clusters
    C
  95. How does Pig’s use of a schema differ from that of a traditional RDBMS?

    A. Pig's schema is optional
    B. Pig's schema requires that the data is physically present when the schema is defined
    C. Pig's schema is required for ETL
    D. Pig's schema supports a single data type
    A
  96. You are provided four different datasets. Initial analysis on these datasets show that they have
    identical mean, variance and correlation values. What should your next step in the analysis be?

    A. Visualize the data to further explore the characteristics of each data set
    B. Select one of the four datasets and begin planning and building a model
    C. Combine the data from all four of the datasets and begin planning and bulding a model
    D. Recalculate the descriptive statistics since they are unlikely to be identical for each dataset
    A
  97. You are asked to create a model to predict the total number of monthly subscribers for a specific
    magazine. You are provided with 1 year's worth of subscription and payment data, user
    demographic data, and 10 years worth of content of the magazine (articles and pictures). Which
    algorithm is the most appropriate for building a predictive model for subscribers?

    A. Linear regression
    B. Logistic regression
    C. Decision trees
    D. TF-IDF
    A
  98. Which word or phrase completes the statement? Structured data is to OLAP data as quasi-
    structured data is to____

    A. Clickstream data
    B. XML data
    C. Text documents
    D. Image files
    A
  99. What describes a true property of Logistic Regression method?

    A. It is robust with redundant variables and correlated variables.
    B. It handles missing values well.
    C. It works well with discrete variables that have many distinct values.
    D. It works well with variables that affect the outcome in a discontinuous way.
    A
  100. You have been assigned to do a study of the daily revenue effect of a pricing model of online
    transactions. You have tested all the theoretical models in the previous model planning stage, and
    all tests have yielded statistically insignificant results. What is your next step?

    A. Report that the results are insignificant,and reevaluate the original business question.
    B. Run all the models again against a larger sample,leveraging more historical data.
    C. Move forward on the model with the highest significance scores relative to the others.
    D. Modify samples used by the models and iterate until a significant result occurs.
    A
  101. A data scientist is asked to implement an article recommendation feature for an on-line magazine.
    The magazine does not want to use client tracking technologies such as cookies or reading
    history. Therefore, only the style and subject matter of the current article is available for making
    recommendations. All of the magazine's articles are stored in a database in a format suitable for
    analytics.

    Which method should the data scientist try first?

    A. K Means Clustering
    B. Naive Bayesian
    C. Logistic Regression
    D. Association Rules
    A
  102. How are window functions different from regular aggregate functions?

    A. Rows retain their separate identities and the window function can access more than the current
    row.
    B. Rows are grouped into an output row and the window function can access more than the
    current row.

    C. Rows retain their separate identities and the window function can only access the current row.
    D. Rows are grouped into an output row and the window function can only access the current row
    A
  103. Consider these itemsets:
    (hat, scarf, coat)
    (hat, scarf, coat, gloves)
    (hat, scarf, gloves)
    (hat, gloves)
    (scarf, coat, gloves)

    What is the confidence of the rule (hat, scarf) -> gloves? 

    A. 66%
    B. 40%
    C. 50%
    D. 60%
    A
  104. In the MapReduce framework, what is the purpose of the Map Function?

    A. It processes the input and generates key-value pairs
    B. It collects the output of the Reduce function
    C. It sorts the results of the Reduce function
    D. It breaks the input into smaller components and distributes to other nodes in the cluster
    A
  105. You have completed your model and are handing it off to be deployed in production. What should
    you deliver to the production team, along with your commented code?

    A. The production team needs to understand how your model will interact with the processes they
    already support. Give them documentation on expected model inputs and outputs, and guidance on error-handling.
    B. The production team are technical,and they need to understand how the processes that they support work,so give them the same presentation that you prepared for the analysts.

    C. The production team supports the processes that run the organization,and they need context tovunderstand how your model interacts with the processes they already support. Give them thevsame presentation that you prepared for the project sponsor.
    D. The production team supports the processes that run the organization,and they need context tovunderstand how your model interacts with the processes they already support. Give them the executive summary.
    A
  106. While having a discussion with your colleague, this person mentions that they want to perform K-
    means clustering on text file data stored in HDFS.

    Which tool would you recommend to this colleague?

    A. Mahout
    B. HBase
    C. Scribe
    D. Sqoop
    A
  107. Which method is used to solve for coefficients b0, b1, .., bn in your linear regression model :
    Y = b0 + b1x1+b2x2+....+bnxn

    A. Ordinary Least squares
    B. Apriori Algorithm
    C. Ridge and Lasso
    D. Integer programming
    D
  108. What describes a true limitation of Logistic Regression method?

    A. It does not handle missing values well.
    B. It does not handle redundant variables well.
    C. It does not handle correlated variables well.
    D. It does not have explanatory values.
    A
  109. You submit a MapReduce job to a Hadoop cluster and notice that although the job was
    successfully submitted, it is not completing. What should you do?

    A. Ensure that the TaskTracker is running.
    B. Ensure that the JobTracker is running
    C. Ensure that the NameNode is running
    D. Ensure that a DataNode is running
    A
  110. A disk drive manufacturer has a defect rate of less than 1.5% with 98% confidence. A quality
    assurance team samples 1000 disk drives and finds 14 defective units. Which action should the
    team recommend?

    A. The manufacturing process is functioning properly and no further action is required
    B. A larger sample size should be taken to determine if the plant is operating correctly
    C. A smaller sample size should be taken to determine if the plant is operating correctly
    D. There is a flaw in the quality assurance process and the sample should be repeated
    A
  111. What is a core deliverable at the end of the analytic project?

    A. An implemented database design
    B. A whitepaper describing the project and the implementation
    C. A presentation for project sponsors
    D. The training materials
    C
  112. You have been assigned to run a logistic regression model for each of 100 countries, and all the
    data is currently stored in a PostgreSQL database. Which tool/library would you use to produce
    these models with the least effort?

    A. MADlib
    B. Mahout
    C. RStudio
    D. HBase
    A
  113. Your organization has a website where visitors randomly receive one of two coupons. It is also
    possible that visitors to the website will not receive a coupon. You have been asked to determine if
    offering a coupon to visitors to your website has any impact on their purchase decision.

    Which analysis method should you use?

    A. K-means clustering
    B. Association rules
    C. Student T-test
    D. One-way ANOVA
    D
  114. Imagine you are trying to hire a Data Scientist for your team. In addition to technical ability and
    quantitative background, which additional essential trait would you look for in people applying for
    this position?

    A. Communication skill
    B. Scientific background
    C. Domain expertise
    D. Well Organized
    A
  115. What describes the use of UNION clause in a SQL statement?

    A. Operates on queries and potentially increases the number of rows
    B. Operates on queries and potentially decreases the number of rows
    C. Operates on tables and potentially decreases the number of columns
    D. Operates on both tables and queries and potentially increases both the number of rows and
    columns
    A
  116. You have run the association rules algorithm on your data set, and the two rules {banana, apple}
    => {grape} and {apple, orange}=> {grape} have been found to be relevant. What else must be
    true?

    A. {grape,apple,orange} must be a frequent itemset.
    B. {banana,apple,grape,orange} must be a frequent itemset.
    C. {grape} => {banana,apple} must be a relevant rule.
    D. {banana,apple} => {orange} must be a relevant rule.
    A
  117. When would you use a Wilcoxson Rank Sum test?

    A. When you cannot make an assumption about the distribution of the populations
    B. When the data can easily be sorted
    C. When the populations represent the sums of other values
    D. When the data cannot easily be sorted
    A
  118. In the MapReduce framework, what is the purpose of the Reduce function?

    A. It aggregates the results of the Map function and generates processed output
    B. It distributes the input to multiple nodes for processing
    C. It writes the output of the Map function to storage
    D. It breaks the input into smaller components and distributes to other nodes in the cluster
    A
  119. Which of the following is an example of quasi-structured data?

    A. OLAP
    B. OLTP
    C. Customer record table
    D. Clickstream data
    A
  120. A Data Scientist is assigned to build a model from a reporting data warehouse. The warehouse
    contains data collected from many sources and transformed through a complex, multi-stage ETL
    process. What is a concern the data scientist should have about the data?

    A. It is too processed
    B. It is not structured
    C. It is not normalized
    D. It is too centralized
    A
  121. Which word or phrase completes the statement? Emphasis color is to standard color as _______ .

    A. Main message is to context
    B. Main message is to key findings
    C. Frequent item set is to item
    D. Pie chart is to proportions
    A
  122. Which data asset is an example of semi-structured data?

    A. XML data file
    B. Database table
    C. Webserver log
    D. News article
    A
  123. Your colleague, who is new to Hadoop, approaches you with a question. They want to know how
    best to access their data. This colleague has previously worked extensively with SQL and
    databases.

    Which query interface would you recommend?

    A. Hive
    B. Pig
    C. Howl
    D. HBase
    A
  124. In linear regression, what indicates that an estimated coefficient is significantly different than zero?

    A. A small p-value
    B. R-squared near 1
    C. R-squared near 0
    D. The estimated coefficient is greater than 3
    A
  125. Which graphical representation shows the distribution and multiple summary statistics of a
    continuous variable for each value of a corresponding discrete variable?

    A. box and whisker plot
    B. dotplot
    C. scatterplot
    D. binplot
    A
  126. Assume that you have a data frame in R. Which function would you use to display descriptive
    statistics about this variable?

    A. summary
    B. str
    C. attributes
    D. levels
    A
  127. What is the mandatory Clause that must be included when using Window functions?

    A. OVER
    B. RANK
    C. PARTITION BY
    D. RANK BY
    C
  128. What is the purpose of the process step "parsing" in text analysis?

    A. imposes a structure on the unstructured/semi-structured text for downstream analysis
    B. performs the search and/or retrieval in finding a specific topic or an entity in a document
    C. executes the clustering and classification to organize the contents
    D. computes the TF-IDF values for all keywords and indices
    A
  129. Which word or phrase completes the statement? A data warehouse is to a centralized database
    for reporting as an analytic sandbox is to a _______?

    A. Collection of data assets for modeling
    B. Collection of low-volume databases
    C. Centralized database of KPIs
    D. Collection of data assets for ETL
    A
  130. You do a Student’s t-test to compare the average test scores of sample groups from populations A
    and B. Group A averaged 10 points higher than group B. You find that this difference is significant,

    with a p-value of 0.03. What does that mean?

    A. There is a 3% chance that you have identified a difference between the populations when in
    reality there is none.
    B. The difference in scores between a sample from population A and a sample from population B
    will tend to be within 3% of 10 points.

    C. There is a 3% chance that a sample group from population A will score 10 points higher that a
    sample group from population B.
    D. There is a 97% chance that a sample group from population A will score 10 points higher that a
    sample group from population B.
    A
  131. Which word or phrase completes the statement?

    Business Intelligence is to ad-hoc reporting and dashboards as Data Science is to
    ______________ .

    A. Optimization and Predictive Modeling
    B. Alerts and Queries
    C. Structured Data and Data Sources
    D. Sales and profit reporting
    A
  132. What is a property of window functions in SQL commands?

    A. They can be used to calculate moving averages over various intervals.
    B. They group rows into a single output row.
    C. They can be used between the keywords FROM and WHERE in a SELECT command.
    D. They don't require ordering of data within a window.
    A
  133. You are attempting to find the Euclidean distance between two centroids:

    Centroid A's coordinates: (X = 2, Y = 4)
    Centroid B's coordinates (X = 8, Y = 10)

    Which formula finds the correct Euclidean distance?

    A. SQRT((2-8)2+(4-10)2) or 8.49
    B. SQRT(((2-8) x 2) + ((4-10) x 2)) or 12.17
    C. ((2-8)2+(4-10)2) or 72
    D. ((2-8) x 2 + (4-10) x 2) or 148
    A
  134. In data visualization, which type of chart is recommended to represent frequency data?

    A. Line chart
    B. Histogram
    C. Q-Q chart
    D. Scatterplot
    B
  135. Which activity might be performed in the Operationalize phase of the Data Analytics Lifecycle?

    A. Run a pilot
    B. Try different analytical techniques
    C. Try different variables
    D. Transform existing variables
    A
  136. Image Upload 4

    You are asked to write a report on how specific variables impact your client’s sales using a data
    set provided to you by the client. The data includes 15 variables that the client views as directly
    related to sales, and you are restricted to these variables only.

    After a preliminary analysis of the data, the following findings were made:

    1. Multicollinearity is not an issue among the variables
    2. Only three variables—A, B, and C—have significant correlation with sales

    You build a linear regression model on the dependent variable of sales with the independent
    variables of A, B, and C. The results of the regression are seen in the exhibit.

    Which interpretation is supported by the analysis?

    A. Variables A,B,and C are significantly impacting sales,but are not effectively estimating sales
    B. Variables A,B,and C are significantly impacting sales and are effectively estimating sales
    C. Due to the R2 of 0.10,the model is not valid – the linear regression should be re-run with all 15
    variables forced into the model to increase the R2

    D. Due to the R2 of 0.10,the model is not valid – a different analytical model should be attempted
    A
  137. Image Upload 6

    In the Exhibit. For effective visualization, what is the chart's primary flaw?

    A. The use of 3 dimensions.
    B. The slanting of axis labels.
    C. The location of the legend.
    D. The order of the columns.
    A
  138. Image Upload 8

    You have plotted the distribution of savings account sizes for your bank. How would you proceed,
    based on this distribution?

    A. The data is extremely skewed. Replot the data on a logarithmic scale to get a better sense of it.
    B. The data is extremely skewed,but looks bimodal; replot the data in the range 2,500-10,000 to
    be sure.
    C. The accounts of size greater than 2500 are rare,and probably outliers. Eliminate them from
    your future analysis.

    D. The data is extremely skewed. Split your analysis into two cohorts: accounts less than
    2500,and accounts greater than 2500
    A
  139. Image Upload 10

    In the exhibit, a correlogram is provided based on an autocorrelation analysis of a sample dataset.

    What can you conclude based only on this exhibit?

    A. There appears to be no structure left to model in the data
    B. There appears to be a seasonal component in the data
    C. Lag 1 has a significant autocorrelation
    D. There appears to be a cyclical component in the data
    A
  140. Image Upload 12
    In the exhibit, the x-axis represents the derived probability of a borrower defaulting on a loan. Also
    in the exhibit, the pink represents borrowers that are known to have not defaulted on their loan,
    and the blue represents borrowers that are known to have defaulted on their loan.

    Which analytical method could produce the probabilities needed to build this exhibit?

    A. Logistic Regression
    B. Linear Regression
    C. Discriminant Analysis
    D. Association Rules
    A
  141. Image Upload 14

    You have created a density plot of purchase amounts from a retail website as shown. What should
    you do next?

    A. Recreate the plot using the barplot() function
    B. Use the rug() function to add elements to the plot
    C. Recreate the density plot using a log normal distribution of the purchase amount data
    D. Reduce the sample size of the purchase amount data used to create the plot
    C
  142. Image Upload 16
    You are building a decision tree. In this exhibit, four variables are listed with their respective values
    of info-gain.

    Based on this information, on which attribute would you expect the next split to be in the decision
    tree?

    A. Credit Score
    B. Age
    C. Income
    D. Gender
    A
  143. Image Upload 18

    In the Exhibit, the table shows the values for the input Boolean attributes "A", "B", and "C". It also
    shows the values for the output attribute "class". Which decision tree is valid for the data?

    A. Tree B
    B. Tree A
    C. Tree C
    D. Tree D
    A
  144. Image Upload 20

    In the Exhibit, the table shows the values for the input Boolean attributes "A", "B", and "C". It also
    shows the values for the output attribute "class". Which decision tree is valid for the data?

    A. Tree B
    B. Tree A
    C. Tree C
    D. Tree D
    A
  145. Image Upload 22

    You are assigned to do an end of the year sales analysis of 1, 000 different products, based on
    the transaction table. Which column in the end of year report requires the use of a window
    function?

    A. Total Sales to Date
    B. Daily Sales
    C. Average Daily Price
    D. Maximum Price
    A
  146. Image Upload 24
    You are working on creating an OLAP query that outputs several rows of with summary rows of
    subtotals and grand totals in addition to regular rows that may contain NULL as shown in the
    exhibit. Which function can you use in your query to distinguish the row from a regular row to a
    subtotal row?

    A. GROUPING
    B. RANK
    C. GROUP_ID
    D. ROLLUP
    A
  147. Image Upload 26
    After analyzing a dataset, you report findings to your team:

    1. Variables A and C are significantly and positively impacting the dependent variable.
    2. Variable B is significantly and negatively impacting the dependent variable.
    3. Variable D is not significantly impacting the dependent variable.

    After seeing your findings, the majority of your team agreed that variable B should be positively
    impacting the dependent variable.

    What is a possible reason the coefficient for variable B was negative and not positive?

    A. Variable B is interacting with another variable due to correlated inputs
    B. Variable B needs a quadratic transformation due to its relationship to the dependent variable
    C. The information gain from variable B is already provided by another variable
    D. Variable B needs a logarithmic transformation due to its relationship to the dependent variable
    A
  148. Image Upload 28

    You have run a linear regression model against your data, and have plotted true outcome versus
    predicted outcome. The R-squared of your model is 0.75. What is your assessment of the model?

    A. The R-squared may be biased upwards by the extreme-valued outcomes. Remove them and
    refit to get a better idea of the model's quality over typical data.
    B. The R-squared is good. The model should perform well.
    C. The extreme-valued outliers may negatively affect the model's performance. Remove them to
    see if the R-squared improves over typical data.

    D. The observations seem to come from two different populations,but this model fits them both
    equally well.
    A
  149. Image Upload 30
    You are using K-means clustering to classify customer behavior for a large retailer. You need to
    determine the optimum number of customer groups. You plot the within-sum-of-squares (wss)
    data as shown in the exhibit. How many customer groups should you specify?

    A. 2
    B. 3
    C. 4
    D. 8
    C
  150. Image Upload 32
    Click on the calculator icon in the upper left corner. You are given a list of pre-defined association
    rules:

    A) RENTER => BAD CREDIT
    B) RENTER => GOOD CREDIT
    C) HOME OWNER => BAD CREDIT
    D) HOME OWNER => GOOD CREDIT
    E) FREE HOUSING => BAD CREDIT
    F) FREE HOUSING => GOOD CREDIT

    For your next analysis, you must limit your dataset based on rules with confidence greater than
    60%.

    Which of the rules will be kept in the analysis?

    A. Rules B and D
    B. Rules A and F
    C. Rules C and E
    D. Rules D and E
    A
  151. Image Upload 34
    You are using k-means clustering to discover groupings within a data set. You plot within-sum-of-
    squares (wss) of multiple cluster sizes. Based on the exhibit, how many clusters should you use in
    your analysis?

    A. 4
    B. 2
    C. 8
    D. 10
    A
  152. Image Upload 36

    Consider the training data set shown in the exhibit. What are the classification (Y = 0 or 1) and the
    probability of the classification for the tupleX(0, 0, 1) using Naive Bayesian classifier?

    A. Classification Y = 1,Probability = 4/54
    B. Classification Y = 0,Probability = 1/54
    C. Classification Y = 1,Probability = 1/54
    D. Classification Y = 0,Probability = 4/54
    A
  153. Image Upload 38

    In the exhibit, a correlogram is provided based on an autocorrelation analysis of a sample dataset.

    What can you conclude from only this exhibit?

    A. There is significant autocorrelation through lag 3
    B. There is no structure left to model in the data
    C. Lag 7 has a significant negative autocorrelation
    D. Differencing is required before proceeding with any analysis
    A
  154. Image Upload 40
    Which type of data issue would you suspect based on the exhibit?

    A. "Saturated" data,indicating potential issues with data definitions
    B. Incomplete data,indicating potential issues with data transmission
    C. Mis-scaled data,indicating potential issues with data entry
    D. The exhibit does not raise any obvious concerns with the data.
    A
  155. Image Upload 42
    Click on the calculator icon in the upper left corner. An analyst is searching a corpus of documents
    for the topic "solid state disk". In the Exhibit, Table A provides the inverse document frequency for
    each term across the corpus. Table B provides each term's frequency in four documents selected
    from corpus. Which of the four documents is most relevant to the analyst's search?

    A. Document C
    B. Document A
    C. Document B
    D. Document D
    A
  156. Image Upload 44

    What provides the decision tree for predicting whether or not someone is a good or bad credit risk.
    What would be the assigned probability, p(good), of a single male with no known savings?

    A. 0.83
    B. 0
    C. 0.498
    D. 0.6
    A
  157. Image Upload 46
    The exhibit shows four graphs labeled as Fig A thorough Fig D. Which figure represents the
    entropy function relative to a Boolean classification and is represented by the formula shown in
    Exhibit?

    A. Fig-A
    B. Fig-B
    C. Fig-C
    D. Fig-D
    A
  158. Image Upload 48
    You ran a linear regression, and the final output is seen in the exhibit.

    Based only on the information in the exhibit and an acceptable confidence level of 95%, how
    would you interpret the interaction of variable D with the dependent variable?

    A. In this model,Variable D is not significantly interacting with the dependent variable
    B. For every 1 unit increase in variable D,holding all other variables constant,we can expect the
    dependent variable to increase by 10.23 units
    C. For every 1 unit increase in variable D,holding all other variables constant,we can expect the
    dependent variable to be multiplied by 10.23 units
    D. Variable D is more significant than variables A,B,and C.
    A
  159. Image Upload 50

    The graph represents an ROC space with four classifiers labelled A through D. Which point in the
    graph represents a perfect classification?

    A. S
    B. P
    C. Q
    D. R
    A
  160. Image Upload 52

    Consider the training data set shown in the exhibit. What are the classification (Y = 0 or 1) and the
    probability of the classification for the tuple

    X(1, 0, 0)
    using Naive Bayesian classifier?

    A. Classification Y = 0,Probability = 4/54
    B. Classification Y = 1,Probability = 4/54
    C. Classification Y = 0,Probability = 1/54
    D. Classification Y = 1,Probability = 1/54
    A
  161. Image Upload 54

    You have scored your Naive bayesian classifier model on a hold out test data for cross validation
    and determined the way the samples scored and tabluated them as shown in the exhibit.

    What are the Precision and Recall rate of the model?

    A. Precision = 262/277
    Recall = 262/288
    B. Precision =262/288
    Recall = 262/277

    C. Precision = 277/262
    Recall = 288/262
    D. Precision = 288/262
    Recall = 277/262
    A
  162. Image Upload 56
    Click on the calculator icon in the upper left corner. An analyst is searching a corpus of documents
    for the topic "solid state disk". In the Exhibit, Table A provides the inverse document frequency for
    each term across the corpus. Table B provides each term's frequency in four documents selected
    from corpus. Which of the four documents is most relevant to the analyst's search?

    A. Document B
    B. Document A
    C. Document C
    D. Document D
    A
  163. Image Upload 58
    Click on the calculator icon in the upper left corner. You are going into a meeting where you know
    your manager will have a question on your dataset -- specifically relating to customers that are
    classified as renters with good credit status.

    In order to prepare for the meeting, you create a rule: RENTER => GOOD CREDIT. What is the
    confidence of the rule?

    A. 63%
    B. 41%
    C. 18%
    D. 73%
    A
Author
Anonymous
ID
220673
Card Set
Data Scientist
Description
EMC Data Scientist E20-007
Updated