-
What is big data?
Data that you can't handle normally
- it won't fit on a single computer
- it is constantly being added/updated
-
What are the 5 Vs of Big Data?
- Volume- scale of data
- Variety- different forms of data
- Velocity- analysis of streaming data
- Veracity- uncertainty in data
- Value- get value from your efforts
-
What do statisticians usually consider?
populations and samples (subset of populations)
make assumptions about the data generating process to try and make inference or predict using sampling distributions
Test hypotheses and create confidence intervals
-
What is the most commonly looked at statistic?
Sample mean
-
How can you stimulate a sampling distribution?
stats.binom.rvs(n,p, size)
- n= total population
- p= proportion
- size= number of ys to draw
-
Do we still need statistics with Big Data?
i.e: "with enough data, the numbers speak for themselves..."
- Yes!
- Can our sample size really represent the entire population? We don't actually observe everything
-
How can Big Data model user-level data (n=1)?
Example: modeling user intention on social media networks can detect depression
We are looking at one user over time and aggregating over other users over time to compare
-
How often do "rare" events happen in big data?
More often than you might think! If you have enough data, you'll eventually see weird things just by chance (similar to the idea of multiple testing in hypothesis testing)
-
How do you consider bias in data/analysis?
Consider what might be a SIGNAL problem!
Example: Twitter analysis was considered to determine which area was hit hardest by Hurricane Sandy.
The most tweets came from Manhattan, which gave the illusion that Manhattan was the hub. Coney Island and other regions actually were affected more and were without power or doing something besides cleaning to deal with the storm damage
- Example: Machine learning model determining that military targets likely had an "arch-shaped gate"
- Resulted in military attack on some civilian locations, because that is a common building type out in the middle east
-
types of flat files to store data:
- .csv
- .txt
- .dat
- .json (think dictionary)
-
What do we use when we have multiple data sets or sources of data?
We use a database! along with a Database Management System (DBMS)
-
How is a relational database organized?
Think of a bunch of 2D tables that are connected by keys
-
RDBMS
Relational Database Management Systems
-
Common types of relational database management systems
Note, most RDBMS have their own structured query language
- Oracle
- MySQL
- SQL Server
- PostgreSQL
- SQLite
- ...
-
C.R.U.D. acronym for the few common actions we want to preform on a database:
- Create Data
- Read Data
- Update Data
- Delete Data
(also: provide access control, monitoring, turning, backup and recovery)
-
ACID acronym for the properties defining relational database transactions
Atomicity- defines all the elements that make up a complete database transaction
Consistency- defines rules for maintaining data points in a correct state after transaction
Isolation- keep the effect of a transaction invisible to others until it is committed, to avoid confusion
Durability- ensures that data changes become permanent when transaction is committed
-
Normalized Data
data that is split up into smaller tables to avoid having data duplicated
-
The code to connect to the SQL database:
import sqlite3
con=sqlite3.connect("database_name")
-
How do you use SQLite in python
Usually, we will write our SQL command as a string and then execute it using the execute() method on a cursor object
-
How would you say in SQL that you wanted to get col_1, col_2 from the 'pretty' table, but only if col_2 values are less than 2
SELECT col_1, col_2 FROM pretty WHERE col_2 <2;
-
What do you want to do when you want to save any changes you made to a database?
- we need to make a commit to the connection via
- `con.commit()`
-
What should you do when you are done working in your SQL database?
close your connection with the `close()` method:
`con.close()`
-
Wat types of joins are the four most common in SQL?
- left_join()
- right_join()
- inner_join()
- full_join()
-
What does an Inner Join do?
- Returns records with matching keys in both tables

-
How do you do joins in pandas?
- with the `merge()` method

-
What does a Left Join do?
- Returns all records from the 'left' table and any matching records from the 'right' table

-
NaN vs. None in python
NaN can be used as a numerical value on mathematical operations, while None cannot.
NaN is a numeric value, as defined in IEEE 754 floating-point standard.
None is an internal Python type ( NoneType ) and would be more like "inexistent" or "empty" than "numerically invalid" in this context.
-
What does Right Join do? (also, how do you do it in SQLite?)
- Returns all records from the 'right' table and any matching records from the left table.
- Its not supported in SQLite! Just do a left join and switch the tables.

-
What does a Full Outer Join do (and how is it done in SQLite?)
- Returns all records when there is a match from the left or right table.
- Its not supported in SQLite! Have to do some work. Two left joins then union them.

-
What does the Cross Join do?
Returns every combination of rows from the left table with the right table
-
What types of Joins are supported by SQLite?
- Cross Join
- Inner Join
- Left Join
-
Common SQL commands
- CREATE TABLE
- SELECT
- INSERT INTO
- UPDATE
- DELETE FROM
- DROP TABLE
-
How do you run the sample SQL code with pandas?
SELECT * FROM dept;
pd.read_sql("SELECT * FROM dept", con)
|
|