-
What is meant by a "Data Lake House"?
An intermediary between an unstructured data Lake and a structured database
-
What are the two ways to bring data in?
- "Batch process"
- "streaming process"
-
What are a few examples of databases?
- Object oriented databases
- NoSQL ('not just') databases
- Cloud databases
- Self-driving databases (uses machine learning to learn about the data)
-
Data Warehouse
A type of database. ETL is done as the data comes in, and then the data is stored
-
ETL
extract, transform, load
-
What does ETL do?
(Extract, Transform, Load)
structured schema, de normalized data ready for dashboards, analysis, etc
-
What are Data Warehouses designed to do?
Store large amounts of data in a standard format in a central database
- Maintain historical records
- Keep data secure by storing it in a single locatoin
- Provides quick access to enable faster business decisions
-
OLTP
- OnLine Transactional Processing
- Used by Databases
optimized to add, modify, delete records a lot
-
OLAP
- OnLine Analytical Processing
- Used by Data Warehouses
optimized to execute a smaller number of complex queries
-
Databases use which format of data?
Normalized format
Reduces redundancy and increases consistency as data isn't stored in multiple locations
-
Data Warehouse use which format of data?
Denormalized format
More query efficient, but the data may exist in multiple places and become inconsistent
-
What is a Data Mart?
A focused version of data warehouse for special teams or departments
-
Data Lake
A central repository for raw data to go until it is needed
can handled structured, unstructured or semi-structured data
Usually includes raw data and data after ETL
-
What is Kafka used for?
streaming data
-
Data sources: batch verses streaming process
Batch data is updated in bulk
Streaming data is updated in 'real-time'
-
Data Swamp
What happens when a data lake has poor data management
- Can happen if:
- has no metadata
- broken injestion procss
- broken metadata management
- no data governance
-
What is a Lake House
an intermediary between the unstructured data lake and very structured database/data warehouse
-
What is a Delta Lake?
Storage technology that can power the lake house
Guarantees ACID transactions
-
HDFS
Hadoop Distributed File System
Clusters data on multiple computers to analyze datasets in parallel
-
Four commonly used data storage systems:
- Hadoop Distributed File System (HDFS)
- Amazon's Simple Storage Service (S3)
- Google's Cloud Storage (GCS)
- Azure's Blob Storage
-
Hadoop HDFS is ideal for what sort of data?
works well for computations that can be split, run in parallel and combined
-
Hadoop Heartbeat:
a signal is sent from a datanode to a namenode. If the namenode sees no signal, the datanode is dead
when datanodes fail, data may be under-replicated but namenodes will send signals to replicate and balance data replicates
-
Hadoop Yarn (came out in Hadoop 2.0)
Yet another resource negotiator
platform responsible for managing computing resources in clusters
-
What happens when a namenode fails in HDFS?
Balancing- when datanodes fail, data may be under-replicated. Name node will send signals to replicate and balance data
-
Hadoop MapReduce
programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster
Leverages parallel computing
-
Hadoop limitations
HDFS requires a decent amount of "on-premises" infrastructure
difficulties with scalability
-
Horizontal scaling with Hadoop
adding more machines or larger disk spaces
-
Vertical scaling with Hadoop
adding additional computational power (CPU, RAM)
-
What type of storage do cloud storage companies use?
Object storage-includes the data, metadata, and a unique identifier
S3, GCS, Blob storage systems
-
What are the five major components of Spark?
Five major components to Spark:
Spark Core as its foundation
Spark SQL for SQL queries
Spark Streaming for real-time analytics
Spark MLlib for machine learning
Spark GraphX for graphic processing
-
How can we interact with Big data using python?
Using pySpark!
-
Advantage of using Hadoop MapReduce with HDFS:
allows for massively parallelized operations without worrying about worker distribution or fault tolerance.
-
Advantage of using Spark (instead of Hadoop MapReduce):
- Does similar things as MapReduce, but is built to be faster by processing in memory (RAM)
-
What type of data can Spark work with?
HDFS, Csv, json, S3, GCS (most data types!)
If your data is in an HDFS, spark's cluster manager will work with YARN
If you have data elsewhere, spark has its own cluster manager to determine how to split up data and run computations
-
What is a `Spark Session`?
Its what you start when you want to use Spark
- Here we define:
- What the cluster is and Workers to handle computation on that cluster/node.
- Central Spark coordinator (i.e. the Driver)
- The name of the app
-
Example of pseudo code to start a Spark Session
sometimes you'll put a cluster url where 'local[*]' is in the code
-
RDDs
Resilient Distributed Datasets
The core object that Spark works on
RDD- an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions
-
RDD transformation
Transformation: something that creates a new RDD (say by filtering, grouping, or mapping)
-
RDD Action:
Action: operation applied to an RDD that performs a computation and sends the result back
-
How do you explicitly create an RDD with Spark?
`.sparkContext.parallelize()` method
-
What are some common transformations done in Spark?
- map(): apply function to each RDD to return a new RDD
- filter(): returns a subsetted RDD
- elect()
- filterByRange()
- groupByKey()
- reduceByKey()
- distinct()
- sample()
- union()
-
What are some common actions done to RDDs in Spark?
- reduce()
- count()
- min()
- max()
- collect()
- take()
- first()
- foreach()
- aggregate()
-
What is Fault Tolerance in Hadoop?
If some data went down, backups were used (because data was re-replicated)
-
What is Fault Tolerance in Spark?
Turns transformations and actions into a directed acyclic graph (DAG) that allows computation to be picked back up if something fails
-
Why are transformations in spark called lazy?
- they don't comute their resultes right away
- Transformations are built up through a DAG
- - computation only done when an action requires a result
-
DAG
Directed acyclic graph
allows computation to be picked back up if something else fails
-
Broadcast Variables
Spark can share variables across clusters or machines
Broadcast variables give access to common (read-only) variables to all workers
-
Accumulators
Spark can share variables across clusters or machines
Accumulators are variables that each worker can do an operation on (usually things like sums)
|
|