Big Data: week 8- Dataflow, HDFS, Spark

Home

Get App

Create

What is meant by a "Data Lake House"?

An intermediary between an unstructured data Lake and a structured database
What are the two ways to bring data in?
- "Batch process"
- "streaming process"
What are a few examples of databases?
- Object oriented databases
- NoSQL ('not just') databases
- Cloud databases
- Self-driving databases (uses machine learning to learn about the data)
Data Warehouse

A type of database. ETL is done as the data comes in, and then the data is stored
ETL

extract, transform, load
What does ETL do?

(Extract, Transform, Load)

structured schema, de normalized data ready for dashboards, analysis, etc
What are Data Warehouses designed to do?
Store large amounts of data in a standard format in a central database
- Maintain historical records
- Keep data secure by storing it in a single locatoin
- Provides quick access to enable faster business decisions
OLTP
- OnLine Transactional Processing
- Used by Databases
optimized to add, modify, delete records a lot
OLAP
- OnLine Analytical Processing
- Used by Data Warehouses
optimized to execute a smaller number of complex queries
Databases use which format of data?

Normalized format

Reduces redundancy and increases consistency as data isn't stored in multiple locations
Data Warehouse use which format of data?

Denormalized format

More query efficient, but the data may exist in multiple places and become inconsistent
What is a Data Mart?

A focused version of data warehouse for special teams or departments
Data Lake

A central repository for raw data to go until it is needed

can handled structured, unstructured or semi-structured data

Usually includes raw data and data after ETL
What is Kafka used for?

streaming data
Data sources: batch verses streaming process

Batch data is updated in bulk

Streaming data is updated in 'real-time'
Data Swamp
What happens when a data lake has poor data management
- Can happen if:
- has no metadata
- broken injestion procss
- broken metadata management
- no data governance
What is a Lake House

an intermediary between the unstructured data lake and very structured database/data warehouse
What is a Delta Lake?

Storage technology that can power the lake house

Guarantees ACID transactions
HDFS

Hadoop Distributed File System

Clusters data on multiple computers to analyze datasets in parallel
Four commonly used data storage systems:
- Hadoop Distributed File System (HDFS)
- Amazon's Simple Storage Service (S3)
- Google's Cloud Storage (GCS)
- Azure's Blob Storage
Hadoop HDFS is ideal for what sort of data?

works well for computations that can be split, run in parallel and combined
Hadoop Heartbeat:

a signal is sent from a datanode to a namenode. If the namenode sees no signal, the datanode is dead

when datanodes fail, data may be under-replicated but namenodes will send signals to replicate and balance data replicates
Hadoop Yarn (came out in Hadoop 2.0)

Yet another resource negotiator

platform responsible for managing computing resources in clusters
What happens when a namenode fails in HDFS?

Balancing- when datanodes fail, data may be under-replicated. Name node will send signals to replicate and balance data
Hadoop MapReduce

programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster

Leverages parallel computing
Hadoop limitations

HDFS requires a decent amount of "on-premises" infrastructure

difficulties with scalability
Horizontal scaling with Hadoop

adding more machines or larger disk spaces
Vertical scaling with Hadoop

adding additional computational power (CPU, RAM)
What type of storage do cloud storage companies use?

Object storage-includes the data, metadata, and a unique identifier

S3, GCS, Blob storage systems
What are the five major components of Spark?

Five major components to Spark:

Spark Core as its foundation

Spark SQL for SQL queries

Spark Streaming for real-time analytics

Spark MLlib for machine learning

Spark GraphX for graphic processing
How can we interact with Big data using python?

Using pySpark!
Advantage of using Hadoop MapReduce with HDFS:

allows for massively parallelized operations without worrying about worker distribution or fault tolerance.
Advantage of using Spark (instead of Hadoop MapReduce):
- Does similar things as MapReduce, but is built to be faster by processing in memory (RAM)
What type of data can Spark work with?

HDFS, Csv, json, S3, GCS (most data types!)

If your data is in an HDFS, spark's cluster manager will work with YARN

If you have data elsewhere, spark has its own cluster manager to determine how to split up data and run computations
What is a `Spark Session`?
Its what you start when you want to use Spark
- Here we define:
- What the cluster is and Workers to handle computation on that cluster/node.
- Central Spark coordinator (i.e. the Driver)
- The name of the app
Example of pseudo code to start a Spark Session

sometimes you'll put a cluster url where 'local[*]' is in the code
RDDs

Resilient Distributed Datasets

The core object that Spark works on

RDD- an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions
RDD transformation

Transformation: something that creates a new RDD (say by filtering, grouping, or mapping)
RDD Action:

Action: operation applied to an RDD that performs a computation and sends the result back
How do you explicitly create an RDD with Spark?

`.sparkContext.parallelize()` method
What are some common transformations done in Spark?
- map(): apply function to each RDD to return a new RDD
- filter(): returns a subsetted RDD
- elect()
- filterByRange()
- groupByKey()
- reduceByKey()
- distinct()
- sample()
- union()
What are some common actions done to RDDs in Spark?
- reduce()
- count()
- min()
- max()
- collect()
- take()
- first()
- foreach()
- aggregate()
What is Fault Tolerance in Hadoop?

If some data went down, backups were used (because data was re-replicated)
What is Fault Tolerance in Spark?

Turns transformations and actions into a directed acyclic graph (DAG) that allows computation to be picked back up if something fails
Why are transformations in spark called lazy?
- they don't comute their resultes right away
- Transformations are built up through a DAG
- - computation only done when an action requires a result
DAG

Directed acyclic graph

allows computation to be picked back up if something else fails
Broadcast Variables

Spark can share variables across clusters or machines

Broadcast variables give access to common (read-only) variables to all workers
Accumulators

Spark can share variables across clusters or machines

Accumulators are variables that each worker can do an operation on (usually things like sums)

Author

saucyocelot

361075

Card Set

Big Data: week 8- Dataflow, HDFS, Spark

Description

Big Data week 8: Dataflow (data warehouses, data lakes), HDFS, Spark for Dealing with Big Data

Updated

2023-09-26T03:21:38Z

Show Answers