Big Data: week 8- Dataflow, HDFS, Spark

  1. What is meant by a "Data Lake House"?
    An intermediary between an unstructured data Lake and a structured database
  2. What are the two ways to bring data in?
    • "Batch process"
    • "streaming process"
  3. What are a few examples of databases?
    • Object oriented databases
    • NoSQL ('not just') databases
    • Cloud databases
    • Self-driving databases (uses machine learning to learn about the data)
  4. Data Warehouse
    A type of database. ETL is done as the data comes in, and then the data is stored
  5. ETL
    extract, transform, load
  6. What does ETL do?
    (Extract, Transform, Load)

    structured schema, de normalized data ready for dashboards, analysis, etc
  7. What are Data Warehouses designed to do?
    Store large amounts of data in a standard format in a central database

    • Maintain historical records
    • Keep data secure by storing it in a single locatoin
    • Provides quick access to enable faster business decisions

    Image Upload 2
  8. OLTP
    • OnLine Transactional Processing
    • Used by Databases

    optimized to add, modify, delete records a lot
  9. OLAP
    • OnLine Analytical Processing
    • Used by Data Warehouses

    optimized to execute a smaller number of complex queries
  10. Databases use which format of data?
    Normalized format

    Reduces redundancy and increases consistency as data isn't stored in multiple locations
  11. Data Warehouse use which format of data?
    Denormalized format

    More query efficient, but the data may exist in multiple places and become inconsistent
  12. What is a Data Mart?
    A focused version of data warehouse for special teams or departments
  13. Data Lake
    A central repository for raw data to go until it is needed

    can handled structured, unstructured or semi-structured data

    Usually includes raw data and data after ETL
  14. What is Kafka used for?
    streaming data
  15. Data sources: batch verses streaming process
    Batch data is updated in bulk

    Streaming data is updated in 'real-time'
  16. Data Swamp
    What happens when a data lake has poor data management

    • Can happen if:
    • has no metadata
    • broken injestion procss
    • broken metadata management
    • no data governance
  17. What is a Lake House
    an intermediary between the unstructured data lake and very structured database/data warehouse
  18. What is a Delta Lake?
    Storage technology that can power the lake house

    Guarantees ACID transactions
  19. HDFS
    Hadoop Distributed File System

    Clusters data on multiple computers to analyze datasets in parallel
  20. Four commonly used data storage systems:
    • Hadoop Distributed File System (HDFS)
    • Amazon's Simple Storage Service (S3)
    • Google's Cloud Storage (GCS)
    • Azure's Blob Storage
  21. Hadoop HDFS is ideal for what sort of data?
    works well for computations that can be split, run in parallel and combined
  22. Hadoop Heartbeat:
    a signal is sent from a datanode to a namenode. If the namenode sees no signal, the datanode is dead

    when datanodes fail, data may be under-replicated but namenodes will send signals to replicate and balance data replicates
  23. Hadoop Yarn (came out in Hadoop 2.0)
    Yet another resource negotiator

    platform responsible for managing computing resources in clusters
  24. What happens when a namenode fails in HDFS?
    Balancing- when datanodes fail, data may be under-replicated. Name node will send signals to replicate and balance data
  25. Hadoop MapReduce
    programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster

    Leverages parallel computing
  26. Hadoop limitations
    HDFS requires a decent amount of "on-premises" infrastructure

    difficulties with scalability
  27. Horizontal scaling with Hadoop
    adding more machines or larger disk spaces
  28. Vertical scaling with Hadoop
    adding additional computational power (CPU, RAM)
  29. What type of storage do cloud storage companies use?
    Object storage-includes the data, metadata, and a unique identifier

    S3, GCS, Blob storage systems
  30. What are the five major components of Spark?
    Five major components to Spark:

    Spark Core as its foundation

    Spark SQL for SQL queries

    Spark Streaming for real-time analytics

    Spark MLlib for machine learning

    Spark GraphX for graphic processing
  31. How can we interact with Big data using python?
    Using pySpark!
  32. Advantage of using Hadoop MapReduce with HDFS:
    allows for massively parallelized operations without worrying about worker distribution or fault tolerance.

    Image Upload 4
  33. Advantage of using Spark (instead of Hadoop MapReduce):
    • Does similar things as MapReduce, but is built to be faster by processing in memory (RAM)
    • Image Upload 6
  34. What type of data can Spark work with?
    HDFS, Csv, json, S3, GCS (most data types!)

    If your data is in an HDFS, spark's cluster manager will work with YARN

    If you have data elsewhere, spark has its own cluster manager to determine how to split up data and run computations
  35. What is a `Spark Session`?
    Its what you start when you want to use Spark

    • Here we define:
    • What the cluster is and Workers to handle computation on that cluster/node.
    • Central Spark coordinator (i.e. the Driver)
    • The name of the app
  36. Example of pseudo code to start a Spark Session
    Image Upload 8

    sometimes you'll put a cluster url where 'local[*]' is in the code
  37. RDDs
    Resilient Distributed Datasets

    The core object that Spark works on

    RDD- an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions
  38. RDD transformation
    Transformation: something that creates a new RDD (say by filtering, grouping, or mapping)
  39. RDD Action:
    Action: operation applied to an RDD that performs a computation and sends the result back
  40. How do you explicitly create an RDD with Spark?
    `.sparkContext.parallelize()` method
  41. What are some common transformations done in Spark?
    • map(): apply function to each RDD to return a new RDD
    • filter(): returns a subsetted RDD

    • elect()
    • filterByRange()
    • groupByKey()
    • reduceByKey()
    • distinct()
    • sample()
    • union()
  42. What are some common actions done to RDDs in Spark?
    • reduce()
    • count()
    • min()
    • max()
    • collect()
    • take()
    • first()
    • foreach()
    • aggregate()
  43. What is Fault Tolerance in Hadoop?
    If some data went down, backups were used (because data was re-replicated)
  44. What is Fault Tolerance in Spark?
    Turns transformations and actions into a directed acyclic graph (DAG) that allows computation to be picked back up if something fails
  45. Why are transformations in spark called lazy?
    - they don't comute their resultes right away

    • Transformations are built up through a DAG
    • - computation only done when an action requires a result
  46. DAG
    Directed acyclic graph

    allows computation to be picked back up if something else fails
  47. Broadcast Variables
    Spark can share variables across clusters or machines

    Broadcast variables give access to common (read-only) variables to all workers
  48. Accumulators
    Spark can share variables across clusters or machines

    Accumulators are variables that each worker can do an operation on (usually things like sums)
Author
saucyocelot
ID
361075
Card Set
Big Data: week 8- Dataflow, HDFS, Spark
Description
Big Data week 8: Dataflow (data warehouses, data lakes), HDFS, Spark for Dealing with Big Data
Updated