Lately, I’ve seen these words a lot: “big data ecosystem”, “hadoop”, “spark”, “mapreduce”, “hive”, “pig” etc.
I feel I should at least learn a little bit about these concepts.
I found this free course: https://www.udemy.com/course/hadoopstarterkit/. It is short and concise, and it “scratch the surface” of the concepts I want to learn about.
I have some notes from the course. I’d like to organize it according to the “what you’ll learn” in the course decription page. The course certainly delivers what it promised to. The notes shouldn’t be a substitute for the course, it’s just for learning and sharing purposes. All rights belong to the original authors of the course.
1. Understand the Big Data problem in terms of storage and computation
big data criteria:
3Vs: volume, velocity (rate of data growth), variety
challenges:
-
storage
-
computational efficiency
-
data loss
-
cost
2. Understand how Hadoop approach Big Data problem and provide a solution to the problem
big data problem have 2 major components
2.1. storage
e.g. image having 1 TB data, assume an average data access rate at 122 MB/s, just reading data from disk take 2.5 hours
idea: divide 1TB into 100 blocks
2.2. computation
challenge: request and transfer data will choke network + slow
idea: store data in same node as computation
another problem: data loss and recover?
idea: create many copies
hadoop: HDFS (reliable shared storage) + mapreduce (distributed computation)
3. Understand the need for another file system like HDFS
traditional solutions (that won’t work):
RDMS: cost, format(need to be structured data), scalability
GRID computing: not for mainstream, suitable for low volume data + high computational
hadoop: a good solution
-
support huge volume
-
storage efficiency
-
good data recovery
-
horizonal scaling
-
cost effective
situations suitable for Hadoop vs RDMS:
Hadoop:
-
dynamic schema
-
linear scale
-
batch
-
PB
-
write 1x, read many times
RDMS:
-
static schema
-
nonlinear
-
interactive + batch
-
GB
-
read many x, write many x
4. Work with HDFS
4.1 Understand the architecture of HDFS
4.1.1 functions of file system:
1). control how data is stored and retrieved
2). metadata bout files and folders
3). permissions and security
4). manage storage space efficiency
4.1.2 benefits of HDFS:
1). support distributed processing
- blocks: not as whole files
2). handle failures:
- replicate blocks
3). scalability
- able to support future expansion
4). cost effective
- commodity hardware can do
4.2 more description
e.g. datasets on cluster with 100 nodes
datasets divided into blocks and spread across 100 nodes
each block replicated 3 times
hdfs manage block info and where it is stored, it has a global distributed view of the file system: know how to construct a file from blocks
-
data nodes: know the block they are managing ( but not know which file these block belongs to)
-
name nodes: given file name, know the blocks that make up the file
-
in disk: metadata of files and folders
-
in memory: block locations
data nodes and name nodes in constant communication
-
node = cpu + ram + disk
-
name node: high on ram (determines how much memory the operating system and open applications can use)
-
data node: high on disk space (storage capacity)
network: same for name node and data node
-
rack: collection of nodes
-
cluster: racks interconnected
5. Understand the MapReduce programming model
distributed computing model for process large data
manage communications, data transfers, parallel executions
it’s not a programming language
hadoop implement mapreduce idea
6. Understand the phases in MapReduce
There are 3 steps:
6.1. map:
-
input splits: know the record boundaries
-
mappers: any program. process 1x each record in the splits
-
key-value pairs: after process, return this
dataset is divided into multiple parts - input splits
each mapper process an input split
each mapper can be called multiple times (1x each record) depend on the content of input split
mapper will emit key value pairs as output
there will be 1 or more mapper in a mapreduce job (number of mapper = number of splits)
6.2. shuffle (optional)
each key assigned to a reducer and stick to it, so if there are many reducers, may need to sort, copy and merge keys from mappers
6.3. reduce
-
reducers (can have multiple)
-
result
reduce function take key value pairs from multiple map functions as input and reduce them to output
keys are grouped with values, reduce function is called once per key and its values
there could be 0, 1 or more reduce function for a mapreduce job
input: key and list of values grouped by key from all mappers
https://mapr.com/blog/spark-101-what-it-what-it-does-and-why-it-matters/
7. Envision a problem in MapReduce
example: calculate Max Close Price from stock dataset using MapReduce
need 3 files:
driver program
Define MapReduce job
Set input and output locations
Set Input and Output formats
input format: validate inputs, input files into logical inputsplits, recordreader implement
validate output specifications (dir don't exist), recordwriter implementation
Set Mapper and Reduce classes
Output types
writable: serializable object
fast, compact and effective, due to need to transfer a lot of data over network
Submit job
mapper program
reducer program
8. Pig and Hive
Pig and Hive have some similarities (both take instructions and translate to mapreduce jobs), use cases can be
e.g.
pig: for standard nightly jobs like extracting data, transforming and loading, doing predefined aggregations
hive: ad hoc analysis
8.1 Write Pig Latin instructions
written in pig latten, a dataflow language, so don’t need to write the java scripts shown in last section and do the same job
A pig script
--Load dataset with column names and datatypes
stock_records = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float,volume:int, adj_close:float);
--Group records by symbol
grp_by_sym = GROUP stock_records BY symbol;
--Calculate maximum closing price
max_closing = FOREACH grp_by_sym GENERATE group, MAX(stock_records.close) as maxclose;
--Store output
STORE max_closing INTO 'output/pig/stocks' USING PigStorage(',');
pig script: taking instructions and translate to mapreduce jobs
8.2 Create and query Hive tables
hive take sql query and convert the query into 1 or more mapreduce jobs and submit into hadoop cluster
### CREATE EXTERNAL TABLE ###
hive> CREATE EXTERNAL TABLE IF NOT EXISTS stocks_starterkit (
exch STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' # csv file
LOCATION '/user/hirw/input/stocks'; # location in hdfs
### SELECT 100 RECORDS ###
hive> SELECT * FROM stocks_starterkit
LIMIT 100;
### DESCRIBE TO GET MORE INFORMATION ABOUT TABLE ###
hive> DESCRIBE FORMATTED stocks_starterkit;
### CALCULATE MAX CLOSING PRICE ###
hive> SELECT symbol, max(price_close) max_close FROM stocks_starterkit
GROUP BY symbol;
2 data types in Hive:
external table: don’t delete when drop, should use this when the hive table is used by mapreduce job and pig
internal table: will delete when drop, could be “dangerous”