Lately, I’ve seen these words a lot: “big data ecosystem”, “hadoop”, “spark”, “mapreduce”, “hive”, “pig” etc.

I feel I should at least learn a little bit about these concepts.

I found this free course: https://www.udemy.com/course/hadoopstarterkit/. It is short and concise, and it “scratch the surface” of the concepts I want to learn about.

I have some notes from the course. I’d like to organize it according to the “what you’ll learn” in the course decription page. The course certainly delivers what it promised to. The notes shouldn’t be a substitute for the course, it’s just for learning and sharing purposes. All rights belong to the original authors of the course.

1. Understand the Big Data problem in terms of storage and computation

big data criteria:

3Vs: volume, velocity (rate of data growth), variety

challenges:

  • storage

  • computational efficiency

  • data loss

  • cost

2. Understand how Hadoop approach Big Data problem and provide a solution to the problem

big data problem have 2 major components

2.1. storage

e.g. image having 1 TB data, assume an average data access rate at 122 MB/s, just reading data from disk take 2.5 hours

idea: divide 1TB into 100 blocks

2.2. computation

challenge: request and transfer data will choke network + slow

idea: store data in same node as computation

another problem: data loss and recover?

idea: create many copies

hadoop: HDFS (reliable shared storage) + mapreduce (distributed computation)

3. Understand the need for another file system like HDFS

traditional solutions (that won’t work):

RDMS: cost, format(need to be structured data), scalability

GRID computing: not for mainstream, suitable for low volume data + high computational

hadoop: a good solution

  • support huge volume

  • storage efficiency

  • good data recovery

  • horizonal scaling

  • cost effective

situations suitable for Hadoop vs RDMS:

Hadoop:

  • dynamic schema

  • linear scale

  • batch

  • PB

  • write 1x, read many times

RDMS:

  • static schema

  • nonlinear

  • interactive + batch

  • GB

  • read many x, write many x

4. Work with HDFS

4.1 Understand the architecture of HDFS

4.1.1 functions of file system:

1). control how data is stored and retrieved

2). metadata bout files and folders

3). permissions and security

4). manage storage space efficiency

4.1.2 benefits of HDFS:

1). support distributed processing

  • blocks: not as whole files

2). handle failures:

  • replicate blocks

3). scalability

  • able to support future expansion

4). cost effective

  • commodity hardware can do

4.2 more description

e.g. datasets on cluster with 100 nodes

datasets divided into blocks and spread across 100 nodes

each block replicated 3 times

hdfs manage block info and where it is stored, it has a global distributed view of the file system: know how to construct a file from blocks

  • data nodes: know the block they are managing ( but not know which file these block belongs to)

  • name nodes: given file name, know the blocks that make up the file

  • in disk: metadata of files and folders

  • in memory: block locations

data nodes and name nodes in constant communication

  • node = cpu + ram + disk

  • name node: high on ram (determines how much memory the operating system and open applications can use)

  • data node: high on disk space (storage capacity)

network: same for name node and data node

  • rack: collection of nodes

  • cluster: racks interconnected

5. Understand the MapReduce programming model

distributed computing model for process large data

manage communications, data transfers, parallel executions

it’s not a programming language

hadoop implement mapreduce idea

6. Understand the phases in MapReduce

There are 3 steps:

6.1. map:

  • input splits: know the record boundaries

  • mappers: any program. process 1x each record in the splits

  • key-value pairs: after process, return this

dataset is divided into multiple parts - input splits

each mapper process an input split

each mapper can be called multiple times (1x each record) depend on the content of input split

mapper will emit key value pairs as output

there will be 1 or more mapper in a mapreduce job (number of mapper = number of splits)

6.2. shuffle (optional)

each key assigned to a reducer and stick to it, so if there are many reducers, may need to sort, copy and merge keys from mappers

6.3. reduce

  • reducers (can have multiple)

  • result

reduce function take key value pairs from multiple map functions as input and reduce them to output

keys are grouped with values, reduce function is called once per key and its values

there could be 0, 1 or more reduce function for a mapreduce job

input: key and list of values grouped by key from all mappers

https://mapr.com/blog/spark-101-what-it-what-it-does-and-why-it-matters/

7. Envision a problem in MapReduce

example: calculate Max Close Price from stock dataset using MapReduce

need 3 files:

driver program

Define MapReduce job
		
Set input and output locations

Set Input and Output formats
input format: validate inputs, input files into logical inputsplits, recordreader implement
validate output specifications (dir don't exist), recordwriter implementation 

Set Mapper and Reduce classes
		
Output types
writable: serializable object 
fast, compact and effective, due to need to transfer a lot of data over network 
Submit job

mapper program

reducer program

8. Pig and Hive

Pig and Hive have some similarities (both take instructions and translate to mapreduce jobs), use cases can be

e.g.

pig: for standard nightly jobs like extracting data, transforming and loading, doing predefined aggregations

hive: ad hoc analysis

8.1 Write Pig Latin instructions

written in pig latten, a dataflow language, so don’t need to write the java scripts shown in last section and do the same job

A pig script

--Load dataset with column names and datatypes
stock_records = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float,volume:int, adj_close:float);

--Group records by symbol
grp_by_sym = GROUP stock_records BY symbol;

--Calculate maximum closing price
max_closing = FOREACH grp_by_sym GENERATE group, MAX(stock_records.close) as maxclose;

--Store output
STORE max_closing INTO 'output/pig/stocks' USING PigStorage(',');

pig script: taking instructions and translate to mapreduce jobs

8.2 Create and query Hive tables

hive take sql query and convert the query into 1 or more mapreduce jobs and submit into hadoop cluster

### CREATE EXTERNAL TABLE ###
hive> CREATE EXTERNAL TABLE IF NOT EXISTS stocks_starterkit (
exch STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' # csv file
LOCATION '/user/hirw/input/stocks'; # location in hdfs

### SELECT 100 RECORDS ###
hive> SELECT * FROM stocks_starterkit
LIMIT 100;

### DESCRIBE TO GET MORE INFORMATION ABOUT TABLE ###
hive> DESCRIBE FORMATTED stocks_starterkit;

### CALCULATE MAX CLOSING PRICE ###
hive> SELECT symbol, max(price_close) max_close FROM stocks_starterkit
GROUP BY symbol;

2 data types in Hive:

external table: don’t delete when drop, should use this when the hive table is used by mapreduce job and pig

internal table: will delete when drop, could be “dangerous”