HadoopJust the Basics for Big Data Rookies презентация

Содержание

Agenda Hadoop Overview HDFS Architecture Hadoop MapReduce Hadoop Ecosystem MapReduce Primer Buckle up!

Слайд 1Hadoop Just the Basics for Big Data Rookies
Adam Shook
ashook@gopivotal.com


Слайд 2Agenda
Hadoop Overview
HDFS Architecture
Hadoop MapReduce
Hadoop Ecosystem
MapReduce Primer

Buckle up!




Слайд 3Hadoop Overview


Слайд 4Hadoop Core
Open-source Apache project out of Yahoo! in 2006
Distributed fault-tolerant data

storage and batch processing
Provides linear scalability on commodity hardware
Adopted by many:
Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM, Netflix, Twitter, Yahoo!, and many, many more



Слайд 5Why?
Bottom line:
Flexible
Scalable
Inexpensive


Слайд 6Overview
Great at
Reliable storage for multi-petabyte data sets
Batch queries and analytics
Complex hierarchical

data structures with changing schemas, unstructured and structured data
Not so great at
Changes to files (can’t do it…)
Low-latency responses
Analyst usability
This is less of a concern now due to higher-level languages

Слайд 7Data Structure
Bytes!
No more ETL necessary
Store data now, process later
Structure on read
Built-in

support for common data types and formats
Extendable
Flexible


Слайд 8Versioning
Version 0.20.x, 0.21.x, 0.22.x, 1.x.x
Two main MR packages:
org.apache.hadoop.mapred (deprecated)
org.apache.hadoop.mapreduce (new hotness)
Version

2.x.x, alpha’d in May 2012
NameNode HA
YARN – Next Gen MapReduce


Слайд 9HDFS Architecture


Слайд 10HDFS Overview
Hierarchical UNIX-like file system for data storage
sort of
Splitting of large

files into blocks
Distribution and replication of blocks to nodes
Two key services
Master NameNode
Many DataNodes
Checkpoint Node (Secondary NameNode)

Слайд 11NameNode
Single master service for HDFS
Single point of failure (HDFS 1.x)
Stores file

to block to location mappings in the namespace
All transactions are logged to disk
NameNode startup reads namespace image and logs


Слайд 12Checkpoint Node (Secondary NN)
Performs checkpoints of the NameNode’s namespace and logs
Not

a hot backup!

Loads up namespace
Reads log transactions to modify namespace
Saves namespace as a checkpoint

Слайд 13DataNode
Stores blocks on local disk
Sends frequent heartbeats to NameNode
Sends block reports

to NameNode
Clients connect to DataNode for I/O

Слайд 14How HDFS Works - Writes
Client contacts NameNode to write data
NameNode says

write it to these nodes

Client sequentially
writes blocks to DataNode


Слайд 15How HDFS Works - Writes
DataNodes replicate data
blocks, orchestrated
by the NameNode


Слайд 16How HDFS Works - Reads
Client contacts NameNode to read data
NameNode says

you can find it here

Client sequentially
reads blocks from DataNode


Слайд 17Client connects to another
node serving that block

How HDFS Works - Failure


Слайд 18Block Replication
Default of three replicas
Rack-aware system
One block on same rack
One block

on same rack, different host
One block on another rack
Automatic re-copy by NameNode, as needed




Слайд 19HDFS 2.0 Features
NameNode High-Availability (HA)
Two redundant NameNodes in active/passive configuration
Manual

or automated failover
NameNode Federation
Multiple independent NameNodes using the same collection of DataNodes

Слайд 20Hadoop MapReduce


Слайд 21Hadoop MapReduce 1.x
Moves the code to the data
JobTracker
Master service to monitor

jobs
TaskTracker
Multiple services to run tasks
Same physical machine as a DataNode
A job contains many tasks
A task contains one or more task attempts

Слайд 22JobTracker
Monitors job and task progress
Issues task attempts to TaskTrackers
Re-tries failed task

attempts
Four failed attempts = one failed job
Schedules jobs in FIFO order
Fair Scheduler
Single point of failure for MapReduce



Слайд 23TaskTrackers
Runs on same node as DataNode service
Sends heartbeats and task reports

to JobTracker
Configurable number of map and reduce slots
Runs map and reduce task attempts
Separate JVM!



Слайд 24Exploiting Data Locality
JobTracker will schedule task on a TaskTracker that is

local to the block
3 options!
If TaskTracker is busy, selects TaskTracker on same rack
Many options!
If still busy, chooses an available TaskTracker at random
Rare!

Слайд 25How MapReduce Works
Client submits job to JobTracker
JobTracker submits
tasks to TaskTrackers
Job output

is written to
DataNodes w/replication

JobTracker reports metrics


Слайд 26How MapReduce Works - Failure
JobTracker assigns task to different node


Слайд 27YARN
Abstract framework for distributed application development
Split functionality of JobTracker into two

components
ResourceManager
ApplicationMaster
TaskTracker becomes NodeManager
Containers instead of map and reduce slots
Configurable amount of memory per NodeManager



Слайд 28MapReduce 2.x on YARN
MapReduce API has not changed
Rebuild required to upgrade

from 1.x to 2.x
Application Master launches and monitors job via YARN
MapReduce History Server to store… history


Слайд 29Hadoop Ecosystem


Слайд 30Hadoop Ecosystem
Core Technologies
Hadoop Distributed File System
Hadoop MapReduce
Many other tools…
Which I will

be describing… now


Слайд 31Moving Data
Sqoop
Moving data between RDBMS and HDFS
Say, migrating MySQL tables to

HDFS
Flume
Streams event data from sources to sinks
Say, weblogs from multiple servers into HDFS



Слайд 32Flume Architecture


Слайд 33Higher Level APIs
Pig
Data-flow language – aptly named PigLatin -- to generate

one or more MapReduce jobs against data stored locally or in HDFS
Hive
Data warehousing solution, allowing users to write SQL-like queries to generate a series of MapReduce jobs against data stored in HDFS

Слайд 34Pig Word Count

A = LOAD '$input';
B = FOREACH A GENERATE FLATTEN(TOKENIZE($0))

AS word;
C = GROUP B BY word;
D = FOREACH C GENERATE group AS word, COUNT(B);
STORE D INTO '$output';

Слайд 35Key/Value Stores
HBase
Accumulo

Implementations of Google’s Big Table for HDFS
Provides random, real-time access

to big data
Supports updates and deletes of key/value pairs

Слайд 36HBase Architecture
Master
ZooKeeper
Client
HDFS


Слайд 37Data Structure
Avro
Data serialization system designed for the Hadoop ecosystem
Expressed as JSON
Parquet
Compressed,

efficient columnar storage for Hadoop and other systems

Слайд 38Scalable Machine Learning
Mahout
Library for scalable machine learning written in Java
Very robust

examples!
Classification, Clustering, Pattern Mining, Collaborative Filtering, and much more

Слайд 39Workflow Management
Oozie
Scheduling system for Hadoop Jobs
Support for:
Java MapReduce
Streaming MapReduce
Pig, Hive, Sqoop,

Distcp
Any ol’ Java or shell script program

Слайд 40Real-time Stream Processing
Storm
Open-source project which runs a streaming of data, called

a spout, to a series of execution agents called bolts
Scalable and fault-tolerant, with guaranteed processing of data
Benchmarks of over a million tuples processed per second per node

Слайд 41Distributed Application Coordination
ZooKeeper
An effort to develop and maintain an open-source server

which enables highly reliable distributed coordination
Designed to be simple, replicated, ordered, and fast
Provides configuration management, distributed synchronization, and group services for applications

Слайд 42ZooKeeper Architecture


Слайд 43Hadoop Streaming
Write MapReduce mappers and reducers using stdin and stdout
Execute on

command line using Hadoop Streaming JAR

// TODO verify
hadoop jar hadoop-streaming.jar -input input -output outputdir
-mapper org.apache.hadoop.mapreduce.Mapper -reduce /bin/wc

Слайд 44SQL on Hadoop
Apache Drill
Cloudera Impala
Hive Stinger
Pivotal HAWQ

MPP execution of SQL queries

against HDFS data

Слайд 45HAWQ Architecture


Слайд 46That’s a lot of projects
I am likely missing several (Sorry, guys!)

Each

cropped up to solve a limitation of Hadoop Core
Know your ecosystem
Pick the right tool for the right job



Слайд 47Sample Architecture
HDFS
Flume Agent
Flume Agent
Flume Agent
MapReduce
Pig
HBase
Storm
Oozie
Webserver
Sales
Call Center
SQL
SQL


Слайд 48MapReduce Primer


Слайд 49MapReduce Paradigm
Data processing system with two key phases
Map
Perform a map function

on input key/value pairs to generate intermediate key/value pairs
Reduce
Perform a reduce function on intermediate key/value groups to generate output key/value pairs
Groups created by sorting map output

Слайд 50Map Input
Map Output
Reducer Input Groups
Reducer Output




Слайд 51Hadoop MapReduce Components
Map Phase
Input Format
Record Reader
Mapper
Combiner
Partitioner
Reduce Phase
Shuffle
Sort
Reducer
Output Format
Record Writer


Слайд 52Writable Interfaces
public interface Writable {

void write(DataOutput out);
void readFields(DataInput in);
}

public

interface WritableComparable extends Writable, Comparable {
}

BooleanWritable
BytesWritable
ByteWritable
DoubleWritable
FloatWritable

IntWritable
LongWritable
NullWritable
Text


Слайд 53InputFormat


public abstract class InputFormat {

public abstract List getSplits(JobContext context);


public abstract RecordReader
createRecordReader(InputSplit split, TaskAttemptContext context);
}

Слайд 54RecordReader
public abstract class RecordReader implements Closeable {

public abstract void

initialize(InputSplit split, TaskAttemptContext context);

public abstract boolean nextKeyValue();

public abstract KEYIN getCurrentKey();

public abstract VALUEIN getCurrentValue();

public abstract float getProgress();

public abstract void close();
}

Слайд 55Mapper
public class Mapper {
protected void setup(Context context)

{ /* NOTHING */ }
protected void cleanup(Context context) { /* NOTHING */ }

protected void map(KEYIN key, VALUEIN value, Context context) {
context.write((KEYOUT) key, (VALUEOUT) value);
}

public void run(Context context) {
setup(context);
while (context.nextKeyValue())
map(context.getCurrentKey(), context.getCurrentValue(), context);
cleanup(context);
}
}

Слайд 56Partitioner


public abstract class Partitioner {

public abstract int getPartition(KEY key,

VALUE value, int numPartitions);

}

Default HashPartitioner uses key’s hashCode() % numPartitions

Слайд 57Reducer
public class Reducer {
protected void setup(Context context)

{ /* NOTHING */ }
protected void cleanup(Context context) { /* NOTHING */ }

protected void reduce(KEYIN key, Iterable value, Context context) {
for (VALUEIN value : values)
context.write((KEYOUT) key, (VALUEOUT) value);
}

public void run(Context context) {
setup(context);
while (context.nextKey())
reduce(context.getCurrentKey(), context.getValues(), context);
cleanup(context);
}
}

Слайд 58OutputFormat


public abstract class OutputFormat {

public abstract RecordWriter
getRecordWriter(TaskAttemptContext

context);

public abstract void checkOutputSpecs(JobContext context);

public abstract OutputCommitter
getOutputCommitter(TaskAttemptContext context);
}

Слайд 59RecordWriter


public abstract class RecordWriter {

public abstract void write(K key,

V value);

public abstract void close(TaskAttemptContext context);
}

Слайд 60Word Count Example


Слайд 61Problem
Count the number of times each word is used in a

body of text
Uses TextInputFormat and TextOutputFormat

map(byte_offset, line)
foreach word in line
emit(word, 1)

reduce(word, counts)
sum = 0
foreach count in counts
sum += count
emit(word, sum)


Слайд 62Mapper Code
public class WordMapper extends Mapper{

private final static IntWritable ONE = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, ONE);
}
}
}

Слайд 63Shuffle and Sort
Mapper outputs
to a single logically
partitioned file
Reducers copy
their parts
Reducer
merges
partitions,


sorting by key

Слайд 64Reducer Code
public class IntSumReducer
extends Reducer {
private

IntWritable outvalue = new IntWritable();
private int sum = 0;

public void reduce(Text key, Iterable values, Context context) {
sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
outvalue.set(sum);
context.write(key, outvalue);
}
}

Слайд 65So what’s so hard about it?


Слайд 66So what’s so hard about it?
MapReduce is a limitation
Entirely different way

of thinking
Simple processing operations such as joins are not so easy when expressed in MapReduce
Proper implementation is not so easy
Lots of configuration and implementation details for optimal performance
Number of reduce tasks, data skew, JVM size, garbage collection

Слайд 67So what does this mean for you?
Hadoop is written primarily in

Java
Components are extendable and configurable
Custom I/O through Input and Output Formats
Parse custom data formats
Read and write using external systems
Higher-level tools enable rapid development of big data analysis

Слайд 68Resources, Wrap-up, etc.
http://hadoop.apache.org
Very supportive community
Strata + Hadoop World Oct. 28th –

30th in Manhattan
Plenty of resources available to learn more
Blogs
Email lists
Books
Shameless Plug -- MapReduce Design Patterns


Слайд 69Getting Started
Pivotal HD Single-Node VM and Community Edition
http://gopivotal.com/pivotal-products/data/pivotal-hd
For the brave and

bold -- Roll-your-own!
http://hadoop.apache.org/docs/current




Слайд 70Acknowledgements
Apache Hadoop, the Hadoop elephant logo, HDFS, Accumulo, Avro, Drill, Flume,

HBase, Hive, Mahout, Oozie, Pig, Sqoop, YARN, and ZooKeeper are trademarks of the Apache Software Foundation
Cloudera Impala is a trademark of Cloudera
Parquet is copyright Twitter, Cloudera, and other contributors
Storm is licensed under the Eclipse Public License



Слайд 71Learn More. Stay Connected.






Talk to us on Twitter: @springcentral
Find Session replays

on YouTube: spring.io/video

Обратная связь

Если не удалось найти и скачать презентацию, Вы можете заказать его на нашем сайте. Мы постараемся найти нужный Вам материал и отправим по электронной почте. Не стесняйтесь обращаться к нам, если у вас возникли вопросы или пожелания:

Email: Нажмите что бы посмотреть 

Что такое ThePresentation.ru?

Это сайт презентаций, докладов, проектов, шаблонов в формате PowerPoint. Мы помогаем школьникам, студентам, учителям, преподавателям хранить и обмениваться учебными материалами с другими пользователями.


Для правообладателей

Яндекс.Метрика