HadoopJust the Basics for Big Data Rookies презентация

Содержание

1. HadoopJust the Basics for Big Data Rookies
2. Agenda Hadoop Overview HDFS Architecture Hadoop MapReduce
3. Hadoop Overview
4. Hadoop Core Open-source Apache project out of
5. Why? Bottom line: Flexible Scalable Inexpensive
6. Overview Great at Reliable storage for multi-petabyte
7. Data Structure Bytes! No more ETL necessary
8. Versioning Version 0.20.x, 0.21.x, 0.22.x, 1.x.x Two
9. HDFS Architecture
10. HDFS Overview Hierarchical UNIX-like file system for
11. NameNode Single master service for HDFS Single
12. Checkpoint Node (Secondary NN) Performs checkpoints of
13. DataNode Stores blocks on local disk Sends
14. How HDFS Works - Writes Client contacts
15. How HDFS Works - Writes DataNodes replicate data blocks, orchestrated by the NameNode
16. How HDFS Works - Reads Client contacts
17. Client connects to another node serving that block How HDFS Works - Failure
18. Block Replication Default of three replicas Rack-aware
19. HDFS 2.0 Features NameNode High-Availability (HA) Two
20. Hadoop MapReduce
21. Hadoop MapReduce 1.x Moves the code to
22. JobTracker Monitors job and task progress Issues
23. TaskTrackers Runs on same node as DataNode
24. Exploiting Data Locality JobTracker will schedule task
25. How MapReduce Works Client submits job to
26. How MapReduce Works - Failure JobTracker assigns task to different node
27. YARN Abstract framework for distributed application development
28. MapReduce 2.x on YARN MapReduce API has
29. Hadoop Ecosystem
30. Hadoop Ecosystem Core Technologies Hadoop Distributed File
31. Moving Data Sqoop Moving data between RDBMS
32. Flume Architecture
33. Higher Level APIs Pig Data-flow language –
34. Pig Word Count A = LOAD
35. Key/Value Stores HBase Accumulo Implementations of
36. HBase Architecture Master ZooKeeper Client HDFS
37. Data Structure Avro Data serialization system designed
38. Scalable Machine Learning Mahout Library for scalable
39. Workflow Management Oozie Scheduling system for Hadoop
40. Real-time Stream Processing Storm Open-source project which
41. Distributed Application Coordination ZooKeeper An effort to
42. ZooKeeper Architecture
43. Hadoop Streaming Write MapReduce mappers and reducers
44. SQL on Hadoop Apache Drill Cloudera Impala
45. HAWQ Architecture
46. That’s a lot of projects I am
47. Sample Architecture HDFS Flume Agent Flume Agent
48. MapReduce Primer
49. MapReduce Paradigm Data processing system with two
50. Map Input Map Output Reducer Input Groups Reducer Output
51. Hadoop MapReduce Components Map Phase Input Format
52. Writable Interfaces public interface Writable {
53. InputFormat public abstract class InputFormat
54. RecordReader public abstract class RecordReader implements Closeable
55. Mapper public class Mapper { protected
56. Partitioner public abstract class Partitioner
57. Reducer public class Reducer { protected
58. OutputFormat public abstract class OutputFormat
59. RecordWriter public abstract class RecordWriter
60. Word Count Example
61. Problem Count the number of times each
62. Mapper Code public class WordMapper extends Mapper{
63. Shuffle and Sort Mapper outputs to a
64. Reducer Code public class IntSumReducer extends Reducer
65. So what’s so hard about it?
66. So what’s so hard about it? MapReduce
67. So what does this mean for you?
68. Resources, Wrap-up, etc. http://hadoop.apache.org Very supportive community
69. Getting Started Pivotal HD Single-Node VM and
70. Acknowledgements Apache Hadoop, the Hadoop elephant logo,
71. Learn More. Stay Connected.

Слайд 1Hadoop Just the Basics for Big Data Rookies
Adam Shook
ashook@gopivotal.com

Слайд 2Agenda
Hadoop Overview
HDFS Architecture
Hadoop MapReduce
Hadoop Ecosystem
MapReduce Primer

Buckle up!

Слайд 3Hadoop Overview

Слайд 4Hadoop Core
Open-source Apache project out of Yahoo! in 2006
Distributed fault-tolerant data

storage and batch processing
Provides linear scalability on commodity hardware
Adopted by many:
Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM, Netflix, Twitter, Yahoo!, and many, many more

Слайд 5Why?
Bottom line:
Flexible
Scalable
Inexpensive

Слайд 6Overview
Great at
Reliable storage for multi-petabyte data sets
Batch queries and analytics
Complex hierarchical

data structures with changing schemas, unstructured and structured data
Not so great at
Changes to files (can’t do it…)
Low-latency responses
Analyst usability
This is less of a concern now due to higher-level languages

Слайд 7Data Structure
Bytes!
No more ETL necessary
Store data now, process later
Structure on read
Built-in

support for common data types and formats
Extendable
Flexible

Слайд 8Versioning
Version 0.20.x, 0.21.x, 0.22.x, 1.x.x
Two main MR packages:
org.apache.hadoop.mapred (deprecated)
org.apache.hadoop.mapreduce (new hotness)
Version

2.x.x, alpha’d in May 2012
NameNode HA
YARN – Next Gen MapReduce

Слайд 9HDFS Architecture

Слайд 10HDFS Overview
Hierarchical UNIX-like file system for data storage
sort of
Splitting of large

files into blocks
Distribution and replication of blocks to nodes
Two key services
Master NameNode
Many DataNodes
Checkpoint Node (Secondary NameNode)

Слайд 11NameNode
Single master service for HDFS
Single point of failure (HDFS 1.x)
Stores file

to block to location mappings in the namespace
All transactions are logged to disk
NameNode startup reads namespace image and logs

Слайд 12Checkpoint Node (Secondary NN)
Performs checkpoints of the NameNode’s namespace and logs
Not

a hot backup!

Loads up namespace
Reads log transactions to modify namespace
Saves namespace as a checkpoint

Слайд 13DataNode
Stores blocks on local disk
Sends frequent heartbeats to NameNode
Sends block reports

to NameNode
Clients connect to DataNode for I/O

Слайд 14How HDFS Works - Writes
Client contacts NameNode to write data
NameNode says

write it to these nodes

Client sequentially
writes blocks to DataNode

Слайд 15How HDFS Works - Writes
DataNodes replicate data
blocks, orchestrated
by the NameNode

Слайд 16How HDFS Works - Reads
Client contacts NameNode to read data
NameNode says

you can find it here

Client sequentially
reads blocks from DataNode

Слайд 17Client connects to another
node serving that block

How HDFS Works - Failure

Слайд 18Block Replication
Default of three replicas
Rack-aware system
One block on same rack
One block

on same rack, different host
One block on another rack
Automatic re-copy by NameNode, as needed

Слайд 19HDFS 2.0 Features
NameNode High-Availability (HA)
Two redundant NameNodes in active/passive configuration
Manual

or automated failover
NameNode Federation
Multiple independent NameNodes using the same collection of DataNodes

Слайд 20Hadoop MapReduce

Слайд 21Hadoop MapReduce 1.x
Moves the code to the data
JobTracker
Master service to monitor

jobs
TaskTracker
Multiple services to run tasks
Same physical machine as a DataNode
A job contains many tasks
A task contains one or more task attempts

Слайд 22JobTracker
Monitors job and task progress
Issues task attempts to TaskTrackers
Re-tries failed task

attempts
Four failed attempts = one failed job
Schedules jobs in FIFO order
Fair Scheduler
Single point of failure for MapReduce

Слайд 23TaskTrackers
Runs on same node as DataNode service
Sends heartbeats and task reports

to JobTracker
Configurable number of map and reduce slots
Runs map and reduce task attempts
Separate JVM!

Слайд 24Exploiting Data Locality
JobTracker will schedule task on a TaskTracker that is

local to the block
3 options!
If TaskTracker is busy, selects TaskTracker on same rack
Many options!
If still busy, chooses an available TaskTracker at random
Rare!

Слайд 25How MapReduce Works
Client submits job to JobTracker
JobTracker submits
tasks to TaskTrackers
Job output

is written to
DataNodes w/replication

JobTracker reports metrics

Слайд 26How MapReduce Works - Failure
JobTracker assigns task to different node

Слайд 27YARN
Abstract framework for distributed application development
Split functionality of JobTracker into two

components
ResourceManager
ApplicationMaster
TaskTracker becomes NodeManager
Containers instead of map and reduce slots
Configurable amount of memory per NodeManager

Слайд 28MapReduce 2.x on YARN
MapReduce API has not changed
Rebuild required to upgrade

from 1.x to 2.x
Application Master launches and monitors job via YARN
MapReduce History Server to store… history

Слайд 29Hadoop Ecosystem

Слайд 30Hadoop Ecosystem
Core Technologies
Hadoop Distributed File System
Hadoop MapReduce
Many other tools…
Which I will

be describing… now

Слайд 31Moving Data
Sqoop
Moving data between RDBMS and HDFS
Say, migrating MySQL tables to

HDFS
Flume
Streams event data from sources to sinks
Say, weblogs from multiple servers into HDFS

Слайд 32Flume Architecture

Слайд 33Higher Level APIs
Pig
Data-flow language – aptly named PigLatin -- to generate

one or more MapReduce jobs against data stored locally or in HDFS
Hive
Data warehousing solution, allowing users to write SQL-like queries to generate a series of MapReduce jobs against data stored in HDFS

Слайд 34Pig Word Count

A = LOAD '$input';
B = FOREACH A GENERATE FLATTEN(TOKENIZE($0))

AS word;
C = GROUP B BY word;
D = FOREACH C GENERATE group AS word, COUNT(B);
STORE D INTO '$output';

Слайд 35Key/Value Stores
HBase
Accumulo

Implementations of Google’s Big Table for HDFS
Provides random, real-time access

to big data
Supports updates and deletes of key/value pairs

Слайд 36HBase Architecture
Master
ZooKeeper
Client
HDFS

Слайд 37Data Structure
Avro
Data serialization system designed for the Hadoop ecosystem
Expressed as JSON
Parquet
Compressed,

efficient columnar storage for Hadoop and other systems

Слайд 38Scalable Machine Learning
Mahout
Library for scalable machine learning written in Java
Very robust

examples!
Classification, Clustering, Pattern Mining, Collaborative Filtering, and much more

Слайд 39Workflow Management
Oozie
Scheduling system for Hadoop Jobs
Support for:
Java MapReduce
Streaming MapReduce
Pig, Hive, Sqoop,

Distcp
Any ol’ Java or shell script program

Слайд 40Real-time Stream Processing
Storm
Open-source project which runs a streaming of data, called

a spout, to a series of execution agents called bolts
Scalable and fault-tolerant, with guaranteed processing of data
Benchmarks of over a million tuples processed per second per node

Слайд 41Distributed Application Coordination
ZooKeeper
An effort to develop and maintain an open-source server

which enables highly reliable distributed coordination
Designed to be simple, replicated, ordered, and fast
Provides configuration management, distributed synchronization, and group services for applications

Слайд 42ZooKeeper Architecture

Слайд 43Hadoop Streaming
Write MapReduce mappers and reducers using stdin and stdout
Execute on

command line using Hadoop Streaming JAR

// TODO verify
hadoop jar hadoop-streaming.jar -input input -output outputdir
-mapper org.apache.hadoop.mapreduce.Mapper -reduce /bin/wc

Слайд 44SQL on Hadoop
Apache Drill
Cloudera Impala
Hive Stinger
Pivotal HAWQ

MPP execution of SQL queries

against HDFS data

Слайд 45HAWQ Architecture

Слайд 46That’s a lot of projects
I am likely missing several (Sorry, guys!)

Each

cropped up to solve a limitation of Hadoop Core
Know your ecosystem
Pick the right tool for the right job

Слайд 47Sample Architecture
HDFS
Flume Agent
Flume Agent
Flume Agent
MapReduce
Pig
HBase
Storm
Oozie
Webserver
Sales
Call Center
SQL
SQL

Слайд 48MapReduce Primer

Слайд 49MapReduce Paradigm
Data processing system with two key phases
Map
Perform a map function

on input key/value pairs to generate intermediate key/value pairs
Reduce
Perform a reduce function on intermediate key/value groups to generate output key/value pairs
Groups created by sorting map output

Слайд 50Map Input
Map Output
Reducer Input Groups
Reducer Output

Слайд 51Hadoop MapReduce Components
Map Phase
Input Format
Record Reader
Mapper
Combiner
Partitioner
Reduce Phase
Shuffle
Sort
Reducer
Output Format
Record Writer

Слайд 52Writable Interfaces
public interface Writable {

void write(DataOutput out);
void readFields(DataInput in);
}

public

interface WritableComparable extends Writable, Comparable {
}

BooleanWritable
BytesWritable
ByteWritable
DoubleWritable
FloatWritable

IntWritable
LongWritable
NullWritable
Text

Слайд 53InputFormat

public abstract class InputFormat {

public abstract List getSplits(JobContext context);

public abstract RecordReader
createRecordReader(InputSplit split, TaskAttemptContext context);
}

Слайд 54RecordReader
public abstract class RecordReader implements Closeable {

public abstract void

initialize(InputSplit split, TaskAttemptContext context);

public abstract boolean nextKeyValue();

public abstract KEYIN getCurrentKey();

public abstract VALUEIN getCurrentValue();

public abstract float getProgress();

public abstract void close();
}

Слайд 55Mapper
public class Mapper {
protected void setup(Context context)

{ /* NOTHING */ }
protected void cleanup(Context context) { /* NOTHING */ }

protected void map(KEYIN key, VALUEIN value, Context context) {
context.write((KEYOUT) key, (VALUEOUT) value);
}

public void run(Context context) {
setup(context);
while (context.nextKeyValue())
map(context.getCurrentKey(), context.getCurrentValue(), context);
cleanup(context);
}
}

Слайд 56Partitioner

public abstract class Partitioner {

public abstract int getPartition(KEY key,

VALUE value, int numPartitions);

}

Default HashPartitioner uses key’s hashCode() % numPartitions

Слайд 57Reducer
public class Reducer {
protected void setup(Context context)

{ /* NOTHING */ }
protected void cleanup(Context context) { /* NOTHING */ }

protected void reduce(KEYIN key, Iterable value, Context context) {
for (VALUEIN value : values)
context.write((KEYOUT) key, (VALUEOUT) value);
}

public void run(Context context) {
setup(context);
while (context.nextKey())
reduce(context.getCurrentKey(), context.getValues(), context);
cleanup(context);
}
}

Слайд 58OutputFormat

public abstract class OutputFormat {

public abstract RecordWriter
getRecordWriter(TaskAttemptContext

context);

public abstract void checkOutputSpecs(JobContext context);

public abstract OutputCommitter
getOutputCommitter(TaskAttemptContext context);
}

Слайд 59RecordWriter

public abstract class RecordWriter {

public abstract void write(K key,

V value);

public abstract void close(TaskAttemptContext context);
}

Слайд 60Word Count Example

Слайд 61Problem
Count the number of times each word is used in a

body of text
Uses TextInputFormat and TextOutputFormat

map(byte_offset, line)
foreach word in line
emit(word, 1)

reduce(word, counts)
sum = 0
foreach count in counts
sum += count
emit(word, sum)

Слайд 62Mapper Code
public class WordMapper extends Mapper{

private final static IntWritable ONE = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, ONE);
}
}
}

Слайд 63Shuffle and Sort
Mapper outputs
to a single logically
partitioned file
Reducers copy
their parts
Reducer
merges
partitions,

sorting by key

Слайд 64Reducer Code
public class IntSumReducer
extends Reducer {
private

IntWritable outvalue = new IntWritable();
private int sum = 0;

public void reduce(Text key, Iterable values, Context context) {
sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
outvalue.set(sum);
context.write(key, outvalue);
}
}

Слайд 65So what’s so hard about it?

Слайд 66So what’s so hard about it?
MapReduce is a limitation
Entirely different way

of thinking
Simple processing operations such as joins are not so easy when expressed in MapReduce
Proper implementation is not so easy
Lots of configuration and implementation details for optimal performance
Number of reduce tasks, data skew, JVM size, garbage collection

Слайд 67So what does this mean for you?
Hadoop is written primarily in

Java
Components are extendable and configurable
Custom I/O through Input and Output Formats
Parse custom data formats
Read and write using external systems
Higher-level tools enable rapid development of big data analysis

Слайд 68Resources, Wrap-up, etc.
http://hadoop.apache.org
Very supportive community
Strata + Hadoop World Oct. 28th –

30th in Manhattan
Plenty of resources available to learn more
Blogs
Email lists
Books
Shameless Plug -- MapReduce Design Patterns

Слайд 69Getting Started
Pivotal HD Single-Node VM and Community Edition
http://gopivotal.com/pivotal-products/data/pivotal-hd
For the brave and

bold -- Roll-your-own!
http://hadoop.apache.org/docs/current

Слайд 70Acknowledgements
Apache Hadoop, the Hadoop elephant logo, HDFS, Accumulo, Avro, Drill, Flume,

HBase, Hive, Mahout, Oozie, Pig, Sqoop, YARN, and ZooKeeper are trademarks of the Apache Software Foundation
Cloudera Impala is a trademark of Cloudera
Parquet is copyright Twitter, Cloudera, and other contributors
Storm is licensed under the Eclipse Public License

Слайд 71Learn More. Stay Connected.

Talk to us on Twitter: @springcentral
Find Session replays

on YouTube: spring.io/video

Скачать презентацию

HadoopJust the Basics for Big Data Rookies презентация

Содержание

Слайд 1Hadoop Just the Basics for Big Data RookiesAdam Shookashook@gopivotal.com

Слайд 2AgendaHadoop OverviewHDFS ArchitectureHadoop MapReduceHadoop EcosystemMapReduce PrimerBuckle up!

Слайд 3Hadoop Overview

Слайд 4Hadoop CoreOpen-source Apache project out of Yahoo! in 2006Distributed fault-tolerant data

Слайд 5Why?Bottom line:FlexibleScalableInexpensive

Слайд 6OverviewGreat atReliable storage for multi-petabyte data setsBatch queries and analyticsComplex hierarchical

Слайд 7Data StructureBytes!No more ETL necessaryStore data now, process laterStructure on readBuilt-in

Слайд 8VersioningVersion 0.20.x, 0.21.x, 0.22.x, 1.x.xTwo main MR packages:org.apache.hadoop.mapred (deprecated)org.apache.hadoop.mapreduce (new hotness)Version

Слайд 9HDFS Architecture

Слайд 10HDFS OverviewHierarchical UNIX-like file system for data storagesort ofSplitting of large

Слайд 11NameNodeSingle master service for HDFSSingle point of failure (HDFS 1.x)Stores file

Слайд 12Checkpoint Node (Secondary NN)Performs checkpoints of the NameNode’s namespace and logsNot

Слайд 13DataNodeStores blocks on local diskSends frequent heartbeats to NameNodeSends block reports

Слайд 14How HDFS Works - WritesClient contacts NameNode to write dataNameNode says

Слайд 15How HDFS Works - WritesDataNodes replicate datablocks, orchestrated by the NameNode

Слайд 16How HDFS Works - ReadsClient contacts NameNode to read dataNameNode says

Слайд 17Client connects to anothernode serving that blockHow HDFS Works - Failure

Слайд 18Block ReplicationDefault of three replicasRack-aware systemOne block on same rackOne block

Слайд 19HDFS 2.0 FeaturesNameNode High-Availability (HA)Two redundant NameNodes in active/passive configuration Manual

Слайд 20Hadoop MapReduce

Слайд 21Hadoop MapReduce 1.xMoves the code to the dataJobTrackerMaster service to monitor

Слайд 22JobTrackerMonitors job and task progressIssues task attempts to TaskTrackersRe-tries failed task

Слайд 23TaskTrackersRuns on same node as DataNode serviceSends heartbeats and task reports

Слайд 24Exploiting Data LocalityJobTracker will schedule task on a TaskTracker that is

Слайд 25How MapReduce WorksClient submits job to JobTrackerJobTracker submitstasks to TaskTrackersJob output

Слайд 26How MapReduce Works - FailureJobTracker assigns task to different node

Слайд 27YARNAbstract framework for distributed application developmentSplit functionality of JobTracker into two

Слайд 28MapReduce 2.x on YARNMapReduce API has not changedRebuild required to upgrade

Слайд 29Hadoop Ecosystem

Слайд 30Hadoop EcosystemCore TechnologiesHadoop Distributed File SystemHadoop MapReduceMany other tools…Which I will

Слайд 31Moving DataSqoopMoving data between RDBMS and HDFSSay, migrating MySQL tables to

Слайд 32Flume Architecture

Слайд 33Higher Level APIsPigData-flow language – aptly named PigLatin -- to generate

Слайд 34Pig Word CountA = LOAD '$input';B = FOREACH A GENERATE FLATTEN(TOKENIZE($0))

Слайд 35Key/Value StoresHBaseAccumuloImplementations of Google’s Big Table for HDFSProvides random, real-time access

Слайд 36HBase ArchitectureMasterZooKeeperClientHDFS

Слайд 37Data StructureAvroData serialization system designed for the Hadoop ecosystemExpressed as JSONParquetCompressed,

Слайд 38Scalable Machine LearningMahoutLibrary for scalable machine learning written in JavaVery robust

Слайд 39Workflow ManagementOozieScheduling system for Hadoop JobsSupport for:Java MapReduceStreaming MapReducePig, Hive, Sqoop,

Слайд 40Real-time Stream ProcessingStormOpen-source project which runs a streaming of data, called

Слайд 41Distributed Application CoordinationZooKeeperAn effort to develop and maintain an open-source server