Big Data Platform at interest презентация

Data Architecture

Слайд 1Mao Ye

Big Data Platform at interest


Слайд 2


Слайд 3
Data Architecture


Слайд 4Data at Pinterest

60 Billion Pins
1 Billion boards
100M MAU
60 PB of data

on S3
3 PB processed every day
2000 node Hadoop cluster
250 engineers

Слайд 5Pinterest Data Architecture
App







Слайд 6Pinterest Data Architecture
App
















events
Kafka
Secor

Singer


Слайд 7Pinterest Data Architecture
App
















events
Kafka
Secor

Singer


Слайд 8

Pinterest Data Architecture
App
















events
Kafka
Secor

Skyline
Pinball
Redshift
Pinalytics
Features
Qubole (Hadoop)
Singer


Слайд 9
Design Choices for Hadoop Platform


Слайд 10Ephemeral clusters
Access control layer
Shared data store
Easy deployment
Hadoop Platform Requirements
Isolated multi-tenancy
Elasticity
Support multiple

clusters

Слайд 11Decoupling compute & storage
Hadoop Cluster 1



Transient HDFS
Hadoop Cluster 2



Transient HDFS
S3 Persistent

Store

Слайд 12Centralized Hive Metastore
Hive Metastore
Pig
Cascading
Hive



HDFS/S3
Data
Metadata


Слайд 13Multi-layered Packaging
Mapreduce Jobs
Hadoop Jars/Libs
Job/User level Configs
Software Packages/Libs
Configs (OS/Hadoop)
Misc Sys Admin
OS
Bootstrap Script
Core

SW

Runtime Staging
(on S3)

Automated Configuration
(Masterless Puppet)

Baked AMI


Слайд 14Executor Abstraction Layer
Hive Metastore



HDFS/S3
Qubole
Managed Hadoop
EMR
Executor
Pinball
Dev Server


Слайд 15API for simplified executor abstraction
Advanced support for spot instances
Baked AMI customization
Why

Qubole?

Hadoop & Spark as managed services
Tight integration with Hive
Graceful cluster scaling


Слайд 16Pinball for Workflow Management


Слайд 17Scale:
60 Billion Pins
Hundreds of workflows
Thousands of jobs
500+ jobs in a workflow
3

petabytes processed daily

Support:
Hadoop, Cascading, Hive, Spark …

Scale of Processing

job

workflow



Слайд 18Why Pinball?
Requirements
Simple abstractions
Extensible in future
Reliable stateless computing
Easy to debug
Scales horizontally
Can be

upgraded w/o aborting workflows
Rich features like auto-retries, per-job emails, overrun policies…
Options
Apache Oozie, Azkaban, Luigi



Слайд 19Pinball Design

Master



Worker
Scheduler
Command Line Clients
UI





Слайд 20Workflow
A directed graph of nodes called jobs
Edge
Run after dependence
Node
Job is

a node

Workflow Model



Слайд 21Job State
Job state is captured in a token
Tokens are named hierarchically





Master






Job

Token

version: 123
name: /workflow/w1/job
owner: worker_0
expiration: 1234567
data: JobTemplate(....)




Слайд 22Job State Machine

RUNNABLE
RUNNING
WAITING


Слайд 23


Master keeps the state
Workers claim and execute tasks
Horizontally scalable
Master Worker Interaction
Worker



Master






Persistent

Store






1: request

2: update

3: ack



Слайд 24Master

Entire state is kept in memory
Each state update is synchronously persisted

before master replies to client
Master runs on a single thread – no concurrency issues



Слайд 26Open Source
Git repo:
https://github.com/pinterest/pinball

Mailing list:
https://groups.google.com/forum/#!forum/pinball-users





Слайд 27Thank You


Обратная связь

Если не удалось найти и скачать презентацию, Вы можете заказать его на нашем сайте. Мы постараемся найти нужный Вам материал и отправим по электронной почте. Не стесняйтесь обращаться к нам, если у вас возникли вопросы или пожелания:

Email: Нажмите что бы посмотреть 

Что такое ThePresentation.ru?

Это сайт презентаций, докладов, проектов, шаблонов в формате PowerPoint. Мы помогаем школьникам, студентам, учителям, преподавателям хранить и обмениваться учебными материалами с другими пользователями.


Для правообладателей

Яндекс.Метрика