Taming Big Data! презентация

Содержание

Publish results Discovery is an iterative process Pose question Janet Rowley, 1972

Слайд 1Ian Foster Argonne National Laboratory and University of Chicago
foster@anl.gov
ianfoster.org
Taming Big Data!


Слайд 2 Publish
results



Discovery is an iterative process
Pose

question

Janet Rowley, 1972


Слайд 3 Publish
results



Discovery in the big data

era: Resource-intensive, expensive, slow

Pose question


Слайд 4Three big data challenges
Channel massive flows


Automate management


Build discovery engines


Слайд 5Three big data challenges
Channel massive flows


Automate management


Build discovery engines


Слайд 6Channel massive data flows
Data must move to be useful. We may

optimize, but we can never entirely eliminate distance.

Sources: experimental facilities, sensors, computations
Sinks: analysis computers, display systems
Stores: impedance matchers & time shifters
Pipes: IO systems and networks connect other elements


“We must think of data as a flowing river over time, not a static snapshot. Make copies, share, and do magic” – S. Madhavan


Слайд 7Transfer is challenging at many levels
Speed and reliability
GridFTP protocol
Globus implementation




Scheduling and

modeling
SEAL and STEAL algorithms
RAMSES project

Слайд 8Source data store
Desti-nation data store
Wide Area Network
File transfer is an end-to-end

problem

Слайд 9
Application
OS


FS Stack
HBA/HCA
LAN
Switch
Router
Source data transfer node
TCP
IP
NIC




Application

OS


FS Stack

HBA/HCA

LAN
Switch

Router

TCP

IP

NIC

Storage Array

Wide Area Network

OST
















MDT


Lustre file system

Destination
data transfer node

OSS

OSS

MDS

MDS


+ diverse environments
+ diverse workloads
+ contention

File transfer is an end-to-end problem


Слайд 10GridFTP protocol and implementations: Fast, reliable, secure 3rd-party data transfer
Extend legacy FTP

protocol to enhance performance, reliability, security

Globus GridFTP provides a widely-used open source implementation.
Modular, pluggable architecture (different protocols, I/O interfaces).
Many optimizations: e.g., concurrency, parallelism, pipelining.


Слайд 1185 Gbps sustained disk-to-disk over 100 Gbps network, Ottawa—New Orleans
Raj Kettiumuthu

and team, Argonne
Nov 2014

Слайд 12
Higgs discovery “only possible because of the extraordinary achievements of …

grid computing”—Rolf Heuer, CERN DG

10s of PB, 100s of institutions, 1000s of scientists, 100Ks of CPUs, Bs of tasks


Слайд 13One Advanced Photon Source data node: 125 destinations


Слайд 14Same node
(1 Gbps link)


Слайд 17Transfer scheduling and optimization
Science data traffic is extremely bursty

User experience can

be improved by scheduling to minimize slowdown

Traffic can be categorized: interactive or batch

Increased concurrency tends to increase aggregate throughput, to a point

Concurrency over 24 hours. Kettimuthu et al., 2015

Throughput vs. concurency & parallelism. Kettimuthu et al., 2014


Слайд 18A load-aware, adaptive algorithm: (1) Data-driven model of throughput
Collect many

cs, cd, v, a> data
E.g.,
Estimate throughput(s, d, cs, cd, v)
Adjust with estimate of external load


Слайд 19Define transfer priority:



Schedule transfers if neither source nor destination is saturated,

using model to decide concurrency

If source or destination is saturated, interrupt active transfer(s) to service waiting requests, if in so doing can reduce overall average slowdown



Should a new transfer be scheduled?
When scheduling a transfer, with what concurrency?
When should active transfer be preempted?
When change concurrency of active transfer?

A load-aware, adaptive algorithm: (2) Concurrency-constrained scheduling


Слайд 22Robust analytic models for science at extreme scales
Gagan Agarwal1* Prasanna

Balaprakash2 Ian Foster2* Raj Kettimuthu2 Sven Leyffer2 Vitali Morozov2 Todd Munson2 Nagi Rao3* Saday Sadayappan1 Brad Settlemyer3 Brian Tierney4* Don Towsley5* Venkat Vishwanath2 Yao Zhang2

1 Ohio State University 2 Argonne National Laboratory
3 Oak Ridge National Laboratory 4 ESnet 5 UMass Amherst (* Co-PIs)

Advanced Scientific Computing Research
Program manager: Rich Carlson

♦︎


Слайд 23How to create more accurate, useful, and portable models of distributed

systems?

Simple analytical model:
T= α+ β*l
[startup cost + sustained bandwidth]
Experiment + regression to estimate α, β

First-principles modeling to better capture details of system & application components

Data-driven modeling to learn unknown details of system & application components

Model composition

Model, data comparison


Слайд 24Differential regression for combining data from different sources
Example of use: Predict

performance on connection length L not realizable on physical infrastructure
E.g., IB-RDMA or HTCP throughput on 900-mile connection

Make multiple measurements of performance on path lengths d:
Ms(d): OPNET simulation
ME(d): ANUE-emulated path
MU(di): Real network (USN)

Compute measurement regressions on d: ṀA(.), A∈{S, E, U}

Compute differential regressions: ∆ṀA,B(.) = ṀA(.) - ṀB(.), A, B∈{S, E, U}

Apply differential regression to obtain estimates, C∈{S, E}
?U(d) = MC(d) - ∆ṀC,U(d)





simulated/emulated measurements

point regression estimate




Слайд 25


Source LAN
profile
WAN profile
Destination LAN
profile
Configuration for
host and edge devices
Configuration

for WAN devices

Configuration for
host and edge devices

composition
operations

End-to-end profile composition


Слайд 26Three big data challenges
Channel massive flows


Automate management


Build discovery engines


Слайд 27Registry
Staging Store
Ingest
Store
Analysis
Store
Community Store
Archive
Mirror
Ingest
Store
Analysis
Store
Community Store
Archive
Mirror
It should be trivial to Collect, Move, Sync,

Share, Analyze, Annotate, Publish, Search, Backup, & Archive BIG DATA

… but in reality it’s often very challenging


Слайд 28One researcher’s perspective on data management challenges


Слайд 30Tripit exemplifies process automation
Me
Book flights

Book hotel






Record flights
Suggest hotel

Record hotel
Get weather
Prepare maps
Share info
Monitor prices
Monitor flight

Other services

Time


Слайд 31How the “business cloud” works
Infrastructure
services
Computing, storage, networking
Elastic capacity
Multiple availability zones


Слайд 32Process automation for science
Run experiment
Collect data
Move data
Check data
Annotate data
Share data
Find similar

data
Link to literature
Analyze data
Publish data

Time

Automate and outsource:


the
Discovery cloud


Слайд 33
Analysis
Staging
Ingest
Community Repository
Archive
Mirror
Next-gen genome sequencer
Telescope
In millions of labs worldwide, researchers struggle with massive

data, advanced software, complex protocols, burdensome reporting

Globus research data management services

www.globus.org

Simulation


Слайд 34Reliable, secure, high-performance file transfer and synchronization
“Fire-and-forget” transfers

Automatic fault recovery

Seamless security

integration

Powerful GUI and APIs

Data
Source

Data
Destination



Слайд 35Simple, secure sharing off existing storage systems
Data
Source

Easily share large data

with any user or group

No cloud storage required

Слайд 36Extreme ease of use
InCommon, Oauth, OpenID, X.509, …
Credential management
Group definition and

management
Transfer management and optimization
Reliability via transfer retries
Web interface, REST API, command line
One-click “Globus Connect Personal” install
5-minute Globus Connect Server install



Слайд 39High-speed transfers to/from AWS cloud, via Globus transfer service
UChicago ? AWS

S3 (US region): Sustained 2 Gbps
2 GridFTP servers, GPFS file system at UChicago
Multi-part upload via 16 concurrent HTTP connections
AWS ? AWS (same region): Sustained 5 Gbps


go#s3


Слайд 40Globus transfer & sharing; identity & group management, data discovery &

publication

25,000 users, 75 PB and 3B files transferred, 8,000 endpoints

Globus endpoints


Слайд 41Globus under the covers
Identity, group, profile management services


Sharing service
Transfer

service

Globus Toolkit


Globus Connect

X


Слайд 42Globus under the covers
Identity, group, profile management services

Sharing service
Transfer

service

Globus Toolkit


Globus Connect

Publication and discovery

X


Слайд 44
Globus Platform-as-a-Service
Identity, group, profile management services

Sharing service
Transfer service
Globus

Toolkit


Globus APIs


Globus Connect

Publication and discovery

X


Слайд 45The Globus Galaxies platform: Science as a service
Ematter
materials
science
FACE-IT


Слайд 46Three big data challenges
Channel massive flows


Automate management


Build discovery engines


Слайд 47Discovery engines: Integrate simulation, experiment, and informatics


Слайд 48metagenomics.anl.gov
A discovery engine for metagenomics


Слайд 49kbase.us


Слайд 50

DOE Systems Biology Knowledge Base (KBase)
Source: Rick Stevens


Слайд 52A discovery engine for the study of disordered structures
Diffuse scattering images

from Ray Osborn et al., Argonne

Sample

Experimental scattering

Material composition

Simulated structure

Simulated scattering

La 60%
Sr 40%







Detect errors (secs—mins)

Knowledge base
Past experiments; simulations; literature; expert knowledge


Select experiments (mins—hours)


Contribute to knowledge base

Simulations driven by experiments (mins—days)



Knowledge-driven
decision making



Evolutionary optimization




Слайд 53Immediate assessment of alignment quality in near-field high-energy diffraction microscopy

Before
After
Hemant Sharma,

Justin Wozniak, Mike Wilde, Jon Almer




Слайд 54



Integrate data movement, management, workflow, and computation to accelerate data-driven applications
New

data, computational capabilities, and methods create opportunities and challenges

Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data

New computer facilities enable on-demand computing and high-speed analysis of large quantities of data


Слайд 55Big Data to Knowledge: bd2k.org


Слайд 56Three big data challenges
Channel massive flows
New protocols and management algorithms

Automate management
The

Discovery Cloud


Build discovery engines
MG-RAST, kBase, Materials

Слайд 57My work is supported by:


Слайд 58Thank you! foster@anl.gov ianfoster.org


Обратная связь

Если не удалось найти и скачать презентацию, Вы можете заказать его на нашем сайте. Мы постараемся найти нужный Вам материал и отправим по электронной почте. Не стесняйтесь обращаться к нам, если у вас возникли вопросы или пожелания:

Email: Нажмите что бы посмотреть 

Что такое ThePresentation.ru?

Это сайт презентаций, докладов, проектов, шаблонов в формате PowerPoint. Мы помогаем школьникам, студентам, учителям, преподавателям хранить и обмениваться учебными материалами с другими пользователями.


Для правообладателей

Яндекс.Метрика