Taming Big Data! презентация

Содержание

1. Taming Big Data!
2. Publish
3. Publish
4. Three big data challenges Channel massive flows
5. Three big data challenges Channel massive flows
6. Channel massive data flows Data must move
7. Transfer is challenging at many levels Speed
8. Source data store Desti-nation data store Wide Area Network File transfer is an end-to-end problem
9. Application OS
10. GridFTP protocol and implementations: Fast, reliable, secure
11. 85 Gbps sustained disk-to-disk over 100 Gbps
12. Higgs discovery “only possible because
13. One Advanced Photon Source data node: 125 destinations
14. Same node (1 Gbps link)
17. Transfer scheduling and optimization Science data traffic
18. A load-aware, adaptive algorithm: (1) Data-driven model
19. Define transfer priority: Schedule
22. Robust analytic models for science at extreme
23. How to create more accurate, useful, and
24. Differential regression for combining data from
25. Source LAN profile
26. Three big data challenges Channel massive flows
27. Registry Staging Store Ingest Store Analysis Store
28. One researcher’s perspective on data management challenges
30. Tripit exemplifies process automation Me Book flights
31. How the “business cloud” works Infrastructure services Computing, storage, networking Elastic capacity Multiple availability zones
32. Process automation for science Run experiment Collect
33. Analysis Staging Ingest Community Repository Archive
34. Reliable, secure, high-performance file transfer and synchronization
35. Simple, secure sharing off existing storage systems
36. Extreme ease of use InCommon, Oauth, OpenID,
39. High-speed transfers to/from AWS cloud, via
40. Globus transfer & sharing; identity & group
41. Globus under the covers Identity, group, profile
42. Globus under the covers Identity, group, profile
44. Globus Platform-as-a-Service Identity, group, profile management
45. The Globus Galaxies platform: Science as a service Ematter materials science FACE-IT
46. Three big data challenges Channel massive flows
47. Discovery engines: Integrate simulation, experiment, and informatics
48. metagenomics.anl.gov A discovery engine for metagenomics
49. kbase.us
50. DOE Systems Biology Knowledge Base (KBase) Source: Rick Stevens
52. A discovery engine for the study
53. Immediate assessment of alignment quality in near-field
54. Integrate data movement,
55. Big Data to Knowledge: bd2k.org
56. Three big data challenges Channel massive flows
57. My work is supported by:
58. Thank you! foster@anl.gov ianfoster.org

Слайд 1Ian Foster Argonne National Laboratory and University of Chicago
foster@anl.gov
ianfoster.org
Taming Big Data!

Слайд 2 Publish
results

Discovery is an iterative process
Pose

question

Janet Rowley, 1972

Слайд 3 Publish
results

Discovery in the big data

era: Resource-intensive, expensive, slow

Pose question

Слайд 4Three big data challenges
Channel massive flows

Automate management

Build discovery engines

Слайд 5Three big data challenges
Channel massive flows

Automate management

Build discovery engines

Слайд 6Channel massive data flows
Data must move to be useful. We may

optimize, but we can never entirely eliminate distance.

Sources: experimental facilities, sensors, computations
Sinks: analysis computers, display systems
Stores: impedance matchers & time shifters
Pipes: IO systems and networks connect other elements

“We must think of data as a flowing river over time, not a static snapshot. Make copies, share, and do magic” – S. Madhavan

Слайд 7Transfer is challenging at many levels
Speed and reliability
GridFTP protocol
Globus implementation

Scheduling and

modeling
SEAL and STEAL algorithms
RAMSES project

Слайд 8Source data store
Desti-nation data store
Wide Area Network
File transfer is an end-to-end

problem

Слайд 9
Application
OS

FS Stack
HBA/HCA
LAN
Switch
Router
Source data transfer node
TCP
IP
NIC

Application

FS Stack

HBA/HCA

LAN
Switch

Router

TCP

NIC

Storage Array

Wide Area Network

OST

MDT

Lustre file system

Destination
data transfer node

OSS

MDS

+ diverse environments
+ diverse workloads
+ contention

File transfer is an end-to-end problem

Слайд 10GridFTP protocol and implementations: Fast, reliable, secure 3rd-party data transfer
Extend legacy FTP

protocol to enhance performance, reliability, security

Globus GridFTP provides a widely-used open source implementation.
Modular, pluggable architecture (different protocols, I/O interfaces).
Many optimizations: e.g., concurrency, parallelism, pipelining.

Слайд 1185 Gbps sustained disk-to-disk over 100 Gbps network, Ottawa—New Orleans
Raj Kettiumuthu

and team, Argonne
Nov 2014

Слайд 12
Higgs discovery “only possible because of the extraordinary achievements of …

grid computing”—Rolf Heuer, CERN DG

10s of PB, 100s of institutions, 1000s of scientists, 100Ks of CPUs, Bs of tasks

Слайд 13One Advanced Photon Source data node: 125 destinations

Слайд 14Same node
(1 Gbps link)

Слайд 15

Слайд 16

Слайд 17Transfer scheduling and optimization
Science data traffic is extremely bursty

User experience can

be improved by scheduling to minimize slowdown

Traffic can be categorized: interactive or batch

Increased concurrency tends to increase aggregate throughput, to a point

Concurrency over 24 hours. Kettimuthu et al., 2015

Throughput vs. concurency & parallelism. Kettimuthu et al., 2014

Слайд 18A load-aware, adaptive algorithm: (1) Data-driven model of throughput
Collect many

cs, cd, v, a> data
E.g.,
Estimate throughput(s, d, cs, cd, v)
Adjust with estimate of external load

Слайд 19Define transfer priority:

Schedule transfers if neither source nor destination is saturated,

using model to decide concurrency

If source or destination is saturated, interrupt active transfer(s) to service waiting requests, if in so doing can reduce overall average slowdown

Should a new transfer be scheduled?
When scheduling a transfer, with what concurrency?
When should active transfer be preempted?
When change concurrency of active transfer?

A load-aware, adaptive algorithm: (2) Concurrency-constrained scheduling

Слайд 20

Слайд 21

Слайд 22Robust analytic models for science at extreme scales
Gagan Agarwal1* Prasanna

Balaprakash2 Ian Foster2* Raj Kettimuthu2 Sven Leyffer2 Vitali Morozov2 Todd Munson2 Nagi Rao3* Saday Sadayappan1 Brad Settlemyer3 Brian Tierney4* Don Towsley5* Venkat Vishwanath2 Yao Zhang2

1 Ohio State University 2 Argonne National Laboratory
3 Oak Ridge National Laboratory 4 ESnet 5 UMass Amherst (* Co-PIs)

Advanced Scientific Computing Research
Program manager: Rich Carlson

♦︎

Слайд 23How to create more accurate, useful, and portable models of distributed

systems?

Simple analytical model:
T= α+ β*l
[startup cost + sustained bandwidth]
Experiment + regression to estimate α, β

First-principles modeling to better capture details of system & application components

Data-driven modeling to learn unknown details of system & application components

Model composition

Model, data comparison

Слайд 24Differential regression for combining data from different sources
Example of use: Predict

performance on connection length L not realizable on physical infrastructure
E.g., IB-RDMA or HTCP throughput on 900-mile connection

Make multiple measurements of performance on path lengths d:
Ms(d): OPNET simulation
ME(d): ANUE-emulated path
MU(di): Real network (USN)

Compute measurement regressions on d: ṀA(.), A∈{S, E, U}

Compute differential regressions: ∆ṀA,B(.) = ṀA(.) - ṀB(.), A, B∈{S, E, U}

Apply differential regression to obtain estimates, C∈{S, E}
?U(d) = MC(d) - ∆ṀC,U(d)

simulated/emulated measurements

point regression estimate

Слайд 25

Source LAN
profile
WAN profile
Destination LAN
profile
Configuration for
host and edge devices
Configuration

for WAN devices

Configuration for
host and edge devices

composition
operations

End-to-end profile composition

Слайд 26Three big data challenges
Channel massive flows

Automate management

Build discovery engines

Слайд 27Registry
Staging Store
Ingest
Store
Analysis
Store
Community Store
Archive
Mirror
Ingest
Store
Analysis
Store
Community Store
Archive
Mirror
It should be trivial to Collect, Move, Sync,

Share, Analyze, Annotate, Publish, Search, Backup, & Archive BIG DATA

… but in reality it’s often very challenging

Слайд 28One researcher’s perspective on data management challenges

Слайд 29

Слайд 30Tripit exemplifies process automation
Me
Book flights

Book hotel

Record flights
Suggest hotel

Record hotel
Get weather
Prepare maps
Share info
Monitor prices
Monitor flight

Other services

Time

Слайд 31How the “business cloud” works
Infrastructure
services
Computing, storage, networking
Elastic capacity
Multiple availability zones

Слайд 32Process automation for science
Run experiment
Collect data
Move data
Check data
Annotate data
Share data
Find similar

data
Link to literature
Analyze data
Publish data

Time

Automate and outsource:

the
Discovery cloud

Слайд 33
Analysis
Staging
Ingest
Community Repository
Archive
Mirror
Next-gen genome sequencer
Telescope
In millions of labs worldwide, researchers struggle with massive

data, advanced software, complex protocols, burdensome reporting

Globus research data management services

www.globus.org

Simulation

Слайд 34Reliable, secure, high-performance file transfer and synchronization
“Fire-and-forget” transfers

Automatic fault recovery

Seamless security

integration

Powerful GUI and APIs

Data
Source

Data
Destination

Слайд 35Simple, secure sharing off existing storage systems
Data
Source

Easily share large data

with any user or group

No cloud storage required

Слайд 36Extreme ease of use
InCommon, Oauth, OpenID, X.509, …
Credential management
Group definition and

management
Transfer management and optimization
Reliability via transfer retries
Web interface, REST API, command line
One-click “Globus Connect Personal” install
5-minute Globus Connect Server install

Слайд 37

Слайд 38

Слайд 39High-speed transfers to/from AWS cloud, via Globus transfer service
UChicago ? AWS

S3 (US region): Sustained 2 Gbps
2 GridFTP servers, GPFS file system at UChicago
Multi-part upload via 16 concurrent HTTP connections
AWS ? AWS (same region): Sustained 5 Gbps

go#s3

Слайд 40Globus transfer & sharing; identity & group management, data discovery &

publication

25,000 users, 75 PB and 3B files transferred, 8,000 endpoints

Globus endpoints

Слайд 41Globus under the covers
Identity, group, profile management services
…

Sharing service
Transfer

service

Globus Toolkit

Globus Connect

Слайд 42Globus under the covers
Identity, group, profile management services

Sharing service
Transfer

service

Globus Toolkit

Globus Connect

Publication and discovery

Слайд 43

Слайд 44
Globus Platform-as-a-Service
Identity, group, profile management services

Sharing service
Transfer service
Globus

Toolkit

Globus APIs

Globus Connect

Publication and discovery

Слайд 45The Globus Galaxies platform: Science as a service
Ematter
materials
science
FACE-IT

Слайд 46Three big data challenges
Channel massive flows

Automate management

Build discovery engines

Слайд 47Discovery engines: Integrate simulation, experiment, and informatics

Слайд 48metagenomics.anl.gov
A discovery engine for metagenomics

Слайд 49kbase.us

Слайд 50

DOE Systems Biology Knowledge Base (KBase)
Source: Rick Stevens

Слайд 51

Слайд 52A discovery engine for the study of disordered structures
Diffuse scattering images

from Ray Osborn et al., Argonne

Sample

Experimental scattering

Material composition

Simulated structure

Simulated scattering

La 60%
Sr 40%

Detect errors (secs—mins)

Knowledge base
Past experiments; simulations; literature; expert knowledge

Select experiments (mins—hours)

Contribute to knowledge base

Simulations driven by experiments (mins—days)

Knowledge-driven
decision making

Evolutionary optimization

Слайд 53Immediate assessment of alignment quality in near-field high-energy diffraction microscopy

Before
After
Hemant Sharma,

Justin Wozniak, Mike Wilde, Jon Almer

$Immediate assessment of alignment quality in near-field high-energy diffraction microscopyBeforeAfterHemant Sharma, Justin Wozniak, Mike Wilde,$

Слайд 54

Integrate data movement, management, workflow, and computation to accelerate data-driven applications
New

data, computational capabilities, and methods create opportunities and challenges

Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data

New computer facilities enable on-demand computing and high-speed analysis of large quantities of data

Слайд 55Big Data to Knowledge: bd2k.org

Слайд 56Three big data challenges
Channel massive flows
New protocols and management algorithms

Automate management
The

Discovery Cloud

Build discovery engines
MG-RAST, kBase, Materials

Слайд 57My work is supported by:

Слайд 58Thank you! foster@anl.gov ianfoster.org

Скачать презентацию

Taming Big Data! презентация

Содержание

Слайд 1Ian Foster Argonne National Laboratory and University of Chicagofoster@anl.govianfoster.orgTaming Big Data!

Слайд 2 Publish resultsDiscovery is an iterative processPose

Слайд 3 Publish resultsDiscovery in the big data

Слайд 4Three big data challengesChannel massive flowsAutomate managementBuild discovery engines

Слайд 5Three big data challengesChannel massive flowsAutomate managementBuild discovery engines

Слайд 6Channel massive data flowsData must move to be useful. We may

Слайд 7Transfer is challenging at many levelsSpeed and reliabilityGridFTP protocolGlobus implementationScheduling and

Слайд 8Source data storeDesti-nation data storeWide Area NetworkFile transfer is an end-to-end

Слайд 9Application OSFS StackHBA/HCALANSwitchRouterSource data transfer nodeTCPIPNICApplication

Слайд 10GridFTP protocol and implementations: Fast, reliable, secure 3rd-party data transferExtend legacy FTP

Слайд 1185 Gbps sustained disk-to-disk over 100 Gbps network, Ottawa—New OrleansRaj Kettiumuthu

Слайд 12Higgs discovery “only possible because of the extraordinary achievements of …

Слайд 13One Advanced Photon Source data node: 125 destinations

Слайд 14Same node(1 Gbps link)

Слайд 17Transfer scheduling and optimizationScience data traffic is extremely burstyUser experience can

Слайд 18A load-aware, adaptive algorithm: (1) Data-driven model of throughputCollect many

Слайд 19Define transfer priority:Schedule transfers if neither source nor destination is saturated,

Слайд 22Robust analytic models for science at extreme scalesGagan Agarwal1* Prasanna

Слайд 23How to create more accurate, useful, and portable models of distributed

Слайд 24Differential regression for combining data from different sourcesExample of use: Predict

Слайд 25Source LAN profileWAN profileDestination LAN profileConfiguration forhost and edge devices Configuration

Слайд 26Three big data challengesChannel massive flowsAutomate managementBuild discovery engines

Слайд 27RegistryStaging StoreIngestStoreAnalysisStoreCommunity StoreArchiveMirrorIngestStoreAnalysisStoreCommunity StoreArchiveMirrorIt should be trivial to Collect, Move, Sync,

Слайд 28One researcher’s perspective on data management challenges

Слайд 30Tripit exemplifies process automationMeBook flightsBook hotel Record flights Suggest hotel

Слайд 31How the “business cloud” worksInfrastructureservicesComputing, storage, networkingElastic capacityMultiple availability zones

Слайд 32Process automation for scienceRun experimentCollect dataMove dataCheck dataAnnotate dataShare dataFind similar

Слайд 33AnalysisStagingIngestCommunity RepositoryArchiveMirrorNext-gen genome sequencerTelescopeIn millions of labs worldwide, researchers struggle with massive

Слайд 34Reliable, secure, high-performance file transfer and synchronization“Fire-and-forget” transfersAutomatic fault recoverySeamless security

Слайд 35Simple, secure sharing off existing storage systems DataSourceEasily share large data

Слайд 36Extreme ease of useInCommon, Oauth, OpenID, X.509, …Credential managementGroup definition and

Слайд 39High-speed transfers to/from AWS cloud, via Globus transfer serviceUChicago ? AWS

Слайд 40Globus transfer & sharing; identity & group management, data discovery &

Слайд 41Globus under the coversIdentity, group, profile management services… Sharing service Transfer

Слайд 42Globus under the coversIdentity, group, profile management services Sharing service Transfer

Слайд 44Globus Platform-as-a-ServiceIdentity, group, profile management services Sharing service Transfer service Globus

Слайд 45The Globus Galaxies platform: Science as a serviceEmatter materials scienceFACE-IT

Слайд 46Three big data challengesChannel massive flowsAutomate managementBuild discovery engines

Слайд 47Discovery engines: Integrate simulation, experiment, and informatics

Слайд 48metagenomics.anl.govA discovery engine for metagenomics

Слайд 49kbase.us

Слайд 50DOE Systems Biology Knowledge Base (KBase)Source: Rick Stevens

Слайд 52A discovery engine for the study of disordered structuresDiffuse scattering images

Слайд 53Immediate assessment of alignment quality in near-field high-energy diffraction microscopyBeforeAfterHemant Sharma,

Слайд 54Integrate data movement, management, workflow, and computation to accelerate data-driven applicationsNew

Слайд 55Big Data to Knowledge: bd2k.org

Слайд 56Three big data challengesChannel massive flowsNew protocols and management algorithmsAutomate managementThe

Слайд 57My work is supported by:

Слайд 58Thank you! foster@anl.gov ianfoster.org

Похожие презентации

Обратная связь

Что такое ThePresentation.ru?

Слайд 1Ian Foster Argonne National Laboratory and University of Chicago
foster@anl.gov
ianfoster.org
Taming Big Data!

Слайд 2 Publish
results

Discovery is an iterative process
Pose

Слайд 3 Publish
results

Discovery in the big data

Слайд 4Three big data challenges
Channel massive flows

Automate management

Build discovery engines

Слайд 5Three big data challenges
Channel massive flows

Automate management

Build discovery engines

Слайд 6Channel massive data flows
Data must move to be useful. We may

Слайд 7Transfer is challenging at many levels
Speed and reliability
GridFTP protocol
Globus implementation

Scheduling and

Слайд 8Source data store
Desti-nation data store
Wide Area Network
File transfer is an end-to-end

Слайд 9
Application
OS

FS Stack
HBA/HCA
LAN
Switch
Router
Source data transfer node
TCP
IP
NIC

Application

Слайд 10GridFTP protocol and implementations: Fast, reliable, secure 3rd-party data transfer
Extend legacy FTP

Слайд 1185 Gbps sustained disk-to-disk over 100 Gbps network, Ottawa—New Orleans
Raj Kettiumuthu

Слайд 12
Higgs discovery “only possible because of the extraordinary achievements of …

Слайд 14Same node
(1 Gbps link)

Слайд 17Transfer scheduling and optimization
Science data traffic is extremely bursty

User experience can

Слайд 18A load-aware, adaptive algorithm: (1) Data-driven model of throughput
Collect many

Слайд 19Define transfer priority:

Schedule transfers if neither source nor destination is saturated,

Слайд 22Robust analytic models for science at extreme scales
Gagan Agarwal1* Prasanna

Слайд 24Differential regression for combining data from different sources
Example of use: Predict

Слайд 25

Source LAN
profile
WAN profile
Destination LAN
profile
Configuration for
host and edge devices
Configuration

Слайд 26Three big data challenges
Channel massive flows

Automate management

Build discovery engines

Слайд 27Registry
Staging Store
Ingest
Store
Analysis
Store
Community Store
Archive
Mirror
Ingest
Store
Analysis
Store
Community Store
Archive
Mirror
It should be trivial to Collect, Move, Sync,

Слайд 30Tripit exemplifies process automation
Me
Book flights

Book hotel

Record flights
Suggest hotel

Слайд 31How the “business cloud” works
Infrastructure
services
Computing, storage, networking
Elastic capacity
Multiple availability zones

Слайд 32Process automation for science
Run experiment
Collect data
Move data
Check data
Annotate data
Share data
Find similar

Слайд 33
Analysis
Staging
Ingest
Community Repository
Archive
Mirror
Next-gen genome sequencer
Telescope
In millions of labs worldwide, researchers struggle with massive

Слайд 34Reliable, secure, high-performance file transfer and synchronization
“Fire-and-forget” transfers

Automatic fault recovery

Seamless security

Слайд 35Simple, secure sharing off existing storage systems
Data
Source

Easily share large data

Слайд 36Extreme ease of use
InCommon, Oauth, OpenID, X.509, …
Credential management
Group definition and

Слайд 39High-speed transfers to/from AWS cloud, via Globus transfer service
UChicago ? AWS

Слайд 41Globus under the covers
Identity, group, profile management services
…

Sharing service
Transfer

Слайд 42Globus under the covers
Identity, group, profile management services

Sharing service
Transfer

Слайд 44
Globus Platform-as-a-Service
Identity, group, profile management services

Sharing service
Transfer service
Globus

Слайд 45The Globus Galaxies platform: Science as a service
Ematter
materials
science
FACE-IT

Слайд 46Three big data challenges
Channel massive flows

Automate management

Build discovery engines

Слайд 48metagenomics.anl.gov
A discovery engine for metagenomics

Слайд 50

DOE Systems Biology Knowledge Base (KBase)
Source: Rick Stevens

Слайд 52A discovery engine for the study of disordered structures
Diffuse scattering images

Слайд 53Immediate assessment of alignment quality in near-field high-energy diffraction microscopy

Before
After
Hemant Sharma,

Слайд 54

Integrate data movement, management, workflow, and computation to accelerate data-driven applications
New

Слайд 56Three big data challenges
Channel massive flows
New protocols and management algorithms

Automate management
The