Overview of Machine Learning & Feature Engineering презентация

Содержание

About us Chris DuBois Intro to recommenders Alice Zheng Overview of ML Piotr Teterwak Intro to image search & deep learning Krishna Sridhar Deploying ML as a predictive service Danny Bickson

Слайд 1Overview of Machine Learning & Feature Engineering
Machine Learning 101 Tutorial
Strata +

Hadoop World, NYC, Sep 2015
Alice Zheng, Dato


Слайд 2About us
Chris DuBois
Intro to recommenders
Alice Zheng
Overview of ML
Piotr Teterwak
Intro to image

search & deep learning

Krishna Sridhar
Deploying ML as a predictive service

Danny Bickson
TA

Alon Palombo
TA


Слайд 3Why machine learning?
Model data.
Make predictions.
Build intelligent applications.


Слайд 4Classification Predict amongst a discrete set of classes


Слайд 5Input
Output


Слайд 6Spam filtering
data
prediction
Spam
vs.
Not spam


Слайд 7Text classification
EDUCATION
FINANCE
TECHNOLOGY


Слайд 8Regression Predict real/numeric values


Слайд 9Stock market
Input
Output


Слайд 10Similarity Find things like this


Слайд 11Similar products
Product I’m buying
Output: other products I might be interested in


Слайд 12Given image, find similar images
http://www.tiltomo.com/


Слайд 13Recommender systems Learn what I want before I know it


Слайд 15Playlist recommendations
Recommendations form coherent & diverse sequence


Слайд 16Friend recommendations
Users and “items” are of the same type


Слайд 17Clustering Grouping similar items


Слайд 18Clustering images
Goldberger et al.
Set of Images


Слайд 19Clustering web search results


Слайд 20Machine learning … how?

Data

Answers

I fell in love the instant I laid

my eyes on that puppy. His big eyes and playful tail, his soft furry paws, …

Many systems

Many tools

Many teams

Lots of methods/jargon


Слайд 21The machine learning pipeline
I fell in love the instant I laid

my eyes on that puppy. His big eyes and playful tail, his soft furry paws, …

Raw data



Features

Models




Слайд 22Three things to know about ML
Feature = numeric representation of raw

data
Model = mathematical “summary” of features
Making something that works = choose the right model and features, given data and task

Слайд 23Feature = numeric representation of raw data


Слайд 24Representing natural text
It is a puppy and it is extremely cute.
What’s

important? Phrases? Specific words? Ordering? Subject, object, verb?

Classify:
puppy or not?

Raw Text


Слайд 25Representing natural text
It is a puppy and it is extremely cute.
Classify:


puppy or not?

Raw Text

Sparse vector representation


Слайд 26Representing images
Image source: “Recognizing and learning object categories,”
Li Fei-Fei, Rob

Fergus, Anthony Torralba, ICCV 2005—2009.

Raw image:
millions of RGB triplets,
one for each pixel

Raw Image


Слайд 27Representing images
Raw Image
Deep learning features
3.29
-15
-5.24
48.3
1.36
47.1
-1.9236.5
2.83
95.4
-19
-89
5.09
37.8
Dense vector representation


Слайд 28Feature space in machine learning
Raw data ? high dimensional vectors
Collection of

data points ? point cloud in feature space
Feature engineering = creating features of the appropriate granularity for the task

Слайд 29Crudely speaking, mathematicians fall into two categories: the algebraists, who find

it easiest to reduce all problems to sets of numbers and variables, and the geometers, who understand the world through shapes. -- Masha Gessen, “Perfect Rigor”

Слайд 30Algebra vs. Geometry
a
b
c
a2 + b2 = c2
Algebra
Geometry
(Euclidean space)


Слайд 31Visualizing a sphere in 2D
x2 + y2 = 1


Слайд 32Visualizing a sphere in 3D

x2 + y2 + z2 = 1
x
y
z
1
1
1


Слайд 33Visualizing a sphere in 4D

x2 + y2 + z2 + t2

= 1

x

y

z

1

1

1


Слайд 34Why are we looking at spheres?

=
=
=
=
Poincaré Conjecture:
All physical objects without holes
is

“equivalent” to a sphere.

Слайд 35The power of higher dimensions
A sphere in 4D can model the

birth and death process of physical objects
High dimensional features can model many things

Слайд 36Visualizing Feature Space


Слайд 37The challenge of high dimension geometry
Feature space can have hundreds to

millions of dimensions
In high dimensions, our geometric imagination is limited
Algebra comes to our aid


Слайд 38Visualizing bag-of-words
I have a puppy and
it is extremely cute


Слайд 39Visualizing bag-of-words
puppy
cute
1
1
1
extremely


Слайд 40Document point cloud














word 1
word 2


Слайд 41Model = mathematical “summary” of features


Слайд 42What is a summary?
Data ? point cloud in feature space
Model =

a geometric shape that best “fits” the point cloud

Слайд 43Clustering model














Feature 2
Feature 1
Group data points tightly


Слайд 44Classification model














Feature 2
Feature 1
Decide between two classes


Слайд 45Regression model







Target
Feature

Fit the target values


Слайд 46Visualizing Feature Engineering


Слайд 47When does bag-of-words fail?
puppy
cat
2
1
1
have
Task: find a surface that separates
documents about

dogs vs. cats

Problem: the word “have” adds fluff
instead of information

1


Слайд 48Improving on bag-of-words
Idea: “normalize” word counts so that popular words are

discounted
Term frequency (tf) = Number of times a terms appears in a document
Inverse document frequency of word (idf) =


N = total number of documents
Tf-idf count = tf x idf

Слайд 49From BOW to tf-idf
puppy
cat
2
1
1
have
idf(puppy) = log 4
idf(cat) = log 4
idf(have) =

log 1 = 0

1


Слайд 50From BOW to tf-idf
puppy
cat
1
have
tfidf(puppy) = log 4
tfidf(cat) = log 4
tfidf(have) =

0

1

log 4

log 4

Tf-idf flattens uninformative dimensions in the BOW point cloud


Слайд 51Entry points of feature engineering
Start from data and task
What’s the best

text representation for classification?
Start from modeling method
What kind of features does k-means assume?
What does linear regression assume about the data?

Слайд 52Dato’s Machine Learning Platform


Слайд 53Dato’s machine learning platform
Raw data

Features
GraphLab Create
Dato Distributed
Dato Predictive Services


Слайд 54Data structures for feature engineering

Features
SFrames
SGraphs



Слайд 55Machine learning toolkits in GraphLab Create
Classification/regression
Clustering
Recommenders
Deep learning
Similarity search
Data matching
Sentiment analysis
Churn prediction
Frequent

pattern mining
And on…

Слайд 57Dimensionality reduction














Feature 1
Feature 2
Flatten non-useful features
PCA: Find most non-flat
linear subspace


Слайд 58PCA : Principal Component Analysis
Center data at origin


Слайд 59PCA : Principal Component Analysis
Find a line, such that the average

distance of every data point to the line is minimized.

This is the 1st Principal Component


Слайд 60PCA : Principal Component Analysis
Find a 2nd line,
- at

right angles to the 1st
- such that the average distance of every data point to the line is minimized.

This is the 2nd Principal Component


Слайд 61PCA : Principal Component Analysis
Find a 3rd line
- at right

angles to the previous lines
- such that the average distance of every data point to the line is minimized.


There can only be as many principle components as the dimensionality of the data.


Слайд 63Coursera Machine Learning Specialization
Learn machine learning in depth
Build and deploy intelligent

applications
Year long certification program
Joint project between University of Washington + Dato
Details: https://www.coursera.org/specializations/machine-learning

Слайд 64Next up today
alicez@dato.com @RainyData,

#StrataConf

11:30am - Intro to recommenders
Chris DuBois

1:30pm - Intro to image search & deep learning
Piotr Teterwak

3:30pm - Deploying ML as a predictive service
Krishna Sridhar


Обратная связь

Если не удалось найти и скачать презентацию, Вы можете заказать его на нашем сайте. Мы постараемся найти нужный Вам материал и отправим по электронной почте. Не стесняйтесь обращаться к нам, если у вас возникли вопросы или пожелания:

Email: Нажмите что бы посмотреть 

Что такое ThePresentation.ru?

Это сайт презентаций, докладов, проектов, шаблонов в формате PowerPoint. Мы помогаем школьникам, студентам, учителям, преподавателям хранить и обмениваться учебными материалами с другими пользователями.


Для правообладателей

Яндекс.Метрика