Get Out of My Dreams...

...and Into My Model.

Incorporating Unstructured Data Into Your Feature Set

Rob Mealey

Machine Learning @ NewBrand

Nov 20, 2014

Making it easy for myself...

meaning at minimum I will definitely annoy the PhD's in the room...

just a smidge of false advertising...

First, a Few Questions...

What's your goal here?

"MALE" 22
"FEMALE" 35
"Female" 40
"male" 212
"malEe" 10
"Emale" 191
"f" 142
"m" 220
"Unknown" 10040
"XXXX" 8020
"Michigan" 20

What does your boss care about?

What do your structured data look like?

What does your structured data look like?

How much of it you got?

Sparse? Dense?

What does your corpus look like?

big data? data stream? data lake?

Got labels?

how'd you get em?

There is some there there, right?

Last question: Model evaluation?

Preprocessing

bag-of-ngrams

chunking

stopword filtering

What do you mean, specification?

Heuristics

"dirty" vs "clean"

"rust"

"rude"

DocId DIRTY RUDE CLEAN ...
DOC1 0 1 0 ...
DOC2 0 1 0 ...
DOC2 1 0 1 ...

"OMFG, is that a quadruple negative?"

Total. Probability.

You should check if you've been fired.

Also, it is 2037 now.

PyMC, BUGS/JAGS, STAN

the middle roads...

sub-classifiers

need good labels...

inter-annotator agreement?

use the uncertainty...

GIGO!

Also YOLO.

semi- or unsupervised methods

dimensionality reduction

clustering

topic modeling

probabilistic or otherwise

need lots of the data...

Gotchas

"endogeneity"

If you are an econometrician,

read this paper:

Regression and Causation:

A Critical Examination of Six Econometrics Textbooks by Bryant Chen and Judea Pearl

mind your betas...

you thought they were hard to interpret before?

Task:

predict receptiveness of potential targets for a promotional campaign.

Structured Data:

  1. demographics
  2. purchase history
  3. ?

Unstructured Data:

Streams of "Chirps"

Links: one "row" of structured data to many "documents"

Links: classifying document/content authors

Kitchen sink approach...

also can fill in demographic holes, as mentioned before

Task:

predict operational metrics at different locations of a service business

Task:

predict merchant upside for participating in a "daily deal" promotion

What AND why.

tools matter...

because you're probably going to need some help...

How many SASonistas we got in the house?

What about Stataniacs?

numpy/scipy

CountVectorizer

HashingVectorizer

"tm" package

CRAN Task View: Natural Language Processing

"I'm so much more than just ggplot2."

(Disclaimer: No way he has ever said that.)

Questions?

www.obscureanalytics.com