Feature hashing for large scale multitask learning

Presenter Notes

Feature hashing

Weinberger, Dasgupta, Langford, Smola and Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. ICML.
Ganchev and Dredze (2008). Small statistical models by random feature mixing. ACL HLT Workshop on Mobile Language Processing.

Presented at ILPS Reading Group by Lars Buitinck

Presenter Notes

Symbolic features

Learning algorithms formulated on vectors/matrices: w⋅x > 0
Also true for cosine similarity etc.
NLP/IR applications: text
Features are initially strings
Need to vectorize: "buy v1agra now!" → (0, 0, 1, 0, 1, 1, 0, 1)

Presenter Notes

Vectorizing

Naive way, step 1: learn a dictionary

feature_to_index = {}
i = 0
for s in samples:
    for f, v in s:
        if f not in feature_to_index:
            feature_to_index[f] = i
            i += 1

Step 2: construct (sparse) vectors

for f, v in sample:
    if f in feature_to_index:
        vector[feature_to_index[f]] = v

Presenter Notes

Problems

Dictionary size = vocabulary size
Specialized string tables (tries, automata) help, but still linear memory use
In true online settings, feature set keeps growing
Sometimes intentionally: when spam filter has learned high weight for "v1agra", spammers invent new misspellings, exploit Unicode, etc.

Presenter Notes

"Hashing trick" 0.9

Idea: hash the strings into the indices directly: O(1) memory
Fix some arbitrary vector length n
The column index of a feature f is h(f) mod n
In case of collision, add values (or OR them)
Ganchev and Dredze (2008) showed this to work well (picture), but collisions increase with decreasing n

Presenter Notes

Second hash function

For very large (online) problems, collisions matter
Solution of Weinberger et al.:
Let ξ be a hash function with range {-1, 1}
Feature to set is still h(f) mod n, but multiply its value by ξ(f):

for f, v in sample:
    vector[h(f) % n] += ξ(f) * v

Presenter Notes

Second hash function's purpose

Suppose we're dealing with boolean features
Collision between f1 and f2, both true

`ξ(f1)`	`ξ(f2)`	`ξ(f1) + ξ(f2)`
-1	-1	-2
-1	1	0
1	-1	0
1	1	2

50% chance of resolving the collision!

Presenter Notes

Properties

Expected value in each column is zero, so data is centered for free
With boolean input, Gaussian-like output
This is what many other learning algorithms want
Let φ(x) be the vector produced for x by our hashing vectorizer
Works like a kernel K(x, x’) = φ(x)⋅φ(x’) with E[φ(x)⋅φ(x’)] = x⋅x’ (expectation over the function φ , i.e. over h, ξ )
Can store weight vectors of classifier as a sparse table

Presenter Notes

Multitask learning

Global spam filter + personalized filter
Massively multiclass classification
Hash not f, but (T, f) for task T
So (user_id, term) for personalized spam filter
Single parameter vector for all tasks to achieve f(x) = w⋅(φ(x) + φ(x, T))

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

Feature hashing for large scale multitask learning

Presenter Notes

Feature hashing

Presenter Notes

Symbolic features

Presenter Notes

Vectorizing

Presenter Notes

Problems

Presenter Notes

"Hashing trick" 0.9

Presenter Notes

Second hash function

Presenter Notes

Second hash function's purpose

Presenter Notes

Properties

Presenter Notes

Multitask learning

Presenter Notes

Further reading

Presenter Notes

Implementations

Presenter Notes

That's it

Presenter Notes

Table of Contents

Help