Feature hashing for large scale multitask learning

Presenter Notes

Feature hashing

Presented at ILPS Reading Group by Lars Buitinck

Presenter Notes

Symbolic features

  • Learning algorithms formulated on vectors/matrices: wx > 0
  • Also true for cosine similarity etc.
  • NLP/IR applications: text
  • Features are initially strings
  • Need to vectorize: "buy v1agra now!" → (0, 0, 1, 0, 1, 1, 0, 1)

Presenter Notes

Vectorizing

  • Naive way, step 1: learn a dictionary
feature_to_index = {}
i = 0
for s in samples:
    for f, v in s:
        if f not in feature_to_index:
            feature_to_index[f] = i
            i += 1
  • Step 2: construct (sparse) vectors
for f, v in sample:
    if f in feature_to_index:
        vector[feature_to_index[f]] = v

Presenter Notes

Problems

  • Dictionary size = vocabulary size
  • Specialized string tables (tries, automata) help, but still linear memory use
  • In true online settings, feature set keeps growing
  • Sometimes intentionally: when spam filter has learned high weight for "v1agra", spammers invent new misspellings, exploit Unicode, etc.

Presenter Notes

"Hashing trick" 0.9

  • Idea: hash the strings into the indices directly: O(1) memory
  • Fix some arbitrary vector length n
  • The column index of a feature f is h(f) mod n
  • In case of collision, add values (or OR them)
  • Ganchev and Dredze (2008) showed this to work well (picture), but collisions increase with decreasing n
ganchev-dredze.png

Presenter Notes

Second hash function

  • For very large (online) problems, collisions matter
  • Solution of Weinberger et al.:
  • Let ξ be a hash function with range {-1, 1}
  • Feature to set is still h(f) mod n, but multiply its value by ξ(f):
for f, v in sample:
    vector[h(f) % n] += ξ(f) * v

Presenter Notes

Second hash function's purpose

  • Suppose we're dealing with boolean features
  • Collision between f1 and f2, both true
ξ(f1) ξ(f2) ξ(f1) + ξ(f2)
-1 -1 -2
-1 1 0
1 -1 0
1 1 2
  • 50% chance of resolving the collision!

Presenter Notes

Properties

  • Expected value in each column is zero, so data is centered for free
  • With boolean input, Gaussian-like output
  • This is what many other learning algorithms want
  • Let φ(x) be the vector produced for x by our hashing vectorizer
  • Works like a kernel K(x, x’) = φ(x)⋅φ(x’) with E[φ(x)⋅φ(x’)] = xx (expectation over the function φ , i.e. over h, ξ )
  • Can store weight vectors of classifier as a sparse table

Presenter Notes

Multitask learning

  • Global spam filter + personalized filter
  • Massively multiclass classification
  • Hash not f, but (T, f) for task T
  • So (user_id, term) for personalized spam filter
  • Single parameter vector for all tasks to achieve f(x) = w⋅(φ(x) + φ(x, T))

Presenter Notes

Further reading

  • Bai, Weston, Grangier, Collobert, Chapelle and Weinberger (2009), Supervised semantic indexing, CIKM. Describe extension to L2R.
  • Shi, Petterson, Dror, Langford, Smola and Vishwanathan (2009), Hash kernels for structured data, JMLR. Application to structured prediction.

Presenter Notes

Presenter Notes

That's it

  • Questions?

Presenter Notes