Overview

For short, MinHash is a technique used to estimating the similarity between two sets.
It was introduced by Andrei Broder in 1997, primarily to aid in detecting duplicate web pages

MinHash operates on the priciple of locality-sensitive hashing(LSH). The fundamental idea is to represent a set by its minimum hash value derived from multiple hash function, avoiding access the details of all elements.

How it works

  • Hash Functions: select different hash function, and generate hash value of all element in each set
  • Minimum Hash Value: For each hash functions, the minimum value obtained from the hashed elements is taken as a representative value for that Set
  • Similarity Estimation: Counts how many times the minimum values of both Sets(A, B) are equal(across k functions). The similarity can be calculated as:

The similarity here is the estimation of Jaccard similarity, which defined as

Implement

An easy version

MinHash — datasketch 1.6.5 documentation
datasketch lib offer a MinHash class

from datasketch import MinHash
 
data1 = ['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'datasets']
data2 = ['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'documents']
 
m1, m2 = MinHash(), MinHash()
for d in data1:
    m1.update(d.encode('utf8'))
for d in data2:
    m2.update(d.encode('utf8'))
print("Estimated Jaccard for data1 and data2 is", m1.jaccard(m2))
 
s1 = set(data1)
s2 = set(data2)
actual_jaccard = float(len(s1.intersection(s2)))/float(len(s1.union(s2)))
print("Actual Jaccard for data1 and data2 is", actual_jaccard)

Maybe I will

Implement by myself

Just for fun or learning, it may not efficient enough compare to the easy one?