Bloom Filter in Go

Source:

What is Bloom filter?

When building high-performance systems, minimizing expensive operations like disk reads or remote database queries is critical.

A Bloom filter is a highly space-efficient, probabilistic data structure designed to test whether an element is a member of a set. It was conceived by Burton Howard Bloom in 1970.

Unlike standard data structures like hash tables or binary search trees that store the actual data, a Bloom filter only stores a cryptographic representation of the data. Because it does not store the raw elements, it requires a fraction of the memory.

The defining characteristic of a Bloom filter is its ironclad rule regarding accuracy:

By placing a Bloom filter in front of a slow datastore, you can instantly filter out requests for missing items, saving massive amounts of latency and computational overhead.

The Core architect

To understand how it works, you need to understand its two primary components:

  1. The Bit Array (m):
    The filter begins as an array of m bits, all initially set to 0.

  2. The Hash Functions (k):
    The filter uses k independent hash functions. When you feed data into a hash function, it outputs a seemingly random, uniform integer. In a Bloom filter, this output is mapped to an index position within the m-bit array.

To initialize an optimal Bloom filter, we need two parameters based on our expected capacity (n) and acceptable false-positive probability (p):

  1. Size of the bit array (m):

  1. Number of hash functions (k):

Instead of actually running k separate hash functions (which is computationally expensive), we will use the Kirsch–Mitzenmacher optimization. We can hash the data once into a 64-bit integer, split it into two 32-bit integers (h₁ and h₂), and simulate k hash functions using the formula: hash_i = h_1 + i * h_2

How It Works: Step-by-Step

Adding an Element

When you want to add an item (for example, the string "user_123") to the Bloom filter:

Let’s explore how this works with a concrete example. Imagine we created a Bloom filter of 3 bits and 2 hash functions and inserted the strings “Hello” and “Bloom” into the filter. If the hash functions match an incoming value with an index in the bit array, the Bloom filter will make sure the bit at that position in the array is 1.

tinybird-bloom-filter-gif
tinybird-bloom-filter-gif

In this example, “Hello” was hashed to 1 by the first hash function and 3 by the second hash function. So, the Bloom filter made sure the bits at index 1 and 3 were flipped to 1. Then, “Bloom” was hashed to 1 and 2. The Bloom filter made sure those were both a 1 as well (even though position 1 already had a 1).

Checking for an Element (Lookup)

When you want to check if "user_123" is in the filter:

Why False Positives Happen

The reason an item is only probably in the set is due to hash collisions.

Imagine you add "Item A" and it flips bits at indices 2, 5, and 8.
Then you add "Item B" and it flips bits at indices 4, 11, and 15.

Now, you check for "Item C". By pure coincidence, the hash functions for "Item C" point to indices 2, 8, and 15. The filter checks those positions, sees they are all 1, and reports that "Item C" is present.

This is a false positive, caused by the overlapping bits of "Item A" and "Item B".

Use cases

Today, Bloom filters are widely used across many technologies including databases, networks, and distributed systems. They serve to reduce the need for expensive disk or network operations, thus improving performance. For example:

The Implementation

mkdir bloom
cd bloom
go mod init bloom
touch bloom.go bloom_test.go

This implementation uses a sync.RWMutex to allow concurrent reads (Contains) while ensuring exclusive access during writes (Add). We use a []uint64 slice for maximum memory density.

package bloom

import (
    "hash/fnv"
    "math"
    "sync"
)

type BloomFilter struct {
    mu     sync.RWMutex
    bitset []uint64
    m      uint64
    k      uint64
}

func New(n uint64, p float64) *BloomFilter {
    mFloat := -(float64(n) * math.Log(p)) / (math.Pow(math.Log(2), 2))
    kFloat := (mFloat / float64(n)) * math.Log(2)

    m := uint64(math.Ceil(mFloat))
    k := uint64(math.Ceil(kFloat))
    if k < 1 {
        k = 1
    }

    return &BloomFilter{
        bitset: make([]uint64, (m+63)/64),
        m:      m,
        k:      k,
    }
}

func (bf *BloomFilter) getHashes(data []byte) (uint32, uint32) {
    h := fnv.New64a()
    h.Write(data)
    hash64 := h.Sum64()
    return uint32(hash64 & ((1<<32) - 1)), uint32(hash64 >> 32)
}

func (bf *BloomFilter) Add(data []byte) {
    h1, h2 := bf.getHashes(data)
    bf.mu.Lock()
    defer bf.mu.Unlock()

    for i := uint64(0); i < bf.k; i++ {
        idx := (uint64(h1) + i*uint64(h2)) % bf.m
        bf.bitset[idx/64] |= 1 << (idx % 64)
    }
}

func (bf *BloomFilter) Contains(data []byte) bool {
    h1, h2 := bf.getHashes(data)
    bf.mu.RLock()
    defer bf.mu.RUnlock()

    for i := uint64(0); i < bf.k; i++ {
        idx := (uint64(h1) + i*uint64(h2)) % bf.m
        if (bf.bitset[idx/64] & (1 << (idx % 64))) == 0 {
            return false
        }
    }
    return true
}

Benchmarking the Impact

To demonstrate the value of the filter, we simulate a slow system (like a database taking 50 microseconds to respond) and benchmark looking up an item that does not exist.

package bloom

import (
    "testing"
    "time"
)

func expensiveDatabaseLookup(key string) bool {
    time.Sleep(50 * time.Microsecond)
    return false
}

func BenchmarkWithoutBloomFilter(b *testing.B) {
    key := "missing-user-id"
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = expensiveDatabaseLookup(key)
    }
}

func BenchmarkWithBloomFilter(b *testing.B) {
    bf := New(1_000_000, 0.01)
    key := []byte("missing-user-id")

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        // Fast path: Bloom filter checks first.
        if bf.Contains(key) {
            _ = expensiveDatabaseLookup(string(key))
        }
    }
}

The Results

Running go test -bench=. -benchtime=2s yields stark results:

By verifying that the item is definitely not in the set, the Bloom filter intercepts the request in nanoseconds, completely bypassing the expensive 50-microsecond database penalty.