Adaptive Bloom Filter Encoding

Thanks to command line tools in RSK I was able to analyze block structure (short experiment using RSK testnet). One thing to explore is the compression of bloom filters, that are included into the block headers. Read Blooms filters in Ethereum.

Since I also have bloom filters in my personal blockchain project, I coded a Bloom filter and its corresponding encoder. As in Ethereum and RSK blockchains, my Bloom filter contains 256 bytes (2048 bits). Usually, each topic or address in the logs emitted along the block transactions execution, it’s marked as a bit into the Bloom filter.

My encoding algorithm uses RLP (Run Lenght Prefix), as in Ethereum and RSK. But instead of encoding directly the 256 bytes inside the Bloom filter, I took an adaptive approach: choose the best of two algorithms:

public static byte[] encode(Bloom bloom) {
byte[] bytes = bloom.getBytes();
byte[] bytes2 = encodeNonZero(bloom);

if (bytes2.length < Bloom.BLOOM_BYTES)
return RLP.encode(bytes2);

return RLP.encode(bytes);
}

The encodeNonZero algorithm encodes ONLY the bytes within the Bloom that are NOT ZERO:

public static byte[] encodeNonZero(Bloom bloom) {
byte[] bytes = bloom.getBytes();
byte[] nonzero = new byte[Bloom.BLOOM_BYTES * 2];
int n = 0;

for (int k = 0; k < bytes.length; k++)
if (bytes[k] != 0) {
nonzero[n++] = (byte)k;
nonzero[n++] = bytes[k];
}

return Arrays.copyOf(nonzero, n);
}

Then I write a new main to generate random results:

java -cp blockchainj.jar com.ajlopez.BloomEncodingMemory 1000 0 200 10

Producing random blooms having from 0 to 200 random topics, changing the no of topics using step 10. Each step generates 1000 such random Blooms. Even when the number of topics varies, some topics could be repeated, so, the number of bits on inside the Bloom filter (I call it “its size”), is a smaller.

Notably, the number of bits on is linear with the number of random topics:

Bloom size (no. bits on) varying no. topics (they could be repeated)

(read Important Formulas(Part 7) — Permutation and Combination)

There is a improvement on memory used with compressed Blooms:

With few topics the adaptive encoding is better

but after more than 200 random topics, there is no advantage over the classical encoding of all 256 bytes (3 bytes are usually added for RLP prefix length).

So, only with less than 200 random topics the compression makes sense. But topics are not random in a blockchain: many of them could be repeated if the most popular contracts are standard tokens and DeFi applications. I prefer then to repeat this experiments with real data, over a chain with real activity.

Angel “Java” Lopez