dev / notes

Future-Proofing Object Integrity

Integrating multihash and multibase

Source: multiformats/multihashVisual explanation of multihashes (explained in article)

Falsifiable hashes notebooks and image data as part of a simple content-addressing scheme. The initial implementation used SHA256 with BASE58-encoded digests for various file names. SHA256 grants data integrity; BASE58 affords (semi-)human-friendly parsing and manual copying.Deliberately excluded from its alphabet is the number 0 and letters I, O, l, which are often hard to distinguish. While this scheme works fine, it fails to anticipate any unknown future demands. Unsurprisingly, Protocol Labs – a team that has thought about lot about sustainable content-addressing – has a solution: multbase with multihash.

Multibase prefixes a message with a single character that declares the base encoding,

<base-encoding-character><base-encoded-data>

The mapping of characters to encodings is here. The multihash protocol does something similar but for providing self-identifying hashes,

<varint hash function code><varint digest size in bytes><hash function output>

All variable integers are base-128 varints as defined in multiformats/unsigned-varint. The multiformats/multicodec library establishes the mapping table for hash function codes. Encoded this way, if I give you a multihash encoded with multibase, you have everything you need to correctly identify the relevant encoding and hash function for verifying data integrity in a future-proof way. Thus, I switched to multihash and multibase for content-addressing.

There is a good chance your favorite language has a multihash and multibase implementation. But, I like from-scratch code examples to make sure I understand. For this one, I’ll be working with the message,

msg = b"Hello, Multihash!"

using SHA256 as the hash function and url-safe, base64 without paddingI’m using base64 here because stdlib doesn’t have base58, although plenty of libraries are available. as the encoding. The stdlib provides everything I need.

import hashlib
import base64

Working backwards, I’ll calculate the SHA256 digest first,

digest = hashlib.sha256(msg).digest()

Consulting the lookup table for multihash, I see that SHA256 has code 0x12. Combined with the 32 byte digest size (implicit to SHA256), the concatenated message is,

# I'm not showing how base-128 varints work. The 'big'-endian here just 
# conforms to the function signature of `to_bytes`. But since it's one
# byte in both cases, it's irrelevant.
multihash = (0x12).to_bytes(1, 'big') + (32).to_bytes(1, 'big') + digest

Consulting the lookup table for multibase, I see u is the prefix character for base64url with no padding. Concatenating this with the BASE64 encoding, I get the output.

output = b'u' + base64.urlsafe_b64encode(multihash).rstrip(b"=")
output
b'uEiBCmDZawQ7sHQYcOq4FDSgVbWEi72GVV-x6JZ2zZmadBA'

For the purposes of verification, I can check this against py-multibase,

from multibase import decode as decode_base

got_multihash = decode_base(output) 
assert got_multihash == multihash, f"{got_multihash} != {multihash}"

and then against pymultihash,

from multihash import decode as decode_hash

m = decode_hash(multihash)
assert m.digest == digest, f"{m.digest} != {digest}"
assert m.verify(msg)  # We have data integrity!

And that’s all there is to it.

But, Why?

For this example, multihash requires two extra bytes when compared to just the digest. The multibase requires another byte for its prefix. And, underlying base-encoding generally inflates the two extra bytes of the multihash some more.For example, by a scaling factor of \(\frac{4}{3}\) for base64, depending on the full message size and padding Rounding up, take that as three extra bytes for transmission and storage.Additionally, as the README carefully notes, the leading bytes of the multihash/multibase encoded values aren’t uniformly distributed. This often really matters on the backend – if you’re not careful, you’ll end up hotspotting a single partition. Reversing the byte order internally can often avoid this. That’s small. But I’m always apprehensive to dismiss the costs of a few bytes, especially in the context of indexing, so the flip side is: what does this scheme purchase?

The answer, again, is that it future-proofs content-addressing.

It is hard to anticipate how people may integrate the shared content-hashes. Without multihash and multibase, the choice you make early on becomes the stone you can’t unthrow. If you change your encoding, the uses cases you don’t see may break. Worse, if you break downstream expectations often enough, clients may start to treat the content hash as an opaque identifier. That is, they may start to skip verification, effectively implementing insecure location-addressing. Conversely, when you share a multihash in multibase, you have a well-defined procedure for decoding and verification – as do your users. To me, that’s worth the price.