What’s in a CID? Multi, Multi, Multi…

      Comments Off on What’s in a CID? Multi, Multi, Multi…
content-identifier

Every piece of data on IPFS can be referenced through its CID, which stands for Content IDentifier. You might have have spotted some while navigating the IPFS jungle. They look like this: Qmd286K6pohQcTKYqnS1YhWrCiS4gz7Xi34sdwMe9USZ7u and this QmYf4sT9KbtW3ZCKoX8DdgJy9tDKVAUjbPTBi525RNR29V. And also this: bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi.

But what are they made of? How to understand their composition? CIDs are said to be multihash, multicodec and multibase.
Multi-what? In this short article, we’ll explain each of those terms so you can better understand CIDs.


CIDs are… Multihash
As has been explained on a previous article, IPFS uses cryptographic hashes to identify data. Cryptographic hashes are in essence mathematical formulae that create a unique fingerprint for a data.

A popular cryptographic algorithm to generate such hashes is SHA2-256. SHA2-256 creates a hash of a length of 256 bits. It’s awesome and very useful today, but will it still be useful 10 years from now, when computing devices will become more powerful and where maybe quantum computing renders this particular algorithm useless?

This scenario is not not completely unheard of since some hash functions, such as MD5 and SHA-1, were eventually rendered useless.

So which algorithm should IPFS use, knowing that eventually the algorithm could be broken? The solution is to use a multihash.


The multihash is very simple. It consists of an identifier for the hashing function used, the length of the hashing function and then the hash itself.

The hashing function used is determined thanks to a table that everyone agrees on, where we assign a number to all possible hashing funtions.

So the multihash ends up looking like this: <hashing function identifier><length of the hash><the hash>
In the case of a SHA2-256 hash, this is <SHA2-256><256 bits in length><001010101010…>.

Awesome! Now our content identifiers are future-proof! If we want to change the hashing function, we can!

But what if we could get a little more information from our CID? Like what the data represents?

That where the next part comes in…

CIDs are… Multicodec

We want to add more information to our CID so that we may have a better idea of the type of data. Is it a JSON data? CBOR? Something else?

So what we will do is very simple, we’ll just add more data in front of our multihash which will describe the codec according to a table. This works in the same manner as the identifier of the hashing function for the multihash.

So now, our CID looks like this:
<codec identifier><multihash>

The CID is just a long series of bits that are self-describing. First, the multicodec which describes the type of the data. Then, the multihash, as explained earlier.

But there’s more…

CIDs are… Multibase
Originally, IPFS CIDs were described in base58 which is the same base which encodes bitcoin addresses.
But of course, we could be using all kind of bases such as base 32 so, once again, we need to add more data in front of our CID. We now add the multibase, which just tells us the base which will encode the CID.

<multibase>base(<multicodex><multihash>)

Okay, so that’s it with prefixing data, right? We are done?

No… We need to do some history about CIDs first, before we can understand the last bit of information to add.

V1 vs V0

At the beginning of IPFS, there weren’t multibases or multicodecs. All the CIDs were multihashes only.
We call those CIDs version 0. Then later, the IPFS project decided to improve the CIDs and add the multicodec and the multibase also. And thus Version 1 replaced Version 0.

So how do we differentiate between Version 0 and Version 1 CIDs? How do we tell if a CID is of an upcoming hypothethical version 2? Or even version 3?

That’s why from Vversion 1 and onward it was decided to add the Version to all CIDs. We will now put it right after the multibase.
So now CIDs look like this:

<multibase>base(<CID version><multicodec><multihash>)

BAFY vs Qm

To help you synthesise this information, here’s an awesome tool that will allow you to analyse CIDs and each of their components: https://cid.ipfs.io/

What I want you to do is to plug this version 0 CID in the tool: Qmd286K6pohQcTKYqnS1YhWrCiS4gz7Xi34sdwMe9USZ7u

As you might notice, the multicodec and the multibase is implicit. Why is that? Because they didn’t exist for version 0 CIDs! Therefore, we just assume what they are.

At the bottom of the page, you will see a hash starting with bafy... this hash is the equivalent CID for version 1. A neat trick to differentiate v0 and v1 CIDs is to look at the first letters. If it starts with Qm, it is probably a v0 CID. If it starts with bafy, it is probably a v1 CID.

Finally, plug this version 1 CID into the tool: bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi.

The tool with now show us the base and codec according to the format described above.

Congrats!

Congrats! You are now a master of CIDs! You understand the ins and out of IPFS CIDs and can describe each component.