The data structure within Bitcoin, Amazon Dynamo DB, ZFS, …
This article explores what are Merkle trees, and how they are used in practice in different systems including Bitcoin, Amazon’s Dynamo DB, and the ZFS filesystem. The basic concept is quite simple, but some of the clever applications are not so obvious.
First, lets start with the concept of Merkle trees. As I said, it is not too complicated in its basic form.
What is a Merkle Tree
A Merkle tree is fundamentally just a hierarchical set of hash values, building from a set of actual data (Merkle leaf) to intermediate hashes (Merkle braches) and up to the Merkle root that summarizes all the data in one hash value.
Example: A Basic Merkle Tree
The following figure illustrates a very small Merkle tree:
In this figure, the bottom nodes (Data1–Data4) are the actual data processed by the application. Each of these is summarized by their respective hash value (Hash1–Hash4), as a Merkle leaf. From these, the Merkle tree builds a hierarchy, combining hashes together until only one is left. The nodes combining other hash nodes are called Merkle branches (here Hash12 and Hash34). When there is only one left (here Hash1234), this is called the Merkle root. There can be multiple levels of branches and hashing as following examples will demonstrate.
Handling An Unbalanced Merkle Tree
The above example illustrates the very basic case of a Merkle tree. It is a convenient example, as at every level there are just the right number of nodes to form exact pairs. What happens if you have an uneven (odd) number of leaf (data) nodes? For example, what happens to the above example if you have 5 Data nodes? You can hash Data1+Data2 together to form a Merkle branch, and same for Data3+Data4. But Data 5 is left without a pair to hash into a new branch.
Different approaches can be taken to address this. For example, in this situation Bitcoin simply copies the un-pairable (odd) hash, and uses the duplicate as a pair (the odd hash is paired with itself). The following figure illustrates this:
In this example, Hash 5 has no immediate pair, and is duplicated to pair it with itself. Same happens for Hash55. The above is just one option, there are different ways to handle this pairing issue. Of the ones I have seen, I like the Monero (or Cryptonote?) one most. I will present it shortly. First, something slightly different.
A Very Simple Merkle Tree in Python
Theories and explanations are great. But for the technical person, some code helps formalize it. The following code from my Github repo illustrates a naive but simple example of implementing a Merkle tree in Python (or just skip it for more conceptual explanation after):
https://medium.com/media/c99d8470b4f41b5c0603a7f311adf474A very simple and naive Merkle tree implementation in Python. Code by author.
The last four lines in the above code run the example to create a Merkle tree with data nodes Data1–Data5. The calculation of the tree with this algorithm looks like this (cropping the hashes to the first 5 characters):
I used the keccak hash function in this example, as this is what the Monero cryptocurrency/blockchain uses, and I was looking into Monero code lately. The 5th (odd) data element here (using my Python code above) is handled slightly different from the Bitcoin example, as this simply re-uses the hash if one is left over (un-pairable, cf54b above). Practically this re-use should have the same effect as the duplication in the Bitcoin algorithm.
Optimizing Merkle Calculation: Monero Style
As I was looking into the Monero blockchain to understand it better, I found it has a different but clever approach to hashing the Merkle tree. The code for it is available in the tree-hash.c file in the Monero Github repository. And a Python version I made to implement the same in my Github.
The Monero approach could be described as converting the hash tree to a perfect binary tree. It hashes enough leaf nodes in the first iteration, so that the following iterations will always have some variant of 2ˣ (a power of 2) nodes. The following figure illustrates this:
This works as follows: First, calculate x so that 2ˣ is greater than number of transactions (data nodes). In this example, it is 2³=8, because 8>5 (there are 5 transactions/Data nodes). The value of x before that, 2²=4, would not fit (4 > 5 is not true). From this, the number of transactions is subtracted. In this case 8–5=3. This 3 is the index of transaction to start iteration 1 from. With 0-based indexing, the start index is Data4 in this example.
Above explanation is maybe a bit abstract. Below is a table with concrete examples of transaction counts and how these are converted into a “perfect binary tree” shape (size is always 2ˣ, eliminating any “odd” counts or leftover un-pairable hashes from following iterations) for iteration 2 and following iterations:
- Transactions: Number of transactions in the block being calculated
- Closest 2^x: First 2ˣ that is bigger than transactions
- Iteration 1 start index: start hashing pairs from this index in transaction list
- Iteration 1 values: Number of transactions in transaction list starting from iteration 1 start index to the end of transaction list
- Iteration 1 pairs: iteration 1 values divided by 2 (because hashed in pairs)
- Iteration 2 formula: How the above values lead to number of items to hash in the following iteration
- Iteration 2 size: The number of transactions to hash in iteration 2. As you can see, it is always 2ˣ, and leads into a “perfect binary tree”.
Too many numbers and formulas I guess. Did you fall asleep yet? Anyway, if you pick the row from above table with 23 transactions, you can follow this nice 8-bit animation I made on how the Merkle root calculation progresses for that row:
In this case (animation above), the start index is calculated as 2⁶=32–23=9. This is transaction number 10 (zero index adds 1). The following iteration then has 16 nodes, which is 2⁵, next one 8 or 2⁴, and so on. The animation above actually represents how the hash array is manipulated in memory by the Monero code.
Example Use Cases
Concepts are nice, while concrete, real-world, use cases make the point. Here I will discuss the use of Merkle trees in blockchains/cryptocurrencies, and Amazon’s AWS DynamoDB distributed database. I will also briefly touch on the ZFS filesystem, and the Git version control system, as these are sometimes also mentioned as examples of Merkle tree use.
Cryptocurrencies and blockchains: Bitcoin, Monero, et al.
As already briefly discussed above, Bitcoin and similar cryptocurrencies make use of Merkle trees to summarize and validate the transactions in a block, and embed the Merkle root into their block header as a summary of it all. The following figure illustrates this:
Each block has an ID value, which is the hash of its header fields. A part of this is the Merkle root. Another part is the previous block ID (Parent in above figure). By linking with the previous block ID, and including that as part of the next block ID, the blocks form a blockchain. By embedding the Merkle root in it, they make an immutable record of transactions in the block.
For example, the Bitcoin block header contains:
- Difficulty target value (called bits in Bitcoin)
- Merkle root of all transactions in the block
- Nonce; a value changed in mining to find accepted blocks
- Previous block hash (block ID); linking this block to the previous block in chain (this previous hash is named parent in figure above)
- Timestamp; Time of mining (creating the hash for) the block
- Block version; identifies support features and formats (also used to identify hash function in some blockchains such as Monero)
These header fields are all hashed together to form the block ID. Making the block ID a hash of all header data makes it practically impossible to modify a block header. Including the Merkle root in the block ID further makes it practically impossible to modify the transactions in a block.
But let’s look in a bit more detail into how these types of blockchains actually use the Merkle tree in different ways. It’s all about data integrity, but in different and sometimes ingenious ways.
Blockchain Use Case 1: Transaction Data (Block) Immutability
As I noted above, the base use case of the Merkle tree in Bitcoin style blockchains is to build the Merkle root into the block header, and using it to verify that no transaction has been changed.
As an example, let’s change Data4 to Data6 in the earlier example:
Comparing this to the earlier example, with Data4, the Merkle root was aa8d3. With Data6, it is now f8932. This way, any change in any transaction data changes the Merkle root, and it no longer matches the one stored in the blockchain for that block (in block header). And through the block header, this issue would propagate to the whole blockchain following this block.
However, if you think about it for a while, you may ask a question:
Do you really need Merkle trees to verify the block data?
No. Instead, one could just concatenate all the transaction data together and build a single root hash. Consider the above example with Data1–Data5, but just hashing them all together:
Now, changing just one data value will have the same effect of invalidating the summary hash:
This single hash could also be used in place of the Merkle root in the block header and block hash. Same effect so far.
So what is the real benefit of using a Merkle tree here? To summarize, I would say it gives you more granularity in the hash verification, and enables other clever tricks to process the blockchain more efficiently. While also providing the transaction integrity assurance / verification.
The original Bitcoin whitepaper provides two additional examples for use of the Merkle tree: blockchain pruning and simplified payment verification (SPV). Beyond these, the one I really like is the Stratum mining pool protocol. Let’s look at these next.
Over time, a blockchain will grow larger and larger as it accumulates more and more blocks. For example, today (Feb/2021) the size of the Bitcoin blockchain is about 380GB. Blockchain pruning is an approach to reduce this used space with the help of Merkle trees. By removing used transactions from local storage where no longer needed.
We want to have all data around for full scale verification and history, but not all nodes in the peer to peer network need all the data. The blockchain pruning approach proposed in the original Bitcoin whitepaper suggests addressing this by using the Merkle tree to prune (remove) spent (used) transactions from the blocks.
The following figure illustrates this:
In this example, we have used (spent) transactions TX1, TX2, and TX4. We still have unused transactions TX3 and TX5 left in this block. Assume we are not interested in a full node with full archival, just keeping a list of spendable transactions. Satoshi’s proposal was to prune the block from the used transaction data, and simply leave the Merkle tree branches needed to verify the unspent transaction data:
This is perhaps more clear if we also spend TX3:
The block could now be pruned to only contain the TX5 data, and the hash of the other Merkle tree branch (96b8d) in the block:
We would just have saved the space require to store 4 our of 5 transactions, and 6 out of 9 Merkle tree nodes. About 80% space savings on that example. The longer the blockchain gets, and the more it is used, the more spent transactions it would have. And thus more space could be saved. It’s a very clever idea, like most ideas in the original Bitcoin whitepaper.
But much as with block verification, we can always ask the question:
Do you really need Merkle trees to prune the blockchain?
The Bitcoin StackExchange is full of insightful discussion on the actual ways pruning is applied in the actual implementations. While the Merkle tree pruning is a clever approach, it is not applied in this way in (at least) the Bitcoin core software.
Instead, the unspent transactions are stored in a separate database for fast lookup. This database is initialized on first startup by scanning the blockchain and updating the database as new and spent transactions are found. The database is continously updated as new blocks are broadcasted for the Bitcoin network. A pruned node could then just rely on this database.
Running a pruned node in general is sufficient for the basic operations, but does not fully support all features of a blockchain, so some nodes will still need to hold the full data. But that goes beyond the scope of this article.
I think the basic takeaway here is that Merkle trees are cool but sometimes the basic and even simpler approach is fine, and even better. Of course, the hard part is identifying when this is true and when the cooler (Merkle) approach is really best. And don’t take this in any way as a suggestion to not use Merkle trees. Just to think about the overall picture. Also with Bitcoin, and related blockchains, I believe the Merkle tree enables many other things and thus makes a lot of sense. As following sections will show.
Simplified Payment Verification (SPV)
The simplified payment verification (SPV) algorithm was also suggested in the original Bitcoin whitepaper. In SPV, a light-weight blockchain client only stores the block headers, but also wishes to verify payments it receives as valid transactions in the blockchain. Lacking full transaction details, the SPV client uses Merkle trees to effectively verify the transaction details in collaboration with full nodes.
I will re-use the example from blockchain pruning above. The SPV client wants to verify TX5 in the given block. The following figure illustrates this:
Here, the SPV client node asks a full node for the Merkle branches required to build the Merkle root with the TX5 data. By rebuilding the Merkle root (aa8d3) from the transaction of interest (TX5), and the Merkle branches (96b8d) provided by the full node, the SPV client can have confidence in having received a valid transaction. By checking this re-built Merkle root against the one stored in the block header, it can verify that the whole tree (and thus TX5) is in fact valid, and is a part of the blockchain.
I find SPV to be an interesting example of how Merkle trees can be used, together with (block) data filtering (Bitcoin uses Bloom filters but that is again another topic), to synchronize and verify existence and correctness of selected data in a distributed system.
Pool Mining: The Stratum Protocol
The traditional way to create cryptocurrencies has been mining via proof of work (PoW) hashing. Mining pools are a way for smaller miners to join together and get some rewards for mining, based on how much compute (hash) power they contribute to the pool. This leads to the requirement for a central entity, the mining pool, to be able to coordinate the miners. It needs a way to effectively distribute and track the overall mining process across all clients. With some clever tricks, the Stratum protocol uses Merkle trees to enable this.
Especially Stratum version 1 makes use of Merkle trees to distribute the work efficiently. The pool server provides the miner nodes with the block header elements needed to build the block. This includes the partial Merkle tree, with the branches calculated for all other transactions except the Coinbase transaction. The coinbase transaction is a special transaction that pays out the mining reward. This is illustrated by the following figure:
This figure contains three elements: the Merkle tree template, the Coinbase template, and the search for the nonce(s). The pool provides the Merkle tree template to the miner, containing the pre-calculated Merkle branches for the transactions in the block (here 96b8d, thus highly reducing bandwidth and calculation needs for miners). The miner is then expected to fill the Merkle tree template with a suitable coinbase transaction, built from the server provided coinbase template. The winner is one who fills the template with values providing a hash that matches the blockchain network difficulty level. Meaning just a hash with suitable small value.
The coinbase template is provided by the pool, and is filled by the miner using their choice of nonce and extranonce fields. When combined with the Merkle tree template, each of these provides a different block hash, and the only way to find a winning block is to try values until one produces a hash matching the network difficult target. If the miner finds a nonce that matches the network hash difficulty for this Merkle template, it submits it to the mining pool. The coinbase template contains the mining pool address, ensuring that the pool receives the block reward, and can distribute it to all the miners.
In a broader context, the Merkle tree here is used to distribute a partial solution (the pre-calculated Merkle tree branches and coinbase template), while allowing different distributed nodes to work independently to try to find the solution to the missing part (nonces for the coinbase transaction to build an acceptable PoW hash). By embedding the pool address as part of the template, it ensures that all distributed nodes contribute to the common goal and can share the rewards.
AWS Dynamo DB
Dynamo DB is a distributed database provided as part of the Amazon Web Services (AWS) platform. It was originally designed to handle much of Amazon’s retail global infrastructure needs. As such, a lot of consideration has been put into making it scalable and optimizing the efficiency in all parts of its architecture. The architecture of Dynamo DB is described in the AWS Dynamo DB paper from 2007, including its use of Merkle trees to efficiently synchronize diverged nodes (section 4.7 in the paper).
Dynamo DB is what is called a key-value store. The actual data (values) are stored in data nodes, and identified and indexed by their keys. Dynamo DB hosts this data in what it calls virtual nodes. Each virtual node hosts a key-range. To handle high-availability and scalability requirements, Dynamo DB distributes the data across these key-ranges, and hosts each key-range on multiple virtual nodes (across multiple physical nodes, or partitions).
The following figure tries to illustrate this, with 3 (virtual) Dynamo DB nodes, each holding 2 key ranges, each key range duplicated across two nodes. I am sure this is not exact to all the details, and the system had evolved since the paper was published, but I believe it depics the general concept well enough:
In this type of a distributed system, there will eventually always come a time when some of these virtual nodes are out of sync with other virtual nodes holding the same key-range. Dynamo DB uses Merkle trees to perform effective comparison and synchronization of the key-ranges in these nodes. In the above figure I made a small 3-node Merkle tree inside each key-range to illustrate this The KR3 key-range tree is shown partially in red to illustrate how the nodes would have diverged and need resolving to find correct values.
A Merkle tree is built for each key-range, where the leaf nodes of the tree are the key-range data values. The Merkle root summarizes the data in each node. By comparing the Merkle roots of each virtual node hosting the same key-range, divergence in nodes is immediately visible. This full comparison requires only the communication and comparison of a single value, the Merkle root. If there is a difference, one can keep comparing the branches of the tree to efficiently find the exact divergent spots, and the data values to synchronize.
The following figure illustrates this process:
As you can see, I just copied the blockchain example into the above figure. Because the fundamel structure, and the concepts it relies on, are exactly the same! The difference is only in the (clever) application of the same fundamentals in a different context.
Yes, some details would likely differ. For example , DynamoDB likely uses a different hash function than Keccak. But, as illustrated here, the fundamental approach and concepts are the same, and as such applicable to more than just the common blockchain example of Merkle trees.
ZFS File System
ZFS is a filesystem supporting data spread over multiple volumes, a form of distributed filesystem. Multiple ZFS volumes can be grouped into storage pools, which can also host redundant copies of the data. ZFS is advertised as using Merkle trees to ensure data integrity.
ZFS uses Merkle trees to checksum data, to identify issues where some part of the data written to, or read from, disk is corrupted (or misread etc.). As in most filesystems, the data is stored as blocks of specific sizes. These blocks form a natural leaf node for building a Merkle tree.
The main rationale I found online for ZFS to use Merkle trees always discusses how ZFS stores the Merkle tree as a checksum outside the block itself, providing higher resiliency for identifying errors. This leaves it quite unclear to me why you need a Merkle tree for this, and not just store a general checksum outside the block. Sometimes I feel people should ask more questions.
On the other hand, I could see the benefit of verifying data across multiple storage pools and their replicas by using hierarchical Merkle trees. This case would make it very similar to how DynamoDB operates and uses Merkle trees to sync nodes (vs ZFS volumes). However, I could not find any description for using Merkle trees for this. If someone knows better, I would be happy to hear.
In any case, this seems like a very similar case to Dynamo DB, as being used to verify integrity of a distributed data storage system more efficiently. With the added benefit of separating the consistency properties (checksum hashes) from the data itself, for increased resiliency.
Git Version Control System
Another example that I found referenced as an example of Merkle trees is the Git version control system. Git does use a lot of hashing, and its basic functionality is heavily based on hashing file contents and various metadata. Git forms a directed acyclic graph (DAG) of commits identified by these hashes. Which seems to be considered by some to be a Merkle tree. I disagree, so I will use this as an example of some nuances in the concept.
A DAG is a graph where connections go one way (directed), and there are no cycles (acyclic). Unfortunately, I do not understand how the Git DAG would be a Merkle tree. The following figure represents a Git DAG with a few of the most common Git graph actions – branches and merges:
Here you have nodes that are identified by their hashes, and a tree-type structure where each node points to a parent, and has one or more children. However, there are multiple issues I see in calling this a Merkle tree:
- Nodes in a Merkle tree do not have multiple parents. The DAG does. The figure above has two such nodes: 8e290 and 184da.
- Merkle tree only hashes actual data in its leaf nodes, the Merkle branches after the leaf nodes are hashes of hashes. The Git DAG does not have branches based on hashes of hashes, just nodes with hashes of raw data.
- The Merkle tree starts with the leaf nodes, and proceeds up the tree, constantly halving the number of branches in each iteration. The Git DAG contracts and expands all the time with branching and merging. Because it is a graph and not a tree.
Each node in the Git DAG hashes its parent hash ID as part of its own hash identifier, and it is ordered in time (directed). As such it actually very closely resembles a blockchain. Unlike cryptocurrencies, it allows forking (branching), and integrates those as a central part of its structure. However, I see no problem to relax this as a requirement for a general blockchain definition. And I do not see a blockchain requiring a Merkle tree, just as I do not see Git DAG as one. It is just useful for cryptocurrencies to have one, but you could have a different form of one, with other mechanisms, such as Git.
In summary, not everything needs to be a Merkle tree, even if it is cool. Although you should consider Merkle trees where it makes sense, because they are cool. However, I do not see the Merkle tree in the Git DAG (or elsewhere in Git). If you know better, happy to hear and learn :).
I find Merkle trees are a simple but clever data structure. And I like my algorithms and systems best when they are simple but clever. The examples I covered in this article were
- blockchain integrity verification,
- blockchain pruning
- simplified payment verification,
- mining pool protocol for work distribution,
- DynamoDB data synchronization
- ZFS checksumming (and possibly node synchronization)
What is common in all of these? All of them use the Merkle tree as a tool to take a potentially large set of data elements, and hierarchically verify their integrity in different steps. This allows for efficient data communication, and efficiently processing the verification of the overall data, going into more details only as needed. This, I belive sums up what I see as the best use case for Merkle trees: Enabling efficient verification and synchronization of data over distributed nodes.
Of course, then there is the stratum mining protocol, that takes it a bit further and uses Merkle trees as a means to distribute work, while controlling specific properties of the output, in a decentralized and untrusted setup. Cool.
Well, in most day-to-day software engineering tasks there is not often much chance to use, and build on, these features. But it certainly is interesting to learn them, and thus better understand what some of the latest technologies built on them are based on. And to make those informated decisions when the time is right. Right?