Merkle Trees: Growing in Use

Ralph C. Merkle (not pictured above), born 1952, is one of the founding fathers of Public Key Cryptography. Throughout his career he has developed and contributed to a list of monumental cryptographic systems, some of which are embedded in the backbone of the online protocols and applications we rely on in daily life. In this post we will look at his work on Merkle Trees, how they function, and why and how they are growing in use.

Merkle _What_

The overall structure of a Merkle Tree is quite simple and is very familiar to computer scientists: a tree data structure. Data trees have one root node (the main file or piece of data) which is then divided or branched out into child nodes — in this case exactly one or two. By branching out again, these children then become parent nodes to subsequent children and so on. The final child node of any branch is called a leaf node. In reverse, each leaf node in combination with its sibling node (should it have one) yields its parent node and so on, until the original root node is reconstructed.

Merkle Trees differ from traditional data trees in one simple way: they use cryptographic hashes of each piece of data rather than the data itself. From leaf nodes up to the root, the process works as such:

  1. Each leaf node is hashed
  2. Moving up the tree towards the root, each leaf or child hash is XOR’d (cryptographically combined) with the hash of its sibling node
  3. Finally, the hash of the the top parent nodes are XOR’d into the root node which precisely equals the aggregate hash of the original piece of data

Merkle _Why_

For many years Merkle Trees were little more than a cryptographic magic trick. But as is common with mathematical and computational breakthroughs, years later it began to play a critical roll in various protocols and software projects. Merkle Trees are mainly used for two reasons:

  1. They allow a client or server to efficiently validate the contents of large files/data
  2. They allow a client or server to validate any segment or sub-segment of the file/data without possessing any other segments

To understand why Merkle Trees have these properties, it is important to know a little bit about Cryptographic Hash Functions. Hash functions are easy to compute in one direction but extraordinarily hard to compute in the opposite direction. For example, the industry standard SHA-256 can be hashed in milliseconds but would theoretically take about 3.85 × 10²⁹ years to reverse, according to computer scientist Luke Dash Jr.

In the case of Merkle Trees, the hash function is used to calculate the hash of each segment of data which becomes the leaf nodes, then XOR each leaf node with sibling leaf/child nodes into parent nodes and so on, all the way up to the root node or Merkle Root. If each segment of the data is in tact and unchanged, the Merkle Root is exactly equal to the aggregate hash of the entire file or piece of data. If even one bit of information is changed in any segment, the hash is completely different which propagates up the tree, resulting in an entirely different Merkle Root. This root can then be compared with the actual/desired root and thus any inconsistency is detected.

“These numbers have nothing to do with the technology of the devices, they are the maximums that thermodynamics will allow. They strongly imply that brute-force attacks against 256-bit keys will be infeasible until computers are built from something other than matter and occupy something other than space.”

— Bruce Schneier, Cryptographer

Merkle _How_

One important use of Merkle Trees is in downloading files. If a user/client attempts to download a large file all at once and something goes wrong, the entire file can be corrupted and the full download process would need to be restarted. Using Merkle Trees, the user can download a smaller segment of the data and hash it. By combining this hash with the hashes of each other segment (trivially small to download/check compared to the data itself), they can check to see if any data was corrupted during the download process. These steps continue for each segment such that if the Merkle Root doesn’t match up at any step during the download process, they know which segment of the data was corrupted. It is important to note that this process takes place automatically in the background and does not require any action from the actual user.

Another place we see Merkle Trees being used is in the Distributed Version Control System Git. Git maintains and reconciles many different versions of a piece of software as it is built, changed and updated simultaneously by many contributors. Git is both fast and secure because instead of storing and comparing every instance of the actual software, it instead stores and compares the hashes of each segment and version in the form of a Merkle Tree. Git’s implementation is slightly more complicated as it XOR’s some additional nonces along the way, but the overall structure is that of a Merkle Tree.

Finally, Merkle Trees play a key roll in the construction of the Bitcoin blockchain. When a Bitcoin node broadcasts a transaction to the rest of the peer-to-peer network and is included in a block by a mining node, the miner hashes that Transaction ID along with every other Transaction ID in that block. Those hashes are then put into pairs and XOR’d into a new hash and so forth in the form of a tree, all the way up to the Merkle Root. The root is then XOR’d once more with some additional data pertaining to the block itself and the hash of the preceding block (forming the chain), which finally results in a new block hash. While it is impossible to infer the details any particular transaction from it, the details of all transactions in the block are needed to compute the final block hash.

Merkle _Root_

In conclusion, Merkle Trees are a very clever way to maintain and verify databases and large files across networks of users/devices. When checked against the original Merkle Root, every piece of data must remain completely unchanged, else the roots will not align. As we live in a continuously more digital age and distributed systems gain popularity, it is likely we will start to see more Merkle Trees sprouting up in new and exciting places.

Data Science | Data Engineering | Python Development