Hash & Hashing
A great deal of what Codel has to offer is dependent on our novel management of Hash Values. It is useful, therefore, for anyone wishing to understand our products, that we help you understand hashes. Most of the explanations you will find on the web are somewhat technical and aimed at mathematicians or cryptographers!
This page describes, in non technical fashion, what a Hash Value is and what it does for us. Go here, if you want one of the more technical explanations.
Hashes are referred to in a number of ways: Message Digests or just Digests and One Way Hashes are common. We just use "Hash" or hashes. Without getting technical, all you need to know about how they are created is that various mathematical transformations are applied to the original data in order to create a small "fingerprint" of that data.
There are two important features of a hash that provide immense benefit to our authentication database.
First, unlike encryption, no-one can reverse the hash to discover the original data. The "fingerprint" analogy is useful here. You cannot retrieve a human fingerprint from the scene of the crime and, from that print alone, deduce anything of significance about the human it belonged to. You can't tell what sex they are, how tall they are, how much they weigh, what ethnic group they belong to, etc etc.
In the same way, you can deduce nothing about the original data from the hash it creates. This makes it much easier to protect the data.
In both cases, however, if you find a human whose prints match the ones you retrieved from the crimescene, or a document which creates the same hash as one you recorded earlier, then you can be reasonably certain that the human or the document is the original.
In fact, in the case of a document hash, we can be even more certain than we can with human fingerprints that the match really does confirm the original document. With state-of-the-art human fingerprinting, the most authoratitive estimates are that the probability of two humans alive today with identical prints is of the order of 6 in 1019. Numbers that large make human fingerprints pretty safe. With a document hash, however, the numbers seem to be several orders of magnitude larger (and the probability of "collision" that much smaller). For SHA1, the chances of two different documents creating the same hash is just 1 in 2160 (which is roughly 1 in 1½x1048). Given the number of documents the human race has created since the written word was invented, which we'll wildly guess at being of the order of 1015, the chances that two of them would produce the same hash is around 1 in 1033. ie around 1014 less likely than finding a match in human fingerprints.
If that's not rare enough for you, we have one more trick up our sleeve. We check the uniqueness of the hashes before we put them on our database. If we ever encountered a duplication, we would simply request the owner of the hash to make a trivial alteration (eg the addition of a space or full stop). This is enough to completely alter the hash.
Second, the algorithm we use to create the hash is freely available (in the public domain) and usable from within standard web browsers. This makes it much easier for us to allow remote users to authenticate data, validate documents etc.
Lets examine hashes in a bit more detail.
This is what the hash value of this sentence looks like:
(using ASCII characters)
IÔ}HÛy–ÍZ*%å5ª0|ô
if you prefer Hex format, it looks like:
E46DF6BC77A6C2FE3E05AC14AB8DE25C591BDE6A
Neither are a pretty sight, although the hex version does at least look readable.
Fortunately, it doesn't have to be humanly readable. It only ever needs to be read and understood by computers.
There are various features of a hash value which remain true, whatever the source. For example, this is the hash value of an early draft of our Detailed Guide Document before it became several html documents - all 117000 bytes of it: m‘¿tÅÃ8/>°=d__°oÏ
What similarity is there between that hash and the previous one? The only obvious one is length. They are both 20 characters long. As is the hash of anything at all if we create it using the SHA1 hashing algorithm. Whether it is the letter "A" or the entire text of "War and Peace"; its hash value will be 20 characters long. (32 Characters if we use SHA256).
What this implies (correctly) is that hashing has nothing to do with encryption. You can't work backwards from a hash value to work out what its source must have been - even if you know how the hash was created.
What you can do, however, is create a hash yourself (the software is widely available) and see if it matches the one on the database. This method of cracking the code is commonly referred to as "Brute Force" - it involves trying every possible combination till you find a match. For small data-sources (like all the names or telephone numbers in a telephone directory), it is a viable technique.
The trick, therefore, in using hashes in place of real data, is to ensure that the number of possible valid sources is literally astronomical and way beyond the capacity of modern computers to sift through. Preferably it should be way beyond even the capacity of the computers we anticipate being available in the not too distant future. Quantum computers may cause us a problem in this respect but we'll cross that bridge when we come to it.
Finally, if you're interested in how all this feeds in to our logic in deciding the length of our VRs, click here.
Previous page: Glossary
Next page: Hash Change