And now for something completely different: A piece of information that seems to get taught at universities that not only has plenty of practical use in everyday web development, it is also surprisingly easy to understand. Here goes.
The little gem we are talking about is called the Hash Function.
A hash function is something you can feed a wild blob of data you know absolutely nothing about, and it will give you back a neat, short, predictable piece of data back.
The key here is that it’s deterministic: Every wild blob you can come up with will give you the same neat hash every time, but even a slightly different wild blob will give you a completely different neat hash.
Of course, it can’t be done to perfection. Try a gazillion different blobs and eventually, you will find two hashes that are the same. This is called a duplicate. The sheer volume of tries you would need to find a duplicate tells you what the purpose of the hash is. If it would take your computer 100 million years of trying to find a duplicate, you have a cryptographic hash function. If it would take your computer something in a manner of days to find a duplicate, you have a fast hash function. SHA-1 and SHA-256 examples of the former. CRC32 is an example of the latter. MD5 used to be a cryptographic hash function, but computer speeds caught up with it; it’s now too fast to be a cryptographic hash function, and too slow to be a fast hash function. So it falls into the category of hash functions better not used for anything at all. SHA-1 is on the way there as well, but for now, it’s okay.
If you’re not sure which kind of hash function to use for a given purpose, you can start cryptographic. If you want better performance and you can do without the million year certainty, than CRC32 works well.
A third kind of hash function is a lookup table: It doesn’t calculate the hash, but simply stores some kind of key for each blob. These are often known as objects, or the venerable “associative array”, or database tables.
The neat thing about looking about objects, arrays and database tables as kinds of hash functions, is that often you can replace a lookup table by a calculated hash function, and do without saving things. This can make your operation stateless. As a rule, a stateless piece of code will be easier to maintain than one that has state, because there are fewer things to keep track of.
Example 1: Send an email only once
Say you have a a form on your web page that will allow your visitors to tell their friends about something cool. This can work pretty well, because it saves people the hassle of having to type their passwords into their webmail. But often, they might get impatient and click send twice. There are three solutions to this problem, two of which suck, and one of which employs a hash function.
Solution 2: You save the message and email address in the database and save its ID in a cookie. This is fine, but it has a privacy issue: You see the personalized messages your visitors are sending to people. Like solution 1, this appears to work, but if you do any kind of scaling or getting a little bigger, people will get privacy sensitive, and you will have to admit you’re saving private messages on your server. Further, you will have to put in some kind of mechanism to delete all those messages you don’t need any more, which is crufty.
Solution 3: The golden third solution with the hash function eliminates the problems of solution 2 by replacing the lookup table (the database table) with a real, calculated hash function. You calculate the CRC32 function of the message concatenated with the email address, and save that in a cookie. No harm done, no one gets the original message from the hash, and it’s small enough to small in a cookie. When the cookie expires it expires, so you have auto cleanup. If the second post comes in a second time, you take the CRC32 of the incoming message and know if it’s a duplicate by comparing it to the value in the cookie.
If you need to ask your users for a password, you need to compare it to something to see if it is correct. That something could be the password itself, but then you would know all your users passwords, which has similar privacy implications to the email example.
It so happens that if it takes a cryptographic amount of time to find a duplicate, it takes a similar amount of time to find the blob to a given hash. So if you store the cryptographic hash on the one hand, and compute the same hash of what the user types into the password box, you have a way to verify passwords without actually having them. Neat, huh?
Pitfalls. Most passwords are around 10 characters long, if that. For blobs that short, cryptographic hash functions break down. You can make a lookup table of the shortest blobs, and then finding the blob that made the hash no longer takes millions of years.
So instead of throwing away the otherwise perfectly good hash function, you make the blob longer. This is also called “salting” it. You append both the password-to-store and the password-to-check to a long, random string. At the time of writing, long means something like 128 characters (more doesn’t hurt). Random means something like hZ7gUTQ@e6p5By@3, not something like “harf harf the narf” or keyboard mashes like “HJKHJKHJKHJ”. Password generators are handy.
Another pitfall is the speed of computers increasing while at the same time lots of cryptography people are trying to break hash functions before people with bad intentions do. This is what happened to MD5 and is happening to SHA-1. So one thing you can do to give you a little bit of extra safety is to append the hash to the salt, and hash the result again. Repeat, say, 1000 times. This will make it harder to to make actual use of cryptographic breakage, and also increase the amount of time it would take to recover a password. It costs some performance, but if you think it’s okay to do that once every time someone supplies a password to be checked, it’s an entirely realistic thing to do.
Since you will be required to upgrade the hash function used from time to time, it can be a good idea to store a unique bit of data called an identifier along with the hash. The identifier can a number, a string or pretty much anything, as long as it can identify which part of your code you used the generate the a hash. For example, you can use a switch statement on a bunch of strings, or have a class and store the method names for dynamic languages. Then save the identifier to be used used for new users and password changes in an easy-to-find place in your database or code.
That way, every year or so, you can up the safety margin new users, without locking out all your existing ones, where the safety margin might not be great but still acceptable. You can still purposely lock out users that have hash functions that are so old it hurts.
Example 2: Confirm an Email address
Often enough, you need to send out some kind of one-time password to a user so he can prove he owns an email address. Of course, you can use the lookup table approach, but again, it’s not stateless, so you have stale pieces of data you need to periodically get rid of. But as any lookup table, you can replace it with a hash function.
Since hash functions are one way only, you need to send any data you might need along with it. For example, for a user to prove he owns an email account, you can send the user a link that contains his own user id and the time of sending, along with a salted hash of that data, like so:
The user can see the data being transmitted, but he can’t fake verifying another user because he can’t predict which hash would be necessary to conform to his modified data. On receiving the token, you can confirm the user id and use the time to expire old tokens, all without maintaing any data.
In order not to tempt the user, you can also veil that you are using the string to transmit data by base32 encoding it, which turns out something like http://yoursite.com/confirm/ditmtmzmdk3ntewmy1hzjyxzjc4y2e. Note this doesn’t add any security at all, the security is in the hash function. But it does make your tokens look more uniform at first glance, which doesn’t hurt, as long as you’re not relying on it for security.
This might seem like too much effort to go through in order to save yourself from having a token database table, and that might be true if one table is all you have. But before you know it you will be sending out tokens for all kinds of things, like sending out one-time download URLs to gift certificates to early bird access… the list goes on. All of that data maintenance really adds up, so in time the “send and forget” can save a lot of ongoing brain labor, which happens to be the most precious thing we human beings possess.
Example 3: Check downloads
Another classic. You use a cryptographic hash function to make sure a huge file you just got hasn’t been tampered with, either on purpose by a mischief maker or by accident because of a faulty transmission. You don’t even need a hash function, because as a rule the file will be big enough forging a hash wouldn’t work out. As a rule, if you’re offering a large download, provide a hash. If you’re downloading something large, check the hash. This kind of security is what makes those sleep at night who have started to think about what could go wrong. And witnessing for yourself that it actually works is great for building trust in cryptography.
Example 4: Handle untrusted data
It is often that you will receive a piece of data that you don’t entirely trust, because it might be carefully crafted to exploit weaknesses in your computer system. An example would be introducing double quotes followed by an SQL command in a web form, known as an SQL injection. While it is good practice to always escape all data that comes in, or even better, have your web framework do that automatically, there could still be bits of data that are problematic for your particular application.
If you only need the data to do some kind of comparing, for example look something up, hash it. For example, you can store the CRC32 hash of email addresses along with the actual address. Not only will the lookup be faster, you will be protected against every conceivable type of injection attack as long as you’re just handling the hash.
This isn’t always applicable so use it when you can. You’re not replacing a lookup table here, but using the neatness of the hash, which implies its security.
Example 5: Get predictable random data
Sometimes you will be needing a set of random numbers that you will be able to re-use later. One example would be every user receiving a random special offer, but you need to know in advance which user will be receiving what.
You could let a random number generator churn out numbers and save those in your database, but there you’re saving random data again in a lookup table, and data needs to be maintained.
Hash functions don’t necessarily make sense immediately, because they are more a property of computing and mathematics rather than an application of it. But once you start wrapping your head around them, they are so incredibly useful for so many different ways, it simply sucks not to know about them. And they’re fun… they’re one of the areas where the previously unkown can be computed, like a greek seafarer asking the oracle. To make them a bit less oracle-like, here’s how they all work: You start with a bunch of predictable numbers, like 2, 3, 5, 8, 13. Then you loop over the data you want to hash and add to, substract from, xor over, generally just operate on the same five numbers so every bit of the incoming data had an effect on the numbers. What takes a lot of brains is sculpting the operations in such a way that the resulting function works well for a given purpose, like having few duplicates or being fast. If you want to spend your time using hash functions rather them making them, I find it works well to go for what the leading edge of research recommends, as witnessed by the all-knowning Wikipedia.