Hashing
I always enjoyed math as a kid. The objectivity of the content pushed opinion to the margin. All I had to do was understand and repeat the steps to do well. Unfortunately, that's how I approached learning. Understand the logic and repeat the steps. Goal focused grade optimization helped me perform well on paper without thinking deeply about the topic.
After entering the work force I had the good fortune of working with people who had developed a strong intuition with math. Watching them solve problems made me realize the power of having an intuition around mathematics and CS concepts. Considering the prevalence of computers in our lives, repeating the steps of a formula is a skill that's been commoditized. What's left for most of us engineers is the creative application of tools to solve problems.
In this article I want to consider hashing with the purpose of gaining some intuition. While it's a familiar everyday concept, it's a largely solved problem. Many engineers may have breezed over a topic that can go surprisingly deep. It's a tool useful for a variety of problems so good to know about.
Common Uses
Hash table lookup - Probably the most familiar, hashing allows us to take an object and create a key index for lookup of the object in a hash table. Hash table data structures lend themselves to innumerable problems, their distinctive feature being that they offer average constant time storage and lookup but essentially no contextual information.
Caching - Really an application of the hash table above.
Checksums - Used for data and data transfer integrity verification like Unix's use of MD5, TCP checksum, etc.
Information Security - While checksums detect alteration, collision resistant cryptographic hashes aim to assure no alteration.
Secure Storage - Storing passwords opens up the risk that they'll be compromised. Storing hashes adds a layer of protection.
Secure Storage - Storing passwords opens up the risk that they'll be compromised. Storing hashes adds a layer of protection.
Other unique identification - Like Git commit SHAs.
Algorithms - Hashes can be useful tools for algorithms, particularly in pattern matching owing to their ability to serve as a succinct unique identifier of something.
Example
Let's take the solved problem of Unix file permissions. Let's first try to represent it in OO fasion:
public class FilePermission {private Enum Permission {READ,WRITE,EXECUTE}private EnumSet user;private EnumSet group;private EnumSet others;...public boolean canUserRead() {return user.contains(Permission.READ);}...public boolean canOthersExecute() {return others.contains(Permission.EXECUTE);}}
There are three member variables, each with eight possible values.
1
2 READ
4 READ EXECUTE
5 READ WRITE EXECUTE
6 WRITE
7 WRITE EXECUTE
8 EXECUTE
We therefore have 8^3 = 512 possible states. Now it makes sense that Unix has chosen to represent it as 0 - 777 value (0777 = 512). This is like a bit vector representing all possible states, in octal form. So...an octal vector.
Now let's override the hashCode function (omitting the necessary accompanying equals() override for brevity):
public int hashCode() {
int result = 0;
result = 1 * (canUserRead() ? 4 : 0
+ canUserWrite() ? 2 : 0
+ canUserExecute() ? 1 : 0);
result = 10 * (canUserRead() ? 4 : 0
+ canUserWrite() ? 2 : 0
+ canUserExecute() ? 1 : 0);
result = 100 * (canUserRead() ? 4 : 0
+ canUserWrite() ? 2 : 0
+ canUserExecute() ? 1 : 0);
return result;
}
This could be considered a minimal perfect hash for this object.
- It has one and only one value for each state. There are no collisions.
- It has no possible values beyond those that represent state. It's space optimal.
A hash is a fixed length representation of a variable length value. In this case our hashCode will output a unique, three digit octal representation for the state of any FilePermission instance. Realistically, in most cases the exhaustive enumeration of possible values is not known ahead of time. We need to come up with an algorithm that will generate a hash that is most likely unique.
Instead of 2^9 = 512 possible states, let's say we think we have 2^12 = 4096 states. We could just use a much bigger variable like an unsigned int that has 2^32 possible values. As long as we distribute values with enough space it's unlikely that we'll use any value more than once. One way to do this is to make the value of each member variable occupy a portion of contiguous memory that does not overlap with those taken by any other variables. This is accomplished in our hashCode above by multiplying each EnumSet's value by 10 (If we're doing arithmetic in octal we'd be multiplying by 8, if binary 2. It's a single, left digit shift operation. Familiar as << for binary operations). Since the max value of any digit in our hashCode is <= the max value possible for any digit (7 in octal), we have no overlap.
More often you'll see a hash start at a small prime number and multiply by a larger prime to accomplish the same thing. There's a lot of wasted space but that works in our favor for now. We're producing a numeric representation of our object with 0s padding between member values.
Frequently our object will have far more states that can fit in even an unsigned 64bit long. In fact a useful property of a hash is that it usually occupies less memory than the object being hashed. If this is the case we can't pad with 0s anymore. We have to accept some overlap. Overlap means there'll be data loss. This in turn means that our hash is less likely to be unique. We'll eventually experience collision or duplicate hash values for different objects. Given our naive implementation, the greater the compression, the greater the likelihood of collision. Overlap, data loss, and collision also mean that we won't be able to reverse the process.
You may have asked yourself, "What's the difference between a hash and compression?" Now we can clearly see some characteristic differences. A hash produces a fixed length representation of an arbitrary length value. A compression algorithm produces an arbitrary (hopefully smaller) length value of an arbitrary length value. Compression is reversible. A hash is not.
It may seem that a compression algorithm is then superior to a hash. It turns out that by virtue of some of this inconvenience, we get a tool that lends itself well to solving certain problems. First, a hash + indexed lookup should perform better than compression + decompression. Second, since hashes can't (trivially) be reversed they're useful for cryptography. Simply stated, a hash that had a simple reversal would be easy to hack. Now it can't actually be proven that a hash is irreversible, but it can be proven that it is reversible. As soon as it is, the hash is no longer considered as secure as it once was.
Once we can no longer avoid overlap in our hash, we need to design the algorithm to minimize it. Said another way we need to try for uniform distribution of our values. This will minimize collisions. Even if our object takes up less memory than our hash, a good hash will still demonstrate the property of uniform application of change. For cryptographic hashes, small change in the object should produce a significant change in the hash. This is known as the avalanche effect and it will aid uniform distribution.
Cryptographic hashes should avoid funneling, or causing collisions when objects differ by only a few bits. Continuing with our naive implementation above let's squeeze the member variables' blocks together until the zero padding falls out. At the edges member variable blocks will overlap. Instances that differed in state by only the portion of overlap would demonstrate funneling.
We may need to consider peculiarities in our data. Volume may bunch up around certain dates if our dataset displays seasonality or cyclicality. Anyone familiar with Scrabble or Wheel of Fortune knows that in English, certain letters occur with a greater frequency. Numbers will probably tend to group around zero since they start from an origin and accumulate outward. The proportion of 0s to 1s will tend to grow as we move away from zero.
With all of these considerations, how does a proper hash address them all? Generally speaking, rather than operating on discrete values within an object as we've done above, hashes will treat the object as a contiguous memory block and apply several mathematical (bitwise) operations to scramble the data. The good news is that this means we don't need a bespoke hash for every class, a generalized algorithm is sufficient. There's also no shortage of methodologies and complete implementations out there for us to borrow from.
In fact what's written here is only the tip of the iceberg. I hope it's clearly built up an introduction using simple examples.
Further reading
Instead of 2^9 = 512 possible states, let's say we think we have 2^12 = 4096 states. We could just use a much bigger variable like an unsigned int that has 2^32 possible values. As long as we distribute values with enough space it's unlikely that we'll use any value more than once. One way to do this is to make the value of each member variable occupy a portion of contiguous memory that does not overlap with those taken by any other variables. This is accomplished in our hashCode above by multiplying each EnumSet's value by 10 (If we're doing arithmetic in octal we'd be multiplying by 8, if binary 2. It's a single, left digit shift operation. Familiar as << for binary operations). Since the max value of any digit in our hashCode is <= the max value possible for any digit (7 in octal), we have no overlap.
More often you'll see a hash start at a small prime number and multiply by a larger prime to accomplish the same thing. There's a lot of wasted space but that works in our favor for now. We're producing a numeric representation of our object with 0s padding between member values.
Frequently our object will have far more states that can fit in even an unsigned 64bit long. In fact a useful property of a hash is that it usually occupies less memory than the object being hashed. If this is the case we can't pad with 0s anymore. We have to accept some overlap. Overlap means there'll be data loss. This in turn means that our hash is less likely to be unique. We'll eventually experience collision or duplicate hash values for different objects. Given our naive implementation, the greater the compression, the greater the likelihood of collision. Overlap, data loss, and collision also mean that we won't be able to reverse the process.
You may have asked yourself, "What's the difference between a hash and compression?" Now we can clearly see some characteristic differences. A hash produces a fixed length representation of an arbitrary length value. A compression algorithm produces an arbitrary (hopefully smaller) length value of an arbitrary length value. Compression is reversible. A hash is not.
It may seem that a compression algorithm is then superior to a hash. It turns out that by virtue of some of this inconvenience, we get a tool that lends itself well to solving certain problems. First, a hash + indexed lookup should perform better than compression + decompression. Second, since hashes can't (trivially) be reversed they're useful for cryptography. Simply stated, a hash that had a simple reversal would be easy to hack. Now it can't actually be proven that a hash is irreversible, but it can be proven that it is reversible. As soon as it is, the hash is no longer considered as secure as it once was.
Once we can no longer avoid overlap in our hash, we need to design the algorithm to minimize it. Said another way we need to try for uniform distribution of our values. This will minimize collisions. Even if our object takes up less memory than our hash, a good hash will still demonstrate the property of uniform application of change. For cryptographic hashes, small change in the object should produce a significant change in the hash. This is known as the avalanche effect and it will aid uniform distribution.
Cryptographic hashes should avoid funneling, or causing collisions when objects differ by only a few bits. Continuing with our naive implementation above let's squeeze the member variables' blocks together until the zero padding falls out. At the edges member variable blocks will overlap. Instances that differed in state by only the portion of overlap would demonstrate funneling.
We may need to consider peculiarities in our data. Volume may bunch up around certain dates if our dataset displays seasonality or cyclicality. Anyone familiar with Scrabble or Wheel of Fortune knows that in English, certain letters occur with a greater frequency. Numbers will probably tend to group around zero since they start from an origin and accumulate outward. The proportion of 0s to 1s will tend to grow as we move away from zero.
With all of these considerations, how does a proper hash address them all? Generally speaking, rather than operating on discrete values within an object as we've done above, hashes will treat the object as a contiguous memory block and apply several mathematical (bitwise) operations to scramble the data. The good news is that this means we don't need a bespoke hash for every class, a generalized algorithm is sufficient. There's also no shortage of methodologies and complete implementations out there for us to borrow from.
In fact what's written here is only the tip of the iceberg. I hope it's clearly built up an introduction using simple examples.
Further reading
- http://burtleburtle.net/bob/hash/evahash.html
- http://www.azillionmonkeys.com/qed/hash.html
- https://kl2217.wordpress.com/2011/07/21/common-hashing-algorithms/
- http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
- http://valerieaurora.org/monkey.html
- MIT 6.046
Comments
Post a Comment