Last time we talked about information-entropy, which is a way of quantifying the information in a given string of symbols. Each symbol has a certain uncertainty associated with it, which may be low if the symbol can be one of two things, or high if the symbol can be one of a hundred things. Summing all the possible outcomes gives the entropy, which is larger for more uncertain situations and smaller for more certain situations. So the more you stand to learn from the string, the higher its uncertainty and entropy, and the larger its information content.

But let’s get less abstract and think about examples. Say that we flip a coin and see if it lands with the head or tail side up. An evenly weighted coin will have an equal probability of being heads or tails. But perhaps an uneven coin is more likely to be heads, or maybe gnomes have swapped in a coin that’s heads on both sides. The results of the heads-only coin toss have very low entropy because there is only one possible result, and minimal information is gained by examining the result. Whereas the weighted coin will have higher entropy due to the larger number of results possible. And the evenly weighted coin not only has more results possible than the heads-only coin, those results are also maximally unpredictable, so the entropy is the highest. And if we do a series of coin tosses, we get a string of coin toss results. A longer string will have more entropy than a shorter string, except in the case where the coin is heads-only. Now if we replace the binary coin toss with choosing a letter from the alphabet, which has 26 possibilities instead of two, we have significantly increased the information-entropy!

Of course, sometimes it is useful to impose some rules on a string of symbols, for example the rules associated with a specific language. Doing so will reduce the uncertainty, and thus the entropy and information content, of the string. This is another way of saying that a string of letters that spells out words in English has a lower entropy than a string of random letters, because in English you know that not all the letters are equally probable, one letter affects the probability of letters following it, and other things like that. It’s the equivalent of weighting the coin! In fact, the trick of data compression is to reduce the number of symbols used in a string without reducing the entropy (and thus the information content) of the message. Data compression is not possible when each symbol in a message is maximally surprising, which explains the difficulty of compressing things like white noise.

Now, what if instead of a sequence of coin tosses or a string of letters, you instead had a collection of atoms that could be in different states? Consider a box filled with a gas, where each atom of the gas can be described by its position in the box and its momentum. The entropy of any given configuration of atoms would then be the sum of all the possible states for each atom, the same way the entropy of a string was the sum of possible symbols in the string (weighted for probability). Entropy is still a measure of uncertainty, but in this physical example the question is how many arrangements of atoms in specific states can make a configuration that has the same measurable properties, such as pressure, temperature, and volume. For example, if the gas is evenly distributed throughout the box, we can make a wide variety of changes to the individual atom positions and velocities without changing the measurable properties of the gas. Thus the entropy is high because of the large number of atomic arrangements that could yield the same result, which means there is a high uncertainty in what any individual atom is doing. In contrast, if the gas atoms are confined to a very small region of the box, there are fewer positions and momenta available to the atoms, and thus a smaller number of indistinguishable arrangements. So the entropy is lower, because there are fewer ways to have the same number of atoms all in a corner of the box.

The technical way to describe this formulation of entropy is that each atom has a number of microstates available to it, and all the atoms together have measurable properties (pressure, volume, temperature, etc.) that define the macrostate. The entropy of any given macrostate is equal to the number of microstate configurations that could produce that macrostate, which means it’s still about uncertainty. But you can also see that entropy is a form of state-counting: higher entropy macrostates can be attained in a larger number of ways than lower entropy macrostates. This means that in general, higher entropy states are more probable. If there is one way to pack all the atoms into the corner of a box, but there are a million ways to evenly distribute the atoms in the box, then the chances of just finding the atoms in the corner are one in a million. And since those atoms are constantly moving and exploring new microstates, over time they will tend to the highest entropy macrostates. This is where the Second Law of Thermodynamics comes from, which says that in any isolated system, total entropy increases over time toward a maximum value.

The idea of entropy as state-counting came from Ludwig Boltzmann, more than fifty years before information theory was developed. Shannon called his measure information-entropy because of the resemblance to entropy as defined in collections of atoms, which is the basis of statistical mechanics. Entropy is a measure of information and uncertainty, but also a way to count the number of states, and a measure of the relative ordering of a system.