Shannon Paper

Claude Shannon’s A mathematical theory of communication is an oft cited classic in information theory. In fact, as of this writing there are 84’411 citations and 139 versions of the article on Google Scholar.

Screenshot of Google Scholar showing citation count for Shannon’s paper

Let’s dive in and try to tease apart the “why”s that are often overlooked when people build on top of the introduced theory. They are definitely things that I didn’t consider to be obvious without the benefit of reading the paper.

The section headings below correspond to the sections in Shannon’s paper.


The paper’s introduction states the purpose as:

… [To] extend [the general theory of communication] to include a number of new factors, in particular the effect of noise in the channel, and the savings possible due to the statistical structure of the original message and due to the nature of the final destination of information.

The two most salient points from this section are

  1. the arguments for a logarithmic measure of information, and
  2. what a measure of information actually measures.

Logarithmic measures are well suited for measuring information, Shannon states, because it’s “nearer to our intuitive feeling as to the proper measure.” Additional rationalization exists in the paper, but I found this to be the most convincing.

To intuit, adding a channel that’s equal in capacity to an existing channel increases the space of transferrable messages quadratically. However our intuition finds it easier to think of this as doubling the capacity. The latter is more in line with a logarithmic measure.

Information is what’s ultimately useful for the recipient. Hence a measure of information must be based on the ability gained by the recipient upon receipt of the message. Receiving a signal means that the recipient can now uniquely identify a message among a large number of possible messages, or to reduce the space of possible messages.

If the destination receives 10 bits of information, that means that the recipient is now able to uniquely select a message from 2^{10} possible messages. Or equivalently, the recipient can choose one out of 2^{10} outcomes.

Hence the measure of information is based on the ratio of the space of possible messages before and after receipt of the information.

The following sections focus on defining and measuring the space of messages.

Part I: Discrete Noiseless Systems

A discrete noiseless channel is one in which a selection of symbols from a finite alphabet can be transmitted from a source to a destination without loss. It’s not necessarily the case that each symbol consumes an equal amount of resources to send. Since the paper pertains to electronic communication Shannon states the cost in terms of time.

1. Discrete Noiseless Channel

Capacity of a channel is defined to be the maximum amount of information that can be transferred over that channel.

The capacity C of a discrete noiseless channel is defined to be:

C = \lim_{T \to \infty} \frac{\log N(T)}{T}

Here N(T) is the number of allowed messages that take a duration of T to transmit. Let’s see how this comes about.

From the prior definition of information, if the recipient receives C bits of information that means that they were able to distinguish between 2^C possible messages.

So we start with counting the number of possible messages that could’ve been transmitted over T time. If we denote this count as N(T), then we know that there were \log{N(T)} bits of information transferred during T time. Hence the rate is:

R(T) = \frac{\log{N(T)}}{T}

To generalize, we find the limit of R(T) as T tends to \infty.

Suppose the channel can transmit n symbols {S_1, S_2, ..., S_n} each of which consume {t_1, t_2, ..., t_n} units of time. How many possible messages can there be in T units of time? I.e. what’s N(T)?

To break this down a bit further, we can ask “how many possible messages could there be that end in S_i during time period T?”

Any message that ends in S_i can be written as a message consisting of arbitrary symbols s_1 s_2 ... s_k (where each s_i \in S) and ending with the symbol S_i. I.e.:

\underbrace{\underbrace{s_1 s_2 s_3 ... s_k}_{T - t_i \text{ seconds}} S_i}_{T \text{ seconds}}

As shown, the length of the s_1..s_k portion in time must be T - t_i since the remaining t_i time is consumed by S_i.

Hence there can be N(T - t_i) possibilities for the s_1 ... s_k prefix.

Thus we can state this problem recursively as follows:

N(T) = N(T - t_1) + N(T - t_2) + ... + N(T - t_n)

… where N(T) = 0 for all T \lt \min(t_1, t_2, ... t_n).

Solving this equation involves recognizing that this is a linear difference equation (not to be confused with a linear differential equation).

If all t_i are distinct, then the characteristic polynomial corresponding to this recurrence is:

X^{-t_1} + X^{-t_2} + ... + X^{-t_n} = 1

If X_0 is the largest real solution to the characteristic equation, then N(T) approaches X_0^T for large T.

(I need to check the math here since Shannon just says this is a well known result and moves on.)

So now we have:

C = \lim_{T \to \infty} \frac{\log N(T)}{T} = \log X_0

2. The Discrete Source of information

This section of the paper focuses on defining what at discrete source of information is. For our purposes it can be thought of as a stochastic process which produces a stream of symbols.

3. Discrete Approximations to English

Here we come to word and sentence generation using stochastic processes. In particular Shannon takes us through the progression of constructing messages via:

  1. Zero-order approximation. Symbols are individual letters with the addition of a space. Each symbol is selected with equal probability.
  2. First-order approximation. As above, but the relative frequencies correspond to actual letter frequencies in English.
  3. Second-order approximation. As above, but selection of a symbol depends on the previous symbol. I.e. follows a digram structure. Probabilities correspond to English.
  4. Third-order approximation. As above, but using trigrams instead of digrams.
  5. First-order word approximation. Start with a list of words and pick them at random with probability corresponding to their usage in English.
  6. Second-order word approximation. As above, but selection frequencies depend on the previous word. I.e. follows digrams.

By the time we arrive at second-order word approximations, the generated messages start to resemble natural English albeit nonsensical.

4. Graphical Representation of a Markoff process

Identifies the previous stochastic processes as Markoff processes. They are represented as a set of states. Each state can probabilistically transition to number of states with some probability distribution. Each transition emits a symbol.

If the emitted symbol doesn’t depend on any prior symbol, then we only really need one state. It will have a self-edge for each possible emitted symbol.

If the emitted symbol depends on the previous symbol, then there will be as many states needed as there are symbols. This is the case for digrams.

For trigrams, since they depend on the previous two symbols, require n^2 states.

5. Ergodic and Mixed Sources

In an ergodic process every sequence produced by the process is the same in statistical properties. […] Roughly the ergodic property means statistical homogeneity.

6. Choice Uncertainty and Entropy

Here we come to the part that is most often cited. The definition of entropy.

Shannon lays out three requirements for a measure of information H(p_1, p_2, .. p_n) which measures the information content of a set of outcomes each of whose probability is p_i:

  1. H should be continuous in the p_i.
  2. If all p_i are equal, i.e. p_i = \frac{1}{n} then H should be a monotonic increasing function of n.
  3. If a choice be broken down into two choices, then the original H should be the weighted sum of the individual values of H.

The only H satisfying the above three requirements is of the form: H = -K \sum_{i=1}^n p_i \log p_i

The constant K is merely a unit of measure and can be set to 1.

Some interesting properties of H:

  1. H = 0 only when all but one p_i = 0, and the remaining p_j = 1.
  2. For a given n, the maximum H is reached when all p_i = \frac{1}{n}.
  3. The joint entropy of two random variables follow the relation: H(x,y) \le H(x) + H(y) The condition of equality implies that x and y are independent.
  4. Any change towards equalization of probabilities p_i increases H. I.e. if p_1 \lt p_2 and \delta \lt |p_1 - p_2|, then the entropy of p_1, p_2, ... p_n is lower than the entropy of (p_1 + \delta), (p_2 - \delta), p_3, ... p_n.
  5. For two variables x,y,there’s conditional probability p_i(j) of y having value j assuming x has value i. This is given by: p_i(j) = \frac{p(i,j)}{\sum_j p(i,j)} The corresponding conditional entropy H_x(y) is given by: H_x(y) = - \sum_{i,j} p(i,j) \log p_i(j) This conditional entropy then holds the relation: H(x,y) = H(x) + H_x(y)
  6. From 3 and 5, we have: H(y) \le H_x(y)

7. The Entropy of an Information Source

Let S_i be a set of states each of which produce some set of outputs s_j with probability p_{i,j}.

Each state S_i has an associated entropy H_i representing the entropy of all possible outputs from that state. I.e.:

H_i = - \sum_j p_{i,j} \log p_{i,j}

Each such state also has a probability of occurrence P_i.

The entropy of the entire information source is the weighted sum of the state-wise entropies. The weight being the probability of occurrence. Thus:

\begin{align*} H & = \sum_i P_i H_i \\ & = - \sum_{i,j} P_i p_{i,j} \log p_{i,j} \end{align*}

If the source is operating at some definite time rate, then the entropy per second, or entropy rate is defined to be:

H' = \sum_i f_i H_i

where f_i is the average frequency of state i.

Shannon says “Clearly H' = mH where m is the average number of symbols produced per second.” But in order to make the connection you need to observe that f_i = m \cdot P_i.

This is followed by a derivation that takes some work to follow. Let’s go through it step by step:

  1. Suppose that the information source consists of a single state and n symbols each of which is emitted with a probability of p_i.

  2. Now take a string of length N emitted by this source. The expected number of occurrences for symbol s_i is N p_i.

  3. A message of sufficient length where all symbols appear at their expected occurrence count would have a probability of occurring of:

    p = \prod_i p_i^{N p_i}

From there we see that:

\begin{align*} \log p & = \sum_i N p_i \log p_i \\ & = N H \end{align*}