Shannon Paper

Claude Shannon’s A mathematical theory of communication is an oft cited classic in information theory. In fact, as of this writing there are 84’411 citations and 139 versions of the article on Google Scholar.

Screenshot of Google Scholar showing citation count for Shannon’s paper

Let’s dive in and try to tease apart the “why”s that are often overlooked when people build on top of the introduced theory. They are definitely things that I didn’t consider to be obvious without the benefit of reading the paper.

The section headings below correspond to the sections in Shannon’s paper.

Introduction

The paper’s introduction states the purpose as:

… [To] extend [the general theory of communication] to include a number of new factors, in particular the effect of noise in the channel, and the savings possible due to the statistical structure of the original message and due to the nature of the final destination of information.

The two most salient points from this section are

  1. the arguments for a logarithmic measure of information, and
  2. what a measure of information actually measures.

Logarithmic measures are well suited for measuring information, Shannon states, because it’s “nearer to our intuitive feeling as to the proper measure.” Additional rationalization exists in the paper, but I found this to be the most convincing.

To intuit, adding a channel that’s equal in capacity to an existing channel increases the space of transferrable messages quadratically. However our intuition finds it easier to think of this as doubling the capacity. The latter is more in line with a logarithmic measure.

Information is what’s ultimately useful for the recipient. Hence a measure of information must be based on the ability gained by the recipient upon receipt of the message. Receiving a signal means that the recipient can now uniquely identify a message among a large number of possible messages, or to reduce the space of possible messages.

If the destination receives 10 bits of information, that means that the recipient is now able to uniquely select a message from 2102^{10} possible messages. Or equivalently, the recipient can choose one out of 2102^{10} outcomes.

Hence the measure of information is based on the ratio of the space of possible messages before and after receipt of the information.

The following sections focus on defining and measuring the space of messages.

Part I: Discrete Noiseless Systems

A discrete noiseless channel is one in which a selection of symbols from a finite alphabet can be transmitted from a source to a destination without loss. It’s not necessarily the case that each symbol consumes an equal amount of resources to send. Since the paper pertains to electronic communication Shannon states the cost in terms of time.

1. Discrete Noiseless Channel

Capacity of a channel is defined to be the maximum amount of information that can be transferred over that channel.

The capacity CC of a discrete noiseless channel is defined to be:

C=limTlogN(T)TC = \lim_{T \to \infty} \frac{\log N(T)}{T}

Here N(T)N(T) is the number of allowed messages that take a duration of TT to transmit. Let’s see how this comes about.

From the prior definition of information, if the recipient receives CC bits of information that means that they were able to distinguish between 2C2^C possible messages.

So we start with counting the number of possible messages that could’ve been transmitted over TT time. If we denote this count as N(T)N(T), then we know that there were logN(T)\log{N(T)} bits of information transferred during TT time. Hence the rate is:

R(T)=logN(T)TR(T) = \frac{\log{N(T)}}{T}

To generalize, we find the limit of R(T)R(T) as TT tends to \infty.

Suppose the channel can transmit nn symbols S1,S2,...,Sn{S_1, S_2, ..., S_n} each of which consume t1,t2,...,tn{t_1, t_2, ..., t_n} units of time. How many possible messages can there be in TT units of time? I.e. what’s N(T)N(T)?

To break this down a bit further, we can ask “how many possible messages could there be that end in SiS_i during time period TT?”

Any message that ends in SiS_i can be written as a message consisting of arbitrary symbols s1s2...sks_1 s_2 ... s_k (where each siSs_i \in S) and ending with the symbol SiS_i. I.e.:

s1s2s3...skTti secondsSiT seconds\underbrace{\underbrace{s_1 s_2 s_3 ... s_k}_{T - t_i \text{ seconds}} S_i}_{T \text{ seconds}}

As shown, the length of the s1..sks_1..s_k portion in time must be TtiT - t_i since the remaining tit_i time is consumed by SiS_i.

Hence there can be N(Tti)N(T - t_i) possibilities for the s1...sks_1 ... s_k prefix.

Thus we can state this problem recursively as follows:

N(T)=N(Tt1)+N(Tt2)+...+N(Ttn)N(T) = N(T - t_1) + N(T - t_2) + ... + N(T - t_n)

… where N(T)=0N(T) = 0 for all T<min(t1,t2,...tn)T \lt \min(t_1, t_2, ... t_n).

Solving this equation involves recognizing that this is a linear difference equation (not to be confused with a linear differential equation).

If all tit_i are distinct, then the characteristic polynomial corresponding to this recurrence is:

Xt1+Xt2+...+Xtn=1X^{-t_1} + X^{-t_2} + ... + X^{-t_n} = 1

If X0X_0 is the largest real solution to the characteristic equation, then N(T)N(T) approaches X0TX_0^T for large TT.

(I need to check the math here since Shannon just says this is a well known result and moves on.)

So now we have:

C=limTlogN(T)T=logX0C = \lim_{T \to \infty} \frac{\log N(T)}{T} = \log X_0

2. The Discrete Source of information

This section of the paper focuses on defining what at discrete source of information is. For our purposes it can be thought of as a stochastic process which produces a stream of symbols.

3. Discrete Approximations to English

Here we come to word and sentence generation using stochastic processes. In particular Shannon takes us through the progression of constructing messages via:

  1. Zero-order approximation. Symbols are individual letters with the addition of a space. Each symbol is selected with equal probability.
  2. First-order approximation. As above, but the relative frequencies correspond to actual letter frequencies in English.
  3. Second-order approximation. As above, but selection of a symbol depends on the previous symbol. I.e. follows a digram structure. Probabilities correspond to English.
  4. Third-order approximation. As above, but using trigrams instead of digrams.
  5. First-order word approximation. Start with a list of words and pick them at random with probability corresponding to their usage in English.
  6. Second-order word approximation. As above, but selection frequencies depend on the previous word. I.e. follows digrams.

By the time we arrive at second-order word approximations, the generated messages start to resemble natural English albeit nonsensical.

4. Graphical Representation of a Markoff process

Identifies the previous stochastic processes as Markoff processes. They are represented as a set of states. Each state can probabilistically transition to number of states with some probability distribution. Each transition emits a symbol.

If the emitted symbol doesn’t depend on any prior symbol, then we only really need one state. It will have a self-edge for each possible emitted symbol.

digraph {
  nodesep=1.5;
  S -> S [label="0.2\n\n 'A'"];
  S -> S [label="0.2\n \n'B'"];
  S -> S [label="0.6\n \n'C'"];
}

If the emitted symbol depends on the previous symbol, then there will be as many states needed as there are symbols. This is the case for digrams.

digraph {
  A -> A [label="0.33, A"];
  A -> B [label="0.33, B"];
  A -> C [label="0.33, C"];
  B -> A [label="0.33, A"];
  B -> B [label="0.33, B"];
  B -> C [label="0.33, C"];
  C -> A [label="0.33, A"];
  C -> B [label="0.33, B"];
  C -> C [label="0.33, C"];
}

For trigrams, since they depend on the previous two symbols, require n2n^2 states.

5. Ergodic and Mixed Sources

In an ergodic process every sequence produced by the process is the same in statistical properties. […] Roughly the ergodic property means statistical homogeneity.

6. Choice Uncertainty and Entropy

Here we come to the part that is most often cited. The definition of entropy.

Shannon lays out three requirements for a measure of information H(p1,p2,..pn)H(p_1, p_2, .. p_n) which measures the information content of a set of outcomes each of whose probability is pip_i:

  1. HH should be continuous in the pip_i.
  2. If all pip_i are equal, i.e. pi=1np_i = \frac{1}{n} then HH should be a monotonic increasing function of nn.
  3. If a choice be broken down into two choices, then the original HH should be the weighted sum of the individual values of HH.

The only HH satisfying the above three requirements is of the form: H=Ki=1npilogpiH = -K \sum_{i=1}^n p_i \log p_i

The constant KK is merely a unit of measure and can be set to 1.

Some interesting properties of HH:

  1. H=0H = 0 only when all but one pi=0p_i = 0, and the remaining pj=1p_j = 1.
  2. For a given nn, the maximum HH is reached when all pi=1np_i = \frac{1}{n}.
  3. The joint entropy of two random variables follow the relation: H(x,y)H(x)+H(y)H(x,y) \le H(x) + H(y) The condition of equality implies that xx and yy are independent.
  4. Any change towards equalization of probabilities pip_i increases HH. I.e. if p1<p2p_1 \lt p_2 and δ<p1p2\delta \lt |p_1 - p_2|, then the entropy of p1,p2,...pnp_1, p_2, ... p_n is lower than the entropy of (p1+δ),(p2δ),p3,...pn(p_1 + \delta), (p_2 - \delta), p_3, ... p_n.
  5. For two variables x,yx,y,there’s conditional probability pi(j)p_i(j) of yy having value jj assuming xx has value ii. This is given by: pi(j)=p(i,j)jp(i,j)p_i(j) = \frac{p(i,j)}{\sum_j p(i,j)} The corresponding conditional entropy Hx(y)H_x(y) is given by: Hx(y)=i,jp(i,j)logpi(j)H_x(y) = - \sum_{i,j} p(i,j) \log p_i(j) This conditional entropy then holds the relation: H(x,y)=H(x)+Hx(y)H(x,y) = H(x) + H_x(y)
  6. From 3 and 5, we have: H(y)Hx(y)H(y) \le H_x(y)

7. The Entropy of an Information Source

Let SiS_i be a set of states each of which produce some set of outputs sjs_j with probability pi,jp_{i,j}.

Each state SiS_i has an associated entropy HiH_i representing the entropy of all possible outputs from that state. I.e.:

Hi=jpi,jlogpi,jH_i = - \sum_j p_{i,j} \log p_{i,j}

Each such state also has a probability of occurrence PiP_i.

The entropy of the entire information source is the weighted sum of the state-wise entropies. The weight being the probability of occurrence. Thus:

H=iPiHi=i,jPipi,jlogpi,j\begin{align*} H & = \sum_i P_i H_i \\ & = - \sum_{i,j} P_i p_{i,j} \log p_{i,j} \end{align*}

If the source is operating at some definite time rate, then the entropy per second, or entropy rate is defined to be:

H=ifiHiH' = \sum_i f_i H_i

where fif_i is the average frequency of state ii.

Shannon says “Clearly H=mHH' = mH where mm is the average number of symbols produced per second.” But in order to make the connection you need to observe that fi=mPif_i = m \cdot P_i.

This is followed by a derivation that takes some work to follow. Let’s go through it step by step:

  1. Suppose that the information source consists of a single state and nn symbols each of which is emitted with a probability of pip_i.

  2. Now take a string of length NN emitted by this source. The expected number of occurrences for symbol sis_i is NpiN p_i.

  3. A message of sufficient length where all symbols appear at their expected occurrence count would have a probability of occurring of:

    p=ipiNpip = \prod_i p_i^{N p_i}

From there we see that:

logp=iNpilogpi=NH\begin{align*} \log p & = \sum_i N p_i \log p_i \\ & = N H \end{align*}

Last modified: March 26, 2021