Shannon Paper
Claude Shannon’s A mathematical theory of communication is an oft cited classic in information theory. In fact, as of this writing there are 84’411 citations and 139 versions of the article on Google Scholar.
Let’s dive in and try to tease apart the “why”s that are often overlooked when people build on top of the introduced theory. They are definitely things that I didn’t consider to be obvious without the benefit of reading the paper.
The section headings below correspond to the sections in Shannon’s paper.
Introduction
The paper’s introduction states the purpose as:
… [To] extend [the general theory of communication] to include a number of new factors, in particular the effect of noise in the channel, and the savings possible due to the statistical structure of the original message and due to the nature of the final destination of information.
The two most salient points from this section are
- the arguments for a logarithmic measure of information, and
- what a measure of information actually measures.
Logarithmic measures are well suited for measuring information, Shannon states, because it’s “nearer to our intuitive feeling as to the proper measure.” Additional rationalization exists in the paper, but I found this to be the most convincing.
To intuit, adding a channel that’s equal in capacity to an existing channel increases the space of transferrable messages quadratically. However our intuition finds it easier to think of this as doubling the capacity. The latter is more in line with a logarithmic measure.
Information is what’s ultimately useful for the recipient. Hence a measure of information must be based on the ability gained by the recipient upon receipt of the message. Receiving a signal means that the recipient can now uniquely identify a message among a large number of possible messages, or to reduce the space of possible messages.
If the destination receives 10 bits of information, that means that the recipient is now able to uniquely select a message from possible messages. Or equivalently, the recipient can choose one out of outcomes.
Hence the measure of information is based on the ratio of the space of possible messages before and after receipt of the information.
The following sections focus on defining and measuring the space of messages.
Part I: Discrete Noiseless Systems
A discrete noiseless channel is one in which a selection of symbols from a finite alphabet can be transmitted from a source to a destination without loss. It’s not necessarily the case that each symbol consumes an equal amount of resources to send. Since the paper pertains to electronic communication Shannon states the cost in terms of time.
1. Discrete Noiseless Channel
Capacity of a channel is defined to be the maximum amount of information that can be transferred over that channel.
The capacity of a discrete noiseless channel is defined to be:
Here is the number of allowed messages that take a duration of to transmit. Let’s see how this comes about.
From the prior definition of information, if the recipient receives bits of information that means that they were able to distinguish between possible messages.
So we start with counting the number of possible messages that could’ve been transmitted over time. If we denote this count as , then we know that there were bits of information transferred during time. Hence the rate is:
To generalize, we find the limit of as tends to .
Suppose the channel can transmit symbols each of which consume units of time. How many possible messages can there be in units of time? I.e. what’s ?
To break this down a bit further, we can ask “how many possible messages could there be that end in during time period ?”
Any message that ends in can be written as a message consisting of arbitrary symbols (where each ) and ending with the symbol . I.e.:
As shown, the length of the portion in time must be since the remaining time is consumed by .
Hence there can be possibilities for the prefix.
Thus we can state this problem recursively as follows:
… where for all .
Solving this equation involves recognizing that this is a linear difference equation (not to be confused with a linear differential equation).
If all are distinct, then the characteristic polynomial corresponding to this recurrence is:
If is the largest real solution to the characteristic equation, then approaches for large .
(I need to check the math here since Shannon just says this is a well known result and moves on.)
So now we have:
2. The Discrete Source of information
This section of the paper focuses on defining what at discrete source of information is. For our purposes it can be thought of as a stochastic process which produces a stream of symbols.
3. Discrete Approximations to English
Here we come to word and sentence generation using stochastic processes. In particular Shannon takes us through the progression of constructing messages via:
- Zero-order approximation. Symbols are individual letters with the addition of a space. Each symbol is selected with equal probability.
- First-order approximation. As above, but the relative frequencies correspond to actual letter frequencies in English.
- Second-order approximation. As above, but selection of a symbol depends on the previous symbol. I.e. follows a digram structure. Probabilities correspond to English.
- Third-order approximation. As above, but using trigrams instead of digrams.
- First-order word approximation. Start with a list of words and pick them at random with probability corresponding to their usage in English.
- Second-order word approximation. As above, but selection frequencies depend on the previous word. I.e. follows digrams.
By the time we arrive at second-order word approximations, the generated messages start to resemble natural English albeit nonsensical.
4. Graphical Representation of a Markoff process
Identifies the previous stochastic processes as Markoff processes. They are represented as a set of states. Each state can probabilistically transition to number of states with some probability distribution. Each transition emits a symbol.
If the emitted symbol doesn’t depend on any prior symbol, then we only really need one state. It will have a self-edge for each possible emitted symbol.
digraph {
nodesep=1.5;
S -> S [label="0.2\n\n 'A'"];
S -> S [label="0.2\n \n'B'"];
S -> S [label="0.6\n \n'C'"];
}
If the emitted symbol depends on the previous symbol, then there will be as many states needed as there are symbols. This is the case for digrams.
digraph {
A -> A [label="0.33, A"];
A -> B [label="0.33, B"];
A -> C [label="0.33, C"];
B -> A [label="0.33, A"];
B -> B [label="0.33, B"];
B -> C [label="0.33, C"];
C -> A [label="0.33, A"];
C -> B [label="0.33, B"];
C -> C [label="0.33, C"];
}
For trigrams, since they depend on the previous two symbols, require states.
5. Ergodic and Mixed Sources
In an ergodic process every sequence produced by the process is the same in statistical properties. […] Roughly the ergodic property means statistical homogeneity.
6. Choice Uncertainty and Entropy
Here we come to the part that is most often cited. The definition of entropy.
Shannon lays out three requirements for a measure of information which measures the information content of a set of outcomes each of whose probability is :
- should be continuous in the .
- If all are equal, i.e. then should be a monotonic increasing function of .
- If a choice be broken down into two choices, then the original should be the weighted sum of the individual values of .
The only satisfying the above three requirements is of the form:
The constant is merely a unit of measure and can be set to 1.
Some interesting properties of :
- only when all but one , and the remaining .
- For a given , the maximum is reached when all .
- The joint entropy of two random variables follow the relation: The condition of equality implies that and are independent.
- Any change towards equalization of probabilities increases . I.e. if and , then the entropy of is lower than the entropy of .
- For two variables ,there’s conditional probability of having value assuming has value . This is given by: The corresponding conditional entropy is given by: This conditional entropy then holds the relation:
- From 3 and 5, we have:
7. The Entropy of an Information Source
Let be a set of states each of which produce some set of outputs with probability .
Each state has an associated entropy representing the entropy of all possible outputs from that state. I.e.:
Each such state also has a probability of occurrence .
The entropy of the entire information source is the weighted sum of the state-wise entropies. The weight being the probability of occurrence. Thus:
If the source is operating at some definite time rate, then the entropy per second, or entropy rate is defined to be:
where is the average frequency of state .
Shannon says “Clearly where is the average number of symbols produced per second.” But in order to make the connection you need to observe that .
This is followed by a derivation that takes some work to follow. Let’s go through it step by step:
-
Suppose that the information source consists of a single state and symbols each of which is emitted with a probability of .
-
Now take a string of length emitted by this source. The expected number of occurrences for symbol is .
-
A message of sufficient length where all symbols appear at their expected occurrence count would have a probability of occurring of:
From there we see that:
Last modified: March 26, 2021