Tuesday, January 11, 2005

Information Entropy: A Derivation

Given a set of M symbols, x(i), they can be combined together to make up a set of messages. The probability of occurrence of any symbol, xi, is given by p[x(i)] with total probability of unity.

S { p[x(i)], i=1,M } = 1 (1)

A measure of the information gained from a message is the minimum number of yes and no questions required to identify the message. Thus if the message is a number less than 64, and each number has an equal probability of 1/64 of occurring, we would first ask if the number is less than or equal to 32. If the answer is “yes”, we could ask if it is less than or equal to 16. If “no”, we could ask it is greater than 16 (and, implicitly, less than r equal to 32), and so on. Thus the number of questions with two possible answers is six. Note that each question adds one unit of information and a convenient measure in a binary system which is additive is the logarithm to base 2 [“log2”; 1 unit = log2 (2)]. The addition of 6 such units is obviously (^ implies exponent)

6 log2 (2) = log2 (2^6) = log2 (64) = 6.

Define a quantity called self information, S, which is log of the reciprocal of the probability of occurrence.

S[x(i)] = log {1/p[x(i)]} = - log {p[x(i)]} (2)

Assigning base 2, for the example above S[x(i)] = 6.

Properties of S[x(i)] and implications for message interpretation:
a) Decreases if a symbol occurs with greater probability.
b) A rare event contains a more significant message.
c) If the p(xi) = 1, then the S[x(i)] = 0.
d) If a message consists of an infinite string of yeses, there is no information or significance to this.
e) Further, if all probabilities are bracketed, 0 <> 0.

Over a long period of time, T, occurrence of x is p(xi)T; thus the total information obtained about the system is
–p[x(1)]T log2{p[x(1)]} – p[x(2)]T log2{p[x(2)]} – ….

The average information per unit time, also called the expectation value, E, of S(x), in which x = [x(1), x(2), x(3), … , x(M)], is then

E [S(x)] = S { p[x(i)] S[x(i)], i=1, M } (3)

And this termed, by analogy with Gibbs thermodynamics, entropy.

H = – S {p[x(i)] log p[x(i)], i=1, M} (4)

In thermodynamics H (with a multiplicative constant) measures the degree of disorder in a physical process. Here it measures the uncertainty of the system in a particular state or the “ignorance”. The base of the logarithm can be any convenient number but is logically 2 if a binary arithmetic is used. In the example, yes = 1 and no = 0.

Entropy of a discrete distribution is a maximum when probabilities are all equal. Thus for a binary system that only transmits a 0 and 1, the probabilities are p and 1-p.

H = – p log2(p) – (1 – p) log2(1 – p) (5)

The entropy is a maximum when p = ½; H is, therefore, 1.

If the variance of x is a constant, s^2, then the entropy is a maximum if the probability has a normal or Gaussian distribution.

p(x) = (2ps^2) ^(-½) exp ( –x^2/2s^2) (6)

The entropy for the Gaussian case is

H = (½) ln [2p exp (s^2)] (7)


(Based on p. 146-147 of Kanasewich, E. R., 1981, Time sequence analysis in geophysics, University of Alberta Press, Edmonton, Alberta, Canada, Third Edition, 480 p.)


No comments: