*8 June, 2005 at 00:04* *
Leave a comment *

THE REDUNDANCY OF ENGLISH – CLAUDE E. SHANNON

Bell Laboratories,

Murray Hill, N. J.

“Note the things in bold – this could get interesting…”

The chief subject I should like to discuss is a recently developed method of estimating

the amount of redundancy in printed English. Before doing so, I wish to review

briefly what we mean by redundancy. In communication engineering we regard information

perhaps a little differently than some of the rest of you do. In particular, we are

not at all interested in semantics or the meaning implications of information. Information

for the communication engineer is something he transmits from one point to

another as it is given to him, and it may not have any meaning at all. It might, for

example, be a random sequence of digits, or it might be information for a guided missile

or a television signal.

Carrying this idea along, we can idealize a communication system, from our point

of view, as a series of boxes, as in Figure 22, of which I want to talk mainly about the

first two. The first box is the information source. It is the thing which produces the

messages to be transmitted. For communication work we abstract all properties of the

messages except the statistical properties which turn out to be very important. The

communication engineer can visualize his job as the transmission of the particular

messages chosen by the information source to be sent to the receiving point. What the

message means is of no importance to him; the thing that does have importance is the

set of statistics with which it was chosen, the probabilities of various messages. In general,

we are usually interested in messages that consist of a sequence of discrete symbols

or symbols that at least can be reduced to that form by suitable approximation.

| The second box is a coding device which translates the message into a form suitable

for transmission to the receiving point, and the third box has the function of decoding

it into its original form. Those two boxes are very important, because it is there that

the communication engineer can make a saving by the choice of an efficient code.

During the last few years a theory has been developed to solve the problem of finding

efficient codes for various types of communication systems.

The redundancy is related to the extent to which it is possible to compress the language.

I think I can explain that simply. A telegraph company uses commercial codes

consisting of a few letters or numbers for common words and phrases. By translating

the message into these codes you get an average compression. The encoded message is

shorter, on the average, than the original. Although this is not the best way to compress,

it is a start in the right direction. The redundancy is the measure of the extent to

which it is possible to compress if the best possible code is used. It is assumed that you

stay in the same alphabet, translating English into a twenty-six-letter alphabet. The

amount that you shorten it, expressed as a percentage, is then the redundancy. If it is

possible, by proper encoding, to reduce the length of English text 40 per cent, English

then is 40 per cent redundant. The redundancy can be calculated in terms of probabil-

ities associated with the language; the probabilities of the different letters, pairs of letters;

probabilities of words, pairs of words; and so on. The formula for this calculation

is related to the formula of entropy, as no doubt has appeared in these meetings before.

Actually, to perform this calculation is quite a task. I was interested in calculating the

redundancy of printed English. I started in by calculating it from the entropy formulas.

What is actually done is to obtain the redundancy of artificial languages which are

approximations to English. I pointed out that we represent an information source as a

statistical process. In order to see what is involved in that representation, I constructed

some approximations to English in which the statistics of English are introduced by

easy stages. The following are examples of these approximations:

1. xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd

2. ocro hli rgwr nmielwis eu ll nbnesebya th eei

3. on ie antsoutinys are t inctore st be s deamy achin d ilonasive tucoowe at

teasonare fuso |

4. in no ist lat whey cratict froure birs grocid pondenome of demonstures of

the retagin is regiactiona of cre.

5. representing and speedily is an good apt or come can different natural

here he the a in came the to of to expert gray come to furnishes the line

message had be these.

6. the head and in frontal attack on an english writer that the character of

this point is therefore another method for the letters that the time of who

ever told the problem for an unexpected.

In the first approximation we don’t introduce any statistics at all. The letters are chosen

completely at random. The only property of English used lies in the fact that the letters

are the twenty-seven letters of English, counting the space as an additional letter. Of

course this produces merely a meaningless sequence of letters. The next step (2) is to

introduce the probabilities of various letters. This is constructed essentially by putting

all the letters of the alphabet in a hat, with more E’s than Z’s in proportion to their relative

frequency, and then drawing out letters at random. If you introduce probabilities

for pairs of letters, you get something like Approximation 3. It looks a bit more like

English, since the vowel consonant alternation is beginning to appear. In Approximation

3 we begin to have a few words produced from the statistical process. Approximation

4 introduces the statistics of trigrams, that is, triplets of letters, and is again somewhat

closer to English. Approximation 5 is based on choosing words according to their

probabilities in normal English, and Approximation 6 introduces the transition probabilities

between pairs of words. It is evident that Approximation 6 is quite close to normal

English. The text makes sense over rather long stretches. These samples show that

it is perhaps reasonable to represent English text as a time series produced by an

involved stochastic process.

The redundancies of the languages 1 to 5 have been calculated. The first sample (1)

is a random sequence and has zero redundancy. The second, involving letter frequencies

only, has a redundancy of 15 per cent. This language could be compressed 15 per

cent by the use of suitable coding schemes. The next approximation (3), based on diagram[!]

structure, gives a redundancy of 29 per cent. Approximation 4, based on trigram

structure, gives a redundancy of 36 per cent. These are all the tables that are

available on the basis of letter frequencies. Although cryptographers | have tabulated

frequencies of letters, digrams, and trigrams, so far as I know no one has obtained a

complete table of quadrugram frequencies. However, there are tables of word frequencies

in English which are quite extensive, and it is possible to calculate from them the

amount of redundancy (Approximation 5) due to unequal probability of words. This

came out to be 54 per cent, making a few incidental approximations; the tables were

not complete and it was necessary to extrapolate them.

In this case the language is treated as though each word were a letter in a more elaborate

language, and the redundancy is computed by the same formula. To compare

with the other figures, it is reduced to the letter basis by dividing by the average number

of letters in a word.

Pitts: Do other languages have the same frequency or the same degree of redundancy?

Shannon: I have not calculated them; but according to the work of Zipf,1 who has

calculated the frequency of words in various languages, for a large number of them the

falling off of frequency against rank order of the word, plotted on log-log coordinates,

is essentially a straight line. The probability of the nth-most probable word is essentially

a constant over n for quite a large range of n:

.

Pitts: But presumably the constants are different for different languages?

Shannon: That I don’t know. They are not vastly different in the examples which

Zipf gave in his books, but the difference in constants would make some difference in

the calculated redundancy. The equation cannot hold indefinitely. It sums to

infinity. If you go out to infinite words, it must tail off. That was one of the approximations

involved here which makes the figure somewhat uncertain.

Teuber: Your probabilities are based on predicting from one letter to the next, or one

word to the next?

McCulloch: No, upon the word.

Shannon: In the particular case (4) this refers to a language in which words are chosen

independently of each other, but each | has the probability that it has in English.

»The« has a probability of .07 in the English language. So we have seven of them in a

hat of 100.

Teuber: As you led up to it you said, first of all, you take the probability that any one

letter will occur in that particular system, and then, given that letter, and the next one

that occurs, you predict the one that follows immediately after that?

Shannon: Yes.

Teuber: Do you go further than that and say that any one letter will follow the one

you took beforehand, regardless of what letter was in between?

Shannon: That is true in the calculation from 3. This is based on probabilities of

groups of three letters. Number 4 goes on a new tack and starts over with the word as

a new unit.

**The words are independently chosen. **

It is a better approximation to English than 3, since each group of letters forms a word, but the words don’t hang together in sentences. At this point there seemed to be no way to go any further, **because no one had tabulated any frequencies for pairs of words. Of course, such a table would be impractically large because of the enormous number of possible pairs of words. However, the thought occurs that every one of us who speaks a language has implicitly an enormous statistical knowledge of the structure of the language if we could only get it out. That is to say, we know what words follow other words; we know the standard clichés, grammar, and syntax of English**. If it were possible, for example, to translate that information into some form adapted to numerical analysis,

1 Zipf, G. K.: Human Behavior and the Principle of Least Effort. An Introduction to Human Ecology. Cambridge,

Mass.: Addison Wesley Press, Inc. (1949)

Entry filed under: Uncategorized.

Trackback this post | Subscribe to the comments via RSS Feed