8 June, 2005 at 00:04 Leave a comment

THE REDUNDANCY OF ENGLISH – CLAUDE E. SHANNON
Bell Laboratories,
Murray Hill, N. J.

“Note the things in bold – this could get interesting…”

The chief subject I should like to discuss is a recently developed method of estimating
the amount of redundancy in printed English. Before doing so, I wish to review
briefly what we mean by redundancy. In communication engineering we regard information
perhaps a little differently than some of the rest of you do. In particular, we are
not at all interested in semantics or the meaning implications of information. Information
for the communication engineer is something he transmits from one point to
another as it is given to him, and it may not have any meaning at all. It might, for
example, be a random sequence of digits, or it might be information for a guided missile
or a television signal.
Carrying this idea along, we can idealize a communication system, from our point
of view, as a series of boxes, as in Figure 22, of which I want to talk mainly about the
first two. The first box is the information source. It is the thing which produces the
messages to be transmitted. For communication work we abstract all properties of the
messages except the statistical properties which turn out to be very important. The
communication engineer can visualize his job as the transmission of the particular
messages chosen by the information source to be sent to the receiving point. What the
message means is of no importance to him; the thing that does have importance is the
set of statistics with which it was chosen, the probabilities of various messages. In general,
we are usually interested in messages that consist of a sequence of discrete symbols
or symbols that at least can be reduced to that form by suitable approximation.
| The second box is a coding device which translates the message into a form suitable
for transmission to the receiving point, and the third box has the function of decoding
it into its original form. Those two boxes are very important, because it is there that
the communication engineer can make a saving by the choice of an efficient code.
During the last few years a theory has been developed to solve the problem of finding
efficient codes for various types of communication systems.
The redundancy is related to the extent to which it is possible to compress the language.
I think I can explain that simply. A telegraph company uses commercial codes
consisting of a few letters or numbers for common words and phrases. By translating
the message into these codes you get an average compression. The encoded message is
shorter, on the average, than the original. Although this is not the best way to compress,
it is a start in the right direction. The redundancy is the measure of the extent to
which it is possible to compress if the best possible code is used. It is assumed that you
stay in the same alphabet, translating English into a twenty-six-letter alphabet. The
amount that you shorten it, expressed as a percentage, is then the redundancy. If it is
possible, by proper encoding, to reduce the length of English text 40 per cent, English
then is 40 per cent redundant. The redundancy can be calculated in terms of probabil-
ities associated with the language; the probabilities of the different letters, pairs of letters;
probabilities of words, pairs of words; and so on. The formula for this calculation
is related to the formula of entropy, as no doubt has appeared in these meetings before.
Actually, to perform this calculation is quite a task. I was interested in calculating the
redundancy of printed English. I started in by calculating it from the entropy formulas.
What is actually done is to obtain the redundancy of artificial languages which are
approximations to English. I pointed out that we represent an information source as a
statistical process. In order to see what is involved in that representation, I constructed
some approximations to English in which the statistics of English are introduced by
easy stages. The following are examples of these approximations:
1. xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd
2. ocro hli rgwr nmielwis eu ll nbnesebya th eei
3. on ie antsoutinys are t inctore st be s deamy achin d ilonasive tucoowe at
teasonare fuso |
4. in no ist lat whey cratict froure birs grocid pondenome of demonstures of
the retagin is regiactiona of cre.
5. representing and speedily is an good apt or come can different natural
here he the a in came the to of to expert gray come to furnishes the line
message had be these.
6. the head and in frontal attack on an english writer that the character of
this point is therefore another method for the letters that the time of who
ever told the problem for an unexpected.
In the first approximation we don’t introduce any statistics at all. The letters are chosen
completely at random. The only property of English used lies in the fact that the letters
are the twenty-seven letters of English, counting the space as an additional letter. Of
course this produces merely a meaningless sequence of letters. The next step (2) is to
introduce the probabilities of various letters. This is constructed essentially by putting
all the letters of the alphabet in a hat, with more E’s than Z’s in proportion to their relative
frequency, and then drawing out letters at random. If you introduce probabilities
for pairs of letters, you get something like Approximation 3. It looks a bit more like
English, since the vowel consonant alternation is beginning to appear. In Approximation
3 we begin to have a few words produced from the statistical process. Approximation
4 introduces the statistics of trigrams, that is, triplets of letters, and is again somewhat
closer to English. Approximation 5 is based on choosing words according to their
probabilities in normal English, and Approximation 6 introduces the transition probabilities
between pairs of words. It is evident that Approximation 6 is quite close to normal
English. The text makes sense over rather long stretches. These samples show that
it is perhaps reasonable to represent English text as a time series produced by an
involved stochastic process.
The redundancies of the languages 1 to 5 have been calculated. The first sample (1)
is a random sequence and has zero redundancy. The second, involving letter frequencies
only, has a redundancy of 15 per cent. This language could be compressed 15 per
cent by the use of suitable coding schemes. The next approximation (3), based on diagram[!]
structure, gives a redundancy of 29 per cent. Approximation 4, based on trigram
structure, gives a redundancy of 36 per cent. These are all the tables that are
available on the basis of letter frequencies. Although cryptographers | have tabulated
frequencies of letters, digrams, and trigrams, so far as I know no one has obtained a
complete table of quadrugram frequencies. However, there are tables of word frequencies
in English which are quite extensive, and it is possible to calculate from them the
amount of redundancy (Approximation 5) due to unequal probability of words. This
came out to be 54 per cent, making a few incidental approximations; the tables were
not complete and it was necessary to extrapolate them.
In this case the language is treated as though each word were a letter in a more elaborate
language, and the redundancy is computed by the same formula. To compare
with the other figures, it is reduced to the letter basis by dividing by the average number
of letters in a word.
Pitts: Do other languages have the same frequency or the same degree of redundancy?
Shannon: I have not calculated them; but according to the work of Zipf,1 who has
calculated the frequency of words in various languages, for a large number of them the
falling off of frequency against rank order of the word, plotted on log-log coordinates,
is essentially a straight line. The probability of the nth-most probable word is essentially
a constant over n for quite a large range of n:
.
Pitts: But presumably the constants are different for different languages?
Shannon: That I don’t know. They are not vastly different in the examples which
Zipf gave in his books, but the difference in constants would make some difference in
the calculated redundancy. The equation cannot hold indefinitely. It sums to
infinity. If you go out to infinite words, it must tail off. That was one of the approximations
involved here which makes the figure somewhat uncertain.
Teuber: Your probabilities are based on predicting from one letter to the next, or one
word to the next?
McCulloch: No, upon the word.
Shannon: In the particular case (4) this refers to a language in which words are chosen
independently of each other, but each | has the probability that it has in English.
»The« has a probability of .07 in the English language. So we have seven of them in a
hat of 100.
Teuber: As you led up to it you said, first of all, you take the probability that any one
letter will occur in that particular system, and then, given that letter, and the next one
that occurs, you predict the one that follows immediately after that?
Shannon: Yes.
Teuber: Do you go further than that and say that any one letter will follow the one
you took beforehand, regardless of what letter was in between?
Shannon: That is true in the calculation from 3. This is based on probabilities of
groups of three letters. Number 4 goes on a new tack and starts over with the word as
a new unit.

The words are independently chosen.

It is a better approximation to English than 3, since each group of letters forms a word, but the words don’t hang together in sentences. At this point there seemed to be no way to go any further, because no one had tabulated any frequencies for pairs of words. Of course, such a table would be impractically large because of the enormous number of possible pairs of words. However, the thought occurs that every one of us who speaks a language has implicitly an enormous statistical knowledge of the structure of the language if we could only get it out. That is to say, we know what words follow other words; we know the standard clichés, grammar, and syntax of English. If it were possible, for example, to translate that information into some form adapted to numerical analysis,

1 Zipf, G. K.: Human Behavior and the Principle of Least Effort. An Introduction to Human Ecology. Cambridge,
Mass.: Addison Wesley Press, Inc. (1949)

Advertisements

Entry filed under: Uncategorized.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

June 2005
M T W T F S S
« May   Jul »
 12345
6789101112
13141516171819
20212223242526
27282930  

Tweets


%d bloggers like this: