9 June, 2005 at 14:07 Leave a comment


Michael Stubbs
FB2 Anglistik, University of Trier, D-54286 Trier, Germany


One area of linguistics which has developed very rapidly in the last 25 years is phraseology. Corpus study has shown that routine phraseology is pervasive in language use, and also that recurrent word-combinations can be modelled in various ways. This paper discusses two aspects of frequent phraseology in English:

1. the distribution of recurrent phrases in different text-types
2. the function, structure and lexis of some of the most frequent phrases in English.

Some of the data which I will discuss come from a major new interactive data-base which provides extensive quantitative information on recurrent phraseology in the BNC. This data-base has been developed by William Fletcher and is available at http://pie.usna.edu. Quantitative phraseological data have important implications for linguistic theory, because they show how findings from phraseology can be related to independent findings from other areas of linguistics, including recent studies of grammar and of semantic change. However, the very large amount of data itself poses methodological and interpretative puzzles.


First, let me thank the organizers for the invitation to Verona. It is a particular pleasure to give a talk at the 25th anniversary meeting of ICAME. The call for papers says that we should be taking stock of corpus linguistics after 25 years, and I’ll start with a few comments on this topic.

I remember attending a Fortran course around 1970, but the first time I did anything with both computers and language, was around 1980 when I did some programming in BASIC for a desktop Pet computer. (Hands up those of you who remember these machines … The main way of getting data and programs in and out was on normal audio-cassette tapes.) My corpus analysis got going a little later when I attended a course run by my colleague Chris Butler at the University of Nottingham, and I’d like to acknowledge here how grateful I am to him for this course. He taught SNOBOL / SPITBOL, using materials for the book he was writing at the time (Butler 1985), plus the Green Bible by Griswold, Poage and Polonsky (1971), plus Lou Burnard’s materials on SNOBOL. When I applied for a job in Germany in 1990, I gave a lecture as part of the selection process, using data produced with concordance software which I had written myself. One big change in the last ten years is that, for many tasks, software is either commercially available, or is made available free or at minimal cost by the enormous generosity of people in the field. I’ll come back to this point shortly.

Many of my current students have never seen anything but Microsoft Windows, and seem to think that Bill Gates invented computers sometime around 1995. But one shouldn’t be too ironic, because it is very easy to underestimate how far back some important ideas go. I gave a talk about phraseology recently in Sweden. In the question session, Sture All‚n pointed out to me, very politely, that some ideas which I had been trying to explain had been explained much more clearly by him and his colleagues in the early 1970s. The work was done in connection with the Swedish frequency dictionary (Allén et al 1975), and I don’t read Swedish, but that is no excuse: the introduction is in English, and it discusses very clearly some essential concepts in phraseology, including “collocational frameworks” and the “constructional tendency” of many words. I will also come back to these concepts shortly.

I’m going to talk about phraseology:

  • some ways of studying frequent phraseology across large corpora, including software for identifying recurrent phrases and a new phraseology data-base
  • then I’ll present some illustrative findings from the data-base and discuss some of their implications.

So the presentation will fall into these two sections:

  • first: some introductory points, plus details of the software and the data-base
  • second: a discussion of some findings.



One main discovery of corpus work over the last 25 years is that there is a level of syntagmatic phrasal organization, which had been largely ignored

1. because it did not fit into either lexis or grammar
2. because it involved facts about frequency

which were either out of fashion – or could be studied only with computational help.

During the 1980s and 1990s, there was a crucial change of perspective on phraseology. The topic had previously been discussed only by a few individuals (such as Harold Palmer, A S Hornby and J R Firth in the UK, and, in the USA, Dwight Bolinger), but had been seen by most linguists only as a collection of oddities.

NOTE. To take just one recent example, Levinson (2000: 23) admits that there are routine formulae, conventional and idiomatic usages, but sees no place for them except within “a great body of language lore […] beyond knowledge of grammar and semantics, extensively studied […] by traditional rhetoric, the ethnography of speaking, and students of translation and second language learning”. This seems to be separate from the theory of idiomaticity” (p.24) which he wants to develop.

But by the 1980s, it had started to be seen as a pervasive and possibly unified phenomenon. This change is discussed in major reviews by Tony Cowie (1998), Ros Moon (1998), Andy Pawley (2000), Alison Wray (2001) and others. It gained support in approaches to syntax such as construction grammar, as developed by Charles Fillmore and Paul Kay (e.g. Kay & Fillmore 1999), and in related large-scale projects such as FrameNet [?? insert URL].

This change of perspective meant that connections could be seen between phraseology and other areas. In any area of study, findings are significant only if they can be linked to other findings. It is best if findings can be causally linked, or if they can be shown to be the same findings seen from a slightly different point of view.

In order to study phraseology in these ways, and to make these links, we need three things:

1. definitions of “phrase” which are suitable for corpus work
2. software to find these phrases
3. data-base: formed from the application of the software to a large corpus.


There are many possible concepts of “phrase”, and I will use just three related definitions.


First, I will use the term n-gram to mean a recurrent string of uninterrupted word-forms. Software to extract n-grams proceeds through a text or corpus, a given number of words at a time, with a moving window (and stopping at sentence boundaries). It keeps a record of how often each n-gram occurs, and orders them alphabetically or by frequency.

There are no standard terms for n-grams, which are called “lexical bundles” in the “Big Longman Grammar” by Doug Biber, Geoff Leech, Stig Johansson et al (1999), but several other terms are also used.

NOTE. Other terms are “clusters” (Scott 1997a), “recurrent word-combinations” (Altenberg 1998), “dyads”, “tryads”, etc (Piotrowski 1984: 93) “statistical phrases” (Strzalkowski 1998: xiv), “chains” (Stubbs & Barth 2003).


We need a second concept of phrase which is more flexible than an n-gram. I will use the term phrase-frame (p-frame) to mean an n-gram with one variable slot. For example, we can study 5- frames, such as plays a * part in with their variant 5-gram realisations. The following are extracted from a data-base – which I will describe in a moment – constructed from the 100- million-word BNC. The adjectives are listed in descending frequency.

  • plays a * part in <large, significant, big, major, vital, essential, key, central, full, great, prominent, …>

These adjectives are all rough synonyms (in this p-frame), and we can use such data to study the extent to which such frames contain only restricted vocabulary.

NOTE. You might think that phrases such as PLAY a small / minor / minimal / negligible part in also occur. Well, they do, but there are very few in the whole BNC.

Phrase-frames are similar to the two-word “collocational frameworks” identified by Antoinette Renouf and John Sinclair (1991). These are “discontinuous pairings” which enclose characteristic sets of words. For example, they show that the 3-frame a * of has frequent realisations such as

  • a * of <number, lot, couple, series, variety, group, range, set, pair, list, …>

So, collocational frameworks are one special case of phrase-frames.


Third, I will use the term POS-gram to mean a string of part of speech categories. This makes sense only with reference to a particular set of POS-tags which have been used on a particular corpus. I will be talking about POS-grams extracted from the BNC. For example, one of the most frequent 5-POS-grams in the BNC is

  • = PREP + DET + sing NOUN + of + DET
  • eg at the end of the; by the end of the; as a result of the; in the middle of the; etc


The software I have used in studying n-grams and p-frames was written by Isabel Barth and Bill Fletcher. The term “phrase-frame” is also due to Bill Fletcher.


One thing Bill Fletcher has done is to apply this software to the BNC in order to construct a data- base of frequent phrases. He has designed and implemented a major new resource for studying phraseology, which was announced on the Corpora List in December 2003, and is available at http://pie.usna.edu (“pie” = Phrases in English). Fletcher has taken the BNC, 100 million words of written and spoken data, and constructed from it a very large and powerful interactive data- base, which can be searched in many different ways for quantitative information on recurrent phrases.

He has extracted all n-grams, p-frames and POS-grams of length 1 to 6, down to a cut-off frequency of 3 for n-grams. (1-grams are individual word-forms, so the data-base can also provide customized word frequency lists of various kinds.) Clicking on a p-frame or a POS-gram produces a list of its n-gram variants. Clicking on an n-gram produces up to 50 examples of its use in context in the BNC.

The web-site gives full details of exactly how words are defined, in what ways the data have been normalized, etc.

The data-base also allows different search parameters to be set. For example:

  • the minimum or maximum frequency of phrases to be retrieved
  • whether results are shown with POS-tags (as defined by the BNC coding)
  • searches can include wildcards (* = any word, ? = any single character)
  • various filters can be set, in order to specify individual word-forms or POS-tags or both (for the same position in the phrase), or a mixture of both for the whole phrase.

The number of possible combinations here is clearly astronomically high, but some simple examples give a rough idea of the range available. It is possible to search for patterns such as the following, in order to generate tailor-made frequency lists of many kinds, for example:

  • The most frequent realizations of the verb lemma KNOW: kn?w* + VERB.
    *kn?w* would give in addition unknown, knowledge, etc
  • The most frequent exponents of any POS-category, e.g. the most frequent lexical verbs, nouns, prepositions, or whatever (a 1-gram is an unlemmatized word-form).
    So, the data-base can be used to generate customized word-frequency lists.
  • 3-grams consisting of: adjective + noun + noun.
    This finds a lot of well known phrases, e.g. high interest rates, high blood pressure, including many names of institutions: National Health Service, Local Education Authorities, Royal Air Force.
  • 4-frames which do or don’t begin with a preposition.
    That is, searches can include or exclude patterns.
  • 5-grams consisting of: PLAY (lemma) + a/an + adjective + part in
  • Searches can also be defined for fuzzy matches, e.g.
    ~at the ~ top of: right at the very top of, at the very top of, right at the top of , or at the top of
  • most frequent POS-grams of length 5 … as I illustrated above.
  • etc, etc.

Future developments which are planned by Fletcher include:

  • providing searches by regular expressions (only wildcards are currently supported)
  • providing data on both frequency and range (defined by a dispersion measure which counts how many text sectors of arbitrary length a phrase occurs in, as in Leech et al’s 2001 BNC word-frequency lists)
  • providing data on frequency in different text-types (such as spoken – written, fiction – non-fiction, academic – non-academic, etc, via David Lee’s 2002 revised BNC categories)
  • including other corpora in the data-base (such as the ANC and MICASE: Michigan Corpus of Academic Spoken English)

In summary: The data-base is a massive virtual machine for re-arranging data, in order that we can see previously invisible patterns of phraseology. It is a very rich resource, which can be used for many kinds of study, and it will take a long time before we can properly appreciate the full range of generalizations about phraseology which we can investigate.

A severe interpretative problem now arises from the very large amounts of data involved. For example, figures of n-grams and p-frames which occur 5 times or more and 100 times or more (i.e. at least once per million: which is not very frequent) in the BNC are as follows.

5 times or more:

3-grams: over 1.5 mill 3-frames: over 120,000

4-grams: over 700,000 4-frames: over 51,000

5-grams: over 225,000 5-frames: over 7,800

100 times or more:

3-grams: over 40,500 3-frames: over 6,400

4-grams: over 8,000 4-frames: over 1,600

5-grams: over 1,100 5-frames: over 100

This provides a severe problem for description. It is difficult to know what level of delicacy is appropriate in making generalisations across so much data. The only realistic strategy is to start small: to use a restricted sample to generate hypotheses which can be tested on larger samples. We just have to make some simplifying assumptions in order to get started.


The next question is: what can we do with this data-base? what kinds of questions can we answer? I’ll mention a few immediately, then discuss these more systematically in the second half of the presentation.

(1) One question concerns the status of the n-grams which are identified by the software. For example, here are some recurrent 5-grams:

  • in and out of the; they looked at each other; for a moment or two; the corner of his eye; you see what I mean; it seemed to him / me that

Some are not complete grammatical units (e.g. in and out of the). Some are grammatical units, but are not necessarily pre-constructed (they looked at each other). Some, perhaps a surprisingly large number, are grammatical constituents which express a common meaning in a habitual and idiomatic way (e.g. for a moment or two, the corner of his eye, you see what I mean). And others are exponents of more abstract variable phrases, i.e. p-frames (e.g. it seemed to him / me that). … So, a general question is: what kind of units are we talking about?

(2) Second, the software and the data-base make it easy to discover the most frequent n-grams in English, but it is difficult to explain why they are frequent. In written corpora, around 30 per cent of the top hundred 5-grams are the beginnings of prepositional phrases. For example

  • at the end of the; in the middle of the; in the case of the; at the beginning of the; by the end of the; on the part of the; at the top of the; at the time of the; on the basis of the

Here, the question is: why are they frequent? I will argue in the second half of the presentation that the explanation is predominantly linguistic: these phrases have predominantly textual functions.

Some lexical characteristics of these frequent 5-grams are also rather obvious: they contain high frequency nouns from the core vocabulary, especially place, time and logical terms (and, in spoken data, a few high frequency verbs). But, again, these facts would have to be related to other facts before they could explain anything.

(3) It is also clear that many of these most frequent prepositional phrases cannot be interpreted compositionally. The meaning of the nouns in these cases is not entirely transparent:

  • on the eve of the; in the face of the; in the heat of the; at the height of the; in the lap of the; on the spur of the; at the turn of the; in the wake of the

Some are also parts of longer fixed phrases: in the lap of the gods, on the spur of the moment.

(4) And finally, as Della Summers (1996: 262-63) and John Sinclair (1999: 162) have pointed out, many words are frequent because they occur in frequent phrases. This means that there is something slightly wrong, logically, with the concept of “word frequency”.

NOTE. This applies both to units which are tagged as “multi-words” in the BNC, but also to highly frequent n-grams which are not multi-words. For example, (in the BNC) over one in five instances of the word middle occurs in the phrase in the middle of. This attraction can be measured as the “constructional tendency” of words (Allén et al 1975).

So, in this first half of the presentation, I have discussed:

three definitions of “phrase” which are very simple, but useful for corpus work
software to identify phrases and their frequencies across texts and corpora
a few questions about phraseology which can be studied with these data
a very powerful interactive data-base which can help in this study.

In the second half of my presentation, I’ll now take a few examples from the data-base in a little more detail.


The general question here is: what can we do with the software and the data-base? what kinds of questions can we answer?


One application of n-gram software is almost purely descriptive. It can show that different phrases occur with different frequencies in different text-types. The “Big Longman Grammar” (Biber et al 1999) compares n-grams in the broad text-types “conversation” and “academic prose”. (They call n-grams “lexical bundles”.)

In a study which I carried out with Isabel Barth, we also showed that the frequencies of different n-grams are very different in different text-types. The results and reference lists of n-grams are published in the resulting article (Stubbs & Barth 2003).

NOTE. Other studies are by Milton (1998) and Aarts & Granger (1998.)

The main idea here is very simple. For example, the frequency of pronouns distinguishes text- types, e.g. fiction and academic articles. But if we look at the n-grams in which pronouns occur, then the differences between the text-types are much more striking. It is intuitively obvious which of the following 4-grams come from FICTION and which from ACADEMIC PROSE:

  • 1. I don’t want to; I want you to; I don’t know what
  • 2. I have already mentioned; I shall show that; I will conclude with

NOTE. Rank orders and frequencies are also very different: set 1 are from the top 25 4-grams in FICTION and all occur almost 30 times per million; the top 50 4-grams in ACADEMIC PROSE contain no pronouns at all, and set 2 are from much further down the list from ACADEMIC PROSE and occur 6 times or fewer per million.

Similarly, the kind of prepositional phrases which I have started to illustrate are much more frequent in written academic texts than in spoken language.


However, recurrent and frequent phrases have other implications, and I will illustrate this from a small set of facts which are both unexpected and inevitable (Hardy 1940/1967: 113): unexpected in that native speakers cannot produce the facts from introspection, but inevitable once you realise why a particular search method finds these phrases.

One problem, as I have mentioned, is simply the very large amount of quantitative data which the data-base makes available. We just have to select a sub-set of data and start somewhere. One simple starting place is the most frequent phrases in the whole BNC. These are parts of nominal and prepositional phrases, which express spatial, chronological and logical relations.

The top 5-POS-grams in the whole BNC (down to a cut-off of 3 for n-grams) are:

  • PRP AT0 NN1 PRF AT0 98,541 (e.g. at the end of the; by the end of the)
  • AT0 NN1 PRF AT0 NN1 57,427 (e.g. the end of the year; the end of the day)

The next couple in descending frequency are related phrases with adjective + noun:

  • AT0 AJ0 NN1 PRF AT0 26,493 (e.g. the other side of the; the other end of the)
  • PRP AT0 AJ0 NN1 PRF 19,779 (e.g. on the other side of; at the other end of)

The following all occur in the top twelve 5-frames in the BNC:

rank order

 1. in the * of the
2. at the * of the
3. to the * of the
6. on the * of the
9. for the * of the
10. by the * of the
12. in the * of a

The other 5-frames in the top 25 are almost all variants of these phrases:

  • of the * of the; * the end of the; the end of the *; and the * of the; at * end of the; etc

Now, the BNC over-represents written data (90 million words written data and only 10 million spoken), and, as I have pointed out, different text-types have significantly different phraseology. (In spoken data 5-frames with high frequency verbs are frequent.) Nevertheless, these prepositional phrases are at the top in both written and spoken samples. So, this is a good pattern to start with.

To make sampling even simpler – and to make sure that we have a well defined and replicable sample – we can start with just the top 5-grams in the whole BNC, which all occur 100 times or more (i.e. at least once per million running words) and which all have the structure

  • = adapted BNC coding: PRP AT0 NN? PRF AT0
  • the BNC tags do not support wild-cards: NN? = NN0 | NN1 | NN2

These are listed in the appendix. There are just over 150 types and around 43,625 tokens.

NOTE. So, one of these 150 items occurs around once every 2,300 running words on average. There are, in fact, many more items with the same structure, since there are many other realizations which occur fewer than 100 times each. In addition, corresponding 4-grams (at the end of) occur much more frequently.

This is clearly only a tiny tip of the iceberg of English phraseology, but these phrases are amongst the most frequent, as selected by very simple criteria. So, we can look at these phrases, and use them to make hypotheses which can be tested on larger data-sets, and see whether generalizations here can be convincingly related to other facts.


The nouns are selected from a quite restricted set. By far the most frequent noun is end: in over 10 per cent of the types (16 out of the 151) and over 20 per cent of the tokens (9,217 out of the 43,622). As in all other areas of language use, the list shows a very uneven distribution. The prepositional phrases occur in a cluster at the top of frequency lists of phrase-frames. And these 150 top 5-grams follow a Zipf-type rank-frequency curve.

NOTE. The top two 5-grams (at / by the end of the) constitute 13 per cent of all the tokens (in this top set of 151). The top ten (with the nouns end, result, middle, time, top, beginning, case, bottom, form) constitute over 30 per cent of all the tokens (in this top set of 151).


Some semantic generalizations are as follows:

(1) Wholes and parts, space and time. First, the list consists overwhelmingly of (the beginnings of) expressions which denote wholes and parts of things, especially with reference to the centre or the periphery of places and periods of time:

  • for the duration of the; for the rest of the; in this part of the; for the remainder of the; for the whole of the; since the beginning of the; at the edge of the; to the side of the; etc

(2) Logic and cause. A second set express logical or causal connections:

  • in the case of the; in the event of a; as a result of the; on the basis of the; as a consequence of the; with the exception of the; etc

(3) Intention and influence. A third set – perhaps not so well defined – and rather further down the frequency list – express intentions and/or relations of power or influence, especially between people:

  • for the benefit of the; under the auspices of; with the aid / help of the; in the interests of the; to the needs of the; for the purposes of the; at the request of the; for the sake of the; for the use of the; under the control of; at the expense of; at / in / the hands of the; under the terms of the; etc

Now, a corpus can tell us which phrases are frequent. But an explanation of why they are frequent can come only from texts. It is not surprising that expressions for place, time, cause and intention are amongst the most frequent in the language, because these are precisely the relations which we need in order to reconstruct plausible sequences of events, and therefore to make sense of connected discourse.

There is another striking feature of the 5-grams: many are not semantically transparent.

(1) Some are, of course, because some prepositional phrases are simply literal place expressions:

  • in the corner of the <room, field, …>
  • in the direction of the <river, town, …>
  • at the top of the <stairs, hill, …>
  • also: centre, edge, floor, middle, north, rear, surface

(2) Many of these place expressions are metaphorical extensions from body terms. This is well known from work on diachronic semantic shifts: see below.

  • at the back of the <house, book, …>
  • in the heart of the <city, forest, …>
  • by the side of the <road, bed, …>
  • also: bottom, face, foot, hands, head

(3) But it is perhaps more surprising that only a minority are literal place expressions. For example, the expressions in and at the heart of the are used quite differently: at is used only for abstract cases:

  • at the heart of the <matter, problem, …>

(4) Other nouns are also delexicalized. The etymology may be transparent, but no literal interpretation is possible, and the meaning of the resultant n-grams cannot be derived purely compositionally:

  • on the eve of the <battle, election, …>
  • in the eyes of the <law, public, …>
  • in the wake of the <riots, scandal, …>
  • also: case, course, …

An attested example such as the following shows just how delexicalized such nouns can be. The writer was apparently not aware of any logical contradiction between the nouns.

  • at the height of the depression

(5) Finally, several of the expressions have pragmatic connotations (semantic prosodies). Some are quite obvious. For example, the phrase at the hands of the has a conventionally negative evaluative meaning, which is clear in examples such as:

  • suffered humiliation at the hands of the Puritans
  • experienced persecution at the hands of the regime

But others are less obvious. For example, the phrase in the middle of often occurs when the speaker is complaining about something (usually someone else’s behaviour) which is unexpected and/or inappropriate, and which has happened where it normally doesn’t and/or shouldn’t. One hint of this is that in the whole BNC the 6-gram in the middle of the night is much more frequent than the next 6-gram, in the middle of the room. The following are some illustrative examples:

  • he gets called out right in the middle of the night
  • they just left it in the middle of the road
  • I’ll give you a ring back … we’re in the middle of eating
  • they live in a ghastly little bungalow in the middle of nowhere


Now we can also make connections to other work. First, these observations corroborate generalizations in other studies about the functions of recurrent phrases.

For English data (spoken, from the London-Lund corpus), Bengt Altenberg (1998) makes several generalizations about the “pragmatic specialization” of recurrent word-combinations. He identifies frequent syntactic constructions (including nominal and prepositional groups), and shows that many routine expressions have the “conventionalized discourse function” of presenting information in textual frames. For English and Spanish data (spoken and written), Chris Butler (1998a, b) makes similar observations. He also notes that many frequent multi-word units are nominal or prepositional phrases, that rather few of these phrases encode representational meanings, except for expressions of time and place, and that many frequent sequences express speaker-orientation and information management.


We also have to remember that different observational methods lead to different findings: if we use a microscope we will discover small things, but if we use a telescope we will discover distant things, and if we use x-rays we will discover what is inside things.

The software picks out phrases which are both frequent and widely distributed, and which are therefore, by definition, not tied to the topic of individual texts. They are used by speakers, irrespective of what they are talking about, in order to organize their discourse, and therefore contain many markers of point of view, topicalization, and the like. So, as well as seeing these generalizations as an empirical finding which is induced from the data, we can also look at things the other way round. The criteria of high frequency and wide range automatically capture predominantly non-representational phraseology.

I am not saying that the findings are a mere artefact of the method. In retrieval tasks, success is measured in terms of precision and recall. If we are searching for such phraseology, then this method has fairly high precision (much of what is found is relevant). Speakers constantly refer to times and places, in routine ways, in order to organize both narrative and non-narrative texts. These prepositional phrases are a recurrent way of organizing information. We do not know how high the recall is (whether the method finds most of what is relevant). But then, recall is always much more difficult to check, since we cannot observe what is not found.


Summarizing some of these points … What I have described is a grammatical construction which has lexical-semantic characteristics and specialized pragmatic functions. The construction has a well-defined syntax. It has prototypical (= high frequency) exemplars. It contains vocabulary from restricted lexical classes. It has pragmatic functions, primarily in managing information and structuring text. It is an idiomatic form-meaning complex, in the sense of Construction Grammar, although the construction is rather less specific than those which have been discussed in the literature (Goldberg 1995, Michaelis & Lambrecht 1996, Kay & Fillmore 1999, Croft 2001, et al). The syntactic structure itself has a frequent (pragmatic) function.


Finally, I will point out briefly that these phraseological data bear a striking similarity to data on one type of language change. The vocabulary of these recurrent 5-grams is remarkably similar to vocabulary which has long been identified as particularly prone to certain kinds of semantic shift. The prepositional phrases which I have discussed are very frequent and they frequently contain nouns from certain semantic fields. These two factors make these phrases a plausible context for semantic change, and the nouns are indeed frequently delexicalized.

There is extensive evidence from many languages that words in certain lexical classes undergo predictable semantic shifts. For example, words for body parts often become place terms (e.g. back, side), and these words often shift further, to become place adverbials and discourse markers (e.g. beside, besides). These shifts are examples of well documented uni-directional diachronic processes which affect predictable classes of words in the basic vocabulary, and which involve shifts from concrete to increasingly abstract expressions with a weakening of semantic meaning, and a corresponding strengthening of pragmatic meaning (e.g. speaker attitude). Typical developmental tracks have been identified and labelled in slightly different but related ways (e.g. by Traugott & Heine 1991, Hopper & Traugott 1993).

  • concrete / literal > abstract / metaphorical
  • body part > locational term > temporal term > discourse term
  • locational > temporal > logical > illocutionary
  • propositional / extralinguistic > expressive / attitudinal

To take just one example: The word face is used both as a body term, and also as a place term (the north face of the Eiger). But the phrase in the face of the is usually followed by a word denoting a problem, and is almost always used entirely abstractly. Similarly, on the face of it has the pragmatic function of introducing a potentially disputed interpretation.

It is always a hint that we are on the right track if it can be shown that two – apparently distinct – areas are different ways of looking at the same thing. Quantitative phraseological data can now provide many examples to support the hypothesis (proposed by Paul Hopper and Elizabeth Traugott 1993) that predictable kinds of semantic shifts take place in local grammatical constructions. This is true especially of

  • the semantic weakening of nouns in frequent prepositional constructions
  • the corresponding strengthening of pragmatic meanings (to form text-structuring expressions and/or conventional evaluative connotations).


Here are a few concluding comments. Some of the following points are well known, but the importance of discoveries about phraseology depends on their consequences for the whole field of language study, so it is useful to try and state explicitly their theoretical implications. Central contributions of corpus work have been to discover large numbers of new facts, to discover regularities (where people had previously seen only irregularities), and to discover relations between things (which had previously seemed independent).

The broadest significance of such findings is for a theory of idiomatic language. Andrew Pawley and Francis Syder (1983) pointed out – twenty years ago – that “native speakers do not use the creative potential of syntactic rules to anything like their full extent”, but instead have standard ways of talking about culturally recognized concepts (pp.191-93). This observation can now be studied in empirical quantitative detail.

Then there are more specific implications for how we model the vocabulary, grammatical units and textual structure. The phrases which I have identified consist of predictable classes of words and are used in predictable ways. They are similar to units which have been discovered independently in other areas (such as construction grammar). They have predictable functions (such as managing information). Their recurrent vocabulary has been independently identified by diachronic linguists as responsible for semantic shifts.

If we are thinking of words … Many words are frequent because of their strong constructional tendency: they occur in frequent phrases. Therefore the concept of “word frequency” needs some reinterpretation.

If we are thinking of phrases … Many phrases are very frequent: they are conventional ways of expressing common meanings. Studies of frequent phrases alter our understanding of what speakers have to infer versus what is conventional and just has to be known.

If we are thinking of texts … Very frequent phrases – which express place, time, cause and intention – express relations which are essential for understanding connected discourse.

If we are thinking of the vocabulary … For a long time, many linguists accepted Bloomfield’s (1933: 274) description of the lexicon as “a list of basic irregularities”. (The view that phraseology was just a list of oddities about fixed phrases and idioms fitted into this more general scepticism about regularities in the lexicon.) However, it has become clear that it is possible to make many generalizations about vocabulary.

Finally, if we are thinking of language change … Quantitative phraseological data can explain why words in the core vocabulary, which occur frequently in well-defined grammatical constructions, undergo predictable semantic shifts.

Both corpus linguists and historical linguists see an inherent relation between frequent use and structure, and argue that rather basic facts about functional load have often been ignored in linguistic description.

In some ways, I have done little more than present a type of data which has not been previously available (though some quantitative data and analysis go back further than we might think, to work in the early 1970s). But maybe a characteristic of the text-type “plenary lecture” is that I am allowed to use certain speech acts:

to report on work in progress
to present some new observational data
to speculate on some explanations for the data
to suggest some relations between different areas of linguistics
and to thank colleagues whose generosity makes available valuable new resources.


This paper reports work which was done in collaboration with Isabel Barth (Stubbs & Barth 2003) and Katrin Ungeheuer (Stubbs & Ungeheuer in prep). I am especially grateful to Bill Fletcher, for much discussion, for the use of a beta-version of his n-gram and p-frame software, and for access to early versions of his BNC data-base (Fletcher 2003/04). For comments on previous drafts I am also grateful to Chris Butler, Joanna Channell, Naomi Hallan and members of the BAAL Special Interest Group on corpus linguistics, to whom an earlier version of the talk was given at Birmingham University on 16 April 2004.



Fletcher, W. (2003/04) Exploring Words and Phrases from the British National Corpus. Website at http://pie.usna.edu.

Books and articles:

Aarts, J. and Granger, S. (1998) Tag sequences in learner corpora. In S. Granger ed Learner English on Computer. London: Longman. 132-41.
Allén, S. et al (1975) Nusvensk frekvensordbok. Stockholm: Almqvist & Wiksell.
Altenberg, B. (1998) On the phraseology of spoken English: the evidence of recurrent word combinations. In A. P. Cowie ed Phraseology: Theory, Analysis and Applications. Oxford: Oxford University Press. 101-122.
Benson, M., Benson, E. & Ilson, R. (1986) The BBI Dictionary of English Word Combinations. Revised ed. Amsterdam: Benjamins.
Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E. (1999) Longman Grammar of Spoken and Written English. London: Longman.
Bloomfield, L. (1933) Language. London: Allen & Unwin.
Butler, C. (1985) Computers in Linguistics. Oxford: Blackwell.
Butler, C. (1998a) Collocational frameworks in Spanish. International Journal of Corpus Linguistics, 3, 1: 1-32.
Butler, C. (1998b)Multi-word lexical phenomena in functional grammar. Revista Canaria de Estudios Ingleses, 36: 13-36.
Cortes, V. (2002) Lexical bundles in freshman composition. In R. Reppen, S. M. Fitzmaurice & D. Biber eds Using Corpora to Explore Linguistic Variation. Amsterdam: Benjamins. 131- 45.
Coxhead, A. (2000) A new academic word list. TESOL Quarterly, 34, 2: 213-238.
Cowie, A. P. ed (1998) Phraseology. Oxford: Oxford University Press.
Croft, W (2001) Radical Construction Grammar. Oxford: OUP.
Fillmore, C., Kay, P. and O’Connor, M. C. (1988) Regularity and idiomaticity in grammatical constructions. Language, 64: 501-38.
Flowerdew, J.(1993) Concordancing as a tool in course design. System, 21, 2: 231-44. Also in M. Ghadessy, A. Henry & R. L. Roseberry eds Small Corpus Studies and ELT. Amsterdam: Benjamins, 2000. 71-92.
Goldberg, A (1995) Constructions. Chicago: Chicago UP.
Griswold, R. E., Poage, J. F. & Polonsky, I. P. (1971) The SNOBOL4 Programming Language. 2nd ed. Englewood Cliffs, NJ: Prentice Hall.
Hallan, N. (2001) Paths to prepositions: a corpus-based study of the acquisition of a leico- grammatical category. In J. Bybee & P. Hopper eds Frequency and the Emergence of Linguistic Structure. Amsterdam: Benjamins.
Hardy, G. H. (1940/1967) A Mathematician’s Apology. Cambridge: Cambridge University Press.
Hopper, P. and Traugott, E. (1993) Grammaticalization. Cambridge: Cambridge University Press.
Kay, P. & Fillmore, C. (1999) Grammatical constructions and linguistic generalizations: the What’s X doing Y? construction. Language, 75, 1: 1-33.
Kennedy, G. (1992) Preferred ways of putting things with implications for language teaching. In J. Svartvik ed Directions in Corpus Linguistics. Berlin: De Gruyter.
Kennedy, G. (1998) An Introduction to Corpus Linguistics. London: Longman.
Lakoff, G. & Johnson, M. (1980) Metaphors We Live By. Chicago: Chicago University Press.
Lee, D. Y. W. (2002) Genres, registers, text types, domains and styles: clarifying the concepts and navigating a path through the BNC jungle. In B. Kettermann & G. Marks eds Teaching and Learning by Doing Corpus Analysis. Amsterdam: Rodopi. 247-92.
Leech, G., Rayson, P. & Wilson, A. (2001) Word Frequencies in Written and Spoken English: Based on the British National Corpus. London: Longman.
Levinson, S C (2000) Presumptive Meanings. Cambridge MA: MIT Press.
Louw, B. (1993) “Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies”. In M. Baker, G. Francis & E. Tognini-Bonelli (Eds.) Text and Technology. Amsterdam: Benjamins.157-76.
Macfarlane, A. & Martin, G. (2002) The Glass Bathyscaphe. London: Profile.
McGuire, W. J. (1999) Constructing Social Psychology. Cambridge: Cambridge University Press.
Michaelis, L & Lambrecht, K (1996) Towards a construction-based theory of language functions. Language, 72: 215-47.
Milton, J. (1998) Exploiting L1 and interlanguage corpora in the design of an electronic language learning and production environment. In S. Granger ed Learner English on Computer. London: Longman. 186-98.
Moon, R. (1998) Fixed Expressions and Idioms in English: A Corpus-Based Approach. Oxford: Clarendon.
Nation, P. (2001) Using small corpora to investigate learner needs. In M. Ghadessy, A. Henry & R. L. Roseberry eds Small Corpus Studies and ELT. Amsterdam: Benjamins. 31-45.
Pawley, A. (2003, in press??) Where have all the verbs gone? In F. Ameka et al eds Issues in Grammar-Writing. Publisher??
Pawley, A. (2000) Developments in the study of formulaic language 1970-2000. Paper read to AAAL Conference, Vancouver, 11-14 March 2000. [?? revised version in I J Lexicography?]
Pawley, A. and Syder, F. H. (1983) Two puzzles for linguistic theory. In J. C. Richards & R. W. Schmidt eds Language and Communication. London: Longman. 191-226.
Piotrowski, R. G. (1984) Text, Computer, Mensch. Bochum: Brockmeyer.
Renouf, A. & Sinclair, J. (1991) Collocational frameworks in English. In K Aijmer & B Altenberg eds English Corpus Linguistics. London: Longman. 128-43.
Scott, M. (1997a) WordSmith Tools Manual. Oxford: Oxford University Press.
Scott, M. (1997b) PC analysis of keywords, and key key words. System, 25, 2: 233-45.
Sinclair, J. (1999) A way with common words. In H. Hasselgard & S. Oksefjell eds Out of Corpora. Amsterdam: Rodopi. 157-79.
Strzalkowski, T. ed (1998) Natural Language Information Retrieval. Dordrecht: Kluwer.
Stubbs, M. (2001) On inference theories and code theories: corpus evidence for semantic schemas. Text, 21, 3: 437-65.
Stubbs, M. (2002) Two quantitative methods of studying phraseology in English. International Journal of Corpus Linguistics, 7, 2: 215-44.
Stubbs, M. & Barth, I. (2003) Using recurrent phrases as text-type discriminators: a quantitative method and some findings. Functions of Language, 10, 1: 61-104.
Stubbs, M. & Ungeheuer, K. (in prep) Quantitative data on highly frequent phrases in English. Working paper, University of Trier.
Summers, D. (1996) Computer lexicography: the importance of representativeness in relation to frequency. In J. Thomas & M. Short eds Using Corpora for Language Research. London: Longman. 260-66.
Traugott, E. & Heine, B. eds (1991) Approaches to Grammaticalization. 2 volumes. Amsterdam: Benjamins.
Warren, B. (2001) Accounting for compositionality. In K. Aijmer ed A Wealth of English. G”teborg: Acta Universitatis Gothenburgensis. 103-14
West, M. (1953) A General Service List of English Words. London: Longman.
Wray, A. (2002) Formulaic Language and the Lexicon. Cambridge: Cambridge University Press.
Zipf, G. K. (1945) The meaning-frequency relationship of words. Journal of General Psychology, 33: 251-56.


All 5-grams, PRP ATO NN? PRF ATO, which occur 100 times or more in the BNC.

4031 at the end of the 234 under the terms of the 143 in the course of his

1707 by the end of the 233 at the end of each 142 from the end of the

1485 as a result of the 231 as a member of the 141 as a consequence of the

1354 in the middle of the 229 at the side of the 140 at a meeting of the

1079 at the time of the 228 on the basis of a 140 in the event of the

950 at the top of the 226 in the context of a 138 on the banks of the

864 at the beginning of the 224 in the interests of the 137 at the rear of the

856 in the case of the 222 as a result of this 137 in the life of the

728 in the form of a(n) 218 in the face of the 136 on the day of the

684 in the case of a(n) 210 on the floor of the 134 since the end of the

632 at the bottom of the 209 in the course of a 132 in the shape of a

584 at the back of the 207 in the aftermath of the 132 to the needs of the

573 on the edge of the 206 in the back of the 132 with the help of a

572 for the rest of the 203 to the back of the 131 at the request of the

566 on the basis of the 202 at the front of the 129 on the basis of their

548 in the context of the 200 in the heart of the 128 for the purposes of this

541 in the centre of the 199 on the eve of the 128 to the front of the

527 at the start of the 199 under the auspices of the 126 on the nature of the

496 at the end of a 198 in the form of the 125 for the duration of the

485 towards the end of the 197 to the edge of the 125 for the sake of the

419 at the foot of the 195 at the base of the 125 for the use of the

410 in the course of the 191 at the beginning of this 124 on the face of the

403 to the top of the 190 in the rest of the 124 to the centre of the

387 in the hands of the 186 in the absence of a 122 at the hands of the

372 to the end of the 183 with the exception of the 122 on each side of the

366 on both sides of the 180 in the name of the 120 since the beginning of the

358 in the direction of the 178 in the absence of any 119 by the side of the

356 in the middle of a 174 to the bottom of the 117 to the north of the

353 in the wake of the 169 at the height of the 116 about the nature of the

348 at the heart of the 169 in the development of the 116 in this part of the

339 at the end of this 165 from the top of the 115 at the centre of a

325 before the end of the 165 on the surface of the 112 in the absence of the

313 on either side of the 162 in many parts of the 110 from the back of the

307 at the expense of the 161 at the level of the 110 with the aid of a

304 until the end of the 161 for the rest of his 108 on this side of the

299 with the rest of the 161 in the eyes of the 107 as a result of his

297 as a result of a 159 under the control of the 106 into the hands of the

294 at the centre of the 158 at the end of his 106 in the face of a

278 for the benefit of the 158 to the right of the 106 on the edge of a

278 on the side of the 154 to the left of the 105 in the words of the

275 at the head of the 153 on the top of the 103 for the remainder of the

271 during the course of the 153 to the attention of the 102 on the site of the

262 from the rest of the 152 by the time of the 101 by the middle of the

253 at the turn of the 150 at the time of his 101 in the presence of the

249 to the rest of the 150 with the help of the 101 on the end of the

245 at the edge of the 147 by the end of this 100 as a result of their

240 on the back of the 147 in the corner of the 100 at the beginning of a

236 in the history of the 147 to the side of the 100 at the end of their

235 in the event of a 144 in the hands of a 100 for the whole of the

234 for the purposes of the 143 after the end of the

Anglistik Home Page

© Copyright Michael Stubbs 2004.

This is a slightly revised version of a plenary lecture given at ICAME 25, the 25th anniversary meeting of the International Computer Archive for Modern and Medieval English, in Verona, Italy, 19-23 May 2004.

Anglistik, Universität Trier, D-54286 Trier, Germany.
This HTML file last up-dated 30 May 2004.


Entry filed under: Uncategorized.

SCOTS – Scottish Corpus of Texts and Speech

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


June 2005
« May   Jul »


%d bloggers like this: