Oxford Text Archive
Identifying language change in corpora

© Martin Wynne

The Brown corpus was one of the very first computer corpora. It contains approximately one million words of written American English, sampled from various text types. All of the texts date from the early 1960s.

The LOB corpus is a corpus of written British English, modelled on the Brown corpus and following the same design and selection criteria. The texts in the LOB corpus also date from the early 1960s.

For many years, corpus linguists have made comparisons between British and American English using these two resources.

More recently however, researchers have become aware that the texts in these corpora are very old, and not appropriate for studying many features of modern English usage. A team at Freiburg University in Germany therefore decided to make new corpora with modern texts following the same design and sampling procedures as Brown and LOB. The resulting corpora are called FLOB and FROWN. Since they are corpora of the same size and design as Brown and LOB, it is easy to do comparisons of statistical properties of linguistic features in these corpora.

There are versions of each corpus which have been part-of-speech tagged. The versions available here have different tagsets for the originals and the Freiburg corpora.

Geoffrey Leech recently gave a paper at the ICAME conference in Gothenburg in Sweden reporting on research he had carried out with Nick Smith, in which he compared frequencies of various lexical and syntactic features in all four of these corpora.

This exercise involves looking at modal and semi-modal verbs, which were among categories studied by Leech, in all four corpora, in tagged and untagged versions.

Resources available

For each of the four corpora, both the tagged and untagged versions are available. You can see the corpora in their original format in the directories LOB, BROWN, FLOB and FROWN. Versions which have been reformatted to work more effectively with Wordsmith are in the directories with "4" on the end - LOB4, BROWN4, FLOB4 and FROWN4. The tagged versions are also there, in directories called LOBTAG, BROWNTAG, etc., and again there are reformatted version in "4" directories, which are the best ones to use with Wordsmith.

(If you want to get hold of these corpora to use after the course, you can get them all on the ICAME CD-ROM. The tagged Freiburg corpora have not however been made officially available yet, but will be released on a future ICAME CD-ROM.)

Exercises

  1. Form a hypothesis about relative frequencies of occurrence in British v. American and 1960s v. 1990s for one or more of the English modal verbs "would", "may", "might", "shall" and "ought". Obtain wordcounts for the words to test your hypothesis. If you make a wordlist for the relevant corpora in Wordsmith, then you can find all of the words you are looking for, but you may want to look at some concordances too in order to make sure that you are counting the right things.
  2. Now try to examine frequencies for some more complex patterns. Leech has identified some words and phrases as "semi-modals" where their meaning and patterns of usage are close to those of the more prototypical modal verbs in English. These include "going to", "need to", "have to", "better", "want to" "used to", "supposed to". You can try using the untagged corpora if you want in order to see if you can indentify and isolate the relevant examples by searching for lexical patterns. You may find you will need to use the tagged corpora. You may need to adjust one or two settings in Wordsmith so that it can handle the tags.

    You may be able to work out how the relevant words are tagged by looking at concordances of them. If you want help on identifying the relevant tags, here are some hints.

Questions for further study

  1. How does the design and annotation of the corpora help or hinder this type of investigation?
  2. What other features might you want to study. Can you do it with these corpora?