Abstracts

Pre-conference workshop at CL2005, 14th July 2005

Why did those Lancastrians bother to annotate a corpus for speech, writing and thought presentation categories and what good did it do them?

Mick Short

This paper describes a corpus-based research project designed to investigate the adequacy of the model of speech and thought presentation outlined in chapter 10 of Leech and Short (1981) Style in Fiction, which has been influential in Stylistics and other fields, and to improve that model. A corpus of around 260,000 words, in approximately 2,000-word samples and comprising three different narrative text types (fiction, news reports and (auto)biographies), was annotated extensively by hand to see how well the model coped with prototypical and non-prototypical cases. The process of annotation led to various changes to the model. We established three parallel scales of discourse presentation, speech presentation, writing presentation and thought presentation and some additional presentation categories on each of the scales. The annotation helped us to be able to describe discourse presentation in texts and its effects more accurately, and this, in turn, led us to be able to understand better how different, and in what ways, the thought presentation scale is very different from the other two scales (and so why it is unhelpful to push the scales together under the term 'discourse presentation' when discussing literary and non-literary texts in any detail).

We made a point of coding discourse presentation ambiguities wherever we found them so that we could better understand the kinds of ambiguity and their causes, and also the general extent of discourse presentation ambiguity in the corpus. The project also helped us to understand better the extent and nature of a series of phenomena which can be found across a range of speech, writing and thought presentation categories, for example brief quotation phenomena, hypothetical discourse presentation and the embedding of one discourse inside another. We annotated such phenomena whenever we found them in order to be able to study them quantitatively and qualitatively. Our statistical work shows that brief quotation phenomena are particularly frequent in news reports and that hypothetical discourse presentation is a relatively infrequent phenomenon, leading us to want to reject and modify the recent arguments in favour of rejecting the notion of degrees of faithfulness to an original in discourse presentation theory.

Thus we would argue that, although time consuming, this use of a specially constructed and annotated corpus to examine the accuracy of a discourse presentation model has helped us to usefully refine that model and so (a) make it more useful for analysing texts, their meanings and effects and (b) use the insights derived from the corpus-based to make our theorising about discourse presentation more accurate.


Collocation and Semantic Prosodies in Literature and in Corpora

Bill Louw

Peter Stockwell's now famous remark about corpus stylistics (that it is "...all methodology and no results") is entirely justified when one weighs up objectively the collective output of this nascent but exciting discipline. Time-consuming methodology may be associated with a desire to attain scientific rigour. However, the work of John Sinclair and the legacy of Cobuild have demonstrated that linguistic science has a human face. In 1941, Malinowski defined science as follows: “Science is the translation of experience into general laws which have predictive value...” (emphasis added). Sadly, the majority of corpus stylisticians assume that corpora will confirm intuitive literary critical and stylistic practice rather than overturn it.

The rehabilitation of corpus stylistics lies in a holistic rather than atomistic approach to both text and corpus which simultaneously restores authority and primacy to contextualised natural language in the corpus and which removes, by means of simple proofs, the misleading or falsifiable labels and fake formalisms of an outmoded apparatus criticus. As a result of this largely Malinowskian revaluation, literary texts will increasingly be read against a large sample of the whole language and ‘general laws which have predictive value’ will be arrived at during that process. The result is direct access to the target text’s institutional meaning as a mode of action (within society) rather than as the countersign of thought (in and through human intuition which too readily attests the existence of word meaning).

Linguistics, after 2,500 years devoted to description, has lost sight of its initial goal: meaning. The act of description has come to worship itself, leaving stylistic results often only dimly or tendentiously inferable and unsatisfying to the reader/critic. For example, a welter of statistical data is often followed by a fairly obvious statement of interpretation. The only critical labels from the past that ought to be allowed to survive into the era of digital stylistics will be those which are capable of (corpus-valid) re-definition, e.g. irony as a reversal of a semantic prosody, but distinct from insincerity (Louw, 1993).

Unless the reader/critics’ interests are satisfied, corpus stylistics will never prosper and will instead become increasingly arcane and inaccessible. The output of corpus stylistics must take critical practice objectively and scientifically towards a proven consensus of meaning which is both incontrovertible and fully replicable (between different investigators and different corpora, including sub-corpora). This process will be inspirational because of the disclosure by semantic prosodies of ‘hidden meaning’ (Sinclair, 2003:117).

The insights gathered by corpus stylistics must no longer extend to single sentences or devices alone. The traditional demands of literary criticism of a literary work need to be addressed:

  1. The thought form or literary world;
  2. The rise and fall of emotions;
  3. The syntactic form;
  4. Diction, imagery and symbolism.(After M.H. Abrams)

Corpus development and composition must be broad enough to cover and discover the full range of situational meanings encountered in literary texts. Corpus stylistics will offer fresh insights into corpus development and the issue of corpus size.

None of the above can be attempted or accomplished without recourse on a large scale to collocation, a phenomenon which operates probabilistically and which is recoverable computationally through frequency and especially by means of co-selection. It allows for the painless revaluation of all literary phenomena as collocation. As this occurs, all critical labels based upon word-meaning or “...concerned with the conceptual or idea approach to the meaning of words...” (Firth, 1957) are likely, gradually, to go to the wall. The new labels will be scientifically robust and fewer in number and will hold nothing of the phlogiston fallacy within them which sustains so many intuitively-derived, falsifiable (in Popper’s terms) theories.


Phraseology and Equivalence

Kieran O'Halloran

Key aspects of stylistics have included taking account of parallelisms in poetry and the equivalences these set up. Another feature has been a focus on deviation at the morphological, grammatical, semantic, phonological and graphological level. Deviation at the phraseological level, however, has not received much attention. In other words, there has not been much focus on how literary texts may deviate from lexico-grammatical norms of usage which show up in large corpora. From retrospecting to the work of Formalists such as Jakobson and Mukarovsky, this presentation then looks forward in arguing that employing large corpora to reveal phraseological deviation in a literary text can be useful for cognitive approaches to stylistics where the notion of the schema is drawn upon (e.g. Cook, 1994; Semino, 1997). In the absence of empirical grounding, the notion of a schema remains speculative and relative to the individual interpreter (Carter, 1999). However, schemata can receive some empirical grounding using corpus techniques. Phraseological regularities as revealed by a concordancer tell us something about habitual expectations with regard to particular usage and thus about likely default schemata generally speaking.

I aim to show that one useful role corpora can play is the following: helping to sift more personal and thus 'local' schemata activated in reading a literary text from schemata that are more likely to be activated by readers generally. In doing so, this can enhance analysis about a literary text's capacity to draw readers into it generally speaking, rather than just the analyst. I will have a look at this issue with regard to schemata that were triggered in Fowler's (1996) reading of a Fleur Adcock's poem, Street Song. I show, via a large corpus of contemporary everyday English, that the poem's grammatical equivalences in Fowler's Jakobsonian analysis are actually not equivalent from a phraseological perspective. This is because of deviation within relevant phraseologies. Since phraseological norms bear some analogy to schemata, I show how this tension between grammatical equivalences and (lack of) phraseological equivalences in the poem helps to account more empirically for why the poem is likely to draw readers into it.


Towards corpus stylistics: semi-automatic analysis of early García Lorca texts

Sara Piccioni

This paper proposes an example of contrastive quantitative semi-automatic analysis of two literary electronic corpora based on the work of Spanish poet Federico García Lorca, the first from his early prose production (170000 words ca.), and the second from his early poetry production (56000 words ca).

The analysis concentrates on the study of the distribution of semantic classes across the two corpora; the log-likelihood measure (Dunning 1993, LL henceforth) is used to look for recurrent semantic class combinations and to identify the semantic classes and class combinations more strongly associated with each corpus. To do this, all content words in both corpora were semi-automatically annotated with tags representing their semantic field. The semantic tags were then automatically extracted from the two corpora and a number of association measures were computed to verify recurrent co-occurrence patterns. Interpretation of the results thus obtained reveals interesting aspects of language use and highlights meaningful differences in the treatment of specific themes in the two corpora. Three different kinds of analysis were conducted.

Firstly, the co-occurrence of semantic fields was investigated by analyzing the distribution of the raw frequencies of pairs of semantic tags. High incidence of "unusual" pairs highlighted frequent association of words belonging to the semantic fields of BODY/SENSUALITY and RELIGION, where body and soul are constantly combined in an attempt to celebrate the spiritual virtues of the former, while simultaneously contemplating the inner suffering and sense of guilt derived from unwanted and irrepressible sexual urges.

Secondly, the LL measure was used to highlight preferred association of specific semantic fields with one of the two corpora, thus drawing attention to the semantic classes that better characterise each corpus. Results show that the highest scoring semantic tags in the poetry corpus bear some relevance to nature (SKY, VEGETATION, WATER, etc.), while the most significant semantic classes in the prose corpus refer to human activities or human-determined categories. Further typical semantic classes in the prose corpus refer to social structure, cognitive processes and activities derived from these (INTELLECT, ART, MEMORY, etc.), and interior life (FEELINGS, RELIGION, etc.). Data thus suggest a complementary distribution of themes and meanings in the two corpora, which are encapsulated through the distinction, 'Nature vs. Culture'.

Finally, combinations of semantic classes typical of each corpus were analysed by using the LL to measure the strength of association between each corpus and the pairs of semantic tags they contain. Analysis of the data suggests language use in the poetry corpus is more figurative and creative than in the prose corpus. While the prose corpus is characterised by high incidence of collocations typical of general language, the poetry corpus contains more instances of figurative language, typical pairs being VEGETATION-DEATH ("the roses of death", etc.), BODY/SENSUALITY-VEGETATION ("your body was covered with pain and roses"), DEATH-SKY ("there are wounds in the sky"), etc.

In conclusion, while some aspects of the methodology are still problematic (e.g., semantic class selection and attribution) and can affect reliability of the results, the semi-automatic quantitative study of word class distribution proposed here can provide insights into how language is used in texts, thus giving a meaningful contribution to stylistic studies.


Phraseology and meaning in literary texts and corpora: how frequent phraseological units contribute both to the characterization of protagonists and places and the structural organisation of literary texts

Bettina Starcke

Corpus stylistics is still a minor focus of research in linguistics. This is the case despite its potential for developing techniques for the extraction of meaning from fiction and non-fiction texts. Relatively few linguists have worked in stylistics and among the rare corpus stylistic studies, analyses discussing corpora or texts longer than short stories or poems are still rare (exceptions are e.g. Tabata 2002, Semino & Short 2004 and Stubbs 2005). However, technical developments now allow for the analysis of large quantities of data so that also longer texts have become open to analysis.

As a first step, this paper illustrates how recurrent phraseological units not only contribute to coherence and cohesion in a text, but also how they encode implicit meaning. Using the example of Jane Austen's novel Northanger Abbey, the presentation shows how collocations and colligations of the novel's most frequent phraseological units contribute to the characterization of protagonists and places. It is then demonstrated how this linguistic evidence can explain intuitive reactions to the text by its readers. For this purpose, linguistic findings are related to the content of the novel and quantitative data serve as a basis for literary interpretations. For example, collocations and colligations of the novel's most frequent 4-word string i am sure i are identified and the protagonists' use of the phrase as usually delexicalized is discussed. The phrase functions as a discourse marker and its use is mainly phatic. This speech behaviour reflects on the protagonists whose use of the phrase contributes to their characterization as superficial. The occurrence of i am sure i mainly during that part of the novel which is set at Bath also contributes to Austen's characterization of the place as a scene of superficial social activities. These conclusions on superficiality are supported by the observation that the literal meaning of the phrase ("certainty") is reversed since it collocates and colligates with explicit and implicit negations.

As a second step, the most frequent phraseological units from a corpus of literature contemporary to Austen are presented and structural features of the corpus are distinguished by way of its most frequent phraseological units. The differences between the most frequent phraseological units identified for Northanger Abbey and for the corpus are used to identify and illustrate structural differences between the two sets of data.

Finally, this paper suggests that the techniques demonstrated for the analysis of fiction texts and corpora can also be used for the analysis of non-fiction texts and corpora. This shows that corpus stylistics could take a key position in the development of techniques for text analysis.


Dickensian patterns: meaning and form in literary text

Michaela Mahlberg

Meaning and form cannot be separated. This is one of the fundamental points of corpus linguistic approaches to the description of language. In corpora we can observe collocations and patterns of words that make visible the meanings shared by the members of a discourse community. However, meanings in texts are not only conventional. The meaning of a particular text is characterised by the interplay of conventional patterns and novel or creative combinations of words. It is the creative or 'unusual' collocations that literary stylistics is interested in: the relationship between meaning and form is analysed with regard to the effects that can be achieved by deviating from linguistic norms. Corpus linguistics can provide useful tools to identify textual features that characterise the style of a particular text. Corpora enable comparisons of typical patterns with features foregrounded in a particular text/texts by a particular author and corpus linguistic techniques can suggest quantitative approaches to the analysis of texts. A corpus stylistic analysis may, for instance, describe the development of a narrative by looking at distributions of content words (e.g. Stubbs 2001), or the analysis may focus on collocations that reflect the way in which characters are portrayed (e.g. Hori 2004). Moreover, corpus linguistic methodology can be used to test existing theories for the stylistic analysis of texts (e.g. Semino & Short 2004). With the help of corpus linguistic methodology the stylistic analysis of literary text is not restricted to those features striking enough to be discovered by a human observer, but what makes the use of corpora such a valuable tool in stylistics is the analysis of the interplay between conventional and creative ways of creating meaning. With the help of corpora we can identify features that a reader may not be fully aware of when he arrives at a particular interpretation of a text. The present paper will take examples from texts by Charles Dickens to illustrate how corpus linguistic methodology can provide useful tools for the stylistic analysis of texts. What appears to be an unusual instance of collocation may turn out to be part of a textual pattern that is created by a set of semantic prosodies spanning large passages of text. The paper will focus on Dickensian patterns that illustrate how Dickens portrays specific features of people and places in the worlds of his novels. In addition to methodological issues the paper will also address theoretical questions of how facts and realities may be created and interpreted.