« More on the President's Delegated "Emergency" Power
Michael Ramsey
| Main | Orin Kerr on State Constitutional Interpretation
Michael Ramsey »


Legal Theory Lexicon on Corpus Linguistics
Michael Ramsey

At Legal Theory Blog, Larry Solum's Legal Theory Lexicon has this entry on Corpus Linguistics.  From the introduction:

Recently . . . lawyers, judges, and legal scholars have discovered a data-driven approach to ascertaining the semantic meaning of disputed language.  This technique, called "corpus linguistics," has already been used by courts and plays an increasingly prominent role in legal scholarship.  This entry in the Legal Theory Lexicon provides a basic introduction to corpus linguistics. ...

Why has corpus linguistics become important in contemporary legal theory and practice?  The answer to that question is complicated.  One important impetus is rooted in the revival of formalism in general legal theory: that revival is reflecting in the developments in the law and theory of both statutory and constitutional interpretation.  Statutory interpretation in the 1960s and 1970s was dominated by approaches that emphasized legislative intent and statutory purpose, but in the last three decades, textualism (or "plain meaning textualism") has been on the ascendance.  Similarly, the living constitutionalism once held hegemonic sway over the realm of constitutional interpretation, but in recent years, originalism has become increasingly important in both the academy and the courts.

And from later on:

How Does Corpus Linguistics Work? 

 Corpus linguistics begins with data sets, singular "corpus" or plural "corpora."  These data can be very large--with millions or even billions of words.  For example, the Corpus of Contemporary American English (COCA) consists of approximately 520 million words.  News on the Web (NOW) consists of more than 5.21 billion words.

Corpus lexicography uses these datasets to investigate the meaning of words and phrases.  Whereas traditional dictionary lexicography relied on researchers compiling instances of usage by reading various sources, the corpus approach allows random sampling from large databases with blind coding by multiple coders.

A complete description of the methods of corpus lexicography is beyond the scope of this brief Lexicon entry, but there are two search techniques that can be described briefly.  The first of these is the Key-word-in-context (or KWIC) search.  This method is simple: a corpus is searched for the occurrence of a string (a word or phrase) and reports back the context in which the string occurs.  The individual instances can then be coded for meaning.  The result will be a set of meanings and data about the frequency of the meanings with the sample.  The second method involves a search for the collocates of a word or phrase: for example, the word "bank" might have collocates like "river," "shady," "deposit," and "ATM."  Collocates may help to disambiguate a word like "bank" that has multiple meanings.

And in conclusion:

The introduction of a new methodology to legal theory is a rare event, but corpus linguistics is one of the black swans.  It is still early days, but the use of corpus methods has already begun in earnest--both in the courts and the academy.  The Bibliography provides many of the key sources in a literature that still can easily be read in just a few days.