« New Book: "The President Who Would Not Be King" by Michael McConnell
Michael Ramsey
| Main | Citizenship and Almond Joys
Andrew Hyman »

11/24/2020

Larry Solum's Legal Theory Lexicon: Corpus Linguistics
Michael Ramsey

At Legal Theory Blog, Larry Solum has this entry in the "Legal Theory Lexicon": Corpus Linguistics. From the introduction:

... Legal disputes frequently turn on the meaning of a contract, will, rule, regulation, statute, or constitutional provision.  How do we determine the meaning of legal texts?  One possibility is that judges could consult their linguistic intuitions.  Another possibility is the use of dictionaries.  Recently, however, lawyers, judges, and legal scholars have discovered a data-driven approach to ascertaining the semantic meaning of disputed language.  This technique, called "corpus linguistics," has already been used by courts and plays an increasingly prominent role in legal scholarship.  This entry in the Legal Theory Lexicon provides a basic introduction to corpus linguistics.  As always, the Lexicon is aimed at law students with an interest in legal theory.

Situating Corpus Linguistics

Why has corpus linguistics become important in contemporary legal theory and practice?  The answer to that question is complicated.  One important impetus is rooted in the revival of formalism in general legal theory: that revival is reflecting in the developments in the law and theory of both statutory and constitutional interpretation.  Statutory interpretation in the 1960s and 1970s was dominated by approaches that emphasized legislative intent and statutory purpose, but in the last three decades, textualism (or "plain meaning textualism") has been on the ascendance.  Similarly, the living constitutionalism once held hegemonic sway over the realm of constitutional interpretation, but in recent years, originalism has become increasingly important in both the academy and the courts.

And from later on:

How Does Corpus Linguistics Work? 

 Corpus linguistics begins with data sets, singular "corpus" or plural "corpora."  These data can be very large--with millions or even billions of words.  For example, the Corpus of Contemporary American English (COCA) consists of approximately 520 million words.  News on the Web (NOW) consists of more than 5.21 billion words.

Corpus lexicography uses these datasets to investigate the meaning of words and phrases.  Whereas traditional dictionary lexicography relied on researchers compiling instances of usage by reading various sources, the corpus approach allows random sampling from large databases with blind coding by multiple coders.

A complete description of the methods of corpus lexicography is beyond the scope of this brief Lexicon entry, but there are two search techniques that can be described briefly.  The first of these is the Key-word-in-context (or KWIC) search.  This method is simple: a corpus is searched for the occurrence of a string (a word or phrase) and reports back the context in which the string occurs.  The individual instances can then be coded for meaning.  The result will be a set of meanings and data about the frequency of the meanings with the sample.  The second method involves a search for the collocates of a word or phrase: for example, the word "bank" might have collocates like "river," "shady," "deposit," and "ATM."  Collocates may help to disambiguate a word like "bank" that has multiple meanings.