Big Talk: The Possibilities of Large Linguistic Databases

Two language families' trees from Dr. Dunn's paper, with two word-order traits.

How does human language work? What are its possibilities and limitations? Where did it come from? Many linguists have asked these questions and made contributions to our understanding of language, but how do they get their answers?

One approach is to go out and document a language, which can then be compared to other languages, writings from the past, etc. Through various methods, linguists have succeeded in discovering patterns within and between languages that allow us to define some of their parameters and to organize them into families. But, as two recent publications demonstrate, our ability to recognize patterns—and their underlying causes—may be dramatically increasing with the development of technology that can centralize, organize and manipulate enormous amounts of information.

The two studies were highlighted in The Economist, and both of them offer conclusions that are likely to spark lively debate. Dr. Michael Dunn, from the Netherlands’ Max Planck Institute for Psycholinguistics, published a paper in Nature magazine addressing word-order dependencies—the idea that, for example, if a given language places verbs before objects (eat lunch) it will also place prepositions before nouns (at home). By comparing different languages, linguists have found that there are some strong consistencies in these dependencies, indicating that they are the result of “underlying cognitive or systems biases.” Dr. Dunn, however, has used large databases of basic vocabularies and statistical methods borrowed from evolutionary biology to approach the problem of dependencies in a different way:

To substitute for fossils, and thus reconstruct the ancient branches of the tree as well as the modern-day leaves, Dr Dunn used mathematically informed guesswork. The maths in question is called the Markov chain Monte Carlo (MCMC) method. As its name suggests, this spins the software equivalent of a roulette wheel to generate a random tree, then examines how snugly the branches of that tree fit the modern foliage. It then spins the wheel again, to tweak the first tree ever so slightly, at random. If the new tree is a better fit for the leaves, it is taken as the starting point for the next spin. If not, the process takes a step back to the previous best fit. The wheel whirrs millions of times until such random tweaking has no discernible effect on the outcome.

When Dr Dunn fed the languages he had chosen into the MCMC casino, the result was several hundred equally probable family trees. Next, he threw eight grammatical features, all related to word order, into the mix, and ran the game again.

He found that particular word-order traits were not necessarily linked to others in the way that current theories propose. Rather, such dependencies seemed to be ‘lineage-specific,’ suggesting that they have been passed down through language families. “Nurture, in other words, rather than nature,” as The Economist put it.

The other article, published in Science by Dr. Quentin Atkinson of the University of Auckland, also uses statistics and databases in an innovative way. He looked at information from the World Atlas of Language Structures on sounds in different languages and found that phonemic diversity (the number of sounds used in a language) decreases as you follow the pathways of human migration outwards from central/southern Africa. The Science article argues that modern language originated in that part of Africa and that phonemic diversity decreased with every stage of human expansion as small groups of people set off in search of new territory.

Both of these studies utilize phylogenetic language groupings, based on evolutionary theory, and they run statistical analyses with large amounts of data made available by central repositories of linguistic information, such as the World Atlas of Language Structures. The Long Now Foundation’s Rosetta Project is an effort to improve and facilitate that very sort of creative methodology—to organize and make available large amounts of data so that researchers can develop fundamentally new methods of inquiry.