Blog Archive for the ‘PanLex’ Category

PanLex: Overcoming Language Barriers with the World’s Largest Lexical Translation Database

Posted on Wednesday, October 25th, 02017 by Ahmed Kabil
link   Categories: PanLex   chat 0 Comments

In an unassuming office on the fourth floor of Downtown Berkeley’s historic Chamber of Commerce high rise, three linguists are at work building the world’s largest lexical translation database. The mission of PanLex, a project of The Long Now Foundation, is to overcome language barriers to human rights, information, and opportunities. After ten years of pooling together different sources from across the world, PanLex’s database covers over 2,500 dictionaries, 5,700 languages, 25 million words, and 1.3 billion translations. Now, the PanLex team is ready to see what it can do. They’re targeting under-served language communities, international humanitarian organizations, and global businesses to explore what practical problems PanLex can address.

“Choose any language you can imagine,” Julie Andersen, the PanLex Director of Programs, instructs me as we power up the PanLex translator for a demo. “The most interesting language you can think of,” Ben Yang, the Director of Technology, adds.

Unlike the machine translation service Google Translate, which translates whole sentences and texts in up to a hundred major world languages (sometimes to comedic effect), PanLex is a panlingual database (built to contain every language), and lexical (focused on words, not sentences).

Stumped by the possibilities, I opt for Classical Nahuatl, the language of the Aztec empire, modern forms of which are spoken today by an estimated 1.5 million people. I once read that the word “avocado” originated with Nahuatl, and that the same Nahuatl word for avocado also meant “testicle”—due, presumably, to the similarity in shape.

“How about avocado?” I ask.

The PanLex translator app in action, translating “avocado” into Classical Nahuatl.


Yang types avocado in the field for English, selects Nahuatl, and we’re immediately presented with words with different translation quality scores, with ahuacatl having the top score. Tapping on the word displays the paths from the English word through equivalent words in different languages which lead to the Nahuatl word. Translating ahuacatl back to English provides the words: avocado, bollock, egg, and testicle, among others, with avocado having the highest quality score.

It’s a simple, intuitive interface, one that belies the implications for human rights embedded within. At the heart of the PanLex project is the conviction that with access to information and the ability to communicate comes the ability to exercise one’s rights.

“You might want to communicate in a different language just because you want to connect with someone,” David Kamholz, PanLex’s Project Director, tells me. “Say you see a person on the street and you want to talk to them and you don’t share a language. By not speaking the same language, you’ve lost the richness in life that comes from communicating with someone you might want to. But that’s not necessarily a human rights issue.”

“Human rights comes into play where you’re talking about a scenario where, say you’re sick and want to see a doctor, but there’s no doctor who speaks your language. Or you need a lawyer, or you want to vote, but you can’t understand what’s going on in the election. Or you just want to look up some information on Wikipedia, or understand something about a field you’re studying, and you’re trying to read it and you can’t. The rights outlined in documents like the Universal Declaration of Human Rights require the ability to communicate. If you can’t communicate with certain entities, be they your government, your doctor, your lawyer, or your teacher in school, you can’t exercise your rights.”

The PanDictionary was the conceptual forerunner to PanLex.


This focus on breaking down barriers to human rights did not mark the PanLex project from its inception. At first, it wasn’t even known whether building such a database was possible. In 02004, a group of researchers at the University of Washington’s Turing Center set out to answer a question:

Can we automatically compose a large set of Wiktionaries and translation dictionaries to yield a massive, multilingual dictionary whose coverage is substantially greater than that of any of its constituent dictionaries?

The result of their research, PanDictionary, demonstrated that it was possible to significantly improve the quality of inferred translations using a novel algorithm that pooled together several multilingual dictionaries and placed them in an interoperable format.

Say you want to translate something in Basque, a language linguistically unrelated to any living language, to Zulu, the most widely spoken home language in South Africa. You can go from Basque to English, and from English to Zulu, but what is the probability that the word in Zulu is an accurate translation of the Basque word? The English might not preserve the meaning completely, giving rise to what’s called a transitive inference problem. But if you have independent confirmation from enough intermediate languages, such as French, Russian, Hindi, et cetera, you can correct for the ambiguities and provide multiple paths that converge on the same Zulu word, and therefore receive a reliable translation.

Jonathan Pool, a political scientist who helped advise the research project, wanted to go a step further than a proof of concept demonstrating that such a database was possible. He wanted to build it.

Pool was struck at an early age by the degree to which linguistic knowledge influenced the universe of opportunities for people. As a member of the Peace Corps teaching English in Turkey in the the 1960s, he observed that it was knowledge of languages, rather than professional skills, that more often than not determined who got hired for jobs. Thus began a career at the intersection of academia, language politics and policy, where Pool’s research focused on individual and collective choices about language, linguistic diversity and the consequences of linguistic discrimination.

In the PanDictionary project, Pool glimpsed the practical implications that a massive lexical database could provide. The vision of PanLex—a database enabling anyone in the world, regardless of their language, to communicate and exercise their rights—was born.

For the next six years, Pool dedicated himself to building out the database, singlehandedly doing the programming, improving its structure, and scouring the Internet for every possible linguistic source he could find. He was also independently funding the venture.

The PanLex team. From left: Project Director David Kamholz, Director of Technology Ben Yang, and Director of Programs Julie Anderson. Photo by Carolyn Wachnicki.


As the database grew larger, Pool expanded his one-man operation, bringing in linguist and self-taught programmer David Kamholz in 02013. Linguists Julie Anderson and Ben Yang joined in 02015. Anderson acquired new data to be ingested into the database, which Yang analyzed and integrated, along with building new tools for it.

“It’s really fun getting my hands on all these dictionaries of languages from all over the world, especially the under-served languages,” Anderson says. “To me, this is brain candy.”

PanLex soon caught the attention of Laura Welcher, project director for The Long Now Foundation’s Rosetta Project. The Rosetta Project began in 02000 as Long Now’s first exploration into long-term archiving, with the goal of building a publicly accessible digital library of human languages. Rosetta had been collecting parallel vocabulary lists early on as a targeted collection effort. As part of Rosetta’s sharing efforts with other linguistics projects, many of these lists made their way to the PanLex project, where the PanLex team incorporated them into their database, linked that data to other language data, and cleaned up and normalized the data. Rosetta and PanLex agreed that they were complementary projects and should work closely together. PanLex became a sponsored nonprofit project of the Long Now Foundation in 02012.

“I think of Rosetta and PanLex as sister projects,” Welcher says. “They are functionally separate projects with separate staff, but with similar and complementary goals. Rosetta also focuses on explorations in very long-term archiving media which PanLex doesn’t specifically do, although they are participating in the larger data collection effort and PanLex lexical data currently makes up about half the language data on the Rosetta Wearable Disk.”

Pool stepped back from day-to-day operations in 02017, and Kamholz took over as Project Director. Now that the database is sufficiently robust (“We have the largest collection of lexical data in the world,” notes Yang), Kamholz is leading PanLex through its next phase. A part of that next phase entails more clearly elucidating PanLex’s value proposition. Another part means finding sustainable ways to generate revenue.

PanLex’s data is freely available and no permission is needed for noncommercial use. Photo by Carolyn Wachnicki


“Earlier this year, we started the process of asking: Who are we, what are we really trying to do?” Kamholz says. “What is the world we want and what is our vision of where PanLex fits into that? We’ve always said that we want to help these under-served communities and partner with global humanitarian organizations, but what exactly do we want to do for them? I wouldn’t say we necessarily changed our mission so far as make it more explicit and concrete.”

“Before, our mission was to translate every word into every language, with a vision of universal communication,” Anderson says. “Now…”

“I wouldn’t say that’s not what we’re trying to do at this point,” Kamholz interjects. “But we’re also trying to do things that in the relatively short term can immediately help people.”

At the moment, PanLex is looking to partner with international organizations both large and small, from the Red Cross, World Bank and OxFam to Translators Without Borders.

“If, for example, there are NGO’s that deal with disaster preparedness,” Anderson says, “we can provide them with dictionaries of languages with disaster and medical terminology tailored to their specific needs and specific regions.”

PanLex is also looking to partner with global businesses. “There are many businesses that are trying to expand into markets around the world,” Kamholz says. “And they’re getting to the point where the major world languages are not enough for them to reach everyone, and we would have the ability to help them reach more people.”

Katrina Esau, one of the last remaining speakers of a Khoisan language that was thought extinct nearly 40 years ago, teaches her native tongue to a group of school children in Upington, South Africa on 21 September 2015. Photo by Mujahid Safodien/AFP/Getty


PanLex’s vision of overcoming language barriers to human rights is inspiring, to be sure. But there are some who contend that the preservation of a diversity of languages could actually make it more challenging for communities to communicate. In an increasingly globalized and interconnected world, wouldn’t an easier solution to the problem be to have everyone learn the same language, like Mandarin or English? As philosopher Rebecca Roache recently put it:

The advantages to adopting a single language are clear. It would enable us to travel anywhere in the world, confident that we could communicate with the people we met. We would save money on translation and interpretation. Scientific advances and other news could be shared faster and more thoroughly. By preserving a diversity of languages, we preserve the obstacles to communication. Wouldn’t it be better to allow as many languages as possible to die out, leaving us with just one universal lingua franca?

“There are two ways to answer that,” Kamholz says. “One is, well, what about the people who don’t speak those languages yet, what are they supposed to do now? Do we say to them: You won’t have human rights until fifty to a hundred years from now and then you’ll speak English or Mandarin? Those people exist now and still need their rights.”

Endangered Languages in Australia, Indonesia and Papua New Guinea. Via Endangered Languages

 

“But I would go even further and say, we don’t want a world where the only possible future, and the only way to exercise your rights, is to speak English or Mandarin. We want a diverse world with many points of view, with different cultural traditions. We don’t want everyone to be the same in that sense, and we don’t want that to be the only solution. We’re enabling people to access information and exercise their rights, but it’s also driven by this desire for diversity and pluralism. We want to make it easier and more possible for people who are in these under-served language communities to access the information they need, and empower them to make their own decisions regarding the preservation of their cultures, their traditions, their languages. There are lot of people in the world who want to do that, but it feels like such a lopsided struggle of us against the world. It seems impossible. But we believe PanLex helps make it easier for people to maintain things they want to maintain. This is just one small piece of the many things that need to happen to make that a reality. I’m not under the illusion that we can do it singlehandedly. I just want us to contribute to the process and hopefully inspire others along the way.”

To learn more about PanLex, go to panlex.org or email info@panlex.org.

 

PanLex Looking for Endangered Language Digital Detectives

Posted on Thursday, January 14th, 02016 by Julie Anderson
link   Categories: Announcements, PanLex   chat 0 Comments

Archives

Every word

The PanLex project aims to translate every word from every language into every other language. We already have solid groundwork with 10,000 language varieties and 22 million expressions in the PanLex database, but we still have a long way to go, especially with the more obscure and under-documented languages of the world which are most susceptible to extinction. Ethnologue: Languages of the World explains:

Language endangerment is a serious concern to which linguists and language planners have turned their attention in the last several decades. For a variety of reasons, speakers of many smaller, less dominant languages stop using their heritage language and begin using another.… As a consequence there may be no speakers who use the [heritage] language as their first or primary language and eventually the language may no longer be used at all…. Languages which have not been adequately documented disappear altogether.

In light of the fragile situation facing many smaller languages, PanLex is in a hurry to track down existing data on them, and we’d love to get some help from the larger Long Now community.

Digital detectives

We are looking for some intrepid, word-loving, puzzle-solving language sleuths who can help us search for words in some of the world’s most obscure languages. Do you love a challenge? Are you a brave armchair world traveler? Can you defy the needle-in-haystack odds? Let us send you a list of five little-known languages to search on the Internet using your best cyber-snooping creativity. You’ll think of places to hunt that we haven’t. Find a dictionary, glossary, or sociolinguistic research paper with a word list in the appendix. Check social media, booksellers, and library catalogs. Tweet your followers. If you find something, you’ll simply email us the URL or the publication info and we’ll take it from there.

What your research will do

You’ll be our springboard towards gathering words in more than 2,000 languages still needed in our database, shoring up the most neglected and least documented in the world. Indigenous communities, researchers, translators, students, and linguists will have online access to this valuable data. Increased visibility and accessibility of these languages allows the stakeholders to develop their projects in the directions they choose, be it research, education, or language revitalization. Pressure may be relieved on some smaller communities who are in danger of abandoning their mother tongue in favor of a politically dominant language. Your efforts support long-term preservation of linguistic diversity, accessibility of data, and ultimately improved communication. Plus, you may gather an esoteric name or two to complete your next high-brow crossword puzzle.

Getting technical

If you prefer, we can also use some help on the technical side:

  • designing apps and interfaces
  • localization tools
  • mobile apps
  • graph visualizations
  • adding links to other linguistic or geographic databases
  • investigating translation inference algorithms

Archives

We’d love to hear from you

Contact us at anderson@panlex.org to volunteer, your support will be greatly appreciated!

Getting Wiktionary into PanLex

Posted on Friday, December 4th, 02015 by David Kamholz
link   Categories: PanLex   chat 0 Comments

If we want to achieve the miracle of translation from any language into any other language, it would be enormously helpful to have a machine that can translate any word, or word-like phrase, from any language into any other language. The PanLex project aims to build exactly that machine. It is documenting all known lexical translations among all the world’s languages and dialects. The project draws mainly on published sources rather than eliciting translations directly from native speakers. An obvious place to turn in working toward this ambitious goal is Wiktionary, an online multilingual dictionary with content curated by thousands of users. Wiktionary contains millions of translations in thousands of languages, and in fact was one of the first sources mined for PanLex in 02006. However, this was done as a rough one-off procedure that could not take advantage of the regular growth of Wiktionary over time. Over the past several months, the PanLex team has been developing a better procedure for incorporating most of Wiktionary’s translations into the PanLex database. This has turned out to be an intricate process.

Wiktionary is in fact many resources, not just one. There are more than 150 editions of Wiktionary, each based on a particular language. Each edition contains entries mainly in that language; many entries include translations into other languages. For example, the English Wiktionary contains an entry for the verb go, whose primary sense “to move through space” is translated into German as either gehen (“to walk”) or fahren (“to go by vehicle”). The German Wiktionary contains separate entries for gehen and fahren, each of which is translated into English as go. Entries among different Wiktionaries must be manually linked, as there is no reliable automatic way to do this.

Several factors make it very difficult to treat different Wiktionaries as a single, uniform, computer-readable resource. Each Wiktionary contains different editorial standards for the standard structure of an entry, and these standards are not perfectly followed by all editors. Furthermore, the wiki markup in which entries are written is designed to be easy for editors to learn, not easy for computers to parse.

The DBnary project, created by Gilles Sérasset at the Université Joseph Fourier in Grenoble, is an effort to convert some of the largest Wiktionaries (currently 13 editions) into linked online data. This means that the data are computer-readable and made to conform to existing standards for lexical data, language codes, parts of speech, and so on. DBnary is a valuable contribution to making Wiktionaries tractable for PanLex, without which our task would have been much more difficult. However, much additional work has been necessary to make use of DBnary.

DBary translation map of “cat”
One major challenge in interpreting DBnary for PanLex is language variety identification. DBnary uses three-letter codes, drawn from the ISO 639-3 standard that identifies more than 7,000 languages. PanLex uses codes from this and other ISO 639 standards, but additionally recognizes varieties of each language, which generally correspond to dialects or to different script standards for writing the language. Given a language code and a text string in that language, it is no simple matter to identify the PanLex variety code. Many cases can be resolved with a heuristic that detects the string’s Unicode script (e.g., Roman, Cyrillic, Arabic, Han) and then looks for a variety of the appropriate language which is written in that script. For about a hundred more difficult cases, we have had to create custom mappings and (in a small number of cases) custom code.

Another major challenge in making use of DBnary is lemmatization. PanLex records only the lemma of any given word or phrase, which generally corresponds to a dictionary headword, also known as a citation form. For example, most English nouns are recorded in the singular (table, not tables), and verbs are recorded in the infinitive (go, not goes or went). Wiktionaries generally record lemmas as their translations, but there is significant messiness in the data. We use a variety of heuristics to detect whether a string is likely to be lemmatic. For example, we remove most parenthesized material from strings, so that “divan (old-fashioned)” is converted to “divan”; the complete original string is preserved as a definition. Strings that contain certain characters, such as commas or semicolons, are likely to be lists of translations rather than single translations and are also converted to definitions.

We have written extensive custom code to convert all 13 available DBnary editions into a format that can be ingested into the PanLex database. The resulting files contain over 4 million translations. We are still in the process of perfecting the code and expect to have the ingestion completed in 02015. This will represent a substantial contribution to PanLex, which currently contains about 57 million translations. Once the new DBnary-provided Wiktionary data are ingested, we will retire the out-of-date PanLex Wiktionary sources. We will also be able to periodically update PanLex with the latest data from DBnary, thereby incorporating new crowd-sourced Wiktionary translations.

The PanLex project is always looking for skilled help in analyzing sources such as Wiktionary. Other sources, though typically much smaller, present similar challenges. We currently hope to hire a small number of source analysts to process our ever-growing backlog of sources. If this sort of work would interest you, please contact info@panlex.org.

Marie’s Dictionary

Posted on Thursday, August 27th, 02015 by Andrew Warner
link   Categories: PanLex, Rosetta   chat 0 Comments

This short documentary tells the story of Marie Wilcox, the last fluent speaker of the Wukchumni language and the dictionary she created in an effort to keep her language alive. Long Now’s PanLex project collects dictionaries such as these with the goal of creating a universal translation engine and fighting language extinction.

The Front Line of Language Extinction

Posted on Friday, April 17th, 02015 by Andrew Warner
link   Categories: Digital Dark Age, PanLex, Rosetta   chat 0 Comments

We live in an era of mass extinction of linguistic heritage. Thousands of years of ancestral knowledge and stories are vanishing with the last speakers of hundreds of languages. Come and find out how mobile devices and social media are being used to preserve the “wisdom of the tribe” for generations far into the future.

Linguists worldwide are engaged in an urgent task of recording the world’s languages while there is still time. Oral cultures are in particular jeopardy because they lack a written record. However, the languages are disappearing more quickly than they can be preserved, and so a new effort is trying to ramp up the effort using mobile technologies.

Steven Bird, a linguist and anthologist who spoke for us at The Interval in November 02014 has been testing a new mobile app in Amazonia, Melanesia, and Central Asia. The app, called Aikuma, has been designed by Steven and his team to permit people who speak endangered languages to record and translate their stories and songs. When Steven visited The Interval, he ran a hands-on demonstration of the app, facilitated a discussion of some thorny issues it raised, and shared some of his ingenious solutions. In this recent interview with the Australian Broadcasting Company, Steven Bird explains how the app works and how it can be used to save endangered languages.

amazoniatranscribe stevenamazonia

The above photos are from the village of Terra Preta, near Manaus, in the heart of the Brazilian Amazon. Steven’s team worked with local speakers of the Nhengatu language to record, translate, and transcribe the stories of the rainforest. One of the products is a story book illustrated by the children of the village, which has been uploaded to the Internet Archive where anyone can access it.

Steven Bird is a Senior Research Associate at the Linguistic Data Consortium at UPenn and Associate Professor of Computing and Information Systems at the University of Melbourne, Australia. He travels extensively to remote indigenous communities and through a variety of projects he works to bring the power of technology to bear on efforts to preserve the world’s endangered languages.

Shooting for 10,000 Autoglossonyms

Posted on Friday, February 27th, 02015 by Jonathan Pool
link   Categories: Announcements, PanLex   chat 0 Comments

How many autoglossonyms do you know? Presumably, “English”; probably “español”, “français”, and “Deutsch”; perhaps “русский”, “日本語”, “עברית”, or “हिंदी”.

As you may have guessed, an autoglossonym is the name of a language in that language. While most people know a few of them, PanLex, as a Long Now project, aims to discover and document all of them that can be found, all the way into the farthest corners of the world and the remotest eras in time.

PanLex has amassed facts about words in nearly 10,000 language varieties (languages and their dialects). PanLex prefers to use autoglossonyms in naming language varieties; so far we have collected about 9,000, which we believe to be the largest such collection in existence. In some cases we find phrases that mean “language of the X people” or “language of X region” or “our language” used as autoglossonyms. But in about a thousand cases the PanLex team has not yet found autoglossonyms of any kind, and then we substitute exoglossonyms—names used by outsiders.

Finding autoglossonyms is hardest for extinct languages, languages of small groups, and obscure dialects. For example, PanLex has documented eight varieties of Shoshoni, a Uto-Aztecan language of Nevada, Idaho, Wyoming, and Utah, and for three of these we haven’t found autoglossonyms. Our database contains over 2,600 expressions in Big Smokey Valley Shoshoni, but we still don’t know its autoglossonym. It’s possible that speakers of this variety did not have a name for it, or the name has never been recorded. The search continues.

Using exoglossonyms when autoglossonyms are not available can be a delicate issue. As with names for racial and ethnic groups, names that outsiders give to languages are sometimes considered offensive by the people whose languages are being labeled. The words “Lapp” and “Hottentot”, for example, are generally recognized as pejorative terms for the Saami and Nama languages, respectively. But in many cases a non-native speaker would not recognize a language name as pejorative (for example, “Ngiao” for Shan and “Quottu” for Eastern Oromo).

Autoglossonyms can often be found in the documentation produced by other projects, including Ethnologue, Geonames, Lexvo, and Wikipedia. We use data from all these projects, and we make our data available to them in return.

You can see PanLex’s labels for language varieties on the home page of the expert PanLex interface. If you see any autoglossonyms there that you know to be incorrect, or exoglossonyms that you can replace with autoglossonyms, please notify info@panlex.org.

The Heirlooms of Language Through Temporary Tattoos and a Nickel Disk

Posted on Wednesday, October 23rd, 02013 by Catherine Borgeson
link   Categories: Events, PanLex, Rosetta   chat 0 Comments

On Saturday October 19, 02013, Long Now participated in Exploratorium Market Days—a series of free, outdoor “mini-festivals” geared to educate the public through the science and art communities and museums. The theme of the month was “Heirlooms,” which focused on the “diverse treasures that we preserve and pass along to future generations.”

Together the Rosetta and PanLex Project staff presented the intangible culture of language in a very tangible way—the Rosetta Disk and temporary tattoos.

The PanLex Project is building an enormous database with the goal of translating all of the words of all of the world’s languages. They created an interface to this database where people could either choose from a list of commonly-used words in tattoos, such as “patience” or “victory,” or enter one of their own choice.

The next screen listed all the translations of that word in the PanLex database, sometimes for hundreds of languages. People were captivated at looking through the list and deciding which language to print their tattoo in. For some, the deciding factor was an interesting script, or because only a handful of people spoke that language. For others it was a language they themselves spoke and personally connected with.

In addition to the PanLex and Rosetta Project staff, Exploratorium Explainers helped run the booth. These are a diverse group of high school students interested in learning new things while explaining and helping others in the process.

Market Day 1

Market Day 2

Market Day 3

On a more permanent role of archiving and preserving languages, the Rosetta Disk was also on display. A steady stream of people viewed the micro-etched languages with a microscope throughout the day.

Market Day 4 Market Day 5

Market Day 6 Market Day 7

Exploratorium’s Director of Public Programs Melissa Alexander invited Long Now to participate in Market Day. She wanted people to get a sense of the vast amount of languages while understanding that like many species, languages are endangered and are disappearing from the planet regularly.

“I had a Ray Bradbury moment–I wanted everyone to learn how to say hello, please & thank you and welcome in at least one endangered language. Loved the setup and clearly our Explainers did too–if our Explainers like it, it’s golden–teenagers are great thermometers.”

Rosetta and PanLex Projects at Exploratorium Market Days 10/19/13

Posted on Thursday, October 17th, 02013 by Austin Brown
link   Categories: Events, PanLex, Rosetta   chat 0 Comments

MarketDays

This Saturday October the 19th, Rosetta and PanLex Project staff will be at the Exploratorium’s final Market Days event of this year. The Exploratorium has been holding these free, outdoor events in the spirit of “exchanging fresh ideas on local phenomena.” Saturday’s theme is Heirlooms and Rosetta and PanLex will showcase our planet’s diverse linguistic stock.

Come to the Rosetta / PanLex Project booth where you can:

  • Learn about the thousands of languages spoken around the world, why many of them are endangered, and why this is important for everybody.
  • Learn how you can make and archive language recordings that document the languages used in your family, classroom and community.
  • Use the PanLex tattoo generator to make a temporary tattoo using words from thousands of languages around the world.
  • See a real Rosetta Disk – an archive of thousands of the world’s languages that read with a microscope, and can hold in the palm of your hand.

The event runs from 11:00am to 3:00pm at the Exploratorium’s new location at Pier 15.

PanLex hits a billion translations

Posted on Wednesday, October 2nd, 02013 by Jonathan Pool
link   Categories: Announcements, PanLex, Rosetta   chat 0 Comments

The PanLex project of The Long Now Foundation, which is building a database of words and phrases in the world’s languages, has recently passed the one-billion-translation mark. That means there are now over a billion pairs of words or phrases, such as “clock” in English and “ঘড়ী” in Assamese, that PanLex records as attested translations of each other. The translations are derived from publications collected from around the world.

Beyond these billion attested translations, it is possible to infer others from longer paths of translations. For example, the number of pairs shoots up from 1 billion to 30 billion if we include translations at distance 2, namely translations of translations.  The longer the path, the greater the number, and the lower the reliability, of translations.

Because counting up these totals would overload the PanLex servers, we have estimated them using a random sample of 3,000 words and phrases.  The figures below show that as more words and phrases are added to the sample the estimates of distance­ 1 and distance­ 2 translations become more stable.

distance1

 

 


distance2

 

The main goal of the PanLex database is to make it possible ultimately to translate any word or phrase in any language into any other language on Earth. With about 7,000 languages, and assuming an average of 100,000 words and phrases per language, there should eventually be about 2.5 trillion translation pairs available from PanLex. Project participants don’t hope to reach this total on their own. Instead, they plan to provide their data to researchers who will develop increasingly effective methods of automatically inferring unattested translations from networks of attested ones.