Blog Archive for the ‘PanLex’ Category

PanLex Looking for Endangered Language Digital Detectives

Posted on Thursday, January 14th, 02016 by Julie Anderson
link   Categories: Announcements, PanLex   chat 0 Comments

Archives

Every word

The PanLex project aims to translate every word from every language into every other language. We already have solid groundwork with 10,000 language varieties and 22 million expressions in the PanLex database, but we still have a long way to go, especially with the more obscure and under-documented languages of the world which are most susceptible to extinction. Ethnologue: Languages of the World explains:

Language endangerment is a serious concern to which linguists and language planners have turned their attention in the last several decades. For a variety of reasons, speakers of many smaller, less dominant languages stop using their heritage language and begin using another.… As a consequence there may be no speakers who use the [heritage] language as their first or primary language and eventually the language may no longer be used at all…. Languages which have not been adequately documented disappear altogether.

In light of the fragile situation facing many smaller languages, PanLex is in a hurry to track down existing data on them, and we’d love to get some help from the larger Long Now community.

Digital detectives

We are looking for some intrepid, word-loving, puzzle-solving language sleuths who can help us search for words in some of the world’s most obscure languages. Do you love a challenge? Are you a brave armchair world traveler? Can you defy the needle-in-haystack odds? Let us send you a list of five little-known languages to search on the Internet using your best cyber-snooping creativity. You’ll think of places to hunt that we haven’t. Find a dictionary, glossary, or sociolinguistic research paper with a word list in the appendix. Check social media, booksellers, and library catalogs. Tweet your followers. If you find something, you’ll simply email us the URL or the publication info and we’ll take it from there.

What your research will do

You’ll be our springboard towards gathering words in more than 2,000 languages still needed in our database, shoring up the most neglected and least documented in the world. Indigenous communities, researchers, translators, students, and linguists will have online access to this valuable data. Increased visibility and accessibility of these languages allows the stakeholders to develop their projects in the directions they choose, be it research, education, or language revitalization. Pressure may be relieved on some smaller communities who are in danger of abandoning their mother tongue in favor of a politically dominant language. Your efforts support long-term preservation of linguistic diversity, accessibility of data, and ultimately improved communication. Plus, you may gather an esoteric name or two to complete your next high-brow crossword puzzle.

Getting technical

If you prefer, we can also use some help on the technical side:

  • designing apps and interfaces
  • localization tools
  • mobile apps
  • graph visualizations
  • adding links to other linguistic or geographic databases
  • investigating translation inference algorithms

Archives

We’d love to hear from you

Contact us at anderson@panlex.org to volunteer, your support will be greatly appreciated!

Getting Wiktionary into PanLex

Posted on Friday, December 4th, 02015 by David Kamholz
link   Categories: PanLex   chat 0 Comments

If we want to achieve the miracle of translation from any language into any other language, it would be enormously helpful to have a machine that can translate any word, or word-like phrase, from any language into any other language. The PanLex project aims to build exactly that machine. It is documenting all known lexical translations among all the world’s languages and dialects. The project draws mainly on published sources rather than eliciting translations directly from native speakers. An obvious place to turn in working toward this ambitious goal is Wiktionary, an online multilingual dictionary with content curated by thousands of users. Wiktionary contains millions of translations in thousands of languages, and in fact was one of the first sources mined for PanLex in 02006. However, this was done as a rough one-off procedure that could not take advantage of the regular growth of Wiktionary over time. Over the past several months, the PanLex team has been developing a better procedure for incorporating most of Wiktionary’s translations into the PanLex database. This has turned out to be an intricate process.

Wiktionary is in fact many resources, not just one. There are more than 150 editions of Wiktionary, each based on a particular language. Each edition contains entries mainly in that language; many entries include translations into other languages. For example, the English Wiktionary contains an entry for the verb go, whose primary sense “to move through space” is translated into German as either gehen (“to walk”) or fahren (“to go by vehicle”). The German Wiktionary contains separate entries for gehen and fahren, each of which is translated into English as go. Entries among different Wiktionaries must be manually linked, as there is no reliable automatic way to do this.

Several factors make it very difficult to treat different Wiktionaries as a single, uniform, computer-readable resource. Each Wiktionary contains different editorial standards for the standard structure of an entry, and these standards are not perfectly followed by all editors. Furthermore, the wiki markup in which entries are written is designed to be easy for editors to learn, not easy for computers to parse.

The DBnary project, created by Gilles Sérasset at the Université Joseph Fourier in Grenoble, is an effort to convert some of the largest Wiktionaries (currently 13 editions) into linked online data. This means that the data are computer-readable and made to conform to existing standards for lexical data, language codes, parts of speech, and so on. DBnary is a valuable contribution to making Wiktionaries tractable for PanLex, without which our task would have been much more difficult. However, much additional work has been necessary to make use of DBnary.

DBary translation map of “cat”
One major challenge in interpreting DBnary for PanLex is language variety identification. DBnary uses three-letter codes, drawn from the ISO 639-3 standard that identifies more than 7,000 languages. PanLex uses codes from this and other ISO 639 standards, but additionally recognizes varieties of each language, which generally correspond to dialects or to different script standards for writing the language. Given a language code and a text string in that language, it is no simple matter to identify the PanLex variety code. Many cases can be resolved with a heuristic that detects the string’s Unicode script (e.g., Roman, Cyrillic, Arabic, Han) and then looks for a variety of the appropriate language which is written in that script. For about a hundred more difficult cases, we have had to create custom mappings and (in a small number of cases) custom code.

Another major challenge in making use of DBnary is lemmatization. PanLex records only the lemma of any given word or phrase, which generally corresponds to a dictionary headword, also known as a citation form. For example, most English nouns are recorded in the singular (table, not tables), and verbs are recorded in the infinitive (go, not goes or went). Wiktionaries generally record lemmas as their translations, but there is significant messiness in the data. We use a variety of heuristics to detect whether a string is likely to be lemmatic. For example, we remove most parenthesized material from strings, so that “divan (old-fashioned)” is converted to “divan”; the complete original string is preserved as a definition. Strings that contain certain characters, such as commas or semicolons, are likely to be lists of translations rather than single translations and are also converted to definitions.

We have written extensive custom code to convert all 13 available DBnary editions into a format that can be ingested into the PanLex database. The resulting files contain over 4 million translations. We are still in the process of perfecting the code and expect to have the ingestion completed in 02015. This will represent a substantial contribution to PanLex, which currently contains about 57 million translations. Once the new DBnary-provided Wiktionary data are ingested, we will retire the out-of-date PanLex Wiktionary sources. We will also be able to periodically update PanLex with the latest data from DBnary, thereby incorporating new crowd-sourced Wiktionary translations.

The PanLex project is always looking for skilled help in analyzing sources such as Wiktionary. Other sources, though typically much smaller, present similar challenges. We currently hope to hire a small number of source analysts to process our ever-growing backlog of sources. If this sort of work would interest you, please contact info@panlex.org.

Marie’s Dictionary

Posted on Thursday, August 27th, 02015 by Andrew Warner
link   Categories: PanLex, Rosetta   chat 0 Comments

This short documentary tells the story of Marie Wilcox, the last fluent speaker of the Wukchumni language and the dictionary she created in an effort to keep her language alive. Long Now’s PanLex project collects dictionaries such as these with the goal of creating a universal translation engine and fighting language extinction.

The Front Line of Language Extinction

Posted on Friday, April 17th, 02015 by Andrew Warner
link   Categories: Digital Dark Age, PanLex, Rosetta   chat 0 Comments

We live in an era of mass extinction of linguistic heritage. Thousands of years of ancestral knowledge and stories are vanishing with the last speakers of hundreds of languages. Come and find out how mobile devices and social media are being used to preserve the “wisdom of the tribe” for generations far into the future.

Linguists worldwide are engaged in an urgent task of recording the world’s languages while there is still time. Oral cultures are in particular jeopardy because they lack a written record. However, the languages are disappearing more quickly than they can be preserved, and so a new effort is trying to ramp up the effort using mobile technologies.

Steven Bird, a linguist and anthologist who spoke for us at The Interval in November 02014 has been testing a new mobile app in Amazonia, Melanesia, and Central Asia. The app, called Aikuma, has been designed by Steven and his team to permit people who speak endangered languages to record and translate their stories and songs. When Steven visited The Interval, he ran a hands-on demonstration of the app, facilitated a discussion of some thorny issues it raised, and shared some of his ingenious solutions. In this recent interview with the Australian Broadcasting Company, Steven Bird explains how the app works and how it can be used to save endangered languages.

amazoniatranscribe stevenamazonia

The above photos are from the village of Terra Preta, near Manaus, in the heart of the Brazilian Amazon. Steven’s team worked with local speakers of the Nhengatu language to record, translate, and transcribe the stories of the rainforest. One of the products is a story book illustrated by the children of the village, which has been uploaded to the Internet Archive where anyone can access it.

Steven Bird is a Senior Research Associate at the Linguistic Data Consortium at UPenn and Associate Professor of Computing and Information Systems at the University of Melbourne, Australia. He travels extensively to remote indigenous communities and through a variety of projects he works to bring the power of technology to bear on efforts to preserve the world’s endangered languages.

Shooting for 10,000 Autoglossonyms

Posted on Friday, February 27th, 02015 by Jonathan Pool
link   Categories: Announcements, PanLex   chat 0 Comments

How many autoglossonyms do you know? Presumably, “English”; probably “español”, “français”, and “Deutsch”; perhaps “русский”, “日本語”, “עברית”, or “हिंदी”.

As you may have guessed, an autoglossonym is the name of a language in that language. While most people know a few of them, PanLex, as a Long Now project, aims to discover and document all of them that can be found, all the way into the farthest corners of the world and the remotest eras in time.

PanLex has amassed facts about words in nearly 10,000 language varieties (languages and their dialects). PanLex prefers to use autoglossonyms in naming language varieties; so far we have collected about 9,000, which we believe to be the largest such collection in existence. In some cases we find phrases that mean “language of the X people” or “language of X region” or “our language” used as autoglossonyms. But in about a thousand cases the PanLex team has not yet found autoglossonyms of any kind, and then we substitute exoglossonyms—names used by outsiders.

Finding autoglossonyms is hardest for extinct languages, languages of small groups, and obscure dialects. For example, PanLex has documented eight varieties of Shoshoni, a Uto-Aztecan language of Nevada, Idaho, Wyoming, and Utah, and for three of these we haven’t found autoglossonyms. Our database contains over 2,600 expressions in Big Smokey Valley Shoshoni, but we still don’t know its autoglossonym. It’s possible that speakers of this variety did not have a name for it, or the name has never been recorded. The search continues.

Using exoglossonyms when autoglossonyms are not available can be a delicate issue. As with names for racial and ethnic groups, names that outsiders give to languages are sometimes considered offensive by the people whose languages are being labeled. The words “Lapp” and “Hottentot”, for example, are generally recognized as pejorative terms for the Saami and Nama languages, respectively. But in many cases a non-native speaker would not recognize a language name as pejorative (for example, “Ngiao” for Shan and “Quottu” for Eastern Oromo).

Autoglossonyms can often be found in the documentation produced by other projects, including Ethnologue, Geonames, Lexvo, and Wikipedia. We use data from all these projects, and we make our data available to them in return.

You can see PanLex’s labels for language varieties on the home page of the expert PanLex interface. If you see any autoglossonyms there that you know to be incorrect, or exoglossonyms that you can replace with autoglossonyms, please notify info@panlex.org.

The Heirlooms of Language Through Temporary Tattoos and a Nickel Disk

Posted on Wednesday, October 23rd, 02013 by Catherine Borgeson
link   Categories: Events, PanLex, Rosetta   chat 0 Comments

On Saturday October 19, 02013, Long Now participated in Exploratorium Market Days—a series of free, outdoor “mini-festivals” geared to educate the public through the science and art communities and museums. The theme of the month was “Heirlooms,” which focused on the “diverse treasures that we preserve and pass along to future generations.”

Together the Rosetta and PanLex Project staff presented the intangible culture of language in a very tangible way—the Rosetta Disk and temporary tattoos.

The PanLex Project is building an enormous database with the goal of translating all of the words of all of the world’s languages. They created an interface to this database where people could either choose from a list of commonly-used words in tattoos, such as “patience” or “victory,” or enter one of their own choice.

The next screen listed all the translations of that word in the PanLex database, sometimes for hundreds of languages. People were captivated at looking through the list and deciding which language to print their tattoo in. For some, the deciding factor was an interesting script, or because only a handful of people spoke that language. For others it was a language they themselves spoke and personally connected with.

In addition to the PanLex and Rosetta Project staff, Exploratorium Explainers helped run the booth. These are a diverse group of high school students interested in learning new things while explaining and helping others in the process.

Market Day 1

Market Day 2

Market Day 3

On a more permanent role of archiving and preserving languages, the Rosetta Disk was also on display. A steady stream of people viewed the micro-etched languages with a microscope throughout the day.

Market Day 4 Market Day 5

Market Day 6 Market Day 7

Exploratorium’s Director of Public Programs Melissa Alexander invited Long Now to participate in Market Day. She wanted people to get a sense of the vast amount of languages while understanding that like many species, languages are endangered and are disappearing from the planet regularly.

“I had a Ray Bradbury moment–I wanted everyone to learn how to say hello, please & thank you and welcome in at least one endangered language. Loved the setup and clearly our Explainers did too–if our Explainers like it, it’s golden–teenagers are great thermometers.”

Rosetta and PanLex Projects at Exploratorium Market Days 10/19/13

Posted on Thursday, October 17th, 02013 by Austin Brown
link   Categories: Events, PanLex, Rosetta   chat 0 Comments

MarketDays

This Saturday October the 19th, Rosetta and PanLex Project staff will be at the Exploratorium’s final Market Days event of this year. The Exploratorium has been holding these free, outdoor events in the spirit of “exchanging fresh ideas on local phenomena.” Saturday’s theme is Heirlooms and Rosetta and PanLex will showcase our planet’s diverse linguistic stock.

Come to the Rosetta / PanLex Project booth where you can:

  • Learn about the thousands of languages spoken around the world, why many of them are endangered, and why this is important for everybody.
  • Learn how you can make and archive language recordings that document the languages used in your family, classroom and community.
  • Use the PanLex tattoo generator to make a temporary tattoo using words from thousands of languages around the world.
  • See a real Rosetta Disk – an archive of thousands of the world’s languages that read with a microscope, and can hold in the palm of your hand.

The event runs from 11:00am to 3:00pm at the Exploratorium’s new location at Pier 15.

PanLex hits a billion translations

Posted on Wednesday, October 2nd, 02013 by Jonathan Pool
link   Categories: Announcements, PanLex, Rosetta   chat 0 Comments

The PanLex project of The Long Now Foundation, which is building a database of words and phrases in the world’s languages, has recently passed the one-billion-translation mark. That means there are now over a billion pairs of words or phrases, such as “clock” in English and “ঘড়ী” in Assamese, that PanLex records as attested translations of each other. The translations are derived from publications collected from around the world.

Beyond these billion attested translations, it is possible to infer others from longer paths of translations. For example, the number of pairs shoots up from 1 billion to 30 billion if we include translations at distance 2, namely translations of translations.  The longer the path, the greater the number, and the lower the reliability, of translations.

Because counting up these totals would overload the PanLex servers, we have estimated them using a random sample of 3,000 words and phrases.  The figures below show that as more words and phrases are added to the sample the estimates of distance­ 1 and distance­ 2 translations become more stable.

distance1

 

 


distance2

 

The main goal of the PanLex database is to make it possible ultimately to translate any word or phrase in any language into any other language on Earth. With about 7,000 languages, and assuming an average of 100,000 words and phrases per language, there should eventually be about 2.5 trillion translation pairs available from PanLex. Project participants don’t hope to reach this total on their own. Instead, they plan to provide their data to researchers who will develop increasingly effective methods of automatically inferring unattested translations from networks of attested ones.