PanLex hits a billion translations

Posted on Wednesday, October 2nd, 02013 by Jonathan Pool
link Categories: Announcements, PanLex, Rosetta   chat 0 Comments

The PanLex project of The Long Now Foundation, which is building a database of words and phrases in the world’s languages, has recently passed the one-billion-translation mark. That means there are now over a billion pairs of words or phrases, such as “clock” in English and “ঘড়ী” in Assamese, that PanLex records as attested translations of each other. The translations are derived from publications collected from around the world.

Beyond these billion attested translations, it is possible to infer others from longer paths of translations. For example, the number of pairs shoots up from 1 billion to 30 billion if we include translations at distance 2, namely translations of translations.  The longer the path, the greater the number, and the lower the reliability, of translations.

Because counting up these totals would overload the PanLex servers, we have estimated them using a random sample of 3,000 words and phrases.  The figures below show that as more words and phrases are added to the sample the estimates of distance­ 1 and distance­ 2 translations become more stable.

distance1

 

 


distance2

 

The main goal of the PanLex database is to make it possible ultimately to translate any word or phrase in any language into any other language on Earth. With about 7,000 languages, and assuming an average of 100,000 words and phrases per language, there should eventually be about 2.5 trillion translation pairs available from PanLex. Project participants don’t hope to reach this total on their own. Instead, they plan to provide their data to researchers who will develop increasingly effective methods of automatically inferring unattested translations from networks of attested ones.

navigateleft Previous Article