Building an Audio Collection for All the World’s Languages

July 21st, 02010 by Laine Stranahan

The Rosetta Project is pleased to announce the Parallel Speech Corpus Project, a year-long volunteer-based effort to collect parallel recordings in languages representing at least 95% of the world’s speakers. The resulting corpus will include audio recordings in hundreds of languages of the same set of texts, each accompanied by a transcription. This will provide a platform for creating new educational and preservation-oriented tools as well as technologies that may one day allow artificial systems to comprehend, translate, and generate them.

Huge text and speech corpora of varying degrees of structure already exist for many of the most widely spoken languages in the world—English is probably the most extensively documented, followed by other majority languages like Russian, Spanish, and Portuguese. Given some degree of access to these corpora (though many are not publicly accessible), research, education and preservation efforts in the ten languages which represent 50% of the world’s speakers (Mandarin, Spanish, English, Hindi, Urdu, Arabic, Bengali, Portuguese, Russian and Japanese) can be relatively well-resourced.

But what about the other half of the world? The next 290 most widely spoken languages account for another 45% of the population, and the remaining 6,500 or so are spoken by only 5%–this latter group representing the “long tail” of human languages:

Long_Tail_of_Languages.jpg

Equal documentation of all the world’s languages is an enormous challenge, especially in light of the tremendous quantity and diversity represented by the long tail. The Parallel Speech Corpus Project will take a first step toward universal documentation of all human languages, with the goal of providing documentation of the top 300 and providing a model that can then be extended out to the long tail. Eventually, researchers, educators and engineers alike should have access to every living human language, creating new opportunities for expanding knowledge and technology alike and helping to preserve our threatened diversity.

This project is made possible through the support and sponsorship of speech technology expert James Baker and will be developed in partnership with his ALLOW initiative. We will be putting out a call for volunteers soon. In the meantime, please contact rosetta@longnow.org with questions or suggestions.

This entry was posted on Wednesday, July 21st, 02010 at 1:09 pm and is filed under Long Term Science, Rosetta, Technology.

  • http://ighalsk.blogspot.com Luke Schubert

    Excellent.

  • E I N

    Good to see. I’ve posted this at the Phi Beta Iota public intelligence journal (thanks to the twitter feed of argotechnica) – http://www.phibetaiota.net/?p=27156

  • Davide Bocelli

    Very good !

  • http://twitter.com/sally_j Practical Archivist

    Exciting project to build an audio collection of the world's languages. I'm curious what media they plan to use in order to preserve the recordings for as long as possible…

  • http://profiles.yahoo.com/u/C457RWVNM2URJSL52FSWFKJJ6U A. Marina P. Fournier

    Wow! THis is grat, esp. in light of languages dying out with the last speakers of same.

  • http://profiles.yahoo.com/u/C457RWVNM2URJSL52FSWFKJJ6U A. Marina P. Fournier

    that's “great”, not grat.

  • Batshua

    Would this include video for sign languages? It would be nice to preserve those, as well.

  • Laine

    Hi Practical Archivist! Following the “movage” model of Danny Hillis, et al. (http://blog.longnow.org/2008/12/11/movage/), we’re storing our data at The Internet Archive (www.archive.org), where it will be publicly accessible and indefinitely downloadable. According to the movage model, continuing changes in physical modes of storage mean that the cycle of copying and moving data from one medium to another is inevitable, and in light of this, the more copies there are of some piece of data, the more likely it is to survive in such a rapidly evolving environment.

  • Laine

    Batshua–absolutely!

Some Rights Reserved (CC)

The Long Now Foundation
Fostering Long-term Responsibility
est. 01996.