Blog Archive for the ‘Rosetta’ Category



Endangered Language Linguist awarded prestigious MacArthur Fellowship

Published on Wednesday, September 29th, 02010 by Laura Welcher

Jessie Little Doe Baird, a linguist who has worked for years on reviving the Wampanoag (Wôpanâak) Language, has just been awarded a 02010 MacArthur “Genius” Fellowship in honor of her work and research.

Baird, who is of Wamponoag heritage, studied at MIT under the indigenous language scholar Kenneth Hale. By immersing herself in the language, she has achieved fluency, effectively reviving in herself the spoken use of the long-silent language. Her research is focused on developing a dictionary of Wampanoag, which now includes nearly 10,000 words, as well as language teaching resources, through which she hopes to help usher the language into modern use in the Wampanoag community.

Be a Pilot Tester for The 300 Languages Project

Published on Tuesday, September 28th, 02010 by Laine Stranahan

The 300 Languages Project is a special effort by The Rosetta Project to create a parallel text and audio corpus for the world’s 300 most widely-spoken languages. We are seeking a limited set of volunteers to test its submission process and offer feedback to its coordinators before the project is globally launched in November. Native speakers of any language (including English) are encouraged to participate.

To participate, sign up here or email laine@longnow.org.

Swadesh List data now re-enabled in Rosetta Internet Archive Collection

Published on Friday, September 24th, 02010 by Laine Stranahan

Puoc Swadesh List
Swadesh list for the Puoc language in the International Phonetic Alphabet

In the 01950s, American linguist Morris Swadesh, as part of his overarching vision of a quantitative method for determining language relationships on a global and multimillenial scale, developed a set of one hundred words found to be unusually stable across time and language boundaries. Swadesh hypothesized that words like “fire,” “moon,” “mother” and “bone,” common to human experience, were far less likely to change or be substituted with words borrowed from other dialects or languages. The 100 word “Swadesh list” (sometimes up to 207, depending on the variety of the list used) is now widely collected in linguistic field research, and functions as a kind of universal linguistic fossil. With careful study, these lists can reveal ancient language relationships and processes of linguistic change typically obscured by centuries-long processes of evolution and borrowing. As familiar examples, such processes transformed Chaucer’s English into modern English and Latin into the modern Romance Languages.

In 02004, The Rosetta Project undertook a National Science Foundation funded project to increase both the size and utility of its long-term multilingual archive and at this time added a large number of Swadesh lists to its collection. Lexical database archivists Tim Usher and Paul Whitehouse contributed original research (Tim Usher’s 02002 Indo-Pacific database and Paul Whitehouse’s 02002 Australian and New Guinea database were central among the additions) and also brought in outside resources, including Darrell Tryon’s Comparative Austronesian Dictionary (01995), George Starostin’s Dravidian database, and Ilya Peiros’ Mon Khmer database. In many of these cases, as with the Usher and Whitehouse collection, the 100-200 term Swadesh lists were a subset of a larger lexical data collection project. Despite the Swadesh list’s limitation in size compared with a resource like a dictionary, a large collection of the same material in many different languages is useful as a parallel dataset for cross-linguistic comparison.

This collection of Swadesh lists was included as a parallel data set among the documents micro-etched on the Rosetta Disk, a physical copy of The Rosetta Project’s long-term linguistic archive created in 02008. And for a period of time, the lists were available on The Rosetta Project’s website via an interactive tool which allowed visitors to view and compare lexical items in over a thousand languages and also contribute their own lexical data. But as the Rosetta Project site evolved and the structure of serving environments changed, this tool became technologically obsolete. While there was (and remains) no lack of storage space for the lists, there was a critical lack of what Long Now board member Kevin Kelly calls “movage.”

Movage,” says Kelly, means transferring the material to current platforms on a regular basis — that is, before the old platform completely dies, and it becomes hard to do. This movic rhythm of refreshing content should be as smooth as a respiratory cycle — in, out, in, out. Copy, move, copy, move.” And it is movage, not storage, says Kelly, that is critical to keeping information alive: “The only way to archive digital information is to keep it moving.” In other words, simply storing data isn’t enough to ensure its longevity; it must be copied, moved, and made redundant. And not just once or twice — indefinitely. Kurt Bollacker, Long Now Foundation Digital Research Director, adds: “[b]ecause any single piece of digital media tends to have a relatively short lifetime, we will have to make copies far more often than has been historically required of analog media. Like species in nature, a copy of data that is more easily “reproduced” before it dies makes the data more likely to survive.” [1]

Since the 02004 iteration of the Swadesh list program, The Rosetta Project has launched a comprehensive migration of all of its data to The Internet Archive, a free online digital library founded in 01996 with over 4 petabytes of storage. The Internet Archive exemplifies the paradigm shift in the field of information preservation from storage to movage: users of the site can upload any document they have permission to distribute to the site for free, where anyone with access to the internet can then download it to their own machine. Thousands of downloads are made every day from Internet Archive servers by users all over the world: early “movage” on a massive scale.

After a long process of unraveling and decoding the Swadesh list data, which had fallen victim to rapid changes in character encoding and database standards, The Rosetta Project has now moved the collection of 1,235 Swadesh lists into The Internet Archive. Recognizing the substantial merit and long-term advantages of the movage model and its successful early implementation by The Internet Archive, our goal is for the lists to have a long, useful, and redundant residence there.

The relocation of the Swadesh lists is also the first step of The Rosetta Project’s latest undertaking, The 300 Languages Project. Source materials collected for The 300 Languages Project, whose aim is to address a need for highly-structured linguistic resources in the world’s 300 most widely-spoken languages, will be stored at The Internet Archive with the rest of The Rosetta Project collection.

Was the 5-to-6-year period the Swadesh list data spent in the darkness unusual? According to Kelly, not at all: “We don’t know what the natural movage respiration cycle is for digital media yet since it is still very new,” says Kelly, “but I suspect the cycle is much shorter than we think. I would guess it is 5 years. No matter what digital format you have your precious [data] stored on, you should expect to move it onto new media in five years — and five years after that forever!”

Building an Audio Collection for All the World’s Languages

Published on Wednesday, July 21st, 02010 by Laine Stranahan

The Rosetta Project is pleased to announce the Parallel Speech Corpus Project, a year-long volunteer-based effort to collect parallel recordings in languages representing at least 95% of the world’s speakers. The resulting corpus will include audio recordings in hundreds of languages of the same set of texts, each accompanied by a transcription. This will provide a platform for creating new educational and preservation-oriented tools as well as technologies that may one day allow artificial systems to comprehend, translate, and generate them.

Huge text and speech corpora of varying degrees of structure already exist for many of the most widely spoken languages in the world—English is probably the most extensively documented, followed by other majority languages like Russian, Spanish, and Portuguese. Given some degree of access to these corpora (though many are not publicly accessible), research, education and preservation efforts in the ten languages which represent 50% of the world’s speakers (Mandarin, Spanish, English, Hindi, Urdu, Arabic, Bengali, Portuguese, Russian and Japanese) can be relatively well-resourced.

But what about the other half of the world? The next 290 most widely spoken languages account for another 45% of the population, and the remaining 6,500 or so are spoken by only 5%–this latter group representing the “long tail” of human languages:

Long_Tail_of_Languages.jpg

Equal documentation of all the world’s languages is an enormous challenge, especially in light of the tremendous quantity and diversity represented by the long tail. The Parallel Speech Corpus Project will take a first step toward universal documentation of all human languages, with the goal of providing documentation of the top 300 and providing a model that can then be extended out to the long tail. Eventually, researchers, educators and engineers alike should have access to every living human language, creating new opportunities for expanding knowledge and technology alike and helping to preserve our threatened diversity.

This project is made possible through the support and sponsorship of speech technology expert James Baker and will be developed in partnership with his ALLOW initiative. We will be putting out a call for volunteers soon. In the meantime, please contact rosetta@longnow.org with questions or suggestions.

Long Now at Wikimania 02010 in Gdansk Poland

Published on Tuesday, July 6th, 02010 by Danielle Engelman

Wikimania

Dr. Laura Welcher and Dr. Kurt Bollacker of Long Now will be speaking at this year’s Wikimania conference in Gdansk Poland over the weekend of July 9 – 11, 02010 on the creation of a new Language Commons Wiki.

Wikimania is a conference for users of the wiki projects operated by the Wikimedia Foundation. Topics of presentations and discussions include Wikimedia Foundation projects, other wikis, open source software, and free content.

Attendance is €15 per day, or €40 for all three days and you can register here.

If you have questions, you can contact Wikimania directly through this page.

Rosetta Spotlight: Ormuri – a piece of Middle Eastern identity

Published on Tuesday, May 11th, 02010 by Sarina Spector

Ormuri Description in the Rosetta Collection

Ormuri Description in the Rosetta Collection

“Language is identity,” Darfur refugee Daowd I. Salih told the New York Times about a week ago. He was being interviewed for an article called “Listening to (and Saving) the World’s Languages.” As mentioned in this Rosetta Project blog post, the article discusses the amazing variety of spoken languages in New York City, and what residents are doing (or not doing) to preserve their native language.

One of the languages the article touches on is Ormuri, a language of multiple dialects spoken in small regions of Afghanistan and Pakistan. According to the Ethnologue, Ormuri has only about 1,050 speakers. The New York Times article reveals a plan to canvass New York City for speakers of Ormuri in order to learn more about the language and the cultural information it holds.

Languages with small speaker populations are quickly dying out, and the data they contain (whether it be linguistic, historical, or cultural) is important enough to merit a concerted effort at saving them. Ormuri is a perfect example, especially in the political and economic environment of our time (read: the complex tangle that is our current Middle Eastern relations).  The Rosetta Project‘s database in the Internet Archive contains a detailed description of Ormuri, including a history of its speakers: where they came from, who their ancestors are, and how their language has co-evolved with those around it to become what it is today.

In my mind there is nothing that illustrates a culture’s unity so much as its language. It allows people to build social relationships, conduct business transactions, and express to fellow humans everything they hold dear. What’s more, as any good anthropologist knows, learning the language of a culture is one of the most important steps an outsider can take to gain the trust and respect of its people.

What does this have to do with an obscure Afghan language, or with Darfur refugees? Only this: if we intend to successfully navigate the conflicts of the modern global world, it is absolutely necessary to understand and relate to the people with whom we intend to work. The Middle East in particular, Afghanistan being an illustrative example, is culturally very foreign to the West; its people have lived for centuries in small, autonomous groups that hold to varied, often contradictory beliefs. The fact that so many of these groups have their own language, like Ormuri, is telling of their relative isolation, and gives clues to how they live their lives.

Rosetta’s description of Ormuri tells the story of its peoples’ interactions through Ormuri’s morphology. By studying the languages Ormuri had contact with and how these influenced its words, we can begin to create a web of social and economic interaction that would show the connections and dissociations between groups in the area. For example, Ormuri has many morphological similarities to Pashto, a common language in the region of Waziristan where Ormuri is spoken. Ormuri pronouns are strikingly similar to their Pashto equivalents, and many scattered words share similarities, like “wife,” “glitter,” and “to sit down.” Pashto has also phonetically influenced Ormuri, replacing some traditional Ormuri allophones with similar Pashto ones.

Ormuri has also sustained contact with Persian, which is evident in many morphological changes that mimic the latter: loss of gendered nouns, simplification of plural nouns, and reduction of irregular past participles.  Analyzing this data led the author, Georg Morgenstierne, to doubt the previous belief that Ormuri speakers descend from Kurds, and provided evidence for further theoretical investigations.

The very existence of this kind of knowledge is what Rosetta is all about; by preserving minority languages and stressing their importance, we hope to contribute vital insights into the lives of their speakers, insights that can be put to good use in surprising places. After all, you never know who you’ll meet on the New York City subway.

[A note of introduction: this is my first post as an intern with the Rosetta Project. I will be working with Rosetta for three months, building the collection in the Internet Archive and continuing to spotlight Rosetta material on this blog.]

The Global Lives Project

Published on Tuesday, March 2nd, 02010 by Laura Welcher

Last Friday evening, Long Now joined the Global Lives Project in celebrating their world premiere opening at San Francisco’s Yerba Buena Center for the Arts.  Through a huge volunteer effort, Global Lives has produced ten films – each 24 hours long – that visually capture the everyday life of ten people around the planet.  And on Friday we could view them all, at the same time, in the same room.  Ten huge screens hung from the ceiling of the Yerba Buena Forum and around a thousand people throughout the evening ambled around and under them, listening as voices emerged — Kai Lu, from Anren China speaking to his wife in a village dialect of Sichuan Yi, young Edith Kaphuka from Ngwale Village, Malawi code-switching with her friends on the playground between Chichewa and Chiyao, James Bullock of San Francisco chatting up the tourists on his cable car in West Coast American English.  Some screens showed people working, others playing, some eating, others sleeping — a glimpse into one human day on planet earth.

Global Lives Opening - Installation in the Forum

Global Lives Opening - Big Screen Installation in the YBCA Forum

A second ongoing installation in the YBCA Room for Big Ideas provides a more intimate viewing space, with ten partitioned rooms and LCD viewing screens.  Each room is furnished with seating for one or two, and with walls and floors embellished with fabrics, colors and textures evocative of the region of the film.  Kiosks and wall graphics give a bit of background about the project, and the ten participants.  And while the installation as a whole gives the sense of a finished, polished project, three computers set up prominently in the room tell a different – and quite wonderful – story.

Global Lives Project - Installation in YBCA Room for Big Ideas

Global Lives Project - Installation in YBCA Room for Big Ideas

This is not a finished project – in fact, it is very much a work in progress.  One of the greatest ongoing efforts is one that anyone can help with – the subtitling of each film in as many languages as possible (through the crowdsource subtitling site dotSUB).  The first pass was getting all ten films subtitled in English for the opening night, and that effort is still only about 80% done.  It is an enormous effort.  Jason Price, one of the producers of the Malawi shoot, tells the story of being nearly at wits end trying to find anyone to help translate Edith Kaphuka’s Chichewa into English — until someone suggested he set up a Facebook Group, and then 2,500 mostly expatriate Chichewa speakers arrived ready to help (there are, of course, many speakers of Chichewa in Malawi, but the need to access streaming video to do the translations made that nearly impossible).

Through the steadfast effort of about 25 of these people, the full twenty four hours of video has now not only been transcribed and translated, but put thorough about five stages of checking, rechecking and review to ensure its accuracy.  And, it is now the largest corpus of spoken transcribed Chichewa on the web.  (What might this ‘seed’ corpus enable down the road?  Chichewa online dictionaries?  Spell checkers?  Natural language processing?  Search? This group of translators may, without realizing it, be forging the way for a real Chichewa language online presence.)

For Global Lives, this set of ten videos is just the beginning of a much larger library of human life experience.  Not grand experiences, not Hollywood, not Bollywood — in the words of David Harris, the project’s director (responding to the umpteenth activist proposal, this one by yours truly) “we want boring!”  Because what we see as the everyday, the mundane, the routine is in fact a picture of our own humanity – and for that each Global Lives shoot is worth a thousand Hollywood productions.

The Global Lives installation in the Room for Big Ideas will be open through June 20, 02010 at San Francisco’s Yerba Buena Center for the Arts.  The Long Now Foundation sponsored the world premiere installation in the YBCA Forum through a grant from the William and Flora Hewlett Foundation.

3 Long Now Events in 8 Days

Published on Tuesday, February 23rd, 02010 by Alexander Rose - Twitter: @zander

Long Now has three events coming up over the next 8 days and we wanted to be sure you all had the right info for reserving tickets and making it out to all three.

  • Alan Weisman on “World Without Us, World With Us.” Wednesday February 24 (Thanks for coming this event went great)

Avoiding a Digital Dark Age

Published on Friday, February 19th, 02010 by Austin Brown

Long Now Digital Research Director Kurt Bollacker was recently published in New Scientist discussing the challenges in maintaining data for the long haul:

It seems unavoidable that most of the data in our future will be digital, so it behooves us to understand how to manage and preserve digital data so we can avoid what some have called the “digital dark age.” This is the idea—or fear!—that if we cannot learn to explicitly save our digital data, we will lose that data and, with it, the record that future generations might use to remember and understand us.

It’s a fairly long and comprehensive piece with lots of good advice and a good description of how the Rosetta Disk tries to address some of these problems.

Read the full article at New Scientist.

No More New Old Knowlege

Published on Thursday, February 18th, 02010 by Austin Brown

scroll

King’s College London president Rick Trainor announced recently that the university would be closing the chair of paleography, the UK’s only one.  Held by Professor David Ganz, the chair of paleography is the position that overseas a discipline many consider to be a vital component of historical research.  Paleography is the study of ancient manuscripts and has pieced together and deciphered many of the texts that have provided the basis for our knowledge of history.

Budget cuts are the precipitating factor, or rather “strategic disinvestment” as the official announcement goes, but they’re being met with some resistance.

“Palaeography is not simply an arcane auxiliary science,” says Professor Jeffrey Hamburger, chair of medieval studies at Harvard University. “It is as basic to the training and practice of ­historians as mastery of Dos or Unix might be to a computer scientist.”

-from the Guardian

Looking for more blog articles?



Some Rights Reserved (CC)

The Long Now Foundation - Fostering Long-term Responsibility - est. 01996.