Blog Archive for the ‘Rosetta’ Category



A Rosetta Disk is on public display in the University of Colorado Boulder Libraries Special Collection

Published on Wednesday, March 2nd, 02011 by Laura Welcher

Rosetta Disk by Spencer MishlenRosetta Disk by Spencer Mishlen

In 02008, one of the first prototype Rosetta Disks went to the family of the late Charles Butcher, who was the founder of The Lazy 8 Foundation. Lazy Eight was one of the first supporters of the Long Now 10,000 Year Library and Rosetta Projects.

This Rosetta Disk has now been donated by the Charles Butcher family to the University of Colorado Boulder. It looks like it is housed in the Library Special Collections, and that it is currently on exhibit as part of Realia: Everyday Objects from Other Lives.

If anyone has a chance to go visit the Rosetta Disk in this exhibit, please send us photos!

An Archive Model with Long Term Benefits

Published on Thursday, January 13th, 02011 by Laura Welcher

On January 9, The Rosetta Project presented a poster at the Linguistic Society of America annual meeting, describing a distributed archive model we’ve developed and implemented with the Rosetta digital collection. Here is a video describing this model, and some of its long-term benefits:

A pdf of this poster is available for download here (12 MB).

The Rosetta Project: A Distributed Archive Model

North American Dialects On Twitter and YouTube

Published on Wednesday, January 12th, 02011 by Austin Brown

Using data from the Atlas of North American English (ANAE) by William Labov, Sharon Ash, and Charles Boberg combined with his own research, linguist Rick Aschmann created the detailed map above to show regional dialects throughout North America.  One of the coolest features is that he’s linked over 600 YouTube videos to the map, so that clicking a region will take you to video clips of (mostly famous) people raised in that area so that you can hear a sample of the dialect.

Researchers at Carnegie Mellon have done some similar research, though they’re using social media – Twitter specifically – as the data source, rather than just to illustrate linguistic nuance. Jacob Eisenstein and his colleagues looked at 380,000 geo-tagged tweets recently and explored the geographical dialects represented within. They saw differences in the way people abbreviate words to fit the short medium and the slang terms they used in informal messaging and were able to create a statistical model from the variation they saw that could predict the location of a user to within about 300 miles based on the dialect used.

The existence of Twitter and other informal, microblogging platforms affords a newly accessible, low-cost source of data for linguistics researchers since they don’t require labor-intensive in-person interviews to uncover patterns of informal speech:

Studies of regional dialects traditionally have been based primarily on oral interviews, Eisenstein said, noting that written communication often is less reflective of regional influences because writing, even in blogs, tends to be formal and thus homogenized. But Twitter offers a new way of studying regional lexicon, he explained, because tweets are informal and conversational. Furthermore, people who tweet using mobile phones have the option of geotagging their messages with GPS coordinates.

- Carnegie Mellon University

Eisenstein also points out that the identifiable regional variation could be an indicator that the internet is less a force for homogenization than often thought.

The Georgetown University Round Table on Languages and Linguistics later this year will explore many ways in which these, “new worlds of words occasion innovative uses of language and new spaces for constructing identities, forming relationships, and expressing social meanings.” (GURT 2011)

So, expect to see plenty more research mining social media and remember to act normal online so you don’t throw off the results.

Presentism in Google Books

Published on Tuesday, January 4th, 02011 by Austin Brown

Google’s new Ngram Viewer is a graphical interface for looking at the frequency of words over time in the several million books scanned into their database.  As a publicly mine-able data set, it’s huge and ripe for exploration with 500 years’ worth of published books spanning several languages.  And while it may seem a simple ‘just so’ kind of information to be able to call up how often a word was used in a particular year, the lives of words can often illuminate historical and cultural trends in surprising ways.

A paper published by researchers who helped develop the project (and summarized by Discover) rounded up a few interesting findings.  One delectably recursive tidbit they mentioned was that a search for years (ie. 1865, 1990) can show the historical efforts focused on particular eras and the extent to which those years remain part of present day discussion.

They found a general trend each individual year follows: a spike just before the year followed by a downward trending long tail as it recedes into history.  They also, however, noticed a trend amongst that pattern: higher peaks with shorter tails.

When the team looked at the frequency of individual years, they found a consistent pattern. In their own words: “’1951’ was rarely discussed until the years immediately preceding 1951. Its frequency soared in 1951, remained high for three years, and then underwent a rapid decay, dropping by half over the next fifteen years.” But the shape of these graphs is changing. The peak gets higher with every year and we are forgetting our past with greater speed. The half-life of ‘1880’ was 32 years, but that of ‘1973’ was a mere 10 years.

So, at a cultural level, we can see a developing ‘presentism’ in which the year we’re currently inhabiting takes on great significance, but is more quickly forgotten once it’s passed.

Rosetta Disk at the Hammer Museum for an “Enormous Microscopic Evening”

Published on Thursday, November 4th, 02010 by Laura Welcher

Join Long Now’s Rosetta Project on November 6 from 4 – 7 pm at UCLA’s Hammer Museum where we team up with San Francisco-based CRITTER for an Enormous Microscopic Evening.  We’ll put a Rosetta Disk under the microscope, check out the fine (and finer) print, and maybe hunt for Easter eggs…  More information on the evening’s lineup from the Hammer Museum:

Enormous Microscopic Evening examines the museum from a microscopic perspective with CRITTER, a San Francisco-based salon dedicated to expanding the relationships between culture and the environment. The evening will focus on demonstrations and workshops about building and manipulating microscopes. Materials and samples taken from around the museum will be examined. Continuing the theme of microscopy, there will be micro performances (short concerts with tiny instruments) and other related events throughout the museum.

Critter

Opening Celebration: Global Lives Project at the Long Now

Published on Friday, October 29th, 02010 by Austin Brown

Photo by Jessie Levandov

Opening: Global Lives Project Installation
at The Long Now Museum & Store

Wednesday November 10
6:00 – 8:00 pm

We’ll be celebrating the opening of a Global Lives Project installation at the Long Now Foundation Museum & Store on the evening of November 10th. Please join us for drinks, snacks and some words from Global Lives Project Founder and Executive Director, David Evan Harris. Global Lives Project filmmakers Ya-Hsuan Huang and Jason J. Price will also be in attendance to answer questions.

The Global Lives Project is a collaboratively-built library of human experience gathered from an orphanage in Kazakstan, a corner store in China, a street car in San Francisco and many other locations foreign and familiar. It takes shape online and as a video installation.

Framed by the arc of the day and conveyed through the intimacy of video, we have slowly and faithfully captured 24 continuous hours in the lives of 10 people from around the world. They are screened here in their own right, but also in relation to one another.

There is no narrative other than that which is found in the composition of everyday life, no overt interpretations other than that which you may bring to it.

By extending the long take to a certain extreme and infusing it with the spirit of cinema verité, we invite audiences to confer close attention onto other worlds, and simultaneously reflect upon their own.  The force and depth of human difference and similarity are revealed in this process. Gaps which mark cultural divides feel, at once, both wider and narrower.  This sense – that we, as humans, are both knowable and unknowable, fundamentally different as well as the same – opens a space for dialogue.

-Artist’s Statement 2010

Endangered Language Linguist awarded prestigious MacArthur Fellowship

Published on Wednesday, September 29th, 02010 by Laura Welcher

Jessie Little Doe Baird, a linguist who has worked for years on reviving the Wampanoag (Wôpanâak) Language, has just been awarded a 02010 MacArthur “Genius” Fellowship in honor of her work and research.

Baird, who is of Wamponoag heritage, studied at MIT under the indigenous language scholar Kenneth Hale. By immersing herself in the language, she has achieved fluency, effectively reviving in herself the spoken use of the long-silent language. Her research is focused on developing a dictionary of Wampanoag, which now includes nearly 10,000 words, as well as language teaching resources, through which she hopes to help usher the language into modern use in the Wampanoag community.

Be a Pilot Tester for The 300 Languages Project

Published on Tuesday, September 28th, 02010 by Laine Stranahan

The 300 Languages Project is a special effort by The Rosetta Project to create a parallel text and audio corpus for the world’s 300 most widely-spoken languages. We are seeking a limited set of volunteers to test its submission process and offer feedback to its coordinators before the project is globally launched in November. Native speakers of any language (including English) are encouraged to participate.

To participate, sign up here or email laine@longnow.org.

Swadesh List data now re-enabled in Rosetta Internet Archive Collection

Published on Friday, September 24th, 02010 by Laine Stranahan

Puoc Swadesh List
Swadesh list for the Puoc language in the International Phonetic Alphabet

In the 01950s, American linguist Morris Swadesh, as part of his overarching vision of a quantitative method for determining language relationships on a global and multimillenial scale, developed a set of one hundred words found to be unusually stable across time and language boundaries. Swadesh hypothesized that words like “fire,” “moon,” “mother” and “bone,” common to human experience, were far less likely to change or be substituted with words borrowed from other dialects or languages. The 100 word “Swadesh list” (sometimes up to 207, depending on the variety of the list used) is now widely collected in linguistic field research, and functions as a kind of universal linguistic fossil. With careful study, these lists can reveal ancient language relationships and processes of linguistic change typically obscured by centuries-long processes of evolution and borrowing. As familiar examples, such processes transformed Chaucer’s English into modern English and Latin into the modern Romance Languages.

In 02004, The Rosetta Project undertook a National Science Foundation funded project to increase both the size and utility of its long-term multilingual archive and at this time added a large number of Swadesh lists to its collection. Lexical database archivists Tim Usher and Paul Whitehouse contributed original research (Tim Usher’s 02002 Indo-Pacific database and Paul Whitehouse’s 02002 Australian and New Guinea database were central among the additions) and also brought in outside resources, including Darrell Tryon’s Comparative Austronesian Dictionary (01995), George Starostin’s Dravidian database, and Ilya Peiros’ Mon Khmer database. In many of these cases, as with the Usher and Whitehouse collection, the 100-200 term Swadesh lists were a subset of a larger lexical data collection project. Despite the Swadesh list’s limitation in size compared with a resource like a dictionary, a large collection of the same material in many different languages is useful as a parallel dataset for cross-linguistic comparison.

This collection of Swadesh lists was included as a parallel data set among the documents micro-etched on the Rosetta Disk, a physical copy of The Rosetta Project’s long-term linguistic archive created in 02008. And for a period of time, the lists were available on The Rosetta Project’s website via an interactive tool which allowed visitors to view and compare lexical items in over a thousand languages and also contribute their own lexical data. But as the Rosetta Project site evolved and the structure of serving environments changed, this tool became technologically obsolete. While there was (and remains) no lack of storage space for the lists, there was a critical lack of what Long Now board member Kevin Kelly calls “movage.”

Movage,” says Kelly, means transferring the material to current platforms on a regular basis — that is, before the old platform completely dies, and it becomes hard to do. This movic rhythm of refreshing content should be as smooth as a respiratory cycle — in, out, in, out. Copy, move, copy, move.” And it is movage, not storage, says Kelly, that is critical to keeping information alive: “The only way to archive digital information is to keep it moving.” In other words, simply storing data isn’t enough to ensure its longevity; it must be copied, moved, and made redundant. And not just once or twice — indefinitely. Kurt Bollacker, Long Now Foundation Digital Research Director, adds: “[b]ecause any single piece of digital media tends to have a relatively short lifetime, we will have to make copies far more often than has been historically required of analog media. Like species in nature, a copy of data that is more easily “reproduced” before it dies makes the data more likely to survive.” [1]

Since the 02004 iteration of the Swadesh list program, The Rosetta Project has launched a comprehensive migration of all of its data to The Internet Archive, a free online digital library founded in 01996 with over 4 petabytes of storage. The Internet Archive exemplifies the paradigm shift in the field of information preservation from storage to movage: users of the site can upload any document they have permission to distribute to the site for free, where anyone with access to the internet can then download it to their own machine. Thousands of downloads are made every day from Internet Archive servers by users all over the world: early “movage” on a massive scale.

After a long process of unraveling and decoding the Swadesh list data, which had fallen victim to rapid changes in character encoding and database standards, The Rosetta Project has now moved the collection of 1,235 Swadesh lists into The Internet Archive. Recognizing the substantial merit and long-term advantages of the movage model and its successful early implementation by The Internet Archive, our goal is for the lists to have a long, useful, and redundant residence there.

The relocation of the Swadesh lists is also the first step of The Rosetta Project’s latest undertaking, The 300 Languages Project. Source materials collected for The 300 Languages Project, whose aim is to address a need for highly-structured linguistic resources in the world’s 300 most widely-spoken languages, will be stored at The Internet Archive with the rest of The Rosetta Project collection.

Was the 5-to-6-year period the Swadesh list data spent in the darkness unusual? According to Kelly, not at all: “We don’t know what the natural movage respiration cycle is for digital media yet since it is still very new,” says Kelly, “but I suspect the cycle is much shorter than we think. I would guess it is 5 years. No matter what digital format you have your precious [data] stored on, you should expect to move it onto new media in five years — and five years after that forever!”

Building an Audio Collection for All the World’s Languages

Published on Wednesday, July 21st, 02010 by Laine Stranahan

The Rosetta Project is pleased to announce the Parallel Speech Corpus Project, a year-long volunteer-based effort to collect parallel recordings in languages representing at least 95% of the world’s speakers. The resulting corpus will include audio recordings in hundreds of languages of the same set of texts, each accompanied by a transcription. This will provide a platform for creating new educational and preservation-oriented tools as well as technologies that may one day allow artificial systems to comprehend, translate, and generate them.

Huge text and speech corpora of varying degrees of structure already exist for many of the most widely spoken languages in the world—English is probably the most extensively documented, followed by other majority languages like Russian, Spanish, and Portuguese. Given some degree of access to these corpora (though many are not publicly accessible), research, education and preservation efforts in the ten languages which represent 50% of the world’s speakers (Mandarin, Spanish, English, Hindi, Urdu, Arabic, Bengali, Portuguese, Russian and Japanese) can be relatively well-resourced.

But what about the other half of the world? The next 290 most widely spoken languages account for another 45% of the population, and the remaining 6,500 or so are spoken by only 5%–this latter group representing the “long tail” of human languages:

Long_Tail_of_Languages.jpg

Equal documentation of all the world’s languages is an enormous challenge, especially in light of the tremendous quantity and diversity represented by the long tail. The Parallel Speech Corpus Project will take a first step toward universal documentation of all human languages, with the goal of providing documentation of the top 300 and providing a model that can then be extended out to the long tail. Eventually, researchers, educators and engineers alike should have access to every living human language, creating new opportunities for expanding knowledge and technology alike and helping to preserve our threatened diversity.

This project is made possible through the support and sponsorship of speech technology expert James Baker and will be developed in partnership with his ALLOW initiative. We will be putting out a call for volunteers soon. In the meantime, please contact rosetta@longnow.org with questions or suggestions.

Looking for more blog articles?



The Long Now Blog

Ideas about Long-term Thinking.

 Subscribe in a reader

Categories

Archives

Meta

Some Rights Reserved (CC)

The Long Now Foundation
Fostering Long-term Responsibility
est. 01996.