A good article on the digital preservation problem in Popular Mechanics:
When the aircraft carrier USS Nimitz takes to sea, it carries more than a half-million files with diagrams of the propulsion, electrical and other systems critical to operation. Because this is the 21st century, these are not unwieldy paper scrolls of engineering drawings, but digital files on the ship’s computers. The shift to digital technology, which enables Navy engineers anywhere in the world to access the diagrams, makes maintenance and repair more efficient. In theory. Several years ago, the Navy noticed a problem when older files were opened on newer versions of computer-aided design (CAD) software.
“We would open up these drawings and be like, ‘Wow, this doesn’t look exactly like the drawing did before,’” says Brad Cumming, head of the aircraft carrier planning yard division at Norfolk Navy Shipyard.
The changes were subtle — a dotted line instead of dashes or minor dimension changes — but significant enough to worry the Navy’s engineers. Even the tiniest discrepancy might be mission critical on a ship powered by two nuclear reactors and carrying up to 85 aircraft.
The challenge of retrieving digital files isn’t an issue just for the U.S. Navy. In fact, the threat of lost or corrupted data faces anyone who relies on digital media to store documents — and these days, that’s practically everyone. Digital information is so simple to create and store, we naturally think it will be easily and accurately preserved for the future. Nothing could be further from the truth. In fact, our digital information — everything from photos of loved ones to diagrams of Navy ships — is at risk of degrading, becoming unreadable or disappearing altogether.
The problem is both immediately apparent and invisible to the average citizen. It crops up when our hard drive crashes, or our new computer lacks a floppy disk drive, or our online e-mail service goes out of business and takes our correspondence with it. We consider these types of data loss scenarios as personal catastrophes. Writ large, they are symptomatic of a growing crisis. If the software and hardware we use to create and store information are not inherently trustworthy over time, then everything we build using that information is at risk.
Large government and academic institutions began grappling with the problem of data loss years ago, with little substantive progress to date. Experts in the field agree that if a solution isn’t worked out soon, we could end up leaving behind a blank spot in history. “Quite a bit of this period could conceivably be lost,” says Jeff Rothenberg, a computer scientist with the Rand Corp. who has studied digital preservation.
Throughout most of our past, preserving information for posterity was mostly a matter of stashing photographs, letters and other documents in a safe place. Personal accounts from the Civil War can still be read today because people took pains to save letters, but how many of the millions of e-mails sent home by U.S. servicemen and servicewomen from the front lines in Iraq will be accessible a century from now?
One irony of the Digital Age is that archiving has become a more complex process than it was in the past. You not only have to save the physical discs, tapes and drives that hold your data, but you also need to make sure those media are compatible with the hardware and software of the future. “Most people haven’t recognized that digital stuff is encoded in some format that requires software to render it in a form that humans can perceive,” Rothenberg says. “Software that knows how to render those bits becomes obsolete. And it runs on computers that become obsolete.”
In 1986, for example, the British Broadcasting Corp. compiled a modern, interactive version of William the Conqueror’s Domesday Book, a survey of life in medieval England. More than a million people submitted photographs, written descriptions and video clips for this new “book.” It was stored on laser discs — considered indestructible at the time — so future generations of students and scholars could learn about life in the 20th century.
But 15 years later, British officials found the information on the discs was practically inaccessible — not because the discs were corrupted, but because they were no longer compatible with modern computer systems. By contrast, the original Domesday Book, written on parchment in 1086, is still in readable condition in England’s National Archives in Kew. (The multimedia version was ultimately salvaged.)
Changing computer standards aren’t the only threat to digital data. In 2004, Miami-Dade County announced it had lost almost all the electronic voting records from a 2002 election because of a series of computer crashes — reminding us that many of the failures of digital records — keeping are attributable to everyday equipment failure (see “Preserving Your Data” at right). Additionally, software companies can go out of business, taking their proprietary codes with them. In 2001, the online photo storage site PhotoPoint shut down and hundreds of people lost the digital photos they stored on the site.
But data loss is not always as apparent as a fried hard drive or a disc with no machine to play it. A digital file is just a long string of binary code. Unlike a letter or a photograph, its content is not immediately apparent to the end user. In order to see a photograph that has been saved as a JPEG file or to read a letter composed in a word processing program, we need software that can translate that code for us.
Software applications are updated on average every 18 months to two years, according to the Software and Information Industry Association, and newer versions are not always backward compatible with the previous ones. That could be a problem on the USS Nimitz, just as it could make trouble for you if the file in question held your medical records.
Likewise, law firms find that metadata—data about the data, such as the date when a file was created—are often not transferred accurately when files are copied. For example, magnetic storage media, such as hard drives, allow for a three-part date storage system (created/accessed/modified), whereas the file architecture of optical media, such as CD-Rs, allows for only one date. This presents a difficulty in litigation, when attorneys must build chronologies of key events in a case. “I see this in almost every single case,” says Craig Ball, a computer forensics expert who advises law firms. “It’s a complex problem at so many levels. We are losing so much.”
As Richard Pearce-Moses, past president of the Society of American Archivists, puts it, “We can keep the 0s and 1s alive forever, but can we make sense of them?”
I TRAVELED RECENTLY TO Washington, D.C., to meet with Ken Thibodeau, head of the National Archives’ Electronic Records Archive (ERA). The National Archives is charged with the daunting task of preserving all historically relevant documents and materials generated by the federal government—everything from White House e-mails to the storage locations of nuclear waste. Ten years ago, Thibodeau’s biggest concern was how to handle the 32 million e-mails sent to the archives by the Clinton administration. And that was just the beginning. The Bush White House is expected to produce 100 million e-mails by 2008. Thibodeau long ago realized that simply copying the data to magnetic tapes—the archives’ previous means of storing electronic records—was not going to work in the Digital Age. It would take years to copy those e-mails to tape, and that was just a trickle compared to the avalanche of more complex digital files that were coming his way.
“The problem is that everything we build, whether it is a highway, tunnel, ship or airplane, is designed using computers,” Thibodeau says. “Electronic records are being sent to the archives at 100 times the rate of paper records. We don’t know how to prevent the loss of most digital information that’s being created today.”
The National Archives must not only sort through the tremendous volume of data, it must also find a way to make sense of it. Thibodeau hopes to develop a system that preserves any type of document—created on any application and any computing platform, and delivered on any digital media—for as long as the United States remains a republic. Complicating matters further, the archive needs to be searchable. When Thibodeau told the head of a government research lab about his mission, the man replied, “Your problem is so big, it’s probably stupid to try and solve it.”
Last year, the National Archives awarded Lockheed Martin a $308 million contract to develop the system. “We think this is a groundbreaking effort of the Information Age,” says Clyde Relick, the project’s program director.
“Everything we build, whether it is a highway, tunnel, ship or airplane, is designed using computers … we don’t know how to prevent the loss of most digital information that’s being created today.”
To date, the ERA has identified more than 4500 file types that need to be accounted for. Each file type essentially requires an independent solution. What type of information needs to be preserved? How does that information need to be presented?
As a relatively simple example, let’s take an e-mail from the head of a regulatory agency. If the correspondence is pure text, it’s a straightforward solution. But what if there is an attachment? What type of file is the attachment? If the attachment is a spreadsheet, does the behavior of the spreadsheet need to be retained? In other words, will it be important for future generations to be able to execute the formulas and play with the data?
“That is unlike a challenge we would have with a paper document,” Relick says. More complex file formats, such as NASA virtual reality training programs, require more complex solutions. The ERA is working with a number of research partners, including the San Diego Super-computer Center and the National Science Foundation, on some of those more intricate challenges.
Lockheed is building what is primarily a “migration” system, in which files are translated into flexible formats such as XML (extensible markup language), so the files can be accessed by technologies of the future. The idea is to make copies without losing essential characteristics of the data.
Not everyone agrees with Lockheed’s approach. Rothenberg, of the Rand Corp., for example, believes an “emulation” strategy would be more appropriate. Emulation allows a modern computer to mimic an older computer so it can run a certain program. Popular emulation programs in use today are those that allow people to take video games made for Sony PlayStation 2 or Microsoft Xbox and play them on PCs.
“It seems to me that migration throws away the original,” Rothenberg says. “It doesn’t even try to save the original. What you end up with is somebody’s idea about what was important about the original.”
Relick says the cost and technical effort involved in emulation are not feasible for a project the size of the ERA. In addition, he notes that the archives in their entirety will need to be accessible to anyone with a browser, and emulation becomes more difficult when you have to account for users with an infinite variety of hardware and software.
The goal for the Lockheed team is to have initial operating capability for the ERA in September 2007, but budget cuts may delay the program’s search functionality.
The data crisis is by no means limited to the National Archives, or to branches of the military. The Library of Congress is in the midst of its own preservation project, and many universities are scrambling to build systems that capture and retain valuable academic research.
But the programs in development for government and academia won’t help find the lost e-mail of an individual computer user. Some experts believe that this is the result of simple market forces: Consumers have shown little interest in digital preservation, and corporations are in the business of meeting consumer demand. Others say corporations are only concerned with selling more new products.
“Their interest, it seems to me, is creating incompatibilities over time, not compatibilities,” Rothenberg says. “Looking at it cynically, they have very little motivation to burden themselves with compatibility because doing so only allows their customers to avoid upgrading.”
Nevertheless, there have been encouraging developments. In late 2005, Microsoft announced it was opening the file formats of its Office suite, including Word and Excel, to competitors in order to get Office certified as an international standard. By ceding proprietary control of the formats to third-party developers, Microsoft greatly increases the odds that those formats will be accessible for future generations.
Meanwhile, the International Organization for Standardization recently certified a modified version of Adobe Systems’ popular Portable Document Format (PDF) specifically for long-term archiving. It’s called PDF/A. In essence, PDF/A preserves everything contained in a document that can be printed while excluding features that may be useful in the short term but problematic in the long term. For example, the new format does not allow embedded links to external applications, which could become obsolete, and it doesn’t allow for passwords, which can be lost or forgotten. “It is all about creating a reliable presentation down the road,” says Melonie Warfel, director of worldwide standards for Adobe, who worked on the project. Adobe is also working on archiving standards for engineering documents and digital images.
IF HISTORY IS A GUIDE—and that, after all, is the point of preserving history—we know the future will offer the means to manipulate digital information in ways we cannot yet imagine. The trick is to keep moving forward without leaving too much behind.
“It goes beyond this notion of ‘important records’—it goes to the things that are important to us,” says Warfel, the mother of two children. “My mom had shoeboxes full of photographs, but we don’t do that anymore. I have hard drives full of photographs.” PM