Henry Newman has a basic overview of the issues with data preservation in an increasingly all-digital world, but the title of his essay — Rock Don’t Need to Be Backed Up — gives me the hives, because it is transparently wrong. Newman opens his essay by explaining the origins of the title he chose,
My wife and I were in New York’s Central Park last fall when we saw a nearly 4,000-year-old Egyptian obelisk that has been remarkably well preserved, with hieroglyphs that were clearly legible — to anyone capable of reading them, that is. I’ve included a couple of pictures below to give you a better sense of this ancient artifact — and how it relates to data storage issues.
As we stood wondering at this archaeological marvel, my wife, ever mindful of how I spend the bulk of my time, blurted out, “Rocks do not need backing up!”
But, of course rocks do need backing up. Of the apparently hundreds of obelisks built by the ancient Egyptians, only 27 have survived completely intact to today. If the Egyptians had, perhaps, made a backup copy of each of those obelisks, some of them may have survived to our time intact. Of course making the initial version was a laborious process, so even had they wanted to make backup copies it may very well have been out of the question.
Compare this to the situation with paper and animal skins. Paper is both easier to create and destroy than the granite that the obelisks were carved from. For most historical documents more than a few hundred years old, we rarely have the original document. Rather we have copies and, in many cases, copies of copies of translations, etc. Oftentimes this in itself creates problems, as the copying process was rarely 100% accurate and the copyists would occasionally intentionally insert or delete passages. The most pronounced example of this is books of the New Testament for which there are numerous versions of, none of them “original” copies and frequently diverging from each other ways both small and large.
So now we enter the digital age. We still have one of the main problems that has vexed historians — the possibility that we’ll forget how to read certain documents, although this has switched from not being able to decipher long lost languages to not being able to read long abandoned formats (and by “long abandoned” that could mean “in the last 24 months).
The other day, for example, I was cleaning out my office and ran across a stack of Syquest disk cartridges that belonged to someone who left my organization about a decade ago. Syquest systems were essentially the forerunner to Zip disks and Syquest dominated the market for large scale removable storage in the 1980s and early 1990s. By the mid-1990s, however, Syquest found itself with huge quality control problems that eventually forced it into bankuptcy in 1998.
The data on those Syquest disks is largely unreadable to me. Or, more precisely, I probably could recover the data but at a price I’m not willing to pay. Moreover, since the data was all created on a mid-1990s Macintosh I’m not certain if I’d even be able to meaningfully use the data there.
But there are clearly ways we could turn this around and use the technology to our advantage. Unlike rocks and paper, making copies — lots of copies — is trivially easy with data. Most people and a surprising number of organizations don’t seem to have much of a plan at all for doing so, but its becoming easier and cheaper to do.
On a personal level, I literally have dozens of copies of my personal data store (which is roughly 500gb and growing) created with the frankly still primitive tools available for doing so. Now there is a major difference between backing up my little old 500gb and an organization that may have 500TB across the organization. Additionally, its easy enough for someone obssessed with data preservation to pull this off on an individual basis with a bit of attention, but there’s still a significant amount of personal intervention required to keep everything go and making sure everything works.
But if I can do it, surely there’s a way to scale backups wider than just me. The rise of the (too) numerous companies that are offering online backups at least suggests that there is a growing awareness of the problem of data loss. Of course many of these companies have business models that suggest they’ll soon be part of the problem rather than the solution (i.e., when that VC funding finally runs out).
On the other hand, few people seem to give a damn about the file format incompatibility issue. The solution there is also simple enough — only use open, well-documented file formats that can easily be reconstructed when support disappears for them. Instead, the reality is that people rush around seeing who can upgrade to the latest Microsoft product whose file formats are completely incompatible with every other file format in the history of the world.
Even then, I think Newman is overly pessimistic about the effect of data loss,
Digital data management concepts, technologies and standards just do not exist today. I don’t know of anyone or anything that addresses all of these problems, and if it is not being done by a standards body, it will not help us manage the data in the long run. It is only a matter of time until a lot of data starts getting lost. A few thousand years from now, what will people know about our lives today? If we are to leave obelisks for future generations, we’d better get started now.
It is almost inevitable that a lot of data will get lost — we are probably no different from previous civilizations who have rarely left behind anything but a small percentage of their “data” in forms that are usable by us. We will likely be no different, except that we are generating so much data that even this trickle of data that survives will still be sufficient to overwhelm future historians trying to get a handle on us.
We should definitely make the issues Newman writes about a priority, but I worry more that future generations will be drowning in our data remnants rather than seeing our era as a black hole of data loss.