Data Preservation – Brian.Carnell.Com

Backblaze on Data Reliability/Durability with Its Cloud Storage Service

Backblaze recently published an in-depth look at how durable/reliable data that is stored with its service is–i.e., what are the odds that you’ll want to retrieve a specific set of data from the service and find out that won’t be able to?

At the end of the day, the technical answer is “11 nines.” That’s 99.999999999%. Conceptually, if you store 1 million objects in B2 for 10 million years, you would expect to lose 1 file. There’s a higher likelihood of an asteroid destroying Earth within a million years, but that is something we’ll get to at the end of the post.
. . .
When you send us a file or object, it is actually broken up into 20 pieces (“shards”). The shards overlap so that the original file can be reconstructed from any combination of any 17 of the original 20 pieces. We then store those pieces on different drives that sit in different physical places (we call those 20 drives a “tome”) to minimize the possibility of data loss. When one drive fails, we have processes in place to “rebuild” the data for that drive. So, to lose a file, we have to have four drives fail before we had a chance to rebuild the first one.

The analysis then goes on to present a lot of math related to the time it takes for Backblaze to rebuild any data lost and its overall drive failure rate, but the general thrust is that it is extremely unlikely that Backblaze would ever suffer data loss from normal technical failures.

But at some point, we all start sounding like the guitar player for Spinal Tap. Yes, our nines go to 11. Where is that point? That’s open for debate. But somewhere around the 8th nine we start moving from practical to purely academic. Why? Because at these probability levels, it’s far more likely that:
An armed conflict takes out data center(s).
Earthquakes / floods / pests / or other events known as “Acts of God” destroy multiple data centers.
There’s a prolonged billing problem and your account data is deleted.

There is one thing of interest in the odd way Backblaze concludes its analysis, however,

Eleven years in and counting, with over 600 petabytes of data stored from customers across 160 countries, and well over 30 billion files restored, we confidently state that our system has scaled successfully and is reliable. The numbers bear it out and the experiences of our customers prove it.

Note that this doesn’t say that they’ve never come across a file they were unable to restore due to technical, backend reasons (rather than issues related to customer credit cards, etc.)

Thoughts on Using Backblaze After A Month

Back in early March I decided to look into off-site backup of my data drives using either Crashplan or Backblaze. For the most part I’ve ignored online backup services mainly because of the large volume of data I currently maintain/backup for personal use, which is currently approaching 60 terabytes. Along with storage costs, the sheer amount of time to upload that amount of data is ridiculous and so I hadn’t really given much thought to online backups.

Someone I know (with a lot less data) was using Crashplan, however, and I figured for the low monthly cost it wouldn’t hurt to at least check it out. I did not like Crashplan. Not one bit. Pretty much everything about Crashplan was confusing, from its terms of use all the way up to its uploading client. I did pay for an initial one month subscription, but after about a week realized Crashplan simply would never work for my needs and canceled.

So I decided to give Backblaze a try. There are some things I do not like about Backblaze, but overall I have been very pleased with it in the intervening month and felt good enough about it to pay for a year’s subscription.

To get things started, I hooked up a nearly full Seagate 8 terabyte hard drive to my main computer using an external dock. I already have that hard drive backed up locally, so I’m only relying on Backblaze as an option in case both the original and all backup copies of the drive should fail.

Don’t Rely on Backblaze for Your Only Backup

A lot of horror stories I read online from users of both Crashplan and Backblaze made it clear that they were using these services as their only method of backup. In several cases, users got burned when they backed up their data to either service prior to reformatting or destroying a hard drive, only to find that their data was unavailable or unrecoverable (or only recoverable after extraordinary measures were taken).

This, in a word, is crazy. For $50/year I wouldn’t use these sorts of services as anything but as a backup of last resort. On the one hand, I’d put the odds of actually being able to recover my data from Backblaze if needed at 50/50. On the other hand, it’s only $50/year–it’s like the extra disability insurance I pay through my workplace that I have never bothered to actually track down the details about. Maybe it will help, maybe it won’t, but it’s so cheap that it’s not worth not carrying.

If you I do need to retrieve the data, however, it is reassuring that Backblaze will let me pay them to copy my data to a hard drive and then ship that hard drive to me, whereas with Crashplan my only option would be to download the data (and there were plenty of reports of that not working so well.)

Uploading Terabytes of Data

The second problem that a lot of users reported was the long length of time it took to upload large volumes of data. In some cases this was just users not understanding how the technology works. No, Mr. Clueless, you’re not going to be able to upload 1 terabyte of data to an offline service over a weekend on a DSL modem. That just isn’t going to happen.

But other users complained of slowness in general. My experience was that Crashplan was slow as hell–significantly slower than Backblaze. I’m on a cable service that has 60mbs down and 7mbs up (and no bandwidth cap). With Backblaze I was able to upload a little over 1 terabyte in the first month, which was very reasonable from my experience. This is where you really start to notice the ridiculously slow Internet speeds that most of us in the United States have to endure, but that’s a much bigger problem and obviously nothing Backblaze can do anything about.

Encryption . . . Sort Of

An absolute necessity for me was being able to encrypt my data independently of either Backblaze or Crashplan. Both services allowed me to use a private encryption phrase so that no one but me, in theory, would be able to unencrypt my data. However–there’s always some sort of “however”–the way these services handle restoring data is that you would need to supply the private key to Backblaze, for example, which would use it to decrypt the files and then make them available to you,

However, if you lose a file, you have to sign into the Backblaze website and provide your passphrase which is ONLY STORED IN RAM for a few seconds and your file is decrypted. Yes, you are now in a “vulnerable state” until you download then “delete” the restore at which point you are back to a secure state.
If you are even more worried about the privacy of your data, we highly recommend you encrypt it EVEN BEFORE BACKBLAZE READS IT on your laptop! Use TrueCrypt. Backblaze backs up the TrueCrypt encrypted bundle having no idea at all what is in it (thank goodness) and you restore the TrueCrypted bundle to yourself later.

Ugh. It would be much better to simply ship me an encrypted blob along with a utility to unencrypt the data locally. This process completely misses the point of why users want a private encryption key. (Crashplan appears to use the same sort of process of decrypting in the cloud and then downloading the unencrypted file over SSL to your hard drive). All you’re really doing, then, is limiting the window of time that Backblaze employees (and anyone who has infiltrated their network) have access to your unencrypted data.

Summary

As I said before, I would never rely on this sort of service as anything but a last resort. Losing all of my data and having to wonder if I really want to trust Backblaze even temporarily with an unencrypted copy of my data is still better than simply losing all of my data with no other options (for $50/year, that is. If it cost, say $200/year, I might have a different view). For me, using Backblaze was a no-brainer given the range of available backup options and costs.

The M-Arc DVD System

First heard about Millenniata’s M-Arc archival DVD system when the Long Now blog mentioned that the product was actually shipping.

Millenniata claims its M-Arc DVDs are backwards-compatible with existing DVD technologies, but rather than using a laser to heat up a photosensitive dye, the M-Arc uses a mechanical process to make scratches in a physical layer that M-Arc claims will last potentially for centuries if stored properly. According to a brief summary on the manufacturer’s site, the M-Arc:

Preserves data for centuries with physical changes in data layer

Constructed with rock-hard materials known to last for centuries

Backwards-compatible on all standard DVD drives

Functions like a standard DVD with a capacity of 4.7 GB

Exclusively written by the M-Writer™ Drive

The Millenniata site doesn’t list any prices, but Long Now reports $1,700 for the writer and $16-$25 per 4.7gb disc depending on the quantity.

Long Now Post on Digital Data Preservation

Back in March, the Long Now Foundation blog featured an extremely long post republishing two articles and a paper concerned with the potential loss of data caused by the increasing speed at which storage technologies become obsolete and, soon thereafter, difficult to access.

Of the three pieces, Jennifer Stilles’s look at the National Archives’ efforts to preserve/recover data stored in obsolete formats was the most interesting. It seems clear from Stilles piece that the crux of the problem is the constant drive for technological innovation which produces products that are ever better but also, too often, ever more incompatible with previous formats. Moreover, this is a problem that started long before the current digital computer age,

On the wall are the internal organs of a film projector from the 1930s; the old heads have been mounted to play together with modern reels. “Twenty-eight different kinds of movie sound-tracking systems were devised during the 1930s and 1940s, trying to improve the quality of sound tracks,” Mayn explained. “Most of them are unique and incompatible.” This particular one used something called “push-pull” technology, in which the sound signal was split onto two different tracks. The technology was meant to cancel out noise distortion, but the two tracks must play in near-perfect synchrony. “If it is played back properly, it is better than a standard optical track, but if it is played back even a little bit improperly, it is far, far worse,” Mayn said. In the mid-1980s at a theater in downtown Washington, he was able to actually use this reconfigured projector to show several reels of push-pull film containing the trials of top Nazi leaders at Nuremberg. And the lab has transferred some 1800 reels of push-pull tape onto new negatives.

Wow. That fits nicely with one of the main problems with data storage today once you get past the physical media — the plethora of file formats and an odd lack of recognition that this is even a problem.

Microsoft rolls out yet another proprietary format for Office? Everybody simply upgrades without a second thought, because if you don’t all of a sudden you’re receiving file attachments you can’t open. Much of this is driven, I suspect, by the view that most data production is largely ephermal. Are we really going to be want to be able to open this report in Word 2003 format 10 years from now? Of course, I’ve also seen the fallout from that where people run around trying to find some way to open that 10 year old file which is suddenly extremely important due to issues with a specific vendor or contract, etc.

The current state of data preservation efforts remind me of the documentary “The Chances of the World Changing.” The documentary follows turtle enthusiasts who, given the lack of any coordinated effort to preserve endangered turtles, create their own ad hoc network of mini-Arks. They buy up individual turtles from overseas, and store them in warehouses, basements, garages, etc., moving the turtles around when one or another enthuisast burns out or runs out of cash. And they hope they’ll be able to keep the turtles going and around until they’re able to get others to see the need for a permanent, formal preservation effort.

Fuck The Cloud?

I couldn’t agree more with Jason Scott’s essay, Fuck the Cloud — if you are entrusting important data to a service that you don’t control and don’t have a migration path out, you are a fool.

Because if you’re not asking what stuff means anything to you, then you’re a sucker, ready to throw your stuff down at the nearest gaping hole that proclaims it is a free service (or ad-supported service), quietly flinging you past an End User License Agreement that indicates that, at the end of the day, you might as well as dragged all this stuff to the trash. If it goes, it’s gone.

. . .

Contrast, though, when people are dumping hundreds of hours a year into the Cloud. Blowing out photos. Entering day after day of entries. Sharing memories, talking about subjects that matter to them. Linking friends or commenting on statuses or trading twitters or what have you. This is a big piece, a very big piece of what is probably important stuff.

Don’t trust the Cloud to safekeep this stuff. Hell yeah, use the Cloud, blow whatever you want into the Cloud. The Internet’s a big copy machine, as they say. Blow copies into the Cloud. But please:

Don’t blow anything into the Cloud that you don’t have a personal copy of.

Insult, berate and make fun of any company that offers you something like a “sharing” site that makes you push stuff in that you can’t make copies out of or which you can’t export stuff out of. They will burble about technology issues. They are fucking lying. They might go off further about business models. They are fucking stupid. Make fun of these people, and their shitty little Cloud Cities running on low-grade cooking fat and dreams. They will die and they will take your stuff into the hole. Don’t let them.

Recognize a Cloud when you see it. Are you paying for these services? No? You are a sucker. You are giving people stuff for free. I pay for Vimeo and I pay for Flickr and a couple other things. This makes me a customer. Neither of these places get my only copy of anything.

If you want to take advantage of the froth, like with YouTube or so Google Video (oh wait! Google Video is going off the air!) then do so, but recognize that these are not Services. These are not dependable enterprises. These are parties. And parties are fun and parties and cool and you meet neat people at parties but parties are not a home.

I think this is a much bigger short term problem than the sort of more basic data preservation problems. People are dumping all of their data into different services and coming to rely on those services without ever thinking, “what if this company goes out of business next year?” In many cases, people won’t even realize just how much they’re dependent on other people providing them access to their data until that disappears.

Personally, I do use a lot of cloud services, but I am also fairly obsessive (ok, ridiculously obsessive) about making sure I have personal copies of everything so the day those services go asking for a bailout I’m not stuck wondering whether I’m going to be able to get my data back or not.

This is also one of the reasons those offering such services need to be pressured to adopt open standards so it is simple and straightforward to create local copies of any data and/or migrate to another service, whether it be on another web service or on a server the user controls. Most of the sort of web services today that Scott is bitching about seem to think that locking their customers into their specific service is the way to go, emulating the Microsoft’s of the traditional software market.

Even Rocks Need Backups

Henry Newman has a basic overview of the issues with data preservation in an increasingly all-digital world, but the title of his essay — Rock Don’t Need to Be Backed Up — gives me the hives, because it is transparently wrong. Newman opens his essay by explaining the origins of the title he chose,

My wife and I were in New York’s Central Park last fall when we saw a nearly 4,000-year-old Egyptian obelisk that has been remarkably well preserved, with hieroglyphs that were clearly legible — to anyone capable of reading them, that is. I’ve included a couple of pictures below to give you a better sense of this ancient artifact — and how it relates to data storage issues.

As we stood wondering at this archaeological marvel, my wife, ever mindful of how I spend the bulk of my time, blurted out, “Rocks do not need backing up!”

But, of course rocks do need backing up. Of the apparently hundreds of obelisks built by the ancient Egyptians, only 27 have survived completely intact to today. If the Egyptians had, perhaps, made a backup copy of each of those obelisks, some of them may have survived to our time intact. Of course making the initial version was a laborious process, so even had they wanted to make backup copies it may very well have been out of the question.

Compare this to the situation with paper and animal skins. Paper is both easier to create and destroy than the granite that the obelisks were carved from. For most historical documents more than a few hundred years old, we rarely have the original document. Rather we have copies and, in many cases, copies of copies of translations, etc. Oftentimes this in itself creates problems, as the copying process was rarely 100% accurate and the copyists would occasionally intentionally insert or delete passages. The most pronounced example of this is books of the New Testament for which there are numerous versions of, none of them “original” copies and frequently diverging from each other ways both small and large.

So now we enter the digital age. We still have one of the main problems that has vexed historians — the possibility that we’ll forget how to read certain documents, although this has switched from not being able to decipher long lost languages to not being able to read long abandoned formats (and by “long abandoned” that could mean “in the last 24 months).

The other day, for example, I was cleaning out my office and ran across a stack of Syquest disk cartridges that belonged to someone who left my organization about a decade ago. Syquest systems were essentially the forerunner to Zip disks and Syquest dominated the market for large scale removable storage in the 1980s and early 1990s. By the mid-1990s, however, Syquest found itself with huge quality control problems that eventually forced it into bankuptcy in 1998.

The data on those Syquest disks is largely unreadable to me. Or, more precisely, I probably could recover the data but at a price I’m not willing to pay. Moreover, since the data was all created on a mid-1990s Macintosh I’m not certain if I’d even be able to meaningfully use the data there.

But there are clearly ways we could turn this around and use the technology to our advantage. Unlike rocks and paper, making copies — lots of copies — is trivially easy with data. Most people and a surprising number of organizations don’t seem to have much of a plan at all for doing so, but its becoming easier and cheaper to do.

On a personal level, I literally have dozens of copies of my personal data store (which is roughly 500gb and growing) created with the frankly still primitive tools available for doing so. Now there is a major difference between backing up my little old 500gb and an organization that may have 500TB across the organization. Additionally, its easy enough for someone obssessed with data preservation to pull this off on an individual basis with a bit of attention, but there’s still a significant amount of personal intervention required to keep everything go and making sure everything works.

But if I can do it, surely there’s a way to scale backups wider than just me. The rise of the (too) numerous companies that are offering online backups at least suggests that there is a growing awareness of the problem of data loss. Of course many of these companies have business models that suggest they’ll soon be part of the problem rather than the solution (i.e., when that VC funding finally runs out).

On the other hand, few people seem to give a damn about the file format incompatibility issue. The solution there is also simple enough — only use open, well-documented file formats that can easily be reconstructed when support disappears for them. Instead, the reality is that people rush around seeing who can upgrade to the latest Microsoft product whose file formats are completely incompatible with every other file format in the history of the world.

Even then, I think Newman is overly pessimistic about the effect of data loss,

Digital data management concepts, technologies and standards just do not exist today. I don’t know of anyone or anything that addresses all of these problems, and if it is not being done by a standards body, it will not help us manage the data in the long run. It is only a matter of time until a lot of data starts getting lost. A few thousand years from now, what will people know about our lives today? If we are to leave obelisks for future generations, we’d better get started now.

It is almost inevitable that a lot of data will get lost — we are probably no different from previous civilizations who have rarely left behind anything but a small percentage of their “data” in forms that are usable by us. We will likely be no different, except that we are generating so much data that even this trickle of data that survives will still be sufficient to overwhelm future historians trying to get a handle on us.

We should definitely make the issues Newman writes about a priority, but I worry more that future generations will be drowning in our data remnants rather than seeing our era as a black hole of data loss.