Think the memory card in your camera is high-capacity? It's got nothing on DNA. With data accumulating at a faster rate now than any other point in human history, scientists and engineers are looking to genetic code as a form of next-generation digital information storage.
Now, a team of Harvard and Johns Hopkins geneticists has developed a new method of DNA encoding that makes it possible to store more digital information than ever before. We spoke with lead researcher Sriram Kosuri to learn why the future of archival data storage is in genetic code, and why his team's novel encoding scheme represents such an important step toward harnessing DNA's vast storage potential.
Humanity has a storage problem. Recent surveys conducted by IDC Digital Universe suggest that the perfusion of technology throughout society has triggered an explosion in the volume of information that we as a species produce on a daily basis. Between photos, video, texts, tweets, Facebook updates, unsolicited FarmVille requests, Instagram posts and various other forms of digital data production, the world's information is doubling every two years, and that raises some important questions, chief among them being: where the hell do we put it all?
"In 2011 we had 1.8 * 1021 bytes of information stored and replicated" explains Sriram Kosuri, a Harvard geneticist and member of the Wyss Institute's synthetic biology platform, in an email to io9. "By 2020 it will be 50 times that. That's an astounding number; and doesn't include a much larger set of data that's thrown away (e.g., video feeds)."
As Kosuri points out, not all of this information needs to be stored, but — being the diligent little hoarders that we are — a good deal of it will be cached away somewhere for posterity; and at the rate we're generating information, we'll need to find new storage solutions if we want to have any hope of keeping up with our demand for space. "Our ability to store, manage, and archive such information is being constantly strained already," notes Kosuri. "Archival storage is also a large problem."
Archival storage is where DNA comes in. As storage media go, it's hard to compete with the universal building blocks of life. In an article published in today's issue of Science, Kosuri — in co-authorship with geneticist Yuan Gao and synthetic biology pioneer George Church — describes a new technique for using DNA to encode digital information in unprecedented quantities. We'll get to their novel storage method in the next section, but for now let's look at some numbers that help contextualize what Kosuri identifies as the two major advantages of DNA storage: information density and stability.
At theoretical maximum, one gram of single stranded genetic code can encode 455 exabytes of information. That's almost half a billion terabytes, or 4.9 * 1011 GB. (As a point of reference, the latest iPad tops out at 64 GB of storage space.) DNA strands also likes to fold over on top of themselves, meaning that, unlike most other digital storage media, data needn't be restricted to two dimensions; and being able to store data in three-space translates to more free-space.
DNA is also incredibly robust, and is often readable even after being exposed to unfavorable conditions for thousands of years. Every time researchers recover genetic information from a woolly mammoth specimen, or sequence the genome of a 5,300 year-old human mummy, it's a testament to DNA's durability and data life. Just try recovering files from a 5,000-year-old CD or DVD. Hell, try it with a 20-year old disc; odds are it just isn't going to happen.
That being said, DNA has its shortcomings. "It's not re-writable, it's not random access, and it is very high latency," explains Kosuri, "so really the applications are for archival storage (not to downplay the importance of archives)."
5.27-megabits probably doesn't strike you as a lot (that comes out to roughly 660 kilobytes of information, about what you'd find on a 3.5" floppy from the 80s), but it's impressive for at least three reasons:
One: It positively crushes the previous DNA-storage record of 7,920 bits.
Two: The novel encoding method employed by Kosuri and his colleagues allowed them to address issues of cost and accuracy, two long-standing technical hurdles facing DNA storage:
The major reason why this would have been difficult in the past is that it is really difficult to construct a large stretch of DNA with exact sequence, and make it cheaply. We took an approach that allows us to use short stretches of DNA (basically by having an address (19 bits) and data block (96 bits), so each short stretch can be stitched together later after sequencing. Using short stretches allowed us to leverage both next-generation synthesis [for writing data]… and next-generation sequencing [for reading data] technologies to really lower cost and ease.
Three: It offers a compelling proof of concept that DNA can be used to store digital information at remarkable densities. "What we published in terms of scale is… obviously small compared to commercial technologies now," explains Kosuri, but "using our method, a petabyte of data [one petabyte = 1,024 terabytes] would require about 1.5 mg of DNA." Since that genetic information can be packaged in three dimensions, that translates to a storage volume of about one cubic millimeter.
The logarithmic plot featured here illustrates how the storage density demonstrated by Kosuri and his team (labeled "This Work") compares to technologies of today and tomorrow. You should really just reference the graph, but to summarize: DNA wins out by a landslide.
"For example," explains Kosuri, "we are ~10 orders of magnitude (100 billion fold) more dense than a CD, a million-fold more dense than the best commercial storage technologies, and about ~1000 fold more dense than [other] proof-of-concept work (e.g., position atoms on a surface)." He says the secret to DNA's superiority harkens back to the fact that it can be stored dry in three dimensions; "thus there is no surface that requires a thickness, which really kills 3D data density."
DNA storage has its limitations. As I mentioned earlier, it's not re-writable, and it's not random access. Its latency is also too high for it to be practical for anything other than archival storage, but we've already established that we're in dire need of space for archiving, anyway. The only other big limiting factors, at present, are synthesis and sequencing technologies — and those won't be an issue for much longer.
According to Kosuri, the costs of DNA synthesis and sequencing have been dropping much faster than Moore's law. In the supplementary information section of their paper, Kosuri and his colleagues imagine what a petabyte of storage would require, from the standpoint of synthesis and sequencing costs, and conclude that they would need a roughly 6 order of magnitude drop in sequencing, and 7-8 in synthesis for storage media of that capacity to become feasible.
"To give perspective," explains Kosuri, "costs have been dropping for the past 5-10 years at 10x and 5x per year for sequencing and synthesis respectively." In other words: this tech is right around the corner. Are you ready for your DNA drive?
The researchers' results are published in the latest issue of Science.