It may not seem logical, or even possible, but lately there has been a lot of discussion about using DNA sequencing for data storage. The reason? More data has been generated in the last two years than in all of preceding history, creating a data storage problem for all of humanity.
Up until now, one of the best ways to store a lot of information in a small amount of space has been with magnetic tape. Antiquated, yes, but magnetic tape is cheap as well as strong enough to hold information for up to 30 years – as much as a terabyte of data per roll.
It is safe to say that the future will surpass the present when it comes to the amount of data created, so magnetic tape is probably not the best long-term plan.
Enter DNA data storage. A biological material such as DNA might appear to be an odd choice for backing up large amounts of digital information, yet its ability to pack enormous amounts of data in a tiny space has been clear for more than 70 years.
But what is the risk of malware infecting a DNA storage mechanism? Turns out, it’s pretty high.
DNA data storage – its history and its importance
Looking back to the 1940s, physicist Erwin “cat in a box” Schrödinger proposed a hereditary “code-script” that could be packed into a non-repeating structure he described as an aperiodic crystal. His suggestion led to the determination of DNA’s helical structure, sparking a revolution in understanding the mechanics of life.
How does this determination evolve into data storage?
While strings of nucleic acid (DNA) have been used to cram information into living cells for billions of years, its role in IT data storage was demonstrated for the first time just five years ago, when a Harvard University geneticist encoded his book – including jpg data for illustrations – in just under 55,000 thousand strands of DNA. Since then, the technology has progressed to the point where scientists have been able to record a whopping 215 petabytes (215 million gigabytes) of information on a single gram of DNA.
Storing 215 million gigabytes of data may sound like gibberish, but in reality, this technology could, in principle, store every bit of datum ever recorded by humans in a container about the size and weight of a couple of pickup trucks.
The advantages to this type of DNA data storage are plenty.
DNA has many advantages for storing digital data. It’s ultracompact, and it can last hundreds of thousands of years if kept in a cool, dry place. And as long as human societies are reading and writing DNA, they will be able to decode it. “DNA won’t degrade over time like cassette tapes and CDs, and it won’t become obsolete,” says Yaniv Erlich, a computer scientist at Columbia University. And unlike other high-density approaches, such as manipulating individual atoms on a surface, new technologies can write and read large amounts of DNA at a time, allowing it to be scaled up.
There are a couple of things DNA data storage isn’t – fast and cheap. But that doesn’t seem to deter large corporations from investing in the technology.
For example, last year Microsoft demonstrated its DNA data storage by encoding roughly 200 megabytes of data in the form of 100 literary classics in DNA. But the process, according to the MIT Review, would have cost around $800,000 on the open market. In addition to the exorbitant cost, the process proved to be painfully slow, with data stored at a rate of about 400 bytes per second.
Microsoft says it needs to get to around 100 megabytes per second to be feasible, and despite the setbacks in cost and time, hopes to start storing their data on strands of DNA within the next few years – with an operational storage system using DNA within a data center by the end of the decade.
Potential risks of DNA data storage
As companies like Microsoft invest in DNA data storage technology, scientists wanting to understand the security risks that could come along with it have been busy testing the waters of this storage type, with the primary goal of better understanding the feasibility of DNA-based code injection attacks.
It what seems like a scene from a science fiction movie, researchers at the University of Washington have been able to successfully infect a computer with malware coded into a strand of DNA.
In order to see if a computer could be compromised in that way, the team included a known security vulnerability in a DNA-processing program before creating a synthetic DNA strand with the malicious code embedded. A computer then analyzed the “infected” strand, and as a result of the malware in the DNA, the researchers were able to remotely exploit the computer. The results were published in a recent paper.
Karl Koscher, a member of that research team, spoke with NPR’s Scott Simon about the findings and what they mean for gene sequencing.
SIMON: What are the implications of this? What could hackers do?
KOSCHER: Currently, we don’t think that there is much of a threat. These attacks are very difficult to pull off in practice. But looking towards the future, we think that if the technology trends continue, there would be the ability for malicious DNA to compromise computers.
SIMON: In what way? What could somebody with nefarious intentions do?
KOSCHER: So, sequencing DNA is currently most cost efficient when you sequence a bunch of different samples together. And so, typically, you’ll outsource your sequencing to a dedicated sequencing facility or lab. And let’s say you want to learn what other people are sequencing. Say you want to get a leg up on some GMO research that people are doing. You could potentially insert malware into the sample that you send to the DNA sequencing facility to exfiltrate some of that data back to you.
SIMON: How did you guys discover this?
KOSCHER: What we try to do here is sort of look at emerging technologies and see if there are any security implications of those emerging technologies and try to get ahead of potential threats before they become actual threats. And so, we had a team of people here working on a DNA storage project for using DNA instead of, say, hard drives for long-term storage. And so, we had sort of the biological and chemical backgrounds for that as well as security expertise here.
SIMON: Are there any good reasons to use this?
KOSCHER: There is a separate group here working on storing data in DNA. It turns out that DNA is really robust for long-term storage, whereas, you know, hard drives may die after a few years. You can conceivably store data in DNA for hundreds or thousands of years. And as long as life continues to be based on DNA, we’ll always have a reason to read and write DNA. So, it’s sort of a technology that won’t go obsolete.
SIMON: So, I mean, we underscored the aspect of some kind of mischief or outright miscreants. But I wonder, you know, a few years from now, instead of having hard drives or storage systems, will people just carry around that information in themselves, in their DNA?
KOSCHER: I’m not sure that people will carry around that information within them. But I do believe people will start using DNA for sort of long-term storage. So, accessing data from DNA and putting data into DNA is a pretty slow process. And so, you’ll want to, you know, archive photos in there and things like that. But you won’t want to use it for your day-to-day tasks, at least not for the foreseeable future.
The best offense is a good defense
Researchers state that currently there is no evidence to believe that DNA data is under attack, and that it is better to consider security threats early in emerging technologies, before the technology matures, since security issues are much easier to fix before real attacks manifest.
“There is no cause for people to be alarmed today, but we also encourage the DNA sequencing community to proactively address computer security risks before any adversary’s manifest. That said, it is time to improve the state of DNA security.”