Scientists Are Running Out of Space for Climate Data
(Inside Science) -- Climate science has gotten big. Debates about the research, its implications and the need for corrective action erupt everywhere from Middle American dinner tables to Capitol Hill and the bowels of social media. But climate science has gotten big in another sense, too -- one of sheer data volume. Mathematical models of Earth's climate are producing so much data that scientists may soon be forced to give some of it up.
When it comes to understanding Earth's climate, the actual measurements of things like rainfall and temperature, even when built up over the course of decades, can only get you so far. To predict hypothetical scenarios and tease out what's causing what, scientists use computer programs that simulate the complex forces at work. By "seeding" these models with conditions on a given day and then watching how they play out over years or centuries, scientists can conduct virtual experiments that would be impossible in real life.
"A climate model is really like a virtual laboratory," said Allison Baker, a computational scientist at the National Center for Atmospheric Research in Boulder, Colorado. "You can ask 'what if' questions, like 'what if the Arctic ice all melts?'"
As climate models run, they produce detailed snapshots of conditions around the globe at hourly, daily or monthly intervals. And because the climate is chaotic, simulations don't always play out the same way, so scientists run them over and over again. This quickly adds up to vast amounts of data that are difficult to store and manage. A single experiment can yield two petabytes, or two million gigabytes -- hundreds of times what a typical laptop can hold, according to Baker.
The need for storage is growing fast, thanks to increases in computing power and investment in climate research. Five years ago, Baker's lab produced about 2.5 petabytes for the Coupled Model Intercomparison Project, an international collaboration to compare models developed by different labs. For the project's continuation, Baker expects to generate between 20 and 40 additional petabytes of data this year.
Scientists pay around $250,000 to store each petabyte, and those costs are already forcing them to make tough decisions, according to V. Balaji, a climate scientist at Princeton University in New Jersey and head of the Modeling Systems Group at the U.S. National Oceanic and Atmospheric Administration. Balaji's models currently divide Earth's surface into squares 30 to 60 miles on a side, treating each square as one pixel. If data storage and computing were cheaper, he could look at smaller squares approaching 5 miles per side.
Perhaps ironically, data storage and computing have a significant carbon footprint of their own, which implies that climate scientists may be contributing to the very changes they are studying. But Baker's NCAR computing center works hard to conserve, with up to 89 percent greater energy efficiency than typical data centers, according to the center's website.
The problem isn't just with storage. The more data scientists have, the harder it is to analyze or share between institutions.
"In some cases we actually have people getting on an airplane with a bunch of disc drives," said Peter Lindstrom, a computer scientist at Lawrence Livermore National Laboratory in California. "That's more efficient than sending the data over the internet."
The solution, Lindstrom and Baker argue, must involve "lossy" compression -- ways of scrunching digital information into compact forms at the cost of some of the data itself. Earlier this month at the Joint Statistical Meetings in Baltimore, they joined colleagues from North Carolina State University and Newcastle University to present the latest advances in climate data compression.
Climate scientists already use lossless compression methods that can get data down to about half its original size, said Balaji, who was not at the meeting. Lossy methods can compress data further by sacrificing some information -- ideally, parts that are meaningless noise or irrelevant to future analyses. Many people are familiar with this concept from video and music streaming or saving pictures on cameras. But such entertainment media are relatively easy to compress, said Baker.
"They kind of joke about using the eyeball norm, or the eyeball metric. Like, if it looks okay, it's fine," she said. With climate data, "we can't be so blasé."
It can be hard to know in advance which parts of a climate model dataset are expendable, and many scientists are leery of losing anything at all. But Baker and Lindstrom see plenty of places to trim without compromising important results. Some scientists want values stored down to 15 decimal places, even though the raw data aren't accurate to that level of detail, said Lindstrom.
Even when compression does produce a small difference in the end results, it may not always be worth worrying about, said Baker. For example, say a model indicates that Earth will warm by 2.0004 degrees in 50 years, and compression changes that number to 2.0003 degrees.
"Like, who cares, right?" said Baker. "You should probably take the same action either way."
Baker is doing her best to help the climate community overcome its fear of loss. In a project published last year in the journal Geoscientific Model Development, she challenged scientists around the world to identify which of several datasets had been subjected to lossy compression. She used an algorithm developed by Lindstrom to shrink the target dataset by more than 80 percent, then decompressed it and sent it to other researchers alongside datasets that had never been compressed. Those colleagues then reviewed the data, examining everything from the total heat stored in the climate system to the net amount of precipitation falling on Earth's surface.
The challenge participants did find a few subtle clues that gave the compressed dataset away. However, those differences were mostly irrelevant to the end results, said Baker.
Lossy compression does have potential pitfalls. For example, if someone rounds off too many digits before comparing the amount of water evaporating from Earth's surface to the amount falling back to the ground, the difference might look artificially large, and that error can magnify through the course of the analysis, said Lindstrom.
Baker and Lindstrom said that if they know how someone will use a dataset, they can customize compression algorithms to preserve important details. But for projects like the Coupled Model Intercomparison Project where the data is presented to the community at large, there's no telling what everyone will do with it, said Baker. In those cases, she just has to declare how the compression worked, and hope users understand the limitations.
To use compression tools to their best advantage, researchers need to keep studying how different types of compression affect scientific conclusions, said Lindstrom. But, he added, the basic tools for compressing climate data already exist. Now, the biggest hurdle is convincing researchers let go of what they don't need.