How to get supercomputer-like data strategies for AI/ML

Tom Temin

May 21, 2021 4:30 pm

4 min read

For federal agencies, artificial intelligence and machine leading applications have moved from gee-whiz-that-sounds-cool ideas, to regular parts of IT planning. For instance, the U.S. Patent and Trademark Office is fielding a new prior-art search engine for its 8,000 examiners. The search engine has an artificial intelligence component to help examiners deal with the fast-growing body of prior art, none of which can be overlooked in the patent decision process.

Further success in artificial intelligence and...

Further success in artificial intelligence and machine learning will require updated strategies for one of the three basic ingredients.

An AI/ML project starts, of course, with the application, what it is the agency wants to accomplish, what process it wants to transform. Then comes the algorithm. Crucial to them both, though, is the data that will be used to train the algorithm.

How the agency curates, grows, stores and manages its data will determine how effective it will be at deploying AI/ML applications, according to Gary Hix, the chief technology officer at Hitachi Vantara Federal.

The deeper and more diverse the data, the better the outcomes will be, Hix said. And, he said, it’s likely the agency will want to further refine its AI/ML applications as new data becomes available.

“Data is at the core of [AI and ML projects],” Hix said. With respect to training data, he added, “And one of the things that I’m starting to see is that data set is still relatively small in regards to all the data that’s available. “People have just started, within the last couple years, really curating data.”

To save money and conserve storage, he said, organizations have tended to purge data deemed no longer useful. That’s changed, and now data is growing fast.

“Now we’ve gone to the other end of the spectrum, where we say, ‘hey, keep everything.’” More data can more accurately train model, “and you want to have as much data in that lake as you possibly can.”

Sources of data themselves are growing as agencies and commercial organizations alike deploy sensors in the internet-of-things model. Hix cited a sensor-laden train set that generates 3 terabytes of data daily.

But, the larger the data and numbers of data sets, Hix said, the more important the management of metadata becomes to keeping data easily discoverable.

Once data keepers solve the the metadata and taxonomy issues, Hix said, they have to look after what he called the data performance challenge. Data can be arranged logically for discoverability and access, but keeping it in scattered data silos can impede usage.

“There’s data access, and then there’s data performance,” Hix said. “One of the things that we’ve seen with AI and ML is heritage, almost legacy, file based solutions.” He added, “Because most of this is unstructured data, at the end of the day, file based solutions fall down from a performance perspective.”

Hix recommends storage consisting of what he called a unified namespace, accessible to all with rights to it, and incorporating a data locking mechanism to ensure integrity and consistency. Even when subsets of data are pulled out for training or some other use, “you still need a consistent core depository,” Hix said.

The capacity of storage maintained by agencies with supercomputing capabilities is, thanks to the commercial cloud space, available to all agencies now, Hix said. But in using clouds, agencies should draw on the data management practices and experience of agencies with a history of high performance computing. He cited the National Oceanic and Atmospheric Administration as an example.

Moreover, with cloud computer scalability, agencies are able not only to keep their data there, but also to upload and train their AI algorithms in the cloud.

Hitachi Vantara’s own global parallel file system is a contemporary instance of tiered data storage, where up to 20 percent of the data – the data in use – is considered “hot” and kept in CPU direct access storage. The rest is in a lower, less expensive tier or several tiers. Data is dynamically moved in and out of hot storage depending on application demand. The solution can run in a commercial cloud as an object store. Or it can run in hybrid mode incorporating agency data centers.

“I would say as data continues to grow, the public cloud is attractive,” Hix said. “But I think there’s going be this tipping point where you have to operate in a hybrid environment from just from cost effectiveness and data security perspectives.”

In short, agencies looking to train AI and ML applications can now use data techniques once available only in supercomputer environments.

Training Data for AI and ML Applications

At the end of the day, file based solutions fall down from a performance perspective. In the data sharing perspective, you need something that has kind of a global parallel data access methodology as part of this. In the past, we would just copy data around. We don't want to do that. Today, we want to host one kind of unified namespace, and everyone can access that.”

Gary Hix

Chief Technology Officer, Hitachi Vantara Federal

Metadata and AI and ML

I think when lay people think about AI, ML, big data, just data analytics, they think you're always looking at that specific piece of data. And many times you’re not. You're looking at the characteristics – when was it created? Who created it? What application created it? AI and ML use that metadata most often. So you want to make sure you have a solution.

Gary Hix

Chief Technology Officer, Hitachi Vantara Federal

Listen to the full show: