IBM first turned its “storage made simple” theme to block-based systems, most notably its FlashSystem offerings as the non-mainframe portion of its storage for hybrid multicloud. Now the company has applied the same theme to storage for its data and AI systems, most prominently, the file-based Elastic Storage System (ESS) storage hardware that is managed by Spectrum Scale software-defined storage (SDS) and object-based Cloud Object Storage hardware that is managed by IBM Cloud Object Storage SDS. The role these products play in edge computing was discussed in my recent blog (see IBM Storage Solutions for Edge Computing).

We can now turn our attention to the role of these products in an information architecture (IA) without which AI is impossible. In contrast to a transaction processing system where the role of the application is to serve as the platform for creating data using business process rules (i.e., a process-driven application), AI is a data-driven application. That means that the AI application must be able to respond dynamically to different data types. The IA for AI is divided into three buckets; collect, organize, and analyze. Visually, think of data being infused into a data lake on the collect side of the house. Then the organize phase pairs the necessary data for the analyze phase. But remember, it all starts with a functional, accessible, manageable data lake.

A data lake can be unbelievably large

We’re all familiar with data storage media and systems measured in megabytes (MB), gigabytes (GB) and terabytes (TB), and recently we have started hearing more about petabyte (PB – 1024 TB as we recall that storage bytes are a function of powers of 2 rather than powers of 10) systems.  The exabyte (EB) is the next 3 orders of magnitude plus increase followed by the zettabyte (ZB) and, finally, topped off by the unbelievably huge yottabyte (YB).

A data lake is a storage repository of file and/or object data in a natural/raw format. Even though incredibly huge, it is possible that some end use cases, to wit autonomous vehicles en masse, may someday require data lakes that exceed a YB in size.

IBM Focuses on the Collect Stage

As with a natural lake that is constantly refreshed with water from springs or rivers that flow into it, so is a data lake refreshed with the infusion of fresh new data.

IBM has just announced the addition of the IBM Elastic Storage System 5000. Unlike its slightly older brother the ESS 3000, which is a NVMe all-flash storage system that was announced in October 2019, the ESS 5000 is an all hard disk drive (HDD) storage system. Isn’t this a ‘back to the past” moment? The answer is no. We are talking about random access bulk storage demands where ingestion speeds in terms of bandwidth is the key metric, not I/O or latency where the ESS 3000 shines in the analysis stage of the infrastructure. And, of course, on a bulk basis, HDDs are still considerably more cost-effective than flash media.

The ESS 5000 scales up to 8YBs per data lake, although realistically the actual size is likely to be far smaller. The ESS 5000 can use one of two enclosures — an IBM standard size (SL) rack with the capacity of ½ PB to 8 PBs or the expanded size rack (SC) with capacity from 1 PB to 13.5 PB. Each enclosure is powered by two IBM Power9 servers. Up to 6 enclosures can be accommodated in a SL rack, whereas SC rack can handle 8 enclosures. IBM claims that the ESS 5000 has significant advantages in better density and faster performance that lead to both CAPEX and OPEX savings over two chief rivals — namely, Dell/EMC Isilon and NetApp FAS6000=DS460C.

A key point from the AI perspective is that there are continuous real-time updates of metadata enabled by IBM Spectrum Discover that leads to faster insights without the need to rescan (which could be a slow process).

Note that the ESS 5000 is file-oriented and uses IBM Spectrum Scale. So, what do we do about bringing object data to and filling the data lake? The answer is IBM Spectrum Scale Data Acceleration for AI that allows access and data movement between IBM Spectrum Scale and object storage on premises or in the cloud (planned GA Q4 2020).

IBM Cloud Object Storage (COS), in addition to its traditional roles of backup and archive, can now take a greater role in AI with faster data collection and integration, which according to IBM is “designed to increase system performance to 55 GB/s in a 12-node configuration, improving reads by 300% and writes by 150%, depending on object size.” (planned August 2020 GA)

IBM strengthens the organize stage by bringing Red Hat OpenShift into the mix

IBM Spectrum Discover is the software heart of the data organize stage that delivers the necessary real-time metadata management and indexing that AI software can access through the Spectrum Discover API, such as IBM Watson and IBM Cloud Pak for Data, and need for both the collect and analyze stages. In another piece of news, IBM Spectrum Scale, plus Spectrum Discover, plus Red Hat OpenShift are now integrated. Why is this important? The ability to leverage Red Hat OpenShift in this process makes multicloud deployments easier, and also according to IBM requires up to 50% less memory resources to be used. The original deployment configuration for Spectrum Discover was on virtual machines and with this announcement, Spectrum Discover can be deployed in a container configuration. This launch gives end users the choice of using a container configuration, a virtual configuration or both, if their needs require it. In addition, the Spectrum Discover policy engine has been upgraded to take better advantage of low-cost options for data movement and migration (planned GA Q4), including support for 3rd party data movers, such as Moonwalk.

Mesabi musings

IBM has long emphasized the importance of AI, but supporting artificial intelligence is not just about the software necessary to analyze the data; it is also about the collection of the data and the ability to organize it to support those analyses. With the introduction of the ESS 5000 for file data and the ability to incorporate object storage into the mix, IBM can now meet the storage needs of even the largest data lakes. But physically meeting the storage requirements is, in and of itself, not enough. The information required for AI workloads also has to be organized so that the right data can be found and processed in a timely manner, which IBM accomplishes in the organize phase with Spectrum Discover, especially with integration with Red Hat OpenShift.

All in all, IBM has built an impressive narrative and solution portfolio for simplifying the storage required to support data analysis and AI.