IBM Advances Cluster Virtualization for HPC and Deep Learning AI

On the classic Groucho Marx quiz show You Bet Your Life if a contestant accidently said the “secret word” of the day, he or she would win a prize. There’s no prize included in this commentary, but the secret word of the day is virtualization, especially as it relates to IBM’s new HPC and AI solutions.

IBM defines virtualization as “A technology that makes a set of shared resources appear to each workload as if they were dedicated only to it.” IT is very familiar with this concept, what with operating system-level virtualization, server virtualization, network virtualization, and storage virtualization all continuing to permeate more and more through computing infrastructures and the collective consciousness. So, it should come as no surprise that IBM is advancing the concept of cluster virtualization in its latest announcement, tying it closely to cloud and cognitive computing.

IBM’s cluster virtualization initiative combines products from its Spectrum Computing family, namely Spectrum LSF, Spectrum Symphony, and Spectrum Conductor, along with overall cluster virtualization software (Spectrum Cluster Foundation) to manage the whole process. And that includes the storage that is delivered through IBM Spectrum Scale, another member of the IBM Spectrum Storage family.

The goal of this approach is to automate the self-service provisioning of multiple heterogeneous high-performance computing (HPC) and analytics (AI and big data) clusters on a shared secure multi-tenant compute and storage infrastructure. Doing so delivers multiple benefits to numerous technical computing end users, including data scientists and HPC professionals.

The announcement focuses on these products: IBM Spectrum LSF, IBM Spectrum Conductor, and IBM Spectrum Scale.

IBM Spectrum LSF Suites 10.2

IBM Spectrum LSF is the latest release of an end-to-end workload management solution for HPC. It enables multiple users and applications to share a computing cluster, thus optimizing performance and cost efficiencies while simplifying resource management. Spectrum LSF comes in three versions: 1) The Workgroup Edition is no slouch as it can deal with up to 128 nodes and 25,000 jobs, 2) The HPC Edition scales to 1024 nodes and 250,000 jobs while adding workflow automation, hybrid cloud auto-scaling, and intelligent data staging, and 3) The Enterprise Edition takes off all scaling limits on nodes and jobs. IBM states that Spectrum LSF Suite 10.2 offers more functionality at a lower cost through a new licensing and pricing model.

IBM Spectrum Conductor

Apache Spark is an open-source cluster computing framework that provides a fast, general engine for big data processing, including built-in modules for machine learning, SQL and streaming. IBM is a strong supporter of Spark, which is why it positioned this product initially as IBM Spectrum Conductor with Spark.  While the new version continues to include the Apache Spark framework, IBM has simplified the name to IBM Spectrum Conductor to avoid confusion that it is for Spark-based workloads only.

IBM Spectrum Conductor enables an enterprise to establish and manage multiple Spark deployments while eliminating inefficient resource silos tied to multiple Apache Spark implementations. It also extends beyond Spark and eliminates cluster sprawl through a shared resource pool, along with granular and dynamic resource allocation that improves both performance and resource usage. The IBM-supported distribution also has workload, resource and data management capabilities that, along with high-performance shared storage through IBM Spectrum Scale, delivers an enterprise-class solution.

A key focus of IBM’s approach is on multitenant shared services/resources where logically each group appears to have its own Spark cluster, which provides isolated, protected, secured and SLA-managed resource allocation. The overall result, though, is being able to dynamically respond to changing workload demand. IBM claims this leads to a 43% reduction in required infrastructure over a siloed infrastructure.

The biggest news is an add-on called IBM Spectrum Conductor Deep Learning Impact. Deep learning is a member of the broader family of machine learning methods that are used to train and extend AI (artificial intelligence) applications and services.

The software extension for Spectrum Conductor enables enterprises to build distributed environments for simplifying deep learning (using pre-built frameworks, such as Caffe and TensorFlow), accelerating time-to-results (more accurate models faster), and easing administration and management processes.

IBM Spectrum Scale 5.0

HPC and deep learning AI are heavily data-driven, so storage and compute resources work seamlessly in cluster virtualization implementations. IBM supports these capabilities with its Spectrum Scale available as software, on public clouds or software embedded with the IBM Elastic Storage Server (ESS). Spectrum Scale is the file-oriented member of IBM Spectrum Storage, which is a family of software-defined storage (SDS) products that deal with immense volumes of semi-structured and unstructured data commonly used in HPC, big data and AI applications. Spectrum Scale uses the GPFS (general-purpose file system), which is a long-proven, robust technology, at its core.

IBM’s next generation ESS combines the clustered file system of Spectrum Scale with the CPU and I/O capability of IBM’s POWER architecture and connects to solid-state drives (SSDs) for performance-related requirements or just a bunch of disk (JBOD) arrays for lower cost/capacity-related requirements. ESS uses a building block approach with a minimum configuration of two IBM Power System servers and at least one storage enclosure. ESS starts with a minimum capacity of 40 TB, but can scale into hundreds of petabytes.

The objective of the latest release, Spectrum Scale 5.0, is to meet the requirements of the CORAL (Collaboration of Oak Ridge, Argonne, and Livermore national laboratories) to advance leading edge HPC technologies. ESS running Spectrum Scale V5 meets performance goals of CORAL (as stated in a CORAL RFP), such as 1 TB/sec 1MB sequential read/write and 50K creates/sec per shared directory. These capabilities are the result of improvements, including significantly reduced latency between nodes (to support NVMe), an enhanced GUI to improve system administration and improved security and compliance functionality with integrated file audit logging.

Mesabi musings

These new solutions demonstrate how IBM has built a cluster virtualization approach using products from its Spectrum Computing family — namely Spectrum LSF for HPC and Spectrum Conductor with its Deep Learning Impact add-on for deep learning AI.  Cluster virtualization enables very large numbers of users and applications to share common pools of resources, including the massive implementations planned by CORAL.

Even though there is no monetary prize for the secret word of the day — virtualization — better understanding IBM’s cluster virtualization approach offers benefits of its own. And if you are a member of the HPC or deep learning AI communities, IBM’s new cluster virtualization solutions qualify as a very big prize, indeed.