Home

Delivering high performance computing through interconnected solutions

Charlie Foo, VP & General Manager, Asia Pacific/Japan, Mellanox

Charlie Foo, VP & General Manager, Asia Pacific/Japan, Mellanox

We live in a time when the world is at the precipice of another industrial revolution. High Performance Computing (HPC), Artificial Intelligence (AI) and Digitalization have been accelerating at an unprecedented pace. Part of this phenomenon encompasses the steep growth because of increasing use cases and applications. The consequential emergence of data volume, movement, communication, and management has founded its centrality in day-to-day operations. Data had become the new currency and dominates how we transact in this age!

HPC encompasses advanced computation over parallel processing, enabling faster execution of highly compute intensive tasks such as machine learning, climate research, molecular modeling, physical simulations, cryptanalysis, geophysical modeling, automotive and aerospace design, financial modeling, virtual reality and more. High-performance systems require the most efficient computing and storage platforms, and key performance metrics include the performance, efficiency, and scalability of the interconnected technology. Efficient HPC systems require high-bandwidth, low-latency connections, both between thousands of multi-processor nodes and to high-speed shared storage systems.

Machine learning is a pillar of today’s technological world, offering solutions that enable better and more accurate decision making based on the great amounts of data being collected. Machine learning encompasses a wide range of applications, ranging from security, finance, and image and voice recognition, to self-driving cars, healthcare, and smart cities. Machine learning applications are based on training a deep neural network, which requires complex computations and fast and efficient data delivery.

With computing and storage devices accelerating and operating at a very fast speed, the bottleneck tends to rest on the network. This leads to the emergence of a need for innovation in both, speed and smarts. By providing low-latency, high-bandwidth, high message rate, transport offload to facilitate extremely low CPU overhead, Remote Direct Memory Access (RDMA), and advanced communications and computations offloads, Mellanox’s interconnected solutions are the most deployed high-speed solutions for large-scale simulations by delivering the highest scalability, efficiency, and performance for HPC systems today and in the future. Mellanox smart offloading such as RDMA and GPUDirect capabilities can dramatically improve neural network training performance and overall machine learning applications.

With demanding application requirements, HPC and AI clusters are getting larger. Some applications need millions of cores to run in parallel. The communications between computing cores, as well as computing and storage functionalities are becoming more critical for application performance.

In-network computing, adaptive routing and self healing technology based smart network can improve application performance, avoid congestion and link failure issues in data center

Low latency and high bandwidth are not enough for application anymore. New technologies were developed to improve the HPC and AI performances. One is in-network computing technology and another is network self-healing communication technology.

In all distributed applications, the CPU does both application computing and communication computing. If the communication function occupies more CPU resources, the application will not have adequate CPU resources and vice versa. The best HPC and AI systems need to have the balance between the application and communication. Mellanox’s solution puts the communication computing in InfiniBand HCA (Host Channel Adapter) and switches, leaving all of the CPU resources to run the application. In-network computing technology performs data algorithms within the network devices, delivering ten times higher performance, and enabling the era of “data-centric” data centers. Using in-network computing technologies not only gives application more CPU resource, but also reduces CPU jitters on the application. Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) is the technology that delivers this. Please refer to Diagram 1, all of communication computing operations such as data synchronization, data movement, and collective operation between all end nodes (hosts) can be done in the aggregation nodes (switches), no need go to CPU. A good use case for SHARP is the Allreduce operation in machine learning. The switch can be used to implement the Allreduce and avoid the many to one communication between workers to parameter.

In HPC and AI systems, the communication between computing node and storage should be the reliable communication. This means they should not have the packet drop during the communication between any nodes. However, there is a propensity of network congestions caused by the communication models of HPC and AI, such as the Allreduce communication. These congestions will lead to the packet drop. Smart solutions like AR (Adaptive Routing) and SHIELD (Self Healing Communication Technology) from Mellanox not only resolve congestion issues, but also help the network avoid the congestion and packet drop. AR uses the specific packet (adaptive routing notification) to detect the congestion of next few hops before the switch sends the actual data packet to next hop. If that packet indicates that the congestion may happen in following hops, then switch can automatically change the routing path to new path to avoid the congestion happening in the original path. Please refer to picture 2.

SHIELD resolves the link failure issue between the sender and receiver at switch level without the application being aware. If there is a link failure, such as cable, transceiver or switch port issues, the closest switch can identify the failure and send the special packet (fault recovery notification) back to uplink switches until the right switch is identified and the packet is re-routed to a healthy link. If we rely on application to identify the link failure, it takes about 5 to 30 seconds in 1K to 10K node clusters. SHIELD can detect and fix the link failure within a few milliseconds without application involvement. Please refer to picture 3.

In summary, acceleration of the computing, storage and smart network (adaptive routing, fabric automation and recovery) has become imperative in this age of HPC, AI and Digitalization, when there is no room for performance compromise. Mellanox's scalable HPC and AI interconnect solutions are paving the road to Exascale computing by delivering the highest scalability, efficiency, and performance for HPC and AI systems today and in the future.