

# High Performance Computing Ecosystem and Trends

4 July 2016 +

Nikolay Mester

### Moore's Law and Parallelism

#### 35 YEARS OF MICROPROCESSOR TREND DATA



Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore



# CPU Parallelism is Already a MUST



Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: Intel Performance Projections as of Q1 2015. For more information go to http://www.intel.com/performance. Results have been estimated based on internal Intel analysis and ae provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Copyright © 2015, Intel Corporation
Chart Illustrates relative performance of the Binomial Options DP workload running on an Intel® Xeon® processor from the adjacent generation.

\*Product specification for launched and shipped products available on ark.intel.com.

<sup>1</sup>Not launched



## Growing Challenges in HPC

# "The Walls" System Bottlenecks



Memory | I/O | Storage Energy Efficient Performance Space | Resiliency | Unoptimized Software

# Divergent Infrastructure

VISUALIZATION

HPC 

BIG
DATA

MACHINE LEARNING

Barriers to Extending Usage



Resources Split Among Modeling and Simulation | Big Data Analytics | Machine Learning | Visualization

Copyright © 2016 Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

Democratization at Every Scale | Cloud Access | Exploration of New Parallel Programming Models



# Intel® Scalable System Framework

#### A Holistic Design Solution for All HPC Needs



Small Clusters Through Supercomputers

Compute and Data-Centric Computing

Standards-Based Programmability

On-Premise and Cloud-Based

Intel® Xeon® Processors
Intel® Xeon Phi™ Processors
Intel® Xeon Phi™ Coprocessors
Intel® Server Boards and Platforms

Intel® Solutions for Lustre\*
Intel® Optane™ Technology
3D XPoint™ Technology
Intel® SSDs

Intel® Omni-Path Architecture
Intel® True Scale Fabric
Intel® Ethernet
Intel® Silicon Photonics

Intel® Software Tools
Intel® Cluster Ready Program
Intel Supported SDVis



#### How It Works



## **Innovative Technologies**



# Tighter Integration and Co-Design



Increased System Density
Reduced System Power Consumption

Copyright © 2016 Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.



# **High Performance Compute**







#### **Common Programming Model**



## Intel® Xeon Phi™ x200 Product Family





(codename Knights Landing)



- 1. Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores, clock frequency and floating point operations per cycle.
- 2. Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4 memory only with all channels populated.
- 3. Source: Intel internal information



#### Intel® Xeon® Processors

#### At the Heart of Intel® Scalable System Framework







| Feature                                 | Xeon E5-2600 v3<br>(Haswell-EP, 22nm)                      | Xeon E5-2600 v4<br>(Broadwell-EP, 14nm) |
|-----------------------------------------|------------------------------------------------------------|-----------------------------------------|
| Cores Per Socket                        | Up to 18                                                   | Up to 22                                |
| Threads Per Socket                      | Up to 36 threads                                           | Up to 44 threads                        |
| Last-level Cache (LLC)                  | Up to 45 MB                                                | Up to 55 MB                             |
| QPI Speed (GT/s)                        | 2x QPI 1.1 channels 6.4, 8.0, 9.6 GT/s                     |                                         |
| PCIe* Lanes/<br>Controllers/Speed(GT/s) | 40 / 10 / PCle* 3.0 (2.5, 5, 8 GT/s)                       |                                         |
| Memory Population                       | 4 channels of up to 3 RDIMMs or<br>3 LRDIMMs               | + 3DS LRDIMM <sup>®</sup>               |
| Max Memory Speed                        | Up to 2133                                                 | Up to 2400                              |
| TDP (W)                                 | 160 (Workstation only), 145, 135, 120, 105, 90, 85, 65, 55 |                                         |

# THE HEART OF THE DATA CENTER

#### Core Single Thread IPC Performance





# Intel® Xeon® Processor E5 v4 Family: Core Improvements



#### Extract more parallelism in scheduling uops

- Reduced instruction latencies (ADC, CMOV, PCLMULQDQ)
- Larger out-of-order scheduler (60->64 entries)
- New instructions (ADCX/ADOX)

#### Improved performance on large data sets

- Larger L2 TLB (1K->1.5K entries)
- New L2 TLB for 1GB pages (16 entries)
- 2nd TLB page miss handler for parallel page walks

# Improved address prediction for branches and returns

 Increased Branch Prediction Unit Target Array from 8 ways to 10

# Floating Point Instruction performance improvements

- Faster vector floating point multiplier (5 to 3 cycles)
- 1024 Radix divider for reduced latency, increased throughput
- Split Scalar divides for increased parallelism/bandwidth
- Faster vector Gather



All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel may make changes to specifications and product descriptions at any time, without notice



# High Performance Memory and Storage



High-Bandwidth Memory

Configurable Modes

**Integrated** into the Processor





**New Technologies Are Bringing Memory Closer to Compute** 



# Bringing Memory Back Into Balance



#### up to 16 GB of High Bandwidth on-package memory in Knights Landing

Intel® Scalable System Framework



<sup>1</sup> Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4 memory with all channels populated.



<sup>&</sup>lt;sup>2</sup> Projected result based on internal Intel analysis comparison of 16GB of ultra high-bandwidth memory to 16GB of GDDR5 memory used in the Intel® Xeon Phi™ coprocessor 7120P.

# Bridging the Memory-Storage Gap



Intel® Optane™ Technology Based on 3D XPoint™



# Intel® Scalable System Framework **CPU DDR** INTEL® DIMMS INTEL® OPTANE™ SSD NAND SSD **Hard Disk Drives**

#### SSD

- 10x More Dense than Conventional Memory<sup>3</sup>
- Intel® Optane™ SSDs 5-7x Current Flagship NAND-Based SSDs (IOPS)¹

#### **DRAM-like performance**

- Intel® DIMMs Based on 3D-XPoint™
- 1,000x Faster than NAND1
- 1,000x the Endurance of NAND<sup>2</sup>

- <sup>1</sup> Performance difference based on comparison between 3D XPoint™ Technology and other industry NAND <sup>2</sup> Density difference based on comparison between 3D XPoint™ Technology and other industry DRAM <sup>2</sup> Endurance difference based on comparison between 3D XPoint™ Technology and other industry NAND

Copyright © 2016 Intel Corporation. All rights reserved.

\*Other names and brands may be claimed as the property of others.



# NAND Flash and 3D XPoint™ Technology









## NVMe<sup>™</sup> and 3D Xpoint<sup>™</sup> are the next Quantum Leaps!





Source: Storage Technologies Group, Intel. Comparisons between memory technologies based on in-market product specifications and internal Intel specifications.

Copyright © 2016 Intel Corporation. All rights reserved.



#### Intel® Solutions for Lustre\* Software

#### The Speed of Lustre\* with the Support of Intel

- Intel® Enterprise Edition for Lustre\* Software v2.4
  - Support for "Distributed Namespace" (DNE) Feature to Scale Out the Metadata Performance of Lustre\*
  - Support for the Latest OS: Red Hat\* 6.7-7 and SUSE\* 11sp4-12
  - Parallel Read IO Performance & HSM Scalability Improvements
- Intel® Cloud Edition for Lustre\* Software v1.2
  - Support for Over-the-Wire and Storage Encryption
  - Disaster Recovery from File System Snapshots
  - Simplified File System Mounting on Clients
  - Support for Intel® Xeon® Processor E5-2600 v3 Product Family-Based Instances
- Intel® Foundation Edition for Lustre\* Software v2.8
  - Delivers the Latest Functions and Features
  - Fully Supported by Intel

# **EXTREME SCALE STORAGE FOR HPC**









# Tighter System-Level Integration



**inte**l



Copyright © 2016 Intel Corporation. All rights reserved.
\*Other names and brands may be claimed as the property of others.

#### Intel® Omni-Path Architecture



#### Evolutionary Approach, Revolutionary Features, End-to-End Solution

768-port









#### **Director Switches QSFP-based**

192 and 768 port

Director Switch (20U chassis)

192-port Director Switch (7U chassis)

#### Silicon

OEM custom designs **HFI and Switch ASICs** 

> HFI silicon Up to 2 ports (50 GB/s total b/w)

Switch silicon up to 48 ports (1200 GB/s total b/w

#### **Software**

Open Source Host Software and **Fabric Manager** 



#### Cables Third Party Vendors

**Passive Copper Active** Optical



#### Better Scaling vs. EDR

- 48 Radix Chip Ports
- Up to 26% More Servers than InfiniBand\* EDR within the Same Budget1

x8 Adapter

(58 Gb/s)

Up to 60% Lower Power and Cooling Costs<sup>2</sup>

#### **Configurable / Resilient**

- Job Prioritization (Traffic Flow Optimization)
- No-Compromise Resiliency (Packet Integrity) Protection and Dynamic Lane Scaling)

#### Robust product offerings and ecosystem

- End-to-end Intel product line
- >100 OEM designs<sup>3</sup>
- Strong ecosystem with 70+ Fabric Builders members

#### Maximizes price-performance, freeing up cluster budgets for increased compute and storage capability

- Assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of 648-port director switches and 36-port edge switches. Mellanox component pricing from www.kernelsoftware.com, with prices as of November 3, 2015. Compute node pricing based on Dell PowerEdge R730 server from www.dell.com, with prices as of May 26, 2015. Intel® OPA pricing based on restimated reseller pricing based on Intel MSRP pricing on ark.intel.com.

  2. Assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of director switches and edge switches. Mellanox power data based on Mellanox C57500 Director Switch, Mellanox SB7700/SB7790 Edge switch, and Mellanox ConnectX-4 VPI adapter card installation documentation posted on www.mellanox.com as of November 1, 2015. Intel® OPA pricing based on estimated reseller pricing based on Intel MSRP pricing on ark.intel.com. 3. Intel internal information. Design win count based on OEM and HPC storage vendors who are planning to offer either Intel-branded or custom switch products, along with the total number of OEM platforms that are currently planned to support custom and/or standard Intel® OPA adapters. Design win count as of November 1, 2015 and subject to change without notice based on vendor product plans. \*\*Other names and brands may be claimed as venerally of others.\*\*
- Source: Intel internal information. Design win count based on OEM and HPC storage vendors who are planning to offer either Intel-branded or custom switch products, along with the total number of OEM platforms that are currently planned to support custom and/or standard Intel® OPA adapters. Design win count as properly of others.

  Copyright © 2016 Intel Corporation. All rights reserved.

\*Other names and brands may be claimed as the property of others.

# Intel® Omni-Path Architecture

Accelerating data movement through the fabric



Based on Intel projections for Wolf River and Prairie River maximum messaging rates, compared to Mellanox CS7500 Director Switch and Mellanox ConnectX-4 adapter and Mellanox SB7700/SB7790 Edge switch product briefs posted on www.mellanox.com as of November 3, 2015.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <a href="http://www.intel.com/performance">http://www.intel.com/performance</a>.

Copyright © 2016 Intel Corporation. All rights reserved.

<sup>&</sup>lt;sup>2</sup> Latency reductions based on Mellanox CS7500 Director Switch and Mellanox SB7700/SB7790 Edge switch product briefs posted on <a href="www.mellanox.com">www.mellanox.com</a> as of July 1, 2015, compared to Intel measured data that was calculated from difference between back to back osu\_latency test and osu\_latency test through one switch hop. 10ns variation due to "near" and "fat" ports on an Intel® OPA edge switch. All tests performed using Intel® Xeon® E5-2697v3 with Turbo Mode enabled.

<sup>\*</sup> Other names and brands may be claimed as property of others.

#### Intel® Software Solutions



# Intel® Software Defined Visualization

#### **Low Cost**

No Dedicated Viz Cluster

#### **Excellent Performance**

Less Data Movement, I/O Invest Power, Space, Budget in Greater Compute Capability

#### **High Fidelity**

Work with Larger Data Sets – Not Constrained by GPU Memory

#### Intel® Parallel Studio

#### **Faster Code**

Boost Application Performance on Current and Next-Gen CPUs

#### **Create Code Faster**

Utilizing a Toolset that Simplifies
Creating Fast and Reliable Parallel
Code

# **HPC System Software Stack**

# An Open Community Effort

Broad Range of Ecosystem Partners
Open Source Availability

#### Benefits the Entire HPC Ecosystem

Accelerate Application Development Turnkey to Customizable

#### **Open Software Available Today!**



#### What Makes a Great HPC Solution?





Reference Architecture Intel® Cluster Ready

Actual configurations depend on specific OEM offerings and implementation.

Copyright © 2016 Intel Corporation. All rights reserved.

\*Other names and brands may be claimed as the property of others.



## Summary: a Holistic Architectural Approach





Intel® Scalable System Framework **Application Modernized Code COMMUNITY** ISV

Copyright © 2016 Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.



experience what's inside™

Спасибо!

nikolay.mester@intel.com