# The Development of an ARM System on Chip based Processing Unit for Data Stream Computing

MITCHELL A. COX UNIVERSITY OF THE WITWATERSRAND, JOHANNESBURG, SOUTH AFRICA

**GRID 2014** 





#### Overview

- DATA IN MODERN TIMES
- CONVENTIONAL COMPUTING PARADIGMS
- DATA STREAM COMPUTING PARADIGM
- SYSTEM ON CHIP BASED PROCESSING UNIT
- PCI-EXPRESS I/O BENCHMARKS

#### Data is getting BIGGER!



**Big Data** 







**NICA** 



SKA

#### Data is getting BIGGER!



#### Massive Data Processing

Data volume must be reduced before storage



#### Massive Data Processing

- Data volume must be reduced before storage
  - Generic processing complements existing FPGA's



#### Massive Data Processing

- Data volume must be reduced before storage
  - Generic processing complements existing FPGA's



#### Conventional Computing Paradigms

- High Performance Computing
  - Tightly Coupled
  - FLOPS
- High Throughput Computing
  - Loosely Coupled
  - Jobs/Day (FLOPS)
- Many Task Computing
  - Tightly or Loosely Coupled
  - FLOPS or I/O Throughput









#### Data Stream Computing

Three important constraints:





#### High Data Throughput

CPU and External I/O must be balanced.

Unbalanced (Conventional Systems)

Balanced (Data Stream Computing)





#### **Data Stream Computing**





#### Data Stream Computing







No Offline Storage Allowed

#### The Offline Problem

PB/s storage is not feasible.



1.3 PB 8.4 GB/s ~2 days



#### The Offline Problem

• PB/s storage is not feasible.



#### Data Stream Computing







High Data Throughput No Offline Storage Allowed

#### Data Stream Computing





High Data
Throughput

No Offline Storage Allowed

Programmer Friendly

#### System on Chips

- ARM or Intel Atom SoC
  - Low Power Consumption
  - Low Cost
  - High CPU Performance per Watt
- What about I/O performance?







Cortex-A9



Cortex-A15



### System on Chip External I/O Ports

**Ethernet** 

**PCI-Express** 





100 Mb/s - 1 Gb/s 12 - 125 MB/s



N x 5 GT/s  $\geq$  500 MB/s



## System on Chip External I/O Ports

#### **Ethernet**



100 Mb/s - 1 Gb/s 12 - 125 MB/s



#### **PCI-Express**



N x 5 GT/s  $\geq$  500 MB/s



#### PCI-Express Benchmark Rig

Test PCI-Express with a pair of SoCs:

Wandboard is a Quad-Core Cortex-A9 at 1 GHz



#### PCI-Express Benchmark Rig

- Test PCI-Express with a pair of SoCs:
  - Wandboard is a Quad-Core Cortex-A9 at 1 GHz (i.MX6 SoC)



#### PCI-Express Test Results

- PCle x1 Link on i.MX6 SoC:
  - 500 MB/s Theoretical

|              | CPU memcpy  | DMA (EP)    | DMA (RC)    |
|--------------|-------------|-------------|-------------|
| Read (MB/s)  | 94.8 ±1.1%  | 174.1 ±0.3% | 236.4 ±0.2% |
| Write (MB/s) | 283.3 ±0.3% | 352.2 ±0.3% | 357.9 ±0.4% |

- 72 % of theoretical with Direct Memory Access (DMA)
  - Superior to Ethernet
  - Successful Proof of Concept
- 40 Gb/s PU needs 12 Freescale i.MX6 SoCs
  - 12 x 5 W = 60 W Power Consumption

#### Further Prototyping

- Test 8 i.MX6 SoCs via PCI-Express Switch
- Develop Linux Driver:
  - Emulate Ethernet (RDMA)
  - Emulate File
  - "Programmer Friendly"







PCIe Development Board at Wits

#### Summary

Data Stream Computing







- 12 i.MX6 SoCs for 40 Gb/s I/O
  - 60 GFLOPS
  - 60 W Power Consumption
  - Low Cost



## Questions or Comments?

MITCHELL.COX@STUDENTS.WITS.AC.ZA

#### Acknowledgements

- The "Massive Affordable Computing Project" team:
  - Robert Reed, Thomas Wrigley, Matthew Spoor
  - MSc Supervisor: Prof. Bruce Mellado
- The financial assistance of the National Research Foundation (NRF) towards this research is hereby acknowledged. Opinions expressed and conclusions arrived at, are those of the authors and are not necessarily to be attributed to the NRF.
- I would also like to acknowledge the School of Physics, the Faculty of Science and the Research Office at the University of the Witwatersrand, Johannesburg.



#### Backup Slides

#### **ARM Performance**

|                 | Cortex-A7 | Cortex-A9 | Cortex-A15 |
|-----------------|-----------|-----------|------------|
| CPU Clock (MHz) | 1008      | 996       | 1000       |
| HPL (SP GFLOPS) | 1.76      | 5.12      | 10.56      |
| HPL (DP GFLOPS) | 0.70      | 2.40      | 6.04       |
| CoreMark        | 4858      | 11327     | 14994      |
| Peak Power (W)  | 2.85      | 5.03      | 7.48       |
| DP GFLOPS/Watt  | 0.25      | 0.48      | 0.81       |