

## Stefan Huber on behalf of AMBER DAQ group

Institute for Hadronic Structure and Fundamental Symmetries (E18)

TUM Department of Physics

Technical University of Munich

SPD Collaboration Meeting, 9-th June 2021



Uhrenturn der TVM



# AMBER DAQ Architecture



# DAQ Evolution Triggered to Free Running





# Free Running DAQ

- Pros: very general data reduction scheme
  - No dedicated trigger detectors, any detector can be included in filter (trigger) logic
  - No fast trigger, latency not an issue
  - Slow(TPC) and fast detector integration
  - Programmable filtering algorithms Hardware(FPGA) or Software(HLT)
  - High quality data selection, higher precision of measurements

## Cons

- High data rate
  - PRM setup in 2022 => 1 GB/s sustained
  - Drell-Yann setup => 10-20 GB/s sustained (preliminary)
- Challenge of data flow control
- Online Time/space alignment
- Synchronization and event building
  - Physics trigger => Slice signal, issued with fixed time interval
  - GATE => Time Slice , Time Slice is equal to time between triggers
  - Event builder => Time Slice builder



Т



# COMPASS++/AMBER DAQ Architecture





# **Data Taking Modes**

- T0 Calibration (Alignment)
  - Rough T0 calibration, no physics data
- Data Taking without hardware filter preferred mode of data taking
  - Fine T0 calibration
  - Software filter algorithms executed by HLT
    - Evaluation of Algorithms for Hardware Filter
    - Normal data taking mode if HLT performance is sufficient
- Data Taking with Filter
  - Economic mode : HLT stops process image as soon one of algorithms flags it
  - Full power mode : all algorithms are executed. Executed if CPU power is sufficient



# Data Structure













# Data Structure





# **DAQ Hardware**

Stefan Huber

11



# **Cross Point Switch**

### **Crosspoint Switch Components**

#### o interfaces:

- 12 x 12 channel CXP transceiver (MPO fiber connectors)
- Ethernet for IPbus
- JTAG
- TCS (Trigger Control System) receiver

#### • Switching and Control:

- Vitesse VSC3144-02 fully configurable 144x144, asynchronous, 6.5 Gbps crosspoint switch
- Xilinx Artix-7 FPGA for switch control and monitoring





- Interface FPGA Crosswitch:
  - 90 MHz, 11-bit parallel data bus
  - Multiple program assignments can be queued and issued simultaneously ⇒ fast programming (<< 1us)</li>

Developing in the future uses MACOM M21605G-12 switch ICs

(Non-blocking, asynchronous 12.5 Gbps 160x160 switch)



# DHmx/DHsw

## Backbone module of DAQ

- Xilinx Virtex-6 VLX130T
  - Custom board in AMC form factor
  - 16x 6.25 Gb/s links
  - 4GB DDR3 Memory
  - VME carrier board
  - ATCA carried board (under development)
- Firmware versions
  - LVL0 multiplexer (frontend specific)
  - LVL1/LVL2 multiplexer
  - Full bandwidth DAQ switch





# New Spill Buffer PCIe Card



- Based on commercial hardware
- Nereid Kintex 7 PCI Express
- Trenz FMC SFP adapter
- Kintex 7 XC7K160T FBG676
- 4x PCIe-Gen2 interface
- 4 GB DDR3 memory
- No dedicated TCS interface





# Online Computer Performance, single computer

## NUMA optimization



Disk speed test:

- Read speed : 2.2 GB/s
- Write speed : 1.5 GB/s
- R/W speed : 1.5 GB/s
  - Requires at least 12 disks

Reference number for scaling up the system:

1 GB/s/server

More tests will be done by end of this year



## **Timeslice Builder**

# Data format

Split into frames indicating several equipments

SPILLBUFFER: Begin of slice

SWITCH: Begin of slice

MUX: Begin of slice

Detector specific: MUX + PAYLOAD

Detector specific: MUX + PAYLOAD

Detector specific: MUX + PAYLOAD

MUX: End of slice

MUX: Begin of slice

Detector specific: MUX + PAYLOAD

MUX: End of slice

SWITCH: End of slice

SPILLBUFFER: End of slice



# **Timeslices builder switch**

- Remove bottleneck
- Pre-sort time-slices
- increase number of timslice builder nodes
- N-to-N switch in FPGA fabric
- No external memory for data
- Requires memory in multiplexers





# Switch firmware



- 8 Aurora receivers
- TCS receiver
- deep TCS FIFO in DDR3 memory
- 8x8 switch
- 8 6.25 Gb/s 8b/10b Aurora links
  - Timelice information and switch configuration distribution over sideband link
- slow control over IPbus
  - control
  - configuration
  - diagnostics



# Switch architecture



### Switch control

• Change of switch mapping when: Frame transmission for given timeslice complete

- 4-to-4 switch
  - Routes frames from an input to a specific output
  - Input-output mapping change
  - Configuration BRAM







#### Eventbuilder switch Sequence IV Slice11 Slice10 Slice8 Slice9 Slice6 Slice7 Slice5 Slice4 Slice10 Slice9 Slice8 Slice7 Slice6 Slice5 Slice4 Slice3 Slice9 Slice8 Slice7 Slice6 Slice5 Slice4 Slice3 Slice2 Slice8 Slice7 Slice6 Slice5 Slice4 Slice3 Slice2 Slice1 Switch 1.1 Switch 1.2 Switch 2.1 Switch 2.2 Slice1 Slice2 Slice3 Slice4 Slice5 Slice6 Slice7 Slice8 Slice1 Slice2 Slice3 Slice4 Slice5 Slice6 Slice7 Slice8 Slice3 Slice1 Slice2 Slice4 Slice5 Slice6 Slice7 Slice8 Slice1 Slice6 Slice2 Slice3 Slice4 Slice5 Slice7 Slice8

# Switch Setup at CERN



# Switch Setup at CERN



- System integration into the AMBER run and slow control
- Validation of the data read-out chain
  - Verification of the data format and integration with the read-out software
  - Optimization of the PCIe driver for high data rates
- Stability test
  - Scan of the parameter space of the 6 Gbps links to find optimal operational parameters for error-free operation

# Switch Performance, 2 links active



- Artificial spill structure 5s on 15s off
- Nearly full time needed for processing
- Data rate ~580 Mbyte/s
- 10% efficiency loss
- Reason:
  - different timeslices contain different amount of data
  - Following switch configuration has to wait

## Computing



# **HLT Framework**

- Optimized high performance framework capable to run on arbitrary number of nodes
- Include tools for monitoring and simulation e.g. detector response simulation





# **HLT** Functionality

## Include tools for monitoring and simulation e.g. detector response simulation





# From Continuous Data stream to Events

## HLT tasks

- Analyze data of detectors included in Filtering algorithms
- Find timed and geometrical coincidence between detectors' hits
- Calculate T0 and remove uncorrelated images

### Problem:

- HLT output has RAW format, no event definition
- CORAL (event reconstruction software) can process only event type data

## **Missing software : event formation**

- Apply all trigger algorithms
- Extract images for each trigger => Event
- DAQ decoding library together with event builder
- process Raw data by CORAL, COOOL, Event Display





# **HLT Performance**





## Outlook



# FriDAQ Objectives for 2021 Pilot Run

- Integration of SciFi detectors with iFTDCs
- Integration of one plane of ALPIDE pixel detector
- Evaluation of TPC readout integration, problem : SIS ADC does not support trigger-less acquisition
- 16 ADC channels of ECAL2
- Detector commissioning
- HLT tests
  - Unfiltered mode
  - Online time calibration
  - Filtering mode with simple coincidence algorithm

# Outlook

• Performance of single spillbuffer limited by link speed to ~600MB/s



- Internal 2:1 MUX would increase performance to 1.2GB/s per readout engine
  - Total limit with 8 readout engines using 1 switch 4.8 GB/s
  - Total limit with 8 readout engines using 2 switches 9.6 GB/s



# Outlook – Scaling to 20 GB/s

Evaluating more modern FPGAs (Virtex UltraScale+ VU37P)

- More links (e.g. 32x32 switch)
- Higher bandwidth per link
- HBM memory



Alpha Data ADM-PCIE-9H7



# THANK YOU

Stefan Huber

37