(1) Stability problems: we can start taking data, but the data flow in the
    ROD stops after a small number of events (usually < 1000). When this
    happens, it is associated with at least one of three different fatal
    errors:
     (a) DX Fault (see below for acronyms and a brief system overview) -
         this is reported through hardware channels. It has been difficult
         for me to trace the source of this fault. It requires some study
         of the firmware (written in Verilog) to make progress here. The
         Verilog code is well commented, but I am reaching my limits here.
     (b) RPU Fault - reported by one or both of the RPUs (DSPs). We
         understand which condition leads to the fault, we're trying to
         find out what leads to this condition. In brief, the RPU expects
         data in its input buffer which is not there. Our goal this week
         is to understand the underlying code (C++, on the DSP) and find
         or rule out a bug in this code.
     (c) RPU stalls - one of two RPUs fails to process events at some
         point. Same approach as in (b).
    The described problems are observed at L1A rates of 50Hz and 1kHz
    alike, but not when triggers are generated by the ROD itself (in this
    case it is auto-throttled and the effective processing rate is low).

    There is one more,
     (d) One of 10 SPUs fails to process the first (and any subsequent)
         event. This happens at the beginning of a run, then requires
         a reboot. No pattern as to which SPU fails.

(2) Rate problems: cannot sustain the design rate. We could barely write
    at 80Hz a couple of months ago. This was in part due to the HPU code,
    which was not at all optimized for speed, but for system tests.
    I am now working on new code which right now can at least do one
    event in less than 4ms, and I believe with further refinement we can
    push this to 1ms soon. And then comes the fine tuning.. this is where
    my limited expertise with this kind of pipelined processing could
    benefit from advice and discussion, which I'm lacking since our
    engineer left.
    However, a particular subsystem call - a simple read operation
    targeting all SPUs and RPUs - was taking 2ms. This read operation
    has to be performed for every event, so by design should never take
    this long. In fact the same operation was used in our 100kHz test
    years ago to transfer the data along the same path - a much larger
    data volume! If we succeeded reading out at 100kHz back then this
    operation should not take more than 10us.
    We managed to factorize the problem a bit and bypass the problem
    partially, but the driver routine still should run faster. The
    drivers are implemented in the DSP code, but the actual read/write
    process is implemented in the FPGAs.


Now, for a very brief introduction to the system.

One ROD processes the data from two chambers (960 channels each) at 20 or 40 MHz sampling rate. The data comes, as far as the ROD is concerned, from
5 frontend electronics boards per chamber. The frontend readout control, deserializing of the incoming data, communication with the TIM (TTC interface module in the ROD crate), and connection to the ROL via
Slink(HOLA) mezz board are all implemented entirely on the CSC transition module (CTM) which acts together with the ROD as a unit. The ROD consists of the motherboard and 13 mezz boards (GPUs) which are identical in hardware (1 TI 6203 DSP, 2 Xilinx Spartan FPGAs, 1 2Mword SDRAM) but have different functions: 10 act as SPUs (Sparsifying Processing Unit) which receive the incoming data via the expansion bus (XB) and perform zero suppression and cluster identification. 2 act as RPUs (neutron Rejection
PUs) which build the event fragment out of the SPU's output for each chamber, and optionally perform further noise suppression by matching clusters across the chamber layers. One acts as HPU (Host PU) and orchestrates the whole ensemble. It also adds the event header and trailer information to the fragment.

The data flow between SPUs, RPUs, and CTM/SLink is handled by a bus system we call the Data Exchange (DX). The bus-GPU interface is implemented as a set of FIFOs in FPGAs (one per GPU).
Communication between HPU and SPU/RPU (also referred to as DPUs) is handled by the DPU Control (DC) subsystem, also implemented in a FPGA.

The available documentation on the ROD system is posted at

http://positron.ps.uci.edu/~pier/csc/CSCElectronics.html

and the most relevant documents to start with on this page are

http://positron.ps.uci.edu/~pier/csc/CSC_ROD_FDR_1.pdf
http://positron.ps.uci.edu/~pier/csc/IRODBlockDiagram9.pdf  (Overview) http://positron.ps.uci.edu/~pier/csc/IROD_Subsystems/DX_Notes22.pdf  (Data
Exchange)
http://positron.ps.uci.edu/~pier/csc/IROD_Subsystems/DC_Notes25.pdf  (DPU
Control)
http://positron.ps.uci.edu/~pier/csc/CTM/CTM_ReferenceManual_01.pdf  (CTM)