SLAC PEP-II
BABAR
SLAC<->RAL
Babar logo
HEPIC E,S & H Databases PDG HEP preprints
Organization Detector Computing Physics Documentation
Personnel Glossary Sitemap Search Hypernews
Unwrap page!
Det. Search
Who's who?
Meetings
FAQ
Images
Archive
Systems
Performance
Intern. region
Vertex Tracker
Drift chamber
DIRC
Calorimeter
IFR
LST
Magnet
Electronics
Trigger
Operations
Run Coordination
Contact Experts
Shift Takers Info
Operations Manual
Electronic Logbook
Ops Hypernews
Shift Signup
Check this page for HTML 4.01 Transitional compliance with the
W3C Validator
(More checks...)

Investigations into EMT data corruption following the installation of the new patch panel (September 2nd 2003)

On September 2nd 2003 the new EMT patch panel was installed in IR2.
On testing the installation, data corruption was seen in several EMT towers.
The problem was traced to a fault in the UPCs where the clock and frame are produced such that their rising edges coincide.
The investigation was carried out by the follwing team:

  • Chris O'Grady
  • Jamie Boyd
  • Matt Weaver
  • Su Dong
  • Tom Latham
Greatly assisted by:
  • Paul Dauncey
  • Phil Clark

The first test that was performed was a serial number calibration. Here is the summary output:

The following towers don't match their serial numbers

Phi 6 : Theta 2 : SN word seen = 0x60a2
Phi 26 : Theta 2 : SN word seen = 0x92c1
Phi 26 : Theta 3 : SN word seen = 0x5092

The following towers don't match their bit patterns

Phi 6 : Theta 1 : BP word seen = 0x80e3
Phi 6 : Theta 2 : BP word seen = 0x9249
Phi 26 : Theta 1 : BP word seen = 0x2701
Phi 26 : Theta 2 : BP word seen = 0x5124

The following towers were not consistent for all readouts

Phi 6 : Theta 1 : number of changes to SN word = 359 and to BP word = 1
Phi 6 : Theta 3 : number of changes to SN word = 93 and to BP word = 94
Phi 26 : Theta 1 : number of changes to SN word = 147 and to BP word = 1
Phi 26 : Theta 3 : number of changes to SN word = 1 and to BP word = 5
Phi 38 : Theta 1 : number of changes to SN word = 0 and to BP word = 237
Phi 38 : Theta 3 : number of changes to SN word = 94 and to BP word = 38

The following towers had missing channel TCs

Phi 16 : Theta 0 : number missing = 100
Phi 17 : Theta 0 : number missing = 100

The following towers show sample time alignment errors

Phi 3 : Theta 1 : event misalignments = 60
Phi 3 : Theta 2 : event misalignments = 40
Phi 3 : Theta 3 : event misalignments = 30
Phi 6 : Theta 1 : event misalignments = 60
Phi 6 : Theta 2 : event misalignments = 40
Phi 6 : Theta 3 : event misalignments = 53
Phi 16 : Theta 0 : event misalignments = 62
Phi 17 : Theta 0 : event misalignments = 62
Phi 23 : Theta 1 : event misalignments = 40
Phi 23 : Theta 2 : event misalignments = 50
Phi 23 : Theta 3 : event misalignments = 60
Phi 26 : Theta 1 : event misalignments = 60
Phi 26 : Theta 2 : event misalignments = 70
Phi 26 : Theta 3 : event misalignments = 60
Phi 38 : Theta 1 : event misalignments = 60
Phi 38 : Theta 2 : event misalignments = 30
Phi 38 : Theta 3 : event misalignments = 80

The "missing channel TCs" problem with Phi 16,17 Theta 0 was clearly an unplugged cable and was fixed immediately.
The "time alignment errors" for Phi 3,6,23,26,38 Theta 1,2,3 were a regular feature from previous runs. Although these were not understood they appeared not to cause problems and so had been ignored in the past.
The remaining problems, however, were new since the installation of the new cables and panel. They also indicate severe corruption of the data in all these channels.

It was quickly noticed that the corrupted channels were identical to the familiar channels from the "time alignment error" section. These channels had exhibited another strange feature in the past, in that their timing appeared different in frameclash calibrations. This led to the idea that the timing had shifted due to the new cables and that these five cables were now frameclashing. So a frameclash calibration was performed. The output can be seen here.

This confirmed that the timing had indeed shifted by one clock 60 tick, the cable error spikes were previously seen at UPC frame off-set = 1 for the five cables Phi 3,6,23,26,38 Theta 1,2,3 and at 2 for all the others. They were now seen at 0 for the five cables and at 1 for all the others.

The fact that the 3 cables showing corruption showed cable errors for all UPC frame-offset values and the fact that the current frame-offset value was 14 (0xe) cast some suspicion on the timing change being the cause of the corruption. However, this avenue was still pursued, partly to gain better understanding of the situtation.

Serial number calibrations were performed for every value of the UPC frame-offset. The results were that the same 3 cables showed corruption for every value of the frame-offset. This proved conclusively that the timing shift had not caused the problem.

We then performed a simple check of the signal integrity by placing an ohm-meter across the two legs of each differential signal on each of the problem cables. When we unplugged the UPC end and checked, we got the correct 100 ohms. When we unplugged the EMT end and checked, we got the correct 940 ohms. The problem was clearly more subtle than a bad cable or a poor connection. But the cables and panel were the only thing that had changed in the system - this was puzzling.

We decided to attempt to understand better the conditions that could lead to a cable error.
The following diagram shows that there are three conditions that give a cable error.
It also shows exactly how these are checked for:

  1. Frame Clash - cable-frame coincides with board-frame
  2. Frame Drift - time between cable-frames is not equal to 16 cable-clock ticks
  3. Loss of Cable Accident - cable-clock not present

Since frame clash had already been eliminated we turn to examine the other two possible causes.
The best way of doing this seemed to be to follow the cable-frame and cable-clock signals across the TPB from the backplane (where the signals come in from the patch panel) to the AX chip (where the error state is checked). We placed one of the TPBs with the error condition on the extender board and got out the TPB schematics and an oscilloscope.

We found that the cable-frame disappeared or became intermittent after passing through a quad dual-port register marked U15 on the TPB schematics.
This register takes as its inputs: two sets of four "data lines", a selector, and a clock. Its outputs are one or other set of the "data" inputs, depending on the value of the selector.

Looking at the selector input compared with a neighbouring channel it was the same.
Comparing cable-frame inputs, however, we saw a difference.
The following diagrams show the scope output from these readings. The first shows a channel with no corruption, the second a corrupted channel and the third shows a corrupted channel with the cable-frame from an uncorrupted one in yellow.





Compare these with this sketch of the specification for the UPC cable signals.
The middle of the frame is meant to be on the rising edge of the clock since this is when chips like U15 read the data.
However, this is not the case for either of the channels shown above.
Even the channel which exhibits no corruption has the rising clock edge almost on the falling edge of the frame.

We swapped in a spare TPB to see if the problem might be TPB related but the problem persisted.

At this point we decided to look at the clock and frame as they left the UPC to see whether they were in the wrong phase before they entered the cable.
Below are the readings from the scope.
They clearly show that the clock and frame are in-phase on entering the cable and so the problem is somewhere on the UPC board.





This problem is not just the case for the three corrupted channels but is common to all the UPCs.
The permanent solution is to make the necessary changes to the UPCs (probably in the firmware) and this will be looked at in the coming weeks.
In the meantime a temporary fix was required such that the EMT could run without masking 9 towers.
The solution decided upon was to build a temporary extender cable that swaps the differential legs of the clock, thereby inverting its phase.
These cables were built and applied to the problem cables, clearing up the corruption and the cable errors.
This has been confirmed by performing bit by bit comparisons of the EMC and EMT trigger sums.
The "time alignment errors" were not remedied by this action, however.

During the following week efforts were made in order to try and understand what made the 5 UPCs worse than the others.
It was suspected that the "time alignment errors" were linked to this in some way.
As such, a printout of the full data received by the EMT during a serial number calibration was produced.
The following extract from the printout clearly shows the change in alignment of the data from one of these UPCs.
On event 20, the data from channel 6 and 7 are aligned - the bit pattern word 37449 (0x9249) occurs on the first tick for both.
However, on event 21 it is seen that channel 6 now has the serial number 49233 (0xc051) on the first tick and the bit pattern on the second.

MajorCycle 1 Event 20/120 L1EmtChannelTC(0x880a7b8)->print()
Channel 6
Number of words 32
Bitmaps 0x ffff 0 0 ffff
Phi Energy 37449 49233 37449 49233 37449 49233 37449 49233 37449 49233 37449 49233 37449 49233 37449 49233
X Energy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Y Energy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Theta Bits 302 302 302 302 302 302 302 302 302 302 302 302 302 302 302 302

MajorCycle 1 Event 20/120 L1EmtChannelTC(0x880a802)->print()
Channel 7
Number of words 32
Bitmaps 0x ffff 0 0 ffff
Phi Energy 37449 49178 37449 49178 37449 49178 37449 49178 37449 49178 37449 49178 37449 49178 37449 49178
X Energy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Y Energy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Theta Bits 302 302 302 302 302 302 302 302 302 302 302 302 302 302 302 302

MajorCycle 1 Event 21/121 L1EmtChannelTC(0x880a7b8)->print()
Channel 6
Number of words 32
Bitmaps 0x ffff 0 0 ffff
Phi Energy 49233 37449 49233 37449 49233 37449 49233 37449 49233 37449 49233 37449 49233 37449 49233 37449
X Energy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Y Energy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Theta Bits 302 302 302 302 302 302 302 302 302 302 302 302 302 302 302 302

MajorCycle 1 Event 21/121 L1EmtChannelTC(0x880a802)->print()
Channel 7
Number of words 32
Bitmaps 0x ffff 0 0 ffff
Phi Energy 37449 49178 37449 49178 37449 49178 37449 49178 37449 49178 37449 49178 37449 49178 37449 49178
X Energy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Y Energy 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Theta Bits 302 302 302 302 302 302 302 302 302 302 302 302 302 302 302 302

Attempts were made to locate a problem with these UPCs that might cause such behaviour using an oscilloscope and logic analyser.
Unfortunately these efforts did not locate any "smoking gun".

It was decided at this point to check the firmware running on these boards.
The verification failed for each of the 5 boards.
The boards were then programmed with the correct firmware and checked again.
When returned to the system, the serial number calibration showed that the "time alignment errors" had gone.
The frameclash calibration also showed that the timing of the UPCs was now identical to the others.

The kludge cables were then removed and the calibrations performed again - both still yielded no errors.
This was confirmed once again by the bit by bit trigger sum comparison checks.

The only imperfection present at this point was that phi 24,25 theta 0 was showing frameclash on tick 2 rather than tick 1.
These towers had been fitted with one of the test 25' cables 6 months prior to the panel replacement. These cables are identical in appearance to the new cables.
It was thought, therefore, that this cables had, accidentally, not been replaced. It was swapped with a new cable and the calibration performed again.
The results from this calibration showed uniform alignment of all UPCs.

Summary

  • 5 UPCs have been sending their frames and data too close to the clock due to incorrect firmware
  • This behaviour has not corrupted the trigger sums in the past but has caused the "time alignment errors" for reasons not fully understood
  • The new patch panel and cables changed the skew between the clock and the rest of the signals, highlighting the marginal timing on those UPCs - this was temporarily kludged around by flipping the clock
  • Even with the correct firmware all the UPCs have only a 2ns margin on their timing
  • The three options for the future are:
    1. Live with the 2ns margin and hope that it is enough.
    2. try to modify the UPC firmware to delay the signals a few more ns - not trivial since there are 115 boards
    3. flip the clock somewhere - one possible place would be the EMT patch panel - again not trivial
  • The EMT is now running better than it ever has, exhibiting no errors of any kind in either calibration

Page created by Tom Latham September 7th 2003.
Updated September 15th 2003.