Hardware, Configuration

All I/O for the binary archive format is based on the LowLevelIO class in the Tools directory.
Via ToolsConfig.h, this class can be configured to use "fd" file descriptor I/O (open, write, lseek, ...) or "FILE *" calls from stdio (fopen, fwrite, ...).
In addition, all ChannelArchiver files can be compiled either with debug information or optimized, depending on the HOST_OPT setting which is usually set in EPICS/base/config/CONFIG_SITE but can be overwritten in ArchiverConfig.h.

The ChannelArchiver/Engine directory contains a simple benchmark program bench.cpp. Depending on the I/O and debug settings, extremely different results were obtained:

Machine Settings Values/sec. written
800Mhz NT4.0 fd 35000
800Mhz NT4.0 FILE 20000
500Mhz RH 6.1 debug 60 (!)
500Mhz RH 6.1 FILE, -O 14500
2x333Mhz, RAID5 debug 42 (!)
2x333Mhz, RAID5 FILE, -O 8600

In the Win32 case, raw fd I/O seems to be faster, debug settings have little influence.
For Linux, fd I/O is terribly slow in any case. So is FILE I/O without using the optimizing -O flag. For Linux, only FILE I/O with optimization yields acceptable results.

This underlying I/O performace limits the number of values that the ArchiveEngine can handle.

Archive Engine Limits

Test: "Up to 10000 values per second"

The ArchiveEngine was started with this configuration file:

  #Archive channels of example CA server (excas)
  !file_size 10
  !write_period 10
  fred 1.0 Monitor
  freddy 1.0 Monitor
  janet 0.1 Monitor
  alan 1.0 Monitor
  jane0 0.1 Monitor
  jane1 0.1 Monitor
  jane2 0.1 Monitor
  # ... and so on until
  jane999 0.1 Monitor

Those Channels were served by the example CA server, running on a 233Mhz Linux machine, launched as:

excas -c1000

The "jane" channels change at about 10Hz. Together with the other channels, including the array "alan", this should provide at least 10000 values per second.

Both machines were on a 10/100 base T hub, but the Linux box only supports 10baseT.

Observed behaviour

This plot shows the CPU load on a 800MHz PC (Windows NT 4.0) archiving about 10000 values per second:

cpu_10kpersec.gif (5653 bytes)

While the machine is quite busy archiving, it does still respond to user input. It cannot be used for much else, though: Additional load like launching and using PaintShowPro for creating this CPU-load snapshot can cause delays, resulting in messages
    "Warning: WriteThread called while busy"
More load can lead to "overwrite" errors (= data loss) and is to be avoided.

This image shows the limit for the less performant Linux machine where bench.cpp reported about 7700 writes/second (233 Mhz, RedHat 6.1):

ioc94_4kpersec.gif (6210 bytes)

The setup is similar to the 10000 val/sec example but uses only 4000 values per second.

The next two images show a dual-233 Mhz pentium machine with a RAID-5 disk array archiving at different rates:

2x233_5000persec.gif (12535 bytes) 2x233_4000persec.gif (12524 bytes)

5000 values/sec

4000 values/sec

At 5000 values/sec, the write thread does not finish in time, so numerous "called while busy" warning result. No overwrites were reported, but this rate can probably not be maintained. In the 4000 val/sec case the write thread is called every 10 seconds and finishes in time, resulting in many brief peaks in CPU load.
(Point 1: The average CPU load is constant, but the individual CPUs are loaded randomly.
Point 2: The machine & disks might be faster than the previous test machine, and the RAID array provides data safety and good read performance, the write performance though is not better than for the older PC).

So one could conclude that archiving rates of 10k values/sec are possible on a dedicated machine, except for the additional problem of ...

Channel Access Flow Control

The Archive Engine uses a dedicated write thread to balance the CPU load. It also tries to increase the default TCP buffer size in order to buffer incoming values while the program is busy writing samples to the disk.

The current Channel Access library, though, silently implements a flow control mechanism. It is not based on a TCP buffer "high-water-mark" idea, instead it can drop values when the "ca_poll" routine is called succesfully several times, causing it to believe that it might get behind the CA server.

Ways to detect losses due to flow control are:

When this was done the result showed occasional flow control losses at 10000 values per second, so the conclusion is that the current engine can archive "close to 10000 vps".

Behaviour when approaching limit

When archiving more and more values per second, the write threads needs an increased amount of time. The peaks in the CPU load snapshot show the write thread running. Because the write thread takes certain mutex semaphores, the engine's web server is delayed while writing. (If the web server request does not need to access the channel data as in the case of the "stop" request, the server and write thread actually run concurrently, depending only on the scheduling as handled by the operating system.)

When even more values are received, the write thread cannot flush the buffers within the configured write period. If it happens rarely, it will simply result in "write thread called while busy" messages and no data loss. If the incoming data rate is too high, additional "overwrite" errors will occur, data is lost.

The web server will always have to wait for the write thread to finish. Consequently, occasional delays in a browser request are expected. While approaching the load limit, this will happen more often up to the point where the web interface does no longer respond because the write thread is constantly running and data is lost.


ChannelArchiver Manual