Preliminary Aida Performance Testing Report

Last modified on
Tuesday, July 27, 2004 Robert C. Sass

Thursday, August 19,2004
Update DaValue test to include toString extract.
Add Slc ref.get test

Friday, August 20, 2004
Add second batch test report.
Minor corrections.

1.0 The Goal

For Aida to be an interface used by High Level Applications to access diverse data, it must have "reasonable" performance. A series of tests was run getting different types and amounts of data, different numbers of clients and using different Aida interfaces. The timing results were plotted and analyzed using Matlab.

1.1 Considerations

A nearly limitless number of tests could be done varying the machines, client/server parameters, distribution of clients and servers, data types and client interfaces. To measure primarily the Aida performance alone we have used test servers that provide canned data for most of the tests. This measures just the Aida performance without adding the variability associated with each of the the currently implemented real data servers; EPICS, SLC and Oracle. In the end, short tests were run using the real EPICS and SLC data servers to see how they compare with the Aida only performance.

2.0 The Approach

The test environment itself is also a significant variable. An ideal environment would be a set of machines and networks dedicated to the performance test. Alas this is not available so existing UNIX machines were used and multi-client tests were run in the wee hours of the morning as cron jobs. Hopefully owl shift didn't notice.

2.1 daPerfTest Program

An additional test program called daPerfTest was written in the Aida package/aida/test/java path. The aidatest command is used to run a single test at a time and it requires 3 parameters:
  1. Test number
  2. Matlab data file name (path defaults to ~rcs/Matlab)
  3. Comment string for this test run
E.g. aidatest daPerfTest 1 matlabfile.dat "Test client xyz"

Each test is run a fixed number of iterations that's defined internally and timing data is accumulated in a local array. After the test is complete the test type, date/time, comment string and the data are written to a file for Matlab analysis.

The file ~rcs/Matlab/AidaPerf.m is the Matlab analysis file. This has also been added to the test/java directory in the Aida package.

2.2 The Machines

The Aida DaNameServer, DaServer, DpTestServer and DpTestHistServer were all run on slcs2. DpTestServer and DpTestHistServer provide canned data directly and don't access additional data providers.

For the multi-client tests, nine clients ran the same series of tests simultaneously on nlcdev-opi01, nlcdev-opi02, nlcdev-opi03, nlcdev-opi04, slcs1, slcs6, mccas0, opi00gtw04 and opi00gtw05. They all have 100Mbit ethernet and have the following configurations:

Server:

Slcs2
System Configuration: Sun Microsystems sun4u Sun Fire V240
System clock frequency: 167 MHZ
Memory size: 2GB

Clients:

nlcdev-opi01 :
System Configuration: Sun Microsystems sun4u Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 333MHz)
System clock frequency: 111 MHz
Memory size: 256 Megabytes

nlcdev-opi02:
System Configuration: Sun Microsystems sun4u Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 270MHz)
System clock frequency: 90 MHz
Memory size: 128 Megabytes

nlcdev-opi03:
System Configuration: Sun Microsystems sun4u Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 270MHz)
System clock frequency: 90 MHz
Memory size: 128 Megabytes

nlcdev-opi04:
System Configuration: Sun Microsystems sun4u Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 270MHz)
System clock frequency: 90 MHz
Memory size: 256 Megabytes

Slcs1:
System Configuration: Sun Microsystems sun4u Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 333MHz)
System clock frequency: 111 MHz
Memory size: 512 Megabytes

Slcs6:
System Configuration: Sun Microsystems sun4u Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 400MHz)
System clock frequency: 100 MHz
Memory size: 384 Megabytes

mccas0
System Configuration: Sun Microsystems sun4u Sun Fire V240
System clock frequency: 167 MHZ
Memory size: 1GB

Opi00gtw04:
System Configuration: Sun Microsystems sun4u Sun Fire V240

System clock frequency: 167 MHZ
Memory size: 1GB

opi00gtw05:
System Configuration: Sun Microsystems sun4u Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 400MHz)
System clock frequency: 100 MHz
Memory size: 256 Megabytes

2.3 The Software

We're using Java 1.4.2 and a Corba package licensed from Orbacus V4.1. All test clients were run on Sun UNIX machines. Only the SLC data provider of course runs on VMS.

3.0 Single Client Results

We'll first do a series of tests with a single client to characterize the Aida behavior before moving on to the multi-client tests.

3.1 Simple Get results

Let's first look at a single client doing simple gets of a double and a long. This is the simplest and least efficient Aida interface where the user specifies a name and data type to be returned. Aida looks up the name in the Oracle database, finds the correct server, calls the server and returns the data. Again, all servers ran on slcs2 and the test client on slcs6. Slcs6 is my development machine and although somewhat slow (400 MHz) I control the load and so there are no other activities in progress. Each test was run twice with 900 iterations (I'll explain why below), the results plotted and simple statistics calculated.


Figure 1: Simple da.get of Test Server Double


Figure 2: Simple da.get of Test Server Long

The first thing that stands out is the variability between requests. True we are going over the network but the minimum time is 1.0 millisecond (ms) which is pretty darn good and the mean time for both data types is ~2.5 ms which isn't too bad either. Even the standard deviation (one sigma) around 3 ms isn't all that bad considering what it's doing. Remember this includes a trip to Oracle each time so clearly Oracle itself does serious caching and buffering. What's not so good are the max time peaks which can be 50-80 times the minimum! These tests were run on Saturday so the Oracle server shouldn't be heavily loaded. Is it the JVM/garbage collection etc. that's causing the variation?

Before jumping to any conclusions, lets do the next logical thing and do the same tests using Aida's most efficient interface.

3.2 Direct to Server Get Results

Using this interface, the client creates a reference object which is then connected directly to the server. The overhead of name lookup and reference construction is only done once. After that, all communication is directly with the appropriate server through the reference object i.e. the client doesn't know what server is being invoked or where it's located. Again each test was run twice with 900 iterations.


Figure 3: ref.get of Test Server Double



Figure 4: ref.get of Test Server Long

The minimum times are about the same (we only have 1 ms granularity) even if we go directly to the server. However the minimum is now the same as the median and the mean time has dropped to ~1.6 ms. Slcs2 is a BaBar production machine and our servers do run at a standard (low?) priority so we expect some variability on the server side. Still we see large maximum excursions. Furthermore, the test run of the get double in particular shows a remarkable similarity of the excursions between the 2 independent runs suggesting some Java Virtual Machine (JVM) dependencies. Before we explore this further, lets look at the timings for getting more complex data as would normally be the case.

3.3 Get by Reference of DaValue and IDL Struct

The following tests are all done by reference directly to the server. The DaValue object is fairly heavy but can be used by a client who doesn't know the format of the data to be returned. This test returns 2 arrays; 11 time strings and 11 doubles.



Figure 5: ref.get DaValue of Test Server DaValue

Several things to note. Using the fairly heavy DaValue object to get 2 small arrays the time goes up to a mean of ~4.5 ms with the exinefficienttract toString method which still isn't too bad. Again we see a very similar set of excursions between the 2 runs. This can't be variations in the server but now really looks like the JVM deciding when to do its garbage collection etc. If you do the same thing twice, you get the same results. At least that's encouraging.

Now let's do the IDL struct known to client and server. In this case, the server returns a Corba Any but it contains a known struct and there is an associated "helper" function that both the server and client use to easily put and get the data. This uses the test history server and again returns 2 arrays ; 11 time strings and 11 doubles.



Figure 6: ref.get Any of Test Hist Server

The mean and median time values are down to about 3 ms and again we see the remarkable parallel excursions between the two runs. Another item to note is that any startup time for the first request is also quite variable. It's ~50 ms for the first runs and < 20 ms for the second.

3.4 Large Arrays

In addition to timing, another important question is the ability of Aida to deliver large amounts of data without additional overhead except for the network. For this we had the test server return an arbitrary array of longs with the size specified by the client. BTW that is a very nice Aida feature whereby you can implement arbitrary parameters to direct server behavior.

We chose powers of 10 from 10 to 1,000,000 longs i.e. 4 Mbytes. We do 30 iterations of each of the 6 array lengths. Here's a semi log plot of the results.


Figure 7: ref.get of Longs Arrays from the Test Hist Server Semi Log Plot


Hmmm. The nice Matlab log tick marks disappeared in the translation to a jpeg image but the results are clear. The first 60 iterations which are 10 and 100 bytes respectively are indistinguishable. From 1K longs (4K bytes) up to 1mega longs we're clearly just dependent on network bandwidth. The saw tooth behavior is presumably garbage collection and heap allocation at least on the server which does a "new" for each request. What's the real variation between times at the 4 Mbytes size? Here's a straight plot to show us.


Figure 8: ref.get of Longs Arrays from the Test Hist Server Linear Plot

It oscillates quite regularly between 4 and 12 seconds. Actually the time gap between oscillations varies every 3 requests which is hard to see from the graphs. For the 4Mbyte transfers the actual times in ms look like: 4173.0 8248.0 12354.0 4069.0 8144.0 12230.0 etc.

Garbage, garbage everywhere.

3.5 Back to the Variability

Before we go the the multi-client tests, lets take a closer look at the amazing synchronous JVM. To eliminate network, production server and Oracle loads we'll move both client and test server into my old-n-slow but trusty and unloaded (except for me) Sun development machine slcs6. We go up and down the TCP stack but not out on the wire. With Aida and Greg's fancy scripts and documentation it's pretty easy to make the switch. Here's the results.

Figure 9: ref.get 2000 Longs from the Test Server on Same Machine as Client

Wow. I boosted the iterations to 2000 and see a huge delay somewhere around 1000 iterations. Also, the delays on the two runs match up very nicely. Also note that the minimum turnaround is now 0 milliseconds and the mean down around 1. Without that huge excursion the turnaround would be well under one ms. Does the 1000 iteration delay repeat if we do many iterations? Here's 20000.
Figure 10: ref.get 20000 Longs from the Test Server on Same Machine as Client

Hmmm. The more I test the more confused I get. Do I understand that huge excursion around iteration 1000? Is the JVM invoking the JIT compiler? Aha! That could very well be it. Since we're iterating in a small section of code it decided to take a little time and compile it and thereafter we only get occasional delays of 10-50 ms but no more 250 ms waits. It's why you can't use Java for real time. But you already knew that. Still, after the presumed compile, it runs pretty darn fast on my old machine except for the occasional presumed garbage collection.

So now you know why I chose 900 iterations for my standard tests. Because at 1000 I always got this huge delay and it screwed up my plots!

That does leave another intriguing question. What if we fudge the data and eliminate the first 1000 iterations to let the JVM get most of its optimization etc. out of the way? What do the results look like then? Seek and ye shall find.



Figure 11: ref.get 19000 Longs from the Test Server on Same Machine as Client

Pretty impressive eh? While we still get a few excursions, the statistics look much better. A mean of .7 ms and a sigma of only .5 in the first run and .7 in the second. Now there is also much less parallel data between the two runs. The lesson here is once an application has been up for awhile, it settles into a much more predictable mode that isn't dominated by JVM machinations.

3.6 SLC and EPICS Data Tests

Below are the plots to get a simple float using the reference object going directly to the Slc server. The two runs are quite similar in their statistics which is a good sign. The mean of ~5.5 ms is higher than for the test servers which might be expected since the SLC server is using the JNI interface to access the data via dbgets. Undoubtedly we can do some optimization here. A curious factor is the periodic excursions of ~20 ms that seem to happen about every 50 iterations which must contribute to the higher mean time. Previous tests showed patterns between runs but not so much regularity in the intervals between excursions. I have no idea as to the cause. On the good side, the minimum time is 2 ms which is about as good as it gets.


Figure 12: ref.get of Slc Float


Later.

Figure 13: ref.get of EPICS Double

4.0 Multi-Client Tests

There are at least two basic questions relating to multiple clients using Aida:
  1. Will Aida properly handle many multiple requests with some degraded performance but without errors or failures?
  2. What are the performance statistics associated with multiple clients and are they "reasonable"?

4.1 The First Approach

The initial set of multi-client tests used nine simultaneous clients on different machines, each of which sequentially ran the first 6 tests in the DaAidaPerf test program. The client machines were nlcdev-opi01, opi02, opi03, opi04, slcs1, slcs6, mccas0, opi00gtw04 and opi00gtw05. Cron jobs were started at the same time for each of the clients and ran at 3:07 and 4:13 on Sunday morning. Prime numbers are good things. The tests were:
  1. Simple get double
  2. Simple get long
  3. Reference get double
  4. Reference get long
  5. Reference get of DaValue
  6. Reference get an IDL struct via a Corba Any.
I used trscrontab to schedule the two runs and didn't change the default priority used for cron jobs .

#
# First time
#
nlcdev-opi01 07 03 25 7 * /u/cd/rcs/Matlab/startdev01
nlcdev-opi02 07 03 25 7 * /u/cd/rcs/Matlab/startdev02
nlcdev-opi03 07 03 25 7 * /u/cd/rcs/Matlab/startdev03
nlcdev-opi04 07 03 25 7 * /u/cd/rcs/Matlab/startdev04
slcs1 07 03 25 7 * /u/cd/rcs/Matlab/startslcs1
slcs6 07 03 25 7 * /u/cd/rcs/Matlab/startslcs6
mccas0 07 03 25 7 * /u/cd/rcs/Matlab/startmccas0
opi00gtw04 07 03 25 7 * /u/cd/rcs/Matlab/startopi04
opi00gtw05 07 03 25 7 * /u/cd/rcs/Matlab/startopi05
#
# Second Time
#
nlcdev-opi01 13 04 25 7 * /u/cd/rcs/Matlab/startdev01
nlcdev-opi02 13 04 25 7 * /u/cd/rcs/Matlab/startdev02
nlcdev-opi03 13 04 25 7 * /u/cd/rcs/Matlab/startdev03
nlcdev-opi04 13 04 25 7 * /u/cd/rcs/Matlab/startdev04
slcs1 13 04 25 7 * /u/cd/rcs/Matlab/startslcs1
slcs6 13 04 25 7 * /u/cd/rcs/Matlab/startslcs6
mccas0 13 04 25 7 * /u/cd/rcs/Matlab/startmccas0
opi00gtw04 13 04 25 7 * /u/cd/rcs/Matlab/startopi04
opi00gtw05 13 04 25 7 * /u/cd/rcs/Matlab/startopi05

Figure 14: Trscrontab Schedule of Scripts to Execute

4.1.1 The First Test Results

All of the tests ran "concurrently" without a problem. Since the test server and a number of the clients are production machines, I didn't want to get a nasty phone call about resource consumption so all tests together take less than a minute to run standalone. All the machines are running ntp and synchronized with the SCS master time server. This should provide client accuracy within a millisecond. Indeed, one can nicely correlate the startup times of the first test in Table 1 with the speed of the CPU as detailed in The Machines section.

Table 1 below shows the actual start of each of the tests as recorded on each client machine. This time stamp is recorded after all connections to the servers have been made and is just before the start of the test loop. There are 900 iterations for each test so that's 5400 gets/client. The total time from the start of the first client to the finish (actually start of the last test) for the last client is just over 1 minute. That's 48,600 gets in 61 seconds. That's over 770 gets/second. Not too shabby at all. Each client gets its own instance of a server and slcs2 is a pretty fast machine so there's very little degradation in client response time at this load level.

As you can see from the times below, because machines mccas0 and opigtw04 are so much faster they were done (mccas0) or almost done (opi00gtw04) before nlcdev-opi02-04 even got started! For opi00gtw04, 7 other machines were still executing when it started its last test but that's not enough for a good measurement. The lesson here is that I was much too conservative in my multi-client experiment and need a longer test to allow all machines to compete. Aida running on the newer machines is just way too fast!

Client/Test #
1 get double
2 get long
3 get ref double
4 get ref long
5 get ref DaValue
6 get ref IDL Any
nlcdev-opi01
3:07:20
3:07:26
3:07:31
3:07:36
3:07:40
3:07:47
nlcdev-opi02
3:07:30
3:07:40
3:07:48
3:07:56
3:08:04
3:08:16
nlcdev-opi03
3:07:29
3:07:39
3:07:48
3:07:56
3:08:09
3:08:21
nlcdev-opi04
3:07:29
3:07:39
3:07:48
3:07:56
3:08:04
3:08:15
slcs1
3:07:27
3:07:33
3:07:40
3:07:45
3:07:51
3:07:58
slcs6
3:07:20
3:07:24
3:07:29
3:07:33
3:07:37
3:07:43
mccas0
3:07:15
3:07:17
3:07:20
3:07:22
3:07:23
3:07:26
opi00gtw04
3:07:16
3:07:21
3:07:25
3:07:27
3:07:29
3:07:34
opi00gtw05
3:07:24
3:07:29
3:07:35
3:07:38
3:07:42
3:07:47

Table 1: Start Times for First Simultaneous Run on Monday Morning July 27, 2004 at 3:07 AM

I have 54 Matlab data files (6 tests * 9 machines) and it's hard to know how to graph the results in a meaningful way so it's clear I need a simpler approach to properly analyze the data. For this test I'll pick just one test, the IDL get through an Any by reference. This returns 2 small arrays and would be a common mechanism to efficiently access frequently acquired structured data. I plotted and compared the concurrent with the standalone results on the same machine, opi00gtw04 since for this test it had the most overlap with other machines still executing.


Figure 15: ref.get Any Cron Job with Simultaneous Clients


Figure 16: ref.get Any Invoked Manually

Figure 15 above is the run with other more-or-less simultaneous clients and Figure 16 is a standalone run by itself. The mean for the standalone tests was ~2 ms as opposed to ~ 2.5 ms for the concurrent test. The standard deviation went from ~4 ms standalone to ~7 for the multi-client. This is what one would expect in a contention situation. Still, the increase in mean response time for the multi-client situation was quite modest. Interestingly there was considerable variation for the 2 runs standalone with sigma 2.9 and 3.9. For the second pass I'll also run the standalone test in the wee hours rather than during the day when demand may be higher.

The improved multi-client test should be more revealing.

4.2 The Second Approach

A separate batch test was written and run with 5000 iterations so that for a time all clients were operating simultaneously. This test did 2 reference gets; one to the test server and one to the testhist server to alternate software servers on the slcs2 server machine. Running the test manually to get the data and time the results, I found that the fastest machines (mccas0 and opigtw04) ran the test in about 15 seconds while the slowest machines (nlcdev-opi01 to 04) took 52 seconds to run the same test. Therefore to get maximum overlap I'll only compare one of the slowest machines standalone vs. multi-client runs.

4.2.1 The Second Test Results

Both the manual and batch/concurrent runs were done during the noon hour. Lets first look at the manual run done on nlcdev-opi02. The statistics for the two runs are nearly identical with mean and sigma also nearly the same. As good as it gets.


Figure 17: Nlcdev-opi02 Multi-Client Test Invoked Standalone

Hmmm. The first multi-client run shows a higher mean and much higher sigma. The second run is only slightly higher than the standalone without the large group of higher delays around 2000 iterations. What can I say except that 9 simultaneous clients doing minimal gets shows a slight degradation in overall response as expected.

Figure 18: Nlcdev-opi02 Concurrent Multi-Client Test

Just to be sure lets look at the statistics for the standalone and multi-client results for one of the fastest machines mccas0. As you can see from the figures below, the statistics are slightly worse for the multi-client test but not dramatically so.


Figure 19: Mccas0 Multi-Client Test Invoked Standalone


Figure 20: Mccas0 Concurrent Multi-Client Test

5.0 Conclusions

You already knew this but the plots in the Single Client Results section clearly show that:
  1. Standard Java is not a real-time environment.
This doesn't necessarily mean that it's not "real-fast" on the average, it just means that it's not deterministic. Hence the Real-Time Java working group that has been at it for some time.

What is surprising is the reproducibility of the execution of a given set of code in the same environment. Actually I guess it's not so surprising, just unexpected.
  1. Large Aida data transfers above ~1Kbyte are proportional to the data size with oscillating times, probably due do garbage collection and heap allocation.
Referring to the plot in Figure 8 we see that the minimum time to transfer 4Mbytes (1 Mlong) is about 4 seconds. This is ~1 Mbyte/second. Though it's hard to see from the plot, the raw data shows that the minimum transfer time for 400K bytes is about 0.4 seconds so our "time proportional to data size" conjecture holds. As stated above, for transfer sizes below ~1Kbyte the network time is too small to stand out above the normal variation in processing time within Java and between machines.
  1. The overall Aida/Java performance is quite good.
In general you can see from many of the tests in the Single Client Results and the Multi Client Tests that overall response time is quite good even though it can be highly variable, especially near the program startup. Furthermore, Figure 11 shows that the runtime performance, while still not real-time can be quite stable in a long-running Java program that repeatedly executes the same set of code.
  1.  Multi-Client Tests Showed that Aida/Java was Stable and Scaled with Multiple Clients.
Two separate multi-client tests were done and both revealed that there was only a modest decline in the response statistics when compared with standalone runs. Aida easily handled 9 simultaneous clients with little degradation in throughput. Orbacus is currently configured so that each new client gets a new instance of the server though other options are available for very large numbers of clients.

These tests revealed no fundamental performance reason why Aida could not be used as efficient, scalable middleware to service the data needs of a variety of applications.