Preliminary Aida Performance
Testing Report
Last modified on
Tuesday, July 27, 2004 Robert C. Sass
Thursday, August 19,2004
Update DaValue test to include toString extract.
Add Slc ref.get test
Friday, August 20, 2004
Add second batch test report.
Minor corrections.
1.0 The Goal
For Aida to be an interface used by High Level
Applications to access diverse data, it must have "reasonable"
performance. A series of tests was run getting different types
and amounts of data, different numbers of clients and using different
Aida interfaces. The timing results were plotted and analyzed using
Matlab.
1.1 Considerations
A nearly limitless number of tests could be
done varying the machines, client/server parameters, distribution
of clients and servers, data types and client interfaces. To measure primarily the Aida performance alone we have used
test servers that provide canned data for most of the tests. This measures
just the Aida performance without adding the variability associated
with each of the the currently implemented real data servers; EPICS,
SLC and Oracle. In the end, short tests were run using the real EPICS
and SLC data servers to see how they compare with the Aida only performance.
2.0
The Approach
The test environment itself is also
a significant variable. An ideal environment would be a set of
machines and networks dedicated to the performance test. Alas this is not available so existing UNIX machines were
used and multi-client tests were run in the wee hours of the morning
as cron jobs. Hopefully owl shift didn't notice.
2.1 daPerfTest Program
An additional test program called daPerfTest was written
in the Aida package/aida/test/java path. The aidatest command is
used to run a single test at a time and it requires 3 parameters:
- Test number
- Matlab data file name (path defaults to ~rcs/Matlab)
- Comment string for this test run
E.g. aidatest daPerfTest 1 matlabfile.dat "Test client
xyz"
Each test is run a fixed number of iterations that's
defined internally and timing data is accumulated in a local array.
After the test is complete the test type, date/time, comment string
and the data are written to a file for Matlab analysis.
The file ~rcs/Matlab/AidaPerf.m is the Matlab analysis
file. This has also been added to the test/java directory in the
Aida package.
2.2 The Machines
The Aida DaNameServer, DaServer, DpTestServer and DpTestHistServer
were all run on slcs2. DpTestServer and DpTestHistServer
provide canned data directly and don't access additional data providers.
For the multi-client tests, nine clients ran the same
series of tests simultaneously on nlcdev-opi01, nlcdev-opi02, nlcdev-opi03,
nlcdev-opi04, slcs1, slcs6, mccas0, opi00gtw04 and opi00gtw05. They
all have 100Mbit ethernet and have the following configurations:
Server:
Slcs2
System Configuration: Sun Microsystems sun4u Sun Fire V240
System clock frequency: 167 MHZ
Memory size: 2GB
Clients:
nlcdev-opi01 :
System Configuration: Sun Microsystems sun4u
Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 333MHz)
System clock frequency: 111 MHz
Memory size: 256 Megabytes
nlcdev-opi02:
System Configuration: Sun Microsystems sun4u
Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 270MHz)
System clock frequency: 90 MHz
Memory size: 128 Megabytes
nlcdev-opi03:
System Configuration: Sun Microsystems sun4u
Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 270MHz)
System clock frequency: 90 MHz
Memory size: 128 Megabytes
nlcdev-opi04:
System Configuration: Sun Microsystems sun4u
Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 270MHz)
System clock frequency: 90 MHz
Memory size: 256 Megabytes
Slcs1:
System Configuration: Sun Microsystems sun4u
Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 333MHz)
System clock frequency: 111 MHz
Memory size: 512 Megabytes
Slcs6:
System Configuration: Sun Microsystems sun4u
Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 400MHz)
System clock frequency: 100 MHz
Memory size: 384 Megabytes
mccas0
System Configuration: Sun Microsystems sun4u
Sun Fire V240
System clock frequency: 167 MHZ
Memory size: 1GB
Opi00gtw04:
System Configuration: Sun Microsystems sun4u Sun Fire V240
System clock frequency: 167 MHZ
Memory size: 1GB
opi00gtw05:
System Configuration: Sun Microsystems sun4u
Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 400MHz)
System clock frequency: 100 MHz
Memory size: 256 Megabytes
2.3 The Software
We're using Java 1.4.2 and a Corba package licensed from Orbacus
V4.1. All test clients were run on Sun UNIX machines. Only the SLC
data provider of course runs on VMS.
3.0 Single Client Results
We'll first do a series of tests with a single client to characterize
the Aida behavior before moving on to the multi-client tests.
3.1 Simple Get results
Let's first look at a single client doing simple gets
of a double and a long. This is the simplest and least efficient Aida
interface where the user specifies a name and data type to be returned.
Aida looks up the name in the Oracle database, finds the correct server,
calls the server and returns the data. Again, all servers ran on slcs2
and the test client on slcs6. Slcs6 is my development machine and although
somewhat slow (400 MHz) I control the load and so there are no other
activities in progress. Each test was run twice with 900 iterations (I'll
explain why below), the results plotted and simple statistics calculated.
Figure 1: Simple da.get of Test
Server Double
Figure 2: Simple da.get of Test
Server Long
The first thing that stands out is the variability between
requests. True we are going over the network but the minimum time
is 1.0 millisecond (ms) which is pretty darn good and the mean time
for both data types is ~2.5 ms which isn't too bad either. Even the standard
deviation (one sigma) around 3 ms isn't all that bad considering what
it's doing. Remember this includes a trip to Oracle each time so clearly
Oracle itself does serious caching and buffering. What's not so good
are the max time peaks which can be 50-80 times the minimum! These tests
were run on Saturday so the Oracle server shouldn't be heavily loaded.
Is it the JVM/garbage collection etc. that's causing the variation?
Before jumping to any conclusions, lets do the next logical
thing and do the same tests using Aida's most efficient interface.
3.2 Direct to Server Get Results
Using this interface, the client creates a reference object
which is then connected directly to the server. The overhead of
name lookup and reference construction is only done once. After that,
all communication is directly with the appropriate server through
the reference object i.e. the client doesn't know what server is being
invoked or where it's located. Again each test was run twice with 900
iterations.
Figure 3: ref.get of Test Server
Double
Figure 4: ref.get of Test
Server Long
The minimum times are about the same (we only have 1 ms
granularity) even if we go directly to the server. However the minimum
is now the same as the median and the mean time has dropped to ~1.6
ms. Slcs2 is a BaBar production machine and our servers do run at a
standard (low?) priority so we expect some variability on the server
side. Still we see large maximum excursions. Furthermore, the test
run of the get double in particular shows a remarkable similarity of
the excursions between the 2 independent runs suggesting some Java Virtual
Machine (JVM) dependencies. Before we explore this further, lets look
at the timings for getting more complex data as would normally be the
case.
3.3 Get by Reference of DaValue and IDL Struct
The following tests are all done by reference directly
to the server. The DaValue object is fairly heavy but can be used
by a client who doesn't know the format of the data to be returned.
This test returns 2 arrays; 11 time strings and 11 doubles.
Figure 5: ref.get DaValue
of Test Server DaValue
Several things to note. Using the fairly heavy DaValue
object to get 2 small arrays the time goes up to a mean of ~4.5 ms
with the exinefficienttract toString method which still isn't too bad.
Again we see a very similar set of excursions between the 2 runs. This
can't be variations in the server but now really looks like the JVM
deciding when to do its garbage collection etc. If you do the same thing
twice, you get the same results. At least that's encouraging.
Now let's do the IDL struct known to client and server.
In this case, the server returns a Corba Any but it contains a known
struct and there is an associated "helper" function that both the server
and client use to easily put and get the data. This uses the test history
server and again returns 2 arrays ; 11 time strings and 11 doubles.
Figure 6: ref.get Any of
Test Hist Server
The mean and median time values are down to about 3 ms
and again we see the remarkable parallel excursions between the two
runs. Another item to note is that any startup time for the first
request is also quite variable. It's ~50 ms for the first runs and
< 20 ms for the second.
3.4 Large Arrays
In addition to timing, another important question is the
ability of Aida to deliver large amounts of data without additional
overhead except for the network. For this we had the test server return
an arbitrary array of longs with the size specified by the client. BTW
that is a very nice Aida feature whereby you can implement arbitrary
parameters to direct server behavior.
We chose powers of 10 from 10 to 1,000,000 longs i.e.
4 Mbytes. We do 30 iterations of each of the 6 array lengths. Here's
a semi log plot of the results.
Figure 7: ref.get of Longs
Arrays from the Test Hist Server Semi Log Plot
Hmmm. The nice Matlab log tick marks disappeared in the
translation to a jpeg image but the results are clear. The first
60 iterations which are 10 and 100 bytes respectively are indistinguishable.
From 1K longs (4K bytes) up to 1mega longs we're clearly just dependent
on network bandwidth. The saw tooth behavior is presumably garbage
collection and heap allocation at least on the server which does a "new"
for each request. What's the real variation between times at the 4 Mbytes
size? Here's a straight plot to show us.
Figure
8: ref.get of Longs Arrays from the Test Hist Server Linear Plot
It oscillates quite regularly between 4 and 12 seconds.
Actually the time gap between oscillations varies every 3 requests
which is hard to see from the graphs. For the 4Mbyte transfers the
actual times in ms look like: 4173.0 8248.0 12354.0 4069.0 8144.0 12230.0
etc.
Garbage, garbage everywhere.
3.5 Back to the Variability
Before we go the the multi-client tests, lets take a closer
look at the amazing synchronous JVM. To eliminate network, production
server and Oracle loads we'll move both client and test server into
my old-n-slow but trusty and unloaded (except for me) Sun development
machine slcs6. We go up and down the TCP stack but not out on the wire.
With Aida and Greg's fancy scripts and documentation it's pretty easy
to make the switch. Here's the results.
Figure 9: ref.get 2000 Longs
from the Test Server on Same Machine as Client
Wow. I boosted the iterations to 2000 and see a huge delay
somewhere around 1000 iterations. Also, the delays on the two runs
match up very nicely. Also note that the minimum turnaround is now
0 milliseconds and the mean down around 1. Without that huge excursion
the turnaround would be well under one ms. Does the 1000 iteration delay
repeat if we do many iterations? Here's 20000.
Figure 10: ref.get 20000 Longs
from the Test Server on Same Machine as Client
Hmmm. The more I test the more confused I get. Do I understand
that huge excursion around iteration 1000? Is the JVM invoking the
JIT compiler? Aha! That could very well be it. Since we're iterating
in a small section of code it decided to take a little time and compile
it and thereafter we only get occasional delays of 10-50 ms but no more
250 ms waits. It's why you can't use Java for real time. But you already
knew that. Still, after the presumed compile, it runs pretty darn fast
on my old machine except for the occasional presumed garbage collection.
So now you know why I chose 900 iterations for my standard
tests. Because at 1000 I always got this huge delay and it screwed
up my plots!
That does leave another intriguing question. What if we fudge
the data and eliminate the first 1000 iterations to let the JVM get
most of its optimization etc. out of the way? What do the results look
like then? Seek and ye shall find.
Figure
11: ref.get 19000 Longs from the Test Server on Same Machine as Client
Pretty impressive eh? While we still get a few excursions,
the statistics look much better. A mean of .7 ms and a sigma of only
.5 in the first run and .7 in the second. Now there is also much less
parallel data between the two runs. The lesson here is once an application
has been up for awhile, it settles into a much more predictable mode
that isn't dominated by JVM machinations.
3.6 SLC and EPICS Data Tests
Below are the plots to get a simple float using the reference object
going directly to the Slc server. The two runs are quite similar in their
statistics which is a good sign. The mean of ~5.5 ms is higher than for
the test servers which might be expected since the SLC server is using
the JNI interface to access the data via dbgets. Undoubtedly we can do
some optimization here. A curious factor is the periodic excursions of
~20 ms that seem to happen about every 50 iterations which must contribute
to the higher mean time. Previous tests showed patterns between runs but
not so much regularity in the intervals between excursions. I have no idea
as to the cause. On the good side, the minimum time is 2 ms which is about
as good as it gets.
Figure 12: ref.get of Slc
Float
Later.
Figure 13: ref.get of EPICS
Double
4.0 Multi-Client Tests
There are at least two basic questions relating to multiple
clients using Aida:
- Will Aida properly handle many multiple requests with
some degraded performance but without errors or failures?
- What are the performance statistics associated with
multiple clients and are they "reasonable"?
4.1 The First Approach
The initial set of multi-client tests used nine simultaneous
clients on different machines, each of which sequentially ran the first
6 tests in the DaAidaPerf test program. The client machines were nlcdev-opi01,
opi02, opi03, opi04, slcs1, slcs6, mccas0, opi00gtw04 and opi00gtw05.
Cron jobs were started at the same time for each of the clients and
ran at 3:07 and 4:13 on Sunday morning. Prime numbers are good things.
The tests were:
- Simple get double
- Simple get long
- Reference get double
- Reference get long
- Reference get of DaValue
- Reference get an IDL struct via a Corba Any.
I used trscrontab to schedule the two runs and didn't change the
default priority used for cron jobs .
#
# First time
#
nlcdev-opi01 07 03 25 7 * /u/cd/rcs/Matlab/startdev01
nlcdev-opi02 07 03 25 7 * /u/cd/rcs/Matlab/startdev02
nlcdev-opi03 07 03 25 7 * /u/cd/rcs/Matlab/startdev03
nlcdev-opi04 07 03 25 7 * /u/cd/rcs/Matlab/startdev04
slcs1 07 03 25 7 * /u/cd/rcs/Matlab/startslcs1
slcs6 07 03 25 7 * /u/cd/rcs/Matlab/startslcs6
mccas0 07 03 25 7 * /u/cd/rcs/Matlab/startmccas0
opi00gtw04 07 03 25 7 * /u/cd/rcs/Matlab/startopi04
opi00gtw05 07 03 25 7 * /u/cd/rcs/Matlab/startopi05
#
# Second Time
#
nlcdev-opi01 13 04 25 7 * /u/cd/rcs/Matlab/startdev01
nlcdev-opi02 13 04 25 7 * /u/cd/rcs/Matlab/startdev02
nlcdev-opi03 13 04 25 7 * /u/cd/rcs/Matlab/startdev03
nlcdev-opi04 13 04 25 7 * /u/cd/rcs/Matlab/startdev04
slcs1 13 04 25 7 * /u/cd/rcs/Matlab/startslcs1
slcs6 13 04 25 7 * /u/cd/rcs/Matlab/startslcs6
mccas0 13 04 25 7 * /u/cd/rcs/Matlab/startmccas0
opi00gtw04 13 04 25 7 * /u/cd/rcs/Matlab/startopi04
opi00gtw05 13 04 25 7 * /u/cd/rcs/Matlab/startopi05
Figure 14: Trscrontab Schedule
of Scripts to Execute
4.1.1 The First Test Results
All of the tests ran "concurrently" without a problem. Since
the test server and a number of the clients are production machines,
I didn't want to get a nasty phone call about resource consumption
so all tests together take less than a minute to run standalone. All the machines are running ntp and synchronized with the
SCS master time server. This should provide client accuracy within a
millisecond. Indeed, one can nicely correlate the startup
times of the first test in Table 1 with the speed
of the CPU as detailed in The Machines section.
Table 1 below shows the actual start
of each of the tests as recorded on each client machine. This time stamp
is recorded after all connections to the servers have been made and
is just before the start of the test loop. There are 900 iterations
for each test so that's 5400 gets/client. The total time from the start
of the first client to the finish (actually start of the last test) for
the last client is just over 1 minute. That's 48,600 gets in 61 seconds.
That's over 770 gets/second. Not too shabby at all. Each client gets
its own instance of a server and slcs2 is a pretty fast machine so there's
very little degradation in client response time at this load level.
As you can see from the times below, because machines mccas0
and opigtw04 are so much faster they were done (mccas0) or almost done
(opi00gtw04) before nlcdev-opi02-04 even got started! For opi00gtw04,
7 other machines were still executing when it started its last test but
that's not enough for a good measurement. The lesson here is that I was
much too conservative in my multi-client experiment and need a longer test
to allow all machines to compete. Aida running on the newer machines is
just way too fast!
Client/Test #
|
1 get double
|
2 get long
|
3 get ref double
|
4 get ref long
|
5 get ref DaValue
|
6 get ref IDL Any
|
nlcdev-opi01
|
3:07:20
|
3:07:26
|
3:07:31
|
3:07:36
|
3:07:40
|
3:07:47
|
nlcdev-opi02
|
3:07:30
|
3:07:40
|
3:07:48
|
3:07:56
|
3:08:04
|
3:08:16
|
nlcdev-opi03
|
3:07:29
|
3:07:39
|
3:07:48
|
3:07:56
|
3:08:09
|
3:08:21
|
nlcdev-opi04
|
3:07:29
|
3:07:39
|
3:07:48
|
3:07:56
|
3:08:04
|
3:08:15
|
slcs1
|
3:07:27
|
3:07:33
|
3:07:40
|
3:07:45
|
3:07:51
|
3:07:58
|
slcs6
|
3:07:20
|
3:07:24
|
3:07:29
|
3:07:33
|
3:07:37
|
3:07:43
|
mccas0
|
3:07:15
|
3:07:17
|
3:07:20
|
3:07:22
|
3:07:23
|
3:07:26
|
opi00gtw04
|
3:07:16
|
3:07:21
|
3:07:25
|
3:07:27
|
3:07:29
|
3:07:34
|
opi00gtw05
|
3:07:24
|
3:07:29
|
3:07:35
|
3:07:38
|
3:07:42
|
3:07:47
|
Table
1: Start Times for First Simultaneous Run on Monday Morning July
27, 2004 at 3:07 AM
I have 54 Matlab data files (6 tests * 9 machines) and it's
hard to know how to graph the results in a meaningful way so it's
clear I need a simpler approach to properly analyze the data. For this
test I'll pick just one test, the IDL get through an Any by reference.
This returns 2 small arrays and would be a common mechanism to efficiently
access frequently acquired structured data. I plotted and compared the
concurrent with the standalone results on the same machine, opi00gtw04
since for this test it had the most overlap with other machines still executing.
Figure 15: ref.get Any Cron
Job with Simultaneous Clients
Figure 16: ref.get Any Invoked
Manually
Figure 15 above is the run with other more-or-less simultaneous
clients and Figure 16 is a standalone run by itself. The mean for
the standalone tests was ~2 ms as opposed to ~ 2.5 ms for the concurrent
test. The standard deviation went from ~4 ms standalone to ~7 for the
multi-client. This is what one would expect in a contention situation.
Still, the increase in mean response time for the multi-client situation
was quite modest. Interestingly there was considerable variation for the
2 runs standalone with sigma 2.9 and 3.9. For the second pass I'll also
run the standalone test in the wee hours rather than during the day when
demand may be higher.
The improved multi-client test should be more revealing.
4.2 The Second Approach
A separate batch test was written and run with 5000 iterations so that
for a time all clients were operating simultaneously. This test did 2 reference
gets; one to the test server and one to the testhist server to alternate
software servers on the slcs2 server machine. Running the test manually
to get the data and time the results, I found that the fastest machines (mccas0 and opigtw04) ran the test in about 15 seconds
while the slowest machines (nlcdev-opi01 to 04)
took 52 seconds to run the same test. Therefore to get maximum overlap
I'll only compare one of the slowest machines standalone vs. multi-client
runs.
4.2.1 The Second Test Results
Both the manual and batch/concurrent runs were done during the
noon hour. Lets first look at the manual run done on nlcdev-opi02. The
statistics for the two runs are nearly identical with mean and sigma also
nearly the same. As good as it gets.
Figure 17: Nlcdev-opi02 Multi-Client
Test Invoked Standalone
Hmmm. The first multi-client run shows a higher mean and much higher
sigma. The second run is only slightly higher than the standalone without
the large group of higher delays around 2000 iterations. What can I say
except that 9 simultaneous clients doing minimal gets shows a slight degradation
in overall response as expected.
Figure 18: Nlcdev-opi02 Concurrent
Multi-Client Test
Just to be sure lets look at the statistics for the standalone and multi-client
results for one of the fastest machines mccas0. As you can see from the figures
below, the statistics are slightly worse for the multi-client test but not
dramatically so.
Figure 19: Mccas0 Multi-Client
Test Invoked Standalone
Figure 20: Mccas0 Concurrent
Multi-Client Test
5.0 Conclusions
You already knew this but the plots in the Single Client Results section clearly
show that:
- Standard Java
is not a real-time environment.
This doesn't necessarily mean that it's not "real-fast"
on the average, it just means that it's not deterministic. Hence the
Real-Time Java working group that has been at it for some time.
What is surprising is the reproducibility of the execution
of a given set of code in the same environment. Actually I guess it's
not so surprising, just unexpected.
- Large Aida data transfers above ~1Kbyte
are proportional to the data size with oscillating times, probably
due do garbage collection and heap allocation.
Referring to the plot in Figure 8 we see that the minimum time to transfer
4Mbytes (1 Mlong) is about 4 seconds. This is ~1 Mbyte/second. Though
it's hard to see from the plot, the raw data shows that the minimum transfer
time for 400K bytes is about 0.4 seconds so our "time proportional to
data size" conjecture holds. As stated above, for transfer sizes below
~1Kbyte the network time is too small to stand out above the normal variation
in processing time within Java and between machines.
- The overall Aida/Java performance is quite
good.
In general you can see from many of the tests in the Single Client Results and the Multi Client Tests that overall response
time is quite good even though it can be highly variable, especially
near the program startup. Furthermore, Figure 11
shows that the runtime performance, while still not real-time can be
quite stable in a long-running Java program that repeatedly executes
the same set of code.
- Multi-Client Tests Showed that Aida/Java
was Stable and Scaled with Multiple Clients.
Two separate multi-client tests were done and both revealed that there
was only a modest decline in the response statistics when compared with
standalone runs. Aida easily handled 9 simultaneous clients with little
degradation in throughput. Orbacus is currently configured so that each
new client gets a new instance of the server though other options are
available for very large numbers of clients.
These tests revealed no fundamental performance
reason why Aida could not be used as efficient, scalable middleware
to service the data needs of a variety of applications.