donderdag 24 januari 2013

DataSnap, RO, RTC, mORMot, WCF, Node speed test

After reading the "DataSnap test" blog article, I wanted to do some extra tests: RemObjects SDK (RO) and the effect of ScaleMM2. I also got a server build from RealThinClient (RTC) for testing.
To get some reference I downloaded the test servers and JMeter 2.8 and ran them on my pc. After that I got a RemObjects server working with JMeter so I could compare the results with the other solutions. I also made a "plain indy" server to mimic the plain Node.js test (to see what indy 10 is capable of).

Test setup

I tested on single PC: a quad core, Windows 7, 8gb computer; I used JMeter 2.8 with 50 threads and 1000 request per thread. I configured all test servers to use a threadpool of 50 (whenever this was possible).
It is a rather rough and quick test, no 3 times average etc: it is only used to get a quick indication of the performance. I also did not monitor the memory usage because most test servers had a low usage of about 6 or 9 mb (and memory is cheap :) ). Only ScaleMM2 versions use about 50Mb: each thread starts with its own 1mb memory block (this needs to be further optimized but it works for now).

Results

See below for the crowded results. Tip: the Google Chart is interactive, this will help to dig through the many bars...
Loading Google Chart...



Limitations

My test has some big limitations you have to be aware of:
  • I tested on "localhost" so no real network test with reliability, error rate, etc.
  • Because all software is ran on 1 pc, most results are cpu bound. This means the results have some kind of automatic "correction" for servers with a high cpu usage: these are "less efficient" and get "punished" for this: they will get a lower request rate.
    Again: it is not a real network test, it does not show the maximum possible speed. A real network test  should be done before conclusions or decisions can be made!
  • I did only a short test (50.000 request, so a couple of seconds), so no long term performance.

Observations (not conclusions :) )

So, how useful is this test then? Well, despite the rough chunks, you can however make some interesting observations:
  • RTC and mORMot both performs very good! 
  • however, mORMot has big differences between the results: it runs fast in admin mode (run as Administrator), but slower in normal user mode (higher kernel(red) cpu). Also the XE3 build acts weird: I could not make connections in user mode, only in admin mode and is much slower than the D2010 build! (same source, same release compiler options). Are there known slow downs in the XE3 compiler?
  • Node.js and "plain indy" have similar results, so Indy itself is quite fast (but slower than RTC and mORMot)
  • RemObjects SDK (RO) comes close to RTC and mORMot (and a little bit faster than WCF) but only(!) if ScaleMM2 and not Indy but Synapse or DX is used
  • RO, DataSnap and Indy benefit the most when using ScaleMM2. RTC and mORMot are very fast on themselves (less MM bound, so more optimized and more efficient?). But a multi threaded memory manager gets very useful when you start writing user code (creating string, objects, records etc) anyway.
  • DataSnap XE3 is disappointing, even with ScaleMM2 it performs much slower than the others. Also the performance drops very quickly! The first second(nr1) it does about 3500 request per second, but after a couple of seconds(!)(nr2) it drops to an average of 1700/s! See below for a ProcessExplorer screenshot, notice the steep drop of the IO chart at the bottom of the screen! Maybe some kind of increasing array or list that makes this logarithmic decline? Due to automatic session management? (but I don't want sessions, I want a fast stateless server!). Or maybe a lot of "interthread memory" is used, so a multithreaded MM needs to lock too? (cpu also gets lower which can indicate higher locking times?).
    Using Google's TCmalloc Memory Manager (Delphi unit, dll) instead of ScaleMM2 shows similar behavior however it starts lower and drops less steep.

Other remarks:

Some other minor remarks about the test:
  • Delphi software is compiled with D2010, only DataSnap is build with XE3 trial.
  • RTC server is compiled by www.realthinclient.com, also with D2010.
  • The fastest RTC settings are used, non blocking and not multithread are also very fast, but blocking and multithreaded gave me the best results
  • DataSnap with keep-alive is used, this gives me the 3500request/s but without it it gave about 3100/s
    (but keep-alive on a real network gives a high error rate?)

Todo

Some todo's:
  • I would like to test RemObjects SDK for .Net too: I saw they have a http.sys server (like mORMot) too!
  • I should redo the tests with multiple physical pc's but unfortunately I don't have that much good ones...
  • I will also build the "plain indy" server in Delphi XE3 to see if the XE3 compiler is really broken (maybe that's why DataSnap performs bad?)

Used server software

Some download links to the used servers:

8 opmerkingen:

Bunny zei

Use a few PCs, take care that the "server's" network card is fast. If you have a server you can assign some IP addresses this can help. If you have 2 or more nics the better.

What is important is to check if the load is balanced - clients should not starve. In all my tests Indy server worked good from this perspective.

I had a certain success with pure datasnap just datasnap + webbroker. But compared to pure Indy HTTP it's a lot slower.

In order to run a fair test you should add maybe a 200ms to 500ms delay (emulating work). Very dependent on what you want to compare.

Arnaud zei

In mORMot, you have two servers, one WinSock-based, and one http.sys-based. The 2nd is much faster, but the http.sys API requires to register the URI:port if you are not with administrator rights (only since Vista/Seven) - this is a security feature, also well known by all WCF users. If http.sys does not start, mORMot falls back to the WinSock-based implementation, which is indeed slower.

You can register the URI from the command line tool, or use a method of mORMot units - see TestSQL3Register.dpr

I suspect that you used version 1.17 of the framework. A XE3 issue has already been fixed in the trunk since October. Buggy XE3.

We try to avoid any unnecessary memory allocation during mORMot process, so this is the reason why SMM2 is not making a big difference here.

By the way, localhost tests are interesting, but not as good as over a real network, or the Internet.

Thanks for sharing your experiment.

Roberto Schneiders zei

It's great to see someone giving continuity to my work. Especially because I not had enough time to test all the options (RTC and RO), just as you did.
The tests are very different. But you have arrived at the same conclusions. mORMot and RTC are extremely fast. Of course the mORMot is much more than a REST api, but this is not the point.

I enjoyed your post, excellent work.

Roberto Schneiders zei
Deze reactie is verwijderd door de auteur.
Michael Justin zei

In the source code for your Indy test project, the server is not using Keep-Alive. This can quickly slow down performance, as the client connection resources will not be removed by the operating system after disconnect. Keep-Alive can be activated with a server property, and can improve performance a lot.

André Mussche zei

Michael Justin: how do you mean? At least in my sources I have "keep alive" enabled in the dfm:

object IdHTTPServer1: TIdHTTPServer
KeepAlive = True

André Mussche zei

Strange, I did some hacking and profiling to see how fast I could get Datasnap, and I see that also my "Plain Indy" is much slower in XE3 than D2010! In D2010 is get about 11.500 request per second, the same in XE3 only 7.700...
After some hacking (mainly disabling the sessions) I got 4700 req/s (was 3200reg/s) and also stable performance (no steep decline). I could not get more without much rework...

When looking at the DS source code my conclusion is: it is not optimized for "high performance" (stupid advertisement of EMBT!). I mean: all kinds of helper objects are created and destroyed on the fly, many UTF8 decoding conversions (implicit due to rtti?), RTTI context is not cached, no connection pool (new connections are created and closed for each request), etc.

Darian Miller zei

I ran your JMeter Plain Indy test on my XE4 machine and a rebuilt Project1.exe outperformed the downloaded version by a significant margin. (Throughput of 1,000/sec over 30,000 samples versus a throughput of 400/sec using the project1.exe provided in the zipfile...which was assumedly built with Delphi 2010.)


On my machine, the provided project1smm2.exe achieved a very substantially increase of 1450/sec throughput over the provided project1.exe

A rebuilt project1smm2 in XE4/32-bit mode achieved 1050/sec. The XE4/64-bit version starting throwing AV errors after a few thousand samples.