André's Dev Blog: 2013

donderdag 22 augustus 2013

ADO Bug: do NOT use comments in parameterized sql queries in ADO/OLEDB!

Yesterday I finally found the cause of a problem in our middle ware service: after 1 hour the service did not return any results from the database anymore. The problem was using a sql comment -- in a query within a report...

Today I did some more investigation and found out WHY we got no results but also no errors.

For example, we could create a very simple query like this:

select *
from (select 1 as value) as dummytable
where value > ?

When we prepare this query in a ADO query (e.g. TADOQuery in Delphi) it will send the following sql to the server to determine the types (used to determine parameter type):

set fmtonly on select v from (select 1 as v) as t where 1=2 set fmtonly off

So it removes the "where" clause, replaces it with "where 1=2" and surrounds it with "set fmtonly on/off". This "fmtonly" thing instructs SQL Server to return only metadata.

So far, so good.
But when you use "--" comments within your sql (which is normally no problem) in combination with parameters, you're getting big problems! Take a look at what ADO (or OLEDB?) does with the next statement:

select *
from (select 1 as value) as dummytable
--dummy comment
where value > ?

When you prepare this query, you will see the next sql statement in SQL Server Profiler:

set fmtonly on select v from (select 1 as v) as t
--dummy comment where 1=2 set fmtonly off

Can you spot the problem? Now the "set fmtonly off" won't be executed! So your connection stays in a "metadata only" state! And so, all queries that you execute after this won't return any results!

Conclusion: DO NOT USE COMMENTS IN PARAMETERIZED QUERIES!

dinsdag 23 april 2013

SQL Server and slow compile times of queries

I posted a message on Google+ on Friday about large compile times of queries on SQL Server.
(to be clear: the execution of the query was fast enough, it was slow only the first time).
I could get my query a lot faster by not using view-in-view-in-view, but I wanted to know why.
Today I found some more detailed information about (slow) compilations of queries.

xperf profiling

First, I found some information about using xperf with SQL Server, to profile the performance. In an old blog post I found an explanation about how to log stack traces with xperf too. This was what I was looking for: using a (free, 64bit) sampling profiler to find out what sqlservr.exe is exactly doing all the time!
(note: xperf.exe is now part of the "Windows Performance Toolkit (WPT)", you have to download the Windows ADK and install only WPT)
However, I got a cryptic error on the server: "xperf: error: NT Kernel Logger: Cannot create a file when that file already exists."
By closing Proces Explorer (which uses ETW events too for timing and performance information) I got it working on my own workstation (I did not want to install the (low level?) WPT on the server :), but xperf should also work by using xcopy?)

Viewing results

The next hurdle was getting my stack traces in the latest xperf (Windows Performance Analyzer). Again some "googling" and I found a newer blog post about xperf. It turned out I had to do the following:
1: load all sybols (takes a while) via the menu Trace -> Load Symbols
2: click on "display graph and table" icon (top right in for example CPU usage view)
3: show "stack" column in detail grid (right click in grid)
Inline afbeelding 1

As you can see, the most time it is busy with the "Optimize" procedure (ironic isn't it? :) ).
(note: 25% cpu is one core at 100% of my quad core pc, so 20% is actually 80% cpu usage)

Optimization phase

Armed with this information I could search further and I found some posts about the optimization phase.
Inside SQL Server: Parse, Compile, and Optimize The SQL Server Query Optimizer

Somehow I found a "deep dive" into the Optimizer, including some undocumented trace flags (!) for really low level inner details. Yeah! :)
http://sqlblog.com/blogs/paul_white/archive/2012/04/28/query-optimizer-deep-dive-part-1.aspx http://sqlblog.com/blogs/paul_white/archive/2012/04/28/query-optimizer-deep-dive-part-2.aspx http://sqlblog.com/blogs/paul_white/archive/2012/04/29/query-optimizer-deep-dive-part-3.aspx http://sqlblog.com/blogs/paul_white/archive/2012/04/29/query-optimizer-deep-dive-part-4.aspx

Too much information to handle for now, but at least (as far as I understand from the internal counters etc) it did not even complete the normalization phase due to too much groups and too large or complex tree. It timed out on "Simplification phase" (see part 2), so it didn't optimize at all? It seems so: the query is executing slower than my one big single select (24s vs 58s of all 30.000 records in a test db). I read somewhere that a view is some kind of a macro that is completely loaded to compile the plan. When you have a lot of "view-in-view-in-view", a lot of "macros" are loaded, which takes then a lot of time to process. It seems even worse: a subselect of the same content of a view is much faster to compile than using or joining the same view.

Remarks and questions

By the way, some remarks of my test:

I used a new database with no data in it (so it is probably even slower on big databases with larger statistics?)
I also tried to delete all auto generated statistics (and disabling auto generation of statistics) but this didn´t change anything (compilation stayed slow)
I also tried all SQL OPTION hints, but again, no effect

Some remaining questions:

why do I get a optimization timeout after 7s on my pc but after 15s on (an older, internal virtualized) server and 25s on a customer server?
I tested it on SQL Server 2008 R2, how would it perform on SQL Server 2012?
Why did we use so much views?

I hope to answer (some of) these question the next time.

Conclusion

Well, the conclusion hasn't changed: do not use many views in views! Or at least be aware of the consequences.

Some final tips

I found a better plan viewer (free!): http://www.sqlsentry.net/plan-explorer/sql-server-query-view.asp\
You can view how much time and memory the compilation of a query took by looking at the details of the "actual plan" in SQL Server Management Studio (SSMS):
Note: optimization = FULL means it had to do a full and slow compiling because the query is too complex
To get only sqlservr.pdb (for example to use it in ProcesExplorer): http://www.sqlskills.com/blogs/paul/how-to-download-a-sqlservr-pdb-symbol-file/

donderdag 24 januari 2013

DataSnap, RO, RTC, mORMot, WCF, Node speed test

After reading the "DataSnap test" blog article, I wanted to do some extra tests: RemObjects SDK (RO) and the effect of ScaleMM2. I also got a server build from RealThinClient (RTC) for testing.
To get some reference I downloaded the test servers and JMeter 2.8 and ran them on my pc. After that I got a RemObjects server working with JMeter so I could compare the results with the other solutions. I also made a "plain indy" server to mimic the plain Node.js test (to see what indy 10 is capable of).

Test setup

I tested on single PC: a quad core, Windows 7, 8gb computer; I used JMeter 2.8 with 50 threads and 1000 request per thread. I configured all test servers to use a threadpool of 50 (whenever this was possible).
It is a rather rough and quick test, no 3 times average etc: it is only used to get a quick indication of the performance. I also did not monitor the memory usage because most test servers had a low usage of about 6 or 9 mb (and memory is cheap :) ). Only ScaleMM2 versions use about 50Mb: each thread starts with its own 1mb memory block (this needs to be further optimized but it works for now).

Results

See below for the crowded results. Tip: the Google Chart is interactive, this will help to dig through the many bars...

Loading Google Chart...

Limitations

My test has some big limitations you have to be aware of:

I tested on "localhost" so no real network test with reliability, error rate, etc.
Because all software is ran on 1 pc, most results are cpu bound. This means the results have some kind of automatic "correction" for servers with a high cpu usage: these are "less efficient" and get "punished" for this: they will get a lower request rate.
Again: it is not a real network test, it does not show the maximum possible speed. A real network test should be done before conclusions or decisions can be made!
I did only a short test (50.000 request, so a couple of seconds), so no long term performance.

Observations (not conclusions :) )

So, how useful is this test then? Well, despite the rough chunks, you can however make some interesting observations:

RTC and mORMot both performs very good!
however, mORMot has big differences between the results: it runs fast in admin mode (run as Administrator), but slower in normal user mode (higher kernel(red) cpu). Also the XE3 build acts weird: I could not make connections in user mode, only in admin mode and is much slower than the D2010 build! (same source, same release compiler options). Are there known slow downs in the XE3 compiler?
Node.js and "plain indy" have similar results, so Indy itself is quite fast (but slower than RTC and mORMot)
RemObjects SDK (RO) comes close to RTC and mORMot (and a little bit faster than WCF) but only(!) if ScaleMM2 and not Indy but Synapse or DX is used
RO, DataSnap and Indy benefit the most when using ScaleMM2. RTC and mORMot are very fast on themselves (less MM bound, so more optimized and more efficient?). But a multi threaded memory manager gets very useful when you start writing user code (creating string, objects, records etc) anyway.
DataSnap XE3 is disappointing, even with ScaleMM2 it performs much slower than the others. Also the performance drops very quickly! The first second(nr1) it does about 3500 request per second, but after a couple of seconds(!)(nr2) it drops to an average of 1700/s! See below for a ProcessExplorer screenshot, notice the steep drop of the IO chart at the bottom of the screen! Maybe some kind of increasing array or list that makes this logarithmic decline? Due to automatic session management? (but I don't want sessions, I want a fast stateless server!). Or maybe a lot of "interthread memory" is used, so a multithreaded MM needs to lock too? (cpu also gets lower which can indicate higher locking times?).
Using Google's TCmalloc Memory Manager (Delphi unit, dll) instead of ScaleMM2 shows similar behavior however it starts lower and drops less steep.

Other remarks:

Some other minor remarks about the test:

Delphi software is compiled with D2010, only DataSnap is build with XE3 trial.
RTC server is compiled by www.realthinclient.com, also with D2010.
The fastest RTC settings are used, non blocking and not multithread are also very fast, but blocking and multithreaded gave me the best results
DataSnap with keep-alive is used, this gives me the 3500request/s but without it it gave about 3100/s
(but keep-alive on a real network gives a high error rate?)

Todo

Some todo's:

I would like to test RemObjects SDK for .Net too: I saw they have a http.sys server (like mORMot) too!
I should redo the tests with multiple physical pc's but unfortunately I don't have that much good ones...
I will also build the "plain indy" server in Delphi XE3 to see if the XE3 compiler is really broken (maybe that's why DataSnap performs bad?)

Used server software

Some download links to the used servers:

André's Dev Blog