« MySQL Performance: Analyzing Benchmarks, part 5: TRX list again | Main | MySQL Tech Day @Paris »

Saturday, 13 July, 2013

MySQL Performance: Why Performance Schema Overhead?..

As there was a lot of complain recently about various cases with PERFORMANCE SCHEMA (PFS) overhead, I've decided to start a series of articles to help you to use PFS in the best way to analyze your workloads. This is the first part, starting with an overview of instrumentation problems generally..

First of all, instrumentation cannot be free! - when you want to trace or get some stats about your application internals you're adding a new code, and this code is increasing an overall code path, so your final binary simply cannot run on the same speed by definition!.. Of course, if your instrumentation code is poor, the overall overhead will be bigger ;-) but if you're designing it well, it'll be lower, but still not zero!..

Dynamic instrumentation brings additional cost - as soon as you're willing to add some advanced stuff within your instrumentation, it makes sense to not keep it active all of the time.. - but checking of all these flag/states on active/inactive will bring additional cost as will require more additional code to manage it.. (the best realization for today of dynamic instrumentation will probably attributed to DTrace, but currently you cannot get it yet available on most of platforms, and which is remaining a story on itself)..

Instrumentation overhead is directly depending on event frequency - sounds trivial, right? - more frequently your instrumentation code is executed - bigger overhead you may expect. So, before you activate any instrumentations on some execution levels of your code you have to think about a frequency of events on this level.. BTW, exactly the same problem you're meeting with DTrace as well (for ex. see my old article with example of x10 slowdown under DTrace). However, when we're adding instrumentation into our own code, we may then keep in mind what kind of level of event frequency to expect within each part of code, and make and overhead of such an instrumentation predictable(!)..

All this sounds simple.. - until we're not starting to feel an overhead directly related to involved instrumentation ;-) To give you a feeling of what it could be I wrote a simple "dumb code" of a program which is not doing anything useful, but just looping to keep CPU warm, and within it loops checking different conditions ;-) (the full source code you may find here if you want to play with it too).

The program is calling recursively Loop_func(), which will do "N" loops of a given level, and then call "M" times Loop_func() again to involve execution of the next level (in fact the program is simulating a work like: main()->do some work-> call function1() -> do some work -> call in loop func2() -> do some work -> call in loop fun3() -> and so on...)

For a database code it could be something similar to:

main()
  ...
  '-> loop: execute_SQL_SELECT();    <== level 1
     '-> loop: read_ranges();           <== level 2
        '-> loop: read_rows();             <== level 3
           '-> loop: read_row();              <== level 4
              '-> ...


So, more we're going in depth in levels - bigger a probability that a code on this level is executed more frequently (more loop iterations). And then, we want to instrument it to measure exactly the time we're spending there. And to follow the PFS idea, I'll add the following to my code around call of the Loop_func():
  • first I'm able to compile the code with or without "PFS instructions" (depending if PFS_DEF is declared or not)
  • then, I'm able "dynamically" enable instrumentation code according the "PFS" arg value
  • and then according to enabled instrumentation of each level, I'm then executing or not the instrumentation code on such a level (for TIMED instrumentation I'm using gettimeofday() which is probably not the most optimal, but good for an example)..
#ifdef PFS_DEF
    if( PFS )
    {
      if( Data[ level ].Enabled )  Data[ level ].CNT++;
      if( Data[ level ].Timed )  gettimeofday( &tv1, NULL );
    }
#endif

    Loop_func( level + 1, max );

#ifdef PFS_DEF
    if( PFS )
    {
      if( Data[ level ].Enabled && Data[ level ].Timed )
      {
        gettimeofday( &tv2, NULL );
        Data[ level ].TIMED += (tv2.tv_sec - tv1.tv_sec) * 1000000 + tv2.tv_usec - tv1.tv_usec;
      }
    }
#endif




Hm.. it's probably will be faster for you to read the code than read my explanations ;-)

But well, now how to use this program:
  • first of all, I've compiled 2 binaries:
    • one "without PFS enabled" and namedd "pfs_loop-noPFS"
    • another with "PFS enabled" (PFS_DEF defined) and named "pfs_loop-withPFS"
  • then I'm executing the program with following arguments:
    • $ pfs_loop PFS Loop1_N Loop1_M Instr1 Timed1 [Loop2_N Loop2_M Instr2 Timed2]...
    • where:
      • PFS - 1/0: enabled/disabled PFS code
      • Loop1_N - number of loops to execute on the "level1" to simulate a kind of work
      • Loop1_M - number of loops with call of the Loop_func() function for the next level ("level2")
      • Instr1 - 1/0 : enabled/disabled instrumentation on the "level1"
      • Timed1 - 1/0 : enabled/disabled TIMED instrumentation on the "level1"
      • ... - and so on for each next level
      • the total number of levels is calculated then automatically

So far, in the following example I'm executing the "pfs_loop" program with 4 levels in total, on the first 3 levels the program will run 1000 times in the "work loop" (N), and then will 500 times call execution of the next level. And only on the last (4th) level the "work loop" will be shorter - 10 times only (similar like you read few rows from FS cache, etc.)..

The first execution is with a binary without any instrumentation:
$ time pfs_loop-noPFS   0 1000 500 0 0 1000 500 0 0 1000 500 0 0 10 0 0 0
real    0m4.380s
user    0m4.366s
sys     0m0.000s


Now exactly the same thing, but with binary compiled with "PFS instrumentation", but PFS is "disabled"
$ time pfs_loop-withPFS 0 1000 500 0 0 1000 500 0 0 1000 500 0 0 10 0 0 0
real    0m4.551s <-- 4% more..
user    0m4.537s
sys     0m0.000s



The same, but PFS "enabled", but all instrumentations "disabled":
$ time pfs_loop-withPFS 1 1000 500 0 0 1000 500 0 0 1000 500 0 0 10 0 0 0
real    0m4.685s <-- 6.6% more..
user    0m4.671s
sys     0m0.001s


And now "enabling" instrumentation on different levels:
$ time pfs_loop-withPFS 1 1000 500 1 1 1000 500 0 0 1000 500 0 0 10 0 0 0
real    0m4.685s  <-- same 6.6% for "level 1"
user    0m4.671s
sys     0m0.000s

$ time pfs_loop-withPFS 1 1000 500 0 0 1000 500 1 1 1000 500 0 0 10 0 0 0
real    0m4.700s <-- 7.3% for "level 2"
user    0m4.684s
sys     0m0.001s

$ time pfs_loop-withPFS 1 1000 500 0 0 1000 500 0 0 1000 500 1 1 10 0 0 0
real    0m12.864s  <-- 293% for "level 3"
user    0m12.826s
sys     0m0.000s




As you can see, more in "lower level" we're involving instrumentation - bigger overhead we're getting (trivial, right? - on a more lower level the same instrumentation code is executed way more often, so as the result the impact is more bigger too)..

Exactly the same is going with PERFORMANCE SCHEMA in MySQL! - more low level instrumentation you're enabling, bigger overhead you may hit.. However, keep in mind:
  • ALL PFS instrumentation is completely dynamic
  • at any time you may disable everything!
  • you may go very progressively and detailed level by level in instrumentation, just be aware about what exactly you're doing ;-)


Now, let's take a popular example of Sysbench OLTP_RO Point-Select 8-tables workload. This workload on 16cores-HT is getting its max performance level usually on 64 or 256 concurrent user sessions. Let's analyze the PFS impact on this workload on the following situations:
  • noPFS -- MySQL binary was compiled without PFS
  • withPFS, PFS=off -- MySQL binary was compiled with PFS, but PFS is disabled
  • withPFS, PFS=def -- same as before, but PFS is enabled with default instrumentation (as in MySQL 5.6 by default)
  • withPFS, PFS=none -- PFS is enabled, but all instrumentations are disabled

So, we're getting:

Observations :
  • noPFS : 64usr-> 250K QPS, 256usr-> 280K QPS
  • PFS=off : 64usr-> 249K QPS, 256usr-> 275K QPS
  • PFS=def : 64usr-> 240K QPS, 256usr-> 260K QPS
  • PFS=none : 64usr-> 245K QPS, 256usr-> 270K QPS
  • so, in worse case we have 8% "overhead" - 260K QPS -vs- 280K QPS.. (PFS default -vs- compiled without PFS)
  • while, "not compiled" -vs- "compiled, but disabled" (noPFS -vs- PFS=off) is "near" the same - only 2% overhead max here..

However, the "default" PFS instrumentation is composed from "table I/O" and "statements/digest" instrumentations, let's see the impact of the each one:

Observations :
  • PFS=none : 64usr-> 245K QPS, 256usr-> 270K QPS
  • PFS=tables_only : 64usr-> 240K QPS, 256usr-> 265K QPS
  • PFS=statements_only : 64usr-> 240K QPS, 256usr-> 260K QPS
  • so, as you can see, by disabling statements instrumentation you can reduce the max overhead from 8% to 5%
  • and all this dynamic, without MySQL Server restart!

Well, I'll go more in details within next articles, but to finish with this one, let's see how well the "alternative" solutions are working on the same workload...

Let me start with MySQL 5.6 code patched by Facebook and providing TABLE_STATISTICS (and other stats). As mentioned Domas, there is no need any enable/disable settings, just start MySQL server and you get it. So, let's compare its overhead on the same workload with noPFS and PFS=def :


Oservations :
  • as you can see, there is near no difference in overhead between PFS=def and Facebook stats instrumentation
  • there are the same max 8% overhead -vs- noPFS.. - so, there is no instrumentation without cost, right?..
  • while PFS=def is not only have table stats, but also query digest..
  • and PFS instrumentation you may dynamically reduce at any moment, while the Facebook code is just here..

Another "alternative" stats solution is implemented in Percona Server, but this instrumentation is dynamic, in fact binary: you may just enable or disable it completely by setting the "userstat" global variable to 1 or 0. Percona Server 5.6 is only RC for the moment, and I was unable to compile it without PFS due errors in Percona code.. So, let's compare it on the same workload with noPFS and PFS=off (the difference in PFS=off config between MySQL 5.6 and Percona 5.6 may be a sign of yet additional overhead coming from the "userstat" related code) :

Observations :
  • in fact, yes, MySQL 5.6 max QPS with PFS=off is higher than on Percona 5.6..
  • and then, once userstat=1 the regression is horrible.. over 30%...

So, it's very easy to complain about PFS overhead.. - but what if we'll try together to improve it rather to blame?.. ;-)) PFS is an universal solution for MySQL profiling, it has a HUGE power, we all just need to learn how to use it right, and then life will become much more easier.. ;-)

INSTEAD OF SUMMARY :
  • Instrumentation is not free ;-)
  • Instrumentation of higher frequency events having a bigger overhead!
  • PFS is a really great tool, just need some love ;-)

Well, it was only a story about PFS instrumentation overhead, but nothing about practical usage.. So, the next one in this series will be all about practical ;-)

To be continued..

Rgds,
-Dimitri
Posted by Dimitri at 0:55
Categories: MySQL
blog comments powered by Disqus
Note: if you don't see any "comment" dialog above, try to access this page with another web browser, or google for known issues on your browser and DISQUS..