« Linux | Main | Oracle »

Wednesday, 02 April, 2014

MySQL 5.7 just rocks! ;-)

A next MySQL 5.7 milestone release is available an it just rocks! ;-)

few benchmark results to see where we're today comparing:

  • MySQL 5.7 / 5.6 / 5.5
  • Percona Server 5.6 / 5.5
  • MariaDB 10 / 5.5

for all engines the latest available versions were used; the data set is fitting memory size, so the main focus is on the internal contentions here: already fixed for some engines, or still remained for another ones ;-)

Sysbench OLTP_RO 8-tables :


Sysbench OLTP_RO Point-Selects 8-tables :


Sysbench OLTP_RW 8-tables :


All details about these benchmark results and others (IO-bound OLTP_RW, Uniform & Pareto, DBT2, LinkBench) - I'll present during my talk tomorrow at PerconaLive 2014. I will also cover: all internals about InnoDB flushing design and how we're making it yet more improved it in 5.7, fixed and pending issues we have for today, the impact of InnoDB purge, filesystem and flash storage choices on IO-heavy workloads, and several still pending unexplained mysteries around.. - prepare your brain for some storming ;-)

If you're attending Percona Live, don't miss the following talks from MySQL Team :

as well we'll be present all here this evening during our "Meeting MySQL Team" BOF session to answer any questions you want and discuss any issues you have - your valuable feedback helps us to make MySQL yet more better! - and nothing is better here than live and fair face to face discussions.. - so, don't miss ;-)

Also, to get an overview of all the new features and improvements coming within this 5.7 milestone release - you may find many interesting information for you by reading Geir's article - http://mysqlserverteam.com/the-mysql-5-7-4-milestone-release-is-available/
Posted by Dimitri at 16:37 - Comments...
Categories: MySQL

Friday, 31 January, 2014

MySQL & Friends @ FOSDEM-2014

February was yet so far.. - and finally it's just tomorrow, starting with MySQL & Friends Dev Room at FOSDEM 2014 in Brussels. I have a talk about "Starting with MySQL PERFORMANCE SCHEMA" - in fact I would call it rather "Using PFS with zero configuration" ;-) -- many people are thinking PFS is complicate, while in reality it's very simple, and just need little bit of love ;-) Since MySQL 5.6 PFS is enabled by default, and as the result - there are several very useful instrumentation stats available out-of-the-box, and my talk will be about them..

Of course I'll speak about MySQL Performance as well, and feel free to ask any questions about.

Also, don't miss talks from our MySQL Team :

See you all there!

(and hurry up to not miss MySQL & Friends Community Dinner - only few places left)

Rgds,
-Dimitri, heading to FOSDEM in train ;-)

Posted by Dimitri at 15:10 - Comments...
Categories: MySQL

Friday, 22 November, 2013

MySQL Performance: over 1M QPS with InnoDB Memcached Plugin in MySQL 5.7

Last week, during Tomas' keynote at MySQL Percona Live Conference in London we announced as one of "previews" of the following MySQL 5.7 release(s) -- an over 1,000,000 Query/sec result obtained with InnoDB Memcached plugin on a Read-Only workload. This article here is just to confirm the announced results without going too much in details..

In fact we have no idea yet for today what are exactly the scalability and performance limits for this solution.. The huge gain in performance was possible here due initial overall speed-up made recently in MySQL 5.7 and letting us reach 500K QPS in a "normal" SQL Read-Only workload. Then yet more improvement in the InnoDB Memcached Plugin code were possible and came just naturally. Specially since Facebook Team challenged us here pretty well by expressing all performance limitations they are hitting in their workloads. As well Facebook provided us a test case workload which we successfully used to improve even more our code. And finally the same test case was used to obtain the following benchmark results ;-)

The test was executed in "standalone" mode (both server and client are running on the same server). So, we used our biggest HW box we have in the LAB - a 48cores machine. This server was able very quickly to point us into any existing or potential performance issues and bottlenecks (and what is interesting that most of them were now on the memcached code itself). However, Query/sec rate (QPS) is depending a lot here of memory latency and CPU frequency, while this server is having 2Ghz CPU cores only, so on a faster HW you may expect even better results ;-)

Now, comparing best-to-best QPS results obtained on this server we have the following :

and for people who prefer 2D charts :



I've placed in legend "MySQL 5.6", while a true label should be rather "the best result we observed until now" ;-)) -- because some part of Memcached code improvement will be back-ported to MySQL 5.6 as well, so we may expect to see next 5.6 releases running here better too. However - only with MySQL 5.7 code base you'll be able to go really high..

During my talk at Percona Live in London I've also presented the following graphs - the Memcached QPS is corresponding here to the InnoDB "dml_reads/sec" stats :

There are 4 tests on these graphs representing "previous" MySQL code running on Memcached workload :

  • #1 - running on 48cores as it.. - we're hitting a severe contentions related to the MVCC code (which was fixed in the latest MySQL 5.7)..
  • #2 - limiting MySQL server to run on 16cores only to lower this contention.. - and then hitting transaction related contentions (which was also fixed in the latest MySQL 5.7 code)..
  • #3 - tune memcached plugin to keep several reads within a single internal transaction -- helps, but hitting other contentions..
  • #4 - limiting MySQL server to run on 8cores to see if contentions may be lowered -- indeed, the max peak QPS becomes higher (on 32 users), but overall performance is worse..


While on the latest MySQL 5.7 code things are looking completely differently :

There are 2 tests on these graphs:
  • #1 - is running on 48cores as it (no comments ;-))
  • #2 - is using "tuning" option to keep several reads within a single internal transaction - just slightly better on a peak max QPS, otherwise no significant difference anymore..


And to really feel the difference in obtained QPS gap, let's bring them all together to the same graph :

As you can see, the difference is more than impressive ;-))
  • all the curves on the left parts of graph representing QPS levels obtained on the "previous" MySQL 5.6 / 5.7 code..
  • then, the last curves on the right part - with the latest MySQL 5.7 code..


So, work is still in progress, and I let Sunny and Jimmy provide you all deep details about this huge step forward we made in the latest MySQL 5.7 release!

I don't know what will be the performance limit here.. Probably only HW level.. And don't know if we'll have a big enough HW to see it ;-) -- currently via a single 1Gbit network link we already observed over 700K QPS performance, and while the limitation is coming here from a single network link, the main troubles are coming from clients processing rather server.. - so, seems like Memcached @InnoDB is scaling now way better comparing to the "original" Memcached itself ;-) -- then, what kind of performance may be expected when several network links are used (or simply more fast network cards are used) -- there is still a lot to discover! and RW workload performance will be yet another challenge as well ;-)

Kudos to Sunny and Jimmy! And my special thanks to Yoshinori (Facebook)! - I think this is an excellent example where a common work on a given problem provides a fantastic final result for all MySQL users!..

If you need some details about Memcached Plugin design - you may start your reading from here: https://blogs.oracle.com/MySQL/entry/nosql_memcached_api_for_mysql - while then, keeping in mind all presented here results, I let you imagine now what kind of performance you may expect if data will be accessed directly via "native" InnoDB API and by-passing the Memcached level.. ;-))

Rgds,
-Dimitri
Posted by Dimitri at 13:36 - Comments...
Categories: MySQL

Friday, 08 November, 2013

Speaking at Percona Live London 2013

  I'm speaking next week at Percona Live in in London. My talk will be almost about MySQL Performance (of course), covering major MySQL 5.6 improvements and latest findings made in MySQL 5.7 for today. Percona Server 5.6 and MariaDB 10 will not be missed as well.. - we're living very interesting times, and performance topic is amazing today as never ;-)

As usually, I'll tell you "one more thing" about MySQL Performance latest news that you cannot read or find anywhere else.. - so, stay tuned ;-)

Rgds,
-Dimitri
Posted by Dimitri at 21:27 - Comments...
Categories: MySQL

Thursday, 10 October, 2013

My Slides from MySQL Connect 2013

  As promised, my slides from MySQL Connect talks are now uploaded to the #OOW13 site and also available from here:
Thanks to all who attended MySQL Connect this year, for all great talks, discussions, suggestions and your support!

Rgds,
-Dimitri

Posted by Dimitri at 16:17 - Comments...
Categories: MySQL

Friday, 04 October, 2013

MySQL Performance: The Road to 500K QPS with MySQL 5.7

This article is providing the details to the 500K QPS with MySQL 5.7 benchmark results published earlier and explained during my talk at MySQL Connect..

Looking back in the history of MySQL / InnoDB improvement, you may easily realize that there was no any so big speed-up(s) between releases in read-only (RO) performance as they came with MySQL 5.6. And it's easy to understand as well that without having a good level of scalability on read-only, it's hard to expect to reach a good level on read+write (specially when reads are dominating ;-))

However, we were so happy with RO performance in MySQL 5.6, that with 5.7 we turned our main focus on the read+write (RW), because on big data volumes things are not yet going as we wish.. But the RW dependency on RO jumped up once more again, and then InnoDB team came with few more improvements to push the max QPS level in 5.7 yet more far.

But let's take the things by order..

In fact in MySQL 5.6 on RO workload dominating internal contentions are

  • Single table used only: MDL, trx_sys and lock_sys (InnoDB)
  • Many tables used: trx_sys and lock_sys (so, InnoDB mainly)
Any fast single-table-bound workload will be mainly blocked by MDL contentions. While multi-table will be limited by InnoDB internals (different tables will be protected by different MDL locks, so contention bottleneck on MDL will be lower in this case). But again, all depends on a workload too -- a more generic and wider RO workload should scale in MySQL 5.6 pretty well (like Sysbench OLTP_RO), while workload with short and fast queries (like Sysbench Point-Selects (fetch a record by PK)) is hitting all these contentions much more harder and will scale only up to 16cores-HT, having a worse result on 32cores.. But any such a workload similar to Point-Select test will show you the max possible performance rate on which all MySQL internals are able to work together (starting by SQL parser and finishing by row fetch round-trip).. This will be as well the max possible SQL query/sec (QPS) rate you may ever achieve on a given MySQL version and a given HW configuration.

For MySQL 5.6 our best obtained result was around 250K QPS, which on that time was the best ever seen on MySQL / InnoDB with SQL queries ;-)

However, it was only possible to achieve if READ-ONLY TRANSACTIONS were used (a new feature in MySQL 5.6), and again, if AUTOCOMMIT=1 was used, otherwise CPU cycles are simply wasted to process START TRANSACTION / COMMIT statements, and you're loosing in overall performance for sure..

So, the very first improvement introduced within MySQL 5.7 was an auto-discovery of RO transactions (in fact any InnoDB transaction is considered as read-only until any DML is not involved within it).. - this simplified a lot the RO transactions feature itself, and life for users and developers too - no need to take care anymore if RO transaction statement was used or not. But again, you cannot reach the max possible potential QPS on MySQL with this feature as CPU time is still wasted by processing transactions begin / end statements..

On the same time Percona came with a different solution to resolve "transaction list" management (TRX-list) and as the result lower the trx_sys mutex contention within InnoDB. Percona's solution kept better a high load on Point-Selects with transactions workload, but MySQL 5.7 was yet more better here (but I was not able to publish the results of 5.7 as its code was not yet public).. So, at least I'm able to do it now :


Observations :
  • the same RO Point-Select-TRX test (using transactions) on 8-tables is executed on MySQL 5.6, Percona 5.5, and then MySQL 5.7 (results are from May.2013)
  • and as you can see, we're far here from the peak 250K QPS obtained on the same 16cores-HT configuration..
  • MySQL 5.6 is hitting a contention on the trx_sys mutex and QPS is decreasing since 64usr..
  • Percona 5.5 is keeping a load longer, and QPS decrease is starting only since 512usr..
  • while MySQL 5.7 already kept on that time QPS without any decrease at all (and even with a higher number of concurrent users which you don't see on these graphs)..

However, it's clear that if one is willing to get the MAX of potential RO QPS rate with MySQL - transactions should be avoided..

So, let's see where it was our MAX QPS in May-2013.

Same Point-Select 8-tables test, but without transactions @MySQL 5.6 :

Observations :
  • the test is executed with keeping MySQL 5.6 running on 16cores only, then on 16cores-HT, then 32cores, and 32cores-HT
  • as you can see, the Max QPS is even bigger than expected - it's 275K QPS on MySQL 5.6
  • and the Max result is reached on 16cores-HT
  • while result on 32cores is not better than on 16cores-HT (due internal contentions, and contentions are better managed here by having 2 CPU threads on the same core - so the true concurrency remains on the level of 16 threads, and not 32 as it will be on 32cores)..

While the same test on MySQL 5.7 was looking pretty differently, as for that time contention on the lock_sys mutex was already lowered in 5.7, and trx_sys mutex related code also got its first changes:

Observations :
  • first of all you may see that on the same 16cores-HT configuration 5.7 was already performing better than 5.6
  • then, performance yet more improved on 32cores configuration!
  • and then reaching its Max 350K QPS on 32cores-HT config!!
  • for the first time in history on this particular (and most aggressive) RO workload we got a better result on 32cores -vs- 16cores-HT, and yet more improved with hyper-threading enabled (on 32cores-HT).. - awesome! ;-)

Comparing to all we saw until now, that was really very good!! ;-)

From the other hand, it was clear as well there is still a room for improvement. Contention on trx_sys was still remained. We did not use a full CPU power to do a useful work (still a lot of CPU cycles were spending on lock spinning).. But the result was already better than ever, and way better than 5.6, so there was no valid reason to continue digging to improve performance here, while our main focus was on RW workloads where we have even more huge room for improvement..

By the end of May, during our Perf-Meeting, Sunny added few changes more around trx_sys mutex contention, and our Max QPS moved to 375K QPS(!) since then - should be more than enough as improvement for 5.7, right? ;-)

On the same time we continued exchange with Percona team who proposed to manage TRX-list in a different way.. - the solution looked interesting, but on 5.5 code base was unable to show a higher QPS, and on 5.6 code base (once Percona Server 5.6 was available) the Max QPS was not bigger than in MySQL 5.6. However, the discussion involved a very interesting point: what will be the impact on RO performance if there will be some RW workloads running in parallel?.. And even if MySQL 5.7 code was still running better within the same test conditions, the impact was very visible (you may find my analyzes about here, while, again, I was not able to present 5.7 results during this time as the code was not public yet - may provide it in one of the next articles now)..

And as any pure RW workload was impacted here as well, there was enough of motivations to finally remaster the whole TRX-list related code in a way that Sunny wanted to see it from a long time.. And this experience was simply amazing! ;-)) Day after day we enjoyed to see our QPS graphs going higher and higher.. - till reached 440K QPS(!) on the same 32cores-HT server.

Point-Selects 8-tables @5.7-DMR2 :

No comments.. ;-))

However, there was a small mystery around.. - we intentionally with Sunny analyzed all bottlenecks and impact of code changes via different tools, and on some of the tests for my big surprise Sunny observed higher QPS levels than me.. This "mystery" was related to the following factors:
  • on a high load the 5.7 code is running now near the limits of the HW capacity (mainly CPU), so every instruction matters!
  • and the difference if UNIX socket or IP port is used becomes very visible!
  • and Sysbench itself is using 30% of CPU time, but if the older Sysbench version is used (with a smaller code path) for the same test load it'll consume only 20% CPU, leaving additional 10% to the MySQL server ;-)
  • so, in the same test workload, by using UNIX socket instead of IP port + replacing Sysbench-0.4.13 by Sysbench-0.4.8 we're out-passing 500K QPS(!) - easy, right? ;-))

Let's compare "before" and "after" :

Observations :
  • Lowered CPU% usage by Sysbench
  • resulting in a higher CPU time availability for MySQL server
  • and we're hitting 500K QPS..

What else?.. ;-)
I may only add: kudos Sunny & whole MySQL Dev Team! ;-)

Let's get a look now on Max QPS obtained on the same Point-Select 8-tables workload on other engines as well. I've used the latest ones available on beginning of Sep.2013 :
  • MySQL-5.7.2 (DMR2)
  • MySQL-5.6.14
  • MySQL-5.5.33
  • Percona Server 5.6.13-rc60.5
  • Percona Server 5.5.33-rel31.1
  • MariaDB-10.0.4
  • MariaDB-5.5.32

Each engine is tested within the following configuration variations:
  • CPU taskset: 8cores-HT, 16cores, 16cores-HT, 32cores, 32cores-HT
  • Concurrent Users Sessions: 8, 16, 32 ... 1024
  • InnoDB Spin Wait Delay: 6, 96

Then the best results from any given combination for every engine are compared -vs- others. And I'm obtaining the graph which I've already presented in the previous article:

Few comments :
  • No need to comment the gap on the MySQL 5.7 results, it's just evident..
  • then, what is interesting that none of engines based on the MySQL 5.5 code base did not get any closer to the MySQL 5.6 results..
  • which is just confirming that only since the MySQL 5.6 code base was used, Percona Server reached the 5.6 level, while MariaDB-10 is still on the road..
  • so, the MySQL 5.6 code base rocks, no doubt!
  • and MySQL 5.7 simply rocks twice! ;-))

And what about scalability?

The answer is simple: MySQL 5.7 is the only one scaling here.

Then, if IP port is used and a more "heavy" Sysbench-0.4.13, the results are the following:

QPS is just slightly lower, but the tendency is exactly the same..

And scalability is pretty similar:

More other results will come, so stay tuned ;-)

NOTE : for a single-table-bound workloads the story is not yet good:
  • decreased contentions on the InnoDB level made other contentions much more visible
  • and MDL contentions became even more dominating when the load is bound on a single table..
  • this is as expected, and should be fixed within the next DMRs..

We have yet many challenges ahead of us ;-)

For reference, my HW config for presented tests was the following:
  • Server : 32cores-HT (bi-thread) Intel 2300Mhz, 128GB RAM
  • OS : Oracle Linux 6.2
  • FS : EXT4 mounted with "noatime,nodiratime,nobarrier"

my.conf:
 max_connections=4000
 key_buffer_size=200M
 low_priority_updates=1
 table_open_cache = 8000
 back_log=1500
 query_cache_type=0
 table_open_cache_instances=16

# files
 innodb_file_per_table
 innodb_log_file_size=1024M
 innodb_log_files_in_group = 3
 innodb_open_files=4000

# buffers
 innodb_buffer_pool_size=32000M
 innodb_buffer_pool_instances=32
 innodb_additional_mem_pool_size=20M
 innodb_log_buffer_size=64M
 join_buffer_size=32K
 sort_buffer_size=32K

# innodb
 innodb_checksums=0
 innodb_doublewrite=0
 innodb_support_xa=0
 innodb_thread_concurrency=0
 innodb_flush_log_at_trx_commit=2
 innodb_max_dirty_pages_pct=50
 innodb_use_native_aio=1
 innodb_stats_persistent = 1
 innodb_spin_wait_delay= 6 / 96

# perf special
 innodb_adaptive_flushing = 1
 innodb_flush_neighbors = 0
 innodb_read_io_threads = 4
 innodb_write_io_threads = 4
 innodb_io_capacity = 4000
 innodb_purge_threads=1
 innodb_adaptive_hash_index=0

# monitoring
 innodb_monitor_enable = '%'
 performance_schema=OFF


In case you need it, Linux Sysbench binaries available from here:
  • Sysbench-0.4.13-lux86
  • Sysbench-0.4.8-lux86

The Sysbench command used to run Point-Selects test via UNIX socket (starting 8 processes in parallel):
LD_PRELOAD=/usr/lib64/libjemalloc.so /BMK/sysbench-0.4.8 --num-threads=$1 --test=oltp --oltp-table-size=10000000 \
        --oltp-dist-type=uniform --oltp-table-name=sbtest_10M_$n \
        --max-requests=0 --max-time=$2 --mysql-socket=/SSD_raid0/mysql.sock \
        --mysql-user=dim --mysql-password=dim --mysql-db=sysbench \
        --mysql-table-engine=INNODB  --db-driver=mysql \
        --oltp-point-selects=1 --oltp-simple-ranges=0 --oltp-sum-ranges=0 \
        --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-skip-trx=on \
        --oltp-read-only=on run  > /tmp/test_$n.log &


The Sysbench command used to run Point-Selects test via IP port (starting 8 processes in parallel):
LD_PRELOAD=/usr/lib64/libjemalloc.so /BMK/sysbench-0.4.13 --num-threads=$1 --test=oltp --oltp-table-size=10000000 \
        --oltp-dist-type=uniform --oltp-table-name=sbtest_10M_$n \
        --max-requests=0 --max-time=$2 --mysql-host=127.0.0.1 --mysql-port=5700 \
        --mysql-user=dim --mysql-password=dim --mysql-db=sysbench \
        --mysql-table-engine=INNODB  --db-driver=mysql \
        --oltp-point-selects=1 --oltp-simple-ranges=0 --oltp-sum-ranges=0 \
        --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-skip-trx=on \
        --oltp-read-only=on run  > /tmp/test_$n.log &


Rgds,
-Dimitri
Posted by Dimitri at 4:16 - Comments...
Categories: MySQL

Saturday, 21 September, 2013

MySQL Performance: reaching 500K QPS with MySQL 5.7

Yes, we've done it! ;-)

Tomas just announced we've reached 500K QPS performance level in OLTP_RO Point-Selects 8-tables benchmark, and I may only confirm it and say you little bit more:

This is the best-to-best comparison between the all listed engines obtained on the same 32cores-HT server that I've used in my previously published benchmark results. Same workload, same conditions, updated players. All details about this and other tests results I'll provide during my tomorrow's talk at MySQL Connect conference (11:30AM), and then later publish them within another blog post..

Well, what else to say.. - MySQL 5.7 is preparing to become the next "the best ever MySQL Server release in history" ;-)

The full story is published here now : The Road to 500K QPS with MySQL 5.7

Rgds,
-Dimitri

Posted by Dimitri at 18:15 - Comments...
Edited on: Thursday, 10 October, 2013 16:16
Categories: MySQL

Thursday, 19 September, 2013

Heading to MySQL Connect

Heading to MySQL Connect right now.. Seems there will be a lot of interesting stuff, my prediction ;-)

This year I have 2 talks about MySQL Performance:

Looking forward to exchange and to learn from others.. And, of course there will be some surprises ;-))

So, stay tuned ;-)

Rgds,
-Dimitri

Posted by Dimitri at 9:43 - Comments...
Categories: MySQL

Thursday, 08 August, 2013

MySQL Tech Day @Paris

If in October you're around in Paris area, don't miss the following event which we're preparing in collaboration with French MySQL User Group. The event is fully technical (trust me ;-)), and all the stuff will be presented by my good friends from different MySQL Dev Teams at Oracle. I'm pretty sure this event will be really great! And I'm also expecting to have such Tech Days delivered on a more regular basis - we become better and better organized, and, of course, we have always a lot of things to tell you ;-)

So, open your agenda, and follow our official announce :


MySQL Tech Day @Paris

We're happy to announce you that MySQL Tech Day will take place in Paris on Oct 10, 2013 in Oracle main office. It'll be a full day event giving you an occasion to listen directly from Oracle developers about most of the improvements made recently within MySQL 5.6 and 5.7 development.

The agenda is the following :
  • Overview: MySQL Innovation @Oracle
  • MySQL Performance: Latest improvements in MySQL 5.6 & 5.7
  • MySQL Optimizer: What is new, what is coming
  • 50 tips to boost MySQL Performance
  • MySQL Performance Schema: Overview / HOWTO
  • MySQL Performance Schema: Hands-on (live lab)

Don't miss such an opportunity to get most of the hot news about MySQL improvement directly from the source! It'll be just two weeks after MySQL Connect, so there will be really a lot of news! So, book this date in your agenda right now and then don't forget to register yourself for the event - attending is totally free, but places are limited, so registration is mandatory to secure enough seats.

To register : please, contact Alexis Bensa:
  • by email: alexis.bensa(@)oracle.com
  • or directly by phone: 01.57.60.27.96

Oracle main office address: Portes de La Defense 15, boulevard Charles de Gaulle.

Further details will be communicated later.

Rgds,
-Dimitri
Posted by Dimitri at 20:38 - Comments...
Categories: MySQL

Saturday, 13 July, 2013

MySQL Performance: Why Performance Schema Overhead?..

As there was a lot of complain recently about various cases with PERFORMANCE SCHEMA (PFS) overhead, I've decided to start a series of articles to help you to use PFS in the best way to analyze your workloads. This is the first part, starting with an overview of instrumentation problems generally..

First of all, instrumentation cannot be free! - when you want to trace or get some stats about your application internals you're adding a new code, and this code is increasing an overall code path, so your final binary simply cannot run on the same speed by definition!.. Of course, if your instrumentation code is poor, the overall overhead will be bigger ;-) but if you're designing it well, it'll be lower, but still not zero!..

Dynamic instrumentation brings additional cost - as soon as you're willing to add some advanced stuff within your instrumentation, it makes sense to not keep it active all of the time.. - but checking of all these flag/states on active/inactive will bring additional cost as will require more additional code to manage it.. (the best realization for today of dynamic instrumentation will probably attributed to DTrace, but currently you cannot get it yet available on most of platforms, and which is remaining a story on itself)..

Instrumentation overhead is directly depending on event frequency - sounds trivial, right? - more frequently your instrumentation code is executed - bigger overhead you may expect. So, before you activate any instrumentations on some execution levels of your code you have to think about a frequency of events on this level.. BTW, exactly the same problem you're meeting with DTrace as well (for ex. see my old article with example of x10 slowdown under DTrace). However, when we're adding instrumentation into our own code, we may then keep in mind what kind of level of event frequency to expect within each part of code, and make and overhead of such an instrumentation predictable(!)..

All this sounds simple.. - until we're not starting to feel an overhead directly related to involved instrumentation ;-) To give you a feeling of what it could be I wrote a simple "dumb code" of a program which is not doing anything useful, but just looping to keep CPU warm, and within it loops checking different conditions ;-) (the full source code you may find here if you want to play with it too).

The program is calling recursively Loop_func(), which will do "N" loops of a given level, and then call "M" times Loop_func() again to involve execution of the next level (in fact the program is simulating a work like: main()->do some work-> call function1() -> do some work -> call in loop func2() -> do some work -> call in loop fun3() -> and so on...)

For a database code it could be something similar to:

main()
  ...
  '-> loop: execute_SQL_SELECT();    <== level 1
     '-> loop: read_ranges();           <== level 2
        '-> loop: read_rows();             <== level 3
           '-> loop: read_row();              <== level 4
              '-> ...


So, more we're going in depth in levels - bigger a probability that a code on this level is executed more frequently (more loop iterations). And then, we want to instrument it to measure exactly the time we're spending there. And to follow the PFS idea, I'll add the following to my code around call of the Loop_func():
  • first I'm able to compile the code with or without "PFS instructions" (depending if PFS_DEF is declared or not)
  • then, I'm able "dynamically" enable instrumentation code according the "PFS" arg value
  • and then according to enabled instrumentation of each level, I'm then executing or not the instrumentation code on such a level (for TIMED instrumentation I'm using gettimeofday() which is probably not the most optimal, but good for an example)..
#ifdef PFS_DEF
    if( PFS )
    {
      if( Data[ level ].Enabled )  Data[ level ].CNT++;
      if( Data[ level ].Timed )  gettimeofday( &tv1, NULL );
    }
#endif

    Loop_func( level + 1, max );

#ifdef PFS_DEF
    if( PFS )
    {
      if( Data[ level ].Enabled && Data[ level ].Timed )
      {
        gettimeofday( &tv2, NULL );
        Data[ level ].TIMED += (tv2.tv_sec - tv1.tv_sec) * 1000000 + tv2.tv_usec - tv1.tv_usec;
      }
    }
#endif




Hm.. it's probably will be faster for you to read the code than read my explanations ;-)

But well, now how to use this program:
  • first of all, I've compiled 2 binaries:
    • one "without PFS enabled" and namedd "pfs_loop-noPFS"
    • another with "PFS enabled" (PFS_DEF defined) and named "pfs_loop-withPFS"
  • then I'm executing the program with following arguments:
    • $ pfs_loop PFS Loop1_N Loop1_M Instr1 Timed1 [Loop2_N Loop2_M Instr2 Timed2]...
    • where:
      • PFS - 1/0: enabled/disabled PFS code
      • Loop1_N - number of loops to execute on the "level1" to simulate a kind of work
      • Loop1_M - number of loops with call of the Loop_func() function for the next level ("level2")
      • Instr1 - 1/0 : enabled/disabled instrumentation on the "level1"
      • Timed1 - 1/0 : enabled/disabled TIMED instrumentation on the "level1"
      • ... - and so on for each next level
      • the total number of levels is calculated then automatically

So far, in the following example I'm executing the "pfs_loop" program with 4 levels in total, on the first 3 levels the program will run 1000 times in the "work loop" (N), and then will 500 times call execution of the next level. And only on the last (4th) level the "work loop" will be shorter - 10 times only (similar like you read few rows from FS cache, etc.)..

The first execution is with a binary without any instrumentation:
$ time pfs_loop-noPFS   0 1000 500 0 0 1000 500 0 0 1000 500 0 0 10 0 0 0
real    0m4.380s
user    0m4.366s
sys     0m0.000s


Now exactly the same thing, but with binary compiled with "PFS instrumentation", but PFS is "disabled"
$ time pfs_loop-withPFS 0 1000 500 0 0 1000 500 0 0 1000 500 0 0 10 0 0 0
real    0m4.551s <-- 4% more..
user    0m4.537s
sys     0m0.000s



The same, but PFS "enabled", but all instrumentations "disabled":
$ time pfs_loop-withPFS 1 1000 500 0 0 1000 500 0 0 1000 500 0 0 10 0 0 0
real    0m4.685s <-- 6.6% more..
user    0m4.671s
sys     0m0.001s


And now "enabling" instrumentation on different levels:
$ time pfs_loop-withPFS 1 1000 500 1 1 1000 500 0 0 1000 500 0 0 10 0 0 0
real    0m4.685s  <-- same 6.6% for "level 1"
user    0m4.671s
sys     0m0.000s

$ time pfs_loop-withPFS 1 1000 500 0 0 1000 500 1 1 1000 500 0 0 10 0 0 0
real    0m4.700s <-- 7.3% for "level 2"
user    0m4.684s
sys     0m0.001s

$ time pfs_loop-withPFS 1 1000 500 0 0 1000 500 0 0 1000 500 1 1 10 0 0 0
real    0m12.864s  <-- 293% for "level 3"
user    0m12.826s
sys     0m0.000s




As you can see, more in "lower level" we're involving instrumentation - bigger overhead we're getting (trivial, right? - on a more lower level the same instrumentation code is executed way more often, so as the result the impact is more bigger too)..

Exactly the same is going with PERFORMANCE SCHEMA in MySQL! - more low level instrumentation you're enabling, bigger overhead you may hit.. However, keep in mind:
  • ALL PFS instrumentation is completely dynamic
  • at any time you may disable everything!
  • you may go very progressively and detailed level by level in instrumentation, just be aware about what exactly you're doing ;-)


Now, let's take a popular example of Sysbench OLTP_RO Point-Select 8-tables workload. This workload on 16cores-HT is getting its max performance level usually on 64 or 256 concurrent user sessions. Let's analyze the PFS impact on this workload on the following situations:
  • noPFS -- MySQL binary was compiled without PFS
  • withPFS, PFS=off -- MySQL binary was compiled with PFS, but PFS is disabled
  • withPFS, PFS=def -- same as before, but PFS is enabled with default instrumentation (as in MySQL 5.6 by default)
  • withPFS, PFS=none -- PFS is enabled, but all instrumentations are disabled

So, we're getting:

Observations :
  • noPFS : 64usr-> 250K QPS, 256usr-> 280K QPS
  • PFS=off : 64usr-> 249K QPS, 256usr-> 275K QPS
  • PFS=def : 64usr-> 240K QPS, 256usr-> 260K QPS
  • PFS=none : 64usr-> 245K QPS, 256usr-> 270K QPS
  • so, in worse case we have 8% "overhead" - 260K QPS -vs- 280K QPS.. (PFS default -vs- compiled without PFS)
  • while, "not compiled" -vs- "compiled, but disabled" (noPFS -vs- PFS=off) is "near" the same - only 2% overhead max here..

However, the "default" PFS instrumentation is composed from "table I/O" and "statements/digest" instrumentations, let's see the impact of the each one:

Observations :
  • PFS=none : 64usr-> 245K QPS, 256usr-> 270K QPS
  • PFS=tables_only : 64usr-> 240K QPS, 256usr-> 265K QPS
  • PFS=statements_only : 64usr-> 240K QPS, 256usr-> 260K QPS
  • so, as you can see, by disabling statements instrumentation you can reduce the max overhead from 8% to 5%
  • and all this dynamic, without MySQL Server restart!

Well, I'll go more in details within next articles, but to finish with this one, let's see how well the "alternative" solutions are working on the same workload...

Let me start with MySQL 5.6 code patched by Facebook and providing TABLE_STATISTICS (and other stats). As mentioned Domas, there is no need any enable/disable settings, just start MySQL server and you get it. So, let's compare its overhead on the same workload with noPFS and PFS=def :


Oservations :
  • as you can see, there is near no difference in overhead between PFS=def and Facebook stats instrumentation
  • there are the same max 8% overhead -vs- noPFS.. - so, there is no instrumentation without cost, right?..
  • while PFS=def is not only have table stats, but also query digest..
  • and PFS instrumentation you may dynamically reduce at any moment, while the Facebook code is just here..

Another "alternative" stats solution is implemented in Percona Server, but this instrumentation is dynamic, in fact binary: you may just enable or disable it completely by setting the "userstat" global variable to 1 or 0. Percona Server 5.6 is only RC for the moment, and I was unable to compile it without PFS due errors in Percona code.. So, let's compare it on the same workload with noPFS and PFS=off (the difference in PFS=off config between MySQL 5.6 and Percona 5.6 may be a sign of yet additional overhead coming from the "userstat" related code) :

Observations :
  • in fact, yes, MySQL 5.6 max QPS with PFS=off is higher than on Percona 5.6..
  • and then, once userstat=1 the regression is horrible.. over 30%...

So, it's very easy to complain about PFS overhead.. - but what if we'll try together to improve it rather to blame?.. ;-)) PFS is an universal solution for MySQL profiling, it has a HUGE power, we all just need to learn how to use it right, and then life will become much more easier.. ;-)

INSTEAD OF SUMMARY :
  • Instrumentation is not free ;-)
  • Instrumentation of higher frequency events having a bigger overhead!
  • PFS is a really great tool, just need some love ;-)

Well, it was only a story about PFS instrumentation overhead, but nothing about practical usage.. So, the next one in this series will be all about practical ;-)

To be continued..

Rgds,
-Dimitri
Posted by Dimitri at 0:55 - Comments...
Categories: MySQL