My Slides about MySQL 8.0 Scalability & Benchmarks from #PerconaLIVE 2019 Austin, TX

Here are my slides about MySQL 8.0 Scalability & Benchmarks from Percona LIVE 2019 in Austin, TX. (and I also realized I forgot to publish my slides from MySQL pre-FOSDEM 2019 Day about MySQL Performance Tuning, so fixing it now ;-))

P.S. Percona LIVE was "just awesome" ! ;-))



My Slides about MySQL 8.0 Performance from #OOW18 and #PerconaLIVE 2018

As promised, here are slides about MySQL 8.0 Performance from my talks at Oracle Open World 2018 and Percona LIVE Europe 2018 -- all is combined into a single PDF file to give you an overall summary about what we already completed, where we're going in the next updates within our "continuous release", and what kind of performance issues we're digging right now.. ;-))

Also, I'd like to say that both Conferences were simply awesome, and it's great to see a constantly growing level of skills of all MySQL Users attending these Conferences ! -- hope you'll have even more fun with MySQL 8.0 now ;-))


MySQL Performance : 8.0 on IO-bound OLTP_RW vs Percona Server 5.7

This article is inspired by Percona blog post comparing MySQL 8.0 and Percona Server 5.7 on IO-bound workload with Intel Optane storage. There are several claims made by Vadim based on a single test case, which is simply unfair. So, I'll try to clarify this all based on more test results and more tech details..

But before we start, some intro :

InnoDB Parallel Flushing -- was introduced with MySQL 5.7 (as a single-thread flushing could no more follow), and implemented as dedicated parallel threads (cleaners) which are involved in background once per second to do LRU-driven flushing first (in case there is no more or too low amount of free pages) and then REDO-driven flushing (to flush the oldest dirty pages and allow more free space in REDO). The amount of cleaners was intentionally made configurable as there were many worries that these threads will use too much CPU ;-)) -- but at least configuring their number equal to number of your Buffer Pool (BP) Instances was resulting in nearly the same as if you have dedicated cleaner-per-BP-instance.

Multi-threaded LRU Flusher -- was introduced in Percona Server 5.7, implementing dedicated LRU cleaners (one thread per BP instance) independently running in background. The real valid point in this approach is to keep LRU cleaners independent to so called "detected activity" in InnoDB (which was historically always buggy), so whatever happens, every LRU cleaner remains active to deliver free pages according the demand. While in MySQL 5.7 the same was expected to be covered by involving "free page event" (not what I'd prefer, but this is also historical to InnoDB). However, on any IO-bounded workload I've tested with MySQL 5.7 and 8.0 by configuring 16 BP instances with 16 page cleaners and with LRU depth setting matching the required free page rate -- I've never observed lower TPS comparing to Percona..

Single Page Flushing -- historically, in InnoDB when a user thread was not able to get a free page for its data, it was involving a "single page flush" itself, expecting to get a free page sooner -- the motivation behind such an approach was "better to try to do something than just do nothing". And this was blamed so often.. -- while, again, it's largely exaggerated, because the only real problem here is coming due a historical "limited space" for single page flush in DoubleWrite Buffer, and that's all. To be honest, making this option configurable could allow anyone to evaluate it very easily and decide to keep it ON or not by his own results ;-))

DoubleWrite Buffer -- probably one of the biggest historical PITA in InnoDB.. -- the feature is implemented to guarantee page "atomic writes" (e.g. to avoid partially written pages, each page is written first to DoubleWrite (DBLWR) place, and only then to its real place in data file). It was still "good enough" while storage was very slow, but quickly became a bottleneck on faster storage. However, such a bottleneck you could not observe on every workload.. -- despite you have to write your data twice, but as long as your storage is able to follow and you don't have waits on IO writes (e.g. not on REDO space nor on free pages) -- your overall TPS will still not be impacted ;-)) The impact is generally becomes visible since 64 concurrent users (really *concurrent*, e.g. doing things on the same time). Anyway, we addressed this issue yet for MySQL 5.7, but our fix arrived after GA date, so it was not delivered with 5.7 -- on the same time Percona delivered their "Parallel DoubleWrite", solving the problem for Percona Server 5.7 -- lucky guys, kudos for timing ! ;-))

Now, why we did NOT put all these points on the first priority for MySQL 8.0 release ?
  • there is one main thing changes since MySQL 8.0 -- for the first time in MySQL history we decided to move to "continuous release" model !
  • which means that we may still deliver new changes with every update ;-))
  • (e.g. if the same was possible with 5.7, the fix for DBLWR would be already here)
  • however, we should be also "realistic" as we cannot address fundamental changes in updates..
  • so, if any fundamental changes should be delivered, they should be made before GA deadline
  • and the most critical from such planned changes was our new REDO log implementation !
  • (we can address DBLWR and other changes later, but redesigning REDO is much more complex story ;-))
  • so, yes, we're aware about all the real issues we have, and we know how to fix them -- and it's only a question of time now..

So far, now let's see what of the listed issues are real problems and how much each one is impacting ;-))

For my following investigation I'll use :
  • the same 48cores-HT 2S Skylake server as before
  • x2 Optane drives used together as a single RAID-0 volume via MDM
  • same OL7.4, EXT4
  • Sysbench 50M x 8-tables data volume (same as I used before, and then Vadim)
  • Similar my.conf but with few changes :
    • trx_commit=1 (flush REDO on every COMMIT as before)
    • PFS=on (Performance Schema)
    • checksums=on (crc32)
    • doublewrite=off/on (to validate the impact)
    • binlog=off/on & sync_binlog=1 (to validate the impact as well)

Test scenarios :
  • Concurrent users : 32, 64, 128
  • Buffer Pool : 128GB / 32GB
  • Workload : Sysbench OLTP_RW 50Mx8tab (100GB)
  • Config variations :
    • 1) base config + dblwr=0 + binlog=0
    • 2) base config + dblwr=1 + binlog=0
    • 3) base config + dblwr=0 + binlog=1
    • 4) base config + dblwr=1 + binlog=1
    • where base config : trx_commit=1 + PFS=on + checksums=on

NOTE : I did not build for all the following test results "user friendly" charts -- all the graphs are representing real TPS (Commit/sec) stats collected during the tests, and matching 3 load levels : 32, 64, and 128 concurrent users.


So far, let's start first with Buffer Pool =128GB :

OLTP_RW-50Mx8tab | BP=128G, trx=1, dblwr=0, binlog=0

Comments :
  • with Buffer Pool (BP) of 128GB we keep the whole dataset "cached" (so, no IO reads)
  • NOTE : PFS=on and checksums=on -- while TPS is mostly the same as in the previous test on the same data volume (where both PFS & checksums were OFF), they are not impacting here
  • as well from the previous test you can see that even if the data are fully cached in BP, there is still an impact due used storage -- the result on Intel SSD was way lower than on Intel Optane
  • so, in the current test result you can see that MySQL 8.0 also getting better benefit from a faster storage comparing to Percona Server, even if the given test is mostly about REDO related improvements ;-))

And now, let's switch DobleWrite=ON :


MySQL Performance : 8.0 GA on IO-bound TPCC

This post is mainly inspired by findings from the previous testing of MySQL 8.0 on TPCC workload(s) and observations from IO-bound Sysbench OLTP on Optane -vs- SSD. But also by several "urban myths" I'm often hearing when discussing with users about their IO-bound OLTP performance problems :

Myth #1 : "if I'll double the number of my storage drives -- I'll get x2 times better TPS !"
  • this was mostly true during "HDD era", and again..
  • (ex.: a single thread app doing single random IO reads from a single HDD will not go faster by doing the same from 2x HDD -- similar like single thread workload will not run faster on 8CPU cores -vs- 2CPU cores, etc.)
  • all depends on your workload and how many parallel IO operations you're involving..
  • indeed, it is much more easier to saturate HDD, but it's much more harder to do it with modern SSD/NVMe
  • NOTE : we're speaking about OLTP (e.g. if you started to observe full table scans in your workload -- means you're already doing something wrong ;-))
  • simple rule : if you're not saturating your storage on any of its limits => you'll not see any gain by adding more drives, you'll probably just have a bigger storage space, that's all.

Myth #2 : "I'll go faster with flash drive which is claimed capable x2 times more IOps in specs than my current drive !"
  • if you're expecting to run OLTP workload, rather pay attention to IO latency first !
  • sometimes it may be not mentioned these x2 times more IOps were obtained with x4 times more IO threads ;-))
  • e.g. a drive capable of x2 times more IOps but with x4 times higher latency will still be x4 times slower than your current drive on 1 user load, and 2, and 4, and .. maybe up to 64 ? (depends on where your current drive the limit is reached) -- and if you don't have more than 64 concurrent users ? -- then you'll never see your x2 times more IOps, but rather x4 times worse TPS ;-))

Test it yourself - this is the only advice I could give you !
  • because "only a real test will give you a real answer" (and I'm repeating to say it again and again.. ;-))
  • testing your own workload will give you the best answer !
  • otherwise you may still use some generic benchmark workloads which are representative for you
  • for ex. your new flash drive may be look better from all points and passing very well all generic pure IO tests, but show x3 times worse results once used by MySQL -- and this is just because, for ex., every involved fsync() will take x3 times more time, etc. (based on real story, no kidding ;-))

So far, for the following story I'll use :
  • Sysbench-TPCC workload, 10x100W (x10 of 100 warehouses, ~100GB data, here is why)
  • Same Skylake server as before, same config, etc.
  • and the same Optane & SSD as in the previous testing, except that I'll also add to the game x2 Optane drives used as a single RAID-0 MDM volume !
  • EXT4 is used on top of each drive or volume

Starting Test scenario :
  • concurrent users : 1, 2, 4, .. 1024
  • trx_commit : 1 (flush REDO on every COMMIT)
  • Buffer Pool (BP) : 128GB (in-memory), 32GB (IO-bound)

Sorry, no "user friendly" graphs this time, but hope it'll still be easy to understand the results ;-))

On the first graph you'll see :
  • 3 test results : with x2 Optaine first, then with single Optane, and then single SSD
  • on each test the load is going from 1, 2, 4 .. up to 1024 concurrent users
  • the Commits/sec curve is showing obtained TPS level

and the first result is with 128GB BP :

InnoDB Buffer Pool 128GB

Comments :
  • first of all, as you can see, TPS is not better with two -vs- one Optane
  • as we're not hitting any single limit of Optane drive, there is no any gain by using x2 ;-))
  • also, regardless the used data volume is ~100GB, we're not observing here such a big difference between Optane -vs- SSD as it was with Sysbench OLTP on a similar data size and still "in-memory"..
  • this because TPCC workload is much less aggressive on IO writes, so REDO flushing is less impacted..

now, how TPS will be impacted if BP size was smaller, just 32GB ?


MySQL Performance : 8.0 GA on IO-bound Sysbench OLTP with Optane -vs- SSD

MySQL Performance on IO-bound workloads is still extremely depending on the underlaying storage layer (thus is directly depending on your Storage Performance).. Indeed, flash storage is definitively changing the game, but even with flash there is, as usual, "flash and flash" -- all storage vendors are improving their stuff constantly, so every time you have something new to discover and to learn ;-)) During all my MySQL 8.0 GA tests I was very pleasantly surprised by IO performance delivered by Intel Optane SSD. However, what the storage device can deliver alone on pure IO tests is not at all the same to what you could observe when it's used by MySQL -- unfortunately, in the past I've observed many cases when with a device claimed to be x2 times faster we were even not observing 10% gain.. But MySQL 8.0 is probably the most best placed MySQL version today to re-visit all this IO-bound story (there are many "under-hood" changes in the code helping to use your storage more efficiently) -- well, we're definitively yet very far from "perfect" ;-)) -- but again, as usual -- "everything is relative".. And for the "relative" point for this article I'll propose you to get a look on 2 different flash drives from Intel :

yes, both devices are from Intel, so it's only Intel -vs- Intel (e.g. no blame for "competition" post ;-))

so far, if you'll look on their specs, both drives are looking pretty good for avg. IO-bound workloads, except that Optane drive is claimed to be x5 times faster in all categories (and it's particularly impressive with its low IO latency -- see 1M IO-bound QPS with MySQL 8.0 -- so, can confirm at least for latency ;-)) -- however, will the x5 claim still be valid in different conditions / workloads ?

For my testing I'll use the same config as before (the same Skylake server as well), and the key config points are :
  • InnoDB page size : 16K
  • trx_commit : 1 (flush REDO on every commit)
  • single flash drive (just EXT4 on top of single drive, no RAID, no LVM, etc..)
  • InnoDB Buffer Pool (BP) : 128G / 64G / 32G (will vary along with test scenarios)

(no double write, no binlog for the moment -- these ones will be covered later, patience ;-))..

So far, let's start first with so called "in-memory" Sysbench OLTP_RW workload :
  • test workload : Sysbench OLTP_RW uniform
  • concurrent users : 1, 2, 4, .. 1024
  • data volume : 8 tables of 10M rows each (~20GB)
  • InnoDB Buffer Pool (BP) : 128GB
  • so, all the data will be fully cached in BP, thus NO any IO reads (the reason to be so called "in-memory")
  • however, there will be definitively IO writes (as there are writes in OLTP_RW ;-))
  • and in our case, if all is going well, there will be only 2 types of writes :
    • 1) dirty pages flushing
    • 2) REDO log writes
  • the 1) is going in background, and as soon as your flushing is fast enough -- your performance is not impacted
  • the 2) are pure sequential writes only mixed with periodic fsync(), and generally as soon as you use a good flash drive for your REDO files -- you're fine ;-))

Ok, what about results ?

Sysbench OLTP_RW 10Mx8-tables BP = 128GB : Intel SSD -vs- Intel Optane

Comments :
  • on the left side of the graph you can see TPS level of MySQL 8.0 running on SSD, and on the right side -- the same but on Optane
  • concurrent users level is progressing from 1 to 2, 4, 8, .. 1024
  • and from TPS results you can see that up to 16 users there is nearly no TPS difference between SSD and Optane
  • then Optane is slightly better on 64usr level, and definitively better on higher load
  • but you don't see any sign of expected x5 difference, right ?
  • and if you'll stop here, you could just say "all this Optane story is b*shit !", Dimitri, WFT ?.. ;-))

Indeed, on such a small data volume and having all the active data set cached in BP you'll not see much difference with a faster storage.. -- however, what will happen if your data will grow ?..


MySQL Performance : IP port -vs- UNIX socket impact in 8.0 GA

Generally, when I'm analyzing MySQL Performance on Linux with "localhost" test workloads, I'm configuring client connections to use IP port (loopback) to connect to MySQL Server (and not UNIX socket) -- this is still at least involving IP stack in the game, and if something is going odd on IP, we can be aware ahead about. And indeed, it already helped several times to discover such kind of problems even without network links between client/server (like this one, etc.). However, in the past we also observed a pretty significant difference in QPS results when IP port was used comparing to UNIX socket (communications via UNIX socket were going near 15% faster).. Over a time with newer OL kernel releases this gap became smaller and smaller. But in all such cases it's always hard to say if the gap was reduced due OS kernel / IP stack improvements, or it's just because MySQL Server is hitting new scalability bottlenecks ;-))

Anyway, I've still continued to use IP port in all my tests until now. But recently the same discussion about IP port impact -vs- UNIX socket came up again in other investigations, and I'd like to share the results I've obtained on different HW servers with MySQL 8.0 GA release.

The test workload is the most "aggressive" for client/server ping-pong exchange -- Sysbench point-selects (the same test scenario which initially already reported 2.1M QPS on MySQL 8.0) -- but this time on 3 different servers :
  • 4CPU Sockets (4S) 96cores-HT Broadwell
  • 2CPU Sockets (2S) 48cores-HT Skylake
  • 2CPU Sockets (2S) 44cores-HT Broadwell

and see how different the results will be when exactly the same test is running via IP port / or UNIX socket.

Broadwell 2S 44cores-HT

- IP port :
- UNIX socket :
Comments :
  • wow, up to 18% difference !
  • and you can clearly see MySQL 8.0 out-passing 1.2M QPS with UNIX socket, and staying under 1.1M QPS with IP port..


MySQL Performance : more in depth with latin1 and utf8mb4 in 8.0 GA

Looking on my previously obtained results on Read-Only (RO) tests for latin1 and UTF8 charsets, one question continued to turn in my mind :

- if MariaDB 10.3 is hitting a so deep drop on "distinct-ranges" workload :

- why then this is not impacting the "mixed" OLTP_RO workload results (which is containing "distinct-ranges" query too) :

The answer was within the test title :
  • I've missed one zero in my scripts while preparing initial tests.. ;-))
  • so, the "distinct-ranges" test was using range size=10 (instead of 100 by default)
  • while "mixed" OLTP_RO remained with default settings, and used range size=100 for all range tests..
  • was the use of a smaller range size which that much impacted MariaDB ?..
  • (generally people are expecting to see wider range queries to be more impacting than the smaller ones, right ?)..

To clarify this all -- I decided to replay RO tests with a more detailed analyze..


MySQL Performance : 8.0 GA and TPCC Workloads

Generally TPC-C benchmark workload is considered as one of the #1 references for Database OLTP Performance. On the same time, for MySQL users it's often not something which is seen as "the most compelling" for performance evaluations.. -- well, when you're still fighting to scale with your own very simple queries, any good result on something more complex may only look as "fake" ;-)) So, since a long time Sysbench workloads remained (and will remain) as the main #1 "entry ticket" for MySQL evaluation -- the most simple to install, to use, and to point on some sensible issues (if any). Specially that since new Sysbench version 1.0 a lot of improvements were made in Sysbench code itself, it really scales now, has the lowest ever overhead, and also allowing you to add your own test scenario via extended LUA scripts (and again, with lowest ever overhead) -- so, anyone can easily add whatever kind of different test scenarios and share with others ! (while I'd say "the most compelling test workload" for any given user should be the workload which is the most closely reproducing his production load -- and you can probably do it now with new Sysbench, just try it !).

However, from MySQL Dev side, every given benchmark workload is mostly seen asa yet one (or several ones) problem(s) to resolve. Some of problems are common for many workloads, some are completely different ones, but generally it's never about "something cool" -- and we're just progressing in this long road by fixing one problem after another (to hit yet another one again). So, TPC-C workload for MySQL is just yet another problem to resolve ;-))

Historically the most popular TPC-C implementations for MySQL were :
  • DBT-2 : an open source version of TPC-C
  • TPCC-mysql : another open source version of TPC-C developed by Percona


MySQL Performance : 1M *IO-bound* QPS with 8.0 GA on Intel Optane SSD !

Historically, Random I/O Reads were always a major PITA for any OLTP workload.. If Random I/O Writes you could yet "delay" via controller's caches (or any kind of other battery-protected caches -- specially if Writes are coming in bursts), there is no way to "predict" I/O Reads if they are fully Random (so you cannot "cache" or "prefetch" them ahead and have to deliver the data directly from storage, read by read.. -- which is hitting a huge "rotation penalty" on HDD).

Indeed, things changed dramatically since arriving of Flash Storage. You don't need to spend any particular attention if your I/O Reads are Random or Sequential. However, you still need to keep in mind to not hit the overall throughout limit of your Flash Device. As the result, reading by smaller I/O blocks allowing you to do more I/O operations/sec than with bigger blocks.
And what about InnoDB ? -- InnoDB is using by default 16KB page size (so by default all Random I/O Reads are of 16KB) :

And with 16KB Random Reads you definitively will reach your throughput limit sooner than with 8KB or 4KB Reads. Many users are seeing "compression" as the best matching solution here. And, indeed, if you're very lucky and can compress your data, say, by x4 times => you'll Read then only 4KB from storage to deliver 16KB of data. And, yes, you'll be able to deliver x4 times more Reads/sec :


MySQL Performance : 8.0 RW & Binlog impact

In the previous article I've intentionally skipped the topic related to Binlog impact on MySQL 8.0 Performance, because it's not a short story, nor a simple one..

In fact, for most of people Binlog in MySQL is generally representing and additional overhead, and historically it was true. Since MySQL 5.6 there is Binlog Group Commit (BGC) feature available, and it was rather doing well, decreasing the gap between "binlog=OFF" and "binlog=ON sync_bin=1". However, storage vendors are making flash drives more and more better from year to year.. And when we delivered MySQL 5.7 the scope of Binlog impact moved with code and flash improvements -- the main impact was no more coming from the I/O operations related to Binlog, but to the Binlog code itself ! -- indeed, this may sound odd initially, but let's go to "pictures" to see it better in details ;-))

So far, I'll reuse the same Skylake server as before, but will reduce it to 1S (1CPU Socket, 24cores-HT) -- so, you don't believe it's all because I'm using a "big HW" (while 2S HW is the most common server config in all data centers where people are running MySQL, and having 16-24cores per CPU Socket is what is todays "commodity HW") -- but well, let's stay with 24cores-HT for the following experiment ;-)) And as the storage it'll be the same Optane drive with EXT4 filesystem.

I'll now replay the same Sysbench OLTP_RW and Update-NoKEY tests as before, but using only 1S (24cores-HT) and with the same my.conf but :
  • 1) binlog=OFF
  • 2) binlog=ON sync_bin=0
  • 3) binlog=ON sync_bin=1000
  • 4) binlog=ON sync_bin=1

and the results are here, each graph is representing a given test workload growing with number of concurrent user sessions from 1, 2, 4, .. to 1024 :

Sysbench OLTP_RW 10Mx8-tables @MySQL-5.7

Comments :
  • as you can see, on a higher load enabling Binlog is helping a higher TPS !
  • (well, in fact it's helping not to gain, but rather not to loose TPS on high load)
  • but in any case, you're seeing here a positive impact ! ;-))
  • and you can understand that in such a case it was difficult to blame Binlog code along MySQL 5.7 time ;-))

Specially that even more important "positive impact" happens on much more aggressive writes with Update-NoKey workload :