Tuesday, 15 May, 2018
Generally TPC-C benchmark workload is considered as one of the #1 references for Database OLTP Performance. On the same time, for MySQL users it's often not something which is seen as "the most compelling" for performance evaluations.. -- well, when you're still fighting to scale with your own very simple queries, any good result on something more complex may only look as "fake" ;-)) So, since a long time Sysbench workloads remained (and will remain) as the main #1 "entry ticket" for MySQL evaluation -- the most simple to install, to use, and to point on some sensible issues (if any). Specially that since new Sysbench version 1.0 a lot of improvements were made in Sysbench code itself, it really scales now, has the lowest ever overhead, and also allowing you to add your own test scenario via extended LUA scripts (and again, with lowest ever overhead) -- so, anyone can easily add whatever kind of different test scenarios and share with others ! (while I'd say "the most compelling test workload" for any given user should be the workload which is the most closely reproducing his production load -- and you can probably do it now with new Sysbench, just try it !).
However, from MySQL Dev side, every given benchmark workload is mostly seen asa yet one (or several ones) problem(s) to resolve. Some of problems are common for many workloads, some are completely different ones, but generally it's never about "something cool" -- and we're just progressing in this long road by fixing one problem after another (to hit yet another one again). So, TPC-C workload for MySQL is just yet another problem to resolve ;-))
Historically the most popular TPC-C implementations for MySQL were :
- DBT-2 : an open source version of TPC-C
- TPCC-mysql : another open source version of TPC-C developed by Percona
Both versions were implemented completely differently, but at least both were very "compelling" to MySQL users, as they can run TPC-C workload via SQL queries (and not via stored procedures, which are more popular for other DB vendors).. So, it was up to anyone preferences which of 2 test cases to use (while personally I'd say it was always more simple to install and use TPCC-mysql). However, since new Sysbench is here and Percona now ported their TPCC-mysql to Sysbench -- for me it's no doubt everyone should move to Sysbench-TPCC if interested on TPC-C testing ! (and kudos Percona to make this happen !! ;-))
So, what is good with new Sysbench-TPCC :
- first of all it's fully integrated with Sysbench, so if you already got Sysbench installed on your server, TPCC workload will just work, as well all your old sysbench scripts around to collect the test results and so on ;-))
- it also goes more far than original TPC-C test -- it's allowing you to run several TPC-C data sets in parallel (I was doing the same in the past by running several TPCC-mysql or DBT-2 processes on the same time -- which is allowing you to see "what is the next problem") -- but now you have the same out-of-the-box!
From the past testings I've already observed that the most "representative" data set size for TPCC workload is around 1000W (1K "warehouses") -- it's not too small, nor too big to take a long time to generate the data + allocate too much disk space (generally it's using around 100GB in InnoDB once loaded).. -- probably over a time I will also test a x10 times bigger volume (or more), but for the moment the 1000W volume is already big enough to investigate MySQL scalability limits on this workload..
So far, my test scenario will be the following :
- data set :
- 1000W (single dataset as in original TPC-C workload)
- 10x100W (10 datasets of 100W executed in parallel)
- concurrent users : 1, 2, 4, .. 1024
- InnoDB Buffer Pool (BP) :
- 128GB : data set is fully cached in BP (no I/O reads)
- 32GB : not more than 1/3 of data set can be cached in BP (I/O reads and yet more writes to expect)
- HW and my.conf are used the same as in the previous article about 8.0 OLTP_RW performance.
- and as well I'm curious how well MySQL 8.0 is scaling on this workload when 1CPU socket (1S) is used comparing to 2CPU sockets (2S)
here is the result with MySQL 8.0 :
- the above graph is representing the test executed on 1S (left side) and then on 2S (right side)
- the load is starting with 1 user session, then progressively increased to 2 users, 4, 8, .. 1024
- as you can see, there is no much difference between 1S and 2S results..
We're scaling on this workload only up to 32 concurrent users, so having more CPU cores could not bring any help here.. And what is the bottleneck ? -- we're hardly hitting index RW-lock contention here :
with such a hot contention the difference between MySQL 8.0 and older versions could not be big ;-))
Sysbench-TPCC 1000W, BP=128GB
- interesting that MySQL 8.0 is still winning here anyway !
- and even on low load 8.0 is mostly matching TPS of 5.6, which is also very positive
- (and we may expect even better TPS once the index lock contention is lowered, but this is probably not a simple fix)..
- no idea why MariaDB is even not matching TPS level of 5.7 (while using InnoDB 5.7)
Sysbench-TPCC 1000W, BP=32GB
- the same 1000W workload, but with BP=32GB, so more IO-bound activity is expected..
- however, TPS results between BP=128GB and BP=32GB configs are not that much different, right ?
- this is just because TPCC workload itself is not that much IO-intensive as many could imagine ;-))
- (well, yes, it'll still involve a lot of I/O writes and reads, but they will be often grouped around the same data, so already cached pages could be re-used)
- (this is completely opposite to, for ex., Sysbench OLTP_RW workload which with a similar amount of data and configured with BP=32GB will become extremely aggressive on I/O and get TPS decreased by several times)
- again, no idea about MariaDB..
From the other side, I don't recall to hit the same index lock contention while testing 1000W dataset with DBT-2, so I was very curious to compare it with Sysbench-TPCC 1000W on 2S and with BP=128GB :
DBT-2 1000W -vs- Sysbench-TPCC 1000W (2S, BP=128GB)
- so far, DBT2 workload is on the left side, and Sysbench-TPCC is on the right
- as you can see, peak TPS level reached on DBT2 is nearly twice higher than on Sysbench-TPCC
- why ? -- this is a good question ;-))
- initially I supposed this is due more indexes used in Sysbench-TPCC schema, but removing them did not help..
- in fact, everything is looking similar, but there is still something which is making Sysbench-TPCC to have a different processing "signature" which is resulting in this intensive index lock contention..
- would be great if Percona eng. could find what is making this difference, and then making from this an additional test scenario ! -- after what we could definitively forget about DBT2 and use Sysbench-TPCC exclusively ;-))
So far, let's get a look how the things are changing when 10x100W dataset is used :
here is the result with MySQL 8.0 :
- as in the previous case, the result on 1S is on the left, and 2S on the right side of the graph
- and you can see here that peak TPS on 2S is more than 50% higher !
- (but not yet near x2 ;-))
- as well peak TPS on 1S higher than 1000W result on 1S
This is because by using 10 datasets in parallel we multiplied the number of all tables by 10, which is dividing the initially observed "index lock" contention by 10 too (!) for the same number of concurrent users -- and moving the internal bottleneck to another place, and now it's hitting lock management part (lock_sys mutex contention) :
- as you can see, the index lock contention is still present, but it's divided by 10 now
- and the presence of lock_sys contention is blocking us from going more far..
- work in progress, and I'm impatient to see this bottleneck gone ;-))
Ok, and now -- how MySQL 8.0 compares to other versions ?
Sysbench-TPCC 10x100W, BP=128GB
- MySQL 8.0 is still showing the best TPS result on this workload as well !
- TPS is lower -vs- 5.7 on 512 and 1024 users load due higher lock_sys contention in 8.0
- (by fixing REDO locking in 8.0 we also made other locks more hot, and this is as expected)
- NOTE : I could "hide" this TPS decrease by limiting thread concurrency, but I did not do it here intentionally to see all the other locks impact..
- and yes, index lock itself makes a huge trouble when present -- as we see here x2 times better TPS -vs- 1000W
- note as well, that MySQL 8.0 is matching 5.6 TPS on low load (which, sadly, was no more the case for 5.7)
- no idea about MariaDB..
and now the same with 32GB BP :
Sysbench-TPCC 10x100W, BP=32GB
- MySQL 8.0 is still do better than others here too !
- I'm a bit surprised to see 5.6 to do slightly better on low load (but hope it'll be improved in 8.0 soon)
- again, TPS is not that much different comparing to BP=128GB config, so the workload is not that much IO-bound as anyone could expect.. -- definitively not something to use as a test case if your target is Storage Performance evaluation..
And I was ready to finish my post here, while Percona published their benchmark results comparing InnoDB -vs- MyRocks on Sysbench-TPCC 10x100W workload ! -- I was happy to see that MyRocks is progressing and doing well, but my main attention was attracted by InnoDB results.. -- as you can see from all the results I've presented above, there is not too much difference when we're going from 128GB BP config to 32GB BP config.. While from Percona results we're seeing exactly opposite.. -- not far from x3 times lower TPS between 128GB and 32GB BP configs ! -- how ever such is possible then ?..
Unfortunately the article is not trying to explain what is going behind, but just showing you the numbers.. -- so, let's try to investigate this a little bit ;-))
First of all, in Percona tests was used 2S 28cores-HT server, so I'll limit my HW setup to 1S and use 24cores-HT only (for sure, it's not the same CPU chips, but at least the number of really concurrent tasks executed in parallel will be similar)..
Then, comparing the configuration settings, the most notable differences are :
- checksums = ON
- doublewrite = ON
- binlog = ON
- adaptive hash index = ON (AHI)
- and lower values for : io capacity / io max / lru depth / BP instances / cleaner threads / etc..
From the follow-up Percona results you can see that this TPS drop between 128GB and 32GB BP is NOT related to binlog, so I have at least one option less to investigate ;-))
So, first of all I wanted to re-check the "base line" results with BP=128GB.
The following graph is representing MySQL 8.0 under Sysbench-TPCC 10x100W workload with different config settings -- I'll try to be short and present all the test cases together rather one by one, so you can see all the 4 tests here :
- the test #1 is executed with my config as presented in all the results above
- the test #2 is the same as #1, but doublewrite=ON and AHI=ON, and you can see a significant TPS drop..
- however, this TPS drop is exactly because of AHI ! -- and as I've mentioned to PeterZ during his "InnoDB Tutorial" @PerconaLIVE -- as soon as you have data changes in your workload, in the current implementation AHI becomes the bottleneck by itself.. -- so, the only one AHI option you should retain in your mind in this case -- is to switch AHI=OFF ! ;-))
- so, the test #3 is the same as #2, but with AHI=OFF -- and as you can see, we got our lost TPS back ! ;-))
- and another observation you may make here is that "doublewrite=ON" is not impacting TPS result at all in the current workload.. -- even it's still not fixed yet in MySQL 8.0
- (Sunny, please, push the new doublewrite code asap to show people a real power of MySQL 8.0 !)
- and the test #4 is with : doublewrite=ON, AHI=OFF, checksums=ON (crc32), io capacity=2K, io capacity max=4K, etc. -- mostly the same as Percona config, and you can see TPS on the same level again ;-))
- NOTE : while using so low IO capacities settings is not resulting here in TPS drops, it's lowering the resistance of MySQL Server instance to activity bursts -- Checkpoint Age is hitting its max limit, and sync flushing waits are already present during the test (aka "furious flushing").. -- so, I'd not suggest it as the right tuning.
- I don't test the impact of checksums here as it'll be simply not present in this workload (all data are in BP, checksums will be involved only on page writes which is going in background, so zero impact on overall processing)..
Now, let's see the same workload, but with BP=32GB :
- the first test is, again, with my initial config settings, and TPS is not too much lower than with BP=128GB..
- the test #2 is as test #1, but with doublewrite=ON and AHI=ON, and indeed, not far from x2 TPS drop..
- let's switch AHI=OFF now as in the previous case..
- the test #3 is as test #2, but with AHI=OFF, and as expected, we can see increased TPS here ;-))
- now, what is the impact of checksums ?
- the test #4 is the same as #3, but with checksums=ON (crc32) -- mostly zero impact on TPS
- and the test #5 is mostly reusing Percona config, except with AHI=off -- which is slightly lowering TPS..
So far :
- the biggest impact here is coming from doublewrite=ON
- and the impact is NOT because we're writing the data twice.. -- but because there is a lock contention in doblewrite code ! -- historically doblewrite was implemented as a small write zone, and as soon as you have many writes going in parallel -- you have a locking fight..
- the new doublewrite code was implemented by Sunny without any of these limitation, and as soon as your storage is able to follow (e.g. to write twice the same data) -- your TPS will remain the same ! ;-))
- e.g. in my case I should obtain the same over 10K TPS as you can see in the test #1
- but Percona is claiming to have it fixed, so that's why this x3 TPS drop in their results between 128GB and 32GB BP configs is surprising me.. -- is it the AHI so much impacting in their tests ?.. -- no idea
- then, why doublewrite is more impacting in 32GB BP config comparing to 128GB BP ?
- with 32GB BP we are doing much more I/O :
- first of all, only 1/3 of data may remain cached in BP, so we'll often Read from storage
- but before to be able to Read, we need to find first a free page in BP to re-use
- and if most of pages in BP are "dirty" with changes, we need to Write these changes first before to declare a given page as "free" and ready to re-use
- which is resulting in much more Writes -vs- 128GB BP config (where you don't have any Reads at all)
- other point : you should also keep in mind to look on TPS results as on "one whole"
- e.g. if you'll look on 32 users load, you'll see 7.5K TPS, but if you'll look on 128 users only -- you'll see 5K TPS (or even less, depending on config ;-))
- and if you're looking for reaching a max possible TPS, your main load level is then around a peak TPS
- once the peak TPS is reached, your main worry then is only about how to not loose it with higher load..
- there are many solutions available around (and the most optimal IMHO is with ProxySQL pool) -- and you have as well the old good one -- "thread concurrency" tuning ;-))
So, let's add the test #6 which is the same as test #4 (doublewrite=ON, AHI=OFF, checksums=ON) but with innodb_thread_concurrency=32 :
- as you can see, even on higher load TPS is "improved" now as well ;-))
- (I'd rather say it's "solved" from contention, as we're not improving here anything, just limiting the concurrency)
- one day we will have no more TPS drops on high load at all (even with thread concurrency=0), but this day is not yet today (nor tomorrow ;-))
Ok, we're able to "hide" the doublewrite contention, fine ! -- but could we reduce the overall Writes impact here ? (with reduced Writes we'll much less stress doublewrite buffer, means its lock contention will be lower, and probably overall TPS will be higher then ?.. -- And YES, in this workload it's possible ! ;-))
How ?.. -- remind that this TPCC, e.g. pure OLTP workload, and, as I've mentioned before -- the data access is "grouped" (so, some data are re-used from BP cache before to Read another ones). And these workload conditions are perfectly matching the story I've explained in 1M IO-bound QPS article with MySQL 8.0 -- let's try the same test #6, but with InnoDB configured with page size = 4KB, which will be the test #7 on the next graph :
- as you can see, with 4KB page size the TPS level is even higher than in test #1 !! ;-))
- (note : we're still having doublewrite=ON and checksums=ON)
- and with the new doublewrite code it should be just the same in all test results here (just mind to switch AHI=OFF ;-))
- also, as you can see, even with x4 times less RAM for BP (32GB -vs- 128GB) and doublewrite=ON and checksums=ON we're still NOT x3 times worse on TPS, but rather near the same result as with 128GB BP !!
INSTEAD OF SUMMARY
- Sysbench-TPCC itself still has some surprises (comparing 1000W case with DBT2)
- MySQL 8.0 is doing better than any other/older version here !
- (but we're yet far from a good scalability -- work in progress, stay tuned ;-))
- believe me, you're not yet finished to be surprised by InnoDB ;-))
- Sunny, please, push to 8.0 the new doublewrite code ASAP !! ;-))
Thank you for using MySQL ! -- stay tuned ;-))
blog comments powered by DisqusNote: if you don't see any "comment" dialog above, try to access this page with another web browser, or google for known issues on your browser and DISQUS..