This post was remaining in stand-by for a long time, specially that I was expecting that observed issues will be fixed soon. But time is going, and the problems are remaining. And I'm constantly asked "why, Dimitri, you're suggesting now to use XFS, while in the past you always suggested EXT4 ??" -- hope the following article will clarify you the "why" and maybe motivate you to do your own evaluations to see how well the things are working for you on your own systems under your own workloads..
NOTE : this will also clarify why the new Double Write did not appear in MySQL 8.0 in 2018, as it was planned, but only recently (http://dimitrik.free.fr/blog/posts/mysql-80-perf-new-dblwr.html)
First of all, some background history
- historically with MySQL we always observed better performance and more stable processing on EXT4
- there were many tentatives to bring XFS on front, but, again, historically, there were always some issues as soon as workload became IO-bound..
- however, since last few years we seriously addressed the problems we have in InnoDB Double-Write (DBLWR) feature
- why do we have DBLWR in InnoDB ?
- in fact historically (and still right now) Linux kernel/ FS layer cannot guarantee atomic IO writes
- so, DBLWR is here to protect InnoDB pages writes from partially written data
- (e.g. each page we first write to some "reserved place", and once the write is confirmed, we then write this page to its own place in its datafile, so if there was any crash in the middle, we can re-apply page write from DBLWR (or discard DBLWR record if the crash happened on DBLWR write)..
- however, if DBLWR feature is providing the necessary security for user's data, it's also increasing page write latency by x2 (in the best case) or more, and also, as the result, reducing by x2 life of any flash storage by sending twice higher IO write traffic for the same data..
- over a time many Flash Storage vendors wanted to deliver solution for atomic IO writes on Linux (and allow InnoDB to disable DBLWR safely)..
- but only Fusion-io until now could deliver truly safe solution (by shipping their own Storage + own Kernel Driver + own proprietary filesystem "NVMFS")
- unfortunately, Fusion-io is no more here, and their patches proposed for Linux Kernel are in stand-by for over 5 years already, and there is still no expectation to see them pushed one day upstream..
Now, what is going on
- our new DBLWR code for MySQL 8.0 was ready by in the middle of the 2018 year
- and we were preparing final validation tests to evaluate expected performance gain on systems running latest Linux kernels
- but we were far from imagine what kind of surprises were waiting for us.. ;-))
First, we observed regressions on EXT4 while running the same workloads with the same MySQL binaries on the similar HW systems, but with newer kernels.. -- which is after long evaluations pointed to overall regression in EXT4 in recent kernels on mixed IO-bound workloads (regardless used HW), and can be summarized by the following graph :
not sure if kernel-4.1.x series was the last one "working well", but definitively newer versions were doing much worse..
During Linux Plumbers Conference in Lisbon there was a valuable input from Jan KARA (SUSE) who suspected the problem may come from the newly implemented internals re-design in EXT4 which is involving shared locking, potentially not working efficiently on the mixed RW workload (when I/O Read & Writes are coming on the same time).. -- which was confirmed later by other devs, and also reproduced by Alexey, see : https://www.percona.com/blog/2019/11/12/watch-out-for-disk-i-o-performance-issues-when-running-ext4/ -- and the "real" fix for this EXT4 issue is expected to come with kernel 5.6, let's see..
And another surprise arrived with XFS.
The following graph is the kind of difference we were observing from the past between EXT4 and XFS on IO-bound MySQL workloads:
Note that this workload was running with DBLWR switched OFF, and as you can imagine, by any "normal" logic, enabling DBLWR may only makes things worse, because:
- we'll write twice more, and specially page write latency will be doubled (in the best)
- which will lower our potential IO read rate
- (as before to be able to read any given page from storage to Buffer Pool (BP), we have to find a room in BP (a free page), but if most of pages were modified (dirty), we have to flush a dirty page first, before to make it "free")
- e.g. we have to do page IO write before we can do page IO read
- (and as page write time will become twice higher (or more), our page reads will stay in wait twice longer (or more))..
But this is the "logical" explanation about what should happen, right ? -- now, what about "practical" ? -- and "in practice" we've observed the following:
- EXT4 : doing better with DBLWR=off (blue), worse with DBLWR=on (green)
- XFS : doing worse with DBLWR=off (red), and better (!!) with DBLWR=on (yellow)
- WTF about XFS ?..
- (and specially the better result with enabled DBLWR on XFS comparing to EXT4 ??)
A long story short, further investigations allowed at least to identify a "workaround" to improve performance on XFS with DBLWR=off :
- the above graph is representing the same workload, but executed with 4 different config settings
- (4 variation based on combinations of 2 config variables)
- LRU scan depth (lru) = 1K or 10K (direct impact on how much dirty pages we may write in one pass)
- IO write threads (iow) = 16 or 4 (how many IO write threads are used to complete AIO requests)
- curiously, limiting iow to 4 largely improves performance !
- now, why exactly this is helping ? => this is a part of "mystery" as of today ;-))
There are many questions coming around :
- Are we saturating something on XFS level ?
- Any starvation is happening somewhere ?
- Any ways to find the answer to this ?..
- Could any from XFS stats described in the following article point to something ? => http://xfs.org/index.php/Runtime_Stats
Probably over a time we will answer them all, etc.. But as of today, here is where we are ;-))
And if something you should retain from this => mind to move to XFS if you're using Linux kernel newer than 4.1 !!!
Work in progress, stay tuned ;-)) And thank you for using MySQL !