DimitriK's (dim) Weblog : MySQL Performance: Final fix for Ahead Flushing & Purge Lag

« Butterfly & Summer... | Main | MySQL Performance: InnoDB Doublewrite Buffer Impact »

Monday, 17 August, 2009

MySQL Performance: Final fix for Ahead Flushing & Purge Lag

The goal of this post is to tell you a final story about performance study looking to fix the Ahead FLushing and Purge Lag issues.. (I've tried to make it short, but if you feel you'll need more details or just have some more time for reading - go directly to the full report: http://dimitrik.free.fr/db_STRESS_MySQL_540_Purge_Lag_and_Ahead_Flushing_Fixed_Aug2009.html :-))

Before to tell you the story about the final solution, let me show you again the TPS level graph obtained during my last tests:

The read+write workload is executed during one hour non-stop on 4 engines (one by one):

MySQL 5.4
InnoDB plugin-1.0.4
MySQL 5.Perf build #45
XtraDB-6

As you may see from graph, all engines are getting a periodic performance drops!!

NOTE :

InnoDB plugin-1.0.4 is using a recently added "Adaptive Flushing" feature !
XtraDB-6 is using the famous Percona's "Adaptive Checkpoint" !

So, WHY do they also getting performance drops???...

And let me repeat again - just because during a heavy read+write workload and under the current InnoDB design the Master thread is never leaving the purge loop! and whatever "Adaptive" flushing or checkpoint code, as well dirty page limit check is never reached within a Master thread! - (for more details see my initial report and next story )..

(To understand better all these critical points I've instrumented InnoDB with light counters and was able to monitor it live during workload activity (see Extention of InnoDB Status command for more details...))

Initially I've tried to extend a purge loop with ahead flushing to avoid a "furious flushing". Modification is light, but cannot be a true fix because all other Master code should be still executed too...

And then an absolute radical idea changed everything ! :-)

Tim Cook came with idea: if it's so, WHY NOT isolate a purge processing withing a separated thread ?..
and Vince Carbone made an initial code split with separated Master and Purge threads!!!

Why having separated purge thread is absolutely great ?! ;-)

because after that at least all Master thread code will work as expected in any situation!! :-))
we may leave purging alone looping forever and not take care anymore! :-))
Master thread activity become way simpler! :-))

It's for what I always loved Sun - it's a place of innovation !! :-))

This idea inspired me to adapt all my changes to the splitted threads and its current logic is working as the following:

Purge :

it's a separated thread now, just looping alone on trx_purge() call and doing nothing else ! :-)
initially by design purge should be involved once per 10 sec, but I've made this interval auto-adaptive: if no work was done during the last loop the sleep timeout is increasing (still respecting max 10sec), and if there was some purge work done - sleep timeout is decreasing (still respecting min 10ms) - works just well and adapt itself to the active workload :-)

Master :

no more purge! :-)
no more checks for redo log flush! :-)
flushing redo log within a 1sec loop (as designed)
do all other stuff within a 1sec or 10sec loop (as designed)
for the first time on my workload I saw Master thread checking dirty page percentage limit!!! :-)))

Wow! that's is really great!!! :-))

Ahead Flushing

Now, when a dirty percentage limit is really checked, do we still need Ahead Flushing? ;-))

And I will say you YES! :-)

Why?..

currently when a dirty page limit is reached there will be a burst buffer flush with 100% of I/O capacity setting, which may be very heavy; and setting lower I/O capacity than real will be not good as it may be too low to flush fast enough (because the goal is to avoid a critical checkpoint age level)
it's quite painful to calculate everytime if dirty page limit is set correctly according redolog and buffer pool sizes - it'll be much more better to leave a database engine to take care about, no? ;-)
there is no a half-force solution: it does not flushing at all, or it's flushing on 100% I/O capacity :-)

So, yes, we need it!

And with a current model with separated threads both Percona's Adaptive Checkpoint and InnoDB's freshly available Adaptive Flushing are entering perfectly in the game now! :-)) However I wanted absolutely to test a final solution but both "Adaptive" codes were not directly adaptable for MySQL 5.4... So I've made mine :-) with a small personal touch :-))

You may find all details about within a final report , but I'll just show here that from the previous situation:

we moved to the following one:

Few comments:

no more periodic performance drops!
checkpoint age is staying under 1.2 GB - probably a shorter time in case of recovery? ;-)
critical checkpoint age is never reached! :-)
mainly the buffer flushing is involved from Ahead Flushing code
and still no more than one flush per second is really executed..

Everything goes pretty well now.. except a constantly growing "History len" value...

Purge Lag

Observing my workload graphs I was surprised by seeing a constantly growing "History len" value - this number is represiting the current number of un-purged pages..

Why it happens and where is a danger you may find from MySQL manual:

In the InnoDB multi-versioning scheme, a row is not physically removed from the database immediately when you delete it with an SQL statement. Only when InnoDB can discard the update undo log record written for the deletion can it also physically remove the corresponding row and its index records from the database. This removal operation is called a purge, and it is quite fast, usually taking the same order of time as the SQL statement that did the deletion.
In a scenario where the user inserts and deletes rows in smallish batches at about the same rate in the table, it is possible that the purge thread starts to lag behind, and the table grows bigger and bigger, making everything disk-bound and very slow. Even if the table carries just 10MB of useful data, it may grow to occupy 10GB with all the "dead" rows . In such a case, it would be good to throttle new row operations and allocate more resources to the purge thread. The innodb_max_purge_lag system variable exists for exactly this purpose.

(see http://dev.mysql.com/doc/refman/5.0/en/innodb-multi-versioning.htmlfor more details)

In my case every time when the test is finished it still took 15-20 minutes for InnoDB to flush all dirty pages!

Observations:

History len is outpassing 6 millions
once the workload is finished, InnoDB is starting to decrease a purge gap
dirty pages level cannot reach zero value until history len is not become zero too
it took 17 minutes to free all dirty pages!

This workload was kept during one hour. But what will be after a whole day activity?..

The problem seems to come due the fact that purge thread cannot follow the workload activity - there are too much pages to purge...

What kind of solution may be used here?

speed-up the purge processing (even it may be quite costly - there are many user threads modifying data, and only one purge thread cleaning removed rows - so there will be need to have several purge threads, and it leave less CPU power for a useful work (and finally you'll probably not win anything for your workload throughput) - however even single thread currently is not going on its full speed as it's crossing some common locks with other threads during its work..) - but with time I think this direction will get the main priority!..
add a kind of throttling for writing operations to keep their activity on the same level as your maximum possible purge throughput - it'll slow down little bit your transactions, but avoid a big potential disaster!..

The second solution is already implemented within InnoDB - you may set an innodb_max_purge_lag parameter to say InnoDB to keep purge gap under this limit. But! The only problem - it doesn't work here..

WHY?..

As I explained in my previous report , the innodb_max_purge_lag condition is ignored until InnoDB considering there is a consistent read view which may need see rows to be purged. The problem is I don't have a feeling it works properly, because until I have any SELECT within my workload InnoDB considering them as consistent reads, even they are started way after when the last DML statement was committed. Probably keeping a track of an oldest LSN for SELECT may help?.. But well, let's back to the innodb_max_purge_lag setting:

This variable controls how to delay INSERT, UPDATE, and DELETE operations when purge operations are lagging (see Section 13.2.9, ?InnoDB Multi-Versioning?). The default value 0 (no delays). The InnoDB transaction system maintains a list of transactions that have delete-marked index records by UPDATE or DELETE operations. Let the length of this list be purge_lag. When purge_lag exceeds innodb_max_purge_lag, each INSERT, UPDATE, and DELETE operation is delayed by ((purge_lag/innodb_max_purge_lag)×10)?5 milliseconds. The delay is computed in the beginning of a purge batch, every ten seconds. The operations are not delayed if purge cannot run because of an old consistent read view that could see the rows to be purged.

(see http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html#sysvar_innodb_max_purge_lag)

But honestly - what is the problem with DML query delay and consistent view?.. If the innodb_max_purge_lag is set it means we want to avoid any purge lag higher than this value! And if it become higher and there are new insert/delete/updates arriving - what else can we do here if we will not delay them just little bit? The same thing may be made within a customer application (if things going slowly - add few ms sleep between transaction orders) - will it be broken after that? ;-)) The same logic is implemented within many file systems (and recently in ZFS too)..

So, my fix is simple here - by removing of consistency read check within trx_purge function! :-)

In short the:

 if (srv_max_purge_lag > 0
  && !UT_LIST_GET_LAST(trx_sys->view_list)) {

become:

if (srv_max_purge_lag > 0 ) 
                            {

As well I limit the max potential delay value to 50ms.

The result

And now let me present you the result - following test was executed with Purge lag fix applied and innodb_max_purge_lag = 200000 limit:

Observations :

TPS level is slightly lower - from 8,200 TPS it moved to 7,800 (400TPS less, 5% performance loss)
however, everything else looks just fine! :-)
history len is stable on 200000
max DML delay did not outpass 5ms during all test duration
checkpoint age and dirty pages level are rock stable
and the most fun: once the test is finished, instead of 17 minutes all dirty pages were freed within 30 seconds ! :-)

More details you may find from the full report: http://dimitrik.free.fr/db_STRESS_MySQL_540_Purge_Lag_and_Ahead_Flushing_Fixed_Aug2009.html

Any comments are welcome! :-)

Posted by Dimitri at 18:18
Categories: MySQL, Solaris, Tools/ dbSTRESS