Created: 2009-08-12
Last modified: 2009-08-26

MySQL Performance: InnoDB Purge & Ahead Flushing - FIXED! :-)

by Dimitri

SSC Team, 2009
Sun Microsystems Inc.

Ces informations sont données à titre indicatif et n'engagent pas Sun Microsystems.

Benchmark Information

Hardware Configuration

Software Configuration

Abstract

Benchmark Scenario

MySQL Configuration

Ahead Flushing, and why...

TPS level drops on the non-stop Read+Write workload
How it works...
Extention of InnoDB Status command
Workload STATs during the test
Final solution
Ahead Flushing
Ahead Flushing code
Workload STATs with the new code

Purge Lag

Workload STATs with the new code - next...
Looking for a solution
Workload STATs with Ahead Flushing + Purge Lag fix
SUMMARY

Benchmark Information

Customer Name(s): SSC Team
NDA: no
Contact Information:
- dimitrik (at) sun.com
Dates: Aug.2009
Keywords: MySQL, InnoDB, Ahead Flushing, purge, purge lag, innodb_max_purge_lag, log size, dirty pages, performance

Hardware Configuration

Server(s):
- M5000 8CPU SPARC64-VII (quad-core bi-thread) 2400Mhz, 64GB RAM, 2x FC-port 4Gbit
Storage:
- ST6140 2x LUN (500GB RAID1 8HD), each LUN is connected by its own fiber channel to the host

Software Configuration

System:
     - Solaris 10 update7
     - UFS
Application(s):
     - MySQL 5.4
     - MySQL Perf #45
     - XtraDB-6
     - InnoDB plugin-1.0.4
     - dbSTRESS (injector)

Abstract

Overview: this performance study is explaining idea and functionality of the final fix for previously discovered issues with Purge Lag and Ahead Flushing (for more details see following report or/and blog posts: Jul.2009 , Jun.2009 )
Goal(s):
- finalize the fix with the latest ideas...
- validate the fix on dbSTRESS workloads
Result(s): Wow! :-))

Benchmark Scenario

My benchmark scenario will be still the same as explained during Purge Lag & Ahead Flushing performance study. And I'm still lucky again to get exactly the same test server :-))
Following dbSTRESS scenario is used to reproduce the same problem again:
Test Scenario :

during 5 minutes a Read-Only warm-up
then during 1 hour a non-stop Read+Write workload
each sessions is operating a fixed list of objects (choice is random, but the list is fixed and InnoDB buffer pool is sized big enough to keep all necessary pages cached!)
32 active concurrent sessions
no think time (zero wait between transactions)

Configuration :

16 CPU cores
1024MB redo log size

MySQL Configuration

Settings within my.conf for 5.4 and 5.Perf :

[mysqld]
 max_connections=2000
 key_buffer_size=200M
 low_priority_updates=1
 sort_buffer_size = 2097152
 table_open_cache = 8000
   
# files
 innodb_file_per_table
 innodb_log_group_home_dir=/LOG/log.54  #separated storage LUN
 innodb_log_file_size=1024M
 innodb_log_files_in_group=3
  
# buffers
 innodb_buffer_pool_size=12000M
 innodb_additional_mem_pool_size=20M
 innodb_log_buffer_size=8M
  
# tune
 innodb_checksums=0
 innodb_doublewrite=0
 innodb_support_xa=0
 innodb_thread_concurrency=16  
 innodb_flush_log_at_trx_commit=2
 innodb_flush_method= O_DIRECT
 innodb_max_dirty_pages_pct=15
 innodb_max_purge_lag=0
   
# perf special
 innodb_read_io_threads = 16
 innodb_write_io_threads = 16
 innodb_io_capacity = 2000

Additionally for XtraDB :

 innodb_adaptive_checkpoint= 1 
 innodb_thread_concurrency_timer_based= 1

Additionally for InnoDB-plugin4 :

 innodb_adaptive_flushing = 1

Ahead Flushing, and why...

Let me show you once again what's going on...
First, let me show you a small graph representing the TPS level observed during the current test scenario and executed against several engines in the following order:

MySQL 5.4.0
InnoDB plugin-1.0.4
MySQL Perf build #45
XtraDB-6

TPS level drops on the non-stop Read+Write workload

As you may see, all of them are getting a periodic performance drops.
NOTE :

InnoDB plugin-1.0.4 is using a recently added "Adaptive Flushing" feature
XtraDB-6 is using the famous Percona's "Adaptive Checkpoint"

So, WHY do they also getting performance drops???...
And let me repeat again - just because during a heavy read+write workload and under current design the Master thread is never leaving the purge loop! and whatever "Adaptive" flushing or checkpoint code, as well dirty page limit check is never reached within a Master thread! - (see for more details my initial report , and next story , but it's not the end yet :-))
Let see more in details how it works..

How it works...

By design, InnoDB engine doing the following steps:

on the database start, according to the total redo log space setting, InnoDB will setup a maximum possible checkpoint age, and the critical limit will be set to something like ~77% (near) of redolog occupancy

then during a database activity the current checkpoint age is checked permanently - it's counted as a difference between the the current redolog LSN and the older page LSN still present within a buffer pool

if this difference is outpassing the critical age (77% of redo space) - the "preflush" is involved and dirty pages then are flushed as fast as possible to lower the checkpoint age and avoid a possible database freeze when it'll be out of redolog free space...

however, the flush on this point is so aggressive that it's blocking database activity anyway!... (well, at least it's due I/O bottleneck, but anyway it's not cool :-))

To get a global view of InnoDB internals I've instrumented the original code with several performance counters and exported them for monitoring via a classic "SHOW INNODB STATUS" command.

Extention of InnoDB Status command

mysql> show innodb status G
...
------------
PURGE STATUS
------------
History len: 2
DML Delay: 0 max: 0 delayed-queries: 0
Purge calls: 15 exec: 1 sleeps: 15 purged-pages: 2
------------
BUFFER FLUSH
------------
Flush calls: 350 exec: 33 flushed-pages: 13250
Called-by recovery: 0 redolog: 325 LRU: 0 master: 20 ahead: 0 dirty-pct: 5
Description :

History len: is showing a current value of purge history length

DML Delay: is showing the current value of srv_dml_needed_delay value (in case if purge lag limit was set); max: is the max of srv_dml_needed_delay since the previous show status print; delayed-queries: printing a total number of delayed queries (if any)

Purge calls: how many times purge function was called ( trx_purge() ), exec: how many purge was really processed (it may be called but have nothing to do), sleeps: how many times purge was able come back to the sleep loop, purged-pages: the total number of purged pages

Flush calls: shows how many buffer flushing was called, exec: how many times buffer flush was really executed (IMPORTANT: if there is already a batch flush is running the call is ignored!), flushed-pages: total number of flushed pages

Called-by recovery: total number times flush was called from recovery, redolog: from redolog, LRU: from LRU (buffer margin), master: from Master thread (other cases, than dirty pages or ahead), ahead: called from Ahead Flushing code, dirty-pct: called when dirty pages percentage limit is reached

Then I added all this stuff into innodbSTAT Add-On of dim_STAT (see about MySQL Add-Ons here ), and in 5 next minutes was able to graph all InnoDB internals I'm interesting in ;-)
What's cool that a such instrumentation may be reused and compared on any platform (even if you're completely unlucky because you don't have DTrace on your system ;-))
Let's see now what's going on...

Workload STATs during the test

Observations :

once the read warm-up is finished, the read+write test is starting..
at the beginning of the load everything is going well :-)
everything is going well until the checkpoint age is not reaching its critical limit
no flushing at all involved until that moment, but then it jumps to 600 calls/sec from the redolog margin code requiring more free space..
I/O writes are jumping on the same time
however you may see there no more than 2 flush exec/sec, when the real buffer flushes happens - means doesn't need to call so often, but more efficient will be preferable ;-)
involved flushing is still insufficient - every time the critical age limit is reached again the flush burst is involved again and again...
you may also note the growing history length during all the test, and outpassing 6 millions at the end(!)
as well you may note that purge is never entering a sleep loop(!) and Master thread is always looping on purge...

Final solution

Initially I've tried to extend a purge loop with ahead flushing to avoid a "furious flushing". Modification is light, but cannot be a true fix because all other Master code should be still executed too :-)
And then an absolute radical idea changed everything ! :-)

Tim Cook came with idea: if it's so, WHY NOT isolate a purge processing withing a separated thread ?..
and Vince Carbone made an initial code split with separated Master and Purge threads!!!

Why having separated purge thread is absolutely great ?! ;-)

because after that at least all Master thread code will work as expected in any situation!! :-))
we may leave purging alone looping forever and not take care anymore! :-))
Master thread activity become way simpler! :-))

It's for what I always loved Sun - it's a place of innovation !! :-))
This idea inspired me to adapt all my changes to the splitted threads and its current logic is working as the following:
Purge :

it's a separated thread now, just looping alone on trx_purge() call and doing nothing else ! :-)
initially by design purge should be involved once per 10 sec, but I've made this interval auto-adaptive: if no work was done during the last loop the sleep timeout is increasing (still respecting max 10sec), and if there was some purge work done - sleep timeout is decreasing (still respecting min 10ms) - works just well and adapt itself to the active workload :-)

Master :

no more purge! :-)
no more checks for redo log flush! :-)
flushing redo log within a 1sec loop (as designed)
do all other stuff within a 1sec or 10sec loop (as designed)
for the first time on my workload I saw Master thread checking dirty page percentage limit!!! :-)))

Wow! that's is really great!!! :-))

Ahead Flushing

Now, when a dirty percentage limit is really checked, do we still need Ahead Flushing? ;-))
And I will say you YES! :-)
Why?..

currently when a dirty page limit is reached there will be a burst buffer flush with 100% of I/O capacity setting, which may be very heavy; and setting lower I/O capacity than real will be not good as it may be too low to flush fast enough (because the goal is to avoid a critical checkpoint age level)

it's quite painful to calculate everytime if dirty page limit is set correctly according redolog and buffer pool sizes - it'll be much more better to leave a database engine to take care about, no? ;-)

there is no a half-force solution: it does no flushing, or then is flushing on 100% I/O capacity :-)

So, yes, we need it!
And with a current model with separated threads both Percona's Adaptive Checkpoint and InnoDB's freshly available Adaptive Flushing are entering perfectly in the game now! :-)) However I wanted absolutely to test a final solution but both "Adaptive" codes were not directly adaptable for MySQL 5.4... So I've made mine :-) and of course I wanted to give it at least a small personal touch :-))
Ahead Flushing :

may be enabled / disabled at any time live
if disabled - doing nothing..
dirty pages percentage limit is verified first - as it's already involving a flushing with 100% I/O capacity no need to check anything else!
if dirty pages level is still under limit - current checkpoint age is verified - and depending the reached level of the max age, the percentage for flush I/O capacity is calculated (NOTE: differently to Percona's code we try to flush here even if the max checkpoint age is reached! - we absolutely have to leave this critical zone asap, and from what I observed - having flushing involved only from redolog will be not enough here)
then(!) - we check if the half of dirty pages percentage limit was reached - and if so - we expecting to involve a flushing with 40% of I/O capacity if the previously found percent value (based on checkpoint age) is lower
we setting a skip sleep flag only(!) if buffer flush was really executed! (not as it's done currently)

Well, it's easier to get a look on the code :-))

Ahead Flushing code

		if (UNIV_UNLIKELY(buf_get_modified_ratio_pct()
				  > srv_max_buf_pool_modified_pct)) {

			/* Try to keep the number of modified pages in the
			buffer pool under the limit wished by the user */

                        srv_buf_flush_dirtypct_calls++;
			n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST,
							  PCT_IO(100),
							  ut_dulint_max);

			/* If we had to do the flush, it may have taken
			even more than 1 second, and also, there may be more
			to flush. Do not sleep 1 second during the next
			iteration of this loop. */

			if( n_pages_flushed != ULINT_UNDEFINED )  skip_sleep = TRUE;

		} else if( srv_ahead_flushing ) {
		 
		    /* Flush ahead dirty pages to avoid oldest modification age
                    coming too close to the max checkpoint age OR dirty pct coming
                    too close to the limit set by user */
		    dulint lsn;
		    int pct= 0;
		    
	            lsn = buf_pool_get_oldest_modification();

          	    if( !ut_dulint_is_zero(lsn) ) {

                        pct= (int)( ut_dulint_minus( log_sys->lsn, lsn ) * 100 /
                                                         log_sys->max_checkpoint_age );
                        pct= pct < 25 ? 0 : (pct < 75 ? pct - 20 : 80 ); 
		    }
		    
		    if( UNIV_UNLIKELY(buf_get_modified_ratio_pct()
				  > srv_max_buf_pool_modified_pct / 2)) {

                        srv_buf_flush_ahead_calls++;
			n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST,
							  PCT_IO( pct > 40 ? pct : 40 ),
							  ut_dulint_max); 

			if( n_pages_flushed != ULINT_UNDEFINED )  skip_sleep = TRUE;

                    } else if( pct ) {
                      
                        srv_buf_flush_ahead_calls++;
			n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST,
							  PCT_IO( pct ),
							  ut_dulint_max);

			if( n_pages_flushed != ULINT_UNDEFINED )  skip_sleep = TRUE;
                      
                    }
		} 
...

Workload STATs with the new code

Observations :

should I say I'm happy? ;-))
no more periodic performance drops!
checkpoint age is staying under 1.2 GB - probably a shorter time in case of recovery? ;-)
critical checkpoint age is never reached! :-)
mainly the buffer flushing is involved from Ahead Flushing code, and time to time from the Master thread (misc)
and still no more than one flush per second is really executed..

Everything goes pretty well now.. except a constantly growing "History len" value...
Ready for the next chapter? ;-)

Purge Lag

So, if you look with attention on the previous graphs you should be surprised by seeing a constantly growing "History len" value - this number is represiting the current number of un-purged pages..
Why it happens and where is a danger you may find from MySQL manual:
In the InnoDB multi-versioning scheme, a row is not physically removed from the database immediately when you delete it with an SQL statement. Only when InnoDB can discard the update undo log record written for the deletion can it also physically remove the corresponding row and its index records from the database. This removal operation is called a purge, and it is quite fast, usually taking the same order of time as the SQL statement that did the deletion.
In a scenario where the user inserts and deletes rows in smallish batches at about the same rate in the table, it is possible that the purge thread starts to lag behind, and the table grows bigger and bigger, making everything disk-bound and very slow. Even if the table carries just 10MB of useful data, it may grow to occupy 10GB with all the “dead” rows . In such a case, it would be good to throttle new row operations and allocate more resources to the purge thread. The innodb_max_purge_lag system variable exists for exactly this purpose.
(see http://dev.mysql.com/doc/refman/5.0/en/innodb-multi-versioning.html for more details)
Now, let show you first what is happening when my test is finished (and it was the main issue which inspired all these performance studies since last year :-)) - once workload is finished it still took 15-20 minutes for InnoDB to flush all dirty pages! But what will be changed now when dirty pages percentage is really limited and Ahead Flushing is in action?...

Workload STATs with the new code - next...

Observations :

History len is outpassing 6 millions
once the workload is finished, InnoDB is starting to decrease a purge gap
dirty pages level cannot reach zero value until history len is not become zero too
it took 17 minutes to free all dirty pages!

This workload was kept during one hour. But what will be after a whole day activity?..

Looking for a solution

The problem seems to be is coming due the fact that purge thread cannot follow the workload activity - there are too much pages to purge...
What kind of solution may be used here?

speed-up the purge processing (even it may be quite costly - there are many user threads modifying data, and only one purge thread cleaning removed rows - so there will be need to have several purge threads, and it leave less CPU power for a useful work (and finally you'll probably not win anything for your workload throughput) - and even single thread currently is not going on its full speed as it crossing common locks with other threads during its work..) - but with time I think this direction will get the main priority!..

add a kind of throttling for writing operations to keep their activity on the same level as your maximum possible purge throughput - it'll slow down little bit your transactions, but avoid a big potential disaster!..

The second solution is already implemented within InnoDB - you may set an innodb_max_purge_lag parameter to say InnoDB to keep purge gap under this limit. But! The only problem - it doesn't work here..
WHY?..
As I explained in my previous report , the innodb_max_purge_lag condition is ignored until InnoDB considering there is a consistent read view which may need see rows to be purged. The problem is I don't have a feeling it works properly, because until I have any SELECT within my workload InnoDB considering them as consistent reads, even they are started way after when the last DML statement was committed. Probably keeping a track of an oldest LSN for SELECT may help?.. But well, let's back to the innodb_max_purge_lag setting:
This variable controls how to delay INSERT, UPDATE, and DELETE operations when purge operations are lagging (see Section 13.2.9, “InnoDB Multi-Versioning”). The default value 0 (no delays). The InnoDB transaction system maintains a list of transactions that have delete-marked index records by UPDATE or DELETE operations. Let the length of this list be purge_lag. When purge_lag exceeds innodb_max_purge_lag, each INSERT, UPDATE, and DELETE operation is delayed by ((purge_lag/innodb_max_purge_lag)×10)–5 milliseconds. The delay is computed in the beginning of a purge batch, every ten seconds. The operations are not delayed if purge cannot run because of an old consistent read view that could see the rows to be purged.
(see http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html#sysvar_innodb_max_purge_lag )
But honestly - what is the problem with DML query delay and consistent view?.. If the innodb_max_purge_lag is set it means we want to avoid any purge lag higher than this value! And if it become higher and there are new insert/delete/updates arriving - what else can we do here if we will not delay them just little bit? The same thing may be made within a customer application (if things going slowly - add few ms sleep between transaction orders) - will it be broken after that? ;-)) The same logic is implemented within many file systems (and recently in ZFS too)..
So, my fix is simple here - by removing of consistency read check within trx_purge function! :-)
In short the:
     if (srv_max_purge_lag > 0
          && !UT_LIST_GET_LAST(trx_sys->view_list)) {
become:
     if (srv_max_purge_lag > 0 ) {
As well I limit the max potential delay value to 50ms.
Now let me present you the result.. :-))

Workload STATs with Ahead Flushing + Purge Lag fix

Following test was executed with Purge lag fix applied and innodb_max_purge_lag = 200000 limit.
Observations :

TPS level is slightly lower - from 8,200 TPS it moved to 7,800 (400TPS less, 5% loss)
however, everything else looks just fine! :-)
history len is stable on 200000
max DML delay did not outpass 5ms during all test duration
checkpoint age and dirty pages level are rock stable
and the most fun: once the test is finished, instead of 17 minutes all dirty pages were freed within 30 seconds ! :-)

SUMMARY

What else to say?..

I'm quite happy with obtained results! (as I say usually to my customers - "the idea is good, but what about implementation?" :-))
I think it'll need other testing now to validate the solution..

Time to prepare a patch :-)
UPDATE : the patch for MySQL 5.4.1 is here: FIX-innobase-5.4.1-dim-sep_purge-ahead_flushing-purge_lag.patch (to apply on storage/innobase source tree).
Any comments are welcome! :-)

MySQL Performance: InnoDB Purge & Ahead Flushing - FIXED! :-)

Table of contents