Created: 2009-06-08
Last modified: 2009-06-30

MySQL Performance: InnoDB Purge & Ahead Flushing

by Dimitri

SSC Team, 2009
Sun Microsystems Inc.

Ces informations sont données à titre indicatif et n'engagent pas Sun Microsystems.

Benchmark Information

Hardware Configuration

Software Configuration

Abstract

Benchmark Scenario

InnoDB Purge Lag

InnoDB Code Instrumentations

First Observations

Analyzing the code

Testing updated code

Improving Dirty Pages Flushing code

Re-Testing with Ahead Flushing enabled

TPS level and Response times
I/O Activity and DML delays
Buffer Pool and Dirty Page flushing

SUMMARY

Benchmark Information

Customer Name(s): SSC Team
NDA: no
Contact Information:
- dimitri (at) sun.com
Dates: Jun.2009
Keywords: MySQL, InnoDB, Ahead Flushing, purge, purge lag, innodb_max_purge_lag, log size, dirty pages, performance

Hardware Configuration

Server(s):
- M5000 8CPU SPARC64-VII (quad-core bi-thread) 2400Mhz, 64GB RAM, 2x FC-port 4Gbit
Storage:
- ST6140 2x LUN (500GB RAID1 8HD), each LUN is connected by its own fiber channel to the host

Software Configuration

System:
     - Solaris 10 u6
     - UFS
Application(s):
     - MySQL 5.4
     - dbSTRESS (injector)

Abstract

Overview: After publishing in May a benchmark report about InnoDB Dirty pages & Log size impact I received by email several very interesting comments, and one of them pointed to the purge lag problem and InnoDB innodb_max_purge_lag parameter created specially for such purpose! So, I was very curious to know how it may help me in my workload...
The following study is about how innodb_max_purge_lag is working now and what may be done to make things better :-))
Goal(s): I'm not pretending here on any absolute truth :-) My goal is to understand what's going on with purge on my workload and see if there is something else possible to improve..
Result(s): are following :-)

Benchmark Scenario

My benchmark scenario will be exactly the same as in my previous test - and by chance I'm still having exactly the same server :-))
To reproduce the same problem again (if it still exists of course :-)) I've prepared a follow dbSTRESS scenario:
Test Scenario :

during 5 minutes a Read-Only warm-up
then during 1 hour a Read+Write workload
each sessions is operating a fixed list of objects (choice is random, but the list is fixed and InnoDB buffer pool is sized big enough to keep all necessary pages cached!)

From my previous tests I observed that the max throughput on 16cores is generally reached with 32 sessions (both with innodb thread concurrency equal to 0 or 16). So, other settings for this scenario:
Config :

CPU cores: 16
concurrent sessions: 32
think time: 0

Each test is executed :

with 1024MB redo log size
or 128MB redo to compare

For every test to check :

dirty pages percentage
total time of dirty page flushing at the end of the test
reached performance level
History length
DML delays (if any)

However, I'm concentrating only on MySQL 5.4 testing now.

InnoDB Purge Lag

According the InnoDB documentation (and as well InnoDB code :-)) innodb_max_purge_lag variable controls how to delay INSERT, UPDATE, and DELETE operations when purge operations are lagging. The default value 0 (no delays).
The InnoDB transaction system maintains a list of transactions that have delete-marked index records by UPDATE or DELETE operations. Let the length of this list be purge_lag. When purge_lag exceeds innodb_max_purge_lag, each INSERT, UPDATE, and DELETE operation is delayed by ((purge_lag/innodb_max_purge_lag)×10)–5 milliseconds. The delay is computed in the beginning of a purge batch, every ten seconds. The operations are not delayed if purge cannot run because of an old consistent read view that could see the rows to be purged.
However, as I explained before , purge routine is not called as expected, Main Thread is looping all the time on trx_purge() - but let's see more in depth how it works.
Purge lag is also called as a Rollback segment History length within InnoDB code, and it's the name I prefere :-))

InnoDB Code Instrumentations

To be able to monitor InnoDB internals, I've added following stats:

FLush calls (DTrace script tracing who is calling buf_flush_batch() function and how often)
innodbSTAT (based on "show innodb status" - to check a buffer pool state)
added "History len" value into "show innodb status"
added DML delay values into "show innodb status"

the added code is looks like:
 fputs("--------------\n"

         "PURGE STATUS\n"

         "--------------\n", file);
 fprintf( file, "History len: %d\nDML Delay: %d max: %d\n",

     trx_sys->rseg_history_len, srv_dml_needed_delay, srv_dml_needed_delay_max );
 srv_dml_needed_delay_max= 0;
As DML-delay (srv_dml_needed_delay) may vary a lot at any time, I've added a DML-delay-max value (srv_dml_needed_delay_max) - it's keeping the max delay value reached until current "show innodb status" call (and set to zero after).

First Observations

As you may see from my previous tests , a smaller redo log size gives a better TPS stability on InnoDB. While a bigger redo log gives a higher performance level, but - generates a periodical burst data flushing which may have a partial freeze of throughput as a side effect...
And even I don't like too much any kind of artificial limit (storage box is even not fully used here), but probably an innodb_max_purge_lag option may bring little bit more stability?..
Let's see..
First test :

innodb_max_purge_lag=0
innodb_max_purge_lag=1000000
innodb_max_purge_lag=100000
innodb_max_purge_lag=10000

For my big surprise, History length outpassed 8.000.000 within all cases and there was NO DIFFERENCE between these tests..
Why?..

Analyzing the code

Let's get a look on the InnoDB code where the innodb_max_purge_lag value is analyzed.

File : innobase/trx/trx0purge.c


          /* If we cannot advance the 'purge view' because of an old

          'consistent read view', then the DML statements cannot be delayed.

          Also, srv_max_purge_lag <= 0 means 'infinity'. */



           if (srv_max_purge_lag > 0 && !UT_LIST_GET_LAST(trx_sys->view_list)) 

              {

                  float   ratio = (float) trx_sys->rseg_history_len

                          / srv_max_purge_lag;

                  if (ratio > ULINT_MAX / 10000) {

                          /* Avoid overflow: maximum delay is 4295 seconds */

                          srv_dml_needed_delay = ULINT_MAX;

                  } else if (ratio > 1) {

                          /* If the history list length exceeds the

                          innodb_max_purge_lag, the

                          data manipulation statements are delayed

                          by at least 5000 microseconds. */

                          srv_dml_needed_delay = (ulint) ((ratio - .5) * 10000);

                  }

          }

As you may see an innodb_max_purge_lag is considered only in case when UT_LIST_GET_LAST(trx_sys->view_list) returns zero! - it means there is no more any "consistent read" running and needing "unpurged" data. Tracing this condition I discovered it never returns zero until there is at least one select running within a database! So, my purge lag setting was simply ignored !

What is dangerous here?

first of all - while history length become bigger and bigger the DML delay value may be calculated way higher than 5ms(!) - and once there is no more reads in the database one DML statement may be delayed for several minutes(!) and break throughput stability! - that's why I think I've found several posts in forums about possible InnoDB freeze when innodb_max_purge_lag option is used...
second - I don't see any reason why we cannot delay DML statements for few milliseconds even if there are some "consistent reads" running on the database?.. - what is the problem?.. - probably I'm missing something here, but I'll try the code without checking UT_LIST_GET_LAST() condition.

Updated code:


          /* If we cannot advance the 'purge view' because of an old

          'consistent read view', then the DML statements cannot be delayed.

          Also, srv_max_purge_lag <= 0 means 'infinity'. */

//(dim) if (srv_max_purge_lag > 0 && !UT_LIST_GET_LAST(trx_sys->view_list))
        if (srv_max_purge_lag > 0 ) 

              {

                  float   ratio = (float) trx_sys->rseg_history_len

                          / srv_max_purge_lag;

                  if (ratio > ULINT_MAX / 10000) {

                          /* Avoid overflow: maximum delay is 4295 seconds */

                          srv_dml_needed_delay = ULINT_MAX;

                  } else if (ratio > 1) {

                          /* If the history list length exceeds the

                          innodb_max_purge_lag, the

                          data manipulation statements are delayed

                          by at least 5000 microseconds. */

                          srv_dml_needed_delay = (ulint) ((ratio - .5) * 10000);

                  }
         if( srv_dml_needed_delay > srv_dml_needed_delay_max) 

              srv_dml_needed_delay_max= srv_dml_needed_delay; 

          }

I've also added a srv_dml_needed_delay_max variable to keep the max DML delay value setting during the last loop.

Testing updated code

To make shorter I'll skip all various tests I've done to find the most optimal combination. I'll just present a short summary before to go more far.
Purge lag summary (with updated code!):

from my observations the main benefit of setting innodb_max_purge_lag variable is making the final dirty pages flushing (once the one hour R+W workload id finished) much more shorter! - instead of 5-15 minutes I've got stable 1-2 minutes(!)
the most optimal setting was innodb_max_purge_lag=200000 in my case (as well history len did not outpass this value during whole test! - and I remind you it jumped to 8 millions before!)
max DML delay never outpassed 5ms

However :

there is still something wrong with dirty pages recycling (but Mark Callaghan seems already to look closely on this issue)
as well as before TPS level is lower with 128MB redo log, but throughput is more stable
and with 1024M TPS level is higher, but generates a periodic "furious flushing"...

Let's see what else may be done here...

Improving Dirty Pages Flushing code

As I explained before during my previous tests, the Master Thread on the high Read+Write workload is mostly looping on the trx_purge() call and may even not reach the code checking if innodb_max_dirty_pages_pct value is reached (nor Adaptive checkpoint code added by Percona team)...
After monitoring dirty pages flush calls with DTrace script, I may list here the buf_flush_batch() callers:

srv_master_thread() - when everything is working as expected (but not on high load)

buf_flush_free_margin() - when buffer pool has no more free pages

log_check_margins() - when redo log has no more free space, so the oldest dirty pages must be flushed to disk to remove corresponding data within a redo log

The most "furious" flashing happens when buf_flush_batch() is called by log_check_margins() - the demand is so urgent and flushing is so important that it creates a partial freeze of workload - all InnoDB time is spent to flush a part of data as soon as possible and then all depends on the disk storage subsystem performance...
What may be changed here?..
From my point of view, when flushing is involved from log_check_margins() function it's already too late - data should be flushed ahead to avoid an I/O burst! Analyzing the Master Thread code I've splitted it into 3 critical parts:

Part-1 : if the load is not high enough, the percentage of dirty pages (innodb_max_dirty_pages_pct) is verified here, and if percentage was outpassed - pages are flushed with 100% IO_CAPACITY setting; I've tested some low values here too, but in this case flushing become very aggressive ans slow down global performance...

Part-2 : here a page flushing is involved ahead if there is no too much activity within a database...

Part-3 : is the most critical place, because if the load is very high, the Main Thread is never leaving the purge loop! And from my point of view, a kind of Ahead Flushing should be involved here!...

Initial code of purge loop :
          n_pages_purged = 1;

          last_flush_time = time(NULL);



          while (n_pages_purged) {

                  if (srv_fast_shutdown && srv_shutdown_state > 0) {

                          goto background_loop;

                  }



                  srv_main_thread_op_info = "purging";

                  n_pages_purged = trx_purge();



                  current_time = time(NULL);



                  if (difftime(current_time, last_flush_time) > 1) {

                          srv_main_thread_op_info = "flushing log";



                          log_buffer_flush_to_disk();

                          last_flush_time = current_time;

                          srv_sync_flush++;

                  }

          }
My additional changes :

changes are involving ahead flushing within a purge loop
check if the half of the innodb_max_dirty_pages_pct value is reached
if so, queue page flushing with 20% of IO_CAPACITY
ideally setting of these condition values should be available via my.conf
          n_pages_purged = 1;

          last_flush_time = time(NULL);



          while (n_pages_purged) {

                  if (srv_fast_shutdown && srv_shutdown_state > 0) {

                          goto background_loop;

                  }



                  srv_main_thread_op_info = "purging";

                  n_pages_purged = trx_purge();

//(dim)

                  if (UNIV_UNLIKELY(buf_get_modified_ratio_pct()

                            > srv_max_buf_pool_modified_pct/2 )) {



                          n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST,  

                               PCT_IO(20),  ut_dulint_max);

                          skip_sleep = TRUE;

                  }



                  current_time = time(NULL);



                  if (difftime(current_time, last_flush_time) > 1) {

                          srv_main_thread_op_info = "flushing log";



                          log_buffer_flush_to_disk();

                          last_flush_time = current_time;

                          srv_sync_flush++;



                  }

          }

Re-Testing with Ahead Flushing enabled

Combining Purge Lag changes + Ahead Flushing, I've got the following result:

very short final flushing once database activity is stopped (purge lag = 200000)
due Ahead Flushing, stable background flushing of dirty pages, so log_check_margins() is never involved!
as log_check_margins() is never involved, the "furious" flushing is gone ! :-) and my TPS throughput become very stable! (finally :-))
very stable dirty page percentage

Following few graphs are comparing 2 cases: 128MB redo log (stable, but aggressive) and Ahead Flushing with 1024MB redo (performance level equal to 1024MB redo log, less aggressive and without periodic freezes due "furious flushing" :-))
Setting used in my.conf :

innodb_max_dirty_pages_pct = 60
innodb_max_purge_lag = 200000
innodb_log_file_size = 128M or 1024M + AheadFlushing enabled

TPS level and Response times

Observations :

as seen before, TPS level is higher with 1024M redo log, but with Ahead Flushing it also become stable now!
response times are better from the beginning, while with 128M log it needs yet some time to become stable
NOTE : 128M configuration benefits here the improvement done by my changes on Purge Lag (!)

I/O Activity and DML delays

Observations :

curiously I/O activity (writing) is much lower with Ahead Flushing enabled...
and I may explain it only by supposition there are pages which are modified several times and with less aggressive flushing we're not flushing all page versions but only the latest one!
as well DML-delays are less frequent with Ahead Flushing (but more accurate measurement will be needed to confirm it)..

Buffer Pool and Dirty Page flushing

Observations :

dirty pages level is staying stable in both configurations
128M : aggressive flushing involved from log_check_margins() on the first half of test, then due Purge Lag changes flushing progressively coming back to the "normal"
1024M + AheadFlushing: log_check_margins() is never called, flushing is less aggressive and called only from the Master Thread during a whole test duration!

SUMMARY

Based on the presented performance study, here is my list of changes I'm suggesting to apply on MySQL 5.4 InnoDB engine.
Purge Lag :

verify if checking of presence of "consistent read" is really needed within a purge lag code (what do you think?)
add a kind of innodb_max_dml_delay configuration parameter to avoid any unexpected long DML delays!
add indicators about History length and DML delays into "show innodb status" output!

Ahead FLushing : (if the idea is accepted :-))

add a kind of innodb_ahead_dirty_pages_pct - a percentage of dirty pages level when Ahead Flushing should be activated
add a kind of innodb_ahead_io_capacity_pct - an I/O capacity percentage to use when Ahead Flushing is activated
add indicators about flushing calls (function names and number of calls) into "show innodb status" !!!

Any comments are welcome! :-))


Observations : as seen before, TPS level is higher with 1024M redo log, but with Ahead Flushing it also become stable now! response times are better from the beginning, while with 128M log it needs yet some time to become stable NOTE : 128M configuration benefits here the improvement done by my changes on Purge Lag (!)


Observations : curiously I/O activity (writing) is much lower with Ahead Flushing enabled... and I may explain it only by supposition there are pages which are modified several times and with less aggressive flushing we're not flushing all page versions but only the latest one! as well DML-delays are less frequent with Ahead Flushing (but more accurate measurement will be needed to confirm it)..

MySQL Performance: InnoDB Purge & Ahead Flushing

Table of contents