Customer Name(s): SSC Team
NDA: no
Contact Information:
- dimitri (at) sun.com
Dates: May.2009
Keywords: MySQL, log size, dirty pages, performance

Hardware Configuration

Server(s):
- M5000 8CPU SPARC64-VII (quad-core bi-thread) 2400Mhz, 64GB RAM, 2x FC-port 4Gbit
Storage:
- ST6140 2x LUN (500GB RAID1 8HD), each LUN is connected by its own fiber channel to the host

Software Configuration

System:
     - Solaris 10 u6
     - UFS
Application(s):
     - MySQL 5.4
     - MySQL 5.Perf build5
     - XtraDB-5
     - XtraDB-3
     - InnoDB plugin-1.0.3
     - dbSTRESS (injector)

History

The story is coming from the beginning of the year when preparing a MySQL Workshop I've continued my first tests with XtraDB and looked for MySQL configuration parameters giving the most optimal, performant and stable processing. During these tests I've discovered several strange things:

my dirty page percentage setting was completely ignored by MySQL
once my test workload was finished it still took 30min(!) for MySQL to finish dirty page flushing!

I've told about a lot of people on that time, but everybody were just surprised and did not really believe such case is possible... Agree, it's hard to believe ;-)
So, to refresh the story, let me first to show you the following picture:

InnoDB Buffer Pool Usage

As you may see here the dirty pages percentage is higher than 15% for sure :-) As well you may see the 30 minutes of dirty pages flushing once the workload is finished (too much according to me on the idle server! - and probably recovery process will take even more time if my server will crash?...)
It looked abnormal for me, and what I found on that time - reducing log size to 128MB gave less or more better stability and dirty pages flushing always finished in few minutes at the end of each workload!
My understanding was :

the dirty pages percentage is completely ignored by MySQL
the only thing which limit a number of dirty pages is the log size!
once redo log has no more free blocks it'll force dirty pages flushing to to continue its work!
the most optimal redo log size I've found on that time was 128MB

So, all my MySQL tests I've done until now were executed with innodb_log_file_size=128M. And I've started the current study to bring some lights on previously found issue and whenever possible to understand what's going wrong (if any :-))...

Test Scenario

To reproduce the same problem again (if it still exists of course :-)) I've prepared a follow dbSTRESS scenario:
Test Scenario :

during 5 minutes a Read-Only warm-up
then during 1 hour a Read+Write workload
IMPORTANT - each sessions is operation a fixed list of objects (choice is random, but the list is fixed and InnoDB buffer pool is sized big enough to keep all necessary pages cached!) - it's important to keep in mind reading following chapters :-))

From my previous tests I observed that the max throughput on 16cores is generally reached with 32 sessions (both with innodb thread concurrency equal to 0 or 16). So, other settings for this scenario:
Config :

CPU cores: 16
concurrent sessions: 32
think time: 0

Each test is executed :

with 128MB redo log size
then with 1024MB redo log size

For every test to check :

dirty pages percentage
total time of dirty page flushing at the end of the test
reached performance level

Initial list of engines :

MySQL 5.4
MySQL 5.Perf build5
XtraDB-5
XtraDB-3
InnoDB plugin-1.0.3

First observations

First of all I want to be sure with 128MB log size everything is still ok until now. So, the following graph is showing an I/O and Buffer Pool state during a test with MySQL 5.4 and:

redo log size: 128MB
innodb thread concurrency: 16
innodb pool size: 12GB
dirty pages percentage: 15

MySQL 5.4 with log=128MB

Observations :

20min Read-Only warm-up (it was 20min initially, but then I saw 5min is more than enough)
as you may see, during a warm-up all needed pages are loaded in pool
there is no more disk reads until the end of the test
since the Read+Write workload is started, the dirty page level is kept stable !
once the test is finished, all dirty pages are flushed within a few minutes !

Question - do you think the dirty pages percentage is kept on the observed level because I set it equal to 15 ?...
(No need to answer for the moment, but just keep it in mind :-))
But what about performance?..

MySQL 5.4 Performance with log=128MB

Performance :

after first 5 min of instability, reached less or more stable 6.000 TPS and 6.500 TPS at the end
response time become quite stable with a time
DELETE is taking slightly more time than INSERT and UPDATE

I'm not presenting here graphs for other engines because all of them are looking similar with a 128MB log size. However, what's interesting - not all engines spending the same time for the final dirty page flushing!
Final flushing time :

MySQL 5.4: 160 sec.
MySQL 5.Perfb5: 50 sec. (quite interesting..)
XtraDB-5: 200 sec.
XtraDB-3: 190 sec.
InnoDB-plugin 1.0.3: 280 sec. (seems not having several I/O write threads is limiting a flushing here)

But well, we may expect 3-4 minutes is enough to finish flushing here. Let's continue...

What if bigger redo log?..

Let's see now what will change if the redo log will be 1024MB size. The following graph is showing an I/O and Buffer Pool state during a test with MySQL 5.4 and:

redo log size: 1024MB
innodb thread concurrency: 16
innodb pool size: 12GB
dirty pages percentage: 15

MySQL 5.4 with log=1024MB

Observations :

Read-Only warm-up
as you may see, during a warm-up all needed pages are loaded in pool
there is no more disk reads until the end of the test
since the Read+Write workload is started, the dirty page level is kept growing !
BUT - once the test is finished, all dirty pages are still flushed within a few minutes !

Question - do you think the dirty pages percentage is kept on the observed level because I set it equal to 15 ?...
Only a silly may say yes here, no? ;-)
As you see, my 15% setting is completely ignored here, and on the next chapters we'll try to understand WHY? :-))
But, anyway, what about performance?..

MySQL 5.4 Performance with log=1024MB

Performance :

growing throughput in few minutes reached a quite stable 8.000 (!) TPS - near 30% improvement!!
the only some periodic jumps down are breaking the whole picture, but let's see them later..
response time is stable and twice better!! (except periodic jumps, hmm..)
DELETE is taking slightly more time than INSERT and UPDATE

Before to present graphs with other engines, let's finish with a final dirty page flushing!
Final flushing time :

MySQL 5.4: 190 sec.
MySQL 5.Perfb5: 50 sec. (hmm.. quite interesting..)
XtraDB-5: 200 sec.
XtraDB-3: 200 sec.
InnoDB-plugin 1.0.3: 250 sec.

Again, we may still expect 3-4 minutes is enough to finish flushing here. Build5 is surprisingly 4 times faster comparing to other - seems to me Mikael have to hurry up to merge it into 5.4 :-)) But well - the main point here is there is no more 30min wait to flush all dirty pages!
Positive :

increased performance!
no more long wait on the final dirty pages flushing

Negative :

constantly growing number of dirty pages
constantly growing number of "data" pages
periodic performance drops

Let's look first more closely on the last one: so why do we see those periodic performance drops?... Probably folks from Percona should point a finger here and explain again the benefit of Adaptive Checkpoint (and I'll be agree with them :-)), but in the current case the Adaptive Checkpoint will not help...
Why?...
Let's get a look on XtraDB-5 performance graphs:

XtraDB-5 Performance with log=1024MB

As you may see :

same kind of periodic performance drops
Adaptive Checkpoint is ON and it doesn't help...

BTW, the result is better anyway comparing to all my previously obtained with XtraDB-5. But the following graph will be even more fun:

XtraDB-3 Performance with log=1024MB

Hard to do not see :

same kind of periodic performance drops
Adaptive Checkpoint is ON and it doesn't help...

On the same time :

XtraDB-3 outperforms here XtraDB-5 !!
XtraDB-3 outperforms here MySQL 5.4 as well !!

Seems to me Percona team should think about and prepare a bunch of regression tests to see if there are any other cases where the old XtraDB performs better than the new one..
And to finish completely performance part of this report I just want to present the result of Build5:

MySQL 5.Perf-b5 Performance with log=1024MB

As you see :

same problems as others
but near 12.000 TPS - the best result for the moment!

InnoDB Log size and Performance

Performance impact :

very important performance gain on Read+Write workload with a bigger InnoDB log size!
keep in mind periodic drops of throughput
once it'll be fixed - it'll be fine to not degrade performance too much :-)

Seems I'll need to replay all my previous test with a bigger log size.. :-)

In depth with Dirty Pages Level

The main question of this chapter is - Why dirty pages percentage setting (innodb_max_dirty_pages_pct=15) was ignored?..
To go more in depth, we'll need to open the MySQL sources. The file "innodb/srv/srv0srv.c" contains the source code we're interesting in.

Analyzing code of the Master Thread

&The code checking the dirty pages percentage is a part of srv_master_thread() function. Everytime when dirty pages level is outpassing the requested value the buf_flush_batch() function is called to flush dirty pages to the disk storage. Let's get a more close look on the master thread code, and I've splitted it in 3 parts:

PART-1

From this part the buf_flush_batch() is called if dirty pages level outpassed the required value:

/***********************************************************************
The master thread controlling the server. */

os_thread_ret_t
srv_master_thread(
/*==============*/


 ...



                  if (UNIV_UNLIKELY(buf_get_modified_ratio_pct()

                                    > srv_max_buf_pool_modified_pct)) {



                          /* Try to keep the number of modified pages in the

                          buffer pool under the limit wished by the user */



                          n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST,

                                                            PCT_IO(100),

                                                            ut_dulint_max);



                          /* If we had to do the flush, it may have taken

                          even more than 1 second, and also, there may be more

                          to flush. Do not sleep 1 second during the next

                          iteration of this loop. */



                          skip_sleep = TRUE;

                  }


 ...

PART-2

From this part the buf_flush_batch() is called if I/O activity was low during last 10 sec and innodb_extra_dirty_writes is enabled (enabled by default):

...
/* ---- We perform the following code approximately once per
10 seconds when there is database activity */

#ifdef MEM_PERIODIC_CHECK
      /* Check magic numbers of every allocated mem block once in 10
      seconds */
      mem_validate_all_blocks(); #endif
      /* If i/os during the 10 second period were less than 200% of
      capacity, we assume that there is free disk i/o capacity
      available, and it makes sense to flush srv_io_capacity pages.

      Note that this is done regardless of the fraction of dirty
      pages relative to the max requested by the user. The one second
      loop above requests writes for that case. The writes done here
      are not required, and may be disabled. */

      n_pend_ios = buf_get_n_pending_ios() + log_sys->n_pending_writes;
      n_ios = log_sys->n_log_ios + buf_pool->n_pages_read
      + buf_pool->n_pages_written;
      if (srv_extra_dirty_writes &
      n_pend_ios < 3 && (n_ios - n_ios_very_old < PCT_IO(200))) {

      srv_main_thread_op_info = "flushing buffer pool pages";
      buf_flush_batch(BUF_FLUSH_LIST, PCT_IO(100), ut_dulint_max);

      srv_main_thread_op_info = "flushing log";
      /* No fsync when srv_flush_log_at_trx_commit 1 */
      log_buffer_flush_maybe_sync();
      srv_async_flush++;
      } ...

Part-3

The master thread is looping here on purge:

 ...

          /* We run a full purge every 10 seconds, even if the server

          were active */



          n_pages_purged = 1;

          last_flush_time = time(NULL);



          while (n_pages_purged) {

                  if (srv_fast_shutdown && srv_shutdown_state > 0) {

                          goto background_loop;

                  }



                  srv_main_thread_op_info = "purging";

                  n_pages_purged = trx_purge();



                  current_time = time(NULL);



                  if (difftime(current_time, last_flush_time) > 1) {

                          srv_main_thread_op_info = "flushing log";



                          log_buffer_flush_to_disk();

                          last_flush_time = current_time;

                          srv_sync_flush++;

                  }

          }
 ...

Initially what I supposed:

the function buf_get_modified_ratio_pct() is for sure was tested over several years now, so very small probability it contains any bug...
as well it uses the same functions to calculate the dirty pages percentage as "show innodb status" command, so the returned value should be correct...
as well it's hard to believe there is an error in comparing condition values :-)
the only thing is possible -- probably the amount of flushed data is not big enough ???...

So, my initial intention was to trace the flushing frequency involved within PART-1 and PART-2 of code and see less or more what kind of values will be observed, etc. For starting all I've just added a few fprintf( stderr, ...) into the code to print the values when any of 2 flushing condition is matched. Messages then are written within a mysqld log file and may be monitored by a simple "tail -f mysqld.log". What I've expected is by auto-incrementing flushing to bring the dirty pages level to some stable values first, and then to see how better it may be improved...

Now, check first you're sitting well in your chair! :-))

READY?..

WHAT KIND OF VALUES DO YOU THINK WHERE PRINTED WITHIN A LOG FILE ?...

???

For all my big surprise...

NOTHING AT ALL WAS PRINTED!!!....

???

WHY ?....

???

JUST BECAUSE THE PART-1 NOR THE PART-2 ARE NEVER REACHED!!

Hard to believe?.. ;-))

Once workload is started, processing is NEVER LEAVING THE PART-3 LOOP !!!
All what is doing the Master Thread is looping on trx_purge() ...

Hard to believe, but it's so...

So, who is flushing dirty pages in this case ???...

The answer is very easy to get with DTrace. Let's see who is calling buf_flush_batch() during a Read+Write workload:
 # pgrep -n mysqld
 28440


 # dtrace -qn "pid28440:a.out:buf_flush_batch:entry { @fun[ustack(2)] = count(); }"
^C

                mysqld`buf_flush_batch

                mysqld`buf_flush_free_margin+0x444

                  2



                mysqld`buf_flush_batch

                mysqld`log_check_margins+0x564

               3085
So well :

flushing is mainly called by log_check_margins() function..
then if the space was not freed fast enough - the buf_flush_free_margin() function is called..
don't know if there are other functions calling flushing too (need yet to monitor)
as you may imagine, for the same reasons the Adaptive Checkpoint code within the XtraDB is also never reached here...
and from all seen above it's clear at least no my tests setting dirty pages percentage to any value has no sense as it's simply never checked by the MySQL code ! And I'm quite curious how others observed different performance levels by changing a percentage value?.. - seems to me it should depends also on how the workload itself is aggressive.. - db_STRESS is a truly stressful workload, and hope it'll be very helpful here to resolve this issue :-))

Monitoring MySQL flushing with DTrace

I've made a short DTrace script printing me by a given time interval names of all functions called the buf_flush_batch() function and how many times. As there may be only few calls, I don't print calls/sec but just a number of calls per interval. As well after first probe monitoring I4v e discovered some functions sometime are doing a several thousand calls(!), while other just few - so to be able to show them on the same graph and still able to be identified I've added the third column - Calls Power - a value representing a number of digits within a Calls value (ex: 1 if Calls values is between 1 and 9, 2 if Calls value is between 10 and 99, and so and so..).
NOTE: the "none" name is used as heartbeat - just to present something if there are no calls at all. As well its "power" values has a special meaning - when it's equal 0 it means the MySQL server is off-line, and when it's 1 - means on-line (this change was added lately so may not be reflected in some graphs).
Output example :
# myflushSTAT.sh 5

FUNCTION    CALLS      POWER
none        0          1
srv_master_thread   87   2

FUNCTION    CALLS      POWER
none        0          1
srv_master_thread   20   2
log_check_margins  3646  4

^C
#
I've added this script as a new Add-On into dim_STAT and in few minutes was able to monitor my workload.
Let's see what we'll discover now...

MySQL 5.4 flushing with log=128MB and concurrency=0

Here is just an example of a graph showing a number of buf_flush_batch() calls from other functions. (NOTE: the Power value is not yet reflecting a digit numbers, but just 1 or 0 if call was present)
As you may see :

during a Read-Only warm-up it's called time to time by srv_master_thread() (PART-2)
once activity is switched to Read+Write, log_check_margins() become the most active caller
time to time buf_flush_free_margin() is involved to start more aggressive flushing and processing is slightly freezing during this period...

Similar graphs I've obtained for all engines and there is nothing surprising - as the Main Thread is looping on calling trx_purge() the only thing changes with a different log size is a call frequency of flush functions...
On the same time tracing a function call stacks I discovered that all flushing calls are involved from trx_purge() - nor directly, but via other functions called from it. And I have a strong feeling something is going wrong here...
To debug more further I decided to simplify a little bit the tested workload.

In depth with a Purge Process

The main question of this chapter is - What is wrong with Purge Process?...
As I did not design InnoDB and don't have yet enough knowledge about its code, my goal here will be to clarify what's going wrong rather to make a patch - I'm pretty sure there are at least several people around managing InnoDB code as their own :-))
To make things more clear I'll first of all simplify my workload.
Simplified Read+Write :

still 32 concurrent sessions
still zero think time
5 min Read-Only warm-up
then 1 hour Read+Writes but with only UPDATE statements as write transactions
buffer pool = 4G to make more things happening on the same time :-))

Question: - if I'm doing only updates of my rows and all rows I'm using are fitting into my buffer pool - how many buffer space will I need to continue to keep them cached?...
NOTE: - current workload is using only fixed length rows! And from previous graphs you may see the whole current workset pages are fitting into 70.000 InnoDB pages within a buffer pool. So, I'd say 140.000 should be more than enough, even if we need to keep all of the time the previous copy of each page.
But let's see now what is going there...

Simplified test with log=128M and concurrency=0

Performance with log=128M and concurrency=0

Observations :

after 5 min warm-up reading all rows are cached in the buffer pool, no more reads
stable AVG performance, however too much aggressive flushing..
in the middle of the test there is a performance drop due a read jump... Why?...

Buffer Poll state with log=128M and concurrency=0

Observations :

after Read-Only warm-up Database pages are occupying ~70.000 pages within a Buffer pool
once Read+Write workload is started, even if dirty pages are kept on the low level by aggressive flushing, the Database pages number is growing linearly with a time and very quickly reaching the max pool size...
once there is no more Free pages a following thing was happened: by mistake the "useful" pages were flushed instead the old ones, etc - this required to involve additional page reads for some SELECT queries!
these unexpected reads dropped down overall workload performance!
and seems to me buf_flush_free_margin() function is involved when some space is needed for new pages (but I may be wrong)

Simplified test with log=1024M and concurrency=0

Performance with log=1024M and concurrency=0

Observations :

moving to 1024MB redo log the performance become more stable!
periodic performance drops are happening due missing free space withing a log files, which involving aggressive dirty pages flushing (and now it's for sure!)
still the same significant performance drop in the middle of the test!

Buffer Poll state with log=1024M and concurrency=0

Observations :

dirty pages level is still driven by the free space within a log file
Database pages level is still quickly increasing
performance drop in the middle of the test due wrongly flushed data pages

NOTE: why this performance drop in the middle of the test is so important???... - It just demonstrates there is something wrong with Database pages management - cache space is simply wasted within a such conditions - we don't really need more than say 80.000 pages to keep all workset of the current test cached! But finally space is wasted and performance is dropped! - It's also important generally to keep cache optimal because if for ex. one of the users will suspend his activity for a while, once it'll resume - there will be no more his pages within a cache even there was more than enough space to keep them!!!

Simplified test with log=1024M and concurrency=16

IMPORTANT: the dirty pages percentage is set to 50 here (innodb_max_dirty_pages_pct=50)

Performance with log=1024M and concurrency=16 (pct=50)

Observations :

concurrency=16 decreasing a thread contention, and it's improving response time!
however it did not improve TPS throughput just because we reaching a current limit with 16 concurrently processing threads and on the same time CPU usage is 2 times(!) lower
periodical performance drops as before
BUT! there is no more huge drop in the middle of the test! Why?

Buffer Poll state with log=1024M and concurrency=16 (pct=50)

Observations :

there is no performance drop in the middle of the test because there was enough Free pages till the end!
until purge processing is going fast enough, dirty pages are flushed by Main Thread
once there is no more free space in the log files - log_check_margins() is called
Buffer pool is better managed, but space is still wasted as Database pages level is continuing to increase...

Simplified test with log=1024M and concurrency=16 (pct=15)

IMPORTANT: the dirty pages percentage is set to 15 here (innodb_max_dirty_pages_pct=15)

Performance with log=1024M and concurrency=16 (pct=15)

Observations :

no periodic performance drops anymore...
no aggressive flushing...
what's going on?...

Buffer Poll state with log=1024M and concurrency=16 (pct=15)

Observations :

this test is simple demonstrates how it should work :-))
Master Thread is flushing pages fast enough! :-))
no space is wasted within a Buffer Pool! :-))
no missing space within log files! :-))

All it makes me think :

if Purge Processing is able to follow the workload activity - everything is working as should!
And on the same time - once it's unable to follow the workload activity - the problems are beginning!...

Instead of SUMMARY...

Last notes :

I have a strong feeling all presented problems are having the same source - Purge Process in some cases is unable to follow the workload activity!
It needs to understand first why it may happens?
And then fix it! :-))
Seriously, probably it's already clear for some of you how to fix it - don't hesitate to present your ideas!
Honestly, for me it sounds as a bug! Of course, there is nothing crashed, etc. - but something was missed on design and globally limiting InnoDB to process optimally!

I'm also surprised nobody observed such issue before... - I encourage you to monitor your databases, probably it's already your case too...
At least one positive thing - I'll obtain now even better results with my tests using 1024M log size! :-))
As well, once this issue will be fixed it'll be fine to avoid any regression comparing to the currently observed performance levels :-))


As you see : same problems as others but near 12.000 TPS - the best result for the moment!


Observations : after 5 min warm-up reading all rows are cached in the buffer pool, no more reads stable AVG performance, however too much aggressive flushing.. in the middle of the test there is a performance drop due a read jump... Why?...


Observations : moving to 1024MB redo log the performance become more stable! periodic performance drops are happening due missing free space withing a log files, which involving aggressive dirty pages flushing (and now it's for sure!) still the same significant performance drop in the middle of the test!


Observations : concurrency=16 decreasing a thread contention, and it's improving response time! however it did not improve TPS throughput just because we reaching a current limit with 16 concurrently processing threads and on the same time CPU usage is 2 times(!) lower periodical performance drops as before BUT! there is no more huge drop in the middle of the test! Why?


Observations : no periodic performance drops anymore... no aggressive flushing... what's going on?...

MySQL Performance: InnoDB Dirty Pages & Redo Log Size Impact

Table of contents