by Dimitri |
1. PREFACE |
Started as a simple database engines comparison, this benchmark quickly out-passed initial test perimeter and was transformed into more general database scalability study. This is the second part covering ZFS performance issues within the same test conditions.Following benchmark results covering mixed subjects, combining:
- MySQL MyISAM and InnoDB
- PostgreSQL
- Oracle 10g
- ZFS (Sol10 11/06)
- UFS buffered/forcedirectio
- T2000
- etcThis study was full of re-bound event and crazy surprises. I'm presenting all events here as they happened in chronological order, finding by finding, with all fixed and potential problems. Final results are very positive, but I always was most interested to know *how* people reaching the best possible performance level (being on the same initial conditions as others) - rather just a final numbers... Hope information presented in this document will be also useful for others and, if it'll wake up any new ideas - I'll be even more happy! :)
Have a fun reading! Any feedback - welcome!
Rgds,
-DimitriNote: due Oracle license restrictions - all Oracle result numbers are removed...
2. Benchmark Platform |
The same platform is used here as in the first part of tests (http://dimitrik.free.fr/db_STRESS_BMK_Part1.html), except only T2000 host is used as database server.
2.1. Platform Overview |
System details: 1) T2000 UST1 8cores 32GB RAM
- Solaris 10
- 2 HBA (2Gbit)
- Gbit NIC2) T2000 UST1 8cores 32GB RAM
- Solaris 10
- Gbit NIC3) 4x SE3510 (73GB 12HD)
- primary controller used (2Gbit port)
- two SE3510 connected to V890, and two other to T2000 (#2)
- 2 LUNs: RAID1 (6HD) and RAID5 (6HD)*) Note: V890 was not used in ZFS test due time limitation...
3. Benchmark Overview |
3.1. db_STRESS Benchmark |
See for detailed benchmark description here: http://dimitrik.free.fr/db_STRESS.html
3.2. Benchmark Scenario |
Stress Test:
- think time: 0 sec
- RW: 0, 1, 10, 1000
- injectors: 1, 2, 4, 8, 16, 32, 64, 128, 2561600usr Test:
- think time: 1 sec
- RW: 0, 1, 10
- injectors: 50, 100, 200, 400, 600, 800, 1000, 1200, 1400, 1600Additional variations:
- DB Engine: MyISAM, InnoDB, PostgreSQL, Oracle 10g
- Storage: RAID1 + ZFS
- Tuning: any interesting database specific tuning not changing test sense :))
4. ZFS Benchmark |
ZFS is a very new filesystem and doing only its first steps in database world. However, ZFS features are so exiting - more and more customers trying to adapt it in production systems, integrate in new projects, etc. But constructive performance feedback is still missing to build strong guidelines to avoid at least most common errors by following simple best practice. This chapter is just one step more in this direction...
4.1. Build Steps |
Well, I have only one big LUN available from SE3510 storage box, but let's play with it :)
As expected, ZFS build files system is more likely amazing steps:# zpool create -f -m /RAID1 zfs_bench c3t40d0s6 # zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT zfs_bench 408G 85K 408G 0% ONLINE -# zfs create zfs_bench/DATA # zfs create zfs_bench/LOG # # df -k Filesystem kbytes used avail capacity Mounted on /dev/dsk/c0t0d0s0 61805268 47573116 13614100 78% / /devices 0 0 0 0% /devices ctfs 0 0 0 0% /system/contract ... zfs_bench 421134336 29 421134127 1% /RAID1 zfs_bench/DATA 421134336 28 421134127 1% /RAID1/DATA zfs_bench/LOG 421134336 24 421134127 1% /RAID1/LOG
#
That's all! :)
4.2. Tune ZFS |
Let's try with default settings initially:root@T2000 # zfs get -o property,value,source all zfs_bench/DATAPROPERTY VALUE SOURCE type filesystem - creation Sat Mar 24 8:55 2007 - used 28.5K - available 402G - referenced 28.5K - compressratio 1.00x - mounted yes - quota none default reservation none default recordsize 128K default mountpoint /RAID1/DATA inherited from zfs_bench sharenfs off default checksum on default compression off default atime on default devices on default exec on default setuid on default readonly off default zoned off default snapdir hidden default aclmode groupmask default aclinherit secure default
Seems we may play only with a 'recordsize' parameter ('atime=off' is another candidate, but I'm not sure it's not used by databases, as well 'checksum=off' will completely reduce ZFS benefits; but let's see :))
For the beginning I'll leave a recordsize by default to validate its impact.
However, I'll at least separate DATA and LOG files from all database servers into two different fylesystems:
- MySQL: simply by using different data and log path in my.conf file
- PostgreSQL: there is no parameter for log path BUT having tablespaces feature simplify many things :), so I'll simply start server from LOG file system (and keeping WAL inside it), and than add a new tablespace with location pointing to DATA file system to keep my data files there :)
- Oracle: well, no question here :))
Observations: during probe tests database activity freezing time to time due aggressive memory operations within ZFS cache! Absolutely need to limit cache to leave other processes to live! (seems to me default ZFS setting is too positive :))
Well, ZFS cache allocation seems to be very aggressive and I suppose it may slow down other RAM allocations in the system... Let's try with severe limitation now and check if things going better now:/etc/system:
** limit ZFS cache to 4GB set zfs:zfs_arc_max = 0x100000000But depending on Solaris build version it may not work yet and you'll need to do it manually via 'mdb' like me :))
(see: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6505658 for details)Observations:
- things are going much more better!
- let's adapt 'recordsize' now to avoid over-read/write operations...Performance?...
- Oracle: stable, but processing 10x times slower(!) vs UFS
- MySQL MyISAM: stable working, but 2x slower vs UFS
- MySQL InnoDB:
- tx=1: freezing...
- tx=0: working, but still 3x times slower vs UFS- PostgreSQL:
- fsync=on: stable working, 2x times slower vs UFS (but the best ZFS result for now)
- fsync=off: near UFS performance level
Setting record size to 8K on both DATA and LOG file systems:Step 4: Redesign ZFS layoutroot@T2000 # zfs list NAME USED AVAIL REFER MOUNTPOINT zfs_bench 238G 164G 29.5K /RAID1 zfs_bench/DATA 233G 164G 233G /RAID1/DATA zfs_bench/DATA_32k 24.5K 164G 24.5K /RAID1/DATA_32k zfs_bench/DATA_8k 24.5K 164G 24.5K /RAID1/DATA_8k zfs_bench/LOG 5.27G 164G 5.27G /RAID1/LOGroot@T2000 # zfs set recordsize=8K zfs_bench/DATA root@T2000 # zfs set recordsize=8K zfs_bench/LOG
root@T2000 # zfs get all zfs_bench/DATA NAME PROPERTY VALUE SOURCE zfs_bench/DATA type filesystem - zfs_bench/DATA creation Sat Mar 24 8:55 2007 - zfs_bench/DATA used 233G - zfs_bench/DATA available 164G - zfs_bench/DATA referenced 233G - zfs_bench/DATA compressratio 1.00x - zfs_bench/DATA mounted yes - zfs_bench/DATA quota none default zfs_bench/DATA reservation none default zfs_bench/DATA recordsize 8K local zfs_bench/DATA mountpoint /RAID1/DATA inherited from zfs_bench zfs_bench/DATA sharenfs off default zfs_bench/DATA checksum on default zfs_bench/DATA compression off default zfs_bench/DATA atime on default zfs_bench/DATA devices on default zfs_bench/DATA exec on default zfs_bench/DATA setuid on default zfs_bench/DATA readonly off default zfs_bench/DATA zoned off default zfs_bench/DATA snapdir hidden default zfs_bench/DATA aclmode groupmask default zfs_bench/DATA aclinherit secure default
root@T2000 #
NOTE: new record size may be applied ONLY to newly created files!!! Means you need to recopy all your files to benefit new setting.
Observations:
- Well, I prefer to redesign ZFS layout at this point to avoid further ambiguity :)
Why do we need to separate DATA and LOG files to increase database performance on ZFS? - just because to have a such easy external interface ZFS is quite complex internally and I don't think it surprising anybody, no? ;) So being complex and having a goal for most optimal throughput, ZFS is also managing internal I/O operation priority according system load. And just to avoid any priority conflicts between database and ZFS we do such a separation: LOG data must be written as fast as possible and should have the highest priority in system load.Step 5: ZFS zil_noflush settingFirstly separate DATA and LOG into two different ZFS pools:
Split SE3510 LUN: - LOG: c3t40d0s6 size:80G - DATA: c3t40d0s7 size:320G root@T2000 # zpool create zfs_bench_log c3t40d0s6 root@T2000 # zpool create zfs_bench_data c3t40d0s7 root@T2000 # zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT zfs_bench_data 328G 79K 328G 0% ONLINE - zfs_bench_log 80G 79K 80.0G 0% ONLINE - root@T2000 #Create file systems:
root@T2000 # zfs create zfs_bench_log/LOG root@T2000 # zfs create zfs_bench_data/DATA root@T2000 # root@T2000 # zfs set recordsize=8k zfs_bench_data/DATA root@T2000 # zfs set recordsize=8k zfs_bench_log/LOG root@T2000 # zfs set mountpoint=/RAID1/LOG zfs_bench_log/LOG root@T2000 # zfs set mountpoint=/RAID1/DATA zfs_bench_data/DATA root@T2000 # root@T2000 # df -k Filesystem kbytes used avail capacity Mounted on /dev/dsk/c0t0d0s0 61805268 47573501 13613715 78% / /devices 0 0 0 0% /devices ... zfs_bench_log 82575360 24 82575249 1% /zfs_bench_log zfs_bench_data 338558976 24 338558865 1% /zfs_bench_data zfs_bench_log/LOG 82575360 24 82575249 1% /RAID1/LOG zfs_bench_data/DATA 338558976 24 338558865 1% /RAID1/DATA root@T2000 #Probe Stress (RW=1) test observations:
- This iostat snapshot during PostgreSQL activity representing positive changes on workload: WRITE is back to low latency, as well you may see 25 concurrent I/O operations (well, it's probably too much and we need to tune it to 16, but in all previous tests this value was never greater than 1.0 :))
extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 1269.0 448.7 68980.5 9693.6 0.0 24.9 0.0 14.5 3 100 c3t40d0 0.0 187.7 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3t40d0s0 1.0 260.7 64.0 9693.7 0.0 0.4 0.0 1.6 1 38 c3t40d0s6 1267.7 0.0 68895.7 0.0 0.0 24.5 0.0 19.3 2 100 c3t40d0s7 ^C- PostgreSQL: 90% of UFS performance!
- Oracle: 75% of UFS performance!
ZFS by default will force storage write-cache during each write operation for all files opened with O_SYNC or O_DSYNC option... However, it's seems my SE3510 is clever enough to ignore these flushes and keep write latency less 1.0 millisecond (otherwise my storage box should be reduced to performance of internal disks)...Step 6: ZFS Checksum featureNOTE:
- zil_noflush value is read only during file system mount
- don't change this value if you don't have secure NVRAM on your storage
- zil_noflush value may migrate to another name/parameterWell, I don't need it...
I/O checksum is one of the main ZFS features. However, there is also block checksum done by Oracle! In case of UFS it's the must (as UFS doesn't do any check), but with ZFS it's just additional waste of CPU time.Step 7: ZFS PrefetchSeems sometime changing in Oracle db_block_checksum parameter to false may give a significant performance gain on ZFS! But it did not change anything in my case. Let's continue... :))
We may still probably gain in I/O latency by reducing default (64K) ZFS prefetch size (less over-read will be done - faster write operations may go :)) ZFS Prefetch size may be changed dynamically via 'mdb' and Roch already wrote a simplifying life script to do it (ztune.sh)Step 8: BUMMER!...Setting 8K prefetch size with ztune.sh:
root@T2000 # ksh /ztune.sh Usage: /ztune.shSets the leaf vdev prefetch size on pool. Sets the number max_pending of per dev I/O. Prefetch size must be power of 2 and pool must be online. Size can be in bytes or in K: 64K. Pool will go through an export/import cycle. Tuning is not persistant; lasts until next import/reboot. root@T2000 # root@T2000 # ksh /ztune.sh zfs_bench_data 8K 16 ********************************************************** CRASH Warning: scripts edits in-kernel parameters and plays with unstable interfaces. May potentialy crash your system. ********************************************************** Setting vdev prefetch to 8K and max pending to 16 on zpool zfs_bench_data Requires the appropriate priviledges (mdb -kw) Will need a cycle of export/import of pool zfs_bench_data ctrl-c now if you don't want that... Exporting pool zfs_bench_data...done Importing pool zfs_bench_data with new soft track buffer and max pending...done Restoring defaults for other/future imports...done root@T2000 # ksh /ztune.sh zfs_bench_log 8K 16 ********************************************************** CRASH Warning: scripts edits in-kernel parameters and plays with unstable interfaces. May potentialy crash your system. ********************************************************** Setting vdev prefetch to 8K and max pending to 16 on zpool zfs_bench_log Requires the appropriate priviledges (mdb -kw) Will need a cycle of export/import of pool zfs_bench_log ctrl-c now if you don't want that... Exporting pool zfs_bench_log...done Importing pool zfs_bench_log with new soft track buffer and max pending...done Restoring defaults for other/future imports...done root@T2000 # Observations: as we say all the time in Benchmark Center - there is no magic ball to solve all your problems and each tip may turn differently on each workload. The only true answer to performance issue is a true test! :) For my big surprise, with 8K prefetch thins became worse... And even processing become less stable... Well, bad hit for me.
Just for curiosity to compare some other points, I've run a small test on UFS and was VERY surprised to see the same results as on ZFS actually... So, something went wrong with my data or my storage, or my system, etc. And to be sure to compare apples to apples from now - I prefer simply replay all MySQL5, PostgreSQL and Oracle tests on UFS and uses these new (or same but confirmed) results as comparison base!...
4.3. ReTEST db_STRESS @UFS on T2000 |
Quite interesting results:
- well, if you have a habit to meet customers having some performance problems one day, you may find they all have at least on thing in common: all of them generally claim they did not change NOTHING! :))
- but next joke is I'm in the same situation with myself, and I'm SURE I changed nothing :))
- but numbers are not really the same as I've seen before:
- MySQL5 tx=1: lost 50% of its previous performance level
- MySQL5 tx=0: lost 40% in performance, seems Anti-DeadLock feature removed some part of locking in InnoDB and processing became more stable, but still with significant slow down on the top level of workload...
- PostgreSQL: lost 20% in performance
- Oracle: at least this one is still the same as before
Note:
- tx is an 'innodb_flush_log_at_trx_commit' MySQL configuration parameter (possible values are: 0, 1, 2; see mysql doc for more details)
- tx=0 means log flush for each commit
- tx=1 means single grouped log flush for all commits every 1 second (and there are real customers using this setting in production!) what's important: all commit requests will be confirmed to all transactions to in that point while data were still not flushed to disk, so there is a price to pay to loose transactions within the last second...
- PostreSQL also has a feature with commit regrouping, but will never confirm commit to any transaction before real flush. Of course, this approach is very professional, no doubt! But probably for customers agree to loose some transactions per crash there may be additional option? ;)
UFS Test: apples to apples |
TPS Stress TEST @UFS (without Oracle)
UFS Test: Detailed STATs |
db_STRESS @UFS STATs - MySQL5 MyISAM @T2000
db_STRESS @UFS STATs - MySQL5 InnoDB tx=1 @T2000db_STRESS @UFS STATs - MySQL5 MyISAM @T2000
db_STRESS @UFS STATs - MySQL5 InnoDB tx=0 @T2000db_STRESS @UFS STATs - MySQL5 InnoDB tx=1 @T2000
db_STRESS @UFS STATs - PostgreSQL @T2000db_STRESS @UFS STATs - MySQL5 InnoDB tx=0 @T2000
db_STRESS @UFS STATs - PostgreSQL @T2000
4.4. ZFS, Continue |
Strange Memory management issue: I have a strong impression there is some sort of conflicts in memory/cache management between UFS and ZFS. Just after finishing UFS workload, I've started another one on ZFS. For my big surprise ZFS test went very sloooooowly and start to freeze jump by jump... It's really strange... I don't like to reboot machines, but did not find any more faster workaround ;)After reboot everything seems to work just fine... Any ideas?
Test still continues... Crossing fingers? :))
First good news:
- I've reached 100% UFS performance level on ZFS with Oracle!!!
- PostgreSQL seems to work correctly, but suffer of aggressive reading (small cache due 32bit version), need to check with bigger ARC on ZFS (16G?)...
ZFS Tests with ARC=4G |
db_STRESS @ZFS.arc=4G STATs - MySQL5 MyISAM @T2000
db_STRESS @ZFS.arc=4G STATs - MySQL5 InnoDB tx=1 @T2000db_STRESS @ZFS.arc=4G STATs - MySQL5 MyISAM @T2000
db_STRESS @ZFS.arc=4G STATs - MySQL5 InnoDB tx=0 @T2000db_STRESS @ZFS.arc=4G STATs - MySQL5 InnoDB tx=1 @T2000
db_STRESS @ZFS.arc=4G STATs - PgSQL @T2000db_STRESS @ZFS.arc=4G STATs - MySQL5 InnoDB tx=0 @T2000
db_STRESS @ZFS.arc=4G STATs - PgSQL @T2000
ZFS Tests with ARC=16G |
db_STRESS @ZFS.arc=16G STATs - MySQL5 MyISAM @T2000
db_STRESS @ZFS.arc=16G STATs - MySQL5 InnoDB tx=1 @T2000db_STRESS @ZFS.arc=16G STATs - MySQL5 MyISAM @T2000
db_STRESS @ZFS.arc=16G STATs - MySQL5 InnoDB tx=0 @T2000db_STRESS @ZFS.arc=16G STATs - MySQL5 InnoDB tx=1 @T2000
db_STRESS @ZFS.arc=16G STATs - PgSQL @T2000db_STRESS @ZFS.arc=16G STATs - MySQL5 InnoDB tx=0 @T2000
Reaching 100% of performance level on UFS!
db_STRESS @ZFS.arc=16G STATs - PgSQL @T2000
5. SUMMARY: db_STRESS@ZFS |
Even ZFS is still young and not yet widely popular - its features and performance improved with each new Solaris release. Seeing it's able already to preform at least like big brother UFS - it's quite attractive to move to ZFS and have all benefits of the most innovated file system solution!And of course, more we test it - better it'll be! :))
5.1. Tests ZFS arc=4G |
5.2. Tests ZFS arc=16G |
5.3. Tests MyISAM: UFS vs ZFS |
5.4. Tests InnoDB tx=1: UFS vs ZFS |
5.5. Tests InnoDB tx=0: UFS vs ZFS |
5.6. Tests PostgreSQL: UFS vs ZFS |