Oprofile Analysis
Preface
First we profiled a RHEL4/U5 system with kernel 2.6.9-55.0.2.ELsmp with glock_purging enabled to 50% on the filesystem being the base of the profiling.
Next we increased the gfs_scand timeout to 30 seconds then to 60seconds.
Normal setting with glock_purging 50%
We see that gfs uses 40% of all time spend in the kernel. Gfs spends most of its time in scan_glock caused by gfs_scand. examine_bucket is called by it. DLM which is mainly called by gfs then spends 64% of its time in __find_lock_by_id. Why?
Results are shown below.
report-all-long
CPU: AMD64 processors, speed 2405.53 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 CPU_CLK_UNHALT...| samples| %| 55234805 44.7989 /usr/lib/debug/lib/modules/2.6.9-55.0.2.ELsmp/vmlinux 48863618 39.6315 /gfs 9353924 7.5866 /dlm 3609413 2.9275 /lib64/tls/libc-2.3.4.so 1494639 1.2122 /usr/sbin/smbd 782285 0.6345 /usr/sbin/winbindd 599506 0.4862 /lock_dlm 466190 0.3781 /oprofile 418400 0.3393 /bin/gawk 415330 0.3369 /bin/bash 408049 0.3310 /usr/bin/oprofiled
report-gfs
CPU: AMD64 processors, speed 2405.53 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 17298461 35.4015 scan_glock 8465404 17.3246 examine_bucket 7882389 16.1314 trylock_on_glock 3630956 7.4308 run_queue 3506498 7.1761 gfs_stat_gfs 1626100 3.3278 gfs_glock_poll 989411 2.0248 unlock_on_glock 815211 1.6683 search_bucket 408136 0.8353 gfs_glock_dq 341993 0.6999 gfs_glock_nq 303472 0.6211 getbuf 251025 0.5137 glock_wait_internal 235018 0.4810 gfs_glock_nq_init
report-dlm
CPU: AMD64 processors, speed 2405.53 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 5987275 64.0082 __find_lock_by_id 2587276 27.6598 search_hashchain 204512 2.1864 search_bucket 132882 1.4206 dlm_astd 39139 0.4184 find_lock_by_id 38604 0.4127 grant_lock 35605 0.3806 dlm_hash 33917 0.3626 find_lockspace_by_local_id 32165 0.3439 dlm_lock 25499 0.2726 queue_conflict 20191 0.2159 dlm_unlock_stage2 19587 0.2094 release_lkb 19253 0.2058 dlm_unlock
This leed to the idea to increase the gfs_scand timeout from 5 to 30 seconds to see if differnt results could be produced.
gfs_scand 30seconds
After profiling for another day we see that the utilisation of gfs_scand could be dramatically reduced. From 40% to 20%. Although still dlm is 7% utilized and still spends 64% of its time in __find_lock_by_id. This still is quite a high figure. Because all the other two functions we see below (search_hashchain and search_bucket) themselves call __find_lock_by_id. We'll see if we can get even better gfs results from increasing the gfs-scand timeout to 60 seconds.
report-all-long(30)
# opreport --long-filenames session:atix-070907
CPU: AMD64 processors, speed 2405.53 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
CPU_CLK_UNHALT...|
samples| %|
32105670 58.0859 /usr/lib/debug/lib/modules/2.6.9-55.0.2.ELsmp/vmlinux
11785838 21.3230 /gfs
3370925 6.0987 /dlm
3005555 5.4377 /lib64/tls/libc-2.3.4.so
1311688 2.3731 /usr/sbin/smbd
698238 1.2633 /usr/sbin/winbindd
391465 0.7082 /e1000
375225 0.6789 /lock_dlm
350211 0.6336 /bin/gawk
349392 0.6321 /bin/bash
344060 0.6225 /oprofile
207063 0.3746 /usr/bin/oprofiled
176924 0.3201 /lib64/ld-2.3.4.so
84152 0.1522 /dm_mod
report-gfs(30)
# cat report-gfs CPU: AMD64 processors, speed 2405.53 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 2429869 20.6169 scan_glock 1193030 10.1226 examine_bucket 1119122 9.4955 trylock_on_glock 1029632 8.7362 search_bucket 821964 6.9742 run_queue 783207 6.6453 gfs_stat_gfs 385457 3.2705 gfs_glock_dq 320107 2.7160 getbuf 275509 2.3376 gfs_glock_nq 237569 2.0157 gfs_drevalidate 218782 1.8563 gfs_flush_meta_cache 208996 1.7733 gfs_glock_nq_init 206271 1.7502 glock_wait_internal 191224 1.6225 gfs_holder_init 181087 1.5365 leaf_search 168320 1.4282 gfs_glock_poll 144269 1.2241 unlock_on_glock 136681 1.1597 dirent_next 134429 1.1406 lock_on_glock 122123 1.0362 gfs_get_meta_buffer
report-dlm(30)
[root@rhein ~]# cat report-dlm CPU: AMD64 processors, speed 2405.53 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 2164044 64.1973 __find_lock_by_id 806314 23.9197 search_hashchain 186350 5.5282 search_bucket 47499 1.4091 dlm_astd 12453 0.3694 find_lock_by_id 12442 0.3691 find_lockspace_by_local_id 12408 0.3681 dlm_hash 11176 0.3315 grant_lock 11037 0.3274 dlm_lock 8903 0.2641 release_lkb 6561 0.1946 queue_conflict 6313 0.1873 wake_astd 6023 0.1787 queue_ast 5985 0.1775 _release_rsb 5826 0.1728 find_rsb 5802 0.1721 dlm_unlock_stage2
gfs_scand 60seconds
After profiling for a day we see that the utilisation of gfs_scand could be not be reduced much more. From 21% to 20% (compared to 30sec). Although dlm is only 3.5% utilized but still spends 53% of its time in __find_lock_by_id. Which I would say didnt' change very much. This still is quite a high figure. Because all the other two functions we see below (search_hashchain and search_bucket) themselves call __find_lock_by_id.
report-all-long(60)
# opreport --long-filenames
CPU: AMD64 processors, speed 2400.3 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
CPU_CLK_UNHALT...|
samples| %|
43849743 60.0708 /usr/lib/debug/lib/modules/2.6.9-55.0.2.ELsmp/vmlinux
14834351 20.3219 /gfs
4586071 6.2826 /lib64/tls/libc-2.3.4.so
2578246 3.5320 /dlm
1524648 2.0887 /usr/sbin/smbd
1004656 1.3763 /usr/sbin/winbindd
627152 0.8592 /bin/bash
551659 0.7557 /e1000
550770 0.7545 /oprofile
363930 0.4986 /lock_dlm
297914 0.4081 /usr/bin/oprofiled
268969 0.3685 /lib64/ld-2.3.4.so
255994 0.3507 /dm_mod
248763 0.3408 /qla2xxx
235919 0.3232 /bin/gawk
120316 0.1648 /usr/lib64/gconv/IBM850.so
84104 0.1152 /scsi_mod
72984 0.1000 /usr/lib64/gconv/IBM437.so
61666 0.0845 /tg3
61078 0.0837 /usr/sbin/slapd
report-gfs(60)
# opreport -l image:/gfs warning: could not check that the binary file /gfs has not been modified since the profile was taken. Results may be inaccurate. CPU: AMD64 processors, speed 2405.53 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 1650760 11.1047 gfs_stat_gfs 1408348 9.4740 scan_glock 1350035 9.0817 gfs_dpin 826363 5.5589 run_queue 700365 4.7114 examine_bucket 660786 4.4451 trylock_on_glock 660358 4.4422 search_bucket 580405 3.9044 gfs_glock_dq 465269 3.1299 gfs_glock_nq 464307 3.1234 getbuf 327035 2.2000 gfs_flush_meta_cache 316230 2.1273 gfs_holder_init 303626 2.0425 gfs_drevalidate 303492 2.0416 glock_wait_internal 282833 1.9026 gfs_glock_poll 263121 1.7700 gfs_glock_nq_init 242750 1.6330 gfs_get_meta_buffer 237037 1.5945 gfs_log_commit 205337 1.3813 lock_on_glock 202481 1.3621 leaf_search 134180 0.9026 dirent_next
report-dlm(60)
# opreport -l image:/dlm warning: could not check that the binary file /dlm has not been modified since the profile was taken. Results may be inaccurate. CPU: AMD64 processors, speed 2405.53 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 1371891 53.0608 __find_lock_by_id 618652 23.9277 search_hashchain 234298 9.0620 search_bucket 52164 2.0176 dlm_astd 23632 0.9140 dlm_sendd 19355 0.7486 dlm_hash 16578 0.6412 find_lock_by_id 16505 0.6384 find_lockspace_by_local_id 15551 0.6015 dlm_lock 13710 0.5303 _release_rsb 11572 0.4476 allocate_rsb 11493 0.4445 release_lkb 10680 0.4131 send_cluster_request 10439 0.4038 process_cluster_request 9598 0.3712 find_rsb
Conclusion
What we see is in that case of cluster we can reduce the utilitzation of the GFS Module by increasing the gfs_scand timeouts. Also that utilitsation of the DLM Module seems to be influenced. Still there is one function which seems to be the hotspot (__find_lock_by_id). With all differnt gfs_scand timeout settings the utilistation of that function in the DLM module stayed quite constant (64%-53%).
One idea now to even increase the reaction time of the system is to give the dlm_astd kernel process a higher priority so that the kernel-scheduler calls it more frequent (see cross-post).
Another step would be to increase the gfs_scand even more (from 60 to 120 seconds).