C(++)言語: valgrindの使い方 (cachegrind)
Kazuki Ohta, 2006/05/20

「Binary Hacks」
(2-2) Cachegrind: キャッシュミスを検出する
cachegrindを用いるとプログラム実行中のキャッシュ(L1, D1, L2)のヒット率を測定する事が出来る。実験の為に次のようなプログラムを用意した。
static char buf[4096][4096]; // 16M
int main(void)
{
int i, j;
for (i = 0; i < 4096; i++) {
for (j = 0; j < 4096; j++) {
#ifndef BAD
buf[i][j] = 'a';
#else
buf[j][i] = 'a';
#endif
}
}
}
BADをdefineしておくと、アクセスする度に違うページにアクセスするのでキャッシュ効率が悪くなる。手元の環境では大体5倍ぐらいの差が出た。
ubuntu% gcc cache.c; time ./a.out ./a.out 0.41s user 0.03s system 99% cpu 0.444 total ubuntu% gcc -DBAD cache.c; time ./a.out ./a.out 1.98s user 0.03s system 99% cpu 2.020 total
ubuntu% gcc cache.c; valgrind --tool=cachegrind ./a.out ==11733== Cachegrind, an I1/D1/L2 cache profiler. ==11733== Copyright (C) 2002-2005, and GNU GPL'd, by Nicholas Nethercote et al. ==11733== Using LibVEX rev 1367, a library for dynamic binary translation. ==11733== Copyright (C) 2004-2005, and GNU GPL'd, by OpenWorks LLP. ==11733== Using valgrind-3.0.1, a dynamic binary instrumentation framework. ==11733== Copyright (C) 2000-2005, and GNU GPL'd, by Julian Seward et al. ==11733== For more details, rerun with: -v ==11733== ==11733== ==11733== I refs: 167,994,594 ==11733== I1 misses: 554 ==11733== L2i misses: 547 ==11733== I1 miss rate: 0.00% ==11733== L2i miss rate: 0.00% ==11733== ==11733== D refs: 84,004,019 (67,192,625 rd + 16,811,394 wr) ==11733== D1 misses: 263,767 ( 1,374 rd + 262,393 wr) ==11733== L2d misses: 263,555 ( 1,172 rd + 262,383 wr) ==11733== D1 miss rate: 0.3% ( 0.0% + 1.5% ) ==11733== L2d miss rate: 0.3% ( 0.0% + 1.5% ) ==11733== ==11733== L2 refs: 264,321 ( 1,928 rd + 262,393 wr) ==11733== L2 misses: 264,102 ( 1,719 rd + 262,383 wr) ==11733== L2 miss rate: 0.1% ( 0.0% + 1.5% )
ubuntu% gcc -DBAD cache.c; valgrind --tool=cachegrind ./a.out ==11740== Cachegrind, an I1/D1/L2 cache profiler. ==11740== Copyright (C) 2002-2005, and GNU GPL'd, by Nicholas Nethercote et al. ==11740== Using LibVEX rev 1367, a library for dynamic binary translation. ==11740== Copyright (C) 2004-2005, and GNU GPL'd, by OpenWorks LLP. ==11740== Using valgrind-3.0.1, a dynamic binary instrumentation framework. ==11740== Copyright (C) 2000-2005, and GNU GPL'd, by Julian Seward et al. ==11740== For more details, rerun with: -v ==11740== ==11740== ==11740== I refs: 167,994,594 ==11740== I1 misses: 554 ==11740== L2i misses: 547 ==11740== I1 miss rate: 0.00% ==11740== L2i miss rate: 0.00% ==11740== ==11740== D refs: 84,004,019 (67,192,625 rd + 16,811,394 wr) ==11740== D1 misses: 16,778,838 ( 1,374 rd + 16,777,464 wr) ==11740== L2d misses: 16,778,626 ( 1,172 rd + 16,777,454 wr) ==11740== D1 miss rate: 19.6% ( 0.0% + 99.1% ) ==11740== L2d miss rate: 19.6% ( 0.0% + 99.1% ) ==11740== ==11740== L2 refs: 16,779,392 ( 1,928 rd + 16,777,464 wr) ==11740== L2 misses: 16,779,173 ( 1,719 rd + 16,777,454 wr) ==11740== L2 miss rate: 6.5% ( 0.0% + 99.1% )
D1は20%、L2は6.5%まで跳ね上がった。これぐらいのcache miss rateの違いでプログラムのパフォーマンスが5倍も違うのだから驚きだ。さらにcg_annotateというコマンドを用いるとcache missが起こっている箇所を特定する事が出来る。cachegrindを実行するとcachegrind.out.pidというファイルが出来るので、そのファイルが有るディレクトリでcg_annotate --pidを実行する。
ubuntu% valgrind --tool=cachegrind ls -al
ubuntu% cg_annotate --11886
--------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 8-way associative
D1 cache: 32768 B, 64 B, 8-way associative
L2 cache: 1048576 B, 64 B, 8-way associative
Command: ls -al
Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
Thresholds: 99 0 0 0 0 0 0 0 0
Include dirs:
User annotated:
Auto-annotation: off
--------------------------------------------------------------------------------
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
--------------------------------------------------------------------------------
14,361,788 9,436 2,345 5,103,436 26,138 3,082 2,415,534 6,362 2,738 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
--------------------------------------------------------------------------------
5,728,224 5,683 1,569 2,010,043 14,285 2,390 1,325,469 2,973 1,834 ???:???
3,052,279 1 1 1,102,170 4,882 14 0 0 0 ???:strcmp
1,475,659 30 30 532,371 367 121 274,034 27 0 ???:strcoll_l
544,839 465 33 165,068 14 5 135,500 22 11 ???:vfprintf
508,370 10 7 223,196 2 0 52,252 2 0 ???:_IO_file_xsputn
401,365 6 4 69,811 535 28 0 0 0 ???:strlen
322,361 29 13 126,790 976 0 77,493 0 0 ???:tsearch
235,601 398 2 73,294 2 1 37,036 4 1 ???:_IO_default_xsputn
221,939 22 6 18,858 0 0 1,728 0 0 ???:index
212,179 2 1 156,352 1,602 1 125,337 1,524 21 ???:memcpy
176,000 27 7 25,492 421 98 3,500 0 0 ???:memchr
171,333 13 5 71,139 86 0 4,498 0 0 ???:getenv
(以下略)
それぞれのcolumnは以下のものを示している。
- Ir: Iキャッシュ読込み
- I1mr: I1キャッシュ読込みミス
- I2mr: I2キャッシュ読込みミス
- Dr: Dキャッシュ読込み
- D1mr: D1キャッシュ読込みミス
- D2mr: L2キャッシュ読込みミス
- Dw: Dキャッシュ書き込みミス
- D1mw: D1キャッシュ書き込みミス
- D2mw: L2キャッシュ書き込みミス
[ return ]