2 个版本

0.2.1 2022 年 7 月 5 日
0.2.0 2022 年 7 月 5 日

#320 in 性能分析

MIT/Apache

14KB
89

criterion-cust

此 crate 提供了用于使用 criterion-rs 对 CUDA 内核进行基准测试的 MeasurementCudaTime

有关使用示例,请参阅 examples/add.rs

CUDA 时间是如何测量的?

通过在 CUDA 默认流上记录基准测试前后的 CUDA 事件来测量 GPU 执行时间。

运行示例基准测试

cargo bench

或者安装 cargo-criterion

cargo criterion

执行以下命令

add kernel/add kernel/2000
                        time:   [0.0142 ms 0.0142 ms 0.0142 ms]
                        thrpt:  [0.5229 GiB/s 0.5231 GiB/s 0.5232 GiB/s]
                 change:
                        time:   [-1.8762% -1.2732% -0.7326%] (p = 0.00 < 0.05)
                        thrpt:  [+0.7380% +1.2896% +1.9121%]
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild
  8 (8.00%) high severe
add kernel/add kernel/20000
                        time:   [0.1163 ms 0.1163 ms 0.1164 ms]
                        thrpt:  [0.6403 GiB/s 0.6404 GiB/s 0.6404 GiB/s]
                 change:
                        time:   [-1.5252% -1.0335% -0.4522%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4542% +1.0443% +1.5488%]
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

进行优化

现在更改示例中的以下行

- launch!(module.sum<<<buffer_size, 1, 0, stream>>>(
+ launch!(module.sum<<<256, ((buffer_size + 256 - 1) / 256), 0, stream>>>(

现在基准测试应该会更快运行

add kernel/add kernel/2000
                        time:   [0.0041 ms 0.0041 ms 0.0041 ms]
                        thrpt:  [1.8300 GiB/s 1.8311 GiB/s 1.8321 GiB/s]
                 change:
                        time:   [-71.520% -71.397% -71.249%] (p = 0.00 < 0.05)
                        thrpt:  [+247.81% +249.61% +251.13%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe
add kernel/add kernel/20000
                        time:   [0.0041 ms 0.0041 ms 0.0041 ms]
                        thrpt:  [18.0229 GiB/s 18.0325 GiB/s 18.0405 GiB/s]
                 change:
                        time:   [-96.459% -96.441% -96.421%] (p = 0.00 < 0.05)
                        thrpt:  [+2694.2% +2709.4% +2724.3%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

故障排除

此项目使用 cust 来运行 CUDA 程序。如果遇到构建问题,请参阅他们的 README。要使用 RustaCUDA,请使用此 crate 的 "0.1.0" 版本。

依赖项

~14–25MB
~388K SLoC