5 个版本 (破坏性)

0.6.1	2021 年 9 月 15 日
0.4.0	2021 年 8 月 2 日
0.3.0	2021 年 6 月 29 日
0.2.0	2021 年 6 月 27 日
0.1.0	2021 年 6 月 20 日

#323 在压缩

Apache-2.0

430KB
1.5K SLoC

DataSketches 在 Rust 中

Rust 绑定 Apache DataSketches 库和命令行工具。

在命令行中，我们提供

dsrs [--key] [--raw] [--merge] 用于近似不同的行计数，以及
dsrs --hh k 用于热门元素（近似最频繁的行）。

例如，以下实验检查了当你打印 1 亿个数字两次时存在多少唯一的行。

m100=$((100 * 1000 * 1000))
(seq $m100 && seq $m100) | \
  /usr/bin/time -f "%e sec %M KB" dsrs
102055590
5.22 sec 4288 KB

(seq $m100 && seq $m100) | \
  /usr/bin/time -f "%e sec %M KB" sort -u | wc -l
438.66 sec 12880 KB
100000000

(seq $m100 && seq $m100) | \
  /usr/bin/time -f "%e sec %M KB" awk '{a[$0]=1}END{print length(a)}'
100000000
39.28 sec 898240 KB

接下来，我们可以要求从流中获取最受欢迎的行（有一个 topfew Rust 包，但它不支持流）。

m10=$((10 * 1000 * 1000))
seq $m10 | sed 's/$/\n1\n2\n3/' | \
  /usr/bin/time -f "%e sec %M KB" sort | \
  uniq -c | sort -rn | head -3
54.88 sec 8968 KB
10000001 3
10000001 2
10000001 1
  
# exact hashmap solution, requires go
pushd /tmp && \
  (test -d topfew || git clone git@github.com:timbray/topfew.git topfew) && \
  pushd topfew && make && popd && popd
seq $m10 | sed 's/$/\n1\n2\n3/' | \
  /usr/bin/time -f "%e sec %M KB" /tmp/topfew/bin/tf -f 1 -n 3
10000001 2
10000001 3
10000001 1
10.67 sec 1060332 KB
  
seq $m10 | sed 's/$/\n1\n2\n3/' | \
  /usr/bin/time -f "%e sec %M KB" target/release/dsrs --hh 3
10000001 2
10000001 1
10000001 3
4.48 sec 4560 KB

以下是一个工具的复杂示例在此处运行，用于计算过去几十年亚马逊的滚动平均活跃评论者。基于非 sketch 的解决方案会 OOM。

安装

假设已经安装了现代 Rust cargo。命令行工具 dsrs 可以使用以下命令安装

cargo install dsrs

库可以通过将其添加到您的 Cargo.toml 文件中作为常规 Rust 依赖项使用。

嵌入式 C++ 库

此 Rust 库包含从提交 043b947f 的 header-only datasketches-cpp 库中手动复制的头文件。

这是通过提取所有头文件来完成的。假设您位于 datasketches-rs 目录中，它有一个兄弟 datasketches-cpp

# make all required directories
find ../datasketches-cpp/ -name "*.h" -or -name "*.hpp" | \
  xargs dirname | \
  sort -u |
  cut -d/ -f2- | \
  xargs mkdir -p
# copy over the actual headers
find ../datasketches-cpp/ -name "*.h" -or -name "*.hpp" | \
  cut -d/ -f2- | \
  xargs -I {} cp ../{} {}
# and the license info too
cp ../datasketches-cpp/{NOTICE,LICENSE} datasketches-cpp/

# some manual interventions were required for the heavy hitters
# implementation, which requires the C++ side to temporarily own
# keys from Rust, so additional management code needs to be injected.
git apply fi.patch
git grep -l "uint16_t DRIFT_LIMIT = [0-9]*;" | xargs sed -i 's/uint16_t DRIFT_LIMIT = [0-9]*;/uint32_t DRIFT_LIMIT = 1024 * 1024 * 1024;/'

这一切都归功于出色的 dtolnay/cxx 库！

为什么在 Rust 中使用 DataSketches？

有很多包含HyperLogLog抽样的crate。然而，当我尝试使用它们（截至2021年6月20日），我发现它们在特定输入上API会崩溃（例如，尝试amadeus_streaming::HyperLogLog::<u64>::new(0.0001);），或者没有合并操作。对1M个唯一键的非常基础的cargo criterion测试发现，CPC的准确度仍然更好（以下所有内容均设置相同的名义精度配置，因此它们应该使用大致相同数量的内存）

repeat-ten/dsrs::CpcSketch/1000000
                        time:   [144.95 ms 149.27 ms 155.42 ms]
repeat-ten/amadeus_streaming::HyperLogLog/1000000
                        time:   [132.89 ms 134.01 ms 135.49 ms]
repeat-ten/probabilistic_collections::HyperLogLog/1000000
                        time:   [159.99 ms 165.94 ms 172.32 ms]
repeat-ten/probably::HyperLogLog/1000000
                        time:   [119.47 ms 123.95 ms 127.84 ms]
repeat-ten/hyperloglogplus::HyperLogLogPlus/1000000
                        time:   [120.74 ms 121.32 ms 122.10 ms]

relative errors
size: 1000000
  relerr:   1.1% name: dsrs::CpcSketch
  relerr:   3.3% name: amadeus_streaming::HyperLogLog
  relerr:   4.3% name: hyperloglogplus::HyperLogLogPlus
  relerr:  50.7% name: probably::HyperLogLog
  relerr:   inf% name: probabilistic_collections::HyperLogLog

而整体更新速度在不同实现之间变化不大。

依赖关系

~3.5–5MB
~79K SLoC