#line #input #sampling #reservoir #random #sample #file

app rsam

基于 Rust 的使用水库抽样算法的文本数据随机抽样器

2 个版本 (1 个稳定版)

1.0.0 2023年3月21日
0.1.0 2023年3月21日

#2 in #reservoir

每月28次下载

GPL-3.0-only

25KB
306

rsam

使用水库抽样算法的文本数据随机抽样器。

安装

在此处安装 cargo 和 rust: https://www.rust-lang.net.cn/tools/install

cargo install rsam

用法

## sample 1000 lines from a file
rsam -s 1000 -o output.txt -i input.txt

## sample 1000 lines from a file and output to stdout
rsam -s 1000 -i input.txt 1>output.txt 2>output.log

## sample 1000 lines from a file and rewrite the exist output file
rsam -s 1000 -i input.txt -o output.txt -r

## sample 1% lines from a file
rsam -s 0.1 -o output.txt -i input.txt
rsam -s .1 -o output.txt -i input.txt

## keep the comment lines
rsam -s 0.1 -o output.txt -c "#" -i input.txt # keep the comment lines start with "#"

## read from stdin
zcat input.txt.gz | rsam -s 0.1 -o output.txt

基准测试

环境:1.4 GHz 4核 Intel Core i5;16 GB 2133 MHz DDR3;macOS 13.2 (22D49);

~/code/rsam main* ❯ time seq 200000 |./target/release/rsam -s 100000 -o /dev/null -r
2023-03-20T17:43:04.722500+08:00 INFO input size: "100000"
2023-03-20T17:43:04.722699+08:00 INFO parsed size: Absolute(100000)
2023-03-20T17:43:04.722726+08:00 INFO input from stdin
2023-03-20T17:43:04.722741+08:00 INFO output to: "/dev/null"
2023-03-20T17:43:04.722776+08:00 WARN file /dev/null exist, will rewrite it
2023-03-20T17:43:04.722796+08:00 INFO comment char: None
2023-03-20T17:43:04.771903+08:00 INFO total line count: 200000
2023-03-20T17:43:04.771934+08:00 INFO true size: 100000
2023-03-20T17:43:04.771947+08:00 INFO Start sample
2023-03-20T17:43:04.803320+08:00 INFO sample done
2023-03-20T17:43:04.803407+08:00 INFO start output

________________________________________________________
Executed in  116.49 millis    fish           external
   usr time  126.31 millis    0.35 millis  125.96 millis
   sys time   14.19 millis    1.24 millis   12.95 millis

路线图

  • 支持多文件
  • 支持 gz 文件输入
  • 支持多输出大小

许可证

GPL-3.0

依赖项

~3.5MB
~70K SLoC