4 个版本 (2 个重大更新)

0.3.0	2023年7月23日
0.2.0	2023年7月19日
0.1.1	2023年7月14日
0.1.0	2023年7月14日

694 在文本处理中

每月下载量 22 次

自定义许可

51KB
877 行

regex-chunker

使用正则表达式分割 Read 类型的输出。

此包中的主要类型是 ByteChunker，它包装了一个实现 Read 的类型，并使用提供的正则表达式遍历其字节流中的块。以下示例从标准输入读取并打印单词计数

use std::collections::BTreeMap;
use regex_chunker::ByteChunker;
  
fn main() -> Result<(), Box<dyn Error>> {
    let mut counts: BTreeMap<String, usize> = BTreeMap::new();
    let stdin = std::io::stdin();
    
    // The regex is a stab at something matching strings of
    // "between-word" characters in general English text.
    let chunker = ByteChunker::new(stdin, r#"[ "\r\n.,!?:;/]+"#)?;
    for chunk in chunker {
        let word = String::from_utf8_lossy(&chunk?).to_lowercase();
        *counts.entry(word).or_default() += 1;
    }

    println!("{:#?}", &counts);
    Ok(())
}

异步功能 async 启用了 stream 子模块，其中包含一个包装 tokio::io::AsyncRead 类型的异步版本的 ByteChunker，并生成一个字节块的 Stream。

运行测试

如果您想运行 async 功能的测试，您需要首先使用 async 和 test 功能构建 src/bin/slowsource.rs

$ cargo build --bin slowsource --all-features

一些 stream 模块测试在子进程中运行它，并将其用作字节数据源。

未回答的问题和待办事项

这目前是一个基本的实现。可以如何优化性能？

是否有空间加强 RcErr 类型？

当非重叠的泛型实现 (1672，也许 20400) 实现，移除 SimpleCustomChunker 类型。

依赖项

~2–4.5MB
~75K SLoC