1 个不稳定版本

0.3.0	2024 年 4 月 5 日

#681 在文本处理

MIT 许可证

68KB
1K SLoC

Tokengrams

Tokengrams 允许您高效计算用于训练大型语言模型的预分词文本语料库的 $n$-gram 统计数据。它不是通过显式预计算固定 $n$ 的 $n$-gram 计数，而是通过创建一个后缀数组索引来实现的，该索引允许您高效地计算任何 $n$ 的 $n$-gram 计数。

我们的代码还允许您将后缀数组索引转换为高效的 $n$-gram 语言模型，该模型可用于生成文本或计算给定文本的困惑度。

后端是用 Rust 编写的，Python 绑定是用 PyO3 生成的。

安装

pip install tokengrams

开发

pip install maturin
maturin develop

使用

构建索引

from tokengrams import MemmapIndex

# Create a new index from an on-disk corpus called `document.bin` and save it to
# `pile.idx`.
index = MemmapIndex.build(
    "/data/document.bin",
    "/pile.idx",
)

# Verify index correctness
print(index.is_sorted())
  
# Get the count of "hello world" in the corpus.
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")
print(index.count(tokenizer.encode("hello world")))

# You can now load the index from disk later using __init__
index = MemmapIndex(
    "/data/document.bin",
    "/pile.idx"
)

使用索引

# Count how often each token in the corpus succeeds "hello world".
print(index.count_next(tokenizer.encode("hello world")))

# Parallelise over queries
print(index.batch_count_next(
    [tokenizer.encode("hello world"), tokenizer.encode("hello universe")]
))

# Autoregressively sample 10 tokens using 5-gram language statistics. Initial
# gram statistics are derived from the query, with lower order gram statistics used 
# until the sequence contains at least 5 tokens.
print(index.sample(tokenizer.encode("hello world"), n=5, k=10))

# Parallelize over sequence generations
print(index.batch_sample(tokenizer.encode("hello world"), n=5, k=10, num_samples=20))

# Query whether the corpus contains "hello world"
print(index.contains(tokenizer.encode("hello world")))

# Get all n-grams beginning with "hello world" in the corpus
print(index.positions(tokenizer.encode("hello world")))

支持

获取支持的最好方法是在此存储库中打开一个问题或在 EleutherAI Discord 服务器中的 #inductive-biases 发布帖子。如果您已经使用过此库，并且有正面（或负面）的体验，我们很乐意听取您的意见！

依赖关系

~6–14MB
~175K SLoC