5个版本

0.1.4	2022年1月13日
0.1.3	2022年1月7日
0.1.2	2022年1月2日
0.1.1	2022年1月2日
0.1.0	2022年1月2日

#822 in 数据结构

MIT许可

69KB
1.5K SLoC

`tongrams`: 大量的 N-gram

tongrams 是一个用于在压缩空间中索引和查询大型语言模型的crate，其中的数据结构在以下论文中介绍

Giulio Ermanno Pibiri and Rossano Venturini, Efficient Data Structures for Massive N-Gram Datasets. In Proceedings of the 40th ACM Conference on Research and Development in Information Retrieval (SIGIR 2017), pp. 615-624.
Giulio Ermanno Pibiri and Rossano Venturini, Handling Massive N-Gram Datasets Efficiently. ACM Transactions on Information Systems (TOIS), 37.2 (2019): 1-41.

这是一个将 tongrams C++ 库移植到 Rust 的项目。

功能

存储带频率计数的 N-gram 语言模型。
查找 N-gram 以获取频率计数。

特点

压缩语言模型。 tongrams-rs 可以在非常压缩的空间中存储大型 N-gram 语言模型。例如，在 test_data 中的单词 N-gram 数据集（N=1..5）每个词组只占 2.6 字节。
时间和内存效率。 tongrams-rs 使用了 Elias-Fano Trie，它通过 Elias-Fano 码巧妙地编码由 N-gram 组成的 trie 数据结构，从而在压缩空间中实现快速查找。
纯Rust。 tongrams-rs 仅使用 Rust 编写，可以轻松地集成到您的 Rust 代码中。

安装

要在您的 Cargo manifest 中使用 tongrams

# Cargo.toml

[dependencies]
tongrams = "0.1"

输入数据格式

N-gram 计数文件的文件格式与 tongrams 使用的是相同的格式，这是一个修改版的 Google 格式，其中

每个不同的 N（顺序）值对应一个单独的文件，每个文件包含一行一个词组。
每个标题行 <number_of_grams> 表示文件中的 N-gram 数量。
语法中的标记<gram>由空格分隔（例如，the same time），并且
语法<gram>和计数<count>由水平制表符分隔。

<number_of_grams>
<gram1><TAB><count1>
<gram2><TAB><count2>
<gram3><TAB><count3>
...

例如，

61516
the // parent	1
the function is	22
the function a	4
the function to	1
the function and	1
...

示例

以下代码使用了存储库根目录下的test_data中的数据集。

use tongrams::EliasFanoTrieCountLm;

// File names of N-grams.
let filenames = vec![
    "../test_data/1-grams.sorted.gz",
    "../test_data/2-grams.sorted.gz",
    "../test_data/3-grams.sorted.gz",
];

// Builds the language model from n-gram counts files.
let lm = EliasFanoTrieCountLm::from_gz_files(&filenames).unwrap();

// Creates the instance for lookup.
let mut lookuper = lm.lookuper();

// Gets the count of a query N-gram written in a space-separated string.
assert_eq!(lookuper.with_str("vector"), Some(182));
assert_eq!(lookuper.with_str("in order"), Some(47));
assert_eq!(lookuper.with_str("the same memory"), Some(8));
assert_eq!(lookuper.with_str("vector is array"), None);

// Gets the count of a query N-gram formed by a string array.
assert_eq!(lookuper.with_tokens(&["vector"]), Some(182));
assert_eq!(lookuper.with_tokens(&["in", "order"]), Some(47));
assert_eq!(lookuper.with_tokens(&["the", "same", "memory"]), Some(8));
assert_eq!(lookuper.with_tokens(&["vector", "is", "array"]), None);

// Serializes the index into a writable stream.
let mut data = vec![];
lm.serialize_into(&mut data).unwrap();

// Deserializes the index from a readable stream.
let other = EliasFanoTrieCountLm::deserialize_from(&data[..]).unwrap();
assert_eq!(lm.num_orders(), other.num_orders());
assert_eq!(lm.num_grams(), other.num_grams());

许可

此库是在MIT许可下提供的免费软件。

依赖

~1.4–2MB
~39K SLoC

5个版本

tongrams: 大量的 N-gram

功能

特点

安装

输入数据格式

示例

许可

依赖

`tongrams`: 大量的 N-gram