12个稳定版本

1.4.3	2024年4月28日
1.4.2	2024年4月26日
1.3.1	2024年3月24日
1.3.0	2023年12月12日
1.1.2	2023年5月27日

134 在算法中排名

33 每月下载量

LGPL-3.0-or-later

120KB
2.5K SLoC

Rust关键词提取

简介

这是一个简单的NLP库，包含一系列无监督关键词提取算法

用于分词的标记化器；
TF-IDF用于计算一个或多个文档中单词的重要性；
共现用于计算特定窗口大小内单词之间的关系；
RAKE用于从文档中提取关键短语；
TextRank用于从文档中提取关键词和关键短语；
YAKE用于从文档中提取具有n-gram大小（默认为3）的关键词。

算法

此库中所有算法的完整列表

辅助算法
- 标记化器
- 共现
关键词提取算法
- TF-IDF
- RAKE
- TextRank
- YAKE

用法

将库添加到您的 Cargo.toml

[dependencies]
keyword_extraction = "1.4.3"

或使用 cargo add

cargo add keyword_extraction

特性

可以启用或禁用特性

"tf_idf"：TF-IDF算法；
"rake"：RAKE算法；
"text_rank"：TextRank算法；
"yake"：YAKE算法；
"all"：算法和辅助工具；
"parallel"：使用Rayon并行化算法；
"co_occurrence"：共现算法；

默认功能：["tf_idf", "rake", "text_rank"]。默认情况下，除了 "co_occurrence" 和 "yake" 以外，所有算法都已启用。

注意："parallel" 功能仅推荐用于大型文档，它会用内存交换计算资源。

示例

对于停用词，您可以使用 stop-words 库

[dependencies]
stop-words = "0.8.0"

例如，对于英语

use stop_words::{get, LANGUAGE};

fn main() {
    let stop_words = get(LANGUAGE::English);
    let punctuation: Vec<String> =[
        ".", ",", ":", ";", "!", "?", "(", ")", "[", "]", "{", "}", "\"", "'",
    ].iter().map(|s| s.to_string()).collect();
    // ...
}

TF-IDF

创建一个 TfIdfParams 枚举，可以是以下之一

未处理的文档：TfIdfParams::UnprocessedDocuments;
已处理的文档：TfIdfParams::ProcessedDocuments;
单个未处理的文档/文本块：TfIdfParams::TextBlock;

use keyword_extraction::tf_idf::{TfIdf, TfIdfParams};

fn main() {
    // ... stop_words & punctuation
    let documents: Vec<String> = vec![
        "This is a test document.".to_string(),
        "This is another test document.".to_string(),
        "This is a third test document.".to_string(),
    ];

    let params = TfIdfParams::UnprocessedDocuments(&documents, &stop_words, Some(&punctuation));

    let tf_idf = TfIdf::new(params);
    let ranked_keywords: Vec<String> = tf_idf.get_ranked_words(10);
    let ranked_keywords_scores: Vec<(String, f32)> = tf_idf.get_ranked_word_scores(10);

    // ...
}

RAKE

创建一个 RakeParams 枚举，可以是以下之一

使用默认值：RakeParams::WithDefaults;
使用默认值和短语长度（短语窗口大小限制）：RakeParams::WithDefaultsAndPhraseLength;
全部：RakeParams::All;

use keyword_extraction::rake::{Rake, RakeParams};

fn main() {
    // ... stop_words
    let text = r#"
        This is a test document.
        This is another test document.
        This is a third test document.
    "#;

    let rake = Rake::new(RakeParams::WithDefaults(text, &stop_words));
    let ranked_keywords: Vec<String> = rake.get_ranked_words(10);
    let ranked_keywords_scores: Vec<(String, f32)> = rake.get_ranked_word_scores(10);

    // ...
}

TextRank

创建一个 TextRankParams 枚举，可以是以下之一

使用默认值：TextRankParams::WithDefaults;
使用默认值和短语长度（短语窗口大小限制）：TextRankParams::WithDefaultsAndPhraseLength;
全部：TextRankParams::All;

use keyword_extraction::text_rank::{TextRank, TextRankParams};

fn main() {
    // ... stop_words
    let text = r#"
        This is a test document.
        This is another test document.
        This is a third test document.
    "#;

    let text_rank = TextRank::new(TextRankParams::WithDefaults(text, &stop_words));
    let ranked_keywords: Vec<String> = text_rank.get_ranked_words(10);
    let ranked_keywords_scores: Vec<(String, f32)> = text_rank.get_ranked_word_scores(10);
}

YAKE

注意： YAKE 是一个更复杂的算法，目前尚不支持 parallel 功能。

创建一个 YakeParams 枚举，可以是以下之一

使用默认值：YakeParams::WithDefaults;
全部：YakeParams::All;

use keyword_extraction::yake::{Yake, YakeParams};

fn main() {
    // ... stop_words
    let text = r#"
        This is a test document.
        This is another test document.
        This is a third test document.
    "#;

    let yake = Yake::new(YakeParams::WithDefaults(text, &stop_words));
    let ranked_keywords: Vec<String> = yake.get_ranked_keywords(10);
    let ranked_keywords_scores: Vec<(String, f32)> = yake.get_ranked_keyword_scores(10);
    // ...
}

贡献

我非常期待您的反馈！我希望使对该项目的贡献尽可能简单和透明，请阅读CONTRIBUTING.md 文件以获取详细信息。

许可证

本项目采用 GNU Lesser General Public License v3.0 许可。请参阅 Copying 和 Copying Lesser 文件以获取详细信息。

依赖项

~2.7–4MB
~67K SLoC