0.1.3 |
|
---|---|
0.1.2 |
|
0.1.1 |
|
0.1.0 |
|
#5 in #word2vec
170KB
262 代码行
shibboleth
一个简单、纯Rust实现的word2vec,具有词干提取和负采样。使用Shibboleth可以轻松
- 构建语料库词汇表。
- 训练单词向量
- 根据向量距离查找单词。
自动文本分词
let tokens = shibboleth::tokenize("Totally! I love cupcakes!");
assert_eq!(tokens[0], "total");
assert_eq!(tokens[3], "cupcak");
数据输入
Shibboleth可以使用与此架构匹配的sqlite文件中提供的训练语料库
CREATE TABLE documents (id PRIMARY KEY, text);
一个流行的训练资源是维基百科。下面的脚本将下载并解压包含超过500万个文档的sqlite文件。有关维基百科许可,请参阅此处。
$ wget -O wiki.db.gz https://dl.fbaipublicfiles.com/drqa/docs.db.gz && gunzip wiki.db.gz
构建词汇表
此示例使用上述下载的wiki.db文件,运行前1000万个文档,提取词干,并构建最常见的25000个单词的词汇表。输出将保存到WikiVocab25k.txt
use shibboleth;
shibboleth::build_vocab_from_db("wiki.db", "WikiVocab25k.txt", 1000000, 25000);
训练
use shibboleth;
// create a new encoder object
let mut enc = shibboleth::Encoder::new(
200, // elements per word vector
"WikiVocab25k.txt", // vocabulary file
0.03 // alpha (learning rate)
);
// the prediction (sigmoid) for 'chips' occuring near 'fish' should be near 0.5 prior to training
let p = enc.predict("fish", "chips");
match p {
Some(val) => println!("'Fish'->'Chips' sigmoid activation before training: {}", val),
None => println!("One of these words is not in your vocabulary")
}
// train
for _ in 0..100 {
enc.train_doc("I like to eat fish & chips.");
enc.train_doc("Steve has chips with his fish.");
}
// after training, the prediction should be near unity
let p = enc.predict("fish", "chips");
match p {
Some(val) => println!("'Fish'->'Chips' sigmoid activation after training: {}", val),
None => println!("One of these words is not in your vocabulary")
}
典型输出
'Fish'->'Chips' sigmoid activation before training: 0.5002038
'Fish'->'Chips' sigmoid activation after training: 0.999495
依赖项
~13–22MB
~303K SLoC