2 个版本
使用旧的 Rust 2015
0.1.1 | 2017年9月6日 |
---|---|
0.1.0 | 2017年8月28日 |
#790 在 数学
70KB
1.5K SLoC
SloWord2Vec
这是在 Rust 中实现的 Word2Vec 的简单版本。
目标是学习 Word2Vec 背后的基本原理和公式。顺便说一下,它运行得比较慢 ;)
获取方法
此库可以作为库和二进制文件使用。
二进制文件
A naive Word2Vec implementation
USAGE:
sloword2vec [SUBCOMMAND]
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
SUBCOMMANDS:
add-subtract Given a number of words to add and to subtract, returns a list of words in that area.
help Prints this message or the help of the given subcommand(s)
similar Given a path to a saved Word2Vec model and a target word, finds words in the model's vocab that are similar.
train Given a corpus and a path to save a trained model, trains Word2Vec encodings for the vocabulary in the corpus and saves it.
训练
Given a corpus and a path to save a trained model, trains Word2Vec encodings for the vocabulary in the corpus and saves it.
USAGE:
sloword2vec train [OPTIONS] --corpus <corpus> --path <path>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-A, --acceptable-error <acceptable-error> Acceptable error threshold under which training will end. [default: 0.1]
-R, --context-radius <context-radius> The context radius (how many word surrounding a centre word to take into account per training sample). [default: 5]
-C, --corpus <corpus> Where the corpus file is.
-D, --dimensions <dimensions> Number of dimensions to use for encoding a word as a vector. [default: 100]
-I, --iterations <iterations> Max number of training iterations. [default: 500]
-L, --learning-rate <learning-rate> Learning rate. [default: 0.001]
-M, --min-error-improvement <min-error-improvement> Minimum improvement in average error magnitude in a single training iteration (over all words) to keep on training [default:
0.0001]
-O, --min-word-occurences <min-word-occurences> Minimum number of occurences in the corpus a word needs to have in order to be included in the trained vocabulary. [default:
2]
-P, --path <path> Where to store the model when training is done.
相似度
Given a path to a saved Word2Vec model and a target word, finds words in the model's vocab that are similar.
USAGE:
sloword2vec similar --limit <limit> --path <path> --word <word>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-L, --limit <limit> Max number of similar entries to show. [default: 20]
-P, --path <path> Where to store the model when training is done.
-W, --word <word> Word to find similar terms for.
加减操作
Word2Vec 的经典演示...
Given a number of words to add and to subtract, returns a list of words in that area.
USAGE:
sloword2vec add-subtract --add <add>... --limit <limit> --path <path> --subtract <subtract>...
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-A, --add <add>... Words to add encodings for
-L, --limit <limit> Max number of similar entries to show. [default: 20]
-P, --path <path> Where to store the model when training is done.
-S, --subtract <subtract>... Words to subtract encodings for
详细信息
Word2Vec 的几乎所有最简单实现,唯一特殊的地方是使用矩阵/向量数学来加速。
此库背后的线性代数库是 ndarray
,启用了 OpenBlas(Fortran 和透明多线程 FTW!)。
参考资料
依赖关系
~9–18MB
~253K SLoC