#nlp #linalg #word2vec

bin+lib sloword2vec

Word2Vec 的简单实现

2 个版本

使用旧的 Rust 2015

0.1.1 2017年9月6日
0.1.0 2017年8月28日

#790数学

MIT 许可证

70KB
1.5K SLoC

SloWord2Vec 构建状态

这是在 Rust 中实现的 Word2Vec 的简单版本。

目标是学习 Word2Vec 背后的基本原理和公式。顺便说一下,它运行得比较慢 ;)

获取方法

此库可以作为库和二进制文件使用。

二进制文件

A naive Word2Vec implementation

USAGE:
    sloword2vec [SUBCOMMAND]

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    add-subtract    Given a number of words to add and to subtract, returns a list of words in that area.
    help            Prints this message or the help of the given subcommand(s)
    similar         Given a path to a saved Word2Vec model and a target word, finds words in the model's vocab that are similar.
    train           Given a corpus and a path to save a trained model, trains Word2Vec encodings for the vocabulary in the corpus and saves it.

训练

Given a corpus and a path to save a trained model, trains Word2Vec encodings for the vocabulary in the corpus and saves it.

USAGE:
    sloword2vec train [OPTIONS] --corpus <corpus> --path <path>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -A, --acceptable-error <acceptable-error>              Acceptable error threshold under which training will end. [default: 0.1]
    -R, --context-radius <context-radius>                  The context radius (how many word surrounding a centre word to take into account per training sample). [default: 5]
    -C, --corpus <corpus>                                  Where the corpus file is.
    -D, --dimensions <dimensions>                          Number of dimensions to use for encoding a word as a vector. [default: 100]
    -I, --iterations <iterations>                          Max number of training iterations. [default: 500]
    -L, --learning-rate <learning-rate>                    Learning rate. [default: 0.001]
    -M, --min-error-improvement <min-error-improvement>    Minimum improvement in average error magnitude in a single training iteration (over all words) to keep on training [default:
                                                           0.0001]
    -O, --min-word-occurences <min-word-occurences>        Minimum number of occurences in the corpus a word needs to have in order to be included in the trained vocabulary. [default:
                                                           2]
    -P, --path <path>                                      Where to store the model when training is done.

相似度

Given a path to a saved Word2Vec model and a target word, finds words in the model's vocab that are similar.

USAGE:
    sloword2vec similar --limit <limit> --path <path> --word <word>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -L, --limit <limit>    Max number of similar entries to show. [default: 20]
    -P, --path <path>      Where to store the model when training is done.
    -W, --word <word>      Word to find similar terms for.

加减操作

Word2Vec 的经典演示...

Given a number of words to add and to subtract, returns a list of words in that area.

USAGE:
    sloword2vec add-subtract --add <add>... --limit <limit> --path <path> --subtract <subtract>...

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -A, --add <add>...              Words to add encodings for
    -L, --limit <limit>             Max number of similar entries to show. [default: 20]
    -P, --path <path>               Where to store the model when training is done.
    -S, --subtract <subtract>...    Words to subtract encodings for

详细信息

Word2Vec 的几乎所有最简单实现,唯一特殊的地方是使用矩阵/向量数学来加速。

此库背后的线性代数库是 ndarray,启用了 OpenBlas(Fortran 和透明多线程 FTW!)。

参考资料

  1. 论文
  2. Word2Vec Skip-gram 模型教程 文章

依赖关系

~9–18MB
~253K SLoC