12 个版本 (1 个稳定版)
使用旧的 Rust 2015
1.0.0 | 2020年11月22日 |
---|---|
0.2.4 | 2019年2月18日 |
0.2.3 | 2019年1月12日 |
0.2.2 | 2018年4月15日 |
0.1.6 | 2017年11月5日 |
#220 in 游戏
320KB
1.5K SLoC
rana
又一个词组算法,这次是在 Rust 中实现
用法
提供 --help-long
的文本。
USAGE:
rana [FLAGS] [OPTIONS] <word>
FLAGS:
-w, --words-in Returns the set of words composable from the letters in the input phrase
--strict When finding --words-in, returns only words that occur in some anagram
--prove Like --strict, but emits a phrase proving this word occurs in an anagram.
-h, --help Prints help information
--help-long Prints *detailed* help information
-C, --no-cache Do not cache partial results (this saves memory and costs speed)
-r, --random (Partially) shuffle order of discovery
--ribbit Ego sum
-V, --version Prints version information
OPTIONS:
-d, --dictionary <file> A line-delimited list of words usable in anagrams [default: ~/.anagram-dictionary.txt]
-x, --exclude <word>... Exclude this word from anagrams
-i, --include <word>... Include this word in the anagrams
-l, --limit <n> Only find this many anagrams
-m, --minimum-word-length <n> Words in anagrams must be at least this long
-t, --threads <n> The number of threads to use during anagram collection [default: 8]
ARGS:
<word>... The words for which you want an anagram
Rana generates all the possible anagrams from a given phrase and
dictionary. Note "given some dictionary." Rana does not have a word list
built in. You must tell it what words it may use in an anagram. I have made
myself such a list out of a list of English words I found on the Internet from
which I delted all the words likely to offend people. By default rana will
look in your home directory for a file called .anagrams-dictionary.txt.
In many cases a simple phrase will have hundreds of thousands or millions of
anagrams, setting aside permutations. The phrase "rotten apple", for example,
with a fairly ordinary dictionary of of 109,217 English words, produces 2695
anagrams. Here are 10:
pone prattle
plea portent
pole pattern
portent pale
platter pone
planter poet
potent paler
porn palette
pron palette
poler patent
Because so many anagrams are available, you are likely to want to focus your
search. Rana provides several options to facilitate this.
--words-in
This will list all the words in your dictionary composable from some subset of
your phrase. You likely want to add --strict to this, so you only get words that
occur in *some* anagram. The --strict version is slower. If you want to verify
that each word listed has some anagram, you can add --prove, or replace --strict
with --prove. This will cause rana to emit the remaining words in an anagram
using the word in question.
--exclude
Discard from your word list particular words.
--include
Include only those phrases which include particular words.
--limit
Only provide a sample of this many phrases.
--random
Shuffle the search order over partial results while searching for anagrams. This
does not provide a fully random sample of the possible anagrams, since only the
results found at any point are shuffled, not all possible results, but this is
a decent way to look at a sample of anagrams when the phrase you've fed in has
many thousands of results. This is particularly useful when paired with --limit.
Caching and Threads
Rana by default uses as many processing threads as there are cores on your
machine. Generally this is what you want, but if you've got a lot of other
things going on, you can limit the number of available threads to reduce the
load your kernel has to deal with.
Rana also uses a dynamic programming algorithm to reduce the complexity of
finding algorithms for large phrases. This is probably unnecessary for short
phrases, though rana provides no lower limit. For larger phrases, like the
complete alphabet, the cache used by the dynamic programming algorithm may grow
so large that the process crashes. If you turn off the cache rana will use
a constant amount of memory, though it may take considerably longer to find all
anagrams.
Text Normalization
Rana attempts to strip away certain characters from your word list and all
other textual input, so it will treat "c-a-t" and " C A T " the same as "cat".
Here is the actual code that does this:
pub fn normalize(word: &str) -> String {
word.trim()
.to_lowercase()
.chars()
.filter(|c| c.is_alphabetic())
.collect::<String>()
}
I have not tested what this will do for something like ß or Í. You may want to
normalize the text yourself before you give it to rana.
NOTE:
The caching algorithm treats character counts as long base-10 numbers. So, for
example, "cat" might be 111 if "c" is 100s, "a" is 10s, and "t" is 1s. Then
"cad" might be 1110 -- "d" is 1000s, and there is one of them, but there is no
"t", so the 1s place is 0. If you have a phrase, like
"Dhrtaraashtra uvaaca, dharmakshetre kurukshetre samavetaa yuyutsavah"
which has 15 a's, the "a" count has to be represented as 5 -- 15 mod 10 is 5. In
other words, there isn't space in the a's column of the number for all the a's
in the phrase. It is unlikely that this will ever cause trouble, because for a
particular phrase you are unlikely to be encounter two sets of character counts
during processing that have different counts but the same code. Also, a phrase
this size will consume so much cache space that the process will probably crash
before you encounter this collision.
Another consideration with caching is that this scheme can only accommodate
alphabets up to 38 characters in size.
示例用法
~ $ rana eat
eat
a et
eta
ate
tea
安装
Rana 可在 crates.io
上找到,因此您可以使用以下命令安装它:
cargo install ranagrams
您还可以克隆或复制此项目到您的机器,并在其目录中运行 cargo build --release
。这将生成一个名为 target/release/rana
的可执行文件。要使用该可执行文件,您需要一个单词列表。我还没有检查我的。您可能在计算机上有一个名为 /usr/share/dict/words
的文件。如果是这样,这将工作,尽管它可能包含所有单独的字母以及其他单词,这意味着如果您将其用作字典,则 c a t
将是 cat
的一个词组。如果您只搜索“单词列表”,您可以在网上找到单词列表。列表必须包含您感兴趣的所有形式。Rana 无法从“cat”推断出“cats”,更不用说从“bring”推断出“brought”了。
历史与致谢
我已经制作了许多这种词组算法的变体。这是 Rust 中的第一个。它比之前的任何版本都更高效、更快。我无法说这是特别好的或符合 Rust 习惯的。这是我写的第一段有意义的 Rust 代码。在某种程度上,如果这是好的 Rust 代码,那么功劳完全归功于 @TurkeyMcMac,他比我更了解 Rust,并且可以大致告诉我我在做什么特别愚蠢的事情。
依赖项
~1.2–1.8MB
~24K SLoC