17 个版本
0.1.16 | 2024 年 7 月 15 日 |
---|---|
0.1.13 | 2024 年 6 月 30 日 |
0.1.12 | 2024 年 3 月 4 日 |
0.1.7 | 2023 年 10 月 10 日 |
0.1.6 | 2023 年 7 月 22 日 |
609 在 Web 编程 中
310 每月下载次数
88KB
2K SLoC
wdict
通过抓取网页或爬取本地文件创建字典。
类似工具(一些特性受其启发)
试试看
# build with nix and run the result
nix build .#
./result/bin/wdict --help
# just run it directly
nix run .# -- --help
# run it without cloning
nix run github:pyqlsa/wdict -- --help
# install from crates.io
# (nixOS users may need to do this within a dev shell)
cargo install wdict
# using a dev shell
nix develop .#
cargo build
./target/debug/wdict --help
# ...or a release version
cargo build --release
./target/release/wdict --help
用法
Create dictionaries by scraping webpages or crawling local files.
Usage: wdict [OPTIONS] <--url <URL>|--theme <THEME>|--path <PATH>|--resume|--resume-strict>
Options:
-u, --url <URL>
URL to start crawling from
--theme <THEME>
Pre-canned theme URLs to start crawling from (for fun)
Possible values:
- star-wars: Star Wars themed URL <https://www.starwars.com/databank>
- tolkien: Tolkien themed URL <https://www.quicksilver899.com/Tolkien/Tolkien_Dictionary.html>
- witcher: Witcher themed URL <https://witcher.fandom.com/wiki/Elder_Speech>
- pokemon: Pokemon themed URL <https://www.smogon.com>
- bebop: Cowboy Bebop themed URL <https://cowboybebop.fandom.com/wiki/Cowboy_Bebop>
- greek: Greek Mythology themed URL <https://www.theoi.com>
- greco-roman: Greek and Roman Mythology themed URL <https://www.gutenberg.org/files/22381/22381-h/22381-h.htm>
- lovecraft: H.P. Lovecraft themed URL <https://www.hplovecraft.com>
-p, --path <PATH>
Local file path to start crawling from
--resume
Resume crawling from a previous run; state file must exist; existence of dictionary is optional; parameters from state are ignored, instead favoring arguments provided on the command line
--resume-strict
Resume crawling from a previous run; state file must exist; existence of dictionary is optional; 'strict' enforces that all arguments from the state file are observed
-d, --depth <DEPTH>
Limit the depth of crawling URLs
[default: 1]
-m, --min-word-length <MIN_WORD_LENGTH>
Only save words greater than or equal to this value
[default: 3]
-x, --max-word-length <MAX_WORD_LENGTH>
Only save words less than or equal to this value
[default: 18446744073709551615]
-j, --include-js
Include javascript from <script> tags and URLs
-c, --include-css
Include CSS from <style> tags and URLs
--filters <FILTERS>...
Filter strategy for words; multiple can be specified (comma separated)
[default: none]
Possible values:
- deunicode: Transform unicode according to <https://github.com/kornelski/deunicode>
- decancer: Transform unicode according to <https://github.com/null8626/decancer>
- all-numbers: Ignore words that consist of all numbers
- any-numbers: Ignore words that contain any number
- no-numbers: Ignore words that contain no numbers
- only-numbers: Keep only words that exclusively contain numbers
- all-ascii: Ignore words that consist of all ascii characters
- any-ascii: Ignore words that contain any ascii character
- no-ascii: Ignore words that contain no ascii characters
- only-ascii: Keep only words that exclusively contain ascii characters
- none: Leave the word as-is
--site-policy <SITE_POLICY>
Site policy for discovered URLs
[default: same]
Possible values:
- same: Allow crawling URL, only if the domain exactly matches
- subdomain: Allow crawling URLs if they are the same domain or subdomains
- sibling: Allow crawling URLs if they are the same domain or a sibling
- all: Allow crawling all URLs, regardless of domain
-r, --req-per-sec <REQ_PER_SEC>
Number of requests to make per second
[default: 5]
-l, --limit-concurrent <LIMIT_CONCURRENT>
Limit the number of concurrent requests to this value
[default: 5]
-o, --output <OUTPUT>
File to write dictionary to (will be overwritten if it already exists)
[default: wdict.txt]
--append
Append extracted words to an existing dictionary
--output-state
Write crawl state to a file
--state-file <STATE_FILE>
File to write state, json formatted (will be overwritten if it already exists)
[default: state-wdict.json]
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Lib
此 crate 公开了库,但目前的接口应被视为不稳定的。
许可证
许可为以下之一
- Apache 许可证 2.0 (LICENSE-APACHE 或 https://apache.ac.cn/licenses/LICENSE-2.0)
- MIT 许可证 (LICENSE-MIT 或 http://opensource.org/licenses/MIT)
由您选择。
贡献
除非您明确声明,否则您提交给作品的所有贡献,根据 Apache-2.0 许可证定义,应如上所述双重许可,不附加任何额外条款或条件。
依赖项
~13–25MB
~428K SLoC