#web-scraping #dictionary #word-list #web-crawler #webpage #local #reconnaissance

bin+lib wdict

通过抓取网页或爬取本地文件创建字典

17 个版本

0.1.16 2024 年 7 月 15 日
0.1.13 2024 年 6 月 30 日
0.1.12 2024 年 3 月 4 日
0.1.7 2023 年 10 月 10 日
0.1.6 2023 年 7 月 22 日

609Web 编程

Download history 193/week @ 2024-04-21 300/week @ 2024-06-30 51/week @ 2024-07-07 261/week @ 2024-07-14 1/week @ 2024-07-21 47/week @ 2024-07-28

310 每月下载次数

MIT/Apache

88KB
2K SLoC

wdict

通过抓取网页或爬取本地文件创建字典。

类似工具(一些特性受其启发)

试试看

# build with nix and run the result
nix build .#
./result/bin/wdict --help

# just run it directly
nix run .# -- --help

# run it without cloning
nix run github:pyqlsa/wdict -- --help

# install from crates.io
# (nixOS users may need to do this within a dev shell)
cargo install wdict

# using a dev shell
nix develop .#
cargo build
./target/debug/wdict --help

# ...or a release version
cargo build --release
./target/release/wdict --help

用法

Create dictionaries by scraping webpages or crawling local files.

Usage: wdict [OPTIONS] <--url <URL>|--theme <THEME>|--path <PATH>|--resume|--resume-strict>

Options:
  -u, --url <URL>
          URL to start crawling from

      --theme <THEME>
          Pre-canned theme URLs to start crawling from (for fun)

          Possible values:
          - star-wars:   Star Wars themed URL <https://www.starwars.com/databank>
          - tolkien:     Tolkien themed URL <https://www.quicksilver899.com/Tolkien/Tolkien_Dictionary.html>
          - witcher:     Witcher themed URL <https://witcher.fandom.com/wiki/Elder_Speech>
          - pokemon:     Pokemon themed URL <https://www.smogon.com>
          - bebop:       Cowboy Bebop themed URL <https://cowboybebop.fandom.com/wiki/Cowboy_Bebop>
          - greek:       Greek Mythology themed URL <https://www.theoi.com>
          - greco-roman: Greek and Roman Mythology themed URL <https://www.gutenberg.org/files/22381/22381-h/22381-h.htm>
          - lovecraft:   H.P. Lovecraft themed URL <https://www.hplovecraft.com>

  -p, --path <PATH>
          Local file path to start crawling from

      --resume
          Resume crawling from a previous run; state file must exist; existence of dictionary is optional; parameters from state are ignored, instead favoring arguments provided on the command line

      --resume-strict
          Resume crawling from a previous run; state file must exist; existence of dictionary is optional; 'strict' enforces that all arguments from the state file are observed

  -d, --depth <DEPTH>
          Limit the depth of crawling URLs

          [default: 1]

  -m, --min-word-length <MIN_WORD_LENGTH>
          Only save words greater than or equal to this value

          [default: 3]

  -x, --max-word-length <MAX_WORD_LENGTH>
          Only save words less than or equal to this value

          [default: 18446744073709551615]

  -j, --include-js
          Include javascript from <script> tags and URLs

  -c, --include-css
          Include CSS from <style> tags and URLs

      --filters <FILTERS>...
          Filter strategy for words; multiple can be specified (comma separated)

          [default: none]

          Possible values:
          - deunicode:    Transform unicode according to <https://github.com/kornelski/deunicode>
          - decancer:     Transform unicode according to <https://github.com/null8626/decancer>
          - all-numbers:  Ignore words that consist of all numbers
          - any-numbers:  Ignore words that contain any number
          - no-numbers:   Ignore words that contain no numbers
          - only-numbers: Keep only words that exclusively contain numbers
          - all-ascii:    Ignore words that consist of all ascii characters
          - any-ascii:    Ignore words that contain any ascii character
          - no-ascii:     Ignore words that contain no ascii characters
          - only-ascii:   Keep only words that exclusively contain ascii characters
          - none:         Leave the word as-is

      --site-policy <SITE_POLICY>
          Site policy for discovered URLs

          [default: same]

          Possible values:
          - same:      Allow crawling URL, only if the domain exactly matches
          - subdomain: Allow crawling URLs if they are the same domain or subdomains
          - sibling:   Allow crawling URLs if they are the same domain or a sibling
          - all:       Allow crawling all URLs, regardless of domain

  -r, --req-per-sec <REQ_PER_SEC>
          Number of requests to make per second

          [default: 5]

  -l, --limit-concurrent <LIMIT_CONCURRENT>
          Limit the number of concurrent requests to this value

          [default: 5]

  -o, --output <OUTPUT>
          File to write dictionary to (will be overwritten if it already exists)

          [default: wdict.txt]

      --append
          Append extracted words to an existing dictionary

      --output-state
          Write crawl state to a file

      --state-file <STATE_FILE>
          File to write state, json formatted (will be overwritten if it already exists)

          [default: state-wdict.json]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Lib

此 crate 公开了库,但目前的接口应被视为不稳定的。

许可证

许可为以下之一

由您选择。

贡献

除非您明确声明,否则您提交给作品的所有贡献,根据 Apache-2.0 许可证定义,应如上所述双重许可,不附加任何额外条款或条件。

依赖项

~13–25MB
~428K SLoC