6个版本

0.1.4	2023年8月12日
0.1.3-alpha	2023年5月15日
0.1.2	2023年5月7日
0.1.1	2023年5月7日
0.1.0	2023年5月7日

619 在数据库接口中

每月 31 次下载

MIT 许可证

29KB
526 行

Waper

Waper是一个用于抓取html网站的命令行工具。以下是一个简单的用法

waper --seed-links "https://example.com/" --whitelist "https://example.com/.*" --whitelist "https://www.iana.org/domains/example"

这将抓取 "https://example.com/" 并将找到的每个链接的html保存到名为 waper_out.sqlite 的sqlite数据库中。

安装

cargo install waper

命令行使用

A CLI tool to scrape HTML websites

Usage: waper [OPTIONS]
       waper <COMMAND>

Commands:
  scrape      This is also default command, so it's optional to include in args
  completion  Print shell completion script
  help        Print this message or the help of the given subcommand(s)

Options:
  -w, --whitelist <WHITELIST>
          whitelist regexes: only these urls will be scanned other then seeds
  -b, --blacklist <BLACKLIST>
          blacklist regexes: these urls will never be scanned By default nothing will be blacklisted [default: a^]
  -s, --seed-links <SEED_LINKS>
          Links to start with
  -o, --output-file <OUTPUT_FILE>
          Sqlite output file [default: waper_out.sqlite]
  -m, --max-parallel-requests <MAX_PARALLEL_REQUESTS>
          Sqlite output file [default: 5]
  -i, --include-db-links
          Will also include unprocessed links from `links` table in db if present. Helpful when you want to continue the scraping from a previously unfinished session
  -v, --verbose
          Should verbose (debug) output
  -h, --help
          Print help
  -V, --version
          Print version

查询数据

数据存储在定义在 ./sqls/INIT.sql 中的sqlite数据库中。有三个表

results：存储已收到响应的所有请求的内容
errors：存储无法完成的请求的所有错误消息
links：存储已访问或未访问的链接的url

可以使用任何sqlite客户端查询结果。例如使用 sqlite cli

$ sqlite3 waper_out.sqlite 'select url, time, length(html) from results'
https://example.com/|2023-05-07 06:47:33|1256
https://www.iana.org/domains/example|2023-05-07 06:47:39|80

为了获得美观的输出，您可以修改sqlite3设置

$ sqlite3 waper_out.sqlite '.headers on' '.mode column' 'select url, time, length(html) from results'
url                                   time                 length(html)
------------------------------------  -------------------  ------------
https://example.com/                  2023-05-07 06:47:33  1256
https://www.iana.org/domains/example  2023-05-07 06:47:39  80

要快速搜索所有url，您可以使用 fzf

sqlite3 waper_out.sqlite 'select url from links' | fzf

计划改进

允许用户指定url的优先级，以便某些url可以在其他url之前抓取
支持复杂的速率限制
允许继续之前停止的抓取
- 应继续进行IP漫游（自动检测并继续）
显式处理重定向
允许用户修改请求的一部分（如user-agent）
通过压缩/去重html来提高存储效率
提供更多关于队列中url数量、处理速率等的可见性
支持使用 ... 执行JS（v8或webkit，选项不多）

反馈

如果您发现任何错误或有任何功能建议，请在github上提交问题。

依赖项

~41–57MB
~1M SLoC