9 个版本 (4 个重大更新)

0.6.2	2023年4月3日
0.6.1	2023年2月9日
0.6.0	2022年12月13日
0.5.1	2022年10月24日
0.2.2	2022年7月30日

#202 in HTTP 客户端

每月38次下载

MIT/Apache

38KB
926 行

Steven Hé (Sīchàng) 的递归爬虫

该爬虫以递归方式抓取，以恒定频率（每次请求的数量）异步执行，并将结果持续写入磁盘。易于使用，同时提供灵活的配置。

该爬虫是对 DKU SSO 中使用的专有爬虫的开源重写。

目标

该爬虫专为递归抓取而设计，即默认情况下，爬虫会处理它从 HTML 中获取的每个 href 和 img，并进一步处理这些 URL。递归抓取的一个明显用途是完整站点抓取。

如果您只想抓取提供的 URL，只需提供一个复杂的 filter，例如 "#"，它将作为一个非递归爬虫运行。非递归抓取的一个用途是批量图片抓取。

安装

使用 Cargo 安装 recursive_scraper

cargo install recursive_scraper

特性

常频

爬虫保证每次发送的请求数量最终是恒定的。这个恒定值取决于每次请求之间设置的 delay。

delay 需要以毫秒为单位设置。默认值为 500。

正则表达式过滤和黑名单

爬虫不会处理任何不匹配给定 filter 正则表达式或匹配给定 blacklist 正则表达式的新的 URL。

用户指定的任何URL都不会被检查。如果未指定，filter默认为".*"以匹配任何URL，而blacklist默认为"#"以匹配无URL。（处理的URL不包括#，因为爬虫将其去除以避免重复。）

可调整的连接超时

如果爬虫在10秒后无法连接，则会超时请求。您可以在毫秒中设置自定义超时。

在底层，爬虫也会使用连接超时的八倍时间来等待请求和响应完成。

持续更新的记录

爬取记录以summary.toml的形式写入日志目录的磁盘。随着爬虫的进行，记录会偶尔更新。

在[urls]中，每个URL根据其发现顺序映射到一个id。[scrapes]记录被爬取的URL的id。[fails]记录爬虫未能处理的URL的id。[redirections]记录一个URL（其id在左侧）是否被重定向到另一个URL（在右侧）。

环

不匹配filter的URL是外环中的URL。为了严谨，这些URL也需要不匹配blacklist。在爬取时，如果设置了number_of_rings，爬虫会将这些hrefs追加到“下一个”待处理列表。当爬虫耗尽任务时，如果设置了number_of_rings且当前环小于它，它会将“下一个”待处理列表作为待处理列表并继续爬取。

用法

$ recursive_scraper --help
Scrapes given urls (separated by commas) recursively.
Saves the results to `html/` and `other/`, the log to `log/`,
or other directories if specified.
See <https://github.com/SichangHe/scraper> for more instructions.

Usage: recursive_scraper [OPTIONS] <START_URLS>

Arguments:
  <START_URLS>  The URLs to start scraping from, separated by commas.

Options:
  -b, --blacklist <BLACKLIST>
          Regex to match URLs that should be excluded.
  -c, --connection-timeout <CONNECTION_TIMEOUT>
          Connection timeout for each request in integer milliseconds.
  -d, --delay <DELAY>
          Delay between each request in integer milliseconds
  -f, --filter <FILTER>
          Regex to match URLs that should be included.
  -i, --disregard-html
          Do not save HTMLs.
  -l, --log-dir <LOG_DIR>
          Directory to output the log.
  -o, --other-dir <OTHER_DIR>
          Directory to save non-HTMLs.
  -r, --number-of-rings <NUMBER_OF_RINGS>
          Set the number of rings for the URLs outside the filter.
  -s, --disregard-other
          Do not save non-HTMLs.
  -t, --html-dir <HTML_DIR>
          Directory to save HTMLs.
  -h, --help
          Print help information
  -V, --version
          Print version information

递归爬取整个https://example.com/

recursive_scraper -f "https://example.com/.*" https://example.com/

与上述相同，但我不想爬取图片

recursive_scraper -f "https://example.com/.*" -s https://example.com/

仅爬取我提供的URL（以逗号分隔）

recursive_scraper -f "#" https://example.com/blah,https://example.com/blahblah,https://example.com/bla

将所有内容爬取到一个文件夹中result/

recursive_scraper -f "https://example.com/.*" -l result/ -o result/ -t result/ https://example.com/

环境变量

recursive_scraper使用env_logger进行日志记录，因此您可以设置RUST_LOG来控制日志级别。

例如，如果您想像上面的第一个示例一样进行操作，同时将日志级别设置为info

RUST_LOG=recursive_scraper=info recursive_scraper -f "https://example.com/.*" https://example.com/

在fish shell中，您将这样做

env RUST_LOG=recursive_scraper=info recursive_scraper -f "https://example.com/.*" https://example.com/

默认情况下，日志级别为error。其他选项包括warn、info和debug。

有关更多说明，请参阅启用日志记录部分。

依赖关系

~11–25MB
~380K SLoC