Paperoni — Rust 应用程序 // Lib.rs

10 个版本 (5 个重大更新)

0.6.1-alpha1	2021年8月24日
0.6.0-alpha1	2021年7月24日
0.5.0-alpha1	2021年6月24日
0.3.0-alpha1	2021年2月24日
0.2.0-alpha1	2020年11月24日

#10 in #article

每月下载量 30

MIT 许可证

270KB
6K SLoC

不包括萨拉米

Paperoni 是一个使用 Rust 编写的 CLI 工具，用于下载网络文章作为 EPUB 或 HTML 文件。也提供了将文章导出为 PDF 的临时支持。

该项目处于 alpha 版本，因此在使用时可能会崩溃。如果出现崩溃，请在 GitHub 上创建问题。

安装

预编译的二进制文件

请查看发布页面以获取预编译的二进制文件。目前只有 Debian 和 Arch 的构建。

从 crates.io 安装

Paperoni 已发布在 crates.io。如果您已安装 cargo，则运行

cargo install paperoni --version 0.6.1-alpha1

Paperoni 仍处于 alpha 版，因此必须传递 version 标志。

从源码构建

此项目使用 async/.await，因此应使用至少 Rust 版本 1.33 进行编译。最好使用 Rust 的最新版本。

git clone https://github.com/hipstermojo/paperoni.git cd paperoni ## You can build and install paperoni locally cargo install --path . ## or use it from within the project cargo run -- # pass your url here

用法

USAGE: paperoni [OPTIONS] [urls]... OPTIONS: --export <type> Specify the file type of the export. The type must be in lower case. [default: epub] [possible values: html, epub] -f, --file <file> Input file containing links -h, --help Prints help information --inline-images Inlines the article images when exporting to HTML using base64. This is used when you do not want a separate folder created for images during HTML export. NOTE: It uses base64 encoding on the images which results in larger HTML export sizes as each image increases in size by about 25%-33%. --inline-toc Add an inlined Table of Contents page at the start of the merged article. This does not affect the Table of Contents navigation --log-to-file Enables logging of events to a file located in .paperoni/logs with a default log level of debug. Use -v to specify the logging level --max-conn <max-conn> The maximum number of concurrent HTTP connections when downloading articles. Default is 8. NOTE: It is advised to use as few connections as needed i.e between 1 and 50. Using more connections can end up overloading your network card with too many concurrent requests. --no-css Removes the stylesheets used in the EPUB generation. The EPUB file will then be laid out based on your e-reader's default stylesheets. Images and code blocks may overflow when this flag is set and layout of generated PDFs will be affected. Use --no-header-css if you want to only disable the styling on headers. --no-header-css Removes the header CSS styling but preserves styling of images and codeblocks. To remove all the default CSS, use --no-css instead. --merge <output-name> Merge multiple articles into a single epub that will be given the name provided -o, --output-dir <output_directory> Directory to store output epub documents -V, --version Prints version information -v This takes upto 4 levels of verbosity in the following order. - Error (-v) - Warn (-vv) - Info (-vvv) - Debug (-vvvv) When this flag is passed, it disables the progress bars and logs to stderr. If you would like to send the logs to a file (and enable progress bars), pass the log-to-file flag. ARGS: <urls>... Urls of web articles

要下载单个文章，请传入其 URL

paperoni https://en.wikipedia.org/wiki/Pepperoni

Paperoni 还支持通过参数传入多个链接。

paperoni https://en.wikipedia.org/wiki/Pepperoni https://en.wikipedia.org/wiki/Salami

或者，如果您使用的是类 Unix 操作系统，可以这样做

cat links.txt | xargs paperoni

这些也可以使用 -f/--file 标志从文件中读取。

paperoni -f links.txt

导出文章

默认情况下，Paperoni 将文章导出为 EPUB 文件，但您可以通过传递 --export html 标志将其更改为 HTML。

paperoni https://en.wikipedia.org/wiki/Pepperoni --export html

HTML 导出允许您在浏览器中以纯 HTML 文档的形式阅读文章，但也可以用于转换到 PDF，如此处所述。

当导出为 HTML 时，Paperoni 将文章的图像下载到与文章类似的文件夹中。因此，上面的命令的文件夹结构将如下所示

. ├── Pepperoni - Wikipedia │ ├── 1a9f886e9b58db72e0003a2cd52681d8.png │ ├── 216f8a4265a1ceb3f8cfba4c2f9057b1.jpeg │ ... └── Pepperoni - Wikipedia.html

如果您希望将图像直接内联到 HTML 导出中，请传递 inline-images 标志，例如

paperoni https://en.wikipedia.org/wiki/Pepperoni --export html --inline-images

这在导出多个链接时特别有用。

注意：HTML 导出中图像的内联使用 base64 编码，这会导致图像大小增加约 25% 到 33%。

禁用 CSS

no-css 和 no-header-css 标志可以用来移除 Paperoni 添加的默认样式。请参阅 --help 了解标志的使用方法。

合并文章

默认情况下，Paperoni 为每个链接生成一个 epub 文件。您也可以使用 merge 标志和指定输出文件将多个链接合并为一个 epub。

paperoni -f links.txt --merge out.epub

记录事件

默认情况下禁用日志记录。可以通过使用 -v 标志或 --log-to-file 标志来激活。如果传递了 --log-to-file 标志，则日志将被发送到默认的 Paperoni 目录 .paperoni/logs，该目录位于您的家目录中。- 标志配置了详细程度，使得
-v Logs only the error level -vv Logs only the warn level -vvv Logs only the info level -vvvv Logs only the debug level 如果只传递了 - 标志，则禁用进度条。如果同时传递了 - 和 --log-to-file 标志，则进度条仍然会显示。工作原理传递给 Paperoni 的 URL 被抓取，返回的 HTML 响应被传递给提取器。这个提取器使用自定义端口的 Mozilla Readability 算法来检索可能的文章。然后这篇文章被保存为 EPUB。该算法的端口仍然不稳定，因此它并不完全兼容所有可以用 Readability 提取的网站。（目前）无法工作的情况此程序仍然处于 alpha 版，所以许多事情都无法工作只使用 JavaScript 运行的网站无法提取。无法由 Readability 提取的网站文章也无法由 Paperoni 提取。 Medium 文章中的懒加载代码片段不会出现在 EPUB 中。还有一些网页通常无法使用，例如 Twitter 和 Reddit 线程。 PDF 导出可以使用第三方工具进行 PDF 转换。有两种方法可以做到这一点 EPUB 转PDF 这需要您安装 Calibre，它包含电子书转换功能。您可以通过终端使用 ebook-convert 将 epub 转换为 pdf。 # Assuming the downloaded epub was called foo.epub ebook-convert foo.epub foo.pdf 或者，您可以使用 Calibre GUI 进行文件转换。 HTML 转PDF 推荐的方法是使用 Weasyprint，这是一个免费和开源的工具，可以将 HTML 文档转换为 PDF。它在 Linux、MacOS 和 Windows 上可用。使用 CLI，可以按以下方式执行 paperoni https://en.wikipedia.org/wiki/Pepperoni --export html weasyprint "Pepperoni - Wikipedia.html" Pepperoni.pdf 内联图片不是强制性的，因为 Weasyprint 将能够自己找到文件。 PDF 转换方法比较两种转换方法对于大多数用例都足够好。主要区别如下所示 EPUB 转PDF HTML 转PDF 包装代码块是否 CSS 自定义否是生成文件大小略大略小文件大小差异是由于 ebook-convert 添加到 PDF 文件中的额外字体。

	EPUB 转PDF	HTML 转PDF
包装代码块	是	否
CSS 自定义	否	是
生成文件大小	略大	略小

依赖关系 ~23–37MB ~617K SLoC async-std base64 0.13 chrono clap 2.33+yaml colored comfy-table 3.0 derive_builder 0.10.2 directories 3.0 epub-builder 0.4.8 flexi_logger 0.18 futures html5ever 0.25.1 indicatif 0.16.2 itertools 0.10.1 kuchiki 0.8.1 lazy_static log md5 regex surf 2.2 thiserror url