#pdf #http #scraping #scrape

urls2disk

并发下载URL到磁盘的HTTP客户端,可选将其转换为PDF

2个版本

使用旧Rust 2015

0.1.1 2018年4月12日
0.1.0 2018年4月12日

HTTP客户端中排名#458

MIT许可证

43KB
658 代码行

urls2disk

Build Status

crates.io | docs.rs | github.com

urls2disk是一个Rust包,帮助您并行下载一系列网页并保存到磁盘。根据您的选择,它将直接将网页的原始字节写入磁盘,或者首先将它们转换为PDF然后再写入磁盘。它对于一般的网络抓取以及将大量网页转换为PDF非常有用。

urls2disk的一个关键特性是您可以在下载网页时设置每秒的最大请求数;这样您可以有效地限制自己的请求,以免在一次发送过多请求时触碰到服务器。

在底层,如果选择将网页转换为PDF,urls2disk使用wkhtmltopdf。因此,要使用它,您需要在您的机器上安装wkhtmltopdf。使用Homebrew在macOS上安装wkhtmltopdf非常简单。只需在终端中输入brew install Caskroom/cask/wkhtmltopdf即可。对于其他系统或如果您没有Homebrew,您需要自己安装wkhtmltopdf,但也许在某个时候我会查找如何在不同的设置中安装它的说明并将其包括在这里。至于版本,我仅在wkhtmltopdf 0.12.4上进行了测试。

以下是一个使用urls2disk从SEC网站下载苹果公司2010-2017年度报告的示例

extern crate reqwest;
extern crate urls2disk;

use std::fs;
use std::path::Path;

use urls2disk::{wkhtmltopdf, ClientBuilder, Result, SimpleDocument, Url};

// This function will download Apple, Inc.'s annual reports for the years 2010 to 2017
// from the SEC's website to your disk.  It will download two copies of each annual
// report: one of just the raw html and another that has been converted to PDF.
fn run() -> Result<()> {
    // Create an output directory.
    let output_directory = Path::new("./data");
    if !output_directory.exists() {
        fs::create_dir_all(output_directory)?;
    }

    // Create a vector of urls we would like to download.
    // These urls represent the annual reports for Apple, Inc. from 2010 to 2017.
    let base = "https://www.sec.gov/Archives/edgar/data/";
    let urls = vec![
        "320193/000119312510238044/d10k.htm",
        "320193/000119312511282113/d220209d10k.htm",
        "320193/000119312512444068/d411355d10k.htm",
        "320193/000119312513416534/d590790d10k.htm",
        "320193/000119312514383437/d783162d10k.htm",
        "320193/000119312515356351/d17062d10k.htm",
        "320193/000162828016020309/a201610-k9242016.htm",
        "320193/000032019317000070/a10-k20179302017.htm",
    ].iter()
        .map(|stem| format!("{}{}", &base, stem))
        .collect::<Vec<String>>();

    // Turn the vector of urls into a vector of boxed Document trait objects (here we'll
    // be using the SimpleDocument struct as one possible implementer of the Document trait).
    // For this batch, we set the wkhtmltopdf option to false; so when we feed this list
    // to the Client it will just download the raw webpages in html format instead of
    // first converting them to PDF.
    let html_documents = urls.iter()
        .enumerate()
        .map(|(i, url_string)| {
            let filename = format!("Apple 10-K {}.html", i + 2010);
            let path = output_directory.join(&filename);
            let url = url_string.parse::<Url>()?;
            let wkhtmltopdf = false;
            let document = SimpleDocument::new(path, url, wkhtmltopdf);
            Ok(Box::new(document))
        })
        .collect::<Result<Vec<Box<SimpleDocument>>>>()?;

    // Turn the vector of urls into another vector of boxed Document trait objects
    // (to show off additional functionality).  This time we'll set the wkhtmltopdf
    // option to true; so when we feed this list to the Client it will first convert
    // the wepages to PDF before writing them to disk.
    let pdf_documents = urls.iter()
        .enumerate()
        .map(|(i, url_string)| {
            let filename = format!("Apple 10-K {}.pdf", i + 2010);
            let path = output_directory.join(&filename);
            let url = url_string.parse::<Url>()?;
            let wkhtmltopdf = true;
            let document = SimpleDocument::new(path, url, wkhtmltopdf);
            Ok(Box::new(document))
        })
        .collect::<Result<Vec<Box<SimpleDocument>>>>()?;

    // Combine our two vectors into one vector of Box<SimpleDocument>.
    let mut documents = [&html_documents[..], &pdf_documents[..]].concat();

    // Create the client.
    // Here, we're showing several customization options, but if you want to use
    // just the default settings, you could simply build the client with
    // `let client = ClientBuilder::default().build()?;`
    let client = ClientBuilder::default()
        .set_max_requests_per_second(9)
        .set_max_threads_cpu(4)
        .set_max_threads_io(50)
        .set_reqwest_client(reqwest::Client::new())
        .set_wkhtmltopdf_setting(wkhtmltopdf::Setting::Zoom(3.5))
        .set_wkhtmltopdf_settings(vec![
            wkhtmltopdf::Setting::DisableExternalLinks(true),
            wkhtmltopdf::Setting::DisableJavascript(true),
        ])
        .build()?;

    // Let the client go. It will download and write to disk all the
    // documents while simultaneously respecting the 'requests per second' and
    // other limits we provided. If you already have the documents on disk,
    // the client will not redownload them.
    client.get_documents(&mut documents)?;

    // Note: Here, if you want to, you can now access the raw bytes of all the urls
    // you downloaded, since they are now stored on each SimpleDocument in addition
    // to being saved on your disk.
    Ok(())
}

fn main() {
    run().unwrap();
}

依赖关系

~14–23MB
~414K SLoC