678个稳定版本

新 2.0.9	2024年8月21日
1.99.37	2024年8月14日
1.99.13	2024年7月31日
1.89.5	2024年3月30日
1.1.1	2018年2月12日

#42 in Web编程

6,060 每月下载量
在 9 个Crate中（8个直接）使用

MIT 许可证

570KB
11K SLoC

Spider

使用隔离和IPC通道进行通信的异步爬虫/索引器，支持去中心化运行。

依赖项

在Linux上

OpenSSL 1.0.1, 1.0.2, 1.1.0, 或 1.1.1

示例

这是一个基本的异步示例，用于爬取网页，将spider添加到您的 Cargo.toml

[dependencies]
spider = "2.0.9"

然后是代码

extern crate spider;

use spider::website::Website;
use spider::tokio;

#[tokio::main]
async fn main() {
    let url = "https://choosealicense.com";
    let mut website = Website::new(&url);
    website.crawl().await;

    for link in website.get_links() {
        println!("- {:?}", link.as_ref());
    }
}

您可以使用 Configuration 对象来配置您的爬虫

// ..
let mut website = Website::new("https://choosealicense.com");

website.configuration.respect_robots_txt = true;
website.configuration.subdomains = true;
website.configuration.tld = false;
website.configuration.delay = 0; // Defaults to 0 ms due to concurrency handling
website.configuration.request_timeout = None; // Defaults to 15000 ms
website.configuration.http2_prior_knowledge = false; // Enable if you know the webserver supports http2
website.configuration.user_agent = Some("myapp/version".into()); // Defaults to using a random agent
website.on_link_find_callback = Some(|s, html| { println!("link target: {}", s); (s, html)}); // Callback to run on each link find - useful for mutating the url, ex: convert the top level domain from `.fr` to `.es`.
website.configuration.blacklist_url.get_or_insert(Default::default()).push("https://choosealicense.com/licenses/".into());
website.configuration.proxies.get_or_insert(Default::default()).push("socks5://10.1.1.1:12345".into()); // Defaults to None - proxy list.
website.configuration.budget = Some(spider::hashbrown::HashMap::from([(spider::CaseInsensitiveString::new("*"), 300), (spider::CaseInsensitiveString::new("/licenses"), 10)])); // Defaults to None.
website.configuration.cron_str = "1/5 * * * * *".into(); // Defaults to empty string - Requires the `cron` feature flag
website.configuration.cron_type = spider::website::CronType::Crawl; // Defaults to CronType::Crawl - Requires the `cron` feature flag
website.configuration.limit = 300; // The limit of pages crawled. By default there is no limit.
website.configuration.cache = false; // HTTP caching. Requires the `cache` or `chrome` feature flag.

website.crawl().await;

自v1.33.0起，也提供了构建器模式

let mut website = Website::new("https://choosealicense.com");

website
   .with_respect_robots_txt(true)
   .with_subdomains(true)
   .with_tld(false)
   .with_delay(0)
   .with_request_timeout(None)
   .with_http2_prior_knowledge(false)
   .with_user_agent(Some("myapp/version".into()))
   .with_budget(Some(spider::hashbrown::HashMap::from([("*", 300), ("/licenses", 10)])))
   .with_limit(300)
   .with_caching(false)
   .with_external_domains(Some(Vec::from(["https://creativecommons.org/licenses/by/3.0/"].map( |d| d.to_string())).into_iter()))
   .with_headers(None)
   .with_blacklist_url(Some(Vec::from(["https://choosealicense.com/licenses/".into()])))
   .with_cron("1/5 * * * * *", Default::Default())
   .with_proxies(None);

功能

我们提供了以下可选功能标志。

[dependencies]
spider = { version = "2.0.9", features = ["regex", "ua_generator"] }

ua_generator：启用自动生成随机真实用户代理。
regex：启用使用正则表达式进行黑名单路径。
jemalloc：启用 jemalloc 内存后端。
decentralized：启用去中心化处理IO，在爬取之前需要启动 spider_worker。
sync：订阅页面数据处理异步的更改。[默认启用]
control：启用按需暂停、启动和关闭爬取的能力。
full_resources：启用收集与域名相关的所有内容，如CSS、JS等。
serde：启用serde序列化支持。
socks：启用socks5代理支持。
glob：启用URL通配符支持。
fs：启用将资源存储到磁盘以进行解析（可能会大大提高性能，但会增加临时存储）。
sitemap：将站点地图页面包含在结果中。
time：启用按页面跟踪持续时间。
cache：启用将HTTP缓存请求写入磁盘。
cache_mem：启用HTTP缓存请求在内存中持久化。
cache_chrome_hybrid：启用HTTP之间的混合Chrome请求缓存。
cache_openai：启用缓存OpenAI请求。在开发AI工作流时可以大幅降低成本。
chrome：启用Chrome无头渲染，使用环境变量CHROME_URL进行远程连接。
chrome_store_page：将页面对象存储以执行其他操作。页面可能已关闭。
chrome_screenshot：启用在爬取时存储每个页面的截图。默认将截图存储在./storage/目录下。使用环境变量SCREENSHOT_DIRECTORY调整目录。要保存背景，请将环境变量SCREENSHOT_OMIT_BACKGROUND设置为false。
chrome_headed：启用Chrome全头渲染。
chrome_headless_new：使用headless=new启动浏览器。
chrome_cpu：禁用Chrome浏览器的GPU使用。
chrome_stealth：启用隐身模式，使其更难被检测为机器人。
chrome_intercept：允许拦截网络请求以加快处理速度。
adblock：启用在Chrome和chrome_intercept中使用时阻止广告的能力。
cookies：启用存储和设置用于请求的cookie。
real_browser：启用绕过受保护页面的能力。
cron：启用启动网站cron作业的能力。
spoof：伪造请求的HTTP头。
openai：启用OpenAI生成动态浏览器可执行脚本。请确保使用环境变量OPENAI_API_KEY。
smart：启用智能模式。在需要JavaScript渲染之前，该模式将请求作为HTTP运行。这通过重用内容避免了发送多个网络请求。
encoding：启用处理不同编码的内容，如Shift_JIS。
headers：启用在检索到的每个页面上提取头信息。向页面结构添加一个headers字段。
decentralized_headers：启用提取IO去中心化处理中被抑制的头信息。如果headers同时在spider和spider_worker中设置，则需要此功能。

去中心化

将处理移动到工作器，即使工作器在同一台机器上，也可以大幅提高性能，因为高效的运行时分割IO工作。

[dependencies]
spider = { version = "2.0.9", features = ["decentralized"] }

# install the worker
cargo install spider_worker
# start the worker [set the worker on another machine in prod]
RUST_LOG=info SPIDER_WORKER_PORT=3030 spider_worker
# start rust project as normal with SPIDER_WORKER env variable
SPIDER_WORKER=http://127.0.0.1:3030 cargo run --example example --features decentralized

SPIDER_WORKER环境变量接受逗号分隔的URL列表以设置工作器。如果启用了scrape功能标志，请使用SPIDER_WORKER_SCRAPER环境变量确定抓取工作器。

去中心化处理头信息

在没有去中心化的情况下，页面的头信息值保持不变。当与去中心化工作器一起工作时，每个工作器都会存储用于原始请求检索的头信息，并使用前缀元素名称进行标记（"zz-spider-r--"）。

使用decentralized_headers功能提供一些有用的工具来清理和提取位于spider::features::decentralized_headers下的原始头条目。

[WORKER_SUPPRESSED_HEADER_PREFIX]

使用订阅方法来获取广播频道。

[dependencies]
spider = { version = "2.0.9", features = ["sync"] }

extern crate spider;

use spider::website::Website;
use spider::tokio;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://choosealicense.com");
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("{:?}", res.get_url());
        }
    });

    website.crawl().await;
    website.unsubscribe();
}

正则表达式黑名单

允许黑名单路由的正则表达式

[dependencies]
spider = { version = "2.0.9", features = ["regex"] }

extern crate spider;

use spider::website::Website;
use spider::tokio;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://choosealicense.com");
    website.configuration.blacklist_url.push("/licenses/".into());
    website.crawl().await;

    for link in website.get_links() {
        println!("- {:?}", link.as_ref());
    }
}

暂停、恢复和关闭

如果您正在进行大量工作，您可能需要通过启用 control 功能标志来控制爬虫

[dependencies]
spider = { version = "2.0.9", features = ["control"] }

extern crate spider;

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    use spider::utils::{pause, resume, shutdown};
    let url = "https://choosealicense.com/";
    let mut website: Website = Website::new(&url);

    tokio::spawn(async move {
        pause(url).await;
        sleep(tokio::time::Duration::from_millis(5000)).await;
        resume(url).await;
        // perform shutdown if crawl takes longer than 15s
        sleep(tokio::time::Duration::from_millis(15000)).await;
        // you could also abort the task to shutdown crawls if using website.crawl in another thread.
        shutdown(url).await;
    });

    website.crawl().await;
}

抓取/收集HTML

extern crate spider;

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    use std::io::{Write, stdout};

    let url = "https://choosealicense.com/";
    let mut website = Website::new(&url);

    website.scrape().await;

    let mut lock = stdout().lock();

    let separator = "-".repeat(url.len());

    for page in website.get_pages().unwrap().iter() {
        writeln!(
            lock,
            "{}\n{}\n\n{}\n\n{}",
            separator,
            page.get_url_final(),
            page.get_html(),
            separator
        )
            .unwrap();
    }
}

计划任务

使用计划任务在任何时候持续运行爬取。

[dependencies]
spider = { version = "2.0.9", features = ["sync", "cron"] }

extern crate spider;

use spider::website::{Website, run_cron};
use spider::tokio;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://choosealicense.com");
    // set the cron to run or use the builder pattern `website.with_cron`.
    website.cron_str = "1/5 * * * * *".into();

    let mut rx2 = website.subscribe(16).unwrap();

    let join_handle = tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("{:?}", res.get_url());
        }
    });

    // take ownership of the website. You can also use website.run_cron, except you need to perform abort manually on handles created.
    let mut runner = run_cron(website).await;

    println!("Starting the Runner for 10 seconds");
    tokio::time::sleep(tokio::time::Duration::from_secs(10)).await;
    let _ = tokio::join!(runner.stop(), join_handle);
}

Chrome

可以使用ENV变量 CHROME_URL 连接到Chrome，如果找不到连接，系统上将启动一个新的浏览器。如果您远程连接，不需要安装Chrome。如果您不是为下载内容抓取内容，请使用功能标志 chrome_intercept 来可能通过网络拦截加速请求。

[dependencies]
spider = { version = "2.0.9", features = ["chrome", "chrome_intercept"] }

您可以使用 website.crawl_concurrent_raw 在需要时执行不使用chromium的抓取。如果需要启用带头的浏览器以进行调试，请使用功能标志 chrome_headed。

extern crate spider;

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://spider.cloud")
        .with_chrome_intercept(cfg!(feature = "chrome_intercept"), true)
        .build()
        .unwrap();

    website.crawl().await;

    println!("Links found {:?}", website.get_links().len());
}

缓存

可以通过功能标志 cache 或 cache_mem 启用HTTP缓存。

[dependencies]
spider = { version = "2.0.9", features = ["cache"] }

您还需要将 website.cache 设置为true以启用。

extern crate spider;

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://spider.cloud")
        .with_caching(true)
        .build()
        .unwrap();

    website.crawl().await;

    println!("Links found {:?}", website.get_links().len());
    /// next run to website.crawl().await; will be faster since content is stored on disk.
}

智能模式

在需要时智能运行使用HTTP和JavaScript渲染的爬取。结合两者的优点，以保持速度并提取每个页面。这需要一个chrome连接或在系统上安装的浏览器。

[dependencies]
spider = { version = "2.0.9", features = ["smart"] }

extern crate spider;

use spider::website::Website;
use spider::tokio;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://choosealicense.com");
    website.crawl_smart().await;

    for link in website.get_links() {
        println!("- {:?}", link.as_ref());
    }
}

OpenAI

使用OpenAI通过功能标志 openai 生成驱动浏览器的动态脚本。

[dependencies]
spider = { version = "2.0.9", features = ["openai"] }

extern crate spider;

use spider::{tokio, website::Website, configuration::GPTConfigs};

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://google.com")
        .with_openai(Some(GPTConfigs::new("gpt-4-1106-preview", "Search for Movies", 256)))
        .with_limit(1)
        .build()
        .unwrap();

    website.crawl().await;
}

深度

设置深度限制以防止跳转。

[dependencies]
spider = { version = "2.0.9" }

extern crate spider;

use spider::{tokio, website::Website};

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://choosealicense.com").with_depth(3).build().unwrap();
    website.crawl().await;

    for link in website.get_links() {
        println!("- {:?}", link.as_ref());
    }
}

可重用配置

可以重复使用相同的配置用于爬取列表。

extern crate spider;

use spider::configuration::Configuration;
use spider::{tokio, website::Website};
use std::io::Error;
use std::time::Instant;

const CAPACITY: usize = 5;
const CRAWL_LIST: [&str; CAPACITY] = [
    "https://spider.cloud",
    "https://choosealicense.com",
    "https://jeffmendez.com",
    "https://spider-rs.github.io/spider-nodejs/",
    "https://spider-rs.github.io/spider-py/",
];

#[tokio::main]
async fn main() -> Result<(), Error> {
    let config = Configuration::new()
        .with_user_agent(Some("SpiderBot"))
        .with_blacklist_url(Some(Vec::from(["https://spider.cloud/resume".into()])))
        .with_subdomains(false)
        .with_tld(false)
        .with_redirect_limit(3)
        .with_respect_robots_txt(true)
        .with_external_domains(Some(
            Vec::from(["http://loto.rsseau.fr/"].map(|d| d.to_string())).into_iter(),
        ))
        .build();

    let mut handles = Vec::with_capacity(CAPACITY);

    for website_url in CRAWL_LIST {
        match Website::new(website_url)
            .with_config(config.to_owned())
            .build()
        {
            Ok(mut website) => {
                let handle = tokio::spawn(async move {
                    println!("Starting Crawl - {:?}", website.get_url().inner());

                    let start = Instant::now();
                    website.crawl().await;
                    let duration = start.elapsed();

                    let links = website.get_links();

                    for link in links {
                        println!("- {:?}", link.as_ref());
                    }

                    println!(
                        "{:?} - Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
                        website.get_url().inner(),
                        duration,
                        links.len()
                    );
                });

                handles.push(handle);
            }
            Err(e) => println!("{:?}", e)
        }
    }

    for handle in handles {
        let _ = handle.await;
    }

    Ok(())
}

阻止

如果您需要使用版本 v1.12.0 之前的阻止同步实现。

依赖项 ~14–39MB ~700K SLoC ahash 0.8+std bytes+serde case_insensitive_string+compact+serde cssparser 0.31.2 ego-tree fast_html5ever hashbrown 0.14 lazy_static log num_cpus percent-encoding phf 0.11 phf_codegen 0.11 quick-xml 0.36+serde+serialize+async-tokio selectors 0.25 smallvec string_concat strum 0.26.2+derive tendril tokio-stream default ua_generator url 可选 adblock 0.8+嵌入式域名解析器+完整的正则表达式处理 openai? async-openai 0.23 openai? lol_html 1.0 openai? serde_json openai? tiktoken-rs cron? async-trait cron? async_job cron? chrono cron? cron chrome? chromiumoxide 0.6+tokio运行时+bytes 去中心化头部? const_format 去中心化头部? glob? itertools 0.12 编码? encoding_rs spoof? fastrand 可选 flexbuffers cache_chrome_hybrid? http 1.0 cache_chrome_hybrid? http-cache 0.19 cache_chrome_hybrid? http-cache-semantics 2.0 cache? cache_mem? http-cache-reqwest 0.14 cache? reqwest-middleware 0.3 glob? regex? smart? regex cache_openai? moka 0.12.8+future reqwest 0.12+brotli+gzip+deflate+zstd+stream not wasm32 wasm32 tokio+宏+时间+停车库+实时多线程不适用于 wasm32 wasm32 可选 serde+derive 可选 sitemap jemalloc? tikv-jemallocator 0.5 不适用于 win 不适用于 android 不适用于 musl 其他功能 chrome_cpu chrome_headed chrome_headless_new chrome_intercept chrome_screenshot chrome_stealth chrome_store_page 控制 cookies cowboy 去中心化文件系统完整资源头部信息 openai_slim_fit 真实浏览器 reqwest_history_dns reqwest_json reqwest_multipart reqwest_native_tls 更多...