678个稳定版本
新 2.0.9 | 2024年8月21日 |
---|---|
1.99.37 | 2024年8月14日 |
1.99.13 | 2024年7月31日 |
1.89.5 | 2024年3月30日 |
1.1.1 | 2018年2月12日 |
#42 in Web编程
6,060 每月下载量
在 9 个Crate中(8个直接) 使用
570KB
11K SLoC
Spider
使用隔离和IPC通道进行通信的异步爬虫/索引器,支持去中心化运行。
依赖项
在Linux上
- OpenSSL 1.0.1, 1.0.2, 1.1.0, 或 1.1.1
示例
这是一个基本的异步示例,用于爬取网页,将spider添加到您的 Cargo.toml
[dependencies]
spider = "2.0.9"
然后是代码
extern crate spider;
use spider::website::Website;
use spider::tokio;
#[tokio::main]
async fn main() {
let url = "https://choosealicense.com";
let mut website = Website::new(&url);
website.crawl().await;
for link in website.get_links() {
println!("- {:?}", link.as_ref());
}
}
您可以使用 Configuration
对象来配置您的爬虫
// ..
let mut website = Website::new("https://choosealicense.com");
website.configuration.respect_robots_txt = true;
website.configuration.subdomains = true;
website.configuration.tld = false;
website.configuration.delay = 0; // Defaults to 0 ms due to concurrency handling
website.configuration.request_timeout = None; // Defaults to 15000 ms
website.configuration.http2_prior_knowledge = false; // Enable if you know the webserver supports http2
website.configuration.user_agent = Some("myapp/version".into()); // Defaults to using a random agent
website.on_link_find_callback = Some(|s, html| { println!("link target: {}", s); (s, html)}); // Callback to run on each link find - useful for mutating the url, ex: convert the top level domain from `.fr` to `.es`.
website.configuration.blacklist_url.get_or_insert(Default::default()).push("https://choosealicense.com/licenses/".into());
website.configuration.proxies.get_or_insert(Default::default()).push("socks5://10.1.1.1:12345".into()); // Defaults to None - proxy list.
website.configuration.budget = Some(spider::hashbrown::HashMap::from([(spider::CaseInsensitiveString::new("*"), 300), (spider::CaseInsensitiveString::new("/licenses"), 10)])); // Defaults to None.
website.configuration.cron_str = "1/5 * * * * *".into(); // Defaults to empty string - Requires the `cron` feature flag
website.configuration.cron_type = spider::website::CronType::Crawl; // Defaults to CronType::Crawl - Requires the `cron` feature flag
website.configuration.limit = 300; // The limit of pages crawled. By default there is no limit.
website.configuration.cache = false; // HTTP caching. Requires the `cache` or `chrome` feature flag.
website.crawl().await;
自v1.33.0起,也提供了构建器模式
let mut website = Website::new("https://choosealicense.com");
website
.with_respect_robots_txt(true)
.with_subdomains(true)
.with_tld(false)
.with_delay(0)
.with_request_timeout(None)
.with_http2_prior_knowledge(false)
.with_user_agent(Some("myapp/version".into()))
.with_budget(Some(spider::hashbrown::HashMap::from([("*", 300), ("/licenses", 10)])))
.with_limit(300)
.with_caching(false)
.with_external_domains(Some(Vec::from(["https://creativecommons.org/licenses/by/3.0/"].map( |d| d.to_string())).into_iter()))
.with_headers(None)
.with_blacklist_url(Some(Vec::from(["https://choosealicense.com/licenses/".into()])))
.with_cron("1/5 * * * * *", Default::Default())
.with_proxies(None);
功能
我们提供了以下可选功能标志。
[dependencies]
spider = { version = "2.0.9", features = ["regex", "ua_generator"] }
ua_generator
:启用自动生成随机真实用户代理。regex
:启用使用正则表达式进行黑名单路径。jemalloc
:启用 jemalloc 内存后端。decentralized
:启用去中心化处理IO,在爬取之前需要启动 spider_worker。sync
:订阅页面数据处理异步的更改。[默认启用]control
:启用按需暂停、启动和关闭爬取的能力。full_resources
:启用收集与域名相关的所有内容,如CSS、JS等。serde
:启用serde序列化支持。socks
:启用socks5代理支持。glob
:启用URL通配符支持。fs
:启用将资源存储到磁盘以进行解析(可能会大大提高性能,但会增加临时存储)。sitemap
:将站点地图页面包含在结果中。time
:启用按页面跟踪持续时间。cache
:启用将HTTP缓存请求写入磁盘。cache_mem
:启用HTTP缓存请求在内存中持久化。cache_chrome_hybrid
:启用HTTP之间的混合Chrome请求缓存。cache_openai
:启用缓存OpenAI请求。在开发AI工作流时可以大幅降低成本。chrome
:启用Chrome无头渲染,使用环境变量CHROME_URL
进行远程连接。chrome_store_page
:将页面对象存储以执行其他操作。页面可能已关闭。chrome_screenshot
:启用在爬取时存储每个页面的截图。默认将截图存储在./storage/目录下。使用环境变量SCREENSHOT_DIRECTORY
调整目录。要保存背景,请将环境变量SCREENSHOT_OMIT_BACKGROUND
设置为false。chrome_headed
:启用Chrome全头渲染。chrome_headless_new
:使用headless=new启动浏览器。chrome_cpu
:禁用Chrome浏览器的GPU使用。chrome_stealth
:启用隐身模式,使其更难被检测为机器人。chrome_intercept
:允许拦截网络请求以加快处理速度。adblock
:启用在Chrome和chrome_intercept中使用时阻止广告的能力。cookies
:启用存储和设置用于请求的cookie。real_browser
:启用绕过受保护页面的能力。cron
:启用启动网站cron作业的能力。spoof
:伪造请求的HTTP头。openai
:启用OpenAI生成动态浏览器可执行脚本。请确保使用环境变量OPENAI_API_KEY
。smart
:启用智能模式。在需要JavaScript渲染之前,该模式将请求作为HTTP运行。这通过重用内容避免了发送多个网络请求。encoding
:启用处理不同编码的内容,如Shift_JIS。headers
:启用在检索到的每个页面上提取头信息。向页面结构添加一个headers
字段。decentralized_headers
:启用提取IO去中心化处理中被抑制的头信息。如果headers
同时在spider和spider_worker中设置,则需要此功能。
去中心化
将处理移动到工作器,即使工作器在同一台机器上,也可以大幅提高性能,因为高效的运行时分割IO工作。
[dependencies]
spider = { version = "2.0.9", features = ["decentralized"] }
# install the worker
cargo install spider_worker
# start the worker [set the worker on another machine in prod]
RUST_LOG=info SPIDER_WORKER_PORT=3030 spider_worker
# start rust project as normal with SPIDER_WORKER env variable
SPIDER_WORKER=http://127.0.0.1:3030 cargo run --example example --features decentralized
SPIDER_WORKER
环境变量接受逗号分隔的URL列表以设置工作器。如果启用了scrape
功能标志,请使用SPIDER_WORKER_SCRAPER
环境变量确定抓取工作器。
去中心化处理头信息
在没有去中心化的情况下,页面的头信息值保持不变。当与去中心化工作器一起工作时,每个工作器都会存储用于原始请求检索的头信息,并使用前缀元素名称进行标记("zz-spider-r--"
)。
使用decentralized_headers
功能提供一些有用的工具来清理和提取位于spider::features::decentralized_headers
下的原始头条目。
[WORKER_SUPPRESSED_HEADER_PREFIX]
订阅更改
使用订阅方法来获取广播频道。
[dependencies]
spider = { version = "2.0.9", features = ["sync"] }
extern crate spider;
use spider::website::Website;
use spider::tokio;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://choosealicense.com");
let mut rx2 = website.subscribe(16).unwrap();
tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
println!("{:?}", res.get_url());
}
});
website.crawl().await;
website.unsubscribe();
}
正则表达式黑名单
允许黑名单路由的正则表达式
[dependencies]
spider = { version = "2.0.9", features = ["regex"] }
extern crate spider;
use spider::website::Website;
use spider::tokio;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://choosealicense.com");
website.configuration.blacklist_url.push("/licenses/".into());
website.crawl().await;
for link in website.get_links() {
println!("- {:?}", link.as_ref());
}
}
暂停、恢复和关闭
如果您正在进行大量工作,您可能需要通过启用 control
功能标志来控制爬虫
[dependencies]
spider = { version = "2.0.9", features = ["control"] }
extern crate spider;
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
use spider::utils::{pause, resume, shutdown};
let url = "https://choosealicense.com/";
let mut website: Website = Website::new(&url);
tokio::spawn(async move {
pause(url).await;
sleep(tokio::time::Duration::from_millis(5000)).await;
resume(url).await;
// perform shutdown if crawl takes longer than 15s
sleep(tokio::time::Duration::from_millis(15000)).await;
// you could also abort the task to shutdown crawls if using website.crawl in another thread.
shutdown(url).await;
});
website.crawl().await;
}
抓取/收集HTML
extern crate spider;
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
use std::io::{Write, stdout};
let url = "https://choosealicense.com/";
let mut website = Website::new(&url);
website.scrape().await;
let mut lock = stdout().lock();
let separator = "-".repeat(url.len());
for page in website.get_pages().unwrap().iter() {
writeln!(
lock,
"{}\n{}\n\n{}\n\n{}",
separator,
page.get_url_final(),
page.get_html(),
separator
)
.unwrap();
}
}
计划任务
使用计划任务在任何时候持续运行爬取。
[dependencies]
spider = { version = "2.0.9", features = ["sync", "cron"] }
extern crate spider;
use spider::website::{Website, run_cron};
use spider::tokio;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://choosealicense.com");
// set the cron to run or use the builder pattern `website.with_cron`.
website.cron_str = "1/5 * * * * *".into();
let mut rx2 = website.subscribe(16).unwrap();
let join_handle = tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
println!("{:?}", res.get_url());
}
});
// take ownership of the website. You can also use website.run_cron, except you need to perform abort manually on handles created.
let mut runner = run_cron(website).await;
println!("Starting the Runner for 10 seconds");
tokio::time::sleep(tokio::time::Duration::from_secs(10)).await;
let _ = tokio::join!(runner.stop(), join_handle);
}
Chrome
可以使用ENV变量 CHROME_URL
连接到Chrome,如果找不到连接,系统上将启动一个新的浏览器。如果您远程连接,不需要安装Chrome。如果您不是为下载内容抓取内容,请使用功能标志 chrome_intercept
来可能通过网络拦截加速请求。
[dependencies]
spider = { version = "2.0.9", features = ["chrome", "chrome_intercept"] }
您可以使用 website.crawl_concurrent_raw
在需要时执行不使用chromium的抓取。如果需要启用带头的浏览器以进行调试,请使用功能标志 chrome_headed
。
extern crate spider;
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://spider.cloud")
.with_chrome_intercept(cfg!(feature = "chrome_intercept"), true)
.build()
.unwrap();
website.crawl().await;
println!("Links found {:?}", website.get_links().len());
}
缓存
可以通过功能标志 cache
或 cache_mem
启用HTTP缓存。
[dependencies]
spider = { version = "2.0.9", features = ["cache"] }
您还需要将 website.cache
设置为true以启用。
extern crate spider;
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://spider.cloud")
.with_caching(true)
.build()
.unwrap();
website.crawl().await;
println!("Links found {:?}", website.get_links().len());
/// next run to website.crawl().await; will be faster since content is stored on disk.
}
智能模式
在需要时智能运行使用HTTP和JavaScript渲染的爬取。结合两者的优点,以保持速度并提取每个页面。这需要一个chrome连接或在系统上安装的浏览器。
[dependencies]
spider = { version = "2.0.9", features = ["smart"] }
extern crate spider;
use spider::website::Website;
use spider::tokio;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://choosealicense.com");
website.crawl_smart().await;
for link in website.get_links() {
println!("- {:?}", link.as_ref());
}
}
OpenAI
使用OpenAI通过功能标志 openai
生成驱动浏览器的动态脚本。
[dependencies]
spider = { version = "2.0.9", features = ["openai"] }
extern crate spider;
use spider::{tokio, website::Website, configuration::GPTConfigs};
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://google.com")
.with_openai(Some(GPTConfigs::new("gpt-4-1106-preview", "Search for Movies", 256)))
.with_limit(1)
.build()
.unwrap();
website.crawl().await;
}
深度
设置深度限制以防止跳转。
[dependencies]
spider = { version = "2.0.9" }
extern crate spider;
use spider::{tokio, website::Website};
#[tokio::main]
async fn main() {
let mut website = Website::new("https://choosealicense.com").with_depth(3).build().unwrap();
website.crawl().await;
for link in website.get_links() {
println!("- {:?}", link.as_ref());
}
}
可重用配置
可以重复使用相同的配置用于爬取列表。
extern crate spider;
use spider::configuration::Configuration;
use spider::{tokio, website::Website};
use std::io::Error;
use std::time::Instant;
const CAPACITY: usize = 5;
const CRAWL_LIST: [&str; CAPACITY] = [
"https://spider.cloud",
"https://choosealicense.com",
"https://jeffmendez.com",
"https://spider-rs.github.io/spider-nodejs/",
"https://spider-rs.github.io/spider-py/",
];
#[tokio::main]
async fn main() -> Result<(), Error> {
let config = Configuration::new()
.with_user_agent(Some("SpiderBot"))
.with_blacklist_url(Some(Vec::from(["https://spider.cloud/resume".into()])))
.with_subdomains(false)
.with_tld(false)
.with_redirect_limit(3)
.with_respect_robots_txt(true)
.with_external_domains(Some(
Vec::from(["http://loto.rsseau.fr/"].map(|d| d.to_string())).into_iter(),
))
.build();
let mut handles = Vec::with_capacity(CAPACITY);
for website_url in CRAWL_LIST {
match Website::new(website_url)
.with_config(config.to_owned())
.build()
{
Ok(mut website) => {
let handle = tokio::spawn(async move {
println!("Starting Crawl - {:?}", website.get_url().inner());
let start = Instant::now();
website.crawl().await;
let duration = start.elapsed();
let links = website.get_links();
for link in links {
println!("- {:?}", link.as_ref());
}
println!(
"{:?} - Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
website.get_url().inner(),
duration,
links.len()
);
});
handles.push(handle);
}
Err(e) => println!("{:?}", e)
}
}
for handle in handles {
let _ = handle.await;
}
Ok(())
}
阻止
如果您需要使用版本 v1.12.0
之前的阻止同步实现。
依赖项
~14–39MB
~700K SLoC