15 个版本
新 0.1.17 | 2024年8月20日 |
---|---|
0.1.16 | 2024年8月8日 |
0.1.12 | 2024年7月24日 |
701 in Web编程
881 每月下载量
用于 spider-cloud-cli
50KB
766 行
Spider Cloud Rust SDK
Spider Cloud Rust SDK 提供了一套工具,用于简单的网站抓取、大规模爬取以及其他功能,如提取链接和截图,使您能够收集适用于语言模型(LLM)的格式化数据。它具有用户友好的界面,可无缝集成 Spider Cloud API。
安装
要使用 Spider Cloud Rust SDK,请在您的 Cargo.toml
中包含以下内容
[dependencies]
spider-client = "0.1"
用法
- 从 spider.cloud 获取 API 密钥
- 将 API 密钥设置为名为
SPIDER_API_KEY
的环境变量,或在使用Spider
结构体实例时将其作为参数传递。
以下是使用 SDK 的示例
use serde_json::json;
use std::env;
#[tokio::main]
async fn main() {
// Set the API key as an environment variable
env::set_var("SPIDER_API_KEY", "your_api_key");
// Initialize the Spider with your API key
let spider = Spider::new(None).expect("API key must be provided");
let url = "https://spider.cloud";
// Scrape a single URL
let scraped_data = spider.scrape_url(url, None, false, "application/json").await.expect("Failed to scrape the URL");
println!("Scraped Data: {:?}", scraped_data);
// Crawl a website
let crawler_params = RequestParams {
limit: Some(1),
proxy_enabled: Some(true),
store_data: Some(false),
metadata: Some(false),
request: Some(RequestType::Http),
..Default::default()
};
let crawl_result = spider.crawl_url(url, Some(crawler_params), false, "application/json", None::<fn(serde_json::Value)>).await.expect("Failed to crawl the URL");
println!("Crawl Result: {:?}", crawl_result);
}
抓取 URL
从单个 URL 抓取数据
let url = "https://example.com";
let scraped_data = spider.scrape_url(url, None, false, "application/json").await.expect("Failed to scrape the URL");
爬取网站
自动爬取网站
let url = "https://example.com";
let crawl_params = RequestParams {
limit: Some(200),
request: Some(RequestType::Smart),
..Default::default()
};
let crawl_result = spider.crawl_url(url, Some(crawl_params), false, "application/json", None::<fn(serde_json::Value)>).await.expect("Failed to crawl the URL");
流式爬取
以回调方式分块流式爬取网站以进行扩展
fn handle_json(json_obj: serde_json::Value) {
println!("Received chunk: {:?}", json_obj);
}
let url = "https://example.com";
let crawl_params = RequestParams {
limit: Some(200),
store_data: Some(false),
..Default::default()
};
spider.crawl_url(
url,
Some(crawl_params),
true,
"application/json",
Some(handle_json)
).await.expect("Failed to crawl the URL");
搜索
执行爬取网站或收集搜索结果的操作
let query = "a sports website";
let crawl_params = RequestParams {
request: Some(RequestType::Smart),
search_limit: Some(5),
limit: Some(5),
fetch_page_content: Some(true),
..Default::default()
};
let crawl_result = spider.search(query, Some(crawl_params), false, "application/json").await.expect("Failed to perform search");
从 URL(s) 中检索链接
从指定的 URL 中提取所有链接
let url = "https://example.com";
let links = spider.links(url, None, false, "application/json").await.expect("Failed to retrieve links from URL");
转换
快速将 HTML 转换为 markdown 或文本
let data = vec![json!({"html": "<html><body><h1>Hello world</h1></body></html>"})];
let params = RequestParams {
readability: Some(false),
return_format: Some(ReturnFormat::Markdown),
..Default::default()
};
let result = spider.transform(data, Some(params), false, "application/json").await.expect("Failed to transform HTML to markdown");
println!("Transformed Data: {:?}", result);
抓取 URL(s) 的截图
捕获给定 URL 的截图
let url = "https://example.com";
let screenshot = spider.screenshot(url, None, false, "application/json").await.expect("Failed to take screenshot of URL");
提取联系信息
从指定的 URL 中提取联系详细信息
let url = "https://example.com";
let contacts = spider.extract_contacts(url, None, false, "application/json").await.expect("Failed to extract contacts from URL");
println!("Extracted Contacts: {:?}", contacts);
标记 URL(s) 的数据
标记从特定 URL 提取的数据
let url = "https://example.com";
let labeled_data = spider.label(url, None, false, "application/json").await.expect("Failed to label data from URL");
println!("Labeled Data: {:?}", labeled_data);
检查爬取状态
您可以检查特定 URL 的爬取状态
let url = "https://example.com";
let state = spider.get_crawl_state(url, None, false, "application/json").await.expect("Failed to get crawl state for URL");
println!("Crawl State: {:?}", state);
下载文件
您可以下载网站的查询结果
let url = "https://example.com";
let options = hashmap!{
"page" => 0,
"limit" => 100,
"expiresIn" => 3600 // Optional, add if needed
};
let response = spider.create_signed_url(Some(url), Some(options)).await.expect("Failed to create signed URL");
println!("Download URL: {:?}", response);
检查可用信用额度
您可以检查账户上的剩余信用额度
let credits = spider.get_credits().await.expect("Failed to get credits");
println!("Remaining Credits: {:?}", credits);
数据操作
Spider 客户端现在可以与特定的数据表交互,以创建、检索和删除数据。
从表中检索数据
通过应用查询参数从指定表中检索数据
let table_name = "pages";
let query_params = RequestParams {
limit: Some(20),
..Default::default()
};
let response = spider.data_get(table_name, Some(query_params)).await.expect("Failed to retrieve data from table");
println!("Data from table: {:?}", response);
从表中删除数据
根据某些条件从指定表中删除数据
let table_name = "websites";
let delete_params = RequestParams {
domain: Some("www.example.com".to_string()),
..Default::default()
};
let response = spider.data_delete(table_name, Some(delete_params)).await.expect("Failed to delete data from table");
println!("Delete Response: {:?}", response);
流式传输
如果您需要使用流式传输,请将 stream
参数设置为 true
并提供回调函数
fn handle_json(json_obj: serde_json::Value) {
println!("Received chunk: {:?}", json_obj);
}
let url = "https://example.com";
let crawler_params = RequestParams {
limit: Some(1),
proxy_enabled: Some(true),
store_data: Some(false),
metadata: Some(false),
request: Some(RequestType::Http),
..Default::default()
};
spider.links(url, Some(crawler_params), true, "application/json").await.expect("Failed to retrieve links from URL");
内容类型
以下是通过 content_type
参数支持的 Content-type 头
应用程序/json
文本/csv
应用程序/xml
应用程序/jsonl
let url = "https://example.com";
let crawler_params = RequestParams {
limit: Some(1),
proxy_enabled: Some(true),
store_data: Some(false),
metadata: Some(false),
request: Some(RequestType::Http),
..Default::default()
};
// Stream JSON lines back to the client
spider.crawl_url(url, Some(crawler_params), true, "application/jsonl", None::<fn(serde_json::Value)>).await.expect("Failed to crawl the URL");
错误处理
SDK处理由Spider Cloud API返回的错误,并抛出适当的异常。如果在请求过程中发生错误,它将带有描述性错误信息的传播给调用者。
贡献
欢迎对Spider Cloud Rust SDK的贡献!如果您发现任何问题或有改进建议,请打开GitHub仓库中的问题或提交拉取请求。
许可
Spider Cloud Rust SDK是开源的,并按照MIT许可证发布。
依赖项
~6–17MB
~240K SLoC