15 个版本

0.1.17 2024年8月20日
0.1.16 2024年8月8日
0.1.12 2024年7月24日

701 in Web编程

Download history 291/week @ 2024-07-03 321/week @ 2024-07-10 123/week @ 2024-07-17 326/week @ 2024-07-24 343/week @ 2024-07-31 173/week @ 2024-08-07 20/week @ 2024-08-14

881 每月下载量
用于 spider-cloud-cli

MIT 许可证

50KB
766

Spider Cloud Rust SDK

Spider Cloud Rust SDK 提供了一套工具,用于简单的网站抓取、大规模爬取以及其他功能,如提取链接和截图,使您能够收集适用于语言模型(LLM)的格式化数据。它具有用户友好的界面,可无缝集成 Spider Cloud API。

安装

要使用 Spider Cloud Rust SDK,请在您的 Cargo.toml 中包含以下内容

[dependencies]
spider-client = "0.1"

用法

  1. spider.cloud 获取 API 密钥
  2. 将 API 密钥设置为名为 SPIDER_API_KEY 的环境变量,或在使用 Spider 结构体实例时将其作为参数传递。

以下是使用 SDK 的示例

use serde_json::json;
use std::env;

#[tokio::main]
async fn main() {
    // Set the API key as an environment variable
    env::set_var("SPIDER_API_KEY", "your_api_key");

    // Initialize the Spider with your API key
    let spider = Spider::new(None).expect("API key must be provided");

    let url = "https://spider.cloud";

    // Scrape a single URL
    let scraped_data = spider.scrape_url(url, None, false, "application/json").await.expect("Failed to scrape the URL");

    println!("Scraped Data: {:?}", scraped_data);

    // Crawl a website
    let crawler_params = RequestParams {
        limit: Some(1),
        proxy_enabled: Some(true),
        store_data: Some(false),
        metadata: Some(false),
        request: Some(RequestType::Http),
        ..Default::default()
    };

    let crawl_result = spider.crawl_url(url, Some(crawler_params), false, "application/json", None::<fn(serde_json::Value)>).await.expect("Failed to crawl the URL");

    println!("Crawl Result: {:?}", crawl_result);
}

抓取 URL

从单个 URL 抓取数据

let url = "https://example.com";
let scraped_data = spider.scrape_url(url, None, false, "application/json").await.expect("Failed to scrape the URL");

爬取网站

自动爬取网站

let url = "https://example.com";
let crawl_params = RequestParams {
    limit: Some(200),
    request: Some(RequestType::Smart),
    ..Default::default()
};
let crawl_result = spider.crawl_url(url, Some(crawl_params), false, "application/json", None::<fn(serde_json::Value)>).await.expect("Failed to crawl the URL");

流式爬取

以回调方式分块流式爬取网站以进行扩展

fn handle_json(json_obj: serde_json::Value) {
    println!("Received chunk: {:?}", json_obj);
}

let url = "https://example.com";
let crawl_params = RequestParams {
    limit: Some(200),
    store_data: Some(false),
    ..Default::default()
};

spider.crawl_url(
    url,
    Some(crawl_params),
    true,
    "application/json",
    Some(handle_json)
).await.expect("Failed to crawl the URL");

执行爬取网站或收集搜索结果的操作

let query = "a sports website";
let crawl_params = RequestParams {
    request: Some(RequestType::Smart),
    search_limit: Some(5),
    limit: Some(5),
    fetch_page_content: Some(true),
    ..Default::default()
};
let crawl_result = spider.search(query, Some(crawl_params), false, "application/json").await.expect("Failed to perform search");

从指定的 URL 中提取所有链接

let url = "https://example.com";
let links = spider.links(url, None, false, "application/json").await.expect("Failed to retrieve links from URL");

转换

快速将 HTML 转换为 markdown 或文本

let data = vec![json!({"html": "<html><body><h1>Hello world</h1></body></html>"})];
let params = RequestParams {
    readability: Some(false),
    return_format: Some(ReturnFormat::Markdown),
    ..Default::default()
};
let result = spider.transform(data, Some(params), false, "application/json").await.expect("Failed to transform HTML to markdown");
println!("Transformed Data: {:?}", result);

抓取 URL(s) 的截图

捕获给定 URL 的截图

let url = "https://example.com";
let screenshot = spider.screenshot(url, None, false, "application/json").await.expect("Failed to take screenshot of URL");

提取联系信息

从指定的 URL 中提取联系详细信息

let url = "https://example.com";
let contacts = spider.extract_contacts(url, None, false, "application/json").await.expect("Failed to extract contacts from URL");
println!("Extracted Contacts: {:?}", contacts);

标记 URL(s) 的数据

标记从特定 URL 提取的数据

let url = "https://example.com";
let labeled_data = spider.label(url, None, false, "application/json").await.expect("Failed to label data from URL");
println!("Labeled Data: {:?}", labeled_data);

检查爬取状态

您可以检查特定 URL 的爬取状态

let url = "https://example.com";
let state = spider.get_crawl_state(url, None, false, "application/json").await.expect("Failed to get crawl state for URL");
println!("Crawl State: {:?}", state);

下载文件

您可以下载网站的查询结果

let url = "https://example.com";
let options = hashmap!{
    "page" => 0,
    "limit" => 100,
    "expiresIn" => 3600 // Optional, add if needed
};
let response = spider.create_signed_url(Some(url), Some(options)).await.expect("Failed to create signed URL");
println!("Download URL: {:?}", response);

检查可用信用额度

您可以检查账户上的剩余信用额度

let credits = spider.get_credits().await.expect("Failed to get credits");
println!("Remaining Credits: {:?}", credits);

数据操作

Spider 客户端现在可以与特定的数据表交互,以创建、检索和删除数据。

从表中检索数据

通过应用查询参数从指定表中检索数据

let table_name = "pages";
let query_params = RequestParams {
    limit: Some(20),
    ..Default::default()
};
let response = spider.data_get(table_name, Some(query_params)).await.expect("Failed to retrieve data from table");
println!("Data from table: {:?}", response);

从表中删除数据

根据某些条件从指定表中删除数据

let table_name = "websites";
let delete_params = RequestParams {
    domain: Some("www.example.com".to_string()),
    ..Default::default()
};
let response = spider.data_delete(table_name, Some(delete_params)).await.expect("Failed to delete data from table");
println!("Delete Response: {:?}", response);

流式传输

如果您需要使用流式传输,请将 stream 参数设置为 true 并提供回调函数

fn handle_json(json_obj: serde_json::Value) {
    println!("Received chunk: {:?}", json_obj);
}

let url = "https://example.com";
let crawler_params = RequestParams {
    limit: Some(1),
    proxy_enabled: Some(true),
    store_data: Some(false),
    metadata: Some(false),
    request: Some(RequestType::Http),
    ..Default::default()
};

spider.links(url, Some(crawler_params), true, "application/json").await.expect("Failed to retrieve links from URL");

内容类型

以下是通过 content_type 参数支持的 Content-type 头

  • 应用程序/json
  • 文本/csv
  • 应用程序/xml
  • 应用程序/jsonl
let url = "https://example.com";

let crawler_params = RequestParams {
    limit: Some(1),
    proxy_enabled: Some(true),
    store_data: Some(false),
    metadata: Some(false),
    request: Some(RequestType::Http),
    ..Default::default()
};

// Stream JSON lines back to the client
spider.crawl_url(url, Some(crawler_params), true, "application/jsonl", None::<fn(serde_json::Value)>).await.expect("Failed to crawl the URL");

错误处理

SDK处理由Spider Cloud API返回的错误,并抛出适当的异常。如果在请求过程中发生错误,它将带有描述性错误信息的传播给调用者。

贡献

欢迎对Spider Cloud Rust SDK的贡献!如果您发现任何问题或有改进建议,请打开GitHub仓库中的问题或提交拉取请求。

许可

Spider Cloud Rust SDK是开源的,并按照MIT许可证发布。

依赖项

~6–17MB
~240K SLoC