2 个版本

0.1.1	2020年8月22日
0.1.0	2020年8月22日

#11 in #news

用于 the-daily-stallman

MIT/Apache

315KB
14K SLoC

extrablatt

可定制的文章抓取和编辑库及命令行界面。也支持在Wasm上运行。

带有一些CORS限制的基本Wasm示例： https://mattsse.github.io/extrablatt/

受 newspaper 启发。

Html抓取是通过 select.rs 实现的。

功能

新闻url识别
文本提取
顶部图片提取
所有图片提取
关键词提取
作者提取
发布日期
参考

可以通过 Extractor 特性对特定新闻网站/布局进行自定义。

文档

完整文档 https://docs.rs/extrablatt

示例

从新闻机构提取所有文章。

use extrablatt::Extrablatt;
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {

    let site = Extrablatt::builder("https://some-news.com/")?.build().await?;

    let mut stream = site.into_stream();
    
    while let Some(article) = stream.next().await {
        if let Ok(article) = article {
            println!("article '{:?}'", article.content.title)
        } else {
            println!("{:?}", article);
        }
    }

    Ok(())
}

命令行

安装

cargo install extrablatt --features="cli"

用法

USAGE:
    extrablatt <SUBCOMMAND>

SUBCOMMANDS:
    article     Extract a set of articles
    category    Extract all articles found on the page
    help        Prints this message or the help of the given subcommand(s)
    site        Extract all articles from a news source.

提取一组特定文章并将结果存储为json

extrablatt article "https://www.example.com/article1.html", "https://www.example.com/article2.html" -o "articles.json"

许可

许可方式如下

Apache License，版本2.0，（LICENSE-APACHE 或 https://apache.ac.cn/licenses/LICENSE-2.0）
MIT 许可证（LICENSE-MIT 或 https://open-source.org.cn/licenses/MIT）

依赖项

~9–22MB
~339K SLoC