hudi-datafusion — Rust库 // Lib.rs

3个版本

0.1.0	2024年7月15日
0.1.0-rc.2	2024年7月12日
0.1.0-rc.1	2024年7月11日
0.1.0-alpha2	~~2024年7月10日~~

#10 in #datalake

196 个月下载量
用于 hudi

Apache-2.0

115KB
2.5K SLoC

Apache Hudi的原生Rust库，具有Python绑定

hudi-rs项目旨在扩大Apache Hudi的使用范围，使其适用于各种用户和项目。

源代码	安装命令
PyPi	`pip install hudi`
Crates.io	`cargoadd hudi`

示例用法

Python

将Hudi表读入PyArrow表。

from hudi import HudiTable

hudi_table = HudiTable("/tmp/trips_table")
records = hudi_table.read_snapshot()

import pyarrow as pa
import pyarrow.compute as pc

arrow_table = pa.Table.from_batches(records)
result = arrow_table.select(
    ["rider", "ts", "fare"]).filter(
    pc.field("fare") > 20.0)
print(result)

Rust

将带有datafusion特征的crate `hudi` 添加到您的应用程序中以查询Hudi表。

[dependencies]
hudi = { version = "0" , features = ["datafusion"] }
tokio = "1"
datafusion = "39.0.0"

use std::sync::Arc;

use datafusion::error::Result;
use datafusion::prelude::{DataFrame, SessionContext};
use hudi::HudiDataSource;

#[tokio::main]
async fn main() -> Result<()> {
    let ctx = SessionContext::new();
    let hudi = HudiDataSource::new("/tmp/trips_table").await?;
    ctx.register_table("trips_table", Arc::new(hudi))?;
    let df: DataFrame = ctx.sql("SELECT * from trips_table where fare > 20.0").await?;
    df.show().await?;
    Ok(())
}

处理云存储

请确保云存储凭据已正确设置在环境变量中，例如，AWS_*，AZURE_* 或 GOOGLE_*。然后，将选择相关存储环境变量。目标表的基本URI将使用类似 s3://、az:// 或 gs:// 的方案进行相应处理。

贡献查看贡献指南了解有关向项目贡献的所有详细信息。

依赖项 ~67MB ~1.5M SLoC arrow-schema 52.0+serde async-trait datafusion 39.0 datafusion-common 39.0 datafusion-expr 39.0 datafusion-physical-expr 39.0 hudi-core tokio+rt-multi-thread url