5 个版本

0.1.4	2023年7月20日
0.1.3	2023年4月10日
0.1.2	2022年10月26日
0.1.1	2022年10月24日
0.1.0	2022年9月21日

#789 在文件系统

每月下载量747
在 2 个 Crates 中使用 (通过 ballista-core)

Apache-2.0

31KB
521 代码行

datafusion-objectstore-hdfs

HDFS 作为 Datafusion 的远程 ObjectStore。

使用 DataFusion 在 HDFS 上查询文件

此 crate 介绍了 HadoopFileSystem 作为远程 ObjectStore，它提供了在 HDFS 文件上查询的能力。

HDFS 访问

由于 `libhdfs` 也只是一个 C 接口包装器，而 HDFS 访问的实际实现是一组 Java jar 文件，为了使此 crate 工作，我们需要准备 Hadoop 客户端 jar 文件和 JRE 环境。

准备 JAVA

安装 Java。

指定并导出 JAVA_HOME。
准备 Hadoop 客户端

要获取 Hadoop 发行版，从 Apache 下载镜像下载最新稳定版本。目前，我们支持 Hadoop-2 和 Hadoop-3。

解压下载的 Hadoop 发行版。例如，文件夹是 /opt/hadoop。然后准备一些环境变量
准备 JRE 环境

export HADOOP_HOME=/opt/hadoop

export PATH=$PATH:$HADOOP_HOME/bin

首先，我们需要添加 jvm 相关依赖项的库路径。例如，对于 MacOS，

由于我们的编译好的 libhdfs 是 JNI 原生实现，它需要适当的 CLASSPATH 来加载 Hadoop 相关的 jar。例如，

export DYLD_LIBRARY_PATH=$JAVA_HOME/jre/lib/server

示例

export CLASSPATH=$CLASSPATH:`hadoop classpath --glob`

示例

假设有一个 hdfs 目录，

let hdfs_file_uri = "hdfs://:8020/testing/tpch_1g/parquet/line_item";

其中包含一些 parquet 文件。然后我们可以像以下这样查询这些 parquet 文件

let ctx = SessionContext::new();
let url = Url::parse("hdfs://").unwrap();
ctx.runtime_env().register_object_store(&url, Arc::new(HadoopFileSystem));
let table_name = "line_item";
println!(
    "Register table {} with parquet file {}",
    table_name, hdfs_file_uri
);
ctx.register_parquet(table_name, &hdfs_file_uri, ParquetReadOptions::default()).await?;

let sql = "SELECT count(*) FROM line_item";
let result = ctx.sql(sql).await?.collect().await?;

测试

首先克隆测试数据仓库

git submodule update --init --recursive

运行测试

cargo test

在测试过程中，将自动模拟并启动 HDFS 集群。

启用 hdfs3 功能运行测试

cargo build --no-default-features --features datafusion-objectstore-hdfs/hdfs3,datafusion-objectstore-hdfs-testing/hdfs3,datafusion-hdfs-examples/hdfs3

cargo test --no-default-features --features datafusion-objectstore-hdfs/hdfs3,datafusion-objectstore-hdfs-testing/hdfs3,datafusion-hdfs-examples/hdfs3

通过运行 ballista-sql 测试

cargo run --bin ballista-sql --no-default-features --features hdfs3

依赖关系

~7–17MB
~244K SLoC