#胶水 #数据融合 #aws #arrow

datafusion-catalogprovider-glue

将 Glue 作为 Datafusion 的目录提供者

11 个不稳定版本 (3 个破坏性更新)

0.4.0 2024年5月2日
0.3.0 2024年4月23日
0.2.0 2022年9月27日
0.1.7 2022年6月10日
0.1.0 2022年5月30日

#1644 in 解析器实现

Apache-2.0

63KB
1K SLoC

DataFusion-CatalogProvider-Glue

Glue 作为 Datafusion 的目录提供者。

来自 示例 的输出

mbpro16(timvw)➜  datafusion-catalogprovider-glue git:(main) ✗ cargo run --example=demo

   Compiling datafusion-catalogprovider-glue v0.1.0 (/Users/timvw/src/github/datafusion-catalogprovider-glue)
    Finished dev [unoptimized + debuginfo] target(s) in 7.43s
     Running `target/debug/examples/demo`
registering tpc-h-parquet-1.customer
+---------------+--------------------+------------+------------+
| table_catalog | table_schema       | table_name | table_type |
+---------------+--------------------+------------+------------+
| glue          | tpc-h-parquet-1    | customer   | BASE TABLE |
| glue          | information_schema | tables     | VIEW       |
| glue          | information_schema | columns    | VIEW       |
| datafusion    | information_schema | tables     | VIEW       |
| datafusion    | information_schema | columns    | VIEW       |
+---------------+--------------------+------------+------------+
+---------------+-----------------+------------+--------------+------------------+----------------+-------------+-----------+--------------------------+------------------------+-------------------+-------------------------+---------------+--------------------+---------------+
| table_catalog | table_schema    | table_name | column_name  | ordinal_position | column_default | is_nullable | data_type | character_maximum_length | character_octet_length | numeric_precision | numeric_precision_radix | numeric_scale | datetime_precision | interval_type |
+---------------+-----------------+------------+--------------+------------------+----------------+-------------+-----------+--------------------------+------------------------+-------------------+-------------------------+---------------+--------------------+---------------+
| glue          | tpc-h-parquet-1 | customer   | c_custkey    | 0                |                | NO          | Int64     |                          |                        |                   |                         |               |                    |               |
| glue          | tpc-h-parquet-1 | customer   | c_name       | 1                |                | YES         | Utf8      |                          | 2147483647             |                   |                         |               |                    |               |
| glue          | tpc-h-parquet-1 | customer   | c_address    | 2                |                | YES         | Utf8      |                          | 2147483647             |                   |                         |               |                    |               |
| glue          | tpc-h-parquet-1 | customer   | c_nationkey  | 3                |                | NO          | Int64     |                          |                        |                   |                         |               |                    |               |
| glue          | tpc-h-parquet-1 | customer   | c_phone      | 4                |                | YES         | Utf8      |                          | 2147483647             |                   |                         |               |                    |               |
| glue          | tpc-h-parquet-1 | customer   | c_acctbal    | 5                |                | NO          | Float64   |                          |                        | 24                | 2                       |               |                    |               |
| glue          | tpc-h-parquet-1 | customer   | c_mktsegment | 6                |                | YES         | Utf8      |                          | 2147483647             |                   |                         |               |                    |               |
| glue          | tpc-h-parquet-1 | customer   | c_comment    | 7                |                | YES         | Utf8      |                          | 2147483647             |                   |                         |               |                    |               |
+---------------+-----------------+------------+--------------+------------------+----------------+-------------+-----------+--------------------------+------------------------+-------------------+-------------------------+---------------+--------------------+---------------+
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
| c_custkey | c_name             | c_address                             | c_nationkey | c_phone         | c_acctbal | c_mktsegment | c_comment                                                                                                         |
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
| 1         | Customer#000000001 | IVhzIApeRb ot,c,E                     | 15          | 25-989-741-2988 | 711.56    | BUILDING     | to the even, regular platelets. regular, ironic epitaphs nag e                                                    |
| 2         | Customer#000000002 | XSTf4,NCwDVaWNe6tEgvwfmRchLXak        | 13          | 23-768-687-3665 | 121.65    | AUTOMOBILE   | l accounts. blithely ironic theodolites integrate boldly: caref                                                   |
| 3         | Customer#000000003 | MG9kdTD2WBHm                          | 1           | 11-719-748-3364 | 7498.12   | AUTOMOBILE   | deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov             |
| 4         | Customer#000000004 | XxVSJsLAGtn                           | 4           | 14-128-190-5944 | 2866.83   | MACHINERY    | requests. final, regular ideas sleep final accou                                                                  |
| 5         | Customer#000000005 | KvpyuHCplrB84WgAiGV6sYpZq7Tj          | 3           | 13-750-942-6364 | 794.47    | HOUSEHOLD    | n accounts will have to unwind. foxes cajole accor                                                                |
| 6         | Customer#000000006 | sKZz0CsnMD7mp4Xd0YrBvx,LREYKUWAh yVn  | 20          | 30-114-968-4951 | 7638.57   | AUTOMOBILE   | tions. even deposits boost according to the slyly bold packages. final accounts cajole requests. furious          |
| 7         | Customer#000000007 | TcGe5gaZNgVePxU5kRrvXBfkasDTea        | 18          | 28-190-982-9759 | 9561.95   | AUTOMOBILE   | ainst the ironic, express theodolites. express, even pinto beans among the exp                                    |
| 8         | Customer#000000008 | I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5 | 17          | 27-147-574-9335 | 6819.74   | BUILDING     | among the slyly regular theodolites kindle blithely courts. carefully even theodolites haggle slyly along the ide |
| 9         | Customer#000000009 | xKiAFTjUsCuxfeleNqefumTrjS            | 8           | 18-338-906-3675 | 8324.07   | FURNITURE    | r theodolites according to the requests wake thinly excuses: pending requests haggle furiousl                     |
| 10        | Customer#000000010 | 6LrEaV6KR6PLVcgl2ArL Q3rqzLzcT1 v2    | 5           | 15-741-346-9870 | 2753.54   | HOUSEHOLD    | es regular deposits haggle. fur                                                                                   |
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+

已知问题

  • 无法推断列是否为可空。目前默认为 true。
  failed to sample datafusion.parquet_testing_datapage_v2_snappy_parquet due to Arrow error: External error: Execution error: Failed to map column projection for field e. Incompatible data types List(Field { name: "element", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }) and List(Field { name: "element", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })
  • 有时 Datafusion 会推断列的类型为 Time32(Millisecond),而不是 Timestamp(Nanosecond, None)
failed to sample datafusion.parquet_testing_encrypt_columns_plaintext_footer_parquet_encrypted due to Arrow error: External error: Execution error: Failed to map column projection for field int32_field. Incompatible data types Time32(Millisecond) and Timestamp(Nanosecond, None)
  • 无法推断列是否为空(然后得到 data_type: Null)
failed to sample datafusion.parquet_testing_null_list_parquet due to Arrow error: External error: Execution error: Failed to map column projection for field emptylist. Incompatible data types List(Field { name: "item", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) and List(Field { name: "element", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })
  • 无法推断列是无符号或有符号
failed to sample datafusion.parquet_testing_nested_structs_rust_parquet due to Arrow error: External error: Execution error: Failed to map column projection for field roll_num. Incompatible data types 

开发

标准 rust 工具链,例如

cargo build
cargo test

运行所有 linting

./dev/rust_lint.sh

测试

首先克隆测试数据仓库

git submodule update --init --recursive .

当这不起作用时

git submodule add -f https://github.com/apache/parquet-testing.git parquet-testing
git submodule add -f https://github.com/apache/arrow-testing testing

上传测试数据

aws s3api create-bucket \
    --bucket datafusion-testing \
    --region eu-central-1 \
    --create-bucket-configuration LocationConstraint=eu-central-1

find testing  -type f -exec aws s3 cp ./{} s3://datafusion-{} \;

aws s3api create-bucket \
    --bucket datafusion-parquet-testing \
    --region eu-central-1 \
    --create-bucket-configuration LocationConstraint=eu-central-1

find parquet-testing  -type f -exec aws s3 cp ./{} s3://datafusion-{} \;

创建 Glue 数据库

aws glue create-database \
    --database-input "{\"Name\":\"datafusion\"}"

依赖项

~114MB
~2M SLoC