#parquet #arrow #命令行工具

app pqrs

Apache Parquet 命令行工具和实用程序

6 个版本

0.3.2 2024 年 6 月 14 日
0.3.1 2023 年 5 月 27 日
0.2.2 2022 年 8 月 9 日
0.2.1 2022 年 5 月 16 日
0.1.2 2021 年 6 月 21 日

#421命令行工具

Download history 5/week @ 2024-04-26 8/week @ 2024-05-03 14/week @ 2024-05-10 48/week @ 2024-05-17 66/week @ 2024-05-24 33/week @ 2024-05-31 15/week @ 2024-06-07 256/week @ 2024-06-14 27/week @ 2024-06-21 40/week @ 2024-06-28 50/week @ 2024-07-05 46/week @ 2024-07-12 39/week @ 2024-07-19 102/week @ 2024-07-26 26/week @ 2024-08-02 29/week @ 2024-08-09

每月 202 次下载

MIT/Apache

42KB
718

pqrs build

  • pqrs 是一个用于检查 Parquet 文件的命令行工具
  • 这是一个用 Rust 编写的 parquet-tools 工具的替代品
  • 使用 Rust 的 ParquetArrow 实现
  • pqrs 大概意味着 "Rust 中的 parquet-tools"

安装

您可以从这里下载发行版二进制文件:这里

其他方法

使用 Homebrew

对于 macOS 用户,pqrs 作为 homebrew tap 提供。

brew install manojkarthick/tap/pqrs

注意:对于从 v0.2 或更早版本升级的用户,请注意,pqrs homebrew tap 的位置已更新。要更新到 v0.2.1+,请使用以下命令卸载,然后使用上面的命令重新安装:brew uninstall pqrs

使用 cargo

pqrs 还可以通过 cargo(Rust 包管理器)从 crates.io 进行安装。

cargo install pqrs

从源代码构建和运行

请确保您的机器上已安装 rustccargo

git clone https://github.com/manojkarthick/pqrs.git
cargo build --release
./target/release/pqrs

运行

以下片段显示了可用的子命令

 pqrs --help
pqrs 0.2.1
Manoj Karthick
Apache Parquet command-line utility

USAGE:
    pqrs [FLAGS] [SUBCOMMAND]

FLAGS:
    -d, --debug      Show debug output
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    cat         Prints the contents of Parquet file(s)
    head        Prints the first n records of the Parquet file
    help        Prints this message or the help of the given subcommand(s)
    merge       Merge file(s) into another parquet file
    rowcount    Prints the count of rows in Parquet file(s)
    sample      Prints a random sample of records from the Parquet file
    schema      Prints the schema of Parquet file(s)
    size        Prints the size of Parquet file(s)

子命令:cat

打印给定文件和文件夹的内容。如果输入是目录,则递归遍历并打印所有文件。支持类似 json、json 或 CSV 格式。使用 --json 生成 JSON 输出,使用 --csv 生成带有第一行列名的 CSV 输出,使用 --csv-data-only 生成不带列名行的 CSV 输出。

 pqrs cat data/cities.parquet
{continent: "Europe", country: {name: "France", city: ["Paris", "Nice", "Marseilles", "Cannes"]}}
{continent: "Europe", country: {name: "Greece", city: ["Athens", "Piraeus", "Hania", "Heraklion", "Rethymnon", "Fira"]}}
{continent: "North America", country: {name: "Canada", city: ["Toronto", "Vancouver", "St. John's", "Saint John", "Montreal", "Halifax", "Winnipeg", "Calgary", "Saskatoon", "Ottawa", "Yellowknife"]}}
 pqrs cat data/cities.parquet --json
{"continent":"Europe","country":{"name":"France","city":["Paris","Nice","Marseilles","Cannes"]}}
{"continent":"Europe","country":{"name":"Greece","city":["Athens","Piraeus","Hania","Heraklion","Rethymnon","Fira"]}}
{"continent":"North America","country":{"name":"Canada","city":["Toronto","Vancouver","St. John's","Saint John","Montreal","Halifax","Winnipeg","Calgary","Saskatoon","Ottawa","Yellowknife"]}}
 pqrs cat data/simple.parquet --csv
foo,bar
1,2
10,20
 pqrs cat data/simple.parquet --csv --no-header
1,2
10,20

注意:CSV 格式不支持包含 Struct 或 Byte 字段的文件。

子命令:head

打印 parquet 文件的前 N 条记录。使用 --records 标志设置记录数。

 pqrs head data/cities.parquet --json --records 2
{"continent":"Europe","country":{"name":"France","city":["Paris","Nice","Marseilles","Cannes"]}}
{"continent":"Europe","country":{"name":"Greece","city":["Athens","Piraeus","Hania","Heraklion","Rethymnon","Fira"]}}

子命令:merge

通过将两个文件的行组(或块)依次放置来合并两个 Parquet 文件。

免责声明:这不会将文件合并为具有优化行组的文件,请勿在生产环境中使用!

 pqrs merge --input data/pems-1.snappy.parquet data/pems-2.snappy.parquet --output data/pems-merged.snappy.parquet

 ls -al data
total 408
drwxr-xr-x   6 manojkarthick  staff     192 Feb 14 08:53 .
drwxr-xr-x  20 manojkarthick  staff     640 Feb 14 08:52 ..
-rw-r--r--   1 manojkarthick  staff     866 Feb  8 19:50 cities.parquet
-rw-r--r--   1 manojkarthick  staff   16468 Feb  8 19:50 pems-1.snappy.parquet
-rw-r--r--   1 manojkarthick  staff   17342 Feb  8 19:50 pems-2.snappy.parquet
-rw-r--r--   1 manojkarthick  staff  160950 Feb 14 08:53 pems-merged.snappy.parquet

子命令:rowcount

打印 parquet 文件中存在的行数。

 pqrs row-count data/pems-1.snappy.parquet data/pems-2.snappy.parquet
File Name: data/pems-1.snappy.parquet: 2693 rows
File Name: data/pems-2.snappy.parquet: 2880 rows

子命令:sample

从给定的 parquet 文件中打印随机样本记录。

 pqrs sample data/pems-1.snappy.parquet --records 3
{timeperiod: "01/17/2016 07:01:27", flow1: 0, occupancy1: 0E0, speed1: 0E0, flow2: 0, occupancy2: 0E0, speed2: 0E0, flow3: 0, occupancy3: 0E0, speed3: 0E0, flow4: null, occupancy4: null, speed4: null, flow5: null, occupancy5: null, speed5: null, flow6: null, occupancy6: null, speed6: null, flow7: null, occupancy7: null, speed7: null, flow8: null, occupancy8: null, speed8: null}
{timeperiod: "01/17/2016 07:47:27", flow1: 0, occupancy1: 0E0, speed1: 0E0, flow2: 0, occupancy2: 0E0, speed2: 0E0, flow3: 0, occupancy3: 0E0, speed3: 0E0, flow4: null, occupancy4: null, speed4: null, flow5: null, occupancy5: null, speed5: null, flow6: null, occupancy6: null, speed6: null, flow7: null, occupancy7: null, speed7: null, flow8: null, occupancy8: null, speed8: null}
{timeperiod: "01/17/2016 09:44:27", flow1: 0, occupancy1: 0E0, speed1: 0E0, flow2: 0, occupancy2: 0E0, speed2: 0E0, flow3: 0, occupancy3: 0E0, speed3: 0E0, flow4: null, occupancy4: null, speed4: null, flow5: null, occupancy5: null, speed5: null, flow6: null, occupancy6: null, speed6: null, flow7: null, occupancy7: null, speed7: null, flow8: null, occupancy8: null, speed8: null}

子命令:schema

打印给定 parquet 文件的模式。使用 --detailed 标志以获取更详细的统计信息。

 pqrs schema data/cities.parquet
Metadata for file: data/cities.parquet

version: 1
num of rows: 3
created by: parquet-mr version 1.5.0-cdh5.7.0 (build ${buildNumber})
message hive_schema {
  OPTIONAL BYTE_ARRAY continent (UTF8);
  OPTIONAL group country {
    OPTIONAL BYTE_ARRAY name (UTF8);
    OPTIONAL group city (LIST) {
      REPEATED group bag {
        OPTIONAL BYTE_ARRAY array_element (UTF8);
      }
    }
  }
}
 pqrs schema data/cities.parquet --detailed

num of row groups: 1
row groups:

row group 0:
--------------------------------------------------------------------------------
total byte size: 466
num of rows: 3

num of columns: 3
columns:

column 0:
--------------------------------------------------------------------------------
column type: BYTE_ARRAY
column path: "continent"
encodings: BIT_PACKED PLAIN_DICTIONARY RLE
file path: N/A
file offset: 4
num of values: 3
total compressed size (in bytes): 93
total uncompressed size (in bytes): 93
data page offset: 4
index page offset: N/A
dictionary page offset: N/A
statistics: {min: [69, 117, 114, 111, 112, 101], max: [78, 111, 114, 116, 104, 32, 65, 109, 101, 114, 105, 99, 97], distinct_count: N/A, null_count: 0, min_max_deprecated: true}

<....output clipped>

 pqrs schema --json data/cities.parquet
{"version":1,"num_rows":3,"created_by":"parquet-mr version 1.5.0-cdh5.7.0 (build ${buildNumber})","metadata":null,"columns":[{"optional":"true","physical_type":"BYTE_ARRAY","name":"continent","path":"continent","converted_type":"UTF8"},{"name":"name","converted_type":"UTF8","path":"country.name","physical_type":"BYTE_ARRAY","optional":"true"},{"optional":"true","name":"array_element","physical_type":"BYTE_ARRAY","path":"country.city.bag.array_element","converted_type":"UTF8"}],"message":"message hive_schema {\n  OPTIONAL BYTE_ARRAY continent (UTF8);\n  OPTIONAL group country {\n    OPTIONAL BYTE_ARRAY name (UTF8);\n    OPTIONAL group city (LIST) {\n      REPEATED group bag {\n        OPTIONAL BYTE_ARRAY array_element (UTF8);\n      }\n    }\n  }\n}\n"}

子命令:size

打印 parquet 文件的压缩/未压缩大小。默认显示未压缩大小

 pqrs size data/pems-1.snappy.parquet --pretty
Size in Bytes:

File Name: data/pems-1.snappy.parquet
Uncompressed Size: 61 KiB
 pqrs size data/pems-1.snappy.parquet --pretty --compressed
Size in Bytes:

File Name: data/pems-1.snappy.parquet
Compressed Size: 12 KiB

待办事项

  • 在 Windows 上测试

依赖项

~29–39MB
~755K SLoC