10个版本

0.1.1	2024年6月2日
0.1.0	2024年5月28日
0.1.0-beta.8	2024年1月18日
0.1.0-beta.7	2023年9月15日
0.1.0-beta.1	2022年11月11日

#309 in 解析器实现

Apache-2.0

135KB
2.5K SLoC

PostgreSQL -> Parquet

一个简单的工具，可以将PostgreSQL表导出为Parquet，支持比int和text更复杂的Postgres特性。

安装

从GitHub下载二进制文件

从GitHub Actions工件中下载二进制文件（点击最新运行，滚动到页面底部，选择您的系统）。

使用Nix flakes

如果您使用Nix，此命令将安装最新的pg2parquet版本。它将从源代码编译，因此安装需要一些时间。

nix shell github:exyi/pg2parquet

然后在新的shell中使用pg2parquet。请注意，您可能需要在nix调用中添加--extra-experimental-features 'nix-command flakes'参数。

使用Cargo

cargo install pg2parquet

从源代码安装

安装Rust和Cargo。克隆仓库。

cd cli
env RUSTFLAGS="-C target-cpu=native" cargo build --release

这应该会在几分钟内完成（约10分钟CPU时间）。取target/release/pg2parquet文件，删除其余的目标目录（它占用相当多的磁盘空间）。您可以选择性地对二进制文件进行strip，但如果它崩溃，您将得到较差的堆栈跟踪。

基本用法

pg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -t the_table_to_export

或者，您也可以导出SQL查询的结果

pg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -q 'select column_a, column_b::text from another_table'

您也可以使用环境变量$PGPASSWORD和$PGUSER

支持的数据类型

基本SQL类型: text, char, varchar及其朋友，所有种类的int，bool，浮点数，timestamp，timestamptz，date，time，uuid
- interval - 在Parquet中（毫秒）的时间间隔比在Postgres中（微秒）的精度低，因此转换是有损的。有一个选项 --interval-handling=struct 可以不同地序列化它，而不进行四舍五入。
十进制数值类型
- numeric 将根据 --decimal-scale 和 --decimal-precision 参数具有固定精度。或者使用 --numeric-handling 将其写入浮点数或字符串。
- money 始终是带有2位小数的64位十进制数
json 和 jsonb：默认情况下，JSON序列化为文本字段。 --json-handling 选项允许将Parquet LogicalType设置为 JSON，但此功能并不广泛支持，因此默认情况下已禁用。
xml：序列化为文本
macaddr 和 inet：默认情况下以文本表示形式写入。可以使用 --macaddr-handling 选项将macaddr序列化为字节或Int64。
bit 和 varbit：表示为 0 和 1 的文本
枚举
- 默认情况下序列化为文本，使用 --enum-handling int 将其序列化为整数
范围
- 序列化为 struct { lower: T, upper: T, lower_inclusive: bool, upper_inclusive: bool, is_empty: bool }
数组
- 序列化为Parquet列表
- 始终序列化为单维数组，并且丢弃起始索引信息
复合类型
- 序列化为Parquet结构类型

已知限制（和解决方案）

并非所有PostgreSQL类型都受支持
- 解决方案：在PostgreSQL端将其转换为文本（或其他受支持类型）--query 'select weird_type_column::text from my_table'
- 请提交问题
我需要的文件格式略有不同（重命名列等）
- 解决方案1：使用 --query 参数来塑造结果架构
- 解决方案2：使用DuckDB或Spark对Parquet文件进行后处理
  - DuckDB COPY (SELECT my_col as myCol, ... FROM 'export.parquet') TO 'export2.parquet' (FORMAT PARQUET);

选项

>pg2parquet导出--帮助

Exports a PostgreSQL table or query to a Parquet file

Usage: pg2parquet export [OPTIONS] --output-file <OUTPUT_FILE> --host <HOST> --dbname <DBNAME>

Options:
  -o, --output-file <OUTPUT_FILE>
          Path to the output file. If the file exists, it will be overwritten

  -q, --query <QUERY>
          SQL query to execute. Exclusive with --table

  -t, --table <TABLE>
          Which table should be exported. Exclusive with --query

      --compression <COMPRESSION>
          Compression applied on the output file. Default: zstd, change to Snappy or None if it's too slow
          
          [possible values: none, snappy, gzip, lzo, brotli, lz4, zstd]

      --compression-level <COMPRESSION_LEVEL>
          Compression level of the output file compressor. Only relevant for zstd, brotli and gzip. Default: 3

      --quiet
          Avoid printing unnecessary information (schema and progress). Only errors will be written to stderr

  -H, --host <HOST>
          Database server host

  -U, --user <USER>
          Database user name. If not specified, PGUSER environment variable is used

  -d, --dbname <DBNAME>
          

  -p, --port <PORT>
          

      --password <PASSWORD>
          Password to use for the connection. It is recommended to use the PGPASSWORD environment variable instead, since process arguments are visible to other users on the system

      --sslmode <SSLMODE>
          Controls whether to use SSL/TLS to connect to the server

          Possible values:
          - disable: Do not use TLS
          - prefer:  Attempt to connect with TLS but allow sessions without (default behavior compiled with SSL support)
          - require: Require the use of TLS

      --ssl-root-cert <SSL_ROOT_CERT>
          File with a TLS root certificate in PEM or DER (.crt) format. When specified, the default CA certificates are considered untrusted. The option can be specified multiple times. Using this options implies --sslmode=require

      --macaddr-handling <MACADDR_HANDLING>
          How to handle `macaddr` columns
          
          [default: text]

          Possible values:
          - text:       MAC address is converted to a string
          - byte-array: MAC is stored as fixed byte array of length 6
          - int64:      MAC is stored in Int64 (lowest 6 bytes)

      --json-handling <JSON_HANDLING>
          How to handle `json` and `jsonb` columns
          
          [default: text]

          Possible values:
          - text-marked-as-json: JSON is stored as a Parquet JSON type. This is essentially the same as text, but with a different ConvertedType, so it may not be supported in all tools
          - text:                JSON is stored as a UTF8 text

      --enum-handling <ENUM_HANDLING>
          How to handle enum (Enumerated Type) columns
          
          [default: text]

          Possible values:
          - text:       Enum is stored as the postgres enum name, Parquet LogicalType is set to ENUM
          - plain-text: Enum is stored as the postgres enum name, Parquet LogicalType is set to String
          - int:        Enum is stored as an 32-bit integer (one-based index of the value in the enum definition)

      --interval-handling <INTERVAL_HANDLING>
          How to handle `interval` columns
          
          [default: interval]

          Possible values:
          - interval: Enum is stored as the Parquet INTERVAL type. This has lower precision than postgres interval (milliseconds instead of microseconds)
          - struct:   Enum is stored as struct { months: i32, days: i32, microseconds: i64 }, exactly as PostgreSQL stores it

      --numeric-handling <NUMERIC_HANDLING>
          How to handle `numeric` columns
          
          [default: double]

          Possible values:
          - decimal: Numeric is stored using the DECIMAL parquet type. Use --decimal-precision and --decimal-scale to set the desired precision and scale
          - double:  Numeric is converted to float64 (DOUBLE)
          - float32: Numeric is converted to float32 (FLOAT)
          - string:  Convert the numeric to a string and store it as UTF8 text. This option never looses precision. Note that text "NaN" may be present if NaN is present in the database

      --decimal-scale <DECIMAL_SCALE>
          How many decimal digits after the decimal point are stored in the Parquet file in DECIMAL data type
          
          [default: 18]

      --decimal-precision <DECIMAL_PRECISION>
          How many decimal digits are allowed in numeric/DECIMAL column. By default 38, the largest value which fits in 128 bits. If <= 9, the column is stored as INT32; if <= 18, the column is stored as INT64; otherwise BYTE_ARRAY
          
          [default: 38]

      --array-handling <ARRAY_HANDLING>
          Parquet does not support multi-dimensional arrays and arrays with different starting index. pg2parquet flattens the arrays, and this options allows including the stripped information in additional columns
          
          [default: plain]

          Possible values:
          - plain:                 Postgres arrays are simply stored as Parquet LIST
          - dimensions:            Postgres arrays are stored as struct of { data: List[T], dims: List[int] }
          - dimensions+lowerbound: Postgres arrays are stored as struct of { data: List[T], dims: List[int], lower_bound: List[int] }

依赖项

~28–45MB
~870K SLoC