#postgresql #parquet-file #parquet #command-line #command-line-tool

app pg2parquet

命令行工具,用于将PostgreSQL表或查询导出为Parquet文件

10个版本

0.1.1 2024年6月2日
0.1.0 2024年5月28日
0.1.0-beta.82024年1月18日
0.1.0-beta.72023年9月15日
0.1.0-beta.12022年11月11日

#309 in 解析器实现

Apache-2.0

135KB
2.5K SLoC

PostgreSQL -> Parquet

一个简单的工具,可以将PostgreSQL表导出为Parquet,支持比int和text更复杂的Postgres特性。

安装

从GitHub下载二进制文件

GitHub Actions工件中下载二进制文件(点击最新运行,滚动到页面底部,选择您的系统)。

使用Nix flakes

如果您使用Nix,此命令将安装最新的pg2parquet版本。它将从源代码编译,因此安装需要一些时间。

nix shell github:exyi/pg2parquet

然后在新的shell中使用pg2parquet。请注意,您可能需要在nix调用中添加--extra-experimental-features 'nix-command flakes'参数。

使用Cargo

cargo install pg2parquet

从源代码安装

安装Rust和Cargo。克隆仓库。

cd cli
env RUSTFLAGS="-C target-cpu=native" cargo build --release

这应该会在几分钟内完成(约10分钟CPU时间)。取target/release/pg2parquet文件,删除其余的目标目录(它占用相当多的磁盘空间)。您可以选择性地对二进制文件进行strip,但如果它崩溃,您将得到较差的堆栈跟踪。

基本用法

pg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -t the_table_to_export

或者,您也可以导出SQL查询的结果

pg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -q 'select column_a, column_b::text from another_table'

您也可以使用环境变量$PGPASSWORD$PGUSER

支持的数据类型

  • 基本SQL类型: text, char, varchar及其朋友,所有种类的intbool,浮点数,timestamptimestamptzdatetimeuuid
    • interval - 在Parquet中(毫秒)的时间间隔比在Postgres中(微秒)的精度低,因此转换是有损的。有一个选项 --interval-handling=struct 可以不同地序列化它,而不进行四舍五入。
  • 十进制数值类型
    • numeric 将根据 --decimal-scale--decimal-precision 参数具有固定精度。或者使用 --numeric-handling 将其写入浮点数或字符串。
    • money 始终是带有2位小数的64位十进制数
  • jsonjsonb:默认情况下,JSON序列化为文本字段。 --json-handling 选项允许将Parquet LogicalType设置为 JSON,但此功能并不广泛支持,因此默认情况下已禁用。
  • xml:序列化为文本
  • macaddrinet:默认情况下以文本表示形式写入。可以使用 --macaddr-handling 选项将macaddr序列化为字节或Int64。
  • bitvarbit:表示为 01 的文本
  • 枚举
    • 默认情况下序列化为文本,使用 --enum-handling int 将其序列化为整数
  • 范围
    • 序列化为 struct { lower: T, upper: T, lower_inclusive: bool, upper_inclusive: bool, is_empty: bool }
  • 数组
    • 序列化为Parquet列表
    • 始终序列化为单维数组,并且丢弃起始索引信息
  • 复合类型
    • 序列化为Parquet结构类型

已知限制(和解决方案)

  • 并非所有PostgreSQL类型都受支持
    • 解决方案:在PostgreSQL端将其转换为文本(或其他受支持类型)--query 'select weird_type_column::text from my_table'
    • 提交问题
  • 我需要的文件格式略有不同(重命名列等)
    • 解决方案1:使用 --query 参数来塑造结果架构
    • 解决方案2:使用DuckDB或Spark对Parquet文件进行后处理
      • DuckDB COPY (SELECT my_col as myCol, ... FROM 'export.parquet') TO 'export2.parquet' (FORMAT PARQUET);

选项

>pg2parquet导出--帮助

Exports a PostgreSQL table or query to a Parquet file

Usage: pg2parquet export [OPTIONS] --output-file <OUTPUT_FILE> --host <HOST> --dbname <DBNAME>

Options:
  -o, --output-file <OUTPUT_FILE>
          Path to the output file. If the file exists, it will be overwritten

  -q, --query <QUERY>
          SQL query to execute. Exclusive with --table

  -t, --table <TABLE>
          Which table should be exported. Exclusive with --query

      --compression <COMPRESSION>
          Compression applied on the output file. Default: zstd, change to Snappy or None if it's too slow
          
          [possible values: none, snappy, gzip, lzo, brotli, lz4, zstd]

      --compression-level <COMPRESSION_LEVEL>
          Compression level of the output file compressor. Only relevant for zstd, brotli and gzip. Default: 3

      --quiet
          Avoid printing unnecessary information (schema and progress). Only errors will be written to stderr

  -H, --host <HOST>
          Database server host

  -U, --user <USER>
          Database user name. If not specified, PGUSER environment variable is used

  -d, --dbname <DBNAME>
          

  -p, --port <PORT>
          

      --password <PASSWORD>
          Password to use for the connection. It is recommended to use the PGPASSWORD environment variable instead, since process arguments are visible to other users on the system

      --sslmode <SSLMODE>
          Controls whether to use SSL/TLS to connect to the server

          Possible values:
          - disable: Do not use TLS
          - prefer:  Attempt to connect with TLS but allow sessions without (default behavior compiled with SSL support)
          - require: Require the use of TLS

      --ssl-root-cert <SSL_ROOT_CERT>
          File with a TLS root certificate in PEM or DER (.crt) format. When specified, the default CA certificates are considered untrusted. The option can be specified multiple times. Using this options implies --sslmode=require

      --macaddr-handling <MACADDR_HANDLING>
          How to handle `macaddr` columns
          
          [default: text]

          Possible values:
          - text:       MAC address is converted to a string
          - byte-array: MAC is stored as fixed byte array of length 6
          - int64:      MAC is stored in Int64 (lowest 6 bytes)

      --json-handling <JSON_HANDLING>
          How to handle `json` and `jsonb` columns
          
          [default: text]

          Possible values:
          - text-marked-as-json: JSON is stored as a Parquet JSON type. This is essentially the same as text, but with a different ConvertedType, so it may not be supported in all tools
          - text:                JSON is stored as a UTF8 text

      --enum-handling <ENUM_HANDLING>
          How to handle enum (Enumerated Type) columns
          
          [default: text]

          Possible values:
          - text:       Enum is stored as the postgres enum name, Parquet LogicalType is set to ENUM
          - plain-text: Enum is stored as the postgres enum name, Parquet LogicalType is set to String
          - int:        Enum is stored as an 32-bit integer (one-based index of the value in the enum definition)

      --interval-handling <INTERVAL_HANDLING>
          How to handle `interval` columns
          
          [default: interval]

          Possible values:
          - interval: Enum is stored as the Parquet INTERVAL type. This has lower precision than postgres interval (milliseconds instead of microseconds)
          - struct:   Enum is stored as struct { months: i32, days: i32, microseconds: i64 }, exactly as PostgreSQL stores it

      --numeric-handling <NUMERIC_HANDLING>
          How to handle `numeric` columns
          
          [default: double]

          Possible values:
          - decimal: Numeric is stored using the DECIMAL parquet type. Use --decimal-precision and --decimal-scale to set the desired precision and scale
          - double:  Numeric is converted to float64 (DOUBLE)
          - float32: Numeric is converted to float32 (FLOAT)
          - string:  Convert the numeric to a string and store it as UTF8 text. This option never looses precision. Note that text "NaN" may be present if NaN is present in the database

      --decimal-scale <DECIMAL_SCALE>
          How many decimal digits after the decimal point are stored in the Parquet file in DECIMAL data type
          
          [default: 18]

      --decimal-precision <DECIMAL_PRECISION>
          How many decimal digits are allowed in numeric/DECIMAL column. By default 38, the largest value which fits in 128 bits. If <= 9, the column is stored as INT32; if <= 18, the column is stored as INT64; otherwise BYTE_ARRAY
          
          [default: 38]

      --array-handling <ARRAY_HANDLING>
          Parquet does not support multi-dimensional arrays and arrays with different starting index. pg2parquet flattens the arrays, and this options allows including the stripped information in additional columns
          
          [default: plain]

          Possible values:
          - plain:                 Postgres arrays are simply stored as Parquet LIST
          - dimensions:            Postgres arrays are stored as struct of { data: List[T], dims: List[int] }
          - dimensions+lowerbound: Postgres arrays are stored as struct of { data: List[T], dims: List[int], lower_bound: List[int] }

依赖项

~28–45MB
~870K SLoC