10个版本
0.1.1 | 2024年6月2日 |
---|---|
0.1.0 | 2024年5月28日 |
0.1.0-beta.8 | 2024年1月18日 |
0.1.0-beta.7 | 2023年9月15日 |
0.1.0-beta.1 | 2022年11月11日 |
#309 in 解析器实现
135KB
2.5K SLoC
PostgreSQL -> Parquet
一个简单的工具,可以将PostgreSQL表导出为Parquet,支持比int和text更复杂的Postgres特性。
安装
从GitHub下载二进制文件
从GitHub Actions工件中下载二进制文件(点击最新运行,滚动到页面底部,选择您的系统)。
使用Nix flakes
如果您使用Nix,此命令将安装最新的pg2parquet版本。它将从源代码编译,因此安装需要一些时间。
nix shell github:exyi/pg2parquet
然后在新的shell中使用pg2parquet
。请注意,您可能需要在nix调用中添加--extra-experimental-features 'nix-command flakes'
参数。
使用Cargo
cargo install pg2parquet
从源代码安装
安装Rust和Cargo。克隆仓库。
cd cli
env RUSTFLAGS="-C target-cpu=native" cargo build --release
这应该会在几分钟内完成(约10分钟CPU时间)。取target/release/pg2parquet
文件,删除其余的目标目录(它占用相当多的磁盘空间)。您可以选择性地对二进制文件进行strip,但如果它崩溃,您将得到较差的堆栈跟踪。
基本用法
pg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -t the_table_to_export
或者,您也可以导出SQL查询的结果
pg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -q 'select column_a, column_b::text from another_table'
您也可以使用环境变量$PGPASSWORD
和$PGUSER
支持的数据类型
- 基本SQL类型:
text
,char
,varchar
及其朋友,所有种类的int
,bool
,浮点数,timestamp
,timestamptz
,date
,time
,uuid
interval
- 在Parquet中(毫秒)的时间间隔比在Postgres中(微秒)的精度低,因此转换是有损的。有一个选项--interval-handling=struct
可以不同地序列化它,而不进行四舍五入。
- 十进制数值类型
numeric
将根据--decimal-scale
和--decimal-precision
参数具有固定精度。或者使用--numeric-handling
将其写入浮点数或字符串。money
始终是带有2位小数的64位十进制数
json
和jsonb
:默认情况下,JSON序列化为文本字段。--json-handling
选项允许将Parquet LogicalType设置为 JSON,但此功能并不广泛支持,因此默认情况下已禁用。xml
:序列化为文本macaddr
和inet
:默认情况下以文本表示形式写入。可以使用--macaddr-handling
选项将macaddr序列化为字节或Int64。bit
和varbit
:表示为0
和1
的文本- 枚举
- 默认情况下序列化为文本,使用
--enum-handling int
将其序列化为整数
- 默认情况下序列化为文本,使用
- 范围
- 序列化为
struct { lower: T, upper: T, lower_inclusive: bool, upper_inclusive: bool, is_empty: bool }
- 序列化为
- 数组
- 序列化为Parquet列表
- 始终序列化为单维数组,并且丢弃起始索引信息
- 复合类型
- 序列化为Parquet结构类型
已知限制(和解决方案)
- 并非所有PostgreSQL类型都受支持
- 解决方案:在PostgreSQL端将其转换为文本(或其他受支持类型)
--query 'select weird_type_column::text from my_table'
- 请 提交问题
- 解决方案:在PostgreSQL端将其转换为文本(或其他受支持类型)
- 我需要的文件格式略有不同(重命名列等)
- 解决方案1:使用
--query
参数来塑造结果架构 - 解决方案2:使用DuckDB或Spark对Parquet文件进行后处理
- DuckDB
COPY (SELECT my_col as myCol, ... FROM 'export.parquet') TO 'export2.parquet' (FORMAT PARQUET);
- DuckDB
- 解决方案1:使用
选项
>pg2parquet导出--帮助
Exports a PostgreSQL table or query to a Parquet file
Usage: pg2parquet export [OPTIONS] --output-file <OUTPUT_FILE> --host <HOST> --dbname <DBNAME>
Options:
-o, --output-file <OUTPUT_FILE>
Path to the output file. If the file exists, it will be overwritten
-q, --query <QUERY>
SQL query to execute. Exclusive with --table
-t, --table <TABLE>
Which table should be exported. Exclusive with --query
--compression <COMPRESSION>
Compression applied on the output file. Default: zstd, change to Snappy or None if it's too slow
[possible values: none, snappy, gzip, lzo, brotli, lz4, zstd]
--compression-level <COMPRESSION_LEVEL>
Compression level of the output file compressor. Only relevant for zstd, brotli and gzip. Default: 3
--quiet
Avoid printing unnecessary information (schema and progress). Only errors will be written to stderr
-H, --host <HOST>
Database server host
-U, --user <USER>
Database user name. If not specified, PGUSER environment variable is used
-d, --dbname <DBNAME>
-p, --port <PORT>
--password <PASSWORD>
Password to use for the connection. It is recommended to use the PGPASSWORD environment variable instead, since process arguments are visible to other users on the system
--sslmode <SSLMODE>
Controls whether to use SSL/TLS to connect to the server
Possible values:
- disable: Do not use TLS
- prefer: Attempt to connect with TLS but allow sessions without (default behavior compiled with SSL support)
- require: Require the use of TLS
--ssl-root-cert <SSL_ROOT_CERT>
File with a TLS root certificate in PEM or DER (.crt) format. When specified, the default CA certificates are considered untrusted. The option can be specified multiple times. Using this options implies --sslmode=require
--macaddr-handling <MACADDR_HANDLING>
How to handle `macaddr` columns
[default: text]
Possible values:
- text: MAC address is converted to a string
- byte-array: MAC is stored as fixed byte array of length 6
- int64: MAC is stored in Int64 (lowest 6 bytes)
--json-handling <JSON_HANDLING>
How to handle `json` and `jsonb` columns
[default: text]
Possible values:
- text-marked-as-json: JSON is stored as a Parquet JSON type. This is essentially the same as text, but with a different ConvertedType, so it may not be supported in all tools
- text: JSON is stored as a UTF8 text
--enum-handling <ENUM_HANDLING>
How to handle enum (Enumerated Type) columns
[default: text]
Possible values:
- text: Enum is stored as the postgres enum name, Parquet LogicalType is set to ENUM
- plain-text: Enum is stored as the postgres enum name, Parquet LogicalType is set to String
- int: Enum is stored as an 32-bit integer (one-based index of the value in the enum definition)
--interval-handling <INTERVAL_HANDLING>
How to handle `interval` columns
[default: interval]
Possible values:
- interval: Enum is stored as the Parquet INTERVAL type. This has lower precision than postgres interval (milliseconds instead of microseconds)
- struct: Enum is stored as struct { months: i32, days: i32, microseconds: i64 }, exactly as PostgreSQL stores it
--numeric-handling <NUMERIC_HANDLING>
How to handle `numeric` columns
[default: double]
Possible values:
- decimal: Numeric is stored using the DECIMAL parquet type. Use --decimal-precision and --decimal-scale to set the desired precision and scale
- double: Numeric is converted to float64 (DOUBLE)
- float32: Numeric is converted to float32 (FLOAT)
- string: Convert the numeric to a string and store it as UTF8 text. This option never looses precision. Note that text "NaN" may be present if NaN is present in the database
--decimal-scale <DECIMAL_SCALE>
How many decimal digits after the decimal point are stored in the Parquet file in DECIMAL data type
[default: 18]
--decimal-precision <DECIMAL_PRECISION>
How many decimal digits are allowed in numeric/DECIMAL column. By default 38, the largest value which fits in 128 bits. If <= 9, the column is stored as INT32; if <= 18, the column is stored as INT64; otherwise BYTE_ARRAY
[default: 38]
--array-handling <ARRAY_HANDLING>
Parquet does not support multi-dimensional arrays and arrays with different starting index. pg2parquet flattens the arrays, and this options allows including the stripped information in additional columns
[default: plain]
Possible values:
- plain: Postgres arrays are simply stored as Parquet LIST
- dimensions: Postgres arrays are stored as struct of { data: List[T], dims: List[int] }
- dimensions+lowerbound: Postgres arrays are stored as struct of { data: List[T], dims: List[int], lower_bound: List[int] }
依赖项
~28–45MB
~870K SLoC