38 个版本

0.18.1	2024年6月8日
0.17.10	2024年2月5日
0.17.8	2023年12月20日
0.17.3	2023年11月20日
0.1.4	2021年3月29日

#2232 in 解析实现

218 个月下载量

MIT/Apache

20KB
307 行

CSV 到 Parquet

将 CSV 文件转换为 Apache Parquet。此包是 Arrow CLI 工具的一部分。

安装

下载预构建的二进制文件

您可以从 https://github.com/domoritz/arrow-tools/releases 获取最新版本。

使用 Homebrew

brew install domoritz/homebrew-tap/csv2parquet

使用 Cargo

cargo install csv2parquet

使用 Cargo B(inary)Install

为了避免重新编译并加快安装速度，您可以使用以下命令安装此工具：cargo binstall

cargo binstall csv2parquet

用法

Usage: csv2parquet [OPTIONS] <CSV> <PARQUET>

Arguments:
  <CSV>      Input CSV fil, stdin if not present
  <PARQUET>  Output file

Options:
  -s, --schema-file <SCHEMA_FILE>
          File with Arrow schema in JSON format
      --max-read-records <MAX_READ_RECORDS>
          The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed
      --header <HEADER>
          Set whether the CSV file has headers [possible values: true, false]
  -d, --delimiter <DELIMITER>
          Set the CSV file's column delimiter as a byte character [default: ,]
  -c, --compression <COMPRESSION>
          Set the compression [possible values: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd, lz4-raw]
  -e, --encoding <ENCODING>
          Sets encoding for any column [possible values: plain, plain-dictionary, rle, rle-dictionary, delta-binary-packed, delta-length-byte-array, delta-byte-array, byte-stream-split]
      --data-page-size-limit <DATA_PAGE_SIZE_LIMIT>
          Sets data page size limit
      --dictionary-page-size-limit <DICTIONARY_PAGE_SIZE_LIMIT>
          Sets dictionary page size limit
      --write-batch-size <WRITE_BATCH_SIZE>
          Sets write batch size
      --max-row-group-size <MAX_ROW_GROUP_SIZE>
          Sets max size for a row group
      --created-by <CREATED_BY>
          Sets "created by" property
      --dictionary
          Sets flag to enable/disable dictionary encoding for any column
      --statistics <STATISTICS>
          Sets flag to enable/disable statistics for any column [possible values: none, chunk, page]
      --max-statistics-size <MAX_STATISTICS_SIZE>
          Sets max statistics size for any column. Applicable only if statistics are enabled
  -p, --print-schema
          Print the schema to stderr
  -n, --dry
          Only print the schema
  -h, --help
          Print help
  -V, --version
          Print version

–schema-file 选项使用与 –dry 和 –print-schema 相同的文件格式。

示例

将 CSV 转换为 Parquet

csv2parquet data.csv data.parquet

将没有 `header` 的 CSV 转换为 Parquet

csv2parquet --header false <CSV> <PARQUET>

从具有标题的 CSV 中获取 `schema`

csv2parquet --header true --dry <CSV> <PARQUET>

使用 `schema-file` 转换 CSV 为 Parquet

以下是 schema-file 内容的示例

{
  "fields": [
    {
      "name": "col1",
      "data_type": "Utf8",
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    },
    {
      "name": " col2",
      "data_type": "Utf8",
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    }
  ],
  " metadata": {}
}

然后在命令中添加 schema-file schema.json

csv2parquet --header false --schema-file schema.json <CSV> <PARQUET>

将标准输入转换为标准输出的流转换

此技术可以防止您将大型文件写入磁盘。例如，这里我们从一个 URL 流式传输 CSV 文件到 S3。

curl <FILE_URL> | csv2parquet /dev/stdin /dev/stdout | aws s3 cp - <S3_DESTINATION>

依赖关系

~35MB
~690K SLoC