进化 — Rust应用程序 // Lib.rs

1 个稳定版本

1.0.0	2024年5月15日
0.1.0	~~2023年11月21日~~

#351 in 解析器实现

自定义许可证

160KB
3K SLoC

🦖 将固定长度数据文件进化为Apache Parquet，完全并行化！

🔎 概述

此仓库托管了名为 evolution 的程序，该程序不仅允许您将现有固定长度文件转换为其他数据格式，还允许您快速创建大量模拟数据。程序支持完全并行化，并在可能的情况下利用SIMD技术进行高度有效的数据处理。

要开始使用，请按照以下README中的安装、模式设置和示例用法部分进行操作。祝您编码愉快！👋🥳

📋 目录

此仓库中的所有代码都是开源的，应根据LICENSE进行许可，有关更多信息，请参阅此链接。

📦 安装

在系统上安装 evolution 二进制文件的最简单方法是使用Cargo 包管理器（它从此链接下载）。

$ cargo install evolution
    
(available features)    
 - rayon
 - nightly

或者，您可以通过克隆仓库并使用Cargo编译来从源代码构建。有关可用可选功能的更多信息，请参阅下文。

$ git clone https://github.com/firelink-data/evolution.git
$ cd evolution
$ cargo build --release

(optional: copy the binary to your users binray folder)
$ cp ./target/release/evolution /usr/bin/evolution

使用 rayon 功能安装将使用rayon 包进行并行执行，而不是标准库线程。它还启用分块转换模式。有关更多信息，请参阅此参考。
使用 nightly 功能安装将使用nightly 工具链，它本质上是不稳定的。要运行此版本，您需要在系统上安装nightly工具链。您可以通过在shell中运行rustup install nightly来安装此工具链。

📝 模式设置

在 evolution 中，所有可用的命令都需要一个现有的有效模式。在此上下文中，模式是一个 json 文件，用于指定固定长度文件（flf）内容的布局。每个使用的模式都必须遵守这个模板。如果您不确定自己的模式文件是否符合模板，可以使用这个验证工具。

示例模式可以在这里找到，其内容如下

{
    "name": "EvolutionExampleSchema",
    "version": 1337,
    "columns": [
        {
            "name": "id",
            "offset": 0,
            "length": 9,
            "dtype": "Int32",
            "alignment": "Right",
            "pad_symbol": "Underscore",
            "is_nullable": false
        },
        {
            "name": "name",
            "offset": 9,
            "length": 32,
            "dtype": "Utf8",
            "is_nullable": true
        },
        {
            "name": "city",
            "offset": 41,
            "length": 32,
            "dtype": "Utf8",
            "alignment": "Right",
            "pad_symbol": "Backslash",
            "is_nullable": false
        },
        {
            "name": "employed",
            "offset": 73,
            "length": 5,
            "dtype": "Boolean",
            "alignment": "Center",
            "pad_symbol": "Asterisk",
            "is_nullable": true
        }
    ]
}

如果您对 dtype、alignment 和 pad_symbol 字段的有效值不确定，请参阅列出了所有有效值的模板。
所有列都必须提供以下字段：name、offset、length 和 is_nullable，而 alignment 和 pad_symbol 可以省略（如本例中的 name 列所示）。如果没有提供，它们将使用默认值，分别是 "Right" 和 "Whitespace"。
默认值来自 padder crate，该 crate 定义了枚举 Alignment 和 Symbol，并分别使用默认实现 Alignment::Right 和 Symbol::Whitespace。

🚀 示例用法

如果您按上述说明安装程序，只需运行二进制文件，您将看到以下有用的使用说明

🦖 Evolve your fixed-length data files into Apache Arrow tables, fully parallelized!

Usage: evolution [OPTIONS] <COMMAND>

Commands:
  convert  Convert a fixed-length file (.flf) to parquet
  mock     Generate mocked fixed-length files (.flf) for testing purposes
  help     Print this message or the help of the given subcommand(s)

Options:
      --n-threads <NUM-THREADS>  Set the number of threads (logical cores) to use when multi-threading [default: 1]
  -h, --help                     Print help
  -V, --version                  Print version

如上所示，程序的功能包括两个主要命令：convert 和 mock。如果您安装了带有 rayon 功能的程序，您还可以访问一个名为 c-convert 的第三个命令。这代表 chunked-convert，是一种替代实现。该命令的文档正在制作中。

如果您想在执行期间查看调试输出，请在执行程序之前设置环境变量 RUST_LOG 为 DEBUG。

🏗️👷‍♂️ 转换

Convert a fixed-length file (.flf) to parquet

Usage: evolution convert [OPTIONS] --in-file <IN-FILE> --out-file <OUT-FILE> --schema <SCHEMA>

Options:
  -i, --in-file <IN-FILE>
          The fixed-length file to convert
  -o, --out-file <OUT-FILE>
          Specify output (target) file name
  -s, --schema <SCHEMA>
          Specify the .json schema file to use when converting
      --buffer-size <BUFFER-SIZE>
          Set the size of the buffer (in bytes)
      --thread-channel-capacity <THREAD-CHANNEL-CAPACITY>
          Set the capacity of the thread channel (number of messages)
  -h, --help
          Print help

要将名为 old-data.flf 的固定长度文件转换为名为 converted.parquet 的 parquet 文件，其中关联的方案位于 ./my/path/to/schema.json，您可以运行以下命令

$ evolution convert --in-file old-data.flf --out-file converted.parquet --schema ./my/path/to/schema.json

👨‍🎨 模拟

Generate mocked fixed-length files (.flf) for testing purposes

Usage: evolution mock [OPTIONS] --schema <SCHEMA>

Options:
  -s, --schema <SCHEMA>
          Specify the .json schema file to mock data for
  -o, --out-file <OUT-FILE>
          Specify output (target) file name
  -n, --n-rows <NUM-ROWS>
          Set the number of rows to generate [default: 100]
      --force-new
          Set the writer option to fail if the file already exists
      --truncate-existing
          Set the writer option to truncate a previous file if the out file already exists
      --buffer-size <MOCKER-BUFFER-SIZE>
          Set the size of the buffer (number of rows)
      --thread-channel-capacity <MOCKER-THREAD-CHANNEL-CAPACITY>
          Set the capacity of the thread channel (number of messages)
  -h, --help
          Print help

例如，如果您想从位于 ./my/path/to/schema.json 的方案中模拟固定长度文件的 10 亿行，输出名称为 mocked-data.flf，并强制该文件不存在，您可以运行以下命令

$ evolution mock --schema ./my/path/to/schema.json --out-file mocked-data.flf --n-rows 1000000000 --force-new

🧵 多线程

程序存在一个全局设置，称为 --n-threads，用于指定调用的命令是在单线程还是多线程模式下执行。此参数应是一个表示您希望使用的线程数（逻辑核心数）的数字。如果您尝试设置比系统逻辑核心数更多的线程数，则程序将使用 所有可用的逻辑核心。如果省略此参数，则程序将运行在单线程模式下。

请注意，仅在大量工作负载中，多线程模式才能真正提高性能。

如果您不确定您的CPU有多少个逻辑核心，最简单的方法是运行程序并将--n-threads选项设置为一个较大的数字。程序将检查您有多少个逻辑核心，并查看此选项是否超出了可能值。如果您传递的值大于系统上的逻辑核心数，则可用的逻辑核心数将通过stdout输出给您。

根据您的宿主系统，您也可能使用以下命令之一。

Windows

$ Get-WmiObject Win32_Processor | Select-Object Name, NumberOfCores, NumberOfLogicalProcessors

使用在NumberOfLogicalProcessors下找到的值。

Unix

$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('

逻辑核心数计算如下：每核线程数 X 每插槽核心数 X 插槽数。

📜 许可证

所有代码均需遵守通用MIT许可证，请参阅LICENSE以获取具体信息。

依赖项

~30–42MB
~819K SLoC