#proc-macro #orc #apache-orc #macro-derive

orcxx derive

使用orcxx反序列化Apache ORC结构体的过程宏

9个版本 (4个破坏性更新)

使用旧的Rust 2015

0.5.0 2024年2月8日
0.4.2 2023年10月13日
0.3.0 2023年8月24日
0.2.2 2023年8月10日
0.1.0 2023年8月7日

#2006编码

每月36次下载

GPL-3.0-or-later

1MB
19K SLoC

C++ 16K SLoC // 0.1% comments Rust 3.5K SLoC // 0.1% comments Python 370 SLoC // 0.1% comments Shell 28 SLoC

orcxx-rs

Rust对Apache ORC官方C++库的封装。

它使用指向Apache ORC发布的子模块,构建其C++部分(包括vendored protobuf、lz4、zstd等),并与其链接,除非设置了环境变量ORC_USE_SYSTEM_LIBRARIES。如果是,您需要确保已安装依赖项(在基于Debian的发行版上:apt-get install libprotoc-dev liblz4-dev libsnappy-dev libzstd-dev zlib1g-dev)。

orcxx_derive crate提供自定义的derive宏。

orcxx derive 示例

RowIterator API

extern crate orcxx;
extern crate orcxx_derive;

use std::num::NonZeroU64;

use orcxx::deserialize::{OrcDeserialize, OrcStruct};
use orcxx::row_iterator::RowIterator;
use orcxx::reader;
use orcxx_derive::OrcDeserialize;

// Define structure
#[derive(OrcDeserialize, Clone, Default, Debug, PartialEq, Eq)]
struct Test1 {
    long1: Option<i64>,
}

// Open file
let orc_path = "../orcxx/orc/examples/TestOrcFile.test1.orc";
let input_stream = reader::InputStream::from_local_file(orc_path).expect("Could not open .orc");
let reader = reader::Reader::new(input_stream).expect("Could not read .orc");

let batch_size = NonZeroU64::new(1024).unwrap();
let mut rows: Vec<Option<Test1>> = RowIterator::new(&reader, batch_size)
    .expect("Could not open ORC file")
    .collect();

assert_eq!(
    rows,
    vec![
        Some(Test1 {
            long1: Some(9223372036854775807)
        }),
        Some(Test1 {
            long1: Some(9223372036854775807)
        })
    ]
);

循环API

RowIterator在yield之前克隆结构体。可以通过循环并直接写入缓冲区来避免此操作。

extern crate orcxx;
extern crate orcxx_derive;

use orcxx::deserialize::{CheckableKind, OrcDeserialize, OrcStruct};
use orcxx::reader;
use orcxx_derive::OrcDeserialize;

// Define structure
#[derive(OrcDeserialize, Default, Debug, PartialEq, Eq)]
struct Test1 {
    long1: Option<i64>,
}

// Open file
let orc_path = "../orcxx/orc/examples/TestOrcFile.test1.orc";
let input_stream = reader::InputStream::from_local_file(orc_path).expect("Could not open .orc");
let reader = reader::Reader::new(input_stream).expect("Could not read .orc");

// Only read columns we need
let options = reader::RowReaderOptions::default().include_names(Test1::columns());

let mut row_reader = reader.row_reader(&options).expect("Could not open ORC file");
Test1::check_kind(&row_reader.selected_kind()).expect("Unexpected schema");

let mut rows: Vec<Option<Test1>> = Vec::new();

// Allocate work buffer
let mut batch = row_reader.row_batch(1024);

// Read structs until the end
while row_reader.read_into(&mut batch) {
    let new_rows = Option::<Test1>::from_vector_batch(&batch.borrow()).unwrap();
    rows.extend(new_rows);
}

assert_eq!(
    rows,
    vec![
        Some(Test1 {
            long1: Some(9223372036854775807)
        }),
        Some(Test1 {
            long1: Some(9223372036854775807)
        })
    ]
);

嵌套结构

上述两个示例也适用于嵌套结构。

extern crate orcxx;
extern crate orcxx_derive;

use orcxx_derive::OrcDeserialize;

#[derive(OrcDeserialize, Default, Debug, PartialEq)]
struct Test1Option {
    boolean1: Option<bool>,
    byte1: Option<i8>,
    short1: Option<i16>,
    int1: Option<i32>,
    long1: Option<i64>,
    float1: Option<f32>,
    double1: Option<f64>,
    bytes1: Option<Vec<u8>>,
    string1: Option<String>,
    list: Option<Vec<Option<Test1ItemOption>>>,
}

#[derive(OrcDeserialize, Default, Debug, PartialEq)]
struct Test1ItemOption {
    int1: Option<i32>,
    string1: Option<String>,
}

orcxx 示例

ColumnTree API

也可以直接读取列,而无需将它们的值写入结构体。这对于读取在编译时未知模式的文件特别有用。

低级API

它直接从C++库读取批次,并让Rust代码动态地将基本向量转换为更具体的类型;这里为字符串向量。

extern crate orcxx;
extern crate orcxx_derive;

use orcxx::reader;
use orcxx::vector::ColumnVectorBatch;

let input_stream = reader::InputStream::from_local_file("../orcxx/orc/examples/TestOrcFile.test1.orc")
    .expect("Could not open");

let reader = reader::Reader::new(input_stream).expect("Could not read");

println!("{:#?}", reader.kind()); // Prints the type of columns in the file

let mut row_reader = reader.row_reader(&reader::RowReaderOptions::default()).unwrap();
let mut batch = row_reader.row_batch(1024);

let mut total_elements = 0;
let mut all_strings: Vec<String> = Vec::new();
while row_reader.read_into(&mut batch) {
    total_elements += (&batch).num_elements();

    let struct_vector = batch.borrow().try_into_structs().unwrap();
    let vectors = struct_vector.fields();

    for vector in vectors {
        match vector.try_into_strings() {
            Ok(string_vector) => {
                for s in string_vector.iter() {
                    all_strings.push(
                        std::str::from_utf8(s.unwrap_or(b"<null>"))
                        .unwrap().to_owned())
                }
            }
            Err(e) => {}
        }
    }
}

assert_eq!(total_elements, 2);
assert_eq!(
    all_strings,
    vec!["\0\u{1}\u{2}\u{3}\u{4}", "", "hi", "bye"]
        .iter()
        .map(|s| s.to_owned())
        .collect::<Vec<_>>()
);

依赖关系

~1.2–3.5MB
~57K SLoC