6 个版本 (稳定)

1.1.3	2023年10月9日
1.0.0	2023年10月3日
0.9.1	2023年10月3日

#151 in 机器学习

每月 28 次下载

MIT/Apache

36KB
144 代码行

tensor_types

tensor_types crate 为 PyTorch Rust 框架中的张量提供了强类型和大小检查，防止了一类难以发现的错误。

tensor_types crate 已与 tch-rs 版本 0.13.0 和 0.14.0 进行了测试。

简介

问题

PyTorch 是一个强大的机器学习库。如 tch-rs 和 Candle 这样的Rust端口结合了其精心设计的架构和Rust的正确性和可靠性。作为其主要数据结构，Torch 使用张量，这是一种灵活的数据表示，包括丰富的支持函数。

然而，在编写基于张量的可靠软件时出现的一个问题是，当张量在整个系统中重复使用时，它们变得过度使用且没有区分。它们用于表示程序不同部分中的不同数据，但所有数据结构都是相同的。例如，在机器学习工作流程中，张量可能表示：原始数据中的标记、嵌入标记、标记的批处理序列、批处理logits、概率序列，最后是输出标记。

在代码中传递张量会增加缺陷风险的三种主要方式

张量可以改变形状，可能会传播不兼容的形状到其他代码。当张量被转换和操作时，它们形状的修改可能不明显。形状的意外变化可能导致难以发现的错误。在这种情况下，Rust编译器无法提供帮助，因为无论形状如何，表示仍然是张量。
张量可以改变类型。例如，一个转换器架构可能接受标记序列、Int64的2d张量，但它将其嵌入到Floats的3d张量中。Torch张量将这两种非常不同的类型都表示为张量。
作为函数或结构体参数的张量可以顺序错误或错误分配。你可能想让你的函数操作 input1 和 input2，但作为两者都表示为 tch:Tensor 类型，编译器无法帮助你在参数顺序颠倒的函数调用中捕捉到错误。

解决方案

将张量包装到经过大小检查和类型检查的类型中可以增加可靠性、减少缺陷并提高可读性。这允许编译器捕获在编译时难以发现的错误，但会产生错误结果的代码。`tensor_types` 包允许编写如下代码：

    // Define TokenizedInput as a type holding a 2d tensor for batched token 
    // sequences. TokenizedInput tensors will be checked to be size 
    // [params.batch_size, params.sequence_length] where params is an instance 
    // of Params.
    tensor_type!(TokenizedInput, [batch_size, sequence_length], Params, Kind::Int64);
    // Define EmbeddedInput as a type holding a 3d tensor for batched sequence
    // of embedded.
    tensor_type!(EmbeddedInput, [batch_size, sequence_length, model_dim], Params, Kind::Float);
    
    let params = load_parameters(); // Or however Params is initialized.
    let input = tokenizer(...); // Tokenize the input into a tch::Tensor.
    let tokenized = TokenizedInput::new(input, &params)?; // Wrap, checking shape.
    // Embed, confirming the new shape and kind. Here, embed() accepts
    // TokenizedInput and returns a tch::Tensor.
    let embedded = EmbeddedInput::new(embed(tokenized)?, &params)?; 
    // More effectively, you would define embed() to accept a TokenizedInput
    // and return an EmbeddedInput, so the your code would read like:
    // let embedded = embed(tokenized, &params)?;

由 `tensor_type!` 宏创建的张量类型会将其大小与您定义的结构进行比较，该结构包含您的运行时值。这种方法允许参数在运行时一次性加载，可能来自配置文件，或者很容易为测试设置。

动机和示例

作为另一个示例，这里有一行有错误的代码

    transform(encoder_input, decoder_input)?;

你发现了错误吗？

`transform()` 的定义是

    pub fn transform(decoder_input: Tensor, encoder_input: Tensor) -> Result<()> {

有错误的代码行传递了错误的参数顺序，但是编译器无法提供帮助，因为两者都是张量。

    // Define EncoderInput as a type holding a 3d tensor.
    tensor_type!(EncoderInput, [batch_size, sequence_length, model_dim], Params, Kind::Float);
    // Define DecoderInput as a type holding a 3d tensor.
    tensor_type!(DecoderInput, [batch_size, sequence_length, model_dim], Params, Kind::Float);
    
    ...
    transform(encoder_input, decoder_input)?; // Won't compile. They're backwards.

...

    pub fn transform(decoder_input: DecoderInput, encoder_input: EncoderInput) -> Result<()> {

此外，TensorTypes 定义了张量的所需形状，防止在张量上应用操作时意外改变形状的难以发现的错误。这种形状变化可以在程序的任何地方发生。

例如，以下代码可能有错误

    let input: tch::Tensor =...;
    let output: tch::Tensor = my_function(input)?; // This function may transpose the input.

...

    pub fn my_function(input: tch::Tensor) -> Result<tch::Tensor> {

转置是否发生？编译器无法判断。如果转置发生或未发生，没有运行时错误，因为无论是否转置，输出都是张量。只有具体检查输出张量的形状才能确定，假设两个维度不同。

tensor_types 包使得在张量上执行操作时保持正确的张量形状变得容易。例如

    tensor_type!(BatchSeqModel, [batch_size, sequence_length, model_dim], Params, Kind::Float);
    tensor_type!(BatchModelSeq, [batch_size, model_dim, sequence_length], Params, Kind::Float);

    let input: tch::BatchSeqModel =...;
    // This function will transpose the input or return an error if the expected
    // shape change doesn't happen.
    let output: tch::BatchModelSeq = my_function(input)?;

...

    pub fn my_function(input: BatchSeqModel) -> Result<BatchModelSeq> {
        ...
        let output: tch::Tensor = // Output from some tch::Tensor operations.
        BatchModelSeq::new(output)
    }

现在 `my_function()` 明确定义为返回一个转置的结果。除非开发者在 `BatchModelSeq` 类型中返回一个转置的形状，否则代码无法编译。在运行时，如果将函数返回的张量包装在 `BatchModelSeq::new()` 中，而输出张量与预期的形状不匹配，则 `my_function` 将返回 `ShapeMismatch`。

详细信息

关键特性

关键特性包括

强类型张量：张量具有在编译时已知的静态形状。
类型安全操作：对张量的操作会进行类型、大小和类型的检查。
维度检查：检查操作是否具有匹配的维度。

可读性

如前例所示，使用张量类型编写的代码与之前一样易于阅读，但现在在运行时包括大小和类型检查。当您在代码中使用 TensorTypes 时，可读性进一步提高。像这样的函数签名，不提供对 Tensor 大小和类型影响的帮助

    fn prepare_input(t: Tensor) -> Result<Tensor, Error> {
        ...

...现在读起来更清晰

    fn prepare_input(t: BatchSeq) -> Result<BatchSeqEmbed, Error> {
        ...

示例用法

    // Define your TensorTypes at the start of the program for reuse throughout.
    // Or as needed in each function.
    // 1. Define DecoderInputType as a type holding a 3d tensor. The fields in a 
    //    Params instance that will give the dimensions for the tensor are 
    //    batch_size, sequence_length, and model_dim.
    tensor_type!(DecoderInputType, [batch_size, sequence_length, model_dim], Params, Kind::Float);
    //    Define BatchSeqType as a 2d tensor of tokens, so Int64.
    tensor_type!(BatchSeqType, [sequence_length, model_dim], Params, Kind::int64);

    // Define Params.
    pub struct Params {
        batch_size: i64,
        sequence_length: i64,
        model_dim: i64
    }

    // 2. At runtime, set the required dimensions for the typed parameters.
    let params = Params {
        batch_size: 1, 
        sequence_length: 100, 
        model_dim: 250};

    // 3. Use your new type's new() function to create a new instance of your
    //    type that wraps any tch::Tensor. The tensor will be checked for the
    //    correct size.
    
    // For example, suppose we obtain t0 from some other function...
    let t0 = Tensor::randn([1, 100, 256], (tch::Kind::Float, tch::Device::Cpu));
    // Wrap it in the DecoderInputType, which will check the size and fail if it
    // is not [BatchSize, SequenceLength, ModelDim], ie, [1, 100, 256].
    let wrapped_t0 = DecoderInputType::new(t0, &params)?;

    // Apply tensor functions. The result is size checked again.
    let new_my_tensor = tokenized_input.apply_fn(|t| t.triu(0), &params)?; // Type: BatchSeqType

    // Or use the tensor in the TensorType directly. No size checking though.
    let cos = *new_my_tensor.cos();  // Type: tch::Tensor
    
    // After a sequence of tch::Tensor operations, you can convert back to a 
    // TensorType to confirm the expected shape.
    let cos = DecoderInputType::new(cos, &params)?;

    // Suppose you have a decoder that will convert from 3d Float to 2d Int64.
    let tokens = my_tokenizer::decode_tensor(*cos);  // Type: tch::Tensor

    // Convert into a tensor_type before returning it to validate it.
    BatchModelType::new(tokens, &params)?; // Type: BatchModelType
    ...

扩展类型

很容易向使用 `tensor_types!` 宏创建的类型添加功能。例如，以下是一个扩展到直接添加两个 TensorTypes 的示例类型。

// BatchSeqDModelTensor: Embedding converts each token to a vector of size
// d_model. They are embedded in an floating point space, so are now kind Float.
tensor_type!(
    BatchSeqDModelTensor,
    [batch_size, sequence_length, d_model],
    ModelParams,
    Kind::Float
);
impl BatchSeqDModelTensor {
    pub fn add(&self, t2: &Self, params: &crate::ModelParams) -> Result<Self> {
        use tensor_types::TensorType;
        Ok(Self::new(&self.tensor + &t2.tensor, params)?)
    }
}

BatchSeqDModelTensor 现在可以添加

    pub fn forward_t(
        &self,
        decoder_input: &BatchSeqDModelTensor,
        ...
    ) -> Result<BatchSeqDModelTensor> {
        let masked_mha_output: BatchSeqDModelTensor = ...
        let sum = decoder_input.add(&masked_mha_output, &self.params)?;

特性和标记特性

使用 `tensor_type!` 宏创建的所有类型都实现了名为 `TensorType` 的特性。这个特性使得可以执行 Rust 特性操作，例如多态数组和函数参数。等等！这难道不是正好违背了 `tensor_types` 包的宗旨，即使不同的类型独特吗？嗯，是的，如果直接使用的话。但是，特性的目的是允许在适当的地方进行一些有限的泛型。

例如，可能你已经添加了一个本地注意力层，该层减少了嵌入训练示例的维度。现在你想要下一层，一个密集注意力层，在减少维度 BatchSeqDReducedTensor 示例或完整维度 BatchSeqDModelTensor 示例上操作。我们需要一个函数可以接受这两种之一，但我们不希望允许任何张量类型或任何 tch::Tensor。这样做会有效地移除大小检查。

我们可以做的是使用 Rust 的特质界限来限制传递给函数的允许的 TensorTypes。这很简单。首先，定义一个标记特质并将其附加到类型上。

// AttentionTensorTrait is a marker trait used to limit what can be passed into
// the Attention function.
pub trait AttentionTensorTrait {}

// BatchSeqDReducedTensor are reduced dimensionality tensors produced by the
// Local Attention layer.
tensor_type!(
    BatchSeqDReducedTensor,
    [batch_size, sequence_length, d_reduced],
    ModelParams,
    Kind::Float
);

// Attach the AttentionTensorTrait to our types.
impl AttentionTensorTrait for BatchSeqDModelTensor {}
impl AttentionTensorTrait for BatchSeqDReducedTensor {}

现在我们的函数可以定义只接受这些 TensorTypes，不接受其他类型。

    fn attention<T: TensorType<InnerType = Params> + AttentionTensorTrait>(
        query: &T,
        params: &Params,
    ) -> Result<T, TensorTypeError> {
        // Do the attention calculation. [Here, just a tch::Tensor upper 
        // triangle fn, returned directly.]
        query.apply_fn(|t| t.triu(1), params)
    }

因此，我们的函数在 TensorType 上定义了一个泛型参数，引入了 TensorType 方法，并通过特质界限进一步约束 AttentionTensorTrait。请注意，<InnerType = Params> 是我们告诉 Rust 编译器我们使用什么类型来提供张量类型的运行时维度值的方式。

设计和考虑的替代方案

tensor_type! 宏的设计是出于简洁性和灵活性。当前版本具有以下格式

    tensor_type!(<name>, <list of fields>, <struct with those fields>, <kind>);

这种设计需要将参数实例传递到代码中，以便将其提供给张量类型的 new() 和其他检查封装张量维度的函数。参数实例应该是不可变的，以确保在张量类型维度的生命周期内的一致性。

这种设计使测试变得容易，因为可以轻松创建和传递测试参数结构到所需的测试代码中。

替代设计：固定维度

另一种设计将固定维度作为类型的一部分。也就是说，宏调用类似于

    tensor_type!(<name>, <list of types>, <kind>);

该类型由宏创建，具有指定的维度，就像目前这样。然而，这个版本需要使用 set() 命令来初始化维度的运行时值。设置后，大小对于类型是固定的。所以设置是这样的

    tensor_type!(DecoderInputType, [BatchSize, SequenceLength, ModelDim], Kind::Float);
    DecoderInputType::set(BatchSize(1), SequenceLength(100), ModelDim(256));
    let my_tensor = DecoderInputType::new(t); // For some tch::Tensor t.

这种语法略微简洁，意味着除了指定了多少维度之外，这些维度的值也是类型的一部分。这种设计的一个优点是运行时维度值不需要传递到类型 new() 函数或其他检查维度的函数。

然而，这种设计过于限制性。这意味着张量需要内部内存来存储由 set() 函数给出的维度。它是通过使用模块 static 变量来避免名称冲突和 std::sync::Once 实现的，以便一旦设置，维度就固定，防止更改，这是 tensor_types crate 的目标。由于其实现的复杂性，需要 proc_macros，增加了测试和打包的复杂性，需要子 crate。而且 Crates.io 不识别子 crate，而是将它们视为单独的 crate。

虽然更简洁，但这种设计的最大缺点是使测试变得非常困难。在测试中，通常使用不同大小的张量形状来测试函数。例如，一个函数可以定义如下

    pub fn embed(t: BatchTokens) -> Result<BatchTokenEmbed, Error> {

BatchTokens 在程序开始时会被设置为 set()，其大小为 [BatchSize, SequenceLength]，并在整个程序的运行期间保持该大小。然而，对 embed() 的测试需要运行不同的形状。但是由于 Rust 测试是并行运行的，第一个运行的测试将定义 BatchTokens 的形状。这会导致所有其他的 set() 调用失败，因为它只能被调用一次。允许 set() 重复调用解决了这个问题，但 1) 违背了使用 set 来固定维度的目的，2) 意味着必须保护张量类型免受测试并行运行时线程交织引起的中间测试更改。由于这种增加的复杂性，这种方法被放弃了。

备选设计：特质

在当前设计中，tensor_type! 宏使用定义期望张量维度运行时值的结构体的字段和将提供这些字段的结构的类型来调用。

    tensor_type!(<name>, <list of fields>, <struct with those fields>, <kind>);

另一种考虑的方法是使用特质来定义期望的张量维度的运行时值。例如，宏调用可能如下所示：

    tensor_type!(<name>, <list of getters>, <trait with those getters>, <kind>);

因此，一个例子可能如下所示：

    tensor_type!(BatchSeqType, [get_sequence_length, get_model_dim], ParamsTrait, Kind::int64);

    // Define the Parameters trait.
    pub trait ParamsTrait {
        sequence_length: i64,
        model_dim: i64
    }

    // Define the Params struct.
    pub struct Params {
        sequence_length: i64,
        model_dim: i64
    }

    // Implement the trait for the Params struct.
    impl ParamsTrait for Params {
        fn get_sequence_length(&self) -> i64 {
            self.sequence_length
        }
        fn get_model_dim(&self) -> i64 {
            self.model_dim
        }
    }

    // At runtime, set the required dimensions for the typed parameters.
    let params = Params {
        sequence_length: 100, 
        model_dim: 250};

    let t0 = Tensor::randn([1, 100, 256], (tch::Kind::Float, tch::Device::Cpu));
    let decoder_input = DecoderInputType::new(t0, &params)?;

如所见，这种方法为仅提供维度就向类型定义中添加了大量样板代码。可能的优势是特质提供的封装。然而，创建新类型的宏调用与当前设计基本相同，新类型上的函数也是如此。因此，这种方法的优势被维护特质和样板代码的负担所超过。

备选设计：编码维度

为了完整性，另一种考虑的设计是将张量形状构建到宏代码中。在这个版本中，不使用运行时内存。类似于

    tensor_type!(<name>, value1, value2, value3, ..., <kind>);

...的调用可以由宏系统扩展成类似于以下代码的代码：

   let expected_size = vec![value1, value2, value3, ...];
   if tensor.size != expected_size {
     return Error...
   }
   ...

因此，代码本身存储了这些值。然而，这种设计也将大小锁定到张量类型中过于严格。具体来说，函数参数的值必须在编译时已知。除了消除运行时配置之外，它还使得测试变得困难。例如，如上所述，一旦定义了张量类型，在测试期间就不能改变。

了解更多

要了解如何使用 tensor_types crate，请参阅

examples/usage.rs：使用 TensorTypes 的各种方法。
examples/before_after.rs：简单的错误示例以及 TensorTypes 如何防止这些错误。
tests/*：测试说明了正确使用 TensorTypes。
test/compilation_tests/*：显示 TensorTypes 捕获的编译错误。

依赖项

~11–20MB
~308K SLoC