32个版本 (9个稳定版)

2.0.0-dev3	2023年4月22日
2.0.0-dev1	2023年3月25日
1.0.8	2023年1月23日
1.0.7	2021年4月16日
0.4.2	2018年7月10日

#35 in 硬件支持

每月 1,444 次下载
用于 16 个crate（7个直接使用）

Apache-2.0/MIT

445KB
12K SLoC

一个抽象SIMD指令集的库，包括不同宽度的指令集。SimDeez旨在让您编写一次函数，并生成SSE2、SSE41和AVX2版本的函数。您可以选择在编译时或运行时自动选择所需的版本。

最初由@jackmott开发，但我自愿接手所有权。

如果需要尚未实现的内联函数，请创建一个问题，我会添加它们。欢迎PR以添加更多内联函数。目前i32、i64、f32和f64类型已经得到了很好的支持。

随着Rust稳定支持Neon和AVX-512，我计划添加这些支持。

请参阅出色的英特尔内联函数指南以获取这些函数的文档。

特性

SSE2、SSE41、AVX和AVX2，以及标量回退
可以使用编译时或运行时选择
无运行时开销
使用熟悉的英特尔内联函数命名约定，易于移植。
- _mm_add_ps(a,b)变为add_ps(a,b)
在较旧API中用快速的SIMD解决方案填充缺失的内联函数。
- ceil、floor、round、blend等
可用于#[no_std]项目
运算符重载：let sum = va + vb 或 s *= s
使用索引运算符提取或设置单个通道： let v1 = v[1];

在没有SIMD或不受支持的SIMD的平台回退到标量代码

通过Sleef-sys实现三角函数 Sleef-sys crate提供了许多三角函数和其他常见数学函数的矢量化形式。这是一个可选功能 sleef，您可以选择启用。目前，这样做需要使用nightly版本，以及安装CMake和Clang。与packed_simd相比 SIMDeez可以抽象不同的simd宽度，而packed_simd则不能 SIMDeez现在基于稳定的rust构建，packed_simd则不是与Faster相比 SIMDeez可以使用运行时选择，而Faster则不能 SIMDeez对某些函数具有更快的回退机制 SIMDeez目前不能与迭代器一起工作，而Faster可以 SIMDeez使用更符合习惯的内置语法，而Faster使用更多符合Rust语法的语法 SIMDeez现在基于稳定的rust构建，而Faster则不是所有这些都可能改变！只要不遇到一些较慢的回退函数，Faster似乎通常具有相同的性能。示例 use simdeez::*; use simdeez::scalar::*; use simdeez::sse2::*; use simdeez::sse41::*; use simdeez::avx::*; use simdeez::avx2::*; // If you want your SIMD function to use use runtime feature detection to call // the fastest available version, use the simd_runtime_generate macro: simd_runtime_generate!( fn distance( x1: &[f32], y1: &[f32], x2: &[f32], y2: &[f32]) -> Vec<f32> { let mut result: Vec<f32> = Vec::with_capacity(x1.len()); result.set_len(x1.len()); // for efficiency /// Set each slice to the same length for iteration efficiency let mut x1 = &x1[..x1.len()]; let mut y1 = &y1[..x1.len()]; let mut x2 = &x2[..x1.len()]; let mut y2 = &y2[..x1.len()]; let mut res = &mut result[..x1.len()]; // Operations have to be done in terms of the vector width // so that it will work with any size vector. // the width of a vector type is provided as a constant // so the compiler is free to optimize it more. // S::VF32_WIDTH is a constant, 4 when using SSE, 8 when using AVX2, etc while x1.len() >= S::VF32_WIDTH { //load data from your vec into an SIMD value let xv1 = S::loadu_ps(&x1[0]); let yv1 = S::loadu_ps(&y1[0]); let xv2 = S::loadu_ps(&x2[0]); let yv2 = S::loadu_ps(&y2[0]); // Use the usual intrinsic syntax if you prefer let mut xdiff = S::sub_ps(xv1, xv2); // Or use operater overloading if you like let mut ydiff = yv1 - yv2; xdiff *= xdiff; ydiff *= ydiff; let distance = S::sqrt_ps(xdiff + ydiff); // Store the SIMD value into the result vec S::storeu_ps(&mut res[0], distance); // Move each slice to the next position x1 = &x1[S::VF32_WIDTH..]; y1 = &y1[S::VF32_WIDTH..]; x2 = &x2[S::VF32_WIDTH..]; y2 = &y2[S::VF32_WIDTH..]; res = &mut res[S::VF32_WIDTH..]; } // (Optional) Compute the remaining elements. Not necessary if you are sure the length // of your data is always a multiple of the maximum S::VF32_WIDTH you compile for (4 for SSE, 8 for AVX2, etc). // This can be asserted by putting `assert_eq!(x1.len(), 0);` here for i in 0..x1.len() { let mut xdiff = x1[i] - x2[i]; let mut ydiff = y1[i] - y2[i]; xdiff *= xdiff; ydiff *= ydiff; let distance = (xdiff + ydiff).sqrt(); res[i] = distance; } result }); fn main() { } 这将为您生成5个函数 distance<S:Simd>您函数的泛型版本 distance_scalar标量回退 distance_sse2SSE2版本 distance_sse41SSE41版本 distance_avxAVX版本 distance_avx2AVX2版本 distance_runtime_select // 在运行时选择上述最快的版本您可以使用这些中的任何一个，尽管通常您会使用运行时选择版本，除非您想强制使用较旧的指令集以避免降频或其他复杂原因。可选地，您可以使用相同的simd_compiletime_generate!宏。这将通过cfg属性功能产生2个有效函数 distance<S:Simd>您函数的泛型版本 distance_compiletime适用于给定编译时功能集的最快指令集如果您了解自己在做什么，可以省略宏，但请注意，内联和target_features有很多复杂的细微差别需要管理。请参阅宏展开的更多细节。

依赖项 ~0–1.3MB ~16K SLoC cfg-if no_std? libm paste sleef? sleef-sys dev criterion 0.4 dev rand dev rand_chacha 0.3.1