6 个版本

0.3.14	2024 年 4 月 20 日
0.3.13	2024 年 4 月 13 日
0.3.9	~~2024 年 3 月 30 日~~
0.2.5	2024 年 3 月 21 日
0.1.7	~~2024 年 2 月 25 日~~

#498 在文本处理

36 每月下载量
用于 dynamic-token

GPL-2.0-or-later WITH Bison-exception-2…

97KB
1.5K SLoC

简单字符串模式

此库使在 Rust 中匹配、分割和提取字符串变得更加容易。它建立在 Rust 标准库之上。并行的 string-patterns crate 提供了与正则表达式一起工作的扩展。这两个 crate 旨在使在 Rust 中处理字符串与 JavaScript 或 Python 一样简单，语法更清晰。

更简单的字符串匹配方法，如 starts_with、contains 或 ends_with，始终会表现得更好，尤其是在处理大型数据集时。例如，starts_with_ci 和 starts_with_ci_alphanum 方法基于这些核心方法，以便在没有 正则表达式 的情况下进行字符串操作。

版本 0.3.0 对 matched_by_rules()、matched_conditional()、filter_all_rules() 和 filter_any_rules() 方法中使用的枚举进行了根本性的修订。

简单模式与正则表达式的比较

simple-string-patterns 的主要优点在于可读性和轻量级应用程序中的最小开销，这些应用程序通常不需要正则表达式支持。在底层，正则表达式引擎编译正则表达式语法，并将它们转换为更有效的字符串匹配子程序。初步基准测试表明，具有基本匹配方法（如 contains_ci）的规则集比它们的正则表达式对应物表现更好，但如果需要添加多个嵌套规则，则正则表达式可能更快。这个基于正则表达式的 string-patterns crate 可以使这变得非常简单。这个crate非常适合需要处理大量字符串并具有高度可预测格式的简单工具，例如在密码学、日志记录中。

方法概述

组件 ^位置	含义
- _⇥	许多不带 _ci 或 _cs 后缀的方法都有一个额外的布尔参数 case_insensitive
_ci _⇥	不区分大小写（比较时转换为小写）
_cs _⇥	区分大小写
_ci_alphanum _⇥	在样本字符串中仅对字母数字字符进行不区分大小写的匹配
_rules _⇥	接受通过 bounds_builder() 定义的规则集，以下为示例
_conditional _⇥	接受一个 StringBounds 规则数组，主要用于内部使用
strip_by_ _⇤	返回不带指定字符类型（或字符类型组合）的字符串
filter_by_ _⇤	返回仅包含指定字符类型（或字符类型组合）的字符串
filter_all _⇤	过滤符合所有规则（和逻辑）的数组或向量
filter_any _⇤	过滤符合任一规则（或逻辑）的数组或向量
to_parts _⇤	将字符串按分隔符拆分为字符串部分向量
to_segments _⇤	将字符串按分隔符拆分为非空字符串部分向量
_part(s) _↔︎⇥	包括起始或结尾分隔符，并在向量中可能返回空元素
_segment(s)* _↔︎⇥	排除起始、结尾、重复连续分隔符，因此排除空元素
_head, _tail _↔︎⇥	在使用拆分方法时，head 表示第一个拆分之前的段，tail 表示剩余部分
_start, _end _↔︎⇥	start 表示最后一个拆分之前的整个字符串，end 仅指最后一个匹配分隔符的最后一个部分
_escaped _⇥	使用 enclose 或 wrap 方法添加一个可选的转义字符参数
_safe _⇥	在所有非末尾出现的关闭字符之前插入一个反斜杠，除非已经存在

简单的不区分大小写匹配

let str_1 = "Dog food";
if str_1.starts_with_ci("dog") {
  println!("{} is dog-related", str_1);
}

在较长的文本中仅对字母数字字符进行简单的不区分大小写匹配

// This method is handy for validating text values from external data sources with
// inconsistent naming conventions, e.g. first-name, first_name, firstName or "first name"
let str_1 = "Do you spell hip-hop with a hyphen?";
if str_1.contains_ci_alphanum("hiphop") {
  println!("{} is hip-hop-related", str_1);
}

通过字符串的第一个字母数字字符过滤字符串向量

// Methods ending in _alphanum are good for filtering strings that may have other
// to_strings() converts an array of &str references to a vector of strings
let sample_strs = [
  "/blue-sky.jpg",
  "----bluesky.png",
  "-B-l-u-e--sky",
  "Blueberry",
  " Blue sky thinking"
].to_strings();
let strings_starting_with_blue = sample_strs
  .into_iter()
  .filter(|s| s.starts_with_ci_alphanum("bluesky"))
  .collect::<Vec<String>>();
// should return all except "Blueberry"

提取长路径名的第三个非空段

let path_string = "/var/www/mysite.com/web/uploads";
if let Some(domain) = path_string.to_segment("/", 2) {
  println!("The domain folder name is: {}", domain); // "mysite.com" is an owned string
}

从较长的字符串中提取头部和尾部或起始和结束

let test_string = "long-list-of-technical-words"
let (head, tail) = test_string.to_head_tail("-");
println!("Head: {}, tail: {}", head, tail); // Head: long, tail: list-of-technical-words

let (start, end) = test_string.to_start_end("-");
println!("Start: {}, end: {}", start, end); // Start: long-list-of-technical, end: words

通过多个模式捕获内部段

let source_str = "long/path/with-a-long-title/details";
  let target_str = "long";
  if let Some(inner_segment) = source_str.to_inner_segment(&[("/", 2), ("-", 2)]) { 
    println!("The inner segment between 'a' and 'title' is: '{}'", inner_segment); // should read 'long'
  }

从较长的字符串中提取第一个十进制值作为 f64

const GBP_TO_EURO: f64 = 0.835;

let sample_str = "Price £12.50 each";
if let Some(price_gbp) = sample_str.to_first_number::<f64>() {
    let price_eur = price_gbp / GBP_TO_EURO;
    println!("The price in euros is {:.2}", price_eur);
}

从短语中提取数字序列，并将它们转换为浮点数向量

// extract European-style numbers with commas as decimal separators and points as thousand separators
let sample_str = "2.500 grammi di farina costa 9,90€ al supermercato.";
let numbers: Vec<f32> = sample_str.to_numbers_euro();
// If two valid numbers are matched assume the first is the weight
if numbers.len() > 1 {
  let weight_grams = numbers[0];
  let price_euros = numbers[1];
  let price_per_kg = price_euros / (weight_grams / 1000f32);
  // the price in kg should be 3.96
  println!("Flour costs €{:.2} per kilo", price_per_kg);
}

将数字字符串列表拆分为浮点数

// extract 64-bit floats from a comma-separated list
// numbers within each segment are evaluated separately
let sample_str = "34.2929,-93.701";
let numbers = sample_str.split_to_numbers::<f64>(",");
// should yield vec![34.2929,-93.701]; (Vec<f64>)

通过所有或任何模式规则匹配，不使用正则表达式

// Call .as_vec() at the end
let mixed_conditions = bounds_builder()
  .containing_ci("nepal")
  .ending_with_ci(".jpg");

let sample_name_1 = "picture-Nepal-1978.jpg";
let sample_name_1 = "edited_picture-Nepal-1978.psd";

// contains `nepal` and ends with .jpg
sample_name_1.match_all_rules(&mixed_conditions); // true

// contains `nepal` but does not end with .jpg
sample_name_2.match_all_rules(&mixed_conditions); // false

// contains `nepal` and/or .jpg
sample_name_1.match_any_rules(&mixed_conditions); // true

// contains `nepal` and/or .jpg
sample_name_2.match_any_rules(&mixed_conditions); // true

通过所有模式规则过滤，不使用正则表达式


// The same array may also be expressed via the new bounds_builder() function with chainable rules:
// You may call .as_vec() to convert to a vector of StringBounds rules as used by methods ending in _conditional
let mixed_conditions = bounds_builder()
  .containing_ci("nepal")
  .not_ending_with_ci(".psd");

let file_names = [
  "edited-img-Nepal-Feb-2003.psd",
  "image-Thailand-Mar-2003.jpg",
  "photo_Nepal_Jan-2005.jpg",
  "image-India-Mar-2003.jpg",
  "pic_nepal_Dec-2004.png"
];
/// The filter_all_rules() method accepts a *BoundsBuilder* object.
let nepal_source_files: Vec<&str> = file_names.filter_all_rules(&mixed_conditions);
// should yield two file names: ["photo_Nepal_Jan-2005.jpg", "pic_nepal_Dec-2004.png"]
// This will now return Vec<&str> or Vec<String> depending on the source string type.

嵌套规则集

从版本 0.3.0 开始，您可以使用 and / or 逻辑添加嵌套规则集。前者仅在所有条件都满足时为真，而后者在任何条件满足时都为真。现在 BoundsBuilder 结构具有一组以 and 或 or 开头的方法。如果您有规则类型的混合，您可以直接调用 and(rules: BoundsBuilder) 或 or(rules: BoundsBuilder) 与嵌套规则集。但是，如果所有规则具有相同的界限，则其他接受简单模式数组的方法可用，例如。

or_starting_with_ci(patterns: &[&str])
or_starting_with_ci_alphanum(patterns: &[&str])
or_containing_ci(patterns: &[&str])
or_ending_with_ci(patterns: &[&str])
and_not_ending_with_ci(patterns: &[&str])


let filenames = [
  "my_rabbit_2019.webp",
  "my_CaT_2020.jpg",
  "neighbours_Dog_2021.gif",
  "daughters_Dog_2023.jpeg",
  "big cat.psd"
];

/// Match files containing the letter sequences "cat" or "dog" and ending in ".jpg" or ".jpeg";
let rules = bounds_builder()
  .or_contains_ci(&["cat", "dog"])
  .or_ends_with_ci(&[".jpg", ".jpeg"]);

let matched_files = filenames.filter_all_rules(&rules);
/// Should yield an array with "my_CaT_2020.jpg" and "daughters_Dog_2023.png"

上述示例生成了以下示例 正则表达式 /(cat|dog).*?.jpe?g$/。_alphanum-suffixed 变体允许仅在字符串内匹配数字和字母，即忽略任何空格或标点符号。

通过任何模式规则过滤，不使用正则表达式


// The same array may also be expressed via the new bounds_builder() function with chainable rules:
// Call .as_vec() at the end
let mixed_or_conditions = bounds_builder()
  .containing_ci("nepal")
  .containing_ci("india");

let file_names = &[
  "edited-img-Nepal-Feb-2003.psd",
  "image-Thailand-Mar-2003.jpg",
  "photo_Nepal_Jan-2005.jpg",
  "image-India-Mar-2003.jpg",
  "pic_nepal_Dec-2004.png"
];
  
let nepal_and_india_source_files: Vec<&str> = file_names.filter_any_rules(&mixed_or_conditions);
// should yield two file names: ["edited-img-Nepal-Feb-2003.psd", "photo_Nepal_Jan-2005.jpg", "image-India-Mar-2003.jpg", "pic_nepal_Dec-2004.png"]

// To combine and/or logic, you can filter all rules with a nested "or" clause.
let mixed_conditions_jpeg_only = bounds_builder()
  .ending_with_ci(".jpg")
  .or(mixed_or_conditions);
let nepal_and_india_source_files_jpgs: Vec<&str> = file_names.filter_all_rules(&mixed_conditions_jpeg_only);
// should yield two file names: ["photo_Nepal_Jan-2005.jpg", "image-India-Mar-2003.jpg"]

将字符串包围在常见的边界字符中

let sample_phrase = r#"LLM means "large language model""#;

let phrase_in_round_brackets = sample_phrase.parenthesize();
// yields (LLM means "large language model")
// but will not escape any parentheses in the source string.

let phrase_in_left_right_quotes = sample_phrase.enclose('“', '”');
// yields “LLM means "large language model"”
// in custom left and right quotation marks, but will not escape double quotes.

let phrase_in_double_quotes = sample_phrase.double_quotes_safe();
// yields “LLM means \"large language model\"" with backslash-escaped double quotes

通过字符类别过滤字符串

let sample_str = "Products: $9.99 per unit, £19.50 each, €15 only. Zürich café cañon";

let vowels_only = sample_str.filter_by_type(CharType::Chars(&['a','e','i','o', 'u', 'é', 'ü', 'y']));
println!("{}", vowels_only);
// should print "oueuieaoyüiaéao"

let lower_case_letters_a_to_m_only = sample_str.filter_by_type(CharType::Range('a'..'n'));
println!("{}", lower_case_letters_a_to_m_only);
// should print  "dceieachlichcafca"

/// You can filter strings by multiple character categories
let sample_with_lower_case_chars_and_spaces = sample_str.filter_by_types(&[CharType::Lower, CharType::Spaces]);
println!("{}", sample_with_lower_case_chars_and_spaces);
// Should print "roducts  per unit  each  only ürich café cañon"

仅删除空格

let sample_str = "19 May 2021 ";
let sample_without_spaces = sample_str.strip_spaces();
println!("{}", sample_without_spaces);
  // should print "19May2021";

从字符串中删除字符类别

let sample_without_punctuation = sample_str.strip_by_type(CharType::Punctuation);
println!("{}", sample_without_punctuation);
// should print "Products 999 per unit £1950 each €15 only Zürich café cañon";

let sample_without_spaces_and_punct = sample_str.strip_by_types(&[CharType::Spaces, CharType::Punctuation]);
println!("{}", sample_without_spaces_and_punct);
// should print "Products999perunit£1950each€15onlyZürichcafécañon";

在字符集合上的字符串拆分

let sample_str = "jazz-and-blues_music/section";
let parts = sample_str.split_on_any_char(&['-','_', '/']);
// should yield "jazz", "and", "blues", "music", "section" as a vector of strings

特性

名称	方法数量	方法描述
MatchOccurrences	2	返回精确字符串（find_matched_indices）或单个字符（find_char_indices）的所有出现索引
CharGroupMatch	6	使用字符类验证字符串，has_digits，has_alphanumeric，has_alphabetic
IsNumeric	1	检查字符串是否可以解析为整数或浮点数
StripCharacters	17	通过类型删除不需要的字符或提取不使用正则表达式的数字字符串、整数或浮点数的向量
SimpleMatch	6	使用常见的验证规则匹配字符串，而不是正则表达式，例如 starts_with_ci_alphanum 检查样本字符串中的首字母或数字是否在大小写不敏感模式下匹配，不使用正则表达式。
SimpleMatchesMany	6	无需正则表达式的多个 match 方法，接受一个 StringBounds 元素、元组或模式的数组，并返回一个布尔结果向量
SimpleMatchAll	4	无需正则表达式的多个 match 方法，接受一个 StringBounds 元素、元组或模式的数组，并返回一个布尔值，表示是否全部匹配
SimpleMatchAany	4	无需正则表达式的多个 match 方法，接受一个 StringBounds 元素、元组或模式的数组，并返回一个布尔结果向量
SimpleFilterAll	2	将简单的无需正则表达式的多个 match 方法应用于字符串数组或向量，并返回一个筛选后的字符串切片向量
ToSegments	14	根据分隔符将字符串拆分为部分、段或首尾对
ToSegmentFromChars	3	根据任意字符数组拆分字符串
SimpleEnclose	10	用匹配的字符对包裹字符串，并提供了不同转义字符规则的变体
ToStrings	1	将 strs 的数组或向量转换为所有权的字符串向量

枚举

CaseMatchMode

定义大小写敏感和仅字母数字模式。

名称	后缀等效	含义
Sensitive	_cs	大小写敏感
Insensitive	_ci	大小写不敏感，将针和草稿中的所有字符串都转换为小写进行比较
AlphanumInsensitive	_ci_alphanum	从样本字符串中删除所有非字母数字字符，并将针和草稿都转换为小写进行比较

StringBounds

使用模式和一个正性标志定义简单的匹配规则，例如 StringBounds::Contains("report", true, CaseMatchMode::Insensitive) 或 StringBounds::EndsWith(".docx", CaseMatchMode::Insensitive)。 bounds_builder 方法有助于构建这些规则集。

所有选项都有 pattern: &str、is_positive: bool 和 case match mode 标志，并接受相同的三个参数 (&str, bool, CaseMatchMode) 用于匹配模式、正性和大小写匹配模式。

名称	含义
StartsWith	以...开头
EndsWith	以...结尾
Contains	包含
Whole	整个字符串匹配

CharType

定义字符的类别、集合或范围，以及单个字符。

名称	参数	含义
Any	-	将匹配任何字符
DecDigit	-	仅匹配0-9（is_ascii_digit）
Digit	(u8)	匹配指定进制（例如16表示十六进制）的数字
Numeric	-	匹配十进制基中的类似数字的$。与 is_numeric() 扩展方法不同，它排除了.和-。使用 to_numbers_conditional() 提取有效的十进制数字作为字符串
AlphaNum	-	匹配任何字母数字字符（is_alphanumeric）
Lower	-	匹配小写字母（is_lowercase）
Upper	-	匹配大写字母（is_uppercase）
Alpha	-	匹配大多数支持字母表中的任何字母（is_alphabetic）
Spaces	-	匹配空格 c.is_whitespace()
Punctuation	-	c.is_ascii_punctuation()
Char	(char)	匹配单个字符
Chars	(&[char])	匹配字符数组
Range	(Range)	匹配范围，例如 'a'..'d' 将包含 a、b 和 c，但不包含 d。这遵循 Unicode 序列。
Between	(char, char)	匹配指定字符之间的字符，例如 Between('a', 'd') 将包含 d。

结构体

BoundsBuilder

这个结构体帮助您构建与 matched_by_rules()、filter_all_rules() 和 filter_any_rules() 方法一起使用的字符串模式规则。 bounds_builder() 函数返回一个基实例，您可以在其上链式调用任意数量的规则和子规则。

规则类型 ^(带后缀)	含义	参数	变体
starting_with_ (✓)	以...开头	pattern: &str	_ci, _cs, _ci_alphanum
包含_ (✓)	Contains	pattern: &str	_ci, _cs, _ci_alphanum
以_结束_ (✓)	以_结束	pattern: &str	_ci, _cs, _ci_alphanum
是_ (✓)	匹配整个模式	pattern: &str	_ci, _cs, _ci_alphanum
不以_开始_ (✓)	不以"x"开始	pattern: &str	_ci, _cs, _ci_alphanum
不包含_ (✓)	不包含	pattern: &str	_ci, _cs, _ci_alphanum
不以_结束_ (✓)	不以_结束	pattern: &str	_ci, _cs, _ci_alphanum
不是_ (✓)	不匹配整个模式	pattern: &str	_ci, _cs, _ci_alphanum
以_开始 (⤬)	以...开头	pattern: &str 是_positive: bool case_insensitive: bool	-
包含 (⤬)	Contains	pattern: &str 是_positive: bool case_insensitive: bool	-
以_结束 (⤬)	以_结束	pattern: &str 是_positive: bool case_insensitive: bool	-
整个 (⤬)	匹配整个模式	pattern: &str , is_positive: bool, case_insensitive: bool	-
或 (⤬)	匹配指定的任何规则	rules: &BoundsBuilder	-
或_ (✓)	匹配具有隐含规则的任何模式	patterns: &[&str]	所有在starting_with_, containing_, ending_with_和is_系列中
和 (⤬)	匹配指定的所有规则	rules: &BoundsBuilder	-
和_ (✓)	匹配具有隐含规则的任何模式	patterns: &[&str]	所有在starting_with_, containing_, ending_with_和is_系列以及它们的not等效

开发笔记

这个crate作为其他crate的构建块，也用于补充string-patterns的未来版本。一些更新反映了轻微的编辑变化。

版本0.3.13引入了.strip_spaces()方法，作为.strip_by_type(CharType::Spaces)的简称。

版本0.3.11引入了.split_to_numbers::<T>(pattern: &str)方法，将数字字符串列表拆分为指定数字类型的向量。这在解析表示为"42.282,-89.3938"的经纬度等常见输入格式时很有用。这可能通过.to_numbers()失败，当逗号或点用作分隔符时，如果没有其他字符介于其中，可能会与十进制或千位分隔符混淆。

版本0.3.8 新增 *and_not_+ 规则方法

这个版本引入了一组以and_not_为前缀的规则方法，用于过滤不匹配指定模式数组的字符串，例如，如果我们有一个以动物名字开始的图像文件名列表，我们想匹配以不区分大小写的"cat"或"dog"开始的那些，但排除以".psd"或".pdf"结尾的。

  /// file names starting with cat or dog, but not ending in .pdf or .psd
  let file_names = [
    "CAT-pic-912.png", // OK
    "dog-pic-234.psd",
    "dOg-photo-876.png", // OK
    "rabbit-pic-194.jpg",
    "cat-pic-787.pdf",
    "cats-image-873.webp", // OK
    "cat-pic-090.jpg", // OK
  ];

  let rules = bounds_builder()
    .or_starting_with_ci(&["cat", "dog"])
    .and_not_ending_with_ci(&[".psd", ".pdf"]);
  let matched_files = file_names.filter_all_rules(&rules);
  /// This should yield ["CAT-pic-912.png", "dOg-photo-876.png", "cats-image-873.webp", "cat-pic-090.jpg"]

版本0.3.0 扩展了可用的规则范围

这个版本通过引入补充的BoundsPosition和CaseMatchMode枚举，对StringBounds枚举进行了重大修订，以处理通过bounds_builder()提供的所有规则范围。这些规则集可以用作matched_by_rules(), filter_all_rules()和filter_any_rules()。

0.2.*系列的完整文档可在Github仓库的v0-2分支中找到。

版本0.2.5 在StringBounds中引入了SimpleMatchAny和Whole匹配。

这补充了SimpleMatchAll，以将or逻辑应用于规则集（StringBound、元组或简单字符串）。StringBounds枚举现在具有整个字符串匹配选项（具有不区分大小写和区分大小写的变体），以适应部分和整个字符串匹配的混合。它还添加了bounds_builder()的多种单参数方法。

0.3.0版本之前的 string-patterns 包含了许多这些扩展。从0.3.0版本开始，所有在此 simple-string-patterns 中定义的特性和枚举以及方法都已删除。这些包相互补充，但可以独立安装。

版本0.2.2引入了三个新特性

bounds_builder() 使得定义需要 StringBounds 规则数组的字符串匹配规则方法更加容易，例如 filter_all_conditional()。请参阅上面的示例。
ToSegmentFromChars 提供了新的方法来根据字符数组进行分割，例如当处理可能使用可预测分隔符集的常见模式时。这模仿了正则表达式中的字符类，并且在只需要允许有限数量的分割字符时更加高效。
MatchOccurrences 有一个接受 char 而不是 &str 的 find_char_indices 方法。这避免了将字符转换为字符串的需要。