3个不稳定版本

0.3.9	2024年5月3日
0.3.8	2024年4月6日
0.3.7	~~2024年3月29日~~
0.3.3	~~2024年2月29日~~
0.1.9	~~2024年1月4日~~

#230 in 文本处理

每月517次下载
用于文件组织器

GPL-2.0-or-later WITH Bison-exception-2…

70KB
740 行

字符串模式

该库使Rust中的正则表达式处理变得更加容易。它基于标准正则表达式crate，regex。它没有其他依赖，但补充了 simple-string-patterns，该库提供了一系列正则表达式扩展方法，通过字符类型或范围匹配、分割和过滤字符串，仅依赖于标准库。

这些crate共同的目标是使Rust中的字符串处理与JavaScript或Python一样简单，语法更简洁。像 starts_with、contains 或 ends_with 这样的简单字符串匹配方法始终会表现得更好，尤其是在处理大量数据集时。

核心 PatternMatch 和 PatternReplace trait 为字符串数组或向量实现，以避免在循环中编译正则表达式。您可能需要根据下面的示例重新实现这些内容，用于自定义结构体的向量。在循环中简单地调用 my_string.pattern_match_ci("complex_regex") 是一种反模式，会导致相同的正则表达式进行昂贵的重新编译。同样的原则适用于替换方法，仅针对 String 和 Vec<String> 实现。

版本0.3.8引入了 _replace_first 方法，用于仅替换样本字符串中的第一个匹配项，实现 re.replace 而不是 re.replace_all。当您只需要在每个字符串中替换一个匹配模式时，这将更快。

方法概述

组件(s)	含义
_result _⇥	如果正则表达式失败，返回一个带有 regex::Error 的 Result
- _⇥	许多不带 _ci 或 _cs 后缀的匹配和替换方法需要布尔型 case_insensitive 参数。
_cs _⇥	区分大小写
_ci _⇥	不区分大小写
_replace _↔︎⇥	替换所有匹配项
_replace_first _↔︎⇥	仅替换第一个匹配项
_word(s) _↔︎⇥	根据边界规则匹配整个或部分单词
_match_all _↔︎⇥	要求数组内所有模式都匹配
_match_any _↔︎⇥	如果数组内任何模式匹配则返回 true
_captures _⇥	返回可迭代的正则表达式捕获对象
_matches _↔︎⇥	返回带有正则表达式模式数组的布尔结果向量
_matches_vec _⇥	返回带有起始和结束偏移量的 Regex::Match 对象向量
_matches_outer _⇥	返回带有起始和结束偏移量的外部（或整个模式）Match 对象向量
_matches_filtered _⇥	返回匹配字符串切片的过滤向量
_split _↔︎⇥	返回向量或元组对
_filter, _filter_word _↔︎⇥	根据正则表达式模式过滤字符串数组或向量

版本 0.3.4 添加了一个 PatternFilter，它包含通过正则表达式模式过滤字符串数组或向量的方法。这些方法与 simple-string-patterns 中的 filter_all_conditional 功能相似，但使用单个正则表达式而不是一组规则。

从版本 0.3.0 开始，该软件包仅包含依赖于正则表达式的核心文本处理扩展。早期版本中捆绑的其他方法已迁移到 simple-string-patterns 软件包。这些软件包相互补充，但如果您只需要它们的一些功能，则可以单独安装。

在不区分大小写的模式下，非捕获的 /(?i)/ 标志会自动添加，但如果您在正则表达式的开头添加另一个非捕获组，则会省略。 _ci 后缀在 /my_complex_regex/i 中相当于 i 修饰符，正如在 JavaScript、Perl 和许多命令行工具中使用的那样。

在其他方面，以模式为前缀的方法的行为与 Regex crate 中的 re.is_match、re.replace_all、re.replace、re.find 和 re.capture_iter 方法类似。String-patterns 释放了 Regex crate 的核心功能的大部分，它依赖于它，以覆盖文本处理中的大多数常见用例，并作为特定验证器（例如电子邮件验证）和文本转换器的构建块。

大小写敏感性

大多数 match 方法将适用于 &str 和 String，而替换方法仅适用于 拥有字符串。同样，匹配方法适用于字符串数组或 string slices 的数组，而替换方法仅适用于 拥有字符串 的数组。这些特性可以用于具有字符串字段的结构体或元组。

使用 Regex 库在标准 Rust 中的正则表达式匹配


fn is_valid_time_string(input: &str) -> bool {
  let time_format_pattern = r#"^([01]\d|2[0-3])?:[0-5]\d(:[0-5]\d)?$"#;
  if let Ok(re) = Regex::new(time_format_pattern) {
    re.is_match(input)
  } else {
    false
  }
}

使用 string-patterns 库的更简洁语法

fn is_valid_time_string(input: &str) -> bool {
  input.pattern_match_cs(r#"^([01]\d|2[0-3])?:[0-5]\d(:[0-5]\d)?$"#)
}

使用 Regex 库在标准 Rust 中的示例替换


fn replace_final_os(input: &str) -> String {
  let regex_str = r#"(\w)o\b"#;
  if let Ok(re) = Regex::new(regex_str) {
    re.replace_all(input, "${1}um").to_string()
  } else {
    input.to_string()
  }
}

使用 string-patterns 库的更简洁语法


fn replace_final_os(input: &str) -> String {
  // case-insensitive replacement,
  // NB: the regex syntax and capture rules are enforced by the Regex library
  input.to_string().pattern_replace_ci(r#"(\w)o\b"#, "${1}um") 
}

仅替换第一个匹配项

  let sample_path = "/User/me/Documents/Accounts/docs/2023/earnings-report.pdf".to_string();
  // should match any segment between / characters starting with 'doc' and ending in 's'
  let pattern = r#"/doc[^/]*s/"#;
  let replacement = r#"/files/"#;
  // Only replace the first segment matching the above pattern case-insensitively
  let new_path_1 = sample_path.pattern_replace_first_ci(pattern, replacement);
  // should yield = "/User/me/files/Accounts/docs/2023/earnings-report.pdf"
  // replace all matches. Will replace /docs/ as well as /Documents/
  let new_path_2 = sample_path.pattern_replace_ci(pattern, replacement);
  // should yield = "/User/me/files/Accounts/files/2023/earnings-report.pdf"

从字符串中提取第一个匹配项

let str_1 = "The park has many lions, spotted hyenas, leopards, rhinoceroses, hippopotamuses, giraffes, cheetahs and baboons";
if let Some(matched_item) = str_1.pattern_first_match(r#"\bspotted\s+\w+\b"#, true) {
  println!("`{}` occurs between positions {} and {}", matched_item.as_str(), matched_item.start(), matched_item.end());
}

匹配字符串数组中的内容

let sample_strs = [
  "pictures_Italy-1997",
  "photos-portugal-2001",
  "imagini-italia_2002",
  "images-france-2003",
];
let test_pattern = r#"[^a-z]ital(y|ia)"#; // matches 'italy' or 'italia'
// The regular expression will only be compiled once
if sample_strs.pattern_match_ci(test_pattern) {
  println!("Some of these folders are related to Italy");
}

// Filter the above array
let filtered_strs = sample_strs.pattern_matches_filtered_ci(test_pattern);
// should yield ["pictures_Italy-1997","imagini-italia_2002"]

计算模式的匹配次数

let sample_text = r#"Humpty Dumpty sat on a wall,
          Humpty Dumpty had a great fall
          All the king's horses and all the king's men
          Couldn't put Humpty together again."#;
  let sample_word = "humpty";
  // count the number of whole words in case-insensitive mode
  let num_occurrences = sample_text.count_word("humpty", true);
  println!("{} occurs {} times in the above text", sample_word, num_occurrences );

在字符串向量中替换文本

let sample_strings = ["apples", "bananas", "carrots", "dates"].to_strings(); /// cast to vector of owned strings
let pattern = r#"a([pr])"#;
let replacement = "æ$1";
// With arrays or vectors the regex need only be compiled once
// case-insensitive replacement
let new_strings = sample_strings.pattern_replace_ci(pattern, replacement); 
/// should yield the strings "æpples", "bananas", "cærrots", "dates"
/// only replacing 'a' with 'æ' before 'p' or 'r'

替换多个模式/替换对

let source_str = "Colourful fishing boats adorned the island's harbours.".to_string();
  let pattern_replacements = [
    ("colour", "color"),
    ("harbour", "harbor"),
  ];
/// Should read "Colorful fishing boats adorned the island's harbors"
let target_str = source_str.pattern_replace_pairs_cs(&pattern_replacements); 
// NB: Prior to version 0.2.19  this was pattern_replace_pairs()
// which now requires a second parameter

根据正则表达式模式过滤字符串数组或向量

let source_strs = [
  "Ristorante-Venezia-2019.jpg",
  "Mercado_venecia_2000.jpg",
  "Mercado_venezuela_2011.jpg",
  "Venice_Oct_2012.png",
  "2venice2003.jpg",
  "venetian_blinds.jpg",
];

/// filter by file names referencing Venice in various languages, but not Venezuela or venetian blinds
let pattern = "ven(ezia|ecia|edig|ice|ise)[^a-z]*";

let filtered_strs = source_strs.pattern_filter_ci(pattern); 
// should yield ["Ristorante-Venezia-2019.jpg", "Mercado_venecia_2000.jpg", "Venice_Oct_2012.png", "2venice2003.jpg"]

在区分大小写的模式下替换多个单词对

/// This should have the same result as above but with cleaner and less error-prone syntax
let source_str = "The dying King Edmund decides to try to save Lear and Cordelia.";
  let pattern_replacements = [
    ("Edmund", "Edward"),
    ("Lear", "Larry"),
    ("Cordelia", "Cecilia")
  ];
/// Should read "The dying King Edward decides to try to save Larry and Cecilia."
let target_str = source_str.to_string().replace_words_cs(&pattern_replacements);

在不区分大小写的模式下匹配任何单词

let source_str = "Two cheetahs ran across the field";
let cat_like_words = [
  "lions?","tigers?", "pumas?",
  "panthers?", "jaguars?", "leopards?",
  "lynx(es)?", "cheetahs?"
];
if source_str.match_any_words_ci(&cat_like_words) {
  println!("`{}` is related to cats", source_str);
}

根据模式分割字符串

let sample_string = "books, records and videotapes";
let pattern = r#"\s*(,|and)\s"#;
 // case-insensitive split
let items = sample_string.pattern_split_ci(pattern);
// should yield a vector of strings: vec!["books", "records", "videotapes"]

根据模式分割字符串为头/尾对（区分大小写）

let sample_string = "first / second - third ; fourth";
let pattern = r#"\s*[/;-]\s*"#;
// case-sensitive split
let (head, tail) = sample_string.pattern_split_pair_cs(pattern); 
// should yield => head: "first" and tail: "second - third ; fourth"

获取包含起始和结束索引以及捕获子串的模式匹配对象向量。

let sample_string = "All the world's a stage, and all the men and women merely players.";
// Words ending in 'men' with 0 to 4 preceding characters. the sequence in parentheses is an inner capture.
let pattern = r#"\b\w{0,4}(men)\b"#;
let outer_matches = sample_string.pattern_matches_outer(pattern,true);
// should yield a vector with the outer matches only, but with with start and end offsets
if let Some(second_match) = outer_matches.get(1) {
  println!("the second match '{}'' starts at {} and ends at {}", second_match.as_str(), second_match.start(), second_match.end());
  // should print the matched word 'woman' and its start and end indices
}

let all_captures = sample_string.pattern_captures(pattern, true);
/// Yields an iterable regex::Captures object with all nested captured groups

从较长的字符串中提取三个浮点值。

此示例需要simple-string-patterns包。


let input_str = "-78.29826, 34.15 160.9";
// the pattern expects valid decimal numbers separated by commas and/or one or more spaces
let split_pattern = r#"(\s*,\s*|\s+)"#;

let numbers: Vec<f64> = input_str.pattern_split_cs(split_pattern)
    .into_iter().map(|s| s.to_first_number::<f64>())
    .filter(|nr| nr.is_some())
    .map(|s| s.unwrap()).collect();
// yields a vector of three f64 numbers [-78.29826, 34.15, 160.9];

为自定义结构体实现的PatternMatch和PatternFilter的示例。

use string_patterns::PatternMatch;

// Simple struct with a core text field
#[derive(Debug, Clone)]
pub struct Message {
  text: String,
  timestamp: i64,
  from: String,
  to: String,
}

impl PatternMatch for Message {
  // All other pattern_match variants with a single regular expression are implemented automatically
  fn pattern_match_result(&self, pattern: &str, case_insensitive: bool) -> Result<bool, Error> {
    self.text.pattern_match_result(pattern, case_insensitive)
  }
}

/// The regular expression is compiled only once. If the regex fails, all items are returned
impl<'a> PatternFilter<'a, Message> for [Message] {
  fn pattern_filter(&'a self, pattern: &str, case_insensitive: bool) -> Vec<Message> {
    if let Ok(re) = build_regex(pattern, case_insensitive) {
    self.into_iter().filter(|m| re.is_match(&m.text)).map(|m| m.to_owned()).collect::<Vec<Message>>()
    } else {
      self.to_owned()
    }
  }
}

特质

名称	描述
PatternMatch	核心正则表达式匹配方法，re.is_match的包装，具有不区分大小写（_ci）和区分大小写（_cs）变体
PatternMatchMany	提供方法以匹配由元组数组或简单字符串表示的多个模式。
PatternMatchesMany	与上述类似，但返回包含每个模式结果的布尔值向量，并提供用于整个单词匹配的变体方法。
PatternMatches	仅针对数组或向量的模式方法，返回布尔结果的布尔值对、字符串切片的向量，或匹配字符串切片的过滤向量。
PatternReplace	核心正则表达式替换方法。
PatternFilter	通过单个正则表达式模式过滤字符串数组或向量。
PatternReplaceMany	提供方法以替换由元组数组表示的多个模式。
PatternSplit	提供方法将字符串分割成字符串向量或字符串的头/尾元组。
MatchWord	提供各种单词边界规则匹配单词的便利方法。
ReplaceWord	提供使用清晰语法替换一个或多个单词的方法。
PatternCapture	返回捕获或每个匹配的向量，无论是否重叠，以及匹配模式或单词的计数。

枚举

WordBounds：具有Start（起始）、End（结束）和Both（两者）选项，并提供方法以正确呈现正则表达式子模式，具有单词边界选项
- None：无边界
- Start：从单词起始位置
- End：到单词结束位置
- Both：整个单词，但模式中可能包含空格或其他标点符号以匹配一个或多个单词

开发笔记

版本0.3.8添加了pattern_replace_first_result和pattern_replace_first变体方法。这些方法已针对String和Vec实现，但需要为自定义结构体或收集类型重新实现。只有_ci和_cs变体有默认实现。

截至版本0.3.8，该包重新导出Regex::Captures和Regex::Match以帮助自定义实现。

截至版本0.3.6，该包重新导出regex::Regex和regex::Error以帮助自定义实现。

截至版本0.3.0，该包功能几乎完整，尽管仍处于测试阶段。所有新功能都将包含在未来的string-patterns-extras包中，该包基于此库和simple-string-patterns。0.3.5没有新功能，只是增加了更多注释，并且一些方法有了默认实现。

0.2.*系列的相关注释可以在GitHub存储库的v0.2.*分支中找到。如果您从0.3.0之前的版本升级，可能还需要安装simple-string-patterns。

删除的方法

仅有一个regex方法，即*match_words_by_proximity已被删除。但是，它可能会在未来的string-patterns-extras包中再次出现。

注意：一些更新仅反映编辑更改。

依赖关系

~2.2–3MB
~54K SLoC