23个不稳定版本 (3个重大更改)

新版本 0.5.4	2024年8月23日
0.5.3	2024年7月26日
0.4.6	2024年7月15日
0.3.4	2024年6月27日
0.2.10	2024年6月20日

853 在文本处理中排名

每月650 次下载

Apache-2.0 OR MIT

340KB
3K SLoC

带有PyO3绑定的Matcher Rust实现

一个专为解决单词匹配中逻辑和文本变化问题而设计的高性能匹配器，使用Rust实现。

有关详细实现，请参阅设计文档。

功能

多种匹配方法:
- 简单单词匹配
- 基于正则表达式的匹配
- 基于相似度的匹配
文本归一化:
- 简体化：将传统汉字简化为简体字。例如：蟲艸 -> 虫艹
- 删除：删除特定字符。例如：*Fu&*iii&^%%*&kkkk -> Fuiiikkkk
- 归一化：将特殊字符归一化为可识别的字符。例如：𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟! -> hello world!
- 拼音：将汉字转换为拼音以进行模糊匹配。例如：西安 -> xi an，匹配 洗按 -> xi an，但不匹配 先 -> xian
- 拼音字符：将汉字转换为拼音。例如：西安 -> xian，匹配 洗按 和 先 -> xian
AND OR NOT 单词匹配:
- 考虑单词重复的次数。
- 示例：hello&world 匹配 hello world 和 world,hello
- 示例：无&法&无&天 匹配 无无法天（因为 无 出现了两次），但不匹配 无法天
- 示例：hello~helloo~hhello 匹配 hello，但不匹配 helloo 和 hhello
自定义豁免列表：从匹配中排除特定单词。
高效处理大量单词列表：针对性能优化。

安装

使用pip

pip install matcher_py

安装预构建的二进制文件

访问发布页面下载预构建的二进制文件。

用法

所有相关类型定义在extension_types.py中。

配置说明

Matcher的配置由MatchTableMap = Dict[int, List[MatchTable]]类型定义，MatchTableMap的键称为match_id，对于每个match_id，内部的table_id必须是唯一的。
SimpleMatcher的配置由SimpleTable = Dict[ProcessType, Dict[int, str]]类型定义，Dict[int, str]的键称为word_id，word_id必须是全局唯一的。

MatchTable

table_id：匹配表的唯一ID。
match_table_type：匹配表类型。
word_list：匹配表的单词列表。
exemption_process_type：豁免简单匹配的类型。
exemption_word_list：匹配表的豁免单词列表。

对于每个匹配表，在word_list上执行单词匹配，在exemption_word_list上执行豁免单词匹配。如果豁免单词匹配结果为True，则单词匹配结果将为False。

MatchTableType

Simple：支持由process_type定义的文本归一化的简单多模式匹配。
- 它可以处理组合模式和时间敏感匹配，由&和~分隔，例如hello&world&hello将匹配hellohelloworld和worldhellohello，但由于hello的重复次数，不会匹配helloworld。
Regex：支持正则表达式模式匹配。
- SimilarChar：支持使用正则表达式进行类似字符匹配。
  - ["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"] 将匹配 helloworld!、hollowrd?、hi🌍~ ··· 列表中以 , 分隔的单词的任何组合。
- Acrostic：支持使用正则表达式进行首字母缩略词匹配（目前仅支持中文和简单英文句子）。
  - ["h,e,l,l,o", "你,好"] 将匹配 hope, endures, love, lasts, onward. 和 你的笑容温暖, 好心情常伴。。
- Regex：支持正则表达式匹配。
  - ["h[aeiou]llo", "w[aeiou]rd"] 将匹配 hello、world、hillo、wurld ··· 列表中正则表达式匹配的任何文本。
Similar：支持基于距离和阈值的相似文本匹配。
- Levenshtein：支持基于 Levenshtein 距离的相似文本匹配。

ProcessType

None：无转换。
Fanjian：繁体中文到简体中文转换。基于 FANJIAN。
- 妳好 -> 你好
- 現⾝ -> 现身
Delete：删除所有标点符号、特殊字符和空白。基于 TEXT_DELETE 和 WHITE_SPACE。
- hello, world! -> helloworld
- 《你∷好》 -> 你好
Normalize：将所有英文字符变体和数字变体归一化为基本字符。基于 NORM 和 NUM_NORM。
- ℋЀ⒈㈠Õ -> he11o
- ⒈Ƨ㊂ -> 123
PinYin：将所有 Unicode 中文字符转换为带边界的拼音。基于 PINYIN。
- 你好 -> ni hao
- 西安 -> xi an
PinYinChar：将所有 Unicode 中文字符转换为不带边界的拼音。基于 PINYIN。
- 你好 -> nihao
- 西安 -> xian

您可以根据需要组合这些转换。为了方便，提供了预定义的组合，如 DeleteNormalize 和 FanjianDeleteNormalize。

由于 PinYin 是 PinYinChar 的更有限版本，因此请避免将 PinYin 和 PinYinChar 组合使用。在某些情况下，例如 xian，它可以被视为两个词 xi 和 an，或者是一个词 xian。

文本处理使用

以下是如何使用 reduce_text_process 和 text_process 函数的示例

from matcher_py import reduce_text_process, text_process
from matcher_py.extension_types import ProcessType

print(reduce_text_process(ProcessType.MatchDeleteNormalize, "hello, world!"))
print(text_process(ProcessType.MatchDelete, "hello, world!"))

匹配器基本使用

以下是如何使用 Matcher 的示例

import msgspec

from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, ProcessType, RegexMatchType, SimMatchType

json_encoder = msgspec.json.Encoder()
matcher = Matcher(
    json_encoder.encode({
        1: [
            MatchTable(
                table_id=1,
                match_table_type=MatchTableType.Simple(process_type = ProcessType.MatchFanjianDeleteNormalize),
                word_list=["hello", "world"],
                exemption_process_type=ProcessType.MatchNone,
                exemption_word_list=["word"],
            ),
            MatchTable(
                table_id=2,
                match_table_type=MatchTableType.Regex(
                  process_type = ProcessType.MatchFanjianDeleteNormalize,
                  regex_match_type=RegexMatchType.Regex
                ),
                word_list=["h[aeiou]llo"],
                exemption_process_type=ProcessType.MatchNone,
                exemption_word_list=[],
            )
        ],
        2: [
            MatchTable(
                table_id=3,
                match_table_type=MatchTableType.Similar(
                  process_type = ProcessType.MatchFanjianDeleteNormalize,
                  sim_match_type=SimMatchType.MatchLevenshtein,
                  threshold=0.5
                ),
                word_list=["halxo"],
                exemption_process_type=ProcessType.MatchNone,
                exemption_word_list=[],
            )
        ]
    })
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("word")
# Perform process as a list
result = matcher.process("hello")
assert result == [{'match_id': 1,
  'table_id': 2,
  'word_id': 0,
  'word': 'h[aeiou]llo',
  'similarity': 1.0},
 {'match_id': 1,
  'table_id': 1,
  'word_id': 0,
  'word': 'hello',
  'similarity': 1.0},
 {'match_id': 2,
  'table_id': 3,
  'word_id': 0,
  'word': 'halxo',
  'similarity': 0.6}]
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1] == [{'match_id': 1,
  'table_id': 2,
  'word_id': 0,
  'word': 'h[aeiou]llo',
  'similarity': 1.0},
 {'match_id': 1,
  'table_id': 1,
  'word_id': 0,
  'word': 'hello',
  'similarity': 1.0},
 {'match_id': 1,
  'table_id': 1,
  'word_id': 1,
  'word': 'world',
  'similarity': 1.0}]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{"2":[{"match_id":2,"table_id":3,"word_id":0,"word":"halxo","similarity":0.6}],"1":[{"match_id":1,"table_id":2,"word_id":0,"word":"h[aeiou]llo","similarity":1.0},{"match_id":1,"table_id":1,"word_id":0,"word":"hello","similarity":1.0}]}"""

简单匹配器基本使用

以下是如何使用 SimpleMatcher 的示例

import msgspec

from matcher_py import SimpleMatcher
from matcher_py.extension_types import ProcessType

json_encoder = msgspec.json.Encoder()
simple_matcher = SimpleMatcher(
    json_encoder.encode(
        {
            ProcessType.MatchNone: {
                1: "hello&world",
                2: "word&word~hello"
            },
            ProcessType.MatchDelete: {
                3: "hallo"
            }
        }
    )
)
# Check if a text matches
assert simple_matcher.is_match("hello^&!#*#&!^#*()world")
# Perform simple processing
result = simple_matcher.process("hello,world,word,word,hallo")
assert result == [{'word_id': 1, 'word': 'hello&world'}, {'word_id': 3, 'word': 'hallo'}]

贡献

欢迎对 matcher_py 的贡献！如果您发现了一个错误或有功能请求，请在该 GitHub 仓库上提交一个问题。如果您想贡献代码，请复制仓库并提交一个拉取请求。

许可证

matcher_py 受 MIT 或 Apache-2.0 许可证的许可。

matcher_py