1 个不稳定版本

0.9.1	2020年10月15日

#1652 in 编码

MIT 许可证

67KB
873 行

Utf8Iterator

一个 Utf8Iterator 将UTF-8解码器包裹在 Read 迭代器周围。

本质上，Utf8Iterator 将一个 u8 迭代器转换为 char 迭代器。底层迭代器可以是 BufRead 或 Cursor 的迭代器，例如。它旨在围绕I/O迭代。因此，它期望内部迭代器为类型 Iterator<Item = Result<u8, std::io::Error>>。

next() 方法将返回一个 Option，其中 None 表示序列的末尾，并且值将为类型 Result 的一个包含 char 或错误的值，该错误将描述UTF-8解码错误或底层迭代器的IO错误。解码错误将包含无效序列。

免责声明我编写了这个crate作为学习项目的一部分，并不是因为没有替代方案或要编写更好的东西。已经有Rust的crate可以解码UTF-8。这个crate只有在你的硬件内存非常低，直接从IO缓冲区解码会更有利，并且你确实需要一次解码一个字符时才有意义。示例基本用法 use rustf8::*; use std::io::prelude::*; use std::io::Cursor; fn some_correct_utf_8_text() { let input: Vec<u8> = vec![ 0xce, 0xba, 0xe1, 0xbd, 0xb9, 0xcf, 0x83, 0xce, 0xbc, 0xce, 0xb5, ]; let stream = Cursor::new(input); let iter = stream.bytes(); let mut chiter = Utf8Iterator::new(iter); assert_eq!('κ', chiter.next().unwrap().unwrap()); assert_eq!('ό', chiter.next().unwrap().unwrap()); assert_eq!('σ', chiter.next().unwrap().unwrap()); assert_eq!('μ', chiter.next().unwrap().unwrap()); assert_eq!('ε', chiter.next().unwrap().unwrap()); assert!(chiter.next().is_none()); } 错误处理 fn next_token( chiter: &mut Utf8Iterator<Bytes<Cursor<&str>>>, state: &mut (State, Token), ) -> Option<Token> { loop { let r = chiter.next(); match r { Some(item) => match item { Ok(ch) => { *state = state_machine(chiter, ch, &state); if let State::FinishedToken = state.0 { return Some(state.1.clone()); } } Err(e) => match e { InvalidSequenceError(bytes) => { panic!("Detected an invalid UTF-8 sequence! {:?}", bytes) } LongSequenceError(bytes) => { panic!("UTF-8 sequence with more tha 4 bytes! {:?}", bytes) } InvalidCharError(bytes) => panic!( "UTF-8 sequence resulted in an invalid character! {:?}", bytes ), IoError(ioe, bytes) => panic!( "I/O error {:?} while decoding de sequence {:?} !", ioe, bytes ), }, }, None => { if let State::Finalized = state.0 { return None; } else { state.0 = State::Finalized; return Some(state.1.clone()); } } } } }; 错误 Utf8Iterator 将识别 UTF-8 解码错误，并返回枚举 Utf8IteratorError。错误还将包含一个 Box<u8>，其中包含格式错误的序列。允许后续调用 next() 方法，并将从格式错误的序列之后解码有效的字符。来自底层迭代器的 IO 错误 std::io::ErrorKind::Interrupted 将被 next() 方法透明地消耗。因此，无需处理此类错误。恐慌如果尝试在调用 next() 之前两次调用 unget()，将发生恐慌。安全性此crate不使用 unsafe {}。解码后，使用 char::from_u32() 转换值，这应该可以防止无效字符。

依赖关系 ~280–740KB ~17K SLoC thiserror dev tempfile