4 个版本
0.1.4 | 2020 年 3 月 7 日 |
---|---|
0.1.3 | 2020 年 2 月 15 日 |
0.1.2 | 2020 年 2 月 11 日 |
0.1.1 | 2020 年 2 月 5 日 |
#14 in #threadpool
27KB
439 代码行
finde-rs
这是一个用 Rust 编写的 CLI 工具,使用多线程爬虫索引 '目录'。它被设计成通用型,目前,已经存在文件系统实现。将来,任何可以分解为 '目录'(要爬取的项目)和 '文件'('目录'的内容)的东西都可以通过实现 Resource 接口来添加
pub enum Response<T> {
DirFileResponse { dirs: Vec<T>, files: Vec<String> },
}
trait Resource<T: FromStr + Send + Sync>: Send + Sync {
/// Return the directories and leaves for this resource
fn get_dirs_and_leaves(&self, path: &T) -> Response<T>;
/// Get the path representation of the resource.
fn get_path(&self) -> Result<T>;
}
fileresource.rs 中存在该文件系统的实现,用于以 '/' 开头的输入模式。
依赖项
它使用 crossbeam 来处理通道,使用 threadpool 库来支持线程池,最后使用 tantivy 来支持全文索引。
它有三个主要组件
- Filecrawler 负责生成索引器和调度线程。它创建一个线程池,其线程用于遍历目录树。它创建了 crossbeam 通道,用于与索引器通信以及线程池线程之间的通信。在每次 '遍历' 中,目录被发送到其他线程池线程使用的通道,作为进一步爬取的源,完全合格的文件路径被发送到索引器线程。
- 调度器根据线程池通道的长度调整 Filecrawler 的线程池大小。它维护线程池大小在最小和最大界限之间。
- 索引器是一个单线程 tantivy 索引器,从 Filecrawler 线程发送的通道中读取完全合格的文件路径。一旦爬取完成,它将索引提交到磁盘。
用法
>> git clone https://github.com/ronin13/finde-rs && cd finde-rs
>> cargo build --release
>>./target/release/finde-rs --help
finde-rs 0.1.3
CLI finder tool
USAGE:
finde-rs [FLAGS] [OPTIONS]
FLAGS:
-h, --help
Prints help information
-q, --quiet
Pass many times for less log output
-V, --version
Prints version information
-v, --verbose
Pass many times for more log output
By default, it'll only report errors. Passing `-v` one time also prints warnings, `-vv` enables info
logging, `-vvv` debug, and `-vvvv` trace.
OPTIONS:
-I, --index-dir <index-dir>
Root path to crawl from [default: /tmp/]
-i, --initial-threads <initial-threads>
Initial number of threads to spawn
-m, --max-threads <max-threads>
Maximum number of threads that threadpool can scale upto. Defaults to number of cpus
-p, --path <path>
Root path to crawl from [default: /usr/lib]
运行
>>./target/release/finde-rs -p $HOME/repo -v -i 6 -m 12 --index-dir /tmp
2020-02-15 14:10:59,683 INFO [finde_rs] Crawling /Users/raghu/repo
2020-02-15 14:10:59,684 INFO [finde_rs::indexer] Starting indexer
2020-02-15 14:10:59,684 INFO [finde_rs::crawler] Waiting on upto 12 crawler threads
2020-02-15 14:10:59,684 INFO [finde_rs::indexer] Index directory created in /tmp/5ryH1
2020-02-15 14:10:59,684 INFO [tantivy::indexer::segment_updater] save metas
2020-02-15 14:10:59,687 INFO [finde_rs::indexer] Iterating over results
2020-02-15 14:10:59,785 INFO [finde_rs::scheduler] Updating number of threads to 7, length of work queue 3818, pool size 6
2020-02-15 14:10:59,886 INFO [finde_rs::scheduler] Updating number of threads to 8, length of work queue 6883, pool size 6
2020-02-15 14:10:59,988 INFO [finde_rs::scheduler] Updating number of threads to 9, length of work queue 11192, pool size 6
2020-02-15 14:11:00,089 INFO [finde_rs::scheduler] Updating number of threads to 10, length of work queue 12956, pool size 6
2020-02-15 14:11:00,190 INFO [finde_rs::scheduler] Updating number of threads to 11, length of work queue 12857, pool size 6
2020-02-15 14:11:00,290 INFO [finde_rs::scheduler] Updating number of threads to 12, length of work queue 12607, pool size 6
2020-02-15 14:11:04,834 INFO [finde_rs::scheduler] Updating number of threads to 6, length of work queue 0, pool size 6
2020-02-15 14:11:05,739 INFO [finde_rs::fileresource] Crawling done in ThreadId(5), leaving, bye!
2020-02-15 14:11:05,740 INFO [finde_rs::fileresource] Crawling done in ThreadId(4), leaving, bye!
2020-02-15 14:11:05,740 INFO [finde_rs::fileresource] Crawling done in ThreadId(2), leaving, bye!
2020-02-15 14:11:05,740 INFO [finde_rs::fileresource] Crawling done in ThreadId(7), leaving, bye!
2020-02-15 14:11:05,740 INFO [finde_rs::fileresource] Crawling done in ThreadId(6), leaving, bye!
2020-02-15 14:11:05,740 INFO [finde_rs::fileresource] Crawling done in ThreadId(3), leaving, bye!
2020-02-15 14:11:05,740 INFO [finde_rs::indexer] Commiting the index
2020-02-15 14:11:05,740 INFO [tantivy::indexer::index_writer] Preparing commit
2020-02-15 14:11:05,757 INFO [finde_rs::scheduler] No more threads to schedule, I am done. Bye!
2020-02-15 14:11:05,899 INFO [tantivy::indexer::segment_updater] Starting merge - [Seg("8cc31b4d"), Seg("97576eb1"), Seg("2b7bcba3"), Seg("f1bbcb09"), Seg("4c3cf582"), Seg("699c0c3b"), Seg("4e08a0dd"), Seg("1e6b5009")]
2020-02-15 14:11:05,904 INFO [tantivy::indexer::index_writer] Prepared commit 500530
2020-02-15 14:11:05,904 INFO [tantivy::indexer::prepared_commit] committing 500530
2020-02-15 14:11:05,904 INFO [tantivy::indexer::segment_updater] save metas
2020-02-15 14:11:05,905 INFO [tantivy::indexer::segment_updater] Running garbage collection
2020-02-15 14:11:05,905 INFO [tantivy::directory::managed_directory] Garbage collect
2020-02-15 14:11:05,905 INFO [finde_rs::indexer] Index created in "/tmp/"
2020-02-15 14:11:05,905 INFO [finde_rs::indexer] Index has 12 segments
2020-02-15 14:11:05,906 INFO [finde_rs] Finished crawling /Users/raghu/repo, took 6s
./target/release/finde-rs -p $HOME/repo -v -i 6 -m 12 --index-dir 12.81s user 26.84s system 636% cpu 6.232 total
测试
>>cargo test
Compiling finde-rs v0.1.1 (/Users/raghu/repo/finde-rs)
Finished test [unoptimized] target(s) in 1.22s
Running target/debug/deps/finde_rs-c62a74cfdff79a3e
running 3 tests
test scheduler::test::test_scale_with_bounds ... ok
test crawler::test::test_root_from_disconnected_channel ... ok
test crawler::test::test_root_from_channel ... ok
test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
代码风格检查
cargo clippy
依赖项
~22–33MB
~463K SLoC