4 个版本

0.1.4 2020 年 3 月 7 日
0.1.3 2020 年 2 月 15 日
0.1.2 2020 年 2 月 11 日
0.1.1 2020 年 2 月 5 日

#14 in #threadpool

MIT 许可证

27KB
439 代码行

Build Status License: MIT dependency status Cargo

finde-rs

这是一个用 Rust 编写的 CLI 工具,使用多线程爬虫索引 '目录'。它被设计成通用型,目前,已经存在文件系统实现。将来,任何可以分解为 '目录'(要爬取的项目)和 '文件'('目录'的内容)的东西都可以通过实现 Resource 接口来添加


pub enum Response<T> {
    DirFileResponse { dirs: Vec<T>, files: Vec<String> },
}

trait Resource<T: FromStr + Send + Sync>: Send + Sync {
    /// Return the directories and leaves for this resource
    fn get_dirs_and_leaves(&self, path: &T) -> Response<T>;

    /// Get the path representation of the resource.
    fn get_path(&self) -> Result<T>;
}

fileresource.rs 中存在该文件系统的实现,用于以 '/' 开头的输入模式。

依赖项

它使用 crossbeam 来处理通道,使用 threadpool 库来支持线程池,最后使用 tantivy 来支持全文索引。

它有三个主要组件

  • Filecrawler 负责生成索引器和调度线程。它创建一个线程池,其线程用于遍历目录树。它创建了 crossbeam 通道,用于与索引器通信以及线程池线程之间的通信。在每次 '遍历' 中,目录被发送到其他线程池线程使用的通道,作为进一步爬取的源,完全合格的文件路径被发送到索引器线程。
  • 调度器根据线程池通道的长度调整 Filecrawler 的线程池大小。它维护线程池大小在最小和最大界限之间。
  • 索引器是一个单线程 tantivy 索引器,从 Filecrawler 线程发送的通道中读取完全合格的文件路径。一旦爬取完成,它将索引提交到磁盘。

用法


>> git clone https://github.com/ronin13/finde-rs && cd finde-rs
>> cargo build --release

>>./target/release/finde-rs --help
finde-rs 0.1.3
CLI finder tool

USAGE:
    finde-rs [FLAGS] [OPTIONS]

FLAGS:
    -h, --help
            Prints help information

    -q, --quiet
            Pass many times for less log output

    -V, --version
            Prints version information

    -v, --verbose
            Pass many times for more log output

            By default, it'll only report errors. Passing `-v` one time also prints warnings, `-vv` enables info
            logging, `-vvv` debug, and `-vvvv` trace.

OPTIONS:
    -I, --index-dir <index-dir>
            Root path to crawl from [default: /tmp/]

    -i, --initial-threads <initial-threads>
            Initial number of threads to spawn

    -m, --max-threads <max-threads>
            Maximum number of threads that threadpool can scale upto. Defaults to number of cpus

    -p, --path <path>
            Root path to crawl from [default: /usr/lib]


运行


>>./target/release/finde-rs -p $HOME/repo -v -i 6 -m 12 --index-dir /tmp
2020-02-15 14:10:59,683 INFO  [finde_rs] Crawling /Users/raghu/repo
2020-02-15 14:10:59,684 INFO  [finde_rs::indexer] Starting indexer
2020-02-15 14:10:59,684 INFO  [finde_rs::crawler] Waiting on upto 12 crawler threads
2020-02-15 14:10:59,684 INFO  [finde_rs::indexer] Index directory created in /tmp/5ryH1
2020-02-15 14:10:59,684 INFO  [tantivy::indexer::segment_updater] save metas
2020-02-15 14:10:59,687 INFO  [finde_rs::indexer] Iterating over results
2020-02-15 14:10:59,785 INFO  [finde_rs::scheduler] Updating number of threads to 7, length of work queue 3818, pool size 6
2020-02-15 14:10:59,886 INFO  [finde_rs::scheduler] Updating number of threads to 8, length of work queue 6883, pool size 6
2020-02-15 14:10:59,988 INFO  [finde_rs::scheduler] Updating number of threads to 9, length of work queue 11192, pool size 6
2020-02-15 14:11:00,089 INFO  [finde_rs::scheduler] Updating number of threads to 10, length of work queue 12956, pool size 6
2020-02-15 14:11:00,190 INFO  [finde_rs::scheduler] Updating number of threads to 11, length of work queue 12857, pool size 6
2020-02-15 14:11:00,290 INFO  [finde_rs::scheduler] Updating number of threads to 12, length of work queue 12607, pool size 6
2020-02-15 14:11:04,834 INFO  [finde_rs::scheduler] Updating number of threads to 6, length of work queue 0, pool size 6
2020-02-15 14:11:05,739 INFO  [finde_rs::fileresource] Crawling done in ThreadId(5), leaving, bye!
2020-02-15 14:11:05,740 INFO  [finde_rs::fileresource] Crawling done in ThreadId(4), leaving, bye!
2020-02-15 14:11:05,740 INFO  [finde_rs::fileresource] Crawling done in ThreadId(2), leaving, bye!
2020-02-15 14:11:05,740 INFO  [finde_rs::fileresource] Crawling done in ThreadId(7), leaving, bye!
2020-02-15 14:11:05,740 INFO  [finde_rs::fileresource] Crawling done in ThreadId(6), leaving, bye!
2020-02-15 14:11:05,740 INFO  [finde_rs::fileresource] Crawling done in ThreadId(3), leaving, bye!
2020-02-15 14:11:05,740 INFO  [finde_rs::indexer] Commiting the index
2020-02-15 14:11:05,740 INFO  [tantivy::indexer::index_writer] Preparing commit
2020-02-15 14:11:05,757 INFO  [finde_rs::scheduler] No more threads to schedule, I am done. Bye!
2020-02-15 14:11:05,899 INFO  [tantivy::indexer::segment_updater] Starting merge  - [Seg("8cc31b4d"), Seg("97576eb1"), Seg("2b7bcba3"), Seg("f1bbcb09"), Seg("4c3cf582"), Seg("699c0c3b"), Seg("4e08a0dd"), Seg("1e6b5009")]
2020-02-15 14:11:05,904 INFO  [tantivy::indexer::index_writer] Prepared commit 500530
2020-02-15 14:11:05,904 INFO  [tantivy::indexer::prepared_commit] committing 500530
2020-02-15 14:11:05,904 INFO  [tantivy::indexer::segment_updater] save metas
2020-02-15 14:11:05,905 INFO  [tantivy::indexer::segment_updater] Running garbage collection
2020-02-15 14:11:05,905 INFO  [tantivy::directory::managed_directory] Garbage collect
2020-02-15 14:11:05,905 INFO  [finde_rs::indexer] Index created in "/tmp/"
2020-02-15 14:11:05,905 INFO  [finde_rs::indexer] Index has 12 segments
2020-02-15 14:11:05,906 INFO  [finde_rs] Finished crawling /Users/raghu/repo, took 6s
./target/release/finde-rs -p $HOME/repo -v -i 6 -m 12 --index-dir   12.81s user 26.84s system 636% cpu 6.232 total


测试

>>cargo test
   Compiling finde-rs v0.1.1 (/Users/raghu/repo/finde-rs)
    Finished test [unoptimized] target(s) in 1.22s
     Running target/debug/deps/finde_rs-c62a74cfdff79a3e

running 3 tests
test scheduler::test::test_scale_with_bounds ... ok
test crawler::test::test_root_from_disconnected_channel ... ok
test crawler::test::test_root_from_channel ... ok

test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out


代码风格检查

cargo clippy

依赖项

~22–33MB
~463K SLoC