5个版本
0.1.4 | 2023年12月9日 |
---|---|
0.1.3 | 2023年11月27日 |
0.1.2 | 2023年11月21日 |
0.1.1 | 2023年11月21日 |
0.1.0 | 2023年11月20日 |
1081 在 文本处理 中排名
每月40 次下载
6MB
788 行
PDF Seekers
基于关键字搜索功能的简单解析器和信息提取器,适用于PDF文档(由Rust提供支持)
关键特性
- 对单个PDF文件或包含多个PDF文件的目录进行索引的能力
- 跨多个PDF文件搜索关键词以获取相关信息
- 获取PDF文件的页数、包含搜索词的页码以及搜索词周围的内容
Python
安装
使用以下命令安装最新的pypdf-seekers版本:pip install pypdf-seekers
目前发布相当频繁(每周/每隔几天),因此定期更新pypdf-seekers以获取最新的错误修复/功能可能是个不错的主意。
使用示例
>>> import pypdf_seekers as ps
>>>
>>> data_dir = "data"
>>> cache_path = None
>>> log_level = "debug"
>>> search_term = "convolutional"
>>>
>>> ps.indexing_contents(data_dir, cache_path, log_level)
2023-12-09 14:37:31 | INFO | Starting indexing operation...
2023-12-09 14:37:31 | DEBUG | src\lib.rs:69 - Input parameters:
2023-12-09 14:37:31 | DEBUG | src\lib.rs:70 - file_or_directory: data
2023-12-09 14:37:31 | DEBUG | src\lib.rs:71 - cache_path: None
2023-12-09 14:37:31 | DEBUG | src\lib.rs:72 - log_level: Some("debug")
2023-12-09 14:37:31 | INFO | Received `data` which is directory.
2023-12-09 14:37:31 | INFO | Read all file names successfully in directory `data`
2023-12-09 14:37:31 | INFO | Read all processed file names successfully in `D:\github-repos\pdf-seekers/.cache/track_dir/_SUCCESS.txt`
2023-12-09 14:37:31 | INFO | data/fast_rcnn.pdf - Indexing started...
2023-12-09 14:37:31 | INFO | data/fast_rcnn.pdf - File read successfully.
2023-12-09 14:37:31 | INFO | Index directory created successfully at `D:\github-repos\pdf-seekers/.cache/index_dir`.
2023-12-09 14:37:31 | DEBUG | src\index_operations.rs:38 - Is index directory `D:\github-repos\pdf-seekers/.cache/index_dir` empty? -> true
2023-12-09 14:37:31 | INFO | D:\github-repos\pdf-seekers/.cache/index_dir - Directory is empty.
2023-12-09 14:37:31 | INFO | Index writer created successfully for `D:\github-repos\pdf-seekers/.cache/index_dir` directory.
2023-12-09 14:37:31 | INFO | data/fast_rcnn.pdf - Indexing completed successfully.
2023-12-09 14:37:31 | INFO | data/yolo.pdf - Indexing started...
2023-12-09 14:37:31 | INFO | data/yolo.pdf - File read successfully.
2023-12-09 14:37:31 | INFO | Index directory created successfully at `D:\github-repos\pdf-seekers/.cache/index_dir`.
2023-12-09 14:37:31 | DEBUG | src\index_operations.rs:38 - Is index directory `D:\github-repos\pdf-seekers/.cache/index_dir` empty? -> false
2023-12-09 14:37:31 | INFO | D:\github-repos\pdf-seekers/.cache/index_dir - Directory content read successfully.
2023-12-09 14:37:31 | INFO | Index writer created successfully for `D:\github-repos\pdf-seekers/.cache/index_dir` directory.
2023-12-09 14:37:31 | INFO | data/yolo.pdf - Indexing completed successfully.
>>>
>>> docs = ps.search_term_in_file(data_dir, search_term, cache_path, log_level)
2023-12-09 14:38:32 | INFO | Starting searching operation...
2023-12-09 14:38:32 | DEBUG | src\lib.rs:193 - Input parameters:
2023-12-09 14:38:32 | DEBUG | src\lib.rs:194 - file_or_directory: data
2023-12-09 14:38:32 | DEBUG | src\lib.rs:195 - search_term: convolutional
2023-12-09 14:38:32 | DEBUG | src\lib.rs:196 - cache_path: None
2023-12-09 14:38:32 | DEBUG | src\lib.rs:197 - log_level: Some("debug")
2023-12-09 14:38:32 | INFO | Received `data` which is directory.
2023-12-09 14:38:32 | INFO | Index directory created successfully at `D:\github-repos\pdf-seekers/.cache/index_dir`.
2023-12-09 14:38:32 | DEBUG | src\index_operations.rs:38 - Is index directory `D:\github-repos\pdf-seekers/.cache/index_dir` empty? -> false
2023-12-09 14:38:32 | INFO | D:\github-repos\pdf-seekers/.cache/index_dir - Directory content read successfully.
2023-12-09 14:38:32 | INFO | Index writer created successfully for `D:\github-repos\pdf-seekers/.cache/index_dir`.
2023-12-09 14:38:32 | INFO | Retrieved matched documents successfully for `convolutional` search term.
2023-12-09 14:38:32 | INFO | Read all file names successfully in directory `data`
2023-12-09 14:38:32 | INFO | data/fast_rcnn.pdf: Metadata extracted successfully.
2023-12-09 14:38:32 | INFO | data/yolo.pdf: Metadata extracted successfully.
>>>
>>> for doc in docs:
... doc.show()
...
==================================================
Document Name: data/fast_rcnn.pdf
Number of pages: 9
Search Results:
[Page: 1] method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efciently classify ob- ject proposals using deep convolutional networks. Com- pared to previous work, Fast R-CNN employs several in- novations to improve training and testing speed while also
[Page: 2] are also written to disk. But unlike R-CNN, the ne-tuning al- gorithm proposed in [ 11 ] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurpris- ingly, this limitation (xed convolutional layers) limits the accuracy of very deep
[Page: 9] 2009. 2 [5] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efcient evaluation. In NIPS , 2014. 4 [6] D. Erhan, C. Szegedy, A. Toshev,
==================================================
Document Name: data/yolo.pdf
Number of pages: 10
Search Results:
[Page: 1] is simple and straightforward. Our system (1) resizes the input image to 448 448 , (2) runs a single convolutional net- work on the image, and (3) thresholds the resulting detections by the model’s condence. methods to rst generate potential
[Page: 2] Our nal prediction is a 7 7 30 tensor. 2.1. Network Design We implement this model as a convolutional neural net- work and evaluate it on the P ASCAL VOC detection dataset [ 9 ]. The initial convolutional layers
[Page: 3] Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 1 convolutional layers reduce the features space from preceding layers.
[Page: 4] a set of robust features from input images (Haar [ 25 ], SIFT [ 23 ], HOG [ 4 ], convolutional features [ 6 ]). Then, classiers [ 36 , 21 , 13 , 10 ] or localizers
[Page: 5] 14 ]. YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same
[Page: 9] [6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti- vation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 , 2013. 4 [7] J. Dong, Q. Chen,
Rust
您可以从crates.io获取最新版本,或者如果您想使用最新特性/性能改进,可以将此仓库的主分支指向最新的功能。
在您的项目目录中运行以下Cargo命令: cargo add pdf_seekers
或者将以下行添加到您的Cargo.toml中: pdf_seekers = "0.1.4"
使用示例
cargo运行 ----action ACTION --file-or-directory FILE_OR_DIRECTORY
选项
- -a, --action: 要执行的操作 [index, search]
- -f, --file-or-directory: 提供要搜索的单个PDF文件或包含多个PDF文件的目录路径
- -s, --search-term: 在PDF文件中搜索的关键词(仅当操作=搜索时需要)
- -c, --cache-path: 所有索引文件、日志文件和跟踪文件存储的目录路径。如果没有提供值,则将在当前工作目录中创建
- -l, --log-level: 标志,表示日志的详细程度。默认值设置为Info。允许的值是INFO、WARN、DEBUG、ERROR、TRACE、OFF
- -h, --help: 打印帮助信息
- -V, --version: 打印版本
索引命令
$ cargo run -- -a index -f data -l debug
2023-12-09 14:42:00 | INFO | Starting indexing operation...
2023-12-09 14:42:00 | DEBUG | src\lib.rs:68 - Input parameters:
2023-12-09 14:42:00 | DEBUG | src\lib.rs:69 - file_or_directory: data
2023-12-09 14:42:00 | DEBUG | src\lib.rs:70 - cache_path: None
2023-12-09 14:42:00 | DEBUG | src\lib.rs:71 - log_level: Some("debug")
2023-12-09 14:42:00 | INFO | Received `data` which is directory.
2023-12-09 14:42:00 | INFO | Read all file names successfully in directory `data`
2023-12-09 14:42:00 | INFO | Read all processed file names successfully in `D:\github-repos\pdf-seekers/.cache/track_dir/_SUCCESS.txt`
2023-12-09 14:42:00 | INFO | data/fast_rcnn.pdf - Indexing started...
2023-12-09 14:42:00 | INFO | data/fast_rcnn.pdf - File read successfully.
2023-12-09 14:42:00 | INFO | Index directory created successfully at `D:\github-repos\pdf-seekers/.cache/index_dir`.
2023-12-09 14:42:00 | DEBUG | src\index_operations.rs:38 - Is index directory `D:\github-repos\pdf-seekers/.cache/index_dir` empty? -> true
2023-12-09 14:42:00 | INFO | D:\github-repos\pdf-seekers/.cache/index_dir - Directory is empty.
2023-12-09 14:42:00 | INFO | Index writer created successfully for `D:\github-repos\pdf-seekers/.cache/index_dir` directory.
2023-12-09 14:42:01 | INFO | data/fast_rcnn.pdf - Indexing completed successfully.
2023-12-09 14:42:01 | INFO | data/yolo.pdf - Indexing started...
2023-12-09 14:42:01 | INFO | data/yolo.pdf - File read successfully.
2023-12-09 14:42:01 | INFO | Index directory created successfully at `D:\github-repos\pdf-seekers/.cache/index_dir`.
2023-12-09 14:42:01 | DEBUG | src\index_operations.rs:38 - Is index directory `D:\github-repos\pdf-seekers/.cache/index_dir` empty? -> false
2023-12-09 14:42:01 | INFO | D:\github-repos\pdf-seekers/.cache/index_dir - Directory content read successfully.
2023-12-09 14:42:01 | INFO | Index writer created successfully for `D:\github-repos\pdf-seekers/.cache/index_dir` directory.
2023-12-09 14:42:01 | INFO | data/yolo.pdf - Indexing completed successfully.
搜索命令
$ cargo run -- -a search -f data -s convolutional -l debug
2023-12-09 14:42:34 | INFO | Starting searching operation...
2023-12-09 14:42:34 | DEBUG | src\lib.rs:191 - Input parameters:
2023-12-09 14:42:34 | DEBUG | src\lib.rs:192 - file_or_directory: data
2023-12-09 14:42:34 | DEBUG | src\lib.rs:193 - search_term: convolutional
2023-12-09 14:42:34 | DEBUG | src\lib.rs:194 - cache_path: None
2023-12-09 14:42:34 | DEBUG | src\lib.rs:195 - log_level: Some("debug")
2023-12-09 14:42:34 | INFO | Received `data` which is directory.
2023-12-09 14:42:34 | INFO | Index directory created successfully at `D:\github-repos\pdf-seekers/.cache/index_dir`.
2023-12-09 14:42:34 | DEBUG | src\index_operations.rs:38 - Is index directory `D:\github-repos\pdf-seekers/.cache/index_dir` empty? -> false
2023-12-09 14:42:34 | INFO | D:\github-repos\pdf-seekers/.cache/index_dir - Directory content read successfully.
2023-12-09 14:42:34 | INFO | Index writer created successfully for `D:\github-repos\pdf-seekers/.cache/index_dir`.
2023-12-09 14:42:34 | DEBUG | src\search_operations.rs:56 - Index reader object created successfully.
2023-12-09 14:42:34 | DEBUG | src\search_operations.rs:60 - Index searcher object created successfully.
2023-12-09 14:42:34 | DEBUG | src\search_operations.rs:68 - Query parser created successfully for `content` field.
2023-12-09 14:42:34 | DEBUG | src\search_operations.rs:75 - Query parsing completed successfully for query string -> convolutional
2023-12-09 14:42:34 | DEBUG | src\search_operations.rs:82 - Top 10 matched documents retrived from search.
2023-12-09 14:42:34 | INFO | Retrieved matched documents successfully for `convolutional` search term.
2023-12-09 14:42:34 | INFO | Read all file names successfully in directory `data`
2023-12-09 14:42:34 | INFO | data/fast_rcnn.pdf: Metadata extracted successfully.
2023-12-09 14:42:35 | INFO | data/yolo.pdf: Metadata extracted successfully.
==================================================
Document Name: data/fast_rcnn.pdf
Number of pages: 9
Search Results:
[Page: 1] method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efciently classify ob- ject proposals using deep convolutional networks. Com- pared to previous work, Fast R-CNN employs several in- novations to improve training and testing speed while also
[Page: 2] are also written to disk. But unlike R-CNN, the ne-tuning al- gorithm proposed in [ 11 ] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurpris- ingly, this limitation (xed convolutional layers) limits the accuracy of very deep
[Page: 9] 2009. 2 [5] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efcient evaluation. In NIPS , 2014. 4 [6] D. Erhan, C. Szegedy, A. Toshev, and D.
==================================================
Document Name: data/yolo.pdf
Number of pages: 10
Search Results:
[Page: 1] is simple and straightforward. Our system (1) resizes the input image to 448 448 , (2) runs a single convolutional net- work on the image, and (3) thresholds the resulting detections by the model’s condence. methods to rst generate potential
[Page: 2] Our nal prediction is a 7 7 30 tensor. 2.1. Network Design We implement this model as a convolutional neural net- work and evaluate it on the P ASCAL VOC detection dataset [ 9 ]. The initial convolutional layers
[Page: 3] Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 1 convolutional layers reduce the features space from preceding layers.
[Page: 4] a set of robust features from input images (Haar [ 25 ], SIFT [ 23 ], HOG [ 4 ], convolutional features [ 6 ]). Then, classiers [ 36 , 21 , 13 , 10 ] or localizers
[Page: 5] 14 ]. YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same
[Page: 9] [6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti- vation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 , 2013. 4 [7] J. Dong, Q. Chen,
官方仓库
访问PDF Seeker官方仓库获取更多信息。
依赖项
~39MB
~603K SLoC