#nlp #text #summarize #cli #summarization

bin+lib pithy

超快、令人毛骨悚然的准确文本摘要器,适用于任何语言

7 个版本

0.1.7 2022年2月8日
0.1.6 2022年2月4日

#1084 in 文本处理

26 每月下载量

MIT 许可证

41KB
766

pithy 0.1.0 - 一个荒谬的快、奇怪准确的摘要器

需要注意的重要事项是,pithy 更像是一个突出显示器而不是摘要器。碰巧的是,文本中最重要的句子往往是很好的摘要。您可以通过 --density 标志来控制这一点。

以下是它输出的示例

https://plato.stanford.edu/entries/chinese-room/,中国房间论点

  • 论点的狭隘结论是,为数字计算机编程可能使它看起来理解语言,但不能产生真正的理解。

https://www.gutenberg.org/files/55/55-0.txt,奥兹国奇遇记

  • 多萝西不知道该说什么,因为所有的人似乎都认为她是个女巫,而她非常清楚她只是一个普通的小女孩,因为一次龙卷风的机会来到了一个陌生的国家。

https://archive.org/stream/ProgrammingRust1stEdition1491927283/Programming%20Rust%201st%20Edition%201491927283_djvu.txt,“Rust 编程 1st 版本”

  • 讽刺的是,占主导地位的系统编程语言 C 和 C++ 并非类型安全,而其他大多数流行的语言都是。 鉴于 C 和 C++ 旨在用于实现系统的基础,负责实现安全边界并接触不受信任的数据,类型安全性对于它们来说似乎是一项特别有价值的特性。 这就是 Rust 试图解决的数十年的紧张关系:它既类型安全又是系统编程语言

https://www.gutenberg.org/cache/epub/5827/pg5827.txt,伯特兰·罗素的《哲学问题》

  • 伯克利主要在这个意义上否认物质;也就是说,他并不否认我们通常认为是桌子存在的标志的感觉数据实际上是我们之外的某种存在存在的标志,但他否认这种存在是非心灵的,即它既不是心灵,也不是某些心灵所持有的观念。他承认当我们离开房间或闭上眼睛时,必须存在某种东西继续存在,而且我们所说的看到桌子确实给了我们相信即使在看不见它的时候也有某种东西持续存在的理由。但他认为这种存在在本质上不可能与我们看到的不同,而且不能完全独立于视觉,尽管它必须独立于我们自己的视觉。
Quick example:
pithy -f your_file_here.txt --sentences 4

--帮助

Print this help message

-f

The file pithy will read from. Required.

--句子

The number of sentences for pithy to return. Defaults to 3.

--近似

Will return a decent approximation of the summary. Good
for extremely long texts where you don't care about precision.

--偏差

slash (i.e \"/\") separated list of words to bias the summary towards.
If you are using pithy on a large text, increase the chunk_size to
2500-5000 to get relevant results. Note that this doesn't work in
approximate mode.

--偏差强度

The strength of the bias, must be an integer. Defaults to 6.

--按章节

If set, pithy splits the text into sections, and each section is
summarized separately. Defaults to false.

--块大小

The number of sentences to read at a time. Defaults to 500 
if unspecified.

--force_all

If set, pithy reads the text all at once. Can be quite 
slow once you go past the 7k mark. Defaults to false.

--force_chunk

If set, regardless of how large the text is, pithy splits it
into chunks. Should be used in combination with chunk_size 
and by_section.

--ngrams

If set, pithy uses ngrams rather than words. 
It's usually crap, but you might use it as a last resort 
for non-spaced languages that you can't pre-tokenise. 
Defaults to false.

--最小长度

The minimum sentence length before filtering. Defaults to 30.

--最大长度

The maximum sentence length before filtering. Defaults to 1500.

--分隔符

The separator used to split the text into sentences. 
Defaults to '. '. You can type newline to separate by newlines.

--清除空白

If set, removes sentences with excessive whitespace. Useful for 
pdfs and copy-pastes from websites.

--清除非字母字符

If set, removes sentences with too many non-alphabetic characters.

--清除大写字母

If set, removes sentences with too many capital letters. Useful 
if the text contains a lot of references or indices.

--长度惩罚

The length penalty. Defaults to 1.5. Decrease to make glance for longer 
sentences, increase for shorter sentences.

--密度

Experimental setting. Defaults to 3. Setting it lower 
seems to bias pithy's summaries towards more common words, 
setting it higher seems to bias summaries towards rarer 
but more informative words.

--无上下文

If set, the context surrounding sentences isn't provided. 
Defaults to false.

--相关性

If set, the sentences are sorted by their relevance rather 
than their order in the original text. Defaults to false.

--无条目

If set, the progress bar is not printed. Defaults to false because
progress bars are cool.

依赖关系

~4–16MB
~153K SLoC