1 个不稳定版本
0.1.2 | 2022 年 3 月 19 日 |
---|
#10 在 #bibliography
36KB
559 代码行
bib-unifier
Rust 编译项目,用于统一一系列 .bib 文件。
该项目仍在早期开发阶段,不应在生产环境中使用
用法
$ bib_unifier --help
bib_unifier 0.1.1
Ariel Jonathan Roffé <[email protected]>
Unifies a set of .bib files into a single file, deleting repetitions
USAGE:
bib_unifier [OPTIONS] <PATH>
ARGS:
<PATH> Directory where the .bib files are located
OPTIONS:
-o, --output <PATH>
Path (directory + filename) to the desired output file
-s, --silent
If present, will not ask for input regarding which repeated entry to keep
-t, --threshold <SIMILARITY_THRESHOLD>
Value between 0 and 1 to compare entry titles [default: 1]
-a, --algorithm <ALGORITHM>
Algorithm to use to compare similarity [default: levenshtein] [possible values:
levenshtein, damerau-levenshtein, jaro, jaro-winkler, sorensen-dice]
-b, --biblatex
Default format for entries is bibtex. Setting this flag changes it to biblatex
-h, --help
Print help information
-V, --version
Print version information
示例
最简单的用法
$ bib_unifier bib_files/test_files -s
Unifiying bibliography...
Found 3 repetitions in the bibliography.
Unified bibliography was written to "bib_files/test_files/[bib_unifier]bibliography.bib".
程序将在指定的目录中查找任何 .bib 文件,读取它们,消除它们之间的重复项,并将它们合并为一个输出文件。
.bib 文件必须格式正确,否则程序将带错误退出。
更改输出文件
默认情况下,输出文件命名为 "[bib_unifier]bibliography.bib",并放置在输入目录中。
请注意,程序设置为忽略以 "[bib_unifier]" 开头的文件。这样,如果您再次运行程序(使用相同的或不同的参数),它不会将先前生成的输出作为新输入。
如果您希望更改输出路径,可以使用 -o
或 --output
标志来更改
$ bib_unifier bib_files/test_files -s -o bib_files/test_files/output.bib
Unifiying bibliography...
Found 5 repetitions in the bibliography.
Unified bibliography was written to "bib_files/test_files/output.bib".
如果指定的输出文件已存在,它将覆盖它。否则,它将创建它。
选择要保留的文件
上面的示例使用了 -s
(静默)标志。如果您删除它,当程序找到至少在一个字段中不同的两个重复条目时,它将询问您想保留哪个。带有 -s
标志时,程序总是选择遇到的第一个变体。
$ bib_unifier bib_files/test_files
Unifiying bibliography...
The following entries have the same title:
1- @article{humberstone1996,
author = {Lloyd Humberstone},
ISSN = {00223611, 15730433},
journal = {Journal of Philosophical Logic},
number = {5},
pages = {451--461},
publisher = {Springer},
title = {Valuational Semantics of Rule Derivability},
volume = {25},
year = {1996},
}
2- @article{humberstone1996rep,
author = {Lloyd Humberstone},
ISSN = {00223611, 15730433},
journal = {Journal of Philosophical Logic},
number = {5},
pages = {451--461},
publisher = {Springer},
title = {Valuational Semantics of Rule Derivability},
volume = {25},
year = {1996},
}
Do you wish to keep the first (1), the second (2) or both (3)?
Enter your choice:
[...]
Found 5 repetitions in the bibliography.
Unified bibliography was written to "bib_files/test_files/[bib_unifier]bibliography.bib".
对于它找到的任何重复条目,它将询问您想保留哪个,前提是它们在键和所有字段中不完全相同。如果它们是,它将不会询问并只保留一个副本。
重复条目被检测为具有
- 相同的键(在这种情况下,保留两个将使第二个键重命名为 "originalkey_1",依此类推)
- 相同的 doi(如果存在)
- 相同的标题
- 相似的标题(见下文)
程序将按此顺序进行检查。
使用相似度阈值
默认情况下,在查看条目标题时,程序比较它们是否完全相同以判断它们是否可能相同。然而,有时两个实际上相同的条目可能有略微不同的标题。对于这些情况,您可以使用相似度阈值。
默认情况下,它设置为1
。但一个大于零且小于一的数字会使程序在标题不相同的情况下也将其视为可能重复。为此,它实现了各种字符串相似度指标。具体如下
- 归一化的Levenshtein编辑距离(默认)
- 归一化的Damerau-Levenshtein距离
- Jaro和Jaro-Winkler距离
- Sørensen-Dice系数
所有值介于0
和1
之间,其中1
表示最相似,0
表示最不相似。如果您不确定使用哪个指标,只需保留默认选项。
您可以使用-t
和--threshold
标志来设置相似度阈值,并使用a
和--algorithm
标志来设置指标(参见上面的--help
中提供的可用选项)。
$ bib_unifier bib_files/test_files -t 0.7
Unifiying bibliography...
[...]
The following entries have the similar titles:
1- @incollection{BPS2018-WIAPL_1,
address = {Dordrecht},
author = {Barrio, Eduardo and Pailos, Federico and Szmuc, Damian},
booktitle = {{Between Consistency and Inconsistency}},
editor = {Walter Carnielli and Jacek Malinowski},
pages = {89--108},
publisher = {Springer},
title = {{What is a paraconsistent logic?}},
series = {Trends in Logic},
year = {2018},
}
2- @incollection{BPS2018-WIAPL,
address = {Dordrecht},
author = {Barrio, Eduardo and Pailos, Federico and Szmuc, Damian},
booktitle = {{Between Consistency and Inconsistency}},
editor = {Walter Carnielli and Jacek Malinowski},
pages = {89--108},
publisher = {Springer},
series = {Trends in Logic},
title = {{What is a Paraconsistent Logic?}},
year = {2018},
}
Do you wish to keep the first (1), the second (2) or both (3)?
Enter your choice: 1
The following entries have the similar titles:
1- @book{Carnap1942_1,
author = {Rudolf Carnap},
publisher = {Harvard University Press},
series = {Studies in Semantics},
title = {An Introduction to Semantics},
year = {1942},
}
2- @book{Carnap1942,
author = {Rudolf Carnap},
publisher = {Harvard University Press},
series = {Studies in Semantics},
title = {Introduction to Semantics},
year = {1942},
}
[...]
Found 7 repetitions in the bibliography.
Unified bibliography was written to "bib_files/test_files/[bib_unifier]bibliography.bib".
请注意,标题比较是区分大小写的(在相似度阈值为0.7
时找到BPS情况,但不使用1
)
Bibtex与biblatex格式
如果您包含-b
或--biblatex
标志,条目将以略微不同的格式打印和保存。例如
$ bib_unifier bib_files/test_files -b
Unifiying bibliography...
The following entries have the same title:
1- @article{humberstone1996,
author = {Lloyd Humberstone},
ISSN = {00223611, 15730433},
journaltitle = {Journal of Philosophical Logic},
number = {5},
pages = {451--461},
publisher = {Springer},
title = {Valuational Semantics of Rule Derivability},
volume = {25},
year = {1996},
}
[...]
请注意,它使用'journaltitle'而不是'journal'。在格式上还有其他一些细微的差异,同时运行这两个选项,看看哪个您最喜欢。
致谢和许可证
Ariel Jonathan Roffé(CONICET,UBA)
本项目采用MIT许可证分发(参见相应的文件)。
依赖项
~8–17MB
~238K SLoC