7个版本

使用旧的Rust 2015

0.1.6	2017年9月3日
0.1.5	2017年3月31日
0.1.3	2017年2月20日

#16 in #subtitle

31 每月下载量

GPL-3.0 许可证

580KB
2K SLoC

简介

aligner 是一个Rust库和命令行工具，用于根据第二个“正确”的字幕校正字幕。它将确定偏移量以及在哪里添加或删除广告中断以获得最佳对齐。它不使用任何语言信息，因此它甚至适用于基于图像的字幕，如VobSub。

用法

最基本的命令是

$ aligner reference_subtitle.ssa incorrect_subtitle.srt output.srt

您还可以调整算法尝试避免添加或删除中断的程度

# split-penalty is a value between 0 and 100
$ aligner reference_subtitle.ssa incorrect_subtitle.srt output.srt --split-penalty 2.6

目前支持 .srt、.ssa/.ass 和 .idx 文件。

如何编译二进制文件

安装 Rust和Cargo 然后运行

# this will create ~/.cargo/bin/aligner
$ cargo install aligner

如何使用库

将其添加到您的 Cargo.toml

[dependencies]
aligner = "~0.1.6"

文档

Crates.io

算法

算法的核心是对对齐的评分。对于每一对字幕（一个来自参考字幕，另一个来自不正确的字幕），评分是

重叠时间(refsub,incsub) / 最大(长度(refsub), 长度(incsub))

此评分的最大值为1，当且仅当 refsub = incsub。对齐的总评分（大部分）是所有可能对的所有评分的总和。通过移动 incsubs，我们可能会获得更好的对齐。作为基本约束，incsubs 的顺序将不会改变。因此，如果我们要连续的字幕 start(incsubN) <= start(incsubN+1)，校正后的 'incsubs 仍然具有 start('incsubN) <= start('incsubN+1)。

如果仅使用此公式，算法可能会为每行字幕创建不同的偏移量。为了避免这种情况，我们必须使用split_penalty值。对于连续的每个字幕，其中start(incsubN) - start(incsubN+1) = start('incsubN) - start('incsubN+1)，我们将在总评分中添加另一个split_penalty。这样，每增加一次分割，我们就会失去split_penalty的评分。

该算法计算产生所有可能评分最大值的对齐方式。该算法由动态规划原理驱动。

为了简化问题，我们假设start(sub)是字幕行sub的起始时间戳（以毫秒为单位），并且0 <= start(sub)为真。假设get_rating(t, n)计算了前n个错误字幕行的最佳评分/对齐，并添加了额外的约束0 <= start(sub) <= t，对于这些n个字幕中的每一个。

当然，我们可以简单地将get_rating(t, 0) = 0设置为，因为我们没有错误字幕需要对齐，所以评分为零（独立于t）。现在我们处理以下情况 get_rating(0, 1)。我们可以简单地计算与每个参考字幕的重叠评分（第一个错误字幕从“零时间点”开始）并将这些值相加。使用 get_rating(1, 1) 事情变得有趣。我们可以有 start(sub) = 0 或者 start(sub) = 1。幸运的是，我们已经有了 get_rating(0, 1)，所以我们只需要 start(sub) = 1 的评分。这可以通过将所有重叠评分相加来计算。类似地，我们可以通过取 get_rating(1, 1) 和 start(sub) = 2 的评分的最大值来计算 get_rating(2, 1)。在这方面，我们可以从 get_rating(t, 1) 创建 get_rating(t+1, 1)。我们还可以加快计算重叠评分的速度，因为字幕行将只从 start(sub) = t 移动到 start(sub) = t + 1 的 1ms。字幕将失去段 [t, t + 1] 的评分，并在另一侧获得段 [t + length(sub), t + 1 + length(sub)] 的重叠评分。通过为每个 t 创建参考字幕的查找表，这个过程具有 O(1) 的时间复杂度。我们在什么时候停止呢？当 t 变得很大时，评分就不会再变化了。最迟当 start(sub) 大于任何参考字幕行时，因为在那之后，重叠的评分将始终为零。我们将 max_t 称为所有错误字幕都移至参考字幕之后的时间点。此时最佳总评分为 get_rating(max_t, number_of_incorrect_subtitles)。 Now we have all get_rating(0, 1) to get_rating(max_t, 1). To compute get_rating(0, 2), which means that 0 <= start(sub0) <= 0 and 0 <= start(sub1) <= 0. We already have the rating for sub0 in form of get_rating(0, 1). We only need to add the overlapping rating for sub1. To get get_rating(1, 2) we can either use get_rating(0, 2) (we leave sub1 where it is), or move start(sub1) to 1, which allows start(sub0) to be in 0 <= start(sub0) <= start(sub1) = 1. The best rating for sub0 for that range has been computed with get_rating(1, 1), we only need to add the overlapping rating for start(sub1) == 1. We proceed similarly to get get_rating(t+1, 2): leave the sub1 like it was for get_rating(t,2) or reposition the subtitle to start(sub1) == t+1 and use get_rating(t+1,1) + overlapping_rating(sub1,t+1). 按照相同的原则，我们继续处理 subN 初始化 get_rating(0, n) = get_rating(0, n - 1) + overlapping_rating(subN, 0) 为 get_rating(t+1, n) 选择最大值 get_rating(t, n)，这意味着“保留 subN” get_rating(t+1, n-1) + overlapping_rating(t+1, subN)，这意味着重新定位 subN Until now we didn't use the split_penalty. We need to add the split penalty when start(subN) - start(subN+1) is a specific value (the original distance diff(N)). The trick here is seeing that we only need to consider the "repostion choice". The only time get_rating(_, n-1) is consulted after the inital phase is when subN gets repositioned. subN will then start at t+1 and we consult get_rating(t+1, n-1). So if subN-1 were positioned at t+1-diff(N-1) for get_rating(t+1-diff(N-1), n-1) we'd be able to get the split_penalty. This is exactly the thing we will do when we are in a phase n: We will not only have the "leave choice" or "reposition choice" but also the "nosplit choice". If we compute get_rating(t, n), we can also compare the two other values with get_rating(t-diffN, n-1) + overlapping_rating(t-diffN, subN) + split_penalty. The get_rating(t-diffN, n-1) + overlapping_rating(t-diffN, subN) is again the best rating where start(subN) = t-diffN. We are allowed to add the split_penalty because in the next phase n+1, subN+1 will start at t when get_rating(t-diffN, n) is looked up. So the final rating algorithm is 将 get_rating(t, 0) 初始化为 0 初始化 get_rating(0, n) = get_rating(0, n - 1) + overlapping_rating(subN, 0) 为 get_rating(t+1, n) 选择最大值 get_rating(t, n)，这意味着“保留 subN” get_rating(t+1, n-1) + overlapping_rating(t+1, subN)，这意味着重新定位 subN 和 get_rating(t+1-diffN, n-1) + overlapping_rating(t+1-diffN, subN) + split_penalty，这意味着对 subN 进行无分割重新定位为了得到最终的排列，我们为每个阶段保存n和t+1，其中subN被定位的位置（可以是t+1，t+1-diffN或上一个位置）。如果我们为n = number_of_incorrect_subtitles和t = max_t查找这个值，我们就知道最后一个字幕subN应该在哪个位置。然后我们就知道start(subN)。然后使用get_rating(start(subN), n-1)计算所有之前字幕的最佳排列。因此，我们使用n' = n-1和t' = start(subN)在该表中查找subN-1的位置。这样我们就得到了所有不正确字幕的所有修正位置，任务就完成了！尽管这个算法可以工作（并且在aligner的早期版本中实现了），但它既不快也不节省空间。让我们以一个45分钟= 2700000毫秒< max_t的字幕文件为例，它大约有n = 900个字幕（这些是现实中的值）。我们构建了一个包含max_t * n = 2430000000 个条目的表。在阶段n-1之后，我们可以丢弃阶段n的评分，但我们总是需要存储字幕subN的位置。假设我们需要4个字节来存储它们：那么我们就有了一个包含2430000000 * 4字节= 9720000000字节= 9 GB数据的表放在RAM中！！！即使是填充零也可能需要一些明显的时间。但事实证明，我们可以使用delta encoding将这个表压缩到小于2 MB（大多数情况下；可能是我看到过的最好的压缩之一）。其经验基础是，从t到t+1的选择几乎从未改变（一个阶段大约是10到1000次）。如果我们总是选择 "保留选择"，则位置对于每个t+1总是t+1（增加1） "nosplit-reposition选择"，则位置总是t+1-diffN（增加1） "重定位选择"，位置不会从 t - 1 (常量) 改变。因此，如果我们将值存储在一个 (start, delta, length) 元组中，其中第一个未压缩的值是 start + 0 * delta，第二个是 start + 1 * delta，第三个是 start + 2 * delta，以此类推，最后一个值是 start + length * delta，我们可以将整个阶段压缩成几个字节。同样的事情也适用于评分。不深入细节：如果我们取一个错误字幕与参考字幕的重叠评分，并将错误字幕从最左边“移动”到最右边，我们将有五个压缩值的段（首先评分将为零，然后线性上升，然后保持不变，然后线性下降，最后再次为零）。然后可以对评分段而不是单个 t 进行比较/选择。这至少可以提高一个数量级的速度。

依赖关系 ~22MB ~257K SLoC arrayvec 0.4 clap 2.22 encoding 0.2.33 error-chain 0.10.0 pbr subparse 0.3 dev rand 0.3