1 个不稳定版本

0.1.0	2022年11月5日

#334 在机器学习中

MIT 许可证

1.5MB
4.5K SLoC

Stitch 的预印本可在此处获取。

教程即将推出！

Stitch

快速入门

运行 cargo run --release --bin=compress -- data/cogsci/nuts-bolts.json --max-arity=3 --iterations=10

在一秒内，这将产生类似以下输出

=======Compression Summary=======
Found 10 inventions
Cost Improvement: (11.93x better) 1919558 -> 160946
fn_0 (1.78x wrt orig): utility: 837792 | final_cost: 1079238 | 1.78x | uses: 320 | body: [fn_0 arity=2: (T (repeat (T l (M 1 0 -0.5 (/ 0.5 (tan (/ pi #1))))) #1 (M 1 (/ (* 2 pi) #1) 0 0)) (M #0 0 0 0))]
fn_1 (3.81x wrt orig): utility: 572767 | final_cost: 503538 | 2.14x | uses: 190 | body: [fn_1 arity=3: (repeat (T (T #2 (M 0.5 0 0 0)) (M 1 0 (* #1 (cos (/ pi 4))) (* #1 (sin (/ pi 4))))) #0 (M 1 (/ (* 2 pi) #0) 0 0))]
fn_2 (6.06x wrt orig): utility: 185436 | final_cost: 316890 | 1.59x | uses: 168 | body: [fn_2 arity=1: (T (T c (M 2 0 0 0)) (M #0 0 0 0))]
fn_3 (7.18x wrt orig): utility: 48984 | final_cost: 267198 | 1.19x | uses: 82 | body: [fn_3 arity=2: (C #1 (T r (M #0 0 0 0)))]
fn_4 (8.29x wrt orig): utility: 35046 | final_cost: 231646 | 1.15x | uses: 88 | body: [fn_4 arity=2: (C (fn_0 4 #1) (fn_0 #0 6))]
fn_5 (9.04x wrt orig): utility: 18885 | final_cost: 212456 | 1.09x | uses: 95 | body: [fn_5 arity=3: (C #2 (fn_1 #1 1.5 #0))]
fn_6 (9.93x wrt orig): utility: 18885 | final_cost: 193266 | 1.10x | uses: 95 | body: [fn_6 arity=3: (C #2 (fn_1 #1 3 #0))]
fn_7 (10.53x wrt orig): utility: 10604 | final_cost: 182358 | 1.06x | uses: 54 | body: [fn_7 arity=2: (C #1 (fn_0 #0 6))]
fn_8 (11.20x wrt orig): utility: 10503 | final_cost: 171450 | 1.06x | uses: 36 | body: [fn_8 arity=2: (C (fn_0 4 #1) (fn_2 #0))]
fn_9 (11.93x wrt orig): utility: 10202 | final_cost: 160946 | 1.07x | uses: 52 | body: [fn_9 arity=0: (fn_4 4.25 6)]
Time: 227ms

阅读此指南的简要说明

fn_0 是自动生成的抽象名称
(1.78x wrt orig) 意味着使用 inv0 生成的压缩程序比原始程序小 1.78 倍，而在行中的稍后位置，另一个 1.78x 是与前一步相比的压缩率（对于第一步，它们是相同的）。

utility: 836528 这是对程序在重写时新原语的数量进行测量的一个指标（除以 100 以获得删除原语的大致数量） uses: 320 在程序集的 320 个地方使用了这个抽象请注意，在这些抽象中 #i 用于抽象变量，而 $i 用于原始程序变量。

常见的命令行参数 --max-arity=2 或 -a2 用于控制找到的抽象的最大算子数量（默认为2） --iterations=10 或 -i10 用于控制压缩运行迭代次数。每次迭代产生一个抽象（可以基于前一个抽象） --threads=10 或 -t10 是通过多线程提高性能的快捷方式（默认为1）所有命令行参数从 cargo run --release --bin=compress -- --help ARGS: <FILE> json file to read compression input programs from OPTIONS: -a, --max-arity <MAX_ARITY> max arity of abstractions to find (will find all from 0 to this number inclusive) [default: 2] --args-from-json extracts argument values from the json; specifically assumes a key value pair like "stitch_args": "data/dc/logo_iteration_1_stitchargs.json -a3 -t8 --fmt=dreamcoder --dreamcoder-drop-last --no-mismatch-check", in the toplevel dictionary of the json. All other commandline args get discarded when you specify this option -b, --batch <BATCH> how many worklist items a thread will take at once [default: 1] --dreamcoder-comparison anything related to running a dreamcoder comparison --dynamic-batch threads will autoadjust how large their batches are based on the worklist size --fmt <FMT> the format of the input file, e.g. 'programs-list' for a simple JSON array of programs or 'dreamcoder' for a JSON in the style expected by the original dreamcoder codebase. See [formats.rs] for options or to add new ones [default: programs-list] [possible values: dreamcoder, programs-list, split-programs-list] --follow-track for debugging: prunes all branches except the one that leads to the `--track` abstraction -h, --help Print help information --hole-choice <HOLE_CHOICE> Method for choosing hole to expand at each step, doesn't have a huge effect [default: depth-first] [possible values: random, breadth-first, depth-first, max-largest-subset, high-entropy, low-entropy, max-cost, min-cost, many-groups, few-groups, few-apps] -i, --iterations <ITERATIONS> Number of iterations to run compression for (number of inventions to find) [default: 3] -n, --inv-candidates <INV_CANDIDATES> Number of invention candidates compression_step should return in a *single* step. Note that these will be the top n optimal candidates modulo subsumption pruning (and the top- 1 is guaranteed to be globally optimal) [default: 1] --no-mismatch-check disables the safety check for the utility being correct; you only want to do this if you truly dont mind unsoundness for a minute --no-opt disable all optimizations --no-opt-arity-zero disable the arity zero priming optimization --no-opt-force-multiuse disable the force multiuse pruning optimization --no-opt-free-vars disable the free variable pruning optimization --no-opt-single-task disable the single task pruning optimization --no-opt-single-use disable the single structurally hashed subtree match pruning --no-opt-upper-bound disable the upper bound pruning optimization --no-opt-useless-abstract disable the useless abstraction pruning optimization --no-other-util makes it so utility is based purely on corpus size without adding in the abstraction size --no-stats Disable stat logging - note that stat logging in multithreading requires taking a mutex so it can be a source of slowdown in the massively multithreaded case, hence this flag to disable it --no-top-lambda makes it so inventions cant start with a lambda at the top -o, --out <OUT> json output file [default: out/out.json] --print-stats <PRINT_STATS> print stats this often (0 means never) [default: 0] -r, --show-rewritten print out programs rewritten under abstraction --rewrite-check whenever you finish an invention do a full rewrite to check that rewriting doesnt raise a cost mismatch exception --save-rewritten <SAVE_REWRITTEN> saves the rewritten frontiers in an input-readable format --shuffle shuffle order of set of inventions -t, --threads <THREADS> number of threads (no parallelism if set to 1) [default: 1] --track <TRACK> for debugging: pattern or abstraction to track --truncate <TRUNCATE> truncate set of inventions to include only this many (happens after shuffle if shuffle is also specified) --utility-by-rewrite calculate utility exhaustively by performing a full rewrite; mainly used when cost mismatches are happening and we need something slow but accurate --verbose-best prints whenever a new best abstraction is found --verbose-worklist prints every worklist item as it is processed (will slow things down a ton due to rendering out expressins) 禁用优化 cargorun --release --bin=compress --data/cogsci/nuts-bolts.json --no-opt 或者查看以 --no-opt- 开头的其他命令行参数，以禁用特定的优化 Python 绑定目前提供初始的 Python 绑定。根据您的操作系统运行 ./gen_bindings_osx.sh 或 ./gen_bindings_linux.sh 来构建绑定（它们将被添加到 bindings/）如果此命令不起作用，请告诉我或打开一个问题！它可能因操作系统而异，并且当前的命令可能过拟合到我的电脑上。将 stitch/bindings/ 文件夹添加到您的 $PYTHONPATH 中，例如，通过将 export PYTHONPATH="$PYTHONPATH:path/to/stitch/bindings/" 添加到您的 ~/.bashrc 或您特定的 shell / venv 中。这意味着 stitch.so 文件在您的 python 路径中，这将允许您导入它。启动 python 并尝试 import stitch（如果成功，则不应打印任何内容）作为一个简单的例子，运行 Python 代码 import stitch,json; result = json.loads(stitch.compression(["(a a a)", "(b b b)"], iterations=1, max_arity=2, max_arity=2)); print("Result:", result) 应找到 (#0 #0 #0) 抽象。请注意，目前它输出一个类似于 stitch 常规 out/out.json 输出的大的 Python 字典。有更多可用的关键字参数（完整列表在 examples/stitch.rs 中，这是绑定所在的位置，因为将它们保存在 examples/ 中是生成项目为 Python 绑定生成 cdylib 的一个解决方案）。基本上，你可以在 cargo run --release --bin=compress -- --help 中找到的任何东西都包含在内。详细信息 --save-baseline=main 保存一个命名的基线（如果存在，则与其过去的版本进行比较，然后覆盖它） --load-baseline=feature 表示不运行任何基准测试，只加载文件，就像它是你刚刚生成的结果一样 --baseline=master 覆盖我们将比较哪个基准 --bench=compress_bench 避免了详细的“未识别的选项”错误这里 --> 火焰图如果你还没有安装： cargo install flamegraph cargo flamegraph --root --open --deterministic --output=out/flamegraph.svg --bin=compress -- data/cogsci/nuts-bolts.json 致谢这项工作得到了美国国家科学基金会（NSF）的资助，资助编号为 1918839《通过代码理解世界》http://www.neurosymbolic.org/ 这项工作部分得到了国防高级研究计划局（DARPA）的资助，资助项目为 Symbiotic Design for Cyber Physical Systems（SDCPS），合同编号 FA8750-20-C-0542（Systemic Generative Engineering）。所表达的观点、意见和/或发现是作者的观点，不一定反映DARPA的观点。

依赖项 ~5–12MB ~127K SLoC chrono clap 3.1+derive colorful egg 0.7.1+serde-1 itertools 0.10.3 lambdas lazy_static ordered-float 3.0 parking_lot python? pyo3 0.16.1+extension-module rand 0.8.4 rustc-hash serde serde_json+preserve_order symbolic_expressions