1 个不稳定版本

0.1.0 2022年11月5日

#334机器学习

MIT 许可证

1.5MB
4.5K SLoC

Rust 3K SLoC // 0.1% comments Python 1K SLoC // 0.1% comments Shell 148 SLoC // 0.2% comments

Stitch 的预印本可在 此处 获取。

教程即将推出!

Stitch

快速入门

运行 cargo run --release --bin=compress -- data/cogsci/nuts-bolts.json --max-arity=3 --iterations=10

在一秒内,这将产生类似以下输出

=======Compression Summary=======
Found 10 inventions
Cost Improvement: (11.93x better) 1919558 -> 160946
fn_0 (1.78x wrt orig): utility: 837792 | final_cost: 1079238 | 1.78x | uses: 320 | body: [fn_0 arity=2: (T (repeat (T l (M 1 0 -0.5 (/ 0.5 (tan (/ pi #1))))) #1 (M 1 (/ (* 2 pi) #1) 0 0)) (M #0 0 0 0))]
fn_1 (3.81x wrt orig): utility: 572767 | final_cost: 503538 | 2.14x | uses: 190 | body: [fn_1 arity=3: (repeat (T (T #2 (M 0.5 0 0 0)) (M 1 0 (* #1 (cos (/ pi 4))) (* #1 (sin (/ pi 4))))) #0 (M 1 (/ (* 2 pi) #0) 0 0))]
fn_2 (6.06x wrt orig): utility: 185436 | final_cost: 316890 | 1.59x | uses: 168 | body: [fn_2 arity=1: (T (T c (M 2 0 0 0)) (M #0 0 0 0))]
fn_3 (7.18x wrt orig): utility: 48984 | final_cost: 267198 | 1.19x | uses: 82 | body: [fn_3 arity=2: (C #1 (T r (M #0 0 0 0)))]
fn_4 (8.29x wrt orig): utility: 35046 | final_cost: 231646 | 1.15x | uses: 88 | body: [fn_4 arity=2: (C (fn_0 4 #1) (fn_0 #0 6))]
fn_5 (9.04x wrt orig): utility: 18885 | final_cost: 212456 | 1.09x | uses: 95 | body: [fn_5 arity=3: (C #2 (fn_1 #1 1.5 #0))]
fn_6 (9.93x wrt orig): utility: 18885 | final_cost: 193266 | 1.10x | uses: 95 | body: [fn_6 arity=3: (C #2 (fn_1 #1 3 #0))]
fn_7 (10.53x wrt orig): utility: 10604 | final_cost: 182358 | 1.06x | uses: 54 | body: [fn_7 arity=2: (C #1 (fn_0 #0 6))]
fn_8 (11.20x wrt orig): utility: 10503 | final_cost: 171450 | 1.06x | uses: 36 | body: [fn_8 arity=2: (C (fn_0 4 #1) (fn_2 #0))]
fn_9 (11.93x wrt orig): utility: 10202 | final_cost: 160946 | 1.07x | uses: 52 | body: [fn_9 arity=0: (fn_4 4.25 6)]
Time: 227ms

阅读此指南的简要说明

  • fn_0 是自动生成的抽象名称
  • (1.78x wrt orig) 意味着使用 inv0 生成的压缩程序比原始程序小 1.78 倍,而在行中的稍后位置,另一个 1.78x 是与前一步相比的压缩率(对于第一步,它们是相同的)。
  • utility: 836528 这是对程序在重写时新原语的数量进行测量的一个指标(除以 100 以获得删除原语的大致数量)
  • uses: 320 在程序集的 320 个地方使用了这个抽象
  • 请注意,在这些抽象中 #i 用于抽象变量,而 $i 用于原始程序变量。

常见的命令行参数

  • --max-arity=2-a2 用于控制找到的抽象的最大算子数量(默认为2)
  • --iterations=10-i10 用于控制压缩运行迭代次数。每次迭代产生一个抽象(可以基于前一个抽象)
  • --threads=10-t10 是通过多线程提高性能的快捷方式(默认为1)

所有命令行参数

cargo run --release --bin=compress -- --help

ARGS:
    <FILE>    json file to read compression input programs from

OPTIONS:
    -a, --max-arity <MAX_ARITY>
            max arity of abstractions to find (will find all from 0 to this number inclusive)
            [default: 2]

        --args-from-json
            extracts argument values from the json; specifically assumes a key value pair like
            "stitch_args": "data/dc/logo_iteration_1_stitchargs.json -a3 -t8 --fmt=dreamcoder
            --dreamcoder-drop-last --no-mismatch-check", in the toplevel dictionary of the json. All
            other commandline args get discarded when you specify this option

    -b, --batch <BATCH>
            how many worklist items a thread will take at once [default: 1]

        --dreamcoder-comparison
            anything related to running a dreamcoder comparison

        --dynamic-batch
            threads will autoadjust how large their batches are based on the worklist size

        --fmt <FMT>
            the format of the input file, e.g. 'programs-list' for a simple JSON array of programs
            or 'dreamcoder' for a JSON in the style expected by the original dreamcoder codebase.
            See [formats.rs] for options or to add new ones [default: programs-list] [possible
            values: dreamcoder, programs-list, split-programs-list]

        --follow-track
            for debugging: prunes all branches except the one that leads to the `--track`
            abstraction

    -h, --help
            Print help information

        --hole-choice <HOLE_CHOICE>
            Method for choosing hole to expand at each step, doesn't have a huge effect [default:
            depth-first] [possible values: random, breadth-first, depth-first, max-largest-subset,
            high-entropy, low-entropy, max-cost, min-cost, many-groups, few-groups, few-apps]

    -i, --iterations <ITERATIONS>
            Number of iterations to run compression for (number of inventions to find) [default: 3]

    -n, --inv-candidates <INV_CANDIDATES>
            Number of invention candidates compression_step should return in a *single* step. Note
            that these will be the top n optimal candidates modulo subsumption pruning (and the top-
            1  is guaranteed to be globally optimal) [default: 1]

        --no-mismatch-check
            disables the safety check for the utility being correct; you only want to do this if you
            truly dont mind unsoundness for a minute

        --no-opt
            disable all optimizations

        --no-opt-arity-zero
            disable the arity zero priming optimization

        --no-opt-force-multiuse
            disable the force multiuse pruning optimization

        --no-opt-free-vars
            disable the free variable pruning optimization

        --no-opt-single-task
            disable the single task pruning optimization

        --no-opt-single-use
            disable the single structurally hashed subtree match pruning

        --no-opt-upper-bound
            disable the upper bound pruning optimization

        --no-opt-useless-abstract
            disable the useless abstraction pruning optimization

        --no-other-util
            makes it so utility is based purely on corpus size without adding in the abstraction
            size

        --no-stats
            Disable stat logging - note that stat logging in multithreading requires taking a mutex
            so it can be a source of slowdown in the massively multithreaded case, hence this flag
            to disable it

        --no-top-lambda
            makes it so inventions cant start with a lambda at the top

    -o, --out <OUT>
            json output file [default: out/out.json]

        --print-stats <PRINT_STATS>
            print stats this often (0 means never) [default: 0]

    -r, --show-rewritten
            print out programs rewritten under abstraction

        --rewrite-check
            whenever you finish an invention do a full rewrite to check that rewriting doesnt raise
            a cost mismatch exception

        --save-rewritten <SAVE_REWRITTEN>
            saves the rewritten frontiers in an input-readable format

        --shuffle
            shuffle order of set of inventions

    -t, --threads <THREADS>
            number of threads (no parallelism if set to 1) [default: 1]

        --track <TRACK>
            for debugging: pattern or abstraction to track

        --truncate <TRUNCATE>
            truncate set of inventions to include only this many (happens after shuffle if shuffle
            is also specified)

        --utility-by-rewrite
            calculate utility exhaustively by performing a full rewrite; mainly used when cost
            mismatches are happening and we need something slow but accurate

        --verbose-best
            prints whenever a new best abstraction is found

        --verbose-worklist
            prints every worklist item as it is processed (will slow things down a ton due to
            rendering out expressins)

禁用优化

cargorun --release --bin=compress --data/cogsci/nuts-bolts.json --no-opt

或者查看以 --no-opt- 开头的其他命令行参数,以禁用特定的优化

Python 绑定

目前提供初始的 Python 绑定。

  • 根据您的操作系统运行 ./gen_bindings_osx.sh./gen_bindings_linux.sh 来构建绑定(它们将被添加到 bindings/
    • 如果此命令不起作用,请告诉我或打开一个问题!它可能因操作系统而异,并且当前的命令可能过拟合到我的电脑上。
  • stitch/bindings/ 文件夹添加到您的 $PYTHONPATH 中,例如,通过将 export PYTHONPATH="$PYTHONPATH:path/to/stitch/bindings/" 添加到您的 ~/.bashrc 或您特定的 shell / venv 中。这意味着 stitch.so 文件在您的 python 路径中,这将允许您导入它。
  • 启动 python 并尝试 import stitch(如果成功,则不应打印任何内容)
  • 作为一个简单的例子,运行 Python 代码 import stitch,json; result = json.loads(stitch.compression(["(a a a)", "(b b b)"], iterations=1, max_arity=2, max_arity=2)); print("Result:", result) 应找到 (#0 #0 #0) 抽象。
  • 请注意,目前它输出一个类似于 stitch 常规 out/out.json 输出的大的 Python 字典。
  • 有更多可用的关键字参数(完整列表在 examples/stitch.rs 中,这是绑定所在的位置,因为将它们保存在 examples/ 中是生成项目为 Python 绑定生成 cdylib 的一个解决方案)。基本上,你可以在 cargo run --release --bin=compress -- --help 中找到的任何东西都包含在内。

详细信息

  • --save-baseline=main 保存一个命名的基线(如果存在,则与其过去的版本进行比较,然后覆盖它)
  • --load-baseline=feature 表示 不运行任何基准测试,只加载文件,就像它是你刚刚生成的结果一样
  • --baseline=master 覆盖我们将比较哪个基准
  • --bench=compress_bench 避免了详细的“未识别的选项”错误 这里

-->

火焰图

如果你还没有安装: cargo install flamegraph cargo flamegraph --root --open --deterministic --output=out/flamegraph.svg --bin=compress -- data/cogsci/nuts-bolts.json

致谢

这项工作得到了美国国家科学基金会(NSF)的资助,资助编号为 1918839《通过代码理解世界》http://www.neurosymbolic.org/

这项工作部分得到了国防高级研究计划局(DARPA)的资助,资助项目为 Symbiotic Design for Cyber Physical Systems(SDCPS),合同编号 FA8750-20-C-0542(Systemic Generative Engineering)。所表达的观点、意见和/或发现是作者的观点,不一定反映DARPA的观点。

依赖项

~5–12MB
~127K SLoC