4 个版本 (重大更改)

0.4.0 2022年1月3日
0.3.0 2021年9月15日
0.2.0 2021年9月11日
0.1.0 2019年5月20日

#1332 in 命令行工具

Download history 113/week @ 2024-03-13 128/week @ 2024-03-20 107/week @ 2024-03-27 133/week @ 2024-04-03 104/week @ 2024-04-10 121/week @ 2024-04-17 117/week @ 2024-04-24 119/week @ 2024-05-01 116/week @ 2024-05-08 101/week @ 2024-05-15 105/week @ 2024-05-22 118/week @ 2024-05-29 110/week @ 2024-06-05 128/week @ 2024-06-12 106/week @ 2024-06-19 88/week @ 2024-06-26

每月452次下载

MIT 许可证

24KB
438

htmlq

类似于 jq,但用于 HTML。使用 CSS 选择器 从 HTML 文件中提取内容片段。

安装

Cargo

cargo install htmlq

Homebrew

brew install htmlq

用法

$ htmlq -h
htmlq 0.4.0
Michael Maclean <[email protected]>
Runs CSS selectors on HTML

USAGE:
    htmlq [FLAGS] [OPTIONS] [--] [selector]...

FLAGS:
    -B, --detect-base          Try to detect the base URL from the <base> tag in the document. If not found, default to
                               the value of --base, if supplied
    -h, --help                 Prints help information
    -w, --ignore-whitespace    When printing text nodes, ignore those that consist entirely of whitespace
    -p, --pretty               Pretty-print the serialised output
    -t, --text                 Output only the contents of text nodes inside selected elements
    -V, --version              Prints version information

OPTIONS:
    -a, --attribute <attribute>         Only return this attribute (if present) from selected elements
    -b, --base <base>                   Use this URL as the base for links
    -f, --filename <FILE>               The input file. Defaults to stdin
    -o, --output <FILE>                 The output file. Defaults to stdout
    -r, --remove-nodes <SELECTOR>...    Remove nodes matching this expression before output. May be specified multiple
                                        times

ARGS:
    <selector>...    The CSS expression to select [default: html]
$

示例

使用 cURL 通过 ID 查找页面的一部分

$ curl --silent https://rust-lang.net.cn/ | htmlq '#get-help'
<div class="four columns mt3 mt0-l" id="get-help">
        <h4>Get help!</h4>
        <ul>
          <li><a href="https://doc.rust-lang.net.cn">Documentation</a></li>
          <li><a href="https://users.rust-lang.org">Ask a Question on the Users Forum</a></li>
          <li><a href="http://ping.rust-lang.org">Check Website Status</a></li>
        </ul>
        <div class="languages">
            <label class="hidden" for="language-footer">Language</label>
            <select id="language-footer">
                <option title="English (US)" value="en-US">English (en-US)</option>
<option title="French" value="fr">Français (fr)</option>
<option title="German" value="de">Deutsch (de)</option>

            </select>
        </div>
      </div>
$ curl --silent https://rust-lang.net.cn/ | htmlq --attribute href a
/
/tools/install
/learn
/tools
/governance
/community
https://blog.rust-lang.net.cn/
/learn/get-started
https://blog.rust-lang.net.cn/2019/04/25/Rust-1.34.1.html
https://blog.rust-lang.net.cn/2018/12/06/Rust-1.31-and-rust-2018.html
[...]

获取帖子的文本内容

$ curl --silent https://nixos.org/nixos/about.html | htmlq  --text .main

          About NixOS

NixOS is a GNU/Linux distribution that aims to
improve the state of the art in system configuration management.  In
existing distributions, actions such as upgrades are dangerous:
upgrading a package can cause other packages to break, upgrading an
entire system is much less reliable than reinstalling from scratch,
you can’t safely test what the results of a configuration change will
be, you cannot easily undo changes to the system, and so on.  We want
to change that.  NixOS has many innovative features:

[...]

在输出前删除节点

这个页面有一个我不需要的大 SVG 图片,所以这里是删除它的方法。

$ curl --silent https://nixos.org/ | ./target/debug/htmlq '.whynix' --remove-nodes svg
<ul class="whynix">
      <li>

        <h2>Reproducible</h2>
        <p>
          Nix builds packages in isolation from each other. This ensures that they
          are reproducible and don't have undeclared dependencies, so <strong>if a
            package works on one machine, it will also work on another</strong>.
        </p>
      </li>
      <li>

        <h2>Declarative</h2>
        <p>
          Nix makes it <strong>trivial to share development and build
            environments</strong> for your projects, regardless of what programming
          languages and tools you’re using.
        </p>
      </li>
      <li>

        <h2>Reliable</h2>
        <p>
          Nix ensures that installing or upgrading one package <strong>cannot
            break other packages</strong>. It allows you to <strong>roll back to
            previous versions</strong>, and ensures that no package is in an
          inconsistent state during an upgrade.
        </p>
      </li>
    </ul>

美化打印 HTML

(这是一个正在进行中的工作)

$ curl --silent https://mgdm.net | htmlq --pretty '#posts'
<section id="posts">
  <h2>I write about...
  </h2>
  <ul class="post-list">
    <li>
      <time datetime="2019-04-29 00:%i:1556496000" pubdate="">
        29/04/2019</time><a href="/weblog/nettop/">
        <h3>Debugging network connections on macOS with nettop
        </h3></a>
      <p>Using nettop to find out what network connections a program is trying to make.
      </p>
    </li>
[...]

使用 bat 进行语法高亮

$ curl --silent example.com | htmlq 'body' | bat --language html
Syntax highlighted output

依赖项

~4–11MB
~127K SLoC