5个版本

0.1.4	2021年2月21日
0.1.3	2021年1月22日
0.1.2	2021年1月18日
0.1.1	2021年1月18日
0.1.0	2021年1月18日

#1614 在文本处理

MIT 许可证

9KB
124 行

web-grep

这是什么？

用于HTML或XML的Grep。

$ echo '<a>Hello</a>' | web-grep '<a>{}</a>'
Hello

$ echo '<a>Hello</a>' | web-grep '<a>{html}</a>' --json
{"html":"Hello"}

# List up all <p>-innerHTML
$ cat << EOM | web-grep '<p>{}</p>'
<body>
  <p>hello</p>
  <div>
    <p>world</p>
  </div>
</body>
EOM
hello
world

# filtering with attributes
$ cat << EOM | web-grep '<p class=here>{}</p>'
<body>
  <p class="not-here">hello</p>
  <div>
    <p class="here">world</p>
  </div>
</body>
EOM
world

# Place-holder {} can be attribute
$ cat << EOM | web-grep '<p class={}>world</p>'
<body>
  <p class="not-here">hello</p>
  <div>
    <p class="here">world</p>
  </div>
</body>
EOM
here

如何使用？

这只是一个用于出色库tanakh/easy-scraper的命令行界面。

安装

安装cargo
- 推荐方法：安装 rustup
然后，
- cargo安装web-grep

用法

$ web-grep <QUERY> [INPUT]

查询 是一个HTML（XML）模式。

模式是有效的HTML结构，具有用于innerHTMLs或属性的占位符。 web-grep 提供了各种占位符以应对各种情况。

占位符

匿名占位符 `{}`

如果需要在模式中精确使用一个占位符，请使用 {}。

<p>{}</p>

<p class="here">
    <q>{}</q>
</p>

web-grep 输出所有匹配 {} 的文本。

$ echo "<p>1</p><p>2</p><p>3</p>" | web-grep "<p>{}</p>"
1
2
3

编号占位符 `{n}`

<a href="{1}">{2}</a>

web-grep 输出匹配 {1}，{2}... 的文本，顺序输出，每个文本之间用 \t 分隔。

$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>"
fuga	hoge

可以使用 -F 指定分隔符。

$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={2}>{1}</a>" -F ' '
fuga hoge

命名占位符 `{xxx}`

<a href="{href}">{innerHTML}</a>

可以使用 --json 将输出格式化为JSON。

$ echo '<a href=hoge>fuga</a>' | web-grep "<a href={href}>{html}</a>" --json
{"href":"hoge","html":"fuga"}

依赖项

~7–14MB
~160K SLoC