18个版本 (10个破坏性更新)

0.13.1	2023年3月31日
0.10.0	2022年10月10日
0.8.1	2022年7月17日
0.5.1	2022年3月15日

#412 in Web编程

每月下载 78次

MPL-2.0 和 GPL-3.0 许可协议

300KB
7.5K SLoC

fetcher

fetcher使您能够将任何信息收集（如新闻或博客文章）自动化到您最舒适的地方。它抓取您选择的任何内容（从可用的来源列表中）；处理它（过滤和解析），并将其发送到您想要的地方。将其想象成类似于IFTTT，但本地托管，并且仅支持传输文本。

在我发现IFTTT开始收费后，我便写了自己的fetcher，更不用说那时我已经对它不满意了。我想在同一个地方接收新闻和通知，而无需搜索。例如，假设我想阅读某人的推文，但只有当它们包含特定字符串时。在此之前，我必须通过Twitter应用在我的手机上接收该用户的所有通知，并阅读每一个以找到与我相关的那些。这真的很糟糕。在寻找一些本地托管的IFTTT替代品后，我发现没有一个是轻量级的且对我有用的，因此我决定自己写一个。

fetcher是一个二进制文件，也是一个库crate，因此它可以以编程方式完成它通常可以完成的所有事情。

如果您想要添加特定功能，请随时贡献。

安装

从crates.io下载和安装

cargo install fetcher

或手动构建

git clone -b main --single-branch https://github.com/SergeyKasmy/fetcher.git
cd fetcher
cargo build --release

最终的二进制文件将位于target/release/fetcher，然后您可以将其复制到~/.local/bin或任何包含在您的$PATH中的目录。

设置

fetcher 的主要执行单元是任务。一个任务由一个或多个每设定时间间隔或每天特定时间重新运行的任务组成。一个任务包含一个数据源（从哪里获取数据），（一个或多个）处理数据的行为（修改、过滤、删除已读的），以及一个数据目的地（数据最终被发送到的地方）。要创建一个任务，在 $XDG_CONFIG_HOME/fetcher/jobs 或 /etc/xdg/fetcher/jobs 中创建一个名为 foo.yml 的文件，其中 foo 是您希望任务具有的名称。一个合适的任务配置文件看起来可能像这样

refresh: 
  every: 30m
tasks:
  news:
    read_filter_type: newer_than_read
    source:
      twitter: '<your_twitter_handle>'
    process:
      - read_filter # leave out only entries newer than the last one read
      - contains:
          body: '[Hh]ello'
      - set:
         title: New tweet from somebody
      - shorten:
          body: 50
    sink:
      discord:
        user: <your_user_id>

这个任务每30分钟运行一次，有一个名为 "news" 的单一任务。

此任务获取 @<你的推特用户名> 的 Twitter 时间线。
删除所有已读的推文（使用 newer_than_read 策略）
仅保留包含 "Hello" 或 "hello" 的推文
将标题设置为 "某人新推文"。
如果正文超过50个字符，则将其缩短到50个字符。
并将所有剩余的推文通过DM发送到 Discord 用户 <你的用户ID>。

运行

使用 fetcher run 运行 fetcher。这将运行在所有配置位置中找到的所有任务。fetcher 首先在 $XDG_CONFIG_HOME/fetcher/jobs 中搜索所有 .yml 任务，然后在 /etc/xdg/fetcher/jobs 中搜索。

您可以使用 JSON 在命令行中手动指定任务，当使用 fetcher run-manual 运行时。

有关详细信息，请参阅 fetcher --help

要设置登录凭证，以保存模式运行 fetcher（fetcher save），然后跟一个服务名称，可以是以下这些之一

google-oauth2
twitter
telegram
email-password

完成提示后，您将能够自动使用这些服务而无需额外授权。

所有可用配置选项

注意：带有 X 的选项是互斥的，并且不打算同时运行，而是列出所有可用选项的用法。

注意：带有 O 的选项是可选的

disabled: true # O
read_filter_type: newer_than_read # XO. either: 
                                  # * keep only the last read entry and filter out all "older" than it
                                  # * notify when the entry is updated
read_filter_type: not_present_in_read_list # XO. keep a list of all items read and filter out all that are present in it
tasks:
  foo:
    tag: <string> # mark the message with a tag. That is usually a hashtag on top of the message or some kind of subscript in it. If a job has multiple tasks, it is automatically set to the task's name
    source:
      string: <string> # X. set the body of an entry to set string
      http: # X
        - <url> # get the contents of a web page
        - <url> # or several ones. Note: they have compatible contents and IDs to be able to work with the processing and read filtering logic. If they do not, just create a different task
        - post: # send a POST request (instead of a GET request)
            url: <url>
            body: <string> # with its body set to <string>
      twitter: # X
        - <twitter_handle> # get the feed of a tweeter account
        - <twitter_handle> # or several
      file: # X
        - <path> # get the contents of a file
        - <path> # or several
      reddit: # X
        <subreddit_name>:
          sort: <new|rising|hot> # X
          sort: # X
            top: <today|thisweek|thismonth|thisyear|alltime> 
          score_threshhold: <int> # O. Ignore posts with score lower than the threshhold
        <subreddit_name>:	# can be specified multiple times
          ...
      exec: # X
        - <cmd> # exec this command and use its output
        - <cmd> # or several commands
      email: # X
        auth: <google_oauth2|password> # how to authenticate with the IMAP server. `password` is insecure. `google_oauth2` can only be used with Gmail
        imap: <url> # URL of the IMAP server. Used only with `auth: password`. With `auth: google_oauth2` `imap.gmail.com` is used automatically
        email: <address> # email address to authenticate with
        filters: # O
          sender: <email_address>  # O. Ignore all email not sent from this address
          subjects: # O
            - <string> # ignore all emails not containing this string
            - <string> # or several
          exclude_subjects: # O
            - <string> # ignore all emails containing this string
            - <string> # or several
        view_mode: <read_only|mark_as_read|delete>  # how to view the inbox.
                                                    # * read_only: doesn't modify the inbox in any way (but will get the same emails over and over again with no way to check which are read. Should be used with a `read_filter`)
                                                    # * mark_as_read: mark read emails as read
                                                    # * delete: move the emails to the trash bin. Exact behavior depends on the email provider in question. Gmail archives the emails by default instead
    process:  # all actions are optional, so don't need to be marked with O
      - read_filter # filter out already read entries using `read_filter_type` stradegy
      - take: # take `num` entries from either the newest or the oldest and ignore the rest
          <from_newest|from_oldest>: <int>
      - contains: # filter out all entries that don't match
          <field>: <regex> # regular expression to match the contents of the <field> against
          <field>: <regex> # can be specified several times
      - feed # parse the entries as an RSS/Atom feeds
      - html: # parse the entries as HTML. All queries use the same format, except for `item_query`
          item: # O. Item is a unit of information. For example, articles in a blog or goods in an online store search are items. If the entire page is the "item", then this should be ignored
            query:
              # either of tag|class|attr can be used any number of times. They specify a narrowing down traversal of the HTML that specifies an item. Refer to [docs.rs of ElementDataQuery](https://docs.rs/fetcher-core/latest/fetcher_core/action/transform/entry/html/query/struct.ElementDataQuery.html) for more details
              - tag: <string>
              - class: <string>
              - attr:
                  <attr>: <value> # look for match of `<html_attribute>` inside `<attr>`, i.e. `href: THIS` will match for the contents of <a href="THIS">foo</a>
                ignore: # any of these can also include an `ignore` field that, in case several HTML tags matched the query, will ignore ones that match ~this~ ignore query
                  - tag: <string>
                  - class: <string>
                  - attr:
                      <attr>: <value>
          title: # O. A query to get the title of the entry from. Seaches inside the item found in "item query" if it set, the entire page otherwise
            optional: <bool> # defines what happens when this query doesn't match anything. if 'true', the title should be left empty, if 'false', the entire task will fail. `false` by default
            query:
              ... # the same as `item` above
            data_location: text # X. Where to exact data from. `text` extracts the text of an HTML tag, i.e. `<a href="https://example.com">THIS</a>`
            data_location: 
              attr: <string> # X. While `attr` extracts the contents of the attribute, i.e. `attr: href` extracts `<a href="THIS">and not this!</a>`
            regex: # O. match the resulting data got from this query against a regex
              re: <regex> # the regex to match against
              replace_with: <string> # replace the matched regex with this string. Supports referencing capture groups from `re`.
              # Example
              #   regex:
              #     re: '/.*/.*'
              #     replace_with: `Hello, $1!`
              # This regex extracts the data from `/HERE/not here/or here` and replaces the entire title with "Hello, HERE!"
          text: # O. Query for the main content of the message. 
            - ... # Same as `title` but is an array. This makes it possible to extract text from several different places and concatenate it into a single message body.
            - ...
          id: # O. Query for the ID of the item.
            ... # same as `title`
          link: # O. Query for the URL of the item. The entry 
            ... # same as `title`
          img: # O. "Query for the attached pictures of the item.
            ... # same as `title`
      - http # fetch a page from the link field of the message. Allows recursive web parsing.
      - json: # very similar to `html`
          item: # O. "Item query". Item is a unit of information. For example, articles in a blog or goods in an online store search are items. If the entire JSON is the "item", then this should be ignored
            query: # query that should be matched one by one to traverse the JSON and find the item
              - <string> # matches a JSON key
              - <int> # matches an item of an array or a map
          title: # O. A query to get the title of the entry from. Seaches inside the item found in "item query" if it set, the entire JSON otherwise
            optional: <bool> # defines what happens when this query doesn't match anything. if 'true', the title should be left empty, if 'false', the entire task will fail. `false` by default
            query:
              ... # the same as `itemq.query` above
            regex: # O. match the resulting data got from this query against a regex
              re: <regex> # the regex to match against
              replace_with: <string> # replace the matched regex with this string. Supports referencing capture groups from `re`.
              # Example
              #   regex:
              #     re: '/.*/.*'
              #     replace_with: `Hello, $1!`
              # This regex extracts the data from `/HERE/not here/or here` and replaces the entire title with "Hello, HERE!"
          text: # O. "Text query". Query for the main content of the message. 
            - ... # Same as `title` but is an array. This makes it possible to extract text from several different places and concatenate it into a single message body.
            - ...
          id: # O. "ID query". Query for the ID of the item.
            ... # same as `title`
          link: # O. "Link query". Query for the URL of the item. The entry 
            ... # same as `title`
          img: # O. "Image query". Query for the attached pictures of the item.
            ... # same as `title`
      - use:  # copy the data of a field to a different field of a message
          <field>:  # the field to copy the data from
            as: <field> # the field to copy the data to
          <field>:  # can be specified multiple times
            as: <field>
        # Example: 
        #   use:
        #     title:
        #       as: body
        # This will use the title of the message as the body of the message, i.e. they will be the same
      - set: # set a field to a specified string
          <field>: <string> # set <field> to <string>
          <field>: <string> # can be specified multiple times
          <field>: 
            - <string> # or even as an array, in which case it will choose a random one each time
            - <string>
      - shorten: # limit the length of a field to a specified maximum amount of charachers
          <field>: <int> # limit <field> to <int> max charachers
          <field>: <int> # can be specified multiple times
      - trim: <field> # remove leftover whitespace to the left and to the right of every line in the <field>
      - replace: # replace the contents of a field
          re: <regex> # replace the first regex match
          field: <field> # in the field
          with: <string> # with this string
      - extract: # extract text using a regex
          from_field: <field> # extract text from this field and replace the contents of the field with it
          re: <regex> # the regex that specifies capture groups that will be concatenated and become the new contents of the field
          passthrough_if_not_found: <bool> # what to do if the regex didn't match. If `true`, the value of the field `from_field` should remain the same, if `false`, the task will be aborted
      - remove_html: # remove any HTML tags in <field> and trim any remaining whitespace
          in: <field> # X. either in one field
          in:         # X. or in several at once
            - <field>
            - <field>
      # debug related actions:
      - caps # make the message title uppercase
      - debug_print # debug print the entire contents of the entry
sink:
  discord: # X. Send as a discord message
    user: <user_id> # X. The user to DM to. This is not a handle (i.e. not User#1234) but rather the ID (see below). 
    channel: <channel_id> # X. The channel to send messages to
    # The ID of a user or a channel can be gotten after enabling developer settings in Discord (under Settings -> Advanced) and rightclicking on a user/channel and selecting "Copy ID"
  telegram: # X
    chat_id: <chat_id>  # Either the private chat (group/channel) ID that can be gotten using bots or the public handle of a chat. DM aren't supported yet.
    link_location: <prefer_title|bottom>  # O. Where to put the link. Either as try to put it in the title if it's present, or a separate "Link" button under the message
  exec: <cmd> # X. Start a process and write the body of the message to its stdin
  stdout # X. Just print to stdout. Isn't really useful but it is the default when run with --dry-run

依赖项

~43–60MB
~1M SLoC