#web-scraping #http #web #robots-txt #honeypot #markov-chain #crawler

app pandoras_pot

设计用于向粗鲁的爬虫发送大量数据的蜜罐

26个版本 (5个破坏性更新)

0.6.3 2024年8月5日
0.6.2 2024年7月27日
0.5.5 2024年6月23日
0.5.4 2024年3月23日

#104 in 网络编程

Download history 2/week @ 2024-05-18 153/week @ 2024-06-22 3/week @ 2024-06-29 15/week @ 2024-07-06 59/week @ 2024-07-20 346/week @ 2024-07-27 118/week @ 2024-08-03 8/week @ 2024-08-10

每月531次下载

AGPL-3.0-only

66KB
1K SLoC

🔥pandoras_pot🍯

用Rust释放难以捉摸的诅咒在不知情的机器人上...!

GitHub Repo Crates.io (pandoras_pot) GitHub License GitHub Actions Workflow Status

摘要

HellPot的启发,pandoras_pot是一个HTTP蜜罐,旨在给那些不尊重你的robots.txt的放肆的爬虫带来更多的痛苦。

pandoras_pot的目标是向进入的不请自来的连接发送尽可能多的数据,同时不耗尽你的Web服务器资源,因为服务器可能可以用这些资源做更有意义的事情。

为了确保机器人不会检测到pandoras_pot,它会生成类似网站(对机器人来说)的随机数据,非常快。像疯狂一样快。甚至可以说是一阵狂风。 希望如此。

pandoras_pot支持多种生成模式,具体取决于其配置。例如,它可以生成随机字符串作为数据,或使用马尔可夫链生成“实际”的句子。很酷!

功能

  • 闪电般快速
  • 用Rust编写
  • TOML配置格式,请参阅下面的示例(但默认值没有配置!)
  • 可选的健康端口,用于反向代理健康检查
  • 多种生成模式,并且很容易添加更多!发送纯随机数据,使用马尔可夫链生成的文本,或静态文件!
  • 可配置的滥用保护(最大并发生成连接数,时间和大小限制)
  • 我提到它是用Rust编写的了吗?

设置方法

Web和反向代理

最可能的使用场景是使用另一个服务器作为反向代理,然后选择一些应该转发到pandoras_pot的路径,例如/wp-login.php/.git/config/.env

请注意,您使用的URI应该在您的/robots.txt中设置Disallow,否则您可能会因为像googlebot这样的东西而遇到麻烦,它会讨厌您奇怪的“页面死亡”。对于上面的路径,您可能有一个如下所示的robots.txt

User-agent: *
Disallow: /wp-login.php
Disallow: /.git
Disallow: /.env

常见的反向代理包括nginxhttpd(Apache)和Caddy

在Caddy中,您可以添加以下内容以匹配我们已创建的/robots.txt

(pandorust) {
    @pandorust_paths {
        path /wp-login.php /.git* /.env*
    }
    handle @pandorust_paths {
        reverse_proxy localhost:6669 # Or whatever you run pandoras_pot on
    }
}

# ...

example.com {
    # ...
    # Your actual website
    # ...

    import pandorust
}

然后您可以直接运行(如果您使用cargo install pandoras_pot安装)

pandoras_pot

完成!

使用Docker

设置pandoras_pot的最简单方法是使用docker。您可以使用docker的--build-arg CONFIG=<path to your config>标志(但它在构建上下文中应该是可用的)传递一个参数到配置文件。

首先,通过运行以下命令来克隆仓库

git clone [email protected]:ginger51011/pandoras_pot.git
cd pandoras_pot

然后您可以构建一个镜像并部署它,这里命名和标记为pandoras_pot,并使其在localhost:6669端口上可用

docker build -t pandoras_pot . # You can add --build-arg CONFIG=<...> here
docker run --name=pandoras_pot --restart=always -p 6669:8080 -d pandoras_pot

systemd服务

您还可以轻松设置一个systemd服务。这要求您安装Rust,但需要一个更小的docker镜像,并且使重新加载配置更容易。在这个例子中,我将设置一个新的用户,pandora-user,但您可以使用任何您想要的用户(但我们将锁定pandora-user)。

注意:除克隆和构建pandoras_pot之外,这里的大多数命令都需要root权限。

首先,克隆仓库并构建pandoras_pot(安装Rust后)

git clone [email protected]:ginger51011/pandoras_pot.git
cd pandoras_pot
cargo build --release

# Move the binary to a better place
cp ./target/release/pandoras_pot /usr/bin/

然后我们创建一个将运行进程的用户;这个用户不是root,甚至无法登录

adduser --disabled-password --gecos '' --shell /sbin/nologin --no-create-home --home /iamadirandidontexist 'pandora-user'

然后我们创建一个目录来保存我们的配置(以及一些生成器的data文件等)

mkdir /etc/pandoras_pot

# Ensure the config file exists; you can copy the default one in this README
# into this file
touch /etc/pandoras_pot/config.toml

# Optionally you can create your data file here. You need to point to it from
# the config.

# Make pandora-user the owner of this dir
chown -R pandora-user:pandora-user /etc/pandoras_pot

现在我们创建实际的服务。如果您已经使用了这里的示例,您可以直接将此内容复制粘贴到位于/etc/systemd/system/pandorad.service的新文件中

[Unit]
Description=Pandora's Pot "service"
After=network.target
StartLimitIntervalSec=0

[Service]
# Change to another user/group if needed
User=pandora-user
Group=pandora-user

Restart=always
RestartSec=1

WorkingDirectory=/etc/pandoras_pot/

# Requires that the file /etc/pandoras_pot/config.toml exists; you can also
# remove config.toml to use plain default settings.
ExecStart=/usr/bin/pandoras_pot config.toml

###
## Hardening; this is optional and can be commented out, but is generally
## good practice. Some might prevent pandoras_pot from functioning, see below.
##
## Other settings may exist and be suitable.
##
## For more info, see systemd.exec(5)
##
MemoryDenyWriteExecute=yes
NoNewPrivileges=yes
PrivateDevices=yes
PrivateTmp=yes
PrivateUsers=yes
ProtectClock=yes
ProtectControlGroups=yes
ProtectHostname=yes
ProtectKernelLogs=yes
ProtectKernelModules=yes
ProtectKernelTunables=yes
RestrictNamespaces=yes
RestrictSUIDSGID=yes

# These might prevent pandoras_pot from writing to a log file if ReadWritePaths is misconfigured.
ProtectHome=yes
ProtectSystem=strict

# This should point to the output log file; this is the default value.
# It should be the same as `logging.output_path` in the config.toml.
# A sane alternative is `/var/log/pandoras.log`.
ReadWritePaths=/etc/pandoras_pot/pandoras.log

##
## End of hardening
###

[Install]
WantedBy=multi-user.target

然后您需要重新加载一些守护程序,启用并启动您的服务

systemctl daemon-reload
systemctl enable pandorad.service
systemctl start pandorad.service

您可以检查一切是否正常

systemctl status pandorad.service

完成!

配置

pandoras_pot使用toml作为配置格式。如果您没有使用docker,您可以像这样将配置作为一个参数传递

pandoras_pot <path-to-config>

或者将其放在$HOME/.config/pandoras_pot/config.toml的文件中。

以下是一个示例文件

[http]
# Make sure this matches your Dockerfile's "EXPOSE" if using Docker
port = "8080"
# Routes to send misery to. Is overridden by `http.catch_all`
routes = ["/wp-login.php", "/.env"]
# If all routes are to be served.
catch_all = true
# How many connections that can be made over `http.rate_limit_period` seconds. Will
# not set any limit if set to 0.
rate_limit = 0
# Amount of seconds that `http.rate_limit` checks on. Does nothing if rate limit is set
# to 0.
rate_limit_period = 300 # 5 minutes
# Enables `http.health_port` to be used for health checks (to see if
# `pandoras_pot` is running). Useful if you want to use your chad gaming PC
# that might not always be up and running to back up an instance running on
# your RPi 3 web server.
health_port_enabled = false
# Port to be used for health checks. Should probably not be accessible from the
# outside. Has no effect if `http.health_port_enabled` is `false`.
health_port = "8081"
# The `Content-Type` header set in responses.
content_type = "text/html; charset=utf-8"

[generator]
# The size of each generated chunk in bytes. Has a big impact on performance, so
# play around a bit! Note that if this is set too low (like 10 bytes), `pandoras_pot`
# will refuse to run.
chunk_size = 16384 # 1024 * 16
# The type of generator to be used
type = { name = "random" }

# For generator.type it is also possible to set a markov chain generator, using
# a text file as a source of data. Then you can use this (but uncommented, duh):
# type = { name = "markov_chain", data = "<path to some text file>" }

# Another alternative is a static generator, that always outputs the full contents
# of a file. Does not respect chunking.
# type = { name = "static", data = "<path to some file>" }

# The max amount of simultaneous generators that can produce output.
# Useful for preventing abuse. `0` means no limit.
max_concurrent = 100

# The amount of time in seconds a generator can be active before
# it stops sending. `0` means no limit.
time_limit = 0

# The amount of data in bytes that a generator can
# send before it stops sending. `0` means no limit.
size_limit = 0

# How many chunks should be buffered for each connection. Higher values mean
# more memory usage, but may lead to increased performance. Must be >= 1.
chunk_buffer = 20

# Prefix that will be used for the first message to an incoming connection.
# Usually used to set an HTML prefix. Can be set to "" to disable.
#
# Example usage: Set to "{" for a static generator using a JSON file to make
# output look like a valid stream of JSON that will eventually end (it won't).
prefix = "<!DOCTYPE html><html><body>"

[logging]
# Output file for logs.
output_path = "pandoras.log"

# If pretty logs should be written to standard output.
print_pretty_logs = true

# If no logs at all should be printed to stdout. Overrides other stdout logging
# settings.
no_stdout = false

测量输出

您可以使用curl轻松测量您的设置发送数据的速度。请注意,使用localhost可能不可靠,因为它不会显示外人可能看到的内容。更好的选择可能是在另一台机器上使用。

此示例假设您已启用http.catch_all,否则您应添加一个有效的路由。

curl localhost:8080/ >> /dev/null

支持

我不接受任何捐赠。但是如果您发现我为乐趣编写的任何软件有用,请考虑捐赠给一个效率最高的慈善机构,该机构每花费$CURRENCY就能拯救或改善最多的生命。

GiveWell.org 是一个优秀的网站,可以帮助您向世界上最有效的慈善机构捐赠。列出当前最佳慈善机构的替代方案有 Founders Pledge,以及针对动物福利的 Animal Charity Evaluators

  • 瑞典居民可以通过 Ge Effektivt 进行可抵税的捐赠给 GiveWell。
  • 挪威居民可以通过 Gi Effektivt 做同样的事情。

此列表并不全面;您所在的国家可能有一个等效的列表。

依赖项

~13–24MB
~337K SLoC