Polars — Rust数据编码库 // Lib.rs

96个版本 (42个重大变更)

0.42.0	2024年8月14日
0.41.3	2024年7月2日
0.41.2	2024年6月24日
0.38.3	2024年3月18日
0.2.0	2020年7月30日

#12 in 编码

每月下载量 173,289
在232个crate中使用 181 个直接使用

MIT许可证

6MB
139K SLoC

文档: Python - Rust - Node.js - R | StackOverflow: Python - Rust - Node.js - R | 用户指南 | Discord

Polars：Rust、Python、Node.js、R和SQL中的闪电般快速DataFrame

Polars是在Rust中实现的OLAP查询引擎之上的DataFrame接口，使用Apache Arrow列式格式作为内存模型。

懒加载 | 贪婪执行
多线程
SIMD
查询优化
强大的表达式API
混合流式传输（大于RAM的数据集）
Rust | Python | NodeJS | R | ...

要了解更多信息，请阅读用户指南。

Python

>>> import polars as pl
>>> df = pl.DataFrame(
...     {
...         "A": [1, 2, 3, 4, 5],
...         "fruits": ["banana", "banana", "apple", "apple", "banana"],
...         "B": [5, 4, 3, 2, 1],
...         "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
...     }
... )

# embarrassingly parallel execution & very expressive query language
>>> df.sort("fruits").select(
...     "fruits",
...     "cars",
...     pl.lit("fruits").alias("literal_string_fruits"),
...     pl.col("B").filter(pl.col("cars") == "beetle").sum(),
...     pl.col("A").filter(pl.col("B") > 2).sum().over("cars").alias("sum_A_by_cars"),
...     pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),
...     pl.col("A").reverse().over("fruits").alias("rev_A_by_fruits"),
...     pl.col("A").sort_by("B").over("fruits").alias("sort_A_by_B_by_fruits"),
... )
shape: (5, 8)
┌──────────┬──────────┬──────────────┬─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ fruits   ┆ cars     ┆ literal_stri ┆ B   ┆ sum_A_by_ca ┆ sum_A_by_fr ┆ rev_A_by_fr ┆ sort_A_by_B │
│ ---      ┆ ---      ┆ ng_fruits    ┆ --- ┆ rs          ┆ uits        ┆ uits        ┆ _by_fruits  │
│ str      ┆ str      ┆ ---          ┆ i64 ┆ ---         ┆ ---         ┆ ---         ┆ ---         │
│          ┆          ┆ str          ┆     ┆ i64         ┆ i64         ┆ i64         ┆ i64         │
╞══════════╪══════════╪══════════════╪═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 4           ┆ 4           │
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 3           ┆ 3           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 5           ┆ 5           │
│ "banana" ┆ "audi"   ┆ "fruits"     ┆ 11  ┆ 2           ┆ 8           ┆ 2           ┆ 2           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 1           ┆ 1           │
└──────────┴──────────┴──────────────┴─────┴─────────────┴─────────────┴─────────────┴─────────────┘

SQL

>>> df = pl.scan_csv("docs/data/iris.csv")
>>> ## OPTION 1
>>> # run SQL queries on frame-level
>>> df.sql("""
...	SELECT species,
...	  AVG(sepal_length) AS avg_sepal_length
...	FROM self
...	GROUP BY species
...	""").collect()
shape: (3, 2)
┌────────────┬──────────────────┐
│ species    ┆ avg_sepal_length │
│ ---        ┆ ---              │
│ str        ┆ f64              │
╞════════════╪══════════════════╡
│ Virginica  ┆ 6.588            │
│ Versicolor ┆ 5.936            │
│ Setosa     ┆ 5.006            │
└────────────┴──────────────────┘
>>> ## OPTION 2
>>> # use pl.sql() to operate on the global context
>>> df2 = pl.LazyFrame({
...    "species": ["Setosa", "Versicolor", "Virginica"],
...    "blooming_season": ["Spring", "Summer", "Fall"]
...})
>>> pl.sql("""
... SELECT df.species,
...     AVG(df.sepal_length) AS avg_sepal_length,
...     df2.blooming_season
... FROM df
... LEFT JOIN df2 ON df.species = df2.species
... GROUP BY df.species, df2.blooming_season
... """).collect()

您还可以直接在终端使用Polars CLI运行SQL命令

# run an inline SQL query
> polars -c "SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/data/iris.csv') GROUP BY species;"

# run interactively
> polars
Polars CLI v0.3.0
Type .help for help.

> SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/data/iris.csv') GROUP BY species;

有关更多信息，请参阅Polars CLI仓库。

性能 🚀🚀

闪电般快速

Polars非常快。实际上，它是性能最佳解决方案之一。请参阅TPC-H基准测试结果。

轻量级

Polars也非常轻量级。它不包含任何必需的依赖项，这在导入时间中也有所体现

polars: 70ms
numpy: 104ms
pandas: 520ms

处理大于RAM的数据

如果您有无法放入内存的数据，Polars的查询引擎能够以流式方式处理您的查询（或查询的一部分）。这极大地降低了内存需求，因此您可能在笔记本电脑上处理250GB的数据集。使用collect(streaming=True)来以流式方式运行查询。（这可能会稍微慢一点，但仍然非常快！）

设置

Python

使用以下命令安装最新版本的Polars：

pip install polars

我们还有一个conda包（conda install -c conda-forge polars），但是pip是安装Polars的首选方式。

安装带有所有可选依赖项的Polars。

pip install 'polars[all]'

您也可以安装所有可选依赖项的子集。

pip install 'polars[numpy,pandas,pyarrow]'

有关可选依赖项的更多详细信息，请参阅用户指南。

要查看当前Polars版本及其所有可选依赖项的完整列表，请运行：

pl.show_versions()

目前，版本发布相当频繁（每周/每隔几天），因此定期更新Polars以获取最新的错误修复/功能可能是个好主意。

Rust

您可以从crates.io获取最新版本，或者如果您想使用最新的功能/性能改进，请指向此存储库的main分支。

polars = { git = "https://github.com/pola-rs/polars", rev = "<optional git tag>" }

需要Rust版本>=1.71。

贡献

想要贡献力量？请阅读我们的贡献指南。

Python：从源码编译Polars

如果您想要最新的发布版本或最大的性能，您应该从源码编译Polars。

这可以通过以下步骤按顺序完成：

安装最新的Rust编译器
安装maturin： pip install maturin
cd py-polars并选择以下选项之一
- make build-release，最快二进制文件，编译时间非常长
- make build-opt，带有调试符号的快速二进制文件，编译时间较长
- make build-debug-opt，中等速度的二进制文件带有调试断言和符号，编译时间中等
- make build，带有调试断言和符号的慢速二进制文件，编译时间快
追加-native（例如make build-release-native）以启用针对您CPU的特定优化。但是，这会产生不可移植的二进制/轮文件。

请注意，实现Python绑定的Rust crate称为py-polars，以区别于包装的Rust crate polars本身。然而，Python包和Python模块都命名为polars，因此您可以pip install polars和import polars。

在Python中使用自定义Rust函数

使用Rust编译的UDF（用户定义函数）扩展Polars非常简单。我们公开了PyO3扩展，用于DataFrame和Series数据结构。更多信息请见https://github.com/pola-rs/pyo3-polars。

继续扩展...

您预期超过2^32（约42亿）行吗？使用bigidx特性标志编译Polars，或者对于Python用户，安装pip install polars-u64-idx。

除非您遇到行边界，否则请勿使用此功能，因为Polars的默认构建版本更快且占用更少的内存。

遗留版本

您想让Polars在旧CPU（例如，2011年之前的产品）上运行，或者在Apple Silicon上使用Rosetta的x86-64构建的Python上运行吗？安装pip install polars-lts-cpu。此版本的Polars没有编译带有AVX目标功能的版本。

赞助商

依赖关系

~8–47MB
~719K SLoC