Switch from pandas to Polars to Speed Up ETL

Using Polars with its optional GPU engine (powered by RAPIDS cuDF) can further reduce transformation time in ETL pipelines
Overview
Why pandas slows down
Why Polars is faster
Controlling performance in Polars
By default, Polars uses all available CPU cores. If you're running in a shared environment or want to limit resource usage, you can set the number of threads:
import os
os.environ["POLARS_MAX_THREADS"] = "4"
import polars as pl
print(pl.thread_pool_size())
Working examples
Using pandas:
import pandas as pd
df = pd.read_csv("large_dataset.csv")
df["amount"] = df["amount"].fillna(0)
grouped = df.groupby("category")["amount"].sum()
print(grouped)
Using Polars:
import polars as pl
df = pl.read_csv("large_dataset.csv")
df = df.with_columns([
pl.col("amount").fill_null(0)
])
grouped = df.groupby("category").agg([
pl.col("amount").sum()
])
print(grouped)
When to switch
Conclusion
References
Polars Team. (2024). Performance Tips. Polars User Guide. Retrieved from https://pola-rs.github.io/polars-book/user-guide/performance/