Introduction to Polars for 100 GB data processing
With more AI and ML advancements, large datasets need to be preprocessed. Pandas library is the default library when we need to do data preprocessing but it has some limitations in handling large datasets but don’t worry we have the Polars library which is aptly suited for handling complex and large datasets.
Polars library supports GPUs hence making it a suitable choice for handling massive datasets.
In this guide, we will learn about why to use Polars, how to set up the Polars library, advanced SQL functions and how to do visualisation with the Polars library.
Why use Polars?
Polars is a fast DataFrame library powered by OLAP Query Enginer designed for efficient data handling on a single machine. It operates on a query engine that can use Nvidia GPUs for higher performance through its GPU engine (powered by RAPIDS cuDF).
Designed to make processing 10–100+ GBs of data feel interactive with just a single GPU, this new engine is built directly into the Polars Lazy API — pass engine=”gpu” to the collect
operation.
Setting up Polars GPU engine
To get started, you need the Polars version 1.5 installed on your computer.
To use the built-in data visualization capabilities of Polars, you’ll need to install a few additional dependencies. We’ll also install pynvml to help us determine which dataset size to use.
Data loading and Testing with CPU vs GPU
Loading data: We are using a 22GB Kaggle dataset, to increase the speed of download we will download a copy of this dataset from a GCS bucket hosted by NVIDIA. This should take about 30 seconds.
import pynvml
pynvml.nvmlInit()
pynvml.nvmlDeviceGetName(pynvml.nvmlDeviceGetHandleByIndex(0))
mem = pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(0))
mem = mem.total/1e9
if mem < 24:
!wget https://storage.googleapis.com/rapidsai/polars-demo/transactions-t4-20.parquet -O transactions.parquet
else:
!wget https://storage.googleapis.com/rapidsai/polars-demo/transactions.parquet -O transactions.parquet
!wget https://storage.googleapis.com/rapidsai/polars-demo/rainfall_data_2010_2020.csv
Now to read the parquet we will need to import the libraries and look at the schema of the dataset.
Reducing the time of data processing
Polars can switch between the CPU and GPU engine, so if you have a small query you can use the CPU and for a complex query GPU engine can be utilized. We can observe the difference between the time taken by the CPU and the GPU engine.
As we can see with the CPU the Wall time taken is 7.22 seconds whereas with the GPU the process got accelerated and we got a result in only 497 milliseconds i.e., about 93% reduced processing time.
Advanced Use — SQL Queries and Multiple Datasets
Polars also supports SQL-like queries, making it easy for users familiar with SQL to perform complex analyses without switching between languages. You can also work with multiple datasets, performing tasks like joins and group by operations, and see even more pronounced speed improvements on GPUs.
query = """
SELECT CUST_ID, SUM(AMOUNT) as sum_amt
FROM transactions
GROUP BY CUST_ID
ORDER BY sum_amt desc
LIMIT 5
"""
%time pl.sql(query).collect()
%time pl.sql(query).collect(engine=gpu_engine)
Visualization with Polars
Polars library also supports GPU-powered visualization, which can help you visualize large datasets quickly. Thus making visualization efficient for high-dimensional data.
(
res
.with_columns(
pl.date(pl.col("YEAR"), pl.col("MONTH"), 1).alias("date-month"),
pl.col("Rainfall (inches)")*100,
)
.hvplot.line(
x="date-month", y=["AMOUNT", "Rainfall (inches)"],
by=['EXP_TYPE'],
rot=45,
)
)
Final Thoughts
If you’re looking to speed up data processing and analysis, especially with very large datasets, try Polars with GPU support. With its ability to switch between CPU and GPU seamlessly, you can work with large data while minimizing setup complexity. To learn more about Polars GPU engine visit https://rapids.ai/polars-gpu-engine/.