Introduction
Handling large datasets locally on laptops is not easy. As a workaround, data scientists either work on a sample data set or spin up spark clusters. Spark clusters don’t come cheap. I recently migrated a data science project from spark to python in-memory computation using polars and saved few hunderd dollars weekly cloud cost. Yay!
Polars is just like pandas (used to read, manipulate, write datasets), but executes much faster in memory. I’ve been using pandas for several years now, so learning to use polars wasn’t that difficult. I found the polars api is quite intuitive and as a data scientist you’ll find it quite useful in your day to day data analysis, machine learning projects while working on big datasets.
Polars to pandas transition is super easy, just doing .to_pandas()
on a polars dataframe brings us back to using pandas.
Some key features about polars library are:
- It is written in Rust based on Apache Columnar Format.
- It allows lazy execution of commands, thus allowing it to handle dataset larger than RAM on your laptop.
- Its API is very similar to pandas and pyspark, thus the learning curve is not at all steep.
- If you are more comfortable with SQL, you can easily convert a polar dataframe into a SQL Table locally and run SQL commands normally.
- It is heavily optimised for doing parallel computations.
In this tutorial, I’ll show how to do some commonly used dataframes calculations/transformations using polars in Python.
Table of Contents
- How to get count of NA values in the polar dataframe?
- How to run groupby functions?
- How to set the data types correctly?
- How to find number of groups in a column?
- How to do one hot encoding in polars dataframe?
- How to create categories/bins using a numeric column?
- How to create a pivot table ?
- How to create a column where each row contains multiple values?
- How to convert dataframe columns into a python dict?
- How to convert a column containing list into separate rows?
- How to perform string related functions using polars?
- How to perform datetime functions using polars?
- How to read parquet data from S3 with polars?
Lets organise the imports we require in this tutorial. Also, lets create a sample data set which we are going to use for this tutorial.
import polars as pl
import boto3
import datetime
# set the configs
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_width_chars(100)
pl.Config.set_tbl_rows(200)
pl.Config.set_tbl_hide_dataframe_shape(True)
# create a polars dataframe
df = pl.DataFrame({
"group": ["one", "one", "one", "two", "two", "three", "three"],
"grade": ["1", "98", "2", "3", "99", "23", "11"],
"score": [45,49,76,83,69,90,80],
"date1": pl.date_range(datetime.date(2023, 1, 1), datetime.date(2023, 1, 7)),
"date2": pl.date_range(datetime.date(2023, 2, 1), datetime.date(2023, 2, 7)),
"rating": [4.5, 3.5, 4, 1.0, 1.2, 9.4, 9.1],
"zone": [None, float('nan'), 1, 2, 4, 5, 1],
"class": ['x1', float('nan'), None, 'c1', 'c2', 'x2', 'j1'],
})
shape: (5, 8)group | grade | score | date1 | date2 | rating | zone | class |
---|
str | str | i64 | date | date | f64 | f64 | str |
"one" | "1" | 45 | 2023-01-01 | 2023-02-01 | 4.5 | null | "x1" |
"one" | "98" | 49 | 2023-01-02 | 2023-02-02 | 3.5 | NaN | null |
"one" | "2" | 76 | 2023-01-03 | 2023-02-03 | 4.0 | 1.0 | null |
"two" | "3" | 83 | 2023-01-04 | 2023-02-04 | 1.0 | 2.0 | "c1" |
"two" | "99" | 69 | 2023-01-05 | 2023-02-05 | 1.2 | 4.0 | "c2" |
Rows: 7
Columns: 8
$ group <str> one, one, one, two, two, three, three
$ grade <str> 1, 98, 2, 3, 99, 23, 11
$ score <i64> 45, 49, 76, 83, 69, 90, 80
$ date1 <date> 2023-01-01, 2023-01-02, 2023-01-03, 2023-01-04, 2023-01-05, 2023-01-06, 2023-01-07
$ date2 <date> 2023-02-01, 2023-02-02, 2023-02-03, 2023-02-04, 2023-02-05, 2023-02-06, 2023-02-07
$ rating <f64> 4.5, 3.5, 4.0, 1.0, 1.2, 9.4, 9.1
$ zone <f64> None, nan, 1.0, 2.0, 4.0, 5.0, 1.0
$ class <str> x1, None, None, c1, c2, x2, j1
shape: (9, 9)describe | group | grade | score | date1 | date2 | rating | zone | class |
---|
str | str | str | f64 | str | str | f64 | f64 | str |
"count" | "7" | "7" | 7.0 | "7" | "7" | 7.0 | 7.0 | "7" |
"null_count" | "0" | "0" | 0.0 | "0" | "0" | 0.0 | 1.0 | "2" |
"mean" | null | null | 70.285714 | null | null | 4.671429 | NaN | null |
"std" | null | null | 17.182494 | null | null | 3.39986 | NaN | null |
"min" | "one" | "1" | 45.0 | "2023-01-01" | "2023-02-01" | 1.0 | 1.0 | "c1" |
"max" | "two" | "99" | 90.0 | "2023-01-07" | "2023-02-07" | 9.4 | 5.0 | "x2" |
"median" | null | null | 76.0 | null | null | 4.0 | 3.0 | null |
"25%" | null | null | 49.0 | null | null | 1.2 | 1.0 | null |
"75%" | null | null | 83.0 | null | null | 9.1 | 5.0 | null |
Q. How to get count of NA values in the polar dataframe?
df.select(pl.all().is_null().sum())
shape: (1, 8)group | grade | score | date1 | date2 | rating | zone | class |
---|
u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 |
Q. How to run groupby functions?
Probably, groupby
is one the most commonly used data manipulation functions. Polars offers .groupby
and .over
functions to calculate aggregated metrics over a group.
# method 1: find the first row in each group sorted by some value
df.groupby("group").agg(pl.all().sort_by("score",descending=True).first())
shape: (3, 8)group | grade | score | date1 | date2 | rating | zone | class |
---|
str | str | i64 | date | date | f64 | f64 | str |
"one" | "2" | 76 | 2023-01-03 | 2023-02-03 | 4.0 | 1.0 | null |
"two" | "3" | 83 | 2023-01-04 | 2023-02-04 | 1.0 | 2.0 | "c1" |
"three" | "23" | 90 | 2023-01-06 | 2023-02-06 | 9.4 | 5.0 | "x2" |
# method 2: find the first row in each group sorted by some value
df.filter(pl.col("score") == pl.col("score").max().over(["group"]))
shape: (3, 8)group | grade | score | date1 | date2 | rating | zone | class |
---|
str | str | i64 | date | date | f64 | f64 | str |
"one" | "2" | 76 | 2023-01-03 | 2023-02-03 | 4.0 | 1.0 | null |
"two" | "3" | 83 | 2023-01-04 | 2023-02-04 | 1.0 | 2.0 | "c1" |
"three" | "23" | 90 | 2023-01-06 | 2023-02-06 | 9.4 | 5.0 | "x2" |
# calculate mean score over group and add it as a column in the dataframe
df = df.with_columns(mean_over_grp=pl.col("score").sum().over("group"))
df.head(5)
shape: (5, 9)group | grade | score | date1 | date2 | rating | zone | class | mean_over_grp |
---|
str | str | i64 | date | date | f64 | f64 | str | i64 |
"one" | "1" | 45 | 2023-01-01 | 2023-02-01 | 4.5 | null | "x1" | 170 |
"one" | "98" | 49 | 2023-01-02 | 2023-02-02 | 3.5 | NaN | null | 170 |
"one" | "2" | 76 | 2023-01-03 | 2023-02-03 | 4.0 | 1.0 | null | 170 |
"two" | "3" | 83 | 2023-01-04 | 2023-02-04 | 1.0 | 2.0 | "c1" | 152 |
"two" | "99" | 69 | 2023-01-05 | 2023-02-05 | 1.2 | 4.0 | "c2" | 152 |
# convert a column into a list per group
df.groupby("group", maintain_order=True).agg(score_list=pl.col("score").apply(list))
shape: (3, 2)group | score_list |
---|
str | list[i64] |
"one" | [45, 49, 76] |
"two" | [83, 69] |
"three" | [90, 80] |
# aggregate calculations for a group using multiple columns
df.groupby('group').agg(
mean_score = pl.col('score').mean(),
n_rows = pl.count(),
cls_c1_count = (pl.col('class') == 'c1').sum(),
min_date = pl.col('date1').min()
)
shape: (3, 5)group | mean_score | n_rows | cls_c1_count | min_date |
---|
str | f64 | u32 | u32 | date |
"two" | 76.0 | 2 | 1 | 2023-01-04 |
"one" | 56.666667 | 3 | 0 | 2023-01-01 |
"three" | 85.0 | 2 | 0 | 2023-01-06 |
# method 1: calculate counts per group
df.groupby('group').count()
shape: (3, 2)group | count |
---|
str | u32 |
"two" | 2 |
"one" | 3 |
"three" | 2 |
# method 2: calculate counts per group
df['group'].value_counts()
shape: (3, 2)group | counts |
---|
str | u32 |
"two" | 2 |
"one" | 3 |
"three" | 2 |
Q. How to set the data types correctly ?
We find the grade
column has integers but encoded as strings in the dataframe. Lets fix the data type.
## convert string to integer
df = df.with_columns(pl.col('grade').cast(pl.Int16))
# similarly to convert an integer to string you can do:
df = df.with_columns(pl.col('score').cast(pl.Utf8))
df['grade'].head(3)
Q. How to find number of groups in a column ?
The n_groups
function in pandas returns the number of unique groups in a column. We can achieve that using n_unique
in polars.
# find unique groups in group col
df.select(pl.col("group").n_unique()).item()
# find unique groups in group and class togther
df.select(pl.struct(["group", "class"]).n_unique()).item()
Q. How to do One Hot Encoding in polars dataframe?
Converting a categorical column into several column with 1 and 0 is a commonly used feature engineering technique.
df=df.to_dummies('group')
df.head()
shape: (7, 11)group_one | group_three | group_two | grade | score | date1 | date2 | rating | zone | class | mean_over_grp |
---|
u8 | u8 | u8 | i16 | str | date | date | f64 | f64 | str | i64 |
1 | 0 | 0 | 1 | "45" | 2023-01-01 | 2023-02-01 | 4.5 | null | "x1" | 170 |
1 | 0 | 0 | 98 | "49" | 2023-01-02 | 2023-02-02 | 3.5 | NaN | null | 170 |
1 | 0 | 0 | 2 | "76" | 2023-01-03 | 2023-02-03 | 4.0 | 1.0 | null | 170 |
0 | 0 | 1 | 3 | "83" | 2023-01-04 | 2023-02-04 | 1.0 | 2.0 | "c1" | 152 |
0 | 0 | 1 | 99 | "69" | 2023-01-05 | 2023-02-05 | 1.2 | 4.0 | "c2" | 152 |
0 | 1 | 0 | 23 | "90" | 2023-01-06 | 2023-02-06 | 9.4 | 5.0 | "x2" | 170 |
0 | 1 | 0 | 11 | "80" | 2023-01-07 | 2023-02-07 | 9.1 | 1.0 | "j1" | 170 |
Q. How to create categories/bins using a numeric column?
Here, we’d like to create buckets of numbers based on percentiles inside a group.
df=df.with_columns(
score_bin=pl.col("score")
.apply(
lambda s: s.qcut(
[0.2, 0.4, 0.6, 0.8], labels=["1", "2", "3", "4", "5"], maintain_order=True
)["category"]
)
.over(["group"])
)
df
shape: (7, 10)group | grade | score | date1 | date2 | rating | zone | class | mean_over_grp | score_bin |
---|
str | i16 | str | date | date | f64 | f64 | str | i64 | cat |
"one" | 1 | "45" | 2023-01-01 | 2023-02-01 | 4.5 | null | "x1" | 170 | "1" |
"one" | 98 | "49" | 2023-01-02 | 2023-02-02 | 3.5 | NaN | null | 170 | "3" |
"one" | 2 | "76" | 2023-01-03 | 2023-02-03 | 4.0 | 1.0 | null | 170 | "5" |
"two" | 3 | "83" | 2023-01-04 | 2023-02-04 | 1.0 | 2.0 | "c1" | 152 | "5" |
"two" | 99 | "69" | 2023-01-05 | 2023-02-05 | 1.2 | 4.0 | "c2" | 152 | "1" |
"three" | 23 | "90" | 2023-01-06 | 2023-02-06 | 9.4 | 5.0 | "x2" | 170 | "5" |
"three" | 11 | "80" | 2023-01-07 | 2023-02-07 | 9.1 | 1.0 | "j1" | 170 | "1" |
Q. How to create a pivot table ?
df.pivot(values="rating", index="score_bin", columns="group", aggregate_function="min")
shape: (3, 4)score_bin | one | two | three |
---|
cat | f64 | f64 | f64 |
"1" | 4.5 | 1.2 | 9.1 |
"3" | 3.5 | null | null |
"5" | 4.0 | 1.0 | 9.4 |
Q. How to create a column where each row contains multiple values ?
This is a common use case where we need to merge two columns into one row.
df = df.with_columns(merged_cat = pl.struct(["group","score_bin"]))
df
shape: (7, 11)group | grade | score | date1 | date2 | rating | zone | class | mean_over_grp | score_bin | merged_cat |
---|
str | i16 | str | date | date | f64 | f64 | str | i64 | cat | struct[2] |
"one" | 1 | "45" | 2023-01-01 | 2023-02-01 | 4.5 | null | "x1" | 170 | "1" | {"one","1"} |
"one" | 98 | "49" | 2023-01-02 | 2023-02-02 | 3.5 | NaN | null | 170 | "3" | {"one","3"} |
"one" | 2 | "76" | 2023-01-03 | 2023-02-03 | 4.0 | 1.0 | null | 170 | "5" | {"one","5"} |
"two" | 3 | "83" | 2023-01-04 | 2023-02-04 | 1.0 | 2.0 | "c1" | 152 | "5" | {"two","5"} |
"two" | 99 | "69" | 2023-01-05 | 2023-02-05 | 1.2 | 4.0 | "c2" | 152 | "1" | {"two","1"} |
"three" | 23 | "90" | 2023-01-06 | 2023-02-06 | 9.4 | 5.0 | "x2" | 170 | "5" | {"three","5"} |
"three" | 11 | "80" | 2023-01-07 | 2023-02-07 | 9.1 | 1.0 | "j1" | 170 | "1" | {"three","1"} |
# lets check the first now at index 0
df.select(pl.col('merged_cat').take(0)).to_numpy()
array([[{'group': 'one', 'score_bin': '1'}]], dtype=object)
# let create a column with list as output
df=df.with_columns(
merged_list = pl.struct(["group","class"]).apply(lambda x: [x['group'], x['class']])
)
df
shape: (7, 12)group | grade | score | date1 | date2 | rating | zone | class | mean_over_grp | score_bin | merged_cat | merged_list |
---|
str | i16 | str | date | date | f64 | f64 | str | i64 | cat | struct[2] | list[str] |
"one" | 1 | "45" | 2023-01-01 | 2023-02-01 | 4.5 | null | "x1" | 170 | "1" | {"one","1"} | ["one", "x1"] |
"one" | 98 | "49" | 2023-01-02 | 2023-02-02 | 3.5 | NaN | null | 170 | "3" | {"one","3"} | ["one", null] |
"one" | 2 | "76" | 2023-01-03 | 2023-02-03 | 4.0 | 1.0 | null | 170 | "5" | {"one","5"} | ["one", null] |
"two" | 3 | "83" | 2023-01-04 | 2023-02-04 | 1.0 | 2.0 | "c1" | 152 | "5" | {"two","5"} | ["two", "c1"] |
"two" | 99 | "69" | 2023-01-05 | 2023-02-05 | 1.2 | 4.0 | "c2" | 152 | "1" | {"two","1"} | ["two", "c2"] |
"three" | 23 | "90" | 2023-01-06 | 2023-02-06 | 9.4 | 5.0 | "x2" | 170 | "5" | {"three","5"} | ["three", "x2"] |
"three" | 11 | "80" | 2023-01-07 | 2023-02-07 | 9.1 | 1.0 | "j1" | 170 | "1" | {"three","1"} | ["three", "j1"] |
Q. How to convert dataframe columns into a python dict ?
Converting two columns into a key:value format.
dict(df.select(pl.col(["score_bin", "score"])).iter_rows())
{'1': '80', '3': '49', '5': '90'}
Q. How to convert a column containing list into separate rows?
Similar to pandas, polars also provide .explode
function to achieve this.
df=pl.DataFrame(
{
"col1": [["X", "Y", "Z"], ["F", "G"], ["P"]],
"col2": [["A", "B", "C"], ["C"], ["D", "E"]],
}
).with_row_count()
df.explode(["col1"]).explode(["col2"]).head(4)
shape: (4, 3)row_nr | col1 | col2 |
---|
u32 | str | str |
0 | "X" | "A" |
0 | "X" | "B" |
0 | "X" | "C" |
0 | "Y" | "A" |
Similar to pandas, polars also provide string methods using <col_name>.str.<method_name>
format.
# add prefix to columns except one
df.select(pl.all().exclude(['score']).suffix('_pre')).head(3)
shape: (3, 7)group_pre | grade_pre | date1_pre | date2_pre | rating_pre | zone_pre | class_pre |
---|
str | str | date | date | f64 | f64 | str |
"one" | "1" | 2023-01-01 | 2023-02-01 | 4.5 | null | "x1" |
"one" | "98" | 2023-01-02 | 2023-02-02 | 3.5 | NaN | null |
"one" | "2" | 2023-01-03 | 2023-02-03 | 4.0 | 1.0 | null |
# convert a column to uppercase
df.with_columns(pl.col('group').str.to_uppercase()).head()
shape: (5, 8)group | grade | score | date1 | date2 | rating | zone | class |
---|
str | str | i64 | date | date | f64 | f64 | str |
"ONE" | "1" | 45 | 2023-01-01 | 2023-02-01 | 4.5 | null | "x1" |
"ONE" | "98" | 49 | 2023-01-02 | 2023-02-02 | 3.5 | NaN | null |
"ONE" | "2" | 76 | 2023-01-03 | 2023-02-03 | 4.0 | 1.0 | null |
"TWO" | "3" | 83 | 2023-01-04 | 2023-02-04 | 1.0 | 2.0 | "c1" |
"TWO" | "99" | 69 | 2023-01-05 | 2023-02-05 | 1.2 | 4.0 | "c2" |
# concatenate a lists into a string
(df
.with_columns(
merged_list = pl.struct(["group","class"]).apply(lambda x: [x['group'], x['class']])
).with_columns(
m_con=pl.col("merged_list").arr.join("_"),
v4=pl.col("merged_list").arr.concat(pl.col("merged_list")), # concatenate two columns having list into a bigger list
))
shape: (7, 11)group | grade | score | date1 | date2 | rating | zone | class | merged_list | m_con | v4 |
---|
str | str | i64 | date | date | f64 | f64 | str | list[str] | str | list[str] |
"one" | "1" | 45 | 2023-01-01 | 2023-02-01 | 4.5 | null | "x1" | ["one", "x1"] | "one_x1" | ["one", "x1", … "x1"] |
"one" | "98" | 49 | 2023-01-02 | 2023-02-02 | 3.5 | NaN | null | ["one", null] | "one_null" | ["one", null, … null] |
"one" | "2" | 76 | 2023-01-03 | 2023-02-03 | 4.0 | 1.0 | null | ["one", null] | "one_null" | ["one", null, … null] |
"two" | "3" | 83 | 2023-01-04 | 2023-02-04 | 1.0 | 2.0 | "c1" | ["two", "c1"] | "two_c1" | ["two", "c1", … "c1"] |
"two" | "99" | 69 | 2023-01-05 | 2023-02-05 | 1.2 | 4.0 | "c2" | ["two", "c2"] | "two_c2" | ["two", "c2", … "c2"] |
"three" | "23" | 90 | 2023-01-06 | 2023-02-06 | 9.4 | 5.0 | "x2" | ["three", "x2"] | "three_x2" | ["three", "x2", … "x2"] |
"three" | "11" | 80 | 2023-01-07 | 2023-02-07 | 9.1 | 1.0 | "j1" | ["three", "j1"] | "three_j1" | ["three", "j1", … "j1"] |
Similar to pandas, polars also provide datetime methods using <col_name>.dt.<method_name>
format.
df.with_columns(day = pl.col('date1').dt.day(),
month= pl.col('date1').dt.month(),
year= pl.col('date1').dt.year(),
year_mon = pl.col('date1').cast(pl.Utf8).str.slice(0, 7),
)
shape: (7, 12)group | grade | score | date1 | date2 | rating | zone | class | day | month | year | year_mon |
---|
str | str | i64 | date | date | f64 | f64 | str | u32 | u32 | i32 | str |
"one" | "1" | 45 | 2023-01-01 | 2023-02-01 | 4.5 | null | "x1" | 1 | 1 | 2023 | "2023-01" |
"one" | "98" | 49 | 2023-01-02 | 2023-02-02 | 3.5 | NaN | null | 2 | 1 | 2023 | "2023-01" |
"one" | "2" | 76 | 2023-01-03 | 2023-02-03 | 4.0 | 1.0 | null | 3 | 1 | 2023 | "2023-01" |
"two" | "3" | 83 | 2023-01-04 | 2023-02-04 | 1.0 | 2.0 | "c1" | 4 | 1 | 2023 | "2023-01" |
"two" | "99" | 69 | 2023-01-05 | 2023-02-05 | 1.2 | 4.0 | "c2" | 5 | 1 | 2023 | "2023-01" |
"three" | "23" | 90 | 2023-01-06 | 2023-02-06 | 9.4 | 5.0 | "x2" | 6 | 1 | 2023 | "2023-01" |
"three" | "11" | 80 | 2023-01-07 | 2023-02-07 | 9.1 | 1.0 | "j1" | 7 | 1 | 2023 | "2023-01" |
Q. How to read parquet data from S3 with polars?
I couldn’t find a straight forward way to do this, hence wrote a utils functions which can read multiple parquet files from a S3 directory.
def read_with_polars(s3_path: str):
bucket, _, key = s3_path.replace("s3://", "").partition("/")
s3 = boto3.resource("s3")
s3bucket = s3.Bucket(bucket)
paths = [
f"s3://{bucket}/{obj.key}"
for obj in s3bucket.objects.filter(Prefix=key)
]
print(len(paths))
return pl.concat(
[
pl.read_parquet(path, use_pyarrow=True, memory_map=True)
for path in paths
]
)
df = read_with_polars('s3://data-science-dev/polars/data/')
Summary
Overall, my experience using polars has been quite satisfying. Having been deployed a python project using polars I feel more confident about using it while handling large data sets. I would probably continue using pandas as long as it doesn’t slow down significantly. Also, I tried building a xgboost model on polars dataframes, it worked nicely without any need to convert it to pandas dataframe.
Comments