Hands on tutorial with Polars DataFrames in Python

6 minute read

Introduction

Handling large datasets locally on laptops is not easy. As a workaround, data scientists either work on a sample data set or spin up spark clusters. Spark clusters don’t come cheap. I recently migrated a data science project from spark to python in-memory computation using polars and saved few hunderd dollars weekly cloud cost. Yay!

Polars is just like pandas (used to read, manipulate, write datasets), but executes much faster in memory. I’ve been using pandas for several years now, so learning to use polars wasn’t that difficult. I found the polars api is quite intuitive and as a data scientist you’ll find it quite useful in your day to day data analysis, machine learning projects while working on big datasets.

Polars to pandas transition is super easy, just doing .to_pandas() on a polars dataframe brings us back to using pandas.

Some key features about polars library are:

It is written in Rust based on Apache Columnar Format.
It allows lazy execution of commands, thus allowing it to handle dataset larger than RAM on your laptop.
Its API is very similar to pandas and pyspark, thus the learning curve is not at all steep.
If you are more comfortable with SQL, you can easily convert a polar dataframe into a SQL Table locally and run SQL commands normally.
It is heavily optimised for doing parallel computations.

In this tutorial, I’ll show how to do some commonly used dataframes calculations/transformations using polars in Python.

How to get count of NA values in the polar dataframe?
How to run groupby functions?
How to set the data types correctly?
How to find number of groups in a column?
How to do one hot encoding in polars dataframe?
How to create categories/bins using a numeric column?
How to create a pivot table ?
How to create a column where each row contains multiple values?
How to convert dataframe columns into a python dict?
How to convert a column containing list into separate rows?
How to perform string related functions using polars?
How to perform datetime functions using polars?
How to read parquet data from S3 with polars?

Lets organise the imports we require in this tutorial. Also, lets create a sample data set which we are going to use for this tutorial.

import polars as pl
import boto3
import datetime

# set the configs
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_width_chars(100)
pl.Config.set_tbl_rows(200)
pl.Config.set_tbl_hide_dataframe_shape(True)  

polars.config.Config

# create a polars dataframe
df = pl.DataFrame({
    "group": ["one", "one", "one", "two", "two", "three", "three"],
    "grade": ["1", "98", "2", "3", "99", "23", "11"],
    "score": [45,49,76,83,69,90,80],
    "date1": pl.date_range(datetime.date(2023, 1, 1), datetime.date(2023, 1, 7)),
    "date2": pl.date_range(datetime.date(2023, 2, 1), datetime.date(2023, 2, 7)),
    "rating": [4.5, 3.5, 4, 1.0, 1.2, 9.4, 9.1],
    "zone": [None, float('nan'), 1, 2, 4, 5, 1],
    "class": ['x1', float('nan'), None, 'c1', 'c2', 'x2', 'j1'],
})

df.head()

shape: (5, 8)

group	grade	score	date1	date2	rating	zone	class
str	str	i64	date	date	f64	f64	str
"one"	"1"	45	2023-01-01	2023-02-01	4.5	null	"x1"
"one"	"98"	49	2023-01-02	2023-02-02	3.5	NaN	null
"one"	"2"	76	2023-01-03	2023-02-03	4.0	1.0	null
"two"	"3"	83	2023-01-04	2023-02-04	1.0	2.0	"c1"
"two"	"99"	69	2023-01-05	2023-02-05	1.2	4.0	"c2"

df.glimpse()

Rows: 7
Columns: 8
$ group   <str> one, one, one, two, two, three, three
$ grade   <str> 1, 98, 2, 3, 99, 23, 11
$ score   <i64> 45, 49, 76, 83, 69, 90, 80
$ date1  <date> 2023-01-01, 2023-01-02, 2023-01-03, 2023-01-04, 2023-01-05, 2023-01-06, 2023-01-07
$ date2  <date> 2023-02-01, 2023-02-02, 2023-02-03, 2023-02-04, 2023-02-05, 2023-02-06, 2023-02-07
$ rating  <f64> 4.5, 3.5, 4.0, 1.0, 1.2, 9.4, 9.1
$ zone    <f64> None, nan, 1.0, 2.0, 4.0, 5.0, 1.0
$ class   <str> x1, None, None, c1, c2, x2, j1

df.describe()

shape: (9, 9)

describe	group	grade	score	date1	date2	rating	zone	class
str	str	str	f64	str	str	f64	f64	str
"count"	"7"	"7"	7.0	"7"	"7"	7.0	7.0	"7"
"null_count"	"0"	"0"	0.0	"0"	"0"	0.0	1.0	"2"
"mean"	null	null	70.285714	null	null	4.671429	NaN	null
"std"	null	null	17.182494	null	null	3.39986	NaN	null
"min"	"one"	"1"	45.0	"2023-01-01"	"2023-02-01"	1.0	1.0	"c1"
"max"	"two"	"99"	90.0	"2023-01-07"	"2023-02-07"	9.4	5.0	"x2"
"median"	null	null	76.0	null	null	4.0	3.0	null
"25%"	null	null	49.0	null	null	1.2	1.0	null
"75%"	null	null	83.0	null	null	9.1	5.0	null

Q. How to get count of NA values in the polar dataframe?

df.select(pl.all().is_null().sum())

shape: (1, 8)

group	grade	score	date1	date2	rating	zone	class
u32	u32	u32	u32	u32	u32	u32	u32
0	0	0	0	0	0	1	2

Q. How to run groupby functions?

Probably, groupby is one the most commonly used data manipulation functions. Polars offers .groupby and .over functions to calculate aggregated metrics over a group.

# method 1: find the first row in each group sorted by some value
df.groupby("group").agg(pl.all().sort_by("score",descending=True).first())

shape: (3, 8)

group	grade	score	date1	date2	rating	zone	class
str	str	i64	date	date	f64	f64	str
"one"	"2"	76	2023-01-03	2023-02-03	4.0	1.0	null
"two"	"3"	83	2023-01-04	2023-02-04	1.0	2.0	"c1"
"three"	"23"	90	2023-01-06	2023-02-06	9.4	5.0	"x2"

# method 2: find the first row in each group sorted by some value
df.filter(pl.col("score") == pl.col("score").max().over(["group"]))

shape: (3, 8)

group	grade	score	date1	date2	rating	zone	class
str	str	i64	date	date	f64	f64	str
"one"	"2"	76	2023-01-03	2023-02-03	4.0	1.0	null
"two"	"3"	83	2023-01-04	2023-02-04	1.0	2.0	"c1"
"three"	"23"	90	2023-01-06	2023-02-06	9.4	5.0	"x2"

# calculate mean score over group and add it as a column in the dataframe
df = df.with_columns(mean_over_grp=pl.col("score").sum().over("group"))
df.head(5)

shape: (5, 9)

group	grade	score	date1	date2	rating	zone	class	mean_over_grp
str	str	i64	date	date	f64	f64	str	i64
"one"	"1"	45	2023-01-01	2023-02-01	4.5	null	"x1"	170
"one"	"98"	49	2023-01-02	2023-02-02	3.5	NaN	null	170
"one"	"2"	76	2023-01-03	2023-02-03	4.0	1.0	null	170
"two"	"3"	83	2023-01-04	2023-02-04	1.0	2.0	"c1"	152
"two"	"99"	69	2023-01-05	2023-02-05	1.2	4.0	"c2"	152

# convert a column into a list per group
df.groupby("group", maintain_order=True).agg(score_list=pl.col("score").apply(list))

shape: (3, 2)

group	score_list
str	list[i64]
"one"	[45, 49, 76]
"two"	[83, 69]
"three"	[90, 80]

# aggregate calculations for a group using multiple columns
df.groupby('group').agg(
    mean_score = pl.col('score').mean(), 
    n_rows = pl.count(), 
    cls_c1_count = (pl.col('class') == 'c1').sum(),
    min_date = pl.col('date1').min()
)

shape: (3, 5)

group	mean_score	n_rows	cls_c1_count	min_date
str	f64	u32	u32	date
"two"	76.0	2	1	2023-01-04
"one"	56.666667	3	0	2023-01-01
"three"	85.0	2	0	2023-01-06

# method 1: calculate counts per group
df.groupby('group').count()

shape: (3, 2)

group	count
str	u32
"two"	2
"one"	3
"three"	2

# method 2: calculate counts per group
df['group'].value_counts()

shape: (3, 2)

group	counts
str	u32
"two"	2
"one"	3
"three"	2

Q. How to set the data types correctly ?

We find the grade column has integers but encoded as strings in the dataframe. Lets fix the data type.

## convert string to integer
df = df.with_columns(pl.col('grade').cast(pl.Int16))

# similarly to convert an integer to string you can do:
df = df.with_columns(pl.col('score').cast(pl.Utf8))

df['grade'].head(3)

shape: (3,)

grade
i16
1
98
2

Q. How to find number of groups in a column ?

The n_groups function in pandas returns the number of unique groups in a column. We can achieve that using n_unique in polars.

# find unique groups in group col
df.select(pl.col("group").n_unique()).item()

# find unique groups in group and class togther
df.select(pl.struct(["group", "class"]).n_unique()).item()

Q. How to do One Hot Encoding in polars dataframe?

Converting a categorical column into several column with 1 and 0 is a commonly used feature engineering technique.

df=df.to_dummies('group')
df.head()

shape: (7, 11)

group_one	group_three	group_two	grade	score	date1	date2	rating	zone	class	mean_over_grp
u8	u8	u8	i16	str	date	date	f64	f64	str	i64
1	0	0	1	"45"	2023-01-01	2023-02-01	4.5	null	"x1"	170
1	0	0	98	"49"	2023-01-02	2023-02-02	3.5	NaN	null	170
1	0	0	2	"76"	2023-01-03	2023-02-03	4.0	1.0	null	170
0	0	1	3	"83"	2023-01-04	2023-02-04	1.0	2.0	"c1"	152
0	0	1	99	"69"	2023-01-05	2023-02-05	1.2	4.0	"c2"	152
0	1	0	23	"90"	2023-01-06	2023-02-06	9.4	5.0	"x2"	170
0	1	0	11	"80"	2023-01-07	2023-02-07	9.1	1.0	"j1"	170

Q. How to create categories/bins using a numeric column?

Here, we’d like to create buckets of numbers based on percentiles inside a group.

df=df.with_columns(
    score_bin=pl.col("score")
    .apply(
        lambda s: s.qcut(
            [0.2, 0.4, 0.6, 0.8], labels=["1", "2", "3", "4", "5"], maintain_order=True
        )["category"]
    )
    .over(["group"])
)
df

shape: (7, 10)

group	grade	score	date1	date2	rating	zone	class	mean_over_grp	score_bin
str	i16	str	date	date	f64	f64	str	i64	cat
"one"	1	"45"	2023-01-01	2023-02-01	4.5	null	"x1"	170	"1"
"one"	98	"49"	2023-01-02	2023-02-02	3.5	NaN	null	170	"3"
"one"	2	"76"	2023-01-03	2023-02-03	4.0	1.0	null	170	"5"
"two"	3	"83"	2023-01-04	2023-02-04	1.0	2.0	"c1"	152	"5"
"two"	99	"69"	2023-01-05	2023-02-05	1.2	4.0	"c2"	152	"1"
"three"	23	"90"	2023-01-06	2023-02-06	9.4	5.0	"x2"	170	"5"
"three"	11	"80"	2023-01-07	2023-02-07	9.1	1.0	"j1"	170	"1"

Q. How to create a pivot table ?

df.pivot(values="rating", index="score_bin", columns="group", aggregate_function="min")

shape: (3, 4)

score_bin	one	two	three
cat	f64	f64	f64
"1"	4.5	1.2	9.1
"3"	3.5	null	null
"5"	4.0	1.0	9.4

Q. How to create a column where each row contains multiple values ?

This is a common use case where we need to merge two columns into one row.

df = df.with_columns(merged_cat = pl.struct(["group","score_bin"]))
df

shape: (7, 11)

group	grade	score	date1	date2	rating	zone	class	mean_over_grp	score_bin	merged_cat
str	i16	str	date	date	f64	f64	str	i64	cat	struct[2]
"one"	1	"45"	2023-01-01	2023-02-01	4.5	null	"x1"	170	"1"	{"one","1"}
"one"	98	"49"	2023-01-02	2023-02-02	3.5	NaN	null	170	"3"	{"one","3"}
"one"	2	"76"	2023-01-03	2023-02-03	4.0	1.0	null	170	"5"	{"one","5"}
"two"	3	"83"	2023-01-04	2023-02-04	1.0	2.0	"c1"	152	"5"	{"two","5"}
"two"	99	"69"	2023-01-05	2023-02-05	1.2	4.0	"c2"	152	"1"	{"two","1"}
"three"	23	"90"	2023-01-06	2023-02-06	9.4	5.0	"x2"	170	"5"	{"three","5"}
"three"	11	"80"	2023-01-07	2023-02-07	9.1	1.0	"j1"	170	"1"	{"three","1"}

# lets check the first now at index 0
df.select(pl.col('merged_cat').take(0)).to_numpy()

array([[{'group': 'one', 'score_bin': '1'}]], dtype=object)

# let create a column with list as output
df=df.with_columns(
    merged_list = pl.struct(["group","class"]).apply(lambda x: [x['group'], x['class']])
)
df

shape: (7, 12)

group	grade	score	date1	date2	rating	zone	class	mean_over_grp	score_bin	merged_cat	merged_list
str	i16	str	date	date	f64	f64	str	i64	cat	struct[2]	list[str]
"one"	1	"45"	2023-01-01	2023-02-01	4.5	null	"x1"	170	"1"	{"one","1"}	["one", "x1"]
"one"	98	"49"	2023-01-02	2023-02-02	3.5	NaN	null	170	"3"	{"one","3"}	["one", null]
"one"	2	"76"	2023-01-03	2023-02-03	4.0	1.0	null	170	"5"	{"one","5"}	["one", null]
"two"	3	"83"	2023-01-04	2023-02-04	1.0	2.0	"c1"	152	"5"	{"two","5"}	["two", "c1"]
"two"	99	"69"	2023-01-05	2023-02-05	1.2	4.0	"c2"	152	"1"	{"two","1"}	["two", "c2"]
"three"	23	"90"	2023-01-06	2023-02-06	9.4	5.0	"x2"	170	"5"	{"three","5"}	["three", "x2"]
"three"	11	"80"	2023-01-07	2023-02-07	9.1	1.0	"j1"	170	"1"	{"three","1"}	["three", "j1"]

Q. How to convert dataframe columns into a python dict ?

Converting two columns into a key:value format.

dict(df.select(pl.col(["score_bin", "score"])).iter_rows())

{'1': '80', '3': '49', '5': '90'}

Q. How to convert a column containing list into separate rows?

Similar to pandas, polars also provide .explode function to achieve this.

df=pl.DataFrame(
    {
        "col1": [["X", "Y", "Z"], ["F", "G"], ["P"]],
        "col2": [["A", "B", "C"], ["C"], ["D", "E"]],
    }
).with_row_count()
df.explode(["col1"]).explode(["col2"]).head(4)

shape: (4, 3)

row_nr	col1	col2
u32	str	str
0	"X"	"A"
0	"X"	"B"
0	"X"	"C"
0	"Y"	"A"

Similar to pandas, polars also provide string methods using <col_name>.str.<method_name> format.

# add prefix to columns except one
df.select(pl.all().exclude(['score']).suffix('_pre')).head(3)

shape: (3, 7)

group_pre	grade_pre	date1_pre	date2_pre	rating_pre	zone_pre	class_pre
str	str	date	date	f64	f64	str
"one"	"1"	2023-01-01	2023-02-01	4.5	null	"x1"
"one"	"98"	2023-01-02	2023-02-02	3.5	NaN	null
"one"	"2"	2023-01-03	2023-02-03	4.0	1.0	null

# convert a column to uppercase
df.with_columns(pl.col('group').str.to_uppercase()).head()

shape: (5, 8)

group	grade	score	date1	date2	rating	zone	class
str	str	i64	date	date	f64	f64	str
"ONE"	"1"	45	2023-01-01	2023-02-01	4.5	null	"x1"
"ONE"	"98"	49	2023-01-02	2023-02-02	3.5	NaN	null
"ONE"	"2"	76	2023-01-03	2023-02-03	4.0	1.0	null
"TWO"	"3"	83	2023-01-04	2023-02-04	1.0	2.0	"c1"
"TWO"	"99"	69	2023-01-05	2023-02-05	1.2	4.0	"c2"

# concatenate a lists into a string 
(df
.with_columns(
    merged_list = pl.struct(["group","class"]).apply(lambda x: [x['group'], x['class']])
).with_columns(
    m_con=pl.col("merged_list").arr.join("_"),
    v4=pl.col("merged_list").arr.concat(pl.col("merged_list")), # concatenate two columns having list into a bigger list
))

shape: (7, 11)

group	grade	score	date1	date2	rating	zone	class	merged_list	m_con	v4
str	str	i64	date	date	f64	f64	str	list[str]	str	list[str]
"one"	"1"	45	2023-01-01	2023-02-01	4.5	null	"x1"	["one", "x1"]	"one_x1"	["one", "x1", … "x1"]
"one"	"98"	49	2023-01-02	2023-02-02	3.5	NaN	null	["one", null]	"one_null"	["one", null, … null]
"one"	"2"	76	2023-01-03	2023-02-03	4.0	1.0	null	["one", null]	"one_null"	["one", null, … null]
"two"	"3"	83	2023-01-04	2023-02-04	1.0	2.0	"c1"	["two", "c1"]	"two_c1"	["two", "c1", … "c1"]
"two"	"99"	69	2023-01-05	2023-02-05	1.2	4.0	"c2"	["two", "c2"]	"two_c2"	["two", "c2", … "c2"]
"three"	"23"	90	2023-01-06	2023-02-06	9.4	5.0	"x2"	["three", "x2"]	"three_x2"	["three", "x2", … "x2"]
"three"	"11"	80	2023-01-07	2023-02-07	9.1	1.0	"j1"	["three", "j1"]	"three_j1"	["three", "j1", … "j1"]

Q. How to perform datetime functions using polars?

Similar to pandas, polars also provide datetime methods using <col_name>.dt.<method_name> format.

df.with_columns(day = pl.col('date1').dt.day(),
                month=  pl.col('date1').dt.month(),
                year=  pl.col('date1').dt.year(),
                year_mon = pl.col('date1').cast(pl.Utf8).str.slice(0, 7),
               )

shape: (7, 12)

group	grade	score	date1	date2	rating	zone	class	day	month	year	year_mon
str	str	i64	date	date	f64	f64	str	u32	u32	i32	str
"one"	"1"	45	2023-01-01	2023-02-01	4.5	null	"x1"	1	1	2023	"2023-01"
"one"	"98"	49	2023-01-02	2023-02-02	3.5	NaN	null	2	1	2023	"2023-01"
"one"	"2"	76	2023-01-03	2023-02-03	4.0	1.0	null	3	1	2023	"2023-01"
"two"	"3"	83	2023-01-04	2023-02-04	1.0	2.0	"c1"	4	1	2023	"2023-01"
"two"	"99"	69	2023-01-05	2023-02-05	1.2	4.0	"c2"	5	1	2023	"2023-01"
"three"	"23"	90	2023-01-06	2023-02-06	9.4	5.0	"x2"	6	1	2023	"2023-01"
"three"	"11"	80	2023-01-07	2023-02-07	9.1	1.0	"j1"	7	1	2023	"2023-01"

Q. How to read parquet data from S3 with polars?

I couldn’t find a straight forward way to do this, hence wrote a utils functions which can read multiple parquet files from a S3 directory.

def read_with_polars(s3_path: str):
    bucket, _, key = s3_path.replace("s3://", "").partition("/")
    s3 = boto3.resource("s3")
    s3bucket = s3.Bucket(bucket)
    paths = [
        f"s3://{bucket}/{obj.key}"
        for obj in s3bucket.objects.filter(Prefix=key)
    ]
    print(len(paths))
    return pl.concat(
        [
            pl.read_parquet(path, use_pyarrow=True, memory_map=True)
            for path in paths
        ]
    )
df = read_with_polars('s3://data-science-dev/polars/data/')

Summary

Overall, my experience using polars has been quite satisfying. Having been deployed a python project using polars I feel more confident about using it while handling large data sets. I would probably continue using pandas as long as it doesn’t slow down significantly. Also, I tried building a xgboost model on polars dataframes, it worked nicely without any need to convert it to pandas dataframe.

Twitter Facebook LinkedIn

Manish Saraswat

Hands on tutorial with Polars DataFrames in Python

Introduction

Table of Contents

Q. How to get count of NA values in the polar dataframe?

Q. How to run groupby functions?

Q. How to set the data types correctly ?

Q. How to find number of groups in a column ?

Q. How to do One Hot Encoding in polars dataframe?

Q. How to create categories/bins using a numeric column?

Q. How to create a pivot table ?

Q. How to create a column where each row contains multiple values ?

Q. How to convert dataframe columns into a python dict ?

Q. How to convert a column containing list into separate rows?

Q. How to perform datetime functions using polars?

Q. How to read parquet data from S3 with polars?

Summary

Comments

You May Also Enjoy

Practical Tutorial on Asyncio in Python 3.7

Useful Full Text Search (FTS) queries with sqlite in Python 3.7

How to create TF-IDF matrix using ngrams in R?

Packaging a Machine Learning Project using Python PEX

Manish Saraswat

Introduction

Table of Contents

Q. How to get count of NA values in the polar dataframe?

Q. How to run groupby functions?

Q. How to set the data types correctly ?

Q. How to find number of groups in a column ?

Q. How to do One Hot Encoding in polars dataframe?

Q. How to create categories/bins using a numeric column?

Q. How to create a pivot table ?

Q. How to create a column where each row contains multiple values ?

Q. How to convert dataframe columns into a python dict ?

Q. How to convert a column containing list into separate rows?

Q. How to perform string related functions using polars?

Q. How to perform datetime functions using polars?

Q. How to read parquet data from S3 with polars?

Summary

Comments

You May Also Enjoy

Practical Tutorial on Asyncio in Python 3.7

Useful Full Text Search (FTS) queries with sqlite in Python 3.7

How to create TF-IDF matrix using ngrams in R?

Packaging a Machine Learning Project using Python PEX