6 minute read

Introduction

Handling large datasets locally on laptops is not easy. As a workaround, data scientists either work on a sample data set or spin up spark clusters. Spark clusters don’t come cheap. I recently migrated a data science project from spark to python in-memory computation using polars and saved few hunderd dollars weekly cloud cost. Yay!

Polars is just like pandas (used to read, manipulate, write datasets), but executes much faster in memory. I’ve been using pandas for several years now, so learning to use polars wasn’t that difficult. I found the polars api is quite intuitive and as a data scientist you’ll find it quite useful in your day to day data analysis, machine learning projects while working on big datasets.

Polars to pandas transition is super easy, just doing .to_pandas() on a polars dataframe brings us back to using pandas.

Some key features about polars library are:

  • It is written in Rust based on Apache Columnar Format.
  • It allows lazy execution of commands, thus allowing it to handle dataset larger than RAM on your laptop.
  • Its API is very similar to pandas and pyspark, thus the learning curve is not at all steep.
  • If you are more comfortable with SQL, you can easily convert a polar dataframe into a SQL Table locally and run SQL commands normally.
  • It is heavily optimised for doing parallel computations.

In this tutorial, I’ll show how to do some commonly used dataframes calculations/transformations using polars in Python.

Table of Contents

  1. How to get count of NA values in the polar dataframe?
  2. How to run groupby functions?
  3. How to set the data types correctly?
  4. How to find number of groups in a column?
  5. How to do one hot encoding in polars dataframe?
  6. How to create categories/bins using a numeric column?
  7. How to create a pivot table ?
  8. How to create a column where each row contains multiple values?
  9. How to convert dataframe columns into a python dict?
  10. How to convert a column containing list into separate rows?
  11. How to perform string related functions using polars?
  12. How to perform datetime functions using polars?
  13. How to read parquet data from S3 with polars?

Lets organise the imports we require in this tutorial. Also, lets create a sample data set which we are going to use for this tutorial.

import polars as pl
import boto3
import datetime
# set the configs
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_width_chars(100)
pl.Config.set_tbl_rows(200)
pl.Config.set_tbl_hide_dataframe_shape(True)  
polars.config.Config
# create a polars dataframe
df = pl.DataFrame({
    "group": ["one", "one", "one", "two", "two", "three", "three"],
    "grade": ["1", "98", "2", "3", "99", "23", "11"],
    "score": [45,49,76,83,69,90,80],
    "date1": pl.date_range(datetime.date(2023, 1, 1), datetime.date(2023, 1, 7)),
    "date2": pl.date_range(datetime.date(2023, 2, 1), datetime.date(2023, 2, 7)),
    "rating": [4.5, 3.5, 4, 1.0, 1.2, 9.4, 9.1],
    "zone": [None, float('nan'), 1, 2, 4, 5, 1],
    "class": ['x1', float('nan'), None, 'c1', 'c2', 'x2', 'j1'],
})

df.head()
shape: (5, 8)
groupgradescoredate1date2ratingzoneclass
strstri64datedatef64f64str
"one""1"452023-01-012023-02-014.5null"x1"
"one""98"492023-01-022023-02-023.5NaNnull
"one""2"762023-01-032023-02-034.01.0null
"two""3"832023-01-042023-02-041.02.0"c1"
"two""99"692023-01-052023-02-051.24.0"c2"
df.glimpse()
Rows: 7
Columns: 8
$ group   <str> one, one, one, two, two, three, three
$ grade   <str> 1, 98, 2, 3, 99, 23, 11
$ score   <i64> 45, 49, 76, 83, 69, 90, 80
$ date1  <date> 2023-01-01, 2023-01-02, 2023-01-03, 2023-01-04, 2023-01-05, 2023-01-06, 2023-01-07
$ date2  <date> 2023-02-01, 2023-02-02, 2023-02-03, 2023-02-04, 2023-02-05, 2023-02-06, 2023-02-07
$ rating  <f64> 4.5, 3.5, 4.0, 1.0, 1.2, 9.4, 9.1
$ zone    <f64> None, nan, 1.0, 2.0, 4.0, 5.0, 1.0
$ class   <str> x1, None, None, c1, c2, x2, j1
df.describe()
shape: (9, 9)
describegroupgradescoredate1date2ratingzoneclass
strstrstrf64strstrf64f64str
"count""7""7"7.0"7""7"7.07.0"7"
"null_count""0""0"0.0"0""0"0.01.0"2"
"mean"nullnull70.285714nullnull4.671429NaNnull
"std"nullnull17.182494nullnull3.39986NaNnull
"min""one""1"45.0"2023-01-01""2023-02-01"1.01.0"c1"
"max""two""99"90.0"2023-01-07""2023-02-07"9.45.0"x2"
"median"nullnull76.0nullnull4.03.0null
"25%"nullnull49.0nullnull1.21.0null
"75%"nullnull83.0nullnull9.15.0null

Q. How to get count of NA values in the polar dataframe?

df.select(pl.all().is_null().sum())
shape: (1, 8)
groupgradescoredate1date2ratingzoneclass
u32u32u32u32u32u32u32u32
00000012

Q. How to run groupby functions?

Probably, groupby is one the most commonly used data manipulation functions. Polars offers .groupby and .over functions to calculate aggregated metrics over a group.

# method 1: find the first row in each group sorted by some value
df.groupby("group").agg(pl.all().sort_by("score",descending=True).first())
shape: (3, 8)
groupgradescoredate1date2ratingzoneclass
strstri64datedatef64f64str
"one""2"762023-01-032023-02-034.01.0null
"two""3"832023-01-042023-02-041.02.0"c1"
"three""23"902023-01-062023-02-069.45.0"x2"
# method 2: find the first row in each group sorted by some value
df.filter(pl.col("score") == pl.col("score").max().over(["group"]))
shape: (3, 8)
groupgradescoredate1date2ratingzoneclass
strstri64datedatef64f64str
"one""2"762023-01-032023-02-034.01.0null
"two""3"832023-01-042023-02-041.02.0"c1"
"three""23"902023-01-062023-02-069.45.0"x2"
# calculate mean score over group and add it as a column in the dataframe
df = df.with_columns(mean_over_grp=pl.col("score").sum().over("group"))
df.head(5)
shape: (5, 9)
groupgradescoredate1date2ratingzoneclassmean_over_grp
strstri64datedatef64f64stri64
"one""1"452023-01-012023-02-014.5null"x1"170
"one""98"492023-01-022023-02-023.5NaNnull170
"one""2"762023-01-032023-02-034.01.0null170
"two""3"832023-01-042023-02-041.02.0"c1"152
"two""99"692023-01-052023-02-051.24.0"c2"152
# convert a column into a list per group
df.groupby("group", maintain_order=True).agg(score_list=pl.col("score").apply(list))
shape: (3, 2)
groupscore_list
strlist[i64]
"one"[45, 49, 76]
"two"[83, 69]
"three"[90, 80]
# aggregate calculations for a group using multiple columns
df.groupby('group').agg(
    mean_score = pl.col('score').mean(), 
    n_rows = pl.count(), 
    cls_c1_count = (pl.col('class') == 'c1').sum(),
    min_date = pl.col('date1').min()
)
shape: (3, 5)
groupmean_scoren_rowscls_c1_countmin_date
strf64u32u32date
"two"76.0212023-01-04
"one"56.666667302023-01-01
"three"85.0202023-01-06
# method 1: calculate counts per group
df.groupby('group').count()
shape: (3, 2)
groupcount
stru32
"two"2
"one"3
"three"2
# method 2: calculate counts per group
df['group'].value_counts()
shape: (3, 2)
groupcounts
stru32
"two"2
"one"3
"three"2

Q. How to set the data types correctly ?

We find the grade column has integers but encoded as strings in the dataframe. Lets fix the data type.

## convert string to integer
df = df.with_columns(pl.col('grade').cast(pl.Int16))

# similarly to convert an integer to string you can do:
df = df.with_columns(pl.col('score').cast(pl.Utf8))

df['grade'].head(3)
shape: (3,)
grade
i16
1
98
2

Q. How to find number of groups in a column ?

The n_groups function in pandas returns the number of unique groups in a column. We can achieve that using n_unique in polars.

# find unique groups in group col
df.select(pl.col("group").n_unique()).item()
3
# find unique groups in group and class togther
df.select(pl.struct(["group", "class"]).n_unique()).item()
6

Q. How to do One Hot Encoding in polars dataframe?

Converting a categorical column into several column with 1 and 0 is a commonly used feature engineering technique.

df=df.to_dummies('group')
df.head()
shape: (7, 11)
group_onegroup_threegroup_twogradescoredate1date2ratingzoneclassmean_over_grp
u8u8u8i16strdatedatef64f64stri64
1001"45"2023-01-012023-02-014.5null"x1"170
10098"49"2023-01-022023-02-023.5NaNnull170
1002"76"2023-01-032023-02-034.01.0null170
0013"83"2023-01-042023-02-041.02.0"c1"152
00199"69"2023-01-052023-02-051.24.0"c2"152
01023"90"2023-01-062023-02-069.45.0"x2"170
01011"80"2023-01-072023-02-079.11.0"j1"170

Q. How to create categories/bins using a numeric column?

Here, we’d like to create buckets of numbers based on percentiles inside a group.

df=df.with_columns(
    score_bin=pl.col("score")
    .apply(
        lambda s: s.qcut(
            [0.2, 0.4, 0.6, 0.8], labels=["1", "2", "3", "4", "5"], maintain_order=True
        )["category"]
    )
    .over(["group"])
)
df
shape: (7, 10)
groupgradescoredate1date2ratingzoneclassmean_over_grpscore_bin
stri16strdatedatef64f64stri64cat
"one"1"45"2023-01-012023-02-014.5null"x1"170"1"
"one"98"49"2023-01-022023-02-023.5NaNnull170"3"
"one"2"76"2023-01-032023-02-034.01.0null170"5"
"two"3"83"2023-01-042023-02-041.02.0"c1"152"5"
"two"99"69"2023-01-052023-02-051.24.0"c2"152"1"
"three"23"90"2023-01-062023-02-069.45.0"x2"170"5"
"three"11"80"2023-01-072023-02-079.11.0"j1"170"1"

Q. How to create a pivot table ?

df.pivot(values="rating", index="score_bin", columns="group", aggregate_function="min")
shape: (3, 4)
score_binonetwothree
catf64f64f64
"1"4.51.29.1
"3"3.5nullnull
"5"4.01.09.4

Q. How to create a column where each row contains multiple values ?

This is a common use case where we need to merge two columns into one row.

df = df.with_columns(merged_cat = pl.struct(["group","score_bin"]))
df
shape: (7, 11)
groupgradescoredate1date2ratingzoneclassmean_over_grpscore_binmerged_cat
stri16strdatedatef64f64stri64catstruct[2]
"one"1"45"2023-01-012023-02-014.5null"x1"170"1"{"one","1"}
"one"98"49"2023-01-022023-02-023.5NaNnull170"3"{"one","3"}
"one"2"76"2023-01-032023-02-034.01.0null170"5"{"one","5"}
"two"3"83"2023-01-042023-02-041.02.0"c1"152"5"{"two","5"}
"two"99"69"2023-01-052023-02-051.24.0"c2"152"1"{"two","1"}
"three"23"90"2023-01-062023-02-069.45.0"x2"170"5"{"three","5"}
"three"11"80"2023-01-072023-02-079.11.0"j1"170"1"{"three","1"}
# lets check the first now at index 0
df.select(pl.col('merged_cat').take(0)).to_numpy()
array([[{'group': 'one', 'score_bin': '1'}]], dtype=object)
# let create a column with list as output
df=df.with_columns(
    merged_list = pl.struct(["group","class"]).apply(lambda x: [x['group'], x['class']])
)
df
shape: (7, 12)
groupgradescoredate1date2ratingzoneclassmean_over_grpscore_binmerged_catmerged_list
stri16strdatedatef64f64stri64catstruct[2]list[str]
"one"1"45"2023-01-012023-02-014.5null"x1"170"1"{"one","1"}["one", "x1"]
"one"98"49"2023-01-022023-02-023.5NaNnull170"3"{"one","3"}["one", null]
"one"2"76"2023-01-032023-02-034.01.0null170"5"{"one","5"}["one", null]
"two"3"83"2023-01-042023-02-041.02.0"c1"152"5"{"two","5"}["two", "c1"]
"two"99"69"2023-01-052023-02-051.24.0"c2"152"1"{"two","1"}["two", "c2"]
"three"23"90"2023-01-062023-02-069.45.0"x2"170"5"{"three","5"}["three", "x2"]
"three"11"80"2023-01-072023-02-079.11.0"j1"170"1"{"three","1"}["three", "j1"]

Q. How to convert dataframe columns into a python dict ?

Converting two columns into a key:value format.

dict(df.select(pl.col(["score_bin", "score"])).iter_rows())
{'1': '80', '3': '49', '5': '90'}

Q. How to convert a column containing list into separate rows?

Similar to pandas, polars also provide .explode function to achieve this.

df=pl.DataFrame(
    {
        "col1": [["X", "Y", "Z"], ["F", "G"], ["P"]],
        "col2": [["A", "B", "C"], ["C"], ["D", "E"]],
    }
).with_row_count()
df.explode(["col1"]).explode(["col2"]).head(4)
shape: (4, 3)
row_nrcol1col2
u32strstr
0"X""A"
0"X""B"
0"X""C"
0"Y""A"

Similar to pandas, polars also provide string methods using <col_name>.str.<method_name> format.

# add prefix to columns except one
df.select(pl.all().exclude(['score']).suffix('_pre')).head(3)
shape: (3, 7)
group_pregrade_predate1_predate2_prerating_prezone_preclass_pre
strstrdatedatef64f64str
"one""1"2023-01-012023-02-014.5null"x1"
"one""98"2023-01-022023-02-023.5NaNnull
"one""2"2023-01-032023-02-034.01.0null
# convert a column to uppercase
df.with_columns(pl.col('group').str.to_uppercase()).head()
shape: (5, 8)
groupgradescoredate1date2ratingzoneclass
strstri64datedatef64f64str
"ONE""1"452023-01-012023-02-014.5null"x1"
"ONE""98"492023-01-022023-02-023.5NaNnull
"ONE""2"762023-01-032023-02-034.01.0null
"TWO""3"832023-01-042023-02-041.02.0"c1"
"TWO""99"692023-01-052023-02-051.24.0"c2"
# concatenate a lists into a string 
(df
.with_columns(
    merged_list = pl.struct(["group","class"]).apply(lambda x: [x['group'], x['class']])
).with_columns(
    m_con=pl.col("merged_list").arr.join("_"),
    v4=pl.col("merged_list").arr.concat(pl.col("merged_list")), # concatenate two columns having list into a bigger list
))
shape: (7, 11)
groupgradescoredate1date2ratingzoneclassmerged_listm_conv4
strstri64datedatef64f64strlist[str]strlist[str]
"one""1"452023-01-012023-02-014.5null"x1"["one", "x1"]"one_x1"["one", "x1", … "x1"]
"one""98"492023-01-022023-02-023.5NaNnull["one", null]"one_null"["one", null, … null]
"one""2"762023-01-032023-02-034.01.0null["one", null]"one_null"["one", null, … null]
"two""3"832023-01-042023-02-041.02.0"c1"["two", "c1"]"two_c1"["two", "c1", … "c1"]
"two""99"692023-01-052023-02-051.24.0"c2"["two", "c2"]"two_c2"["two", "c2", … "c2"]
"three""23"902023-01-062023-02-069.45.0"x2"["three", "x2"]"three_x2"["three", "x2", … "x2"]
"three""11"802023-01-072023-02-079.11.0"j1"["three", "j1"]"three_j1"["three", "j1", … "j1"]

Q. How to perform datetime functions using polars?

Similar to pandas, polars also provide datetime methods using <col_name>.dt.<method_name> format.

df.with_columns(day = pl.col('date1').dt.day(),
                month=  pl.col('date1').dt.month(),
                year=  pl.col('date1').dt.year(),
                year_mon = pl.col('date1').cast(pl.Utf8).str.slice(0, 7),
               )
shape: (7, 12)
groupgradescoredate1date2ratingzoneclassdaymonthyearyear_mon
strstri64datedatef64f64stru32u32i32str
"one""1"452023-01-012023-02-014.5null"x1"112023"2023-01"
"one""98"492023-01-022023-02-023.5NaNnull212023"2023-01"
"one""2"762023-01-032023-02-034.01.0null312023"2023-01"
"two""3"832023-01-042023-02-041.02.0"c1"412023"2023-01"
"two""99"692023-01-052023-02-051.24.0"c2"512023"2023-01"
"three""23"902023-01-062023-02-069.45.0"x2"612023"2023-01"
"three""11"802023-01-072023-02-079.11.0"j1"712023"2023-01"

Q. How to read parquet data from S3 with polars?

I couldn’t find a straight forward way to do this, hence wrote a utils functions which can read multiple parquet files from a S3 directory.

def read_with_polars(s3_path: str):
    bucket, _, key = s3_path.replace("s3://", "").partition("/")
    s3 = boto3.resource("s3")
    s3bucket = s3.Bucket(bucket)
    paths = [
        f"s3://{bucket}/{obj.key}"
        for obj in s3bucket.objects.filter(Prefix=key)
    ]
    print(len(paths))
    return pl.concat(
        [
            pl.read_parquet(path, use_pyarrow=True, memory_map=True)
            for path in paths
        ]
    )
df = read_with_polars('s3://data-science-dev/polars/data/')

Summary

Overall, my experience using polars has been quite satisfying. Having been deployed a python project using polars I feel more confident about using it while handling large data sets. I would probably continue using pandas as long as it doesn’t slow down significantly. Also, I tried building a xgboost model on polars dataframes, it worked nicely without any need to convert it to pandas dataframe.

Tags:

Updated:

Comments