alt-text

What is the Difference Between Polars concat(), vstack(), and extend()?

On the surface Polars concat(), vstack(), and extend() seem very similar. They can all be used to vertically merge two DataFrames, similar to a SQL UNION. However the way each of these acheive this union is quite different, and can have drastic performance implications. We will discuss the differences in this post.

Sample DataFrame

We will use the two DataFrames below for the examples in this post.

import polars as pl

# create first dataframe
df1 = pl.DataFrame(
    {
        "employee_id": [12, 34],
        "sales": [541, 450,],

    }
)    
df1

shape: (2, 2)
┌─────────────┬───────┐
 employee_id  sales 
 ---          ---   
 i64          i64   
╞═════════════╪═══════╡
 12           541   
 34           450   
└─────────────┴───────┘

# create second dataframe
df2 = pl.DataFrame(
    {
        "employee_id": [65, 33],
        "sales": [675, 112,],

    }
)     
df2

shape: (2, 2)
┌─────────────┬───────┐
 employee_id  sales 
 ---          ---   
 i64          i64   
╞═════════════╪═══════╡
 65           675   
 33           112   
└─────────────┴───────┘

Using n_chunks()

In this tutorial we will use the DataFrame method n_chunks() which will return the memory "chunks" linked for each column. By passing strategy=all to n_chunks() we can see the count of memory chunks for each column.

Looking at vstack()

vstack() will vertically stack two DataFrames, similar to a SQL UNION of two tables. vstack() does not copy any data from either DataFrame, but creates a new DataFrame by linking to the existing memory chunks.

Since we are not copying any data, vstack() is a very lightweight operation. However the tradeoff is that with > 1 memory locations (per column), we can experience decreased performance with later operations.

Below is an example using vstack(). We also can use the n_chunks() to see the that there are now two memory locations used in each DataFrame column.

# create df3 by vertically stacking two dataframes
df3 = df1.vstack(df2)
df3

shape: (4, 2)
┌─────────────┬───────┐
 employee_id  sales 
 ---          ---   
 i64          i64   
╞═════════════╪═══════╡
 12           541   
 34           450   
 65           675   
 33           112   
└─────────────┴───────┘

# check number of chunks in our new dataframe
df3.n_chunks(strategy="all")

[2, 2]

Looking at extend()

extend() will also vertically stack two DataFrames, similar to a SQL UNION of two tables. But unlike vstack(), extend() will copy the second DataFrame and append it to its own memory locations, modifying the existing DataFrame in place.

Copying the second DataFrame will require more processing time than vstack(), but in the end we should be left with a DataFrame with a single memory location per column.

Below is an example using extend(). We also can use the n_chunks() to confirm that our DataFrame still only uses a single memory chunk per column.

# print df1 before extend()
df1

shape: (2, 2)
┌─────────────┬───────┐
 employee_id  sales 
 ---          ---   
 i64          i64   
╞═════════════╪═══════╡
 12           541   
 34           450   
└─────────────┴───────┘

# use extend() to append df2 to df1
df1.extend(df2)

shape: (4, 2)
┌─────────────┬───────┐
 employee_id  sales 
 ---          ---   
 i64          i64   
╞═════════════╪═══════╡
 12           541   
 34           450   
 65           675   
 33           112   
└─────────────┴───────┘

# confirm that df1 still only uses 1 memory chunk
df1.n_chunks(strategy="all")

[1, 1]

Looking at concat()

The concat() function is a more flexible than vstack() and extend(). concat() can merge one or more DataFrames into a single DataFrame. By default concat() will merge DataFrames vertically (adding rows) similar to the SQL UNION. But concat() can also append DataFrames horizontally (adding columns).

By default concat() will act like vstack(), linking to additional memory chunks for each appended DataFrame column. However concat() allows us pass the parameter rechunk=True which will copy data from both DataFrames into a single memory chunk per column, allowing for faster subsequent queries.

Below is an example using concat(). We will also see how the number of memory chunks in the result changes when using rechunk=True.

# create df3 by using vertical concatenation
df3 = pl.concat([df1, df2])
df3

shape: (4, 2)
┌─────────────┬───────┐
 employee_id  sales 
 ---          ---   
 i64          i64   
╞═════════════╪═══════╡
 12           541   
 34           450   
 65           675   
 33           112   
└─────────────┴───────┘

# view number of memory chunks
df3.n_chunks(strategy="all")

[2, 2]

# view number of memory chunks using rechunk=True
pl.concat([df1, df2], rechunk=True).n_chunks(strategy="all")

[1, 1]

When to Use vstack()

vstack() is often used when appending multiple times before running operations. An eample of this is when reading and appending multiple files into a single DataFrame. Rechunking each column into a single memory location can be done on final DataFrame itself if needed.

Another reason to use vstack() is if only lightweight operations will be used on it after appending. For example, shape, head(), tail(), etc.

When to use extend()

Using extend() makes sense when we need to query the DataFrame after appending, and we would like to alter the DataFrame in place. Since extend() copies data from DataFrame into another, it works well to append small DataFrames into larger DataFrames. It is still possible that extend() could trigger rechunking (reallocation of memory), but this only occurs when needed.

When to use concat()

concat() is useful when appending multiple DataFrames at once with the option of rechunking in a single command. concat() also has additional functionality not covered in this post.