What is the Difference Between Polars concat(), vstack(), and extend()?

On the surface Polars concat(), vstack(), and extend() seem very similar. They can all be used to vertically merge two DataFrames, similar to a SQL UNION. However the way each of these acheive this union is quite different, and can have drastic performance implications. We will discuss the differences in this post.

Sample DataFrame

We will use the two DataFrames below for the examples in this post.

import polars as pl

# create first dataframe
df1 = pl.DataFrame(
    {
        "employee_id": [12, 34],
        "sales": [541, 450,],

    }
)    
df1

shape: (2, 2)
┌─────────────┬───────┐
│ employee_id ┆ sales │
│ ---         ┆ ---   │
│ i64         ┆ i64   │
╞═════════════╪═══════╡
│ 12          ┆ 541   │
│ 34          ┆ 450   │
└─────────────┴───────┘

# create second dataframe
df2 = pl.DataFrame(
    {
        "employee_id": [65, 33],
        "sales": [675, 112,],

    }
)     
df2

shape: (2, 2)
┌─────────────┬───────┐
│ employee_id ┆ sales │
│ ---         ┆ ---   │
│ i64         ┆ i64   │
╞═════════════╪═══════╡
│ 65          ┆ 675   │
│ 33          ┆ 112   │
└─────────────┴───────┘

Using `n_chunks()`

In this tutorial we will use the DataFrame method n_chunks() which will return the memory "chunks" linked for each column. By passing strategy=all to n_chunks() we can see the count of memory chunks for each column.

Looking at `vstack()`

vstack() will vertically stack two DataFrames, similar to a SQL UNION of two tables. vstack() does not copy any data from either DataFrame, but creates a new DataFrame by linking to the existing memory chunks.

Since we are not copying any data, vstack() is a very lightweight operation. However the tradeoff is that with > 1 memory locations (per column), we can experience decreased performance with later operations.

Below is an example using vstack(). We also can use the n_chunks() to see the that there are now two memory locations used in each DataFrame column.

# create df3 by vertically stacking two dataframes
df3 = df1.vstack(df2)
df3

shape: (4, 2)
┌─────────────┬───────┐
│ employee_id ┆ sales │
│ ---         ┆ ---   │
│ i64         ┆ i64   │
╞═════════════╪═══════╡
│ 12          ┆ 541   │
│ 34          ┆ 450   │
│ 65          ┆ 675   │
│ 33          ┆ 112   │
└─────────────┴───────┘

# check number of chunks in our new dataframe
df3.n_chunks(strategy="all")

[2, 2]

Looking at `extend()`

extend() will also vertically stack two DataFrames, similar to a SQL UNION of two tables. But unlike vstack(), extend() will copy the second DataFrame and append it to its own memory locations, modifying the existing DataFrame in place.

Copying the second DataFrame will require more processing time than vstack(), but in the end we should be left with a DataFrame with a single memory location per column.

Below is an example using extend(). We also can use the n_chunks() to confirm that our DataFrame still only uses a single memory chunk per column.

# print df1 before extend()
df1

shape: (2, 2)
┌─────────────┬───────┐
│ employee_id ┆ sales │
│ ---         ┆ ---   │
│ i64         ┆ i64   │
╞═════════════╪═══════╡
│ 12          ┆ 541   │
│ 34          ┆ 450   │
└─────────────┴───────┘

# use extend() to append df2 to df1
df1.extend(df2)

shape: (4, 2)
┌─────────────┬───────┐
│ employee_id ┆ sales │
│ ---         ┆ ---   │
│ i64         ┆ i64   │
╞═════════════╪═══════╡
│ 12          ┆ 541   │
│ 34          ┆ 450   │
│ 65          ┆ 675   │
│ 33          ┆ 112   │
└─────────────┴───────┘

# confirm that df1 still only uses 1 memory chunk
df1.n_chunks(strategy="all")

[1, 1]

Looking at `concat()`

The concat() function is a more flexible than vstack() and extend(). concat() can merge one or more DataFrames into a single DataFrame. By default concat() will merge DataFrames vertically (adding rows) similar to the SQL UNION. But concat() can also append DataFrames horizontally (adding columns).

By default concat() will act like vstack(), linking to additional memory chunks for each appended DataFrame column. However concat() allows us pass the parameter rechunk=True which will copy data from both DataFrames into a single memory chunk per column, allowing for faster subsequent queries.

Below is an example using concat(). We will also see how the number of memory chunks in the result changes when using rechunk=True.

# create df3 by using vertical concatenation
df3 = pl.concat([df1, df2])
df3

shape: (4, 2)
┌─────────────┬───────┐
│ employee_id ┆ sales │
│ ---         ┆ ---   │
│ i64         ┆ i64   │
╞═════════════╪═══════╡
│ 12          ┆ 541   │
│ 34          ┆ 450   │
│ 65          ┆ 675   │
│ 33          ┆ 112   │
└─────────────┴───────┘

# view number of memory chunks
df3.n_chunks(strategy="all")

[2, 2]

# view number of memory chunks using rechunk=True
pl.concat([df1, df2], rechunk=True).n_chunks(strategy="all")

[1, 1]

When to Use `vstack()`

vstack() is often used when appending multiple times before running operations. An eample of this is when reading and appending multiple files into a single DataFrame. Rechunking each column into a single memory location can be done on final DataFrame itself if needed.

Another reason to use vstack() is if only lightweight operations will be used on it after appending. For example, shape, head(), tail(), etc.

When to use `extend()`

Using extend() makes sense when we need to query the DataFrame after appending, and we would like to alter the DataFrame in place. Since extend() copies data from DataFrame into another, it works well to append small DataFrames into larger DataFrames. It is still possible that extend() could trigger rechunking (reallocation of memory), but this only occurs when needed.

When to use `concat()`

concat() is useful when appending multiple DataFrames at once with the option of rechunking in a single command. concat() also has additional functionality not covered in this post.

What is the Difference Between Polars concat(), vstack(), and extend()?

Sample DataFrame

Using n_chunks()

Looking at vstack()

Looking at extend()

Looking at concat()

When to Use vstack()

When to use extend()