Repeat rows in DataFrame with respect to column

Question

I have a Pandas DataFrame that looks like this:

df = pd.DataFrame({'col1': [1, 2, 3],
                   'col2': [4, 5, 6],
                   'col3': [7, 8, 9]})

df
    col1    col2    col3
0      1       4       7
1      2       5       8
2      3       6       9

I would like to create a Pandas DataFrame like this:

df_new
    col1    col2    col3
0      1       4       7
1      1       5       8
2      1       6       9
3      2       4       7
4      2       5       8
5      2       6       9
6      3       4       7
7      3       5       8
8      3       6       9

Is there built-in or combination of built-in Pandas methods that can achieve this?

Even if there are duplicates in df, I would like the output to be the same format. In other words:

df
    col1    col2    col3
0      1       4       7
1      2       5       8
2      2       6       8

df_new
    col1    col2    col3
0      1       4       7
1      1       5       8
2      1       6       8
3      2       4       7
4      2       5       8
5      2       6       8
6      2       4       7
7      2       5       8
8      2       6       8

Thanks in advance for any suggestions!

Can you include some more details about your scaling logic? It looks like you're taking the Cartesian product between col1 and the rest of the columns which would just be a standard cross join between col1 and [col2, col3] which would be significantly faster than manual scaling. Something like df[['col1']].merge(df[['col2', 'col3']], how='cross').reset_index(drop=True) Is this correct or do you actually just need to scale your DataFrame a fixed number of times based on column values or otherwise? — 2 days ago
@HenryEcker. Great use of how='cross'! It could also be to_tile=['col1']; df[to_tile].merge(df[df.columns.difference(to_tile)], how='cross').reset_index(drop=True) — 2 days ago

score 6 · Accepted Answer · 2023-05-26 08:56:29Z

6

I would also have gone for a cross merge as suggested by @Henry in comments:

out = df[['col1']].merge(df[['col2', 'col3']], how='cross').reset_index(drop=True)

Output:

   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

Comparison of the different approaches:

Note that @sammywemmy’s approach behaves differently when rows are duplicated, which leads to a non comparable timing.

edited 2 days ago

answered 2 days ago

mozway

178k12 gold badges33 silver badges73 bronze badges

11

Interesting timing data. It kind of looks like the other solutions might be going to catch mine as length increases…

– Nick

2 days ago
for the nick_concat solution, I suspect the copies affect the speed (a copy for drop, a copy for assign, a copy for reset_index).

– sammywemmy

2 days ago
@Nick I added some more rows (it previously crashed on another computer as the combinations are memory expensive). Looks like they converge with still a difference

– mozway

2 days ago
1

@sammywemmy regarding your approach, it doesn't provide the same output as the others when there are duplicated rows (thus the different timing)

– mozway

2 days ago
1

@mozway Cool stuff. Interesting to see itertools as the slowest option. I love questions that get such a variety of responses and someone doing the timing.

– Nick

2 days ago

|
Show 6 more comments

score 6 · Accepted Answer · 2023-05-26 04:26:55Z

I would love to see a more pythonic or a ‘pandas-exclusive’ answer, but this one also works good!

import pandas as pd
import numpy as np

n=3

df = pd.DataFrame({'col1': [1, 2, 3],
                   'col2': [4, 5, 6],
                   'col3': [7, 8, 9]})

# Edited and added this new method.
df2 = pd.DataFrame({df.columns[0]:np.repeat(df['col1'].values, n)})
df2[df.columns[1:]] = df.iloc[:,1:].apply(lambda x: np.tile(x, n))

""" Old method.
for col in df.columns[1:]:
   df2[col] = np.tile(df[col].values, n)

"""
print(df2)

NickNick 138k22 gold badges57 silver badges95 bronze badges · Accepted Answer · 2023-05-26 04:31:04Z

You could concatenate copies of the dataframe, with col1 replaced in each copy by each of the values in col1:

out = df.drop('col1', axis=1)
out = pd.concat([out.assign(col1=c1) for c1 in df['col1']]).reset_index(drop=True)

Output:

   col2  col3  col1
0     4     7     1
1     5     8     1
2     6     9     1
3     4     7     2
4     5     8     2
5     6     9     2
6     4     7     3
7     5     8     3
8     6     9     3

If you prefer, you can then re-order the columns back to the original using

out = out[['col1', 'col2', 'col3']]

score 5 · Accepted Answer · 2023-05-26 06:20:40Z

You can use np.repeat and np.tile to get the expected output:

import numpy as np

N = 3
cols_to_repeat = ['col1']  # 1, 1, 1, 2, 2, 2
cols_to_tile = ['col2', 'col3']  # 1, 2, 1, 2, 1, 2

data = np.concatenate([np.tile(df[cols_to_tile].values.T, N).T,
                       np.repeat(df[cols_to_repeat].values, N, axis=0)], axis=1)
out = pd.DataFrame(data, columns=cols_to_tile + cols_to_repeat)[df.columns]

Output:

>>> out
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

You can create a generic function:

def repeat(df: pd.DataFrame, to_repeat: list[str], to_tile: list[str]=None) -> pd.DataFrame:
    to_tile = to_tile if to_tile else df.columns.difference(to_repeat).tolist()

    assert df.columns.difference(to_repeat + to_tile).empty, "all columns should be repeated or tiled"

    data = np.concatenate([np.tile(df[to_tile].values.T, N).T,
                           np.repeat(df[to_repeat].values, N, axis=0)], axis=1)

    return pd.DataFrame(data, columns=to_tile + to_repeat)[df.columns]

repeat(df, ['col1'])

Usage:

>>> repeat(df, ['col1'])
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

PaulSPaulS 20.3k2 gold badges9 silver badges25 bronze badges · Accepted Answer · 2023-05-26 07:05:41Z

Another possible solution, which is based on itertools.product:

from itertools import product

pd.DataFrame([[x, y[0], y[1]] for x, y in 
              product(df['col1'], zip(df['col2'], df['col3']))], 
             columns=df.columns)

Output:

   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

score 5 · Accepted Answer · 2023-05-26 12:57:48Z

One option is with complete from pyjanitor:

# pip install pyjanitor
import janitor 
import pandas as pd

df.complete('col1', ('col2','col3'))
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

complete primarily is for exposing missing rows – the output above just happens to be a nice side effect. A more appropriate, albeit quite verbose option is expand_grid:

# pip install pyjanitor
import janitor as jn
import pandas as pd

others = {'df1':df.col1, 'df2':df[['col2','col3']]}
jn.expand_grid(others=others).droplevel(axis=1,level=0)
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     8
3     2     4     7
4     2     5     8
5     2     6     8
6     2     4     7
7     2     5     8
8     2     6     8

It's nice and short for the non-duplicates case but unfortunately doesn't handle the other case correctly. — 2 days ago

Repeat rows in DataFrame with respect to column