将 Pandas DataFrame 列扩展为多行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38203352/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Expand pandas DataFrame column into multiple rows
提问by gozzilli
If I have a DataFrame
such that:
如果我有DataFrame
这样的:
pd.DataFrame( {"name" : "John",
"days" : [[1, 3, 5, 7]]
})
gives this structure:
给出了这个结构:
days name
0 [1, 3, 5, 7] John
How do expand it to the following?
如何将其扩展为以下内容?
days name
0 1 John
1 3 John
2 5 John
3 7 John
采纳答案by unutbu
You could use df.itertuples
to iterate through each row, and use a list comprehension to reshape the data into the desired form:
您可以使用df.itertuples
遍历每一行,并使用列表理解将数据重塑为所需的形式:
import pandas as pd
df = pd.DataFrame( {"name" : ["John", "Eric"],
"days" : [[1, 3, 5, 7], [2,4]]})
result = pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])
print(result)
yields
产量
0 1
0 1 John
1 3 John
2 5 John
3 7 John
4 2 Eric
5 4 Eric
Divakar's solution, using_repeat
, is fastest:
Divakar的解决方案,using_repeat
是最快的:
In [48]: %timeit using_repeat(df)
1000 loops, best of 3: 834 μs per loop
In [5]: %timeit using_itertuples(df)
100 loops, best of 3: 3.43 ms per loop
In [7]: %timeit using_apply(df)
1 loop, best of 3: 379 ms per loop
In [8]: %timeit using_append(df)
1 loop, best of 3: 3.59 s per loop
Here is the setup used for the above benchmark:
这是用于上述基准测试的设置:
import numpy as np
import pandas as pd
N = 10**3
df = pd.DataFrame( {"name" : np.random.choice(list('ABCD'), size=N),
"days" : [np.random.randint(10, size=np.random.randint(5))
for i in range(N)]})
def using_itertuples(df):
return pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])
def using_repeat(df):
lens = [len(item) for item in df['days']]
return pd.DataFrame( {"name" : np.repeat(df['name'].values,lens),
"days" : np.concatenate(df['days'].values)})
def using_apply(df):
return (df.apply(lambda x: pd.Series(x.days), axis=1)
.stack()
.reset_index(level=1, drop=1)
.to_frame('day')
.join(df['name']))
def using_append(df):
df2 = pd.DataFrame(columns = df.columns)
for i,r in df.iterrows():
for e in r.days:
new_r = r.copy()
new_r.days = e
df2 = df2.append(new_r)
return df2
回答by Divakar
Here's something with NumPy -
这是 NumPy 的一些东西 -
lens = [len(item) for item in df['days']]
df_out = pd.DataFrame( {"name" : np.repeat(df['name'].values,lens),
"days" : np.hstack(df['days'])
})
As pointed in @unutbu's solution
np.concatenate(df['days'].values)
would be faster than np.hstack(df['days'])
.
正如所指出的那样会比.@unutbu's solution
np.concatenate(df['days'].values)
np.hstack(df['days'])
It uses a loop-comprehension to extract the lengths of each 'days'
element, which must be minimal runtime-wise.
它使用循环理解来提取每个'days'
元素的长度,这在运行时必须是最小的。
Sample run -
样品运行 -
>>> df
days name
0 [1, 3, 5, 7] John
1 [2, 4] Eric
>>> lens = [len(item) for item in df['days']]
>>> pd.DataFrame( {"name" : np.repeat(df['name'].values,lens),
... "days" : np.hstack(df['days'])
... })
days name
0 1 John
1 3 John
2 5 John
3 7 John
4 2 Eric
5 4 Eric
回答by jeremycg
A 'native' pandas solution - we unstack the column into a series, then join back on based on index:
“原生”pandas 解决方案 - 我们将列拆分为一个系列,然后根据索引重新连接:
import pandas as pd #import
x2 = x.days.apply(lambda x: pd.Series(x)).unstack() #make an unstackeded series, x2
x.drop('days', axis = 1).join(pd.DataFrame(x2.reset_index(level=0, drop=True))) #drop the days column, join to the x2 series
回答by philshem
New since pandas 0.25 you can use the function explode()
pandas 0.25 以来的新功能,您可以使用该功能 explode()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html
import pandas as pd
df = pd.DataFrame( {"name" : "John",
"days" : [[1, 3, 5, 7]]})
print(df.explode('days'))
prints
印刷
name days
0 John 1
0 John 3
0 John 5
0 John 7
回答by MaxU
another solution:
另一种解决方案:
In [139]: (df.apply(lambda x: pd.Series(x.days), axis=1)
.....: .stack()
.....: .reset_index(level=1, drop=1)
.....: .to_frame('day')
.....: .join(df['name'])
.....: )
Out[139]:
day name
0 1 John
0 3 John
0 5 John
0 7 John
回答by AnnetteC
Probably somehow like this:
大概是这样的:
df2 = pd.DataFrame(columns = df.columns)
for i,r in df.iterrows():
for e in r.days:
new_r = r.copy()
new_r.days = e
df2 = df2.append(new_r)
df2
回答by Gautham Kumaran
Thanks to Divakar's solution, wrote it as a wrapper function to flatten a column, handling np.nan
and DataFrames with multiple columns
感谢Divakar 的解决方案,将其编写为一个包装函数来展np.nan
平一列、处理和具有多列的 DataFrames
def flatten_column(df, column_name):
repeat_lens = [len(item) if item is not np.nan else 1 for item in df[column_name]]
df_columns = list(df.columns)
df_columns.remove(column_name)
expanded_df = pd.DataFrame(np.repeat(df.drop(column_name, axis=1).values, repeat_lens, axis=0), columns=df_columns)
flat_column_values = np.hstack(df[column_name].values)
expanded_df[column_name] = flat_column_values
expanded_df[column_name].replace('nan', np.nan, inplace=True)
return expanded_df