将函数应用于可以返回多行的 Pandas DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13050003/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Apply function to pandas DataFrame that can return multiple rows
提问by btel
I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example:
我正在尝试转换 DataFrame,以便将某些行复制给定的次数。例如:
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
class count
0 A 1
1 B 0
2 C 2
should be transformed to:
应转化为:
class
0 A
1 C
2 C
This is the reverse of aggregation with count function. Is there an easy way to achieve it in pandas (without using for loops or list comprehensions)?
这是使用 count 函数进行聚合的逆过程。有没有一种简单的方法可以在 Pandas 中实现它(不使用 for 循环或列表推导式)?
One possibility might be to allow DataFrame.applymapfunction return multiple rows (akin applymethod of GroupBy). However, I do not think it is possible in pandas now.
一种可能性可能是允许DataFrame.applymap函数返回多行(类似于 的apply方法GroupBy)。但是,我认为现在在Pandas中是不可能的。
回答by Wes McKinney
You could use groupby:
您可以使用 groupby:
def f(group):
row = group.irow(0)
return DataFrame({'class': [row['class']] * row['count']})
df.groupby('class', group_keys=False).apply(f)
so you get
所以你得到
In [25]: df.groupby('class', group_keys=False).apply(f)
Out[25]:
class
0 A
0 C
1 C
You can fix the index of the result however you like
您可以根据需要修复结果的索引
回答by Radical Edward
I know this is an old question, but I was having trouble getting Wes' answer to work for multiple columns in the dataframe so I made his code a bit more generic. Thought I'd share in case anyone else stumbles on this question with the same problem.
我知道这是一个老问题,但是我无法让 Wes 的答案适用于数据框中的多列,因此我使他的代码更通用一些。我想我会分享以防其他人在这个问题上遇到同样的问题。
You just basically specify what column has the counts in it in and you get an expanded dataframe in return.
您基本上只需指定其中包含计数的列,然后您就会得到一个扩展的数据框作为回报。
import pandas as pd
df = pd.DataFrame({'class 1': ['A','B','C','A'],
'class 2': [ 1, 2, 3, 1],
'count': [ 3, 3, 3, 1]})
print df,"\n"
def f(group, *args):
row = group.irow(0)
Dict = {}
row_dict = row.to_dict()
for item in row_dict: Dict[item] = [row[item]] * row[args[0]]
return pd.DataFrame(Dict)
def ExpandRows(df,WeightsColumnName):
df_expand = df.groupby(df.columns.tolist(), group_keys=False).apply(f,WeightsColumnName).reset_index(drop=True)
return df_expand
df_expanded = ExpandRows(df,'count')
print df_expanded
Returns:
返回:
class 1 class 2 count
0 A 1 3
1 B 2 3
2 C 3 3
3 A 1 1
class 1 class 2 count
0 A 1 1
1 A 1 3
2 A 1 3
3 A 1 3
4 B 2 3
5 B 2 3
6 B 2 3
7 C 3 3
8 C 3 3
9 C 3 3
With regards to speed, my base df is 10 columns by ~6k rows and when expanded is ~100,000 rows takes ~7 seconds. I'm not sure in this case if grouping is necessary or wise since it's taking all the columns to group form, but hey whatever only 7 seconds.
关于速度,我的基本 df 是 10 列 x 6k 行,当扩展为 ~100,000 行时需要 ~7 秒。在这种情况下,我不确定分组是否必要或明智,因为它将所有列组合成分组形式,但是嘿,无论只有 7 秒。
回答by Ted Petrou
This question is very old and the answers do not reflect pandas modern capabilities. You can use iterrowsto loop over every row and then use the DataFrame constructor to create new DataFrames with the correct number of rows. Finally, use pd.concatto concatenate all the rows together.
这个问题很老了,答案并不反映Pandas的现代能力。您可以使用iterrows循环遍历每一行,然后使用 DataFrame 构造函数创建具有正确行数的新 DataFrame。最后,用于pd.concat将所有行连接在一起。
pd.concat([pd.DataFrame(data=[row], index=range(row['count']))
for _, row in df.iterrows()], ignore_index=True)
class count
0 A 1
1 C 2
2 C 2
This has the benefit of working with any size DataFrame.
这样做的好处是可以使用任何大小的 DataFrame。
回答by Shay
There is even a simpler and significantly more efficient solution. I had to make similar modification for a table of about 3.5M rows, and the previous suggested solutions were extremely slow.
甚至还有一个更简单、更有效的解决方案。我不得不对一个大约 350 万行的表进行类似的修改,而之前建议的解决方案非常慢。
A better way is to use numpy's repeatprocedure for generating a new index in which each row index is repeated multiple times according to its given count, and use ilocto select rows of the original table according to this index:
更好的方法是使用numpy的repeat过程生成一个新的索引,其中每个行索引根据给定的计数重复多次,并使用iloc根据这个索引选择原始表的行:
import pandas as pd
import numpy as np
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count': [1, 0, 2]})
spread_ixs = np.repeat(range(len(df)), df['count'])
spread_ixs
array([0, 2, 2])
df.iloc[spread_ixs, :].drop(columns='count').reset_index(drop=True)
class
0 A
1 C
2 C
回答by Matti John
repeated_items = [list(row[1]*row[2]) for row in df.itertuples()]
will create a nested list:
将创建一个嵌套列表:
[['A'], [], ['C', 'C']]
which you can then iterate over with list comprehensions to create a new data frame:
然后您可以使用列表推导式对其进行迭代以创建新的数据框:
new_df = pd.DataFrame({"class":[j for i in repeated_items for j in i]})
Of course, you can do it in a single line as well if you want:
当然,如果您愿意,您也可以在一行中完成:
new_df = pd.DataFrame({"class":[j for i in [list(row[1]*row[2]) for row in df.itertuples()] for j in i]})

