将函数应用于可以返回多行的 Pandas DataFrame

Question

提问by btel

I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example:

我正在尝试转换 DataFrame，以便将某些行复制给定的次数。例如：

df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})

  class  count
0     A      1
1     B      0
2     C      2

should be transformed to:

应转化为：

  class 
0     A   
1     C   
2     C

This is the reverse of aggregation with count function. Is there an easy way to achieve it in pandas (without using for loops or list comprehensions)?

这是使用 count 函数进行聚合的逆过程。有没有一种简单的方法可以在 Pandas 中实现它（不使用 for 循环或列表推导式）？

One possibility might be to allow DataFrame.applymapfunction return multiple rows (akin applymethod of GroupBy). However, I do not think it is possible in pandas now.

一种可能性可能是允许DataFrame.applymap函数返回多行（类似于的apply方法GroupBy）。但是，我认为现在在Pandas中是不可能的。

Answer 1

回答by Wes McKinney

You could use groupby:

您可以使用 groupby：

def f(group):
    row = group.irow(0)
    return DataFrame({'class': [row['class']] * row['count']})
df.groupby('class', group_keys=False).apply(f)

so you get

所以你得到

In [25]: df.groupby('class', group_keys=False).apply(f)
Out[25]: 
  class
0     A
0     C
1     C

You can fix the index of the result however you like

您可以根据需要修复结果的索引

Answer 2

回答by Radical Edward

I know this is an old question, but I was having trouble getting Wes' answer to work for multiple columns in the dataframe so I made his code a bit more generic. Thought I'd share in case anyone else stumbles on this question with the same problem.

我知道这是一个老问题，但是我无法让 Wes 的答案适用于数据框中的多列，因此我使他的代码更通用一些。我想我会分享以防其他人在这个问题上遇到同样的问题。

You just basically specify what column has the counts in it in and you get an expanded dataframe in return.

您基本上只需指定其中包含计数的列，然后您就会得到一个扩展的数据框作为回报。

import pandas as pd
df = pd.DataFrame({'class 1': ['A','B','C','A'],
                   'class 2': [ 1,  2,  3,  1], 
                   'count':   [ 3,  3,  3,  1]})
print df,"\n"

def f(group, *args):
    row = group.irow(0)
    Dict = {}
    row_dict = row.to_dict()
    for item in row_dict: Dict[item] = [row[item]] * row[args[0]]
    return pd.DataFrame(Dict)

def ExpandRows(df,WeightsColumnName):
    df_expand = df.groupby(df.columns.tolist(), group_keys=False).apply(f,WeightsColumnName).reset_index(drop=True)
    return df_expand


df_expanded = ExpandRows(df,'count')
print df_expanded

Returns:

返回：

  class 1  class 2  count
0       A        1      3
1       B        2      3
2       C        3      3
3       A        1      1 

  class 1  class 2  count
0       A        1      1
1       A        1      3
2       A        1      3
3       A        1      3
4       B        2      3
5       B        2      3
6       B        2      3
7       C        3      3
8       C        3      3
9       C        3      3

With regards to speed, my base df is 10 columns by ~6k rows and when expanded is ~100,000 rows takes ~7 seconds. I'm not sure in this case if grouping is necessary or wise since it's taking all the columns to group form, but hey whatever only 7 seconds.

关于速度，我的基本 df 是 10 列 x 6k 行，当扩展为 ~100,000 行时需要 ~7 秒。在这种情况下，我不确定分组是否必要或明智，因为它将所有列组合成分组形式，但是嘿，无论只有 7 秒。

Answer 3

回答by Ted Petrou

This question is very old and the answers do not reflect pandas modern capabilities. You can use iterrowsto loop over every row and then use the DataFrame constructor to create new DataFrames with the correct number of rows. Finally, use pd.concatto concatenate all the rows together.

这个问题很老了，答案并不反映Pandas的现代能力。您可以使用iterrows循环遍历每一行，然后使用 DataFrame 构造函数创建具有正确行数的新 DataFrame。最后，用于pd.concat将所有行连接在一起。

pd.concat([pd.DataFrame(data=[row], index=range(row['count'])) 
           for _, row in df.iterrows()], ignore_index=True)

  class  count
0     A      1
1     C      2
2     C      2

This has the benefit of working with any size DataFrame.

这样做的好处是可以使用任何大小的 DataFrame。

Answer 4

回答by Shay

There is even a simpler and significantly more efficient solution. I had to make similar modification for a table of about 3.5M rows, and the previous suggested solutions were extremely slow.

甚至还有一个更简单、更有效的解决方案。我不得不对一个大约 350 万行的表进行类似的修改，而之前建议的解决方案非常慢。

A better way is to use numpy's repeatprocedure for generating a new index in which each row index is repeated multiple times according to its given count, and use ilocto select rows of the original table according to this index:

更好的方法是使用numpy的repeat过程生成一个新的索引，其中每个行索引根据给定的计数重复多次，并使用iloc根据这个索引选择原始表的行：

import pandas as pd
import numpy as np

df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count': [1, 0, 2]})
spread_ixs = np.repeat(range(len(df)), df['count'])
spread_ixs 

array([0, 2, 2])

df.iloc[spread_ixs, :].drop(columns='count').reset_index(drop=True)

  class
0     A
1     C
2     C

Answer 5

回答by Matti John

repeated_items = [list(row[1]*row[2]) for row in df.itertuples()]

will create a nested list:

将创建一个嵌套列表：

[['A'], [], ['C', 'C']]

which you can then iterate over with list comprehensions to create a new data frame:

然后您可以使用列表推导式对其进行迭代以创建新的数据框：

new_df = pd.DataFrame({"class":[j for i in repeated_items for j in i]})

Of course, you can do it in a single line as well if you want:

当然，如果您愿意，您也可以在一行中完成：

new_df = pd.DataFrame({"class":[j for i in [list(row[1]*row[2]) for row in df.itertuples()] for j in i]})

将函数应用于可以返回多行的 Pandas DataFrame

提问by btel

回答by Wes McKinney

回答by Radical Edward

回答by Ted Petrou

回答by Shay

回答by Matti John

相关推荐

最近更新

标签

将函数应用于可以返回多行的 Pandas DataFrame

提问by btel

回答by Wes McKinney

回答by Radical Edward

回答by Ted Petrou

回答by Shay

回答by Matti John

相关推荐

如何在 hdf5 中有效地保存 python pandas 数据帧并在 R 中将其作为数据帧打开？

pandas 如何根据第 i 个字段的值对 numpy 数组进行切片？

KDB+ 像 asof 一样在 Pandas 中加入时间序列数据？

pandas 从逐笔报价数据到烛台

相关推荐

最近更新

标签