pandas 在熊猫数据框中的每一行中查找非零值的列索引集

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32768555/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:55:44  来源:igfitidea点击:

find the set of column indices for non-zero values in each row in pandas' data frame

pythonpandas

提问by Qiang Li

Is there a good way to find the set of column indices for non-zero values in each row in pandas' data frame? Do I have to traverse the data frame row-by-row?

有没有一种好方法可以在Pandas数据框中的每一行中找到非零值的列索引集?我是否必须逐行遍历数据框?

For example, the data frame is

例如,数据框是

c1  c2  c3  c4 c5 c6 c7 c8  c9
 1   1   0   0  0  0  0  0   0
 1   0   0   0  0  0  0  0   0
 0   1   0   0  0  0  0  0   0
 1   0   0   0  0  0  0  0   0
 0   1   0   0  0  0  0  0   0
 0   0   0   0  0  0  0  0   0
 0   2   1   1  1  1  1  0   2
 1   5   5   0  0  1  0  4   6
 4   3   0   1  1  1  1  5  10
 3   5   2   4  1  2  2  1   3
 6   4   0   1  0  0  0  0   0
 3   9   1   0  1  0  2  1   0

The output is expected to be

输出预计为

['c1','c2']
['c1']
['c2']
...

采纳答案by Younggun Kim

It seems you have to traverse the DataFrame by row.

看来您必须逐行遍历 DataFrame。

cols = df.columns
bt = df.apply(lambda x: x > 0)
bt.apply(lambda x: list(cols[x.values]), axis=1)

and you will get:

你会得到:

0                                 [c1, c2]
1                                     [c1]
2                                     [c2]
3                                     [c1]
4                                     [c2]
5                                       []
6             [c2, c3, c4, c5, c6, c7, c9]
7                 [c1, c2, c3, c6, c8, c9]
8         [c1, c2, c4, c5, c6, c7, c8, c9]
9     [c1, c2, c3, c4, c5, c6, c7, c8, c9]
10                            [c1, c2, c4]
11                [c1, c2, c3, c5, c7, c8]
dtype: object

If performance is matter, try to pass raw=Trueto boolean DataFrame creation like below:

如果性能很重要,请尝试传递raw=True给布尔数据帧创建,如下所示:

%timeit df.apply(lambda x: x > 0, raw=True).apply(lambda x: list(cols[x.values]), axis=1)
1000 loops, best of 3: 812 μs per loop

It brings you a better performance gain. Following is raw=False(which is default) result:

它为您带来更好的性能增益。以下是raw=False(这是默认的)结果:

%timeit df.apply(lambda x: x > 0).apply(lambda x: list(cols[x.values]), axis=1)
100 loops, best of 3: 2.59 ms per loop

回答by Dickster

How about this approach?

这种方法怎么样?

#create a True / False data frame
df_boolean = df>0

#a little helper method that uses boolean slicing internally 
def bar(x,columns):
    return ','.join(list(columns[x]))

#use an apply along the column axis
df_boolean['result'] = df_boolean.apply(lambda x: bar(x,df_boolean.columns),axis=1)

# filter out the empty "rows" adn grab the result column
df_result =  df_boolean[df_boolean['result'] != '']['result']

#append an axis, just so each line will will output a list 
lst_result = df_result.values[:,np.newaxis]

print '\n'.join([ str(myelement) for myelement in lst_result])

and this produces:

这会产生:

['c1,c2']
['c1']
['c2']
['c1']
['c2']
['c2,c3,c4,c5,c6,c7,c9']
['c1,c2,c3,c6,c8,c9']
['c1,c2,c4,c5,c6,c7,c8,c9']
['c1,c2,c3,c4,c5,c6,c7,c8,c9']
['c1,c2,c4']
['c1,c2,c3,c5,c7,c8']

回答by Andy Hayden

Potentially a better data structure (rather than a Series of lists) is to stack:

潜在更好的数据结构(而不​​是一系列列表)是堆栈:

In [11]: res = df[df!=0].stack()

In [12]: res
Out[12]:
0   c1     1
    c2     1
1   c1     1
2   c2     1
3   c1     1
...

And you can iterate over the original rows:

您可以遍历原始行:

In [13]: res.loc[0]
Out[13]:
c1    1
c2    1
dtype: float64

In [14]: res.loc[0].index
Out[14]: Index(['c1', 'c2'], dtype='object')


Note: I thought you used to be able to return a list in an apply (to create a DataFrame which has list elements) this no longer appears to be the case.

注意:我认为您曾经能够在应用程序中返回一个列表(以创建一个具有列表元素的 DataFrame),但现在似乎不再如此。