pandas 熊猫中的多列分解

Question

提问by ChrisB

The pandas factorizefunction assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to.

pandasfactorize函数将系列中的每个唯一值分配给一个从 0 开始的顺序索引，并计算每个系列条目属于哪个索引。

I'd like to accomplish the equivalent of pandas.factorizeon multiple columns:

我想pandas.factorize在多列上完成相当于：

import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0]

That is, I want to determine each unique tuple of values in several columns of a data frame, assign a sequential index to each, and compute which index each row in the data frame belongs to.

也就是说，我想确定数据帧的几列中每个唯一的值元组，为每个值分配一个顺序索引，并计算数据帧中的每一行属于哪个索引。

Factorizeonly works on single columns. Is there a multi-column equivalent function in pandas?

Factorize仅适用于单列。Pandas 中是否有多列等效函数？

Answer 1

回答by HYRY

You need to create a ndarray of tuple first, pandas.lib.fast_zipcan do this very fast in cython loop.

您需要首先创建元组的 ndarray，pandas.lib.fast_zip可以在 cython 循环中非常快地完成此操作。

import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]

the output is:

输出是：

[0 1 2 2 1 0]

Answer 2

回答by user2179627

I am not sure if this is an efficient solution. There might be better solutions for this.

我不确定这是否是一个有效的解决方案。可能有更好的解决方案。

arr=[] #this will hold the unique items of the dataframe
for i in df.index:
   if list(df.iloc[i]) not in arr:
      arr.append(list(df.iloc[i]))

so printing the arr would give you

所以打印 arr 会给你

>>>print arr
[[1,1],[1,2],[2,2]]

to hold the indices, i would declare an ind array

为了保存索引，我会声明一个 ind 数组

ind=[]
for i in df.index:
   ind.append(arr.index(list(df.iloc[i])))

printing ind would give

印刷工业会给

 >>>print ind
 [0,1,2,2,1,0]

Answer 3

回答by waitingkuo

You can use drop_duplicatesto drop those duplicated rows

您可以使用drop_duplicates删除那些重复的行

In [23]: df.drop_duplicates()
Out[23]: 
      x  y
   0  1  1
   1  1  2
   2  2  2

EDIT

编辑

To achieve your goal, you can join your original df to the drop_duplicated one:

为了实现您的目标，您可以将原始 df 加入 drop_duplicated 一个：

In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
Out[46]: 
   x  y  index
0  1  1      0
1  1  2      1
2  2  2      2
3  2  2      2
4  1  2      1
5  1  1      0

Answer 4

回答by David Hagar

df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
tuples = df[['x', 'y']].apply(tuple, axis=1)
df['newID'] = pd.factorize( tuples )[0]

pandas 熊猫中的多列分解

提问by ChrisB

回答by HYRY

回答by user2179627

回答by waitingkuo

EDIT

编辑

回答by David Hagar

相关推荐

最近更新

标签

pandas 熊猫中的多列分解

提问by ChrisB

回答by HYRY

回答by user2179627

回答by waitingkuo

EDIT

编辑

回答by David Hagar

相关推荐

数据框中的 Pandas 列表理解

pandas 如何在多索引数据帧的第一级最后一个键中选择行？

pandas 如何在忽略索引对齐的情况下分配列

pandas HDFStore.append(string, DataFrame) 当字符串列的内容比已有的内容长时失败

相关推荐

最近更新

标签