pandas 熊猫中的多列分解
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16453465/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
multi-column factorize in pandas
提问by ChrisB
The pandas factorizefunction assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to.
pandasfactorize函数将系列中的每个唯一值分配给一个从 0 开始的顺序索引,并计算每个系列条目属于哪个索引。
I'd like to accomplish the equivalent of pandas.factorizeon multiple columns:
我想pandas.factorize在多列上完成相当于:
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0]
That is, I want to determine each unique tuple of values in several columns of a data frame, assign a sequential index to each, and compute which index each row in the data frame belongs to.
也就是说,我想确定数据帧的几列中每个唯一的值元组,为每个值分配一个顺序索引,并计算数据帧中的每一行属于哪个索引。
Factorizeonly works on single columns. Is there a multi-column equivalent function in pandas?
Factorize仅适用于单列。Pandas 中是否有多列等效函数?
回答by HYRY
You need to create a ndarray of tuple first, pandas.lib.fast_zipcan do this very fast in cython loop.
您需要首先创建元组的 ndarray,pandas.lib.fast_zip可以在 cython 循环中非常快地完成此操作。
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]
the output is:
输出是:
[0 1 2 2 1 0]
回答by user2179627
I am not sure if this is an efficient solution. There might be better solutions for this.
我不确定这是否是一个有效的解决方案。可能有更好的解决方案。
arr=[] #this will hold the unique items of the dataframe
for i in df.index:
if list(df.iloc[i]) not in arr:
arr.append(list(df.iloc[i]))
so printing the arr would give you
所以打印 arr 会给你
>>>print arr
[[1,1],[1,2],[2,2]]
to hold the indices, i would declare an ind array
为了保存索引,我会声明一个 ind 数组
ind=[]
for i in df.index:
ind.append(arr.index(list(df.iloc[i])))
printing ind would give
印刷工业会给
>>>print ind
[0,1,2,2,1,0]
回答by waitingkuo
You can use drop_duplicatesto drop those duplicated rows
您可以使用drop_duplicates删除那些重复的行
In [23]: df.drop_duplicates()
Out[23]:
x y
0 1 1
1 1 2
2 2 2
EDIT
编辑
To achieve your goal, you can join your original df to the drop_duplicated one:
为了实现您的目标,您可以将原始 df 加入 drop_duplicated 一个:
In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
Out[46]:
x y index
0 1 1 0
1 1 2 1
2 2 2 2
3 2 2 2
4 1 2 1
5 1 1 0
回答by David Hagar
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
tuples = df[['x', 'y']].apply(tuple, axis=1)
df['newID'] = pd.factorize( tuples )[0]

