pandas 为列熊猫数据框分配唯一 id
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33283086/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Assign unique id to columns pandas data frame
提问by emax
Hello I have the following dataframe
您好,我有以下数据框
df =
A B
John Tom
Homer Bart
Tom Maggie
Lisa John
I would like to assign to each name a unique ID and returns
我想为每个名字分配一个唯一的 ID 并返回
df =
A B C D
John Tom 0 1
Homer Bart 2 3
Tom Maggie 1 4
Lisa John 5 0
What I have done is the following:
我所做的是以下内容:
LL1 = pd.concat([df.a,df.b],ignore_index=True)
LL1 = pd.DataFrame(LL1)
LL1.columns=['a']
nameun = pd.unique(LL1.a.ravel())
LLout['c'] = 0
LLout['d'] = 0
NN = list(nameun)
for i in range(1,len(LLout)):
LLout.c[i] = NN.index(LLout.a[i])
LLout.d[i] = NN.index(LLout.b[i])
But since I have a very large dataset this process is very slow.
但由于我有一个非常大的数据集,这个过程非常缓慢。
回答by Andy Hayden
Here's one way. First get the array of unique names:
这是一种方法。首先获取唯一名称的数组:
In [11]: df.values.ravel()
Out[11]: array(['John', 'Tom', 'Homer', 'Bart', 'Tom', 'Maggie', 'Lisa', 'John'], dtype=object)
In [12]: pd.unique(df.values.ravel())
Out[12]: array(['John', 'Tom', 'Homer', 'Bart', 'Maggie', 'Lisa'], dtype=object)
and make this a Series, mapping names to their respective numbers:
并将其设为系列,将名称映射到各自的编号:
In [13]: names = pd.unique(df.values.ravel())
In [14]: names = pd.Series(np.arange(len(names)), names)
In [15]: names
Out[15]:
John 0
Tom 1
Homer 2
Bart 3
Maggie 4
Lisa 5
dtype: int64
Now use applymapand names.getto lookup these numbers:
现在使用applymap和names.get查找这些数字:
In [16]: df.applymap(names.get)
Out[16]:
A B
0 0 1
1 2 3
2 1 4
3 5 0
and assign it to the correct columns:
并将其分配给正确的列:
In [17]: df[["C", "D"]] = df.applymap(names.get)
In [18]: df
Out[18]:
A B C D
0 John Tom 0 1
1 Homer Bart 2 3
2 Tom Maggie 1 4
3 Lisa John 5 0
Note: This assumes that all the values are names to begin with, you may want to restrict this to some columns only:
注意:这假设所有值都是以名称开头的,您可能只想将其限制为某些列:
df[['A', 'B']].values.ravel()
...
df[['A', 'B']].applymap(names.get)
回答by DSM
(Note: I'm assuming you don't care about the precise details of the mapping -- which number John becomes, for example -- but only that there is one.)
(注意:我假设您不关心映射的精确细节——例如,约翰变成了哪个数字——但只关心有一个。)
Method #1: you could use a Categoricalobject as an intermediary:
方法#1:你可以使用一个Categorical对象作为中介:
>>> ranked = pd.Categorical(df.stack()).codes.reshape(df.shape)
>>> df.join(pd.DataFrame(ranked, columns=["C", "D"]))
A B C D
0 John Tom 2 5
1 Homer Bart 1 0
2 Tom Maggie 5 4
3 Lisa John 3 2
It feels like you should be able to treat a Categorical as providing an encoding dictionary somehow (whether directly or by generating a Series) but I can't see a convenient way to do it.
感觉就像您应该能够将 Categorical 视为以某种方式提供编码字典(无论是直接还是通过生成系列),但我看不到一种方便的方法来做到这一点。
Method #2: you could use rank("dense"), which generates an increasing number for each value in order:
方法#2:您可以使用rank("dense"),它按顺序为每个值生成一个递增的数字:
>>> ranked = df.stack().rank("dense").reshape(df.shape).astype(int)-1
>>> df.join(pd.DataFrame(ranked, columns=["C", "D"]))
A B C D
0 John Tom 2 5
1 Homer Bart 1 0
2 Tom Maggie 5 4
3 Lisa John 3 2

