pandas 为列熊猫数据框分配唯一 id

Question

提问by emax

Hello I have the following dataframe

您好，我有以下数据框

df = 
A      B   
John   Tom
Homer  Bart
Tom    Maggie
Lisa   John

I would like to assign to each name a unique ID and returns

我想为每个名字分配一个唯一的 ID 并返回

df = 
A      B         C    D

John   Tom       0    1
Homer  Bart      2    3
Tom    Maggie    1    4 
Lisa   John      5    0

What I have done is the following:

我所做的是以下内容：

LL1 = pd.concat([df.a,df.b],ignore_index=True)
LL1 = pd.DataFrame(LL1)
LL1.columns=['a']
nameun = pd.unique(LL1.a.ravel())
LLout['c'] = 0
LLout['d'] = 0
NN = list(nameun)
for i in range(1,len(LLout)):
   LLout.c[i] = NN.index(LLout.a[i])
   LLout.d[i] = NN.index(LLout.b[i])

But since I have a very large dataset this process is very slow.

但由于我有一个非常大的数据集，这个过程非常缓慢。

Answer 1

回答by Andy Hayden

Here's one way. First get the array of unique names:

这是一种方法。首先获取唯一名称的数组：

In [11]: df.values.ravel()
Out[11]: array(['John', 'Tom', 'Homer', 'Bart', 'Tom', 'Maggie', 'Lisa', 'John'], dtype=object)

In [12]: pd.unique(df.values.ravel())
Out[12]: array(['John', 'Tom', 'Homer', 'Bart', 'Maggie', 'Lisa'], dtype=object)

and make this a Series, mapping names to their respective numbers:

并将其设为系列，将名称映射到各自的编号：

In [13]: names = pd.unique(df.values.ravel())

In [14]: names = pd.Series(np.arange(len(names)), names)

In [15]: names
Out[15]:
John      0
Tom       1
Homer     2
Bart      3
Maggie    4
Lisa      5
dtype: int64

Now use applymapand names.getto lookup these numbers:

现在使用applymap和names.get查找这些数字：

In [16]: df.applymap(names.get)
Out[16]:
   A  B
0  0  1
1  2  3
2  1  4
3  5  0

and assign it to the correct columns:

并将其分配给正确的列：

In [17]: df[["C", "D"]] = df.applymap(names.get)

In [18]: df
Out[18]:
       A       B  C  D
0   John     Tom  0  1
1  Homer    Bart  2  3
2    Tom  Maggie  1  4
3   Lisa    John  5  0

Note: This assumes that all the values are names to begin with, you may want to restrict this to some columns only:

注意：这假设所有值都是以名称开头的，您可能只想将其限制为某些列：

df[['A', 'B']].values.ravel()
...
df[['A', 'B']].applymap(names.get)

Answer 2

回答by DSM

(Note: I'm assuming you don't care about the precise details of the mapping -- which number John becomes, for example -- but only that there is one.)

（注意：我假设您不关心映射的精确细节——例如，约翰变成了哪个数字——但只关心有一个。）

Method #1: you could use a Categoricalobject as an intermediary:

方法#1：你可以使用一个Categorical对象作为中介：

>>> ranked = pd.Categorical(df.stack()).codes.reshape(df.shape)
>>> df.join(pd.DataFrame(ranked, columns=["C", "D"]))
       A       B  C  D
0   John     Tom  2  5
1  Homer    Bart  1  0
2    Tom  Maggie  5  4
3   Lisa    John  3  2

It feels like you should be able to treat a Categorical as providing an encoding dictionary somehow (whether directly or by generating a Series) but I can't see a convenient way to do it.

感觉就像您应该能够将 Categorical 视为以某种方式提供编码字典（无论是直接还是通过生成系列），但我看不到一种方便的方法来做到这一点。

Method #2: you could use rank("dense"), which generates an increasing number for each value in order:

方法#2：您可以使用rank("dense")，它按顺序为每个值生成一个递增的数字：

>>> ranked = df.stack().rank("dense").reshape(df.shape).astype(int)-1
>>> df.join(pd.DataFrame(ranked, columns=["C", "D"]))
       A       B  C  D
0   John     Tom  2  5
1  Homer    Bart  1  0
2    Tom  Maggie  5  4
3   Lisa    John  3  2

pandas 为列熊猫数据框分配唯一 id

提问by emax

回答by Andy Hayden

回答by DSM

相关推荐

最近更新

标签

pandas 为列熊猫数据框分配唯一 id

提问by emax

回答by Andy Hayden

回答by DSM

相关推荐

pandas 两个数据帧之间的相关性

使用 Pandas 对数据框进行时间分箱

pandas 大熊猫替换（擦除）字符串中的不同字符

pandas 熊猫组合两个字符串忽略 nan 值

相关推荐

最近更新

标签