pandas.factorize 在整个数据框上

Question

提问by clstaudt

pandas.factorizeencodes input values as an enumerated type or categorical variable.

pandas.factorize将输入值编码为枚举类型或分类变量。

But how can I easily and efficiently convert many columns of a data frame? What about the reverse mapping step?

但是如何轻松有效地转换数据框的多列？反向映射步骤呢？

Example: This data frame contains columns with string values such as "type 2" which I would like to convert to numerical values - and possibly translate them back later.

示例：此数据框包含带有字符串值的列，例如“类型 2”，我想将其转换为数值 - 并可能稍后将它们转换回来。

Answer 1

回答by jezrael

You can use applyif you need to factorizeeach column separately:

apply如果您需要分别对factorize每一列，您可以使用：

df = pd.DataFrame({'A':['type1','type2','type2'],
                   'B':['type1','type2','type3'],
                   'C':['type1','type3','type3']})

print (df)
       A      B      C
0  type1  type1  type1
1  type2  type2  type3
2  type2  type3  type3

print (df.apply(lambda x: pd.factorize(x)[0]))
   A  B  C
0  0  0  0
1  1  1  1
2  1  2  1

If you need for the same string value the same numeric one:

如果您需要相同的字符串值相同的数字：

print (df.stack().rank(method='dense').unstack())
     A    B    C
0  1.0  1.0  1.0
1  2.0  2.0  3.0
2  2.0  3.0  3.0

If you need to apply the function only for some columns, use a subset:

如果您只需要对某些列应用该函数，请使用子集：

df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack()
print (df)
       A    B    C
0  type1  1.0  1.0
1  type2  2.0  3.0
2  type2  3.0  3.0

Solution with factorize:

解决方案factorize：

stacked = df[['B','C']].stack()
df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack()
print (df)
       A  B  C
0  type1  0  0
1  type2  1  2
2  type2  2  2

Translate them back is possible via mapby dict, where you need to remove duplicates by drop_duplicates:

可以通过mapby将它们翻译回来dict，您需要通过以下方式删除重复项drop_duplicates：

vals = df.stack().drop_duplicates().values
b = [x for x in df.stack().drop_duplicates().rank(method='dense')]

d1 = dict(zip(b, vals))
print (d1)
{1.0: 'type1', 2.0: 'type2', 3.0: 'type3'}

df1 = df.stack().rank(method='dense').unstack()
print (df1)
     A    B    C
0  1.0  1.0  1.0
1  2.0  2.0  3.0
2  2.0  3.0  3.0

print (df1.stack().map(d1).unstack())
       A      B      C
0  type1  type1  type1
1  type2  type2  type3
2  type2  type3  type3

Answer 2

回答by Gabe F.

I also found this answer quite helpful: https://stackoverflow.com/a/20051631/4643212

我也发现这个答案很有帮助：https: //stackoverflow.com/a/20051631/4643212

I was trying to take values from an existing column in a Pandas DataFrame (a list of IP addresses named 'SrcIP') and map them to numerical values in a new column (named 'ID' in this example).

我试图从 Pandas DataFrame 中的现有列（名为“SrcIP”的 IP 地址列表）中获取值，并将它们映射到新列（在本例中名为“ID”）中的数值。

Solution:

解决方案：

df['ID'] = pd.factorize(df.SrcIP)[0]

Result:

结果：

        SrcIP | ID    
192.168.1.112 |  0  
192.168.1.112 |  0  
192.168.4.118 |  1 
192.168.1.112 |  0
192.168.4.118 |  1
192.168.5.122 |  2
192.168.5.122 |  2
...

Answer 3

回答by tbrittoborges

I would like to redirect my answer: https://stackoverflow.com/a/32011969/1694714

我想重定向我的答案：https: //stackoverflow.com/a/32011969/1694714

Old answer

旧答案

Another readable solution for this problem, when you want to keep the categories consistent across the the resulting DataFrame is using replace:

此问题的另一个可读解决方案是，当您希望在生成的 DataFrame 中保持类别一致时，使用替换：

def categorise(df):
    categories = {k: v for v, k in enumerate(df.stack().unique())}
    return df.replace(categories)

Performs slightly worse than the example by @jezrael, but easier to read. Also, it might escalate better for bigger datasets. I can do some proper testing if anyone is interested.

表现比@jezrael 的例子稍差，但更容易阅读。此外，对于更大的数据集，它可能会更好地升级。如果有人感兴趣，我可以做一些适当的测试。

pandas.factorize 在整个数据框上

提问by clstaudt

回答by jezrael

回答by Gabe F.

回答by tbrittoborges

相关推荐

最近更新

标签

pandas.factorize 在整个数据框上

提问by clstaudt

回答by jezrael

回答by Gabe F.

回答by tbrittoborges

相关推荐

如何停止在 csv 文件末尾写一个空行 - pandas

pandas 从多索引数据框中删除特定行

pandas 使用熊猫创建矩阵结构

pandas 熊猫弹出最后一行

相关推荐

最近更新

标签