pandas.factorize 在整个数据框上
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39390160/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas.factorize on an entire data frame
提问by clstaudt
pandas.factorize
encodes input values as an enumerated type or categorical variable.
pandas.factorize
将输入值编码为枚举类型或分类变量。
But how can I easily and efficiently convert many columns of a data frame? What about the reverse mapping step?
但是如何轻松有效地转换数据框的多列?反向映射步骤呢?
Example: This data frame contains columns with string values such as "type 2" which I would like to convert to numerical values - and possibly translate them back later.
示例:此数据框包含带有字符串值的列,例如“类型 2”,我想将其转换为数值 - 并可能稍后将它们转换回来。
回答by jezrael
You can use apply
if you need to factorize
each column separately:
apply
如果您需要分别对factorize
每一列,您可以使用:
df = pd.DataFrame({'A':['type1','type2','type2'],
'B':['type1','type2','type3'],
'C':['type1','type3','type3']})
print (df)
A B C
0 type1 type1 type1
1 type2 type2 type3
2 type2 type3 type3
print (df.apply(lambda x: pd.factorize(x)[0]))
A B C
0 0 0 0
1 1 1 1
2 1 2 1
If you need for the same string value the same numeric one:
如果您需要相同的字符串值相同的数字:
print (df.stack().rank(method='dense').unstack())
A B C
0 1.0 1.0 1.0
1 2.0 2.0 3.0
2 2.0 3.0 3.0
If you need to apply the function only for some columns, use a subset:
如果您只需要对某些列应用该函数,请使用子集:
df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack()
print (df)
A B C
0 type1 1.0 1.0
1 type2 2.0 3.0
2 type2 3.0 3.0
Solution with factorize
:
解决方案factorize
:
stacked = df[['B','C']].stack()
df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack()
print (df)
A B C
0 type1 0 0
1 type2 1 2
2 type2 2 2
Translate them back is possible via map
by dict
, where you need to remove duplicates by drop_duplicates
:
可以通过map
by将它们翻译回来dict
,您需要通过以下方式删除重复项drop_duplicates
:
vals = df.stack().drop_duplicates().values
b = [x for x in df.stack().drop_duplicates().rank(method='dense')]
d1 = dict(zip(b, vals))
print (d1)
{1.0: 'type1', 2.0: 'type2', 3.0: 'type3'}
df1 = df.stack().rank(method='dense').unstack()
print (df1)
A B C
0 1.0 1.0 1.0
1 2.0 2.0 3.0
2 2.0 3.0 3.0
print (df1.stack().map(d1).unstack())
A B C
0 type1 type1 type1
1 type2 type2 type3
2 type2 type3 type3
回答by Gabe F.
I also found this answer quite helpful: https://stackoverflow.com/a/20051631/4643212
我也发现这个答案很有帮助:https: //stackoverflow.com/a/20051631/4643212
I was trying to take values from an existing column in a Pandas DataFrame (a list of IP addresses named 'SrcIP') and map them to numerical values in a new column (named 'ID' in this example).
我试图从 Pandas DataFrame 中的现有列(名为“SrcIP”的 IP 地址列表)中获取值,并将它们映射到新列(在本例中名为“ID”)中的数值。
Solution:
解决方案:
df['ID'] = pd.factorize(df.SrcIP)[0]
Result:
结果:
SrcIP | ID
192.168.1.112 | 0
192.168.1.112 | 0
192.168.4.118 | 1
192.168.1.112 | 0
192.168.4.118 | 1
192.168.5.122 | 2
192.168.5.122 | 2
...
回答by tbrittoborges
I would like to redirect my answer: https://stackoverflow.com/a/32011969/1694714
我想重定向我的答案:https: //stackoverflow.com/a/32011969/1694714
Old answer
旧答案
Another readable solution for this problem, when you want to keep the categories consistent across the the resulting DataFrame is using replace:
此问题的另一个可读解决方案是,当您希望在生成的 DataFrame 中保持类别一致时,使用替换:
def categorise(df):
categories = {k: v for v, k in enumerate(df.stack().unique())}
return df.replace(categories)
Performs slightly worse than the example by @jezrael, but easier to read. Also, it might escalate better for bigger datasets. I can do some proper testing if anyone is interested.
表现比@jezrael 的例子稍差,但更容易阅读。此外,对于更大的数据集,它可能会更好地升级。如果有人感兴趣,我可以做一些适当的测试。