Python Pandas 中最接近因子变量的等价物

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15124439/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:41:01  来源:igfitidea点击:

Closest equivalent of a factor variable in Python Pandas

pythonrpandas

提问by Amelio Vazquez-Reina

What is the closest equivalent to an R Factor variablein Python pandas?

什么是最接近相当于R因素变量Python的大Pandas

采纳答案by sriramn

This question seems to be from a year back but since it is still open here's an update. pandas has introduced a categoricaldtype and it operates very similar to factorsin R. Please see this link for more information:

这个问题似乎是一年前的,但由于它仍然开放,这里有一个更新。pandas 引入了一个categoricaldtype,它的操作与factorsR 中的非常相似。有关更多信息,请参阅此链接:

http://pandas-docs.github.io/pandas-docs-travis/categorical.html

http://pandas-docs.github.io/pandas-docs-travis/categorical.html

Reproducing a snippet from the link above showing how to create a "factor" variable in pandas.

从上面的链接中复制一个片段,展示如何在 Pandas 中创建一个“因子”变量。

In [1]: s = Series(["a","b","c","a"], dtype="category")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a < b < c]

回答by badgley

If you're looking to do modeling etc, lots of goodies for factor within the patsy library. I will admit to having struggled with this myself. I found these slideshelpful. Wish I could give a better example, but this is as far as I've gotten myself.

如果您想进行建模等工作,那么patsy 库中有很多关于 factor 的好东西。我承认我自己也曾为此挣扎过。我发现这些幻灯片很有帮助。希望我能举一个更好的例子,但这是我自己得到的。

回答by jpcsoup

If you're looking to map a categorical variable to a number as R does, Pandas implemented a function that will give you just that: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html

如果您想像 R 一样将分类变量映射到数字,Pandas 实现了一个函数,该函数将为您提供:https: //pandas.pydata.org/pandas-docs/stable/reference/api/pandas。分解.html

import pandas as pd

df = pd.read_csv('path_to_your_file')
df['new_factor'], _ = pd.factorize(df['old_categorical'], sort=True)

This function returns both the enumerated mapping as well as a list of unique values. If you're just doing variable assignment, you'll have to throw the latter away as above.

此函数返回枚举映射以及唯一值列表。如果你只是在做变量赋值,你就必须像上面一样把后者扔掉。

If you want a homegrown solution, you can use a combination of a set and a dictionary within a function. This method is a bit easier to apply over multiple columns, but you do have to note that None, NaN, etc. will be a included as a category in this method:

如果你想要一个自产的解决方案,你可以在一个函数中使用集合和字典的组合。这种方法更容易应用于多列,但您必须注意 None、NaN 等将作为类别包含在此方法中:

def factor(var):
    var_set = set(var)
    var_set = {x: y for x, y in [pair for pair in zip(var_set, range(len(var_set)))]}
    return [var_set[x] for x in var]


df['new_factor1'] = df['old_categorical1'].apply(factor)
df[['new_factor2', 'new_factor3']] = df[['old_categorical2', 'old_categorical3']].apply(factor)

回答by Dan Krahenbuhl

C # array containing category data
V # array containing numerical data

H = np.unique(C)
mydict = {}
for h in H:
    mydict[h] = V[C==h]


boxplot(mydict.values(), labels=mydict.keys())