Python Pandas 中最接近因子变量的等价物

Question

提问by Amelio Vazquez-Reina

What is the closest equivalent to an R Factor variablein Python pandas?

Answer 1

采纳答案by sriramn

This question seems to be from a year back but since it is still open here's an update. pandas has introduced a categoricaldtype and it operates very similar to factorsin R. Please see this link for more information:

这个问题似乎是一年前的，但由于它仍然开放，这里有一个更新。pandas 引入了一个categoricaldtype，它的操作与factorsR 中的非常相似。有关更多信息，请参阅此链接：

http://pandas-docs.github.io/pandas-docs-travis/categorical.html

Reproducing a snippet from the link above showing how to create a "factor" variable in pandas.

从上面的链接中复制一个片段，展示如何在 Pandas 中创建一个“因子”变量。

In [1]: s = Series(["a","b","c","a"], dtype="category")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a < b < c]

Answer 2

回答by badgley

If you're looking to do modeling etc, lots of goodies for factor within the patsy library. I will admit to having struggled with this myself. I found these slideshelpful. Wish I could give a better example, but this is as far as I've gotten myself.

如果您想进行建模等工作，那么patsy 库中有很多关于 factor 的好东西。我承认我自己也曾为此挣扎过。我发现这些幻灯片很有帮助。希望我能举一个更好的例子，但这是我自己得到的。

Answer 3

回答by jpcsoup

If you're looking to map a categorical variable to a number as R does, Pandas implemented a function that will give you just that: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html

如果您想像 R 一样将分类变量映射到数字，Pandas 实现了一个函数，该函数将为您提供：https: //pandas.pydata.org/pandas-docs/stable/reference/api/pandas。分解.html

import pandas as pd

df = pd.read_csv('path_to_your_file')
df['new_factor'], _ = pd.factorize(df['old_categorical'], sort=True)

This function returns both the enumerated mapping as well as a list of unique values. If you're just doing variable assignment, you'll have to throw the latter away as above.

此函数返回枚举映射以及唯一值列表。如果你只是在做变量赋值，你就必须像上面一样把后者扔掉。

If you want a homegrown solution, you can use a combination of a set and a dictionary within a function. This method is a bit easier to apply over multiple columns, but you do have to note that None, NaN, etc. will be a included as a category in this method:

如果你想要一个自产的解决方案，你可以在一个函数中使用集合和字典的组合。这种方法更容易应用于多列，但您必须注意 None、NaN 等将作为类别包含在此方法中：

def factor(var):
    var_set = set(var)
    var_set = {x: y for x, y in [pair for pair in zip(var_set, range(len(var_set)))]}
    return [var_set[x] for x in var]


df['new_factor1'] = df['old_categorical1'].apply(factor)
df[['new_factor2', 'new_factor3']] = df[['old_categorical2', 'old_categorical3']].apply(factor)

Answer 4

回答by Dan Krahenbuhl

C # array containing category data
V # array containing numerical data

H = np.unique(C)
mydict = {}
for h in H:
    mydict[h] = V[C==h]


boxplot(mydict.values(), labels=mydict.keys())

Python Pandas 中最接近因子变量的等价物

提问by Amelio Vazquez-Reina

采纳答案by sriramn

回答by badgley

回答by jpcsoup

回答by Dan Krahenbuhl

相关推荐

最近更新

标签

Python Pandas 中最接近因子变量的等价物

提问by Amelio Vazquez-Reina

采纳答案by sriramn

回答by badgley

回答by jpcsoup

回答by Dan Krahenbuhl

相关推荐

pandas HDFStore - 如何重新打开？

pandas：带条件格式的 HTML 输出

滑动窗口上的 Pandas 滚动计算（不均匀间隔）

pandas 熊猫面板中的布尔掩码

相关推荐

最近更新

标签