如何将数字数据映射到 Pandas 数据框中的类别/箱

Question

提问by kiltannen

I've just started coding in python, and my general coding skills are fairly rusty :( so please be a bit patient

我刚刚开始用 python 编码，我的一般编码技能相当生疏:(所以请耐心等待

I have a pandas dataframe:

我有一个Pandas数据框：

It has around 3m rows. There are 3 kinds of age_units: Y, D, W for years, Days & Weeks. Any individual over 1 year old has an age unit of Y and my first grouping I want is <2y old so all I have to test for in Age Units is Y...

它有大约 3m 行。有 3 种年龄单位：Y、D、W 表示年、天和周。任何超过 1 岁的人的年龄单位都是 Y，我想要的第一个分组是 <2 岁，所以我必须在年龄单位中测试的是 Y...

I want to create a new column AgeRange and populate with the following ranges:

我想创建一个新列 AgeRange 并填充以下范围：

<2
2 - 18
18 - 35
35 - 65
65+

<2
2 - 18
18 - 35
35 - 65
65+

so I wrote a function

所以我写了一个函数

def agerange(values):
    for i in values:
        if complete.Age_units == 'Y':
            if complete.Age > 1 AND < 18 return '2-18'
            elif complete.Age > 17 AND < 35 return '18-35'
            elif complete.Age > 34 AND < 65 return '35-65'
            elif complete.Age > 64 return '65+'
        else return '< 2'

I thought if I passed in the dataframe as a whole I would get back what I needed and then could create the column I wanted something like this:

我想如果我作为一个整体传入数据框，我会得到我需要的东西，然后可以创建我想要的列：

agedetails['age_range'] = ageRange(agedetails)

BUT when I try to run the first code to create the function I get:

但是当我尝试运行第一个代码来创建函数时，我得到：

  File "<ipython-input-124-cf39c7ce66d9>", line 4
    if complete.Age > 1 AND complete.Age < 18 return '2-18'
                          ^
SyntaxError: invalid syntax

Clearly it is not accepting the AND - but I thought I heard in class I could use AND like this? I must be mistaken but then what would be the right way to do this?

显然它不接受 AND - 但我想我在课堂上听说我可以像这样使用 AND ？我一定是弄错了，但是这样做的正确方法是什么？

So after getting that error, I'm not even sure the method of passing in a dataframe will throw an error either. I am guessing probably yes. In which case - how would I make that work as well?

因此，在收到该错误后，我什至不确定传入数据帧的方法是否会引发错误。我猜可能是的。在这种情况下 - 我将如何使其工作？

I am looking to learn the best method, but part of the best method for me is keeping it simple even if that means doing things in a couple of steps...

我想学习最好的方法，但对我来说最好的方法之一就是保持简单，即使这意味着要分几个步骤做事......

Answer 1

回答by jpp

With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.

使用 Pandas，您应该避免按行操作，因为这些操作通常涉及低效的 Python 级循环。这里有几个选择。

Pandas: `pd.cut`

Pandas： `pd.cut`

As @JonClements suggests, you can use pd.cutfor this, the benefit here being that your new column becomes a Categorical.

正如@JonClements 所建议的那样，您可以pd.cut为此使用它，这样做的好处是您的新列变成了Categorical。

You only need to define your boundaries (including np.inf) and category names, then apply pd.cutto the desired numeric column.

您只需要定义边界（包括np.inf）和类别名称，然后应用于pd.cut所需的数字列。

bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']

df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)

print(df.dtypes)

# Age             int64
# Age_units      object
# AgeRange     category
# dtype: object

NumPy: `np.digitize`

NumPy： `np.digitize`

np.digitizeprovides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitizeto your Age column. Finally, use your dictionary to map your category names.

np.digitize提供了另一种干净的解决方案。这个想法是定义你的界限和名字，创建一个字典，然后应用np.digitize到你的年龄列。最后，使用您的字典来映射您的类别名称。

Note that for boundary cases the lower bound is used for mapping to a bin.

请注意，对于边界情况，下限用于映射到 bin。

import pandas as pd, numpy as np

df = pd.DataFrame({'Age': [99, 53, 71, 84, 84],
                   'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y']})

bins = [0, 2, 18, 35, 65]
names = ['<2', '2-18', '18-35', '35-65', '65+']

d = dict(enumerate(names, 1))

df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))

Result

结果

   Age Age_units AgeRange
0   99         Y      65+
1   53         Y    35-65
2   71         Y      65+
3   84         Y      65+
4   84         Y      65+

如何将数字数据映射到 Pandas 数据框中的类别/箱

提问by kiltannen

回答by jpp

Pandas: `pd.cut`

Pandas： `pd.cut`

NumPy: `np.digitize`

NumPy： `np.digitize`

Result

结果

相关推荐

最近更新

标签

如何将数字数据映射到 Pandas 数据框中的类别/箱

提问by kiltannen

回答by jpp

Pandas: pd.cut

Pandas： pd.cut

NumPy: np.digitize

NumPy： np.digitize

Result

结果

相关推荐

pandas 熊猫分类错误：“无法在具有新类别的分类上设置项目，请先设置类别”

pandas 如何从熊猫数据框中的时间戳列中删除时区

Python：matplotlib/pandas - 将数据框绘制为子图中的表格

pandas “系列”对象没有“applymap”属性

相关推荐

最近更新

标签

Pandas: `pd.cut`

Pandas： `pd.cut`

NumPy: `np.digitize`

NumPy： `np.digitize`