Python Pandas - 将某些列类型更改为类别

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28910851/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:53:46  来源:igfitidea点击:

Python Pandas - Changing some column types to categories

pythonnumpypandasmultiple-columnscategories

提问by gincard

I have fed the following CSV file into iPython Notebook:

我已将以下 CSV 文件输入 iPython Notebook:

public = pd.read_csv("categories.csv")
public

I've also imported pandas as pd, numpy as np and matplotlib.pyplot as plt. The following data types are present (the below is a summary - there are about 100 columns)

我还导入了熊猫作为 pd,numpy 作为 np 和 matplotlib.pyplot 作为 plt。存在以下数据类型(以下是摘要 - 大约有 100 列)

In [36]:   public.dtypes
Out[37]:   parks          object
           playgrounds    object
           sports         object
           roading        object               
           resident       int64
           children       int64

I want to change 'parks', 'playgrounds', 'sports' and 'roading' to categories (they have likert scale responses in them - each column has different types of likert responses though (e.g. one has "strongly agree", "agree" etc., another has "very important", "important" etc.), leaving the remainder as int64.

我想将“公园”、“操场”、“运动”和“道路”更改为类别(它们中有 Likert 量表响应——尽管每一列都有不同类型的 Likert 响应(例如,有一个“非常同意”、“同意”) " 等,另一个有“非常重要”、“重要”等),其余为 int64。

I was able to create a separate dataframe - public1 - and change one of the columns to a category type using the following code:

我能够创建一个单独的数据框 - public1 - 并使用以下代码将其中一列更改为类别类型:

public1 = {'parks': public.parks}
public1 = public1['parks'].astype('category')

However, when I tried to change a number at once using this code, I was unsuccessful:

但是,当我尝试使用此代码一次更改一个数字时,却没有成功:

public1 = {'parks': public.parks,
           'playgrounds': public.parks}
public1 = public1['parks', 'playgrounds'].astype('category')

Notwithstanding this, I don't want to create a separate dataframe with just the categories columns. I would like them changed in the original dataframe.

尽管如此,我不想创建一个仅包含类别列的单独数据框。我希望它们在原始数据框中更改。

I tried numerous ways to achieve this, then tried the code here: Pandas: change data type of columns...

我尝试了多种方法来实现这一点,然后尝试了这里的代码:Pandas: change data type of columns...

public[['parks', 'playgrounds', 'sports', 'roading']] = public[['parks', 'playgrounds', 'sports', 'roading']].astype('category')

and got the following error:

并得到以下错误:

 NotImplementedError: > 1 ndim Categorical are not supported at this time

Is there a way to change 'parks', 'playgrounds', 'sports', 'roading' to categories (so the likert scale responses can then be analysed), leaving 'resident' and 'children' (and the 94 other columns that are string, int + floats) untouched please? Or, is there a better way to do this? If anyone has any suggestions and/or feedback I would be most grateful....am slowly going bald ripping my hair out!

有没有办法将“公园”、“游乐场”、“运动”、“道路”更改为类别(这样可以分析李克特量表的反应),而留下“居民”和“儿童”(以及其他 94 个列是字符串,整数 + 浮点数)未受影响吗?或者,有没有更好的方法来做到这一点?如果有人有任何建议和/或反馈,我将不胜感激……我正在慢慢地秃头撕掉我的头发!

Many thanks in advance.

提前谢谢了。

edited to add - I am using Python 2.7.

编辑添加 - 我使用的是 Python 2.7。

采纳答案by unutbu

Sometimes, you just have to use a for-loop:

有时,您只需要使用 for 循环:

for col in ['parks', 'playgrounds', 'sports', 'roading']:
    public[col] = public[col].astype('category')

回答by Kevin

As of pandas 0.19.0, What's Newdescribes that read_csvsupports parsing Categoricalcolumns directly. This answer applies only if you're starting from read_csvotherwise, I think unutbu's answer is still best. Example on 10,000 records:

从 pandas 0.19.0 开始,What's New描述read_csv支持Categorical直接解析列。此答案仅适用于您从read_csv其他方面开始的情况,我认为 unutbu 的答案仍然是最好的。10,000 条记录的示例:

import pandas as pd
import numpy as np

# Generate random data, four category-like columns, two int columns
N=10000
categories = pd.DataFrame({
            'parks' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
            'playgrounds' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
            'sports' : np.random.choice(['important', 'very important', 'not important'], size=N),
            'roading' : np.random.choice(['important', 'very important', 'not important'], size=N),
            'resident' : np.random.choice([1, 2, 3], size=N),
            'children' : np.random.choice([0, 1, 2, 3], size=N)
                       })
categories.to_csv('categories_large.csv', index=False)

<0.19.0 (or >=19.0 without specifying dtype)

<0.19.0(或 >=19.0 不指定数据类型)

pd.read_csv('categories_large.csv').dtypes # inspect default dtypes

children        int64
parks          object
playgrounds    object
resident        int64
roading        object
sports         object
dtype: object

>=0.19.0

>=0.19.0

For mixed dtypesparsing as Categoricalcan be implemented by passing a dictionary dtype={'colname' : 'category', ...}in read_csv.

对于混合dtypes解析,Categorical可以通过dtype={'colname' : 'category', ...}read_csv.

pd.read_csv('categories_large.csv', dtype={'parks': 'category',
                                           'playgrounds': 'category',
                                           'sports': 'category',
                                           'roading': 'category'}).dtypes
children          int64
parks          category
playgrounds    category
resident          int64
roading        category
sports         category
dtype: object

Performance

表现

A slight speed-up (local jupyter notebook), as mentioned in the release notes.

如发行说明中所述,略微加速(本地 jupyter 笔记本)。

# unutbu's answer
%%timeit
public = pd.read_csv('categories_large.csv')
for col in ['parks', 'playgrounds', 'sports', 'roading']:
    public[col] = public[col].astype('category')
10 loops, best of 3: 20.1 ms per loop

# parsed during read_csv
%%timeit
category_cols = {item: 'category' for item in ['parks', 'playgrounds', 'sports', 'roading']}
public = pd.read_csv('categories_large.csv', dtype=category_cols)
100 loops, best of 3: 14.3 ms per loop

回答by Derek Kaknes

You can use the pandas.DataFrame.applymethod along with a lambdaexpression to solve this. In your example you could use

您可以使用该pandas.DataFrame.apply方法和一个lambda表达式来解决这个问题。在您的示例中,您可以使用

df[['parks', 'playgrounds', 'sports']].apply(lambda x: x.astype('category'))

I don't know of a way to execute this inplace, so typically I'll end up with something like this:

我不知道如何就地执行此操作,因此通常我会得到如下结果:

df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))

Obviously you can replace .select_dtypeswith explicit column names if you don't want to select all of a certain datatype (although in your example it seems like you wanted all objecttypes).

显然.select_dtypes,如果您不想选择所有特定数据类型,则可以用显式列名替换(尽管在您的示例中,您似乎想要所有object类型)。

回答by NikoTumi

I found that using a for loop works well.

我发现使用 for 循环效果很好。

for col in ['col_variable_name_1', 'col_variable_name_2', ect..]:
    dataframe_name[col] = dataframe_name[col].astype(float)

回答by rsc05

Jupyter Notebook

Jupyter 笔记本

In my case, I had big Dataframe with many objects that I would like to convert it to category.

就我而言,我有很多对象的大数据框,我想将其转换为类别。

Therefore, what I did is I selected the object columns and filled anything that is NA to missing and then saved it in the original Dataframe as in

因此,我所做的是选择了对象列并填充了 NA 缺失的任何内容,然后将其保存在原始数据框中,如

# Convert Object Columns to Categories
obj_df =df.select_dtypes(include=['object']).copy()
obj_df=obj_df.fillna('Missing')
for col in obj_df:
    obj_df[col] = obj_df[col].astype('category')
df[obj_df.columns]=obj_df[obj_df.columns]
df.head()

I hope this might be a helpful resource for later reference

我希望这可能是一个有用的资源,供以后参考

回答by Maximilian Peters

No need for loops, Pandas can do it directly now, just pass a list of columns you want to convert and Pandas will convert them all.

不需要循环,Pandas 现在可以直接完成,只需传递您要转换的列列表,Pandas 就会将它们全部转换。

cols = ['parks', 'playgrounds', 'sports', 'roading']:
public[cols] = public[cols].astype('category')


df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['c', 'd', 'e']})

>>     a  b
>>  0  a  c
>>  1  b  d
>>  2  c  e

df.dtypes
>> a    object
>> b    object
>> dtype: object

df[df.columns] = df[df.columns].astype('category')
df.dtypes
>> a    category
>> b    category
>> dtype: object

回答by liangli

To make things easier. NO APPLY. NO MAP. NO LOOP.

为了让事情变得更容易。不适用。没有地图。没有循环。

    cols=data.select_dtypes(exclude='int').columns.to_list()
    data[cols]=data[cols].astype('category')