Python Pandas - 将某些列类型更改为类别
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28910851/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas - Changing some column types to categories
提问by gincard
I have fed the following CSV file into iPython Notebook:
我已将以下 CSV 文件输入 iPython Notebook:
public = pd.read_csv("categories.csv")
public
I've also imported pandas as pd, numpy as np and matplotlib.pyplot as plt. The following data types are present (the below is a summary - there are about 100 columns)
我还导入了熊猫作为 pd,numpy 作为 np 和 matplotlib.pyplot 作为 plt。存在以下数据类型(以下是摘要 - 大约有 100 列)
In [36]: public.dtypes
Out[37]: parks object
playgrounds object
sports object
roading object
resident int64
children int64
I want to change 'parks', 'playgrounds', 'sports' and 'roading' to categories (they have likert scale responses in them - each column has different types of likert responses though (e.g. one has "strongly agree", "agree" etc., another has "very important", "important" etc.), leaving the remainder as int64.
我想将“公园”、“操场”、“运动”和“道路”更改为类别(它们中有 Likert 量表响应——尽管每一列都有不同类型的 Likert 响应(例如,有一个“非常同意”、“同意”) " 等,另一个有“非常重要”、“重要”等),其余为 int64。
I was able to create a separate dataframe - public1 - and change one of the columns to a category type using the following code:
我能够创建一个单独的数据框 - public1 - 并使用以下代码将其中一列更改为类别类型:
public1 = {'parks': public.parks}
public1 = public1['parks'].astype('category')
However, when I tried to change a number at once using this code, I was unsuccessful:
但是,当我尝试使用此代码一次更改一个数字时,却没有成功:
public1 = {'parks': public.parks,
'playgrounds': public.parks}
public1 = public1['parks', 'playgrounds'].astype('category')
Notwithstanding this, I don't want to create a separate dataframe with just the categories columns. I would like them changed in the original dataframe.
尽管如此,我不想创建一个仅包含类别列的单独数据框。我希望它们在原始数据框中更改。
I tried numerous ways to achieve this, then tried the code here: Pandas: change data type of columns...
我尝试了多种方法来实现这一点,然后尝试了这里的代码:Pandas: change data type of columns...
public[['parks', 'playgrounds', 'sports', 'roading']] = public[['parks', 'playgrounds', 'sports', 'roading']].astype('category')
and got the following error:
并得到以下错误:
NotImplementedError: > 1 ndim Categorical are not supported at this time
Is there a way to change 'parks', 'playgrounds', 'sports', 'roading' to categories (so the likert scale responses can then be analysed), leaving 'resident' and 'children' (and the 94 other columns that are string, int + floats) untouched please? Or, is there a better way to do this? If anyone has any suggestions and/or feedback I would be most grateful....am slowly going bald ripping my hair out!
有没有办法将“公园”、“游乐场”、“运动”、“道路”更改为类别(这样可以分析李克特量表的反应),而留下“居民”和“儿童”(以及其他 94 个列是字符串,整数 + 浮点数)未受影响吗?或者,有没有更好的方法来做到这一点?如果有人有任何建议和/或反馈,我将不胜感激……我正在慢慢地秃头撕掉我的头发!
Many thanks in advance.
提前谢谢了。
edited to add - I am using Python 2.7.
编辑添加 - 我使用的是 Python 2.7。
采纳答案by unutbu
Sometimes, you just have to use a for-loop:
有时,您只需要使用 for 循环:
for col in ['parks', 'playgrounds', 'sports', 'roading']:
public[col] = public[col].astype('category')
回答by Kevin
As of pandas 0.19.0, What's Newdescribes that read_csv
supports parsing Categorical
columns directly.
This answer applies only if you're starting from read_csv
otherwise, I think unutbu's answer is still best.
Example on 10,000 records:
从 pandas 0.19.0 开始,What's New描述read_csv
支持Categorical
直接解析列。此答案仅适用于您从read_csv
其他方面开始的情况,我认为 unutbu 的答案仍然是最好的。10,000 条记录的示例:
import pandas as pd
import numpy as np
# Generate random data, four category-like columns, two int columns
N=10000
categories = pd.DataFrame({
'parks' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
'playgrounds' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
'sports' : np.random.choice(['important', 'very important', 'not important'], size=N),
'roading' : np.random.choice(['important', 'very important', 'not important'], size=N),
'resident' : np.random.choice([1, 2, 3], size=N),
'children' : np.random.choice([0, 1, 2, 3], size=N)
})
categories.to_csv('categories_large.csv', index=False)
<0.19.0 (or >=19.0 without specifying dtype)
<0.19.0(或 >=19.0 不指定数据类型)
pd.read_csv('categories_large.csv').dtypes # inspect default dtypes
children int64
parks object
playgrounds object
resident int64
roading object
sports object
dtype: object
>=0.19.0
>=0.19.0
For mixed dtypes
parsing as Categorical
can be implemented by passing a dictionary dtype={'colname' : 'category', ...}
in read_csv
.
对于混合dtypes
解析,Categorical
可以通过dtype={'colname' : 'category', ...}
在read_csv
.
pd.read_csv('categories_large.csv', dtype={'parks': 'category',
'playgrounds': 'category',
'sports': 'category',
'roading': 'category'}).dtypes
children int64
parks category
playgrounds category
resident int64
roading category
sports category
dtype: object
Performance
表现
A slight speed-up (local jupyter notebook), as mentioned in the release notes.
如发行说明中所述,略微加速(本地 jupyter 笔记本)。
# unutbu's answer
%%timeit
public = pd.read_csv('categories_large.csv')
for col in ['parks', 'playgrounds', 'sports', 'roading']:
public[col] = public[col].astype('category')
10 loops, best of 3: 20.1 ms per loop
# parsed during read_csv
%%timeit
category_cols = {item: 'category' for item in ['parks', 'playgrounds', 'sports', 'roading']}
public = pd.read_csv('categories_large.csv', dtype=category_cols)
100 loops, best of 3: 14.3 ms per loop
回答by Derek Kaknes
You can use the pandas.DataFrame.apply
method along with a lambda
expression to solve this. In your example you could use
您可以使用该pandas.DataFrame.apply
方法和一个lambda
表达式来解决这个问题。在您的示例中,您可以使用
df[['parks', 'playgrounds', 'sports']].apply(lambda x: x.astype('category'))
I don't know of a way to execute this inplace, so typically I'll end up with something like this:
我不知道如何就地执行此操作,因此通常我会得到如下结果:
df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))
Obviously you can replace .select_dtypes
with explicit column names if you don't want to select all of a certain datatype (although in your example it seems like you wanted all object
types).
显然.select_dtypes
,如果您不想选择所有特定数据类型,则可以用显式列名替换(尽管在您的示例中,您似乎想要所有object
类型)。
回答by NikoTumi
I found that using a for loop works well.
我发现使用 for 循环效果很好。
for col in ['col_variable_name_1', 'col_variable_name_2', ect..]:
dataframe_name[col] = dataframe_name[col].astype(float)
回答by rsc05
Jupyter Notebook
Jupyter 笔记本
In my case, I had big Dataframe with many objects that I would like to convert it to category.
就我而言,我有很多对象的大数据框,我想将其转换为类别。
Therefore, what I did is I selected the object columns and filled anything that is NA to missing and then saved it in the original Dataframe as in
因此,我所做的是选择了对象列并填充了 NA 缺失的任何内容,然后将其保存在原始数据框中,如
# Convert Object Columns to Categories
obj_df =df.select_dtypes(include=['object']).copy()
obj_df=obj_df.fillna('Missing')
for col in obj_df:
obj_df[col] = obj_df[col].astype('category')
df[obj_df.columns]=obj_df[obj_df.columns]
df.head()
I hope this might be a helpful resource for later reference
我希望这可能是一个有用的资源,供以后参考
回答by Maximilian Peters
No need for loops, Pandas can do it directly now, just pass a list of columns you want to convert and Pandas will convert them all.
不需要循环,Pandas 现在可以直接完成,只需传递您要转换的列列表,Pandas 就会将它们全部转换。
cols = ['parks', 'playgrounds', 'sports', 'roading']:
public[cols] = public[cols].astype('category')
df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['c', 'd', 'e']})
>> a b
>> 0 a c
>> 1 b d
>> 2 c e
df.dtypes
>> a object
>> b object
>> dtype: object
df[df.columns] = df[df.columns].astype('category')
df.dtypes
>> a category
>> b category
>> dtype: object
回答by liangli
To make things easier. NO APPLY. NO MAP. NO LOOP.
为了让事情变得更容易。不适用。没有地图。没有循环。
cols=data.select_dtypes(exclude='int').columns.to_list()
data[cols]=data[cols].astype('category')