Pandas 将所有对象列转换为类别

Question

提问by Georg Heiler

I want to have ha elegant function to cast all object columns in a pandas data frame to categories

我想拥有优雅的功能来将 Pandas 数据框中的所有对象列转换为类别

df[x] = df[x].astype("category")performs the type cast df.select_dtypes(include=['object'])would sub-select all categories columns. However this results in a loss of the other columns / a manual merge is required. Is there a solution which "just works in place" or does not require a manual cast?

df[x] = df[x].astype("category")执行类型转换 df.select_dtypes(include=['object'])将子选择所有类别列。但是，这会导致其他列丢失/需要手动合并。是否有“就地工作”或不需要手动演员的解决方案？

edit

编辑

I am looking for something similar as http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.htmlfor a conversion to categorical data

我正在寻找类似于http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.html 的东西来转换为分类数据

Answer 1

回答by piRSquared

use applyand pd.Series.astypewith dtype='category'

使用apply并pd.Series.astype与dtype='category'

Consider the pd.DataFramedf

考虑 pd.DataFramedf

df = pd.DataFrame(dict(
        A=[1, 2, 3, 4],
        B=list('abcd'),
        C=[2, 3, 4, 5],
        D=list('defg')
    ))
df

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A    4 non-null int64
B    4 non-null object
C    4 non-null int64
D    4 non-null object
dtypes: int64(2), object(2)
memory usage: 200.0+ bytes

Lets use select_dtypesto include all 'object'types to convert and recombine with a select_dtypesto exclude them.

让我们使用select_dtypes来包含所有'object'类型以进行转换和重新组合select_dtypes以排除它们。

df = pd.concat([
        df.select_dtypes([], ['object']),
        df.select_dtypes(['object']).apply(pd.Series.astype, dtype='category')
        ], axis=1).reindex_axis(df.columns, axis=1)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A    4 non-null int64
B    4 non-null category
C    4 non-null int64
D    4 non-null category
dtypes: category(2), int64(2)
memory usage: 208.0 bytes

Answer 2

回答by KG in Chicago

I think that this is a more elegant way:

我认为这是一种更优雅的方式：

df = pd.DataFrame(dict(
        A=[1, 2, 3, 4],
        B=list('abcd'),
        C=[2, 3, 4, 5],
        D=list('defg')
    ))

df.info()

df.loc[:, df.dtypes == 'object'] =\
    df.select_dtypes(['object'])\
    .apply(lambda x: x.astype('category'))

df.info()

Answer 3

回答by a Data Head

Wish I could add this as a comment, but can't.

希望我可以将其添加为评论，但不能。

The accepted answer doesn't work for pandas version 0.25 and higher. Use .reindexinstead of reindex_axis. See here for more information: https://github.com/scikit-hep/root_pandas/issues/82

接受的答案不适用于 0.25 版及更高版本的Pandas。使用.reindex代替reindex_axis。有关更多信息，请参见此处：https: //github.com/scikit-hep/root_pandas/issues/82

Answer 4

回答by Anton Golubev

Often the order of categories has meaning, for example t-short sizes 'S', 'M', 'L' 'XL' are ordered categories (in SPSS - ordinals). If you are interested in creating ordered categories from strings you can use this code:

通常类别的顺序是有意义的，例如 t-short 尺寸“S”、“M”、“L”、“XL”是有序的类别（在 SPSS 中 - 序数）。如果您有兴趣从字符串创建有序类别，您可以使用以下代码：

df = pd.concat([
        df.select_dtypes([], ['object']),
        df.select_dtypes(['object']).apply(pd.Categorical, ordered=True)
        ], axis=1).reindex(df.columns, axis=1)

In the resulting DataFrame categorical columns can be sorted by values the same way as you used to sort strings.

在生成的 DataFrame 中，分类列可以按照与用于对字符串进行排序相同的方式按值进行排序。

Pandas 将所有对象列转换为类别

提问by Georg Heiler

edit

编辑

回答by piRSquared

回答by KG in Chicago

回答by a Data Head

Wish I could add this as a comment, but can't.

希望我可以将其添加为评论，但不能。

回答by Anton Golubev

相关推荐

最近更新

标签

Pandas 将所有对象列转换为类别

提问by Georg Heiler

edit

编辑

回答by piRSquared

回答by KG in Chicago

回答by a Data Head

Wish I could add this as a comment, but can't.

希望我可以将其添加为评论，但不能。

回答by Anton Golubev

相关推荐

将 Pandas 导入 Python

pandas 熊猫分配新的列名作为字符串

pandas 为 Dataframe 的特定列添加前缀

pandas 通过引用传递pandas DataFrame

相关推荐

最近更新

标签