Pandas 数据帧编码具有数千个唯一值的分类变量

Question

提问by roqds

I have a dataframe about data on schools for a few thousands cities. The school is the row identifier and the city is encoded as follow:

我有一个关于数千个城市学校数据的数据框。学校是行标识符，城市编码如下：

school city          category   capacity
1      azez6576sebd  45         23
2      dsqozbc765aj  12         236
3      sqdqsd12887s  8          63 
4      azez6576sebd  7          234 
...

How can I convert the city variable to numeric knowing that I have a few thousand cities ? I guess one-hot encoding is not appropriate as I will have too many columns. What is the general approach to convert categorical variable with thousand of levels to numeric ?

知道我有几千个城市，如何将城市变量转换为数字？我想单热编码是不合适的，因为我会有太多的列。将具有数千个级别的分类变量转换为数字的一般方法是什么？

Thank you.

谢谢你。

Answer 1

回答by YOBEN_S

You can using category dtype in sklearn , it should be labelencoder

您可以在 sklearn 中使用 category dtype ，它应该是 labelencoder

df.city=df.city.astype('category').cat.codes
df
Out[385]: 
   school  city  category  capacity
0       1     0        45        23
1       2     1        12       236
2       3     2         8        63
3       4     0         7       234

Answer 2

回答by cs95

A few thousand columns is still manageable in the context of ML classifiers. Although you'd want to watch out for the curse of dimensionality.

在 ML 分类器的上下文中，仍然可以管理几千列。虽然你想提防维度的诅咒。

That aside, you wouldn't want a get_dummiescall to result in a memory blowout, so you could generate a SparseDataFrameinstead -

除此之外，您不希望get_dummies调用导致内存溢出，因此您可以生成一个SparseDataFrame-

v = pd.get_dummies(df.set_index('school').city, sparse=True)
v

        azez6576sebd  dsqozbc765aj  sqdqsd12887s
school                                          
1                  1             0             0
2                  0             1             0
3                  0             0             1
4                  1             0             0

type(v)
pandas.core.sparse.frame.SparseDataFrame

You can generate a sparse matrix using sdf.to_coo-

您可以使用sdf.to_coo-

v.to_coo()

<4x3 sparse matrix of type '<class 'numpy.uint8'>'
    with 4 stored elements in COOrdinate format>

Answer 3

回答by aamir23

An optimal way, that's used in production ML systems & Kaggle competitions is to use embeddings, like their target statistics. So for a binary target variable you can calculate the following for each of the distinct categorical values.

在生产 ML 系统和 Kaggle 比赛中使用的最佳方法是使用嵌入，如目标统计数据。因此，对于二进制目标变量，您可以针对每个不同的分类值计算以下内容。

1) No of positive labels 2) No of Negative labels 3) Ratio

1) 正面标签的数量 2) 负面标签的数量 3) 比率

Here's a video explaining it - Large-Scale Learning - Dr. Mikhail Bilenko

这是解释它的视频 -大规模学习 - Mikhail Bilenko 博士

Hash encoders are also suitable for your situation of 'city' column having a few thousand distinct values. This method scales pretty well. You need to specify the number of binary output columns that you want as output.

哈希编码器也适用于具有几千个不同值的“城市”列的情况。这种方法可以很好地扩展。您需要指定要作为输出的二进制输出列数。

Another option for supervised learning cases is Target Encoder or James Stein encoder. This technique replaces each category with the average value of the target for rows with the category. But if your dataset sample isnt very large, and you have only a few examples per category this method may not be very useful.

监督学习案例的另一个选择是 Target Encoder 或 James Stein 编码器。此技术将每个类别替换为具有该类别的行的目标平均值。但是，如果您的数据集样本不是很大，并且每个类别只有几个示例，则此方法可能不是很有用。

Here's a helpful blogpost that I referred to - Encoding Categorical Variables

这是我提到的一篇有用的博客文章 -编码分类变量

Pandas 数据帧编码具有数千个唯一值的分类变量

提问by roqds

回答by YOBEN_S

回答by cs95

回答by aamir23

相关推荐

最近更新

标签

Pandas 数据帧编码具有数千个唯一值的分类变量

提问by roqds

回答by YOBEN_S

回答by cs95

回答by aamir23

相关推荐

计算 Pandas 的亏损

如何将 Pandas 列转换为 for 循环的两倍？

pandas 熊猫分组差异

创建具有唯一索引的 Pandas Dataframe

相关推荐

最近更新

标签