pandas 如何将分类数据转换为数值数据?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51311831/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert categorical data to numerical data?
提问by stone rock
I have feature => city
which is categorical data i.e string but instead of hardcoding using replace()
is there any smart approach ?
我有功能 =>city
这是分类数据,即字符串,但不是使用硬编码replace()
,有什么聪明的方法吗?
train['city'].unique()
Output: ['city_149', 'city_83', 'city_16', 'city_64', 'city_100', 'city_21',
'city_114', 'city_103', 'city_97', 'city_160', 'city_65',
'city_90', 'city_75', 'city_136', 'city_159', 'city_67', 'city_28',
'city_10', 'city_73', 'city_76', 'city_104', 'city_27', 'city_30',
'city_61', 'city_99', 'city_41', 'city_142', 'city_9', 'city_116',
'city_128', 'city_74', 'city_69', 'city_1', 'city_176', 'city_40',
'city_123', 'city_152', 'city_165', 'city_89', 'city_36', .......]
What I was trying :
我在尝试什么:
train.replace(['city_149', 'city_83', 'city_16', 'city_64', 'city_100', 'city_21',
'city_114', 'city_103', 'city_97', 'city_160', 'city_65',
'city_90', 'city_75', 'city_136', 'city_159', 'city_67', 'city_28',
'city_10', 'city_73', 'city_76', 'city_104', 'city_27', 'city_30',
'city_61', 'city_99', 'city_41', 'city_142', 'city_9', 'city_116',
'city_128', 'city_74', 'city_69', 'city_1', 'city_176', 'city_40',
'city_123', 'city_152', 'city_165', 'city_89', 'city_36', .......], [1,2,3,4,5,6,7,8,9....], inplace=True)
Is there any better way to convert the data into numerical ? Because the number of unique values are 123
.
So I need to hard code numbers from 1,2,3,4,...123 to convert it. Suggest some better way to convert it into numerical value.
有没有更好的方法将数据转换为数字?因为唯一值的数量是123
. 所以我需要对 1,2,3,4,...123 中的数字进行硬编码来转换它。建议一些更好的方法将其转换为数值。
回答by sacuL
Try pd.factorize()
:
train['city'] = pd.factorize(train.city)[0]
train['city'] = train['city'].astype('category').cat.codes
For example:
例如:
>>> train
city
0 city_151
1 city_149
2 city_151
3 city_149
4 city_149
5 city_149
6 city_151
7 city_151
8 city_150
9 city_151
factorize
:
factorize
:
train['city'] = pd.factorize(train.city)[0]
>>> train
city
0 0
1 1
2 0
3 1
4 1
5 1
6 0
7 0
8 2
9 0
Or astype('category')
:
或astype('category')
:
train['city'] = train['city'].astype('category').cat.codes
>>> train
city
0 2
1 0
2 2
3 0
4 0
5 0
6 2
7 2
8 1
9 2
回答by iDrwish
You can accomplish this via mapping
:
您可以通过以下方式完成此操作mapping
:
value_mapper = dict(zip(train['city'].unique(), np.arange(1, 124)))
train['city'].map(value_mapper)
Or the more idiomatic categorical data
:
或者更惯用的categorical data
:
pd.Categorical(train['city']).codes
回答by Void Star
If your values always have an underscore before the integer, a list comprehension might work for you:
如果您的值总是在整数前有一个下划线,则列表理解可能适合您:
data = [int(x.split('_')[-1]) for x in train['city']]
The comprehension loops across each x
in train['city']
, splits x
into underscore delimited parts, and converts the last part to an integer. This works if you have more than one underscore, like foo_bar_5.
理解循环遍历每个x
in train['city']
,拆分x
为下划线分隔的部分,并将最后一部分转换为整数。如果您有多个下划线(例如 foo_bar_5),则此方法有效。