pandas 如何使用 sklearn FeatureHasher?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/40739152/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:29:02  来源:igfitidea点击:

How to use sklearn FeatureHasher?

pythonpandasscikit-learn

提问by KillerSnail

I have a dataframe like this:

我有一个这样的数据框:

import pandas as pd
test = pd.DataFrame({'type': ['a', 'b', 'a', 'c', 'b'], 'model': ['bab', 'ba', 'ba', 'ce', 'bw']})

How do I use the sklearnFeatureHasheron it?

我如何使用sklearnFeatureHasher它?

I tried:

我试过:

from sklearn.feature_extraction import FeatureHasher 
FH = FeatureHasher()
train = FH.transform(test.type)

but it doesn't like it? it seems it wants a string or a list so I try

但它不喜欢吗?似乎它想要一个字符串或一个列表,所以我尝试

FH.transform(test.to_dict(orient='list'))

but that doesn't work either? I get:

但这也不起作用?我得到:

AttributeError: 'str' object has no attribute 'items'

thanks

谢谢

回答by Julien Marrec

You need to specify the input type when initializing your instance of FeatureHasher:

您需要在初始化FeatureHasher实例时指定输入类型:

In [1]:
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=5, input_type='string')
f = h.transform(test.type)
f.toarray()

Out[1]:
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0., -1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0., -1.,  0.,  0.],
       [ 0., -1.,  0.,  0.,  0.]])

Note that this will assume the value of these feature is 1 according to the documentation linked above (bold emphasis is mine):

请注意,根据上面链接的文档,这将假设这些功能的值为 1(粗体强调的是我的):

input_type : string, optional, default “dict”

  • Either “dict” (the default) to accept dictionaries over (feature_name, value);
  • “pair” to accept pairs of (feature_name, value);
  • or “string” to accept single strings. feature_name should be a string, while value should be a number. In the case of “string”, a value of 1 is implied.

The feature_name is hashed to find the appropriate column for the feature. The value's sign might be flipped in the output (but see non_negative, below).

input_type :字符串,可选,默认“字典”

  • 要么“dict”(默认)接受字典 over (feature_name, value);
  • “pair” 接受成对的 (feature_name, value);
  • 或“字符串”接受单个字符串。feature_name 应该是一个字符串,而 value 应该是一个数字。在“字符串”的情况下,隐含值为 1

对 feature_name 进行散列以找到适合该功能的列。值的符号可能会在输出中翻转(但请参阅下面的 non_negative)。