pandas 如何使用 sklearn FeatureHasher?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40739152/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to use sklearn FeatureHasher?
提问by KillerSnail
I have a dataframe like this:
我有一个这样的数据框:
import pandas as pd
test = pd.DataFrame({'type': ['a', 'b', 'a', 'c', 'b'], 'model': ['bab', 'ba', 'ba', 'ce', 'bw']})
How do I use the sklearn
FeatureHasher
on it?
我如何使用sklearn
FeatureHasher
它?
I tried:
我试过:
from sklearn.feature_extraction import FeatureHasher
FH = FeatureHasher()
train = FH.transform(test.type)
but it doesn't like it? it seems it wants a string or a list so I try
但它不喜欢吗?似乎它想要一个字符串或一个列表,所以我尝试
FH.transform(test.to_dict(orient='list'))
but that doesn't work either? I get:
但这也不起作用?我得到:
AttributeError: 'str' object has no attribute 'items'
thanks
谢谢
回答by Julien Marrec
You need to specify the input type when initializing your instance of FeatureHasher:
您需要在初始化FeatureHasher实例时指定输入类型:
In [1]:
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=5, input_type='string')
f = h.transform(test.type)
f.toarray()
Out[1]:
array([[ 1., 0., 0., 0., 0.],
[ 0., -1., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., -1., 0., 0.],
[ 0., -1., 0., 0., 0.]])
Note that this will assume the value of these feature is 1 according to the documentation linked above (bold emphasis is mine):
请注意,根据上面链接的文档,这将假设这些功能的值为 1(粗体强调的是我的):
input_type : string, optional, default “dict”
- Either “dict” (the default) to accept dictionaries over (feature_name, value);
- “pair” to accept pairs of (feature_name, value);
- or “string” to accept single strings. feature_name should be a string, while value should be a number. In the case of “string”, a value of 1 is implied.
The feature_name is hashed to find the appropriate column for the feature. The value's sign might be flipped in the output (but see non_negative, below).
input_type :字符串,可选,默认“字典”
- 要么“dict”(默认)接受字典 over (feature_name, value);
- “pair” 接受成对的 (feature_name, value);
- 或“字符串”接受单个字符串。feature_name 应该是一个字符串,而 value 应该是一个数字。在“字符串”的情况下,隐含值为 1。
对 feature_name 进行散列以找到适合该功能的列。值的符号可能会在输出中翻转(但请参阅下面的 non_negative)。