使用 sklearn 和 pandas 在一个模型中组合词袋和其他特征
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30653642/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Combining bag of words and other features in one model using sklearn and pandas
提问by Jeremy
I am trying to model the score that a post receives, based on both the text of the post, and other features (time of day, length of post, etc.)
我正在尝试根据帖子的文本和其他功能(一天中的时间、帖子的长度等)对帖子收到的分数进行建模
I am wondering how to best combine these different types of features into one model. Right now, I have something like the following (stolen from hereand here).
我想知道如何最好地将这些不同类型的特征组合成一个模型。现在,我有类似以下的东西(从这里和这里偷来的)。
import pandas as pd
...
def features(p):
terms = vectorizer(p[0])
d = {'feature_1': p[1], 'feature_2': p[2]}
for t in terms:
d[t] = d.get(t, 0) + 1
return d
posts = pd.read_csv('path/to/csv')
# Create vectorizer for function to use
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2)).build_tokenizer()
y = posts["score"].values.astype(np.float32)
vect = DictVectorizer()
# This is the part I want to fix
temp = zip(list(posts.message), list(posts.feature_1), list(posts.feature_2))
tokenized = map(lambda x: features(x), temp)
X = vect.fit_transform(tokenized)
It seems very silly to extract all of the features I want out of the pandas dataframe, just to zip them all back together. Is there a better way of doing this step?
从 Pandas 数据框中提取我想要的所有特征,只是将它们全部压缩在一起似乎很愚蠢。有没有更好的方法来做这一步?
The CSV looks something like the following:
CSV 类似于以下内容:
ID,message,feature_1,feature_2
1,'This is the text',4,7
2,'This is more text',3,2
...
回答by khammel
You could do everything with your map and lambda:
你可以用你的地图和 lambda 做任何事情:
tokenized=map(lambda msg, ft1, ft2: features([msg,ft1,ft2]), posts.message,posts.feature_1, posts.feature_2)
This saves doing your interim temp step and iterates through the 3 columns.
这可以节省执行临时临时步骤并遍历 3 列。
Another solution would be convert the messages into their CountVectorizer sparse matrix and join this matrix with the feature values from the posts dataframe (this skips having to construct a dict and produces a sparse matrix similar to what you would get with DictVectorizer):
另一种解决方案是将消息转换为它们的 CountVectorizer 稀疏矩阵,并将这个矩阵与来自 post 数据帧的特征值连接起来(这跳过了必须构造一个 dict 并生成一个类似于使用 DictVectorizer 得到的稀疏矩阵):
import scipy as sp
posts = pd.read_csv('post.csv')
# Create vectorizer for function to use
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
y = posts["score"].values.astype(np.float32)
X = sp.sparse.hstack((vectorizer.fit_transform(posts.message),posts[['feature_1','feature_2']].values),format='csr')
X_columns=vectorizer.get_feature_names()+posts[['feature_1','feature_2']].columns.tolist()
posts
Out[38]:
ID message feature_1 feature_2 score
0 1 'This is the text' 4 7 10
1 2 'This is more text' 3 2 9
2 3 'More random text' 3 2 9
X_columns
Out[39]:
[u'is',
u'is more',
u'is the',
u'more',
u'more random',
u'more text',
u'random',
u'random text',
u'text',
u'the',
u'the text',
u'this',
u'this is',
'feature_1',
'feature_2']
X.toarray()
Out[40]:
array([[1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 4, 7],
[1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 3, 2],
[0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 3, 2]])
Additionally sklearn-pandas has DataFrameMapper which does what you're looking for too:
此外 sklearn-pandas 有 DataFrameMapper 也可以满足您的需求:
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([
(['feature_1', 'feature_2'], None),
('message',CountVectorizer(binary=True, ngram_range=(1, 2)))
])
X=mapper.fit_transform(posts)
X
Out[71]:
array([[4, 7, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
[3, 2, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1],
[3, 2, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0]])
Note:X is not sparse when using this last method.
注意:使用最后一种方法时,X 不是稀疏的。
X_columns=mapper.features[0][0]+mapper.features[1][1].get_feature_names()
X_columns
Out[76]:
['feature_1',
'feature_2',
u'is',
u'is more',
u'is the',
u'more',
u'more random',
u'more text',
u'random',
u'random text',
u'text',
u'the',
u'the text',
u'this',
u'this is']

