在 scikit-learn 中使用 Featureunion 为 tfidf 组合两个 Pandas 列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34710281/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
use Featureunion in scikit-learn to combine two pandas columns for tfidf
提问by BLodge
While using thisas a model for spam classification, I'd like to add an additional feature of the Subject plus the body.
在使用它作为垃圾邮件分类模型时,我想添加主题和正文的附加功能。
I have all of my features in a pandas dataframe. For example, the subject is df['Subject'], the body is df['body_text'] and the spam/ham label is df['ham/spam']
我在 Pandas 数据框中拥有我的所有功能。例如,主题是 df['Subject'],正文是 df['body_text'],垃圾邮件/火腿标签是 df['ham/spam']
I receive the following error: TypeError: 'FeatureUnion' object is not iterable
我收到以下错误:TypeError: 'FeatureUnion' object is not iterable
How can I use both df['Subject'] and df['body_text'] as features all while running them through the pipeline function?
在通过管道函数运行它们时,如何同时使用 df['Subject'] 和 df['body_text'] 作为特征?
from sklearn.pipeline import FeatureUnion
features = df[['Subject', 'body_text']].values
combined_2 = FeatureUnion(list(features))
pipeline = Pipeline([
('count_vectorizer', CountVectorizer(ngram_range=(1, 2))),
('tfidf_transformer', TfidfTransformer()),
('classifier', MultinomialNB())])
pipeline.fit(combined_2, df['ham/spam'])
k_fold = KFold(n=len(df), n_folds=6)
scores = []
confusion = numpy.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
train_text = combined_2.iloc[train_indices]
train_y = df.iloc[test_indices]['ham/spam'].values
test_text = combined_2.iloc[test_indices]
test_y = df.iloc[test_indices]['ham/spam'].values
pipeline.fit(train_text, train_y)
predictions = pipeline.predict(test_text)
prediction_prob = pipeline.predict_proba(test_text)
confusion += confusion_matrix(test_y, predictions)
score = f1_score(test_y, predictions, pos_label='spam')
scores.append(score)
回答by David Maust
FeatureUnion
was not meant to be used that way. It instead takes two feature extractors / vectorizers and applies them to the input. It does not take data in the constructor the way it is shown.
FeatureUnion
不应该这样使用。相反,它需要两个特征提取器/矢量化器并将它们应用于输入。它不会按照显示的方式在构造函数中获取数据。
CountVectorizer
is expecting a sequence of strings. The easiest way to provide it with that is to concatenate the strings together. That would pass both the text in both columns to the same CountVectorizer
.
CountVectorizer
期待一个字符串序列。提供它的最简单方法是将字符串连接在一起。这会将两列中的文本都传递给相同的CountVectorizer
.
combined_2 = df['Subject'] + ' ' + df['body_text']
An alternative method would be to run CountVectorizer
and optionally TfidfTransformer
individually on each column, and then stack the results.
另一种方法是在每列上单独运行CountVectorizer
并可选择TfidfTransformer
单独运行,然后堆叠结果。
import scipy.sparse as sp
subject_vectorizer = CountVectorizer(...)
subject_vectors = subject_vectorizer.fit_transform(df['Subject'])
body_vectorizer = CountVectorizer(...)
body_vectors = body_vectorizer.fit_transform(df['Subject'])
combined_2 = sp.hstack([subject_vectors, body_vectors], format='csr')
A third option is to implement your own transformer that would extract a dataframe column.
第三种选择是实现您自己的转换器来提取数据帧列。
class DataFrameColumnExtracter(TransformerMixin):
def __init__(self, column):
self.column = column
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X[self.column]
In that case you could use FeatureUnion
on two pipelines, each containing your custom transformer, then CountVectorizer
.
在这种情况下,您可以FeatureUnion
在两个管道上使用,每个管道都包含您的自定义转换器,然后CountVectorizer
.
subj_pipe = make_pipeline(
DataFrameColumnExtracter('Subject'),
CountVectorizer()
)
body_pipe = make_pipeline(
DataFrameColumnExtracter('body_text'),
CountVectorizer()
)
feature_union = make_union(subj_pipe, body_pipe)
This feature union of pipelines will take the dataframe and each pipeline will process its column. It will produce the concatenation of term count matrices from the two columns given.
管道的此功能联合将采用数据帧,每个管道将处理其列。它将从给定的两列中生成术语计数矩阵的串联。
sparse_matrix_of_counts = feature_union.fit_transform(df)
This feature union can also be added as the first step in a larger pipeline.
此功能联合也可以作为更大管道中的第一步添加。