pandas 如何在 scikit learn 中矢量化具有多个文本列的数据框而不会丢失对原始列的跟踪

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30927610/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:29:48  来源:igfitidea点击:

How to vectorize a data frame with several text columns in scikit learn without losing track of the origin columns

pythonnumpypandasmachine-learningscikit-learn

提问by Eric L

I have several pandas data series, and want to train this data to map to an output, df["output"].

我有几个 Pandas 数据系列,想训练这些数据映射到输出 df["output"]。

So far I have merged the series into one, and separated each by commas.

到目前为止,我已将系列合并为一个,并用逗号分隔每个系列。

df = pd.read_csv("sourcedata.csv")
sample = df["catA"] + "," + df["catB"] + "," + df["catC"]

def my_tokenizer(s):
    return s.split(",")

vect = CountVectorizer()
vect = CountVectorizer(analyzer='word',tokenizer=my_tokenizer, ngram_range=(1, 3), min_df=1) 
train = vect.fit_transform(sample.values)

lf = LogisticRegression()
lfit = lf.fit(train, df["output"])
pred = lambda x: lfit.predict_proba(vect.transform([x]))

The problem is that this is a bag of words approach and doesn't consider
- the unique order in each category. ("orange banana" is different than "banana orange")
- text is one category has different significance than in another ("US" in one category could mean country of origin vs destination)

问题是这是一个词袋方法,没有考虑
- 每个类别中的唯一顺序。(“橙色香蕉”与“香蕉橙”不同)
-文本是一个类别的含义与另一个类别不同(一个类别中的“美国”可能意味着原产国与目的地国家/地区)

For example, the entire string could be:
pred("US, Chiquita Banana, China")
Category A: Country of origin
Category B: Company & Type of Fruit (order does matter)
Category C: Destination

例如,整个字符串可以是:
pred("US, Chiquita Banana, China")
A 类:原产国
B 类:公司和水果类型(订单很重要)
C 类:目的地

The way I am doing it currently ignores any type of ordering, and also generates extra spaces in my feature names for some reason (which messes up things more):

我目前的做法忽略了任何类型的排序,并且由于某种原因还在我的功能名称中生成了额外的空格(这会使事情更加混乱):

In [1242]: vect.get_feature_names()[0:10]
Out[1242]:
[u'',
 u' ',
 u'  ',
 u'   ',
 u'    ',
 u'     ',
 u'   US',
 u'   CA',
 u'   UK']

Any suggestions are welcome!! Thanks a lot

欢迎任何建议!!非常感谢

回答by maxymoo

OK, first let's prepare your data set, by selecting the relevant columns and removing leading and trailing spaces using strip:

好的,首先让我们准备您的数据集,通过选择相关列并使用strip以下命令删除前导和尾随空格:

sample = df[['catA','catB','catC']]
sample = df.apply(lambda col: col.str.strip())

From here you have a couple of options as how to vectorize this for a training set. If you have a smallish number of levels across all of your features (say less than 1000 in total), you can simply treat them as categorical variables and set train = pd.get_dummies(sample)to convert them to binary indicator variables. After this your data will look something like this:

从这里开始,您有几个选项可用于如何将其矢量化以用于训练集。如果您的所有特征的级别数量很少(比如总共少于 1000 个),您可以简单地将它们视为分类变量并设置train = pd.get_dummies(sample)为将它们转换为二元指标变量。在此之后,您的数据将如下所示:

catA_US   catA_CA ... cat_B_chiquita_banana   cat_B_morningstar_tomato ... catC_China ...
1         0           1                       0                            1   
...

Notice that that variable names start with their origin column, so this makes sure that the model will know where they come from. Also you're using exact strings so word order in the second column will be preserved.

请注意,变量名称以其原点列开头,因此这可确保模型知道它们来自何处。此外,您使用的是精确字符串,因此将保留第二列中的词序。

If you have too many levels for this to work, or you want to consider the individual words in catBas well as the bigrams, you could apply your CountVectorizerseparately to each column, and then use and use hstackto concatenate the resulting output matrices:

如果您有太多级别无法执行此操作,或者您想考虑 incatB以及 bigrams 中的单个单词,您可以将您的CountVectorizer单独应用于每一列,然后使用 和hstack连接生成的输出矩阵:

import scipy.sparse as sp
vect = CountVectorizer(ngram_range=(1, 3))
train = sp.hstack(sample.apply(lambda col: vect.fit_transform(col)))

回答by jcaine

Try mapping your dataframe to a list of dictionaries (where each entry represents a column) that represent your data and then write a custom tokenizer function that accepts a dictionary as input and outputs a list of features.

尝试将您的数据框映射到代表您的数据的字典列表(其中每个条目代表一个列),然后编写一个自定义标记器函数,该函数接受字典作为输入并输出一个特征列表。

In the example below, I create a custom tokenizer that iterates through each of your columns so you can do whatever you want with them inside the function before appending them to your tokens list. The data is then converted into a list of dictionaries using Pandas.

在下面的示例中,我创建了一个自定义标记生成器,它遍历您的每个列,以便您可以在将它们附加到标记列表之前在函数内对它们执行任何您想做的操作。然后使用 Pandas 将数据转换为字典列表。

def my_tokenizer(d):
    # create empty list to store tokens
    tokens = []

    # do something with catA data
    tokens.append(d['catA'])

    # do something with catB data
    tokens.append(d['catB'].lower())

    return tokens

sample = df[['catA','catB','catC']]
vect = CountVectorizer(tokenizer=my_tokenizer)
train = vect.fit_transform(sample.to_dict(orient='records'))