使用 Pandas 和 spaCy 进行标记

Question

提问by LMGagne

I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of text per 'cell') and have been using pandas to organize/build the dataset. I'm hoping to use spaCy for all the nlp but can't quite figure out how to tokenize the text in my columns. I've read a bunch of the spaCy documentation, and googled around but all the examples I've found are for a single sentence or word - not 75K rows in a pandas df.

我正在处理我的第一个 Python 项目，并且拥有相当大的数据集（10 行，数千行）。我需要对 5 个文本列（每个“单元格”的多个文本句子）进行一些 nlp（聚类、分类），并且一直在使用 Pandas 来组织/构建数据集。我希望对所有 nlp 使用 spaCy，但无法弄清楚如何标记我的列中的文本。我已经阅读了一堆 spaCy 文档，并用谷歌搜索，但我发现的所有示例都是针对单个句子或单词的 - 而不是 Pandas df 中的 75K 行。

I've tried things like: df['new_col'] = [token for token in (df['col'])]

我试过这样的事情： df['new_col'] = [token for token in (df['col'])]

but would definitely appreciate some help/resources.

但肯定会感谢一些帮助/资源。

full (albeit messy) code available here

完整（尽管凌乱）代码可在此处获得

Answer 1

回答by Peter

I've never used spaCy (nltk has always gotten the job done for me) but from glancing at the documentation it looks like this should work:

我从未使用过 spaCy（nltk 总是为我完成工作）但是从文档中看它看起来应该可以工作：

import spacy
nlp = spacy.load('en')

df['new_col'] = df['text'].apply(lambda x: nlp(x))

Note that nlpby default runs the entire SpaCy pipeline, which includes part-of-speech tagging, parsing and named entity recognition. You can significantly speed up your code by using nlp.tokenizer(x)instead of nlp(x), or by disabling parts of the pipeline when you load the model. E.g. nlp = spacy.load('en', parser=False, entity=False).

请注意，nlp默认情况下运行整个 SpaCy 管道，其中包括词性标记、解析和命名实体识别。您可以通过使用nlp.tokenizer(x)代替nlp(x)或在加载模型时禁用部分管道来显着加快代码速度。例如nlp = spacy.load('en', parser=False, entity=False)。

使用 Pandas 和 spaCy 进行标记

提问by LMGagne

回答by Peter

相关推荐

最近更新

标签

使用 Pandas 和 spaCy 进行标记

提问by LMGagne

回答by Peter

相关推荐

pandas 使用 seaborn 绘制系列

返回数据帧中两列的最大值（Pandas）

创建列的 bin 并获取 Pandas 中的计数

带有 isin 的 pandas 函数

相关推荐

最近更新

标签