使用 Pandas 和 spaCy 进行标记

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46981137/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:42:28  来源:igfitidea点击:

Tokenizing using Pandas and spaCy

pythonpython-3.xpandastokenizespacy

提问by LMGagne

I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of text per 'cell') and have been using pandas to organize/build the dataset. I'm hoping to use spaCy for all the nlp but can't quite figure out how to tokenize the text in my columns. I've read a bunch of the spaCy documentation, and googled around but all the examples I've found are for a single sentence or word - not 75K rows in a pandas df.

我正在处理我的第一个 Python 项目,并且拥有相当大的数据集(10 行,数千行)。我需要对 5 个文本列(每个“单元格”的多个文本句子)进行一些 nlp(聚类、分类),并且一直在使用 Pandas 来组织/构建数据集。我希望对所有 nlp 使用 spaCy,但无法弄清楚如何标记我的列中的文本。我已经阅读了一堆 spaCy 文档,并用谷歌搜索,但我发现的所有示例都是针对单个句子或单词的 - 而不是 Pandas df 中的 75K 行。

I've tried things like: df['new_col'] = [token for token in (df['col'])]

我试过这样的事情: df['new_col'] = [token for token in (df['col'])]

but would definitely appreciate some help/resources.

但肯定会感谢一些帮助/资源。

full (albeit messy) code available here

完整(尽管凌乱)代码可在此处获得

回答by Peter

I've never used spaCy (nltk has always gotten the job done for me) but from glancing at the documentation it looks like this should work:

我从未使用过 spaCy(nltk 总是为我完成工作)但是从文档中看它看起来应该可以工作:

import spacy
nlp = spacy.load('en')

df['new_col'] = df['text'].apply(lambda x: nlp(x))

Note that nlpby default runs the entire SpaCy pipeline, which includes part-of-speech tagging, parsing and named entity recognition. You can significantly speed up your code by using nlp.tokenizer(x)instead of nlp(x), or by disabling parts of the pipeline when you load the model. E.g. nlp = spacy.load('en', parser=False, entity=False).

请注意,nlp默认情况下运行整个 SpaCy 管道,其中包括词性标记、解析和命名实体识别。您可以通过使用nlp.tokenizer(x)代替nlp(x)或在加载模型时禁用部分管道来显着加快代码速度。例如nlp = spacy.load('en', parser=False, entity=False)