使用 Pandas 和 spaCy 进行标记
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46981137/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Tokenizing using Pandas and spaCy
提问by LMGagne
I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of text per 'cell') and have been using pandas to organize/build the dataset. I'm hoping to use spaCy for all the nlp but can't quite figure out how to tokenize the text in my columns. I've read a bunch of the spaCy documentation, and googled around but all the examples I've found are for a single sentence or word - not 75K rows in a pandas df.
我正在处理我的第一个 Python 项目,并且拥有相当大的数据集(10 行,数千行)。我需要对 5 个文本列(每个“单元格”的多个文本句子)进行一些 nlp(聚类、分类),并且一直在使用 Pandas 来组织/构建数据集。我希望对所有 nlp 使用 spaCy,但无法弄清楚如何标记我的列中的文本。我已经阅读了一堆 spaCy 文档,并用谷歌搜索,但我发现的所有示例都是针对单个句子或单词的 - 而不是 Pandas df 中的 75K 行。
I've tried things like:
df['new_col'] = [token for token in (df['col'])]
我试过这样的事情:
df['new_col'] = [token for token in (df['col'])]
but would definitely appreciate some help/resources.
但肯定会感谢一些帮助/资源。
回答by Peter
I've never used spaCy (nltk has always gotten the job done for me) but from glancing at the documentation it looks like this should work:
我从未使用过 spaCy(nltk 总是为我完成工作)但是从文档中看它看起来应该可以工作:
import spacy
nlp = spacy.load('en')
df['new_col'] = df['text'].apply(lambda x: nlp(x))
Note that nlp
by default runs the entire SpaCy pipeline, which includes part-of-speech tagging, parsing and named entity recognition. You can significantly speed up your code by using nlp.tokenizer(x)
instead of nlp(x)
, or by disabling parts of the pipeline when you load the model. E.g. nlp = spacy.load('en', parser=False, entity=False)
.
请注意,nlp
默认情况下运行整个 SpaCy 管道,其中包括词性标记、解析和命名实体识别。您可以通过使用nlp.tokenizer(x)
代替nlp(x)
或在加载模型时禁用部分管道来显着加快代码速度。例如nlp = spacy.load('en', parser=False, entity=False)
。