Python 使用 nltk 从文本文件中提取所有名词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33587667/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 13:36:18  来源:igfitidea点击:

Extracting all Nouns from a text file using nltk

pythonnltk

提问by Rakesh Adhikesavan

Is there a more efficient way of doing this? My code reads a text file and extracts all Nouns.

有没有更有效的方法来做到这一点?我的代码读取文本文件并提取所有名词。

import nltk

File = open(fileName) #open file
lines = File.read() #read all lines
sentences = nltk.sent_tokenize(lines) #tokenize sentences
nouns = [] #empty to array to hold all nouns

for sentence in sentences:
     for word,pos in nltk.pos_tag(nltk.word_tokenize(str(sentence))):
         if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
             nouns.append(word)

How do I reduce the time complexity of this code? Is there a way to avoid using the nested for loops?

如何降低这段代码的时间复杂度?有没有办法避免使用嵌套的 for 循环?

Thanks in advance!

提前致谢!

采纳答案by Aziz Alto

If you are open to options other than NLTK, check out TextBlob. It extracts all nouns and noun phrases easily:

如果您对 以外的选项持开放态度NLTK,请查看TextBlob。它可以轻松提取所有名词和名词短语:

>>> from textblob import TextBlob
>>> txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the inter
actions between computers and human (natural) languages."""
>>> blob = TextBlob(txt)
>>> print(blob.noun_phrases)
[u'natural language processing', 'nlp', u'computer science', u'artificial intelligence', u'computational linguistics']

回答by Will Angley

I'm not an NLP expert, but I think you're pretty close already, and there likely isn't a way to get better than quadratic time complexity in these outer loops here.

我不是 NLP 专家,但我认为您已经非常接近了,并且在这些外部循环中,可能没有比二次时间复杂度更好的方法。

Recent versions of NLTK have a built in function that does what you're doing by hand, nltk.tag.pos_tag_sents, and it returns a list of lists of tagged words too.

NLTK 的最新版本有一个内置函数,可以手动完成您正在做的事情,nltk.tag.pos_tag_sents,它也返回一个标记词列表列表。

回答by Boa

import nltk

lines = 'lines is some string of words'
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == 'NN'
# do the nlp stuff
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)] 

print nouns
>>> ['lines', 'string', 'words']

Useful tip: it is often the case that list comprehensions are a faster method of building a list than adding elements to a list with the .insert() or append() method, within a 'for' loop.

有用的提示:通常情况下,列表推导式是构建列表的更快方法,而不是在“for”循环中使用 .insert() 或 append() 方法向列表添加元素。

回答by alexis

Your code has no redundancy: You read the file once and visit each sentence, and each tagged word, exactly once. No matter how you write your code (e.g., using comprehensions), you will only be hiding the nested loops, not skipping any processing.

您的代码没有冗余:您阅读文件一次,访问每个句子和每个标记词,恰好一次。无论您如何编写代码(例如,使用推导式),您都只会隐藏嵌套循环,而不会跳过任何处理。

The only potential for improvement is in its space complexity: Instead of reading the whole file at once, you could read it in increments. But since you need to process a whole sentence at a time, it's not as simple as reading and processing one line at a time; so I wouldn't bother unless your files are whole gigabytes long; for short files it's not going to make any difference.

唯一的改进潜力在于它的空间复杂性:您可以分批读取,而不是一次读取整个文件。但由于你需要一次处理一个完整的句子,它并不像一次读取和处理一行那么简单;所以我不会打扰,除非你的文件有整个 GB 的长度;对于短文件,它不会有任何区别。

In short, your loops are fine. There are a thing or two in your code that you could clean up (e.g. the ifclause that matches the POS tags), but it's not going to change anything efficiency-wise.

简而言之,您的循环很好。您可以清理代码中的一两件事(例如if,与 POS 标签匹配的子句),但它不会在效率方面改变任何事情。

回答by Samuel Nde

You can achieve good results using nltk, Textblob, SpaCyor any of the many other libraries out there. These libraries will all do the job but with different degrees of efficiency.

您可以使用取得了良好的效果nltkTextblobSpaCy或任何其他许多图书馆在那里。这些库都可以完成这项工作,但效率不同。

import nltk
from textblob import TextBlob
import spacy
nlp = spacy.load('en')
nlp1 = spacy.load('en_core_web_lg')

txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""

On my windows 10 2 cores, 4 processors, 8GB ram i5 hplaptop, in jupyter notebook, I ran some comparisons and here are the results.

在我的 Windows 10 2 核、4 个处理器、8GB ram i5 hp笔记本电脑和 jupyter 笔记本电脑上,我进行了一些比较,以下是结果。

For TextBlob:

对于文本块:

%%time
print([w for (w, pos) in TextBlob(txt).pos_tags if pos[0] == 'N'])

And the output is

输出是

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 8.01 ms #average over 20 iterations

For nltk:

对于 nltk:

%%time
print([word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if pos[0] == 'N'])

And the output is

输出是

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 7.09 ms #average over 20 iterations

For spacy:

对于空间:

%%time
print([ent.text for ent in nlp(txt) if ent.pos_ == 'NOUN'])

And the output is

输出是

>>> ['language', 'processing', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 30.19 ms #average over 20 iterations

It seems nltkand TextBlobare reasonably faster and this is to be expected since store nothing else about the input text, txt. Spacy is way slower. One more thing. SpaCymissed the noun NLPwhile nltkand TextBlobgot it. I would shot for nltkor TextBlobunless there is something else I wish to extract from the input txt.

它似乎nltk并且TextBlob相当快,这是可以预料的,因为不存储有关输入文本的其他内容,txt. Spacy要慢得多。还有一件事。SpaCy错过了名词NLP,而nltkTextBlob得到了它。我会为nltk或者TextBlob除非我想从输入中提取其他东西txt


Check out a quick start into spacyhere.
Check out some basics about TextBlobhere.
Check out nltkHowTos here


查看spacy此处的快速入门
查看有关TextBlob此处的一些基础知识。
检查出nltk的HOWTOs这里

回答by Amit Ghosh

import nltk
lines = 'lines is some string of words'
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if(pos[:2] == 'NN')]
print (nouns)

Just simplied abit more.

只是简单多了。