Python 使用 nltk 从文本文件中提取所有名词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33587667/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extracting all Nouns from a text file using nltk
提问by Rakesh Adhikesavan
Is there a more efficient way of doing this? My code reads a text file and extracts all Nouns.
有没有更有效的方法来做到这一点?我的代码读取文本文件并提取所有名词。
import nltk
File = open(fileName) #open file
lines = File.read() #read all lines
sentences = nltk.sent_tokenize(lines) #tokenize sentences
nouns = [] #empty to array to hold all nouns
for sentence in sentences:
for word,pos in nltk.pos_tag(nltk.word_tokenize(str(sentence))):
if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
nouns.append(word)
How do I reduce the time complexity of this code? Is there a way to avoid using the nested for loops?
如何降低这段代码的时间复杂度?有没有办法避免使用嵌套的 for 循环?
Thanks in advance!
提前致谢!
采纳答案by Aziz Alto
If you are open to options other than NLTK
, check out TextBlob
. It extracts all nouns and noun phrases easily:
如果您对 以外的选项持开放态度NLTK
,请查看TextBlob
。它可以轻松提取所有名词和名词短语:
>>> from textblob import TextBlob
>>> txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the inter
actions between computers and human (natural) languages."""
>>> blob = TextBlob(txt)
>>> print(blob.noun_phrases)
[u'natural language processing', 'nlp', u'computer science', u'artificial intelligence', u'computational linguistics']
回答by Will Angley
I'm not an NLP expert, but I think you're pretty close already, and there likely isn't a way to get better than quadratic time complexity in these outer loops here.
我不是 NLP 专家,但我认为您已经非常接近了,并且在这些外部循环中,可能没有比二次时间复杂度更好的方法。
Recent versions of NLTK have a built in function that does what you're doing by hand, nltk.tag.pos_tag_sents, and it returns a list of lists of tagged words too.
NLTK 的最新版本有一个内置函数,可以手动完成您正在做的事情,nltk.tag.pos_tag_sents,它也返回一个标记词列表列表。
回答by Boa
import nltk
lines = 'lines is some string of words'
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == 'NN'
# do the nlp stuff
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print nouns
>>> ['lines', 'string', 'words']
Useful tip: it is often the case that list comprehensions are a faster method of building a list than adding elements to a list with the .insert() or append() method, within a 'for' loop.
有用的提示:通常情况下,列表推导式是构建列表的更快方法,而不是在“for”循环中使用 .insert() 或 append() 方法向列表添加元素。
回答by alexis
Your code has no redundancy: You read the file once and visit each sentence, and each tagged word, exactly once. No matter how you write your code (e.g., using comprehensions), you will only be hiding the nested loops, not skipping any processing.
您的代码没有冗余:您阅读文件一次,访问每个句子和每个标记词,恰好一次。无论您如何编写代码(例如,使用推导式),您都只会隐藏嵌套循环,而不会跳过任何处理。
The only potential for improvement is in its space complexity: Instead of reading the whole file at once, you could read it in increments. But since you need to process a whole sentence at a time, it's not as simple as reading and processing one line at a time; so I wouldn't bother unless your files are whole gigabytes long; for short files it's not going to make any difference.
唯一的改进潜力在于它的空间复杂性:您可以分批读取,而不是一次读取整个文件。但由于你需要一次处理一个完整的句子,它并不像一次读取和处理一行那么简单;所以我不会打扰,除非你的文件有整个 GB 的长度;对于短文件,它不会有任何区别。
In short, your loops are fine. There are a thing or two in your code that you could clean up (e.g. the if
clause that matches the POS tags), but it's not going to change anything efficiency-wise.
简而言之,您的循环很好。您可以清理代码中的一两件事(例如if
,与 POS 标签匹配的子句),但它不会在效率方面改变任何事情。
回答by Samuel Nde
You can achieve good results using nltk
, Textblob
, SpaCy
or any of the many other libraries out there. These libraries will all do the job but with different degrees of efficiency.
您可以使用取得了良好的效果nltk
,Textblob
,SpaCy
或任何其他许多图书馆在那里。这些库都可以完成这项工作,但效率不同。
import nltk
from textblob import TextBlob
import spacy
nlp = spacy.load('en')
nlp1 = spacy.load('en_core_web_lg')
txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
On my windows 10 2 cores, 4 processors, 8GB ram i5 hplaptop, in jupyter notebook, I ran some comparisons and here are the results.
在我的 Windows 10 2 核、4 个处理器、8GB ram i5 hp笔记本电脑和 jupyter 笔记本电脑上,我进行了一些比较,以下是结果。
For TextBlob:
对于文本块:
%%time
print([w for (w, pos) in TextBlob(txt).pos_tags if pos[0] == 'N'])
And the output is
输出是
>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
Wall time: 8.01 ms #average over 20 iterations
For nltk:
对于 nltk:
%%time
print([word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if pos[0] == 'N'])
And the output is
输出是
>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
Wall time: 7.09 ms #average over 20 iterations
For spacy:
对于空间:
%%time
print([ent.text for ent in nlp(txt) if ent.pos_ == 'NOUN'])
And the output is
输出是
>>> ['language', 'processing', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
Wall time: 30.19 ms #average over 20 iterations
It seems nltk
and TextBlob
are reasonably faster and this is to be expected since store nothing else about the input text, txt
. Spacy is way slower. One more thing. SpaCy
missed the noun NLP
while nltk
and TextBlob
got it. I would shot for nltk
or TextBlob
unless there is something else I wish to extract from the input txt
.
它似乎nltk
并且TextBlob
相当快,这是可以预料的,因为不存储有关输入文本的其他内容,txt
. Spacy要慢得多。还有一件事。SpaCy
错过了名词NLP
,而nltk
并TextBlob
得到了它。我会为nltk
或者TextBlob
除非我想从输入中提取其他东西txt
。
Check out a quick start into spacy
here.
Check out some basics about TextBlob
here.
Check out nltk
HowTos here
回答by Amit Ghosh
import nltk
lines = 'lines is some string of words'
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if(pos[:2] == 'NN')]
print (nouns)
Just simplied abit more.
只是简单多了。