Python 使用 nltk 从文本文件中提取所有名词

Question

提问by Rakesh Adhikesavan

Is there a more efficient way of doing this? My code reads a text file and extracts all Nouns.

有没有更有效的方法来做到这一点？我的代码读取文本文件并提取所有名词。

import nltk

File = open(fileName) #open file
lines = File.read() #read all lines
sentences = nltk.sent_tokenize(lines) #tokenize sentences
nouns = [] #empty to array to hold all nouns

for sentence in sentences:
     for word,pos in nltk.pos_tag(nltk.word_tokenize(str(sentence))):
         if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
             nouns.append(word)

How do I reduce the time complexity of this code? Is there a way to avoid using the nested for loops?

如何降低这段代码的时间复杂度？有没有办法避免使用嵌套的 for 循环？

Thanks in advance!

提前致谢！

Answer 1

采纳答案by Aziz Alto

If you are open to options other than NLTK, check out TextBlob. It extracts all nouns and noun phrases easily:

如果您对以外的选项持开放态度NLTK，请查看TextBlob。它可以轻松提取所有名词和名词短语：

>>> from textblob import TextBlob
>>> txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the inter
actions between computers and human (natural) languages."""
>>> blob = TextBlob(txt)
>>> print(blob.noun_phrases)
[u'natural language processing', 'nlp', u'computer science', u'artificial intelligence', u'computational linguistics']

Answer 2

回答by Will Angley

I'm not an NLP expert, but I think you're pretty close already, and there likely isn't a way to get better than quadratic time complexity in these outer loops here.

我不是 NLP 专家，但我认为您已经非常接近了，并且在这些外部循环中，可能没有比二次时间复杂度更好的方法。

Recent versions of NLTK have a built in function that does what you're doing by hand, nltk.tag.pos_tag_sents, and it returns a list of lists of tagged words too.

NLTK 的最新版本有一个内置函数，可以手动完成您正在做的事情，nltk.tag.pos_tag_sents，它也返回一个标记词列表列表。

Answer 3

回答by Boa

import nltk

lines = 'lines is some string of words'
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == 'NN'
# do the nlp stuff
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)] 

print nouns
>>> ['lines', 'string', 'words']

Useful tip: it is often the case that list comprehensions are a faster method of building a list than adding elements to a list with the .insert() or append() method, within a 'for' loop.

有用的提示：通常情况下，列表推导式是构建列表的更快方法，而不是在“for”循环中使用 .insert() 或 append() 方法向列表添加元素。

Answer 4

回答by alexis

Your code has no redundancy: You read the file once and visit each sentence, and each tagged word, exactly once. No matter how you write your code (e.g., using comprehensions), you will only be hiding the nested loops, not skipping any processing.

您的代码没有冗余：您阅读文件一次，访问每个句子和每个标记词，恰好一次。无论您如何编写代码（例如，使用推导式），您都只会隐藏嵌套循环，而不会跳过任何处理。

The only potential for improvement is in its space complexity: Instead of reading the whole file at once, you could read it in increments. But since you need to process a whole sentence at a time, it's not as simple as reading and processing one line at a time; so I wouldn't bother unless your files are whole gigabytes long; for short files it's not going to make any difference.

唯一的改进潜力在于它的空间复杂性：您可以分批读取，而不是一次读取整个文件。但由于你需要一次处理一个完整的句子，它并不像一次读取和处理一行那么简单；所以我不会打扰，除非你的文件有整个 GB 的长度；对于短文件，它不会有任何区别。

In short, your loops are fine. There are a thing or two in your code that you could clean up (e.g. the ifclause that matches the POS tags), but it's not going to change anything efficiency-wise.

简而言之，您的循环很好。您可以清理代码中的一两件事（例如if，与 POS 标签匹配的子句），但它不会在效率方面改变任何事情。

Answer 5

回答by Samuel Nde

You can achieve good results using nltk, Textblob, SpaCyor any of the many other libraries out there. These libraries will all do the job but with different degrees of efficiency.

您可以使用取得了良好的效果nltk，Textblob，SpaCy或任何其他许多图书馆在那里。这些库都可以完成这项工作，但效率不同。

import nltk
from textblob import TextBlob
import spacy
nlp = spacy.load('en')
nlp1 = spacy.load('en_core_web_lg')

txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""

On my windows 10 2 cores, 4 processors, 8GB ram i5 hplaptop, in jupyter notebook, I ran some comparisons and here are the results.

在我的 Windows 10 2 核、4 个处理器、8GB ram i5 hp笔记本电脑和 jupyter 笔记本电脑上，我进行了一些比较，以下是结果。

For TextBlob:

对于文本块：

%%time
print([w for (w, pos) in TextBlob(txt).pos_tags if pos[0] == 'N'])

And the output is

输出是

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 8.01 ms #average over 20 iterations

For nltk:

对于 nltk：

%%time
print([word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if pos[0] == 'N'])

And the output is

输出是

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 7.09 ms #average over 20 iterations

For spacy:

对于空间：

%%time
print([ent.text for ent in nlp(txt) if ent.pos_ == 'NOUN'])

And the output is

输出是

>>> ['language', 'processing', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 30.19 ms #average over 20 iterations

It seems nltkand TextBlobare reasonably faster and this is to be expected since store nothing else about the input text, txt. Spacy is way slower. One more thing. SpaCymissed the noun NLPwhile nltkand TextBlobgot it. I would shot for nltkor TextBlobunless there is something else I wish to extract from the input txt.

它似乎nltk并且TextBlob相当快，这是可以预料的，因为不存储有关输入文本的其他内容，txt. Spacy要慢得多。还有一件事。SpaCy错过了名词NLP，而nltk并TextBlob得到了它。我会为nltk或者TextBlob除非我想从输入中提取其他东西txt。

Check out a quick start into spacyhere.
Check out some basics about TextBlobhere.
Check out nltkHowTos here

查看spacy此处的快速入门。
查看有关TextBlob此处的一些基础知识。
检查出nltk的HOWTOs这里

Answer 6

回答by Amit Ghosh

import nltk
lines = 'lines is some string of words'
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if(pos[:2] == 'NN')]
print (nouns)

Just simplied abit more.

只是简单多了。

Python 使用 nltk 从文本文件中提取所有名词

提问by Rakesh Adhikesavan

采纳答案by Aziz Alto

回答by Will Angley

回答by Boa

回答by alexis

回答by Samuel Nde

回答by Amit Ghosh

相关推荐

最近更新

标签

Python 使用 nltk 从文本文件中提取所有名词

提问by Rakesh Adhikesavan

采纳答案by Aziz Alto

回答by Will Angley

回答by Boa

回答by alexis

回答by Samuel Nde

回答by Amit Ghosh

相关推荐

Python Django 世界中的项目和应用程序有什么区别？

Python multiprocessing.Process.is_alive() 返回 True 虽然进程已经完成，为什么？

Python 如何在读取 CSV 文件时将字符串值转换为整数值？

在python（cv2）中使用OpenCV增加彩色图像对比度的最快方法是什么？

相关推荐

最近更新

标签