Python FreqDist 与 NLTK

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4634787/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 16:38:50  来源:igfitidea点击:

FreqDist with NLTK

pythonnlpnltk

提问by afg102

NLTKin python has a function FreqDistwhich gives you the frequency of words within a text. I am trying to pass my text as an argument but the result is of the form:

python 中的NLTK有一个函数FreqDist,它可以为您提供文本中单词的频率。我试图将我的文本作为参数传递,但结果的形式如下:

[' ', 'e', 'a', 'o', 'n', 'i', 't', 'r', 's', 'l', 'd', 'h', 'c', 'y', 'b', 'u', 'g', '\n', 'm', 'p', 'w', 'f', ',', 'v', '.', "'", 'k', 'B', '"', 'M', 'H', '9', 'C', '-', 'N', 'S', '1', 'A', 'G', 'P', 'T', 'W', '[', ']', '(', ')', '0', '7', 'E', 'J', 'O', 'R', 'j', 'x']

[' ', 'e', 'a', 'o', 'n', 'i', 't', 'r', 's', 'l', 'd', 'h', 'c', 'y', 'b', 'u', 'g', '\n', 'm', 'p', 'w', 'f', ',', 'v', '.', "'", 'k', 'B', '"', 'M', 'H', '9', 'C', '-', 'N', 'S', '1', 'A', 'G', 'P', 'T', 'W', '[', ']', '(', ')', '0', '7', 'E', 'J', 'O', 'R', 'j', 'x']

whereas in the example in the NLTKwebsite the result was whole words not just letters. Im doing it this way:

而在NLTK网站的示例中,结果是整个单词而不仅仅是字母。我这样做:

file_y = open(fileurl)
p = file_y.read()
fdist = FreqDist(p)
vocab = fdist.keys()
vocab[:100]

DO you know what I have wrong pls? Thanks!

你知道我有什么问题吗?谢谢!

回答by Alex Brasetvik

FreqDistexpects an iterable of tokens. A string is iterable --- the iterator yields every character.

FreqDist期望一个可迭代的令牌。字符串是可迭代的——迭代器产生每个字符。

Pass your text to a tokenizer first, and pass the tokens to FreqDist.

首先将您的文本传递给标记器,然后将标记传递给FreqDist.

回答by Tim McNamara

NLTK's FreqDistaccepts any iterable. As a string is iterated character by character, it is pulling things apart in the way that you're experiencing.

NLTKFreqDist接受任何可迭代对象。由于字符串是逐字符迭代的,因此它以您正在经历的方式将事物分开。

In order to do count words, you need to feed FreqDistwords. How do you do that? Well, you might think (as others have suggested in the answer to your question) to feed the whole file to nltk.tokenize.word_tokenize.

为了计算单词,您需要输入FreqDist单词。你是怎样做的?好吧,您可能会认为(正如其他人在回答您的问题时所建议的那样)将整个文件提供给nltk.tokenize.word_tokenize.

>>> # first, let's import the dependencies
>>> import nltk
>>> from nltk.probability import FreqDist

>>> # wrong :(
>>> words = nltk.tokenize.word_tokenize(p)
>>> fdist = FreqDist(words)

word_tokenizebuilds word models from sentences. It needs to be fed each sentence one at a time. It will do a relatively poor job when given whole paragraphs or even documents.

word_tokenize从句子建立词模型。它需要一次输入一个句子。当给出整个段落甚至文档时,它会做得相对较差。

So, what to do? Easy, add in a sentence tokenizer!

那么该怎么办?很简单,添加一个句子标记器!

>>> fdist = FreqDist()
>>> for sentence in nltk.tokenize.sent_tokenize(p):
...     for word in nltk.tokenize.word_tokenize(sentence):
>>>         fdist[word] += 1

One thing to bear in mind is that there are many ways to tokenize text. The modules nltk.tokenize.sent_tokenizeand nltk.tokenize.word_tokenizesimply pick a reasonable default for relatively clean, English text. There are several other options to chose from, which you can read about in the API documentation.

要记住的一件事是有很多方法可以标记文本。模块nltk.tokenize.sent_tokenizenltk.tokenize.word_tokenize简单地为相对干净的英文文本选择一个合理的默认值。还有其他几个选项可供选择,您可以在API 文档 中阅读。

回答by Eran Kampf

FreqDist runs on an array of tokens. You're sending it a an array of characters (a string) where you should have tokenized the input first:

FreqDist 在一组令牌上运行。您正在向它发送一个字符数组(一个字符串),您应该首先在其中标记输入:

words = nltk.tokenize.word_tokenize(p)
fdist = FreqDist(words)

回答by Aakash Anuj

You simply have to use it like this:

你只需要像这样使用它:

import nltk
from nltk.probability import FreqDist

sentence='''This is my sentence'''
tokens = nltk.tokenize.word_tokenize(sentence)
fdist=FreqDist(tokens)

The variable fdist is of the type "class 'nltk.probability.FreqDist" and contains the frequency distribution of words.

变量 fdist 的类型是“class 'nltk.probability.FreqDist”,包含单词的频率分布。

回答by Musadiq

Your_string = "here is my string"
tokens = Your_string.split()

Do this way, and then use the NLTKfunctions

这样做,然后使用NLTK函数

it will give your tokens in words but not in characters

它会给你的代币用文字而不是字符