使用python排序词频计数

Question

提问by AlgoMan

I have to count the word frequency in a text using python. I thought of keeping words in a dictionary and having a count for each of these words.

我必须使用 python 计算文本中的词频。我想把单词保存在字典中，并对每个单词进行计数。

Now if I have to sort the words according to # of occurrences. Can i do it with same dictionary instead of using a new dictionary which has the key as the count and array of words as the values ?

现在，如果我必须根据出现次数对单词进行排序。我可以用同一个字典来做，而不是使用一个新的字典，它的键是计数，单词数组是值吗？

Answer 1

采纳答案by Frédéric Hamidi

You can use the same dictionary:

您可以使用相同的字典：

>>> d = { "foo": 4, "bar": 2, "quux": 3 }
>>> sorted(d.items(), key=lambda item: item[1])

The second line prints:

第二行打印：

[('bar', 2), ('quux', 3), ('foo', 4)]

If you only want a sorted word list, do:

如果您只想要一个排序的单词列表，请执行以下操作：

>>> [pair[0] for pair in sorted(d.items(), key=lambda item: item[1])]

That line prints:

该行打印：

['bar', 'quux', 'foo']

Answer 2

回答by jathanism

WARNING:This example requires Python 2.7 or higher.

警告：此示例需要 Python 2.7 或更高版本。

Python's built-in Counterobject is exactly what you're looking for. Counting words is even the first example in the documentation:

Python 的内置Counter对象正是您正在寻找的。计数单词甚至是文档中的第一个示例：

>>> # Tally occurrences of words in a list
>>> from collections import Counter
>>> cnt = Counter()
>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
...     cnt[word] += 1
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})

As specified in the comments, Countertakes an iterable, so the above example is merely for illustration and is equivalent to:

如注释中所述，Counter采用可迭代对象，因此上面的示例仅用于说明，等效于：

>>> mywords = ['red', 'blue', 'red', 'green', 'blue', 'blue']
>>> cnt = Counter(mywords)
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})

Answer 3

回答by user470379

>>> d = {'a': 3, 'b': 1, 'c': 2, 'd': 5, 'e': 0}
>>> l = d.items()
>>> l.sort(key = lambda item: item[1])
>>> l
[('e', 0), ('b', 1), ('c', 2), ('a', 3), ('d', 5)]

Answer 4

回答by Gani Simsek

Didn't know there was a Counterobject for such a task. Here's how I did it back then, similar to your approach. You can do the sorting on a representation of the same dictionary.

不知道有Counter这样一个任务的对象。这是我当时的做法，类似于您的方法。您可以对同一字典的表示进行排序。

#Takes a list and returns a descending sorted dict of words and their counts
def countWords(a_list):
    words = {}
    for i in range(len(a_list)):
        item = a_list[i]
        count = a_list.count(item)
        words[item] = count
    return sorted(words.items(), key = lambda item: item[1], reverse=True)

An example:

一个例子：

>>>countWords("the quick red fox jumped over the lazy brown dog".split())
[('the', 2), ('brown', 1), ('lazy', 1), ('jumped', 1), ('over', 1), ('fox', 1), ('dog', 1), ('quick', 1), ('red', 1)]

Answer 5

回答by martineau

You could use Counteranddefaultdictin the Python 2.7 collectionsmodule in a two-step process. First use Counterto create a dictionary where each word is a key with the associated frequency count. This is fairly trivial.

您可以分两步在 Python 2.7模块中使用Counter和。首先用于创建一个字典，其中每个单词都是一个具有相关频率计数的键。这是相当微不足道的。defaultdictcollectionsCounter

Secondly defaultdictcould be used to create an inverted or reversed dictionary where the keys are the frequency of occurrence and the associated values are lists of the word or words that were encountered that many times. Here's what I mean:

其次defaultdict可用于创建倒排或倒排字典，其中键是出现频率，相关值是单词或多次遇到的单词的列表。这就是我的意思：

from collections import Counter, defaultdict

wordlist = ['red', 'yellow', 'blue', 'red', 'green', 'blue', 'blue', 'yellow']

# invert a temporary Counter(wordlist) dictionary so keys are
# frequency of occurrence and values are lists the words encountered
freqword = defaultdict(list)
for word, freq in Counter(wordlist).items():
    freqword[freq].append(word)

# print in order of occurrence (with sorted list of words)
for freq in sorted(freqword):
    print('count {}: {}'.format(freq, sorted(freqword[freq])))

Output:

输出：

count 1: ['green']
count 2: ['red', 'yellow']
count 3: ['blue']

Answer 6

回答by Fruitful

I have just wrote a similar program, with the help of Stack Overflow guys:

在 Stack Overflow 的帮助下，我刚刚写了一个类似的程序：

from string import punctuation
from operator import itemgetter

N = 100
words = {}

words_gen = (word.strip(punctuation).lower() for line in open("poi_run.txt")
                                             for word in line.split())

for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.items(), key=itemgetter(1), reverse=True)[:N]

for word, frequency in top_words:
    print ("%s %d" % (word, frequency))

Answer 7

回答by Russell Asher

To find the frequency of these items its easier then you guys are making it. if you have all the words in a list (which is easy to do using the string split function). Then:

要找到这些项目的频率，那么你们就更容易了。如果你有一个列表中的所有单词（使用字符串拆分函数很容易做到）。然后：

#(Pseudo Python Code) 

listOfWords = inputString.split() # splits the words up from whitespace
setOfWords = Set(listOfWords) #  Gives you all the unique words (no duplicates)

for each word in setOfWords  #Count how many words are in the list
   print word + " appears: " + listOfWords.Count(word) + "times"

Answer 8

回答by prisco.napoli

I wrote a similar program few days ago. Program uses two arguments: filename (required) and N (optional)

几天前我写了一个类似的程序。程序使用两个参数：文件名（必需）和 N（可选）

from collections import Counter
import re
import sys

if sys.version_info <(2,7):
    Sys.exit("Must use Python 2.7 or greater")

if len(sys.argv)<2:
    sys.exit('Usage: python %s filename N'%sys.argv[0])

n=0
if len(sys.argv)>2:
    try:
        n=int(sys.argv[2])
        if n<=0:
            raise ValueError
    except ValueError:
        sys.exit("Invalid value for N: %s.\nN must be an integer greater than 0"%sys.argv[2])

filename=sys.argv[1]
try:
        with open(filename,"r") as input_text:
            wordcounter=Counter()
            for line in input_text:
                 wordcounter.update(re.findall("\w+",line.lower()))
        if n==0:
            n=len(wordcounter)

        for word, frequency in wordcounter.most_common(n):
            print("%s %d" % (word, frequency))

except IOError:
        sys.exit("Cannot open file: %s"% filename)

Answer 9

回答by Clay

If you are going to require additional text processing, it may be worth importing nltk(Natural Language Toolkit) into your project. Here's an example, using JFK's inauguration speech:

如果您需要额外的文本处理，可能值得将nltk(Natural Language Toolkit) 导入到您的项目中。这是一个例子，使用肯尼迪的就职演说：

import nltk

speech_text = "Vice President Johnson, Mr. Speaker, Mr. Chief Justice, President Eisenhower, Vice President Nixon, President Truman, reverend clergy, fellow citizens: We observe today not a victory of party, but a celebration of freedom — symbolizing an end, as well as a beginning — signifying renewal, as well as change. For I have sworn before you and Almighty God the same solemn oath our forebears prescribed nearly a century and three-quarters ago. The world is very different now. For man holds in his mortal hands the power to abolish all forms of human poverty and all forms of human life. And yet the same revolutionary beliefs for which our forebears fought are still at issue around the globe — the belief that the rights of man come not from the generosity of the state, but from the hand of God. We dare not forget today that we are the heirs of that first revolution. Let the word go forth from this time and place, to friend and foe alike, that the torch has been passed to a new generation of Americans — born in this century, tempered by war, disciplined by a hard and bitter peace, proud of our ancient heritage, and unwilling to witness or permit the slow undoing of those human rights to which this nation has always been committed, and to which we are committed today at home and around the world. Let every nation know, whether it wishes us well or ill, that we shall pay any price, bear any burden, meet any hardship, support any friend, oppose any foe, to assure the survival and the success of liberty. This much we pledge — and more. To those old allies whose cultural and spiritual origins we share, we pledge the loyalty of faithful friends. United there is little we cannot do in a host of cooperative ventures. Divided there is little we can do — for we dare not meet a powerful challenge at odds and split asunder. To those new states whom we welcome to the ranks of the free, we pledge our word that one form of colonial control shall not have passed away merely to be replaced by a far more iron tyranny. We shall not always expect to find them supporting our view. But we shall always hope to find them strongly supporting their own freedom — and to remember that, in the past, those who foolishly sought power by riding the back of the tiger ended up inside. To those people in the huts and villages of half the globe struggling to break the bonds of mass misery, we pledge our best efforts to help them help themselves, for whatever period is required — not because the Communists may be doing it, not because we seek their votes, but because it is right. If a free society cannot help the many who are poor, it cannot save the few who are rich. To our sister republics south of our border, we offer a special pledge: to convert our good words into good deeds, in a new alliance for progress, to assist free men and free governments in casting off the chains of poverty. But this peaceful revolution of hope cannot become the prey of hostile powers. Let all our neighbors know that we shall join with them to oppose aggression or subversion anywhere in the Americas. And let every other power know that this hemisphere intends to remain the master of its own house. To that world assembly of sovereign states, the United Nations, our last best hope in an age where the instruments of war have far outpaced the instruments of peace, we renew our pledge of support — to prevent it from becoming merely a forum for invective, to strengthen its shield of the new and the weak, and to enlarge the area in which its writ may run. Finally, to those nations who would make themselves our adversary, we offer not a pledge but a request: that both sides begin anew the quest for peace, before the dark powers of destruction unleashed by science engulf all humanity in planned or accidental self-destruction. We dare not tempt them with weakness. For only when our arms are sufficient beyond doubt can we be certain beyond doubt that they will never be employed. But neither can two great and powerful groups of nations take comfort from our present course — both sides overburdened by the cost of modern weapons, both rightly alarmed by the steady spread of the deadly atom, yet both racing to alter that uncertain balance of terror that stays the hand of mankind's final war. So let us begin anew — remembering on both sides that civility is not a sign of weakness, and sincerity is always subject to proof. Let us never negotiate out of fear, but let us never fear to negotiate. Let both sides explore what problems unite us instead of belaboring those problems which divide us. Let both sides, for the first time, formulate serious and precise proposals for the inspection and control of arms, and bring the absolute power to destroy other nations under the absolute control of all nations. Let both sides seek to invoke the wonders of science instead of its terrors. Together let us explore the stars, conquer the deserts, eradicate disease, tap the ocean depths, and encourage the arts and commerce. Let both sides unite to heed, in all corners of the earth, the command of Isaiah — to “undo the heavy burdens, and [to] let the oppressed go free.”1 And, if a beachhead of cooperation may push back the jungle of suspicion, let both sides join in creating a new endeavor — not a new balance of power, but a new world of law — where the strong are just, and the weak secure, and the peace preserved. All this will not be finished in the first one hundred days. Nor will it be finished in the first one thousand days; nor in the life of this Administration; nor even perhaps in our lifetime on this planet. But let us begin. In your hands, my fellow citizens, more than mine, will rest the final success or failure of our course. Since this country was founded, each generation of Americans has been summoned to give testimony to its national loyalty. The graves of young Americans who answered the call to service surround the globe. Now the trumpet summons us again — not as a call to bear arms, though arms we need — not as a call to battle, though embattled we are — but a call to bear the burden of a long twilight struggle, year in and year out, “rejoicing in hope; patient in tribulation,”2 a struggle against the common enemies of man: tyranny, poverty, disease, and war itself. Can we forge against these enemies a grand and global alliance, North and South, East and West, that can assure a more fruitful life for all mankind? Will you join in that historic effort? In the long history of the world, only a few generations have been granted the role of defending freedom in its hour of maximum danger. I do not shrink from this responsibility — I welcome it. I do not believe that any of us would exchange places with any other people or any other generation. The energy, the faith, the devotion which we bring to this endeavor will light our country and all who serve it. And the glow from that fire can truly light the world. And so, my fellow Americans, ask not what your country can do for you; ask what you can do for your country. My fellow citizens of the world, ask not what America will do for you, but what together we can do for the freedom of man. Finally, whether you are citizens of America or citizens of the world, ask of us here the same high standards of strength and sacrifice which we ask of you. With a good conscience our only sure reward, with history the final judge of our deeds, let us go forth to lead the land we love, asking His blessing and His help, but knowing that here on earth God's work must truly be our own."

# Tokenize the words
all_words = speech_text.lower().split()

# Create a frequency distribution
freq = nltk.FreqDist(all_words)

# Show the top 10 words in the list, with counts
freq.items()[:10]

Out[5]: 
[('the', 86),
 ('of', 66),
 ('to', 42),
 ('and', 40),
 ('we', 30),
 ('a', 29),
 ('in', 24),
 ('our', 21),
 ('not', 19),
 ('that', 19)]

# Show the top 10 keys in the frequency dictionary
freq.keys()[:10]

Out[6]: ['the', 'of', 'to', 'and', 'we', 'a', 'in', 'our', 'not', 'that']

# Those frequent words aren't very interesting... let's strip common words
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
clean_words = [w for w in all_words if not w in stop_words]
freq_clean  = nltk.FreqDist(clean_words)

# This is a little more interesting
freq_clean.items()[1:10]
[('let', 16),
 ('us', 11),
 ('new', 7),
 ('sides', 7),
 ('pledge', 6),
 ('ask', 5),
 ('shall', 5),
 ('always', 4),
 ('call', 4)]

NLTK will allow you to do all manner of other interesting analysis with text, too, should the need arise. Here's a quick example of how you would find the top 10 bigrams that occur more than 3 times in the text:

如果需要，NLTK 还允许您对文本进行各种其他有趣的分析。下面是一个快速示例，说明如何找到文本中出现次数超过 3 次的前 10 个二元组：

bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder   = nltk.collocations.BigramCollocationFinder.from_words(all_words)
bigram_finder.apply_freq_filter(3)
bigram_finder.nbest(bigram_measures.pmi, 10)

Out[28]: 
[('my', 'fellow'),
 ('both', 'sides'),
 ('can', 'do'),
 ('dare', 'not'),
 ('let', 'us'),
 ('we', 'dare'),
 ('do', 'for'),
 ('let', 'both'),
 ('we', 'shall'),
 ('a', 'call')]

Refer to the NLTK Documentationfor more information and examples of how to, for instance, quickly create a plot of the most frequent terms in your text.

请参阅NLTK 文档以获取更多信息和示例，例如如何快速创建文本中最常用术语的图。

Answer 10

回答by user3443599

There are few steps involved in this Problem :

此问题涉及几个步骤：

Clean the Punctuations.

Sort the Array Based on Frequency.

def wordCount(self,nums):
  nums = "Hello, number of transaction which happened, for,"
  nums=nums.lower().translate(None,string.punctuation).split()
  d = {}
  for i in nums:
    if i not in d:
    d[i] = 1
  else:
    d[i] = d[i]+1
 sorted_d = (sorted(d.items(), key = operator.itemgetter(1), reverse = True)

for key,val in sorted_d:
 print key,val

清理标点符号。

根据频率对数组进行排序。

def wordCount(self,nums):
  nums = "Hello, number of transaction which happened, for,"
  nums=nums.lower().translate(None,string.punctuation).split()
  d = {}
  for i in nums:
    if i not in d:
    d[i] = 1
  else:
    d[i] = d[i]+1
 sorted_d = (sorted(d.items(), key = operator.itemgetter(1), reverse = True)

for key,val in sorted_d:
 print key,val

使用python排序词频计数

提问by AlgoMan

采纳答案by Frédéric Hamidi

回答by jathanism

回答by user470379

回答by Gani Simsek

回答by martineau

回答by Fruitful

回答by Russell Asher

回答by prisco.napoli

回答by Clay

回答by user3443599

相关推荐

最近更新

标签

使用python排序词频计数

提问by AlgoMan

采纳答案by Frédéric Hamidi

回答by jathanism

回答by user470379

回答by Gani Simsek

回答by martineau

回答by Fruitful

回答by Russell Asher

回答by prisco.napoli

回答by Clay

回答by user3443599

相关推荐

如何可靠地打开与 Python 脚本相同目录中的文件

Python 类型错误：“int”对象不可订阅

Python 我如何访问 Jinja2 中的部分列表

python中的二维列表数组

相关推荐

最近更新

标签