Python 计算文本文件中的字母

Question

提问by user2752551

I am a beginner python programmer and I am trying to make a program which counts the numbers of letters in a text file. Here is what I've got so far:

我是一名 Python 初学者，我正在尝试编写一个程序来计算文本文件中的字母数。这是我到目前为止所得到的：

import string 
text = open('text.txt')
letters = string.ascii_lowercase
for i in text:
  text_lower = i.lower()
  text_nospace = text_lower.replace(" ", "")
  text_nopunctuation = text_nospace.strip(string.punctuation)
  for a in letters:
    if a in text_nopunctuation:
      num = text_nopunctuation.count(a)
      print(a, num)

If the text file contains hello bob, I want the output to be:

如果文本文件包含hello bob，我希望输出为：

b 2
e 1
h 1
l 2
o 2

My problem is that it doesn't work properly when the text file contains more than one line of text or has punctuation.

我的问题是当文本文件包含多行文本或带有标点符号时，它无法正常工作。

Answer 1

回答by moliware

You have to use collections.Counter

你必须使用 collections.Counter

from collections import Counter
text = 'aaaaabbbbbccccc'
c = Counter(text)
print c

It prints:

它打印：

Counter({'a': 5, 'c': 5, 'b': 5})

Your textvariable should be:

你的text变量应该是：

import string
text = open('text.txt').read()
# Filter all characters that are not letters.
text = filter(lambda x: x in string.letters, text.lower())

For getting the output you need:

要获得您需要的输出：

for letter, repetitions in c.iteritems():
    print letter, repetitions

In my example it prints:

在我的示例中，它打印：

a 5
c 5
b 5

For more information Counters doc

有关更多信息计数器文档

Answer 2

回答by elyase

This is very readable way to accomplish what you want using Counter:

这是使用Counter完成您想要的操作的非常易读的方式：

from string import ascii_lowercase
from collections import Counter

with open('text.txt') as f:
    print Counter(letter for line in f 
                  for letter in line.lower() 
                  if letter in ascii_lowercase)

You can iterate the resulting dict to print it in the format that you want.

您可以迭代生成的 dict 以您想要的格式打印它。

Answer 3

回答by elyase

Using re:

使用重新：

import re

context, m = 'some file to search or text', {}
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
for i in range(len(letters)):
  m[letters[i]] = len(re.findall('{0}'.format(letters[i]), context))
  print '{0} -> {1}'.format(letters[i], m[letters[i]])

It is much more elegant and clean with Counter nonetheless.

尽管如此，使用 Counter 会更加优雅和干净。

Answer 4

回答by no1

import string
fp=open('text.txt','r')
file_list=fp.readlines()
print file_list
freqs = {}
for line in file_list:
    line = filter(lambda x: x in string.letters, line.lower())
    for char in line:
        if char in freqs:
            freqs[char] += 1
        else:
            freqs[char] = 1

print freqs

Answer 5

回答by tobias_k

Just for the sake of completeness, if you want to do it without using Counter, here's another very short way, using list comprehension and the dictbuiltin:

只是为了完整起见，如果你想不使用它来做Counter，这是另一种非常简短的方法，使用列表理解和dict内置：

from string import ascii_lowercase as letters
with open("text.txt") as f:
    text = f.read().lower()
    print dict((l, text.count(l)) for l in letters)

f.read()will read the content of the entire file into the textvariable (might be a bad idea, if the file is really large); then we use a list comprehension to create a list of tuples (letter, count in text)and convert this list of tuples to a dictionary. With Python 2.7+ you can also use {l: text.count(l) for l in letters}, which is even shorter and a bit more readable.

f.read()将整个文件的内容读入text变量（如果文件真的很大，这可能是个坏主意）；然后我们使用列表理解来创建一个元组(letter, count in text)列表并将这个元组列表转换为字典。在 Python 2.7+ 中，您还可以使用{l: text.count(l) for l in letters}，它更短，可读性更强。

Note, however, that this will search the text multiple times, once for each letter, whereas Counterscans it only once and updates the counts for all the letters in one go.

但是请注意，这将多次搜索文本，每个字母一次，而Counter只扫描一次并一次性更新所有字母的计数。

Answer 6

回答by jfs

You could split the problem into two simpler tasks:

您可以将问题拆分为两个更简单的任务：

#!/usr/bin/env python
import fileinput # accept input from stdin and/or files specified at command-line
from collections import Counter
from itertools import chain
from string import ascii_lowercase

# 1. count frequencies of all characters (bytes on Python 2)
freq = Counter(chain.from_iterable(fileinput.input())) # read one line at a time

# 2. print frequencies of ascii letters
for c in ascii_lowercase:
     n = freq[c] + freq[c.upper()] # merge lower- and upper-case occurrences
     if n != 0:
        print(c, n)

Answer 7

回答by Public Person

import sys

def main():
    try:
         fileCountAllLetters = file(sys.argv[1], 'r')
         print "Count all your letters: ", len(fileCountAllLetters.read())
    except IndexError:
         print "You forget add file in argument!"
    except IOError:
         print "File like this not your folder!"

main()

python file.py countlettersfile.txt

Answer 8

回答by Maxim Egorushkin

Yet another way:

还有一种方式：

import sys
from collections import defaultdict

read_chunk_size = 65536

freq = defaultdict(int)
for c in sys.stdin.read(read_chunk_size):
    freq[ord(c.lower())] += 1

for symbol, count in sorted(freq.items(), key=lambda kv: kv[1], reverse=True):
    print(chr(symbol), count)

It outputs the symbols most frequent to the least.

它输出最频繁到最少的符号。

The character counting loop is O(1) complexity and can handle arbitrarily large files because it reads the file in read_chunk_sizechunks.

字符计数循环的复杂度为 O(1)，可以处理任意大的文件，因为它以read_chunk_size块的形式读取文件。

Python 计算文本文件中的字母

提问by user2752551

回答by moliware

回答by elyase

回答by elyase

回答by no1

回答by tobias_k

回答by jfs

回答by Public Person

回答by Maxim Egorushkin

相关推荐

最近更新

标签

Python 计算文本文件中的字母

提问by user2752551

回答by moliware

回答by elyase

回答by elyase

回答by no1

回答by tobias_k

回答by jfs

回答by Public Person

回答by Maxim Egorushkin

相关推荐

Python 熊猫重置系列上的索引以删除多索引

在 Python 中定义白噪声过程

Python 查找最接近给定日期的日期

Python 如何绘制 wav 文件

相关推荐

最近更新

标签