使用 Python 查找文件中的字符数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41504428/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Find the number of characters in a file using Python
提问by S.Soopra
Here is the question:
这是问题:
I have a file with these words:
我有一个包含这些话的文件:
hey how are you
I am fine and you
Yes I am fine
And it is asked to find the number of words, lines and characters.
并要求找到单词、行和字符的数量。
Below is my program, but the number of counts for the characters without space is not correct.
下面是我的程序,但没有空格的字符计数不正确。
The number of words is correct and the number of line is correct. What is the mistake in the same loop?
字数正确,行数正确。同一个循环中的错误是什么?
fname = input("Enter the name of the file:")
infile = open(fname, 'r')
lines = 0
words = 0
characters = 0
for line in infile:
wordslist = line.split()
lines = lines + 1
words = words + len(wordslist)
characters = characters + len(line)
print(lines)
print(words)
print(characters)
The output is:
输出是:
lines=3(Correct)
words=13(correct)
characters=47
I've looked on the site with multiple answers and I am confused because I didn't learn some other functions in Python. How do I correct the code as simple and basic as it is in the loop I've done?
我在网站上查看了多个答案,但我很困惑,因为我没有学习 Python 中的其他一些函数。我如何更正代码,就像我已经完成的循环一样简单和基本?
Whereas the number of characters without space is 35 and with space is 45. If possible, I want to find the number of characters without space. Even if someone know the loop for the number of characters with space that's fine.
而没有空格的字符数是 35,有空格的字符数是 45。如果可能,我想找到没有空格的字符数。即使有人知道带有空格的字符数的循环也没关系。
回答by Mike Müller
Sum up the length of all words in a line:
总结一行中所有单词的长度:
characters += sum(len(word) for word in wordslist)
The whole program:
整个程序:
with open('my_words.txt') as infile:
lines=0
words=0
characters=0
for line in infile:
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lines)
print(words)
print(characters)
Output:
输出:
3
13
35
This:
这个:
(len(word) for word in wordslist)
is a generator expression. It is essentially a loop in one line that produces the length of each word. We feed these lengths directly to sum
:
是一个生成器表达式。它本质上是一行中的循环,产生每个单词的长度。我们将这些长度直接提供给sum
:
sum(len(word) for word in wordslist)
Improved version
改良版
This version takes advantage of enumerate
, so you save two lines of code, while keeping the readability:
此版本利用了enumerate
,因此您节省了两行代码,同时保持了可读性:
with open('my_words.txt') as infile:
words = 0
characters = 0
for lineno, line in enumerate(infile, 1):
wordslist = line.split()
words += len(wordslist)
characters += sum(len(word) for word in wordslist)
print(lineno)
print(words)
print(characters)
This line:
这一行:
with open('my_words.txt') as infile:
opens the file with the promise to close it as soon as you leave indentation. It is always good practice to close file after your are done using it.
打开文件并承诺在您离开缩进时立即关闭它。在使用完文件后关闭文件始终是一个好习惯。
回答by Solo
Remember that each line (except for the last) has a line separator. I.e. "\r\n" for Windows or "\n" for Linux and Mac.
请记住,每一行(除了最后一行)都有一个行分隔符。即 Windows 的 "\r\n" 或 Linux 和 Mac 的 "\n"。
Thus, exactly two characters are added in this case, as 47 and not 45.
因此,在这种情况下正好添加了两个字符,即 47 而不是 45。
A nice way to overcome this could be to use:
克服这个问题的一个好方法是使用:
import os
fname=input("enter the name of the file:")
infile=open(fname, 'r')
lines=0
words=0
characters=0
for line in infile:
line = line.strip(os.linesep)
wordslist=line.split()
lines=lines+1
words=words+len(wordslist)
characters=characters+ len(line)
print(lines)
print(words)
print(characters)
回答by csl
To count the characters, you should count each individual word. So you could have another loop that counts characters:
要计算字符数,您应该计算每个单词。所以你可以有另一个计算字符的循环:
for word in wordslist:
characters += len(word)
That ought to do it. The wordslist should probably take away newline characters on the right, something like wordslist = line.rstrip().split()
perhaps.
那应该这样做。单词列表可能应该去掉右边的换行符,比如wordslist = line.rstrip().split()
也许。
回答by Jared Smith
This is too long for a comment.
评论太长了。
Python 2 or 3? Because it really matters. Try out the following in your REPL for both:
Python 2 还是 3?因为这真的很重要。在您的 REPL 中为这两个尝试以下内容:
Python 2.7.12
>>>len("ta?a")
5
Python 3.5.2
>>>len("ta?a")
4
Huh? The answer lies in unicode. That ?
is an 'n' with a combining diacritical. Meaning its 1 character, but not 1 byte. So unless you're working with plain ASCII text, you'd better specify which version of python your character counting function is for.
嗯?答案在于 unicode。那?
是一个带有组合变音符号的“n”。意思是它的 1 个字符,但不是 1 个字节。因此,除非您使用纯 ASCII 文本,否则最好指定字符计数函数适用于哪个版本的 python 。
回答by barrios
I found this solution very simply and readable:
我发现这个解决方案非常简单易读:
with open("filename", 'r') as file:
text = file.read().strip().split()
len_chars = sum(len(word) for word in text)
print(len_chars)
回答by Tagc
How's this? It uses a regular expression to match all non-whitespace characters and returns the number of matches within a string.
这个怎么样?它使用正则表达式匹配所有非空白字符并返回字符串中的匹配数。
import re
DATA="""
hey how are you
I am fine and you
Yes I am fine
"""
def get_char_count(s):
return len(re.findall(r'\S', s))
if __name__ == '__main__':
print(get_char_count(DATA))
Output
输出
35
The image below shows this tested on RegExr:
下图显示了在 RegExr 上的测试:
回答by Loaf
It is probably counting new line characters. Subtract characters with (lines+1)
它可能正在计算换行符。用 (lines+1) 减去字符
回答by Rahul
Here is the code:
这是代码:
fp = open(fname, 'r+').read()
chars = fp.decode('utf8')
print len(chars)
Check the output. I just tested it.
检查输出。我刚刚测试了它。
回答by Rahul
A more Pythonic solution than the others:
比其他解决方案更 Pythonic 的解决方案:
with open('foo.txt') as f:
text = f.read().splitlines() # list of lines
lines = len(text) # length of the list = number of lines
words = sum(len(line.split()) for line in text) # split each line on spaces, sum up the lengths of the lists of words
characters = sum(len(line) for line in text) # sum up the length of each line
print(lines)
print(words)
print(characters)
The other answers here are manually doing what str.splitlines()
does. There's no reason to reinvent the wheel.
这里的其他答案是手动执行的str.splitlines()
操作。没有理由重新发明轮子。
回答by Nizam Mohamed
Simply skip unwanted characters while calling len
,
只需在调用时跳过不需要的字符len
,
import os
characters=characters+ len([c for c in line if c not in (os.linesep, ' ')])
or sum
the count,
或sum
计数,
characters=characters+ sum(1 for c in line if c not in (os.linesep, ' '))
or build a str
from the wordlist
and take len
,
或str
从wordlist
和 take构建一个len
,
characters=characters+ len(''.join(wordlist))
or sum
the characters in the wordlist
. I think this is the fastest.
或sum
在字符wordlist
。我认为这是最快的。
characters=characters+ sum(1 for word in wordlist for char in word)