Python:计算一个单词在文件中出现的次数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22849662/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python: Count how many times a word occurs in a file
提问by bw61293
I have a file that contains a city name and then a state name per line in the file. I am suppose to count how many times a state name occurs and return the value.
我有一个文件,其中包含一个城市名称,然后文件中每行包含一个州名称。我想计算一个状态名称出现的次数并返回值。
for example, if my file contained:
例如,如果我的文件包含:
Los Angeles California
San Diego California
San Francisco California
Albany New York
Buffalo New York
Orlando Florida
I am suppose to return how many times each state name occurs. I have this for California.
我想返回每个州名出现的次数。我有这个给加利福尼亚。
for line in f:
California_count=line.find("California")
if California_count!=-1:
total=line.count("California")
print(total)
This only gives me the value 1, which I am assuming is because it occurs 1 time per line. How do I get it to return the number 3 instead of the number 1?
这只会给我值 1,我假设这是因为它每行出现 1 次。我如何让它返回数字 3 而不是数字 1?
采纳答案by Bruno Gelb
total = 0
with open('input.txt') as f:
for line in f:
found = line.find('California')
if found != -1 and found != 0:
total += 1
print total
output:
输出:
3
回答by m.wasowski
Use dictionary for storing counters:
使用字典存储计数器:
data = """Los Angeles California
San Diego California
San Francisco California
Albany New York
Buffalo New York
Orlando Florida""".splitlines()
counters = {}
for line in data:
city, state = line[:14], line[14:]
# city, state = line.split('\t') # if separated by tabulator
if state not in counters:
counters[state] = 1
else:
counters[state] += 1
print counters
# {'Florida': 1, 'New York': 2, 'California': 3}
You can simplify it by using collections.defaultdict
:
您可以使用collections.defaultdict
以下方法简化它:
from collections import defaultdict
counter = defaultdict(int)
for line in data:
city, state = line[:14], line[14:]
counter[state] += 1
print counter
# defaultdict(<type 'int'>, {'Florida': 1, 'New York': 2, 'California': 3})
or using collections.Counter
and generator expression:
或使用collections.Counter
和生成器表达式:
from collections import Counter
states = Counter(line[14:] for line in data)
# Counter({'California': 3, 'New York': 2, 'Florida': 1})
回答by Nate Mara
Assuming that the spaces in your post are meant to be tabs, the following code will give you a dict containing the counts for all of the states in the file.
假设您帖子中的空格是制表符,以下代码将为您提供一个包含文件中所有状态计数的字典。
#!/usr/bin/env python3
counts = {}
with open('states.txt', 'r') as statefile:
for i in statefile:
state = i.split('\t')[1].rstrip()
if state not in counts:
counts[state] = 0
else:
counts[state] += 1
print(counts)
回答by Denis
Alternatively, you could just use the re
module, and regex it:
或者,您可以只使用该re
模块,然后对其进行正则表达式:
import re
states = """
Los Angeles California
San Diego California
San Francisco California
Albany New York
Buffalo New York
Orlando Florida
"""
found = re.findall('[cC]alifornia', states)
total = 0
for i in found:
total += 1
print total
回答by Lachlan Moore
The accepted Answer for this common problem I believe covers what 'bw61293' asked for because of the format of his Text File, but is not a general solution for all Text Files!
我相信这个常见问题的公认答案涵盖了“bw61293”因其文本文件的格式而要求的内容,但并不是所有文本文件的通用解决方案!
He asked for 'Count how many times a word occurs in a file', the accepted answer can only count the word 'California' once per line. So if the word appears twice on a line then it will only count it once. Although this does work for the given format, it is not a general solution to say if the 'file' was a book.
他要求“计算一个单词在文件中出现的次数”,接受的答案只能每行计算一次“加利福尼亚”这个词。因此,如果单词在一行中出现两次,那么它只会计算一次。尽管这对于给定的格式确实有效,但如果说“文件”是一本书,这并不是一个通用的解决方案。
A fix to the Accepted answer would be below, of using nltk to break the line into a list of words. The only problem is make sure to pip install the nltk library with 'pip install nltk
' in Command Prompt, beware its a big library. If you want to use Anaconda use 'conda install -c anaconda nltk
'. I used the Tweet Tokenizer because apostrophes in words like "don't
" will split the string into a list ['don', "'t"]
but the TweetTokenizer will return ["don't"]
, among other reasons. I also made it case insensitive by just using .lower()
in .count()
. I hope this will help people who want a more general solution to the question of 'Count how many times a word occurs in a file'.
对 Accepted 答案的修复如下,使用 nltk 将行分成单词列表。唯一的问题是确保pip install nltk
在命令提示符中使用 ' ' pip install nltk 库,注意它是一个大库。如果要使用 Anaconda,请使用 ' conda install -c anaconda nltk
'。我使用 Tweet Tokenizer 是因为像 " don't
"这样的单词中的撇号会将字符串拆分为一个列表,['don', "'t"]
但 TweetTokenizer 将返回["don't"]
,还有其他原因。我还通过使用.lower()
in使其不区分大小写.count()
。我希望这会帮助那些想要更一般地解决“计算一个单词在文件中出现的次数”问题的人。
I am new to StackOverflow so please give feedback to improvements to my code or to what I have written for my first comment ever!
我是 StackOverflow 的新手,所以请对我的代码的改进或我为我的第一条评论所写的内容提供反馈!
UPDATE I MADE AN ERROR, below is now fixed!! (Keep in mind this is a case insensitive search, if you want it case sensitive please remove the .lower() from the list comprehension. Thanks.) I also promise to make an answer without using nltk when I get enough time.
更新我犯了一个错误,下面现在已修复!!(请记住,这是一个不区分大小写的搜索,如果您希望它区分大小写,请从列表理解中删除 .lower()。谢谢。)我也承诺在我有足够的时间时不使用 nltk 进行回答。
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()
total = 0
with open('input.txt') as f:
for line in f:
LineList = tknzr.tokenize(line)
LineLower = [x.lower() for x in LineList]
found = LineLower.count('california')
if found != -1 and found != 0:
total += found
print(total)