在 Python 中读取文件时忽略空行的最简单方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4842057/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Easiest way to ignore blank lines when reading a file in Python
提问by Ambrosio
I have some code that reads a file of names and creates a list:
我有一些代码可以读取名称文件并创建一个列表:
names_list = open("names", "r").read().splitlines()
Each name is separated by a newline, like so:
每个名称由换行符分隔,如下所示:
Allman
Atkinson
Behlendorf
I want to ignore any lines that contain only whitespace. I know I can do this by by creating a loop and checking each line I read and then adding it to a list if it's not blank.
我想忽略任何只包含空格的行。我知道我可以通过创建一个循环并检查我阅读的每一行然后将其添加到列表中(如果它不是空白的)来做到这一点。
I was just wondering if there was a more Pythonic way of doing it?
我只是想知道是否有更 Pythonic 的方式来做到这一点?
采纳答案by aaronasterling
I would stack generator expressions:
我会堆叠生成器表达式:
with open(filename) as f_in:
lines = (line.rstrip() for line in f_in) # All lines including the blank ones
lines = (line for line in lines if line) # Non-blank lines
Now, linesis all of the non-blank lines. This will save you from having to call strip on the line twice. If you want a list of lines, then you can just do:
现在,lines是所有非空行。这将使您不必在线路上两次调用 strip。如果你想要一个行列表,那么你可以这样做:
with open(filename) as f_in:
lines = (line.rstrip() for line in f_in)
lines = list(line for line in lines if line) # Non-blank lines in a list
You can also do it in a one-liner (exluding withstatement) but it's no more efficient and harder to read:
您也可以在单行(排除with语句)中执行此操作,但效率更高且更难阅读:
with open(filename) as f_in:
lines = list(line for line in (l.strip() for l in f_in) if line)
Update:
更新:
I agree that this is ugly because of the repetition of tokens. You could just write a generator if you prefer:
我同意这很丑陋,因为令牌的重复。如果您愿意,您可以只编写一个生成器:
def nonblank_lines(f):
for l in f:
line = l.rstrip()
if line:
yield line
Then call it like:
然后像这样调用它:
with open(filename) as f_in:
for line in nonblank_lines(f_in):
# Stuff
update 2:
更新2:
with open(filename) as f_in:
lines = filter(None, (line.rstrip() for line in f_in))
and on CPython (with deterministic reference counting)
和 CPython(具有确定性引用计数)
lines = filter(None, (line.rstrip() for line in open(filename)))
In Python 2 use itertools.ifilterif you want a generator and in Python 3, just pass the whole thing to listif you want a list.
在 Python 2 中使用itertools.ifilterif 你想要一个生成器,而在 Python 3 中,list如果你想要一个列表,只需将整个内容传递给。
回答by Felix Kling
You could use list comprehension:
您可以使用列表理解:
with open("names", "r") as f:
names_list = [line.strip() for line in f if line.strip()]
Updated:Removed unnecessary readlines().
更新:删除了不必要的readlines().
To avoid calling line.strip()twice, you can use a generator:
为了避免调用line.strip()两次,您可以使用生成器:
names_list = [l for l in (line.strip() for line in f) if l]
回答by eyquem
When a treatment of text must be done to just extract data from it, I always think first to the regexes, because:
当必须对文本进行处理以从中提取数据时,我总是首先考虑正则表达式,因为:
as far as I know, regexes have been invented for that
iterating over lines appears clumsy to me: it essentially consists to search the newlines then to search the data to extract in each line; that makes two searches instead of a direct unique one with a regex
way of bringing regexes into play is easy; only the writing of a regex string to be compiled into a regex object is sometimes hard, but in this case the treatment with an iteration over lines will be complicated too
据我所知,正则表达式是为此发明的
遍历行对我来说似乎很笨拙:它本质上包括搜索换行符,然后搜索要在每一行中提取的数据;使用正则表达式进行两次搜索而不是直接唯一的搜索
使用正则表达式的方法很简单;只有编写一个正则表达式字符串来编译成一个正则表达式对象有时很困难,但在这种情况下,对行进行迭代的处理也会很复杂
For the problem discussed here, a regex solution is fast and easy to write:
对于这里讨论的问题,正则表达式解决方案快速且易于编写:
import re
names = re.findall('\S+',open(filename).read())
I compared the speeds of several solutions:
我比较了几种解决方案的速度:
import re
from time import clock
A,AA,B1,B2,BS,reg = [],[],[],[],[],[]
D,Dsh,C1,C2 = [],[],[],[]
F1,F2,F3 = [],[],[]
def nonblank_lines(f):
for l in f:
line = l.rstrip()
if line: yield line
def short_nonblank_lines(f):
for l in f:
line = l[0:-1]
if line: yield line
for essays in xrange(50):
te = clock()
with open('raa.txt') as f:
names_listA = [line.strip() for line in f if line.strip()] # Felix Kling
A.append(clock()-te)
te = clock()
with open('raa.txt') as f:
names_listAA = [line[0:-1] for line in f if line[0:-1]] # Felix Kling with line[0:-1]
AA.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f_in:
namesB1 = [ name for name in (l.strip() for l in f_in) if name ] # aaronasterling without list()
B1.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
namesB2 = [ name for name in (l[0:-1] for l in f_in) if name ] # aaronasterling without list() and with line[0:-1]
B2.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
namesBS = [ name for name in f_in.read().splitlines() if name ] # a list comprehension with read().splitlines()
BS.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f:
xreg = re.findall('\S+',f.read()) # eyquem
reg.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f_in:
linesC1 = list(line for line in (l.strip() for l in f_in) if line) # aaronasterling
C1.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
linesC2 = list(line for line in (l[0:-1] for l in f_in) if line) # aaronasterling with line[0:-1]
C2.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f_in:
yD = [ line for line in nonblank_lines(f_in) ] # aaronasterling update
D.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
yDsh = [ name for name in short_nonblank_lines(f_in) ] # nonblank_lines with line[0:-1]
Dsh.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f_in:
linesF1 = filter(None, (line.rstrip() for line in f_in)) # aaronasterling update 2
F1.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
linesF2 = filter(None, (line[0:-1] for line in f_in)) # aaronasterling update 2 with line[0:-1]
F2.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
linesF3 = filter(None, f_in.read().splitlines()) # aaronasterling update 2 with read().splitlines()
F3.append(clock()-te)
print 'names_listA == names_listAA==namesB1==namesB2==namesBS==xreg\n is ',\
names_listA == names_listAA==namesB1==namesB2==namesBS==xreg
print 'names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3\n is ',\
names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3,'\n\n\n'
def displ((fr,it,what)): print fr + str( min(it) )[0:7] + ' ' + what
map(displ,(('* ', A, '[line.strip() for line in f if line.strip()] * Felix Kling\n'),
(' ', B1, ' [name for name in (l.strip() for l in f_in) if name ] aaronasterling without list()'),
('* ', C1, 'list(line for line in (l.strip() for l in f_in) if line) * aaronasterling\n'),
('* ', reg, 're.findall("\S+",f.read()) * eyquem\n'),
('* ', D, '[ line for line in nonblank_lines(f_in) ] * aaronasterling update'),
(' ', Dsh, '[ line for line in short_nonblank_lines(f_in) ] nonblank_lines with line[0:-1]\n'),
('* ', F1 , 'filter(None, (line.rstrip() for line in f_in)) * aaronasterling update 2\n'),
(' ', B2, ' [name for name in (l[0:-1] for l in f_in) if name ] aaronasterling without list() and with line[0:-1]'),
(' ', C2, 'list(line for line in (l[0:-1] for l in f_in) if line) aaronasterling with line[0:-1]\n'),
(' ', AA, '[line[0:-1] for line in f if line[0:-1] ] Felix Kling with line[0:-1]\n'),
(' ', BS, '[name for name in f_in.read().splitlines() if name ] a list comprehension with read().splitlines()\n'),
(' ', F2 , 'filter(None, (line[0:-1] for line in f_in)) aaronasterling update 2 with line[0:-1]'),
(' ', F3 , 'filter(None, f_in.read().splitlines() aaronasterling update 2 with read().splitlines()'))
)
Solution with regex is straightforward and neat. Though, it isn't among the fastest ones. The solution of aaronasterling with filter() is surprisigly fast for me (I wasn't aware of this particular filter()'s speed) and times of optimized solutions go down until 27 % of the biggest time. I wonder what makes the miracle of the filter-splitlines association:
使用正则表达式的解决方案简单明了。尽管如此,它并不是最快的。使用 filter() 的 aaronasterling 的解决方案对我来说非常快(我不知道这个特定的 filter() 的速度)并且优化解决方案的时间下降到最大时间的 27%。我想知道是什么造就了 filter-splitlines 关联的奇迹:
names_listA == names_listAA==namesB1==namesB2==namesBS==xreg
is True
names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3
is True
* 0.08266 [line.strip() for line in f if line.strip()] * Felix Kling
0.07535 [name for name in (l.strip() for l in f_in) if name ] aaronasterling without list()
* 0.06912 list(line for line in (l.strip() for l in f_in) if line) * aaronasterling
* 0.06612 re.findall("\S+",f.read()) * eyquem
* 0.06486 [ line for line in nonblank_lines(f_in) ] * aaronasterling update
0.05264 [ line for line in short_nonblank_lines(f_in) ] nonblank_lines with line[0:-1]
* 0.05451 filter(None, (line.rstrip() for line in f_in)) * aaronasterling update 2
0.04689 [name for name in (l[0:-1] for l in f_in) if name ] aaronasterling without list() and with line[0:-1]
0.04582 list(line for line in (l[0:-1] for l in f_in) if line) aaronasterling with line[0:-1]
0.04171 [line[0:-1] for line in f if line[0:-1] ] Felix Kling with line[0:-1]
0.03265 [name for name in f_in.read().splitlines() if name ] a list comprehension with read().splitlines()
0.03638 filter(None, (line[0:-1] for line in f_in)) aaronasterling update 2 with line[0:-1]
0.02198 filter(None, f_in.read().splitlines() aaronasterling update 2 with read().splitlines()
But this problem is particular, the most simple of all: only one name in each line. So the solutions are only games with lines, splitings and [0:-1] cuts.
但这个问题很特殊,也是最简单的:每一行只有一个名字。所以解决方案只是带有线、分裂和 [0:-1] 切割的游戏。
On the contrary, regex doesn't matter with lines, it straightforwardly finds the desired data: I consider it is a more natural way of resolution, applying from the simplest to the more complex cases, and hence is often the way to be prefered in treatments of texts.
相反,正则表达式与行无关,它直接找到所需的数据:我认为这是一种更自然的解决方式,从最简单的情况应用到更复杂的情况,因此通常是首选的方式文本的处理。
EDIT
编辑
I forgot to say that I use Python 2.7 and I measured the above times with a file containing 500 times the following chain
我忘了说我使用的是 Python 2.7,我用一个包含 500 次以下链的文件测量了上述时间
SMITH
JONES
WILLIAMS
TAYLOR
BROWN
DAVIES
EVANS
WILSON
THOMAS
JOHNSON
ROBERTS
ROBINSON
THOMPSON
WRIGHT
WALKER
WHITE
EDWARDS
HUGHES
GREEN
HALL
LEWIS
HARRIS
CLARKE
PATEL
HymanSON
WOOD
TURNER
MARTIN
COOPER
HILL
WARD
MORRIS
MOORE
CLARK
LEE
KING
BAKER
HARRISON
MORGAN
ALLEN
JAMES
SCOTT
PHILLIPS
WATSON
DAVIS
PARKER
PRICE
BENNETT
YOUNG
GRIFFITHS
MITCHELL
KELLY
COOK
CARTER
RICHARDSON
BAILEY
COLLINS
BELL
SHAW
MURPHY
MILLER
COX
RICHARDS
KHAN
MARSHALL
ANDERSON
SIMPSON
ELLIS
ADAMS
SINGH
BEGUM
WILKINSON
FOSTER
CHAPMAN
POWELL
WEBB
ROGERS
GRAY
MASON
ALI
HUNT
HUSSAIN
CAMPBELL
MATTHEWS
OWEN
PALMER
HOLMES
MILLS
BARNES
KNIGHT
LLOYD
BUTLER
RUSSELL
BARKER
FISHER
STEVENS
JENKINS
MURRAY
DIXON
HARVEY
回答by eyquem
@S.Lott
@S.洛特
The following code processes lines one at a time and produces a result that isn't memory eager:
下面的代码一次处理一行,并产生一个不是内存急切的结果:
filename = 'english names.txt'
with open(filename) as f_in:
lines = (line.rstrip() for line in f_in)
lines = (line for line in lines if line)
the_strange_sum = 0
for l in lines:
the_strange_sum += 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.find(l[0])
print the_strange_sum
So the generator (line.rstrip() for line in f_in) is quite the same acceptable than the nonblank_lines() function.
所以生成器(line.rstrip() for line in f_in)与 nonblank_lines() 函数完全相同。
回答by Sean
If you want you can just put what you had in a list comprehension:
如果你愿意,你可以把你所拥有的放在列表理解中:
names_list = [line for line in open("names.txt", "r").read().splitlines() if line]
names_list = [line for line in open("names.txt", "r").read().splitlines() if line]
or
或者
all_lines = open("names.txt", "r").read().splitlines()
names_list = [name for name in all_lines if name]
splitlines() has already removed the line endings.
splitlines() 已经删除了行尾。
I don't think those are as clear as just looping explicitly though:
我认为这些并不像明确地循环那样清晰:
names_list = []
with open('names.txt', 'r') as _:
for line in _:
line = line.strip()
if line:
names_list.append(line)
Edit:
编辑:
Although, filter looks quite readable and concise:
虽然,过滤器看起来非常可读和简洁:
names_list = filter(None, open("names.txt", "r").read().splitlines())
names_list = filter(None, open("names.txt", "r").read().splitlines())
回答by Rocketq
What about LineSentencemodule, it will ignore such lines:
什么 LineSentence模块,它会忽略这样的线路:
Bases: object
Simple format: one sentence = one line; words already preprocessed and separated by whitespace.
source can be either a string or a file object. Clip the file to the first limit lines (or not clipped if limit is None, the default).
基础:对象
简单格式:一句话=一行;已经预处理并用空格分隔的单词。
source 可以是字符串或文件对象。将文件剪裁到第一个限制行(如果限制为无,默认情况下不剪裁)。
from gensim.models.word2vec import LineSentence
text = LineSentence('text.txt')
回答by a_r
I guess there is a simple solution which I recently used after going through so many answers here.
我想有一个简单的解决方案,我最近在这里浏览了这么多答案后使用了它。
with open(file_name) as f_in:
for line in f_in:
if len(line.split()) == 0:
continue
This just does the same work, ignoring all empty line.
这只是做同样的工作,忽略所有空行。
回答by Bharel
Why are you all going the hard way?
为什么你们都走得很艰难?
with open("myfile") as myfile:
nonempty = filter(str.rstrip, myfile)
Convert nonempty into a list if you have the urge to do so, although I highly suggest keeping nonempty a generator as it is in Python 3.x
如果您有这种冲动,请将非空转换为列表,尽管我强烈建议将生成器保留为非空,就像在 Python 3.x 中一样
In Python 2.x you may use itertools.ifilterto do your bidding instead.
在 Python 2.x 中,您可以使用itertools.ifilter来代替您的出价。

