使用python将特定行从一个文件写入另一个文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12755587/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using python to write specific lines from one file to another file
提问by Andreanna
I have ~200 short text files (50kb) that all have a similar format. I want to find a line in each of those files that contains a certain string and then write that line plus the next three lines (but not rest of the lines in the file) to another text file. I am trying to teach myself python in order to do this and have written a very simple and crude little script to try this out. I am using version 2.6.5, and running the script from Mac terminal:
我有大约 200 个短文本文件 (50kb),它们都具有类似的格式。我想在每个包含某个字符串的文件中找到一行,然后将该行加上接下来的三行(但不是文件中的其余行)写入另一个文本文件。为了做到这一点,我正在尝试自学 python,并编写了一个非常简单粗暴的小脚本来尝试一下。我使用的是 2.6.5 版,并从 Mac 终端运行脚本:
#!/usr/bin/env python
f = open('Test.txt')
Lines=f.readlines()
searchquery = 'am\n'
i=0
while i < 500:
if Lines[i] == searchquery:
print Lines[i:i+3]
i = i+1
else:
i = i+1
f.close()
This more or less works and prints the output to the screen. But I would like to print the lines to a new file instead, so I tried something like this:
这或多或少有效并将输出打印到屏幕上。但是我想将这些行打印到一个新文件中,所以我尝试了这样的操作:
f1 = open('Test.txt')
f2 = open('Output.txt', 'a')
Lines=f1.readlines()
searchquery = 'am\n'
i=0
while i < 500:
if Lines[i] == searchquery:
f2.write(Lines[i])
f2.write(Lines[i+1])
f2.write(Lines[i+2])
i = i+1
else:
i = i+1
f1.close()
f2.close()
However, nothing is written to the file. I also tried
但是,没有任何内容写入文件。我也试过
from __future__ import print_function
print(Lines[i], file='Output.txt')
and can't get that to work, either. If anyone can explain what I'm doing wrong or offer some suggestions about what I should try instead I would be really grateful. Also, if you have any suggestions for making the search better I would appreciate those as well. I have been using a test file where the string I want to find is the only text on the line, but in my real files the string that I need is still at the beginning of the line but followed by a bunch of other text, so I think the way I have things set up now won't really work, either.
也不能让它发挥作用。如果有人可以解释我做错了什么或提供一些关于我应该尝试什么的建议,我将不胜感激。此外,如果您有任何改进搜索的建议,我也将不胜感激。我一直在使用一个测试文件,其中我要查找的字符串是该行中唯一的文本,但在我的真实文件中,我需要的字符串仍位于该行的开头,但后面是一堆其他文本,因此我认为我现在设置的方式也不会真正起作用。
Thanks, and sorry if this is a super basic question!
谢谢,对不起,如果这是一个超级基本的问题!
采纳答案by Lukas Graf
As pointed out by @ajon, I don't think there's anything fundamentally wrong with your code except the indentation. With the indentation fixed it works for me. However there's a couple opportunities for improvement.
正如@ajon 所指出的,除了缩进之外,我认为您的代码没有任何根本性的错误。缩进固定后,它对我有用。然而,有几个改进的机会。
1)In Python, the standard way of iterating over things is by using a forloop. When using a forloop, you don't need to define loop counter variables and keep track of them yourself in order to iterate over things. Instead, you write something like this
1)在 Python 中,迭代事物的标准方法是使用for循环。使用for循环时,您不需要定义循环计数器变量并自己跟踪它们以迭代事物。相反,你写这样的东西
for line in lines:
print line
to iterate over all the items in a list of strings and print them.
迭代字符串列表中的所有项目并打印它们。
2)In most cases this is what your forloops will look like. However, there's situations where you actually do want to keep track of the loop count. Your case is such a situation, because you not only need that one line but also the next three, and therefore need to use the counter for indexing (lst[i]). For that there's enumerate(), which will return a list of items andtheir index over which you then can loop.
2)在大多数情况下,这就是您的for循环的样子。但是,在某些情况下,您确实希望跟踪循环计数。你的情况就是这样的情况,因为你不仅需要那一行,还需要接下来的三行,因此需要使用计数器进行索引(lst[i])。对于有enumerate(),它会返回一个项目列表和它们的指数超过你,然后可以循环。
for i, line in enumerate(lines):
print i
print line
print lines[i+7]
Ifyou were to manually keep track of the loop counter as in your example, there's two things:
如果您像示例中那样手动跟踪循环计数器,则有两件事:
3)That i = i+1should be moved out of the ifand elseblocks. You're doing it in both cases, so put it after the if/else. In your case the elseblock then doesn't do anything any more, and can be eliminated:
3)那i = i+1应该从if和else块中移出。你在这两种情况下都这样做,所以把它放在if/else. 在您的情况下,该else块不再执行任何操作,并且可以消除:
while i < 500:
if Lines[i] == searchquery:
f2.write(Lines[i])
f2.write(Lines[i+1])
f2.write(Lines[i+2])
i = i+1
4)Now, this will cause an IndexErrorwith files shorter than 500 lines. Instead of hard coding a loop count of 500, you should use the actual length of the sequence you're iterating over. len(lines)will give you that length. But instead of using a whileloop, use a forloop and range(len(lst))to iterate over a list of the range from zero to len(lst) - 1.
4)现在,这将导致IndexError文件少于 500 行。您应该使用迭代序列的实际长度,而不是硬编码 500 的循环计数。len(lines)会给你那个长度。但是不要使用while循环,而是使用for循环并range(len(lst))迭代从 0 到 范围的列表len(lst) - 1。
for i in range(len(lst)):
print lst[i]
5)open()can be used as a context managerthat takes care of closing files for you. context managers are a rather advanced concept but are pretty simple to use if they're already provided for you. By doing something like this
5)open()可以用作上下文管理器,负责为您关闭文件。上下文管理器是一个相当高级的概念,但如果已经为您提供了它们,则使用起来非常简单。通过做这样的事情
with open('test.txt') as f:
f.write('foo')
the file will be opened and accessible to you as finside that withblock. After you leave the block the file will be automatically closed, so you can't end up forgetting to close the file.
该文件将f在该with块内打开并可供您访问。离开块后,文件将自动关闭,因此您不会忘记关闭文件。
In your case you're opening two files. This can be done by just using two withstatements and nest them
在您的情况下,您要打开两个文件。这可以通过使用两个with语句并将它们嵌套来完成
with open('one.txt') as f1:
with open('two.txt') as f2:
f1.write('foo')
f2.write('bar')
or, in Python 2.7 / Python 3.x, by nesting two context manager in a single withstatement:
或者,在 Python 2.7 / Python 3.x 中,通过在单个with语句中嵌套两个上下文管理器:
with open('one.txt') as f1, open('two.txt', 'a') as f2:
f1.write('foo')
f2.write('bar')
6)Depending on the operating system the file was created on, line endings are different. On UNIX-like platforms it's \n, Macs before OS X used \r, and Windows uses \r\n. So that Lines[i] == searchquerywill not match for Mac or Windows line endings. file.readline()can deal with all three, but because it keeps whatever line endings were there at the end of the line, the comparison will fail. This is solved by using str.strip(), which will strip the string of all whitespace at the beginning and the end, and compare a search pattern withoutthe line ending to that:
6)根据创建文件的操作系统,行尾是不同的。在类 UNIX 平台上\n,Mac 在 OS X 之前使用\r,Windows 使用\r\n. 所以这Lines[i] == searchquery与 Mac 或 Windows 行结尾不匹配。file.readline()可以处理所有三个,但是因为它保留了行尾的任何行尾,所以比较将失败。这是通过使用来解决的str.strip(),它将去除开头和结尾的所有空白字符串,并比较没有行结尾的搜索模式:
searchquery = 'am'
# ...
if line.strip() == searchquery:
# ...
(Reading the file using file.read()and using str.splitlines()would be another alternative.)
(使用file.read()和使用读取文件str.splitlines()是另一种选择。)
But, since you mentioned your search string actually appears at the beginning of the line, lets do that, by using str.startswith():
但是,由于您提到您的搜索字符串实际上出现在该行的开头,让我们通过使用str.startswith():
if line.startswith(searchquery):
# ...
7)The official style guide for Python, PEP8, recommends to use CamelCasefor classes, lowercase_underscorefor pretty much everything else (variables, functions, attributes, methods, modules, packages). So instead of Linesuse lines. This is definitely a minor point compared to the others, but still worth getting right early on.
7)Python 的官方风格指南PEP8建议CamelCase用于类,lowercase_underscore以及几乎所有其他内容(变量、函数、属性、方法、模块、包)。所以,而不是Lines使用lines. 与其他方面相比,这绝对是一个小问题,但仍然值得尽早解决。
So, considering all those things I would write your code like this:
因此,考虑到所有这些事情,我会像这样编写您的代码:
searchquery = 'am'
with open('Test.txt') as f1:
with open('Output.txt', 'a') as f2:
lines = f1.readlines()
for i, line in enumerate(lines):
if line.startswith(searchquery):
f2.write(line)
f2.write(lines[i + 1])
f2.write(lines[i + 2])
As @TomK pointed out, all this code assumes that if your search string matches, there's at least two lines following it. If you can't rely on that assumption, dealing with that case by using a try...exceptblock like @poorsod suggested is the right way to go.
正如@TomK 指出的那样,所有这些代码都假设如果您的搜索字符串匹配,则后面至少有两行。如果您不能依赖该假设,那么使用try...except@poorsod 建议的块来处理这种情况是正确的方法。
回答by ajon
I think your problem is the tabs of the bottom file.
我认为你的问题是底部文件的标签。
You need to indent from if Lines[i]until after i=i+1such as:
您需要从 ifLines[i]到 after缩进,i=i+1例如:
while i < 500:
if Lines[i] == searchquery:
f2.write(Lines[i])
f2.write(Lines[i+1])
f2.write(Lines[i+2])
i = i+1
else:
i = i+1
回答by TomK
ajon has the right answer, but so long as you are looking for guidance, your solution doesn't take advantage of the high-level constructs that Python can offer. How about:
ajon 有正确的答案,但是只要您正在寻找指导,您的解决方案就不会利用 Python 可以提供的高级结构。怎么样:
searchquery = 'am\n'
with open('Test.txt') as f1:
with open(Output.txt, 'a') as f2:
Lines = f1.readlines()
try:
i = Lines.index(searchquery)
for iline in range(i, i+3):
f2.write(Lines[iline])
except:
print "not in file"
The two "with" statements will automatically close the files at the end, even if an exception happens.
两个“with”语句会在最后自动关闭文件,即使发生异常。
A still better solution would be to avoid reading in the whole file at once (who knows how big it could be?) and, instead, process line by line, using iteration on a file object:
更好的解决方案是避免一次读取整个文件(谁知道它有多大?),而是逐行处理,对文件对象使用迭代:
with open('Test.txt') as f1:
with open(Output.txt, 'a') as f2:
for line in f1:
if line == searchquery:
f2.write(line)
f2.write(f1.next())
f2.write(f1.next())
All of these assume that there are at least two additional lines beyond your target line.
所有这些都假设在您的目标线之外至少有两条额外的线。
回答by whardier
Have you tried using something other than 'Output.txt' to avoid any filesystem related issues as the problem?
您是否尝试过使用“Output.txt”以外的其他内容来避免任何与文件系统相关的问题?
What about an absolute path to avoid any funky unforeseen problems while diagnosing this.
在诊断时避免任何时髦的不可预见的问题的绝对路径怎么样。
This advice is simply from a diagnostic standpoint. Also check out the the OS X dtrace and dtruss.
这个建议只是从诊断的角度来看的。另请查看 OS X dtrace 和 dtruss。
回答by computerist
Writing line by line can be slow when working with large data. You can accelerate the read/write operations by reading/writing a bunch of lines at once.
处理大数据时,逐行写入可能会很慢。您可以通过一次读/写一堆行来加速读/写操作。
from itertools import slice
f1 = open('Test.txt')
f2 = open('Output.txt', 'a')
bunch = 500
lines = list(islice(f1, bunch))
f2.writelines(lines)
f1.close()
f2.close()
In case your lines are too long and depending on your system, you may not be able to put 500 lines in a list. If that's the case, you should reduce the bunchsize and have as many read/write steps as needed to write the whole thing.
如果您的行太长并且取决于您的系统,您可能无法将 500 行放入列表中。如果是这种情况,您应该减小bunch大小并根据需要使用尽可能多的读/写步骤来编写整个内容。

