使用 bash 在一个大文件中获取一行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2794049/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Getting one line in a huge file with bash
提问by JavaRocky
How can I get a particular line in a 3 gig text file. All the lines have:
如何在 3 gig 文本文件中获取特定行。所有的线路都有:
- the same length, and
- are delimited by
\n.
- 相同的长度,并且
- 由 分隔
\n。
And I need to be able to get any line on demand.
而且我需要能够按需获得任何线路。
How can this be done? Only one line need be returned.
如何才能做到这一点?只需要返回一行。
回答by camh
If all the lines have the same length, the best way by far will be to use dd(1)and give it a skip parameter.
如果所有行的长度都相同,那么目前最好的方法是使用dd(1)并给它一个跳过参数。
Let the block size be the length of each line (including the newline), then you can do:
让块大小为每行的长度(包括换行符),然后你可以这样做:
$ dd if=filename bs=<line-length> skip=<line_no - 1> count=1 2>/dev/null
The idea is to seek past all the previous lines (skip=<line_no - 1>) and read a single line (count=1). Because the block size is set to the line length (bs=<line-length>), each block is effectively a single line. Redirect stderr so you don't get the annoying stats at the end.
这个想法是寻找所有前面的行 ( skip=<line_no - 1>) 并阅读一行 ( count=1)。由于块大小设置为行长 ( bs=<line-length>),因此每个块实际上是一行。重定向标准错误,这样你就不会在最后得到烦人的统计数据。
That should be much more efficient than streaming the lines before the one you want through a program to read all the lines and then throw them away, as ddwill seek to the position you want in the file and read only one line of data from the file.
这应该比在您希望通过程序读取所有行然后将它们丢弃的行之前流式传输行更有效,因为dd它将寻找您在文件中想要的位置并仅从文件中读取一行数据.
回答by Paul Creasey
回答by paxdiablo
If it's not a fixed-record-length file and you don't do some sort of indexing on the line starts, your best bet is to just use:
如果它不是固定记录长度的文件,并且您没有在行开始时进行某种索引,那么最好的办法就是使用:
head -n N filespec | tail -1
where Nis the line number you want.
N你想要的行号在哪里。
This isn't going to be the best-performing piece of code for a 3Gb file unfortunately but there are ways to make it better.
不幸的是,这不会是 3Gb 文件的最佳性能代码,但有一些方法可以使它变得更好。
If the file doesn't change too often, you may want to consider indexing it. By that I mean having anotherfile with the line offsets in it as fixed length records.
如果文件不经常更改,您可能需要考虑对其进行索引。我的意思是将另一个文件中的行偏移作为固定长度记录。
So the file:
所以文件:
0000000000
0000000017
0000000092
0000001023
would give you an fast way to locate each line. Just multiply the desired line number by the index record size and seek to there in the index file.
会给你一个快速的方法来定位每一行。只需将所需的行号乘以索引记录大小,然后在索引文件中查找即可。
Then use the value at that location to seek in the main file so you can read until the next newline character.
然后使用该位置的值在主文件中查找,以便您可以读取到下一个换行符。
So for line 3, you would seek to 33 in the index file (index record length is 10 characters plus one more for the newline). Reading the value there, 0000000092, would give you the offset to use into the main file.
因此,对于第 3 行,您将在索引文件中查找 33(索引记录长度为 10 个字符加上一个换行符)。读取那里的值,0000000092,会给你在主文件中使用的偏移量。
Of course, that's not so useful if the file changes frequently although, if you can control what happens when things get appended, you can still add offsets to the index efficiently. If you don'tcontrol that, you'll have to re-index whenever the last-modified date of the index is earlier than that of the main file.
当然,如果文件频繁更改,这不是很有用,但是,如果您可以控制附加内容时发生的情况,您仍然可以有效地向索引添加偏移量。如果您不控制它,则只要索引的最后修改日期早于主文件的日期,就必须重新索引。
And, based on your update:
而且,根据您的更新:
Update: If it matters, all the lines have the same length.
更新:如果重要的话,所有行的长度都相同。
With that extra piece of information, you don't need the index - you can just seek immediately to the right location in the main file by multiplying the record length by the record length (assuming the values fit into your data types).
有了这些额外的信息,您就不需要索引了——您可以通过将记录长度乘以记录长度(假设这些值适合您的数据类型)来立即查找到主文件中的正确位置。
So something like the pseudo-code:
所以类似于伪代码:
def getline(fhandle,reclen,recnum):
seek to position reclen*recnum for file fhandle.
read reclen characters into buffer.
return buffer.
回答by Jamie
An awk alternative, where 3 is the line number.
awk 替代方案,其中 3 是行号。
awk 'NR == 3 {print; exit}' file.txt
回答by Paused until further notice.
Use qwith sedto make the search stop after the line has been printed.
使用qwithsed使搜索在行打印后停止。
sed -n '11723{p;q}' filename
Python (minimal error checking):
Python(最小错误检查):
#!/usr/bin/env python
import sys
# by Dennis Williamson - 2010-05-08
# for http://stackoverflow.com/questions/2794049/getting-one-line-in-a-huge-file-with-bash
# seeks the requested line in a file with a fixed line length
# Usage: ./lineseek.py LINE FILE
# Example: ./lineseek 11723 data.txt
EXIT_SUCCESS = 0
EXIT_NOT_FOUND = 1
EXIT_OPT_ERR = 2
EXIT_FILE_ERR = 3
EXIT_DATA_ERR = 4
# could use a try block here
seekline = int(sys.argv[1])
file = sys.argv[2]
try:
if file == '-':
handle = sys.stdin
size = 0
else:
handle = open(file,'r')
except IOError as e:
print >> sys.stderr, ("File Open Error")
exit(EXIT_FILE_ERR)
try:
line = handle.readline()
lineend = handle.tell()
linelen = len(line)
except IOError as e:
print >> sys.stderr, ("File I/O Error")
exit(EXIT_FILE_ERR)
# it would be really weird if this happened
if lineend != linelen:
print >> sys.stderr, ("Line length inconsistent")
exit(EXIT_DATA_ERR)
handle.seek(linelen * (seekline - 1))
try:
line = handle.readline()
except IOError as e:
print >> sys.stderr, ("File I/O Error")
exit(EXIT_FILE_ERR)
if len(line) != linelen:
print >> sys.stderr, ("Line length inconsistent")
exit(EXIT_DATA_ERR)
print(line)
Argument validation should be a lot better and there is room for many other improvements.
参数验证应该更好,并且还有许多其他改进的空间。
回答by Eld
A quick perl one liner would work well for this too...
一个快速的 perl one liner 也可以很好地解决这个问题......
$ perl -ne 'if (YOURLINENUMBER..YOURLINENUMBER) {print $_; last;}' /path/to/your/file

