bash 从文件中提取某些行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6821360/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 00:26:45  来源:igfitidea点击:

bash pull certain lines from a file

bashfile-io

提问by mike

I was wondering if there is a more efficient way to get this task done. I am working with files with the number of lines ranging from a couple hundred thousand to a couple million. Say I know that lines 100,000 - 125,000 are the lines that contain the data I am looking for. I would like to know if there is a quick way to pull just these desired lines from the file. Right now I am using a loop with grep like this:

我想知道是否有更有效的方法来完成这项任务。我正在处理行数从几十万到几百万不等的文件。假设我知道第 100,000 - 125,000 行是包含我要查找的数据的行。我想知道是否有一种快速的方法可以从文件中提取这些所需的行。现在我正在使用带有 grep 的循环,如下所示:

 for ((i=$start_fid; i<=$end_fid; i++))
  do
    grep "^$i " fulldbdir_new >> new_dbdir${bscnt}
  done

Which works fine its just is taking longer than I would like. And the lines contain more than just numbers. Basically each line has about 10 fields with the first being a sequential integer that appears only once per file.

哪个工作正常,只是花费的时间比我想要的要长。这些行不仅仅包含数字。基本上每行大约有 10 个字段,第一个字段是一个连续整数,每个文件只出现一次。

I am comfortable writing in C if necessary.

如有必要,我很乐意用 C 编写。

回答by Costa

sedcan do the job...

sed可以做这个工作...

sed -n '100000,125000p' input

sed -n '100000,125000p' input

EDIT: As per glenn Hymanman's suggestion, can be adjusted thusly for efficiency...

编辑:根据格伦Hyman曼的建议,可以这样调整以提高效率......

sed -n '100000,125000p; 125001q' input

sed -n '100000,125000p; 125001q' input

回答by mhyfritz

I'd use awk:

我会使用 awk:

awk 'NR >= 100000; NR == 125000 {exit}' file

For big numbers you can also use E notation:

对于大数字,您还可以使用E 表示法

awk 'NR >= 1e5; NR == 1.25e5 {exit}' file

EDIT: @glenn Hymanman's suggestion (cf. comment)

编辑:@glenn Hymanman 的建议(参见评论)

回答by gpojd

You can try a combination of tail and head to get the correct lines.

您可以尝试将尾部和头部结合使用以获得正确的线条。

head -n 125000 file_name | tail -n 25001 | grep "^$i "

Don't forget perl either.

也不要忘记 perl。

perl -ne 'print if $. >= 100000 && $. <= 125000' file_name | grep "^$i "

or some faster perl:

或一些更快的 perl:

perl -ne 'print if $. >= 100000; exit() if $. >= 100000 && $. <= 125000' | grep "^$i "

Also, instead of a for loop you might want to look into using GNU parallel.

此外,您可能想要研究使用GNU parallel而不是 for 循环。

回答by Ole Tange

The answers so far reads the first 100000 lines and discards them. As disk I/O is often the limiting factor these days it might be nice to have a solution that does not have to read the unwanted lines.

到目前为止的答案读取前 100000 行并丢弃它们。由于磁盘 I/O 现在通常是限制因素,因此拥有一个不必读取不需要的行的解决方案可能会很好。

If the first 100000 lines are always the same total length (approximately), then you might compute how far to seek into the file to get to approximately line 100000 and then read the next 25000 lines. Maybe read a bit more before and after to make sure you have all the 25000 lines.

如果前 100000 行的总长度始终相同(大约),那么您可以计算搜索文件的距离以到达大约第 100000 行,然后读取接下来的 25000 行。也许在之前和之后多读一点,以确保您拥有所有 25000 行。

You would not know exactly what line you were at, though, which may or may not be important for you.

但是,您不会确切地知道自己在哪条线路上,这对您来说可能重要也可能不重要。

Assume the average line length of the first 100000 lines is 130 then you would get something like this:

假设前 100000 行的平均行长是 130,那么你会得到这样的结果:

 dd if=the_file skip=130 bs=100000 | head -n 25000

You would have to throw away the first line, as it is likely to be only half a line.

您将不得不丢弃第一行,因为它可能只有半行。