bash 从文件中提取某些行

Question

提问by mike

I was wondering if there is a more efficient way to get this task done. I am working with files with the number of lines ranging from a couple hundred thousand to a couple million. Say I know that lines 100,000 - 125,000 are the lines that contain the data I am looking for. I would like to know if there is a quick way to pull just these desired lines from the file. Right now I am using a loop with grep like this:

我想知道是否有更有效的方法来完成这项任务。我正在处理行数从几十万到几百万不等的文件。假设我知道第 100,000 - 125,000 行是包含我要查找的数据的行。我想知道是否有一种快速的方法可以从文件中提取这些所需的行。现在我正在使用带有 grep 的循环，如下所示：

 for ((i=$start_fid; i<=$end_fid; i++))
  do
    grep "^$i " fulldbdir_new >> new_dbdir${bscnt}
  done

Which works fine its just is taking longer than I would like. And the lines contain more than just numbers. Basically each line has about 10 fields with the first being a sequential integer that appears only once per file.

哪个工作正常，只是花费的时间比我想要的要长。这些行不仅仅包含数字。基本上每行大约有 10 个字段，第一个字段是一个连续整数，每个文件只出现一次。

I am comfortable writing in C if necessary.

如有必要，我很乐意用 C 编写。

Answer 1

回答by Costa

sedcan do the job...

sed可以做这个工作...

sed -n '100000,125000p' input

EDIT: As per glenn Hymanman's suggestion, can be adjusted thusly for efficiency...

编辑：根据格伦Hyman曼的建议，可以这样调整以提高效率......

sed -n '100000,125000p; 125001q' input

Answer 2

回答by mhyfritz

I'd use awk:

我会使用 awk：

awk 'NR >= 100000; NR == 125000 {exit}' file

For big numbers you can also use E notation:

对于大数字，您还可以使用E 表示法：

awk 'NR >= 1e5; NR == 1.25e5 {exit}' file

EDIT: @glenn Hymanman's suggestion (cf. comment)

编辑：@glenn Hymanman 的建议（参见评论）

Answer 3

回答by gpojd

You can try a combination of tail and head to get the correct lines.

您可以尝试将尾部和头部结合使用以获得正确的线条。

head -n 125000 file_name | tail -n 25001 | grep "^$i "

Don't forget perl either.

也不要忘记 perl。

perl -ne 'print if $. >= 100000 && $. <= 125000' file_name | grep "^$i "

or some faster perl:

或一些更快的 perl：

perl -ne 'print if $. >= 100000; exit() if $. >= 100000 && $. <= 125000' | grep "^$i "

Also, instead of a for loop you might want to look into using GNU parallel.

此外，您可能想要研究使用GNU parallel而不是 for 循环。

Answer 4

回答by Ole Tange

The answers so far reads the first 100000 lines and discards them. As disk I/O is often the limiting factor these days it might be nice to have a solution that does not have to read the unwanted lines.

到目前为止的答案读取前 100000 行并丢弃它们。由于磁盘 I/O 现在通常是限制因素，因此拥有一个不必读取不需要的行的解决方案可能会很好。

If the first 100000 lines are always the same total length (approximately), then you might compute how far to seek into the file to get to approximately line 100000 and then read the next 25000 lines. Maybe read a bit more before and after to make sure you have all the 25000 lines.

如果前 100000 行的总长度始终相同（大约），那么您可以计算搜索文件的距离以到达大约第 100000 行，然后读取接下来的 25000 行。也许在之前和之后多读一点，以确保您拥有所有 25000 行。

You would not know exactly what line you were at, though, which may or may not be important for you.

但是，您不会确切地知道自己在哪条线路上，这对您来说可能重要也可能不重要。

Assume the average line length of the first 100000 lines is 130 then you would get something like this:

假设前 100000 行的平均行长是 130，那么你会得到这样的结果：

 dd if=the_file skip=130 bs=100000 | head -n 25000

You would have to throw away the first line, as it is likely to be only half a line.

您将不得不丢弃第一行，因为它可能只有半行。

bash 从文件中提取某些行

提问by mike

回答by Costa

回答by mhyfritz

回答by gpojd

回答by Ole Tange

相关推荐

最近更新

标签

bash 从文件中提取某些行

提问by mike

回答by Costa

回答by mhyfritz

回答by gpojd

回答by Ole Tange

相关推荐

bash 使用用户输入的文件路径自动完成

在 bash 中检查空字符串

使用 & 符号在后台运行 bash 管道命令

bash 无法在case语句bash中设置变量

相关推荐

最近更新

标签