bash 从文件中选择随机行

Question

提问by user121196

In a Bash script, I want to pick out N random lines from input file and output to another file.

在 Bash 脚本中，我想从输入文件中挑选出 N 个随机行并输出到另一个文件。

How can this be done?

如何才能做到这一点？

Answer 1

回答by dogbane

Use shufwith the -noption as shown below, to get Nrandom lines:

shuf与-n如下所示的选项一起使用，以获取N随机行：

shuf -n N input > output

Answer 2

回答by user881480

Sort the file randomly and pick first 100lines:

随机排序文件并选择第一100行：

$ sort -R input | head -n 100 >output

Answer 3

回答by Stein van Broekhoven

Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.

好吧，根据对 shuf 回答的评论，他在一分钟内 shuff 了 78 000 000 000 行。

Challenge accepted...

已接受的挑战...

First I needed a file of 78.000.000.000 lines:

首先我需要一个 78.000.000.000 行的文件：

seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt

This gives me a a file with 78 Billionnewlines ;-)

这给了我一个包含780 亿个换行符的文件;-)

Now for the shuf part:

现在是 shuf 部分：

$ time shuf -n 10 lines_78000000000.txt










shuf -n 10 lines_78000000000.txt  2171.20s user 22.17s system 99% cpu 36:35.80 total

The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.

瓶颈是 CPU 并且没有使用多线程，它 100% 固定 1 个核心，其他 15 个没有使用。

Python is what I regularly use so that's what I'll use to make this faster:

Python 是我经常使用的，所以我将使用它来加快速度：

#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
  buffer = f.read(65536)
  if not buffer: break
  count += buffer.count('\n')

for i in range(10):
  f.readline(random.randint(1, count))

This got me just under a minute:

这让我不到一分钟：

$ time ./shuf.py         










./shuf.py  42.57s user 16.19s system 98% cpu 59.752 total

I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.

我在带有 i9 和三星 NVMe 的 Lenovo X1 Extreme 2nd gen 上进行了此操作，这为我提供了充足的读写速度。

I know it can get faster but I'll leave some room to give others a try.

我知道它可以变得更快，但我会留出一些空间让其他人尝试。

Line counter source: Luther Blissett

线计数器来源：Luther Blissett

bash 从文件中选择随机行

提问by user121196

回答by dogbane

回答by user881480

回答by Stein van Broekhoven

相关推荐

最近更新

标签

bash 从文件中选择随机行

提问by user121196

回答by dogbane

回答by user881480

回答by Stein van Broekhoven

相关推荐

如何检查参数是否已提供给 bash 脚本

bash 解压到特定目的地

bash 自删除shell脚本

在 Bash Shell 脚本中生成 1 到 10 之间的随机数

相关推荐

最近更新

标签