Python 随机混合300万行文件的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4618298/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 16:32:53  来源:igfitidea点击:

Randomly mix lines of 3 million-line file

pythonrandomvim

提问by Nigu

Everything is in the title. I'm wondering if any one knows a quick and with reasonable memory demands way of randomly mixing all the lines of a 3 million lines file. I guess it is not possible with a simple vim command, so any simple script using Python. I tried with python by using a random number generator, but did not manage to find a simple way out.

一切都在标题中。我想知道是否有人知道一种快速且具有合理内存需求的方法来随机混合 300 万行文件的所有行。我想使用简单的 vim 命令是不可能的,所以任何使用 Python 的简单脚本都是不可能的。我通过使用随机数生成器尝试使用 python,但没有设法找到简单的出路。

采纳答案by S.Lott

import random
with open('the_file','r') as source:
    data = [ (random.random(), line) for line in source ]
data.sort()
with open('another_file','w') as target:
    for _, line in data:
        target.write( line )

That should do it. 3 million lines will fit into most machine's memory unless the lines are HUGE (over 512 characters).

应该这样做。除非行很大(超过 512 个字符),否则 300 万行将适合大多数机器的内存。

回答by John Kugelman

Takes only a few seconds in Python:

在 Python 中只需几秒钟:

>>> import random
>>> lines = open('3mil.txt').readlines()
>>> random.shuffle(lines)
>>> open('3mil.txt', 'w').writelines(lines)

回答by fuzzyTew

On many systems the sortshell command takes -Rto randomize its input.

在许多系统上,sortshell 命令会-R随机化其输入。

回答by S.Lott

Here's another version

这是另一个版本

At the shell, use this.

在外壳上,使用它。

python decorate.py | sort | python undecorate.py

decorate.py

装饰.py

import sys
import random
for line in sys.stdin:
    sys.stdout.write( "{0}|{1}".format( random.random(), line ) )

undecorate.py

未装饰.py

import sys
for line in sys.stdin:
    _, _, data= line.partition("|")
    sys.stdout.write( line )

Uses almost no memory.

几乎不占用内存。

回答by sleepynate

This is the same as Mr. Kugelman's, but using vim's built-in python interface:

这个和库格尔曼先生的一样,但是使用vim内置的python接口:

:py import vim, random as r; cb = vim.current.buffer ; l = cb[:] ; r.shuffle(l) ; cb[:] = l

回答by Lennart Regebro

If you do notwant to load everything into memory and sort it there, you haveto store the lines on disk while doing random sorting. That will be very slow.

如果你希望加载的一切到内存和排序它在那里,你必须到线存储在磁盘上,而这样做的随机排序。那会很慢。

Here is a very simple, stupid and slow version. Note that this may take a surprising amount of diskspace, and it will be very slow. I ran it with 300.000 lines, and it takes several minutes. 3 million lines could very well take an hour. So: Do it in memory. Really. It's not that big.

这是一个非常简单、愚蠢和缓慢的版本。请注意,这可能会占用大量磁盘空间,而且速度会非常慢。我用 300.000 行运行它,需要几分钟。300 万行可能需要一个小时。所以:在内存中做。真的。它没有那么大。

import os
import tempfile
import shutil
import random
tempdir = tempfile.mkdtemp()
print tempdir

files = []
# Split the lines:
with open('/tmp/sorted.txt', 'rt') as infile:
    counter = 0    
    for line in infile:
        outfilename = os.path.join(tempdir, '%09i.txt' % counter)
        with open(outfilename, 'wt') as outfile:
            outfile.write(line)
        counter += 1
        files.append(outfilename)

with open('/tmp/random.txt', 'wt') as outfile:
    while files:
        index = random.randint(0, len(files) - 1)
        filename = files.pop(index)
        outfile.write(open(filename, 'rt').read())

shutil.rmtree(tempdir)

Another version would be to store the files in an SQLite database and pull the lines randomly from that database. That is probably going to be faster than this.

另一个版本是将文件存储在 SQLite 数据库中并从该数据库中随机提取行。那可能会比这更快。

回答by Drag0

I just tried this on a file with 4.3M of lines and fastest thing was 'shuf' command on Linux. Use it like this:

我刚刚在一个 4.3M 行的文件上尝试了这个,最快的事情是 Linux 上的“shuf”命令。像这样使用它:

shuf huge_file.txt -o shuffled_lines_huge_file.txt

It took 2-3 seconds to finish.

完成需要 2-3 秒。

回答by Aziz Alto

Here is another way using random.choice, this may provide some gradual memory relieve as well, but with a worse Big-O :)

这是使用random.choice 的另一种方法,这也可以提供一些渐进的记忆缓解,但更糟糕的是 Big-O :)

from random import choice

with open('data.txt', 'r') as r:
    lines = r.readlines()

with open('shuffled_data.txt', 'w') as w:
    while lines:
        l = choice(lines)
        lines.remove(l)
        w.write(l)

回答by sergio

The following Vimscript can be used to swap lines:

以下 Vimscript 可用于交换行:

function! Random()                                                       
  let nswaps = 100                                                       
  let firstline = 1                                                     
  let lastline = 10                                                      
  let i = 0                                                              
  while i <= nswaps                                                      
    exe "let line = system('shuf -i ".firstline."-".lastline." -n 1')[:-2]"
    exe line.'d'                                                         
    exe "let line = system('shuf -i ".firstline."-".lastline." -n 1')[:-2]"
    exe "normal! " . line . 'Gp'                                         
    let i += 1                                                           
  endwhile                                                               
endfunction

Select the function in visual mode and type :@"then execute it with :call Random()

在可视模式下选择函数并键入:@"然后执行它:call Random()

回答by Kumaresp

This will do the trick: My solution even don't use random and it will also remove duplicates.

这可以解决问题:我的解决方案甚至不使用 random 并且它还会删除重复项。

import sys
lines= list(set(open(sys.argv[1]).readlines()))
print(' '.join(lines))

in the shell

在壳中

python shuffler.py nameoffilestobeshuffled.txt > shuffled.txt