从一个大的 CSV 文件中读取一个小的随机样本到一个 Python 数据框中

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22258491/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:36:01  来源:igfitidea点击:

Read a small random sample from a big CSV file into a Python data frame

pythonpandasrandomioimport-from-csv

提问by P.Escondido

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

我想读取的 CSV 文件不适合主内存。如何读取其中的几行(~10K)随机行并对选定的数据框进行一些简单的统计?

回答by dlm

Assuming no header in the CSV file:

假设 CSV 文件中没有标题:

import pandas
import random

n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(xrange(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)

would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.

如果 read_csv 有一个 keeprows 会更好,或者如果 skiprows 采用回调函数而不是列表。

With header and unknown file length:

带有标题和未知文件长度:

import pandas
import random

filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(xrange(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)

Note: For Python3.x, use range() instead of xrange().

注意:对于 Python3.x,使用 range() 而不是 xrange()。

回答by Joran Beasley

class magic_checker:
    def __init__(self,target_count):
        self.target = target_count
        self.count = 0
    def __eq__(self,x):
        self.count += 1
        return self.count >= self.target

min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
     f.seek(seek_target)
     f.readline() #discard this line
     rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))

#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")

something like that should work I think

我认为这样的事情应该有效

回答by Vagner Guedes

No pandas!

没有熊猫!

import random
from os import fstat
from sys import exit

f = open('/usr/share/dict/words')

# Number of lines to be read
lines_to_read = 100

# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000

def is_EOF():
    return f.tell() >= fstat(f.fileno()).st_size

# To accumulate the read lines
sampled_lines = []

for n in xrange(lines_to_read):
    bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
    f.seek(bytes_to_skip, 1)
    # After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
    # Skip current entire line
    f.readline()
    if not is_EOF():
        sampled_lines.append(f.readline())
    else:
        # Go to the begginig of the file ...
        f.seek(0, 0)
        # ... and skip lines again
        f.seek(bytes_to_skip, 1)
        # If it has reached the EOF again
        if is_EOF():
            print "You have skipped more lines than your file has"
            print "Reduce the values of:"
            print "   min_bytes_to_skip"
            print "   max_bytes_to_skip"
            exit(1)
        else:
            f.readline()
            sampled_lines.append(f.readline())

print sampled_lines

You'll end up with a sampled_lines list. What kind of statistics do you mean?

您最终会得到一个 sampled_lines 列表。你是指什么样的统计数据?

回答by queise

The following code reads first the header, and then a random sample on the other lines:

以下代码首先读取标题,然后在其他行读取随机样本:

import pandas as pd
import numpy as np

filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)

回答by desktable

Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.

这里有一个算法,不需要预先计算文件中的行数,所以你只需要读取文件一次。

Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.

假设你想要 m 个样本。首先,算法保留前 m 个样本。当它看到第 i 个样本 (i > m) 时,概率为 m/i,该算法使用该样本随机替换已选择的样本。

By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.

通过这样做,对于任何 i > m,我们总是从前 i 个样本中随机选择 m 个样本的子集。

See code below:

见下面的代码:

import random

n_samples = 10
samples = []

for i, line in enumerate(f):
    if i < n_samples:
        samples.append(line)
    elif random.random() < n_samples * 1. / (i+1):
            samples[random.randint(0, n_samples-1)] = line

回答by Bar

This is not in Pandas, but it achieves the same result much faster through bash, while not reading the entire file into memory:

这不在 Pandas 中,但它通过 bash 更快地获得相同的结果,同时不会将整个文件读入内存

shuf -n 100000 data/original.tsv > data/sample.tsv

The shufcommand will shuffle the input and the and the -nargument indicates how many lines we want in the output.

shuf命令将打乱输入,而 和-n参数表示我们希望输出中有多少行。

Relevant question: https://unix.stackexchange.com/q/108581

相关问题:https: //unix.stackexchange.com/q/108581

Benchmark on a 7M lines csv available here(2008):

此处提供 700 万行 csv 的基准测试(2008 年):

Top answer:

最佳答案:

def pd_read():
    filename = "2008.csv"
    n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
    s = 100000 #desired sample size
    skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
    df = pandas.read_csv(filename, skiprows=skip)
    df.to_csv("temp.csv")

Timing for pandas:

熊猫时间:

%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s

While using shuf:

使用时shuf

time shuf -n 100000 2008.csv > temp.csv

real    0m1.583s
user    0m1.445s
sys     0m0.136s

So shufis about 12x faster and importantly does not read the whole file into memory.

Soshuf大约快 12 倍,重要的是不会将整个文件读入内存

回答by exp1orer

@dlm's answeris great but since v0.20.0, skiprows does accept a callable. The callable receives as an argument the row number.

@dlm的回答很好,但是从 v0.20.0 开始,skiprows 确实接受了 callable。callable 接收行号作为参数。

If you can specify what percentof lines you want, rather than how many lines, you don't even need to get the file size and you just need to read through the file once. Assuming a header on the first row:

如果您可以指定所需的行数百分比,而不是行数,您甚至不需要获取文件大小,只需通读一次文件即可。假设第一行有一个标题:

import pandas as pd
import random
p = 0.01  # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
         filename,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)

Or, if you want to take every nth line:

或者,如果你想取每一n行:

n = 100  # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)

回答by Zhongjun 'Mark' Jin

use subsample

使用子样本

pip install subsample
subsample -n 1000 file.csv > file_1000_sample.csv

回答by u5675325

You can also create a sample with the 10000 records before bringing it into the Python environment.

您还可以在将其引入 Python 环境之前创建一个包含 10000 条记录的示例。

Using Git Bash (Windows 10) I just ran the following command to produce the sample

使用 Git Bash (Windows 10) 我只是运行以下命令来生成示例

shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv

To note:If your CSV has headers this is not the best solution.

注意:如果您的 CSV 有标题,这不是最佳解决方案。

回答by Newt

For example, you have the loan.csv, you can use this script to easily load the specified number of random items.

例如,您有loan.csv,您可以使用此脚本轻松加载指定数量的随机项目。

data = pd.read_csv('loan.csv').sample(10000, random_state=44)