从一个大的 CSV 文件中读取一个小的随机样本到一个 Python 数据框中

Question

提问by P.Escondido

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

我想读取的 CSV 文件不适合主内存。如何读取其中的几行（~10K）随机行并对选定的数据框进行一些简单的统计？

Answer 1

回答by dlm

Assuming no header in the CSV file:

假设 CSV 文件中没有标题：

import pandas
import random

n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(xrange(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)

would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.

如果 read_csv 有一个 keeprows 会更好，或者如果 skiprows 采用回调函数而不是列表。

With header and unknown file length:

带有标题和未知文件长度：

import pandas
import random

filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(xrange(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)

Note: For Python3.x, use range() instead of xrange().

注意：对于 Python3.x，使用 range() 而不是 xrange()。

Answer 2

回答by Joran Beasley

class magic_checker:
    def __init__(self,target_count):
        self.target = target_count
        self.count = 0
    def __eq__(self,x):
        self.count += 1
        return self.count >= self.target

min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
     f.seek(seek_target)
     f.readline() #discard this line
     rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))

#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")

something like that should work I think

我认为这样的事情应该有效

Answer 3

回答by Vagner Guedes

No pandas!

没有熊猫！

import random
from os import fstat
from sys import exit

f = open('/usr/share/dict/words')

# Number of lines to be read
lines_to_read = 100

# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000

def is_EOF():
    return f.tell() >= fstat(f.fileno()).st_size

# To accumulate the read lines
sampled_lines = []

for n in xrange(lines_to_read):
    bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
    f.seek(bytes_to_skip, 1)
    # After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
    # Skip current entire line
    f.readline()
    if not is_EOF():
        sampled_lines.append(f.readline())
    else:
        # Go to the begginig of the file ...
        f.seek(0, 0)
        # ... and skip lines again
        f.seek(bytes_to_skip, 1)
        # If it has reached the EOF again
        if is_EOF():
            print "You have skipped more lines than your file has"
            print "Reduce the values of:"
            print "   min_bytes_to_skip"
            print "   max_bytes_to_skip"
            exit(1)
        else:
            f.readline()
            sampled_lines.append(f.readline())

print sampled_lines

You'll end up with a sampled_lines list. What kind of statistics do you mean?

您最终会得到一个 sampled_lines 列表。你是指什么样的统计数据？

Answer 4

回答by queise

The following code reads first the header, and then a random sample on the other lines:

以下代码首先读取标题，然后在其他行读取随机样本：

import pandas as pd
import numpy as np

filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)

Answer 5

回答by desktable

Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.

这里有一个算法，不需要预先计算文件中的行数，所以你只需要读取文件一次。

Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.

假设你想要 m 个样本。首先，算法保留前 m 个样本。当它看到第 i 个样本 (i > m) 时，概率为 m/i，该算法使用该样本随机替换已选择的样本。

By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.

通过这样做，对于任何 i > m，我们总是从前 i 个样本中随机选择 m 个样本的子集。

See code below:

见下面的代码：

import random

n_samples = 10
samples = []

for i, line in enumerate(f):
    if i < n_samples:
        samples.append(line)
    elif random.random() < n_samples * 1. / (i+1):
            samples[random.randint(0, n_samples-1)] = line

Answer 6

回答by Bar

This is not in Pandas, but it achieves the same result much faster through bash, while not reading the entire file into memory:

这不在 Pandas 中，但它通过 bash 更快地获得相同的结果，同时不会将整个文件读入内存：

shuf -n 100000 data/original.tsv > data/sample.tsv

The shufcommand will shuffle the input and the and the -nargument indicates how many lines we want in the output.

该shuf命令将打乱输入，而和-n参数表示我们希望输出中有多少行。

Relevant question: https://unix.stackexchange.com/q/108581

相关问题：https: //unix.stackexchange.com/q/108581

Benchmark on a 7M lines csv available here(2008):

此处提供 700 万行 csv 的基准测试（2008 年）：

回答by exp1orer

@dlm's answeris great but since v0.20.0, skiprows does accept a callable. The callable receives as an argument the row number.

@dlm的回答很好，但是从 v0.20.0 开始，skiprows 确实接受了 callable。callable 接收行号作为参数。

If you can specify what percentof lines you want, rather than how many lines, you don't even need to get the file size and you just need to read through the file once. Assuming a header on the first row:

如果您可以指定所需的行数百分比，而不是行数，您甚至不需要获取文件大小，只需通读一次文件即可。假设第一行有一个标题：

import pandas as pd
import random
p = 0.01  # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
         filename,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)

Or, if you want to take every nth line:

或者，如果你想取每一n行：

n = 100  # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)

Answer 8

回答by Zhongjun 'Mark' Jin

use subsample

使用子样本

pip install subsample
subsample -n 1000 file.csv > file_1000_sample.csv

Answer 9

回答by u5675325

You can also create a sample with the 10000 records before bringing it into the Python environment.

您还可以在将其引入 Python 环境之前创建一个包含 10000 条记录的示例。

Using Git Bash (Windows 10) I just ran the following command to produce the sample

使用 Git Bash (Windows 10) 我只是运行以下命令来生成示例

shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv

To note:If your CSV has headers this is not the best solution.

注意：如果您的 CSV 有标题，这不是最佳解决方案。

Answer 10

回答by Newt

For example, you have the loan.csv, you can use this script to easily load the specified number of random items.

例如，您有loan.csv，您可以使用此脚本轻松加载指定数量的随机项目。

data = pd.read_csv('loan.csv').sample(10000, random_state=44)

从一个大的 CSV 文件中读取一个小的随机样本到一个 Python 数据框中

提问by P.Escondido

回答by dlm

回答by Joran Beasley

回答by Vagner Guedes

回答by queise

回答by desktable

回答by Bar

回答by exp1orer

回答by Zhongjun 'Mark' Jin

回答by u5675325

回答by Newt

相关推荐

最近更新

标签

从一个大的 CSV 文件中读取一个小的随机样本到一个 Python 数据框中

提问by P.Escondido

回答by dlm

回答by Joran Beasley

回答by Vagner Guedes

回答by queise

回答by desktable

回答by Bar

回答by exp1orer

回答by Zhongjun 'Mark' Jin

回答by u5675325

回答by Newt

相关推荐

使用 BeautifulSoup 和 Python 抓取多个页面

在 Python 2.7 中获取列表长度作为字典中的值

是否使用 -m 选项执行 Python 代码

Python 如何在 Turtle 中制作笑脸？

相关推荐

最近更新

标签