pandas 在python中顺序读取巨大的CSV文件

Question

提问by Ulderique Demoitre

I have a 10gb CSV file that contains some information that I need to use.

我有一个 10gb 的 CSV 文件，其中包含我需要使用的一些信息。

As I have limited memory on my PC, I can not read all the file in memory in one single batch. Instead, I would like to iteratively read only some rows of this file.

由于我的 PC 内存有限，我无法一次性读取内存中的所有文件。相反，我只想迭代读取此文件的某些行。

Say that at the first iteration I want to read the first 100, at the second those going to 101 to 200 and so on.

假设在第一次迭代时我想读取前 100 个，在第二个时读取 101 到 200，依此类推。

Is there an efficient way to perform this task in Python? May Pandas provide something useful to this? Or are there better (in terms of memory and speed) methods?

有没有一种有效的方法可以在 Python 中执行此任务？Pandas可以为此提供一些有用的东西吗？或者有更好的（在内存和速度方面）方法吗？

Answer 1

回答by ASH

Here is the short answer.

这是简短的回答。

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

Here is the very long answer.

这是很长的答案。

To get started, you'll need to import pandas and sqlalchemy. The commands below will do that.

首先，您需要导入 pandas 和 sqlalchemy。下面的命令将做到这一点。

import pandas as pd
from sqlalchemy import create_engine

Next, set up a variable that points to your csv file. This isn't necessary but it does help in re-usability.

接下来，设置一个指向您的 csv 文件的变量。这不是必需的，但它确实有助于可重用性。

file = '/path/to/csv/file'

With these three lines of code, we are ready to start analyzing our data. Let's take a look at the ‘head' of the csv file to see what the contents might look like.

有了这三行代码，我们就可以开始分析我们的数据了。让我们看一下 csv 文件的“头部”，看看内容可能是什么样子。

print pd.read_csv(file, nrows=5)

This command uses pandas' “read_csv” command to read in only 5 rows (nrows=5) and then print those rows to the screen. This lets you understand the structure of the csv file and make sure the data is formatted in a way that makes sense for your work.

此命令使用 pandas 的“read_csv”命令仅读取 5 行（nrows=5），然后将这些行打印到屏幕上。这可以让您了解 csv 文件的结构，并确保数据的格式对您的工作有意义。

Before we can actually work with the data, we need to do something with it so we can begin to filter it to work with subsets of the data. This is usually what I would use pandas' dataframe for but with large data files, we need to store the data somewhere else. In this case, we'll set up a local sqllite database, read the csv file in chunks and then write those chunks to sqllite.

在我们实际处理数据之前，我们需要对它做一些事情，以便我们可以开始过滤它以处理数据的子集。这通常是我使用 Pandas 数据框的目的，但是对于大型数据文件，我们需要将数据存储在其他地方。在这种情况下，我们将建立一个本地 sqllite 数据库，以块的形式读取 csv 文件，然后将这些块写入 sqllite。

To do this, we'll first need to create the sqllite database using the following command.

为此，我们首先需要使用以下命令创建 sqllite 数据库。

csv_database = create_engine('sqlite:///csv_database.db')

Next, we need to iterate through the CSV file in chunks and store the data into sqllite.

接下来，我们需要分块遍历 CSV 文件并将数据存储到 sqllite 中。

chunksize = 100000
i = 0
j = 1
for df in pd.read_csv(file, chunksize=chunksize, iterator=True):
      df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) 
      df.index += j
      i+=1
      df.to_sql('table', csv_database, if_exists='append')
      j = df.index[-1] + 1

With this code, we are setting the chunksize at 100,000 to keep the size of the chunks managable, initializing a couple of iterators (i=0, j=0) and then running a through a for loop. The for loop read a chunk of data from the CSV file, removes space from any of column names, then stores the chunk into the sqllite database (df.to_sql(…)).

使用这段代码，我们将块大小设置为 100,000 以保持块的大小可管理，初始化几个迭代器 (i=0, j=0)，然后通过 for 循环运行 a。for 循环从 CSV 文件中读取数据块，从任何列名中删除空格，然后将数据块存储到 sqllite 数据库中 (df.to_sql(...))。

This might take a while if your CSV file is sufficiently large, but the time spent waiting is worth it because you can now use pandas ‘sql' tools to pull data from the database without worrying about memory constraints.

如果您的 CSV 文件足够大，这可能需要一段时间，但等待的时间是值得的，因为您现在可以使用 pandas 'sql' 工具从数据库中提取数据，而无需担心内存限制。

To access the data now, you can run commands like the following:

要立即访问数据，您可以运行如下命令：

df = pd.read_sql_query('SELECT * FROM table', csv_database)

Of course, using ‘select *…' will load all data into memory, which is the problem we are trying to get away from so you should throw from filters into your select statements to filter the data. For example:

当然，使用“select *...”会将所有数据加载到内存中，这是我们试图摆脱的问题，因此您应该将过滤器扔到您的选择语句中以过滤数据。例如：

df = pd.read_sql_query('SELECT COl1, COL2 FROM table where COL1 = SOMEVALUE', csv_database)

Answer 2

回答by Guillaume

You can use pandas.read_csv()with chuncksizeparameter:

您可以使用pandas.read_csv()与chuncksize参数：

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

for chunck_df in pd.read_csv('yourfile.csv', chunksize=100):
    # each chunck_df contains a part of the whole CSV

Answer 3

回答by G.Papas

This code may help you for this task. It navigates trough a large .csv file and does not consume lots of memory so that you can perform this in a standard lap top.

此代码可以帮助您完成此任务。它通过一个大的 .csv 文件导航并且不消耗大量内存，因此您可以在标准笔记本电脑上执行此操作。

import pandas as pd
import os

The chunksize here orders the number of rows within the csv file you want to read later

这里的chunksize 对你想稍后阅读的csv文件中的行数进行排序

chunksize2 = 2000

path = './'
data2 = pd.read_csv('ukb35190.csv',
                chunksize=chunksize2,
                encoding = "ISO-8859-1")
df2 = data2.get_chunk(chunksize2)
headers = list(df2.keys())
del data2

start_chunk = 0
data2 = pd.read_csv('ukb35190.csv',
                chunksize=chunksize2,
                encoding = "ISO-8859-1",
                skiprows=chunksize2*start_chunk)

headers = []

标题 = []

for i, df2 in enumerate(data2):
try:

    print('reading cvs....')
    print(df2)
    print('header: ', list(df2.keys()))
    print('our header: ', headers)

    # Access chunks within data

    # for chunk in data:

    # You can now export all outcomes in new csv files
    file_name = 'export_csv_' + str(start_chunk+i) + '.csv'
    save_path = os.path.abspath(
        os.path.join(
            path, file_name
        )
    )
    print('saving ...')

except Exception:
    print('reach the end')
    break

Answer 4

回答by Wojciech Moszczyński

Method to transfer huge CSV into database is good because we can easily use SQL query. We have also to take into account two things.

将巨大的 CSV 传输到数据库的方法很好，因为我们可以轻松使用 SQL 查询。我们还必须考虑两件事。

FIRST POINT:SQL also are not a rubber, it will not be able to stretch the memory.

第一点：SQL 也不是橡皮，它无法拉伸内存。

For example converted to bd file:

例如转换为 bd文件：

https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9

https://nycopendata.socrata.com/Social-Services/311-Service-Requests-从2010年到现在/ ERM2，nwe9

For this db file SQL language:

对于这个 db 文件 SQL 语言：

pd.read_sql_query("SELECT * FROM 'table'LIMIT 600000", Mydatabase)

It can read maximum about 0,6 mln records no more with 16 GB RAM memory of PC (time of operation 15,8 second). It could be malicious to add that downloading directly from a csv file is a bit more efficient:

它可以读取最多约 060 万条记录，使用 PC 的 16 GB RAM 内存（操作时间 15.8 秒）。添加直接从 csv 文件下载更有效一点可能是恶意的：

giga_plik = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
Abdul = pd.read_csv(giga_plik, nrows=1100000)

(time of operation 16,5 second)

（操作时间 16,5 秒）

SECOND POINT:To effectively using SQL data series converted from CSV we ought to memory about suitable form of date. So I proposer add to ryguy72's code this:

第二点：为了有效地使用从 CSV 转换而来的 SQL 数据系列，我们应该记住合适的日期形式。所以我提议在 ryguy72 的代码中添加以下内容：

df['ColumnWithQuasiDate'] = pd.to_datetime(df['Date'])

All code for file 311 as about I pointed:

我指出的文件 311 的所有代码：

start_time = time.time()
### sqlalchemy create_engine
plikcsv = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
WM_csv_datab7 = create_engine('sqlite:///C:/1/WM_csv_db77.db')
#----------------------------------------------------------------------
chunksize = 100000 
i = 0
j = 1
## --------------------------------------------------------------------
for df in pd.read_csv(plikcsv, chunksize=chunksize, iterator=True, encoding='utf-8', low_memory=False):
      df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) 
## -----------------------------------------------------------------------
      df['CreatedDate'] = pd.to_datetime(df['CreatedDate'])  # to datetimes
      df['ClosedDate'] = pd.to_datetime(df['ClosedDate'])
## --------------------------------------------------------------------------
      df.index += j
      i+=1
      df.to_sql('table', WM_csv_datab7, if_exists='append')
      j = df.index[-1] + 1
print(time.time() - start_time)

At the end I would like to add: converting a csv file directly from the Internet to db seems to me a bad idea. I propose to download base and convert locally.

最后我想补充一点：将 csv 文件直接从 Internet 转换为 db 在我看来是个坏主意。我建议下载base并在本地转换。

pandas 在python中顺序读取巨大的CSV文件

提问by Ulderique Demoitre

回答by ASH

回答by Guillaume

回答by G.Papas

The chunksize here orders the number of rows within the csv file you want to read later

这里的chunksize 对你想稍后阅读的csv文件中的行数进行排序

headers = []

标题 = []

回答by Wojciech Moszczyński

相关推荐

最近更新

标签

pandas 在python中顺序读取巨大的CSV文件

提问by Ulderique Demoitre

回答by ASH

回答by Guillaume

回答by G.Papas

The chunksize here orders the number of rows within the csv file you want to read later

这里的chunksize 对你想稍后阅读的csv文件中的行数进行排序

headers = []

标题 = []

回答by Wojciech Moszczyński

相关推荐

pandas 在 Python 中从数据集绘图

Pandas：如何将函数应用于列名

Pandas 映射到 TRUE/FALSE 作为字符串，而不是布尔值

pandas 将熊猫系列时间戳转换为唯一日期列表

相关推荐

最近更新

标签