pandas 如何在不连接的情况下读取 Python 数据帧中的数据？

Question

提问by Geet

I want to read the file f (file size:85GB) in chunks to a dataframe. Following code is suggested.

我想将文件 f（文件大小：85GB）分块读取到数据帧中。建议使用以下代码。

chunksize = 5
TextFileReader = pd.read_csv(f, chunksize=chunksize)

However, this code gives me TextFileReader, not dataframe. Also, I don't want to concatenate these chunks to convert TextFileReader to dataframe because of the memory limit. Please advise.

但是，这段代码给了我 TextFileReader，而不是数据框。另外，由于内存限制，我不想连接这些块以将 TextFileReader 转换为数据帧。请指教。

Answer 1

回答by Sayali Sonawane

As you are trying to process 85GB CSV file, if you will try to read all the data by breaking it into chunks and converting it into dataframe then it will hit memory limit for sure. You can try to solve this problem by using different approach. In this case, you can use filtering operations on your data. For example, if there are 600 columns in your dataset and you are interested only in 50 columns. Try to read only 50 columns from the file. This way you will save lot of memory. Process your rows as you read them. If you need to filter the data first, use a generator function. yieldmakes a function a generator function, which means it won't do any work until you start looping over it.

当您尝试处理 85GB CSV 文件时，如果您尝试通过将所有数据分成块并将其转换为数据帧来读取所有数据，那么它肯定会达到内存限制。您可以尝试使用不同的方法来解决此问题。在这种情况下，您可以对数据使用过滤操作。例如，如果您的数据集中有 600 列，而您只对 50 列感兴趣。尝试仅从文件中读取 50 列。这样，您将节省大量内存。在阅读时处理您的行。如果您需要先过滤数据，请使用生成器函数。yield使函数成为生成器函数，这意味着在您开始循环之前它不会做任何工作。

For more information regarding generator function: Reading a huge .csv file

有关生成器功能的更多信息：阅读巨大的 .csv 文件

For efficient filtering refer: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3

高效过滤参考：https: //codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3

For processing smaller dataset:

处理较小的数据集：

Approach 1: To convert reader object to dataframe directly:

方法一：直接将reader对象转为dataframe：

full_data = pd.concat(TextFileReader, ignore_index=True)

It is necessary to add parameter ignore indexto function concat, because avoiding duplicity of indexes.

需要在concat函数中加入参数ignore index，避免索引重复。

Approach 2:Use Iterator or get_chunk to convert it into dataframe.

方法二：使用 Iterator 或 get_chunk 将其转换为数据帧。

By specifying a chunksize to read_csv,return value will be an iterable object of type TextFileReader.

通过为 read_csv 指定一个块大小，返回值将是一个 TextFileReader 类型的可迭代对象。

df=TextFileReader.get_chunk(3)

for chunk in TextFileReader:
    print(chunk)

Source : http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

来源：http: //pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

df= pd.DataFrame(TextFileReader.get_chunk(1))

This will convert one chunk to dataframe.

这会将一个块转换为数据帧。

Checking total number of chunks in TextFileReader

检查 TextFileReader 中的块总数

number_of_chunks=0

for chunk in TextFileReader:
   number_of_chunks=number_of_chunks+1 


print(number_of_chunks)

If file size is bigger,I won't recommend second approach. For example, if csv file consist of 100000 records then chunksize=5 will create 20,000 chunks.

如果文件较大，我不会推荐第二种方法。例如，如果 csv 文件包含 100000 条记录，则 chunksize=5 将创建 20,000 个块。

Answer 2

回答by Yulia Perunovskaia

If you want to receive a data frame as a result of working with chunks, you can do it this way. Initialize empty data frame before you initialize chunk iterations. After you did the filtering process you can concatenate every result into your dataframe. As a result you will receive a dataframe filtered by your condition under the for loop.

如果您想接收数据帧作为处理块的结果，您可以这样做。在初始化块迭代之前初始化空数据帧。完成过滤过程后，您可以将每个结果连接到您的数据框中。因此，您将在 for 循环下收到由您的条件过滤的数据帧。

file = 'results.csv'
df_empty = pd.DataFrame()
with open(file) as fl:
    chunk_iter = pd.read_csv(fl, chunksize = 100000)
    for chunk in chunk_iter:
        chunk = chunk[chunk['column1'] > 180]
        df_empty = pd.concat([df_empty,chunk])

pandas 如何在不连接的情况下读取 Python 数据帧中的数据？

提问by Geet

回答by Sayali Sonawane

回答by Yulia Perunovskaia

相关推荐

最近更新

标签

pandas 如何在不连接的情况下读取 Python 数据帧中的数据？

提问by Geet

回答by Sayali Sonawane

回答by Yulia Perunovskaia

相关推荐

在 Pandas 索引对象的末尾添加一个值

pandas 计算系列的本地时间导数

如何停止在 csv 文件末尾写一个空行 - pandas

pandas 从多索引数据框中删除特定行

相关推荐

最近更新

标签