在打开的文件上使用 Pandas read_csv() 两次

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25943208/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:29:32  来源:igfitidea点击:

Using Pandas read_csv() on an open file twice

pythoncsvpandas

提问by Grant Hulegaard

As I was experimenting with pandas, I noticed some odd behavior of pandas.read_csv and was wondering if someone with more experience could explain what might be causing it.

当我尝试使用 pandas 时,我注意到 pandas.read_csv 的一些奇怪行为,并想知道是否有更多经验的人可以解释可能导致它的原因。

To start, here is my basic class definition for creating a new pandas.dataframe from a .csv file:

首先,这是我从 .csv 文件创建新的 pandas.dataframe 的基本类定义:

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath  # File path to the target .csv file.
        self.csvfile = open(filepath)  # Open file.
        self.csvdataframe = pd.read_csv(self.csvfile)

Now, this works pretty well and calling the class in my __ main __.py successfully creates a pandas dataframe:

现在,这很有效,并且在我的 __ main __.py 中调用该类成功地创建了一个 Pandas 数据框:

From dataMatrix.py import dataMatrix

testObject = dataMatrix('/path/to/csv/file')

But I was noticing that this process was automatically setting the first row of the .csv as the pandas.dataframe.columns index. Instead, I decided to number the columns. Since I didn't want to assume I knew the number of columns before hand, I took the approach of opening the file, loading it into a dataframe, counting the columns, and then reloading the dataframe with the proper number of columns using range().

但我注意到这个过程会自动将 .csv 的第一行设置为 pandas.dataframe.columns 索引。相反,我决定对列进行编号。由于我不想假设我事先知道列数,因此我采用了打开文件,将其加载到数据框中,计算列数,然后使用 range() 重新加载具有适当列数的数据框的方法( )。

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath
        self.csvfile = open(filepath)

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(self.csvfile)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)
        # Re-load the .csv file, manually setting the column names to their 
        # number.
        self.csvdataframe = pd.read_csv(self.csvfile, 
                                        names=range(self.numcolumns))

Keeping my processing in __ main __.py the same, I got back a dataframe with the correct number of columns (500 in this case) with proper names (0...499), but it was otherwise empty (no row data).

保持我在 __ main __.py 中的处理相同,我得到了一个具有正确列数(在这种情况下为 500)和正确名称(0...499)的数据框,但它是空的(没有行数据) .

Scratching my head, I decided to close self.csvfile and reload it like so:

我挠头,决定关闭 self.csvfile 并像这样重新加载它:

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath
        self.csvfile = open(filepath)

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(self.csvfile)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)

        # Close the .csv file.         #<---- +++++++
        self.csvfile.close()           #<----  Added
        # Re-open file.                #<----  Block
        self.csvfile = open(filepath)  #<---- +++++++

        # Re-load the .csv file, manually setting the column names to their
        # number.
        self.csvdataframe = pd.read_csv(self.csvfile, 
                                        names=range(self.numcolumns))

Closing the file and re-opening it returned correctly with a pandas.dataframe with columns numbered 0...499 and all 255 subsequent rows of data.

关闭文件并重新打开它会正确返回一个 pandas.dataframe,其中列编号为 0...499 以及所有 255 个后续数据行。

My question is why does closing the file and re-opening it make a difference?

我的问题是为什么关闭文件并重新打开它会有所不同?

回答by unutbu

When you open a file with

当你打开一个文件时

open(filepath)

a file handle iteratoris returned. An iterator is good for one pass through its contents. So

返回文件句柄迭代器。迭代器适合一次遍历其内容。所以

self.csvdataframe = pd.read_csv(self.csvfile)

reads the contents and exhausts the iterator. Subsequent calls to pd.read_csvthinks the iterator is empty.

读取内容并耗尽迭代器。后续调用pd.read_csv认为迭代器为空。

Note that you could avoid this problem by just passing the file path to pd.read_csv:

请注意,您只需将文件路径传递给pd.read_csv

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(filepath)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)


        # Re-load the .csv file, manually setting the column names to their
        # number.
        self.csvdataframe = pd.read_csv(filepath, 
                                        names=range(self.numcolumns))    

pd.read_csvwill then open (and close) the file for you.

pd.read_csv然后将为您打开(和关闭)文件。

PS. Another option is to reset the file handle to the beginning of the file by calling self.csvfile.seek(0), but using pd.read_csv(filepath, ...)is still easier.

附注。另一种选择是通过调用将文件句柄重置到文件的开头self.csvfile.seek(0),但使用起来pd.read_csv(filepath, ...)仍然更容易。



Even better, instead of calling pd.read_csvtwice (which is inefficient), you could rename the columns like this:

更好的pd.read_csv是,您可以像这样重命名列,而不是调用两次(这是低效的):

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(filepath)
        self.numcolumns = len(self.csvdataframe.columns)
        self.csvdataframe.columns = range(self.numcolumns)