Pandas 相当于 Python 的 readlines 函数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36020690/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:52:40  来源:igfitidea点击:

Pandas equivalent of Python's readlines function

pythonpandas

提问by kilojoules

With python's readlines()function I can retrieve a list of each line in a file:

使用 python 的readlines()函数,我可以检索文件中每一行的列表:

with open('dat.csv', 'r') as dat:
    lines = dat.readlines()

I am working on a problem involving a very large file and this method is producing a memory error. Is there a pandas equivalent to Python's readlines()function? The pd.read_csv()option chunksizeseems to append numbers to my lines, which is far from ideal.

我正在处理一个涉及非常大文件的问题,并且此方法产生了内存错误。是否有与 Pythonreadlines()函数等效的 Pandas ?该pd.read_csv()选项chunksize似乎将数字附加到我的行中,这远非理想。

Minimal example:

最小的例子:

In [1]: lines = []

In [2]: for df in pd.read_csv('s.csv', chunksize = 100):
   ...:     lines.append(df)
In [3]: lines
Out[3]: 
[   hello here is a line
 0  here is another line
 1  here is my last line]

In [4]: with open('s.csv', 'r') as dat:
   ...:     lines = dat.readlines()
   ...:     

In [5]: lines
Out[5]: ['hello here is a line\n', 'here is another line\n', 'here is my last line\n']

In [6]: cat s.csv
hello here is a line
here is another line
here is my last line

回答by Thanos

You should try to use the chunksizeoption of pd.read_csv(), as mentioned in some of the comments.

您应该尝试使用 的chunksize选项pd.read_csv(),如某些评论中所述。

This will force pd.read_csv()to read in a defined amount of lines at a time, instead of trying to read the entire file in one go. It would look like this:

这将强制一次pd.read_csv()读取定义数量的行,而不是尝试一次性读取整个文件。它看起来像这样:

>> df = pd.read_csv(filepath, chunksize=1, header=None, encoding='utf-8')

In the above example the file will be read line by line.

在上面的例子中,文件将被逐行读取。

Now, in fact, according to the documentation of pandas.read_csv, it is not a pandas.DataFrameobject that is being returned here, but a TextFileReaderobject instead.

现在,实际上,根据 的文档pandas.read_csvpandas.DataFrame这里返回的不是一个对象,而是一个TextFileReader对象。

  • chunksize : int, default None

Return TextFileReader object for iteration. See IO Tools docs for more information on iterator and chunksize.

  • 块大小:整数,默认无

返回用于迭代的 TextFileReader 对象。有关迭代器和块大小的更多信息,请参阅 IO 工具文档。

Therefore, in order to complete the exercise, you would need to put this in a loop like this:

因此,为了完成练习,您需要将其放入如下循环中:

In [385]: cat data_sample.tsv
This is a new line
This is another line of text
And this is the last line of text in this file

In [386]: lines = []

In [387]: for line in pd.read_csv('./data_sample.tsv', encoding='utf-8', header=None, chunksize=1):
    lines.append(line.iloc[0,0])
   .....:     

In [388]: print(lines)
['This is a new line', 'This is another line of text', 'And this is the last line of text in this file']

I hope this helps!

我希望这有帮助!