Pandas 相当于 Python 的 readlines 函数

Question

提问by kilojoules

With python's readlines()function I can retrieve a list of each line in a file:

使用 python 的readlines()函数，我可以检索文件中每一行的列表：

with open('dat.csv', 'r') as dat:
    lines = dat.readlines()

I am working on a problem involving a very large file and this method is producing a memory error. Is there a pandas equivalent to Python's readlines()function? The pd.read_csv()option chunksizeseems to append numbers to my lines, which is far from ideal.

我正在处理一个涉及非常大文件的问题，并且此方法产生了内存错误。是否有与 Pythonreadlines()函数等效的 Pandas ？该pd.read_csv()选项chunksize似乎将数字附加到我的行中，这远非理想。

Minimal example:

最小的例子：

In [1]: lines = []

In [2]: for df in pd.read_csv('s.csv', chunksize = 100):
   ...:     lines.append(df)
In [3]: lines
Out[3]: 
[   hello here is a line
 0  here is another line
 1  here is my last line]

In [4]: with open('s.csv', 'r') as dat:
   ...:     lines = dat.readlines()
   ...:     

In [5]: lines
Out[5]: ['hello here is a line\n', 'here is another line\n', 'here is my last line\n']

In [6]: cat s.csv
hello here is a line
here is another line
here is my last line

Answer 1

回答by Thanos

You should try to use the chunksizeoption of pd.read_csv(), as mentioned in some of the comments.

您应该尝试使用的chunksize选项pd.read_csv()，如某些评论中所述。

This will force pd.read_csv()to read in a defined amount of lines at a time, instead of trying to read the entire file in one go. It would look like this:

这将强制一次pd.read_csv()读取定义数量的行，而不是尝试一次性读取整个文件。它看起来像这样：

>> df = pd.read_csv(filepath, chunksize=1, header=None, encoding='utf-8')

In the above example the file will be read line by line.

在上面的例子中，文件将被逐行读取。

Now, in fact, according to the documentation of pandas.read_csv, it is not a pandas.DataFrameobject that is being returned here, but a TextFileReaderobject instead.

现在，实际上，根据的文档pandas.read_csv，pandas.DataFrame这里返回的不是一个对象，而是一个TextFileReader对象。

chunksize : int, default None
Return TextFileReader object for iteration. See IO Tools docs for more information on iterator and chunksize.

块大小：整数，默认无
返回用于迭代的 TextFileReader 对象。有关迭代器和块大小的更多信息，请参阅 IO 工具文档。

Therefore, in order to complete the exercise, you would need to put this in a loop like this:

因此，为了完成练习，您需要将其放入如下循环中：

In [385]: cat data_sample.tsv
This is a new line
This is another line of text
And this is the last line of text in this file

In [386]: lines = []

In [387]: for line in pd.read_csv('./data_sample.tsv', encoding='utf-8', header=None, chunksize=1):
    lines.append(line.iloc[0,0])
   .....:     

In [388]: print(lines)
['This is a new line', 'This is another line of text', 'And this is the last line of text in this file']

I hope this helps!

我希望这有帮助！

Pandas 相当于 Python 的 readlines 函数

提问by kilojoules

回答by Thanos

相关推荐

最近更新

标签

Pandas 相当于 Python 的 readlines 函数

提问by kilojoules

回答by Thanos

相关推荐

使用 Pandas 中的系列连接 DataFrame

pandas 使用列表理解修改数据框列

Pandas DataFrame 能否高效计算 PMI（Pointwise Mutual Information）？

使用 pandas iterrows() 时追加新行？

相关推荐

最近更新

标签