Pandas 相当于 Python 的 readlines 函数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36020690/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas equivalent of Python's readlines function
提问by kilojoules
With python's readlines()
function I can retrieve a list of each line in a file:
使用 python 的readlines()
函数,我可以检索文件中每一行的列表:
with open('dat.csv', 'r') as dat:
lines = dat.readlines()
I am working on a problem involving a very large file and this method is producing a memory error. Is there a pandas equivalent to Python's readlines()
function? The pd.read_csv()
option chunksize
seems to append numbers to my lines, which is far from ideal.
我正在处理一个涉及非常大文件的问题,并且此方法产生了内存错误。是否有与 Pythonreadlines()
函数等效的 Pandas ?该pd.read_csv()
选项chunksize
似乎将数字附加到我的行中,这远非理想。
Minimal example:
最小的例子:
In [1]: lines = []
In [2]: for df in pd.read_csv('s.csv', chunksize = 100):
...: lines.append(df)
In [3]: lines
Out[3]:
[ hello here is a line
0 here is another line
1 here is my last line]
In [4]: with open('s.csv', 'r') as dat:
...: lines = dat.readlines()
...:
In [5]: lines
Out[5]: ['hello here is a line\n', 'here is another line\n', 'here is my last line\n']
In [6]: cat s.csv
hello here is a line
here is another line
here is my last line
回答by Thanos
You should try to use the chunksize
option of pd.read_csv()
, as mentioned in some of the comments.
您应该尝试使用 的chunksize
选项pd.read_csv()
,如某些评论中所述。
This will force pd.read_csv()
to read in a defined amount of lines at a time, instead of trying to read the entire file in one go. It would look like this:
这将强制一次pd.read_csv()
读取定义数量的行,而不是尝试一次性读取整个文件。它看起来像这样:
>> df = pd.read_csv(filepath, chunksize=1, header=None, encoding='utf-8')
In the above example the file will be read line by line.
在上面的例子中,文件将被逐行读取。
Now, in fact, according to the documentation of pandas.read_csv
, it is not a pandas.DataFrame
object that is being returned here, but a TextFileReader
object instead.
现在,实际上,根据 的文档pandas.read_csv
,pandas.DataFrame
这里返回的不是一个对象,而是一个TextFileReader
对象。
- chunksize : int, default None
Return TextFileReader object for iteration. See IO Tools docs for more information on iterator and chunksize.
- 块大小:整数,默认无
返回用于迭代的 TextFileReader 对象。有关迭代器和块大小的更多信息,请参阅 IO 工具文档。
Therefore, in order to complete the exercise, you would need to put this in a loop like this:
因此,为了完成练习,您需要将其放入如下循环中:
In [385]: cat data_sample.tsv
This is a new line
This is another line of text
And this is the last line of text in this file
In [386]: lines = []
In [387]: for line in pd.read_csv('./data_sample.tsv', encoding='utf-8', header=None, chunksize=1):
lines.append(line.iloc[0,0])
.....:
In [388]: print(lines)
['This is a new line', 'This is another line of text', 'And this is the last line of text in this file']
I hope this helps!
我希望这有帮助!