pandas 如何使用熊猫读取csv中的特定行号
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47586614/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read a specific line number in a csv with pandas
提问by Guido Muscioni
I have a huge dataset and I am trying to read it line by line. For now, I am reading the dataset using pandas:
我有一个巨大的数据集,我正在尝试逐行读取它。现在,我正在使用 Pandas 读取数据集:
df = pd.read_csv("mydata.csv", sep =',', nrows = 1)
This function allows me to read only the first line, but how can I read the second, the third one and so on? (I would like to use pandas.)
此功能允许我只阅读第一行,但如何阅读第二行、第三行等等?(我想使用Pandas。)
EDIT: To make it more clear, I need to read one line at a time as the dataset is 20 GB and I cannot keep all the stuff in memory.
编辑:为了更清楚,我需要一次读取一行,因为数据集是 20 GB,我无法将所有内容都保存在内存中。
回答by Guido Muscioni
Looking in the pandas documentation, there is a parameter for read_csv function:
查看pandas文档,read_csv函数有一个参数:
skiprows
If a list is assigned to this parameter it will skip the line indexed by the list:
如果将列表分配给此参数,它将跳过列表索引的行:
skiprows = [0,1]
This will skip the first one and the second line.
Thus a combination of nrow
and skiprows
allow to read each line in the dataset separately.
这将跳过第一行和第二行。因此,组合nrow
和skiprows
允许分别读取数据集中的每一行。
回答by Davidvs
One way could be to read part by part of your file and store each part, for example:
一种方法是逐部分读取文件并存储每个部分,例如:
df1 = pd.read_csv("mydata.csv", nrows=10000)
Here you will skip the first 10000 rows that you already read and stored in df1, and store the next 10000 rows in df2.
在这里,您将跳过已读取并存储在 df1 中的前 10000 行,并将接下来的 10000 行存储在 df2 中。
df2 = pd.read_csv("mydata.csv", skiprows=10000 nrows=10000)
dfn = pd.read_csv("mydata.csv", skiprows=(n-1)*10000, nrows=10000)
Maybe there is a way to introduce this idea into a for or while loop.
也许有一种方法可以将这个想法引入 for 或 while 循环。
回答by Aymen
You are using nrows = 1
, wich means "Number of rows of file to read. Useful for reading pieces of large files"
您正在使用nrows = 1
,意思是“要读取的文件行数。对于读取大文件很有用”
So you are telling it to read only the first row and stop.
所以你告诉它只读取第一行并停止。
You should just remove the argument to read all the csv file into a DataFrame and then go line by line.
您应该删除参数以将所有 csv 文件读入 DataFrame,然后逐行读取。
See the documentation for more details on usage : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
有关使用的更多详细信息,请参阅文档:https: //pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html