pandas 如何使用熊猫读取csv中的特定行号

Question

提问by Guido Muscioni

I have a huge dataset and I am trying to read it line by line. For now, I am reading the dataset using pandas:

我有一个巨大的数据集，我正在尝试逐行读取它。现在，我正在使用 Pandas 读取数据集：

df = pd.read_csv("mydata.csv", sep =',', nrows = 1)

This function allows me to read only the first line, but how can I read the second, the third one and so on? (I would like to use pandas.)

此功能允许我只阅读第一行，但如何阅读第二行、第三行等等？（我想使用Pandas。）

EDIT: To make it more clear, I need to read one line at a time as the dataset is 20 GB and I cannot keep all the stuff in memory.

编辑：为了更清楚，我需要一次读取一行，因为数据集是 20 GB，我无法将所有内容都保存在内存中。

Answer 1

回答by Guido Muscioni

Looking in the pandas documentation, there is a parameter for read_csv function:

查看pandas文档，read_csv函数有一个参数：

skiprows

If a list is assigned to this parameter it will skip the line indexed by the list:

如果将列表分配给此参数，它将跳过列表索引的行：

skiprows = [0,1]

This will skip the first one and the second line. Thus a combination of nrowand skiprowsallow to read each line in the dataset separately.

这将跳过第一行和第二行。因此，组合nrow和skiprows允许分别读取数据集中的每一行。

Answer 2

回答by Davidvs

One way could be to read part by part of your file and store each part, for example:

一种方法是逐部分读取文件并存储每个部分，例如：

df1 = pd.read_csv("mydata.csv", nrows=10000)

Here you will skip the first 10000 rows that you already read and stored in df1, and store the next 10000 rows in df2.

在这里，您将跳过已读取并存储在 df1 中的前 10000 行，并将接下来的 10000 行存储在 df2 中。

df2 = pd.read_csv("mydata.csv", skiprows=10000 nrows=10000)
dfn = pd.read_csv("mydata.csv", skiprows=(n-1)*10000, nrows=10000)

Maybe there is a way to introduce this idea into a for or while loop.

也许有一种方法可以将这个想法引入 for 或 while 循环。

Answer 3

回答by Aymen

You are using nrows = 1, wich means "Number of rows of file to read. Useful for reading pieces of large files"

您正在使用nrows = 1，意思是“要读取的文件行数。对于读取大文件很有用”

So you are telling it to read only the first row and stop.

所以你告诉它只读取第一行并停止。

You should just remove the argument to read all the csv file into a DataFrame and then go line by line.

您应该删除参数以将所有 csv 文件读入 DataFrame，然后逐行读取。

See the documentation for more details on usage : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

有关使用的更多详细信息，请参阅文档：https: //pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

pandas 如何使用熊猫读取csv中的特定行号

提问by Guido Muscioni

回答by Guido Muscioni

回答by Davidvs

回答by Aymen

相关推荐

最近更新

标签

pandas 如何使用熊猫读取csv中的特定行号

提问by Guido Muscioni

回答by Guido Muscioni

回答by Davidvs

回答by Aymen

相关推荐

pandas 如何将巨大的熊猫数据框保存到 hdfs？

pandas 提高pandas groupby的性能

相对路径在 Jupyter 笔记本中的 Pandas python 中不起作用

Python sqlalchemy 尝试使用 .to_sql 将 Pandas 数据帧写入 SQL Server

相关推荐

最近更新

标签