pandas 熊猫 read_csv 并只保留某些行（python）

Question

提问by dleal

I am aware of the skiprows that allows you to pass a list with the indices of the rows to skip. However, I have the index of the rows I want to keep.

我知道 skiprows 允许您传递包含要跳过的行索引的列表。但是，我有要保留的行的索引。

Say that my cvs file looks like this for millions of rows:

假设我的 cvs 文件在数百万行中看起来像这样：

The list of indices i would like to load are only 2,3, so

我想加载的索引列表只有 2,3，所以

index_list = [2,3]

The input for the skiprows function would be [0,1,4]. However, I only have available [2,3].

skiprows 函数的输入是 [0,1,4]。但是，我只有 [2,3] 可用。

I am trying something like:

我正在尝试类似的东西：

pd.read_csv(path, skiprows = ~index_list)

but no luck.. any suggestions?

但没有运气..有什么建议吗？

thank and I appreciate all the help,

谢谢，我感谢所有的帮助，

Answer 1

I think you would need to find the number of lines first, like this.

我认为你需要先找到行数，就像这样。

num_lines = sum(1 for line in open('myfile.txt'))

Then you would need to delete the indices of index_list:

然后你需要删除的索引index_list：

to_exclude = [i for i in num_lines if i not in index_list]

and then load your data:

然后加载您的数据：

pd.read_csv(path, skiprows = to_exclude)

Answer 2

You can pass in a lambda function in the skiprowsargument. For example:

您可以在skiprows参数中传入一个 lambda 函数。例如：

rows_to_keep = [2,3]
pd.read_csv(path, skiprows = lambda x: x not in rows_to_keep)

You can read more about it in the documentation here

您可以在此处的文档中阅读有关它的更多信息