Python Pandas read_csv 跳过行但保留标题
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27325652/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas read_csv skip rows but keep header
提问by mcd
I'm having trouble figuring out how to skip n rows in a csv file but keep the header which is the 1 row.
我无法弄清楚如何跳过 csv 文件中的 n 行,但保留第 1 行的标题。
What I want to do is iterate but keep the header from the first row. skiprows
makes the header the first row after the skipped rows. What is the best way of doing this?
我想要做的是迭代但保留第一行的标题。 skiprows
使标题成为跳过的行之后的第一行。这样做的最佳方法是什么?
data = pd.read_csv('test.csv', sep='|', header=0, skiprows=10, nrows=10)
采纳答案by Alex Riley
You can pass a list of row numbers to skiprows
instead of an integer.
您可以将行号列表传递给skiprows
而不是整数。
By giving the function the integer 10, you're just skipping the first 10 lines.
通过为函数提供整数 10,您只是跳过了前 10 行。
To keep the first row 0 (as the header) and then skip everything else up to row 10, you can write:
要保留第一行 0(作为标题),然后跳过其他所有内容直到第 10 行,您可以编写:
pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))
Other ways to skip rows using read_csv
使用其他方法跳过行 read_csv
The two main ways to control which rows read_csv
uses are the header
or skiprows
parameters.
控制哪些行read_csv
使用的两种主要方法是header
orskiprows
参数。
Supose we have the following CSV file with one column:
假设我们有以下一列的 CSV 文件:
a
b
c
d
e
f
In each of the examples below, this file is f = io.StringIO("\n".join("abcdef"))
.
在下面的每个示例中,此文件是f = io.StringIO("\n".join("abcdef"))
.
Read all lines as values (no header, defaults to integers)
>>> pd.read_csv(f, header=None) 0 0 a 1 b 2 c 3 d 4 e 5 f
Use a particular row as the header (skip all lines before that):
>>> pd.read_csv(f, header=3) d 0 e 1 f
Use a multiple rows as the header creating a MultiIndex (skip all lines before the last specified header line):
>>> pd.read_csv(f, header=[2, 4]) c e 0 f
Skip N rows from the start of the file (the first row that's not skipped is the header):
>>> pd.read_csv(f, skiprows=3) d 0 e 1 f
Skip one or more rows by giving the row indices (the first row that's not skipped is the header):
>>> pd.read_csv(f, skiprows=[2, 4]) a 0 b 1 d 2 f
读取所有行作为值(无标题,默认为整数)
>>> pd.read_csv(f, header=None) 0 0 a 1 b 2 c 3 d 4 e 5 f
使用特定行作为标题(跳过之前的所有行):
>>> pd.read_csv(f, header=3) d 0 e 1 f
使用多行作为创建 MultiIndex 的标题(跳过最后指定的标题行之前的所有行):
>>> pd.read_csv(f, header=[2, 4]) c e 0 f
从文件开头跳过 N 行(未跳过的第一行是标题):
>>> pd.read_csv(f, skiprows=3) d 0 e 1 f
通过给出行索引跳过一行或多行(未跳过的第一行是标题):
>>> pd.read_csv(f, skiprows=[2, 4]) a 0 b 1 d 2 f
回答by Prateek Khatri
To expand on @AlexRiley's answer, the skiprows
argument takes a list of numbers which determines what rows to skip. So:
为了扩展@AlexRiley 的答案,该skiprows
参数采用一个数字列表,该列表确定要跳过的行。所以:
pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))
is the same as:
是相同的:
pd.read_csv('test.csv', sep='|', skiprows=[1,2,3,4,5,6,7,8,9])
The best way to go about ignoring specific rows would be to create your ignore list (either manually or with a function like range
that returns a list of integers) and pass it to skiprows
.
忽略特定行的最佳方法是创建忽略列表(手动或使用类似range
返回整数列表的函数)并将其传递给skiprows
.
回答by JohnM
If you're iterating through a long csv file, you can use the chunksizeargument. If for some reason you need to manually step through it, you can try the following as long as you know how many iterations you need to go through:
如果您要遍历一个很长的 csv 文件,则可以使用chunksize参数。如果由于某种原因需要手动单步执行,只要知道需要执行多少次迭代,就可以尝试以下操作:
for i in range(num_iters):
pd.read_csv('test.csv', sep='|', header=0,
skiprows = range(i*10 + 1, (i+1)*10), nrows=10)
回答by Zakir
Great answers already.. I somehow feel the need to add the generalized form here.. Consider this scenario:-
很好的答案已经......我觉得有必要在这里添加广义形式......考虑一下这种情况:-
Say your xls/csv has junk rows in the top 2 rows (row #0,1). Row #2 (3rd row)is the real header and you want to load 10 rows starting from row#50 (i.e 51st row).. Here's the snippet:-
假设您的 xls/csv 在前 2 行(行 #0,1)中有垃圾行。第 2 行(第 3 行)是真正的标题,您希望从第 50 行(即第 51 行)开始加载 10 行。这是片段:-
pd.read_csv('test.csv', header=2, skiprows=range(3, 50), nrows=10)
pd.read_csv('test.csv', header=2, skiprows=range(3, 50), nrows=10)
回答by Lawrence Jacob
If you need to skip/drop specific rows, say the first 3 rows (i.e. 0,1,2) and then 2 more rows (i.e. 4,5). You can use the following to retain the header row:
如果您需要跳过/删除特定行,请说前 3 行(即 0,1,2),然后是另外 2 行(即 4,5)。您可以使用以下内容保留标题行:
df = pd.read_csv(file_in, delimiter='\t', skiprows=[0,1,2,4,5], encoding='utf-16', usecols=cols)