Python Pandas read_csv 跳过行但保留标题

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27325652/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:39:58  来源:igfitidea点击:

Python Pandas read_csv skip rows but keep header

pythoncsvpandas

提问by mcd

I'm having trouble figuring out how to skip n rows in a csv file but keep the header which is the 1 row.

我无法弄清楚如何跳过 csv 文件中的 n 行,但保留第 1 行的标题。

What I want to do is iterate but keep the header from the first row. skiprowsmakes the header the first row after the skipped rows. What is the best way of doing this?

我想要做的是迭代但保留第一行的标题。 skiprows使标题成为跳过的行之后的第一行。这样做的最佳方法是什么?

data = pd.read_csv('test.csv', sep='|', header=0, skiprows=10, nrows=10)

采纳答案by Alex Riley

You can pass a list of row numbers to skiprowsinstead of an integer.

您可以将行号列表传递给skiprows而不是整数。

By giving the function the integer 10, you're just skipping the first 10 lines.

通过为函数提供整数 10,您只是跳过了前 10 行。

To keep the first row 0 (as the header) and then skip everything else up to row 10, you can write:

要保留第一行 0(作为标题),然后跳过其他所有内容直到第 10 行,您可以编写:

pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))


Other ways to skip rows using read_csv

使用其他方法跳过行 read_csv

The two main ways to control which rows read_csvuses are the headeror skiprowsparameters.

控制哪些行read_csv使用的两种主要方法是headerorskiprows参数。

Supose we have the following CSV file with one column:

假设我们有以下一列的 CSV 文件:

a
b
c
d
e
f

In each of the examples below, this file is f = io.StringIO("\n".join("abcdef")).

在下面的每个示例中,此文件是f = io.StringIO("\n".join("abcdef")).

  • Read all lines as values (no header, defaults to integers)

    >>> pd.read_csv(f, header=None)
       0
    0  a
    1  b
    2  c
    3  d
    4  e
    5  f
    
  • Use a particular row as the header (skip all lines before that):

    >>> pd.read_csv(f, header=3)
       d
    0  e
    1  f
    
  • Use a multiple rows as the header creating a MultiIndex (skip all lines before the last specified header line):

    >>> pd.read_csv(f, header=[2, 4])                                                                                                                                                                        
       c
       e
    0  f
    
  • Skip N rows from the start of the file (the first row that's not skipped is the header):

    >>> pd.read_csv(f, skiprows=3)                                                                                                                                                                      
       d
    0  e
    1  f
    
  • Skip one or more rows by giving the row indices (the first row that's not skipped is the header):

    >>> pd.read_csv(f, skiprows=[2, 4])                                                                                                                                                                      
       a
    0  b
    1  d
    2  f
    
  • 读取所有行作为值(无标题,默认为整数)

    >>> pd.read_csv(f, header=None)
       0
    0  a
    1  b
    2  c
    3  d
    4  e
    5  f
    
  • 使用特定行作为标题(跳过之前的所有行):

    >>> pd.read_csv(f, header=3)
       d
    0  e
    1  f
    
  • 使用多行作为创建 MultiIndex 的标题(跳过最后指定的标题行之前的所有行):

    >>> pd.read_csv(f, header=[2, 4])                                                                                                                                                                        
       c
       e
    0  f
    
  • 从文件开头跳过 N 行(未跳过的第一行是标题):

    >>> pd.read_csv(f, skiprows=3)                                                                                                                                                                      
       d
    0  e
    1  f
    
  • 通过给出行索引跳过一行或多行(未跳过的第一行是标题):

    >>> pd.read_csv(f, skiprows=[2, 4])                                                                                                                                                                      
       a
    0  b
    1  d
    2  f
    

回答by Prateek Khatri

To expand on @AlexRiley's answer, the skiprowsargument takes a list of numbers which determines what rows to skip. So:

为了扩展@AlexRiley 的答案,该skiprows参数采用一个数字列表,该列表确定要跳过的行。所以:

pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))

is the same as:

是相同的:

pd.read_csv('test.csv', sep='|', skiprows=[1,2,3,4,5,6,7,8,9])

The best way to go about ignoring specific rows would be to create your ignore list (either manually or with a function like rangethat returns a list of integers) and pass it to skiprows.

忽略特定行的最佳方法是创建忽略列表(手动或使用类似range返回整数列表的函数)并将其传递给skiprows.

回答by JohnM

If you're iterating through a long csv file, you can use the chunksizeargument. If for some reason you need to manually step through it, you can try the following as long as you know how many iterations you need to go through:

如果您要遍历一个很长的 csv 文件,则可以使用chunksize参数。如果由于某种原因需要手动单步执行,只要知道需要执行多少次迭代,就可以尝试以下操作:

for i in range(num_iters):
    pd.read_csv('test.csv', sep='|', header=0, 
                 skiprows = range(i*10 + 1, (i+1)*10), nrows=10)

回答by Zakir

Great answers already.. I somehow feel the need to add the generalized form here.. Consider this scenario:-

很好的答案已经......我觉得有必要在这里添加广义形式......考虑一下这种情况:-

Say your xls/csv has junk rows in the top 2 rows (row #0,1). Row #2 (3rd row)is the real header and you want to load 10 rows starting from row#50 (i.e 51st row).. Here's the snippet:-

假设您的 xls/csv 在前 2 行(行 #0,1)中有垃圾行。第 2 行(第 3 行)是真正的标题,您希望从第 50 行(即第 51 行)开始加载 10 行。这是片段:-

pd.read_csv('test.csv', header=2, skiprows=range(3, 50), nrows=10)

pd.read_csv('test.csv', header=2, skiprows=range(3, 50), nrows=10)

回答by Lawrence Jacob

If you need to skip/drop specific rows, say the first 3 rows (i.e. 0,1,2) and then 2 more rows (i.e. 4,5). You can use the following to retain the header row:

如果您需要跳过/删除特定行,请说前 3 行(即 0,1,2),然后是另外 2 行(即 4,5)。您可以使用以下内容保留标题行:

df = pd.read_csv(file_in, delimiter='\t', skiprows=[0,1,2,4,5], encoding='utf-16', usecols=cols)