Python pandas read_csv 并使用 usecols 过滤列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15017072/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 13:07:07  来源:igfitidea点击:

pandas read_csv and filter columns with usecols

pythonpandascsvcsv-import

提问by chip

I have a csv file which isn't coming in correctly with pandas.read_csvwhen I filter the columns with usecolsand use multiple indexes.

我有一个 csv 文件,pandas.read_csv当我过滤列usecols并使用多个索引时,该文件没有正确输入。

import pandas as pd
csv = r"""dummy,date,loc,x
   bar,20090101,a,1
   bar,20090102,a,3
   bar,20090103,a,5
   bar,20090101,b,1
   bar,20090102,b,3
   bar,20090103,b,5"""

f = open('foo.csv', 'w')
f.write(csv)
f.close()

df1 = pd.read_csv('foo.csv',
        header=0,
        names=["dummy", "date", "loc", "x"], 
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"])
print df1

# Ignore the dummy columns
df2 = pd.read_csv('foo.csv', 
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"], # <----------- Changed
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
print df2

I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.

我希望 df1 和 df2 除了缺少虚拟列之外应该是相同的,但是列的标签错误。日期也被解析为日期。

In [118]: %run test.py
               dummy  x
date       loc
2009-01-01 a     bar  1
2009-01-02 a     bar  3
2009-01-03 a     bar  5
2009-01-01 b     bar  1
2009-01-02 b     bar  3
2009-01-03 b     bar  5
              date
date loc
a    1    20090101
     3    20090102
     5    20090103
b    1    20090101
     3    20090102
     5    20090103

Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I'm trying to understand what is going wrong. I'm using pandas 0.10.1.

使用列号而不是名称给了我同样的问题。我可以通过在 read_csv 步骤之后删除虚拟列来解决这个问题,但我试图了解出了什么问题。我正在使用熊猫 0.10.1。

edit: fixed bad header usage.

编辑:修复了错误的标题使用。

采纳答案by Mack

The answer by @chip completely misses the point of two keyword arguments.

@chip 的回答完全忽略了两个关键字参数的要点。

  • namesis only necessary when there is no header and you want to specify other arguments using column names rather than integer indices.
  • usecolsis supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.
  • 名称是只在必要时有没有头和要指定使用的列名,而不是整数索引等参数。
  • usecols应该在将整个 DataFrame 读入内存之前提供一个过滤器;如果使用得当,阅读后永远不需要删除列。

This solution corrects those oddities:

此解决方案纠正了这些奇怪之处:

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        header=0,
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"],
        parse_dates=["date"])

Which gives us:

这给了我们:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

回答by Theodros Zelleke

This code achieves what you want --- also its weird and certainly buggy:

这段代码实现了你想要的——也很奇怪,而且肯定有问题:

I observed that it works when:

我观察到它在以下情况下有效:

a) you specify the index_colrel. to the number of columns you really use -- so its three columns in this example, not four (you drop dummyand start counting from then onwards)

a) 您指定了index_colrel。到你真正使用的列数——所以在这个例子中它是三列,而不是四列(你放下dummy并从那时起开始计数)

b) same for parse_dates

b) 相同 parse_dates

c) not so for usecols;) for obvious reasons

c) 不是这样usecols;) 原因很明显

d) here I adapted the namesto mirror this behaviour

d)在这里我调整了names反映这种行为

import pandas as pd
from StringIO import StringIO

csv = """dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5
"""

df = pd.read_csv(StringIO(csv),
        index_col=[0,1],
        usecols=[1,2,3], 
        parse_dates=[0],
        header=0,
        names=["date", "loc", "", "x"])

print df

which prints

哪个打印

                x
date       loc   
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

回答by Mohan

import csv first and use csv.DictReader its easy to process...

首先导入 csv 并使用 csv.DictReader 其易于处理...

回答by chip

If your csv file contains extra data, columns can be deletedfrom the DataFrame after import.

如果您的 csv 文件包含额外数据,则可以在导入后从 DataFrame 中删除列。

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
del df['dummy']

Which gives us:

这给了我们:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

回答by Auday Berro

You have to just add the index_col=Falseparameter

你只需要添加index_col=False参数

df1 = pd.read_csv('foo.csv',
     header=0,
     index_col=False,
     names=["dummy", "date", "loc", "x"], 
     index_col=["date", "loc"], 
     usecols=["dummy", "date", "loc", "x"],
     parse_dates=["date"])
  print df1