pandas 尝试使用熊猫读取表时出现索引错误

Question

提问by Weston

Update:This is a duplicate of "usecols with parse_dates and names" but this question was answered first.

更新：这是“ usecols with parse_dates and names”的重复，但首先回答了这个问题。

I can't get this code to work for the life of me. As soon as I take out the namesparameter it works fine, but that is just silly.

我无法让这段代码在我的生活中工作。一旦我取出names参数它就可以正常工作，但这只是愚蠢的。

From a space delimited file I want to:

从一个空格分隔的文件我想：

skip the header section
import selected columns
name the columns
parse two columns as a date
use parsed date as index

跳过标题部分
导入选定的列
命名列
将两列解析为日期
使用解析日期作为索引

This almost works:

这几乎有效：

import panadas as pd
columns = [4, 5, 10, 11, 15, 16, 17, 26, 28, 29]
names = ["DATE","TIME","DLAT", "DLON", "SLAT", "SLON", "SHGT", "HGT", "N", "E"]
ppp_data = pd.read_table(
    filename,
    delim_whitespace=True, # space delimited
    skiprows=8, # skip header rows
    header=None, # don't use first row as column names
    usecols=columns, # only use selected columns
    names=names, # use names for selected columns
    parse_dates=[[4,5]], # join date and time columns and parse as date
    index_col=0, # use parsed date (now column 0) as index
)
print ppp_data

But here is the stack trace I'm getting

但这是我得到的堆栈跟踪

Traceback (most recent call last):
  File "plot_squat_test_pandas.py", line 30, in <module>
    index_col=0,
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 400, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 205, in _read
    return parser.read()
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas/parser.c:6745)
  File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:6964)
  File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows     (pandas/parser.c:7780)
  File "parser.pyx", line 865, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:8512)
  File "parser.pyx", line 1105, in pandas.parser.TextReader._get_column_name (pandas/parser.c:11684)
IndexError: list index out of range

If I comment out the names=namesparameter and it works fine

如果我注释掉names=names参数并且它工作正常

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 86281 entries, 2013-10-30 00:00:00 to 2013-10-30 23:59:59
Data columns (total 8 columns):
10    86281  non-null values
11    86281  non-null values
15    86281  non-null values
16    86281  non-null values
17    86281  non-null values
26    86281  non-null values
28    86281  non-null values
29    86281  non-null values

What am I missing? Or is this an issue with panadas and I should go make a bug report?

我错过了什么？或者这是 panadas 的问题，我应该去提交错误报告？

I'm using python 2.7.3, and with pandas the stack trace above is from stable release 0.12.0. I've tried this with development version 0.13.0rc1-119-g2485e09 and got the same results (different line numbers).

我正在使用 python 2.7.3，对于 Pandas，上面的堆栈跟踪来自稳定版本 0.12.0。我已经在开发版本 0.13.0rc1-119-g2485e09 上尝试过这个，得到了相同的结果（不同的行号）。

Answer 1

采纳答案by Weston

This is a bugin versions of pandas prior to and including the current development version 0.13.0rc1-119-g2485e09. There are two workarounds.

这是在当前开发版本 0.13.0rc1-119-g2485e09 之前和包括的版本中的一个错误。有两种解决方法。

Workaround 1

解决方法 1

Including the last column of the table in both usecolsand nameswill suppress the IndexError

包括表的同时在最后一列usecols，并names会抑制IndexError

from StringIO import StringIO
import pandas as pd

data = """2013-10-11 11:53:49,1,2,3,4
2013-10-11 11:53:50,1,2,3,4
2013-10-11 11:53:51,1,2,3,4"""

df = pd.read_csv(
    StringIO(data),
    header=None,
    usecols=[0,2,4],
    names=["DATE","COl2","COL4"],
    parse_dates=["DATE"],
    index_col=0,
)
print df

Workaround 2

解决方法 2

Alternately you can renamethe columns after the fact, as in this question

或者，您可以在事后重命名列，如this question

ppp_data.rename(columns=dict(zip(columns[2:],names)), inplace=True)

Answer 2

回答by unutbu

nameshas 10 elements:

names有10个元素：

In [1]: len(["DATE","TIME","DLAT", "DLON", "SLAT", "SLON", "SHGT", "HGT", "N", "E"])
Out[1]: 10

But when you omit the namesparameter, read_tableis parsing only 8 columns:

但是当您省略names参数时，read_table仅解析 8 列：

Data columns (total 8 columns):

Therefore, if the desired DataFrame has 8 columns and a single index, then namesmay have 9 (or 8) elements.

因此，如果所需的 DataFrame 有 8 个列和一个索引，则names可能有 9 个（或 8 个）元素。

Note that

注意

parse_dates=[[4,5]],

is combining columns 4 and 5 into one column. So even though the raw data has 10 columns, what remains is 8 columns and an index. If you make nameshave 9 elements, the first element is used to name the index.

将第 4 列和第 5 列合并为一列。所以即使原始数据有 10 列，剩下的是 8 列和一个索引。如果你names有 9 个元素，第一个元素用于命名索引。

pandas 尝试使用熊猫读取表时出现索引错误

提问by Weston

采纳答案by Weston

Workaround 1

解决方法 1

Workaround 2

解决方法 2

回答by unutbu

相关推荐

最近更新

标签

pandas 尝试使用熊猫读取表时出现索引错误

提问by Weston

采纳答案by Weston

Workaround 1

解决方法 1

Workaround 2

解决方法 2

回答by unutbu

相关推荐

如何以块的形式迭代两个 Pandas 数据帧

来自 Pandas DataFrame 的基本 Matplotlib 散点图

日期时间相关值的 Python Numpy 或 Pandas 线性插值

从 pandas.DataFrame 的每一列中获取最大的值

相关推荐

最近更新

标签