Python pandas read_csv index_col=None 不使用每行末尾的分隔符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12960574/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 12:16:24  来源:igfitidea点击:

pandas read_csv index_col=None not working with delimiters at the end of each line

pythonpandas

提问by Rich

I am going through the 'Python for Data Analysis' book and having trouble in the 'Example: 2012 Federal Election Commision Database' section reading the data to a DataFrame. The trouble is that one of the columns of data is always being set as the index column, even when the index_col argument is set to None.

我正在阅读“用于数据分析的 Python”一书,但在“示例:2012 年联邦选举委员会数据库”部分将数据读取到 DataFrame 时遇到了问题。问题是其中一列数据始终被设置为索引列,即使 index_col 参数设置为 None 也是如此。

Here is the link to the data : http://www.fec.gov/disclosurep/PDownload.do.

这是数据的链接:http: //www.fec.gov/disclosurep/PDownload.do

Here is the loading code (to save time in the checking, I set the nrows=10):

这是加载代码(为了节省检查时间,我设置了 nrows=10):

import pandas as pd
fec = pd.read_csv('P00000001-ALL.csv',nrows=10,index_col=None)

To keep it short I am excluding the data column outputs, but here is my output (please not the Index values):

为了保持简短,我不包括数据列输出,但这是我的输出(请不要索引值):

In [20]: fec

Out[20]:
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, C00410118 to C00410118
Data columns:
...
dtypes: float64(4), int64(3), object(11)

And here is the book's output (again with data columns excluded):

这是本书的输出(再次排除数据列):

In [13]: fec = read_csv('P00000001-ALL.csv')
In [14]: fec
Out[14]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1001731 entries, 0 to 1001730
...
dtypes: float64(1), int64(1), object(14)

The Index values in my output are actually the first column of data in the file, which is then moving all the rest of the data to the left by one. Would anyone know how to prevent this column of data to be listed as an index? I would like to have the index just +1 increasing integers.

我的输出中的索引值实际上是文件中的第一列数据,然后将所有其余数据向左移动一个。有谁知道如何防止这列数据被列为索引?我想让索引只是 +1 递增的整数。

I am fairly new to python and pandas, so I apologize for any inconvenience. Thanks.

我对 python 和 Pandas 还很陌生,因此对于给您带来的不便,我深表歉意。谢谢。

采纳答案by craigts

Quick Answer

快速回答

Use index_col=Falseinstead of index_col=None when you have delimiters at the end of each line to turn off index column inference and discard the last column.

当每行末尾有分隔符时,请使用index_col=False而不是 index_col=None 来关闭索引列推断并丢弃最后一列。

More Detail

更多详情

After looking at the data, there is a comma at the end of each line. And this quote (the documentation has been edited since the time this post was created):

查看数据后,每行末尾都有一个逗号。和这句话(自这篇文章创建以来,文档已经被编辑):

index_col: column number, column name, or list of column numbers/names, to use as the index (row labels) of the resulting DataFrame. By default, it will number the rows without using any column, unless there is one more data column than there are headers, in which case the first column is taken as the index.

index_col:列号、列名或列号/名称列表,用作结果 DataFrame 的索引(行标签)。默认情况下,它会在不使用任何列的情况下对行进行编号,除非数据列比标题多一个,在这种情况下,将第一列作为索引。

from the documentationshows that pandas believes you have n headers and n+1 data columns and is treating the first column as the index.

文档中可以看出,pandas 认为您有 n 个标题和 n+1 个数据列,并将第一列视为索引。



EDIT 10/20/2014 - More information

编辑 10/20/2014 - 更多信息

I found another valuable entrythat is specifically about trailing limiters and how to simply ignore them:

我发现了另一个有价值的条目,专门关于尾随限制器以及如何简单地忽略它们:

If a file has one more column of data than the number of column names, the first column will be used as the DataFrame's row names: ...

Ordinarily, you can achieve this behavior using the index_col option.

There are some exception cases when a file has been prepared with delimiters at the end of each data line, confusing the parser. To explicitly disable the index column inference and discard the last column, pass index_col=False: ...

如果文件的数据列比列名的数量多一列,则第一列将用作 DataFrame 的行名:...

通常,您可以使用 index_col 选项实现此行为。

有一些例外情况,即文件在每个数据行的末尾都已准备好分隔符,从而混淆了解析器。要显式禁用索引列推断并丢弃最后一列,请传递 index_col=False: ...

回答by ZaxR

Re: craigts's response, for anyone having trouble with using either False or None parameters for index_col, such as in cases where you're trying to get rid of a range index, you can instead use an integer to specify the column you want to use as the index. For example:

回复:craigts 的回复,对于在 index_col 使用 False 或 None 参数时遇到问题的任何人,例如在您试图摆脱范围索引的情况下,您可以使用整数来指定要使用的列作为索引。例如:

df = pd.read_csv('file.csv', index_col=0)

The above will set the first column as the index (and not add a range index in my "common case").

以上将第一列设置为索引(而不是在我的“常见情况”中添加范围索引)。

Update

更新

Given the popularity of this answer, I thought i'd add some context/ a demo:

鉴于此答案的受欢迎程度,我想我会添加一些上下文/演示:

# Setting up the dummy data
In [1]: df = pd.DataFrame({"A":[1, 2, 3], "B":[4, 5, 6]})

In [2]: df
Out[2]:
   A  B
0  1  4
1  2  5
2  3  6

In [3]: df.to_csv('file.csv', index=None)
File[3]:
A  B
1  4
2  5
3  6

Reading without index_col or with None/False will all result in a range index:

不使用 index_col 或使用 None/False 读取都将导致范围索引:

In [4]: pd.read_csv('file.csv')
Out[4]:
   A  B
0  1  4
1  2  5
2  3  6

# Note that this is the default behavior, so the same as In [4]
In [5]: pd.read_csv('file.csv', index_col=None)
Out[5]:
   A  B
0  1  4
1  2  5
2  3  6

In [6]: pd.read_csv('file.csv', index_col=False)
Out[6]:
   A  B
0  1  4
1  2  5
2  3  6

However, if we specify that "A" (the 0th column) is actually the index, we can avoid the range index:

但是,如果我们指定“A”(第 0 列)实际上是索引,则可以避免使用范围索引:

In [7]: pd.read_csv('file.csv', index_col=0)
Out[7]:
   B
A
1  4
2  5
3  6