Python pandas.read_csv:如何跳过评论行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18366797/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas.read_csv: how to skip comment lines
提问by mathtick
I think I misunderstand the intention of read_csv. If I have a file 'j' like
我想我误解了 read_csv 的意图。如果我有一个像“j”这样的文件
# notes
a,b,c
# more notes
1,2,3
How can I pandas.read_csv this file, skipping any '#' commented lines? I see in the help 'comment' of lines is not supported but it indicates an empty line should be returned. I see an error
我怎样才能 pandas.read_csv 这个文件,跳过任何“#”注释行?我在帮助中看到不支持行的“注释”,但它表示应该返回一个空行。我看到一个错误
df = pandas.read_csv('j', comment='#')
CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3
CParserError:标记数据时出错。C 错误:第 2 行中应有 1 个字段,看到 3 个
I'm currently on
我目前在
In [15]: pandas.__version__
Out[15]: '0.12.0rc1'
On version'0.12.0-199-g4c8ad82':
在版本“0.12.0-199-g4c8ad82”上:
In [43]: df = pandas.read_csv('j', comment='#', header=None)
CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3
CParserError:标记数据时出错。C 错误:第 2 行中应有 1 个字段,看到 3 个
采纳答案by hlin117
So I believe in the latest releases of pandas (version 0.16.0), you could throw in the comment='#'
parameter into pd.read_csv
and this should skip commented out lines.
所以我相信最新版本的 Pandas(0.16.0 版),您可以将comment='#'
参数放入其中pd.read_csv
,这应该跳过注释掉的行。
These github issues shows that you can do this:
这些 github 问题表明您可以这样做:
See the documentation on read_csv
: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
请参阅以下文档read_csv
:http: //pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
回答by Andy Hayden
One workaround is to specify skiprows to ignore the first few entries:
一种解决方法是指定 skiprows 以忽略前几个条目:
In [11]: s = '# notes\na,b,c\n# more notes\n1,2,3'
In [12]: pd.read_csv(StringIO(s), sep=',', comment='#', skiprows=1)
Out[12]:
a b c
0 NaN NaN NaN
1 1 2 3
Otherwise read_csv
gets a little confused:
否则read_csv
会有点困惑:
In [13]: pd.read_csv(StringIO(s), sep=',', comment='#')
Out[13]:
Unnamed: 0
a b c
NaN NaN NaN
1 2 3
This seems to be the case in 0.12.0, I've filed a bug report.
这似乎是 0.12.0 的情况,我已经提交了一个错误报告。
As Viktor points out you can use dropna to remove the NaN after the fact... (there is a recent open issueto have commented lines be ignored completely):
正如 Viktor 指出的那样,您可以在事后使用 dropna 删除 NaN ......(最近有一个未解决的问题可以完全忽略注释行):
In [14]: pd.read_csv(StringIO(s2), comment='#', sep=',').dropna(how='all')
Out[14]:
a b c
1 1 2 3
Note: the default index will "give away" the fact there was missing data.
注意:默认索引会“泄露”缺少数据的事实。
回答by Finn ?rup Nielsen
I am on Pandas version 0.13.1 and this comments-in-csvproblem still bothers me.
我使用的是 Pandas 0.13.1 版,这个csv 注释问题仍然困扰着我。
Here is my present workaround:
这是我目前的解决方法:
def read_csv(filename, comment='#', sep=','):
lines = "".join([line for line in open(filename)
if not line.startswith(comment)])
return pd.read_csv(StringIO(lines), sep=sep)
Otherwise with pd.read_csv(filename, comment='#')
I get
否则pd.read_csv(filename, comment='#')
我得到
pandas.parser.CParserError: Error tokenizing data. C error: Expected 1 fields in line 16, saw 3.
pandas.parser.CParserError:标记数据时出错。C 错误:第 16 行应为 1 个字段,看到 3 个。