pandas 熊猫读取没有标题的 csv（可能在那里）

Question

提问by marscher

I'm trying to read a .csvfile in chunks (python-engine) and skip the header (or any lines starting with a comment character). It is not known a prioriif the file has a header or not, so it is not possible to just skip the first line, since it might already be a data row.

我正在尝试分.csv块读取文件（python 引擎）并跳过标题（或任何以注释字符开头的行）。先验不知道文件是否有标题，因此不可能跳过第一行，因为它可能已经是一个数据行。

Setting header=Nonedoes solve the problem. If I invoke get_chunkand want the row values, I still get the header/or comment line.

设置header=None确实可以解决问题。如果我调用get_chunk并想要行值，我仍然会得到标题/或注释行。

Desired output would be just the same like numpy.loadtxt()

所需的输出将是一样的 numpy.loadtxt()

The code below demonstrates what's going on:

下面的代码演示了发生了什么：

import numpy as np
from pandas.io.parsers import TextFileReader
fn = '/tmp/test.csv'
np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no sense")
print np.loadtxt(fn).shape # output (100,3)

reader = TextFileReader(fn, chunksize=10, header=None)
reader.get_chunk().values

# output
array([['#', 'makes', 'no', 'sense'],
       ['0.000000000000000000e+00', '1.000000000000000000e+00',
        '2.000000000000000000e+00', None],
       ['3.000000000000000000e+00', '4.000000000000000000e+00',
        '5.000000000000000000e+00', None],
       ['6.000000000000000000e+00', '7.000000000000000000e+00',
        '8.000000000000000000e+00', None],
       ['9.000000000000000000e+00', '1.000000000000000000e+01',
        '1.100000000000000000e+01', None],
       ['1.200000000000000000e+01', '1.300000000000000000e+01',
        '1.400000000000000000e+01', None],
       ['1.500000000000000000e+01', '1.600000000000000000e+01',
        '1.700000000000000000e+01', None],
       ['1.800000000000000000e+01', '1.900000000000000000e+01',
        '2.000000000000000000e+01', None],
       ['2.100000000000000000e+01', '2.200000000000000000e+01',
        '2.300000000000000000e+01', None],
       ['2.400000000000000000e+01', '2.500000000000000000e+01',
        '2.600000000000000000e+01', None]], dtype=object)

If I specify the comment char via

如果我通过指定注释字符

   reader = TextFileReader(fn, chunksize=10, header=None, comment='#')

I get an exception:

我得到一个例外：

In [99]: reader = pandas.io.parsers.TextFileReader('/tmp/test.csv', chunksize=10, header=None, index_col=False, comment="#")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-99-64b1c0bce4ef> in <module>()
----> 1 reader = pandas.io.parsers.TextFileReader('/tmp/test.csv', chunksize=10, header=None, index_col=False, comment="#")

/home/marscher/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    560             self.options['has_index_names'] = kwds['has_index_names']
    561 
--> 562         self._make_engine(self.engine)
    563 
    564     def _get_options_with_defaults(self, engine):

/home/marscher/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    703             elif engine == 'python-fwf':
    704                 klass = FixedWidthFieldParser
--> 705             self._engine = klass(self.f, **self.options)
    706 
    707     def _failover_to_python(self):

/home/marscher/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, **kwds)
   1400         # Set self.data to something that can read lines.
   1401         if hasattr(f, 'readline'):
-> 1402             self._make_reader(f)
   1403         else:
   1404             self.data = f

/home/marscher/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_reader(self, f)
   1505                 self.pos += 1
   1506                 self.line_pos += 1
-> 1507                 sniffed = csv.Sniffer().sniff(line)
   1508                 dia.delimiter = sniffed.delimiter
   1509                 if self.encoding is not None:

/home/marscher/anaconda/lib/python2.7/csv.pyc in sniff(self, sample, delimiters)
    180 
    181         quotechar, doublequote, delimiter, skipinitialspace = \
--> 182                    self._guess_quote_and_delimiter(sample, delimiters)
    183         if not delimiter:
    184             delimiter, skipinitialspace = self._guess_delimiter(sample,

/home/marscher/anaconda/lib/python2.7/csv.pyc in _guess_quote_and_delimiter(self, data, delimiters)
    221                       '(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'):                            #  ".*?" (no delim, no space)
    222             regexp = re.compile(restr, re.DOTALL | re.MULTILINE)
--> 223             matches = regexp.findall(data)
    224             if matches:
    225                 break

TypeError: expected string or buffer

Editthis error is caused by not wrapping comment in a list.

编辑此错误是由于未将注释包装在列表中引起的。

Answer 1

回答by Andrew B

I know this is super old, and I never figured out what's going on with your comment error (and your clarification of the problem didn't fix it for me, but I think it has something to do with calling a class rather than a function), but several modifications provide the output I think you're looking for.

我知道这太旧了，我从来没有弄清楚你的评论错误是怎么回事（你对问题的澄清并没有为我解决这个问题，但我认为这与调用类而不是函数有关)，但一些修改提供了我认为您正在寻找的输出。

First, if you tell the reader there is no header, it will interpret any header lines as data, determining both the shape and type of data read in (e.g., string format for numbers). It can infer whether there is a header, to not screw up the shape, leaving comments as a separate issue.

首先，如果你告诉阅读器没有标题，它会将任何标题行解释为数据，确定读入的数据的形状和类型（例如，数字的字符串格式）。它可以推断是否有标题，不要搞砸形状，将评论作为单独的问题。

import numpy as np
from pandas.io.parsers import TextFileReader
fn = '/tmp/test.csv'
np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no sense")
np.loadtxt(fn).shape # output (100,3)

reader = TextFileReader(fn, chunksize=10, header='infer')
reader.get_chunk().values

#output, just inferring headers
array([[  0.,   1.,   2.,  nan],
   [  3.,   4.,   5.,  nan],
   [  6.,   7.,   8.,  nan],
   [  9.,  10.,  11.,  nan],
   [ 12.,  13.,  14.,  nan],
   [ 15.,  16.,  17.,  nan],
   [ 18.,  19.,  20.,  nan],
   [ 21.,  22.,  23.,  nan],
   [ 24.,  25.,  26.,  nan],
   [ 27.,  28.,  29.,  nan]])

The nan comes from interpreting the commented line as a header (which it is, though also commented out), which has four parts.

nan 来自将注释行解释为标题（它是，但也被注释掉），它有四个部分。

You can get rid of the comment mark on the header by changing how you save the text.

您可以通过更改保存文本的方式来去除标题上的注释标记。

np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no      sense",comments=None)
reader = TextFileReader(fn, chunksize=10, header='infer')
reader.get_chunk().values
#output, without true header commented out
array([[  0.,   1.,   2.],
   [  3.,   4.,   5.],
   [  6.,   7.,   8.],
   [  9.,  10.,  11.],
   [ 12.,  13.,  14.],
   [ 15.,  16.,  17.],
   [ 18.,  19.,  20.],
   [ 21.,  22.,  23.],
   [ 24.,  25.,  26.],
   [ 27.,  28.,  29.]])

This eliminates the problem with the commented out header, but doesn't help to infer the correct shape, or if you have real comments you also want to ignore.

这消除了注释掉的标题的问题，但无助于推断正确的形状，或者如果您有真正的评论，您也想忽略。

If you want to infer whether there is a header, and also ignore any commented lines, I can only figure out how to do that by calling a function.

如果您想推断是否有标题，并忽略任何注释行，我只能通过调用函数来弄清楚如何做到这一点。

import pandas
np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no sense")
reader = pandas.read_csv(fn,chunksize=10,header='infer',comment="#")
reader.get_chunk().values
#output, treating the header as a comment, so shape is decided by first data line
array([[ '3.000000000000000000e+00 4.000000000000000000e+00 5.000000000000000000e+00'],
   [ '6.000000000000000000e+00 7.000000000000000000e+00 8.000000000000000000e+00'],
   [ '9.000000000000000000e+00 1.000000000000000000e+01 1.100000000000000000e+01'],
   [ '1.200000000000000000e+01 1.300000000000000000e+01 1.400000000000000000e+01'],
   [ '1.500000000000000000e+01 1.600000000000000000e+01 1.700000000000000000e+01'],
   [ '1.800000000000000000e+01 1.900000000000000000e+01 2.000000000000000000e+01'],
   [ '2.100000000000000000e+01 2.200000000000000000e+01 2.300000000000000000e+01'],
   [ '2.400000000000000000e+01 2.500000000000000000e+01 2.600000000000000000e+01'],
   [ '2.700000000000000000e+01 2.800000000000000000e+01 2.900000000000000000e+01'],
   [ '3.000000000000000000e+01 3.100000000000000000e+01 3.200000000000000000e+01']], dtype=object)

#Or, without the commented out header
np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no sense",comments='')
reader = pandas.read_csv(fn,chunksize=10,header='infer',comment="#")
reader.get_chunk().values
#output, treating the header as a header to determine shape, but comments would also be ignored
array([[ '0.000000000000000000e+00 1.000000000000000000e+00 2.000000000000000000e+00'],
   [ '3.000000000000000000e+00 4.000000000000000000e+00 5.000000000000000000e+00'],
   [ '6.000000000000000000e+00 7.000000000000000000e+00 8.000000000000000000e+00'],
   [ '9.000000000000000000e+00 1.000000000000000000e+01 1.100000000000000000e+01'],
   [ '1.200000000000000000e+01 1.300000000000000000e+01 1.400000000000000000e+01'],
   [ '1.500000000000000000e+01 1.600000000000000000e+01 1.700000000000000000e+01'],
   [ '1.800000000000000000e+01 1.900000000000000000e+01 2.000000000000000000e+01'],
   [ '2.100000000000000000e+01 2.200000000000000000e+01 2.300000000000000000e+01'],
   [ '2.400000000000000000e+01 2.500000000000000000e+01 2.600000000000000000e+01'],
   [ '2.700000000000000000e+01 2.800000000000000000e+01 2.900000000000000000e+01']], dtype=object)

pandas 熊猫读取没有标题的 csv（可能在那里）

提问by marscher

回答by Andrew B

相关推荐

最近更新

标签

pandas 熊猫读取没有标题的 csv（可能在那里）

提问by marscher

回答by Andrew B

相关推荐

pandas 熊猫的问题

pandas 我可以使用 seaborn 在 x 轴上绘制带有日期时间的线性回归吗？

Pandas：使用 groupby 获取每个数据类别的平均值

我需要你关于 python pandas 中 read_fwf 的帮助

相关推荐

最近更新

标签