pandas 从hdfs读取csv文件作为数据帧

Question

提问by lordingtar

I'm using pydoop to read in a file from hdfs, and when I use:

我正在使用 pydoop 从 hdfs 中读取文件，当我使用时：

import pydoop.hdfs as hd
with hd.open("/home/file.csv") as f:
    print f.read()

It shows me the file in stdout.

它向我显示了标准输出中的文件。

Is there any way for me to read in this file as dataframe? I've tried using pandas' read_csv("/home/file.csv"), but it tells me that the file cannot be found. The exact code and error is:

有什么方法可以让我将此文件作为数据帧读取吗？我试过使用Pandas的 read_csv("/home/file.csv")，但它告诉我找不到该文件。确切的代码和错误是：

>>> import pandas as pd
>>> pd.read_csv("/home/file.csv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
  File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /home/file.csv does not exist

Answer 1

回答by hpaulj

I know next to nothing about hdfs, but I wonder if the following might work:

我对几乎一无所知hdfs，但我想知道以下是否可行：

with hd.open("/home/file.csv") as f:
    df =  pd.read_csv(f)

I assume read_csvworks with a file handle, or in fact any iterable that will feed it lines. I know the numpycsv readers do.

我假设read_csv可以使用文件句柄，或者实际上可以为它提供行的任何可迭代对象。我知道numpycsv 读者会这样做。

pd.read_csv("/home/file.csv")would work if the regular Python file openworks - i.e. it reads the file a regular local file.

pd.read_csv("/home/file.csv")如果常规 Python 文件open有效 - 即它将文件读取为常规本地文件。

with open("/home/file.csv") as f: 
    print f.read()

But evidently hd.openis using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfsdocumentation.

但显然hd.open正在使用其他位置或协议，因此该文件不是本地的。如果我的建议不起作用，那么您（或我们）需要深入研究hdfs文档。

pandas 从hdfs读取csv文件作为数据帧

提问by lordingtar

回答by hpaulj

相关推荐

最近更新

标签

pandas 从hdfs读取csv文件作为数据帧

提问by lordingtar

回答by hpaulj

相关推荐

pandas 如何将文件路径变量放入pandas.read_csv？

pandas 熊猫表查找

如何将基于 Pandas 数据框的图形导出为 pdf？

在有空值的日期上使用 lambda 和 strftime (Pandas)

相关推荐

最近更新

标签