pandas 从hdfs读取csv文件作为数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35642020/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:46:16  来源:igfitidea点击:

Reading in csv file as dataframe from hdfs

pythonhadooppandashdfs

提问by lordingtar

I'm using pydoop to read in a file from hdfs, and when I use:

我正在使用 pydoop 从 hdfs 中读取文件,当我使用时:

import pydoop.hdfs as hd
with hd.open("/home/file.csv") as f:
    print f.read()

It shows me the file in stdout.

它向我显示了标准输出中的文件。

Is there any way for me to read in this file as dataframe? I've tried using pandas' read_csv("/home/file.csv"), but it tells me that the file cannot be found. The exact code and error is:

有什么方法可以让我将此文件作为数据帧读取吗?我试过使用Pandas的 read_csv("/home/file.csv"),但它告诉我找不到该文件。确切的代码和错误是:

>>> import pandas as pd
>>> pd.read_csv("/home/file.csv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
  File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /home/file.csv does not exist

回答by hpaulj

I know next to nothing about hdfs, but I wonder if the following might work:

我对 几乎一无所知hdfs,但我想知道以下是否可行:

with hd.open("/home/file.csv") as f:
    df =  pd.read_csv(f)

I assume read_csvworks with a file handle, or in fact any iterable that will feed it lines. I know the numpycsv readers do.

我假设read_csv可以使用文件句柄,或者实际上可以为它提供行的任何可迭代对象。我知道numpycsv 读者会这样做。

pd.read_csv("/home/file.csv")would work if the regular Python file openworks - i.e. it reads the file a regular local file.

pd.read_csv("/home/file.csv")如果常规 Python 文件open有效 - 即它将文件读取为常规本地文件。

with open("/home/file.csv") as f: 
    print f.read()

But evidently hd.openis using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfsdocumentation.

但显然hd.open正在使用其他位置或协议,因此该文件不是本地的。如果我的建议不起作用,那么您(或我们)需要深入研究hdfs文档。