pandas 从hdfs读取csv文件作为数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35642020/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reading in csv file as dataframe from hdfs
提问by lordingtar
I'm using pydoop to read in a file from hdfs, and when I use:
我正在使用 pydoop 从 hdfs 中读取文件,当我使用时:
import pydoop.hdfs as hd
with hd.open("/home/file.csv") as f:
print f.read()
It shows me the file in stdout.
它向我显示了标准输出中的文件。
Is there any way for me to read in this file as dataframe? I've tried using pandas' read_csv("/home/file.csv"), but it tells me that the file cannot be found. The exact code and error is:
有什么方法可以让我将此文件作为数据帧读取吗?我试过使用Pandas的 read_csv("/home/file.csv"),但它告诉我找不到该文件。确切的代码和错误是:
>>> import pandas as pd
>>> pd.read_csv("/home/file.csv")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
self._make_engine(self.engine)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /home/file.csv does not exist
回答by hpaulj
I know next to nothing about hdfs
, but I wonder if the following might work:
我对 几乎一无所知hdfs
,但我想知道以下是否可行:
with hd.open("/home/file.csv") as f:
df = pd.read_csv(f)
I assume read_csv
works with a file handle, or in fact any iterable that will feed it lines. I know the numpy
csv readers do.
我假设read_csv
可以使用文件句柄,或者实际上可以为它提供行的任何可迭代对象。我知道numpy
csv 读者会这样做。
pd.read_csv("/home/file.csv")
would work if the regular Python file open
works - i.e. it reads the file a regular local file.
pd.read_csv("/home/file.csv")
如果常规 Python 文件open
有效 - 即它将文件读取为常规本地文件。
with open("/home/file.csv") as f:
print f.read()
But evidently hd.open
is using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfs
documentation.
但显然hd.open
正在使用其他位置或协议,因此该文件不是本地的。如果我的建议不起作用,那么您(或我们)需要深入研究hdfs
文档。