将 HDFS(Hadoop 文件系统)目录中的文件读入 Pandas 数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16598043/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:49:59  来源:igfitidea点击:

Reading Files in HDFS (Hadoop filesystem) directories into a Pandas dataframe

pythonhadooppandashdfs

提问by Setjmp

I am generating some delimited files from hive queries into multiple HDFS directories. As the next step, I would like to read the files into a single pandas dataframe in order to apply standard non-distributed algorithms.

我正在从 hive 查询中生成一些分隔文件到多个 HDFS 目录中。下一步,我想将文件读入单个 Pandas 数据帧,以便应用标准的非分布式算法。

At some level a workable solution is trivial using a "hadoop dfs -copyTolocal" followed by local file system operations, however I am looking for a particularly elegant way to load the data that I will incorporate into my standard practice.

在某种程度上,使用“hadoop dfs -copyTolocal”后跟本地文件系统操作的可行解决方案是微不足道的,但是我正在寻找一种特别优雅的方式来加载我将合并到我的标准实践中的数据。

Some characteristics of an ideal solution:

理想解决方案的一些特征:

  1. No need to create a local copy (who likes clean up?)
  2. Minimal number of system calls
  3. Few lines of Python code
  1. 无需创建本地副本(谁喜欢清理?)
  2. 最少的系统调用次数
  3. 几行 Python 代码

回答by Setjmp

It looks like the pydoop.hdfs module solves this problem while meeting a good set of the goals:

看起来 pydoop.hdfs 模块解决了这个问题,同时满足了一组很好的目标:

http://pydoop.sourceforge.net/docs/tutorial/hdfs_api.html

http://pydoop.sourceforge.net/docs/tutorial/hdfs_api.html

I was not not able to evaluate this, as pydoop has very strict requirements to compile and my Hadoop version is a bit dated.

我无法对此进行评估,因为 pydoop 对编译有非常严格的要求,而且我的 Hadoop 版本有点过时。