将 HDFS（Hadoop 文件系统）目录中的文件读入 Pandas 数据帧

Question

提问by Setjmp

I am generating some delimited files from hive queries into multiple HDFS directories. As the next step, I would like to read the files into a single pandas dataframe in order to apply standard non-distributed algorithms.

我正在从 hive 查询中生成一些分隔文件到多个 HDFS 目录中。下一步，我想将文件读入单个 Pandas 数据帧，以便应用标准的非分布式算法。

At some level a workable solution is trivial using a "hadoop dfs -copyTolocal" followed by local file system operations, however I am looking for a particularly elegant way to load the data that I will incorporate into my standard practice.

在某种程度上，使用“hadoop dfs -copyTolocal”后跟本地文件系统操作的可行解决方案是微不足道的，但是我正在寻找一种特别优雅的方式来加载我将合并到我的标准实践中的数据。

Some characteristics of an ideal solution:

理想解决方案的一些特征：

No need to create a local copy (who likes clean up?)
Minimal number of system calls
Few lines of Python code

无需创建本地副本（谁喜欢清理？）
最少的系统调用次数
几行 Python 代码

Answer 1

回答by Setjmp

It looks like the pydoop.hdfs module solves this problem while meeting a good set of the goals:

看起来 pydoop.hdfs 模块解决了这个问题，同时满足了一组很好的目标：

http://pydoop.sourceforge.net/docs/tutorial/hdfs_api.html

I was not not able to evaluate this, as pydoop has very strict requirements to compile and my Hadoop version is a bit dated.

我无法对此进行评估，因为 pydoop 对编译有非常严格的要求，而且我的 Hadoop 版本有点过时。

将 HDFS（Hadoop 文件系统）目录中的文件读入 Pandas 数据帧

提问by Setjmp

回答by Setjmp

相关推荐

最近更新

标签

将 HDFS（Hadoop 文件系统）目录中的文件读入 Pandas 数据帧

提问by Setjmp

回答by Setjmp

相关推荐

pandas 以相反的顺序遍历 DataFrame 行索引

pandas 熊猫中的条件替换

Pandas read_csv 用字符串 'nan' 填充空值，而不是解析日期

csv 和 xlsx 文件导入到 Pandas 数据框：速度问题

相关推荐

最近更新

标签