如何使用python在hadoop中保存文件

Question

提问by user3671459

Question:

题：

I am starting to learn hadoop, however, I need to save a lot of files into it using python. I cannot seem to figure out what i am doing wrong. Can anyone help me with this?

我开始学习hadoop，但是，我需要使用python将很多文件保存到其中。我似乎无法弄清楚我做错了什么。谁能帮我这个？

Below is my code. I think the HDFS_PATHis correct as I didn't change it in the settings while installing. the pythonfile.txtis on my desktop (so is the python code running through the command line).

下面是我的代码。我认为这HDFS_PATH是正确的，因为我在安装时没有在设置中更改它。该pythonfile.txt是我的桌面上（所以是通过命令行运行Python代码）。

Code:

代码：

import hadoopy
import os
hdfs_path ='hdfs://localhost:9000/python' 

def main():
    hadoopy.writetb(hdfs_path, [('pythonfile.txt',open('pythonfile.txt').read())])

main()

OutputWhen I run the above code all I get is a directory in python itself.

输出当我运行上面的代码时，我得到的只是python本身的一个目录。

iMac-van-Brian:desktop Brian$ $HADOOP_HOME/bin/hadoop dfs -ls /python

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

14/10/28 11:30:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   1 Brian supergroup        236 2014-10-28 11:30 /python

Answer 1

回答by Legato

I have a feeling that you're writing into a file called '/python' while you intend it to be the directory in which the file is stored

我有一种感觉，您正在写入一个名为“/python”的文件，而您打算将它作为存储文件的目录

what does

做什么

hdfs dfs -cat /python

show you?

给你看？

if it shows the file contents, all you need to do is edit your hdfs_path to include the file name (you should delete /python first with -rm) Otherwise, use pydoop (pip install pydoop) and do this:

如果它显示文件内容，您需要做的就是编辑您的 hdfs_path 以包含文件名（您应该先使用 -rm 删除 /python）否则，使用 pydoop（pip install pydoop）并执行以下操作：

import pydoop.hdfs as hdfs

from_path = '/tmp/infile.txt'
to_path ='hdfs://localhost:9000/python/outfile.txt'
hdfs.put(from_path, to_path)

Answer 2

回答by M. Mashaye

I found this answer here:

我在这里找到了这个答案：

import subprocess

def run_cmd(args_list):
        """
        run linux commands
        """
        # import subprocess
        print('Running system command: {0}'.format(' '.join(args_list)))
        proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        s_output, s_err = proc.communicate()
        s_return =  proc.returncode
        return s_return, s_output, s_err 

#Run Hadoop ls command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-ls', 'hdfs_file_path'])
lines = out.split('\n')


#Run Hadoop get command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-get', 'hdfs_file_path', 'local_path'])


#Run Hadoop put command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-put', 'local_file', 'hdfs_file_path'])


#Run Hadoop copyFromLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyFromLocal', 'local_file', 'hdfs_file_path'])

#Run Hadoop copyToLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyToLocal', 'hdfs_file_path', 'local_file'])


hdfs dfs -rm -skipTrash /path/to/file/you/want/to/remove/permanently
#Run Hadoop remove file command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-skipTrash', 'hdfs_file_path'])


#rm -r
#HDFS Command to remove the entire directory and all of its content from #HDFS.
#Usage: hdfs dfs -rm -r <path>

(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', '-skipTrash', 'hdfs_file_path'])




#Check if a file exist in HDFS
#Usage: hadoop fs -test -[defsz] URI


#Options:
#-d: f the path is a directory, return 0.
#-e: if the path exists, return 0.
#-f: if the path is a file, return 0.
#-s: if the path is not empty, return 0.
#-z: if the file is zero length, return 0.
#Example:

#hadoop fs -test -e filename

hdfs_file_path = '/tmpo'
cmd = ['hdfs', 'dfs', '-test', '-e', hdfs_file_path]
ret, out, err = run_cmd(cmd)
print(ret, out, err)
if ret:
    print('file does not exist')

Answer 3

回答by Jared Wilber

This is a pretty typical task for the subprocessmodule. The solution looks like this:

这是该subprocess模块的一项非常典型的任务。解决方案如下所示：

put = Popen(["hadoop", "fs", "-put", <path/to/file>, <path/to/hdfs/file], stdin=PIPE, bufsize=-1)
put.communicate()

Full Example

完整示例

Let's assume you're on a server and have a verified connection with hdfs (e.g. you already called a .keytab).

假设您在服务器上并与 hdfs 建立了经过验证的连接（例如，您已经调用了 a .keytab）。

You just created a csv from a pandas.DataFrameand want to put it into hdfs.

您刚刚从 a 创建了一个 csvpandas.DataFrame并想将其放入 hdfs。

You can then upload the file to hdfs as follows:

然后，您可以将文件上传到 hdfs，如下所示：

import os 

import pandas as pd

from subprocess import PIPE, Popen


# define path to saved file
file_name = "saved_file.csv"

# create a pandas.DataFrame
sales = {'account': ['Jones LLC', 'Alpha Co', 'Blue Inc'], 'Jan': [150, 200, 50]}
df = pd.DataFrame.from_dict(sales)

# save your pandas.DataFrame to csv (this could be anything, not necessarily a pandas.DataFrame)
df.to_csv(file_name)

# create path to your username on hdfs
hdfs_path = os.path.join(os.sep, 'user', '<your-user-name>', file_name)

# put csv into hdfs
put = Popen(["hadoop", "fs", "-put", file_name, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()

The csv file will then exist at /user/<your-user-name/saved_file.csv.

然后 csv 文件将存在于/user/<your-user-name/saved_file.csv.

Note- If you created this file from a python script called in Hadoop, the intermediate csv file may be stored on some random nodes. Since this file is (presumably) no longer needed, it's best practice to remove it so as not to pollute the nodes everytime the script is called. You can simply add os.remove(file_name)as the last line of the above script to solve this issue.

注- 如果您从 Hadoop 中调用的 Python 脚本创建此文件，则中间 csv 文件可能会存储在某些随机节点上。由于（大概）不再需要此文件，因此最好将其删除，以免每次调用脚本时都污染节点。您可以简单地添加os.remove(file_name)为上述脚本的最后一行来解决此问题。

如何使用python在hadoop中保存文件

提问by user3671459

回答by Legato

回答by M. Mashaye

回答by Jared Wilber

相关推荐

最近更新

标签

如何使用python在hadoop中保存文件

提问by user3671459

回答by Legato

回答by M. Mashaye

回答by Jared Wilber

相关推荐

使用 Mechanize (Python) 填写表单

Python 带有 env 变量的 ConfigParser 和字符串插值

如何在python中获得高斯滤波器

Python py.test：错误：无法识别的参数：--cov=ner_brands --cov-report=term-missing --cov-config

相关推荐

最近更新

标签