如何使用python在hadoop中保存文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26606128/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to save a file in hadoop with python
提问by user3671459
Question:
题:
I am starting to learn hadoop, however, I need to save a lot of files into it using python. I cannot seem to figure out what i am doing wrong. Can anyone help me with this?
我开始学习hadoop,但是,我需要使用python将很多文件保存到其中。我似乎无法弄清楚我做错了什么。谁能帮我这个?
Below is my code.
I think the HDFS_PATHis correct as I didn't change it in the settings while installing.
the pythonfile.txtis on my desktop (so is the python code running through the command line).
下面是我的代码。我认为这HDFS_PATH是正确的,因为我在安装时没有在设置中更改它。该pythonfile.txt是我的桌面上(所以是通过命令行运行Python代码)。
Code:
代码:
import hadoopy
import os
hdfs_path ='hdfs://localhost:9000/python'
def main():
hadoopy.writetb(hdfs_path, [('pythonfile.txt',open('pythonfile.txt').read())])
main()
OutputWhen I run the above code all I get is a directory in python itself.
输出当我运行上面的代码时,我得到的只是python本身的一个目录。
iMac-van-Brian:desktop Brian$ $HADOOP_HOME/bin/hadoop dfs -ls /python
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
14/10/28 11:30:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r-- 1 Brian supergroup 236 2014-10-28 11:30 /python
回答by Legato
I have a feeling that you're writing into a file called '/python' while you intend it to be the directory in which the file is stored
我有一种感觉,您正在写入一个名为“/python”的文件,而您打算将它作为存储文件的目录
what does
做什么
hdfs dfs -cat /python
show you?
给你看?
if it shows the file contents, all you need to do is edit your hdfs_path to include the file name (you should delete /python first with -rm) Otherwise, use pydoop (pip install pydoop) and do this:
如果它显示文件内容,您需要做的就是编辑您的 hdfs_path 以包含文件名(您应该先使用 -rm 删除 /python)否则,使用 pydoop(pip install pydoop)并执行以下操作:
import pydoop.hdfs as hdfs
from_path = '/tmp/infile.txt'
to_path ='hdfs://localhost:9000/python/outfile.txt'
hdfs.put(from_path, to_path)
回答by M. Mashaye
I found this answer here:
我在这里找到了这个答案:
import subprocess
def run_cmd(args_list):
"""
run linux commands
"""
# import subprocess
print('Running system command: {0}'.format(' '.join(args_list)))
proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
s_output, s_err = proc.communicate()
s_return = proc.returncode
return s_return, s_output, s_err
#Run Hadoop ls command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-ls', 'hdfs_file_path'])
lines = out.split('\n')
#Run Hadoop get command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-get', 'hdfs_file_path', 'local_path'])
#Run Hadoop put command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-put', 'local_file', 'hdfs_file_path'])
#Run Hadoop copyFromLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyFromLocal', 'local_file', 'hdfs_file_path'])
#Run Hadoop copyToLocal command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyToLocal', 'hdfs_file_path', 'local_file'])
hdfs dfs -rm -skipTrash /path/to/file/you/want/to/remove/permanently
#Run Hadoop remove file command in Python
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-skipTrash', 'hdfs_file_path'])
#rm -r
#HDFS Command to remove the entire directory and all of its content from #HDFS.
#Usage: hdfs dfs -rm -r <path>
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', 'hdfs_file_path'])
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', '-skipTrash', 'hdfs_file_path'])
#Check if a file exist in HDFS
#Usage: hadoop fs -test -[defsz] URI
#Options:
#-d: f the path is a directory, return 0.
#-e: if the path exists, return 0.
#-f: if the path is a file, return 0.
#-s: if the path is not empty, return 0.
#-z: if the file is zero length, return 0.
#Example:
#hadoop fs -test -e filename
hdfs_file_path = '/tmpo'
cmd = ['hdfs', 'dfs', '-test', '-e', hdfs_file_path]
ret, out, err = run_cmd(cmd)
print(ret, out, err)
if ret:
print('file does not exist')
回答by Jared Wilber
This is a pretty typical task for the subprocessmodule. The solution looks like this:
这是该subprocess模块的一项非常典型的任务。解决方案如下所示:
put = Popen(["hadoop", "fs", "-put", <path/to/file>, <path/to/hdfs/file], stdin=PIPE, bufsize=-1)
put.communicate()
Full Example
完整示例
Let's assume you're on a server and have a verified connection with hdfs (e.g. you already called a .keytab).
假设您在服务器上并与 hdfs 建立了经过验证的连接(例如,您已经调用了 a .keytab)。
You just created a csv from a pandas.DataFrameand want to put it into hdfs.
您刚刚从 a 创建了一个 csvpandas.DataFrame并想将其放入 hdfs。
You can then upload the file to hdfs as follows:
然后,您可以将文件上传到 hdfs,如下所示:
import os
import pandas as pd
from subprocess import PIPE, Popen
# define path to saved file
file_name = "saved_file.csv"
# create a pandas.DataFrame
sales = {'account': ['Jones LLC', 'Alpha Co', 'Blue Inc'], 'Jan': [150, 200, 50]}
df = pd.DataFrame.from_dict(sales)
# save your pandas.DataFrame to csv (this could be anything, not necessarily a pandas.DataFrame)
df.to_csv(file_name)
# create path to your username on hdfs
hdfs_path = os.path.join(os.sep, 'user', '<your-user-name>', file_name)
# put csv into hdfs
put = Popen(["hadoop", "fs", "-put", file_name, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()
The csv file will then exist at /user/<your-user-name/saved_file.csv.
然后 csv 文件将存在于/user/<your-user-name/saved_file.csv.
Note- If you created this file from a python script called in Hadoop, the intermediate csv file may be stored on some random nodes. Since this file is (presumably) no longer needed, it's best practice to remove it so as not to pollute the nodes everytime the script is called. You can simply add os.remove(file_name)as the last line of the above script to solve this issue.
注- 如果您从 Hadoop 中调用的 Python 脚本创建此文件,则中间 csv 文件可能会存储在某些随机节点上。由于(大概)不再需要此文件,因此最好将其删除,以免每次调用脚本时都污染节点。您可以简单地添加os.remove(file_name)为上述脚本的最后一行来解决此问题。

