bash 在hadoop中批量重命名

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14736017/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 04:29:44  来源:igfitidea点击:

Batch rename in hadoop

bashhadoopfile-rename

提问by beefyhalo

How can I rename all files in a hdfs directory to have a .lzoextension? .lzo.indexfiles should not be renamed.

如何重命名 hdfs 目录中的所有文件以具有.lzo扩展名?.lzo.index文件不应重命名。

For example, this directory listing:

例如,这个目录列表:

file0.lzo file0.lzo.index file0.lzo_copy_1 
file0.lzo file0.lzo.index file0.lzo_copy_1 

could be renamed to:

可以重命名为:

file0.lzo file0.lzo.index file0.lzo_copy_1.lzo 
file0.lzo file0.lzo.index file0.lzo_copy_1.lzo 

These files are lzo compressed, and I need them to have the .lzoextension to be recognized by hadoop.

这些文件是 lzo 压缩的,我需要它们具有让.lzohadoop 识别的扩展名。

回答by mt_

If you don't want to write Java Code for this - I think using the command line HDFS API is your best bet:

如果您不想为此编写 Java 代码 - 我认为使用命令行 HDFS API 是您最好的选择:

mvin Hadoop

mv在 Hadoop 中

hadoop fs -mv URI [URI …] <dest>

hadoop fs -mv URI [URI …] <dest>

You can get the paths using a small one liner:

您可以使用一个小的衬垫来获取路径:

% hadoop fs -ls /user/foo/bar | awk  '!/^d/ {print }'

/user/foo/bar/blacklist
/user/foo/bar/books-eng
...

the awkwill remove directories from output..now you can put these files into a variable:

awk会移除output..now你可以把这些文件放到一个变量目录:

% files=$(hadoop fs -ls /user/foo/bar | awk  '!/^d/ {print }')

and rename each file..

并重命名每个文件..

% for f in $files; do hadoop fs -mv $f $f.lzo; done

you can also use awkto filter the files for other criteria. This should remove files that match the regex nolzo. However it's untested. But this way you can write flexible filters.

您还可以使用awk其他条件过滤文件。这应该删除与 regex 匹配的文件nolzo。然而它是未经测试的。但是通过这种方式您可以编写灵活的过滤器。

% files=$(hadoop fs -ls /user/foo/bar | awk  '!/^d|nolzo/ {print }' )

test if it works with replacing the hadoopcommand with echo:

测试它是否适用于将hadoop命令替换为echo

$ for f in $files; do echo $f $f.lzo; done

Edit: Updated examples to use awkinstead of sedfor more reliable output.

编辑:更新示例以用于awk代替sed更可靠的输出。

The "right" way to do it is probably using the HDFS Java API.. However using the shell is probably faster and more flexible for most jobs.

这样做的“正确”方法可能是使用HDFS Java API.. 然而,对于大多数作业来说,使用 shell 可能更快、更灵活。

回答by Robert

When I had to rename many files I was searching for an efficient solution and stumbled over this question and thi-duong-nguyen's remark that renaming many files is very slow. I implemented a Java solution for batch rename operations which I can highly recommend, since it is orders of magnitudefaster. The basic idea is to use org.apache.hadoop.fs.FileSystem's rename()method:

当我不得不重命名许多文件时,我正在寻找一个有效的解决方案,并偶然发现了这个问题和thi-duong-nguyen的评论,即重命名许多文件非常慢。我为批量重命名操作实现了一个 Java 解决方案,我强烈推荐它,因为它快了几个数量级。基本思想是使用org.apache.hadoop.fs.FileSystemrename()方法:

Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://master:8020");
FileSystem dfs = FileSystem.get(conf);
dfs.rename(from, to);

where fromand toare org.apache.hadoop.fs.Pathobjects. The easiest way is to create a list of files to be renamed (including their new name) and feed this list to the Java program.

wherefromtoareorg.apache.hadoop.fs.Path对象。最简单的方法是创建一个要重命名的文件列表(包括它们的新名称)并将这个列表提供给 Java 程序。

I have publishedthe complete implementation which reads such a mapping from STDIN. It renamed 100 files in less than four seconds (the same time was required to rename 7000 files!) while the hdfs dfs -mvbased approach described before requires 4 minutesto rename 100 files.

已经发布了完整的实现,它从STDIN. 它在不到四秒的时间内重命名了 100 个文件(重命名 7000 个文件需要同样的时间!)而hdfs dfs -mv之前描述的基于方法需要 4分钟来重命名 100 个文件。

回答by Ameba Spugnosa

We created an utility to do bulk renaming of files in HDFS: https://github.com/tenaris/hdfs-rename. The tool is limited, but if you want you can contribute to improve it with recursive, awk regex syntax and so on.

我们创建了一个实用程序来批量重命名 HDFS 中的文件:https: //github.com/tenaris/hdfs-rename。该工具是有限的,但如果您愿意,您可以使用递归、awk 正则表达式语法等来改进它。