Hadoop & Bash:删除匹配范围的文件名

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7733096/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 00:57:05  来源:igfitidea点击:

Hadoop & Bash: delete filenames matching range

bashhadoop

提问by volni

Say you have a list of files in HDFS with a common prefix and an incrementing suffix. For example,

假设您在 HDFS 中有一个文件列表,带有一个公共前缀和一个递增的后缀。例如,

part-1.gz, part-2.gz, part-3.gz, ..., part-50.gz

I only want to leave a few file in the directory, say 3. Any three files will do. The files will be used for testing so the choice of files doesn't matter.

我只想在目录中留下几个文件,比如说 3。任何三个文件都可以。这些文件将用于测试,因此文件的选择无关紧要。

What's the simples & fastest way to delete the 47 other files?

删除其他 47 个文件的最简单快捷的方法是什么?

回答by Donald Miner

Few options here:

这里有几个选择:



Move three files manually over to a new folder, then delete the old folder.

手动将三个文件移至新文件夹,然后删除旧文件夹。



Grab the files names with fs -ls, then pull the top n, then rm them. This is the most robust method, in my opinion.

使用 获取文件名fs -ls,然后拉出顶部 n,然后 rm 它们。在我看来,这是最可靠的方法。

hadoop fs -ls /path/to/filesgives you ls output

hadoop fs -ls /path/to/files给你 ls 输出

hadoop fs -ls /path/to/files | grep 'part' | awk '{print $8}'prints out only the file names (adjust the grep accordingly to grab the files you want).

hadoop fs -ls /path/to/files | grep 'part' | awk '{print $8}'仅打印出文件名(相应地调整 grep 以获取您想要的文件)。

hadoop fs -ls /path/to/files | grep 'part' | awk '{print $8}' | head -n47grabs the top 47

hadoop fs -ls /path/to/files | grep 'part' | awk '{print $8}' | head -n47夺得前47名

Throw this into a for loop and rm them:

将其放入 for 循环并 rm 它们:

for k in `hadoop fs -ls /path/to/files | grep part | awk '{print }' | head -n47`
do
   hadoop fs -rm $k
done


Instead of a for-loop, you could use xargs:

您可以使用xargs

hadoop fs -ls /path/to/files | grep part | awk '{print }' | head -n47 | xargs hadoop fs -rm

Thanks to Keith for the inspiration

感谢基思的灵感

回答by David W.

In Bash?

在 Bash 中?

What files do you want to keep and why? What are their names? In the above example, you could do something like this:

您想保留哪些文件,为什么?他们的名字是什么?在上面的例子中,你可以做这样的事情:

$ rm !(part-[1-3].gz)

which will remove all files except part-1.gz, part-2.gz, and part-3.gz.

这将删除除 part-1.gz、part-2.gz 和 part-3.gz 之外的所有文件。

You can also do something like this:

你也可以做这样的事情:

$ rm $(ls | sed -n '4,$p')

Which will remove all except the last three files listed.

这将删除除列出的最后三个文件之外的所有文件。

You could also do this:

你也可以这样做:

$ls | sed -n '4,$p' | xargs rm

Which is safer if you have hundreds and hundreds of files in the directory.

如果目录中有成百上千个文件,哪个更安全。

回答by eswald

Do you need to keep the firstthree or the lastthree?

需要保留三个还是三个?

To remove all but the first three:

要删除除前三个之外的所有内容:

hadoop fs -ls | grep 'part-[0-9]*\.gz' | sort -g -k2 -t- | tail -n +4 | xargs -r -d\n hadoop fs -rm

To remove all but the last three:

要删除除最后三个之外的所有内容:

hadoop fs -ls | grep 'part-[0-9]*\.gz' | sort -g -k2 -t- | head -n -3 | xargs -r -d\n hadoop fs -rm

Note that these commands don't depend on the actual number of files, nor on the existence of more than three, nor on the precise sorting of the original listing, but they do depend on the fact that the number is after a hyphen. The parameters to xargsaren't strictly necessary, but they may be helpful in certain situations.

请注意,这些命令不取决于文件的实际数量,也不取决于文件是否存在超过三个,也不取决于原始列表的精确排序,但它们确实取决于数字在连字符之后的事实。to 的参数xargs并不是绝对必要的,但它们在某些情况下可能会有所帮助。

回答by Kent

awk :

awk :

  ls part-*.gz|awk -F '[-\.]' '>3{print "rm "
ls part-*.gz | sed -e "1,3d" | xargs rm
}' |sh

回答by Keith

##代码##