bash 使用bash根据md5查找重复文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19551908/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Finding duplicate files according to md5 with bash
提问by user2913020
I want to write an algorithm about bash that it finds duplicate files
我想写一个关于 bash 的算法,它可以找到重复的文件
How can I add size option?
如何添加尺寸选项?
回答by Gilles Quenot
Don't reinvent the wheel, use the proper command :
不要重新发明轮子,使用正确的命令:
fdupes -r dir
See http://code.google.com/p/fdupes/(packaged on some Linux distros)
请参阅http://code.google.com/p/fdupes/(在某些 Linux 发行版上打包)
回答by Alex Atkinson
find . -not -empty -type f -printf "%s\n" | sort -rn | uniq -d |\
xargs -I{} -n1 find . -type f -size {}c -print0 | xargs -0 md5sum |\
sort | uniq -w32 --all-repeated=separate
This is how you'd want to do it. This code locates dups based on size first, then MD5 hash. Note the use of -size
, in relation to your question. Enjoy. Assumes you want to search in the current directory. If not, change the find .
to be appropriate for for the directory(ies) you'd like to search.
这就是你想要的方式。此代码首先根据大小定位 dup,然后是 MD5 哈希。请注意 , 的使用-size
与您的问题有关。享受。假设您要在当前目录中搜索。如果没有,请将 更改find .
为适合您要搜索的目录。
回答by Drake Clarris
find /path/to/folder1 /path/to/folder2 -type f -printf "%f %s\n" | sort | uniq -d
find /path/to/folder1 /path/to/folder2 -type f -printf "%f %s\n" | sort | uniq -d
The find command looks in two folders for files, prints file name only (stripping leading directories) and size, sort and show only dupes. This does assume there are no newlines in the file names.
find 命令在两个文件夹中查找文件,仅打印文件名(剥离前导目录)和大小,排序并仅显示欺骗。这确实假设文件名中没有换行符。
回答by Ondra ?i?ka
Normally I use fdupes -r -S .
. But when I search for duplicates of lower amount of very large files, fdupes
takes very long to finish as it does a full checksum of the whole file (I guess).
通常我使用fdupes -r -S .
. 但是当我搜索少量非常大的文件的副本时,fdupes
需要很长时间才能完成,因为它会对整个文件进行完整的校验和(我猜)。
I've avoided that by comparing only the first 1 megabyte. It's not super-safe and you have to check if it's really a duplicate if you want to be 100 % sure. But the chance of two different videos (my case) having the same 1st megabyte but different further content is rather theorethical.
我通过只比较前 1 兆字节来避免这种情况。它不是超级安全的,如果您想 100% 确定,您必须检查它是否真的重复。但是,两个不同的视频(我的案例)具有相同的 1 兆字节但进一步内容不同的可能性是相当理论上的。
So I have written this script. Another trick it does to speed up is that it stores the resulting hash for certain path into a file. I rely on the fact that the files don't change.
所以我写了这个脚本。它用来加速的另一个技巧是它将特定路径的结果哈希存储到文件中。我依赖于文件不会改变的事实。
I paste this code to a console rather than running it - for that, it would need some more work, but here you have the idea:
我将此代码粘贴到控制台而不是运行它 - 为此,它需要更多的工作,但在这里你有想法:
find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
echo -n '.'
if grep -q "$i" md5-partial.txt; then
echo -n ':'; #-e "\n$i ---- Already counted, skipping.";
continue;
fi
MD5=`dd bs=1M count=1 if="$i" status=none | md5sum`
MD5=`echo $MD5 | cut -d' ' -f1`
if grep "$MD5" md5-partial.txt; then echo -e "Duplicate: $i"; fi
echo $MD5 $i >> md5-partial.txt
done
fi
## Show the duplicates
#sort md5-partial.txt | uniq --check-chars=32 -d -c | sort -b -n | cut -c 9-40 | xargs -I '{}' sh -c "grep '{}' md5-partial.txt && echo"
Another bash snippet which use to determine the largest duplicate files:
另一个用于确定最大重复文件的 bash 片段:
## Show wasted space
if [ false ] ; then
sort md5-partial.txt | uniq --check-chars=32 -d -c | while IFS= read -r -d '' LINE; do
HASH=`echo $LINE | cut -c 9-40`;
PATH=`echo $LINE | cut -c 41-`;
ls -l '$PATH' | cud -c 26-34
done
Both these scripts have a lot of space for improvements, feel free to contribute - here is the gist:)
这两个脚本都有很大的改进空间,请随意贡献 -这是要点:)
回答by Peacher Wu
This might be a late answer, but there are much faster alternatives to fdupes
now.
这可能是一个迟到的答案,但现在有更快的替代方案fdupes
。
- fslint/findup
- jdupes, which is supposed to be a faster replacement for fdupes
I have had the time to do a small test. For a folder with 54,000 files of a total size 17G, on a standard (8 vCPU/30G) Google Virtual Machine:
我有时间做一个小测试。对于包含 54,000 个文件、总大小为 17G 的文件夹,在标准 (8 vCPU/30G) 谷歌虚拟机上:
fdupes
takes 2m 47.082sfindup
takes 13.556sjdupes
takes 0.165s
fdupes
需要 2m 47.082sfindup
需要 13.556sjdupes
需要 0.165 秒
However, my experience is that, if your folder is too large, the time might become very long too (hours, if not days) since pairwise comparison (or sorting at best) and extremely memory-hungry operations soon become unbearably slow. Runnig a task like this on an entire disk is out of the question.
但是,我的经验是,如果您的文件夹太大,时间也可能会变得很长(数小时,如果不是数天),因为成对比较(或最多排序)和极度消耗内存的操作很快就会变得难以忍受。在整个磁盘上运行这样的任务是不可能的。
回答by jlaraval
If you can't use *dupes for any reason and the number of files is very highthe sort+uniq
won't have a good performance. In this case you could use something like this:
如果你不能使用*易受骗的人以任何理由和文件的数量是非常高的sort+uniq
不会有很好的表现。在这种情况下,您可以使用以下内容:
find . -not -empty -type f -printf "%012s" -exec md5sum {} \; | awk 'x[substr(#!/bin/bash
folder1=""
folder2=""
log=~/log.txt
for i in "$folder1"/*; do
filename="${i%.*}"
cmp --silent "$folder1/$filename" "$folder2/$filename" && echo "$filename" >> "$log"
done
, 1, 44)]++'
find
will create a line for each file with the filesize in bytes (I used 12 positions but YMMV) and the md5 hash of the file (plus the name).awk
will filter the results without the need of being sorted previously. The 44 stands for 12 (for the filesize) + 32 (length of the hash). If you need some explanation about the awk program you can see the basics here.
find
将为每个文件创建一行,文件大小以字节为单位(我使用了 12 个位置,但 YMMV)和文件的 md5 哈希(加上名称)。awk
将过滤结果而无需事先排序。44 代表 12(文件大小)+ 32(散列长度)。如果您需要有关 awk 程序的一些说明,您可以在此处查看基础知识。
回答by anubhava
You can make use of cmp
to compare file size like this:
您可以使用cmp
来比较文件大小,如下所示: