bash 使用bash根据md5查找重复文件

Question

提问by user2913020

I want to write an algorithm about bash that it finds duplicate files

我想写一个关于 bash 的算法，它可以找到重复的文件

How can I add size option?

如何添加尺寸选项？

Answer 1

回答by Gilles Quenot

Don't reinvent the wheel, use the proper command :

不要重新发明轮子，使用正确的命令：

fdupes -r dir

See http://code.google.com/p/fdupes/(packaged on some Linux distros)

请参阅http://code.google.com/p/fdupes/（在某些 Linux 发行版上打包）

Answer 2

回答by Alex Atkinson

find . -not -empty -type f -printf "%s\n" | sort -rn | uniq -d |\
xargs -I{} -n1 find . -type f -size {}c -print0 | xargs -0 md5sum |\
sort | uniq -w32 --all-repeated=separate

This is how you'd want to do it. This code locates dups based on size first, then MD5 hash. Note the use of -size, in relation to your question. Enjoy. Assumes you want to search in the current directory. If not, change the find .to be appropriate for for the directory(ies) you'd like to search.

这就是你想要的方式。此代码首先根据大小定位 dup，然后是 MD5 哈希。请注意 , 的使用-size与您的问题有关。享受。假设您要在当前目录中搜索。如果没有，请将更改find .为适合您要搜索的目录。

Answer 3

回答by Drake Clarris

find /path/to/folder1 /path/to/folder2 -type f -printf "%f %s\n" | sort | uniq -d

The find command looks in two folders for files, prints file name only (stripping leading directories) and size, sort and show only dupes. This does assume there are no newlines in the file names.

find 命令在两个文件夹中查找文件，仅打印文件名（剥离前导目录）和大小，排序并仅显示欺骗。这确实假设文件名中没有换行符。

Answer 4

回答by Ondra ?i?ka

Normally I use fdupes -r -S .. But when I search for duplicates of lower amount of very large files, fdupestakes very long to finish as it does a full checksum of the whole file (I guess).

通常我使用fdupes -r -S .. 但是当我搜索少量非常大的文件的副本时，fdupes需要很长时间才能完成，因为它会对整个文件进行完整的校验和（我猜）。

I've avoided that by comparing only the first 1 megabyte. It's not super-safe and you have to check if it's really a duplicate if you want to be 100 % sure. But the chance of two different videos (my case) having the same 1st megabyte but different further content is rather theorethical.

我通过只比较前 1 兆字节来避免这种情况。它不是超级安全的，如果您想 100% 确定，您必须检查它是否真的重复。但是，两个不同的视频（我的案例）具有相同的 1 兆字节但进一步内容不同的可能性是相当理论上的。

So I have written this script. Another trick it does to speed up is that it stores the resulting hash for certain path into a file. I rely on the fact that the files don't change.

所以我写了这个脚本。它用来加速的另一个技巧是它将特定路径的结果哈希存储到文件中。我依赖于文件不会改变的事实。

I paste this code to a console rather than running it - for that, it would need some more work, but here you have the idea:

我将此代码粘贴到控制台而不是运行它 - 为此，它需要更多的工作，但在这里你有想法：

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
  echo -n '.'
  if grep -q "$i" md5-partial.txt; then
    echo -n ':'; #-e "\n$i  ---- Already counted, skipping.";
    continue;
  fi
  MD5=`dd bs=1M count=1 if="$i" status=none | md5sum`
  MD5=`echo $MD5 | cut -d' ' -f1`
  if grep "$MD5" md5-partial.txt; then echo -e "Duplicate: $i"; fi
  echo $MD5 $i >> md5-partial.txt
done
fi

## Show the duplicates
#sort md5-partial.txt | uniq  --check-chars=32 -d -c | sort -b -n | cut -c 9-40 | xargs -I '{}' sh -c "grep '{}'  md5-partial.txt && echo"

Another bash snippet which use to determine the largest duplicate files:

另一个用于确定最大重复文件的 bash 片段：

## Show wasted space
if [ false ] ; then
sort md5-partial.txt | uniq  --check-chars=32 -d -c | while IFS= read -r -d '' LINE; do
  HASH=`echo $LINE | cut -c 9-40`;
  PATH=`echo $LINE | cut -c 41-`;
  ls -l '$PATH' | cud -c 26-34
done

Both these scripts have a lot of space for improvements, feel free to contribute - here is the gist:)

这两个脚本都有很大的改进空间，请随意贡献 -这是要点:)

Answer 5

回答by Peacher Wu

This might be a late answer, but there are much faster alternatives to fdupesnow.

这可能是一个迟到的答案，但现在有更快的替代方案fdupes。

fslint/findup
jdupes, which is supposed to be a faster replacement for fdupes

fslint/查找
jdupes，它应该是 fdupes 的更快替代品

I have had the time to do a small test. For a folder with 54,000 files of a total size 17G, on a standard (8 vCPU/30G) Google Virtual Machine:

我有时间做一个小测试。对于包含 54,000 个文件、总大小为 17G 的文件夹，在标准 (8 vCPU/30G) 谷歌虚拟机上：

fdupestakes 2m 47.082s
finduptakes 13.556s
jdupestakes 0.165s

fdupes需要 2m 47.082s
findup需要 13.556s
jdupes需要 0.165 秒

However, my experience is that, if your folder is too large, the time might become very long too (hours, if not days) since pairwise comparison (or sorting at best) and extremely memory-hungry operations soon become unbearably slow. Runnig a task like this on an entire disk is out of the question.

但是，我的经验是，如果您的文件夹太大，时间也可能会变得很长（数小时，如果不是数天），因为成对比较（或最多排序）和极度消耗内存的操作很快就会变得难以忍受。在整个磁盘上运行这样的任务是不可能的。

Answer 6

回答by jlaraval

If you can't use *dupes for any reason and the number of files is very highthe sort+uniqwon't have a good performance. In this case you could use something like this:

如果你不能使用*易受骗的人以任何理由和文件的数量是非常高的sort+uniq不会有很好的表现。在这种情况下，您可以使用以下内容：

find . -not -empty -type f -printf "%012s" -exec md5sum {} \; | awk 'x[substr(#!/bin/bash

folder1=""
folder2=""
log=~/log.txt

for i in "$folder1"/*; do
    filename="${i%.*}"
    cmp --silent "$folder1/$filename" "$folder2/$filename" && echo "$filename" >> "$log"
done
, 1, 44)]++'

findwill create a line for each file with the filesize in bytes (I used 12 positions but YMMV) and the md5 hash of the file (plus the name).
awkwill filter the results without the need of being sorted previously. The 44 stands for 12 (for the filesize) + 32 (length of the hash). If you need some explanation about the awk program you can see the basics here.

find将为每个文件创建一行，文件大小以字节为单位（我使用了 12 个位置，但 YMMV）和文件的 md5 哈希（加上名称）。
awk将过滤结果而无需事先排序。44 代表 12（文件大小）+ 32（散列长度）。如果您需要有关 awk 程序的一些说明，您可以在此处查看基础知识。

Answer 7

回答by anubhava

You can make use of cmpto compare file size like this:

您可以使用cmp来比较文件大小，如下所示：

##代码##

bash 使用bash根据md5查找重复文件

提问by user2913020

回答by Gilles Quenot

回答by Alex Atkinson

回答by Drake Clarris

回答by Ondra ?i?ka

回答by Peacher Wu

回答by jlaraval

回答by anubhava

相关推荐

最近更新

标签

bash 使用bash根据md5查找重复文件

提问by user2913020

回答by Gilles Quenot

回答by Alex Atkinson

回答by Drake Clarris

回答by Ondra ?i?ka

回答by Peacher Wu

回答by jlaraval

回答by anubhava

相关推荐

如何将字符串参数传递给 bash 脚本中的命令？

bash 如何在 Gitlab CI 的构建脚本中设置（环境）变量？

bash 如何在tomcat服务器中重新启动应用程序

bash 如何将进度条添加到 somearchive.tar.xz 提取物中

相关推荐

最近更新

标签