bash 如何在Linux中的同一目录中找到同名但在不同情况下存在的重复文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2109056/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to find duplicate files with same name but in different case that exist in same directory in Linux?
提问by Camsoft
How can I return a list of files that are named duplicates i.e. have same name but in different case that exist in the same directory?
如何返回名为重复的文件列表,即具有相同名称但存在于同一目录中的不同情况?
I don't care about the contents of the files. I just need to know the location and name of any files that have a duplicate of the same name.
我不在乎文件的内容。我只需要知道任何具有相同名称重复文件的位置和名称。
Example duplicates:
重复示例:
/www/images/taxi.jpg
/www/images/Taxi.jpg
Ideally I need to search all files recursively from a base directory. In above example it was /www/
理想情况下,我需要从基本目录递归搜索所有文件。在上面的例子中是/www/
回答by Christoffer Hammarstr?m
The other answer is great, but instead of the "rather monstrous" perl script i suggest
另一个答案很好,但我建议不要使用“相当可怕”的 perl 脚本
perl -pe 's!([^/]+)$!lc !e'
Which will lowercase just the filename part of the path.
这将小写路径的文件名部分。
Edit 1: In fact the entire problem can be solved with:
编辑 1:事实上,整个问题可以通过以下方式解决:
find . | perl -ne 's!([^/]+)$!lc !e; print if 1 == $seen{$_}++'
Edit 3: I found a solution using sed, sort and uniq that also will print out the duplicates, but it only works if there are no whitespaces in filenames:
编辑 3:我找到了一个使用 sed、sort 和 uniq 的解决方案,它也将打印出重复项,但它仅在文件名中没有空格时才有效:
find . |sed 's,\(.*\)/\(.*\)$,/\t/\L,'|sort|uniq -D -f 1|cut -f 1
Edit 2: And here is a longer script that will print out the names, it takes a list of paths on stdin, as given by find
. Not so elegant, but still:
编辑 2:这是一个更长的脚本,它将打印出名称,它需要一个标准输入上的路径列表,如find
. 不那么优雅,但仍然:
#!/usr/bin/perl -w
use strict;
use warnings;
my %dup_series_per_dir;
while (<>) {
my ($dir, $file) = m!(.*/)?([^/]+?)$!;
push @{$dup_series_per_dir{$dir||'./'}{lc $file}}, $file;
}
for my $dir (sort keys %dup_series_per_dir) {
my @all_dup_series_in_dir = grep { @{$_} > 1 } values %{$dup_series_per_dir{$dir}};
for my $one_dup_series (@all_dup_series_in_dir) {
print "$dir\{" . join(',', sort @{$one_dup_series}) . "}\n";
}
}
回答by paxdiablo
Try:
尝试:
ls -1 | tr '[A-Z]' '[a-z]' | sort | uniq -c | grep -v " 1 "
Simple, really :-) Aren't pipelines wonderful beasts?
简单,真的 :-) 管道难道不是奇妙的野兽吗?
The ls -1
gives you the files one per line, the tr '[A-Z]' '[a-z]'
converts all uppercase to lowercase, the sort
sorts them (surprisingly enough), uniq -c
removes subsequent occurrences of duplicate lines whilst giving you a count as well and, finally, the grep -v " 1 "
strips out those lines where the count was one.
该ls -1
给你的文件,每行一个,将tr '[A-Z]' '[a-z]'
所有大写转换为小写,在sort
各种他们(令人惊讶的是),uniq -c
删除重复行后续出现的同时给你一个数,以及和,最后,grep -v " 1 "
带出这些行,其中计数一。
When I run this in a directory with one "duplicate" (I copied qq
to qQ
), I get:
当我在一个带有一个“重复”(我复制qq
到qQ
)的目录中运行它时,我得到:
2 qq
For the "this directory and every subdirectory" version, just replace ls -1
with find .
or find DIRNAME
if you want a specific directory starting point (DIRNAME
is the directory name you want to use).
对于“此目录和每个子目录”版本,只需替换ls -1
为find .
或者find DIRNAME
如果您想要一个特定的目录起点(DIRNAME
是您要使用的目录名称)。
This returns (for me):
这返回(对我而言):
2 ./.gconf/system/gstreamer/0.10/audio/profiles/mp3
2 ./.gconf/system/gstreamer/0.10/audio/profiles/mp3/%gconf.xml
2 ./.gnome2/accels/blackHyman
2 ./qq
which are caused by:
由以下原因引起:
pax> ls -1d .gnome2/accels/[bB]* .gconf/system/gstreamer/0.10/audio/profiles/[mM]* [qQ]?
.gconf/system/gstreamer/0.10/audio/profiles/mp3
.gconf/system/gstreamer/0.10/audio/profiles/MP3
.gnome2/accels/blackHyman
.gnome2/accels/BlackHyman
qq
qQ
Update:
更新:
Actually, on further reflection, the tr
will lowercase allcomponents of the path so that both of
实际上,经过进一步思考,tr
将小写路径的所有组件,以便
/a/b/c
/a/B/c
will be considered duplicates even though they're in different directories.
即使它们位于不同的目录中,也将被视为重复。
If you only want duplicates within a single directory to show as a match, you can use the (rather monstrous):
如果您只想将单个目录中的重复项显示为匹配项,则可以使用(相当可怕的):
perl -ne '
chomp;
@flds = split (/\//);
$lstf = $f[-1];
$lstf =~ tr/A-Z/a-z/;
for ($i =0; $i ne $#flds; $i++) {
print "$f[$i]/";
};
print "$x\n";'
in place of:
代替:
tr '[A-Z]' '[a-z]'
What it does is to only lowercase the final portion of the pathname rather than the whole thing. In addition, if you only want regular files (no directories, FIFOs and so forth), use find -type f
to restrict what's returned.
它所做的只是将路径名的最后部分而不是整个内容小写。此外,如果您只需要常规文件(无目录、FIFO 等),请使用find -type f
限制返回的内容。
回答by mpez0
I believe
我相信
ls | sort -f | uniq -i -d
is simpler, faster, and will give the same result
更简单,更快,并且会给出相同的结果
回答by Alain
Following up on the response of mpez0, to detect recursively just replace "ls" by "find .". The only problem I see with this is that if this is a directory that is duplicating, then you have 1 entry for each files in this directory. Some human brain is required to treat the output of this.
跟进 mpez0 的响应,递归检测只需将“ls”替换为“find .”。我看到的唯一问题是,如果这是一个正在复制的目录,那么该目录中的每个文件都有 1 个条目。需要一些人脑来处理这个输出。
But anyway, you're not automatically deleting these files, are you?
但无论如何,您不会自动删除这些文件,是吗?
find . | sort -f | uniq -i -d
回答by user1639307
This is a nice little command line app called findsn
you get if you compile fslint
that the deb package does not include.
这是一个不错的小命令行应用程序,findsn
如果您编译fslint
deb 包不包含的内容,就会调用它。
it will find any files with the same name, and its lightning fast and it can handle different case.
它会找到任何具有相同名称的文件,并且速度快如闪电,可以处理不同的情况。
/findsn --help
find (files) with duplicate or conflicting names.
Usage: findsn [-A -c -C] [[-r] [-f] paths(s) ...]
If no arguments are supplied the $PATH is searched for any redundant or conflicting files.
如果未提供任何参数,则搜索 $PATH 以查找任何冗余或冲突的文件。
-A reports all aliases (soft and hard links) to files.
If no path(s) specified then the $PATH is searched.
If only path(s) specified then they are checked for duplicate named files. You can qualify this with -C to ignore case in this search. Qualifying with -c is more restrictive as only files (or directories) in the same directory whose names differ only in case are reported. I.E. -c will flag files & directories that will conflict if transfered to a case insensitive file system. Note if -c or -C specified and no path(s) specified the current directory is assumed.
如果只指定了路径,则检查它们是否有重复的命名文件。您可以使用 -C 对其进行限定以在此搜索中忽略大小写。使用 -c 限定更严格,因为只有在报告的情况下名称不同的同一目录中的文件(或目录)。如果转移到不区分大小写的文件系统,IE -c 将标记将冲突的文件和目录。请注意,如果指定了 -c 或 -C 且未指定路径,则假定当前目录。
回答by noclayto
Here is an example how to find all duplicate jar files:
以下是如何查找所有重复 jar 文件的示例:
find . -type f -printf "%f\n" -name "*.jar" | sort -f | uniq -i -d
Replace *.jar
with whatever duplicate file type you are looking for.
替换*.jar
为您要查找的任何重复文件类型。
回答by crafter
Here's a script that worked for me ( I am not the author). the original and discussion can be found here: http://www.daemonforums.org/showthread.php?t=4661
这是一个对我有用的脚本(我不是作者)。原文和讨论可以在这里找到:http: //www.daemonforums.org/showthread.php?t=4661
#! /bin/sh
# find duplicated files in directory tree
# comparing by file NAME, SIZE or MD5 checksum
# --------------------------------------------
# LICENSE(s): BSD / CDDL
# --------------------------------------------
# vermaden [AT] interia [DOT] pl
# http://strony.toya.net.pl/~vermaden/links.htm
__usage() {
echo "usage: $( basename OLD : find "" -type f | xargs -n 1 basename
NEW : find "" -type f -printf "%f\n"
) OPTION DIRECTORY"
echo " OPTIONS: -n check by name (fast)"
echo " -s check by size (medium)"
echo " -m check by md5 (slow)"
echo " -N same as '-n' but with delete instructions printed"
echo " -S same as '-s' but with delete instructions printed"
echo " -M same as '-m' but with delete instructions printed"
echo " EXAMPLE: $( basename find -type f -exec readlink -m {} \; | gawk 'BEGIN{FS="/";OFS="/"}{$NF=tolower($NF);print}' | uniq -c
) -s /mnt"
exit 1
}
__prefix() {
case $( id -u ) in
(0) PREFIX="rm -rf" ;;
(*) case $( uname ) in
(SunOS) PREFIX="pfexec rm -rf" ;;
(*) PREFIX="sudo rm -rf" ;;
esac
;;
esac
}
__crossplatform() {
case $( uname ) in
(FreeBSD)
MD5="md5 -r"
STAT="stat -f %z"
;;
(Linux)
MD5="md5sum"
STAT="stat -c %s"
;;
(SunOS)
echo "INFO: supported systems: FreeBSD Linux"
echo
echo "Porting to Solaris/OpenSolaris"
echo " -- provide values for MD5/STAT in '$( basename find . -type f | awk -F/ '{print $NF}' | sort -f | uniq -i -d
):__crossplatform()'"
echo " -- use digest(1) instead for md5 sum calculation"
echo " $ digest -a md5 file"
echo " -- pfexec(1) is already used in '$( basename gawk 'BEGINFILE {if ((seen[tolower(FILENAME)]++)) print FILENAME; nextfile}' *
):__prefix()'"
echo
exit 1
(*)
echo "INFO: supported systems: FreeBSD Linux"
exit 1
;;
esac
}
__md5() {
__crossplatform
:> ${DUPLICATES_FILE}
DATA=$( find "" -type f -exec ${MD5} {} ';' | sort -n )
echo "${DATA}" \
| awk '{print }' \
| uniq -c \
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print }' )
[ ${COUNT} -eq 1 ] && continue
SUM=$( echo ${LINE} | awk '{print }' )
echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE}
done
echo "${DATA}" \
| awk '{print }' \
| sort -n \
| uniq -c \
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print }' )
[ ${COUNT} -eq 1 ] && continue
SUM=$( echo ${LINE} | awk '{print }' )
echo "count: ${COUNT} | md5: ${SUM}"
grep ${SUM} ${DUPLICATES_FILE} \
| cut -d ' ' -f 2-10000 2> /dev/null \
| while read LINE
do
if [ -n "${PREFIX}" ]
then
echo " ${PREFIX} \"${LINE}\""
else
echo " ${LINE}"
fi
done
echo
done
rm -rf ${DUPLICATES_FILE}
}
__size() {
__crossplatform
find "" -type f -exec ${STAT} {} ';' \
| sort -n \
| uniq -c \
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print }' )
[ ${COUNT} -eq 1 ] && continue
SIZE=$( echo ${LINE} | awk '{print }' )
SIZE_KB=$( echo ${SIZE} / 1024 | bc )
echo "count: ${COUNT} | size: ${SIZE_KB}KB (${SIZE} bytes)"
if [ -n "${PREFIX}" ]
then
find -type f -size ${SIZE}c -exec echo " ${PREFIX} \"{}\"" ';'
else
# find -type f -size ${SIZE}c -exec echo " {} " ';' -exec du -h " {}" ';'
find -type f -size ${SIZE}c -exec echo " {} " ';'
fi
echo
done
}
__file() {
__crossplatform
find "" -type f \
| xargs -n 1 basename 2> /dev/null \
| tr '[A-Z]' '[a-z]' \
| sort -n \
| uniq -c \
| sort -n -r \
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print }' )
[ ${COUNT} -eq 1 ] && break
FILE=$( echo ${LINE} | cut -d ' ' -f 2-10000 2> /dev/null )
echo "count: ${COUNT} | file: ${FILE}"
FILE=$( echo ${FILE} | sed -e s/'\['/'\\['/g -e s/'\]'/'\\]'/g )
if [ -n "${PREFIX}" ]
then
find -iname "${FILE}" -exec echo " ${PREFIX} \"{}\"" ';'
else
find -iname "${FILE}" -exec echo " {}" ';'
fi
echo
done
}
# main()
[ ${#} -ne 2 ] && __usage
[ ! -d "" ] && __usage
DUPLICATES_FILE="/tmp/$( basename $ tree
.
├── bye.txt
├── hello.txt
├── helLo.txt
├── yeah.txt
└── YEAH.txt
0 directories, 5 files
$ gawk 'BEGINFILE {if ((a[tolower(FILENAME)]++)) print FILENAME; nextfile}' *
helLo.txt
YEAH.txt
)_DUPLICATES_FILE.tmp"
case in
(-n) __file "" ;;
(-m) __md5 "" ;;
(-s) __size "" ;;
(-N) __prefix; __file "" ;;
(-M) __prefix; __md5 "" ;;
(-S) __prefix; __size "" ;;
(*) __usage ;;
esac
If the find command is not working for you, you may have to change it. For example
如果 find 命令对您不起作用,您可能需要更改它。例如
##代码##回答by crafter
You can use:
您可以使用:
##代码##Where:
在哪里:
find -type f
recursion print all file's full path.-exec readlink -m {} \;
get file's absolute pathgawk 'BEGIN{FS="/";OFS="/"}{$NF=tolower($NF);print}'
replace the all filename's to lower caseuniq -c
unique the path, -c output the count of duplicate.
find -type f
递归打印所有文件的完整路径。-exec readlink -m {} \;
获取文件的绝对路径gawk 'BEGIN{FS="/";OFS="/"}{$NF=tolower($NF);print}'
将所有文件名替换为小写uniq -c
唯一的路径,-c 输出重复的计数。
回答by serg10
Little bit late to this one, but here's the version I went with:
这个有点晚了,但这是我使用的版本:
##代码##Here we are using:
这里我们使用:
find
- find all files under the current dirawk
- remove the file path part of the filenamesort
- sort case insensitivelyuniq
- find the dupes from what makes it through the pipe
find
- 查找当前目录下的所有文件awk
- 删除文件名的文件路径部分sort
- 不区分大小写排序uniq
- 找到通过管道的骗子
(Inspired by @mpez0 answer, and @SimonDowdles comment on @paxdiablo answer.)
(受到@mpez0 回答的启发,以及@SimonDowdles 对@paxdiablo 回答的评论。)
回答by fedorqui 'SO stop harming'
You can check duplicates in a given directory with GNU awk:
您可以使用 GNU awk 检查给定目录中的重复项:
##代码##This uses BEGINFILEto perform some action before going on and reading a file. In this case, it keeps track of the names that have appeared in an array seen[]
whose indexes are the names of the files in lowercase.
这使用BEGINFILE在继续读取文件之前执行一些操作。在这种情况下,它会跟踪出现在数组seen[]
中的名称,该数组的索引是小写文件的名称。
If a name has already appeared, no matter its case, it prints it. Otherwise, it just jumps to the next file.
如果名称已经出现,无论大小写,它都会打印出来。否则,它只是跳转到下一个文件。
See an example:
看一个例子:
##代码##