bash 如何在Linux中的同一目录中找到同名但在不同情况下存在的重复文件？

Question

提问by Camsoft

How can I return a list of files that are named duplicates i.e. have same name but in different case that exist in the same directory?

如何返回名为重复的文件列表，即具有相同名称但存在于同一目录中的不同情况？

I don't care about the contents of the files. I just need to know the location and name of any files that have a duplicate of the same name.

我不在乎文件的内容。我只需要知道任何具有相同名称重复文件的位置和名称。

Example duplicates:

重复示例：

/www/images/taxi.jpg
/www/images/Taxi.jpg

Ideally I need to search all files recursively from a base directory. In above example it was /www/

理想情况下，我需要从基本目录递归搜索所有文件。在上面的例子中是/www/

Answer 1

回答by Christoffer Hammarstr?m

The other answer is great, but instead of the "rather monstrous" perl script i suggest

另一个答案很好，但我建议不要使用“相当可怕”的 perl 脚本

perl -pe 's!([^/]+)$!lc !e'

Which will lowercase just the filename part of the path.

这将小写路径的文件名部分。

Edit 1: In fact the entire problem can be solved with:

编辑 1：事实上，整个问题可以通过以下方式解决：

find . | perl -ne 's!([^/]+)$!lc !e; print if 1 == $seen{$_}++'

Edit 3: I found a solution using sed, sort and uniq that also will print out the duplicates, but it only works if there are no whitespaces in filenames:

编辑 3：我找到了一个使用 sed、sort 和 uniq 的解决方案，它也将打印出重复项，但它仅在文件名中没有空格时才有效：

find . |sed 's,\(.*\)/\(.*\)$,/\t/\L,'|sort|uniq -D -f 1|cut -f 1

Edit 2: And here is a longer script that will print out the names, it takes a list of paths on stdin, as given by find. Not so elegant, but still:

编辑 2：这是一个更长的脚本，它将打印出名称，它需要一个标准输入上的路径列表，如find. 不那么优雅，但仍然：

#!/usr/bin/perl -w

use strict;
use warnings;

my %dup_series_per_dir;
while (<>) {
    my ($dir, $file) = m!(.*/)?([^/]+?)$!;
    push @{$dup_series_per_dir{$dir||'./'}{lc $file}}, $file;
}

for my $dir (sort keys %dup_series_per_dir) {
    my @all_dup_series_in_dir = grep { @{$_} > 1 } values %{$dup_series_per_dir{$dir}};
    for my $one_dup_series (@all_dup_series_in_dir) {
        print "$dir\{" . join(',', sort @{$one_dup_series}) . "}\n";
    }
}

Answer 2

回答by paxdiablo

Try:

尝试：

ls -1 | tr '[A-Z]' '[a-z]' | sort | uniq -c | grep -v " 1 "

Simple, really :-) Aren't pipelines wonderful beasts?

简单，真的 :-) 管道难道不是奇妙的野兽吗？

The ls -1gives you the files one per line, the tr '[A-Z]' '[a-z]'converts all uppercase to lowercase, the sortsorts them (surprisingly enough), uniq -cremoves subsequent occurrences of duplicate lines whilst giving you a count as well and, finally, the grep -v " 1 "strips out those lines where the count was one.

该ls -1给你的文件，每行一个，将tr '[A-Z]' '[a-z]'所有大写转换为小写，在sort各种他们（令人惊讶的是），uniq -c删除重复行后续出现的同时给你一个数，以及和，最后，grep -v " 1 "带出这些行，其中计数一。

When I run this in a directory with one "duplicate" (I copied qqto qQ), I get:

当我在一个带有一个“重复”（我复制qq到qQ）的目录中运行它时，我得到：

2 qq

For the "this directory and every subdirectory" version, just replace ls -1with find .or find DIRNAMEif you want a specific directory starting point (DIRNAMEis the directory name you want to use).

对于“此目录和每个子目录”版本，只需替换ls -1为find .或者find DIRNAME如果您想要一个特定的目录起点（DIRNAME是您要使用的目录名称）。

This returns (for me):

这返回（对我而言）：

2 ./.gconf/system/gstreamer/0.10/audio/profiles/mp3
2 ./.gconf/system/gstreamer/0.10/audio/profiles/mp3/%gconf.xml
2 ./.gnome2/accels/blackHyman
2 ./qq

which are caused by:

由以下原因引起：

pax> ls -1d .gnome2/accels/[bB]* .gconf/system/gstreamer/0.10/audio/profiles/[mM]* [qQ]?
.gconf/system/gstreamer/0.10/audio/profiles/mp3
.gconf/system/gstreamer/0.10/audio/profiles/MP3
.gnome2/accels/blackHyman
.gnome2/accels/BlackHyman
qq
qQ

Update:

更新：

Actually, on further reflection, the trwill lowercase allcomponents of the path so that both of

实际上，经过进一步思考，tr将小写路径的所有组件，以便

/a/b/c
/a/B/c

will be considered duplicates even though they're in different directories.

即使它们位于不同的目录中，也将被视为重复。

If you only want duplicates within a single directory to show as a match, you can use the (rather monstrous):

如果您只想将单个目录中的重复项显示为匹配项，则可以使用（相当可怕的）：

perl -ne '
    chomp;
    @flds = split (/\//);
    $lstf = $f[-1];
    $lstf =~ tr/A-Z/a-z/;
    for ($i =0; $i ne $#flds; $i++) {
        print "$f[$i]/";
    };
    print "$x\n";'

in place of:

代替：

tr '[A-Z]' '[a-z]'

What it does is to only lowercase the final portion of the pathname rather than the whole thing. In addition, if you only want regular files (no directories, FIFOs and so forth), use find -type fto restrict what's returned.

它所做的只是将路径名的最后部分而不是整个内容小写。此外，如果您只需要常规文件（无目录、FIFO 等），请使用find -type f限制返回的内容。

Answer 3

回答by mpez0

I believe

我相信

ls | sort -f | uniq -i -d

is simpler, faster, and will give the same result

更简单，更快，并且会给出相同的结果

Answer 4

回答by Alain

Following up on the response of mpez0, to detect recursively just replace "ls" by "find .". The only problem I see with this is that if this is a directory that is duplicating, then you have 1 entry for each files in this directory. Some human brain is required to treat the output of this.

跟进 mpez0 的响应，递归检测只需将“ls”替换为“find .”。我看到的唯一问题是，如果这是一个正在复制的目录，那么该目录中的每个文件都有 1 个条目。需要一些人脑来处理这个输出。

But anyway, you're not automatically deleting these files, are you?

但无论如何，您不会自动删除这些文件，是吗？

find . | sort -f | uniq -i -d

Answer 5

回答by user1639307

This is a nice little command line app called findsnyou get if you compile fslintthat the deb package does not include.

这是一个不错的小命令行应用程序，findsn如果您编译fslintdeb 包不包含的内容，就会调用它。

it will find any files with the same name, and its lightning fast and it can handle different case.

它会找到任何具有相同名称的文件，并且速度快如闪电，可以处理不同的情况。

/findsn --help
find (files) with duplicate or conflicting names.
Usage: findsn [-A -c -C] [[-r] [-f] paths(s) ...]

If no arguments are supplied the $PATH is searched for any redundant or conflicting files.

如果未提供任何参数，则搜索 $PATH 以查找任何冗余或冲突的文件。

-A  reports all aliases (soft and hard links) to files.
    If no path(s) specified then the $PATH is searched.

If only path(s) specified then they are checked for duplicate named files. You can qualify this with -C to ignore case in this search. Qualifying with -c is more restrictive as only files (or directories) in the same directory whose names differ only in case are reported. I.E. -c will flag files & directories that will conflict if transfered to a case insensitive file system. Note if -c or -C specified and no path(s) specified the current directory is assumed.

如果只指定了路径，则检查它们是否有重复的命名文件。您可以使用 -C 对其进行限定以在此搜索中忽略大小写。使用 -c 限定更严格，因为只有在报告的情况下名称不同的同一目录中的文件（或目录）。如果转移到不区分大小写的文件系统，IE -c 将标记将冲突的文件和目录。请注意，如果指定了 -c 或 -C 且未指定路径，则假定当前目录。

Answer 6

回答by noclayto

Here is an example how to find all duplicate jar files:

以下是如何查找所有重复 jar 文件的示例：

find . -type f -printf "%f\n" -name "*.jar" | sort -f | uniq -i -d

Replace *.jarwith whatever duplicate file type you are looking for.

替换*.jar为您要查找的任何重复文件类型。

Answer 7

回答by crafter

Here's a script that worked for me ( I am not the author). the original and discussion can be found here: http://www.daemonforums.org/showthread.php?t=4661

这是一个对我有用的脚本（我不是作者）。原文和讨论可以在这里找到：http: //www.daemonforums.org/showthread.php?t=4661

#! /bin/sh

# find duplicated files in directory tree
# comparing by file NAME, SIZE or MD5 checksum
# --------------------------------------------
# LICENSE(s): BSD / CDDL
# --------------------------------------------
# vermaden [AT] interia [DOT] pl
# http://strony.toya.net.pl/~vermaden/links.htm

__usage() {
  echo "usage: $( basename OLD :   find "" -type f | xargs -n 1 basename 
NEW :   find "" -type f -printf "%f\n"
 ) OPTION DIRECTORY"
  echo "  OPTIONS: -n   check by name (fast)"
  echo "           -s   check by size (medium)"
  echo "           -m   check by md5  (slow)"
  echo "           -N   same as '-n' but with delete instructions printed"
  echo "           -S   same as '-s' but with delete instructions printed"
  echo "           -M   same as '-m' but with delete instructions printed"
  echo "  EXAMPLE: $( basename find -type f  -exec readlink -m {} \; | gawk 'BEGIN{FS="/";OFS="/"}{$NF=tolower($NF);print}' | uniq -c
 ) -s /mnt"
  exit 1
  }

__prefix() {
  case $( id -u ) in
    (0) PREFIX="rm -rf" ;;
    (*) case $( uname ) in
          (SunOS) PREFIX="pfexec rm -rf" ;;
          (*)     PREFIX="sudo rm -rf"   ;;
        esac
        ;;
  esac
  }

__crossplatform() {
  case $( uname ) in
    (FreeBSD)
      MD5="md5 -r"
      STAT="stat -f %z"
      ;;
    (Linux)
      MD5="md5sum"
      STAT="stat -c %s"
      ;;
    (SunOS)
      echo "INFO: supported systems: FreeBSD Linux"
      echo
      echo "Porting to Solaris/OpenSolaris"
      echo "  -- provide values for MD5/STAT in '$( basename find . -type f | awk -F/ '{print $NF}' | sort -f | uniq -i -d
 ):__crossplatform()'"
      echo "  -- use digest(1) instead for md5 sum calculation"
      echo "       $ digest -a md5 file"
      echo "  -- pfexec(1) is already used in '$( basename gawk 'BEGINFILE {if ((seen[tolower(FILENAME)]++)) print FILENAME; nextfile}' *
 ):__prefix()'"
      echo
      exit 1
    (*)
      echo "INFO: supported systems: FreeBSD Linux"
      exit 1
      ;;
  esac
  }

__md5() {
  __crossplatform
  :> ${DUPLICATES_FILE}
  DATA=$( find "" -type f -exec ${MD5} {} ';' | sort -n )
  echo "${DATA}" \
    | awk '{print }' \
    | uniq -c \
    | while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print }' )
        [ ${COUNT} -eq 1 ] && continue
        SUM=$( echo ${LINE} | awk '{print }' )
        echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE}
      done

  echo "${DATA}" \
    | awk '{print }' \
    | sort -n \
    | uniq -c \
    | while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print }' )
        [ ${COUNT} -eq 1 ] && continue
        SUM=$( echo ${LINE} | awk '{print }' )
        echo "count: ${COUNT} | md5: ${SUM}"
        grep ${SUM} ${DUPLICATES_FILE} \
          | cut -d ' ' -f 2-10000 2> /dev/null \
          | while read LINE
            do
              if [ -n "${PREFIX}" ]
              then
                echo "  ${PREFIX} \"${LINE}\""
              else
                echo "  ${LINE}"
              fi
            done
        echo
      done
  rm -rf ${DUPLICATES_FILE}
  }

__size() {
  __crossplatform
  find "" -type f -exec ${STAT} {} ';' \
    | sort -n \
    | uniq -c \
    | while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print }' )
        [ ${COUNT} -eq 1 ] && continue
        SIZE=$( echo ${LINE} | awk '{print }' )
        SIZE_KB=$( echo ${SIZE} / 1024 | bc )
        echo "count: ${COUNT} | size: ${SIZE_KB}KB (${SIZE} bytes)"
        if [ -n "${PREFIX}" ]
        then
          find  -type f -size ${SIZE}c -exec echo "  ${PREFIX} \"{}\"" ';'
        else
          # find  -type f -size ${SIZE}c -exec echo "  {}  " ';'  -exec du -h "  {}" ';'
          find  -type f -size ${SIZE}c -exec echo "  {}  " ';'
        fi
        echo
      done
  }

__file() {
  __crossplatform
  find "" -type f \
    | xargs -n 1 basename 2> /dev/null \
    | tr '[A-Z]' '[a-z]' \
    | sort -n \
    | uniq -c \
    | sort -n -r \
    | while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print }' )
        [ ${COUNT} -eq 1 ] && break
        FILE=$( echo ${LINE} | cut -d ' ' -f 2-10000 2> /dev/null )
        echo "count: ${COUNT} | file: ${FILE}"
        FILE=$( echo ${FILE} | sed -e s/'\['/'\\['/g -e s/'\]'/'\\]'/g )
        if [ -n "${PREFIX}" ]
        then
          find  -iname "${FILE}" -exec echo "  ${PREFIX} \"{}\"" ';'
        else
          find  -iname "${FILE}" -exec echo "  {}" ';'
        fi
        echo
      done 
  }

# main()

[ ${#} -ne 2  ] && __usage
[ ! -d "" ] && __usage

DUPLICATES_FILE="/tmp/$( basename $ tree
.
├── bye.txt
├── hello.txt
├── helLo.txt
├── yeah.txt
└── YEAH.txt

0 directories, 5 files
$ gawk 'BEGINFILE {if ((a[tolower(FILENAME)]++)) print FILENAME; nextfile}' *
helLo.txt
YEAH.txt
 )_DUPLICATES_FILE.tmp"

case  in
  (-n)           __file "" ;;
  (-m)           __md5  "" ;;
  (-s)           __size "" ;;
  (-N) __prefix; __file "" ;;
  (-M) __prefix; __md5  "" ;;
  (-S) __prefix; __size "" ;;
  (*)  __usage ;;
esac

If the find command is not working for you, you may have to change it. For example

如果 find 命令对您不起作用，您可能需要更改它。例如

##代码##

Answer 8

回答by crafter

You can use:

您可以使用：

##代码##

Where:

在哪里：

find -type f
recursion print all file's full path.
-exec readlink -m {} \;
get file's absolute path
gawk 'BEGIN{FS="/";OFS="/"}{$NF=tolower($NF);print}'
replace the all filename's to lower case
uniq -c
unique the path, -c output the count of duplicate.

find -type f
递归打印所有文件的完整路径。
-exec readlink -m {} \;
获取文件的绝对路径
gawk 'BEGIN{FS="/";OFS="/"}{$NF=tolower($NF);print}'
将所有文件名替换为小写
uniq -c
唯一的路径，-c 输出重复的计数。

Answer 9

回答by serg10

Little bit late to this one, but here's the version I went with:

这个有点晚了，但这是我使用的版本：

##代码##

Here we are using:

这里我们使用：

find- find all files under the current dir
awk- remove the file path part of the filename
sort- sort case insensitively
uniq- find the dupes from what makes it through the pipe

find- 查找当前目录下的所有文件
awk- 删除文件名的文件路径部分
sort- 不区分大小写排序
uniq- 找到通过管道的骗子

(Inspired by @mpez0 answer, and @SimonDowdles comment on @paxdiablo answer.)

（受到@mpez0 回答的启发，以及@SimonDowdles 对@paxdiablo 回答的评论。）

Answer 10

回答by fedorqui 'SO stop harming'

You can check duplicates in a given directory with GNU awk:

您可以使用 GNU awk 检查给定目录中的重复项：

##代码##

This uses BEGINFILEto perform some action before going on and reading a file. In this case, it keeps track of the names that have appeared in an array seen[]whose indexes are the names of the files in lowercase.

这使用BEGINFILE在继续读取文件之前执行一些操作。在这种情况下，它会跟踪出现在数组seen[]中的名称，该数组的索引是小写文件的名称。

If a name has already appeared, no matter its case, it prints it. Otherwise, it just jumps to the next file.

如果名称已经出现，无论大小写，它都会打印出来。否则，它只是跳转到下一个文件。

See an example:

看一个例子：

##代码##

bash 如何在Linux中的同一目录中找到同名但在不同情况下存在的重复文件？

提问by Camsoft

回答by Christoffer Hammarstr?m

回答by paxdiablo

回答by mpez0

回答by Alain

回答by user1639307

回答by noclayto

回答by crafter

回答by crafter

回答by serg10

回答by fedorqui 'SO stop harming'

相关推荐

最近更新

标签

bash 如何在Linux中的同一目录中找到同名但在不同情况下存在的重复文件？

提问by Camsoft

回答by Christoffer Hammarstr?m

回答by paxdiablo

回答by mpez0

回答by Alain

回答by user1639307

回答by noclayto

回答by crafter

回答by crafter

回答by serg10

回答by fedorqui 'SO stop harming'

相关推荐

如何在 bash 脚本中提示用户进行确认？

bash 一个简单的小shell脚本来计算平均值

bash 如何在shell中找到数组的长度？

bash 从 Grep 正则表达式中捕获组

相关推荐

最近更新

标签