java 我需要帮助递归比较目录中的文件以查找重复项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13209284/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 11:53:23  来源:igfitidea点击:

I need help comparing files in a directory recursively to find duplicates

javafile-io

提问by Kevin Bigler

I'm working on a program that will (hopefully) compare all files in a given directory, identify duplicates, add them to a list, then display the list to the user so they can verify they want those files deleted before deleting them and I'm seriously stuck. So far I've been able to recursively list all the files and I've been messing around with comparing them to find the duplicates. I'm quickly realizing to accomplish what I want I'm going to need to compare more than one file attribute. Not all files will be text files and comparing text is mostly what I've found as far as example code on the internet goes, I'm trying learn more about the binary data because comparing byte arrays and file names is the best I could come up with. Specifically I'm asking which attributes would be best to compare in order to balance accuracy in finding the duplicates and being able to handle a reasonable sized directory? And if you don't mind how could I implement it in my code? Hopefully my question wasn't too terrible, I'd really appreciate any help I can get. Here's what I have, and yes, a couple of the methods and the second file I did find here in case you were wondering. P.S. I'm really sorry about the pointless variables if I missed any, I tried to clean up the code a little before posting it

我正在开发一个程序,它将(希望)比较给定目录中的所有文件,识别重复项,将它们添加到列表中,然后将列表显示给用户,以便他们可以在删除它们之前验证他们是否希望删除这些文件,我我严重卡住了。到目前为止,我已经能够递归地列出所有文件,并且我一直在比较它们以找到重复项。我很快意识到要完成我想要的,我将需要比较多个文件属性。并非所有文件都是文本文件,就 Internet 上的示例代码而言,比较文本主要是我发现的内容,我正在尝试了解有关二进制数据的更多信息,因为比较字节数组和文件名是我能来的最好方法跟上。具体我' 我问哪些属性最好比较以平衡查找重复项的准确性和能够处理合理大小的目录?如果你不介意我怎么能在我的代码中实现它?希望我的问题不是太糟糕,我真的很感激我能得到的任何帮助。这是我所拥有的,是的,我在这里找到了一些方法和第二个文件,以防您想知道。PS如果我错过了毫无意义的变量,我真的很抱歉,我尝试在发布之前稍微清理代码 如果您想知道,我在这里找到了一些方法和第二个文件。PS如果我错过了毫无意义的变量,我真的很抱歉,我尝试在发布之前稍微清理代码 如果您想知道,我在这里找到了一些方法和第二个文件。PS如果我错过了毫无意义的变量,我真的很抱歉,我尝试在发布之前稍微清理代码

ListFilesInDir.java

ListFilesInDir.java

import java.io.*;
import java.nio.file.Files;
import java.nio.file.attribute.*;
import java.security.*;
import java.util.*;

public final class ListFilesInDir {

static File startingDir;

static List<File> files;
static List<File> dirs;
static TreeMap<Integer, File> duplicates;
static ArrayList<Integer> usedIndexes = new ArrayList<Integer>();
static ArrayList<File> duplicateList = new ArrayList<File>();

static File out = new File("ListDuplicateFiles.txt");
static PrintWriter output;

static int key = 0;
static String tabString;
static TreeMap<Integer, File> tMap = new TreeMap<Integer, File>();

static int num1 = 0;
static int num2 = 0;
static File value1 = null;
static File value2 = null;
static String path1 = null;
static String name1 = null;
static String path2 = null;
static String name2 = null;

public static void main(String[] args) throws FileNotFoundException {
    new ListFilesInDir(args[0]);
}

public ListFilesInDir(String string) throws FileNotFoundException {
    startingDir = new File(string);
    dirs = new ArrayList<File>();
    duplicates = new TreeMap<Integer, File>();
    output = new PrintWriter(out);

    getFiles(startingDir);
    compareFiles();
    writeDuplicateList();
}

public void getFiles(File root) throws FileNotFoundException {
    System.out.println("Adding files to list...");
    ListFilesInDir.files = getFileList(root);
    for (File file : files) {
        if (!file.isFile()) {
            System.out.println("Adding DIR: " + key + " name: " + file);
            dirs.add(file);
        } else {
            System.out.println("Adding FILE: " + key + " name: " + file);
            tMap.put(key, file);
        }
        key++;
    }
    System.out.println(dirs.size());
    System.out.println("Complete");
}

public static void compareFiles() throws FileNotFoundException {
    System.out.println("Preparing to compare files...");
    for (num1 = 0; num1 < files.size(); num1++) {
        for (num2 = 0; num2 < files.size(); num2++) {

            if (num1 != num2) {
                value1 = files.get(num1);
                value2 = files.get(num2);
                path1 = value1.getAbsolutePath();
                path2 = value2.getAbsolutePath();
                name1 = path1.substring(path1.lastIndexOf(File.separator));
                name2 = path2.substring(path2.lastIndexOf(File.separator));
                HashMap<Integer, File> testMap = new HashMap<Integer, File>();

                System.out.println(num1 + "|" + num2 + " : " + value1
                        + " - " + value2);
                if (CompareBinaries.fileContentsEquals(
                        value1.getAbsolutePath(), value2.getAbsolutePath()) == true) {
                    if (testMap.put(num1, value1) != null) {
                        TreeSet<File> fileTreeSet;
                    }
                    addDuplicate(num1, value1);
                    files.remove(num1);

                    System.out.println("added(binary): " + num1 + ":"
                            + value1);

                } else if (value1.getName().equalsIgnoreCase(
                        value2.getName())) {
                    addDuplicate(num1, value1);
                    files.remove(num1);
                    System.out.println("added(name): " + num1 + ":"
                            + value1);
                }
            }
        }
    }
    System.out.println("Complete");

}

public static void writeDuplicateList() {
    int printKey = 0;
    for (File file : duplicateList) {
        output.printf("%03d | %s\n", printKey, file);
        System.out.printf("%03d | %s\n", printKey, file);
        printKey++;
    }

    output.append(docsInfo());
    output.close();
    output.flush();

    System.out.println("\n"+files.size()+" files in "+startingDir.getAbsolutePath() +", "+duplicateList.size()+" duplicate files.");
}

static public String docsInfo() {
    String s = "\n\n" + files.size() + " files in "
            + startingDir.getAbsolutePath() + ", " + duplicates.size()
            + " duplicate files.";
    return s;
}

static public List<File> getFileList(File file)
        throws FileNotFoundException {
    List<File> result = getUnsortedFileList(file);
    Collections.sort(result);
    return result;
}

static private List<File> getUnsortedFileList(File file)
        throws FileNotFoundException {
    List<File> result = new ArrayList<File>();
    File[] filesAndDirs = file.listFiles();
    List<File> filesDirs = Arrays.asList(filesAndDirs);
    int dirKey = 0;

    for (File fileList : filesDirs) {
        result.add(fileList);
        if (!fileList.isFile()) {

            List<File> deeperList = getUnsortedFileList(fileList);
            result.addAll(deeperList);
        }
    }
    return result;
    }

        static private void validateDir(File dir) throws FileNotFoundException {
    if (dir == null)
        throw new IllegalArgumentException("Directory is null!");
    if (!dir.exists())
        throw new FileNotFoundException("Directory doesn't exist: " + dir);
    if (!dir.isDirectory())
        throw new IllegalArgumentException(dir + "is not a directory!");
    if (!dir.canRead())
        throw new IllegalArgumentException("Directory cannot be read: "
                + dir);
     }

         public static void addDuplicate(int i, File file)throws FileNotFoundException{
          if (!duplicates.containsKey(i)) {
           duplicates.put(i, file);
               duplicateList.add(file);

          }
     }
    }

CompareBinaries.java

比较二进制文件

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.Arrays;


public class CompareBinaries {

private final static int BUFFSIZE = 1024;
private static byte buff1[] = new byte[BUFFSIZE];
private static byte buff2[] = new byte[BUFFSIZE];

public static boolean inputStreamEquals(InputStream is1, InputStream is2) {
    if(is1 == is2) return true;

    if(is1 == null && is2 == null) {
        System.out.println("both input streams are null");
        return true;
    }

    if(is1 == null || is2 == null) return false;
    try {
        int read1 = -1;
        int read2 = -1;

        do {
            int offset1 = 0;
            while (offset1 < BUFFSIZE
                        && (read1 = is1.read(buff1, offset1, BUFFSIZE-offset1)) >= 0) {
                        offset1 += read1;
                }

            int offset2 = 0;
            while (offset2 < BUFFSIZE
                        && (read2 = is2.read(buff2, offset2, BUFFSIZE-offset2)) >= 0) {
                        offset2 += read2;
                }
            if(offset1 != offset2) return false;
            if(offset1 != BUFFSIZE) {
                Arrays.fill(buff1, offset1, BUFFSIZE, (byte)0);
                Arrays.fill(buff2, offset2, BUFFSIZE, (byte)0);
            }
            if(!Arrays.equals(buff1, buff2)) return false;
        } while(read1 >= 0 && read2 >= 0);
        if(read1 < 0 && read2 < 0) return true; // both at EOF
        return false;

    } catch (Exception ei) {
        return false;
    }
}

public static boolean fileContentsEquals(File file1, File file2) {
    InputStream is1 = null;
    InputStream is2 = null;
    if(file1.length() != file2.length()) return false;

    try {
        is1 = new FileInputStream(file1);
        is2 = new FileInputStream(file2);

        return inputStreamEquals(is1, is2);

    } catch (Exception ei) {
        return false;
    } finally {
        try {
            if(is1 != null) is1.close();
            if(is2 != null) is2.close();
        } catch (Exception ei2) {}
    }
}

public static boolean fileContentsEquals(String fn1, String fn2) {
    return fileContentsEquals(new File(fn1), new File(fn2));
}

}

}

采纳答案by thedayofcondor

You could use an hash function to compare two files - two files (in a different folder) can have same name and attributes (eg length) but different content. For example, you can create a text file and then copy it on a different folder changing one letter in the content.

您可以使用散列函数来比较两个文件 - 两个文件(在不同文件夹中)可以具有相同的名称和属性(例如长度)但内容不同。例如,您可以创建一个文本文件,然后将其复制到另一个文件夹中,更改内容中的一个字母。

An hash function does some clever maths on the file content ending up with a number, even small difference in content will end up with two very different numbers.

散列函数对以数字结尾的文件内容进行了一些巧妙的数学运算,即使内容上的微小差异也会以两个截然不同的数字结尾。

Taking for example the md5 hash function, this produces a 16 bytes number out of a byte array of any length. While it is theoretically possible to create two files with the same md5 but different content, the probability is very low (while two files with same name and size but different content is a relatively high probability event)

以 md5 哈希函数为例,它会从任意长度的字节数组中生成一个 16 字节的数字。虽然理论上可以创建两个md5相同但内容不同的文件,但概率非常低(而两个文件名和大小相同但内容不同的事件是概率较高的事件)

The point is, you can build a table of md5 of file contents, this has to be calculated only once and it is quick to compare - if the md5 are different, the files are different with an 100% confidence. Only in the unlikely event the md5 are the same you will have to resort to byte-by-byte comparison to be 100% sure.

关键是,您可以构建一个文件内容的 md5 表,这只需计算一次,并且可以快速进行比较 - 如果 md5 不同,则文件不同时有 100% 的置信度。只有在不太可能发生的情况下,md5 是相同的,您才必须逐字节比较才能 100% 确定。

回答by Prafull Kumar

On Working on my Project work recently, I had found a good memo on receiving the duplicate filenames and directory using SHA algorithm

最近在处理我的项目工作时,我发现了一个关于使用 SHA 算法接收重复文件名和目录的好备忘录

take a look on it : https://jakut.is/2011/03/15/a-java-program-to-list-all/

看看它:https: //jakut.is/2011/03/15/a-java-program-to-list-all/

May be it might be useful for you

可能对你有用

回答by jboi

My suggestion: Walk thru one directory tree, compare to the other directory tree by name. Then, for each matching pair, compare file size and last-modificagtion-time and, if all that is equal, do a straight forward byte-by-byte comparison.

我的建议:遍历一个目录树,按名称与另一个目录树进行比较。然后,对于每个匹配对,比较文件大小和上次修改时间,如果所有这些都相等,则进行直接的逐字节比较。

There're two steps to implement this (if added links to example code):

有两个步骤来实现这一点(如果添加了示例代码的链接):

  1. Walk thru both directories to get the full list. Java has speed up this with Java 7 and the Files.walkFileTree(). You walk thru one directory tree and compare each entry to the other directory tree. I've posted heresome example code for such a comparison (My example code should help you with this step, yet does not 100% hit your question)
  2. Compare two files if they equal or not. Several things can be compared:
    • File name. This is obvious, as it is needed to find the file in the second tree anyway.
    • File size, last modification time: Are part of the BasicFileAttributesthat you get, when walking the tree. See example code on how to get it for the second file.
    • The content. As mentioned above you can calculate some kind of crc, md5, sha. What happens is, that you'll read the full content of both files. So, my suggestion here is, to directly compare byte-by-byte, e.g. with [Arrays.equals()](http://docs.oracle.com/javase/7/docs/api/java/util/Arrays.html#equals(byte[], byte[]))
  1. 遍历两个目录以获取完整列表。Java 已经通过 Java 7 和Files.walkFileTree(). 您遍历一个目录树并将每个条目与另一个目录树进行比较。我在这里发布一些用于此类比较的示例代码(我的示例代码应该可以帮助您完成这一步,但并不能 100% 回答您的问题)
  2. 比较两个文件是否相等。可以比较以下几点:
    • 文件名。这很明显,因为无论如何都需要在第二棵树中找到文件。
    • 文件大小,上次修改时间:是BasicFileAttributes您在遍历树时获得的一部分。请参阅有关如何为第二个文件获取它的示例代码。
    • 内容。如上所述,您可以计算某种 crc、md5、sha。发生的情况是,您将阅读两个文件的完整内容。所以,我的建议是,直接逐字节比较,例如与 [ Arrays.equals()]( http://docs.oracle.com/javase/7/docs/api/java/util/Arrays.html#equals(byte [], 字节 []))