bash Linux：为给定的文件夹和内容计算单个哈希？

Question

提问by Ben L

Surely there must be a way to do this easily!

当然必须有一种方法可以轻松做到这一点！

I've tried the Linux command-line apps such as sha1sumand md5sumbut they seem only to be able to compute hashes of individual files and output a list of hash values, one for each file.

我已经尝试过 Linux 命令行应用程序，例如sha1sum和md5sum但它们似乎只能计算单个文件的哈希值并输出哈希值列表，每个文件一个。

I need to generate a single hash for the entire contents of a folder (not just the filenames).

我需要为文件夹的整个内容（不仅仅是文件名）生成一个哈希值。

I'd like to do something like

我想做类似的事情

sha1sum /folder/of/stuff > singlehashvalue

Edit:to clarify, my files are at multiple levels in a directory tree, they're not all sitting in the same root folder.

编辑：澄清一下，我的文件位于目录树中的多个级别，它们并不都位于同一个根文件夹中。

Answer 1

回答by Vatine

One possible way would be:

一种可能的方法是：

sha1sum path/to/folder/* | sha1sum

If there is a whole directory tree, you're probably better off using find and xargs. One possible command would be

如果有一整个目录树，您可能最好使用 find 和 xargs。一种可能的命令是

find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

And, finally, if you also need to take account of permissions and empty directories:

最后，如果您还需要考虑权限和空目录：

(find path/to/folder -type f -print0  | sort -z | xargs -0 sha1sum;
 find path/to/folder \( -type f -o -type d \) -print0 | sort -z | \
   xargs -0 stat -c '%n %a') \
| sha1sum

The arguments to statwill cause it to print the name of the file, followed by its octal permissions. The two finds will run one after the other, causing double the amount of disk IO, the first finding all file names and checksumming the contents, the second finding all file and directory names, printing name and mode. The list of "file names and checksums", followed by "names and directories, with permissions" will then be checksummed, for a smaller checksum.

的参数stat将导致它打印文件的名称，后跟其八进制权限。两次查找会一个接一个地运行，导致磁盘IO量翻倍，第一次查找所有文件名并校验内容，第二次查找所有文件和目录名，打印名称和模式。“文件名和校验和”列表，然后是“名称和目录，具有权限”，然后将进行校验和，以获得较小的校验和。

Answer 2

回答by David Schmitt

Use a file system intrusion detection tool like aide.
hash a tar ball of the directory:
tar cvf - /path/to/folder | sha1sum
Code something yourself, like vatine's oneliner:
find /path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

使用像aide这样的文件系统入侵检测工具。
散列目录的焦油球：
tar cvf - /path/to/folder | sha1sum
自己编码一些东西，比如vatine的oneliner：
find /path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

Answer 3

回答by S.Lott

You can do tar -c /path/to/folder | sha1sum

你可以做 tar -c /path/to/folder | sha1sum

Answer 4

回答by Shumoapp

If you just want to check if something in the folder changed, I'd recommend this one:

如果您只想检查文件夹中的某些内容是否已更改，我建议您使用以下方法：

ls -alR --full-time /folder/of/stuff | sha1sum

It will just give you a hash of the ls output, that contains folders, sub-folders, their files, their timestamp, size and permissions. Pretty much everything that you would need to determine if something has changed.

它只会为您提供 ls 输出的哈希值，其中包含文件夹、子文件夹、它们的文件、它们的时间戳、大小和权限。几乎所有你需要确定的东西是否已经改变。

Please note that this command will not generate hash for each file, but that is why it should be faster than using find.

请注意，此命令不会为每个文件生成哈希，但这就是它比使用 find 更快的原因。

Answer 5

回答by six-k

A robust and clean approach

一种稳健而干净的方法

First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
Different approaches for different needs/purpose (all of the below or pick what ever applies):
- Hash only the entry name of all entries in the directory tree
- Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
- For a symbolic link, its content is the referent name. Hash it or choose to skip
- Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
- If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
- Handle large files well(again, mind the RAM)
- Handle very deep directory trees (mind the open file descriptors)
- Handle non standard file names
- How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
- Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.

首先，不要占用可用内存！以块为单位散列文件而不是提供整个文件。
针对不同需求/目的的不同方法（以下所有或选择适用的方法）：
- 仅散列目录树中所有条目的条目名称
- 散列所有条目的文件内容（留下元数据，inode 编号，ctime，atime，mtime，size 等，你懂的）
- 对于符号链接，其内容是所指名称。散列它或选择跳过
- 在散列条目的内容时遵循或不遵循（解析名称）符号链接
- 如果是目录，则其内容只是目录条目。在递归遍历时，它们最终会被散列，但是该级别的目录条目名称是否应该被散列以标记该目录？在需要散列以快速识别更改而无需深入遍历以散列内容的用例中很有帮助。一个例子是文件的名称更改，但其余内容保持不变，它们都是相当大的文件
- 处理好大文件（再次注意 RAM）
- 处理非常深的目录树（注意打开的文件描述符）
- 处理非标准文件名
- 如何处理套接字、管道/FIFO、块设备、字符设备等文件？也必须散列它们吗？
- 遍历时不要更新任何条目的访问时间，因为这对于某些用例会产生副作用和适得其反（直观？）。

This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.

这就是我的想法，任何花一些时间在这方面工作的人实际上都会遇到其他问题和角落案例。

Here's a tool, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.

这里有一个工具，非常节省内存，可以解决大多数情况，可能有点粗糙，但非常有帮助。

An example usage and output of `dtreetrawl`.

的示例用法和输出`dtreetrawl`。

Usage:
  dtreetrawl [OPTION...] "/trawl/me" [path2,...]

Help Options:
  -h, --help                Show help options

Application Options:
  -t, --terse               Produce a terse output; parsable.
  -j, --json                Output as JSON
  -d, --delim=:             Character or string delimiter/separator for terse output(default ':')
  -l, --max-level=N         Do not traverse tree beyond N level(s)
  --hash                    Enable hashing(default is MD5).
  -c, --checksum=md5        Valid hashing algorithms: md5, sha1, sha256, sha512.
  -R, --only-root-hash      Output only the root hash. Blank line if --hash is not set
  -N, --no-name-hash        Exclude path name while calculating the root checksum
  -F, --no-content-hash     Do not hash the contents of the file
  -s, --hash-symlink        Include symbolic links' referent name while calculating the root checksum
  -e, --hash-dirent         Include hash of directory entries while calculating root checksum

Usage:
  dtreetrawl [OPTION...] "/trawl/me" [path2,...]

Help Options:
  -h, --help                Show help options

Application Options:
  -t, --terse               Produce a terse output; parsable.
  -j, --json                Output as JSON
  -d, --delim=:             Character or string delimiter/separator for terse output(default ':')
  -l, --max-level=N         Do not traverse tree beyond N level(s)
  --hash                    Enable hashing(default is MD5).
  -c, --checksum=md5        Valid hashing algorithms: md5, sha1, sha256, sha512.
  -R, --only-root-hash      Output only the root hash. Blank line if --hash is not set
  -N, --no-name-hash        Exclude path name while calculating the root checksum
  -F, --no-content-hash     Do not hash the contents of the file
  -s, --hash-symlink        Include symbolic links' referent name while calculating the root checksum
  -e, --hash-dirent         Include hash of directory entries while calculating root checksum

A snippet of human friendly output:

人类友好输出的片段：

...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
        Base name                    : CREDITS
        Level                        : 1
        Type                         : regular file
        Referent name                :
        File size                    : 98443 bytes
        I-node number                : 290850
        No. directory entries        : 0
        Permission (octal)           : 0644
        Link count                   : 1
        Ownership                    : UID=0, GID=0
        Preferred I/O block size     : 4096 bytes
        Blocks allocated             : 200
        Last status change           : Tue, 21 Nov 17 21:28:18 +0530
        Last file access             : Thu, 28 Dec 17 00:53:27 +0530
        Last file modification       : Tue, 21 Nov 17 21:28:18 +0530
        Hash                         : 9f0312d130016d103aa5fc9d16a2437e

Stats for /home/lab/linux-4.14-rc8:
        Elapsed time     : 1.305767 s
        Start time       : Sun, 07 Jan 18 03:42:39 +0530
        Root hash        : 434e93111ad6f9335bb4954bc8f4eca4
        Hash type        : md5
        Depth            : 8
        Total,
                size           : 66850916 bytes
                entries        : 12484
                directories    : 763
                regular files  : 11715
                symlinks       : 6
                block devices  : 0
                char devices   : 0
                sockets        : 0
                FIFOs/pipes    : 0

...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
        Base name                    : CREDITS
        Level                        : 1
        Type                         : regular file
        Referent name                :
        File size                    : 98443 bytes
        I-node number                : 290850
        No. directory entries        : 0
        Permission (octal)           : 0644
        Link count                   : 1
        Ownership                    : UID=0, GID=0
        Preferred I/O block size     : 4096 bytes
        Blocks allocated             : 200
        Last status change           : Tue, 21 Nov 17 21:28:18 +0530
        Last file access             : Thu, 28 Dec 17 00:53:27 +0530
        Last file modification       : Tue, 21 Nov 17 21:28:18 +0530
        Hash                         : 9f0312d130016d103aa5fc9d16a2437e

Stats for /home/lab/linux-4.14-rc8:
        Elapsed time     : 1.305767 s
        Start time       : Sun, 07 Jan 18 03:42:39 +0530
        Root hash        : 434e93111ad6f9335bb4954bc8f4eca4
        Hash type        : md5
        Depth            : 8
        Total,
                size           : 66850916 bytes
                entries        : 12484
                directories    : 763
                regular files  : 11715
                symlinks       : 6
                block devices  : 0
                char devices   : 0
                sockets        : 0
                FIFOs/pipes    : 0

Answer 6

回答by six-k

If you just want to hash the contents of the files, ignoring the filenames then you can use

如果您只想散列文件的内容，而忽略文件名，则可以使用

cat $FILES | md5sum

Make sure you have the files in the same order when computing the hash:

确保在计算哈希时文件的顺序相同：

cat $(echo $FILES | sort) | md5sum

But you can't have directories in your list of files.

但是您的文件列表中不能有目录。

Answer 7

回答by Kingdon

There is a python script for that:

有一个python脚本：

http://code.activestate.com/recipes/576973-getting-the-sha-1-or-md5-hash-of-a-directory/

If you change the names of a file without changing their alphabetical order, the hash script will not detect it. But, if you change the order of the files or the contents of any file, running the script will give you a different hash than before.

如果您更改文件的名称而不更改其字母顺序，哈希脚本将不会检测到它。但是，如果您更改文件的顺序或任何文件的内容，运行该脚本将为您提供与以前不同的哈希值。

Answer 8

回答by Hyman

Another tool to achieve this:

实现此目的的另一个工具：

http://md5deep.sourceforge.net/

As is sounds: like md5sum but also recursive, plus other features.

听起来：像 md5sum 但也是递归的，加上其他功能。

Answer 9

回答by haventchecked

I've written a Groovy script to do this:

我写了一个 Groovy 脚本来做到这一点：

import java.security.MessageDigest

public static String generateDigest(File file, String digest, int paddedLength){
    MessageDigest md = MessageDigest.getInstance(digest)
    md.reset()
    def files = []
    def directories = []

    if(file.isDirectory()){
        file.eachFileRecurse(){sf ->
            if(sf.isFile()){
                files.add(sf)
            }
            else{
                directories.add(file.toURI().relativize(sf.toURI()).toString())
            }
        }
    }
    else if(file.isFile()){
        files.add(file)
    }

    files.sort({a, b -> return a.getAbsolutePath() <=> b.getAbsolutePath()})
    directories.sort()

    files.each(){f ->
        println file.toURI().relativize(f.toURI()).toString()
        f.withInputStream(){is ->
            byte[] buffer = new byte[8192]
            int read = 0
            while((read = is.read(buffer)) > 0){
                md.update(buffer, 0, read)
            }
        }
    }

    directories.each(){d ->
        println d
        md.update(d.getBytes())
    }

    byte[] digestBytes = md.digest()
    BigInteger bigInt = new BigInteger(1, digestBytes)
    return bigInt.toString(16).padLeft(paddedLength, '0')
}

println "\n${generateDigest(new File(args[0]), 'SHA-256', 64)}"

You can customize the usage to avoid printing each file, change the message digest, take out directory hashing, etc. I've tested it against the NIST test data and it works as expected. http://www.nsrl.nist.gov/testdata/

您可以自定义用法以避免打印每个文件、更改消息摘要、删除目录散列等。我已经针对 NIST 测试数据对其进行了测试，并且它按预期工作。 http://www.nsrl.nist.gov/testdata/

gary-macbook:Scripts garypaduana$ groovy dirHash.groovy /Users/garypaduana/.config
.DS_Store
configstore/bower-github.yml
configstore/insight-bower.json
configstore/update-notifier-bower.json
filezilla/filezilla.xml
filezilla/layout.xml
filezilla/lockfile
filezilla/queue.sqlite3
filezilla/recentservers.xml
filezilla/sitemanager.xml
gtk-2.0/gtkfilechooser.ini
a/
configstore/
filezilla/
gtk-2.0/
lftp/
menus/
menus/applications-merged/

79de5e583734ca40ff651a3d9a54d106b52e94f1f8c2cd7133ca3bbddc0c6758

Answer 10

回答by Joao da Silva

Try to make it in two steps:

尝试分两步完成：

create a file with hashes for all files in a folder
hash this file

为文件夹中的所有文件创建一个带有哈希值的文件
散列这个文件

Like so:

像这样：

# for FILE in `find /folder/of/stuff -type f | sort`; do sha1sum $FILE >> hashes; done
# sha1sum hashes

Or do it all at once:

或者一次性完成：

# cat `find /folder/of/stuff -type f | sort` | sha1sum

bash Linux：为给定的文件夹和内容计算单个哈希？

提问by Ben L

回答by Vatine

回答by David Schmitt

回答by S.Lott

回答by Shumoapp

回答by six-k

A robust and clean approach

一种稳健而干净的方法

An example usage and output of `dtreetrawl`.

的示例用法和输出`dtreetrawl`。

回答by six-k

回答by Kingdon

回答by Hyman

回答by haventchecked

回答by Joao da Silva

相关推荐

最近更新

标签

bash Linux：为给定的文件夹和内容计算单个哈希？

提问by Ben L

回答by Vatine

回答by David Schmitt

回答by S.Lott

回答by Shumoapp

回答by six-k

A robust and clean approach

一种稳健而干净的方法

An example usage and output of dtreetrawl.

的示例用法和输出dtreetrawl。

回答by six-k

回答by Kingdon

回答by Hyman

回答by haventchecked

回答by Joao da Silva

相关推荐

bash Shell 脚本：死于任何错误

如何将 glob 表达式分配给 Bash 脚本中的变量？

bash 删除所有 ClearCase 视图私有文件的命令行

从 Bash 中的 $PATH 变量中删除路径的最优雅方法是什么？

相关推荐

最近更新

标签

An example usage and output of `dtreetrawl`.

的示例用法和输出`dtreetrawl`。