bash Linux:为给定的文件夹和内容计算单个哈希?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/545387/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Linux: compute a single hash for a given folder & contents?
提问by Ben L
Surely there must be a way to do this easily!
当然必须有一种方法可以轻松做到这一点!
I've tried the Linux command-line apps such as sha1sum
and md5sum
but they seem only to be able to compute hashes of individual files and output a list of hash values, one for each file.
我已经尝试过 Linux 命令行应用程序,例如sha1sum
和md5sum
但它们似乎只能计算单个文件的哈希值并输出哈希值列表,每个文件一个。
I need to generate a single hash for the entire contents of a folder (not just the filenames).
我需要为文件夹的整个内容(不仅仅是文件名)生成一个哈希值。
I'd like to do something like
我想做类似的事情
sha1sum /folder/of/stuff > singlehashvalue
Edit:to clarify, my files are at multiple levels in a directory tree, they're not all sitting in the same root folder.
编辑:澄清一下,我的文件位于目录树中的多个级别,它们并不都位于同一个根文件夹中。
回答by Vatine
One possible way would be:
一种可能的方法是:
sha1sum path/to/folder/* | sha1sum
If there is a whole directory tree, you're probably better off using find and xargs. One possible command would be
如果有一整个目录树,您可能最好使用 find 和 xargs。一种可能的命令是
find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum
And, finally, if you also need to take account of permissions and empty directories:
最后,如果您还需要考虑权限和空目录:
(find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum;
find path/to/folder \( -type f -o -type d \) -print0 | sort -z | \
xargs -0 stat -c '%n %a') \
| sha1sum
The arguments to stat
will cause it to print the name of the file, followed by its octal permissions. The two finds will run one after the other, causing double the amount of disk IO, the first finding all file names and checksumming the contents, the second finding all file and directory names, printing name and mode. The list of "file names and checksums", followed by "names and directories, with permissions" will then be checksummed, for a smaller checksum.
的参数stat
将导致它打印文件的名称,后跟其八进制权限。两次查找会一个接一个地运行,导致磁盘IO量翻倍,第一次查找所有文件名并校验内容,第二次查找所有文件和目录名,打印名称和模式。“文件名和校验和”列表,然后是“名称和目录,具有权限”,然后将进行校验和,以获得较小的校验和。
回答by David Schmitt
Use a file system intrusion detection tool like aide.
hash a tar ball of the directory:
tar cvf - /path/to/folder | sha1sum
Code something yourself, like vatine's oneliner:
find /path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum
使用像aide这样的文件系统入侵检测工具。
散列目录的焦油球:
tar cvf - /path/to/folder | sha1sum
自己编码一些东西,比如vatine的oneliner:
find /path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum
回答by S.Lott
You can do tar -c /path/to/folder | sha1sum
你可以做 tar -c /path/to/folder | sha1sum
回答by Shumoapp
If you just want to check if something in the folder changed, I'd recommend this one:
如果您只想检查文件夹中的某些内容是否已更改,我建议您使用以下方法:
ls -alR --full-time /folder/of/stuff | sha1sum
It will just give you a hash of the ls output, that contains folders, sub-folders, their files, their timestamp, size and permissions. Pretty much everything that you would need to determine if something has changed.
它只会为您提供 ls 输出的哈希值,其中包含文件夹、子文件夹、它们的文件、它们的时间戳、大小和权限。几乎所有你需要确定的东西是否已经改变。
Please note that this command will not generate hash for each file, but that is why it should be faster than using find.
请注意,此命令不会为每个文件生成哈希,但这就是它比使用 find 更快的原因。
回答by six-k
A robust and clean approach
一种稳健而干净的方法
- First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
- Different approaches for different needs/purpose (all of the below or pick what ever applies):
- Hash only the entry name of all entries in the directory tree
- Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
- For a symbolic link, its content is the referent name. Hash it or choose to skip
- Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
- If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
- Handle large files well(again, mind the RAM)
- Handle very deep directory trees (mind the open file descriptors)
- Handle non standard file names
- How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
- Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.
- 首先,不要占用可用内存!以块为单位散列文件而不是提供整个文件。
- 针对不同需求/目的的不同方法(以下所有或选择适用的方法):
- 仅散列目录树中所有条目的条目名称
- 散列所有条目的文件内容(留下元数据,inode 编号,ctime,atime,mtime,size 等,你懂的)
- 对于符号链接,其内容是所指名称。散列它或选择跳过
- 在散列条目的内容时遵循或不遵循(解析名称)符号链接
- 如果是目录,则其内容只是目录条目。在递归遍历时,它们最终会被散列,但是该级别的目录条目名称是否应该被散列以标记该目录?在需要散列以快速识别更改而无需深入遍历以散列内容的用例中很有帮助。一个例子是文件的名称更改,但其余内容保持不变,它们都是相当大的文件
- 处理好大文件(再次注意 RAM)
- 处理非常深的目录树(注意打开的文件描述符)
- 处理非标准文件名
- 如何处理套接字、管道/FIFO、块设备、字符设备等文件?也必须散列它们吗?
- 遍历时不要更新任何条目的访问时间,因为这对于某些用例会产生副作用和适得其反(直观?)。
This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.
这就是我的想法,任何花一些时间在这方面工作的人实际上都会遇到其他问题和角落案例。
Here's a tool, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.
这里有一个工具,非常节省内存,可以解决大多数情况,可能有点粗糙,但非常有帮助。
An example usage and output of dtreetrawl
.
的示例用法和输出dtreetrawl
。
Usage: dtreetrawl [OPTION...] "/trawl/me" [path2,...] Help Options: -h, --help Show help options Application Options: -t, --terse Produce a terse output; parsable. -j, --json Output as JSON -d, --delim=: Character or string delimiter/separator for terse output(default ':') -l, --max-level=N Do not traverse tree beyond N level(s) --hash Enable hashing(default is MD5). -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512. -R, --only-root-hash Output only the root hash. Blank line if --hash is not set -N, --no-name-hash Exclude path name while calculating the root checksum -F, --no-content-hash Do not hash the contents of the file -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum -e, --hash-dirent Include hash of directory entries while calculating root checksum
Usage: dtreetrawl [OPTION...] "/trawl/me" [path2,...] Help Options: -h, --help Show help options Application Options: -t, --terse Produce a terse output; parsable. -j, --json Output as JSON -d, --delim=: Character or string delimiter/separator for terse output(default ':') -l, --max-level=N Do not traverse tree beyond N level(s) --hash Enable hashing(default is MD5). -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512. -R, --only-root-hash Output only the root hash. Blank line if --hash is not set -N, --no-name-hash Exclude path name while calculating the root checksum -F, --no-content-hash Do not hash the contents of the file -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum -e, --hash-dirent Include hash of directory entries while calculating root checksum
A snippet of human friendly output:
人类友好输出的片段:
... ... //clipped ... /home/lab/linux-4.14-rc8/CREDITS Base name : CREDITS Level : 1 Type : regular file Referent name : File size : 98443 bytes I-node number : 290850 No. directory entries : 0 Permission (octal) : 0644 Link count : 1 Ownership : UID=0, GID=0 Preferred I/O block size : 4096 bytes Blocks allocated : 200 Last status change : Tue, 21 Nov 17 21:28:18 +0530 Last file access : Thu, 28 Dec 17 00:53:27 +0530 Last file modification : Tue, 21 Nov 17 21:28:18 +0530 Hash : 9f0312d130016d103aa5fc9d16a2437e Stats for /home/lab/linux-4.14-rc8: Elapsed time : 1.305767 s Start time : Sun, 07 Jan 18 03:42:39 +0530 Root hash : 434e93111ad6f9335bb4954bc8f4eca4 Hash type : md5 Depth : 8 Total, size : 66850916 bytes entries : 12484 directories : 763 regular files : 11715 symlinks : 6 block devices : 0 char devices : 0 sockets : 0 FIFOs/pipes : 0
... ... //clipped ... /home/lab/linux-4.14-rc8/CREDITS Base name : CREDITS Level : 1 Type : regular file Referent name : File size : 98443 bytes I-node number : 290850 No. directory entries : 0 Permission (octal) : 0644 Link count : 1 Ownership : UID=0, GID=0 Preferred I/O block size : 4096 bytes Blocks allocated : 200 Last status change : Tue, 21 Nov 17 21:28:18 +0530 Last file access : Thu, 28 Dec 17 00:53:27 +0530 Last file modification : Tue, 21 Nov 17 21:28:18 +0530 Hash : 9f0312d130016d103aa5fc9d16a2437e Stats for /home/lab/linux-4.14-rc8: Elapsed time : 1.305767 s Start time : Sun, 07 Jan 18 03:42:39 +0530 Root hash : 434e93111ad6f9335bb4954bc8f4eca4 Hash type : md5 Depth : 8 Total, size : 66850916 bytes entries : 12484 directories : 763 regular files : 11715 symlinks : 6 block devices : 0 char devices : 0 sockets : 0 FIFOs/pipes : 0
回答by six-k
If you just want to hash the contents of the files, ignoring the filenames then you can use
如果您只想散列文件的内容,而忽略文件名,则可以使用
cat $FILES | md5sum
Make sure you have the files in the same order when computing the hash:
确保在计算哈希时文件的顺序相同:
cat $(echo $FILES | sort) | md5sum
But you can't have directories in your list of files.
但是您的文件列表中不能有目录。
回答by Kingdon
There is a python script for that:
有一个python脚本:
http://code.activestate.com/recipes/576973-getting-the-sha-1-or-md5-hash-of-a-directory/
http://code.activestate.com/recipes/576973-getting-the-sha-1-or-md5-hash-of-a-directory/
If you change the names of a file without changing their alphabetical order, the hash script will not detect it. But, if you change the order of the files or the contents of any file, running the script will give you a different hash than before.
如果您更改文件的名称而不更改其字母顺序,哈希脚本将不会检测到它。但是,如果您更改文件的顺序或任何文件的内容,运行该脚本将为您提供与以前不同的哈希值。
回答by Hyman
Another tool to achieve this:
实现此目的的另一个工具:
http://md5deep.sourceforge.net/
http://md5deep.sourceforge.net/
As is sounds: like md5sum but also recursive, plus other features.
听起来:像 md5sum 但也是递归的,加上其他功能。
回答by haventchecked
I've written a Groovy script to do this:
我写了一个 Groovy 脚本来做到这一点:
import java.security.MessageDigest
public static String generateDigest(File file, String digest, int paddedLength){
MessageDigest md = MessageDigest.getInstance(digest)
md.reset()
def files = []
def directories = []
if(file.isDirectory()){
file.eachFileRecurse(){sf ->
if(sf.isFile()){
files.add(sf)
}
else{
directories.add(file.toURI().relativize(sf.toURI()).toString())
}
}
}
else if(file.isFile()){
files.add(file)
}
files.sort({a, b -> return a.getAbsolutePath() <=> b.getAbsolutePath()})
directories.sort()
files.each(){f ->
println file.toURI().relativize(f.toURI()).toString()
f.withInputStream(){is ->
byte[] buffer = new byte[8192]
int read = 0
while((read = is.read(buffer)) > 0){
md.update(buffer, 0, read)
}
}
}
directories.each(){d ->
println d
md.update(d.getBytes())
}
byte[] digestBytes = md.digest()
BigInteger bigInt = new BigInteger(1, digestBytes)
return bigInt.toString(16).padLeft(paddedLength, '0')
}
println "\n${generateDigest(new File(args[0]), 'SHA-256', 64)}"
You can customize the usage to avoid printing each file, change the message digest, take out directory hashing, etc. I've tested it against the NIST test data and it works as expected. http://www.nsrl.nist.gov/testdata/
您可以自定义用法以避免打印每个文件、更改消息摘要、删除目录散列等。我已经针对 NIST 测试数据对其进行了测试,并且它按预期工作。 http://www.nsrl.nist.gov/testdata/
gary-macbook:Scripts garypaduana$ groovy dirHash.groovy /Users/garypaduana/.config
.DS_Store
configstore/bower-github.yml
configstore/insight-bower.json
configstore/update-notifier-bower.json
filezilla/filezilla.xml
filezilla/layout.xml
filezilla/lockfile
filezilla/queue.sqlite3
filezilla/recentservers.xml
filezilla/sitemanager.xml
gtk-2.0/gtkfilechooser.ini
a/
configstore/
filezilla/
gtk-2.0/
lftp/
menus/
menus/applications-merged/
79de5e583734ca40ff651a3d9a54d106b52e94f1f8c2cd7133ca3bbddc0c6758
回答by Joao da Silva
Try to make it in two steps:
尝试分两步完成:
- create a file with hashes for all files in a folder
- hash this file
- 为文件夹中的所有文件创建一个带有哈希值的文件
- 散列这个文件
Like so:
像这样:
# for FILE in `find /folder/of/stuff -type f | sort`; do sha1sum $FILE >> hashes; done
# sha1sum hashes
Or do it all at once:
或者一次性完成:
# cat `find /folder/of/stuff -type f | sort` | sha1sum