bash 如何在bash中计算字符串(url)的哈希值以进行wget缓存
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1602378/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to calculate a hash for a string (url) in bash for wget caching
提问by Bambax
I'm building a little tool that will download files using wget, reading the urls from different files. The same url may be present in different files; the url may even be present in one file several times. It would be inefficient to download a page several times (every time its url found in the list(s)).
我正在构建一个小工具,它将使用 wget 下载文件,读取来自不同文件的 url。相同的 url 可能存在于不同的文件中;url 甚至可能多次出现在一个文件中。多次下载页面(每次在列表中找到它的 url 时)效率很低。
Thus, the simple approach is to save the downloaded file and to instruct wget not to download it again if it is already there.
因此,简单的方法是保存下载的文件并指示 wget 如果它已经存在就不要再次下载它。
That would be very straightforward; however the urls are very long (many many GET parameters) and therefore cannot be used as such for filenames (wget gives the error 'Cannot write to... [] file name too long').
那将非常简单;但是 url 很长(很多 GET 参数),因此不能用作文件名(wget 给出错误“无法写入... [] 文件名太长”)。
So, I need to rename the downloaded files. But for the caching mechanism to work, the renaming scheme needs to implement "one url <=> one name": if a given url can have multiple names, the caching does not work (ie, if I simply number the files in the order they are found, I won't let wget identify which urls have already been downloaded).
所以,我需要重命名下载的文件。但是为了缓存机制起作用,重命名方案需要实现“一个 url <=> 一个名称”:如果给定的 url 可以有多个名称,则缓存不起作用(即,如果我只是按顺序对文件进行编号它们被找到了,我不会让 wget 识别已经下载了哪些 url)。
The simplest renaming scheme would be to calculate an md5 hash of the filename(and notof the file itself, which is what md5sum does); that would ensure the filename is unique and that a given url results in always the same name.
最简单的重命名方案是计算文件名的 md5 哈希值(而不是文件本身,这是 md5sum 所做的);这将确保文件名是唯一的,并且给定的 url 总是产生相同的名称。
It's possible to do this in Perl, etc., but can it be done directly in bash or using a system utility (RedHat)?
可以在 Perl 等中执行此操作,但可以直接在 bash 中或使用系统实用程序 (RedHat) 执行此操作吗?
回答by Epsilon Prime
Sounds like you want the md5sum system utility.
听起来您想要 md5sum 系统实用程序。
URLMD5=`/bin/echo $URL | /usr/bin/md5sum | /bin/cut -f1 -d" "`
If you want to only create the hash on the filename, you can get that quickly with sed:
如果您只想在文件名上创建哈希,您可以使用 sed 快速获得:
FILENAME=`echo $URL | /bin/sed -e 's#.*/##'`
URLMD5=`/bin/echo $FILENAME | /usr/bin/md5sum | /bin/cut -f1 -d" "`
Note that, depending on your distribution, the path to cut
may be /usr/bin/cut
.
请注意,根据您的发行版,路径cut
可能是/usr/bin/cut
.
回答by user1043466
I don't have the rep to comment on the answer, but there's one clarification to Epsilon Prime's answer: by default, echo will print a newline at the end of the text. If you want the md5 sums to match up with what will be generated by any other tool (eg php, Java's md5, etc) you need to call
我没有代表对答案发表评论,但对 Epsilon Prime 的答案有一个澄清:默认情况下,echo 将在文本末尾打印一个换行符。如果您希望 md5 总和与任何其他工具(例如 php、Java 的 md5 等)生成的相匹配,您需要调用
echo -n "$url"
which will suppress the newline.
这将抑制换行符。
回答by kdauria
Other options on my Ubuntu (Precise) box:
我的 Ubuntu (Precise) 盒子上的其他选项:
echo -n $STRING | sha512sum
echo -n $STRING | sha256sum
echo -n $STRING | sha224sum
echo -n $STRING | sha384sum
echo -n $STRING | sha1sum
echo -n $STRING | shasum
echo -n $STRING | sha512sum
echo -n $STRING | sha256sum
echo -n $STRING | sha224sum
echo -n $STRING | sha384sum
echo -n $STRING | sha1sum
echo -n $STRING | shasum
Other options on my Mac:
Mac 上的其他选项:
echo -n $STRING | shasum -a 512
echo -n $STRING | shasum -a 256
- etc.
echo -n $STRING | shasum -a 512
echo -n $STRING | shasum -a 256
- 等等。
回答by Kaleb Pederson
Newer versions of Bashprovide an associative array, as well as an indexed array. Something like this might work for you:
较新版本的Bash提供了一个关联数组和一个索引数组。像这样的事情可能对你有用:
declare -A myarray
myarray["url1"]="url1_content"
myarray["url2"]=""
if [ ! -z ${myarray["url1"]} ] ; then
echo "Cached";
fi
wget will typically rename the files with a filename.html.1, .2, etc., so you could use the associative array to store a list of which one has been downloaded and what the actual filename was.
wget 通常会使用 filename.html.1、.2 等重命名文件,因此您可以使用关联数组来存储已下载的列表以及实际文件名。