bash 如何在bash中计算字符串(url)的哈希值以进行wget缓存

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1602378/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 18:37:20  来源:igfitidea点击:

How to calculate a hash for a string (url) in bash for wget caching

bashmd5wget

提问by Bambax

I'm building a little tool that will download files using wget, reading the urls from different files. The same url may be present in different files; the url may even be present in one file several times. It would be inefficient to download a page several times (every time its url found in the list(s)).

我正在构建一个小工具,它将使用 wget 下载文件,读取来自不同文件的 url。相同的 url 可能存在于不同的文件中;url 甚至可能多次出现在一个文件中。多次下载页面(每次在列表中找到它的 url 时)效率很低。

Thus, the simple approach is to save the downloaded file and to instruct wget not to download it again if it is already there.

因此,简单的方法是保存下载的文件并指示 wget 如果它已经存在就不要再次下载它。

That would be very straightforward; however the urls are very long (many many GET parameters) and therefore cannot be used as such for filenames (wget gives the error 'Cannot write to... [] file name too long').

那将非常简单;但是 url 很长(很多 GET 参数),因此不能用作文件名(wget 给出错误“无法写入... [] 文件名太长”)。

So, I need to rename the downloaded files. But for the caching mechanism to work, the renaming scheme needs to implement "one url <=> one name": if a given url can have multiple names, the caching does not work (ie, if I simply number the files in the order they are found, I won't let wget identify which urls have already been downloaded).

所以,我需要重命名下载的文件。但是为了缓存机制起作用,重命名方案需要实现“一个 url <=> 一个名称”:如果给定的 url 可以有多个名称,则缓存不起作用(即,如果我只是按顺序对文件进行编号它们被找到了,我不会让 wget 识别已经下载了哪些 url)。

The simplest renaming scheme would be to calculate an md5 hash of the filename(and notof the file itself, which is what md5sum does); that would ensure the filename is unique and that a given url results in always the same name.

最简单的重命名方案是计算文件名的 md5 哈希值(而不是文件本身,这是 md5sum 所做的);这将确保文件名是唯一的,并且给定的 url 总是产生相同的名称。

It's possible to do this in Perl, etc., but can it be done directly in bash or using a system utility (RedHat)?

可以在 Perl 等中执行此操作,但可以直接在 bash 中或使用系统实用程序 (RedHat) 执行此操作吗?

回答by Epsilon Prime

Sounds like you want the md5sum system utility.

听起来您想要 md5sum 系统实用程序。

URLMD5=`/bin/echo $URL | /usr/bin/md5sum | /bin/cut -f1 -d" "`

If you want to only create the hash on the filename, you can get that quickly with sed:

如果您只想在文件名上创建哈希,您可以使用 sed 快速获得:

FILENAME=`echo $URL | /bin/sed -e 's#.*/##'`
URLMD5=`/bin/echo $FILENAME | /usr/bin/md5sum | /bin/cut -f1 -d" "`

Note that, depending on your distribution, the path to cutmay be /usr/bin/cut.

请注意,根据您的发行版,路径cut可能是/usr/bin/cut.

回答by user1043466

I don't have the rep to comment on the answer, but there's one clarification to Epsilon Prime's answer: by default, echo will print a newline at the end of the text. If you want the md5 sums to match up with what will be generated by any other tool (eg php, Java's md5, etc) you need to call

我没有代表对答案发表评论,但对 Epsilon Prime 的答案有一个澄清:默认情况下,echo 将在文本末尾打印一个换行符。如果您希望 md5 总和与任何其他工具(例如 php、Java 的 md5 等)生成的相匹配,您需要调用

echo -n "$url"

which will suppress the newline.

这将抑制换行符。

回答by kdauria

Other options on my Ubuntu (Precise) box:

我的 Ubuntu (Precise) 盒子上的其他选项:

  • echo -n $STRING | sha512sum
  • echo -n $STRING | sha256sum
  • echo -n $STRING | sha224sum
  • echo -n $STRING | sha384sum
  • echo -n $STRING | sha1sum
  • echo -n $STRING | shasum
  • echo -n $STRING | sha512sum
  • echo -n $STRING | sha256sum
  • echo -n $STRING | sha224sum
  • echo -n $STRING | sha384sum
  • echo -n $STRING | sha1sum
  • echo -n $STRING | shasum

Other options on my Mac:

Mac 上的其他选项:

  • echo -n $STRING | shasum -a 512
  • echo -n $STRING | shasum -a 256
  • etc.
  • echo -n $STRING | shasum -a 512
  • echo -n $STRING | shasum -a 256
  • 等等。

回答by Kaleb Pederson

Newer versions of Bashprovide an associative array, as well as an indexed array. Something like this might work for you:

较新版本的Bash提供了一个关联数组和一个索引数组。像这样的事情可能对你有用:

declare -A myarray
myarray["url1"]="url1_content"
myarray["url2"]=""

if [ ! -z ${myarray["url1"]} ] ; then 
    echo "Cached";
fi

wget will typically rename the files with a filename.html.1, .2, etc., so you could use the associative array to store a list of which one has been downloaded and what the actual filename was.

wget 通常会使用 filename.html.1、.2 等重命名文件,因此您可以使用关联数组来存储已下载的列表以及实际文件名。