bash 如何在shell中解码URL编码的字符串?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6250698/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 20:36:57  来源:igfitidea点击:

How to decode URL-encoded string in shell?

bashshellawksedurldecode

提问by user785717

I have a file with a list of user-agents which are encoded. E.g.:

我有一个文件,其中包含已编码的用户代理列表。例如:

Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

I want a shell script which can read this file and write to a new file with decoded strings.

我想要一个 shell 脚本,它可以读取这个文件并用解码的字符串写入一个新文件。

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

I have been trying to use this example to get it going but it is not working so far.

我一直在尝试使用这个示例来实现它,但到目前为止它不起作用。

$ echo -e "$(echo "%31+%32%0A%33+%34" | sed 'y/+/ /; s/%/\x/g')"

My script looks like:

我的脚本看起来像:

#!/bin/bash
for f in *.log; do
  echo -e "$(cat $f | sed 'y/+/ /; s/%/\x/g')" > y.log
done

回答by guest

Here is a simple one-line solution.

这是一个简单的单行解决方案。

$ function urldecode() { : "${*//+/ }"; echo -e "${_//%/\x}"; }

It may look like perl :) but it is just pure bash. No awks, no seds ... no overheads. Using the : builtin, special parameters, pattern substitution and the echo builtin's -e option to translate hex codes into characters. See bash's manpage for further details. You can use this function as separate command

它可能看起来像 perl :) 但它只是纯粹的 bash。没有 awks,没有 seds ......没有开销。使用 : 内置、特殊参数、模式替换和 echo 内置的 -e 选项将十六进制代码转换为字符。有关更多详细信息,请参阅 bash 的联机帮助页。您可以将此功能用作单独的命令

$ urldecode https%3A%2F%2Fgoogle.com%2Fsearch%3Fq%3Durldecode%2Bbash
https://google.com/search?q=urldecode+bash

or in variable assignments, like so:

或者在变量赋值中,像这样:

$ x="http%3A%2F%2Fstackoverflow.com%2Fsearch%3Fq%3Durldecode%2Bbash"
$ y=$(urldecode "$x")
$ echo "$y"
http://stackoverflow.com/search?q=urldecode+bash

回答by Steven Penny

GNU awk

GNU awk

#!/usr/bin/awk -fn
@include "ord"
BEGIN {
  RS = "%.."
}
{
  printf RT ? 
#!/bin/sh
awk -niord '{printf RT?
while read; do echo -e ${REPLY//%/\x}; done
chr("0x"substr(RT,2)):
while read; do echo -e ${REPLY//%/\x}; done < file
}' RS=%..
chr("0x" substr(RT, 2)) :
echo 'a%21b' | while read; do echo -e ${REPLY//%/\x}; done
}

Or

或者

while read; do : "${REPLY//%/\x}"; echo -e ${_//+/ }; done

Using awk printf to urldecode text

使用 awk printf 对文本进行 urldecode

回答by brendan

With BASH, to read the per cent encoded URL from standard in and decode:

使用 BASH,从标准输入读取百分比编码的 URL 并解码:

echo -n "%21%20" | python3 -c "import sys; from urllib.parse import unquote; print(unquote(sys.stdin.read()));"

Press CTRL-Dto signal the end of file(EOF) and quit gracefully.

CTRL-D表示文件结束(EOF)并正常退出。

You can decode the contents of a file by setting the file to be standard in:

您可以通过将文件设置为标准来解码文件的内容:

echo -n "%21%20" | python -c "import sys, urllib as ul; print ul.unquote(sys.stdin.read());"

You can decode input from a pipe either, for example:

您可以解码来自管道的输入,例如:

#!/bin/bash
urldecode(){
  echo -e "$(sed 's/+/ /g;s/%\(..\)/\x/g;')"
}

for f in /opt/logs/*.log; do
    name=${f##/*/}
    cat $f | urldecode > /opt/logs/processed/$HOSTNAME.$name
done
  • The read built in command reads standard in until it sees a Line Feed character. It sets a variable called REPLYequal to the line of text it just read.
  • ${REPLY//%/\\x}replaces all instances of '%' with '\x'.
  • echo -einterprets \xNNas the ASCII character with hexadecimal value of NN.
  • while repeats this loop until the read command fails, eg. EOF has been reached.
  • read 内置命令读取标准输入,直到它看到换行符。它设置一个变量,称为REPLY等于它刚刚读取的文本行。
  • ${REPLY//%/\\x}用 '\x' 替换 '%' 的所有实例。
  • echo -e解释\xNN为十六进制值为 的 ASCII 字符NN
  • while 重复此循环,直到读取命令失败,例如。已达到EOF。

The above does not change '+' to ' '. To change '+' to ' ' also, like guest's answer:

以上不会将“+”更改为“ ”。要将 '+' 更改为 ' ',就像客人的回答

perl -pi.back -e 'y/+/ /;s/%([\da-f]{2})/pack H2,/gie' ./*.log
  • :is a BASH builtin command. Here it just takes in a single argument and does nothing with it.
  • The double quotes make everything inside one single parameter.
  • _is a special parameter that is equal to the last argument of the previous command, after argument expansion. This is the value of REPLYwith all instances of '%' replaced with '\x'.
  • ${_//+/ }replaces all instances of '+' with ' '.
  • :是一个 BASH 内置命令。在这里,它只接受一个参数,不做任何处理。
  • 双引号使所有内容都包含在一个参数中。
  • _是一个特殊参数,在参数扩展后等于上一个命令的最后一个参数。这是REPLY'%' 的所有实例都替换为 '\x' 的值。
  • ${_//+/ }用“ ”替换“+”的所有实例。

This uses only BASH and doesn't start any other process, similar to guest's answer.

这仅使用 BASH 而不会启动任何其他进程,类似于来宾的回答。

回答by Jay

If you are a pythondeveloper, this maybe preferable:

如果您是Python开发人员,这可能更可取:

For Python 3.x(default):

对于 Python 3.x(默认):

perl -pi.back -e 'y/+/ /;s/%([\da-f]{2})/chr hex /gie' ./*.log

For Python 2.x(deprecated):

对于 Python 2.x(已弃用):

perl -pi.back -MURI::Escape -e 'y/+/ /;$_=uri_unescape$_' ./*.log

urllibis really good at handling URL parsing

urllib非常擅长处理 URL 解析

回答by user785717

This is what seems to be working for me.

这似乎对我有用。

LANG=C

urlencode() {
    local l=${#1}
    for (( i = 0 ; i < l ; i++ )); do
        local c=${1:i:1}
        case "$c" in
            [a-zA-Z0-9.~_-]) printf "$c" ;;
            ' ') printf + ;;
            *) printf '%%%.2X' "'$c"
        esac
    done
}

urldecode() {
    local data=${1//+/ }
    printf '%b' "${data//%/\x}"
}

Replacing '+'s with spaces, and % signs with '\x' escapes, and letting echo interpret the \x escapes using the '-e' option was not working. For some reason, the cat command was printing the % sign as its own encoded form %25. So sed was simply replacing %25 with \x25. When the -e option was used, it was simply evaluating \x25 as % and the output was same as the original.

用空格替换 '+',用 '\x' 转义符替换 % 符号,并让 echo 使用 '-e' 选项解释 \x 转义符是行不通的。出于某种原因,cat 命令将 % 符号打印为它自己的编码形式 %25。所以 sed 只是用 \x25 替换了 %25。使用 -e 选项时,它只是将 \x25 评估为 % 并且输出与原始输出相同。

Trace:

痕迹:

Original:Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

原文:Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

sed:Mozilla\x252F5.0\x2520\x2528Macintosh\x253B\x2520U\x253B\x2520Intel\x2520Mac\x2520OS\x2520X\x252010.6\x253B\x2520en

sed:Mozilla\x252F5.0\x2520\x2528Macintosh\x253B\x2520U\x253B\x2520Intel\x2520Mac\x2520OS\x2520X\x252010.6\x2520B\x2

echo -e:Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

echo -e:Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

Fix:Basically ignore the 2 characters after the % in sed.

修复:基本上忽略 sed 中 % 之后的 2 个字符。

sed:Mozilla\x2F5.0\x20\x28Macintosh\x3B\x20U\x3B\x20Intel\x20Mac\x20OS\x20X\x2010.6\x3B\x20en

sed:Mozilla\x2F5.0\x20\x28Macintosh\x3B\x20U\x3B\x20Intel\x20Mac\x20OS\x20X\x2010.6\x3B\x20en

echo -e:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

echo -e:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

Not sure what complications this would result in, after extensive testing, but works for now.

经过广泛的测试,不确定这会导致什么并发症,但现在有效。

回答by Stephane Chazelas

while true
  do cat /dev/urandom | tr -d '
tail -f nginx.access.log | php -R 'echo urldecode($argn)."\n";'
' | head -c1000 > /tmp/tmp; A="$(cat /tmp/tmp; printf x)" A=${A%x} A=$(urlencode "$A") urldecode "$A" > /tmp/tmp2 cmp /tmp/tmp /tmp/tmp2 if [ $? != 0 ] then break fi done

With -iupdates the files in-place (some sedimplementations have borrowed that from perl) with .backas the backup extension.

通过-i就地更新文件(某些sed实现从 中借用了文件perl),.back并将其作为备份扩展。

s/x/y/esubstitutes xwith the evaluation of the yperl code.

s/x/y/ex用perl 代码的e值代替y

The perl code in this case uses packto pack the hex number captured in $1(first parentheses pair in the regexp) as the corresponding character.

在这种情况下,perl 代码用于pack将捕获的十六进制数$1(正则表达式中的第一个括号对)打包为相应的字符。

An alternative to packis to use chr(hex($1)):

另一种方法pack是使用chr(hex($1))

% echo -e "$(echo "Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en" | sed 'y/+/ /; s/%/\x/g')"
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

If available, you could also use uri_unescape()from URI::Escape:

如果可用,您还可以使用uri_unescape()from URI::Escape

#!/usr/bin/env python

import glob
import os
import urllib

for logfile in glob.glob(os.path.join('.', '*.log')):
    with open(logfile) as current:
        new_log_filename = logfile + '.new'
        with open(new_log_filename, 'w') as new_log_file:
            for url in current:
                unquoted = urllib.unquote(url.strip())
                new_log_file.write(unquoted + '\n')

回答by Janus Troelsen

Bash script for doing it in native Bash (original source):

用于在本机 Bash 中执行此操作的 Bash 脚本(原始来源):

gawk -vRS='%[0-9a-fA-F]{2}' 'RT{sub("%","0x",RT);RT=sprintf("%c",strtonum(RT))}
                             {gsub(/\+/," ");printf "%s", ##代码## RT}'

If you want to urldecode file content, just put the file content as an argument.

如果要对文件内容进行 urldecode,只需将文件内容作为参数即可。

Here's a test that will run halt if the decoded encoded file content differs (if it runs for a few seconds, the script probably works correctly):

这是一个测试,如果解码的编码文件内容不同,它将停止运行(如果它运行几秒钟,脚本可能会正常工作):

##代码##

回答by Oleg Bondar'

If you have php installed on your server, you can "cat" or even "tail" any file, with url encoded strings very easily.

如果您的服务器上安装了 php,您可以“cat”甚至“tail”任何文件,非常容易地使用 url 编码字符串。

##代码##

回答by Johnsyweb

As @barti_ddusaid in the comments, \x"should be [double-]escaped".

正如@barti_ddu在评论中所说,\x“应该[双重]转义”。

##代码##

Rather than mixing up Bash and sed, I would do this all in Python. Here's a rough cut of how:

与其将 Bash 和 sed 混在一起,我会用 Python 来完成这一切。这是一个粗略的方法:

##代码##

回答by Stephane Chazelas

With GNU awk:

使用 GNU awk

##代码##