bash 从bash脚本中的URL中提取文件名和路径
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1199613/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract filename and path from URL in bash script
提问by Arek
In my bash script I need to extract just the path from the given URL. For example, from the variable containing string:
在我的 bash 脚本中,我只需要从给定的 URL 中提取路径。例如,从包含字符串的变量中:
http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth
http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth
I want to extract to some other variable only the:
我只想提取到其他一些变量:
/one/more/dir/file.exe
/one/more/dir/file.exe
part. Of course login, password, filename and parameters are optional.
部分。当然登录名、密码、文件名和参数是可选的。
Since I am new to sed and awk I ask you for help. Please, advice me how to do it. Thank you!
由于我是 sed 和 awk 的新手,因此我向您寻求帮助。请教我怎么做。谢谢!
回答by JESii
There are built-in functions in bash to handle this, e.g., the string pattern-matching operators:
bash 中有内置函数来处理这个问题,例如,字符串模式匹配运算符:
- '#' remove minimal matching prefixes
- '##' remove maximal matching prefixes
- '%' remove minimal matching suffixes
- '%%' remove maximal matching suffixes
- '#' 删除最小匹配前缀
- '##' 删除最大匹配前缀
- '%' 删除最小匹配后缀
- '%%' 删除最大匹配后缀
For example:
例如:
FILE=/home/user/src/prog.c
echo ${FILE#/*/} # ==> user/src/prog.c
echo ${FILE##/*/} # ==> prog.c
echo ${FILE%/*} # ==> /home/user/src
echo ${FILE%%/*} # ==> nil
echo ${FILE%.c} # ==> /home/user/src/prog
All this from the excellent book: "A Practical Guide to Linux Commands, Editors, and Shell Programming by Mark G. Sobell (http://www.sobell.com/)
所有这些都来自优秀的书:“Linux 命令、编辑器和 Shell 编程实用指南,作者是 Mark G. Sobell (http://www.sobell.com/)
回答by saeedgnu
In bash:
在 bash 中:
URL='http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth'
URL_NOPRO=${URL:7}
URL_REL=${URL_NOPRO#*/}
echo "/${URL_REL%%\?*}"
Works only if URL starts with http://
or a protocol with the same length
Otherwise, it's probably easier to use regex with sed
, grep
or cut
...
仅当 URL 以http://
或具有相同长度的协议开头时才有效否则,将正则表达式与sed
,grep
或cut
...
回答by Jim
This uses bashand cutas another way of doing this. It's ugly, but it works (at least for the example). Sometimes I like to use what I call cutsieves to whittle down the information that I am actually looking for.
这使用bash和cut作为另一种方法。它很难看,但它有效(至少对于示例而言)。有时我喜欢使用我所说的切筛来减少我实际寻找的信息。
Note:Performance wise, this may be a problem.
注意:性能方面,这可能是一个问题。
Given those caveats:
鉴于这些警告:
First let's echo the the line:
首先让我们回显一下这一行:
echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth'
Which gives us:
这给了我们:
http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth
http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth
Then let's cutthe line at the @as a convenient way to strip out the http://login:password:
然后让我们在@处剪掉一行,以方便去除 http://login:password:
echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2
That give us this:
这给了我们这个:
example.com/one/more/dir/file.exe?a=sth&b=sth
example.com/one/more/dir/file.exe?a=sth&b=sth
To get rid of the hostname, let's do another cutand use the /as the delimiter while asking cut to give us the second field and everything after (essentially, to the end of the line). It looks like this:
为了摆脱主机名,让我们再做一次剪切并使用/作为分隔符,同时要求剪切为我们提供第二个字段和后面的所有内容(基本上,到行尾)。它看起来像这样:
echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2 | \
cut -d/ -f2-
Which, in turn, results in:
反过来,这会导致:
one/more/dir/file.exe?a=sth&b=sth
one/more/dir/file.exe?a=sth&b=sth
And finally, we want to strip off all the parameters from the end. Again, we'll use cutand this time the ?as the delimiter and tell it to give us just the first field. That brings us to the end and looks like this:
最后,我们想从末尾剥离所有参数。同样,我们将使用cut,这次是? 作为分隔符并告诉它只给我们第一个字段。这将我们带到最后,看起来像这样:
echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2 | \
cut -d/ -f2- | \
cut -d? -f1
And the output is:
输出是:
one/more/dir/file.exe
一个/多个/目录/file.exe
Just another way to do it and this approach is one way to whittle away that data you don't need in an interactive way to come up with something you do need.
只是另一种方法,这种方法是一种以交互方式减少不需要的数据以提出您确实需要的东西的方法。
If I wanted to stuff this into a variable in a script, I'd do something like this:
如果我想把它塞进脚本中的一个变量中,我会做这样的事情:
#!/bin/bash
url="http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth"
file_path=$(echo ${url} | cut -d@ -f2 | cut -d/ -f2- | cut -d? -f1)
echo ${file_path}
Hope it helps.
希望能帮助到你。
回答by kenorb
url="http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth"
GNU grep
GNU grep
$ grep -Po '\w\K/\w+[^?]+' <<<$url
/one/more/dir/file.exe
BSD grep
BSD grep
$ grep -o '\w/\w\+[^?]\+' <<<$url | tail -c+2
/one/more/dir/file.exe
ripgrep
ripgrep
$ rg -o '\w(/\w+[^?]+)' -r '' <<<$url
/one/more/dir/file.exe
To get other parts of URL, check: Getting parts of a URL (Regex).
要获取 URL 的其他部分,请检查:获取 URL 的部分(正则表达式)。
回答by Hirofumi Saito
If you have a gawk:
如果你有一个傻瓜:
$ echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth' | \
gawk '$ echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth' | \
gawk -F'(http://[^/]+|?)' 'path="/${url#*://*/}" && [[ "/${url}" == "${path}" ]] && path="/"
='
=gensub(/http:\/\/[^/]+(\/[^?]+)\?.*/,"\1",1)'
or
或者
file:///home/username/Music/Jean-Michel%20Jarre/M%C3%A9tamorphoses/01%20-%20Je%20me%20souviens.mp3
Gnu awk can use regular expression as field separators(FS).
Gnu awk 可以使用正则表达式作为字段分隔符(FS)。
回答by caldfir
Using only bash builtins:
仅使用 bash 内置函数:
/home/username/Music/Jean-Michel Jarre/Métamorphoses/01 - Je me souviens.mp3
What this does is:
它的作用是:
- remove the prefix
*://*/
(so this would be your protocol and hostname+port) - check if we actually succeeded in removing anything - if not, then this implies there was no third slash (assuming this is a well-formed URL)
- if there was no third slash, then the path is just
/
- 删除前缀
*://*/
(因此这将是您的协议和主机名+端口) - 检查我们是否真的成功删除了任何内容 - 如果没有,那么这意味着没有第三个斜杠(假设这是一个格式正确的 URL)
- 如果没有第三个斜线,那么路径就是
/
note: the quotation marks aren't actually needed here, but I find it easier to read with them in
注意:这里实际上不需要引号,但我发现用它们更容易阅读
回答by Urhixidur
The Perl snippet is intriguing, and since Perl is present in most Linux distros, quite useful, but...It doesn't do the job completely. Specifically, there is a problem in translating the URL/URI format from UTF-8 into path Unicode. Let me give an example of the problem. The original URI may be:
Perl 片段很有趣,而且由于 Perl 存在于大多数 Linux 发行版中,非常有用,但是......它并不能完全完成这项工作。具体来说,将 URL/URI 格式从 UTF-8 转换为路径 Unicode 存在问题。让我举一个问题的例子。原始 URI 可能是:
path=$( echo "$url" | perl -MURI -le 'chomp($url = <>); print URI->new($url)->file' )
The corresponding path would be:
对应的路径是:
path=$( echo "$url" | perl -MURI -le 'print URI->new(<>)->file' )
%20
became space, %C3%A9
became 'é'. Is there a Linux command, bash feature, or Perl script that can handle this transformation, or do I have to write a humongous series of sed substring substitutions? What about the reverse transformation, from path to URL/URI?
%20
变成了空间,%C3%A9
变成了'é'。是否有 Linux 命令、bash 功能或 Perl 脚本可以处理这种转换,或者我是否必须编写大量的 sed 子字符串替换?从路径到 URL/URI 的反向转换怎么样?
(Follow-up)
(跟进)
Looking at http://search.cpan.org/~gaas/URI-1.54/URI.pm, I first saw the as_iri method, but that was apparently missing from my Linux (or is not applicable, somehow). Turns out the solution is to replace the "->path" part with "->file". You can then break that further down using basename and dirname, etc. The solution is thus:
查看http://search.cpan.org/~gaas/URI-1.54/URI.pm,我第一次看到了 as_iri 方法,但我的 Linux 显然缺少它(或者不适用,不知何故)。原来的解决方案是用“->文件”替换“->路径”部分。然后,您可以使用 basename 和 dirname 等进一步分解。因此,解决方案是:
echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth' | \
sed 's|.*://[^/]*/\([^?]*\)?.*|/|g'
Oddly, using "->dir" instead of "->file" does NOT extract the directory part: rather, it formats the URI so it can be used as an argument to mkdir and the like.
奇怪的是,使用 "->dir" 而不是 "->file" 不会提取目录部分:相反,它格式化了 URI,因此它可以用作 mkdir 等的参数。
(Further follow-up)
(进一步跟进)
Any reason why the line cannot be shortened to this?
这条线不能缩短到这个的任何原因?
echo "http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth" | awk -F"/" '
{
===""
gsub(/\?.*/,"",$NF)
print substr(# ./test.sh
/one/more/dir/file.exe
,3)
}' OFS="/"
回答by sed
How does this :?
这是怎么回事:?
url="http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth"
path=$( echo "$url" | ruby -ruri -e 'puts URI.parse(gets.chomp).path' )
- .://[^/]/: http://login:[email protected]/
- ([^?]*) : one/more/dir/file.exe
- ?.* : ?a=sth&b=sth
- /\1 : /one/more/dir/file.exe
- . ://[^/]/: http://login:[email protected]/
- ([^?]*) : one/more/dir/file.exe
- ?.* : ?a=sth&b=sth
- /\1 : /one/more/dir/file.exe
回答by ghostdog74
gawk
呆呆的
path=$( echo "$url" | perl -MURI -le 'chomp($url = <>); print URI->new($url)->path' )
output
输出
##代码##回答by glenn Hymanman
Best bet is to find a language that has a URL parsing library:
最好的办法是找到一种具有 URL 解析库的语言:
##代码##or
或者
##代码##