bash 从bash脚本中的URL中提取文件名和路径

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1199613/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 18:21:45  来源:igfitidea点击:

Extract filename and path from URL in bash script

bashurlparsing

提问by Arek

In my bash script I need to extract just the path from the given URL. For example, from the variable containing string:

在我的 bash 脚本中,我只需要从给定的 URL 中提取路径。例如,从包含字符串的变量中:

http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth

http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth

I want to extract to some other variable only the:

我只想提取到其他一些变量:

/one/more/dir/file.exe

/one/more/dir/file.exe

part. Of course login, password, filename and parameters are optional.

部分。当然登录名、密码、文件名和参数是可选的。

Since I am new to sed and awk I ask you for help. Please, advice me how to do it. Thank you!

由于我是 sed 和 awk 的新手,因此我向您寻求帮助。请教我怎么做。谢谢!

回答by JESii

There are built-in functions in bash to handle this, e.g., the string pattern-matching operators:

bash 中有内置函数来处理这个问题,例如,字符串模式匹配运算符:

  1. '#' remove minimal matching prefixes
  2. '##' remove maximal matching prefixes
  3. '%' remove minimal matching suffixes
  4. '%%' remove maximal matching suffixes
  1. '#' 删除最小匹配前缀
  2. '##' 删除最大匹配前缀
  3. '%' 删除最小匹配后缀
  4. '%%' 删除最大匹配后缀

For example:

例如:

FILE=/home/user/src/prog.c
echo ${FILE#/*/}  # ==> user/src/prog.c
echo ${FILE##/*/} # ==> prog.c
echo ${FILE%/*}   # ==> /home/user/src
echo ${FILE%%/*}  # ==> nil
echo ${FILE%.c}   # ==> /home/user/src/prog

All this from the excellent book: "A Practical Guide to Linux Commands, Editors, and Shell Programming by Mark G. Sobell (http://www.sobell.com/)

所有这些都来自优秀的书:“Linux 命令、编辑器和 Shell 编程实用指南,作者是 Mark G. Sobell (http://www.sobell.com/)

回答by saeedgnu

In bash:

在 bash 中:

URL='http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth'
URL_NOPRO=${URL:7}
URL_REL=${URL_NOPRO#*/}
echo "/${URL_REL%%\?*}"

Works only if URL starts with http://or a protocol with the same length Otherwise, it's probably easier to use regex with sed, grepor cut...

仅当 URL 以http://或具有相同长度的协议开头时才有效否则,将正则表达式与sed,grepcut...

回答by Jim

This uses bashand cutas another way of doing this. It's ugly, but it works (at least for the example). Sometimes I like to use what I call cutsieves to whittle down the information that I am actually looking for.

这使用bashcut作为另一种方法。它很难看,但它有效(至少对于示例而言)。有时我喜欢使用我所说的筛来减少我实际寻找的信息。

Note:Performance wise, this may be a problem.

注意:性能方面,这可能是一个问题。

Given those caveats:

鉴于这些警告:

First let's echo the the line:

首先让我们回显一下这一行:

echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth'

Which gives us:

这给了我们:

http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth

http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth

Then let's cutthe line at the @as a convenient way to strip out the http://login:password:

然后让我们在@剪掉一行,以方便去除 http://login:password

echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2

That give us this:

这给了我们这个:

example.com/one/more/dir/file.exe?a=sth&b=sth

example.com/one/more/dir/file.exe?a=sth&b=sth

To get rid of the hostname, let's do another cutand use the /as the delimiter while asking cut to give us the second field and everything after (essentially, to the end of the line). It looks like this:

为了摆脱主机名,让我们再做一次剪切并使用/作为分隔符,同时要求剪切为我们提供第二个字段和后面的所有内容(基本上,到行尾)。它看起来像这样:

echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2 | \
cut -d/ -f2-

Which, in turn, results in:

反过来,这会导致:

one/more/dir/file.exe?a=sth&b=sth

one/more/dir/file.exe?a=sth&b=sth

And finally, we want to strip off all the parameters from the end. Again, we'll use cutand this time the ?as the delimiter and tell it to give us just the first field. That brings us to the end and looks like this:

最后,我们想从末尾剥离所有参数。同样,我们将使用cut,这次是? 作为分隔符并告诉它只给我们第一个字段。这将我们带到最后,看起来像这样:

echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth' | \
cut -d@ -f2 | \
cut -d/ -f2- | \
cut -d? -f1

And the output is:

输出是:

one/more/dir/file.exe

一个/多个/目录/file.exe

Just another way to do it and this approach is one way to whittle away that data you don't need in an interactive way to come up with something you do need.

只是另一种方法,这种方法是一种以交互方式减少不需要的数据以提出您确实需要的东西的方法。

If I wanted to stuff this into a variable in a script, I'd do something like this:

如果我想把它塞进脚本中的一个变量中,我会做这样的事情:

#!/bin/bash

url="http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth"
file_path=$(echo ${url} | cut -d@ -f2 | cut -d/ -f2- | cut -d? -f1)
echo ${file_path}

Hope it helps.

希望能帮助到你。

回答by kenorb

url="http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth"

GNU grep

GNU grep

$ grep -Po '\w\K/\w+[^?]+' <<<$url
/one/more/dir/file.exe

BSD grep

BSD grep

$ grep -o '\w/\w\+[^?]\+' <<<$url | tail -c+2
/one/more/dir/file.exe

ripgrep

ripgrep

$ rg -o '\w(/\w+[^?]+)' -r '' <<<$url
/one/more/dir/file.exe


To get other parts of URL, check: Getting parts of a URL (Regex).

要获取 URL 的其他部分,请检查:获取 URL 的部分(正则表达式)

回答by Hirofumi Saito

If you have a gawk:

如果你有一个傻瓜:

$ echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth' | \
  gawk '
$ echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth' | \
  gawk -F'(http://[^/]+|?)' '
path="/${url#*://*/}" && [[ "/${url}" == "${path}" ]] && path="/"
='
=gensub(/http:\/\/[^/]+(\/[^?]+)\?.*/,"\1",1)'

or

或者

file:///home/username/Music/Jean-Michel%20Jarre/M%C3%A9tamorphoses/01%20-%20Je%20me%20souviens.mp3

Gnu awk can use regular expression as field separators(FS).

Gnu awk 可以使用正则表达式作为字段分隔符(FS)。

回答by caldfir

Using only bash builtins:

仅使用 bash 内置函数:

/home/username/Music/Jean-Michel Jarre/Métamorphoses/01 - Je me souviens.mp3

What this does is:

它的作用是:

  1. remove the prefix *://*/(so this would be your protocol and hostname+port)
  2. check if we actually succeeded in removing anything - if not, then this implies there was no third slash (assuming this is a well-formed URL)
  3. if there was no third slash, then the path is just /
  1. 删除前缀*://*/(因此这将是您的协议和主机名+端口)
  2. 检查我们是否真的成功删除了任何内容 - 如果没有,那么这意味着没有第三个斜杠(假设这是一个格式正确的 URL)
  3. 如果没有第三个斜线,那么路径就是 /

note: the quotation marks aren't actually needed here, but I find it easier to read with them in

注意:这里实际上不需要引号,但我发现用它们更容易阅读

回答by Urhixidur

The Perl snippet is intriguing, and since Perl is present in most Linux distros, quite useful, but...It doesn't do the job completely. Specifically, there is a problem in translating the URL/URI format from UTF-8 into path Unicode. Let me give an example of the problem. The original URI may be:

Perl 片段很有趣,而且由于 Perl 存在于大多数 Linux 发行版中,非常有用,但是......它并不能完全完成这项工作。具体来说,将 URL/URI 格式从 UTF-8 转换为路径 Unicode 存在问题。让我举一个问题的例子。原始 URI 可能是:

path=$( echo "$url" | perl -MURI -le 'chomp($url = <>); print URI->new($url)->file' )

The corresponding path would be:

对应的路径是:

path=$( echo "$url" | perl -MURI -le 'print URI->new(<>)->file' )

%20became space, %C3%A9became 'é'. Is there a Linux command, bash feature, or Perl script that can handle this transformation, or do I have to write a humongous series of sed substring substitutions? What about the reverse transformation, from path to URL/URI?

%20变成了空间,%C3%A9变成了'é'。是否有 Linux 命令、bash 功能或 Perl 脚本可以处理这种转换,或者我是否必须编写大量的 sed 子字符串替换?从路径到 URL/URI 的反向转换怎么样?

(Follow-up)

(跟进)

Looking at http://search.cpan.org/~gaas/URI-1.54/URI.pm, I first saw the as_iri method, but that was apparently missing from my Linux (or is not applicable, somehow). Turns out the solution is to replace the "->path" part with "->file". You can then break that further down using basename and dirname, etc. The solution is thus:

查看http://search.cpan.org/~gaas/URI-1.54/URI.pm,我第一次看到了 as_iri 方法,但我的 Linux 显然缺少它(或者不适用,不知何故)。原来的解决方案是用“->文件”替换“->路径”部分。然后,您可以使用 basename 和 dirname 等进一步分解。因此,解决方案是:

echo 'http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth' | \
sed 's|.*://[^/]*/\([^?]*\)?.*|/|g'

Oddly, using "->dir" instead of "->file" does NOT extract the directory part: rather, it formats the URI so it can be used as an argument to mkdir and the like.

奇怪的是,使用 "->dir" 而不是 "->file" 不会提取目录部分:相反,它格式化了 URI,因此它可以用作 mkdir 等的参数。

(Further follow-up)

(进一步跟进)

Any reason why the line cannot be shortened to this?

这条线不能缩短到这个的任何原因?

echo "http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth" | awk -F"/" '
{
 ===""
 gsub(/\?.*/,"",$NF)
 print substr(
# ./test.sh
/one/more/dir/file.exe
,3) }' OFS="/"

回答by sed

How does this :?

这是怎么回事:?

url="http://login:[email protected]/one/more/dir/file.exe?a=sth&b=sth"
path=$( echo "$url" | ruby -ruri -e 'puts URI.parse(gets.chomp).path' )

回答by ghostdog74

gawk

呆呆的

path=$( echo "$url" | perl -MURI -le 'chomp($url = <>); print URI->new($url)->path' )

output

输出

##代码##

回答by glenn Hymanman

Best bet is to find a language that has a URL parsing library:

最好的办法是找到一种具有 URL 解析库的语言:

##代码##

or

或者

##代码##