string 在 Bash 中提取子字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/428109/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 00:21:48  来源:igfitidea点击:

Extract substring in Bash

stringbashshellsubstring

提问by Berek Bryan

Given a filename in the form someletters_12345_moreleters.ext, I want to extract the 5 digits and put them into a variable.

给定表单中的文件名someletters_12345_moreleters.ext,我想提取 5 位数字并将它们放入一个变量中。

So to emphasize the point, I have a filename with x number of characters then a five digit sequence surrounded by a single underscore on either side then another set of x number of characters. I want to take the 5 digit number and put that into a variable.

所以为了强调这一点,我有一个包含 x 个字符的文件名,然后是一个五位数字序列,两边各有一个下划线,然后是另一组 x 个字符。我想取 5 位数字并将其放入变量中。

I am very interested in the number of different ways that this can be accomplished.

我对实现这一目标的不同方式的数量非常感兴趣。

采纳答案by FerranB

Use cut:

使用切割

echo 'someletters_12345_moreleters.ext' | cut -d'_' -f 2

More generic:

更通用:

INPUT='someletters_12345_moreleters.ext'
SUBSTRING=$(echo $INPUT| cut -d'_' -f 2)
echo $SUBSTRING

回答by JB.

If xis constant, the following parameter expansion performs substring extraction:

如果x是常数,则以下参数扩展执行子字符串提取:

b=${a:12:5}

where 12is the offset (zero-based) and 5is the length

其中12是偏移量(从零开始),5是长度

If the underscores around the digits are the only ones in the input, you can strip off the prefix and suffix (respectively) in two steps:

如果数字周围的下划线是输入中唯一的下划线,您可以分两步去除前缀和后缀(分别):

tmp=${a#*_}   # remove prefix ending in "_"
b=${tmp%_*}   # remove suffix starting with "_"

If there are other underscores, it's probably feasible anyway, albeit more tricky. If anyone knows how to perform both expansions in a single expression, I'd like to know too.

如果还有其他下划线,无论如何它可能都是可行的,尽管更棘手。如果有人知道如何在一个表达式中执行两个扩展,我也想知道。

Both solutions presented are pure bash, with no process spawning involved, hence very fast.

提供的两种解决方案都是纯 bash,不涉及进程生成,因此速度非常快。

回答by Johannes Schaub - litb

Generic solution where the number can be anywhere in the filename, using the first of such sequences:

数字可以在文件名中的任何位置的通用解决方案,使用第一个这样的序列:

number=$(echo $filename | egrep -o '[[:digit:]]{5}' | head -n1)

Another solution to extract exactly a part of a variable:

精确提取变量一部分的另一种解决方案:

number=${filename:offset:length}

If your filename always have the format stuff_digits_...you can use awk:

如果您的文件名始终具有stuff_digits_...您可以使用 awk的格式:

number=$(echo $filename | awk -F _ '{ print  }')

Yet another solution to remove everything except digits, use

删除除数字以外的所有内容的另一种解决方案,使用

number=$(echo $filename | tr -cd '[[:digit:]]')

回答by brown.2179

just try to use cut -c startIndx-stopIndx

只是尝试使用 cut -c startIndx-stopIndx

回答by jperelli

In case someone wants more rigorous information, you can also search it in man bash like this

如果有人想要更严格的信息,你也可以像这样在 man bash 中搜索

$ man bash [press return key]
/substring  [press return key]
[press "n" key]
[press "n" key]
[press "n" key]
[press "n" key]

Result:

结果:

${parameter:offset}
       ${parameter:offset:length}
              Substring Expansion.  Expands to  up  to  length  characters  of
              parameter  starting  at  the  character specified by offset.  If
              length is omitted, expands to the substring of parameter  start‐
              ing at the character specified by offset.  length and offset are
              arithmetic expressions (see ARITHMETIC  EVALUATION  below).   If
              offset  evaluates  to a number less than zero, the value is used
              as an offset from the end of the value of parameter.  Arithmetic
              expressions  starting  with  a - must be separated by whitespace
              from the preceding : to be distinguished from  the  Use  Default
              Values  expansion.   If  length  evaluates to a number less than
              zero, and parameter is not @ and not an indexed  or  associative
              array,  it is interpreted as an offset from the end of the value
              of parameter rather than a number of characters, and the  expan‐
              sion is the characters between the two offsets.  If parameter is
              @, the result is length positional parameters beginning at  off‐
              set.   If parameter is an indexed array name subscripted by @ or
              *, the result is the length members of the array beginning  with
              ${parameter[offset]}.   A  negative  offset is taken relative to
              one greater than the maximum index of the specified array.  Sub‐
              string  expansion applied to an associative array produces unde‐
              fined results.  Note that a negative offset  must  be  separated
              from  the  colon  by  at least one space to avoid being confused
              with the :- expansion.  Substring indexing is zero-based  unless
              the  positional  parameters are used, in which case the indexing
              starts at 1 by default.  If offset  is  0,  and  the  positional
              parameters are used, 
FN=someletters_12345_moreleters.ext
[[ ${FN} =~ _([[:digit:]]{5})_ ]] && NUM=${BASH_REMATCH[1]}
is prefixed to the list.

回答by nicerobot

Here's how i'd do it:

这是我的方法:

a="someletters_12345_moreleters.ext"
IFS="_"
set $a
echo 
# prints 12345

Explanation:

解释:

Bash-specific:

特定于 Bash:

Regular Expressions (RE): _([[:digit:]]{5})_

正则表达式 (RE): _([[:digit:]]{5})_

  • _are literals to demarcate/anchor matching boundaries for the string being matched
  • ()create a capture group
  • [[:digit:]]is a character class, i think it speaks for itself
  • {5}means exactly five of the prior character, class (as in this example), or group must match
  • _是为被匹配的字符串划定/锚定匹配边界的文字
  • ()创建捕获组
  • [[:digit:]]是一个字符类,我认为它不言自明
  • {5}表示前一个字符、类(如本例中)或组中的五个必须匹配

In english, you can think of it behaving like this: the FNstring is iterated character by character until we see an _at which point the capture group is openedand we attempt to match five digits. If that matching is successful to this point, the capture group saves the five digits traversed. If the next character is an _, the condition is successful, the capture group is made available in BASH_REMATCH, and the next NUM=statement can execute. If any part of the matching fails, saved details are disposed of and character by character processing continues after the _. e.g. if FNwhere _1 _12 _123 _1234 _12345_, there would be four false starts before it found a match.

在英语中,你可以认为它的行为是这样的:FN字符串逐个字符地迭代,直到我们看到_捕获组被打开并尝试匹配五个数字。如果此时匹配成功,则捕获组将保存遍历的五位数字。如果下一个字符是_,则条件成功,捕获组在 中可用BASH_REMATCH,并且NUM=可以执行下一条语句。如果匹配的任何部分失败,保存的详细信息将被处理,并在_. 例如,如果FNwhere _1 _12 _123 _1234 _12345_,在找到匹配之前会有四个错误的开始。

回答by user1338062

I'm surprised this pure bash solution didn't come up:

我很惊讶这个纯 bash 解决方案没有出现:

substring=$(expr "$filename" : '.*_\([^_]*\)_.*')

You probably want to reset IFS to what value it was before, or unset IFSafterwards!

您可能希望将 IFS 重置为之前或unset IFS之后的值!

回答by PEZ

Building on jor's answer (which doesn't work for me):

基于 jor 的答案(这对我不起作用):

$ echo "someletters_12345_moreleters.ext" | grep -Eo "[[:digit:]]+" 
12345

回答by fedorqui 'SO stop harming'

Following the requirements

遵循要求

I have a filename with x number of characters then a five digit sequence surrounded by a single underscore on either side then another set of x number of characters. I want to take the 5 digit number and put that into a variable.

我有一个包含 x 个字符的文件名,然后是一个五位数字序列,两边各有一个下划线,然后是另一组 x 个字符。我想取 5 位数字并将其放入变量中。

I found some grepways that may be useful:

我发现了一些grep可能有用的方法:

$ echo "someletters_12345_moreleters.ext" | grep -Eo "[[:digit:]]{5}" 
12345

or better

或更好

$ echo "someletters_12345_moreleters.ext" | grep -Po '(?<=_)\d+' 
12345

And then with -Posyntax:

然后使用-Po语法:

$ echo "someletters_12345_moreleters.ext" | grep -Po '(?<=_)\d{5}' 
12345

Or if you want to make it fit exactly 5 characters:

或者,如果您想让它恰好适合 5 个字符:

name='someletters_12345_moreleters.ext'

echo $name | sed 's/[^0-9]*//g'    # 12345
echo $name | tr -c -d 0-9          # 12345

Finally, to make it be stored in a variable it is just need to use the var=$(command)syntax.

最后,要将其存储在变量中,只需要使用var=$(command)语法即可。

回答by fedorqui 'SO stop harming'

If we focus in the concept of:
"A run of (one or several) digits"

如果我们专注于
“一系列(一个或几个)数字”的概念

We could use several external tools to extract the numbers.
We could quite easily erase all other characters, either sed or tr:

我们可以使用几种外部工具来提取数字。
我们可以很容易地删除所有其他字符,无论是 sed 还是 tr:

echo $name | sed 's/[^0-9]*//g'    # 12345323
echo $name | tr -c -d 0-9          # 12345323

But if $name contains several runs of numbers, the above will fail:

但是如果 $name 包含多个数字运行,上面的将失败:

If "name=someletters_12345_moreleters_323_end.ext", then:

如果“name=someletters_12345_moreleters_323_end.ext”,则:

echo $name | sed 's/[^0-9]*\([0-9]\{1,\}\).*$//'
perl -e 'my $name='$name';my ($num)=$name=~/(\d+)/;print "$num\n";'

We need to use regular expresions (regex).
To select only the first run (12345 not 323) in sed and perl:

我们需要使用正则表达式(regex)。
要在 sed 和 perl 中仅选择第一次运行(12345 而不是 323):

regex=[^0-9]*([0-9]{1,}).*$; \
[[ $name =~ $regex ]] && echo ${BASH_REMATCH[1]}

But we could as well do it directly in bash(1):

但是我们也可以直接在 bash (1) 中进行

##代码##

This allows us to extract the FIRST run of digits of any length
surrounded by any other text/characters.

这允许我们提取
由任何其他文本/字符包围的任何长度的第一轮数字。

Note: regex=[^0-9]*([0-9]{5,5}).*$;will match only exactly 5 digit runs. :-)

注意regex=[^0-9]*([0-9]{5,5}).*$;将仅匹配 5 位数字运行。:-)

(1): faster than calling an external tool for each short texts. Not faster than doing all processing inside sed or awk for large files.

(1):比为每个短文本调用外部工具更快。不比在 sed 或 awk 中对大文件进行所有处理快。