bash bash中两个字符串的最长公共前缀
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6973088/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Longest common prefix of two strings in bash
提问by con-f-use
I have two strings. For the sake of the example they are set like this:
我有两个字符串。为了示例起见,它们设置如下:
string1="test toast"
string2="test test"
What I want is to find the overlap starting at the beginning of the strings. With overlap I mean the string "test t" in my above example.
我想要的是找到从字符串开头开始的重叠。重叠是指上面示例中的字符串“test t”。
# So I look for the command
command "$string1" "$string2"
# that outputs:
"test t"
If the strings were string1="atest toast"; string2="test test"they would have no overlap since the check starts form the beginning and the "a" at the start of string1.
如果字符串是,string1="atest toast"; string2="test test"它们将没有重叠,因为检查从开头和 .a 开头的“a”开始string1。
采纳答案by jfg956
In sed, assuming the strings don't contain any newline characters:
在 sed 中,假设字符串不包含任何换行符:
string1="test toast"
string2="test test"
printf "%s\n%s\n" "$string1" "$string2" | sed -e 'N;s/^\(.*\).*\n.*$//'
回答by ack
An improved version of the sed example, this finds the common prefix of N strings (N>=0):
sed 示例的改进版本,它查找 N 个字符串的公共前缀 (N>=0):
string1="test toast"
string2="test test"
string3="teaser"
{ echo "$string1"; echo "$string2"; echo "$string3"; } | sed -e 'N;s/^\(.*\).*\n.*$/\n/;D'
If the strings are stored in an array, they can be piped to sed with printf:
如果字符串存储在数组中,则可以使用printf将它们通过管道传输到 sed :
strings=("test toast" "test test" "teaser")
printf "%s\n" "${strings[@]}" | sed -e '$!{N;s/^\(.*\).*\n.*$/\n/;D;}'
You can also use a here-string:
您还可以使用here-string:
strings=("test toast" "test test" "teaser")
oIFS=$IFS
IFS=$'\n'
<<<"${strings[*]}" sed -e '$!{N;s/^\(.*\).*\n.*$/\n/;D;}'
IFS=$oIFS
# for a local IFS:
(IFS=$'\n'; sed -e '$!{N;s/^\(.*\).*\n.*$/\n/;D;}' <<<"${strings[*]}")
The here-string (as with all redirections) can go anywhere within a simple command.
here-string(与所有重定向一样)可以在简单命令中的任何位置。
回答by Eugene Yarmash
Yet another variant, using GNU grep:
另一个变体,使用 GNU grep:
$ string1="test toast"
$ string2="test test"
$ grep -zPo '(.*).*\n\K' <<< "$string1"$'\n'"$string2"
test t
回答by Gilles 'SO- stop being evil'
This can be done entirely inside bash. Although doing string manipulation in a loop in bash is slow, there is a simple algorithm that is logarithmic in the number of shell operations, so pure bash is a viable option even for long strings.
这可以完全在 bash 中完成。虽然在 bash 的循环中进行字符串操作很慢,但有一个简单的算法,它在 shell 操作的数量上是对数的,所以即使对于长字符串,纯 bash 也是一个可行的选择。
longest_common_prefix () {
local prefix= n
## Truncate the two strings to the minimum of their lengths
if [[ ${#1} -gt ${#2} ]]; then
set -- "${1:0:${#2}}" ""
else
set -- "" "${2:0:${#1}}"
fi
## Binary search for the first differing character, accumulating the common prefix
while [[ ${#1} -gt 1 ]]; do
n=$(((${#1}+1)/2))
if [[ ${1:0:$n} == ${2:0:$n} ]]; then
prefix=$prefix${1:0:$n}
set -- "${1:$n}" "${2:$n}"
else
set -- "${1:0:$n}" "${2:0:$n}"
fi
done
## Add the one remaining character, if common
if [[ = ]]; then prefix=$prefix; fi
printf %s "$prefix"
}
The standard toolbox includes cmpto compare binary files. By default, it indicates the byte offset of the first differing bytes. There is a special case when one string is a prefix of the other: cmpproduces a different message on STDERR; an easy way to deal with this is to take whichever string is the shortest.
标准工具箱包括cmp比较二进制文件。默认情况下,它表示第一个不同字节的字节偏移量。当一个字符串是另一个字符串的前缀时,有一种特殊情况:cmp在 STDERR 上产生不同的消息;处理这个问题的一个简单方法是采用最短的字符串。
longest_common_prefix () {
local LC_ALL=C offset prefix
offset=$(export LC_ALL; cmp <(printf %s "") <(printf %s "") 2>/dev/null)
if [[ -n $offset ]]; then
offset=${offset%,*}; offset=${offset##* }
prefix=${1:0:$((offset-1))}
else
if [[ ${#1} -lt ${#2} ]]; then
prefix=
else
prefix=
fi
fi
printf %s "$prefix"
}
Note that cmpoperates on bytes, but bash's string manipulation operates on characters. This makes a difference in multibyte locales, for examples locales using the UTF-8 character set. The function above prints the longest prefix of a byte string. To handle character strings with this method, we can first convert the strings to a fixed-width encoding. Assuming the locale's character set is a subset of Unicode, UTF-32 fits the bill.
请注意,cmp对字节进行操作,而 bash 的字符串操作对字符进行操作。这在多字节语言环境中产生了差异,例如使用 UTF-8 字符集的语言环境。上面的函数打印字节字符串的最长前缀。要使用这种方法处理字符串,我们可以先将字符串转换为固定宽度的编码。假设语言环境的字符集是 Unicode 的一个子集,UTF-32 符合要求。
longest_common_prefix () {
local offset prefix LC_CTYPE="${LC_ALL:=LC_CTYPE}"
offset=$(unset LC_ALL; LC_MESSAGES=C cmp <(printf %s "" | iconv -t UTF-32)
<(printf %s "" | iconv -t UTF-32) 2>/dev/null)
if [[ -n $offset ]]; then
offset=${offset%,*}; offset=${offset##* }
prefix=${1:0:$((offset/4-1))}
else
if [[ ${#1} -lt ${#2} ]]; then
prefix=
else
prefix=
fi
fi
printf %s "$prefix"
}
回答by Hubbitus
Grep short variant (idea borrowed from sed one):
Grep 短变体(从 sed 借来的想法):
$ echo -e "String1\nString2" | grep -zoP '^(.*)(?=.*?\n)'
String
Assumes string have no new line character. But easy may be tuned to use any delimiter.
假设字符串没有换行符。但是可以调整 easy 以使用任何分隔符。
Update at 2016-10-24: On modern versions of grep you may receive complain grep: unescaped ^ or $ not supported with -Pz, just use \Ainstead of ^:
更新于2016年10月24日:在grep的现代版本,您可能会收到抱怨grep: unescaped ^ or $ not supported with -Pz,只是使用\A的,而不是^:
$ echo -e "String1\nString2" | grep -zoP '\A(.*)(?=.*?\n)'
String
回答by jfg956
Without sed, using the cmp utility to get the index of the 1st different character, and using process substitution to get the 2 strings to cmp:
在没有 sed 的情况下,使用 cmp 实用程序获取第一个不同字符的索引,并使用进程替换将 2 个字符串获取到 cmp:
string1="test toast"
string2="test test"
first_diff_char=$(cmp <( echo "$string1" ) <( echo "$string2" ) | cut -d " " -f 5 | tr -d ",")
echo ${string1:0:$((first_diff_char-1))}
回答by Tanktalus
Ok, in bash:
好的,在 bash 中:
#!/bin/bash
s=""
t=""
l=1
while [ "${t#${s:0:$l}}" != "$t" ]
do
(( l = l + 1 ))
done
(( l = l - 1 ))
echo "${s:0:$l}"
It's the same algorithm as in other languages, but pure bash functionality. And, might I say, a bit uglier, too :-)
它与其他语言中的算法相同,但具有纯 bash 功能。而且,我可以说,也有点丑:-)
回答by Tanktalus
It's probably simpler in another language. Here's my solution:
用另一种语言可能更简单。这是我的解决方案:
common_bit=$(perl -le '($s,$t)=@ARGV;for(split//,$s){last unless $t=~/^\Q$z$_/;$z.=$_}print $z' "$string1" "$string2")
If this weren't a one-liner, I'd use longer variable names, more whitespace, more braces, etc. I'm also sure there's a faster way, even in perl, but, again, it's a trade-off between speed and space: this uses less space on what is already a long one-liner.
如果这不是单行代码,我会使用更长的变量名、更多空格、更多大括号等。我也确信有一种更快的方法,即使在 perl 中,但同样,这是之间的权衡速度和空间:这在已经很长的单线飞机上占用的空间更少。
回答by chad
Just yet another way using Bash only.
仅使用 Bash 的另一种方式。
string1="test toast"
string2="test test"
len=${#string1}
for ((i=0; i<len; i++)); do
if [[ "${string1:i:1}" == "${string2:i:1}" ]]; then
continue
else
echo "${string1:0:i}"
i=len
fi
done
回答by Karoly Horvath
Man, this is tough. It's an extremely trivial task, yet I don't know how to do this with the shell :)
伙计,这很难。这是一项极其微不足道的任务,但我不知道如何用 shell 做到这一点:)
here is an ugly solution:
这是一个丑陋的解决方案:
echo "" | awk 'BEGIN{FS=""} { n=0; while(n<=NF) {if ($n == substr(test,n,1)) {printf("%c",$n);} n++;} print ""}' test=""

