bash bash中两个字符串的最长公共前缀

Question

提问by con-f-use

I have two strings. For the sake of the example they are set like this:

我有两个字符串。为了示例起见，它们设置如下：

string1="test toast"
string2="test test"

What I want is to find the overlap starting at the beginning of the strings. With overlap I mean the string "test t" in my above example.

我想要的是找到从字符串开头开始的重叠。重叠是指上面示例中的字符串“test t”。

# So I look for the command 
command "$string1" "$string2"
# that outputs:
"test t"

If the strings were string1="atest toast"; string2="test test"they would have no overlap since the check starts form the beginning and the "a" at the start of string1.

如果字符串是，string1="atest toast"; string2="test test"它们将没有重叠，因为检查从开头和 .a 开头的“a”开始string1。

Answer 1

采纳答案by jfg956

In sed, assuming the strings don't contain any newline characters:

在 sed 中，假设字符串不包含任何换行符：

string1="test toast"
string2="test test"
printf "%s\n%s\n" "$string1" "$string2" | sed -e 'N;s/^\(.*\).*\n.*$//'

Answer 2

回答by ack

An improved version of the sed example, this finds the common prefix of N strings (N>=0):

sed 示例的改进版本，它查找 N 个字符串的公共前缀 (N>=0)：

string1="test toast"
string2="test test"
string3="teaser"
{ echo "$string1"; echo "$string2"; echo "$string3"; } | sed -e 'N;s/^\(.*\).*\n.*$/\n/;D'

If the strings are stored in an array, they can be piped to sed with printf:

如果字符串存储在数组中，则可以使用printf将它们通过管道传输到 sed ：

strings=("test toast" "test test" "teaser")
printf "%s\n" "${strings[@]}" | sed -e '$!{N;s/^\(.*\).*\n.*$/\n/;D;}'

You can also use a here-string:

您还可以使用here-string：

strings=("test toast" "test test" "teaser")
oIFS=$IFS
IFS=$'\n'
<<<"${strings[*]}" sed -e '$!{N;s/^\(.*\).*\n.*$/\n/;D;}'
IFS=$oIFS
# for a local IFS:
(IFS=$'\n'; sed -e '$!{N;s/^\(.*\).*\n.*$/\n/;D;}' <<<"${strings[*]}")

The here-string (as with all redirections) can go anywhere within a simple command.

here-string（与所有重定向一样）可以在简单命令中的任何位置。

Answer 3

回答by Eugene Yarmash

Yet another variant, using GNU grep:

另一个变体，使用 GNU grep：

$ string1="test toast"
$ string2="test test"
$ grep -zPo '(.*).*\n\K' <<< "$string1"$'\n'"$string2"
test t

Answer 4

回答by Gilles 'SO- stop being evil'

This can be done entirely inside bash. Although doing string manipulation in a loop in bash is slow, there is a simple algorithm that is logarithmic in the number of shell operations, so pure bash is a viable option even for long strings.

这可以完全在 bash 中完成。虽然在 bash 的循环中进行字符串操作很慢，但有一个简单的算法，它在 shell 操作的数量上是对数的，所以即使对于长字符串，纯 bash 也是一个可行的选择。

longest_common_prefix () {
  local prefix= n
  ## Truncate the two strings to the minimum of their lengths
  if [[ ${#1} -gt ${#2} ]]; then
    set -- "${1:0:${#2}}" ""
  else
    set -- "" "${2:0:${#1}}"
  fi
  ## Binary search for the first differing character, accumulating the common prefix
  while [[ ${#1} -gt 1 ]]; do
    n=$(((${#1}+1)/2))
    if [[ ${1:0:$n} == ${2:0:$n} ]]; then
      prefix=$prefix${1:0:$n}
      set -- "${1:$n}" "${2:$n}"
    else
      set -- "${1:0:$n}" "${2:0:$n}"
    fi
  done
  ## Add the one remaining character, if common
  if [[  =  ]]; then prefix=$prefix; fi
  printf %s "$prefix"
}

The standard toolbox includes cmpto compare binary files. By default, it indicates the byte offset of the first differing bytes. There is a special case when one string is a prefix of the other: cmpproduces a different message on STDERR; an easy way to deal with this is to take whichever string is the shortest.

标准工具箱包括cmp比较二进制文件。默认情况下，它表示第一个不同字节的字节偏移量。当一个字符串是另一个字符串的前缀时，有一种特殊情况：cmp在 STDERR 上产生不同的消息；处理这个问题的一个简单方法是采用最短的字符串。

longest_common_prefix () {
  local LC_ALL=C offset prefix
  offset=$(export LC_ALL; cmp <(printf %s "") <(printf %s "") 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=
    else
      prefix=
    fi
  fi
  printf %s "$prefix"
}

Note that cmpoperates on bytes, but bash's string manipulation operates on characters. This makes a difference in multibyte locales, for examples locales using the UTF-8 character set. The function above prints the longest prefix of a byte string. To handle character strings with this method, we can first convert the strings to a fixed-width encoding. Assuming the locale's character set is a subset of Unicode, UTF-32 fits the bill.

请注意，cmp对字节进行操作，而 bash 的字符串操作对字符进行操作。这在多字节语言环境中产生了差异，例如使用 UTF-8 字符集的语言环境。上面的函数打印字节字符串的最长前缀。要使用这种方法处理字符串，我们可以先将字符串转换为固定宽度的编码。假设语言环境的字符集是 Unicode 的一个子集，UTF-32 符合要求。

longest_common_prefix () {
  local offset prefix LC_CTYPE="${LC_ALL:=LC_CTYPE}"
  offset=$(unset LC_ALL; LC_MESSAGES=C cmp <(printf %s "" | iconv -t UTF-32)
                                           <(printf %s "" | iconv -t UTF-32) 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset/4-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=
    else
      prefix=
    fi
  fi
  printf %s "$prefix"
}

Answer 5

回答by Hubbitus

Grep short variant (idea borrowed from sed one):

Grep 短变体（从 sed 借来的想法）：

$ echo -e "String1\nString2" | grep -zoP '^(.*)(?=.*?\n)'
String

Assumes string have no new line character. But easy may be tuned to use any delimiter.

假设字符串没有换行符。但是可以调整 easy 以使用任何分隔符。

Update at 2016-10-24: On modern versions of grep you may receive complain grep: unescaped ^ or $ not supported with -Pz, just use \Ainstead of ^:

更新于2016年10月24日：在grep的现代版本，您可能会收到抱怨grep: unescaped ^ or $ not supported with -Pz，只是使用\A的，而不是^：

$ echo -e "String1\nString2" | grep -zoP '\A(.*)(?=.*?\n)'
String

Answer 6

回答by jfg956

Without sed, using the cmp utility to get the index of the 1st different character, and using process substitution to get the 2 strings to cmp:

在没有 sed 的情况下，使用 cmp 实用程序获取第一个不同字符的索引，并使用进程替换将 2 个字符串获取到 cmp：

string1="test toast"
string2="test test"
first_diff_char=$(cmp <( echo "$string1" ) <( echo "$string2" ) | cut -d " " -f 5 | tr -d ",")
echo ${string1:0:$((first_diff_char-1))}

Answer 7

回答by Tanktalus

Ok, in bash:

好的，在 bash 中：

#!/bin/bash

s=""
t=""
l=1

while [ "${t#${s:0:$l}}" != "$t" ]
do
  (( l = l + 1 ))
done
(( l = l - 1 ))

echo "${s:0:$l}"

It's the same algorithm as in other languages, but pure bash functionality. And, might I say, a bit uglier, too :-)

它与其他语言中的算法相同，但具有纯 bash 功能。而且，我可以说，也有点丑:-)

Answer 8

回答by Tanktalus

It's probably simpler in another language. Here's my solution:

用另一种语言可能更简单。这是我的解决方案：

common_bit=$(perl -le '($s,$t)=@ARGV;for(split//,$s){last unless $t=~/^\Q$z$_/;$z.=$_}print $z' "$string1" "$string2")

If this weren't a one-liner, I'd use longer variable names, more whitespace, more braces, etc. I'm also sure there's a faster way, even in perl, but, again, it's a trade-off between speed and space: this uses less space on what is already a long one-liner.

如果这不是单行代码，我会使用更长的变量名、更多空格、更多大括号等。我也确信有一种更快的方法，即使在 perl 中，但同样，这是之间的权衡速度和空间：这在已经很长的单线飞机上占用的空间更少。

Answer 9

回答by chad

Just yet another way using Bash only.

仅使用 Bash 的另一种方式。

string1="test toast"
string2="test test"
len=${#string1}

for ((i=0; i<len; i++)); do
   if [[ "${string1:i:1}" == "${string2:i:1}" ]]; then
      continue
   else
      echo "${string1:0:i}"                       
      i=len
   fi
done

Answer 10

回答by Karoly Horvath

Man, this is tough. It's an extremely trivial task, yet I don't know how to do this with the shell :)

伙计，这很难。这是一项极其微不足道的任务，但我不知道如何用 shell 做到这一点:)

here is an ugly solution:

这是一个丑陋的解决方案：

echo "" | awk 'BEGIN{FS=""} { n=0; while(n<=NF) {if ($n == substr(test,n,1)) {printf("%c",$n);} n++;} print ""}' test=""

bash bash中两个字符串的最长公共前缀

提问by con-f-use

采纳答案by jfg956

回答by ack

回答by Eugene Yarmash

回答by Gilles 'SO- stop being evil'

回答by Hubbitus

回答by jfg956

回答by Tanktalus

回答by Tanktalus

回答by chad

回答by Karoly Horvath

相关推荐

最近更新

标签

bash bash中两个字符串的最长公共前缀

提问by con-f-use

采纳答案by jfg956

回答by ack

回答by Eugene Yarmash

回答by Gilles 'SO- stop being evil'

回答by Hubbitus

回答by jfg956

回答by Tanktalus

回答by Tanktalus

回答by chad

回答by Karoly Horvath

相关推荐

bash 如何将输出限制为终端宽度

bash 从文件中提取某些行

bash 如何使用 tail 实用程序查看经常重新创建的日志文件

bash 哪里记录了语法“while IFS= read line”？

相关推荐

最近更新

标签