如何显示包含在 Bash 字符串中的唯一单词?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35212701/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 14:12:12  来源:igfitidea点击:

How can I display unique words contained in a Bash string?

linuxstringbashduplicates

提问by Todd Partridge

I have a string that has duplicate words. I would like to display only the unique words. The string is:

我有一个包含重复单词的字符串。我只想显示唯一的单词。字符串是:

variable="alpha bravo charlie alpha delta echo charlie"

I know several tools that can do this together. This is what I figured out:

我知道有几种工具可以一起完成这项工作。这是我想出来的:

echo $variable | tr " " "\n" | sort -u | tr "\n" " "

What is a more effective way to do this?

什么是更有效的方法来做到这一点?

采纳答案by Todd A. Jacobs

Use a Bash Substitution Expansion

使用 Bash 替换扩展

The following shell parameter expansionwill substitute spaces with newlines, and then pass the results into the sortutility to return only the unique words.

下面的shell 参数扩展将用换行符替换空格,然后将结果传递给排序实用程序以仅返回唯一的单词。

$ echo -e "${variable// /\n}" | sort -u
alpha
bravo
charlie
delta
echo

This has the side-effect of sorting your words, as the sortand uniqutilities both require input to be sorted in order to detect duplicates. If that's not what you want, I also posted a Ruby solutionthat preserves the original word order.

这具有排序单词的副作用,因为sortuniq实用程序都需要对输入进行排序以检测重复项。如果这不是您想要的,我还发布了一个保留原始词序的Ruby 解决方案

Rejoining Words

重新连接词

If, as one commenter pointed out, you're trying to reassemble your unique words back into a single line, you can use command substitutionto do this. For example:

如果正如一位评论者指出的那样,您试图将您的独特单词重新组合成一行,则可以使用命令替换来执行此操作。例如:

$ echo $(echo -e "${variable// /\n}" | sort -u)
alpha bravo charlie delta echo

The lack of quotes around the command substitution are intentional. If you quote it, the newlines will be preserved because Bash won't do word-splitting. Unquoted, the shell will return the results as a single line, however unintuitive that may seem.

命令替换周围缺少引号是有意的。如果你引用它,换行符将被保留,因为 Bash 不会做word-splitting。不加引号,shell 会将结果作为单行返回,无论这看起来多么不直观。

回答by jyvet

You may use xargs:

您可以使用xargs

echo "$variable" | xargs -n 1 | sort -u | xargs

回答by Todd A. Jacobs

Preserve Input Order with a Ruby One-Liner

使用 Ruby One-Liner 保留输入顺序

I posted a Bash-specific answeralready, but if you want to return only unique words while preserving the word order of the original string, then you can use the following Ruby one-liner:

我已经发布了一个Bash 特定的答案,但是如果您只想返回唯一的单词,同时保留原始字符串的词序,那么您可以使用以下 Ruby 单行:

$ echo "$variable" | ruby -ne 'puts $_.split.uniq'
alpha
bravo
charlie
delta
echo

This will split the input string on whitespace, and then return unique elements from the resulting array.

这将在空白处拆分输入字符串,然后从结果数组中返回唯一元素。

Unlike the sortor uniqutilities, Ruby doesn't need the words to be sorted to detect duplicates. This may be a better solution if you don't want your results to be sorted, although given your input sample it makes no practical difference for the posted example.

sortuniq实用程序不同,Ruby 不需要对单词进行排序来检测重复项。如果您不希望对结果进行排序,这可能是一个更好的解决方案,尽管考虑到您的输入样本,它对发布的示例没有实际影响。

Rejoining Words

重新连接词

If, as one commenter pointed out, you're then trying to reassemble the words back into a single line after deduplication, you can do that too. For that, we just append the Array#joinmethod:

如果,正如一位评论者指出的那样,您尝试在重复数据删除后将单词重新组合成一行,您也可以这样做。为此,我们只需附加Array#join方法:

$ echo "$variable" | ruby -ne 'puts $_.split.uniq.join(" ")'
alpha bravo charlie delta echo

回答by mklement0

Note: This solution assumes that all unique words should be output in the order they're encountered in the input. By contrast, the OP's own solution attempt outputs a sortedlist of unique words.

注意:此解决方案假定所有唯一单词都应按照它们在 input 中遇到的顺序输出。相比之下,OP 自己的解决方案尝试输出一个排序的唯一单词列表。

A simple Awk-only solution(POSIX-compliant) that is efficientby avoiding a pipeline (which invariably involves subshells).

一个简单的纯 awk 解决方案(符合 POSIX 标准),通过避免管道(总是涉及子外壳)而高效

awk -v RS=' ' '{ if (!seen[]++) { printf "%s%s",sep,; sep=" " } }' <<<"$variable"

# The above prints without a trailing \n, as in the OP's own solution.
# To add a trailing newline, append  `END { print }` to the end 
# of the Awk script.
  • Note how $variableis double-quotedto prevent it from accidental shell expansions, notably pathname expansion (globbing), and how it is provided to Awk via a here-string(<<<).

  • -v RS=' 'tells Awk to split the input into recordsby a single space.

    • Note that the lastword will have the input line's trailing newline included, which is why we don't use $0- the entire record - but $1, the record's first field, which has the newline stripped due to Awk's default field-splitting behavior.
  • seen[$1]++is a common Awk idiom that either creates an entry for $1, the input word, in associative array seen, if it doesn't exist yet, or increments its occurrence count.

  • !seen[$0]++therefore only returns true for the firstoccurrence of a given word (where seen[$0]is implicitly zero/the empty string; the ++is a post-increment, and therefore doesn't take effect until afterthe condition is evaluated)

  • {printf "%s%s",sep,$1; sep=" "}prints the word at hand $1, preceded by separator sep, which is implicitly the emptystring for the firstword, but a single space for subsequent words, due to setting septo " "immediately after.

  • 注意如何$variable双引号括起来,以防止意外shell扩展,尤其是路径扩展(通配符),以及它是如何通过提供给awk中下面的字符串<<<)。

  • -v RS=' '告诉 Awk 将输入按单个空格拆分为记录

    • 请注意,最后一个单词将包含输入行的尾随换行符,这就是为什么我们不使用$0- 整个记录 - 而是$1记录的第一个字段,由于 Awk 的默认字段拆分行为,该字段的换行符被删除。
  • seen[$1]++是一个常见的 awk 习语,它要么为$1关联数组中的输入词创建一个条目seen,如果它还不存在,要么增加其出现次数。

  • !seen[$0]++因此只在给定单词的第一次出现时返回 true (其中seen[$0]隐式为零/空字符串;这++是一个增量,因此评估条件之后才会生效)

  • {printf "%s%s",sep,$1; sep=" "}打印手头的单词$1,前面是 separator sep第一个单词隐式为字符串,但由于设置为紧接其后,所以后续单词有一个空格。sep" "



Here's a more flexible variant that handles any run of whitespace between input words; it works with GNUAwk and Mawk[1]:

这是一个更灵活的变体,可以处理输入单词之间的任何空白;它适用于GNUAwk 和Mawk [1]

awk -v RS='[[:space:]]+' '{if (!seen[
$ echo "$variable" | awk  '{for(i=1;i<=NF;i++){if (!seen[$i]++) printf $i" "}}'
alpha bravo charlie delta echo 
]++){printf "%s%s",sep,
$ echo "$variable" | awk  'BEGIN{j=""} {for(i=1;i<=NF;i++){if (!seen[$i]++)j=j==""?j=$i:j=j" "$i}} END{print j}' 
alpha bravo charlie delta echo
; sep=" "}}' <<<"$variable"
  • -v RS='[[:space:]]s+'tells Awk to split the input into records by any mix of spaces, tabs, and newlines.
  • -v RS='[[:space:]]s+'告诉 Awk 通过空格、制表符和换行符的任意组合将输入拆分为记录。


[1] Unfortunately, BSD/OSX Awk (in strict compliance with the POSIX spec), doesn't support using regular expressionsor even multi-character literals as RS, the input record separator.

[1] 不幸的是,BSD/OSX Awk(严格遵守POSIX 规范)不支持使用正则表达式甚至多字符文字作为RS输入记录分隔符。

回答by dawg

You can use awk:

您可以使用 awk:

variable="alpha bravo charlie alpha delta echo charlie"

# declare an associative array
declare -A unq

# read sentence into an indexed array
read -ra arr <<< "$variable"

# iterate each word and populate associative array with word as key
for w in "${arr[@]}"; do
   unq["$w"]=1
done

# print unique results
printf "%s\n" "${!unq[@]}"
delta
bravo
echo
alpha
charlie

## if you want results in same order as original string
for w in "${arr[@]}"; do
   [[ ${unq["$w"]} ]] && echo "$w" && unset unq["$w"]
done
alpha
bravo
charlie
delta
echo

If you do not want the trailing space and want a trailing CR, you can do:

如果您不想要尾随空格并想要尾随 CR,则可以执行以下操作:

for x in $vaviable; do 
    if [ "$(eval echo $(echo $un__$x))" = "" ]; then
         echo -n $x
         eval un__$x=1
         __usv="$__usv un__$x"
    fi
done
unset $__usv

回答by anubhava

Using associative arrays in BASH 4+ you can simplify this:

在 BASH 4+ 中使用关联数组,您可以简化此操作:

##代码##

回答by evil otto

pure, ugly bash:

纯粹,丑陋的bash:

##代码##