Bash- 是否可以仅将 -uniq 用于一行的一列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13583365/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 03:52:21  来源:igfitidea点击:

Bash- is it possible to use -uniq for only one column of a line?

bashsortinguniq

提问by teutara

    1.gui  Qxx  16
    2.gu   Qxy  23
    3.guT  QWS  18
    4.gui  Qxr  21

i want to sort a file depending a value in the 3rd column, so i use:

我想根据第三列中的值对文件进行排序,所以我使用:

sort -rnk3 myfile

2.gu   Qxy  23
4.gui  Qxr  21
3.guT  QWS  18
1.gui  Qxx  16

now i have to output as: (the line starting with 3.gui is out because the line with 4.gui has a greater value)

现在我必须输出为:(以 3.gui 开头的行已退出,因为带有 4.gui 的行具有更大的价值)

2.gu   Qxy  23
4.gui  Qxr  21
1.guT  QWS  18

i can not use -headbecause i have millions of rows and i do not where to cut, i could not figure a way to use -uniqbecause it treats a line as whole and since i can not tell -uniqto look at first column, it counts a line which has unique it outputs it -which is normal-. i know -uniqcan ignore a number of characters but as you can see from example first column might have various character count..

我不能使用,-head因为我有数百万行,我不知道在哪里剪切,我想不出一种使用方法,-uniq因为它把一条线当作一个整体,而且因为我不知道-uniq看第一列,它计算了一条线有唯一它输出它-这是正常的-。我知道-uniq可以忽略许多字符,但正如您从示例中看到的第一列可能有各种字符数..

please advice..

请指教..

回答by Guru

Try this:

尝试这个:

sort -rnk3 myfile | awk -F"[. ]" '!a[]++'

awk removes the duplicates depending on the 2nd column. This is actually a famous awk syntax to remove duplicates. An array is maintained where the record of 2nd field is maintained. Every time before a record is printed, the 2nd field is checked in the array. If not present, it is printed, else its discarded since it is duplicate. This is achived using the ++. First time, when a record is encountered, this ++ will keep the count as 0 since its post-fix. SUbsequent occurences will increase the value which when negated becomes false.

awk 根据第二列删除重复项。这实际上是一种著名的 awk 语法,用于删除重复项。维护一个数组,其中维护了第二个字段的记录。每次打印记录之前,都会检查数组中的第二个字段。如果不存在,则打印它,否则将其丢弃,因为它是重复的。这是使用 ++ 实现的。第一次,当遇到一条记录时,这个 ++ 将自其后修复以来将计数保持为 0。后续发生将增加该值,当否定时该值变为假。

回答by Chris Seymour

Here you go:

干得好:

sort -rnk3 file | awk -F'[. ]' '{ if (a[]++ == 0) print }' 

2.gu   Qxy  23
4.gui  Qxr  21
1.guT  QWS  18

This uses awkto check duplicate values in the second field where by the field separator is either a whitespace or a period. So this is what it treats the second field as:

这用于awk检查第二个字段中的重复值,其中字段分隔符是空格或句点。所以这就是它将第二个字段视为:

$ awk -F'[. ]' '{ print  }' file

gu
gui
guT
gui

In awkthe variable $0represents the whole line, $1represents the first field, and so on..

awk变量中$0代表整行,$1代表第一个字段,以此类推。

awk -F'[. ]' '{ if (a[$2]++ == 0) print }'the -Foptions let you specify the field separator, in this case it's either whitespace or a period.

awk -F'[. ]' '{ if (a[$2]++ == 0) print }'这些-F选项让您指定字段分隔符,在这种情况下,它可以是空格或句点。

回答by Ziferius

So I found this by the all powerful and amazing Google -- My little script builds off @sudo_O 's answer, in that it shows you all the duplicate lines found...., not a file without duplicates.

所以我通过强大而神奇的谷歌找到了这个——我的小脚本建立在@sudo_O 的答案之上,因为它向你展示了找到的所有重复行......,而不是没有重复的文件。

The text I was finding all duplicates in the 3rd column (port) were in a file called master.txt

我在第 3 列(端口)中找到所有重复项的文本位于名为 master.txt 的文件中

awk '{if (a[]++ > 0) print}' master.txt | while read site thread port
do
  grep $port master.txt
done