Bash- 是否可以仅将 -uniq 用于一行的一列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/13583365/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bash- is it possible to use -uniq for only one column of a line?
提问by teutara
    1.gui  Qxx  16
    2.gu   Qxy  23
    3.guT  QWS  18
    4.gui  Qxr  21
i want to sort a file depending a value in the 3rd column, so i use:
我想根据第三列中的值对文件进行排序,所以我使用:
sort -rnk3 myfile
2.gu   Qxy  23
4.gui  Qxr  21
3.guT  QWS  18
1.gui  Qxx  16
now i have to output as: (the line starting with 3.gui is out because the line with 4.gui has a greater value)
现在我必须输出为:(以 3.gui 开头的行已退出,因为带有 4.gui 的行具有更大的价值)
2.gu   Qxy  23
4.gui  Qxr  21
1.guT  QWS  18
i can not use -headbecause i have millions of rows and i do not where to cut, i could not figure a way to use -uniqbecause it treats a line as whole and since i can not tell -uniqto look at first column, it counts a line which has  unique it outputs it -which is normal-. i know -uniqcan ignore a number of characters but as you can see from example first column might have various character count..
我不能使用,-head因为我有数百万行,我不知道在哪里剪切,我想不出一种使用方法,-uniq因为它把一条线当作一个整体,而且因为我不知道-uniq看第一列,它计算了一条线有唯一它输出它-这是正常的-。我知道-uniq可以忽略许多字符,但正如您从示例中看到的第一列可能有各种字符数..
please advice..
请指教..
回答by Guru
Try this:
尝试这个:
sort -rnk3 myfile | awk -F"[. ]" '!a[]++'
awk removes the duplicates depending on the 2nd column. This is actually a famous awk syntax to remove duplicates. An array is maintained where the record of 2nd field is maintained. Every time before a record is printed, the 2nd field is checked in the array. If not present, it is printed, else its discarded since it is duplicate. This is achived using the ++. First time, when a record is encountered, this ++ will keep the count as 0 since its post-fix. SUbsequent occurences will increase the value which when negated becomes false.
awk 根据第二列删除重复项。这实际上是一种著名的 awk 语法,用于删除重复项。维护一个数组,其中维护了第二个字段的记录。每次打印记录之前,都会检查数组中的第二个字段。如果不存在,则打印它,否则将其丢弃,因为它是重复的。这是使用 ++ 实现的。第一次,当遇到一条记录时,这个 ++ 将自其后修复以来将计数保持为 0。后续发生将增加该值,当否定时该值变为假。
回答by Chris Seymour
Here you go:
干得好:
sort -rnk3 file | awk -F'[. ]' '{ if (a[]++ == 0) print }' 
2.gu   Qxy  23
4.gui  Qxr  21
1.guT  QWS  18
This uses awkto check duplicate values in the second field where by the field separator is either a whitespace or a period. So this is what it treats the second field as:
这用于awk检查第二个字段中的重复值,其中字段分隔符是空格或句点。所以这就是它将第二个字段视为:
$ awk -F'[. ]' '{ print  }' file
gu
gui
guT
gui
In awkthe variable $0represents the whole line, $1represents the first field, and so on..
在awk变量中$0代表整行,$1代表第一个字段,以此类推。
awk -F'[. ]' '{ if (a[$2]++ == 0) print }'the -Foptions let you specify the field separator, in this case it's either whitespace or a period.    
awk -F'[. ]' '{ if (a[$2]++ == 0) print }'这些-F选项让您指定字段分隔符,在这种情况下,它可以是空格或句点。    
回答by Ziferius
So I found this by the all powerful and amazing Google -- My little script builds off @sudo_O 's answer, in that it shows you all the duplicate lines found...., not a file without duplicates.
所以我通过强大而神奇的谷歌找到了这个——我的小脚本建立在@sudo_O 的答案之上,因为它向你展示了找到的所有重复行......,而不是没有重复的文件。
The text I was finding all duplicates in the 3rd column (port) were in a file called master.txt
我在第 3 列(端口)中找到所有重复项的文本位于名为 master.txt 的文件中
awk '{if (a[]++ > 0) print}' master.txt | while read site thread port
do
  grep $port master.txt
done

