Linux 有没有办法按列'uniq'？

Question

提问by Eno

I have a .csv file like this:

我有一个这样的 .csv 文件：

[email protected],2009-11-27 01:05:47.893000000,example.net,127.0.0.1
[email protected],2009-11-27 00:58:29.793000000,example.net,255.255.255.0
[email protected],2009-11-27 00:58:29.646465785,example.net,256.255.255.0
...

I have to remove duplicate e-mails (the entire line) from the file (i.e. one of the lines containing [email protected]in the above example). How do I use uniqon only field 1 (separated by commas)? According to man, uniqdoesn't have options for columns.

我必须从文件中删除重复的电子邮件（整行）（即[email protected]上面示例中包含的行之一）。如何uniq仅在字段 1（以逗号分隔）上使用？根据man，uniq没有列选项。

I tried something with sort | uniqbut it doesn't work.

我尝试了一些东西，sort | uniq但它不起作用。

Answer 1

采纳答案by Carl Smotricz

sort -u -t, -k1,1 file

-ufor unique
-t,so comma is the delimiter
-k1,1for the key field 1

-u独一无二的
-t,所以逗号是分隔符
-k1,1对于关键字段 1

Test result:

测试结果：

[email protected],2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0 
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

Answer 2

回答by Steve B.

well, simpler than isolating the column with awk, if you need to remove everything with a certain value for a given file, why not just do grep -v:

好吧，比用 awk 隔离列更简单，如果您需要删除给定文件具有特定值的所有内容，为什么不直接执行 grep -v：

e.g. to delete everything with the value "col2" in the second place line: col1,col2,col3,col4

例如，删除第二行中值为“col2”的所有内容：col1,col2,col3,col4

grep -v ',col2,' file > file_minus_offending_lines

If this isn't good enough, because some lines may get improperly stripped by possibly having the matching value show up in a different column, you can do something like this:

如果这还不够好，因为某些行可能会因匹配值显示在不同的列中而被不正确地剥离，您可以执行以下操作：

awk to isolate the offending column: e.g.

awk 以隔离违规列：例如

awk -F, '{print  "|" $line}'

the -F sets the field delimited to ",", $2 means column 2, followed by some custom delimiter and then the entire line. You can then filter by removing lines that beginwith the offending value:

-F 将字段设置为“,”，$2 表示第 2 列，后跟一些自定义分隔符，然后是整行。然后，您可以通过删除以违规值开头的行进行过滤：

 awk -F, '{print  "|" $line}' | grep -v ^BAD_VALUE

and then strip out the stuff before the delimiter:

然后去掉分隔符之前的内容：

awk -F, '{print  "|" $line}' | grep -v ^BAD_VALUE | sed 's/.*|//g'

(note -the sed command is sloppy because it doesn't include escaping values. Also the sed pattern should really be something like "[^|]+" (i.e. anything not the delimiter). But hopefully this is clear enough.

（注意 - sed 命令是草率的，因为它不包括转义值。此外，sed 模式应该真正类似于“[^|]+”（即任何不是分隔符的东西）。但希望这足够清楚。

Answer 3

回答by Mikael S

By sorting the file with sortfirst, you can then apply uniq.

通过sort首先对文件进行排序，然后您可以应用uniq.

It seems to sort the file just fine:

它似乎对文件进行了排序：

$ cat test.csv
[email protected],2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
[email protected],2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0 
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

$ sort test.csv
[email protected],2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0 
[email protected],2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

$ sort test.csv | uniq
[email protected],2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0 
[email protected],2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1

You could also do some AWK magic:

你也可以做一些 AWK 魔法：

$ awk -F, '{ lines[] = 1 01:05:47.893000000 2009-11-27 [email protected]
2 00:58:29.793000000 2009-11-27 [email protected]
1
 } END { for (l in lines) print lines[l] }' test.csv
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
[email protected],2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
[email protected],2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0

Answer 4

回答by Carsten C.

or if u want to use uniq:

或者如果你想使用 uniq：

<mycvs.cvs tr -s ',' ' ' | awk '{print $3" "$2" "$1}' | uniq -c -f2

gives:

给出：

awk -F"," '!_[]++' file

Answer 5

回答by ghostdog74

sort -u -t : -k 1,1 -k 3,3 test.txt

-Fsets the field separator.
$1is the first field.
_[val]looks up valin the hash _(a regular variable).
++increment, and return old value.
!returns logical not.
there is an implicit print at the end.

-F设置字段分隔符。
$1是第一个字段。
_[val]val在散列中查找_（一个常规变量）。
++递增，并返回旧值。
!返回逻辑非。
最后有一个隐式打印。

Answer 6

回答by Prakash

To consider multiple column.

考虑多列。

Sort and give unique list based on column 1 and column 3:

根据第 1 列和第 3 列排序并给出唯一列表：

 tac a.csv | sort -u -t, -r -k1,1 |tac

-t :colon is separator
-k 1,1 -k 3,3based on column 1 and column 3

-t :冒号是分隔符
-k 1,1 -k 3,3基于第 1 列和第 3 列

Answer 7

回答by Sumukh

If you want to retain the last one of the duplicates you could use

如果您想保留您可以使用的最后一个重复项

printf "%s" "$str" \
| awk '{ tmp_fixed_width=15; uniq_col=8; w=tmp_fixed_width-length($uniq_col); for (i=0;i<w;i++) { $uniq_col=$uniq_col" "}; printf "%s\n", printf "%s" "$str" \
| awk '{ uniq_col_1=4; printf "%15s %s\n", uniq_col_1, printf "%s" "$str" \
| awk '{ uniq_col_1=4; uniq_col_2=8; printf "%5s %15s %s\n", uniq_col_1, uniq_col_2, ##代码## }' \
| uniq -f 0 -w 5 \
| uniq -f 1 -w 15 \
| awk '{ ==""; gsub(/^ */, "", ##代码##); printf "%s\n", ##代码## }'
 }' \
| uniq -f 0 -w 15 \
| awk '{ =""; gsub(/^ */, "", ##代码##); printf "%s\n", ##代码## }'
 }' \
| uniq -f 7 -w 15 \
| awk '{ uniq_col=8; gsub(/ */, "", $uniq_col); printf "%s\n", ##代码## }'

Which was my requirement

这是我的要求

here

这里

tacwill reverse the file line by line

tac将逐行反转文件

Answer 8

回答by NOYB

Here is a very nifty way.

这是一个非常漂亮的方法。

First format the content such that the column to be compared for uniqueness is a fixed width. One way of doing this is to use awk printf with a field/column width specifier ("%15s").

首先格式化内容，以便要比较唯一性的列是固定宽度。一种方法是将 awk printf 与字段/列宽度说明符（“%15s”）一起使用。

Now the -f and -w options of uniq can be used to skip preceding fields/columns and to specify the comparison width (column(s) width).

现在 uniq 的 -f 和 -w 选项可用于跳过前面的字段/列并指定比较宽度（列宽度）。

Here are three examples.

下面是三个例子。

In the first example...

在第一个例子中...

1) Temporarily make the column of interest a fixed width greater than or equal to the field's max width.

1) 临时使感兴趣的列的固定宽度大于或等于字段的最大宽度。

2) Use -f uniq option to skip the prior columns, and use the -w uniq option to limit the width to the tmp_fixed_width.

2) 使用 -f uniq 选项跳过前面的列，并使用 -w uniq 选项将宽度限制为 tmp_fixed_width。

3) Remove trailing spaces from the column to "restore" it's width (assuming there were no trailing spaces beforehand).

3）从列中删除尾随空格以“恢复”它的宽度（假设事先没有尾随空格）。

##代码##

In the second example...

在第二个例子中...

Create a new uniq column 1. Then remove it after the uniq filter has been applied.

创建一个新的 uniq 列 1. 然后在应用 uniq 过滤器后将其删除。

##代码##

The third example is the same as the second, but for multiple columns.

第三个示例与第二个示例相同，但适用于多列。

##代码##

Linux 有没有办法按列'uniq'？

提问by Eno

采纳答案by Carl Smotricz

回答by Steve B.

回答by Mikael S

回答by Carsten C.

回答by ghostdog74

回答by Prakash

回答by Sumukh

回答by NOYB

相关推荐

最近更新

标签

Linux 有没有办法按列'uniq'？

提问by Eno

采纳答案by Carl Smotricz

回答by Steve B.

回答by Mikael S

回答by Carsten C.

回答by ghostdog74

回答by Prakash

回答by Sumukh

回答by NOYB

相关推荐

C# XmlReader - 自关闭元素不会触发 EndElement 事件？

何时在 C# 中使用静态类

C# 使用 PST/CEST/UTC/etc 形式的时区解析 DateTime

C# 如何在 .Net 中将文化更改为 DateTimepicker 或日历控件

相关推荐

最近更新

标签