Linux 如何计算制表符分隔文本文件中字段的唯一值的数量？

Question

提问by sfactor

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,

我有一个包含大量数据的文本文件，这些数据以制表符分隔。我想查看数据，以便我可以看到列中的唯一值。例如，

Red     Ball 1 Sold
Blue    Bat  5 OnSale
...............

So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.

所以，它就像第一列有颜色，所以我想知道该列中有多少不同的唯一值，我希望能够为每一列做到这一点。

I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.

我需要在 Linux 命令行中执行此操作，因此可能会使用一些 bash 脚本、sed、awk 或其他东西。

Addendum: Thanks everyone for the help, can I ask one more thing? What if I wanted a count of these unique values as well?

附录：谢谢大家的帮助，我可以再问一件事吗？如果我还想对这些唯一值进行计数怎么办？

I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.

我想我没有把第二部分说得不够清楚。我想要做的是对这些唯一值中的“每个”进行计数，不知道有多少个唯一值。例如，在第一列中，我想知道有多少红色、蓝色、绿色等颜色的物体。

Answer 1

采纳答案by codaddict

You can make use of cut, sortand uniqcommands as follows:

您可以使用cut,sort和uniq命令，如下所示：

cat input_file | cut -f 1 | sort | uniq

gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.

获取字段 1 中的唯一值，将 1 替换为 2 将为您提供字段 2 中的唯一值。

Avoiding UUOC:)

避免UUOC:)

cut -f 1 input_file | sort | uniq

EDIT:

编辑：

To count the number of unique occurences you can make use of wccommand in the chain as:

要计算唯一出现的次数，您可以wc在链中使用命令：

cut -f 1 input_file | sort | uniq | wc -l

Answer 2

回答by Jon Freedman

You can use awk, sort & uniq to do this, for example to list all the unique values in the first column

您可以使用 awk、sort & uniq 来执行此操作，例如列出第一列中的所有唯一值

awk < test.txt '{print }' | sort | uniq

As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l

正如其他地方发布的那样，如果您想计算某事的实例数，您可以将唯一列表通过管道传输到 wc -l

Answer 3

回答by Douglas Leeder

Assuming the data file is actually Tab separated, not space aligned:

假设数据文件实际上是制表符分隔的，而不是空格对齐的：

<test.tsv awk '{print }' | sort | uniq

Where $4 will be:

4 美元在哪里：

$1 - Red
$2 - Ball
$3 - 1
$4 - Sold

$1 - 红色
$2 - 球
3 美元 - 1 美元
$4 - 已售

Answer 4

回答by stacker

# COLUMN is integer column number
# INPUT_FILE is input file name

cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l

Answer 5

回答by Mike

cat test.csv | awk '{ a[]++ } END { for (n in a) print n, a[n] } '

Answer 6

回答by peak

Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.

这是一个 bash 脚本，它完全回答了（修改后的）原始问题。也就是说，给定任何 .tsv 文件，它依次为每一列提供概要。除了 bash 本身，它只使用标准的 *ix/Mac 工具：sed tr wc cut sort uniq。

#!/bin/bash
# Syntax: #!/bin/bash

awk '
(NR==1){
    for(fi=1; fi<=NF; fi++)
        fname[fi]=$fi;
} 
(NR!=1){
    for(fi=1; fi<=NF; fi++) 
        arr[fname[fi]][$fi]++;
} 
END{
    for(fi=1; fi<=NF; fi++){
        out=fname[fi];
        for (item in arr[fname[fi]])
            out=out"\t"item"_"arr[fname[fi]][item];
        print(out);
    }
}
' 
 filename   
# The input is assumed to be a .tsv file

FILE=""

cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
  echo Column $i ::
  cut -f $i < "$FILE" | sort | uniq -c
  echo
done

Answer 7

回答by Amin.A

This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.

此脚本输出给定文件的每一列中唯一值的数量。它假定给定文件的第一行是标题行。无需定义字段数。只需将脚本保存在 bash 文件 (.sh) 中，并提供制表符分隔的文件作为此脚本的参数。

Code

代码

isRef    A_15      C_42     G_24     T_18
isCar    YEA_10    NO_40    NA_50
isTv     FALSE_33  TRUE_66

Execution Example:

执行示例：

bash> ./script.sh <path to tab-delimited file>

Output Example

输出示例

##代码##

Linux 如何计算制表符分隔文本文件中字段的唯一值的数量？

提问by sfactor

采纳答案by codaddict

回答by Jon Freedman

回答by Douglas Leeder

回答by stacker

回答by Mike

回答by peak

回答by Amin.A

相关推荐

最近更新

标签

Linux 如何计算制表符分隔文本文件中字段的唯一值的数量？

提问by sfactor

采纳答案by codaddict

回答by Jon Freedman

回答by Douglas Leeder

回答by stacker

回答by Mike

回答by peak

回答by Amin.A

相关推荐

Linux 使用 shell 脚本发送 HTML 邮件

在 C# 中使用可空类型

__FUNCTION__ 宏的 C# 版本

Linux 硬浮点数和软浮点数有什么区别？

相关推荐

最近更新

标签

FUNCTION 宏的 C# 版本