在 Bash 中转置文件的有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1729824/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 18:41:19  来源:igfitidea点击:

An efficient way to transpose a file in Bash

bashparsingunixtranspose

提问by Federico Giorgi

I have a huge tab-separated file formatted like this

我有一个巨大的制表符分隔文件,格式如下

X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11

I would like to transposeit in an efficient way using only bash commands (I could write a ten or so lines Perl script to do that, but it should be slower to execute than the native bash functions). So the output should look like

我想仅使用 bash 命令以有效的方式转置它(我可以编写十行左右的 Perl 脚本来执行此操作,但执行起来应该比本地 bash 函数慢)。所以输出应该看起来像

X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11

I thought of a solution like this

我想到了这样的解决方案

cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output
done

But it's slow and doesn't seem the most efficient solution. I've seen a solution for vi in this post, but it's still over-slow. Any thoughts/suggestions/brilliant ideas? :-)

但它很慢,似乎不是最有效的解决方案。我在这篇文章中看到了 vi 的解决方案,但它仍然过慢。任何想法/建议/绝妙的想法?:-)

回答by ghostdog74

awk '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    }
}' file

output

输出

$ more file
0 1 2
3 4 5
6 7 8
9 10 11

$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11

Performance against Perl solution by Jonathan on a 10000 lines file

Jonathan 在 10000 行文件上针对 Perl 解决方案的性能

$ head -5 file
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2

$  wc -l < file
10000

$ time perl test.pl file >/dev/null

real    0m0.480s
user    0m0.442s
sys     0m0.026s

$ time awk -f test.awk file >/dev/null

real    0m0.382s
user    0m0.367s
sys     0m0.011s

$ time perl test.pl file >/dev/null

real    0m0.481s
user    0m0.431s
sys     0m0.022s

$ time awk -f test.awk file >/dev/null

real    0m0.390s
user    0m0.370s
sys     0m0.010s

EDIT by Ed Morton (@ghostdog74 feel free to delete if you disapprove).

Ed Morton 编辑(@ghostdog74 如果您不同意,请随时删除)。

Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing. It also uses tabs as the separator which the OP had originally asked for so it'd handle empty fields and it coincidentally pretties-up the output a bit for this particular case.

也许这个带有一些更明确的变量名的版本将有助于回答下面的一些问题,并且通常会阐明脚本在做什么。它还使用制表符作为 OP 最初要求的分隔符,因此它可以处理空字段,并且巧合的是,对于这种特殊情况,它会稍微修饰输出。

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    for (rowNr=1;rowNr<=NF;rowNr++) {
        cell[rowNr,NR] = $rowNr
    }
    maxRows = (NF > maxRows ? NF : maxRows)
    maxCols = NR
}
END {
    for (rowNr=1;rowNr<=maxRows;rowNr++) {
        for (colNr=1;colNr<=maxCols;colNr++) {
            printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

The above solutions will work in any awk (except old, broken awk of course - there YMMV).

上述解决方案适用于任何 awk(当然,旧的、损坏的 awk 除外 - 有 YMMV)。

The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this:

上述解决方案确实将整个文件读入内存 - 如果输入文件太大,那么你可以这样做:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND }
ENDFILE {
    print ""
    if (ARGIND < NF) {
        ARGV[ARGC] = FILENAME
        ARGC++
    }
}
$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory. It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILEand ARGINDbut any awk can do the same with tests on FNR==1and END.

它几乎不使用内存,但每行读取一次输入文件,因此它比将整个文件读入内存的版本慢得多。它还假设每行的字段数相同,并且它使用 GNU awk for ENDFILEandARGIND但任何 awk 都可以对FNR==1and进行相同的测试END

回答by nisetama

Another option is to use rs:

另一种选择是使用rs

rs -c' ' -C' ' -T

-cchanges the input column separator, -Cchanges the output column separator, and -Ttransposes rows and columns. Do not use -tinstead of -T, because it uses an automatically calculated number of rows and columns that is not usually correct. rs, which is named after the reshape function in APL, comes with BSDs and OS X, but it should be available from package managers on other platforms.

-c更改输入列分隔符,-C更改输出列分隔符,并-T转置行和列。不要使用-t代替-T,因为它使用自动计算的行数和列数,这通常是不正确的。rs以 APL 中的 reshape 函数命名,随 BSD 和 OS X 一起提供,但它应该可以从其他平台的包管理器中获得。

A second option is to use Ruby:

第二种选择是使用 Ruby:

ruby -e'puts readlines.map(&:split).transpose.map{|x|x*" "}'

A third option is to use jq:

第三种选择是使用jq

jq -R .|jq -sr 'map(./" ")|transpose|map(join(" "))[]'

jq -R .prints each input line as a JSON string literal, -s(--slurp) creates an array for the input lines after parsing each line as JSON, and -r(--raw-output) outputs the contents of strings instead of JSON string literals. The /operator is overloaded to split strings.

jq -R .将每个输入行打印为 JSON 字符串文字,-s( --slurp) 在将每一行解析为 JSON 后为输入行创建一个数组,并且-r( --raw-output) 输出字符串的内容而不是 JSON 字符串文字。该/操作符被重载到拆分字符串。

回答by Stephan202

A Python solution:

一个 Python 解决方案:

python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output

The above is based on the following:

以上基于以下几点:

import sys

for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):
    print(' '.join(c))

This code does assume that every line has the same number of columns (no padding is performed).

此代码确实假设每一行都具有相同的列数(不执行填充)。

回答by flying sheep

the transposeproject on sourceforge is a coreutil-like C program for exactly that.

sourceforge 上的转置项目就是一个类似 coreutil 的 C 程序。

gcc transpose.c -o transpose
./transpose -t input > output #works with stdin, too.

回答by Fritz G. Mehner

Pure BASH, no additional process. A nice exercise:

纯BASH,没有额外的过程。一个不错的练习:

declare -a array=( )                      # we build a 1-D-array

read -a line < ""                       # read the headline

COLS=${#line[@]}                          # save number of columns

index=0
while read -a line ; do
    for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do
        array[$index]=${line[$COUNTER]}
        ((index++))
    done
done < ""

for (( ROW = 0; ROW < COLS; ROW++ )); do
  for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do
    printf "%s\t" ${array[$COUNTER]}
  done
  printf "\n" 
done

回答by pixelbeat

Have a look at GNU datamashwhich can be used like datamash transpose. A future version will also support cross tabulation (pivot tables)

看看GNU datamash可以像使用datamash transpose。未来版本还将支持交叉制表(数据透视表)

回答by Jonathan Leffler

Here is a moderately solid Perl script to do the job. There are many structural analogies with @ghostdog74's awksolution.

这是一个中等强度的 Perl 脚本来完成这项工作。@ghostdog74 的awk解决方案有很多结构类比。

#!/bin/perl -w
#
# SO 1729824

use strict;

my(%data);          # main storage
my($maxcol) = 0;
my($rownum) = 0;
while (<>)
{
    my(@row) = split /\s+/;
    my($colnum) = 0;
    foreach my $val (@row)
    {
        $data{$rownum}{$colnum++} = $val;
    }
    $rownum++;
    $maxcol = $colnum if $colnum > $maxcol;
}

my $maxrow = $rownum;
for (my $col = 0; $col < $maxcol; $col++)
{
    for (my $row = 0; $row < $maxrow; $row++)
    {
        printf "%s%s", ($row == 0) ? "" : "\t",
                defined $data{$row}{$col} ? $data{$row}{$col} : "";
    }
    print "\n";
}

With the sample data size, the performance difference between perl and awk was negligible (1 millisecond out of 7 total). With a larger data set (100x100 matrix, entries 6-8 characters each), perl slightly outperformed awk - 0.026s vs 0.042s. Neither is likely to be a problem.

对于样本数据大小,perl 和 awk 之间的性能差异可以忽略不计(总共 7 毫秒中有 1 毫秒)。使用更大的数据集(100x100 矩阵,每个条目 6-8 个字符),perl 略胜于 awk - 0.026s vs 0.042s。两者都不太可能成为问题。



Representative timings for Perl 5.10.1 (32-bit) vs awk (version 20040207 when given '-V') vs gawk 3.1.7 (32-bit) on MacOS X 10.5.8 on a file containing 10,000 lines with 5 columns per line:

Perl 5.10.1(32 位)与 awk(版本 20040207,当给出“-V”时)与 gawk 3.1.7(32 位)在 MacOS X 10.5.8 上的代表性时序,该文件包含 10,000 行,每行 5 列线:

Osiris JL: time gawk -f tr.awk xxx  > /dev/null

real    0m0.367s
user    0m0.279s
sys 0m0.085s
Osiris JL: time perl -f transpose.pl xxx > /dev/null

real    0m0.138s
user    0m0.128s
sys 0m0.008s
Osiris JL: time awk -f tr.awk xxx  > /dev/null

real    0m1.891s
user    0m0.924s
sys 0m0.961s
Osiris-2 JL: 

Note that gawk is vastly faster than awk on this machine, but still slower than perl. Clearly, your mileage will vary.

请注意,在这台机器上,gawk 比 awk 快得多,但仍然比 perl 慢。显然,您的里程会有所不同。

回答by Paused until further notice.

If you have scinstalled, you can do:

如果已sc安装,则可以执行以下操作:

psc -r < inputfile | sc -W% - > outputfile

回答by nelaaro

There is a purpose built utility for this,

有一个专门为此构建的实用程序,

GNU datamash utility

GNU datamash 实用程序

apt install datamash  

datamash transpose < yourfile

Taken from this site, https://www.gnu.org/software/datamash/and http://www.thelinuxrain.com/articles/transposing-rows-and-columns-3-methods

取自该站点,https://www.gnu.org/software/datamash/http://www.thelinuxrain.com/articles/transposing-rows-and-columns-3-methods

回答by Guilherme Freitas

Assuming all your rows have the same number of fields, this awk program solves the problem:

假设你所有的行都有相同数量的字段,这个 awk 程序解决了这个问题:

{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}

In words, as you loop over the rows, for every field fgrow a ':'-separated string col[f]containing the elements of that field. After you are done with all the rows, print each one of those strings in a separate line. You can then substitute ':' for the separator you want (say, a space) by piping the output through tr ':' ' '.

换句话说,当您遍历行时,每个字段都会生成f一个以 ':' 分隔的字符串,col[f]其中包含该字段的元素。完成所有行后,在单独的行中打印每个字符串。然后,您可以通过将输出通过tr ':' ' '.

Example:

例子:

$ echo "1 2 3\n4 5 6"
1 2 3
4 5 6

$ echo "1 2 3\n4 5 6" | awk '{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}' | tr ':' ' '
 1 4
 2 5
 3 6