Linux 按列将分隔文件拆分为较小的文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5265839/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 03:12:20  来源:igfitidea点击:

Split delimited file into smaller files by column

linuxbashunixsplitcut

提问by Stephen Turner

I'm familiar with the split command in linux. If I have a file that's 100 lines long,

我熟悉linux中的split命令。如果我有一个 100 行长的文件,

split -l 5 myfile.txt

...will split myfile.txt into 20 files, each having 5 lines, and will write them to file.

...将 myfile.txt 分成 20 个文件,每个文件有 5 行,并将它们写入文件。

My question is, I want to do this by column. Given a file with 100 columns, tab delimited, is there a similar command to split this file into 20 smaller files, each having 5 columns and all the rows?

我的问题是,我想按列执行此操作。给定一个有 100 列、制表符分隔的文件,是否有类似的命令可以将该文件拆分为 20 个较小的文件,每个文件有 5 列和所有行?

I'm aware of how to use cut, but I'm hoping there's a simple UNIX command I've never heard of that will accomplish this without wrapping cut with perl or something.

我知道如何使用 cut,但我希望有一个我从未听说过的简单 UNIX 命令可以在不使用 perl 或其他东西包装 cut 的情况下完成此操作。

Thanks in advance.

提前致谢。

采纳答案by SiegeX

#!/bin/bash

(($# == 2)) || { echo -e "\nUsage: 
# do something smarter with output files (& clear on start)
XIFS="${IFS}"
IFS=$'\t'
while read -a LINE; do 
  for (( i=0; i< ${#LINE[@]}; i++ )); do
    echo "${LINE[$i]}" >> /tmp/outfile${i}
  done
done < infile
IFS="${XIFS}"
<file to split> <# columns in each split>\n\n"; exit; } infile="" inc= ncol=$(awk 'NR==1{print NF}' "$infile") ((inc < ncol)) || { echo -e "\nSplit size >= number of columns\n\n"; exit; } for((i=0, start=1, end=$inc; i < ncol/inc + 1; i++, start+=inc, end+=inc)); do cut -f$start-$end "$infile" > "${infile}.$i" done

回答by nhed

#!/usr/bin/perl

chomp(my $pwd = `pwd`);
my $help = "\nUsage: 
    1 #!/usr/bin/env ruby                                                                                                                                                                                       
    2 #                                                                                                                                                                                                         
    3 def usage(e)                                                                                                                                                                                              
    4   puts "Usage #{__FILE__} <n_rows> <n_cols>"                                                                                                                                                              
    5   exit e                                                                                                                                                                                                  
    6 end                                                                                                                                                                                                       
    7                                                                                                                                                                                                           
    8 usage 1 unless ARGV.size == 2                                                                                                                                                                             
    9                                                                                                                                                                                                           
   10 rows, cols = ARGV.map{|e| e.to_i}                                                                                                                                                                         
   11 (1..rows).each do |l|                                                                                                                                                                                     
   12   (1..cols).each {|c| printf "%s ", c }                                                                                                                                                                   
   13   puts ""                                                                                                                                                                                                 
   14 end 
<file to split> <# columns in each split>\n\n"; die $help if @ARGV!=2; $infile = $ARGV[0]; chomp($ncol = `head -n 1 $infile | wc -w`); $start=1; $inc = $ARGV[1]; $end = $start+$inc-1; die "\nSplit size >= number of columns\n\n" if $inc>=$ncol; for($i=1 ; $i<$ncol/$inc +1 ; $i++) { if ($end>$ncol) {$end=$ncol;} `cut -f $start-$end $infile > $infile.$i`; $start += $inc; $end += $inc; }

Try the above ... using file name 'infile'

尝试以上...使用文件名'infile'

Note the saving and restoring of the IFS (does anyone have a better idea? a subshell?)

注意 IFS 的保存和恢复(有没有人有更好的主意?一个子外壳?)

Also note that this appends, if you are running for a second time - you would want to delete prior run's outputs ...

另请注意,如果您第二次运行,则会附加此内容 - 您会想要删除先前运行的输出......

回答by Stephen Turner

Thanks for the help. I hoped there would be a unix command similar to split, but I ended up wrapping the cut command with perl, via SiegeX's suggestion.

谢谢您的帮助。我希望有一个类似于 split 的 unix 命令,但我最终通过 SiegeX 的建议用 perl 包装了 cut 命令。

    1 #!/usr/bin/env ruby                                                                                                                                                                                       
    2 #                                                                                                                                                                                                         
    3                                                                                                                                                                                                           
    4 def usage(e)                                                                                                                                                                                              
    5   puts "Usage #{__FILE__} <column_start> <column_end>"                                                                                                                                                    
    6   exit e                                                                                                                                                                                                  
    7 end                                                                                                                                                                                                       
    8                                                                                                                                                                                                           
    9 usage 1 unless ARGV.size == 2                                                                                                                                                                             
   10                                                                                                                                                                                                           
   11 c_start, c_end = ARGV.map{|e| e.to_i}                                                                                                                                                                     
   12 i = 0                                                                                                                                                                                                     
   13 buffer = []                                                                                                                                                                                               
   14 $stdin.each_line do |l|                                                                                                                                                                                   
   15   i += 1                                                                                                                                                                                                  
   16   buffer << l.split[c_start..c_end].join(" ")                                                                                                                                                             
   17   $stderr.printf "\r%d", i if i % 100000 == 0                                                                                                                                                             
   18 end                                                                                                                                                                                                       
   19 $stderr.puts ""                                                                                                                                                                                           
   20 buffer.each {|l| puts l}

回答by drio

Here you have my solution:

这是我的解决方案:

First an input generator:

首先是输入生成器:

 $ time ./gen.data.rb 1000 10 | ./split.rb 0 4 > ./out

The split tool:

拆分工具:

$ ruby -e '(0..103).each {|i| puts "cat input.txt | ./split.rb #{i-4} #{i} > out.#{i/4}" if i % 4 == 0 && i > 0}' | /bin/bash

Notice that the split tool dumps to the stderr the value of number of line it is processing so you can get an idea how fast is going.

请注意,拆分工具将它正在处理的行数的值转储到 stderr,以便您了解运行速度。

Also, I am assuming that the separator is an space.

另外,我假设分隔符是一个空格。

Example of how to run it:

如何运行它的示例:

cat input.txt | ./split.rb 0 4 > out.1
cat input.txt | ./split.rb 4 8 > out.2
cat input.txt | ./split.rb 8 12 > out.3
cat input.txt | ./split.rb 12 16 > out.4
cat input.txt | ./split.rb 16 20 > out.5
cat input.txt | ./split.rb 20 24 > out.6
cat input.txt | ./split.rb 24 28 > out.7
cat input.txt | ./split.rb 28 32 > out.8
cat input.txt | ./split.rb 32 36 > out.9
cat input.txt | ./split.rb 36 40 > out.10
cat input.txt | ./split.rb 40 44 > out.11
cat input.txt | ./split.rb 44 48 > out.12
cat input.txt | ./split.rb 48 52 > out.13
cat input.txt | ./split.rb 52 56 > out.14
cat input.txt | ./split.rb 56 60 > out.15
cat input.txt | ./split.rb 60 64 > out.16
cat input.txt | ./split.rb 64 68 > out.17
cat input.txt | ./split.rb 68 72 > out.18
cat input.txt | ./split.rb 72 76 > out.19
cat input.txt | ./split.rb 76 80 > out.20
cat input.txt | ./split.rb 80 84 > out.21
cat input.txt | ./split.rb 84 88 > out.22
cat input.txt | ./split.rb 88 92 > out.23
cat input.txt | ./split.rb 92 96 > out.24
cat input.txt | ./split.rb 96 100 > out.25

Generate 1000 lines with 10 columns each and split the first 5 columns. I use time(1) to measure the running time.

生成 1000 行,每行 10 列并拆分前 5 列。我使用 time(1) 来测量运行时间。

We can use a little oneliner to do the splitting you requested (sequentially). It is very easy to process it in parallel in a single node (check bash building command wait) or to send them to a cluster.

我们可以使用一个小的 oneliner 来完成您要求的拆分(按顺序)。在单个节点中并行处理它(检查 bash 构建命令等待)或将它们发送到集群非常容易。

#!/bin/bash
# delimiter is ;
cut -d';' -f1 "" > ".1"
cut -d';' -f2 "" > ".2"
cut -d';' -f3 "" > ".3"
cut -d';' -f4 "" > ".4"
cut -d';' -f5 "" > ".5"
cut -d';' -f6 "" > ".6"
cut -d';' -f7 "" > ".7"
cut -d';' -f8 "" > ".8"

Which basically generates:

这基本上产生:

sed -E $'s/(([^\t]+\t){4}[^\t]+)\t/\1\n/g' myfile.txt | split -nr/20

And gets piped to bash.

并通过管道进行 bash。

Be careful with the number of processes (or jobs) you compute in parallel because it will flood your storage (unless you have independent storage volumes).

请注意并行计算的进程(或作业)数量,因为它会淹没您的存储(除非您有独立的存储卷)。

Hope that helps. Let us know how fast it runs for you.

希望有帮助。让我们知道它的运行速度有多快。

-drd

-drd

回答by zzapper

if you only need a QAD (Quick & Dirty) solution for in my case a fixed 8 column ; separated csv

如果您只需要 QAD(Quick & Dirty)解决方案,就我而言是固定的 8 列;分离的csv

paste x* | cmp - myfile.txt

回答by Erik

Split can actually do what you desire, with a little bit of preprocessing

Split 实际上可以做你想做的事,只需稍加预处理

awk 'BEGIN{FS="\t"; m=NUMBER }
     { for(i=1;i<=NF;++i) { 
          s = (i%m==1 ? $i : s FS $i);                                                                                                                                                 
          if (i%m==0 || i==NF) {print s > (sprintf("out.%0.5d",int(i/m)+(i%m!=0)))}
     }}' input_file

This will write out twenty files with an xprefix (in my version of split). You can verify this worked with:

这将写出带有x前缀的二十个文件(在我的 split 版本中)。您可以验证这与:

awk 'BEGIN{FS="\t"; n=CHUNKS}
     (NR==1){ m=int(NF/n)+(NF%n==0) }
     { for(i=1;i<=NF;++i) { 
          s = (i%m==1 ? $i : s FS $i);                                                                                                                                                 
          if (i%m==0 || i==NF) {print s > (sprintf("out.%0.5d",int(i/m)+(i%m!=0)))}
     }}' input_file

Essentially what this is doing is using sedto split each line into twenty lines, and then using split with round robin chunks to write each line to the appropriate file. To use a different delimiter, switch the tabs in the expression. The number 4 should be the number of columns per file - 1, and the 20 at the end of split is the number of files. Additional parameters to split can be used to modify the filenames that are written. This example uses bashes escape expansion to write tabs into the sed expression and a version of sed that can use the +operator, but these effects can be achieved alternate ways if these aren't present on your system.

本质上,这是sed将每一行拆分为 20 行,然后使用带有循环块的 split 将每一行写入适当的文件。要使用不同的分隔符,请切换表达式中的制表符。数字 4 应该是每个文件的列数 - 1,拆分结束时的 20 是文件数。要拆分的其他参数可用于修改写入的文件名。此示例使用 bashes 转义扩展将制表符写入 sed 表达式和可以使用+运算符的 sed 版本,但如果您的系统上不存在这些效果,则可以通过其他方式实现。

I got a variant of this solution from Reuti on the coreutils mailing list.

我在 coreutils 邮件列表上从 Reuti 那里得到了这个解决方案的一个变体。

回答by kvantour

There is not directly something similar that will split your file column-wise. However, you can use AWK for this in a straightforward manner:

没有直接类似的东西可以按列拆分您的文件。但是,您可以直接使用 AWK:

The following splits input_filein output files containing NUMBERof columns

以下拆分input_file包含NUMBER列的输出文件

##代码##

The following splits input_filein CHUNKSoutput files

下面分裂input_fileCHUNKS输出文件

##代码##