bash 根据从文本文件中获取的模式将文本文件拆分为多个部分

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9476018/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 01:40:12  来源:igfitidea点击:

Split text file into parts based on a pattern taken from the text file

linuxbashtext

提问by a different ben

I have many text files of fixed-width data, e.g.:

我有许多固定宽度数据的文本文件,例如:

$ head model-q-060.txt 
% x                      y                        
15.0                     0.0                      
15.026087                -1.0                     
15.052174                -2.0                     
15.07826                 -3.0                     
15.104348                -4.0                     
15.130435                -5.0                     
15.156522                -6.0                     
15.182609                -6.9999995               
15.208695                -8.0  

The data comprise 3 or 4 runs of a simulation, all stored in the one text file, with no separator between runs. In other words, there is no empty line or anything, e.g. if there were only 3 'records' per run it would look like this for 3 runs:

数据包括模拟的 3 或 4 次运行,全部存储在一个文本文件中,运行之间没有分隔符。换句话说,没有空行或任何东西,例如,如果每次运行只有 3 个“记录”,则 3 次运行看起来像这样:

$ head model-q-060.txt 
% x                      y                        
15.0                     0.0                      
15.026087                -1.0                     
15.052174                -2.0                     
15.0                     0.0                      
15.038486                -1.0                     
15.066712                -2.0                     
15.0                     0.0                      
15.041089                -1.0                     
15.087612                -2.0                     

It's a COMSOL Multiphysics output file for those interested. Visually you can tell where the new run data begin, as the first x-value is repeated (actually the entire second line might be the same for all of them). So I need to firstly open the file and get this x-value, save it, then use it as a pattern to match with awk or csplit. I am struggling to work this out!

对于感兴趣的人,这是一个 COMSOL Multiphysics 输出文件。您可以直观地看出新运行数据的开始位置,因为第一个 x 值是重复的(实际上,整个第二行可能对所有数据都相同)。所以我需要首先打开文件并获取这个 x 值,保存它,然后将其用作模式以匹配 awk 或 csplit。我正在努力解决这个问题!

csplit will do the job:

csplit 将完成这项工作:

$ csplit -z -f 'temp' -b '%02d.txt' model-q-060.txt /^15\.0\s/ {*}

but I have to know the pattern to split on. This question is similar but each of my text files might have a different pattern to match: Split files based on file content and pattern matching.

但我必须知道要拆分的模式。这个问题很相似,但我的每个文本文件可能都有不同的匹配模式:Split files based on file content and pattern matching

Ben.

本。

采纳答案by Jim Garrison

Here's a simple awk script that will do what you want:

这是一个简单的 awk 脚本,可以执行您想要的操作:

BEGIN { fn=0 }
NR==1 { next }
NR==2 { delim= }
 == delim {
    f=sprintf("test%02d.txt",fn++);
    print "Creating " f
}

{ print 
rm -f temp*.txt

cat > f1.txt <<EOF
% x                      y                        
15.0                     0.0                      
15.026087                -1.0                     
15.052174                -2.0                     
15.0                     0.0                      
15.038486                -1.0                     
15.066712                -2.0                     
15.0                     0.0                      
15.041089                -1.0                     
15.087612                -2.0    
EOF

first=`awk 'NR==2{print }' f1.txt|sed 's/\./\\./'`
echo --- Splitting by: $first

csplit -z -f temp -b %02d.txt f1.txt /^"$first"\s/ {*}

for i in temp*.txt; do
  echo ---- $i
  cat $i
done
> f }
  1. initialize output file number
  2. ignore the first line
  3. extract the delimiter from the second line
  4. for every input line whose first token matches the delimiter, set up the output file name
  5. for all lines, write to the current output file
  1. 初始化输出文件号
  2. 忽略第一行
  3. 从第二行中提取分隔符
  4. 对于第一个标记与分隔符匹配的每个输入行,设置输出文件名
  5. 对于所有行,写入当前输出文件

回答by icyrock.com

This should do the job - test somewhere you don't have a lot of temp*.txtfiles: :)

这应该可以完成工作 - 在没有很多temp*.txt文件的地方测试::)

--- Splitting by: 15\.0
51
153
153
136
---- temp00.txt
% x                      y                        
---- temp01.txt
15.0                     0.0                      
15.026087                -1.0                     
15.052174                -2.0                     
---- temp02.txt
15.0                     0.0                      
15.038486                -1.0                     
15.066712                -2.0                     
---- temp03.txt
15.0                     0.0                      
15.041089                -1.0                     
15.087612                -2.0    

The output of the above is:

上面的输出是:

cat your_file.txt | grep -P "^\d" | \
   split --lines=$(expr \( $(wc -l "your_file.txt" | \
   awk '{print '}) - 1 \) / number_of_runs)

Of course, you will run into trouble if you have repeating second column value (15.0in the above example) - solving that would be a tad harder - exercise left for the reader...

当然,如果你有重复的第二列值(15.0在上面的例子中),你会遇到麻烦- 解决这个问题会有点困难 - 留给读者的练习......

回答by Blackle Mori

If the amount of lines per run is constant, you could use this:

如果每次运行的行数是恒定的,你可以使用这个:

##代码##