使用 bash、sed 或 awk 拆分 CSV 文件并排除输出中的列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10217663/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 02:04:18  来源:igfitidea点击:

Splitting CSV file and excluding column in output using bash, sed or awk

linuxbashcsvsedawk

提问by general exception

I have a CSV file which contains data like the following:-

我有一个 CSV 文件,其中包含如下数据:-

1,275,,,275,17.3,0,"2011-05-09 20:21:45"
2,279,,,279,17.3,0,"2011-05-10 20:21:52"
3,276,,,276,17.3,0,"2011-05-11 20:21:58"
4,272,,,272,17.3,0,"2011-05-12 20:22:04"
5,272,,,272,17.3,0,"2011-05-13 20:22:10"
6,278,,,278,17.3,0,"2011-05-13 20:24:08"
7,270,,,270,17.3,0,"2011-05-13 20:24:14"
8,269,,,269,17.3,0,"2011-05-14 20:24:20"
9,278,,,278,17.3,0,"2011-05-14 20:24:26"

This file contains 4432986 rows of data.

该文件包含 4432986 行数据。

I wish to split the file out basing the new file name on the date in the last column.

我希望基于最后一列中的日期的新文件名拆分文件。

Therefore based on the data above i would want 6 new files with the rows for each day in each file.

因此,根据上面的数据,我想要 6 个新文件,每个文件中每天都有行。

I would like the files named in YYYY_MM_DD format.

我想要以 YYYY_MM_DD 格式命名的文件。

I would also like to ignore the first column in the output data

我也想忽略输出数据中的第一列

So file 2011_05_13 would contain the following rows, with the first column excluded:-

因此文件 2011_05_13 将包含以下行,排除第一列:-

272,,,272,17.3,0,"2011-05-13 20:22:10"
278,,,278,17.3,0,"2011-05-13 20:24:08"
270,,,270,17.3,0,"2011-05-13 20:24:14"

I am planning on doing this on a linux box, so anything using any linux utilities would be cool, sed awk etc ??

我打算在 linux 机器上做这个,所以任何使用任何 linux 实用程序的东西都会很酷,sed awk 等等??

回答by Steve

Here's a one-liner for you in awk:

这是您的单线awk

awk -F "," '{ split ($8,array," "); sub ("\"","",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt

awk -F "," '{ split ($8,array," "); sub ("\"","",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt

Desired output achieved, although perhaps some of this code could be made more succinct. HTH.

实现了所需的输出,尽管其中一些代码可以做得更简洁。哈。

EDIT:

编辑:

Read code from left to right:

从左到右阅读代码:

  • -F ","
    Yes this sets the delimiter.

  • split ($8,array," ")
    This splits the eighth column on the space and puts this info in an array called array.

  • sub ("\"","",array[1])
    We take the first array element (this is a slice that's going to become our output file name) and substitute out the leading "symbol (We need to escape the "symbol so we put the \character in front).

  • sub (NR,"",$0)
    This conveniently removes the line number from the beginning of your file (NRis row number and $0is of course the whole line of input before delimitation).

  • sub (",","",$0)
    This removes the comma after the row number.

  • Now that we have a clean filename and a clean row of data we can write $0to array[1]: print $0 > array[1].

  • -F ","
    是的,这设置了分隔符。

  • split ($8,array," ")
    这将拆分空间上的第八列并将此信息放入名为 的数组中array

  • sub ("\"","",array[1])
    我们取第一个数组元素(这是一个将成为我们输出文件名的切片)并替换掉前导"符号(我们需要对"符号进行转义,因此我们将\字符放在前面)。

  • sub (NR,"",$0)
    这可以方便地从文件的开头删除行号(NR是行号$0,当然是分隔前的整行输入)。

  • sub (",","",$0)
    这将删除行号后的逗号。

  • 现在,我们有一个干净的文件名,我们可以写数据的清洁排$0array[1]print $0 > array[1]

FIX:

使固定:

So if you'd prefer a underscore instead of a hypon, all we need to fix is array[1]. I've just added in a global substitution: gsub ("-","_",array[1]).

因此,如果您更喜欢下划线而不是 Hypon,我们需要修复的只是array[1]. 我刚刚添加了一个全局替换:gsub ("-","_",array[1]).

The updated code is:

更新后的代码是:

awk -F "," '{ split ($8,array," "); sub ("\"","",array[1]); gsub ("-","_",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt

awk -F "," '{ split ($8,array," "); sub ("\"","",array[1]); gsub ("-","_",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt

HTH.

哈。

回答by anubhava

You can use this awk command:

您可以使用此 awk 命令:

awk -F, 'BEGIN{OFS=",";} {dt=; gsub(/^"| .*"$/,"", dt);
=""; sub(/^,/, "", 
 cat bigfile.txt | while read LINE;
  do echo $LINE >> `echo $LINE | cut -d, -f8 | cut -c2-11`.txt ;
 done
); print
cat bigfile.txt | sed 's/^.*?,//'
> dt}' input.txt

回答by Donald Miner

A scripting language (perl/python) is likely your best choice here, but I liked the challenge of doing this in bash, so here it is.

脚本语言 (perl/python) 可能是您最好的选择,但我喜欢在 bash 中执行此操作的挑战,所以就在这里。

 cat bigfile.txt | sed 's/^.*?,//' | while read LINE;
  do echo $LINE >> `echo $LINE | cut -d, -f8 | cut -c2-11`.txt ;
 done

Basically, what this does is reads the file line by line in the whileloop, then appends that line to a file based on the date.

基本上,它的作用是在while循环中逐行读取文件,然后根据日期将该行附加到文件中。

The date is pulled out with a combination of two cutcommands. The first cutpulls the last column (column 8) off using a comma delimiter (-d,), then the second cutpulls just the date by removing the first ", and then slurping up to character 11.

日期是通过两个cut命令的组合拉出的。第一个cut使用逗号分隔符 ( -d,)拉出最后一列(第 8 列),然后第二个cut通过删除第一列来仅拉出日期",然后直到字符 11。



Now, to tackle the removal of the first column:

现在,要解决第一列的删除问题:

cat records.txt \
| cut -f8 -d, \
| cut -f1 -d ' ' \
| tr -d '"' \
| sort -u \
| while read DATE ; do \
    cat records.txt \
    | cut -f2- -d, \
    | egrep ",\"${DATE} [0-9]{2}:[0-9]{2}:[0-9]{2}\"" \
    > ${DATE}.txt
done

This regular expression just removes everything before the first comma.

这个正则表达式只是删除第一个逗号之前的所有内容。

So, we'll replace the beginning of our while loop with this, leaving us with:

因此,我们将用这个替换 while 循环的开头,留下:

$ sed 's/^[0-9]*,//' your_gigantic_data.csv

回答by Demosthenex

This monstrosity grabs all the unique dates and then greps for those keys in the original file saving them to files named by that key. Yes, useless use of cat, but trying to atomize the actions.

这个怪物会抓取所有唯一的日期,然后在原始文件中搜索那些键,将它们保存到由该键命名的文件中。是的,用 cat 没用,但试图将动作原子化。

sed 's/^[^,]*,\(.*"\(....\)-\(..\)-\(..\).*\)/echo  >>__.csv/' file | sh

回答by allenhwkim

It must be simple

一定很简单

sed 's/^[^,]*,\(.*"\(....\)-\(..\)-\(..\).*\)/echo  >>__.csv/e' file

回答by potong

This might work for you:

这可能对你有用:

##代码##

or GNU sed:

或 GNU sed:

##代码##