bash 如何提取csv文件的一列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19602181/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to extract one column of a csv file
提问by user788171
If I have a csv file, is there a quick bash way to print out the contents of only any single column? It is safe to assume that each row has the same number of columns, but each column's content would have different length.
如果我有一个 csv 文件,是否有一种快速的 bash 方法可以只打印出任何一列的内容?可以安全地假设每行具有相同的列数,但每列的内容将具有不同的长度。
回答by synthesizerpatel
You could use awk for this. Change '$2' to the nth column you want.
您可以为此使用 awk。将 '$2' 更改为您想要的第 n 列。
awk -F "\"*,\"*" '{print }' textfile.csv
回答by madrag
yes. cat mycsv.csv | cut -d ',' -f3
will print 3rd column.
是的。cat mycsv.csv | cut -d ',' -f3
将打印第三列。
回答by Samar
The simplest way I was able to get this done was to just use csvtool. I had other use cases as well to use csvtool and it can handle the quotes or delimiters appropriately if they appear within the column data itself.
我能够完成这项工作的最简单方法是使用csvtool。我还有其他用例来使用 csvtool,如果它们出现在列数据本身中,它可以适当地处理引号或分隔符。
csvtool format '%(2)\n' input.csv
Replacing 2 with the column number will effectively extract the column data you are looking for.
用列号替换 2 将有效地提取您要查找的列数据。
回答by cevaris
Landed here looking to extract from a tab separated file. Thought I would add.
降落在这里,希望从制表符分隔的文件中提取。以为我会补充。
cat textfile.tsv | cut -f2 -s
Where -f2
extracts the 2, non-zero indexed column, or the second column.
Where-f2
提取第 2 个非零索引列或第二列。
回答by router
Many answers for this questions are great and some have even looked into the corner cases. I would like to add a simple answer that can be of daily use... where you mostly get into those corner cases (like having escaped commas or commas in quotes etc.,).
这个问题的许多答案都很好,有些甚至研究了极端情况。我想添加一个可以日常使用的简单答案......在那里你大多会遇到那些极端情况(比如转义逗号或引号中的逗号等)。
FS (Field Separator) is the variable whose value is dafaulted to space. So awk by default splits at space for any line.
FS (Field Separator) 是其值受空格影响的变量。所以 awk 默认情况下在任何行的空格处拆分。
So using BEGIN (Execute before taking input) we can set this field to anything we want...
因此,使用 BEGIN(在接受输入之前执行)我们可以将此字段设置为我们想要的任何内容...
awk 'BEGIN {FS = ","}; {print }'
The above code will print the 3rd column in a csv file.
上面的代码将打印 csv 文件中的第三列。
回答by jarno
You could use GNU Awk, see this article of the user guide.
As an improvement to the solution presented in the article (in June 2015), the following gawk command allows double quotes inside double quoted fields; a double quote is marked by two consecutive double quotes ("") there. Furthermore, this allows empty fields, but even this can not handle multiline fields. The following example prints the 3rd column (via c=3
) of textfile.csv:
您可以使用 GNU Awk,请参阅用户指南的这篇文章。作为对文章(2015 年 6 月)中提出的解决方案的改进,以下 gawk 命令允许在双引号字段内使用双引号;双引号在那里由两个连续的双引号 ("") 标记。此外,这允许空字段,但即使这样也无法处理多行字段。以下示例打印c=3
textfile.csv的第 3 列(via ):
#!/bin/bash
gawk -- '
BEGIN{
FPAT="([^,\"]*)|(\"((\"\")*[^\"]*)*\")"
}
{
if (substr($c, 1, 1) == "\"") {
$c = substr($c, 2, length($c) - 2) # Get the text within the two quotes
gsub("\"\"", "\"", $c) # Normalize double quotes
}
print $c
}
' c=3 < <(dos2unix <textfile.csv)
Note the use of dos2unix
to convert possible DOS style line breaks (CRLF i.e. "\r\n") and UTF-16 encoding (with byte order mark) to "\n" and UTF-8 (without byte order mark), respectively. Standard CSV files use CRLF as line break, see Wikipedia.
请注意dos2unix
将可能的 DOS 样式换行符(CRLF,即“\r\n”)和 UTF-16 编码(带字节顺序标记)分别转换为“\n”和 UTF-8(不带字节顺序标记)的用法。标准 CSV 文件使用 CRLF 作为换行符,请参阅Wikipedia。
If the input may contain multiline fields, you can use the following script.Note the use of special string for separating records in output (since the default separator newline could occur within a record). Again, the following example prints the 3rd column (via c=3
) of textfile.csv:
如果输入可能包含多行字段,您可以使用以下脚本。请注意使用特殊字符串在输出中分隔记录(因为默认分隔符换行符可能出现在记录中)。同样,以下示例打印c=3
textfile.csv的第 3 列(via ):
#!/bin/bash
gawk -- '
BEGIN{
RS="csvquote textfile.csv | cut -d ',' -f 3 | csvquote -u
" # Read the whole input file as one record;
# assume there is no null character in input.
FS="" # Suppose this setting eases internal splitting work.
ORS="\n####\n" # Use a special output separator to show borders of a record.
}
{
nof=patsplit(AirBoxOmega:~ d$ cat > file #First we'll create a basic CSV
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
, a, /([^,"\n]*)|("(("")*[^"]*)*")/, seps)
field=0;
for (i=1; i<=nof; i++){
field++
if (field==c) {
if (substr(a[i], 1, 1) == "\"") {
a[i] = substr(a[i], 2, length(a[i]) - 2) # Get the text within
# the two quotes.
gsub(/""/, "\"", a[i]) # Normalize double quotes.
}
print a[i]
}
if (seps[i]!=",") field=0
}
}
' c=3 < <(dos2unix <textfile.csv)
There is another approach to the problem. csvquotecan output contents of a CSV file modified so that special characters within field are transformed so that usual Unix text processing tools can be used to select certain column. For example the following code outputs the third column:
有另一种方法可以解决这个问题。csvquote可以输出修改后的 CSV 文件的内容,以便转换字段中的特殊字符,以便可以使用通常的 Unix 文本处理工具来选择特定列。例如下面的代码输出第三列:
AirBoxOmega:~ d$ while IFS=, read -a csv_line;do echo "${csv_line[0]}";done < file
a
1
a
1
a
1
a
1
a
1
a
1
csvquote
can be used to process arbitrary large files.
csvquote
可用于处理任意大文件。
回答by drldcsta
The other answers work well, but since you asked for a solution using just the bash shell, you can do this:
其他答案效果很好,但由于您要求仅使用 bash shell 的解决方案,您可以执行以下操作:
echo "require 'csv'; CSV.read('new.csv').each {|data| puts data[34]}" | ruby
And then you can pull out columns (the first in this example) like so:
然后你可以像这样拉出列(本例中的第一列):
csvtool col 2 file.csv
So there's a couple of things going on here:
所以这里发生了几件事:
while IFS=,
- this is saying to use a comma as the IFS (Internal Field Separator), which is what the shell uses to know what separates fields (blocks of text). So saying IFS=, is like saying "a,b" is the same as "a b" would be if the IFS=" " (which is what it is by default.)read -a csv_line;
- this is saying read in each line, one at a time and create an array where each element is called "csv_line" and send that to the "do" section of our while loopdo echo "${csv_line[0]}";done < file
- now we're in the "do" phase, and we're saying echo the 0th element of the array "csv_line". This action is repeated on every line of the file. The< file
part is just telling the while loop where to read from. NOTE: remember, in bash, arrays are 0 indexed, so the first column is the 0th element.
while IFS=,
- 这是说使用逗号作为 IFS(内部字段分隔符),这是 shell 用来了解分隔字段(文本块)的内容。所以说 IFS=, 就像说 "a,b" 和 "a b" 是一样的,如果 IFS=" "(这是默认的)。read -a csv_line;
- 这是说在每一行中读取,一次一个并创建一个数组,其中每个元素都称为“csv_line”并将其发送到我们的 while 循环的“do”部分do echo "${csv_line[0]}";done < file
- 现在我们处于“做”阶段,我们说 echo 数组“csv_line”的第 0 个元素。对文件的每一行重复此操作。这< file
部分只是告诉 while 循环从哪里读取。注意:请记住,在 bash 中,数组的索引为 0,因此第一列是第 0 个元素。
So there you have it, pulling out a column from a CSV in the shell. The other solutions are probably more practical, but this one is pure bash.
所以你有它,从shell中的CSV中拉出一列。其他解决方案可能更实用,但这个是纯粹的 bash。
回答by Raj Velayudhan
[dumb@one pts]$ cat > file #First we'll create a basic CSV
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
[dumb@one pts]$ cat > file #首先我们将创建一个基本的CSV
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6, 7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
[dumb@one pts]$ awk -F , '{print $1}' file
a
1
a
1
[dumb@one pts]$ awk -F , '{print $1}' file
a
1
a
1
回答by Darth Egregious
I needed proper CSV parsing, not cut
/ awk
and prayer. I'm trying this on a mac without csvtool
, but macs do come with ruby, so you can do:
我需要正确的 CSV 解析,而不是cut
/awk
和祈祷。我正在没有csvtool
.
csvtool col 1,2 file.csv
回答by exussum
where 2 is the column you are interested in
其中 2 是您感兴趣的列
you can also do
你也可以这样做
##代码##to do multiple columns
做多列