Bash shell 脚本 - csv 解析
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1560393/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bash shell scripting - csv parsing
提问by tinkertime
I am trying to parse a CSV containing potentially 100k+ lines. Here is the criteria I have:
我正在尝试解析可能包含 100k+ 行的 CSV。这是我的标准:
- The index of the identifier
- The identifier value
- 标识符的索引
- 标识符值
I would like to retrieve all lines in the CSV that have the given value in the given index (delimited by commas).
我想检索 CSV 中在给定索引中具有给定值的所有行(以逗号分隔)。
Any ideas, taking in special consideration for performance?
任何想法,特别考虑性能?
采纳答案by unwind
First prototype using plain old grep
and cut
:
第一个原型使用普通的旧grep
和cut
:
grep "${VALUE}" inputfile.csv | cut -d, -f"${INDEX}"
If that's fast enough and gives the proper output, you're done.
如果这足够快并提供正确的输出,那么您就完成了。
回答by Andrey Vlasovskikh
回答by FRV
See this youtube video: BASH scripting lesson 10 working with CSV files
观看此 YouTube 视频:BASH 脚本第 10 课使用 CSV 文件
CSV file:
CSV 文件:
Bob Brown;Manager;16581;Main
Sally Seaforth;Director;4678;HOME
Bash script:
bash脚本:
#!/bin/bash
OLDIFS=$IFS
IFS=";"
while read user job uid location
do
echo -e "$user \
======================\n\
Role :\t $job\n\
ID :\t $uid\n\
SITE :\t $location\n"
done <
IFS=$OLDIFS
Output:
输出:
Bob Brown ======================
Role : Manager
ID : 16581
SITE : Main
Sally Seaforth ======================
Role : Director
ID : 4678
SITE : HOME
回答by bobince
CSV isn't quite that simple. Depending on the limits of the data you have, you might have to worry about quoted values (which may contain commas and newlines) and escaping quotes.
CSV 并不是那么简单。根据您拥有的数据的限制,您可能不得不担心引用值(可能包含逗号和换行符)和转义引号。
So if your data are restricted enough can get away with simple comma-splitting fine, shell script can do that easily. If, on the other hand, you need to parse CSV ‘properly', bash would not be my first choice. Instead I'd look at a higher-level scripting language, for example Python with a csv.reader.
因此,如果您的数据受到足够的限制,可以通过简单的逗号分割很好地逃脱,shell 脚本可以轻松做到这一点。另一方面,如果您需要“正确”解析 CSV,那么 bash 将不是我的首选。相反,我会查看更高级别的脚本语言,例如带有csv.reader 的Python 。
回答by Hai Vu
In a CSV file, each field is separated by a comma. The problem is, a field itself might have an embedded comma:
在 CSV 文件中,每个字段都用逗号分隔。问题是,一个字段本身可能有一个嵌入的逗号:
Name,Phone
"Woo, John",425-555-1212
You really need a library package that offer robust CSV support instead of relying on using comma as a field separator. I know that scripting languages such as Python has such support. However, I am comfortable with the Tcl scripting language so that is what I use. Here is a simple Tcl script which does what you are asking for:
您确实需要一个提供强大 CSV 支持的库包,而不是依赖使用逗号作为字段分隔符。我知道像 Python 这样的脚本语言有这样的支持。但是,我对 Tcl 脚本语言很满意,所以这就是我所使用的。这是一个简单的 Tcl 脚本,它可以满足您的要求:
#!/usr/bin/env tclsh
package require csv
package require Tclx
# Parse the command line parameters
lassign $argv fileName columnNumber expectedValue
# Subtract 1 from columnNumber because Tcl's list index starts with a
# zero instead of a one
incr columnNumber -1
for_file line $fileName {
set columns [csv::split $line]
set columnValue [lindex $columns $columnNumber]
if {$columnValue == $expectedValue} {
puts $line
}
}
Save this script to a file called csv.tcl and invoke it as:
将此脚本保存到名为 csv.tcl 的文件中并将其调用为:
$ tclsh csv.tcl filename indexNumber expectedValue
Explanation
解释
The script reads the CSV file line by line and store the line in the variable $line, then it split each line into a list of columns (variable $columns). Next, it picks out the specified column and assigned it to the $columnValue variable. If there is a match, print out the original line.
该脚本逐行读取 CSV 文件并将该行存储在变量 $line 中,然后将每一行拆分为一个列列表(变量 $columns)。接下来,它挑选出指定的列并将其分配给 $columnValue 变量。如果匹配,则打印出原始行。
回答by Nate Kohl
Using awk
:
使用awk
:
export INDEX=2
export VALUE=bar
awk -F, '$'$INDEX' ~ /^'$VALUE'$/ {print}' inputfile.csv
Edit:As per Dennis Williamson'sexcellent comment, this could be much more cleanly (and safely) written by defining awk variables using the -v
switch:
编辑:根据丹尼斯威廉姆森的精彩评论,通过使用-v
开关定义 awk 变量可以更干净(和安全)地编写:
awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv
Jeez...with variables, and everything, awk is almost a real programming language...
天啊……有了变量,还有一切,awk 几乎是一种真正的编程语言……
回答by ghostdog74
index=1
value=2
awk -F"," -v i=$index -v v=$value '$(i)==v' file
回答by D Bro
For situations where the data does not contain any special characters, the solution suggested by Nate Kohl and ghostdog74 is good.
对于数据不包含任何特殊字符的情况,Nate Kohl 和 ghostdog74 建议的解决方案是好的。
If the data contains commas or newlines inside the fields, awk may not properly count the field numbers and you'll get incorrect results.
如果数据在字段中包含逗号或换行符,awk 可能无法正确计算字段编号,并且您会得到不正确的结果。
You can still use awk, with some help from a program I wrote called csvquote (available at https://github.com/dbro/csvquote):
在我编写的名为 csvquote 的程序(可在https://github.com/dbro/csvquote 上找到)的帮助下,您仍然可以使用 awk :
csvquote inputfile.csv | awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' | csvquote -u
This program finds special characters inside quoted fields, and temporarily replaces them with nonprinting characters which won't confuse awk. Then they get restored after awk is done.
该程序在引用字段中查找特殊字符,并临时用不会混淆 awk 的非打印字符替换它们。然后他们在 awk 完成后恢复。
回答by mob
A sed
or awk
solution would probably be shorter, but here's one for Perl:
A sed
orawk
解决方案可能会更短,但这是 Perl 的一个:
perl -F/,/ -ane 'print if $F[<INDEX>] eq "<VALUE>"`
where <INDEX>
is 0-based (0 for first column, 1 for 2nd column, etc.)
其中<INDEX>
是基于 0 的(第一列为 0,第二列为 1,等等)
回答by Thomas Guyot-Sionnest
I was looking for an elegant solution that support quoting and wouldn't require installing anything fancy on my VMware vMA appliance. Turns out this simple python script does the trick! (I named the script csv2tsv.py
, since it converts CSV into tab-separated values - TSV)
我正在寻找一种优雅的解决方案,它支持引用并且不需要在我的 VMware vMA 设备上安装任何花哨的东西。原来这个简单的python脚本可以解决问题!(我将脚本命名为csv2tsv.py
,因为它将 CSV 转换为制表符分隔值 - TSV)
#!/usr/bin/env python
import sys, csv
with sys.stdin as f:
reader = csv.reader(f)
for row in reader:
for col in row:
print col+'\t',
print
Tab-separated values can be split easily with the cut command (no delimiter needs to be specified, tab is the default). Here's a sample usage/output:
使用 cut 命令可以轻松拆分制表符分隔值(无需指定分隔符,制表符是默认值)。这是一个示例用法/输出:
> esxcli -h $VI_HOST --formatter=csv network vswitch standard list |csv2tsv.py|cut -f12
Uplinks
vmnic4,vmnic0,
vmnic5,vmnic1,
vmnic6,vmnic2,
In my scripts I'm actually going to parse tsv output line by line and use read or cut to get the fields I need.
在我的脚本中,我实际上将逐行解析 tsv 输出并使用 read 或 cut 来获取我需要的字段。