Bash shell 脚本 - csv 解析

Question

提问by tinkertime

I am trying to parse a CSV containing potentially 100k+ lines. Here is the criteria I have:

我正在尝试解析可能包含 100k+ 行的 CSV。这是我的标准：

The index of the identifier
The identifier value

标识符的索引
标识符值

I would like to retrieve all lines in the CSV that have the given value in the given index (delimited by commas).

我想检索 CSV 中在给定索引中具有给定值的所有行（以逗号分隔）。

Any ideas, taking in special consideration for performance?

任何想法，特别考虑性能？

Answer 1

采纳答案by unwind

First prototype using plain old grepand cut:

第一个原型使用普通的旧grep和cut：

grep "${VALUE}" inputfile.csv | cut -d, -f"${INDEX}"

If that's fast enough and gives the proper output, you're done.

如果这足够快并提供正确的输出，那么您就完成了。

Answer 2

回答by Andrey Vlasovskikh

As an alternative to cut- or awk-based one-liners, you could use the specialized csvtoolaka ocaml-csv:

作为基于cut- 或awk- 的单行的替代方案，您可以使用专门的csvtoolaka ocaml-csv：

$ csvtool -t ',' col "$index" - < csvfile | grep "$value"

According to the docs, it handles escaping, quoting, etc.

根据文档，它处理转义、引用等。

Answer 3

回答by FRV

See this youtube video: BASH scripting lesson 10 working with CSV files

观看此 YouTube 视频：BASH 脚本第 10 课使用 CSV 文件

CSV file:

CSV 文件：

Bob Brown;Manager;16581;Main
Sally Seaforth;Director;4678;HOME

Bash script:

bash脚本：

#!/bin/bash
OLDIFS=$IFS
IFS=";"
while read user job uid location
 do

    echo -e "$user \
    ======================\n\
    Role :\t $job\n\
    ID :\t $uid\n\
    SITE :\t $location\n"
 done < 
 IFS=$OLDIFS

Output:

输出：

Bob Brown     ======================
    Role :   Manager
    ID :     16581
    SITE :   Main

Sally Seaforth     ======================
    Role :   Director
    ID :     4678
    SITE :   HOME

Answer 4

回答by bobince

CSV isn't quite that simple. Depending on the limits of the data you have, you might have to worry about quoted values (which may contain commas and newlines) and escaping quotes.

CSV 并不是那么简单。根据您拥有的数据的限制，您可能不得不担心引用值（可能包含逗号和换行符）和转义引号。

So if your data are restricted enough can get away with simple comma-splitting fine, shell script can do that easily. If, on the other hand, you need to parse CSV ‘properly', bash would not be my first choice. Instead I'd look at a higher-level scripting language, for example Python with a csv.reader.

因此，如果您的数据受到足够的限制，可以通过简单的逗号分割很好地逃脱，shell 脚本可以轻松做到这一点。另一方面，如果您需要“正确”解析 CSV，那么 bash 将不是我的首选。相反，我会查看更高级别的脚本语言，例如带有csv.reader 的Python 。

Answer 5

回答by Hai Vu

In a CSV file, each field is separated by a comma. The problem is, a field itself might have an embedded comma:

在 CSV 文件中，每个字段都用逗号分隔。问题是，一个字段本身可能有一个嵌入的逗号：

Name,Phone
"Woo, John",425-555-1212

You really need a library package that offer robust CSV support instead of relying on using comma as a field separator. I know that scripting languages such as Python has such support. However, I am comfortable with the Tcl scripting language so that is what I use. Here is a simple Tcl script which does what you are asking for:

您确实需要一个提供强大 CSV 支持的库包，而不是依赖使用逗号作为字段分隔符。我知道像 Python 这样的脚本语言有这样的支持。但是，我对 Tcl 脚本语言很满意，所以这就是我所使用的。这是一个简单的 Tcl 脚本，它可以满足您的要求：

#!/usr/bin/env tclsh

package require csv 
package require Tclx

# Parse the command line parameters
lassign $argv fileName columnNumber expectedValue

# Subtract 1 from columnNumber because Tcl's list index starts with a
# zero instead of a one
incr columnNumber -1

for_file line $fileName {
    set columns [csv::split $line]
    set columnValue [lindex $columns $columnNumber]
    if {$columnValue == $expectedValue} {
        puts $line
    }   
}

Save this script to a file called csv.tcl and invoke it as:

将此脚本保存到名为 csv.tcl 的文件中并将其调用为：

$ tclsh csv.tcl filename indexNumber expectedValue

Explanation

解释

The script reads the CSV file line by line and store the line in the variable $line, then it split each line into a list of columns (variable $columns). Next, it picks out the specified column and assigned it to the $columnValue variable. If there is a match, print out the original line.

该脚本逐行读取 CSV 文件并将该行存储在变量 $line 中，然后将每一行拆分为一个列列表（变量 $columns）。接下来，它挑选出指定的列并将其分配给 $columnValue 变量。如果匹配，则打印出原始行。

Answer 6

回答by Nate Kohl

Using awk:

使用awk：

export INDEX=2
export VALUE=bar

awk -F, '$'$INDEX' ~ /^'$VALUE'$/ {print}' inputfile.csv

Edit:As per Dennis Williamson'sexcellent comment, this could be much more cleanly (and safely) written by defining awk variables using the -vswitch:

编辑：根据丹尼斯威廉姆森的精彩评论，通过使用-v开关定义 awk 变量可以更干净（和安全）地编写：

awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv

Jeez...with variables, and everything, awk is almost a real programming language...

天啊……有了变量，还有一切，awk 几乎是一种真正的编程语言……

Answer 7

回答by ghostdog74

index=1
value=2
awk -F"," -v i=$index -v v=$value '$(i)==v' file

Answer 8

回答by D Bro

For situations where the data does not contain any special characters, the solution suggested by Nate Kohl and ghostdog74 is good.

对于数据不包含任何特殊字符的情况，Nate Kohl 和 ghostdog74 建议的解决方案是好的。

If the data contains commas or newlines inside the fields, awk may not properly count the field numbers and you'll get incorrect results.

如果数据在字段中包含逗号或换行符，awk 可能无法正确计算字段编号，并且您会得到不正确的结果。

You can still use awk, with some help from a program I wrote called csvquote (available at https://github.com/dbro/csvquote):

在我编写的名为 csvquote 的程序（可在https://github.com/dbro/csvquote 上找到）的帮助下，您仍然可以使用 awk ：

csvquote inputfile.csv | awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' | csvquote -u

This program finds special characters inside quoted fields, and temporarily replaces them with nonprinting characters which won't confuse awk. Then they get restored after awk is done.

该程序在引用字段中查找特殊字符，并临时用不会混淆 awk 的非打印字符替换它们。然后他们在 awk 完成后恢复。

Answer 9

回答by mob

A sedor awksolution would probably be shorter, but here's one for Perl:

A sedorawk解决方案可能会更短，但这是 Perl 的一个：

perl -F/,/ -ane 'print if $F[<INDEX>] eq "<VALUE>"`

where <INDEX>is 0-based (0 for first column, 1 for 2nd column, etc.)

其中<INDEX>是基于 0 的（第一列为 0，第二列为 1，等等）

Answer 10

回答by Thomas Guyot-Sionnest

I was looking for an elegant solution that support quoting and wouldn't require installing anything fancy on my VMware vMA appliance. Turns out this simple python script does the trick! (I named the script csv2tsv.py, since it converts CSV into tab-separated values - TSV)

我正在寻找一种优雅的解决方案，它支持引用并且不需要在我的 VMware vMA 设备上安装任何花哨的东西。原来这个简单的python脚本可以解决问题！（我将脚本命名为csv2tsv.py，因为它将 CSV 转换为制表符分隔值 - TSV）

#!/usr/bin/env python

import sys, csv

with sys.stdin as f:
    reader = csv.reader(f)
    for row in reader:
        for col in row:
            print col+'\t',
        print

Tab-separated values can be split easily with the cut command (no delimiter needs to be specified, tab is the default). Here's a sample usage/output:

使用 cut 命令可以轻松拆分制表符分隔值（无需指定分隔符，制表符是默认值）。这是一个示例用法/输出：

> esxcli -h $VI_HOST --formatter=csv network vswitch standard list |csv2tsv.py|cut -f12
Uplinks
vmnic4,vmnic0,
vmnic5,vmnic1,
vmnic6,vmnic2,

In my scripts I'm actually going to parse tsv output line by line and use read or cut to get the fields I need.

在我的脚本中，我实际上将逐行解析 tsv 输出并使用 read 或 cut 来获取我需要的字段。

Bash shell 脚本 - csv 解析

提问by tinkertime

采纳答案by unwind

回答by Andrey Vlasovskikh

回答by FRV

回答by bobince

回答by Hai Vu

Explanation

解释

回答by Nate Kohl

回答by ghostdog74

回答by D Bro

回答by mob

回答by Thomas Guyot-Sionnest

相关推荐

最近更新

标签

Bash shell 脚本 - csv 解析

提问by tinkertime

采纳答案by unwind

回答by Andrey Vlasovskikh

回答by FRV

回答by bobince

回答by Hai Vu

Explanation

解释

回答by Nate Kohl

回答by ghostdog74

回答by D Bro

回答by mob

回答by Thomas Guyot-Sionnest

相关推荐

bash shell 脚本中的 YYYY-MM-DD 格式日期

bash 按上次编辑日期列出文件

如何创建一个 bash 脚本来检查 SSH 连接？

bash 如何在shell脚本中提取字符串的前两个字符？

相关推荐

最近更新

标签