在 linux 中将制表符分隔的文件转换为 csv 的最快方法

Question

提问by andrewj

I have a tab-delimited file that has over 200 million lines. What's the fastest way in linux to convert this to a csv file? This file does have multiple lines of header information which I'll need to strip out down the road, but the number of lines of header is known. I have seen suggestions for sedand gawk, but I wonder if there is a "preferred" choice.

我有一个超过 2 亿行的制表符分隔文件。在 linux 中将其转换为 csv 文件的最快方法是什么？这个文件确实有多行标题信息，我需要在路上去掉这些信息，但标题的行数是已知的。我已经看到了sed和的建议gawk，但我想知道是否有“首选”选择。

Just to clarify, there are no embedded tabs in this file.

只是为了澄清，此文件中没有嵌入的选项卡。

Answer 1

采纳答案by Mark Rushakoff

If all you need to do is translateall tab characters to comma characters, tris probably the way to go.

如果您需要做的就是将所有制表符转换为逗号字符，tr这可能是要走的路。

The blank space here is a literal tab:

这里的空格是一个文字制表符：

$ echo "hello   world" | tr "\t" ","
hello,world

Of course, if you have embedded tabs inside string literals in the file, this will incorrectly translate those as well; but embedded literal tabs would be fairly uncommon.

当然，如果您在文件的字符串文字中嵌入了制表符，这也会错误地翻译它们；但嵌入式文字标签相当不常见。

Answer 2

回答by Ignacio Vazquez-Abrams

If you're worried about embedded commas then you'll need to use a slightly more intelligent method. Here's a Python script that takes TSV lines from stdin and writes CSV lines to stdout:

如果您担心嵌入的逗号，那么您需要使用更智能的方法。这是一个 Python 脚本，它从 stdin 获取 TSV 行并将 CSV 行写入 stdout：

import sys
import csv

tabin = csv.reader(sys.stdin, dialect=csv.excel_tab)
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in tabin:
  commaout.writerow(row)

Run it from a shell as follows:

从 shell 运行它，如下所示：

python script.py < input.tsv > output.csv

Answer 3

回答by ghostdog74

assuming you don't want to change header and assuming you don't have embedded tabs

假设您不想更改标题并假设您没有嵌入的选项卡

# cat file
header  header  header
one     two     three

$ awk 'NR>1{=}1' OFS="," file
header  header  header
one,two,three

NR>1 skips the first header. you mentioned you know how many lines of header, so use the correct number for your own case. with this, you also do not need to call any other external commands. just one awk command does the job.

NR>1 跳过第一个标头。您提到您知道标题的行数，因此请为您自己的情况使用正确的数字。有了这个，您也不需要调用任何其他外部命令。只需一个 awk 命令即可完成这项工作。

another way if you have blank columns and you care about that.

另一种方式，如果您有空白列并且您关心它。

awk 'NR>1{gsub("\t",",")}1' file

using sed

使用 sed

sed '2,$y/\t/,/' file #skip 1 line header and translate (same as tr)

Answer 4

回答by Will Hartung

sed -e 's/"/\"/g' -e 's/<tab>/","/g' -e 's/^/"/' -e 's/$/"/' infile > outfile

Damn the critics, quote everything, CSV doesn't care.

该死的评论家，引用一切，CSV 不在乎。

<tab>is the actual tab character. \t didn't work for me. In bash, use ^V to enter it.

<tab>是实际的制表符。\t 对我不起作用。在 bash 中，使用 ^V 输入它。

Answer 5

回答by pabs

perl -lpe 's/"/""/g; s/^|$/"/g; s/\t/","/g' < input.tab > output.csv

Perl is generally faster at this sort of thing than the sed, awk, and Python.

Perl 在这类事情上通常比 sed、awk 和 Python 更快。

Answer 6

回答by coderofsalvation

the following awk oneliner supports quoting + quote-escaping

以下 awk oneliner 支持引用 + 引用转义

printf "flop\tflap\"" | awk -F '\t' '{ gsub(/"/,"\"\"\"",$i); for(i = 1; i <= NF; i++) { printf "\"%s\"",$i; if( i < NF ) printf "," }; printf "\n" }'

gives

给

"flop","flap""""

Answer 7

回答by jtlai

@ignacio-vazquez-abrams 's python solution is great! For people who are looking to parse delimiters other tab, the library actually allows you to set arbitrary delimiter. Here is my modified version to handle pipe-delimited files:

@ignacio-vazquez-abrams 的 python 解决方案很棒！对于希望在其他选项卡中解析分隔符的人，该库实际上允许您设置任意分隔符。这是我处理管道分隔文件的修改版本：

import sys
import csv

pipein = csv.reader(sys.stdin, delimiter='|')
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in pipein:
  commaout.writerow(row)

Answer 8

回答by Gopal Kumar

If you want to convert the whole tsv file into a csv file:
```
$ cat data.tsv | tr "\t" "," > data.csv
```
If you want to omit some fields:
```
$ cat data.tsv | cut -f1,2,3 | tr "\t" "," > data.csv
```
The above command will convert the data.tsv file to data.csv file containing only the first threefields.

如果要将整个 tsv 文件转换为 csv 文件：
```
$ cat data.tsv | tr "\t" "," > data.csv
```
如果要省略某些字段：
```
$ cat data.tsv | cut -f1,2,3 | tr "\t" "," > data.csv
```
上述命令会将 data.tsv 文件转换为仅包含前三个字段的data.csv 文件。

Answer 9

回答by Mian Asbat Ahmad

I think it is better not to cat the file because it may create problem in the case of large file. The better way may be

我认为最好不要 cat 文件，因为它可能会在大文件的情况下产生问题。更好的方法可能是

$ tr ',' '\t' < csvfile.csv > tabdelimitedFile.txt

The command will get input from csvfile.csv and store the result as tab seperated in tabdelimitedFile.txt

该命令将从 csvfile.csv 获取输入并将结果存储为 tabdelimitedFile.txt 中分隔的选项卡

Answer 10

回答by mloughran

You can also use xsvfor this

您也可以为此使用xsv

xsv input -d '\t' input.tsv > output.csv

In my test on a 300MB tsv file, it was roughly 5x faster than the python solution (2.5s vs. 14s).

在我对 300MB tsv 文件的测试中，它大约比 python 解决方案快 5 倍（2.5 秒对 14 秒）。

在 linux 中将制表符分隔的文件转换为 csv 的最快方法

提问by andrewj

采纳答案by Mark Rushakoff

回答by Ignacio Vazquez-Abrams

回答by ghostdog74

回答by Will Hartung

回答by pabs

回答by coderofsalvation

回答by jtlai

回答by Gopal Kumar

回答by Mian Asbat Ahmad

回答by mloughran

相关推荐

最近更新

标签

在 linux 中将制表符分隔的文件转换为 csv 的最快方法

提问by andrewj

采纳答案by Mark Rushakoff

回答by Ignacio Vazquez-Abrams

回答by ghostdog74

回答by Will Hartung

回答by pabs

回答by coderofsalvation

回答by jtlai

回答by Gopal Kumar

回答by Mian Asbat Ahmad

回答by mloughran

相关推荐

如何在 c# 代码中构建 DataTemplate？

Linux 异步 shell 命令

Linux 解压缩目录中的所有文件

Linux 如何从命令行刷新硬盘和闪存盘（或文件系统）的缓存？

相关推荐

最近更新

标签