如何在 Python 中将制表符分隔、管道分隔符转换为 CSV 文件格式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1366775/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert tab separated, pipe separated to CSV file format in Python
提问by
I have a text file (.txt) which could be in tab separated format or pipe separated format, and I need to convert it into CSV file format. I am using python 2.6. Can any one suggest me how to identify the delimiter in a text file, read the data and then convert that into comma separated file.
我有一个文本文件 (.txt),它可以是制表符分隔格式或管道分隔格式,我需要将其转换为 CSV 文件格式。我正在使用 python 2.6。任何人都可以建议我如何识别文本文件中的分隔符,读取数据,然后将其转换为逗号分隔的文件。
Thanks in advance
提前致谢
回答by
I fear that you can't identify the delimiter without knowing what it is. The problem with CSV is, that, quoting ESR:
我担心您在不知道它是什么的情况下无法识别分隔符。CSV 的问题在于,引用 ESR:
the Microsoft version of CSV is a textbook example of how not to design a textual file format.
Microsoft 版本的 CSV 是如何不设计文本文件格式的教科书示例。
The delimiter needs to be escaped in some way if it can appear in fields. Without knowing, how the escaping is done, automatically identifying it is difficult. Escaping could be done the UNIX way, using a backslash '\', or the Microsoft way, using quotes which then must be escaped, too. This is not a trivial task.
如果分隔符可以出现在字段中,则需要以某种方式对其进行转义。在不知道转义是如何完成的情况下,很难自动识别。转义可以通过 UNIX 方式完成,使用反斜杠“\”,或 Microsoft 方式,使用引号,然后也必须转义。这不是一项微不足道的任务。
So my suggestion is to get full documentation from whoever generates the file you want to convert. Then you can use one of the approaches suggested in the other answers or some variant.
所以我的建议是从生成要转换的文件的人那里获取完整的文档。然后您可以使用其他答案或某些变体中建议的方法之一。
Edit:
编辑:
Python provides csv.Snifferthat can help you deduce the format of your DSV. If your input looks like this (note the quoted delimiter in the first field of the second row):
Python 提供了csv.Sniffer可以帮助您推断 DSV 的格式。如果您的输入看起来像这样(注意第二行第一个字段中引用的分隔符):
a|b|c
"a|b"|c|d
foo|"bar|baz"|qux
You can do this:
你可以这样做:
import csv
csvfile = open("csvfile.csv")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.DictReader(csvfile, dialect=dialect)
for row in reader:
print row,
# => {'a': 'a|b', 'c': 'd', 'b': 'c'} {'a': 'foo', 'c': 'qux', 'b': 'bar|baz'}
# write records using other dialect
回答by Mauro Bianchi
Your strategy could be the following:
您的策略可能如下:
- parse the file with BOTH a tab-separated csv reader and a pipe-separated csv reader
- calculate some statistics on resulting rows to decide which resultset is the one you want to write. An idea could be counting the total number of fields in the two recordset (expecting that tab and pipe are not so common). Another one (if your data is strongly structured and you expect the same number of fields in each line) could be measuring the standard deviation of number of fields per line and take the record set with the smallest standard deviation.
- 使用制表符分隔的 csv 阅读器和管道分隔的 csv 阅读器解析文件
- 计算结果行的一些统计信息,以确定哪个结果集是您要写入的结果集。一个想法可能是计算两个记录集中的字段总数(期望选项卡和管道不那么常见)。另一个(如果您的数据结构牢固并且您希望每行中的字段数相同)可以测量每行字段数的标准偏差并采用具有最小标准偏差的记录集。
In the following example you find the simpler statistic (total number of fields)
在以下示例中,您会发现更简单的统计信息(字段总数)
import csv
piperows= []
tabrows = []
#parsing | delimiter
f = open("file", "rb")
readerpipe = csv.reader(f, delimiter = "|")
for row in readerpipe:
piperows.append(row)
f.close()
#parsing TAB delimiter
f = open("file", "rb")
readertab = csv.reader(f, delimiter = "\t")
for row in readerpipe:
tabrows.append(row)
f.close()
#in this example, we use the total number of fields as indicator (but it's not guaranteed to work! it depends by the nature of your data)
#count total fields
totfieldspipe = reduce (lambda x,y: x+ y, [len(f) for f in piperows])
totfieldstab = reduce (lambda x,y: x+ y, [len(f) for f in tabrows])
if totfieldspipe > totfieldstab:
yourrows = piperows
else:
yourrows = tabrows
#the var yourrows contains the rows, now just write them in any format you like
回答by S.Lott
Like this
像这样
from __future__ import with_statement
import csv
import re
with open( input, "r" ) as source:
with open( output, "wb" ) as destination:
writer= csv.writer( destination )
for line in input:
writer.writerow( re.split( '[\t|]', line ) )
回答by ghostdog74
for line in open("file"):
line=line.strip()
if "|" in line:
print ','.join(line.split("|"))
else:
print ','.join(line.split("\t"))
回答by quamrana
I would suggest taking some of the example code from the existing answers, or perhaps better use the csv
module from python and change it to first assume tab separated, then pipe separated, and produce two output files which are comma separated. Then you visually examine both files to determine which one you want and pick that.
我建议从现有答案中获取一些示例代码,或者更好地使用csv
python 中的模块并将其更改为首先假设制表符分隔,然后管道分隔,并生成两个逗号分隔的输出文件。然后您目视检查这两个文件以确定您想要哪个并选择那个。
If you actually have lots of files, then you need to try to find a way to detect which file is which.
One of the examples has this:
如果您实际上有很多文件,那么您需要尝试找到一种方法来检测哪个文件是哪个。
其中一个例子是这样的:
if "|" in line:
This may be enough: if the first line of a file contains a pipe, then maybe the whole file is pipe separated, else assume a tab separated file.
这可能就足够了:如果文件的第一行包含一个管道,那么整个文件可能是管道分隔的,否则假设一个制表符分隔的文件。
Alternatively fix the file to contain a key field in the first line which is easily identified - or maybe the first line contains column headers which can be detected.
或者修复文件以在第一行中包含一个易于识别的关键字段 - 或者第一行可能包含可以检测到的列标题。