Python 我可以导入 CSV 文件并自动推断分隔符吗?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16312104/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Can I import a CSV file and automatically infer the delimiter?
提问by rom
I want to import two kinds of CSV files, some use ";" for delimiter and others use ",". So far I have been switching between the next two lines:
我想导入两种 CSV 文件,有些使用“;” 对于分隔符和其他使用“,”。到目前为止,我一直在接下来的两行之间切换:
reader=csv.reader(f,delimiter=';')
or
或者
reader=csv.reader(f,delimiter=',')
Is it possible not to specify the delimiter and to let the program check for the right delimiter?
是否可以不指定分隔符并让程序检查正确的分隔符?
The solutions below (Blender and sharth) seem to work well for comma-separated files (generated with Libroffice) but not for semicolon-separated files (generated with MS Office). Here are the first lines of one semicolon-separated file:
下面的解决方案(Blender 和 sharth)似乎适用于逗号分隔的文件(使用 Libroffice 生成),但不适用于分号分隔的文件(使用 MS Office 生成)。以下是一个以分号分隔的文件的第一行:
ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
采纳答案by rom
To solve the problem, I have created a function which reads the first line of a file (header) and detects the delimiter.
为了解决这个问题,我创建了一个函数,它读取文件(标题)的第一行并检测分隔符。
def detectDelimiter(csvFile):
with open(csvFile, 'r') as myCsvfile:
header=myCsvfile.readline()
if header.find(";")!=-1:
return ";"
if header.find(",")!=-1:
return ","
#default delimiter (MS Office export)
return ";"
回答by Bill Lynch
The csvmodule seems to recommend using the csv snifferfor this problem.
该csv模块似乎建议对这个问题使用csv 嗅探器。
They give the following example, which I've adapted for your case.
他们给出了以下示例,我已根据您的情况对其进行了调整。
with open('example.csv', 'rb') as csvfile: # python 3: 'r',newline=""
dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,")
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
# ... process CSV file contents here ...
Let's try it out.
让我们试试看。
[9:13am][wlynch@watermelon /tmp] cat example
#!/usr/bin/env python
import csv
def parse(filename):
with open(filename, 'rb') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
for line in reader:
print line
def main():
print 'Comma Version:'
parse('comma_separated.csv')
print
print 'Semicolon Version:'
parse('semicolon_separated.csv')
print
print 'An example from the question (kingdom.csv)'
parse('kingdom.csv')
if __name__ == '__main__':
main()
And our sample inputs
和我们的样本输入
[9:13am][wlynch@watermelon /tmp] cat comma_separated.csv
test,box,foo
round,the,bend
[9:13am][wlynch@watermelon /tmp] cat semicolon_separated.csv
round;the;bend
who;are;you
[9:22am][wlynch@watermelon /tmp] cat kingdom.csv
ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
And if we execute the example program:
如果我们执行示例程序:
[9:14am][wlynch@watermelon /tmp] ./example
Comma Version:
['test', 'box', 'foo']
['round', 'the', 'bend']
Semicolon Version:
['round', 'the', 'bend']
['who', 'are', 'you']
An example from the question (kingdom.csv)
['ReleveAnnee', 'ReleveMois', 'NoOrdre', 'TitreRMC', 'AdopCSRegleVote', 'AdopCSAbs', 'AdoptCSContre', 'NoCELEX', 'ProposAnnee', 'ProposChrono', 'ProposOrigine', 'NoUniqueAnnee', 'NoUniqueType', 'NoUniqueChrono', 'PropoSplittee', 'Suite2LecturePE', 'Council PATH', 'Notes']
['1999', '1', '1', '1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC', 'U', '', '', '31999D0083', '1998', '577', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']
['1999', '1', '2', '1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes', 'U', '', '', '31999D0081', '1998', '184', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']
It's also probably worth noting what version of python I'm using.
可能还值得注意的是我使用的是什么版本的 python。
[9:20am][wlynch@watermelon /tmp] python -V
Python 2.7.2
回答by twalberg
I don't think there can be a perfectly general solution to this (one of the reasons I might use ,as a delimiter is that some of my data fields need to be able to include ;...). A simple heuristic for deciding might be to simply read the first line (or more), count how many ,and ;characters it contains (possibly ignoring those inside quotes, if whatever creates your .csvfiles quotes entries properly and consistently), and guess that the more frequent of the two is the right delimiter.
我不认为有一个完美的通用解决方案(我可能,用作分隔符的原因之一是我的某些数据字段需要能够包含;......)。决定的一个简单启发式方法可能是简单地阅读第一行(或更多),计算它包含的数量,和;字符(可能忽略里面的引号,如果任何创建你的.csv文件正确和一致地引用条目),并猜测更频繁两者中的一个是正确的分隔符。
回答by Andrew Basile
Given a project that deals with both , (comma) and | (vertical bar) delimited CSV files, which are well formed, I tried the following (as given at https://docs.python.org/2/library/csv.html#csv.Sniffer):
给定一个项目同时处理 ,(逗号)和 | (竖线)分隔的 CSV 文件,格式正确,我尝试了以下操作(如https://docs.python.org/2/library/csv.html#csv.Sniffer 所示):
dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=',|')
However, on a |-delimited file, the "Could not determine delimiter" exception was returned. It seemed reasonable to speculate that the sniff heuristic might work best if each line has the same number of delimiters (not counting whatever might be enclosed in quotes). So, instead of reading the first 1024 bytes of the file, I tried reading the first two lines in their entirety:
但是,在 | 分隔的文件上,返回了“无法确定分隔符”异常。如果每行具有相同数量的分隔符(不包括可能用引号括起来的任何内容),那么推测嗅探启发法可能效果最好似乎是合理的。因此,我没有读取文件的前 1024 个字节,而是尝试完整读取前两行:
temp_lines = csvfile.readline() + '\n' + csvfile.readline()
dialect = csv.Sniffer().sniff(temp_lines, delimiters=',|')
So far, this is working well for me.
到目前为止,这对我来说效果很好。
回答by Vladir Parrado Cruz
And if you're using DictReaderyou can do that:
如果您正在使用,DictReader您可以这样做:
#!/usr/bin/env python
import csv
def parse(filename):
with open(filename, 'rb') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
csvfile.seek(0)
reader = csv.DictReader(csvfile, dialect=dialect)
for line in reader:
print(line['ReleveAnnee'])
I used this with Python 3.5and it worked this way.
我用它,Python 3.5它以这种方式工作。

