Python 如何动态识别数据文件中的未知分隔符?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3952132/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do you dynamically identify unknown delimiters in a data file?
提问by Greg Gauthier
I have three input data files. Each uses a different delimiter for the data contained therein. Data file one looks like this:
我有三个输入数据文件。每个都为其中包含的数据使用不同的分隔符。数据文件一看起来像这样:
apples | bananas | oranges | grapes
data file two looks like this:
数据文件二如下所示:
quarter, dime, nickel, penny
data file three looks like this:
数据文件三看起来像这样:
horse cow pig chicken goat
(the change in the number of columns is also intentional)
(列数的变化也是有意为之)
The thought I had was to count the number of non-alpha characters, and presume that the highest count was the separator character. However, the files with non-space separators also have spaces before and after the separators, so the spaces win on all three files. Here's my code:
我的想法是计算非字母字符的数量,并假设最高计数是分隔符。但是,带有非空格分隔符的文件在分隔符前后也有空格,因此空格在所有三个文件中都占优势。这是我的代码:
def count_chars(s):
valid_seps=[' ','|',',',';','\t']
cnt = {}
for c in s:
if c in valid_seps: cnt[c] = cnt.get(c,0) + 1
return cnt
infile = 'pipe.txt' #or 'comma.txt' or 'space.txt'
records = open(infile,'r').read()
print count_chars(records)
It will print a dictionary with the counts of all the acceptable characters. In each case, the space always wins, so I can't rely on that to tell me what the separator is.
它将打印一个包含所有可接受字符计数的字典。在每种情况下,空格总是获胜,所以我不能依靠它来告诉我分隔符是什么。
But I can't think of a better way to do this.
但我想不出更好的方法来做到这一点。
Any suggestions?
有什么建议?
采纳答案by JoshD
If you're using python, I'd suggest just calling re.spliton the line with all valid expected separators:
如果您使用的是 python,我建议只在所有有效的预期分隔符的行上调用re.split:
>>> l = "big long list of space separated words"
>>> re.split(r'[ ,|;"]+', l)
['big', 'long', 'list', 'of', 'space', 'separated', 'words']
The only issue would be if one of the files used a separator as part of the data.
唯一的问题是其中一个文件是否使用分隔符作为数据的一部分。
If you must identify the separator, your best bet is to count everything excluding spaces. If there are almost no occurrences, then it's probably space, otherwise, it's the max of the mapped characters.
如果您必须确定分隔符,最好的办法是计算除空格之外的所有内容。如果几乎没有出现,那么它可能是空格,否则,它是映射字符的最大值。
Unfortunately, there's really no way to be sure. You may have space separated data filled with commas, or you may have | separated data filled with semicolons. It may not always work.
不幸的是,真的没有办法确定。您可能有用逗号填充的空格分隔数据,或者您可能有 | 用分号填充的分隔数据。它可能并不总是有效。
回答by eumiro
How about trying Python CSV's standard: http://docs.python.org/library/csv.html#csv.Sniffer
如何尝试 Python CSV 的标准:http: //docs.python.org/library/csv.html#csv.Sniffer
import csv
sniffer = csv.Sniffer()
dialect = sniffer.sniff('quarter, dime, nickel, penny')
print dialect.delimiter
# returns ','
回答by Greg Gauthier
I ended up going with the regex, because of the problem of spaces. Here's my finished code, in case anyone's interested, or could use anything else in it. On a tangential note, it would be neat to find a way to dynamically identify column order, but I realize that's a bit more tricky. In the meantime, I'm falling back on old tricks to sort that out.
由于空间问题,我最终使用了正则表达式。这是我完成的代码,以防有人感兴趣,或者可以在其中使用任何其他内容。顺便说一句,找到一种动态识别列顺序的方法会很好,但我意识到这有点棘手。与此同时,我正在依靠旧技巧来解决这个问题。
for infile in glob.glob(os.path.join(self._input_dir, self._file_mask)):
#couldn't quite figure out a way to make this a single block
#(rather than three separate if/elifs. But you can see the split is
#generalized already, so if anyone can come up with a better way,
#I'm all ears!! :)
for row in open(infile,'r').readlines():
if infile.find('comma') > -1:
datefmt = "%m/%d/%Y"
last, first, gender, color, dobraw = \
[x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
elif infile.find('space') > -1:
datefmt = "%m-%d-%Y"
last, first, unused, gender, dobraw, color = \
[x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
elif infile.find('pipe') > -1:
datefmt = "%m-%d-%Y"
last, first, unused, gender, color, dobraw = \
[x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
#There is also a way to do this with csv.Sniffer, but the
#spaces around the pipe delimiter also confuse sniffer, so
#I couldn't use it.
else: raise ValueError(infile + "is not an acceptable input file.")

