Python 对于不规则的分隔符,如何使pandas read_csv 中的分隔符更灵活的wrt 空格?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15026698/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?
提问by Roman
I need to create a data frame by reading in data from a file, using read_csvmethod. However, the separators are not very regular: some columns are separated by tabs (\t), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).
我需要使用read_csv方法通过从文件中读取数据来创建数据框。但是,分隔符不是很规则:一些列用制表符 ( \t) 分隔,其他列用空格分隔。此外,某些列可以由 2 个或 3 个或更多个空格或什至由空格和制表符的组合分隔(例如 3 个空格、两个制表符然后是 1 个空格)。
Is there a way to tell pandas to treat these files properly?
有没有办法告诉熊猫正确处理这些文件?
By the way, I do not have this problem if I use Python. I use:
顺便说一句,如果我使用 Python,我就没有这个问题。我用:
for line in file(file_name):
fld = line.split()
And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?
它完美无缺。它不关心字段之间是否有 2 或 3 个空格。即使是空格和制表符的组合也不会造成任何问题。熊猫也可以吗?
采纳答案by DSM
From the documentation, you can use either a regex or delim_whitespace:
从文档中,您可以使用正则表达式或delim_whitespace:
>>> import pandas as pd
>>> for line in open("whitespace.csv"):
... print repr(line)
...
'a\t b\tc 1 2\n'
'd\t e\tf 3 4\n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+")
0 1 2 3 4
0 a b c 1 2
1 d e f 3 4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
0 1 2 3 4
0 a b c 1 2
1 d e f 3 4
回答by Peaceful
>>> pd.read_csv("whitespace.csv", header = None, sep = "\s+|\t+|\s+\t+|\t+\s+")
would use any combination of any number of spaces and tabs as the separator.
将使用任意数量的空格和制表符的任意组合作为分隔符。
回答by yoonghm
We may consider this to take care of all the combination and zero or more occurrences.
我们可以考虑这样处理所有的组合和零次或多次出现。
pd.read_csv("whitespace.csv", header = None, sep = "[ \t]*,[ \t]*")
回答by Gerben
Pandas has two csv readers, only is flexible regarding redundant leading white space:
Pandas 有两个 csv 阅读器,仅在冗余前导空白方面灵活:
pd.read_csv("whitespace.csv", skipinitialspace=True)
while one is not
而一个不是
pd.DataFrame.from_csv("whitespace.csv")
Neither is out-of-the-box flexible regarding trailing white space, see the answers with regular expressions. Avoid delim_whitespace, as it also allows just spaces (without , or \t) as separators.
对于尾随空白也不是开箱即用的,请参阅正则表达式的答案。避免使用 delim_whitespace,因为它也只允许空格(没有 , 或 \t)作为分隔符。

