Python 对于不规则的分隔符，如何使pandas read_csv 中的分隔符更灵活的wrt 空格？

Question

提问by Roman

I need to create a data frame by reading in data from a file, using read_csvmethod. However, the separators are not very regular: some columns are separated by tabs (\t), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).

我需要使用read_csv方法通过从文件中读取数据来创建数据框。但是，分隔符不是很规则：一些列用制表符 ( \t) 分隔，其他列用空格分隔。此外，某些列可以由 2 个或 3 个或更多个空格或什至由空格和制表符的组合分隔（例如 3 个空格、两个制表符然后是 1 个空格）。

Is there a way to tell pandas to treat these files properly?

有没有办法告诉熊猫正确处理这些文件？

By the way, I do not have this problem if I use Python. I use:

顺便说一句，如果我使用 Python，我就没有这个问题。我用：

for line in file(file_name):
   fld = line.split()

And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?

它完美无缺。它不关心字段之间是否有 2 或 3 个空格。即使是空格和制表符的组合也不会造成任何问题。熊猫也可以吗？

Answer 1

采纳答案by DSM

From the documentation, you can use either a regex or delim_whitespace:

从文档中，您可以使用正则表达式或delim_whitespace：

>>> import pandas as pd
>>> for line in open("whitespace.csv"):
...     print repr(line)
...     
'a\t  b\tc 1 2\n'
'd\t  e\tf 3 4\n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+")
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4

Answer 2

回答by Peaceful

>>> pd.read_csv("whitespace.csv", header = None, sep = "\s+|\t+|\s+\t+|\t+\s+")

would use any combination of any number of spaces and tabs as the separator.

将使用任意数量的空格和制表符的任意组合作为分隔符。

Answer 3

回答by yoonghm

We may consider this to take care of all the combination and zero or more occurrences.

我们可以考虑这样处理所有的组合和零次或多次出现。

pd.read_csv("whitespace.csv", header = None, sep = "[ \t]*,[ \t]*")

Answer 4

回答by Gerben

Pandas has two csv readers, only is flexible regarding redundant leading white space:

Pandas 有两个 csv 阅读器，仅在冗余前导空白方面灵活：

pd.read_csv("whitespace.csv", skipinitialspace=True)

while one is not

而一个不是

pd.DataFrame.from_csv("whitespace.csv")

Neither is out-of-the-box flexible regarding trailing white space, see the answers with regular expressions. Avoid delim_whitespace, as it also allows just spaces (without , or \t) as separators.

对于尾随空白也不是开箱即用的，请参阅正则表达式的答案。避免使用 delim_whitespace，因为它也只允许空格（没有 , 或 \t）作为分隔符。

Python 对于不规则的分隔符，如何使pandas read_csv 中的分隔符更灵活的wrt 空格？

提问by Roman

采纳答案by DSM

回答by Peaceful

回答by yoonghm

回答by Gerben

相关推荐

最近更新

标签

Python 对于不规则的分隔符，如何使pandas read_csv 中的分隔符更灵活的wrt 空格？

提问by Roman

采纳答案by DSM

回答by Peaceful

回答by yoonghm

回答by Gerben

相关推荐

双等于 vs 在 python 中

在 Python 中排序的最快方法

Python 使用 argparse 解析布尔值

Python 从另一个类调用类方法

相关推荐

最近更新

标签