Python 使用 Pandas 读取空格分隔的数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22809061/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:44:41  来源:igfitidea点击:

Read Space-separated Data with Pandas

pythonpandas

提问by Tengis

I used to read my data with numpy.loadtxt(). However, lately I found out in SO, that pandas.read_csv()is much more faster.

我曾经用numpy.loadtxt(). 但是,最近我在SO 中发现,这pandas.read_csv()要快得多。

To read these data I use:

要读取这些数据,我使用:

pd.read_csv(filename, sep=' ',header=None)

The problem that I encounter right now is that in my case the separator can differ from one space, xspaces to even a tab.

我现在遇到的问题是,在我的情况下,分隔符可以从一个空格、x 个空格甚至一个制表符不同。

Here how my data could look like:

我的数据如下所示:

56.00     101.85 52.40 101.85 56.000000 101.850000 1
56.00 100.74 50.60 100.74 56.000000 100.740000 2
56.00 100.74 52.10 100.74 56.000000 100.740000 3
56.00 102.96 52.40 102.96 56.000000 102.960000 4
56.00 100.74 55.40 100.74 56.000000 100.740000 5

That leads to results like:

这会导致如下结果:

     0       1     2       3     4       5   6       7   8
0   56     NaN   NaN  101.85  52.4  101.85  56  101.85   1
1   56  100.74  50.6  100.74  56.0  100.74   2     NaN NaN
2   56  100.74  52.1  100.74  56.0  100.74   3     NaN NaN
3   56  102.96  52.4  102.96  56.0  102.96   4     NaN NaN
4   56  100.74  55.4  100.74  56.0  100.74   5     NaN NaN

I have to specify that my data are >100 MB. So I can not preprocess the data or clean them first. Any ideas how to get this fixed?

我必须指定我的数据 > 100 MB。所以我不能预处理数据或先清理它们。任何想法如何解决这个问题?

采纳答案by EdChum

Your original line:

您的原始线路:

pd.read_csv(filename, sep=' ',header=None)

was specifying the separator as a single space, because your csvs can have spaces or tabs you can pass a regular expression to the sepparam like so:

将分隔符指定为单个空格,因为您的 csvs 可以有空格或制表符,您可以将正则表达式传递给sep参数,如下所示:

pd.read_csv(filename, sep='\s+',header=None)

This defines separator as being one single white space or more, there is a handy cheatsheet that lists regular expressions.

这将分隔符定义为一个或多个空格,有一个方便的备忘单列出了正则表达式。