Python 使用多个分隔符将文本导入到 Pandas

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26551662/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:38:49  来源:igfitidea点击:

import text to pandas with multiple delimiters

pythonimportpandasdelimited-text

提问by CastleH

I have some data that looks like this:

我有一些看起来像这样的数据:

c stuff
c more header
c begin data         
 1 1:.5
 1 2:6.5
 1 3:5.3

I want to import it into a 3 column data frame, with columns e.g.

我想将它导入到一个 3 列的数据框中,列例如

a , b, c
1,  1, 0.5
etc

I have been trying to read in the data as 2 columns split on ':', and then to split the first column on ' '. However I'm finding it irksome. Is there a better way to sort it out on import directly?

我一直在尝试将数据读取为在 ':' 上拆分的 2 列,然后在 ' ' 上拆分第一列。不过我觉得很烦。有没有更好的方法可以直接在导入时进行排序?

currently:

目前:

data1 = pd.read_csv(file_loc, skiprows = 3, delimiter = ':', names = ['AB', 'C'])
data2 = pd.DataFrame(data1.AB.str.split(' ',1).tolist(), names = ['A','B'])

However this is further complicated by the fact my data has a leading space...

但是,由于我的数据具有领先空间,这使情况变得更加复杂......

I feel like this should be a simple task, but currently I'm thinking of reading it line by line and using some find replace to sanitise the data before importing.

我觉得这应该是一项简单的任务,但目前我正在考虑逐行阅读并在导入之前使用一些查找替换来清理数据。

采纳答案by DSM

One way might be to use the regex separators permitted by the python engine. For example:

一种方法可能是使用 python 引擎允许的正则表达式分隔符。例如:

>>> !cat castle.dat
c stuff
c more header
c begin data         
 1 1:.5
 1 2:6.5
 1 3:5.3
>>> df = pd.read_csv('castle.dat', skiprows=3, names=['a', 'b', 'c'], 
                     sep=' |:', engine='python')
>>> df
   a  b    c
0  1  1  0.5
1  1  2  6.5
2  1  3  5.3