列数据中的python pandas read_csv分隔符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30898935/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:08:05  来源:igfitidea点击:

python pandas read_csv delimiter in column data

pythoncsvpython-3.xpandasdelimiter

提问by Thomas Pazur

I'm having this type of CSV file:

我有这种类型的 CSV 文件:

12012;My Name is Mike. What is your's?;3;0 
1522;In my opinion: It's cool; or at least not bad;4;0
21427;Hello. I like this feature!;5;1

I want to get this data into da pandas.DataFrame. But read_csv(sep=";")throws exceptions due to the semicolon in the user generated message column in line 2 (In my opinion: It's cool; or at least not bad). All remaining columns constantly have numeric dtypes.

我想将这些数据放入 da pandas.DataFrame。但是read_csv(sep=";")由于第 2 行中用户生成的消息列中的分号而引发异常(在我看来:这很酷;或者至少还不错)。所有剩余的列始终具有数字 dtypes。

What is the most convenient method to manage this?

管理这个最方便的方法是什么?

回答by DSM

Dealing with unquoted delimiters is always a nuisance. In this case, since it looks like the broken text is known to be surrounded by three correctly-encoded columns, we can recover. TBH, I'd just use the standard Python reader and build a DataFrame once from that:

处理不带引号的分隔符总是一件麻烦事。在这种情况下,由于看起来损坏的文本被三个正确编码的列包围,我们可以恢复。TBH,我只是使用标准的 Python 阅读器并从中构建一个 DataFrame:

import csv
import pandas as pd

with open("semi.dat", "r", newline="") as fp:
    reader = csv.reader(fp, delimiter=";")
    rows = [x[:1] + [';'.join(x[1:-2])] + x[-2:] for x in reader] 
    df = pd.DataFrame(rows)

which produces

产生

       0                                              1  2  3
0  12012               My Name is Mike. What is your's?  3  0
1   1522  In my opinion: It's cool; or at least not bad  4  0
2  21427                    Hello. I like this feature!  5  1

Then we can immediately save it and get something quoted correctly:

然后我们可以立即保存它并正确引用一些内容:

In [67]: df.to_csv("fixedsemi.dat", sep=";", header=None, index=False)

In [68]: more fixedsemi.dat
12012;My Name is Mike. What is your's?;3;0
1522;"In my opinion: It's cool; or at least not bad";4;0
21427;Hello. I like this feature!;5;1

In [69]: df2 = pd.read_csv("fixedsemi.dat", sep=";", header=None)

In [70]: df2
Out[70]: 
       0                                              1  2  3
0  12012               My Name is Mike. What is your's?  3  0
1   1522  In my opinion: It's cool; or at least not bad  4  0
2  21427                    Hello. I like this feature!  5  1