Pandas pd.read_csv 不适用于简单的 sep=','

Question

提问by Kakalukia

Good afternoon, everybody.

大家下午好。

I know that it is quite an easy question, although, I simply do not understand why it does not work the way I expected.

我知道这是一个非常简单的问题，但我根本不明白为什么它不能按我预期的方式工作。

The task is as following:

任务如下：

I have a file data.csv presented in this format:

我有一个以这种格式显示的文件 data.csv：

id,"feature_1","feature_2","feature_3"
00100429,"PROTO","Proprietary","Phone"
00100429,"PROTO","Proprietary","Phone"

The thing is to import this data using pandas. I know that by default pandas read_csv uses comma separator, so I just imported it as following:

问题是使用Pandas导入这些数据。我知道默认情况下大Pandas read_csv 使用逗号分隔符，所以我只是按如下方式导入它：

data = pd.read_csv('data.csv')

And the result I got is the one I presented at the beginning with no change at all. I mean one column which contains everything.

我得到的结果是我在开始时呈现的结果，完全没有变化。我的意思是一列包含所有内容。

I tried many other separators using regex, and the only one that made some sort of improvement was:

我使用正则表达式尝试了许多其他分隔符，唯一做出某种改进的分隔符是：

data = pd.read_csv('data.csv',sep="\,",engine='python')

On the one hand it finally separated all columns, on the other hand the way data is presented is not that convenient to use. In particular:

一方面它最终将所有列分开，另一方面数据的呈现方式使用起来并不方便。特别是：

"id         ""feature_1""   ""feature_2""   ""feature_3"""
"00100429   ""PROTO""       ""Proprietary"" ""Phone"""

Therefore, I think that somewhere must be a mistake, because the data seems to be fine.

所以，我觉得肯定是哪里出错了，因为数据好像没问题。

So the question is - how to import csv file with separated columns and no triple quote symbols?

所以问题是 - 如何导入带有分隔列且没有三引号的 csv 文件？

Thank you.

谢谢你。

Answer 1

采纳答案by dataLeo

Here's my quick solution for your problem -

这是我针对您的问题的快速解决方案-

import numpy as np
import pandas as pd

### Reading the file, treating header as first row and later removing all the double apostrophe 
df = pd.read_csv('file.csv', sep='\,', header=None).apply(lambda x: x.str.replace(r"\"",""))
df

    0              1           2       3
0   id      feature_1   feature_2   feature_3
1   00100429    PROTO   Proprietary Phone
2   00100429    PROTO   Proprietary Phone

### Putting column names back and dropping the first row.
df.columns = df.iloc[0]
df.drop(index=0, inplace=True)
df

## You can reset the index 
        id  feature_1   feature_2   feature_3
1   00100429    PROTO   Proprietary Phone
2   00100429    PROTO   Proprietary Phone

### Converting `id` column datatype back to `int` (change according to your needs)

df.id = df.id.astype(np.int)
np.result_type(df.id)

dtype('int64')

Answer 2

回答by Karn Kumar

It should work without any issue with sepuntil there is anything really bad on the CSV file you have, However simulating your data example it works file for me:

它应该可以正常工作，sep直到您拥有的 CSV 文件出现任何问题为止，但是模拟您的数据示例它对我有用：

As per your data sample, you don't need to escape char \for comma delimited Values.

根据您的数据样本，您不需要\为逗号分隔的值转义字符。

>>> import pandas as pd
>>> data = pd.read_csv("sample.csv", sep=",")
>>> data
       id feature_1    feature_2 feature_3
0  100429     PROTO  Proprietary     Phone
1  100429     PROTO  Proprietary     Phone
>>> pd.__version__
'0.23.3'

There is a problem here as i noticed sep="\,"

我注意到这里有一个问题 sep="\,"

Alternatively Try:

或者尝试：

Here skipinitialspace=True- this "deals with the spaces after the comma-delimiter"
quotechar='"': string (length 1) The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.

这里skipinitialspace=True- 这“处理逗号分隔符后的空格”
quotechar='"': string (length 1) 用于表示引用项的开始和结束的字符。引用的项目可以包含分隔符，它将被忽略。

So, in that case worth trying..

所以，在那种情况下值得一试..

>>> data1 = pd.read_csv("sample.csv", skipinitialspace = True, quotechar = '"')
>>> data1
       id feature_1    feature_2 feature_3
0  100429     PROTO  Proprietary     Phone
1  100429     PROTO  Proprietary     Phone

Note from Pandas doc:

Pandas 文档中的注释：

Separators longer than 1 character and different from '\s+' will be interpreted as regular expressions, will force use of the python parsing engine and will ignore quotes in the data.

超过 1 个字符且与 '\s+' 不同的分隔符将被解释为正则表达式，将强制使用 python 解析引擎并忽略数据中的引号。

Answer 3

回答by Shadab Hussain

Here's just an alternative way to dataLeo'sanswer -

这只是dataLeo答案的另一种方式-

import pandas as pd
import numpy as np

Reading the file in a dataframe, and later removing all the double apostrophe from row values

在数据框中读取文件，然后从行值中删除所有双撇号

df = pd.read_csv("file.csv", sep="\,").apply(lambda x: x.str.replace(r"\"",""))
df

    "id"   "feature_1"  "feature_2" "feature_3"
0   00100429    PROTO   Proprietary Phone
1   00100429    PROTO   Proprietary Phone

Removing all the double apostrophe from column names

从列名中删除所有双撇号

df.columns = df.columns.str.replace('\"', '')
df

      id    feature_1   feature_2   feature_3
0   00100429    PROTO   Proprietary Phone
1   00100429    PROTO   Proprietary Phone

Converting `id`column datatype back to `int`(change according to your needs)

将`id`列数据类型转换回`int`（根据您的需要更改）

df.id = df.id.astype('int')
np.result_type(df.id)

dtype('int32')

Pandas pd.read_csv 不适用于简单的 sep=','

提问by Kakalukia

采纳答案by dataLeo

回答by Karn Kumar

回答by Shadab Hussain

Reading the file in a dataframe, and later removing all the double apostrophe from row values

在数据框中读取文件，然后从行值中删除所有双撇号

Removing all the double apostrophe from column names

从列名中删除所有双撇号

Converting `id`column datatype back to `int`(change according to your needs)

将`id`列数据类型转换回`int`（根据您的需要更改）

相关推荐

最近更新

标签

Pandas pd.read_csv 不适用于简单的 sep=','

提问by Kakalukia

采纳答案by dataLeo

回答by Karn Kumar

回答by Shadab Hussain

Reading the file in a dataframe, and later removing all the double apostrophe from row values

在数据框中读取文件，然后从行值中删除所有双撇号

Removing all the double apostrophe from column names

从列名中删除所有双撇号

Converting idcolumn datatype back to int(change according to your needs)

将id列数据类型转换回int（根据您的需要更改）

相关推荐

pandas.errors.ParserError：错误可能是由于使用多字符分隔符时忽略了引号

pandas 使用python中的列表值过滤匹配列值的数据框

pandas 按索引（列）编号选择熊猫数据框中的列

pandas 熊猫计算一列中值的出现次数

相关推荐

最近更新

标签

Converting `id`column datatype back to `int`(change according to your needs)

将`id`列数据类型转换回`int`（根据您的需要更改）