Pandas pd.read_csv 不适用于简单的 sep=','

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/53455947/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:10:18  来源:igfitidea点击:

Pandas pd.read_csv does not work for simple sep=','

pythonpandascsv

提问by Kakalukia

Good afternoon, everybody.

大家下午好。

I know that it is quite an easy question, although, I simply do not understand why it does not work the way I expected.

我知道这是一个非常简单的问题,但我根本不明白为什么它不能按我预期的方式工作。

The task is as following:

任务如下:

I have a file data.csv presented in this format:

我有一个以这种格式显示的文件 data.csv:

id,"feature_1","feature_2","feature_3"
00100429,"PROTO","Proprietary","Phone"
00100429,"PROTO","Proprietary","Phone"

The thing is to import this data using pandas. I know that by default pandas read_csv uses comma separator, so I just imported it as following:

问题是使用Pandas导入这些数据。我知道默认情况下大Pandas read_csv 使用逗号分隔符,所以我只是按如下方式导入它:

data = pd.read_csv('data.csv')

And the result I got is the one I presented at the beginning with no change at all. I mean one column which contains everything.

我得到的结果是我在开始时呈现的结果,完全没有变化。我的意思是一列包含所有内容。

I tried many other separators using regex, and the only one that made some sort of improvement was:

我使用正则表达式尝试了许多其他分隔符,唯一做出某种改进的分隔符是:

data = pd.read_csv('data.csv',sep="\,",engine='python')

On the one hand it finally separated all columns, on the other hand the way data is presented is not that convenient to use. In particular:

一方面它最终将所有列分开,另一方面数据的呈现方式使用起来并不方便。特别是:

"id         ""feature_1""   ""feature_2""   ""feature_3"""
"00100429   ""PROTO""       ""Proprietary"" ""Phone"""

Therefore, I think that somewhere must be a mistake, because the data seems to be fine.

所以,我觉得肯定是哪里出错了,因为数据好像没问题。

So the question is - how to import csv file with separated columns and no triple quote symbols?

所以问题是 - 如何导入带有分隔列且没有三引号的 csv 文件?

Thank you.

谢谢你。

采纳答案by dataLeo

Here's my quick solution for your problem -

这是我针对您的问题的快速解决方案-

import numpy as np
import pandas as pd

### Reading the file, treating header as first row and later removing all the double apostrophe 
df = pd.read_csv('file.csv', sep='\,', header=None).apply(lambda x: x.str.replace(r"\"",""))
df

    0              1           2       3
0   id      feature_1   feature_2   feature_3
1   00100429    PROTO   Proprietary Phone
2   00100429    PROTO   Proprietary Phone

### Putting column names back and dropping the first row.
df.columns = df.iloc[0]
df.drop(index=0, inplace=True)
df

## You can reset the index 
        id  feature_1   feature_2   feature_3
1   00100429    PROTO   Proprietary Phone
2   00100429    PROTO   Proprietary Phone

### Converting `id` column datatype back to `int` (change according to your needs)

df.id = df.id.astype(np.int)
np.result_type(df.id)

dtype('int64')

回答by Karn Kumar

It should work without any issue with sepuntil there is anything really bad on the CSV file you have, However simulating your data example it works file for me:

它应该可以正常工作,sep直到您拥有的 CSV 文件出现任何问题为止,但是模拟您的数据示例它对我有用:

As per your data sample, you don't need to escape char \for comma delimited Values.

根据您的数据样本,您不需要\为逗号分隔的值转义字符。

>>> import pandas as pd
>>> data = pd.read_csv("sample.csv", sep=",")
>>> data
       id feature_1    feature_2 feature_3
0  100429     PROTO  Proprietary     Phone
1  100429     PROTO  Proprietary     Phone
>>> pd.__version__
'0.23.3'

There is a problem here as i noticed sep="\,"

我注意到这里有一个问题 sep="\,"

Alternatively Try:

或者尝试:

  • Here skipinitialspace=True- this "deals with the spaces after the comma-delimiter"

  • quotechar='"': string (length 1) The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.

  • 这里skipinitialspace=True- 这“处理逗号分隔符后的空格”

  • quotechar='"': string (length 1) 用于表示引用项的开始和结束的字符。引用的项目可以包含分隔符,它将被忽略。

So, in that case worth trying..

所以,在那种情况下值得一试..

>>> data1 = pd.read_csv("sample.csv", skipinitialspace = True, quotechar = '"')
>>> data1
       id feature_1    feature_2 feature_3
0  100429     PROTO  Proprietary     Phone
1  100429     PROTO  Proprietary     Phone

Note from Pandas doc:

Pandas 文档中的注释:

Separators longer than 1 character and different from '\s+' will be interpreted as regular expressions, will force use of the python parsing engine and will ignore quotes in the data.

超过 1 个字符且与 '\s+' 不同的分隔符将被解释为正则表达式,将强制使用 python 解析引擎并忽略数据中的引号。

回答by Shadab Hussain

Here's just an alternative way to dataLeo'sanswer -

这只是dataLeo答案的另一种方式-

import pandas as pd
import numpy as np

Reading the file in a dataframe, and later removing all the double apostrophe from row values

在数据框中读取文件,然后从行值中删除所有双撇号

df = pd.read_csv("file.csv", sep="\,").apply(lambda x: x.str.replace(r"\"",""))
df

    "id"   "feature_1"  "feature_2" "feature_3"
0   00100429    PROTO   Proprietary Phone
1   00100429    PROTO   Proprietary Phone

Removing all the double apostrophe from column names

从列名中删除所有双撇号

df.columns = df.columns.str.replace('\"', '')
df

      id    feature_1   feature_2   feature_3
0   00100429    PROTO   Proprietary Phone
1   00100429    PROTO   Proprietary Phone

Converting idcolumn datatype back to int(change according to your needs)

id列数据类型转换回int(根据您的需要更改)

df.id = df.id.astype('int')
np.result_type(df.id)

dtype('int32')