Pandas pd.read_csv 不适用于简单的 sep=','
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/53455947/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas pd.read_csv does not work for simple sep=','
提问by Kakalukia
Good afternoon, everybody.
大家下午好。
I know that it is quite an easy question, although, I simply do not understand why it does not work the way I expected.
我知道这是一个非常简单的问题,但我根本不明白为什么它不能按我预期的方式工作。
The task is as following:
任务如下:
I have a file data.csv presented in this format:
我有一个以这种格式显示的文件 data.csv:
id,"feature_1","feature_2","feature_3"
00100429,"PROTO","Proprietary","Phone"
00100429,"PROTO","Proprietary","Phone"
The thing is to import this data using pandas. I know that by default pandas read_csv uses comma separator, so I just imported it as following:
问题是使用Pandas导入这些数据。我知道默认情况下大Pandas read_csv 使用逗号分隔符,所以我只是按如下方式导入它:
data = pd.read_csv('data.csv')
And the result I got is the one I presented at the beginning with no change at all. I mean one column which contains everything.
我得到的结果是我在开始时呈现的结果,完全没有变化。我的意思是一列包含所有内容。
I tried many other separators using regex, and the only one that made some sort of improvement was:
我使用正则表达式尝试了许多其他分隔符,唯一做出某种改进的分隔符是:
data = pd.read_csv('data.csv',sep="\,",engine='python')
On the one hand it finally separated all columns, on the other hand the way data is presented is not that convenient to use. In particular:
一方面它最终将所有列分开,另一方面数据的呈现方式使用起来并不方便。特别是:
"id ""feature_1"" ""feature_2"" ""feature_3"""
"00100429 ""PROTO"" ""Proprietary"" ""Phone"""
Therefore, I think that somewhere must be a mistake, because the data seems to be fine.
所以,我觉得肯定是哪里出错了,因为数据好像没问题。
So the question is - how to import csv file with separated columns and no triple quote symbols?
所以问题是 - 如何导入带有分隔列且没有三引号的 csv 文件?
Thank you.
谢谢你。
采纳答案by dataLeo
Here's my quick solution for your problem -
这是我针对您的问题的快速解决方案-
import numpy as np
import pandas as pd
### Reading the file, treating header as first row and later removing all the double apostrophe
df = pd.read_csv('file.csv', sep='\,', header=None).apply(lambda x: x.str.replace(r"\"",""))
df
0 1 2 3
0 id feature_1 feature_2 feature_3
1 00100429 PROTO Proprietary Phone
2 00100429 PROTO Proprietary Phone
### Putting column names back and dropping the first row.
df.columns = df.iloc[0]
df.drop(index=0, inplace=True)
df
## You can reset the index
id feature_1 feature_2 feature_3
1 00100429 PROTO Proprietary Phone
2 00100429 PROTO Proprietary Phone
### Converting `id` column datatype back to `int` (change according to your needs)
df.id = df.id.astype(np.int)
np.result_type(df.id)
dtype('int64')
回答by Karn Kumar
It should work without any issue with sep
until there is anything really bad on the CSV file you have, However simulating your data example it works file for me:
它应该可以正常工作,sep
直到您拥有的 CSV 文件出现任何问题为止,但是模拟您的数据示例它对我有用:
As per your data sample, you don't need to escape char \
for comma delimited Values.
根据您的数据样本,您不需要\
为逗号分隔的值转义字符。
>>> import pandas as pd
>>> data = pd.read_csv("sample.csv", sep=",")
>>> data
id feature_1 feature_2 feature_3
0 100429 PROTO Proprietary Phone
1 100429 PROTO Proprietary Phone
>>> pd.__version__
'0.23.3'
There is a problem here as i noticed sep="\,"
我注意到这里有一个问题 sep="\,"
Alternatively Try:
或者尝试:
Here
skipinitialspace=True
- this "deals with the spaces after the comma-delimiter"quotechar='"'
: string (length 1) The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
这里
skipinitialspace=True
- 这“处理逗号分隔符后的空格”quotechar='"'
: string (length 1) 用于表示引用项的开始和结束的字符。引用的项目可以包含分隔符,它将被忽略。
So, in that case worth trying..
所以,在那种情况下值得一试..
>>> data1 = pd.read_csv("sample.csv", skipinitialspace = True, quotechar = '"')
>>> data1
id feature_1 feature_2 feature_3
0 100429 PROTO Proprietary Phone
1 100429 PROTO Proprietary Phone
Note from Pandas doc:
Pandas 文档中的注释:
Separators longer than 1 character and different from '\s+' will be interpreted as regular expressions, will force use of the python parsing engine and will ignore quotes in the data.
超过 1 个字符且与 '\s+' 不同的分隔符将被解释为正则表达式,将强制使用 python 解析引擎并忽略数据中的引号。
回答by Shadab Hussain
Here's just an alternative way to dataLeo'sanswer -
这只是dataLeo答案的另一种方式-
import pandas as pd
import numpy as np
Reading the file in a dataframe, and later removing all the double apostrophe from row values
在数据框中读取文件,然后从行值中删除所有双撇号
df = pd.read_csv("file.csv", sep="\,").apply(lambda x: x.str.replace(r"\"",""))
df
"id" "feature_1" "feature_2" "feature_3"
0 00100429 PROTO Proprietary Phone
1 00100429 PROTO Proprietary Phone
Removing all the double apostrophe from column names
从列名中删除所有双撇号
df.columns = df.columns.str.replace('\"', '')
df
id feature_1 feature_2 feature_3
0 00100429 PROTO Proprietary Phone
1 00100429 PROTO Proprietary Phone
Converting id
column datatype back to int
(change according to your needs)
将id
列数据类型转换回int
(根据您的需要更改)
df.id = df.id.astype('int')
np.result_type(df.id)
dtype('int32')