pandas 带双引号的熊猫数据

Question

提问by Gerrit

I am trying to read a large dataset in .csv format which is update automatically using the pandas library. The problem is that in my data, the first row is a string without double quotation marks, and the other colums are strings with double quotation marks. It is not possible for me to adjust the .csv file manually.

我正在尝试读取 .csv 格式的大型数据集，该数据集使用 Pandas 库自动更新。问题是在我的数据中，第一行是没有双引号的字符串，其他列是有双引号的字符串。我无法手动调整 .csv 文件。

A simplified dataset would look like this

一个简化的数据集看起来像这样

A,"B","C","D"
comp_a,"tree","house","door"
comp_b,"truck","red","blue"

A B C D”
comp_a、“树”、“房子”、“门”
comp_b，“卡车”，“红色”，“蓝色”

I need the data to be stored as separate columns without the quotation marks like this:

我需要将数据存储为不带引号的单独列，如下所示：

A B C D
comp_a tree house door
comp_b truck red blue

A B C D
comp_a 树屋门
comp_b 卡车红色蓝色

I tried using

我尝试使用

import pandas as pd
df_csv = pd.read(path_to_file,delimiter=',')

which gives me the complete header as a single variable for the last column

这给了我完整的标题作为最后一列的单个变量

A,"B","C","D"
comp_a "tree" "house" "door"
comp_b "truck" "red" "blue"

A B C D”
comp_a“树”“房子”“门”
comp_b“卡车”“红色”“蓝色”

The closest result to the one i need was by using the following

最接近我需要的结果是使用以下

df_csv = pd.read(path_to_file,delimiter=',',quoting=3)

which correctly recognizes each column, but adds in a bunch of extra double quotes.

它正确识别每一列，但添加了一堆额外的双引号。

"A ""B"" ""C"" ""D"""
"comp_a ""tree"" ""house"" ""door"""
"comp_b ""truck"" ""red"" ""blue"""

“A B C D”””
“comp_a”“树”““房子”““门”“”
"comp_b""卡车""""红色""""蓝色"""

Setting quoting to a value from 0 to 2 just reads an entire row as a single column.

将引用设置为 0 到 2 之间的值只会将整行作为单列读取。

Does anyone know how I can remove all quotation marks when reading the .csv file?

有谁知道如何在阅读 .csv 文件时删除所有引号？

Answer 1

回答by Federico Gentile

Just load the data with pd.read_csv()and then use .replace('"','', regex=True)

只需加载数据pd.read_csv()然后使用.replace('"','', regex=True)

In one line it would be:

在一行中，它将是：

df = pd.read_csv(filename, sep=',').replace('"','', regex=True)

To set the columns names:

要设置列名称：

df.columns = df.iloc[0]

And drop row 0:

并删除第 0 行：

df = df.drop(index=0).reset_index(drop=True)

Answer 2

回答by Nihal

you can replace "after read_csvand save that file again using df_csv.to_csv('fname')

您可以"在read_csv使用后替换并再次保存该文件df_csv.to_csv('fname')

df_csv.apply(lambda x:x.str.replace('"', ""))

Answer 3

回答by Kay Wittig

Consider your data in a file data.csv like

考虑文件 data.csv 中的数据，例如

$> more data.csv 
A,"B","C","D"
comp_a,"tree","house","door"
comp_b,"truck","red","blue"

Perhaps a newer pandas version would solve your problem from itself, e.g. at pd.__version__ = '0.23.1'

也许较新的Pandas版本会自己解决您的问题，例如在 pd.__version__ = '0.23.1'

In [1]: import pandas as pd

In [2]: pd.read_csv('data.csv')
Out[2]: 
        A      B      C     D
0  comp_a   tree  house  door
1  comp_b  truck    red  blue

Otherwise apply a replace on the read-out

否则在读数上应用替换

pd.read_csv('data.csv').replace('"', '')

pandas 带双引号的熊猫数据

提问by Gerrit

回答by Federico Gentile

回答by Nihal

回答by Kay Wittig

相关推荐

最近更新

标签

pandas 带双引号的熊猫数据

提问by Gerrit

回答by Federico Gentile

回答by Nihal

回答by Kay Wittig

相关推荐

pandas 在多列熊猫上应用 lambda 行

如何使用 Pandas 获取单元格的值并存储到变量中？

Pandas 和 scikit-learn：KeyError：[....] 不在索引中

pandas 时间增量到熊猫数据框中的字符串类型

相关推荐

最近更新

标签