pandas 带双引号的熊猫数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/51359010/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:48:25  来源:igfitidea点击:

pandas data with double quote

pythonpandasdouble-quotes

提问by Gerrit

I am trying to read a large dataset in .csv format which is update automatically using the pandas library. The problem is that in my data, the first row is a string without double quotation marks, and the other colums are strings with double quotation marks. It is not possible for me to adjust the .csv file manually.

我正在尝试读取 .csv 格式的大型数据集,该数据集使用 Pandas 库自动更新。问题是在我的数据中,第一行是没有双引号的字符串,其他列是有双引号的字符串。我无法手动调整 .csv 文件。

A simplified dataset would look like this

一个简化的数据集看起来像这样

  1. A,"B","C","D"
  2. comp_a,"tree","house","door"
  3. comp_b,"truck","red","blue"
  1. A B C D”
  2. comp_a、“树”、“房子”、“门”
  3. comp_b,“卡车”,“红色”,“蓝色”

I need the data to be stored as separate columns without the quotation marks like this:

我需要将数据存储为不带引号的单独列,如下所示:

  1. A B C D
  2. comp_a tree house door
  3. comp_b truck red blue
  1. A B C D
  2. comp_a 树屋门
  3. comp_b 卡车红色蓝色

I tried using

我尝试使用

import pandas as pd
df_csv = pd.read(path_to_file,delimiter=',')

which gives me the complete header as a single variable for the last column

这给了我完整的标题作为最后一列的单个变量

  1. A,"B","C","D"
  2. comp_a "tree" "house" "door"
  3. comp_b "truck" "red" "blue"
  1. A B C D”
  2. comp_a“树”“房子”“门”
  3. comp_b“卡车”“红色”“蓝色”

The closest result to the one i need was by using the following

最接近我需要的结果是使用以下

df_csv = pd.read(path_to_file,delimiter=',',quoting=3)

which correctly recognizes each column, but adds in a bunch of extra double quotes.

它正确识别每一列,但添加了一堆额外的双引号。

  1. "A ""B"" ""C"" ""D"""
  2. "comp_a ""tree"" ""house"" ""door"""
  3. "comp_b ""truck"" ""red"" ""blue"""
  1. “A B C D”””
  2. “comp_a”“树”““房子”““门”“”
  3. "comp_b""卡车""""红色""""蓝色"""

Setting quoting to a value from 0 to 2 just reads an entire row as a single column.

将引用设置为 0 到 2 之间的值只会将整行作为单列读取。

Does anyone know how I can remove all quotation marks when reading the .csv file?

有谁知道如何在阅读 .csv 文件时删除所有引号?

回答by Federico Gentile

Just load the data with pd.read_csv()and then use .replace('"','', regex=True)

只需加载数据pd.read_csv()然后使用.replace('"','', regex=True)

In one line it would be:

在一行中,它将是:

df = pd.read_csv(filename, sep=',').replace('"','', regex=True)

To set the columns names:

要设置列名称:

df.columns = df.iloc[0]

And drop row 0:

并删除第 0 行:

df = df.drop(index=0).reset_index(drop=True)

回答by Nihal

you can replace "after read_csvand save that file again using df_csv.to_csv('fname')

您可以"read_csv使用后替换并再次保存该文件df_csv.to_csv('fname')

df_csv.apply(lambda x:x.str.replace('"', ""))

回答by Kay Wittig

Consider your data in a file data.csv like

考虑文件 data.csv 中的数据,例如

$> more data.csv 
A,"B","C","D"
comp_a,"tree","house","door"
comp_b,"truck","red","blue"

Perhaps a newer pandas version would solve your problem from itself, e.g. at pd.__version__ = '0.23.1'

也许较新的Pandas版本会自己解决您的问题,例如在 pd.__version__ = '0.23.1'

In [1]: import pandas as pd

In [2]: pd.read_csv('data.csv')
Out[2]: 
        A      B      C     D
0  comp_a   tree  house  door
1  comp_b  truck    red  blue

Otherwise apply a replace on the read-out

否则在读数上应用替换

pd.read_csv('data.csv').replace('"', '')