pandas 如何使用pandas将一列csv读取为dtype列表?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32742976/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read a column of csv as dtype list using pandas?
提问by nachiappanpl
I have a csv file with 3 columns, wherein each row of Column 3 has list of values in it. As you can see from the following table structure
我有一个包含 3 列的 csv 文件,其中第 3 列的每一行都有值列表。从下表结构可以看出
Col1,Col2,Col3
1,a1,"['Proj1', 'Proj2']"
2,a2,"['Proj3', 'Proj2']"
3,a3,"['Proj4', 'Proj1']"
4,a4,"['Proj3', 'Proj4']"
5,a5,"['Proj5', 'Proj2']"
Whenever I try to read this csv, Col3 is getting read as str object and not as list. I tried to alter the dtype of that column to list but got "Attribute Error" as below
每当我尝试读取此 csv 时,Col3 都会被读取为 str 对象而不是列表。我尝试将该列的 dtype 更改为列出,但出现“属性错误”,如下所示
df = pd.read_csv("inputfile.csv")
df.Col3.dtype = list
AttributeError Traceback (most recent call last)
<ipython-input-19-6f9ec76b1b30> in <module>()
----> 1 df.Col3.dtype = list
C:\Python27\lib\site-packages\pandas\core\generic.pyc in __setattr__(self, name, value)
1953 object.__setattr__(self, name, value)
1954 except (AttributeError, TypeError):
-> 1955 object.__setattr__(self, name, value)
1956
1957 #----------------------------------------------------------------------
AttributeError: can't set attribute
属性错误:无法设置属性
It would be really great if you can guide me how to go about it.
如果你能指导我如何去做,那就太好了。
回答by Padraic Cunningham
You could use the ast lib:
你可以使用 ast 库:
from ast import literal_eval
df.Col3 = df.Col3.apply(literal_eval)
print(df.Col3[0][0])
Proj1
You can also do it when you create the dataframe from the csv, using converters:
您也可以在从 csv 创建数据框时使用converters:
df = pd.read_csv("in.csv",converters={"Col3": literal_eval})
If you are sure the format is he same for all strings, stripping and splitting will be a lot faster:
如果您确定所有字符串的格式都相同,则剥离和拆分会快得多:
df = pd.read_csv("in.csv",converters={"Col3": lambda x: x.strip("[]").split(", ")})
But you will end up with the strings wrapped in quotes
但是你最终会得到用引号包裹的字符串
回答by 5norre
Adding a replace to Cunninghams answer:
为 Cunninghams 答案添加替换:
df = pd.read_csv("in.csv",converters={"Col3": lambda x: x.strip("[]").replace("'","").split(", ")})
回答by Ricardo
I have a different approach for this, which can be used for string representations of other data types, besides just lists.
我对此有一种不同的方法,除了列表之外,它还可用于其他数据类型的字符串表示。
You can use the json library and apply json.loads() to the desired column. e.g
您可以使用 json 库并将 json.loads() 应用于所需的列。例如
import json
df.my_column = df.my_column.apply(json.loads)
For this to work, however, your input strings must be enclosed in double quotations.
但是,要使其正常工作,您的输入字符串必须用双引号括起来。
回答by cs95
@Padraic Cunningham's answer will not work if you have to parse lists of strings that do not have quotes. For example, literal_evalwill successfully parse "['a', 'b', 'c']", but not "[a, b, c]". To load strings like this, use the PyYAMLlibrary.
如果您必须解析没有引号的字符串列表,@Padraic Cunningham 的答案将不起作用。例如,literal_eval将成功解析"['a', 'b', 'c']",但不会成功解析"[a, b, c]"。要加载这样的字符串,请使用PyYAML库。
import io
import pandas as pd
data = '''
A,B,C
"[1, 2, 3]",True,"[a, b, c]"
"[4, 5, 6]",False,"[d, e, f]"
'''
df = pd.read_csv(io.StringIO(data), sep=',')
df
A B C
0 [1, 2, 3] True [a, b, c]
1 [4, 5, 6] False [d, e, f]
df['C'].tolist()
# ['[a, b, c]', '[d, e, f]']
import yaml
df[['A', 'C']] = df[['A', 'C']].applymap(yaml.safe_load)
df['C'].tolist()
# [['a', 'b', 'c'], ['d', 'e', 'f']]
yamlcan be installed using pip install pyyaml.
yaml可以使用pip install pyyaml.
回答by theletz
If you have the option to write the file -
如果您可以选择写入文件 -
you can use pd.to_parquetand pd.read_parquet(instead of csv).
您可以使用pd.to_parquet和pd.read_parquet(而不是 csv)。
It will properly parse this column.
它将正确解析此列。

