pandas - 将字符串转换为字符串列表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45758646/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas - convert string into list of strings
提问by Fabio Lamanna
I have this 'file.csv' file to read with pandas:
我有这个“file.csv”文件可以用Pandas读取:
Title|Tags
T1|"[Tag1,Tag2]"
T1|"[Tag1,Tag2,Tag3]"
T2|"[Tag3,Tag1]"
using
使用
df = pd.read_csv('file.csv', sep='|')
the output is:
输出是:
Title Tags
0 T1 [Tag1,Tag2]
1 T1 [Tag1,Tag2,Tag3]
2 T2 [Tag3,Tag1]
I know that the column Tags
is a full string, since:
我知道该列Tags
是一个完整的字符串,因为:
In [64]: df['Tags'][0][0]
Out[64]: '['
I need to read it as a list of strings like ["Tag1","Tag2"]
. I tried the solution provided in thisquestion but no luck there, since I have the [
and ]
characters that actually mess up the things.
我需要将它作为一个字符串列表来阅读,比如["Tag1","Tag2"]
. 我尝试了这个问题中提供的解决方案,但没有运气,因为我有实际上搞砸了事情的[
和]
字符。
The expecting output should be:
预期的输出应该是:
In [64]: df['Tags'][0][0]
Out[64]: 'Tag1'
回答by Mike Müller
You can split the string manually:
您可以手动拆分字符串:
>>> df['Tags'] = df.Tags.apply(lambda x: x[1:-1].split(','))
>>> df.Tags[0]
['Tag1', 'Tag2']
回答by YOBEN_S
Or
或者
df.Tags=df.Tags.str[1:-1].str.split(',').tolist()
回答by Scott Boston
You can convert the string to a list using strip
and split
.
您可以使用strip
和将字符串转换为列表split
。
df_out = df.assign(Tags=df.Tags.str.strip('[]').str.split(','))
df_out.Tags[0][0]
Output:
输出:
'Tag1'
回答by RHSmith159
I think you could use the json module.
我认为您可以使用 json 模块。
import json
import pandas
df = pd.read_csv('file.csv', sep='|')
df['Tags'] = df['Tags'].apply(lambda x: json.loads(x))
So this will load your dataframe as before, then apply a lambda function to each of the items in the Tags
column. The lambda function calls json.loads()
which converts the string representation of the list to an actual list.
因此,这将像以前一样加载您的数据框,然后将 lambda 函数应用于Tags
列中的每个项目。lambda 函数调用json.loads()
将列表的字符串表示形式转换为实际列表。
回答by Veggiet
Your df['Tags']
appears to be a list of strings. If you print that list you should get ["[tag1,tag2]","[Tag1,Tag2,Tag3]","[Tag3,Tag1]"]
this is why when you call the first element of the first element you're actually getting the first single character of the string, rather than what you want.
您df['Tags']
似乎是一个字符串列表。如果您打印该列表,您应该得到["[tag1,tag2]","[Tag1,Tag2,Tag3]","[Tag3,Tag1]"]
这就是为什么当您调用第一个元素的第一个元素时,您实际上获得的是字符串的第一个单个字符,而不是您想要的。
You either need to parse that string afterward. Performing something like
您要么需要在之后解析该字符串。执行类似
df['Tags'][0] = df['Tags'][0].split(',')
But as you saw in your cited example this will give you a list that looks like
但是正如您在引用的示例中看到的那样,这将为您提供一个看起来像的列表
in: df['Tags'][0][0]
out: '[tag1'`
What you need is a way to parse the string editing out multiple characters. You can use a simple regex expression to do this. Something like:
您需要的是一种解析字符串并编辑出多个字符的方法。您可以使用简单的正则表达式来执行此操作。就像是:
import re
df['Tags'][0] = re.findall(r"[\w']+", df['Tags'][0])
print(df['Tags'][0][0])
will print:
将打印:
'tag1'
Using the other answer involving Pandas converters you might write a converter like this:
使用涉及 Pandas 转换器的其他答案,您可能会编写这样的转换器:
def clean(seq_string):
return re.findall(r"[\w']+", seq_string)
If you don't know regex, they can be quite powerful, but also unpredictable if you're not sure on the content of your input strings. The expression used here r"[\w']+"
will match any common word character alpha-numeric and underscores and treat everything else as a point for re.findall
to split the list at.
如果您不了解正则表达式,它们可能非常强大,但如果您不确定输入字符串的内容,它们也会变得不可预测。此处使用的表达式r"[\w']+"
将匹配任何常见单词字符字母数字和下划线,并将其他所有内容视为re.findall
拆分列表的点。