根据字符串条件为 Pandas 数据框列赋值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36701689/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Assign value to a pandas dataframe column based on string condition
提问by haimen
Suppose I have a dataframe,
假设我有一个数据框,
data
id URL
1 www.pandora.com
2 m.jcpenney.com
3 www.youtube.com
4 www.facebook.com
I want to create a new column based on a condition that if the URL contains some particular word. Suppose if it contains 'youtube', I want my column value as youtube. So I tried the following,
我想根据 URL 是否包含某些特定单词的条件创建一个新列。假设它包含“youtube”,我希望我的列值为 youtube。所以我尝试了以下方法,
data['test'] = 'other'
so once we do that we have,
所以一旦我们这样做了,我们就有了
data['test']
other
other
other
other
then I tried this,
然后我尝试了这个,
data[data['URL'].str.contains("youtub") == True]['test'] = 'Youtube'
data[data['URL'].str.contains("face") == True]['test'] = 'Facebook'
Though this runs without any error, the value of the test column, doesn't change. It still has other only for all the columns. When I run this statement, ideally 3rd row alone show change to 'Youtube' and 4th to 'Facebook'. But it doesn't change. Can anybody tell me what mistake I am doing here?
虽然这运行没有任何错误,但测试列的值不会改变。它仍然只有其他所有列。当我运行此语句时,理想情况下,仅第 3 行显示更改为“Youtube”,第 4 行更改为“Facebook”。但它不会改变。谁能告诉我我在这里犯了什么错误?
回答by jezrael
I think you can use loc
with boolean mask created by contains
:
我认为您可以使用loc
由contains
以下方法创建的布尔掩码:
print data['URL'].str.contains("youtub")
0 False
1 False
2 True
3 False
Name: URL, dtype: bool
data.loc[data['URL'].str.contains("youtub"),'test'] = 'Youtube'
data.loc[data['URL'].str.contains("face"),'test'] = 'Facebook'
print data
id URL test
0 1 www.pandora.com NaN
1 2 m.jcpenney.com NaN
2 3 www.youtube.com Youtube
3 4 www.facebook.com Facebook
回答by MaxU
i would do it in one shot:
我会一口气做到:
replacements = {
r'.*youtube.*': 'Youtube',
r'.*face.*': 'Facebook',
r'.*pandora.*': 'Pandora'
}
df['text'] = df.URL.replace(replacements, regex=True)
df.loc[df.text.str.contains('\.'), 'text'] = 'other'
print(df)
Output:
输出:
URL text
id
1 www.pandora.com Pandora
2 m.jcpenney.com other
3 www.youtube.com Youtube
4 www.facebook.com Facebook
回答by Alexander
Given that you probably want to check if the host name matches (rather than any word in the url), you could split the string on the dot and check if the second item (host name) is in your list.
鉴于您可能想检查主机名是否匹配(而不是 url 中的任何单词),您可以在点上拆分字符串并检查第二项(主机名)是否在您的列表中。
targets = ['pandora', 'youtube', 'facebook']
data['target_url'] = [url[1] if url[1] in targets else None
for url in data.URL.str.split('.')]
data
id URL target_url
0 1 www.pandora.com pandora
1 2 m.jcpenney.com None
2 3 www.youtube.com youtube
3 4 www.facebook.com facebook