根据字符串条件为 Pandas 数据框列赋值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36701689/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:04:17  来源:igfitidea点击:

Assign value to a pandas dataframe column based on string condition

pythonpandas

提问by haimen

Suppose I have a dataframe,

假设我有一个数据框,

data
id  URL
1   www.pandora.com
2   m.jcpenney.com
3   www.youtube.com
4   www.facebook.com

I want to create a new column based on a condition that if the URL contains some particular word. Suppose if it contains 'youtube', I want my column value as youtube. So I tried the following,

我想根据 URL 是否包含某些特定单词的条件创建一个新列。假设它包含“youtube”,我希望我的列值为 youtube。所以我尝试了以下方法,

data['test'] = 'other'

so once we do that we have,

所以一旦我们这样做了,我们就有了

data['test']
other
other
other
other

then I tried this,

然后我尝试了这个,

data[data['URL'].str.contains("youtub") == True]['test'] = 'Youtube'
data[data['URL'].str.contains("face") == True]['test'] = 'Facebook'

Though this runs without any error, the value of the test column, doesn't change. It still has other only for all the columns. When I run this statement, ideally 3rd row alone show change to 'Youtube' and 4th to 'Facebook'. But it doesn't change. Can anybody tell me what mistake I am doing here?

虽然这运行没有任何错误,但测试列的值不会改变。它仍然只有其他所有列。当我运行此语句时,理想情况下,仅第 3 行显示更改为“Youtube”,第 4 行更改为“Facebook”。但它不会改变。谁能告诉我我在这里犯了什么错误?

回答by jezrael

I think you can use locwith boolean mask created by contains:

我认为您可以使用loccontains以下方法创建的布尔掩码:

print data['URL'].str.contains("youtub")
0    False
1    False
2     True
3    False
Name: URL, dtype: bool

data.loc[data['URL'].str.contains("youtub"),'test'] = 'Youtube'
data.loc[data['URL'].str.contains("face"),'test'] = 'Facebook'
print data
   id               URL      test
0   1   www.pandora.com       NaN
1   2    m.jcpenney.com       NaN
2   3   www.youtube.com   Youtube
3   4  www.facebook.com  Facebook

回答by MaxU

i would do it in one shot:

我会一口气做到:

replacements = {
  r'.*youtube.*': 'Youtube',
  r'.*face.*': 'Facebook',
  r'.*pandora.*': 'Pandora'
}

df['text'] = df.URL.replace(replacements, regex=True)
df.loc[df.text.str.contains('\.'), 'text'] = 'other'
print(df)

Output:

输出:

                 URL      text
id
1    www.pandora.com   Pandora
2     m.jcpenney.com     other
3    www.youtube.com   Youtube
4   www.facebook.com  Facebook

回答by Alexander

Given that you probably want to check if the host name matches (rather than any word in the url), you could split the string on the dot and check if the second item (host name) is in your list.

鉴于您可能想检查主机名是否匹配(而不是 url 中的任何单词),您可以在点上拆分字符串并检查第二项(主机名)是否在您的列表中。

targets = ['pandora', 'youtube', 'facebook']
data['target_url'] = [url[1] if url[1] in targets else None 
                      for url in data.URL.str.split('.')]

data
   id               URL target_url
0   1   www.pandora.com    pandora
1   2    m.jcpenney.com       None
2   3   www.youtube.com    youtube
3   4  www.facebook.com   facebook