检查 Pandas DataFrame 单元格是否包含特定字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/54508137/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:18:37  来源:igfitidea点击:

Check if Pandas DataFrame cell contains certain string

pythonpandas

提问by turtle101

Suppose I have the following Pandas DataFrame:

假设我有以下 Pandas DataFrame:

         a        b             
0        NAN      BABA UN EQUITY
1        NAN      2018  
2        NAN      2017
3        NAN      2016
4        NAN      NAN
5        NAN      700 HK EQUITY
6        NAN      2018  
7        NAN      2017
8        NAN      2016
9        NAN      NAN 

I want to check each cell in column bto see if it contains the string EQUITY. If it does, I want to replace the cells in column a, the next row until a row that is all NANwith the previous string, to get the edited DataFrame as follows:

我想检查列中的每个单元格b是否包含字符串EQUITY。如果是这样,我想替换 column 中的单元格a,下一行,直到一行都NAN包含前一个字符串,以获取编辑后的 ​​DataFrame 如下:

         a                 b             
0        NAN               BABA UN EQUITY
1        BABA UN EQUITY    2018  
2        BABA UN EQUITY    2017
3        BABA UN EQUITY    2016
4        NAN               NAN
5        NAN               700 HK EQUITY
6        700 HK EQUITY     2018  
7        700 HK EQUITY     2017
8        700 HK EQUITY     2016
9        NAN               NAN         

My actual DataFrame is much larger than the above, but the format is similar. I'm very new to Pandas but I think I can figure out the text replacement part, by using sheet.locand replacing the cell values in a loop.

我的实际DataFrame比上面的大很多,但格式相似。我对 Pandas 很陌生,但我想我可以通过sheet.loc在循环中使用和替换单元格值来找出文本替换部分 。

However, I am having trouble figuring out how to check whether a cell contains EQUITY. It seems that str.containsis what I should be using, but it's not clear to me how to do that.

但是,我无法弄清楚如何检查单元格是否包含EQUITY. 这似乎str.contains是我应该使用的,但我不清楚如何做到这一点。

Thanks!

谢谢!

回答by unutbu

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': ['NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN'],
 'b': ['BABA UN EQUITY', '2018', '2017', '2016', 'NAN', '700 HK EQUITY', '2018', '2017', '2016', 'NAN']})

# Make sure that all NaN values are `np.nan` not `'NAN'` (strings)
df = df.replace('NAN', np.nan)
mask = df['b'].str.contains(r'EQUITY', na=True)
df.loc[mask, 'a'] = df['b']
df['a'] = df['a'].ffill()
df.loc[mask, 'a'] = np.nan

yields

产量

                a               b
0             NaN  BABA UN EQUITY
1  BABA UN EQUITY            2018
2  BABA UN EQUITY            2017
3  BABA UN EQUITY            2016
4             NaN             NaN
5             NaN   700 HK EQUITY
6   700 HK EQUITY            2018
7   700 HK EQUITY            2017
8   700 HK EQUITY            2016
9             NaN             NaN


One slightly tricky bit above is how maskis defined. Notice that str.containsreturns a Series which contains not only Trueand Falsevalues, but also NaN:

上面稍微有点棘手的一点是如何mask定义。请注意,str.contains返回一个不仅包含TrueFalse值的系列,还包含NaN

In [114]: df['b'].str.contains(r'EQUITY')
Out[114]: 
0     True
1    False
2    False
3    False
4      NaN
5     True
6    False
7    False
8    False
9      NaN
Name: b, dtype: object

str.contains(..., na=True)is used to make the NaNs be treated as True:

str.contains(..., na=True)用于使NaNs 被视为True

In [116]: df['b'].str.contains(r'EQUITY', na=True)
Out[116]: 
0     True
1    False
2    False
3    False
4     True
5     True
6    False
7    False
8    False
9     True
Name: b, dtype: bool


Once you have maskthe idea is simple: Copy the values from binto awherever maskis True:

一旦你有了mask这个想法就很简单:将值从True复制ba任何地方mask

df.loc[mask, 'a'] = df['b']

Forward-fill the NaN values in a:

前向填充 NaN 值a

df['a'] = df['a'].ffill()

Replace the values in awith NaN wherever maskis True:

a用 NaN替换mask为 True的值:

df.loc[mask, 'a'] = np.nan