检查 Pandas DataFrame 单元格是否包含特定字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/54508137/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Check if Pandas DataFrame cell contains certain string
提问by turtle101
Suppose I have the following Pandas DataFrame:
假设我有以下 Pandas DataFrame:
a b
0 NAN BABA UN EQUITY
1 NAN 2018
2 NAN 2017
3 NAN 2016
4 NAN NAN
5 NAN 700 HK EQUITY
6 NAN 2018
7 NAN 2017
8 NAN 2016
9 NAN NAN
I want to check each cell in column b
to see if it contains the string EQUITY
. If it does, I want to replace the cells in column a
, the next row until a row that is all NAN
with the previous string, to get the edited DataFrame as follows:
我想检查列中的每个单元格b
是否包含字符串EQUITY
。如果是这样,我想替换 column 中的单元格a
,下一行,直到一行都NAN
包含前一个字符串,以获取编辑后的 DataFrame 如下:
a b
0 NAN BABA UN EQUITY
1 BABA UN EQUITY 2018
2 BABA UN EQUITY 2017
3 BABA UN EQUITY 2016
4 NAN NAN
5 NAN 700 HK EQUITY
6 700 HK EQUITY 2018
7 700 HK EQUITY 2017
8 700 HK EQUITY 2016
9 NAN NAN
My actual DataFrame is much larger than the above, but the format is similar. I'm very new to Pandas but I think I can figure out the text replacement part, by using
sheet.loc
and replacing the cell values in a loop.
我的实际DataFrame比上面的大很多,但格式相似。我对 Pandas 很陌生,但我想我可以通过sheet.loc
在循环中使用和替换单元格值来找出文本替换部分
。
However, I am having trouble figuring out how to check whether a cell contains EQUITY
. It seems that str.contains
is what I should be using, but it's not clear to me how to do that.
但是,我无法弄清楚如何检查单元格是否包含EQUITY
. 这似乎str.contains
是我应该使用的,但我不清楚如何做到这一点。
Thanks!
谢谢!
回答by unutbu
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': ['NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN'],
'b': ['BABA UN EQUITY', '2018', '2017', '2016', 'NAN', '700 HK EQUITY', '2018', '2017', '2016', 'NAN']})
# Make sure that all NaN values are `np.nan` not `'NAN'` (strings)
df = df.replace('NAN', np.nan)
mask = df['b'].str.contains(r'EQUITY', na=True)
df.loc[mask, 'a'] = df['b']
df['a'] = df['a'].ffill()
df.loc[mask, 'a'] = np.nan
yields
产量
a b
0 NaN BABA UN EQUITY
1 BABA UN EQUITY 2018
2 BABA UN EQUITY 2017
3 BABA UN EQUITY 2016
4 NaN NaN
5 NaN 700 HK EQUITY
6 700 HK EQUITY 2018
7 700 HK EQUITY 2017
8 700 HK EQUITY 2016
9 NaN NaN
One slightly tricky bit above is how mask
is defined. Notice that str.contains
returns a Series which contains not only True
and False
values, but also NaN
:
上面稍微有点棘手的一点是如何mask
定义。请注意,str.contains
返回一个不仅包含True
和False
值的系列,还包含NaN
:
In [114]: df['b'].str.contains(r'EQUITY')
Out[114]:
0 True
1 False
2 False
3 False
4 NaN
5 True
6 False
7 False
8 False
9 NaN
Name: b, dtype: object
str.contains(..., na=True)
is used to make the NaN
s be treated as True
:
str.contains(..., na=True)
用于使NaN
s 被视为True
:
In [116]: df['b'].str.contains(r'EQUITY', na=True)
Out[116]:
0 True
1 False
2 False
3 False
4 True
5 True
6 False
7 False
8 False
9 True
Name: b, dtype: bool
Once you have mask
the idea is simple: Copy the values from b
into a
wherever mask
is True:
一旦你有了mask
这个想法就很简单:将值从True复制b
到a
任何地方mask
:
df.loc[mask, 'a'] = df['b']
Forward-fill the NaN values in a
:
前向填充 NaN 值a
:
df['a'] = df['a'].ffill()
Replace the values in a
with NaN wherever mask
is True:
a
用 NaN替换mask
为 True的值:
df.loc[mask, 'a'] = np.nan