pandas 根据列值的长度过滤数据框行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45089650/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
filter dataframe rows based on length of column values
提问by D.prd
I have a pandas dataframe as follows:
我有一个Pandas数据框,如下所示:
df = pd.DataFrame([ [1,2], [np.NaN,1], ['test string1', 5]], columns=['A','B'] )
df
A B
0 1 2
1 NaN 1
2 test string1 5
I am using pandas 0.20. What is the most efficient way to remove any rows where 'any' of its column values has length > 10?
我正在使用Pandas 0.20。删除“任何”列值的长度 > 10 的任何行的最有效方法是什么?
len('test string1') 12
len('测试字符串1') 12
So for the above e.g., I am expecting an output as follows:
所以对于上面的例子,我期望输出如下:
df
A B
0 1 2
1 NaN 1
回答by Zero
If based on column A
如果基于列 A
In [865]: df[~(df.A.str.len() > 10)]
Out[865]:
A B
0 1 2
1 NaN 1
If based on all columns
如果基于所有列
In [866]: df[~df.applymap(lambda x: len(str(x)) > 10).any(axis=1)]
Out[866]:
A B
0 1 2
1 NaN 1
回答by Elizabeth
I had to cast to a string for Diego's answer to work:
为了让 Diego 的答案起作用,我不得不转换为一个字符串:
df = df[df['A'].apply(lambda x: len(str(x)) <= 10)]
回答by MaxU
In [42]: df
Out[42]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
2 test string1 5 test string1test string1 2017-01-03
In [43]: df.dtypes
Out[43]:
A object
B int64
C object
D datetime64[ns]
dtype: object
In [44]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(1)]
Out[44]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
Explanation:
解释:
df.select_dtypes(['object'])
selects only columns of object
(str
) dtype:
df.select_dtypes(['object'])
仅选择object
( str
) dtype 的列:
In [45]: df.select_dtypes(['object'])
Out[45]:
A C
0 1 2
1 NaN NaN
2 test string1 test string1test string1
In [46]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10))
Out[46]:
A C
0 False False
1 False False
2 True True
now we can "aggregate" it as follows:
现在我们可以按如下方式“聚合”它:
In [47]: df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)
Out[47]:
0 False
1 False
2 True
dtype: bool
finally we can select only those rows where value is False
:
最后我们只能选择那些值为 value 的行False
:
In [48]: df.loc[~df.select_dtypes(['object']).apply(lambda x: x.str.len().gt(10)).any(axis=1)]
Out[48]:
A B C D
0 1 2 2 2017-01-01
1 NaN 1 NaN 2017-01-02
回答by Diego Aguado
Use the apply function of series, in order to keep them:
使用系列的应用功能,以保持它们:
df = df[df['A'].apply(lambda x: len(x) <= 10)]
df = df[df['A'].apply(lambda x: len(x) <= 10)]