Python 根据包含 len(string) 给出 KeyError 的条件表达式从 Pandas DataFrame 中删除行

Question

提问by sjs

I have a pandas DataFrame and I want to delete rows from it where the length of the string in a particular column is greater than 2.

我有一个 Pandas DataFrame，我想从中删除特定列中字符串长度大于 2 的行。

I expect to be able to do this (per this answer):

我希望能够做到这一点（根据这个答案）：

df[(len(df['column name']) < 2)]

but I just get the error:

但我只是收到错误：

KeyError: u'no item named False'

What am I doing wrong?

我究竟做错了什么？

(Note: I know I can use df.dropna()to get rid of rows that contain any NaN, but I didn't see how to remove rows based on a conditional expression.)

（注意：我知道我可以df.dropna()用来删除包含 any 的行NaN，但我没有看到如何根据条件表达式删除行。）

Answer 1

采纳答案by BrenBarn

When you do len(df['column name'])you are just getting one number, namely the number of rows in the DataFrame (i.e., the length of the column itself). If you want to apply lento each element in the column, use df['column name'].map(len). So try

当您这样做时，len(df['column name'])您只会得到一个数字，即 DataFrame 中的行数（即列本身的长度）。如果要应用于len列中的每个元素，请使用df['column name'].map(len). 所以试试

df[df['column name'].map(len) < 2]

Answer 2

回答by User

To directly answer this question's original title "How to delete rows from a pandas DataFrame based on a conditional expression" (which I understand is not necessarily the OP's problem but could help other users coming across this question) one way to do this is to use the dropmethod:

要直接回答这个问题的原始标题“如何根据条件表达式从 Pandas DataFrame 中删除行”（我理解这不一定是 OP 的问题，但可以帮助其他用户遇到这个问题），一种方法是使用该降的方法：

df = df.drop(some labels)

df = df.drop(df[<some boolean condition>].index)

Example

例子

To remove all rows where column 'score' is < 50:

要删除列 'score' < 50 的所有行：

df = df.drop(df[df.score < 50].index)

In place version (as pointed out in comments)

就地版本（如评论中指出的那样）

df.drop(df[df.score < 50].index, inplace=True)

Multiple conditions

多重条件

(see Boolean Indexing)

（见布尔索引）

The operators are: |for or, &for and, and ~for not. These must be grouped by using parentheses.

运算符是：|for or、&forand和~for not。这些必须使用括号进行分组。

To remove all rows where column 'score' is < 50 and > 20

删除列 'score' 小于 50 且大于 20 的所有行

df = df.drop(df[(df.score < 50) & (df.score > 20)].index)

Answer 3

回答by Kabard

You can assign the DataFrameto a filtered version of itself:

您可以将分配给DataFrame自身的过滤版本：

df = df[df.score > 50]

This is faster than drop:

这比drop：

%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test[test.x < 0]
# 54.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test.drop(test[test.x > 0].index, inplace=True)
# 201 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test.drop(test[test.x > 0].index)
# 194 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answer 4

回答by YOBEN_S

In pandas you can do str.lenwith your boundary and using the Boolean result to filter it .

在 Pandas 中，您可以str.len处理边界并使用布尔结果对其进行过滤。

df[df['column name'].str.len().lt(2)]

Answer 5

回答by jayanti prasad

If you want to drop rows of data frame on the basis of some complicated condition on the column value then writing that in the way shown above can be complicated. I have the following simpler solution which always works. Let us assume that you want to drop the column with 'header' so get that column in a list first.

如果您想根据列值上的一些复杂条件删除数据框的行，那么以上述方式编写可能会很复杂。我有以下更简单的解决方案，它始终有效。让我们假设您想删除带有 'header' 的列，因此首先将该列放入列表中。

text_data = df['name'].tolist()

now apply some function on the every element of the list and put that in a panda series:

现在对列表的每个元素应用一些函数并将其放入熊猫系列：

text_length = pd.Series([func(t) for t in text_data])

in my case I was just trying to get the number of tokens:

就我而言，我只是想获取令牌的数量：

text_length = pd.Series([len(t.split()) for t in text_data])

now add one extra column with the above series in the data frame:

现在在数据框中添加一个带有上述系列的额外列：

df = df.assign(text_length = text_length .values)

now we can apply condition on the new column such as:

现在我们可以在新列上应用条件，例如：

df = df[df.text_length  >  10]

def pass_filter(df, label, length, pass_type):

    text_data = df[label].tolist()

    text_length = pd.Series([len(t.split()) for t in text_data])

    df = df.assign(text_length = text_length .values)

    if pass_type == 'high':
        df = df[df.text_length  >  length]

    if pass_type == 'low':
        df = df[df.text_length  <  length]

    df = df.drop(columns=['text_length'])

    return df

Answer 6

回答by Zakir

I will expand on @User's generic solution to provide a dropfree alternative. This is for folks directed here based on the question's title (not OP 's problem)

我将扩展@User 的通用解决方案以提供drop免费的替代方案。这是针对根据问题标题指向此处的人（不是 OP 的问题）

Say you want to delete all rows with negative values. One liner solution is:-

假设您要删除所有具有负值的行。一种衬垫解决方案是：-

df = df[(df > 0).all(axis=1)]

Step by step Explanation:--

分步说明：--

Let's generate a 5x5 random normal distribution data frame

让我们生成一个 5x5 的随机正态分布数据框

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'))
      A         B         C         D         E
0  1.764052  0.400157  0.978738  2.240893  1.867558
1 -0.977278  0.950088 -0.151357 -0.103219  0.410599
2  0.144044  1.454274  0.761038  0.121675  0.443863
3  0.333674  1.494079 -0.205158  0.313068 -0.854096
4 -2.552990  0.653619  0.864436 -0.742165  2.269755

Let the condition be deleting negatives. A boolean df satisfying the condition:-

让条件是删除否定。满足条件的布尔 df：-

df > 0
      A     B      C      D      E
0   True  True   True   True   True
1  False  True  False  False   True
2   True  True   True   True   True
3   True  True  False   True  False
4  False  True   True  False   True

A boolean series for all rows satisfying the conditionNote if any element in the row fails the condition the row is marked false

满足条件的所有行的布尔系列注意如果行中的任何元素不符合条件，则该行被标记为 false

(df > 0).all(axis=1)
0     True
1    False
2     True
3    False
4    False
dtype: bool

Finally filter out rows from data frame based on the condition

最后根据条件从数据框中过滤出行

df[(df > 0).all(axis=1)]
      A         B         C         D         E
0  1.764052  0.400157  0.978738  2.240893  1.867558
2  0.144044  1.454274  0.761038  0.121675  0.443863

You can assign it back to df to actually deletevs filtering done above
df = df[(df > 0).all(axis=1)]

您可以分配回DF实际删除VS过滤器上面做ING
df = df[(df > 0).all(axis=1)]

This can easily be extended to filter out rows containing NaN s (non numeric entries):-
df = df[(~df.isnull()).all(axis=1)]

这可以很容易地扩展到过滤掉包含 NaN s（非数字条目）的行：-
df = df[(~df.isnull()).all(axis=1)]

This can also be simplified for cases like: Delete all rows where column E is negative

对于以下情况，这也可以简化：删除列 E 为负的所有行

df = df[(df.E>0)]

I would like to end with some profiling stats on why @User's dropsolution is slower than raw column based filtration:-

我想以一些分析统计数据结束@User 的drop解决方案为什么比基于原始列的过滤慢：-

%timeit df_new = df[(df.E>0)]
345 μs ± 10.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dft.drop(dft[dft.E < 0].index, inplace=True)
890 μs ± 94.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

A column is basically a Seriesi.e a NumPyarray, it can be indexed without any cost. For folks interested in how the underlying memory organization plays into execution speed here is a great Link on Speeding up Pandas:

列基本上是一个SeriesieNumPy数组，它可以被索引而不需要任何成本。对于那些对底层内存组织如何影响执行速度感兴趣的人，这里有一个关于加速熊猫的很好的链接：

Python 根据包含 len(string) 给出 KeyError 的条件表达式从 Pandas DataFrame 中删除行

提问by sjs

采纳答案by BrenBarn

回答by User

回答by Kabard

回答by YOBEN_S

回答by jayanti prasad

回答by Zakir

相关推荐

最近更新

标签

Python 根据包含 len(string) 给出 KeyError 的条件表达式从 Pandas DataFrame 中删除行

提问by sjs

采纳答案by BrenBarn

回答by User

回答by Kabard

回答by YOBEN_S

回答by jayanti prasad

回答by Zakir

相关推荐

Python 在 psycopg2 中将表名作为参数传递

如何在同一个显示窗口中绘制 3 个子图？Python

如何将所有打印导出到 Python 中的 .txt 文件？

Python Numpy 错误：奇异矩阵

相关推荐

最近更新

标签