pandas 如何根据列值的长度从熊猫数据框中删除一行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42895061/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:14:00  来源:igfitidea点击:

How to remove a row from pandas dataframe based on the length of the column values?

pythonpandasdataframestring-length

提问by everestial007

In the following pandas.DataFframe:

在以下内容中pandas.DataFframe

df = 
    alfa    beta   ceta
    a,b,c   c,d,e  g,e,h
    a,b     d,e,f  g,h,k
    j,k     c,k,l  f,k,n

How to drop the rows in which the column values for alfa has more than 2 elements? This can be done using the length function, I know but not finding a specific answer.

如何删除 alfa 的列值具有 2 个以上元素的行?这可以使用长度函数来完成,我知道但没有找到具体的答案。

df = df[['alfa'].str.split(',').map(len) < 3]

回答by Stephen Rauch

You can do that test to each row in turn using pandas.DataFrame.apply()

您可以使用以下方法依次对每一行进行测试 pandas.DataFrame.apply()

print(df[df['alfa'].apply(lambda x: len(x.split(',')) < 3)])

Gives:

给出:

  alfa   beta   ceta
1  a,b  d,e,f  g,h,k
2  j,k  c,k,l  f,k,n

回答by piRSquared

This is the numpyversion of @NickilMaveli's answer.

这是numpy@NickilMaveli 答案的版本。

mask = np.core.defchararray.count(df.alfa.values.astype(str), ',') <= 1
pd.DataFrame(df.values[mask], df.index[mask], df.columns)

  alfa   beta   ceta
1  a,b  d,e,f  g,h,k
2  j,k  c,k,l  f,k,n


naive timing

天真的时机

enter image description here

在此处输入图片说明

回答by mikkokotila

Here is an option that is the easiest to remember and still embracing the DataFrame which is the "bleeding heart" of Pandas:

这是一个最容易记住的选项,它仍然包含 DataFrame,它是 Pandas 的“流血之心”:

1) Create a new column in the dataframe with a value for the length:

1) 在数据框中创建一个新列,并为其设置长度值:

df['length'] = df.alfa.str.len()

2) Index using the new column:

2) 使用新列的索引:

df = df[df.length < 3]

Then the comparison to the above timings, which are not really relevant in this case as the data is very small, and usually is less important than how likely you're going to remember how to do something and not having to interrupt your workflow:

然后与上述时间进行比较,在这种情况下,由于数据非常小,因此与上述时间无关,并且通常不如您记住如何做某事而不必中断工作流程的可能性重要:

step 1:

第1步:

%timeit df['length'] = df.alfa.str.len()

359 μs ± 6.83 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

每个循环 359 μs ± 6.83 μs(平均值 ± 标准偏差,7 次运行,每次 1000 次循环)

step 2:

第2步:

df = df[df.length < 3]

627 μs ± 76.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

每个循环 627 μs ± 76.9 μs(平均值 ± 标准偏差,7 次运行,每次 1000 次循环)

The good news is that when the size grows, time does not grow linearly. For example doing the same operation with 30,000 rows of data takes about 3ms (so 10,000x data, 3x speed increase). Pandas DataFrame is like a train, takes energy to get it going (so not great for small things under absolute comparison, but objectively does not matter much does it...as with small data things are fast anyways).

好消息是,当规模增长时,时间不会线性增长。例如对 30,000 行数据做同样的操作大约需要 3ms(所以 10,000x 数据,3x 速度增加)。Pandas DataFrame 就像一列火车,需要能量来让它运行(所以在绝对比较下对于小事情来说不是很好,但客观上并没有多大关系......因为小数据无论如何都很快)。

回答by Craig

How's this?

这个怎么样?

df = df[df['alpha'].str.split(',', expand=True)[2].isnull()]

Using expand=Truecreates a new dataframe with one column for each item in the list. If the list has three or more items, then the third column will have a non-null value.

使用expand=True为列表中的每个项目创建一个包含一列的新数据框。如果列表具有三个或更多项,则第三列将具有非空值。

One problem with this approach is that if none of the lists have three or more items, selecting column [2]will cause a KeyError. Based on this, it's safer to use the solution posted by @Stephen Rauch.

这种方法的一个问题是,如果没有一个列表包含三个或更多项目,则选择列[2]将导致KeyError. 基于此,使用@Stephen Rauch 发布的解决方案更安全。

回答by Nickil Maveli

There are at-least two ways to subset the given DF:

至少有两种方法可以对给定的 进行子集化DF

1) Split on the comma separator and then compute length of the resulting list:

1)在逗号分隔符上拆分,然后计算结果的长度list

df[df['alfa'].str.split(",").str.len().lt(3)]

2) Count number of commas and add 1 to the result to account for the last character:

2) 计算逗号的数量并将结果加 1 以计算最后一个字符:

df[df['alfa'].str.count(",").add(1).lt(3)] 

Both produce:

两者都产生:

enter image description here

在此处输入图片说明