pandas 熊猫元素比较并创建选择
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37406619/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas element-wise comparison and create selection
提问by mati
In a dataframe I would like to compare the elements of a column with a value and sort the elements which pass the comparison into a new column.
在数据框中,我想将列的元素与值进行比较,并对将比较传递到新列的元素进行排序。
df = pandas.DataFrame([{'A':3,'B':10},
{'A':2, 'B':30},
{'A':1,'B':20},
{'A':2,'B':15},
{'A':2,'B':100}])
df['C'] = [x for x in df['B'] if x > 18]
I can't find out what's wrongs and why I get:
我不知道出了什么问题以及为什么我得到:
ValueError: Length of values does not match length of index
ValueError:值的长度与索引的长度不匹配
采纳答案by Saranya Krishnamurthy
As Darren mentioned, all columns in a DataFrame
should have same length.
正如达伦所提到的, a 中的所有列都DataFrame
应该具有相同的长度。
When you try print [x for x in df['B'] if x > 18]
, you get only [30, 20, 100]
values. But you have got five index/rows. That's the reason you get Length of values does not match length of index
error.
当您尝试时print [x for x in df['B'] if x > 18]
,您只会获得[30, 20, 100]
值。但是你有五个索引/行。这就是你得到Length of values does not match length of index
错误的原因。
You can change your code as follows:
您可以按如下方式更改代码:
df['C'] = [x if x > 18 else None for x in df['B']]
print df
You will get:
你会得到:
A B C
0 3 10 NaN
1 2 30 30.0
2 1 20 20.0
3 2 15 NaN
4 2 100 100.0
回答by jezrael
I think you can use loc
with boolean indexing
:
我想你可以用loc
用boolean indexing
:
print (df)
A B
0 3 10
1 2 30
2 1 20
3 2 15
4 2 100
print (df['B'] > 18)
0 False
1 True
2 True
3 False
4 True
Name: B, dtype: bool
df.loc[df['B'] > 18, 'C'] = df['B']
print (df)
A B C
0 3 10 NaN
1 2 30 30.0
2 1 20 20.0
3 2 15 NaN
4 2 100 100.0
If you need select by condition use boolean indexing
:
如果您需要按条件选择,请使用boolean indexing
:
print (df[df['B'] > 18])
A B
1 2 30
2 1 20
4 2 100
If need something more faster, use where
:
如果需要更快的速度,请使用where
:
df['C'] = df.B.where(df['B'] > 18)
Timings(len(df)=50k
):
时间( len(df)=50k
):
In [1367]: %timeit (a(df))
The slowest run took 8.34 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.14 ms per loop
In [1368]: %timeit (b(df1))
100 loops, best of 3: 15.5 ms per loop
In [1369]: %timeit (c(df2))
100 loops, best of 3: 2.93 ms per loop
Code for timings:
计时代码:
import pandas as pd
df = pd.DataFrame([{'A':3,'B':10},
{'A':2, 'B':30},
{'A':1,'B':20},
{'A':2,'B':15},
{'A':2,'B':100}])
print (df)
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
def a(df):
df['C'] = df.B.where(df['B'] > 18)
return df
def b(df1):
df['C'] = ([x if x > 18 else None for x in df['B']])
return df
def c(df2):
df.loc[df['B'] > 18, 'C'] = df['B']
return df
print (a(df))
print (b(df1))
print (c(df2))
回答by Darren Cook
All columns in a DataFrame
have to be the same length. Because you are filtering away some values, you are trying to insert fewer values into column C than are in columns A and B.
a 中的所有列DataFrame
的长度必须相同。因为您要过滤掉一些值,所以您尝试在 C 列中插入的值少于在 A 和 B 列中插入的值。
So, your two options are to start a new DataFrame for C
:
因此,您的两个选择是为 启动一个新的 DataFrame C
:
dfC = [x for x in df['B'] if x > 18]
or but some dummy value in the column for when x is not 18+. E.g.:
或者当 x 不是 18+ 时列中的一些虚拟值。例如:
df['C'] = np.where(df['B'] > 18, True, False)
Or even:
甚至:
df['C'] = np.where(df['B'] > 18, 'Yay', 'Nay')
P.S. Also take a look at: Pandas conditional creation of a series/dataframe columnfor other ways to do this.
PS 另请参阅:Pandas conditional creation of a series/dataframe column以使用其他方法来执行此操作。