pandas 根据 2 个现有列的值将新列分配(添加)到 dask 数据框 - 涉及条件语句
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42212496/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement
提问by ML_Passion
I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls:
我想根据 2 个现有列的值向现有 dask 数据框添加一个新列,并涉及用于检查空值的条件语句:
DataFrame definition
数据框定义
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, "", 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df1, npartitions=2)
Method-1 tried
方法 1 尝试
def funcUpdate(row):
if row['y'].isnull():
return row['y']
else:
return round((1 + row['x'])/(1+ 1/row['y']),4)
ddf = ddf.assign(z= ddf.apply(funcUpdate, axis=1 , meta = ddf))
It gives an error:
它给出了一个错误:
TypeError: Column assignment doesn't support type DataFrame
Method-2
方法二
ddf = ddf.assign(z = ddf.apply(lambda col: col.y if col.y.isnull() else round((1 + col.x)/(1+ 1/col.y),4),axis = 1, meta = ddf))
Any idea how it should be done ?
知道应该怎么做吗?
回答by MRocklin
You can either use fillna
(fast) or you can use apply
(slow but flexible)
您可以使用fillna
(快速)或您可以使用apply
(缓慢但灵活)
Fillna
菲尔纳
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df, npartitions=2)
ddf['z'] = ddf.y.fillna((100 + ddf.x))
>>> df
x y
0 1 0.200
1 2 NaN
2 3 0.345
3 4 0.400
4 5 0.150
>>> ddf.compute()
x y z
0 1 0.200 0.200
1 2 NaN 102.000
2 3 0.345 0.345
3 4 0.400 0.400
4 5 0.150 0.150
Of course in this case though because your function uses y
if y
is a null, the result will be null as well. I'm assuming that you didn't intend this, so I changed the output slightly.
当然,在这种情况下,因为您的函数使用y
ify
为空,结果也将为空。我假设你不是故意的,所以我稍微改变了输出。
Use apply
使用申请
As any Pandas expert will tell you, using apply
comes with a 10x to 100x slowdown penalty. Please beware.
任何 Pandas 专家都会告诉您,使用apply
会带来 10 到 100 倍的减速惩罚。请小心。
That being said, the flexibility is useful. Your example almost works, except that you are providing improper metadata. You are telling apply that the function produces a dataframe, when in fact I think that your function was intended to produce a series. You can have Dask guess the meta information for you (although it will complain) or you can specify the dtype explicitly. Both options are shown in the example below:
话虽如此,灵活性是有用的。您的示例几乎有效,只是您提供了不正确的元数据。您告诉 apply 该函数生成一个数据帧,而实际上我认为您的函数旨在生成一个系列。您可以让 Dask 为您猜测元信息(尽管它会抱怨),或者您可以明确指定 dtype。这两个选项都显示在下面的示例中:
In [1]: import pandas as pd
...:
...: import dask.dataframe as dd
...: df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
...: ddf = dd.from_pandas(df, npartitions=2)
...:
In [2]: def func(row):
...: if pd.isnull(row['y']):
...: return row['x'] + 100
...: else:
...: return row['y']
...:
In [3]: ddf['z'] = ddf.apply(func, axis=1)
/home/mrocklin/Software/anaconda/lib/python3.4/site-packages/dask/dataframe/core.py:2553: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
warnings.warn(msg)
In [4]: ddf.compute()
Out[4]:
x y z
0 1 0.200 0.200
1 2 NaN 102.000
2 3 0.345 0.345
3 4 0.400 0.400
4 5 0.150 0.150
In [5]: ddf['z'] = ddf.apply(func, axis=1, meta=float)
In [6]: ddf.compute()
Out[6]:
x y z
0 1 0.200 0.200
1 2 NaN 102.000
2 3 0.345 0.345
3 4 0.400 0.400
4 5 0.150 0.150
回答by Calvin Smythe
I do not have any experience with dask but your boolean test will not catch that 2nd element as null in funcUpdate. Null values with pandas are equal to None or NaN/Nan, not "".
我对 dask 没有任何经验,但是您的布尔测试不会将 funcUpdate 中的第二个元素捕获为 null。pandas 的空值等于 None 或 NaN/Nan,而不是 ""。
def funcUpdate(row):
try:
return round((1 + row['x'])/(1+ 1/row['y']),4)
except:
return row['y']
Is a possible workaround but you would need to run data validation before hand.
是一种可能的解决方法,但您需要事先运行数据验证。