pandas 根据 2 个现有列的值将新列分配（添加）到 dask 数据框 - 涉及条件语句

Question

提问by ML_Passion

I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls:

我想根据 2 个现有列的值向现有 dask 数据框添加一个新列，并涉及用于检查空值的条件语句：

DataFrame definition

数据框定义

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, "", 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df1, npartitions=2)

Method-1 tried

方法 1 尝试

def funcUpdate(row):
    if row['y'].isnull():
        return row['y']
    else:
        return  round((1 + row['x'])/(1+ 1/row['y']),4)

ddf = ddf.assign(z= ddf.apply(funcUpdate, axis=1 , meta = ddf))

It gives an error:

它给出了一个错误：

TypeError: Column assignment doesn't support type DataFrame

Method-2

方法二

ddf = ddf.assign(z = ddf.apply(lambda col: col.y if col.y.isnull() else  round((1 + col.x)/(1+ 1/col.y),4),axis = 1, meta = ddf))

Any idea how it should be done ?

知道应该怎么做吗？

Answer 1

回答by MRocklin

You can either use fillna(fast) or you can use apply(slow but flexible)

您可以使用fillna（快速）或您可以使用apply（缓慢但灵活）

Fillna

菲尔纳

import pandas as pd

import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df, npartitions=2)

ddf['z'] = ddf.y.fillna((100 + ddf.x))

>>> df

   x      y
0  1  0.200
1  2    NaN
2  3  0.345
3  4  0.400
4  5  0.150

>>> ddf.compute()

   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

Of course in this case though because your function uses yif yis a null, the result will be null as well. I'm assuming that you didn't intend this, so I changed the output slightly.

当然，在这种情况下，因为您的函数使用yify为空，结果也将为空。我假设你不是故意的，所以我稍微改变了输出。

Use apply

使用申请

As any Pandas expert will tell you, using applycomes with a 10x to 100x slowdown penalty. Please beware.

任何 Pandas 专家都会告诉您，使用apply会带来 10 到 100 倍的减速惩罚。请小心。

That being said, the flexibility is useful. Your example almost works, except that you are providing improper metadata. You are telling apply that the function produces a dataframe, when in fact I think that your function was intended to produce a series. You can have Dask guess the meta information for you (although it will complain) or you can specify the dtype explicitly. Both options are shown in the example below:

话虽如此，灵活性是有用的。您的示例几乎有效，只是您提供了不正确的元数据。您告诉 apply 该函数生成一个数据帧，而实际上我认为您的函数旨在生成一个系列。您可以让 Dask 为您猜测元信息（尽管它会抱怨），或者您可以明确指定 dtype。这两个选项都显示在下面的示例中：

In [1]: import pandas as pd
   ...: 
   ...: import dask.dataframe as dd
   ...: df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
   ...: ddf = dd.from_pandas(df, npartitions=2)
   ...: 

In [2]: def func(row):
   ...:     if pd.isnull(row['y']):
   ...:         return row['x'] + 100
   ...:     else:
   ...:         return row['y']
   ...:     

In [3]: ddf['z'] = ddf.apply(func, axis=1)
/home/mrocklin/Software/anaconda/lib/python3.4/site-packages/dask/dataframe/core.py:2553: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)

In [4]: ddf.compute()
Out[4]: 
   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

In [5]: ddf['z'] = ddf.apply(func, axis=1, meta=float)

In [6]: ddf.compute()
Out[6]: 
   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

Answer 2

回答by Calvin Smythe

I do not have any experience with dask but your boolean test will not catch that 2nd element as null in funcUpdate. Null values with pandas are equal to None or NaN/Nan, not "".

我对 dask 没有任何经验，但是您的布尔测试不会将 funcUpdate 中的第二个元素捕获为 null。pandas 的空值等于 None 或 NaN/Nan，而不是 ""。

def funcUpdate(row):
    try:
        return  round((1 + row['x'])/(1+ 1/row['y']),4)
    except:
        return row['y']

Is a possible workaround but you would need to run data validation before hand.

是一种可能的解决方法，但您需要事先运行数据验证。

pandas 根据 2 个现有列的值将新列分配（添加）到 dask 数据框 - 涉及条件语句

提问by ML_Passion

回答by MRocklin

Fillna

菲尔纳

Use apply

使用申请

回答by Calvin Smythe

相关推荐

最近更新

标签

pandas 根据 2 个现有列的值将新列分配（添加）到 dask 数据框 - 涉及条件语句

提问by ML_Passion

回答by MRocklin

Fillna

菲尔纳

Use apply

使用申请

回答by Calvin Smythe

相关推荐

pandas 熊猫数据透视表重命名列

pandas 数据框任意两列之间的百分比差异

Pandas：更新列的值

Pandas DataFrame.assign 参数

相关推荐

最近更新

标签