pandas 如何用 1 替换数据帧的所有非 NaN 条目,用 0 替换所有 NaN

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37543647/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 01:19:22  来源:igfitidea点击:

How to replace all non-NaN entries of a dataframe with 1 and all NaN with 0

pythonpandasdataframe

提问by Anirban De

I have a dataframe with 71 columns and 30597 rows. I want to replace all non-nan entries with 1 and the nan values with 0.

我有一个包含 71 列和 30597 行的数据框。我想用 1 替换所有非 nan 条目,用 0 替换 nan 值。

Initially I tried for-loop on each value of the dataframe which was taking too much time.

最初,我尝试对数据帧的每个值进行 for 循环,这花费了太多时间。

Then I used data_new=data.subtract(data)which was meant to subtract all the values of the dataframe to itself so that I can make all the non-null values 0. But an error occurred as the dataframe had multiple string entries.

然后我使用了data_new=data.subtract(data)这意味着将数据帧的所有值减去自身,以便我可以将所有非空值设为 0。但是由于数据帧有多个字符串条目而发生错误。

回答by fmarc

You can take the return value of df.notnull(), which is Falsewhere the DataFrame contains NaNand Trueotherwise and cast it to integer, giving you 0where the DataFrame is NaNand 1otherwise:

您可以获取 的返回值df.notnull(),即FalseDataFrame 包含的位置NaNTrue否则将其转换为整数,从而为您提供0DataFrame 所在的位置NaN1否则:

newdf = df.notnull().astype('int')

If you really want to write into your original DataFrame, this will work:

如果您真的想写入原始数据帧,这将起作用:

df.loc[~df.isnull()] = 1  # not nan
df.loc[df.isnull()] = 0   # nan

回答by jezrael

Use notnullwith casting boolean to intby astype:

使用notnull与铸造布尔值,int通过astype

print ((df.notnull()).astype('int'))

Sample:

样本:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [np.nan, 4, np.nan], 'b': [1,np.nan,3]})
print (df)
     a    b
0  NaN  1.0
1  4.0  NaN
2  NaN  3.0

print (df.notnull())
       a      b
0  False   True
1   True  False
2  False   True

print ((df.notnull()).astype('int'))
   a  b
0  0  1
1  1  0
2  0  1

回答by tnknepp

I do a lot of data analysis and am interested in finding new/faster methods of carrying out operations. I had never come across jezrael's method, so I was curious to compare it with my usual method (i.e. replace by indexing). NOTE: This is not an answer to the OP's question, rather it is an illustration of the efficiency of jezrael's method. Since this is NOT an answer I will remove this post if people do not find it useful (and after being downvoted into oblivion!). Just leave a comment if you think I should remove it.

我进行了大量数据分析,并且有兴趣寻找新的/更快的执行操作方法。我从未遇到过 jezrael 的方法,所以我很好奇将它与我常用的方法(即用索引替换)进行比较。注意:这不是对 OP 问题的回答,而是对 jezrael 方法效率的说明。由于这不是一个答案,如果人们认为它没有用(并且在被低估之后被遗忘!),我将删除这篇文章。如果您认为我应该删除它,请发表评论。

I created a moderately sized dataframe and did multiple replacements using both the df.notnull().astype(int) method and simple indexing (how I would normally do this). It turns out that the latter is slower by approximately five times. Just an fyi for anyone doing larger-scale replacements.

我创建了一个中等大小的数据框,并使用 df.notnull().astype(int) 方法和简单的索引(我通常会这样做)进行了多次替换。事实证明,后者慢了大约五倍。对于任何进行大规模更换的人来说,仅供参考。

from __future__ import division, print_function

import numpy as np
import pandas as pd
import datetime as dt


# create dataframe with randomly place NaN's
data = np.ones( (1e2,1e2) )
data.ravel()[np.random.choice(data.size,data.size/10,replace=False)] = np.nan

df = pd.DataFrame(data=data)

trials = np.arange(100)


d1 = dt.datetime.now()

for r in trials:
    new_df = df.notnull().astype(int)

print( (dt.datetime.now()-d1).total_seconds()/trials.size )


# create a dummy copy of df.  I use a dummy copy here to prevent biasing the 
# time trial with dataframe copies/creations within the upcoming loop
df_dummy = df.copy()

d1 = dt.datetime.now()

for r in trials:
    df_dummy[df.isnull()] = 0
    df_dummy[df.isnull()==False] = 1

print( (dt.datetime.now()-d1).total_seconds()/trials.size )

This yields times of 0.142 s and 0.685 s respectively. It is clear who the winner is.

这分别产生 0.142 秒和 0.685 秒的时间。谁是赢家,一目了然。

回答by DainDwarf

There is a method .fillna()on DataFrames which does what you need. For example:

.fillna()DataFrames 上有一种方法可以满足您的需求。例如:

df = df.fillna(0)  # Replace all NaN values with zero, returning the modified DataFrame

or

或者

df.fillna(0, inplace=True)   # Replace all NaN values with zero, updating the DataFrame directly

回答by tompiler

I'd advise making a new column rather than just replacing. You can always delete the previous column if necessary but its always helpful to have a source for a column populated via an operation on another.

我建议制作一个新的专栏,而不是仅仅更换。如有必要,您始终可以删除前一列,但通过对另一列的操作填充列的源总是有帮助的。

e.g. if df['col1'] is the existing column

例如,如果 df['col1'] 是现有列

df['col2'] = df['col1'].apply(lambda x: 1 if not pd.isnull(x) else np.nan)

where col2 is the new column. Should also work if col2 has string entries.

其中 col2 是新列。如果 col2 有字符串条目,也应该工作。

回答by Xin Niu

for fmarc 's answer:

对于 fmarc 的回答:

df.loc[~df.isnull()] = 1  # not nan
df.loc[df.isnull()] = 0   # nan

The code above does not work for me, and the below works.

上面的代码对我不起作用,下面的代码有效。

df[~df.isnull()] = 1  # not nan
df[df.isnull()] = 0   # nan

With the pandas 0.25.3

与Pandas 0.25.3

And if you want to just change values in specific columns, you may need to create a temp dataframe and assign it to the columns of the original dataframe:

如果您只想更改特定列中的值,您可能需要创建一个临时数据框并将其分配给原始数据框的列:

change_col = ['a', 'b']
tmp = df[change_col]
tmp[tmp.isnull()]='xxx'
df[change_col]=tmp

回答by afuc func

Use: df.fillna(0)

用: df.fillna(0)

to fill NaN with 0.

用 0 填充 NaN。

回答by arshad

Here i will give a suggestion to take a particular column and if the rows in that column is NaN replace it by 0 or values are there in that column replace it as 1

在这里,我将建议采用特定列,如果该列中的行是 NaN,则将其替换为 0 或该列中有值将其替换为 1

this below line will change your column to 0

下面这行会将您的列更改为 0

df.YourColumnName.fillna(0,inplace=True)

Now Rest of the Not Nan Part will be Replace by 1 by below code

现在非南部分的其余部分将被以下代码替换为 1

df["YourColumnName"]=df["YourColumnName"].apply(lambda x: 1 if x!=0 else 0)

Same Can Be applied to the total dataframe by not defining the column Name

同样可以通过不定义列名称应用于总数据框