Python 如何将现有 Pandas DataFrame 的所有值设置为零?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42636765/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:59:02  来源:igfitidea点击:

How to set all the values of an existing Pandas DataFrame to zero?

pythonpandasdataframe

提问by manocormen

I currently have an existing Pandas DataFrame with a date index, and columns each with a specific name.

我目前有一个带有日期索引的现有 Pandas DataFrame,每个列都有一个特定的名称。

As for the data cells, they are filled with various float values.

至于数据单元格,它们填充有各种浮点值。

I would like to copy my DataFrame, but replace all these values with zero.

我想复制我的 DataFrame,但用零替换所有这些值。

The objective is to reuse the structure of the DataFrame (dimensions, index, column names), but clear all the current values by replacing them with zeroes.

目标是重用 DataFrame 的结构(维度、索引、列名),但通过用零替换它们来清除所有当前值。

The way I'm currently achieving this is as follow:

我目前实现这一目标的方式如下:

df[df > 0] = 0

However, this would not replace any negative value in the DataFrame.

但是,这不会替换 DataFrame 中的任何负值。

Isn't there a more general approach to filling an entire existing DataFrame with a single common value?

没有更通用的方法来用单个公共值填充整个现有 DataFrame 吗?

Thank you in advance for your help.

预先感谢您的帮助。

回答by BallpointBen

The absolute fastest way, which also preserves dtypes, is the following:

绝对最快的方法,也保留dtypes,如下:

for col in df.columns:
    df[col].values[:] = 0

This directly writes to the underlying numpy array of each column. I doubt any other method will be faster than this, as this allocates no additional storage and doesn't pass through pandas's dtypehandling. You can also use np.issubdtypeto only zero out numeric columns. This is probably what you want if you have a mixed dtypeDataFrame, but of course it's not necessary if your DataFrame is already entirely numeric.

这直接写入每列的底层 numpy 数组。我怀疑任何其他方法会比这更快,因为这不会分配额外的存储空间并且不会通过 pandas 的dtype处理。您还可以使用np.issubdtype仅将数字列清零。如果您有一个混合的dtypeDataFrame,这可能是您想要的,但是如果您的 DataFrame 已经完全是数字,那当然没有必要。

for col in df.columns:
    if np.issubdtype(df[col].dtype, np.number):
        df[col].values[:] = 0

For small DataFrames, the subtype check is somewhat costly. However, the cost of zeroing a non-numeric column is substantial, so if you're not sure whether your DataFrame is entirely numeric, you should probably include the issubdtypecheck.

对于小型 DataFrame,子类型检查的成本有些高。但是,将非数字列归零的成本是巨大的,因此如果您不确定您的 DataFrame 是否完全是数字,您可能应该包括issubdtype检查。



Timing comparisons

时序比较

Setup

设置

import pandas as pd
import numpy as np

def make_df(n, only_numeric):
    series = [
        pd.Series(range(n), name="int", dtype=int),
        pd.Series(range(n), name="float", dtype=float),
    ]
    if only_numeric:
        series.extend(
            [
                pd.Series(range(n, 2 * n), name="int2", dtype=int),
                pd.Series(range(n, 2 * n), name="float2", dtype=float),
            ]
        )
    else:
        series.extend(
            [
                pd.date_range(start="1970-1-1", freq="T", periods=n, name="dt")
                .to_series()
                .reset_index(drop=True),
                pd.Series(
                    [chr((i % 26) + 65) for i in range(n)],
                    name="string",
                    dtype="object",
                ),
            ]
        )

    return pd.concat(series, axis=1)

>>> make_df(5, True)
   int  float  int2  float2
0    0    0.0     5     5.0
1    1    1.0     6     6.0
2    2    2.0     7     7.0
3    3    3.0     8     8.0
4    4    4.0     9     9.0

>>> make_df(5, False)
   int  float                  dt string
0    0    0.0 1970-01-01 00:00:00      A
1    1    1.0 1970-01-01 00:01:00      B
2    2    2.0 1970-01-01 00:02:00      C
3    3    3.0 1970-01-01 00:03:00      D
4    4    4.0 1970-01-01 00:04:00      E

Small DataFrame

小型数据帧

n = 10_000                                                                                  

# Numeric df, no issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
    df[col].values[:] = 0
36.1 μs ± 510 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# Numeric df, yes issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
    if np.issubdtype(df[col].dtype, np.number):
        df[col].values[:] = 0
53 μs ± 645 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# Non-numeric df, no issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
    df[col].values[:] = 0
113 μs ± 391 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# Non-numeric df, yes issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
    if np.issubdtype(df[col].dtype, np.number):
        df[col].values[:] = 0
39.4 μs ± 1.91 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Large DataFrame

大数据帧

n = 10_000_000                                                                             

# Numeric df, no issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
    df[col].values[:] = 0
38.7 ms ± 151 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Numeric df, yes issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
    if np.issubdtype(df[col].dtype, np.number):
        df[col].values[:] = 0
39.1 ms ± 556 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Non-numeric df, no issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
    df[col].values[:] = 0
99.5 ms ± 748 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Non-numeric df, yes issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
    if np.issubdtype(df[col].dtype, np.number):
        df[col].values[:] = 0
17.8 ms ± 228 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


I'd previously suggested the answer below, but I now consider it harmful — it's significantly slower than the above answers and is harder to reason about. Its only advantage is being nicer to write.

我之前建议使用下面的答案,但现在我认为它有害——它比上面的答案慢得多,而且更难推理。它唯一的优点是更好写。

The cleanest way is to use a bare colon to reference the entire dataframe.

df[:] = 0

Unfortunately the dtypesituation is a bit fuzzy because every column in the resulting dataframe will have the same dtype. If every column of dfwas originally float, the new dtypeswill still be float. But if a single column was intor object, it seems that the new dtypeswill allbe int.

最简洁的方法是使用裸冒号来引用整个数据帧。

df[:] = 0

不幸的是,dtype情况有点模糊,因为结果数据帧中的每一列都将具有相同的dtype. 如果每一列df都是原来的float,那么新的dtypes仍然是 float。但是,如果一列是intobject,似乎新的dtypes意愿可以int

回答by Joe T. Boka

You can use the replacefunction:

您可以使用替换功能:

df2 = df.replace(df, 0)

回答by Psidom

Since you are trying to make a copy, it might be better to simply create a new data frame with values as 0, and columns and index from the original data frame:

由于您正在尝试制作副本,因此最好简单地创建一个值为 0 的新数据框,以及来自原始数据框的列和索引:

pd.DataFrame(0, columns=df.columns, index=df.index)

回答by hannah413

FYI the accepted answer from BallpointBen was almost 2 orders of magnitude fasterfor me than the .replace() operation offered by Joe T Boka. Both are helpful. Thanks!

仅供参考,BallpointBen 接受的答案对我来说比 Joe T Boka 提供的 .replace() 操作近 2 个数量级。两者都有帮助。谢谢!

To be clear, the fast way described by BallpointBen is:

明确地说,BallpointBen 描述的快速方法是:

for col in df.columns: df[col].values[:] = 0

for col in df.columns: df[col].values[:] = 0

*I would have commented this but I don't have enough street cred/reputation yet since I have been lurking for years. I used timeit.timeit() for the comparison.

*我会对此发表评论,但由于我潜伏多年,我还没有足够的街头信誉/声誉。我使用 timeit.timeit() 进行比较。