Python 如何将现有 Pandas DataFrame 的所有值设置为零?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42636765/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to set all the values of an existing Pandas DataFrame to zero?
提问by manocormen
I currently have an existing Pandas DataFrame with a date index, and columns each with a specific name.
我目前有一个带有日期索引的现有 Pandas DataFrame,每个列都有一个特定的名称。
As for the data cells, they are filled with various float values.
至于数据单元格,它们填充有各种浮点值。
I would like to copy my DataFrame, but replace all these values with zero.
我想复制我的 DataFrame,但用零替换所有这些值。
The objective is to reuse the structure of the DataFrame (dimensions, index, column names), but clear all the current values by replacing them with zeroes.
目标是重用 DataFrame 的结构(维度、索引、列名),但通过用零替换它们来清除所有当前值。
The way I'm currently achieving this is as follow:
我目前实现这一目标的方式如下:
df[df > 0] = 0
However, this would not replace any negative value in the DataFrame.
但是,这不会替换 DataFrame 中的任何负值。
Isn't there a more general approach to filling an entire existing DataFrame with a single common value?
没有更通用的方法来用单个公共值填充整个现有 DataFrame 吗?
Thank you in advance for your help.
预先感谢您的帮助。
回答by BallpointBen
The absolute fastest way, which also preserves dtypes
, is the following:
绝对最快的方法,也保留dtypes
,如下:
for col in df.columns:
df[col].values[:] = 0
This directly writes to the underlying numpy array of each column. I doubt any other method will be faster than this, as this allocates no additional storage and doesn't pass through pandas's dtype
handling. You can also use np.issubdtype
to only zero out numeric columns. This is probably what you want if you have a mixed dtype
DataFrame, but of course it's not necessary if your DataFrame is already entirely numeric.
这直接写入每列的底层 numpy 数组。我怀疑任何其他方法会比这更快,因为这不会分配额外的存储空间并且不会通过 pandas 的dtype
处理。您还可以使用np.issubdtype
仅将数字列清零。如果您有一个混合的dtype
DataFrame,这可能是您想要的,但是如果您的 DataFrame 已经完全是数字,那当然没有必要。
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
For small DataFrames, the subtype check is somewhat costly. However, the cost of zeroing a non-numeric column is substantial, so if you're not sure whether your DataFrame is entirely numeric, you should probably include the issubdtype
check.
对于小型 DataFrame,子类型检查的成本有些高。但是,将非数字列归零的成本是巨大的,因此如果您不确定您的 DataFrame 是否完全是数字,您可能应该包括issubdtype
检查。
Timing comparisons
时序比较
Setup
设置
import pandas as pd
import numpy as np
def make_df(n, only_numeric):
series = [
pd.Series(range(n), name="int", dtype=int),
pd.Series(range(n), name="float", dtype=float),
]
if only_numeric:
series.extend(
[
pd.Series(range(n, 2 * n), name="int2", dtype=int),
pd.Series(range(n, 2 * n), name="float2", dtype=float),
]
)
else:
series.extend(
[
pd.date_range(start="1970-1-1", freq="T", periods=n, name="dt")
.to_series()
.reset_index(drop=True),
pd.Series(
[chr((i % 26) + 65) for i in range(n)],
name="string",
dtype="object",
),
]
)
return pd.concat(series, axis=1)
>>> make_df(5, True)
int float int2 float2
0 0 0.0 5 5.0
1 1 1.0 6 6.0
2 2 2.0 7 7.0
3 3 3.0 8 8.0
4 4 4.0 9 9.0
>>> make_df(5, False)
int float dt string
0 0 0.0 1970-01-01 00:00:00 A
1 1 1.0 1970-01-01 00:01:00 B
2 2 2.0 1970-01-01 00:02:00 C
3 3 3.0 1970-01-01 00:03:00 D
4 4 4.0 1970-01-01 00:04:00 E
Small DataFrame
小型数据帧
n = 10_000
# Numeric df, no issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
df[col].values[:] = 0
36.1 μs ± 510 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Numeric df, yes issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
53 μs ± 645 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Non-numeric df, no issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
df[col].values[:] = 0
113 μs ± 391 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Non-numeric df, yes issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
39.4 μs ± 1.91 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Large DataFrame
大数据帧
n = 10_000_000
# Numeric df, no issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
df[col].values[:] = 0
38.7 ms ± 151 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Numeric df, yes issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
39.1 ms ± 556 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Non-numeric df, no issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
df[col].values[:] = 0
99.5 ms ± 748 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Non-numeric df, yes issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
17.8 ms ± 228 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I'd previously suggested the answer below, but I now consider it harmful — it's significantly slower than the above answers and is harder to reason about. Its only advantage is being nicer to write.
我之前建议使用下面的答案,但现在我认为它有害——它比上面的答案慢得多,而且更难推理。它唯一的优点是更好写。
The cleanest way is to use a bare colon to reference the entire dataframe.
df[:] = 0
Unfortunately the
dtype
situation is a bit fuzzy because every column in the resulting dataframe will have the samedtype
. If every column ofdf
was originallyfloat
, the newdtypes
will still befloat
. But if a single column wasint
orobject
, it seems that the newdtypes
will allbeint
.
最简洁的方法是使用裸冒号来引用整个数据帧。
df[:] = 0
不幸的是,
dtype
情况有点模糊,因为结果数据帧中的每一列都将具有相同的dtype
. 如果每一列df
都是原来的float
,那么新的dtypes
仍然是float
。但是,如果一列是int
或object
,似乎新的dtypes
意愿都可以int
。
回答by Psidom
Since you are trying to make a copy, it might be better to simply create a new data frame with values as 0, and columns and index from the original data frame:
由于您正在尝试制作副本,因此最好简单地创建一个值为 0 的新数据框,以及来自原始数据框的列和索引:
pd.DataFrame(0, columns=df.columns, index=df.index)
回答by hannah413
FYI the accepted answer from BallpointBen was almost 2 orders of magnitude fasterfor me than the .replace() operation offered by Joe T Boka. Both are helpful. Thanks!
仅供参考,BallpointBen 接受的答案对我来说比 Joe T Boka 提供的 .replace() 操作快了近 2 个数量级。两者都有帮助。谢谢!
To be clear, the fast way described by BallpointBen is:
明确地说,BallpointBen 描述的快速方法是:
for col in df.columns:
df[col].values[:] = 0
for col in df.columns:
df[col].values[:] = 0
*I would have commented this but I don't have enough street cred/reputation yet since I have been lurking for years. I used timeit.timeit() for the comparison.
*我会对此发表评论,但由于我潜伏多年,我还没有足够的街头信誉/声誉。我使用 timeit.timeit() 进行比较。