pandas 通过字典有效地替换熊猫系列中的值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49259580/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Replace values in a pandas series via dictionary efficiently
提问by jpp
How to replace values in a Pandas series s
via a dictionary d
has been asked and re-asked many times.
如何s
通过字典替换 Pandas 系列中的值d
已被多次询问和重新询问。
The recommended method (1, 2, 3, 4) is to either use s.replace(d)
or, occasionally, use s.map(d)
if all your series values are found in the dictionary keys.
推荐的方法 ( 1, 2, 3, 4) 是使用s.replace(d)
或偶尔使用,s.map(d)
如果您的所有系列值都在字典键中找到。
However, performance using s.replace
is often unreasonably slow, often 5-10x slower than a simple list comprehension.
但是,使用性能s.replace
通常会异常缓慢,通常比简单的列表理解慢 5-10 倍。
The alternative, s.map(d)
has good performance, but is only recommended when all keys are found in the dictionary.
替代方案s.map(d)
具有良好的性能,但仅在字典中找到所有键时才推荐使用。
Why is s.replace
so slow and how can performance be improved?
为什么s.replace
这么慢,如何提高性能?
import pandas as pd, numpy as np
df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()
##### TEST 1 #####
d = {i: i+1 for i in range(1000)}
%timeit df['A'].replace(d) # 1.98s
%timeit [d[i] for i in lst] # 134ms
##### TEST 2 #####
d = {i: i+1 for i in range(10)}
%timeit df['A'].replace(d) # 20.1ms
%timeit [d.get(i, i) for i in lst] # 243ms
Note:This question is not marked as a duplicate because it is looking for specific advice on when to usedifferent methods given different datasets. This is explicit in the answer and is an aspect not usually addressed in other questions.
注意:这个问题没有被标记为重复,因为它正在寻找关于在给定不同数据集时何时使用不同方法的具体建议。这在答案中是明确的,并且是其他问题中通常不涉及的方面。
回答by jpp
One trivial solution is to choose a method dependent on an estimate of how completely values are covered by dictionary keys.
一个简单的解决方案是选择一种方法,该方法依赖于字典键覆盖值的完整程度的估计。
General case
一般情况
- Use
df['A'].map(d)
if all values mapped; or - Use
df['A'].map(d).fillna(df['A']).astype(int)
if >5% values mapped.
df['A'].map(d)
如果所有值都已映射,则使用;或者df['A'].map(d).fillna(df['A']).astype(int)
如果映射了 >5% 的值,则使用。
Few, e.g. < 5%, values in d
d 中的值很少,例如 < 5%
- Use
df['A'].replace(d)
- 用
df['A'].replace(d)
The "crossover point" of ~5% is specific to Benchmarking below.
~5% 的“交叉点”特定于下面的基准测试。
Interestingly, a simple list comprehension generally underperforms map
in either scenario.
有趣的是,一个简单的列表理解通常map
在任何一种情况下都表现不佳。
Benchmarking
基准测试
import pandas as pd, numpy as np
df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()
##### TEST 1 - Full Map #####
d = {i: i+1 for i in range(1000)}
%timeit df['A'].replace(d) # 1.98s
%timeit df['A'].map(d) # 84.3ms
%timeit [d[i] for i in lst] # 134ms
##### TEST 2 - Partial Map #####
d = {i: i+1 for i in range(10)}
%timeit df['A'].replace(d) # 20.1ms
%timeit df['A'].map(d).fillna(df['A']).astype(int) # 111ms
%timeit [d.get(i, i) for i in lst] # 243ms
Explanation
解释
The reason why s.replace
is so slow is that it does much more than simply map a dictionary. It deals with some edge cases and arguably rare situations, which typically merit more care in any case.
之所以s.replace
这么慢,是因为它不仅仅是简单地映射字典。它处理一些边缘情况和可以说是罕见的情况,在任何情况下通常都值得更多关注。
This is an excerpt from replace()
in pandas\generic.py
.
这是从replace()
in的摘录pandas\generic.py
。
items = list(compat.iteritems(to_replace))
keys, values = zip(*items)
are_mappings = [is_dict_like(v) for v in values]
if any(are_mappings):
# handling of nested dictionaries
else:
to_replace, value = keys, values
return self.replace(to_replace, value, inplace=inplace,
limit=limit, regex=regex)
There appear to be many steps involved:
似乎涉及许多步骤:
- Converting dictionary to a list.
- Iterating through list and checking for nested dictionaries.
- Feeding an iterator of keys and values into a replace function.
- 将字典转换为列表。
- 遍历列表并检查嵌套字典。
- 将键和值的迭代器提供给替换函数。
This can be compared to much leaner code from map()
in pandas\series.py
:
这可以与来自map()
in 的更精简的代码进行比较pandas\series.py
:
if isinstance(arg, (dict, Series)):
if isinstance(arg, dict):
arg = self._constructor(arg, index=arg.keys())
indexer = arg.index.get_indexer(values)
new_values = algos.take_1d(arg._values, indexer)