pandas 通过字典有效地替换熊猫系列中的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49259580/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:19:16  来源:igfitidea点击:

Replace values in a pandas series via dictionary efficiently

pythonpython-3.xpandasdictionarydataframe

提问by jpp

How to replace values in a Pandas series svia a dictionary dhas been asked and re-asked many times.

如何s通过字典替换 Pandas 系列中的值d已被多次询问和重新询问。

The recommended method (1, 2, 3, 4) is to either use s.replace(d)or, occasionally, use s.map(d)if all your series values are found in the dictionary keys.

推荐的方法 ( 1, 2, 3, 4) 是使用s.replace(d)或偶尔使用,s.map(d)如果您的所有系列值都在字典键中找到。

However, performance using s.replaceis often unreasonably slow, often 5-10x slower than a simple list comprehension.

但是,使用性能s.replace通常会异常缓慢,通常比简单的列表理解慢 5-10 倍。

The alternative, s.map(d)has good performance, but is only recommended when all keys are found in the dictionary.

替代方案s.map(d)具有良好的性能,但仅在字典中找到所有键时才推荐使用。

Why is s.replaceso slow and how can performance be improved?

为什么s.replace这么慢,如何提高性能?

import pandas as pd, numpy as np

df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()

##### TEST 1 #####

d = {i: i+1 for i in range(1000)}

%timeit df['A'].replace(d)                          # 1.98s
%timeit [d[i] for i in lst]                         # 134ms

##### TEST 2 #####

d = {i: i+1 for i in range(10)}

%timeit df['A'].replace(d)                          # 20.1ms
%timeit [d.get(i, i) for i in lst]                  # 243ms

Note:This question is not marked as a duplicate because it is looking for specific advice on when to usedifferent methods given different datasets. This is explicit in the answer and is an aspect not usually addressed in other questions.

注意:这个问题没有被标记为重复,因为它正在寻找关于在给定不同数据集时何时使用不同方法的具体建议。这在答案中是明确的,并且是其他问题中通常不涉及的方面。

回答by jpp

One trivial solution is to choose a method dependent on an estimate of how completely values are covered by dictionary keys.

一个简单的解决方案是选择一种方法,该方法依赖于字典键覆盖值的完整程度的估计。

General case

一般情况

  • Use df['A'].map(d)if all values mapped; or
  • Use df['A'].map(d).fillna(df['A']).astype(int)if >5% values mapped.
  • df['A'].map(d)如果所有值都已映射,则使用;或者
  • df['A'].map(d).fillna(df['A']).astype(int)如果映射了 >5% 的值,则使用。

Few, e.g. < 5%, values in d

d 中的值很少,例如 < 5%

  • Use df['A'].replace(d)
  • df['A'].replace(d)

The "crossover point" of ~5% is specific to Benchmarking below.

~5% 的“交叉点”特定于下面的基准测试。

Interestingly, a simple list comprehension generally underperforms mapin either scenario.

有趣的是,一个简单的列表理解通常map在任何一种情况下都表现不佳。

Benchmarking

基准测试

import pandas as pd, numpy as np

df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()

##### TEST 1 - Full Map #####

d = {i: i+1 for i in range(1000)}

%timeit df['A'].replace(d)                          # 1.98s
%timeit df['A'].map(d)                              # 84.3ms
%timeit [d[i] for i in lst]                         # 134ms

##### TEST 2 - Partial Map #####

d = {i: i+1 for i in range(10)}

%timeit df['A'].replace(d)                          # 20.1ms
%timeit df['A'].map(d).fillna(df['A']).astype(int)  # 111ms
%timeit [d.get(i, i) for i in lst]                  # 243ms

Explanation

解释

The reason why s.replaceis so slow is that it does much more than simply map a dictionary. It deals with some edge cases and arguably rare situations, which typically merit more care in any case.

之所以s.replace这么慢,是因为它不仅仅是简单地映射字典。它处理一些边缘情况和可以说是罕见的情况,在任何情况下通常都值得更多关注。

This is an excerpt from replace()in pandas\generic.py.

这是从replace()in的摘录pandas\generic.py

items = list(compat.iteritems(to_replace))
keys, values = zip(*items)
are_mappings = [is_dict_like(v) for v in values]

if any(are_mappings):
    # handling of nested dictionaries
else:
    to_replace, value = keys, values

return self.replace(to_replace, value, inplace=inplace,
                    limit=limit, regex=regex)

There appear to be many steps involved:

似乎涉及许多步骤:

  • Converting dictionary to a list.
  • Iterating through list and checking for nested dictionaries.
  • Feeding an iterator of keys and values into a replace function.
  • 将字典转换为列表。
  • 遍历列表并检查嵌套字典。
  • 将键和值的迭代器提供给替换函数。

This can be compared to much leaner code from map()in pandas\series.py:

这可以与来自map()in 的更精简的代码进行比较pandas\series.py

if isinstance(arg, (dict, Series)):
    if isinstance(arg, dict):
        arg = self._constructor(arg, index=arg.keys())

    indexer = arg.index.get_indexer(values)
    new_values = algos.take_1d(arg._values, indexer)