Python 使用字典重新映射熊猫列中的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20250771/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:00:37  来源:igfitidea点击:

Remap values in pandas column with a dict

pythondictionarypandasremap

提问by TheChymera

I have a dictionary which looks like this: di = {1: "A", 2: "B"}

我有一本看起来像这样的字典: di = {1: "A", 2: "B"}

I would like to apply it to the "col1" column of a dataframe similar to:

我想将其应用于类似于以下内容的数据框的“col1”列:

     col1   col2
0       w      a
1       1      2
2       2    NaN

to get:

要得到:

     col1   col2
0       w      a
1       A      2
2       B    NaN

How can I best do this? For some reason googling terms relating to this only shows me links about how to make columns from dicts and vice-versa :-/

我怎样才能最好地做到这一点?出于某种原因,与此相关的谷歌搜索术语仅向我显示了有关如何从 dicts 制作列的链接,反之亦然:-/

采纳答案by DSM

You can use .replace. For example:

您可以使用.replace. 例如:

>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
  col1 col2
0    w    a
1    1    2
2    2  NaN
>>> df.replace({"col1": di})
  col1 col2
0    w    a
1    A    2
2    B  NaN

or directly on the Series, i.e. df["col1"].replace(di, inplace=True).

或直接在 上Series,即df["col1"].replace(di, inplace=True)

回答by unutbu

There is a bit of ambiguity in your question. There are at least threetwo interpretations:

你的问题有点含糊。至少两种解释:

  1. the keys in direfer to index values
  2. the keys in direfer to df['col1']values
  3. the keys in direfer to index locations (not the OP's question, but thrown in for fun.)
  1. 中的键是di指索引值
  2. 中的键是didf['col1']
  3. 中的键是di指索引位置(不是 OP 的问题,而是为了好玩而抛出的。)

Below is a solution for each case.

以下是每种情况的解决方案。



Case 1:If the keys of diare meant to refer to index values, then you could use the updatemethod:

情况 1:如果 的键di是指索引值,那么您可以使用以下update方法:

df['col1'].update(pd.Series(di))

For example,

例如,

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1':['w', 10, 20],
                   'col2': ['a', 30, np.nan]},
                  index=[1,2,0])
#   col1 col2
# 1    w    a
# 2   10   30
# 0   20  NaN

di = {0: "A", 2: "B"}

# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)

yields

产量

  col1 col2
1    w    a
2    B   30
0    A  NaN

I've modified the values from your original post so it is clearer what updateis doing. Note how the keys in diare associated with index values. The order of the index values -- that is, the index locations-- does not matter.

我已经修改了您原始帖子中的值,因此更清楚update正在做什么。请注意 indi中的键如何与索引值相关联。索引值的顺序——即索引位置——无关紧要。



Case 2:If the keys in direfer to df['col1']values, then @DanAllan and @DSM show how to achieve this with replace:

情况 2:如果键中的键是didf['col1']值,那么 @DanAllan 和 @DSM 将展示如何使用以下方法实现replace

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1':['w', 10, 20],
                   'col2': ['a', 30, np.nan]},
                  index=[1,2,0])
print(df)
#   col1 col2
# 1    w    a
# 2   10   30
# 0   20  NaN

di = {10: "A", 20: "B"}

# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)

yields

产量

  col1 col2
1    w    a
2    A   30
0    B  NaN

Note how in this case the keys in diwere changed to match valuesin df['col1'].

注意如何在这种情况下,在键di改为匹配df['col1']



Case 3:If the keys in direfer to index locations, then you could use

情况 3:如果键中的键是di指索引位置,那么您可以使用

df['col1'].put(di.keys(), di.values())

since

自从

df = pd.DataFrame({'col1':['w', 10, 20],
                   'col2': ['a', 30, np.nan]},
                  index=[1,2,0])
di = {0: "A", 2: "B"}

# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)

yields

产量

  col1 col2
1    A    a
2   10   30
0    B  NaN

Here, the first and third rows were altered, because the keys in diare 0and 2, which with Python's 0-based indexing refer to the first and third locations.

在这里,第一行和第三行被改变了,因为其中的键di0and 2,在 Python 的基于 0 的索引中,它指的是第一和第三个位置。

回答by JohnE

mapcan be much faster than replace

map可以比 replace

If your dictionary has more than a couple of keys, using mapcan be much faster than replace. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):

如果您的字典有多个键,则 usingmap可能比replace. 这种方法有两种版本,具体取决于您的字典是否详尽地映射了所有可能的值(以及您是否希望非匹配项保留其值或转换为 NaN):

Exhaustive Mapping

详尽的映射

In this case, the form is very simple:

在这种情况下,表单非常简单:

df['col1'].map(di)       # note: if the dictionary does not exhaustively map all
                         # entries then non-matched entries are changed to NaNs

Although mapmost commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map

虽然map最常见的是将函数作为参数,但它也可以使用字典或系列: Pandas.series.map 的文档

Non-Exhaustive Mapping

非详尽映射

If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add fillna:

如果您有一个非详尽的映射并希望保留不匹配的现有变量,您可以添加fillna

df['col1'].map(di).fillna(df['col1'])

as in @jpp's answer here: Replace values in a pandas series via dictionary efficiently

就像这里@jpp 的回答一样: 通过字典有效地替换熊猫系列中的值

Benchmarks

基准

Using the following data with pandas version 0.23.1:

将以下数据与 Pandas 0.23.1 版一起使用:

di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })

and testing with %timeit, it appears that mapis approximately 10x faster than replace.

并使用 进行测试%timeit,它似乎map比 快约 10 倍replace

Note that your speedup with mapwill vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See @jpp answer (linked above) for more extensive benchmarks and discussion.

请注意,您的加速map会因您的数据而异。最大的加速似乎是使用大型字典和详尽的替换。有关更广泛的基准和讨论,请参阅 @jpp 答案(上面链接)。

回答by Nico Coallier

Adding to this question if you ever have more than one columns to remap in a data dataframe:

如果您在数据数据框中要重新映射多于一列,请添加到此问题中:

def remap(data,dict_labels):
    """
    This function take in a dictionnary of labels : dict_labels 
    and replace the values (previously labelencode) into the string.

    ex: dict_labels = {{'col1':{1:'A',2:'B'}}

    """
    for field,values in dict_labels.items():
        print("I am remapping %s"%field)
        data.replace({field:values},inplace=True)
    print("DONE")

    return data

Hope it can be useful to someone.

希望它对某人有用。

Cheers

干杯

回答by Amir Imani

A more native pandas approach is to apply a replace function as below:

更原生的 Pandas 方法是应用如下替换函数:

def multiple_replace(dict, text):
  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

Once you defined the function, you can apply it to your dataframe.

定义函数后,您可以将其应用于数据帧。

di = {1: "A", 2: "B"}
df['col1'] = df.apply(lambda row: multiple_replace(di, row['col1']), axis=1)

回答by wordsforthewise

DSM has the accepted answer, but the coding doesn't seem to work for everyone. Here is one that works with the current version of pandas (0.23.4 as of 8/2018):

DSM 有公认的答案,但编码似乎并不适合所有人。这是一个适用于当前版本的熊猫(0.23.4 截至 2018 年 8 月):

import pandas as pd

df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
            'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})

conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
df['converted_column'] = df['col2'].replace(conversion_dict)

print(df.head())

You'll see it looks like:

你会看到它看起来像:

   col1      col2  converted_column
0     1  negative                -1
1     2  positive                 1
2     2   neutral                 0
3     3   neutral                 0
4     1  positive                 1

The docs for pandas.DataFrame.replace are here.

pandas.DataFrame.replace的文档在这里

回答by U10-Forward

Or do apply:

或者做apply

df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))

Demo:

演示:

>>> df['col1']=df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
>>> df
  col1 col2
0    w    a
1    1    2
2    2  NaN
>>> 

回答by dorien

A nice complete solution that keeps a map of your class labels:

一个很好的完整解决方案,可以保留类标签的地图:

labels = features['col1'].unique()
labels_dict = dict(zip(labels, range(len(labels))))
features = features.replace({"col1": labels_dict})

This way, you can at any point refer to the original class label from labels_dict.

这样,您可以随时从labels_dict 引用原始类标签。

回答by louisD

As an extension to what have been proposed by Nico Coallier (apply to multiple columns) and U10-Forward(using apply style of methods), and summarising it into a one-liner I propose:

作为对 Nico Coallier(适用于多列)和 U10-Forward(使用方法的应用风格)提出的内容的扩展,并将其总结为一行,我建议:

df.loc[:,['col1','col2']].transform(lambda x: x.map(lambda x: {1: "A", 2: "B"}.get(x,x))

The .transform()processes each column as a series. Contrary to .apply()which passes the columns aggregated in a DataFrame.

.transform()每一列作为一个系列进行处理。与.apply()传递在 DataFrame 中聚合的列相反。

Consequently you can apply the Series method map().

因此,您可以应用系列方法map()

Finally, and I discovered this behaviour thanks to U10, you can use the whole Series in the .get() expression. Unless I have misunderstood its behaviour and it processes sequentially the series instead of bitwisely.
The .get(x,x)accounts for the values you did not mention in your mapping dictionary which would be considered as Nan otherwise by the .map()method

最后,由于 U10,我发现了这种行为,您可以在 .get() 表达式中使用整个系列。除非我误解了它的行为并且它按顺序处理系列而不是按位处理。您在映射字典中未提及的值
.get(x,x)帐户,否则该.map()方法将被视为 Nan

回答by ALollz

Given mapis faster than replace (@JohnE's solution) you need to be careful with Non-Exhaustive mappings where you intend to map specific values to NaN. The proper method in this case requires that you maskthe Series when you .fillna, else you undo the mapping to NaN.

给定map比替换更快(@JohnE 的解决方案),您需要小心非穷举映射,在这种映射中您打算将特定值映射到NaN. 在这种情况下,正确的方法要求您mask使用 Series .fillna,否则您将映射到NaN.

import pandas as pd
import numpy as np

d = {'m': 'Male', 'f': 'Female', 'missing': np.NaN}
df = pd.DataFrame({'gender': ['m', 'f', 'missing', 'Male', 'U']})


keep_nan = [k for k,v in d.items() if pd.isnull(v)]
s = df['gender']

df['mapped'] = s.map(d).fillna(s.mask(s.isin(keep_nan)))


    gender  mapped
0        m    Male
1        f  Female
2  missing     NaN
3     Male    Male
4        U       U