pandas 熊猫数据框索引匹配

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44318844/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:43:12  来源:igfitidea点击:

pandas dataframe index match

pythonpandasdataframeindexing

提问by A. Martin

I'm wondering if there is a more efficient way to do an "index & match" type function that is popular in excel. For example - given two pandas DataFrames, update the df_1 with information found in df_2:

我想知道是否有更有效的方法来执行 Excel 中流行的“索引和匹配”类型的函数。例如 - 给定两个 Pandas DataFrame,用在 df_2 中找到的信息更新 df_1:

import pandas as pd

df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
                     'num_b':[2, 4, 1, 2, 3]})    
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
                     'name':['a', 'b', 'c', 'd', 'e']})

I'm working with data sets that have ~80,000 rows in both df_1 and df_2 and my goal is to create two new columns in df_1, "name_a" and "name_b".

我正在处理在 df_1 和 df_2 中都有约 80,000 行的数据集,我的目标是在 df_1 中创建两个新列,“name_a”和“name_b”。

Below is the most efficient method that I could come up with. There hasto be a better way!

下面是我能想到的最有效的方法。有是一个更好的办法!

name_a = []
name_b = []
for i in range(len(df_1)):

    name_a.append(df_2.name.iloc[df_2[
                  df_2.num == df_1.num_a.iloc[i]].index[0]])
    name_b.append(df_2.name.iloc[df_2[
                  df_2.num == df_1.num_b.iloc[i]].index[0]])

df_1['name_a'] = name_a
df_1['name_b'] = name_b

Resulting in:

导致:

>>> df_1.head()
   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      e      c

回答by piRSquared

High Level

高水平

  • Create a dictionary to use in a replace
  • replace, renamecolumns, and join
  • 创建一个字典以用于 replace
  • replacerename列,和join


m = dict(zip(
    df_2.num.values.tolist(),
    df_2.name.values.tolist()
))

df_1.join(
    df_1.replace(m).rename(
        columns=lambda x: x.replace('num', 'name')
    )
)

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      5      c


Breakdown

分解

replacewith a dictionary should be pretty quick. There are bunch of ways to build a dictionary form df_2. As a matter of fact we could have used a pd.Series. I chose to build with dictand zipbecause I find that it's faster.

replace用字典应该很快。有很多方法可以构建字典表单df_2。事实上,我们可以使用pd.Series. 我选择用dictzip因为我发现它更快。

Building m

建筑 m

Option 1

选项1

m = df_2.set_index('num').name

Option 2

选项 2

m = df_2.set_index('num').name.to_dict()

Option 3

选项 3

m = dict(zip(df_2.num, df_2.name))

Option 4 (My Choice)

选项 4(我的选择)

m = dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))

mbuild times

m构建时间

1000 loops, best of 3: 325 μs per loop
1000 loops, best of 3: 376 μs per loop
10000 loops, best of 3: 32.9 μs per loop
100000 loops, best of 3: 10.4 μs per loop

%timeit df_2.set_index('num').name
%timeit df_2.set_index('num').name.to_dict()
%timeit dict(zip(df_2.num, df_2.name))
%timeit dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))


Replacing num

更换 num

Again, we have choices, here are a few and their times.

同样,我们有选择,这里有一些和他们的时间。

%timeit df_1.replace(m)
%timeit df_1.applymap(lambda x: m.get(x, x))
%timeit df_1.stack().map(lambda x: m.get(x, x)).unstack()

1000 loops, best of 3: 792 μs per loop
1000 loops, best of 3: 959 μs per loop
1000 loops, best of 3: 925 μs per loop

I choose...

我选择...

df_1.replace(m)

  num_a num_b
0     a     b
1     b     d
2     c     a
3     d     b
4     5     c

Rename columns

重命名列

df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name'))

  name_a name_b   <-- note the column name change
0      a      b
1      b      d
2      c      a
3      d      b
4      5      c

Join

加入

df_1.join(df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name')))

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      5      c

回答by T. Ray

I think there's a more straightforward solution than those already offered. Since you mentioned Excel, this is a basic vlookup. You can simulate this in pandas by using Series.map.

我认为有一个比已经提供的解决方案更直接的解决方案。既然你提到了 Excel,这是一个基本的查找。您可以使用Series.mapPandas 中模拟这一点

name_map = dict(df_2.set_index('num').name)

df_1['name_a'] = df_1.num_a.map(name_map)
df_1['name_b'] = df_1.num_b.map(name_map)

df_1

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      e      c

All we do is convert df_2 to a dict with 'num' as the keys. The map function looks up each value from a df_1 column in the dict and returns the corresponding letter. No complicated indexing required.

我们所做的就是将 df_2 转换为以“num”为键的字典。map 函数从字典中的 df_1 列中查找每个值并返回相应的字母。不需要复杂的索引。

回答by David Jaimes

Just try a conditional statement:

只需尝试一个条件语句:

import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
                     'num_b':[2, 4, 1, 2, 3]})    
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
                     'name':['a', 'b', 'c', 'd', 'e']})
df_1["name_a"] = df_2["num_b"]
df_1["name_b"] = np.array(df_1["name_a"][df_1["num_b"]-1]) 
print(df_1)

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      e      c