pandas 熊猫数据框索引匹配

Question

提问by A. Martin

I'm wondering if there is a more efficient way to do an "index & match" type function that is popular in excel. For example - given two pandas DataFrames, update the df_1 with information found in df_2:

我想知道是否有更有效的方法来执行 Excel 中流行的“索引和匹配”类型的函数。例如 - 给定两个 Pandas DataFrame，用在 df_2 中找到的信息更新 df_1：

import pandas as pd

df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
                     'num_b':[2, 4, 1, 2, 3]})    
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
                     'name':['a', 'b', 'c', 'd', 'e']})

I'm working with data sets that have ~80,000 rows in both df_1 and df_2 and my goal is to create two new columns in df_1, "name_a" and "name_b".

我正在处理在 df_1 和 df_2 中都有约 80,000 行的数据集，我的目标是在 df_1 中创建两个新列，“name_a”和“name_b”。

Below is the most efficient method that I could come up with. There hasto be a better way!

下面是我能想到的最有效的方法。有有是一个更好的办法！

name_a = []
name_b = []
for i in range(len(df_1)):

    name_a.append(df_2.name.iloc[df_2[
                  df_2.num == df_1.num_a.iloc[i]].index[0]])
    name_b.append(df_2.name.iloc[df_2[
                  df_2.num == df_1.num_b.iloc[i]].index[0]])

df_1['name_a'] = name_a
df_1['name_b'] = name_b

Resulting in:

导致：

>>> df_1.head()
   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      e      c

Answer 1

回答by piRSquared

High Level

高水平

Create a dictionary to use in a replace
replace, renamecolumns, and join

创建一个字典以用于 replace
replace，rename列，和join

m = dict(zip(
    df_2.num.values.tolist(),
    df_2.name.values.tolist()
))

df_1.join(
    df_1.replace(m).rename(
        columns=lambda x: x.replace('num', 'name')
    )
)

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      5      c

Breakdown

分解

replacewith a dictionary should be pretty quick. There are bunch of ways to build a dictionary form df_2. As a matter of fact we could have used a pd.Series. I chose to build with dictand zipbecause I find that it's faster.

replace用字典应该很快。有很多方法可以构建字典表单df_2。事实上，我们可以使用pd.Series. 我选择用dict和zip因为我发现它更快。

Building m

建筑 m

Option 1

选项1

m = df_2.set_index('num').name

Option 2

选项 2

m = df_2.set_index('num').name.to_dict()

Option 3

选项 3

m = dict(zip(df_2.num, df_2.name))

Option 4 (My Choice)

选项 4（我的选择）

m = dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))

mbuild times

m构建时间

1000 loops, best of 3: 325 μs per loop
1000 loops, best of 3: 376 μs per loop
10000 loops, best of 3: 32.9 μs per loop
100000 loops, best of 3: 10.4 μs per loop

%timeit df_2.set_index('num').name
%timeit df_2.set_index('num').name.to_dict()
%timeit dict(zip(df_2.num, df_2.name))
%timeit dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))

Replacing num

更换 num

Again, we have choices, here are a few and their times.

同样，我们有选择，这里有一些和他们的时间。

%timeit df_1.replace(m)
%timeit df_1.applymap(lambda x: m.get(x, x))
%timeit df_1.stack().map(lambda x: m.get(x, x)).unstack()

1000 loops, best of 3: 792 μs per loop
1000 loops, best of 3: 959 μs per loop
1000 loops, best of 3: 925 μs per loop

I choose...

我选择...

df_1.replace(m)

  num_a num_b
0     a     b
1     b     d
2     c     a
3     d     b
4     5     c

Rename columns

重命名列

df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name'))

  name_a name_b   <-- note the column name change
0      a      b
1      b      d
2      c      a
3      d      b
4      5      c

Join

加入

df_1.join(df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name')))

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      5      c

Answer 2

回答by T. Ray

I think there's a more straightforward solution than those already offered. Since you mentioned Excel, this is a basic vlookup. You can simulate this in pandas by using Series.map.

我认为有一个比已经提供的解决方案更直接的解决方案。既然你提到了 Excel，这是一个基本的查找。您可以使用Series.map在Pandas 中模拟这一点。

name_map = dict(df_2.set_index('num').name)

df_1['name_a'] = df_1.num_a.map(name_map)
df_1['name_b'] = df_1.num_b.map(name_map)

df_1

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      e      c

All we do is convert df_2 to a dict with 'num' as the keys. The map function looks up each value from a df_1 column in the dict and returns the corresponding letter. No complicated indexing required.

我们所做的就是将 df_2 转换为以“num”为键的字典。map 函数从字典中的 df_1 列中查找每个值并返回相应的字母。不需要复杂的索引。

Answer 3

回答by David Jaimes

Just try a conditional statement:

只需尝试一个条件语句：

import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
                     'num_b':[2, 4, 1, 2, 3]})    
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
                     'name':['a', 'b', 'c', 'd', 'e']})
df_1["name_a"] = df_2["num_b"]
df_1["name_b"] = np.array(df_1["name_a"][df_1["num_b"]-1]) 
print(df_1)

   num_a  num_b name_a name_b
0      1      2      a      b
1      2      4      b      d
2      3      1      c      a
3      4      2      d      b
4      5      3      e      c

pandas 熊猫数据框索引匹配

提问by A. Martin

回答by piRSquared

回答by T. Ray

回答by David Jaimes

相关推荐

最近更新

标签

pandas 熊猫数据框索引匹配

提问by A. Martin

回答by piRSquared

回答by T. Ray

回答by David Jaimes

相关推荐

Python pandas 将秒转换为时间 (hh:mm)

Pandas：使用变量从变量名称创建具有一行名称和列名称的数据框

pandas “IndexError: positional indexers are out-of-bounds” 当它们显然不是

pandas Python：日期时间到季节

相关推荐

最近更新

标签