pandas 在python中基于pandas索引在新列中添加值

Question

提问by ArnJac

I'm just getting into pandas and I am trying to add a new column to an existing dataframe.

我刚刚进入Pandas，我正在尝试向现有数据框添加一个新列。

I have two dataframes where the index of one data frame links to a column in another dataframe. Where these values are equal I need to put the value of another column in the source dataframe in a new column of the destination column.

我有两个数据帧，其中一个数据帧的索引链接到另一个数据帧中的一列。如果这些值相等，我需要将源数据框中另一列的值放在目标列的新列中。

The code section below illustrates what I mean. The commented part is what I need as an output.

下面的代码部分说明了我的意思。注释部分是我需要的输出。

I guess I need the .loc[]function.

我想我需要这个.loc[]功能。

Another, minor, question: is it bad practice to have a non-unique indexes?

另一个次要问题：使用非唯一索引是不好的做法吗？

import pandas as pd

d = {'key':['a',  'b', 'c'], 
     'bar':[1, 2, 3]}

d2 = {'key':['a', 'a', 'b'],
      'other_data':['10', '20', '30']}

df = pd.DataFrame(d)
df2 = pd.DataFrame(data = d2)
df2 = df2.set_index('key')

print df2

##    other_data  new_col
##key           
##a            10   1
##a            20   1
##b            30   2

Answer 1

回答by jezrael

Use rename indexby Series:

用重命名index的Series：

df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)

    other_data  new
key                
a           10    1
a           20    1
b           30    2

Or map:

或map：

df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)

    other_data  new
key                
a           10    1
a           20    1
b           30    2

If want better performance, the best is avoid duplicates in index. Also some function like reindexfailed in duplicates index.

如果想要更好的性能，最好避免索引中的重复。还有一些功能，比如reindex在重复索引中失败。

Answer 2

回答by piRSquared

You can use join

您可以使用 join

df2.join(df.set_index('key'))

    other_data  bar
key                
a           10    1
a           20    1
b           30    2

One way to rename the column in the process

在过程中重命名列的一种方法

df2.join(df.set_index('key').bar.rename('new'))

    other_data  new
key                
a           10    1
a           20    1
b           30    2

Answer 3

回答by Bharath

With the help of .loc

在...的帮助下 .loc

df2['new'] = df.set_index('key').loc[df2.index]

Output :

输出：

   other_data  new
key                
a           10    1
a           20    1
b           30    2

Answer 4

回答by Brad Solomon

Another, minor, question: is it bad practice to have a non-unique indexes?

另一个次要问题：使用非唯一索引是不好的做法吗？

It is not great practice, but depends on your needs and can be okay in some circumstances.

这不是很好的做法，但取决于您的需要，并且在某些情况下可能没问题。

Issue 1: join operations

问题 1：join 操作

A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .joinor .merge, you'll lose the functionality of a primary keyif you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.

一个好的起点是考虑是什么使 Index 与标准 DataFrame column 不同。这就产生了一个问题：如果你的索引有重复的值，它是否真的需要被指定为索引，或者它只是一个RangeIndex-ed DataFrame 中的另一列？如果您曾经使用过 SQL 或任何其他 DMBS，并且想要在 Pandas 中使用诸如.join或之类的函数来模拟连接操作.merge，那么如果您有重复的索引值，您将失去主键的功能。合并将为您提供基本上是笛卡尔积的东西——可能不是您想要的。

For example:

例如：

df = pd.DataFrame(np.random.randn(10,2),
                  index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
         0        1        a        b
a  0.73737  1.49073  0.73737  1.49073
a  0.73737  1.49073 -0.25562 -2.79859
a -0.25562 -2.79859  0.73737  1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583  1.17583 -0.93583  1.17583
b -0.93583  1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583  1.17583

Issue 2: performance

问题 2：性能

Unique-valued indices make certain operations efficient, as explained in thispost.

独特值指标做出一定的操作效率，在解释这个职位。

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).

当索引唯一时，pandas 使用哈希表将键映射到值 O(1)。当索引不唯一且已排序时，pandas 使用二分查找 O(logN)，当索引是随机排序时，pandas 需要检查索引 O(N) 中的所有键。

A word on `.loc`

一句话 `.loc`

Using .locwill return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,

使用.loc将返回标签的所有实例。这可能是福也可能是祸，这取决于您的目标是什么。例如，

df = pd.DataFrame(np.random.randn(10,2),
                  index=2*list('abcde'))
print(df.loc['a'])
         0        1
a  0.73737  1.49073
a -0.25562 -2.79859

Answer 5

回答by Zero

Using combine_first

使用 combine_first

In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
     bar other_data
key
a    1.0         10
a    1.0         20
b    2.0         30

Or, using map

或者，使用map

In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
    other_data  bar
key
a           10    1
a           20    1
b           30    2

pandas 在python中基于pandas索引在新列中添加值

提问by ArnJac

回答by jezrael

回答by piRSquared

回答by Bharath

回答by Brad Solomon

Issue 1: join operations

问题 1：join 操作

Issue 2: performance

问题 2：性能

A word on `.loc`

一句话 `.loc`

回答by Zero

相关推荐

最近更新

标签

pandas 在python中基于pandas索引在新列中添加值

提问by ArnJac

回答by jezrael

回答by piRSquared

回答by Bharath

回答by Brad Solomon

Issue 1: join operations

问题 1：join 操作

Issue 2: performance

问题 2：性能

A word on .loc

一句话 .loc

回答by Zero

相关推荐

pandas UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 3 中的字节 0xcc：无效的连续字节

pandas 使用样式器格式化数据框索引和列

pandas 根据布尔值列表返回数据帧子集

pandas 大熊猫应用带参数的函数

相关推荐

最近更新

标签

A word on `.loc`

一句话 `.loc`