pandas 在python中基于pandas索引在新列中添加值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45636105/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:13:52  来源:igfitidea点击:

adding values in new column based on indexes with pandas in python

pythonpandas

提问by ArnJac

I'm just getting into pandas and I am trying to add a new column to an existing dataframe.

我刚刚进入Pandas,我正在尝试向现有数据框添加一个新列。

I have two dataframes where the index of one data frame links to a column in another dataframe. Where these values are equal I need to put the value of another column in the source dataframe in a new column of the destination column.

我有两个数据帧,其中一个数据帧的索引链接到另一个数据帧中的一列。如果这些值相等,我需要将源数据框中另一列的值放在目标列的新列中。

The code section below illustrates what I mean. The commented part is what I need as an output.

下面的代码部分说明了我的意思。注释部分是我需要的输出。

I guess I need the .loc[]function.

我想我需要这个.loc[]功能。

Another, minor, question: is it bad practice to have a non-unique indexes?

另一个次要问题:使用非唯一索引是不好的做法吗?

import pandas as pd

d = {'key':['a',  'b', 'c'], 
     'bar':[1, 2, 3]}

d2 = {'key':['a', 'a', 'b'],
      'other_data':['10', '20', '30']}

df = pd.DataFrame(d)
df2 = pd.DataFrame(data = d2)
df2 = df2.set_index('key')

print df2

##    other_data  new_col
##key           
##a            10   1
##a            20   1
##b            30   2

回答by jezrael

Use rename indexby Series:

用重命名indexSeries

df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)

    other_data  new
key                
a           10    1
a           20    1
b           30    2

Or map:

map

df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)

    other_data  new
key                
a           10    1
a           20    1
b           30    2

If want better performance, the best is avoid duplicates in index. Also some function like reindexfailed in duplicates index.

如果想要更好的性能,最好避免索引中的重复。还有一些功能,比如reindex在重复索引中失败。

回答by piRSquared

You can use join

您可以使用 join

df2.join(df.set_index('key'))

    other_data  bar
key                
a           10    1
a           20    1
b           30    2


One way to rename the column in the process

在过程中重命名列的一种方法

df2.join(df.set_index('key').bar.rename('new'))

    other_data  new
key                
a           10    1
a           20    1
b           30    2

回答by Bharath

With the help of .loc

在...的帮助下 .loc

df2['new'] = df.set_index('key').loc[df2.index]

Output :

输出 :

   other_data  new
key                
a           10    1
a           20    1
b           30    2

回答by Brad Solomon

Another, minor, question: is it bad practice to have a non-unique indexes?

另一个次要问题:使用非唯一索引是不好的做法吗?

It is not great practice, but depends on your needs and can be okay in some circumstances.

这不是很好的做法,但取决于您的需要,并且在某些情况下可能没问题。

Issue 1: join operations

问题 1:join 操作

A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .joinor .merge, you'll lose the functionality of a primary keyif you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.

一个好的起点是考虑是什么使 Index 与标准 DataFrame column 不同。这就产生了一个问题:如果你的索引有重复的值,它是否真的需要被指定为索引,或者它只是一个RangeIndex-ed DataFrame 中的另一列?如果您曾经使用过 SQL 或任何其他 DMBS,并且想要在 Pandas 中使用诸如.join或 之类的函数来模拟连接操作.merge,那么如果您有重复的索引值,您将失去主键的功能。合并将为您提供基本上是笛卡尔积的东西——可能不是您想要的。

For example:

例如:

df = pd.DataFrame(np.random.randn(10,2),
                  index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
         0        1        a        b
a  0.73737  1.49073  0.73737  1.49073
a  0.73737  1.49073 -0.25562 -2.79859
a -0.25562 -2.79859  0.73737  1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583  1.17583 -0.93583  1.17583
b -0.93583  1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583  1.17583

Issue 2: performance

问题 2:性能

Unique-valued indices make certain operations efficient, as explained in thispost.

独特值指标做出一定的操作效率,在解释这个职位。

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).

当索引唯一时,pandas 使用哈希表将键映射到值 O(1)。当索引不唯一且已排序时,pandas 使用二分查找 O(logN),当索引是随机排序时,pandas 需要检查索引 O(N) 中的所有键。

A word on .loc

一句话 .loc

Using .locwill return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,

使用.loc将返回标签的所有实例。这可能是福也可能是祸,这取决于您的目标是什么。例如,

df = pd.DataFrame(np.random.randn(10,2),
                  index=2*list('abcde'))
print(df.loc['a'])
         0        1
a  0.73737  1.49073
a -0.25562 -2.79859

回答by Zero

Using combine_first

使用 combine_first

In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
     bar other_data
key
a    1.0         10
a    1.0         20
b    2.0         30

Or, using map

或者,使用map

In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
    other_data  bar
key
a           10    1
a           20    1
b           30    2