pandas 高效连接多个熊猫系列

Question

提问by Kane Chew

I understand that I can use combine_firstto merge two series:

我知道我可以combine_first用来合并两个系列：

series1 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series2 = pd.Series([1,2,3,4,5],index=['f','g','h','i','j'])
series3 = pd.Series([1,2,3,4,5],index=['k','l','m','n','o'])

Combine1 = series1.combine_first(series2)
print(Combine1

Output:

输出：

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
f    1.0
g    2.0
h    3.0
i    4.0
j    5.0
dtype: float64

What if I need to merge 3 or more series?

如果我需要合并 3 个或更多系列怎么办？

I understand that using the following code: print(series1 + series2 + series3)yields:

我了解使用以下代码：print(series1 + series2 + series3)产生：

a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
f   NaN
...
dtype: float64

Can I merge multiple series efficiently without using combine_firstmultiple times?

我可以在不combine_first多次使用的情况下有效地合并多个系列吗？

Thanks

谢谢

Answer 1

回答by cs95

Combine Series with Non-Overlapping Indexes

将系列与非重叠索引相结合

To combine series vertically, use pd.concat.

要垂直组合系列，请使用pd.concat.

# Setup
series_list = [
    pd.Series(range(1, 6), index=list('abcde')),
    pd.Series(range(1, 6), index=list('fghij')),
    pd.Series(range(1, 6), index=list('klmno'))
]

pd.concat(series_list)

a    1
b    2
c    3
d    4
e    5
f    1
g    2
h    3
i    4
j    5
k    1
l    2
m    3
n    4
o    5
dtype: int64

Combine with Overlapping Indexes

结合重叠索引

series_list = [
    pd.Series(range(1, 6), index=list('abcde')),
    pd.Series(range(1, 6), index=list('abcde')),
    pd.Series(range(1, 6), index=list('kbmdf'))
]

If the Series have overlapping indices, you can either combine (add) the keys,

如果系列具有重叠索引，您可以组合（添加）键，

pd.concat(series_list, axis=1, sort=False).sum(axis=1)

a     2.0
b     6.0
c     6.0
d    12.0
e    10.0
k     1.0
m     3.0
f     5.0
dtype: float64

Alternatively, just drop duplicates values on the index if you want to take only the first/last value (when there are duplicates).

或者，如果您只想获取第一个/最后一个值（当有重复时），只需删除索引上的重复值。

res = pd.concat(series_list, axis=0)
# keep first value
res[~res.index.duplicated(keep='first')]
# keep last value
res[~res.index.duplicated(keep='last')]

Answer 2

回答by miradulo

Presuming that you were using the behavior of combine_firstto prioritize the values of the series in order as combine_firstis meant for, you could succinctly make multiple calls to it with a lambda expression.

假设您正在使用的行为combine_first按预期的顺序对系列的值进行优先级排序combine_first，您可以使用 lambda 表达式简洁地多次调用它。

from functools import reduce
l_series = [series1, series2, series3]
reduce(lambda s1, s2: s1.combine_first(s2), l_series)

Of course if the indices are unique as in your current example, you can simply use pd.concatinstead.

当然，如果索引在您当前的示例中是唯一的，您可以简单地使用pd.concat。

Demo

演示

series1 = pd.Series(list(range(5)),index=['a','b','c','d','e'])
series2 = pd.Series(list(range(5, 10)),index=['a','g','h','i','j'])
series3 = pd.Series(list(range(10, 15)),index=['k','b','m','c','o'])

from functools import reduce
l_series = [series1, series2, series3]
print(reduce(lambda s1, s2: s1.combine_first(s2), l_series))

# a     0.0
# b     1.0
# c     2.0
# d     3.0
# e     4.0
# g     6.0
# h     7.0
# i     8.0
# j     9.0
# k    10.0
# m    12.0
# o    14.0
# dtype: float64

Answer 3

回答by White

Agree with what @codespeed has pointed out in his answer.

同意@codespeed 在他的回答中指出的内容。

I think it will depend on user needs. If series index are confirmed with no overlapping, concat will be a better option. (as original question posted, there is no index overlapping, then concat will be a better option)

我认为这将取决于用户的需求。如果确定系列索引没有重叠，则 concat 将是更好的选择。（作为原始问题发布，没有索引重叠，然后 concat 将是更好的选择）

If there is index overlapping, you might need to consider how to handle overlapping, which value to be overwritten. (as example provided by codespeed, if index are matching to different values, need to be careful about combine_first)

如果有索引重叠，可能需要考虑如何处理重叠，覆盖哪个值。（以codespeed提供的例子，如果索引匹配到不同的值，需要注意combine_first）

i.e. (note series3 is same as series1, series2 is same as series4)

即（注意series3与series1相同，series2与series4相同）

import pandas as pd
import numpy as np


series1 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series2 = pd.Series([2,3,4,4,5],index=['a','b','c','i','j'])
series3 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
series4 = pd.Series([2,3,4,4,5],index=['a','b','c','i','j'])


print(series1.combine_first(series2))



a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
i    4.0
j    5.0
dtype: float64



print(series4.combine_first(series3))



a    2.0
b    3.0
c    4.0
d    4.0
e    5.0
i    4.0
j    5.0
dtype: float64

Answer 4

回答by Vaishali

You would use combine_first if you want one series's values prioritized over the other. Its usually used to fill the missing values in the first series. I am not sure whats the expected output in your example but looks like you can use concat

如果您希望一个系列的值优先于另一个系列，您可以使用 combine_first。它通常用于填充第一个系列中的缺失值。我不确定您的示例中的预期输出是什么，但看起来您可以使用 concat

pd.concat([series1, series2, series3])

You get

你得到

pandas 高效连接多个熊猫系列

提问by Kane Chew

回答by cs95

Combine Series with Non-Overlapping Indexes

将系列与非重叠索引相结合

Combine with Overlapping Indexes

结合重叠索引

回答by miradulo

回答by White

回答by Vaishali

相关推荐

最近更新

标签

pandas 高效连接多个熊猫系列

提问by Kane Chew

回答by cs95

Combine Series with Non-Overlapping Indexes

将系列与非重叠索引相结合

Combine with Overlapping Indexes

结合重叠索引

回答by miradulo

回答by White

回答by Vaishali

相关推荐

pandas 类型错误：预期的字符串或类似字节的对象 – 使用 Python/NLTK word_tokenize

来自 Pandas Dataframe 的 Seaborn Violin Plot，每列都有自己独立的小提琴图

pandas 从数据框的最后一行中选择特定列的正确方法

Pandas 数据框保存到 csv 文件

相关推荐

最近更新

标签