Python 熊猫将数据框与 NaN(或“未知”)合并以获取缺失值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28174752/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:51:55  来源:igfitidea点击:

pandas merge dataframe with NaN (or "unknown") for missing values

pythonpandasdataframe

提问by Kevin Thompson

I have 2 dataframes, one of which has supplemental information for some (but not all) of the rows in the other.

我有 2 个数据帧,其中一个具有其他行中某些(但不是全部)行的补充信息。

names = df({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],
            'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank'],
           'classification':['thief','thief','good','thief']})

I would like to take the classification column from the infodataframe above and add it to the namesdataframe above. However, when I do combined = pd.merge(names, info)the resulting dataframe is only 4 rows long. All of the rows that do not have supplemental info are dropped.

我想从info上面的数据框中获取分类列并将其添加到names上面的数据框中。但是,当我这样做时combined = pd.merge(names, info),结果数据帧只有 4 行长。所有没有补充信息的行都被删除。

Ideally, I would have the values in those missing columns set to unknown. Resulting in a dataframe where some people are theives, some are good, and the rest are unknown.

理想情况下,我会将那些缺失列中的值设置为未知。导致数据帧中有些人是小偷,有些人是好人,而其余的人是未知的。

EDIT: One of the first answers I received suggested using merge outter which seems to do some weird things. Here is a code sample:

编辑:我收到的第一个答案之一建议使用合并外部,这似乎做了一些奇怪的事情。这是一个代码示例:

names = df({'names':['bob','frank','bob','bob','bob''james','tim','ricardo','mike','mark','joan','joe'],
            'position':['dev','dev','dev','dev','dev','dev''sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank','joe','bill'],
           'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna("unknown")

The strange thing is that in the output I'll get a row where the resulting name is "bobjames" and another where position is "devsys". Finally, even though bill does not appear in the names dataframe it shows up in the resulting dataframe. So I really need a way to say lookup a value in this other dataframe and if you find something tack on those columns.

奇怪的是,在输出中,我会得到一行,结果名称是“bobjames”,另一个位置是“devsys”。最后,即使 bill 没有出现在名称数据框中,它也会出现在结果数据框中。所以我真的需要一种方法来在这个其他数据框中查找一个值,如果你在这些列上找到了一些东西。

回答by EdChum

I think you want to perform an outermerge:

我想你想执行一个outermerge

In [60]:

pd.merge(names, info, how='outer')
Out[60]:
     names position classification
0      bob      dev            NaN
1    frank      dev          thief
2    james      dev            NaN
3      tim      sys           good
4  ricardo      sys            NaN
5     mike      sys            NaN
6     mark      sup          thief
7     joan      sup            NaN
8      joe      sup          thief

There is section showing the type of merges can perform: http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

有部分显示可以执行的合并类型:http: //pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

回答by oxtay

In case you are still looking for an answer for this:

如果您仍在为此寻找答案:

The "strange" things that you described are due to some minor errors in your code. For example, the first (appearance of "bobjames" and "devsys") is due to the fact that you don't have a comma between those two values in your source dataframes. And the second is because pandas doesn't care about the name of your dataframe but cares about the name of your columns when merging (you have a dataframe called "names" but also your columns are called "names"). Otherwise, it seems that the merge is doing exactly what you are looking for:

您描述的“奇怪”的事情是由于您的代码中存在一些小错误。例如,第一个(出现“bobjames”和“devsys”)是因为在源数据帧中这两个值之间没有逗号。第二个是因为 Pandas 不关心您的数据框的名称,但在合并时关心您的列的名称(您有一个名为“名称”的数据框,但您的列也称为“名称”)。否则,似乎合并正在做你正在寻找的东西:

import pandas as pd
names = pd.DataFrame({'names':['bob','frank','bob','bob','bob', 'james','tim','ricardo','mike','mark','joan','joe'], 
                      'position':['dev','dev','dev','dev','dev','dev', 'sys','sys','sys','sup','sup','sup']})

info = pd.DataFrame({'names':['joe','mark','tim','frank','joe','bill'],
                     'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna('unknown', inplace=True)

which will result in:

这将导致:

      names position classification
0       bob      dev        unknown
1       bob      dev        unknown
2       bob      dev        unknown
3       bob      dev        unknown
4     frank      dev          thief
5     james      dev        unknown
6       tim      sys           good
7   ricardo      sys        unknown
8      mike      sys        unknown
9      mark      sup          thief
10     joan      sup        unknown
11      joe      sup          thief
12      joe      sup           good
13     bill  unknown          thief

回答by Lucas Aimaretto

Think of it as an SQL join operation. You need a left-outerjoin[1].

将其视为 SQL 连接操作。你需要一个left-outer连接[1]。

names = pd.DataFrame({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})

names = pd.DataFrame({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})

info = pd.DataFrame({'names':['joe','mark','tim','frank'],'classification':['thief','thief','good','thief']})

info = pd.DataFrame({'names':['joe','mark','tim','frank'],'classification':['thief','thief','good','thief']})

Since there are namesfor which there is no classification, a left-outerjoin will do the job.

由于有names没有的classificationleft-outer连接将完成这项工作。

a = pd.merge(names, info, how='left', on='names')

a = pd.merge(names, info, how='left', on='names')

The result is ...

结果是……

>>> a
     names position classification
0      bob      dev            NaN
1    frank      dev          thief
2    james      dev            NaN
3      tim      sys           good
4  ricardo      sys            NaN
5     mike      sys            NaN
6     mark      sup          thief
7     joan      sup            NaN
8      joe      sup          thief

... which is fine. All the NaNresults are ok if you look at both the tables.

......这很好。NaN如果您查看两个表,所有结果都可以。

Cheers!

干杯!

[1] - http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

[1] - http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging