Python 熊猫将数据框与 NaN(或“未知”)合并以获取缺失值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28174752/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas merge dataframe with NaN (or "unknown") for missing values
提问by Kevin Thompson
I have 2 dataframes, one of which has supplemental information for some (but not all) of the rows in the other.
我有 2 个数据帧,其中一个具有其他行中某些(但不是全部)行的补充信息。
names = df({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank'],
'classification':['thief','thief','good','thief']})
I would like to take the classification column from the info
dataframe above and add it to the names
dataframe above. However, when I do combined = pd.merge(names, info)
the resulting dataframe is only 4 rows long. All of the rows that do not have supplemental info are dropped.
我想从info
上面的数据框中获取分类列并将其添加到names
上面的数据框中。但是,当我这样做时combined = pd.merge(names, info)
,结果数据帧只有 4 行长。所有没有补充信息的行都被删除。
Ideally, I would have the values in those missing columns set to unknown. Resulting in a dataframe where some people are theives, some are good, and the rest are unknown.
理想情况下,我会将那些缺失列中的值设置为未知。导致数据帧中有些人是小偷,有些人是好人,而其余的人是未知的。
EDIT: One of the first answers I received suggested using merge outter which seems to do some weird things. Here is a code sample:
编辑:我收到的第一个答案之一建议使用合并外部,这似乎做了一些奇怪的事情。这是一个代码示例:
names = df({'names':['bob','frank','bob','bob','bob''james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','dev','dev','dev''sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank','joe','bill'],
'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna("unknown")
The strange thing is that in the output I'll get a row where the resulting name is "bobjames" and another where position is "devsys". Finally, even though bill does not appear in the names dataframe it shows up in the resulting dataframe. So I really need a way to say lookup a value in this other dataframe and if you find something tack on those columns.
奇怪的是,在输出中,我会得到一行,结果名称是“bobjames”,另一个位置是“devsys”。最后,即使 bill 没有出现在名称数据框中,它也会出现在结果数据框中。所以我真的需要一种方法来在这个其他数据框中查找一个值,如果你在这些列上找到了一些东西。
回答by EdChum
I think you want to perform an outer
merge
:
我想你想执行一个outer
merge
:
In [60]:
pd.merge(names, info, how='outer')
Out[60]:
names position classification
0 bob dev NaN
1 frank dev thief
2 james dev NaN
3 tim sys good
4 ricardo sys NaN
5 mike sys NaN
6 mark sup thief
7 joan sup NaN
8 joe sup thief
There is section showing the type of merges can perform: http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
有部分显示可以执行的合并类型:http: //pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
回答by oxtay
In case you are still looking for an answer for this:
如果您仍在为此寻找答案:
The "strange" things that you described are due to some minor errors in your code. For example, the first (appearance of "bobjames" and "devsys") is due to the fact that you don't have a comma between those two values in your source dataframes. And the second is because pandas doesn't care about the name of your dataframe but cares about the name of your columns when merging (you have a dataframe called "names" but also your columns are called "names"). Otherwise, it seems that the merge is doing exactly what you are looking for:
您描述的“奇怪”的事情是由于您的代码中存在一些小错误。例如,第一个(出现“bobjames”和“devsys”)是因为在源数据帧中这两个值之间没有逗号。第二个是因为 Pandas 不关心您的数据框的名称,但在合并时关心您的列的名称(您有一个名为“名称”的数据框,但您的列也称为“名称”)。否则,似乎合并正在做你正在寻找的东西:
import pandas as pd
names = pd.DataFrame({'names':['bob','frank','bob','bob','bob', 'james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','dev','dev','dev', 'sys','sys','sys','sup','sup','sup']})
info = pd.DataFrame({'names':['joe','mark','tim','frank','joe','bill'],
'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna('unknown', inplace=True)
which will result in:
这将导致:
names position classification
0 bob dev unknown
1 bob dev unknown
2 bob dev unknown
3 bob dev unknown
4 frank dev thief
5 james dev unknown
6 tim sys good
7 ricardo sys unknown
8 mike sys unknown
9 mark sup thief
10 joan sup unknown
11 joe sup thief
12 joe sup good
13 bill unknown thief
回答by Lucas Aimaretto
Think of it as an SQL join operation. You need a left-outer
join[1].
将其视为 SQL 连接操作。你需要一个left-outer
连接[1]。
names = pd.DataFrame({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
names = pd.DataFrame({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = pd.DataFrame({'names':['joe','mark','tim','frank'],'classification':['thief','thief','good','thief']})
info = pd.DataFrame({'names':['joe','mark','tim','frank'],'classification':['thief','thief','good','thief']})
Since there are names
for which there is no classification
, a left-outer
join will do the job.
由于有names
没有的classification
,left-outer
连接将完成这项工作。
a = pd.merge(names, info, how='left', on='names')
a = pd.merge(names, info, how='left', on='names')
The result is ...
结果是……
>>> a
names position classification
0 bob dev NaN
1 frank dev thief
2 james dev NaN
3 tim sys good
4 ricardo sys NaN
5 mike sys NaN
6 mark sup thief
7 joan sup NaN
8 joe sup thief
... which is fine. All the NaN
results are ok if you look at both the tables.
......这很好。NaN
如果您查看两个表,所有结果都可以。
Cheers!
干杯!
[1] - http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
[1] - http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging