Pandas group by :包括所有行,甚至是列值为空的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46875065/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas group by : Include all rows even the ones with empty column values
提问by Apolo Radomer
I am using Pandas and trying to test something to fully understand some functionalities.
我正在使用 Pandas 并尝试测试一些东西以完全理解某些功能。
I am grouping and aggregating my data after I load everything from a csv using the following code:
在使用以下代码从 csv 加载所有内容后,我正在对我的数据进行分组和聚合:
s = df.groupby(['ID','Site']).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
print(s)
and it works with the following file:
它适用于以下文件:
but it does not work with this file:
但它不适用于此文件:
For the second file, I am getting the data only for the 56311 ID. The reason is that some columns have empty values. But that should not matter. I have not found anything relevant about that. I have only found how to exclude the null columns.
对于第二个文件,我仅获取 56311 ID 的数据。原因是某些列具有空值。但这应该无关紧要。我没有发现任何与此相关的内容。我只找到了如何排除空列。
Except for this issue, what are the main things that I should take into account before grouping? Is there any chance that rows will be excluded because for example of a format (date or number)?
除了这个问题,在分组之前我应该考虑哪些主要事项?是否有可能因为格式(日期或数字)而排除行?
采纳答案by jezrael
There is problem if NaN
s in columns in by
parameter, then groups are removed.
如果参数NaN
列中的 s 出现问题by
,则删除组。
So need replace NaN
to some value not in Site
column and after groupby replace back to NaN
s:
因此需要替换NaN
为不在Site
列中的某个值,并在 groupby 后替换回NaN
s:
Thanks Zero
for simplifying solution with fillna
in groupby
:
感谢您Zero
使用fillna
in简化解决方案groupby
:
df1= (df.groupby([df['ID'],df['Site'].fillna('tmp')])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
.reset_index()
.replace({'Site':{'tmp': np.nan}}))
If need NaN
s in MultiIndex
:
如果需要NaN
在MultiIndex
:
s = (df.groupby([df['ID'],df['Site'].fillna('tmp')])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
.rename(index={'tmp':np.nan}))
Sample:
样本:
df = pd.DataFrame({'A':list('abcdef'),
'Site':[np.nan,'a',np.nan,'b','b','a'],
'Start Date':pd.date_range('2017-01-01', periods=6),
'End Date':pd.date_range('2017-11-11', periods=6),
'Value':[7,3,6,9,2,1],
'ID':list('aaabbb')})
print (df)
A End Date ID Site Start Date Value
0 a 2017-11-11 a NaN 2017-01-01 7
1 b 2017-11-12 a a 2017-01-02 3
2 c 2017-11-13 a NaN 2017-01-03 6
3 d 2017-11-14 b b 2017-01-04 9
4 e 2017-11-15 b b 2017-01-05 2
5 f 2017-11-16 b a 2017-01-06 1
df1= (df.groupby([df['ID'],df['Site'].fillna('tmp')])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
.reset_index()
.replace({'Site':{'tmp': np.nan}}))
print (df1)
ID Site End Date Start Date Value
0 a a 2017-11-12 2017-01-02 3
1 a NaN 2017-11-13 2017-01-01 13
2 b a 2017-11-16 2017-01-06 1
3 b b 2017-11-15 2017-01-04 11
s = (df.groupby([df['ID'],df['Site'].fillna('tmp')])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
.rename(index={'tmp':np.nan}))
print (s)
End Date Start Date Value
ID Site
a a 2017-11-12 2017-01-02 3
NaN 2017-11-13 2017-01-01 13
b a 2017-11-16 2017-01-06 1
b 2017-11-15 2017-01-04 11