Pandas group by ：包括所有行，甚至是列值为空的行

Question

提问by Apolo Radomer

I am using Pandas and trying to test something to fully understand some functionalities.

我正在使用 Pandas 并尝试测试一些东西以完全理解某些功能。

I am grouping and aggregating my data after I load everything from a csv using the following code:

在使用以下代码从 csv 加载所有内容后，我正在对我的数据进行分组和聚合：

s = df.groupby(['ID','Site']).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
print(s)

and it works with the following file:

它适用于以下文件：

but it does not work with this file:

但它不适用于此文件：

For the second file, I am getting the data only for the 56311 ID. The reason is that some columns have empty values. But that should not matter. I have not found anything relevant about that. I have only found how to exclude the null columns.

对于第二个文件，我仅获取 56311 ID 的数据。原因是某些列具有空值。但这应该无关紧要。我没有发现任何与此相关的内容。我只找到了如何排除空列。

Except for this issue, what are the main things that I should take into account before grouping? Is there any chance that rows will be excluded because for example of a format (date or number)?

除了这个问题，在分组之前我应该考虑哪些主要事项？是否有可能因为格式（日期或数字）而排除行？

Answer 1

采纳答案by jezrael

There is problem if NaNs in columns in byparameter, then groups are removed.

如果参数NaN列中的 s 出现问题by，则删除组。

So need replace NaNto some value not in Sitecolumn and after groupby replace back to NaNs:

因此需要替换NaN为不在Site列中的某个值，并在 groupby 后替换回NaNs：

Thanks Zerofor simplifying solution with fillnain groupby:

感谢您Zero使用fillnain简化解决方案groupby：

df1= (df.groupby([df['ID'],df['Site'].fillna('tmp')])
        .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
        .reset_index()
        .replace({'Site':{'tmp': np.nan}}))

If need NaNs in MultiIndex:

如果需要NaN在MultiIndex：

s = (df.groupby([df['ID'],df['Site'].fillna('tmp')])
       .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
       .rename(index={'tmp':np.nan}))

Sample:

样本：

df = pd.DataFrame({'A':list('abcdef'),
                   'Site':[np.nan,'a',np.nan,'b','b','a'],
                   'Start Date':pd.date_range('2017-01-01', periods=6),
                   'End Date':pd.date_range('2017-11-11', periods=6),
                   'Value':[7,3,6,9,2,1],
                   'ID':list('aaabbb')})

print (df)
   A   End Date ID Site Start Date  Value
0  a 2017-11-11  a  NaN 2017-01-01      7
1  b 2017-11-12  a    a 2017-01-02      3
2  c 2017-11-13  a  NaN 2017-01-03      6
3  d 2017-11-14  b    b 2017-01-04      9
4  e 2017-11-15  b    b 2017-01-05      2
5  f 2017-11-16  b    a 2017-01-06      1

df1= (df.groupby([df['ID'],df['Site'].fillna('tmp')])
        .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
        .reset_index()
        .replace({'Site':{'tmp': np.nan}}))

print (df1)
  ID Site   End Date Start Date  Value
0  a    a 2017-11-12 2017-01-02      3
1  a  NaN 2017-11-13 2017-01-01     13
2  b    a 2017-11-16 2017-01-06      1
3  b    b 2017-11-15 2017-01-04     11

s = (df.groupby([df['ID'],df['Site'].fillna('tmp')])
       .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
       .rename(index={'tmp':np.nan}))

print (s)
          End Date Start Date  Value
ID Site                             
a  a    2017-11-12 2017-01-02      3
   NaN  2017-11-13 2017-01-01     13
b  a    2017-11-16 2017-01-06      1
   b    2017-11-15 2017-01-04     11

Pandas group by ：包括所有行，甚至是列值为空的行

提问by Apolo Radomer

采纳答案by jezrael

相关推荐

最近更新

标签

Pandas group by ：包括所有行，甚至是列值为空的行

提问by Apolo Radomer

采纳答案by jezrael

相关推荐

pandas 转换为 html 表时删除熊猫数据框中的索引

pandas combine_first 和 fillna 有什么区别？

pandas 从熊猫数据框中的字符串列中删除 b''

Python Pandas KeyError：'标签不在[索引]中'

相关推荐

最近更新

标签