pandas 在熊猫中获取组名的有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50859987/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:41:59  来源:igfitidea点击:

Efficient way to get group names in pandas

pythonpython-3.xpandascsvprocessing-efficiency

提问by swopnilnep

I have a .csv file with around 300,000 rows. I have set it to group by a particular column, with each group having around 140 members (2138 total groups).

我有一个大约 300,000 行的 .csv 文件。我已将其设置为按特定列分组,每个组大约有 140 个成员(总共 2138 个组)。

I am trying to generate a numpy array of the group names. I have used a for loop to generate the names as of now but it takes a while for everything to process.

我正在尝试生成组名称的 numpy 数组。到目前为止,我已经使用 for 循环来生成名称,但处理所有内容都需要一段时间。

import numpy as np
import pandas as pd

df = pd.read_csv('file.csv')
grouped = df.groupby('col1')
group_names = []
for name,group in grouped: group_names.append(name)
group_names = np.array(group_names, dtype=object)

I am wondering if there is a more efficient way to do this, whether by using a pandas module or directly converting the names into a numpy array.

我想知道是否有更有效的方法来做到这一点,无论是使用 pandas 模块还是直接将名称转换为 numpy 数组。

回答by EdChum

groupbyobjects have a .groupsattribute:

groupby对象有一个.groups属性:

groups = df.groupby('col1').groups

this returns a dict of the group name->labels

这将返回组名称->标签的字典

example:

例子:

In[257]:
df = pd.DataFrame({'a':list('aabcccc'), 'b':np.random.randn(7)})
groups = df.groupby('a').groups
groups

Out[257]: 
{'a': Int64Index([0, 1], dtype='int64'),
 'b': Int64Index([2], dtype='int64'),
 'c': Int64Index([3, 4, 5, 6], dtype='int64')}

groups.keys()
Out[258]: dict_keys(['a', 'b', 'c'])

回答by sacuL

The fastest way would most likely be just to use uniqueon the column you are grouping by, which gives you all unique values. The output will be an array of your group names.

最快的方法很可能只是unique在您分组的列上使用,它为您提供所有唯一值。输出将是您的组名数组。

group_names = df.col1.unique()