pandas - 按部分字符串分组
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28495905/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas - groupby by partial string
提问by Fabio Lamanna
I would like to group a DataFrame by partial substrings. This is a sample .csv file:
我想按部分子字符串对 DataFrame 进行分组。这是一个示例 .csv 文件:
GridCode,Key
1000,Colour
1000,Colours
1001,Behaviours
1001,Behaviour
1002,Favourite
1003,COLORS
1004,Honours
What I did so far is importing the file as df = pd.read_csv(sample.csv), and then I put all the strings to lowercases with df['Key'] = df['Key'].str.lower(). The first thing I tried is groupby by GridCode and Key with:
到目前为止我所做的是将文件导入为df = pd.read_csv(sample.csv),然后我将所有字符串都用df['Key'] = df['Key'].str.lower(). 我尝试的第一件事是通过 GridCode 和 Key 使用 groupby:
g = df.groupby([df['GridCode'],df['Key']]).size()
then unstack and fill:
然后拆开并填充:
d = g.unstack().fillna(0)
and the resulting DataFrame is:
结果数据帧是:
Key behaviour behaviours colors colour colours favourite honours
GridCode
1000 0 0 0 1 1 0 0
1001 1 1 0 0 0 0 0
1002 0 0 0 0 0 1 0
1003 0 0 1 0 0 0 0
1004 0 0 0 0 0 0 1
Now what I would like to do is to group only strings containing the substring 'our', in this case avoiding only the colors Key, creating a new column with the desired substring. The expected result would be like:
现在我想做的是只对包含子字符串“我们”的字符串进行分组,在这种情况下,只避免颜色键,创建一个包含所需子字符串的新列。预期的结果是这样的:
Key 'our'
GridCode
1000 2
1001 2
1002 1
1003 0
1004 1
I tried also to mask the DataFrame with masked = df['Key'].str.contains('our'), then df1 = df[mask], but I can't figured out how to make a new column with the new groupby counts. Any help would be really appreciated.
我还尝试使用masked = df['Key'].str.contains('our'), then来屏蔽 DataFrame df1 = df[mask],但是我不知道如何使用新的 groupby 计数创建一个新列。任何帮助将非常感激。
回答by behzad.nouri
>>> import re # for the re.IGNORECASE flag
>>> df['Key'].str.contains('our', re.IGNORECASE).groupby(df['GridCode']).sum()
GridCode
1000 2
1001 2
1002 1
1003 0
1004 1
Name: Key, dtype: float64
also, instead of
也,而不是
df.groupby([df['GridCode'],df['Key']])
it is better to do:
最好这样做:
df.groupby(['GridCode', 'Key'])

