Python Pandas,对于一列中的每个唯一值,在另一列中获取唯一值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48979604/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:56:51  来源:igfitidea点击:

Pandas, for each unique value in one column, get unique values in another column

pythonpandas

提问by Parseltongue

I have a dataframe where each row contains various meta-data pertaining to a single Reddit comment (e.g. author, subreddit, comment text).

我有一个数据框,其中每一行都包含与单个 Reddit 评论(例如作者、subreddit、评论文本)有关的各种元数据。

I want to do the following: for each author, I want to grab a list of all the subreddits they have comments in, and transform this data into a pandas dataframe where each row corresponds to an author, and a list of all the unique subreddits they comment in.

我想执行以下操作:对于每个作者,我想获取他们有评论的所有 subreddits 的列表,并将这些数据转换为 Pandas 数据框,其中每一行对应一个作者,以及所有唯一 subreddits 的列表他们评论。

I am currently trying some combination of the following, but can't get it down:

我目前正在尝试以下的一些组合,但无法解决:

Attempt 1:

尝试 1:

group = df['subreddit'].groupby(df['author']).unique()
list(group) 

Attempt 2:

尝试 2:

from collections import defaultdict
subreddit_dict  = defaultdict(list)

for index, row in df.iterrows():
    author = row['author']
    subreddit = row['subreddit']
    subreddit_dict[author].append(subreddit)

for key, value in subreddit_dict.items():
    subreddit_dict[key] = set(value)

subreddit_df = pd.DataFrame.from_dict(subreddit_dict, 
                            orient = 'index')

回答by sacuL

Here are two strategies to do it. No doubt, there are other ways.

这里有两种策略可以做到这一点。毫无疑问,还有其他方法。

Assuming your dataframe looks somethinglike this (obviously with more columns):

假设你的数据框看起来事情是这样的(显然有更多列):

df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})

>>> df
  author subreddit
0      a       sr1
1      a       sr2
2      b       sr2
...

SOLUTION 1: groupby

解决方案1:groupby

More straightforward than solution 2, and similar to your first attempt:

比解决方案 2 更直接,类似于您的第一次尝试:

group = df.groupby('author')

df2 = group.apply(lambda x: x['subreddit'].unique())

# Alternatively, same thing as a one liner:
# df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())

Result:

结果:

>>> df2
author
a    [sr1, sr2]
b         [sr2]

The author is the index, and the single column is the list of all subredditsthey are active in (this is how I interpreted how you wanted your output, according to your description).

作者是索引,单列是他们活跃的所有 subreddits列表(根据你的描述,这是我解释你想要的输出方式的方式)。

If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:

如果您希望将 subreddits 分别放在一个单独的列中,这可能更有用,具体取决于您想用它做什么,您可以在之后执行以下操作:

df2 = df2.apply(pd.Series)

Result:

结果:

>>> df2
          0    1
author          
a       sr1  sr2
b       sr2  NaN

Solution 2: Iterate through dataframe

解决方案 2:遍历数据帧

you can make a new dataframe with all unique authors:

您可以使用所有独特的作者创建一个新的数据框:

df2 = pd.DataFrame({'author':df.author.unique()})

And then just get the list of all unique subreddits they are active in, assigning it to a new column:

然后只需获取他们处于活动状态的所有唯一 subreddits 的列表,将其分配给一个新列:

df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']])) 
    for _, x in df2.iterrows()]

This gives you this:

这给你这个:

>>> df2
  author  subreddits
0      a  [sr2, sr1]
1      b       [sr2]

回答by YOBEN_S

By using sacul's sample data

通过使用 sacul 的样本数据

df['subreddit'].groupby(df['author']).unique().apply(pd.Series)
Out[370]: 
          0    1
author          
a       sr1  sr2
b       sr2  NaN