Python Pandas，对于一列中的每个唯一值，在另一列中获取唯一值

Question

提问by Parseltongue

I have a dataframe where each row contains various meta-data pertaining to a single Reddit comment (e.g. author, subreddit, comment text).

我有一个数据框，其中每一行都包含与单个 Reddit 评论（例如作者、subreddit、评论文本）有关的各种元数据。

I want to do the following: for each author, I want to grab a list of all the subreddits they have comments in, and transform this data into a pandas dataframe where each row corresponds to an author, and a list of all the unique subreddits they comment in.

我想执行以下操作：对于每个作者，我想获取他们有评论的所有 subreddits 的列表，并将这些数据转换为 Pandas 数据框，其中每一行对应一个作者，以及所有唯一 subreddits 的列表他们评论。

I am currently trying some combination of the following, but can't get it down:

我目前正在尝试以下的一些组合，但无法解决：

Attempt 1:

尝试 1：

group = df['subreddit'].groupby(df['author']).unique()
list(group)

Attempt 2:

尝试 2：

from collections import defaultdict
subreddit_dict  = defaultdict(list)

for index, row in df.iterrows():
    author = row['author']
    subreddit = row['subreddit']
    subreddit_dict[author].append(subreddit)

for key, value in subreddit_dict.items():
    subreddit_dict[key] = set(value)

subreddit_df = pd.DataFrame.from_dict(subreddit_dict, 
                            orient = 'index')

Answer 1

回答by sacuL

Here are two strategies to do it. No doubt, there are other ways.

这里有两种策略可以做到这一点。毫无疑问，还有其他方法。

Assuming your dataframe looks somethinglike this (obviously with more columns):

假设你的数据框看起来事情是这样的（显然有更多列）：

df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})

>>> df
  author subreddit
0      a       sr1
1      a       sr2
2      b       sr2
...

SOLUTION 1: groupby

解决方案1：groupby

More straightforward than solution 2, and similar to your first attempt:

比解决方案 2 更直接，类似于您的第一次尝试：

group = df.groupby('author')

df2 = group.apply(lambda x: x['subreddit'].unique())

# Alternatively, same thing as a one liner:
# df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())

Result:

结果：

>>> df2
author
a    [sr1, sr2]
b         [sr2]

The author is the index, and the single column is the list of all subredditsthey are active in (this is how I interpreted how you wanted your output, according to your description).

作者是索引，单列是他们活跃的所有 subreddits的列表（根据你的描述，这是我解释你想要的输出方式的方式）。

If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:

如果您希望将 subreddits 分别放在一个单独的列中，这可能更有用，具体取决于您想用它做什么，您可以在之后执行以下操作：

df2 = df2.apply(pd.Series)

Result:

结果：

>>> df2
          0    1
author          
a       sr1  sr2
b       sr2  NaN

Solution 2: Iterate through dataframe

解决方案 2：遍历数据帧

you can make a new dataframe with all unique authors:

您可以使用所有独特的作者创建一个新的数据框：

df2 = pd.DataFrame({'author':df.author.unique()})

And then just get the list of all unique subreddits they are active in, assigning it to a new column:

然后只需获取他们处于活动状态的所有唯一 subreddits 的列表，将其分配给一个新列：

df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']])) 
    for _, x in df2.iterrows()]

This gives you this:

这给你这个：

>>> df2
  author  subreddits
0      a  [sr2, sr1]
1      b       [sr2]

Answer 2

回答by YOBEN_S

By using sacul's sample data

通过使用 sacul 的样本数据

df['subreddit'].groupby(df['author']).unique().apply(pd.Series)
Out[370]: 
          0    1
author          
a       sr1  sr2
b       sr2  NaN

Python Pandas，对于一列中的每个唯一值，在另一列中获取唯一值

提问by Parseltongue

回答by sacuL

回答by YOBEN_S

相关推荐

最近更新

标签

Python Pandas，对于一列中的每个唯一值，在另一列中获取唯一值

提问by Parseltongue

回答by sacuL

回答by YOBEN_S

相关推荐

Python Scipy 旋转和缩放图像而不改变其尺寸

Python 如何在使用 Pandas 读取特定列的 csv 文件时删除它？

Python LinAlgError: 数组的最后 2 个维度必须是正方形

如果通过，如果在python中继续

相关推荐

最近更新

标签