Python Pandas,对于一列中的每个唯一值,在另一列中获取唯一值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48979604/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas, for each unique value in one column, get unique values in another column
提问by Parseltongue
I have a dataframe where each row contains various meta-data pertaining to a single Reddit comment (e.g. author, subreddit, comment text).
我有一个数据框,其中每一行都包含与单个 Reddit 评论(例如作者、subreddit、评论文本)有关的各种元数据。
I want to do the following: for each author, I want to grab a list of all the subreddits they have comments in, and transform this data into a pandas dataframe where each row corresponds to an author, and a list of all the unique subreddits they comment in.
我想执行以下操作:对于每个作者,我想获取他们有评论的所有 subreddits 的列表,并将这些数据转换为 Pandas 数据框,其中每一行对应一个作者,以及所有唯一 subreddits 的列表他们评论。
I am currently trying some combination of the following, but can't get it down:
我目前正在尝试以下的一些组合,但无法解决:
Attempt 1:
尝试 1:
group = df['subreddit'].groupby(df['author']).unique()
list(group)
Attempt 2:
尝试 2:
from collections import defaultdict
subreddit_dict = defaultdict(list)
for index, row in df.iterrows():
author = row['author']
subreddit = row['subreddit']
subreddit_dict[author].append(subreddit)
for key, value in subreddit_dict.items():
subreddit_dict[key] = set(value)
subreddit_df = pd.DataFrame.from_dict(subreddit_dict,
orient = 'index')
回答by sacuL
Here are two strategies to do it. No doubt, there are other ways.
这里有两种策略可以做到这一点。毫无疑问,还有其他方法。
Assuming your dataframe looks somethinglike this (obviously with more columns):
假设你的数据框看起来事情是这样的(显然有更多列):
df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})
>>> df
author subreddit
0 a sr1
1 a sr2
2 b sr2
...
SOLUTION 1: groupby
解决方案1:groupby
More straightforward than solution 2, and similar to your first attempt:
比解决方案 2 更直接,类似于您的第一次尝试:
group = df.groupby('author')
df2 = group.apply(lambda x: x['subreddit'].unique())
# Alternatively, same thing as a one liner:
# df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())
Result:
结果:
>>> df2
author
a [sr1, sr2]
b [sr2]
The author is the index, and the single column is the list of all subredditsthey are active in (this is how I interpreted how you wanted your output, according to your description).
作者是索引,单列是他们活跃的所有 subreddits的列表(根据你的描述,这是我解释你想要的输出方式的方式)。
If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:
如果您希望将 subreddits 分别放在一个单独的列中,这可能更有用,具体取决于您想用它做什么,您可以在之后执行以下操作:
df2 = df2.apply(pd.Series)
Result:
结果:
>>> df2
0 1
author
a sr1 sr2
b sr2 NaN
Solution 2: Iterate through dataframe
解决方案 2:遍历数据帧
you can make a new dataframe with all unique authors:
您可以使用所有独特的作者创建一个新的数据框:
df2 = pd.DataFrame({'author':df.author.unique()})
And then just get the list of all unique subreddits they are active in, assigning it to a new column:
然后只需获取他们处于活动状态的所有唯一 subreddits 的列表,将其分配给一个新列:
df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']]))
for _, x in df2.iterrows()]
This gives you this:
这给你这个:
>>> df2
author subreddits
0 a [sr2, sr1]
1 b [sr2]
回答by YOBEN_S
By using sacul's sample data
通过使用 sacul 的样本数据
df['subreddit'].groupby(df['author']).unique().apply(pd.Series)
Out[370]:
0 1
author
a sr1 sr2
b sr2 NaN