Pandas DataFrame - 将具有相同索引的一列值组合到列表中

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31796973/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:43:33  来源:igfitidea点击:

Pandas DataFrame - Combining one column's values with same index into list

pythonpandas

提问by grish

I've been at this issue for awhile to no avail. This is almost a duplicate of at least one other question on here, but I can't quite figure out how to do exactly what I'm looking for from related answers online.

我在这个问题上已经有一段时间了,但无济于事。这几乎是此处至少一个其他问题的重复,但我无法完全弄清楚如何从在线相关答案中找到我正在寻找的内容。

I have a Pandas DataFrame (we'll call it df) that looks something like:

我有一个 Pandas DataFrame(我们称之为df),它看起来像:

Name    Value        Value2
'A'     '8.8.8.8'    'x'
'B'     '6.6.6.6'    'y'
'A'     '6.6.6.6'    'x'
'A'     '8.8.8.8'    'x'

Where Nameis the index. I want to convert this to something like that looks like:

Name索引在哪里。我想将其转换为如下所示的内容:

Name    Value                     Value2
'A'     ['8.8.8.8', '6.6.6.6']    'x'
'B'     ['6.6.6.6']               'y'

So, basically, every Valuethat corresponds to the same index should be combined into a list (or a set, or a tuple) and that list made to be the Valuefor the corresponding index. And, as shown, Value2is the same between like-indexed rows, so it should just stay the same in the end.

因此,基本上,Value对应于相同索引的每个都应该组合成一个列表(或一个集合或一个元组),并且该列表成为Value相应索引的 。并且,如图所示,Value2在相同索引的行之间是相同的,因此它最终应该保持不变。

All I've done (successfully) is figure out how to make each element in the Valuecolumn into a list with:

我所做的(成功)就是弄清楚如何将Value列中的每个元素变成一个列表:

df['Value'] = pd.Series([[val] for val in df['Value']])

In the question I linked at the start of this post, the recommended way to combine columns with duplicate indices offers a solution using df.groupby(df.index).sum(). I know I need something besides df.indexas an argument to groupbysince the Valuecolumn is treated as special, and I'm not sure what to put in place of sum()since that's not quite what I'm looking for.

在我在本文开头链接的问题中,推荐的将列与重复索引组合的方法提供了使用df.groupby(df.index).sum(). 我知道除了df.index作为参数之外我还需要一些东西,groupby因为该Value列被视为特殊的,而且我不知道该用什么来代替,sum()因为那不是我想要的。

Hopefully it's clear what I'm looking for, let me know if there's anything I can elaborate on. I've also tried simply looping through the DataFrame myself, finding rows with the same index, combining the Valuesinto a list and updating dfaccordingly. After trying to get this method to work for a bit I thought I'd look for a more Pandas-esque way of handling this problem.

希望很清楚我在寻找什么,如果有什么我可以详细说明的,请告诉我。我也尝试过自己简单地循环遍历 DataFrame,找到具有相同索引的行,将它们组合Values到一个列表中并相应地进行更新df。在尝试使用这种方法一段时间后,我想我会寻找一种更像 Pandas 式的方法来处理这个问题。



Edit: As a follow up to dermen's answer, that solution kind of worked. The Valuesdid seem to concatenate correctly into a list. One thing I realized was that the uniquefunction returns a Series, as opposed to a DataFrame. Also, I do have more columns in the actual setup than just Name, Value, and Value2. But I think I was able to get around both of the issues successfully with the following:

编辑:作为对 dermen 答案的跟进,该解决方案有点奏效。在Values似乎正确地连接成一个列表。我意识到的一件事是该unique函数返回 a Series,而不是 a DataFrame。另外,我在实际设置的不仅仅是更多的列NameValueValue2。但我认为我能够通过以下方式成功解决这两个问题:

gb = df.groupby(tuple(df.columns.difference(['Value'])))
result = pd.DataFrame(gb['Value'].unique(), columns=df.columns)

Where the first line gives an argument to groupbyof the list of columns minus the Valuecolumn, and the second line converts the Seriesreturned by uniqueinto a DataFramewith the same columns as df.

其中第一行给出了groupby列列表减去Value列的参数 to ,第二行将Series返回的 byunique转换为DataFrame与 具有相同列的a df

But I think with all of that in place (unless anyone sees an issue with this), almost everything works as intended. There does seem to be something that's a bit off here, though. When I try to output this to a file with to_csv, there are duplicate headers across the top (but only certain headers are duplicated, and there's no real pattern as to which as far as I can tell). Also, the Valuelists are truncated, which is probably a simpler issue to fix. The csvoutput currenlty looks like:

但我认为所有这些都到位(除非有人看到这个问题),几乎一切都按预期工作。不过,这里似乎确实有些不对劲。当我尝试将其输出到带有 的文件时to_csv,顶部有重复的标题(但只有某些标题是重复的,并且据我所知没有真正的模式)。此外,Value列表被截断,这可能是一个更简单的问题来解决。该csv输出currenlty样子:

Name    Value                   Value2    Name    Value2
'A'     ['8.8.8.8' '7.7.7.7'    'x'                     
'B'     ['6.6.6.6']             'y'

The above looks weird, but that is exactly how it looks in the output. Note that, contrary to the example presented at the start of this post, there are assumed to be more than 2 Valuesfor A(so that I can illustrate this point). When I do this with the actual data, the Valuelists get cut off after the first 4 elements.

上面看起来很奇怪,但这正是它在输出中的样子。请注意,与本文开头提供的示例相反,假设有超过 2 个Valuesfor A(以便我可以说明这一点)。当我使用实际数据执行此操作时,Value列表在前 4 个元素之后被切断。

采纳答案by dermen

I think you are looking to use pandas.Series.unique. First, make the 'Name'index a column

我认为您正在寻找使用pandas.Series.unique. 首先,让'Name'索引成为一列

df
#     Value2  Value
#Name              
#A         x    8.8
#B         y    6.6
#A         x    6.6
#A         x    8.8

df.reset_index(inplace=True)
#  Name Value2  Value
#0    A      x    8.8
#1    B      y    6.6
#2    A      x    6.6
#3    A      x    8.8

Next call groupbyand call the uniquefunction on the 'Value'series

下一次调用groupby并调用系列unique上的函数'Value'

gb = df.groupby(('Name','Value2'))
result = gb['Value'].unique()
result.reset_index(inplace=True) #lastly, reset the index
#  Name Value2       Value
#0    A      x  [8.8, 6.6]
#1    B      y       [6.6]

Finally, if you want 'Name'as the index again, just do

最后,如果你想'Name'再次作为索引,就做

result.set_index( 'Name', inplace=True)
#     Value2       Value
#Name                   
#A         x  [8.8, 6.6]
#B         y       [6.6]

UPDATE

更新

As a follow up, make sure you re-assign result after resetting the index

作为跟进,请确保在重置索引后重新分配结果

result = gb['Value'].unique()
type(result)
#pandas.core.series.Series

result = result.reset_index()
type(result)
#pandas.core.frame.DataFrame

saving as CSV (rather TSV)

保存为 CSV(而不是 TSV)

You don't want to use CSV here because there are commas in the Valuecolumn entries. Rather, save as TSV, you still use the same method to_csv, just change the separg:

您不想在此处使用 CSV,因为Value列条目中有逗号。相反,另存为 TSV,您仍然使用相同的方法to_csv,只需更改separg:

result.to_csv( 'result.txt', sep='\t')

If I load result.txt in EXCEL as a TSV, I get

如果我在 EXCEL 中加载 result.txt 作为 TSV,我得到

enter image description here

在此处输入图片说明