pandas 如何在给定 3 列的情况下创建方形数据框/矩阵 - Python
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47683642/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to create a square dataframe/matrix given 3 columns - Python
提问by WolVes
I am struggling to figure out how to develop a square matrix given a format like
我正在努力弄清楚如何在给定格式的情况下开发方阵
a a 0
a b 3
a c 4
a d 12
b a 3
b b 0
b c 2
...
To something like:
类似于:
a b c d e
a 0 3 4 12 ...
b 3 0 2 7 ...
c 4 3 0 .. .
d 12 ...
e . ..
in pandas. I developed a method which I thinks works but takes forever to run because it has to iterate through each column and row for every value starting from the beginning each time using for loops. I feel like I'm definitely reinventing the wheel here. This also isnt realistic for my dataset given how many columns and rows there are. Is there something similar to R's cast function in python which can do this significantly faster?
在Pandas。我开发了一种我认为有效但需要永远运行的方法,因为它必须在每次使用 for 循环时从头开始迭代每个值的每一列和每一行。我觉得我肯定是在这里重新发明轮子。考虑到有多少列和行,这对于我的数据集也是不现实的。是否有类似于 Python 中 R 的 cast 函数的东西可以更快地做到这一点?
回答by unutbu
You could use df.pivot
:
你可以使用df.pivot
:
import pandas as pd
df = pd.DataFrame([['a', 'a', 0],
['a', 'b', 3],
['a', 'c', 4],
['a', 'd', 12],
['b', 'a', 3],
['b', 'b', 0],
['b', 'c', 2]], columns=['X','Y','Z'])
print(df.pivot(index='X', columns='Y', values='Z'))
yields
产量
Y a b c d
X
a 0.0 3.0 4.0 12.0
b 3.0 0.0 2.0 NaN
Here, index='X'
tells df.pivot
to use the column labeled 'X'
as the index, and columns='Y'
tells it to use the column labeled 'Y'
as the column index.
在这里,index='X'
告诉df.pivot
使用标记'X'
为索引的列,并columns='Y'
告诉它使用标记'Y'
为列索引的列。
See the docsfor more on pivot
and other reshaping methods.
有关更多信息和其他重塑方法,请参阅文档pivot
。
Alternatively, you could use pd.crosstab
:
或者,您可以使用pd.crosstab
:
print(pd.crosstab(index=df.iloc[:,0], columns=df.iloc[:,1],
values=df.iloc[:,2], aggfunc='sum'))
Unlike df.pivot
which expects each (a1, a2)
pair to be unique, pd.crosstab
(with agfunc='sum'
) will aggregate duplicate pairs by summing the associated
values. Although there are no duplicate pairs in your posted example, specifying
how duplicates are supposed to be aggregated is required when the values
parameter is used.
与df.pivot
期望每(a1, a2)
对都是唯一的不同,pd.crosstab
(with agfunc='sum'
) 将通过对相关值求和来聚合重复的对。尽管您发布的示例中没有重复的对,但在values
使用该参数时需要指定应该如何聚合重复项。
Also, whereas df.pivot
is passed column labels, pd.crosstab
is passed
array-likes (such as whole columns of df
). df.iloc[:, i]
is the i
th column
of df
.
此外,虽然df.pivot
传递的是列标签,但pd.crosstab
传递的是类数组(例如 的整列df
)。df.iloc[:, i]
是 的i
第 列df
。