在 Pandas 中有效地创建稀疏数据透视表?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/31661604/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:41:16  来源:igfitidea点击:

Efficiently create sparse pivot tables in pandas?

pythonpandasscipyscikit-learnsparse-matrix

提问by neelshiv

I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.

我正在将具有两列(A 和 B)的记录列表转换为矩阵表示。我一直在Pandas中使用枢轴函数,但结果最终相当大。pandas 是否支持转为稀疏格式?我知道我可以旋转它,然后将它变成某种稀疏表示,但并不像我想要的那么优雅。我的最终目标是将其用作预测模型的输入。

Alternatively, is there some kind of sparse pivot capability outside of pandas?

或者,在Pandas之​​外是否有某种稀疏的枢轴能力?

edit: here is an example of a non-sparse pivot

编辑:这是一个非稀疏枢轴的示例

import pandas as pd
frame=pd.DataFrame()
frame['person']=['me','you','him','you','him','me']
frame['thing']=['a','a','b','c','d','d']
frame['count']=[1,1,1,1,1,1]

frame

  person thing  count
0     me     a      1
1    you     a      1
2    him     b      1
3    you     c      1
4    him     d      1
5     me     d      1

frame.pivot('person','thing')

        count            
thing       a   b   c   d
person                   
him       NaN   1 NaN   1
me          1 NaN NaN   1
you         1 NaN   1 NaN

This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.

这将创建一个矩阵,该矩阵可以包含人和事物的所有可能组合,但它并不稀疏。

http://docs.scipy.org/doc/scipy/reference/sparse.html

http://docs.scipy.org/doc/scipy/reference/sparse.html

Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.

稀疏矩阵占用较少的空间,因为它们可以暗示 NaN 或 0 之类的东西。如果我有一个非常大的数据集,这个旋转函数可以生成一个矩阵,由于大量的 NaN 或 0,该矩阵应该是稀疏的。我希望我可以通过立即生成稀疏的东西而不是创建密集矩阵然后将其转换为稀疏矩阵来节省大量空间/内存。

采纳答案by Alnilam

The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:

@khammel 之前发布的答案很有用,但不幸的是,由于 pandas 和 Python 的变化而不再有效。以下应该产生相同的输出:

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)

row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)), \
                           shape=(person_c.categories.size, thing_c.categories.size))

>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
     with 6 stored elements in Compressed Sparse Row format>

>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]], dtype=int64)


dfs = pd.SparseDataFrame(sparse_matrix, \
                         index=person_c.categories, \
                         columns=thing_c.categories, \
                         default_fill_value=0)
>>> dfs
        a   b   c   d
 him    0   1   0   1
  me    1   0   0   1
 you    1   0   1   0

The main changes were:

主要变化是:

  • .astype()no longer accepts "categorical". You have to create a CategoricalDtype object.
  • sort()doesn't work anymore
  • .astype()不再接受“分类”。您必须创建一个 CategoricalDtype 对象。
  • sort()不工作了

Other changes were more superficial:

其他变化更为肤浅:

  • using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
  • the data input for the csr_matrix(frame["count"]) doesn't need to be a list object
  • pandas SparseDataFrameaccepts a scipy.sparse object directly now
  • 使用类别大小而不是唯一的系列对象的长度,只是因为我不想不必要地制作另一个对象
  • csr_matrix( frame["count"])的数据输入不需要是列表对象
  • pandasSparseDataFrame现在直接接受一个 scipy.sparse 对象

回答by khammel

Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_uand thing_uare lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.

这是一种基于人和事物的数据和索引创建稀疏 scipy 矩阵的方法。 person_uthing_u是代表要创建的数据透视行和列的唯一条目的列表。注意:这假设您的计数列已经具有您想要的值。

from scipy.sparse import csr_matrix

person_u = list(sort(frame.person.unique()))
thing_u = list(sort(frame.thing.unique()))

data = frame['count'].tolist()
row = frame.person.astype('category', categories=person_u).cat.codes
col = frame.thing.astype('category', categories=thing_u).cat.codes
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))

>>> sparse_matrix 
<3x4 sparse matrix of type '<type 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>

>>> sparse_matrix.todense()

matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]])

Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:

根据您的原始问题,scipy 稀疏矩阵应该足以满足您的需求,但是如果您希望拥有一个稀疏数据框,您可以执行以下操作:

dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0) 
                              for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)

>>> dfs
     a  b  c  d
him  0  1  0  1
me   1  0  0  1
you  1  0  1  0

>>> type(dfs)
pandas.sparse.frame.SparseDataFrame

回答by sbstn

I had a similar problem and I stumbled over this post. The only difference was that that I had two columns in the DataFramethat define the "row dimension" (i) of the output matrix. I thought this might be an interesting generalisation, I used the grouper:

我有一个类似的问题,我偶然发现了这篇文章。唯一的区别是我在DataFrame定义i输出矩阵的“行维度”()中有两列。我认为这可能是一个有趣的概括,我使用了grouper

# function
import pandas as pd

from scipy.sparse import csr_matrix

def df_to_sm(data, vars_i, vars_j):
    grpr_i = data.groupby(vars_i).grouper

    idx_i = grpr_i.group_info[0]

    grpr_j = data.groupby(vars_j).grouper

    idx_j = grpr_j.group_info[0]

    data_sm = csr_matrix((data['val'].values, (idx_i, idx_j)),
                         shape=(grpr_i.ngroups, grpr_j.ngroups))

    return data_sm, grpr_i, grpr_j


# example
data = pd.DataFrame({'var_i_1' : ['a1', 'a1', 'a1', 'a2', 'a2', 'a3'],
                     'var_i_2' : ['b2', 'b1', 'b1', 'b1', 'b1', 'b4'],
                     'var_j_1' : ['c2', 'c3', 'c2', 'c1', 'c2', 'c3'],
                     'val' : [1, 2, 3, 4, 5, 6]})

data_sm, _, _ = df_to_sm(data, ['var_i_1', 'var_i_2'], ['var_j_1'])

data_sm.todense()