Pandas sparse dataFrame转稀疏矩阵，内存中不生成稠密矩阵

Question

提问by Jake0x32

Is there a way to convert from a pandas.SparseDataFrameto scipy.sparse.csr_matrix, without generating a dense matrix in memory?

有没有办法从 a 转换pandas.SparseDataFrame为scipy.sparse.csr_matrix，而不在内存中生成密集矩阵？

scipy.sparse.csr_matrix(df.values)

doesn't work as it generates a dense matrix which is cast to the csr_matrix.

不起作用，因为它生成了一个密集矩阵，该矩阵被强制转换为csr_matrix.

Thanks in advance!

提前致谢！

Answer 1

采纳答案by hpaulj

Pandas docs talks about an experimental conversion to scipy sparse, SparseSeries.to_coo:

Pandas 文档讨论了到 scipy 稀疏的实验性转换，SparseSeries.to_coo：

http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse

================

edit - this is a special function from a multiindex, not a data frame. See the other answers for that. Note the difference in dates.

编辑 - 这是来自多索引的特殊功能，而不是数据框。请参阅其他答案。注意日期的差异。

============

As of 0.20.0, there is a sdf.to_coo()and a multiindex ss.to_coo(). Since a sparse matrix is inherently 2d, it makes sense to require multiindex for the (effectively) 1d dataseries. While the dataframe can represent a table or 2d array.

从 0.20.0 开始，有 asdf.to_coo()和multiindex ss.to_coo()。由于稀疏矩阵本质上是 2d 的，因此（有效）1d 数据序列需要多索引是有意义的。而数据框可以表示表格或二维数组。

When I first responded to this question this sparse dataframe/series feature was experimental (june 2015).

当我第一次回答这个问题时，这个稀疏数据帧/系列功能是实验性的（2015 年 6 月）。

Answer 2

回答by T.C. Proctor

Pandas 0.20.0+:

Pandas 0.20.0+：

As of pandas version 0.20.0, released May 5, 2017, there is a one-liner for this:

从 2017 年 5 月 5 日发布的 pandas 0.20.0 版本开始，有一个单行代码：

from scipy import sparse


def sparse_df_to_csr(df):
    return sparse.csr_matrix(df.to_coo())

This uses the new to_coo()method.

这使用了新to_coo()方法。

Earlier Versions:

早期版本：

Building on Victor May's answer, here's a slightly faster implementation, but it only works if the entire SparseDataFrameis sparse with all BlockIndex(note: if it was created with get_dummies, this will be the case).

基于 Victor May 的回答，这里有一个稍微快一点的实现，但它只有在整个SparseDataFrame都是稀疏的情况下才有效BlockIndex（注意：如果它是用创建的get_dummies，情况就是这样）。

Edit: I modified this so it will work with a non-zero fill value. CSR has no native non-zero fill value, so you will have to record it externally.

编辑：我修改了这个，所以它可以使用非零填充值。CSR 没有本机非零填充值，因此您必须在外部记录它。

import numpy as np
import pandas as pd
from scipy import sparse

def sparse_BlockIndex_df_to_csr(df):
    columns = df.columns
    zipped_data = zip(*[(df[col].sp_values - df[col].fill_value,
                         df[col].sp_index.to_int_index().indices)
                        for col in columns])
    data, rows = map(list, zipped_data)
    cols = [np.ones_like(a)*i for (i,a) in enumerate(data)]
    data_f = np.concatenate(data)
    rows_f = np.concatenate(rows)
    cols_f = np.concatenate(cols)
    arr = sparse.coo_matrix((data_f, (rows_f, cols_f)),
                            df.shape, dtype=np.float64)
    return arr.tocsr()

Answer 3

回答by Victor May

The answer by @Marigold does the trick, but it is slow due to accessing all elements in each column, including the zeros. Building on it, I wrote the following quick n' dirty code, which runs about 50x faster on a 1000x1000 matrix with a density of about 1%. My code also handles dense columns appropriately.

@Marigold 的答案可以解决问题，但由于访问每列中的所有元素（包括零），因此速度很慢。在此基础上，我编写了以下快速 n' 脏代码，它在 1000x1000 矩阵上运行速度提高了约 50 倍，密度约为 1%。我的代码还适当地处理密集列。

def sparse_df_to_array(df):
    num_rows = df.shape[0]   

    data = []
    row = []
    col = []

    for i, col_name in enumerate(df.columns):
        if isinstance(df[col_name], pd.SparseSeries):
            column_index = df[col_name].sp_index
            if isinstance(column_index, BlockIndex):
                column_index = column_index.to_int_index()

            ix = column_index.indices
            data.append(df[col_name].sp_values)
            row.append(ix)
            col.append(len(df[col_name].sp_values) * [i])
        else:
            data.append(df[col_name].values)
            row.append(np.array(range(0, num_rows)))
            col.append(np.array(num_rows * [i]))

    data_f = np.concatenate(data)
    row_f = np.concatenate(row)
    col_f = np.concatenate(col)

    arr = coo_matrix((data_f, (row_f, col_f)), df.shape, dtype=np.float64)
    return arr.tocsr()

Answer 4

回答by Claygirl

As of Pandas version 0.25 SparseSeriesand SparseDataFrameare deprecated. DataFrames now support Sparse Dtypesfor columns with sparse data. Sparse methods are available through sparseaccessor, so conversion one-liner now looks like this:

从 Pandas 0.25 版开始SparseSeries，SparseDataFrame已弃用。DataFrames 现在支持稀疏数据列的稀疏 Dtypes。稀疏方法可通过sparse访问器获得，因此单行转换现在如下所示：

sparse_matrix = scipy.sparse.csr_matrix(df.sparse.to_coo())

Answer 5

回答by Marigold

Here's a solution that fills the sparse matrix column by column (assumes you can fit at least one column to memory).

这是一个逐列填充稀疏矩阵的解决方案（假设您可以将至少一列放入内存）。

import pandas as pd
import numpy as np
from scipy.sparse import lil_matrix

def sparse_df_to_array(df):
    """ Convert sparse dataframe to sparse array csr_matrix used by
    scikit learn. """
    arr = lil_matrix(df.shape, dtype=np.float32)
    for i, col in enumerate(df.columns):
        ix = df[col] != 0
        arr[np.where(ix), i] = df.ix[ix, col]

    return arr.tocsr()

Answer 6

回答by Marc Garcia

EDIT: This method is actually having a dense representation at some stage, so it doesn't solve the question.

编辑：此方法实际上在某个阶段具有密集表示，因此它不能解决问题。

You should be able to use the experimental .to_coo()method in pandas [1] in the following way:

您应该可以.to_coo()通过以下方式使用pandas [1] 中的实验方法：

df, idx_rows, idx_cols = df.stack().to_sparse().to_coo()
df = df.tocsr()

This method, instead of taking a DataFrame(rows / columns) it takes a Serieswith rows and columns in a MultiIndex(this is why you need the .stack()method). This Serieswith the MultiIndexneeds to be a SparseSeries, and even if your input is a SparseDataFrame, .stack()returns a regular Series. So, you need to use the .to_sparse()method before calling .to_coo().

这种方法，而不是采用DataFrame（行/列），而是采用 aSeries中的行和列MultiIndex（这就是您需要该.stack()方法的原因）。这Series与MultiIndex需要是SparseSeries，即使您的输入是SparseDataFrame，.stack()返回普通的Series。因此，您需要.to_sparse()在调用之前使用该方法.to_coo()。

The Seriesreturned by .stack(), even if it's not a SparseSeriesonly contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nanwhen the type is np.float).

在Series由归国.stack()，即使它不是一个SparseSeries仅包含非空的元素，所以它不应该比稀疏的版本（至少有更多的内存np.nan当类型为np.float）。

http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse

http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse

Pandas sparse dataFrame转稀疏矩阵，内存中不生成稠密矩阵

提问by Jake0x32

采纳答案by hpaulj

回答by T.C. Proctor

Pandas 0.20.0+:

Pandas 0.20.0+：

Earlier Versions:

早期版本：

回答by Victor May

回答by Claygirl

回答by Marigold

回答by Marc Garcia

相关推荐

最近更新

标签

Pandas sparse dataFrame转稀疏矩阵，内存中不生成稠密矩阵

提问by Jake0x32

采纳答案by hpaulj

回答by T.C. Proctor

Pandas 0.20.0+:

Pandas 0.20.0+：

Earlier Versions:

早期版本：

回答by Victor May

回答by Claygirl

回答by Marigold

回答by Marc Garcia

相关推荐

Pandas.read_csv 将所有文件读入一列

pandas 如何在 scikit learn 中矢量化具有多个文本列的数据框而不会丢失对原始列的跟踪

pandas 如何将 MySQL 时间戳（6）读入熊猫？

使用 Pandas 读取带有 numpy 数组的 csv

相关推荐

最近更新

标签