Pandas sparse dataFrame转稀疏矩阵,内存中不生成稠密矩阵
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/31084942/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory
提问by Jake0x32
Is there a way to convert from a pandas.SparseDataFrameto scipy.sparse.csr_matrix, without generating a dense matrix in memory?
有没有办法从 a 转换pandas.SparseDataFrame为scipy.sparse.csr_matrix,而不在内存中生成密集矩阵?
scipy.sparse.csr_matrix(df.values)
doesn't work as it generates a dense matrix which is cast to the csr_matrix.
不起作用,因为它生成了一个密集矩阵,该矩阵被强制转换为csr_matrix.
Thanks in advance!
提前致谢!
采纳答案by hpaulj
Pandas docs talks about an experimental conversion to scipy sparse, SparseSeries.to_coo:
Pandas 文档讨论了到 scipy 稀疏的实验性转换,SparseSeries.to_coo:
http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse
http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse
================
================
edit - this is a special function from a multiindex, not a data frame. See the other answers for that. Note the difference in dates.
编辑 - 这是来自多索引的特殊功能,而不是数据框。请参阅其他答案。注意日期的差异。
============
============
As of 0.20.0, there is a sdf.to_coo()and a multiindex ss.to_coo(). Since a sparse matrix is inherently 2d, it makes sense to require multiindex for the (effectively) 1d dataseries. While the dataframe can represent a table or 2d array.
从 0.20.0 开始,有 asdf.to_coo()和multiindex ss.to_coo()。由于稀疏矩阵本质上是 2d 的,因此(有效)1d 数据序列需要多索引是有意义的。而数据框可以表示表格或二维数组。
When I first responded to this question this sparse dataframe/series feature was experimental (june 2015).
当我第一次回答这个问题时,这个稀疏数据帧/系列功能是实验性的(2015 年 6 月)。
回答by T.C. Proctor
Pandas 0.20.0+:
Pandas 0.20.0+:
As of pandas version 0.20.0, released May 5, 2017, there is a one-liner for this:
从 2017 年 5 月 5 日发布的 pandas 0.20.0 版本开始,有一个单行代码:
from scipy import sparse
def sparse_df_to_csr(df):
return sparse.csr_matrix(df.to_coo())
This uses the new to_coo()method.
这使用了新to_coo()方法。
Earlier Versions:
早期版本:
Building on Victor May's answer, here's a slightly faster implementation, but it only works if the entire SparseDataFrameis sparse with all BlockIndex(note: if it was created with get_dummies, this will be the case).
基于 Victor May 的回答,这里有一个稍微快一点的实现,但它只有在整个SparseDataFrame都是稀疏的情况下才有效BlockIndex(注意:如果它是用 创建的get_dummies,情况就是这样)。
Edit: I modified this so it will work with a non-zero fill value. CSR has no native non-zero fill value, so you will have to record it externally.
编辑:我修改了这个,所以它可以使用非零填充值。CSR 没有本机非零填充值,因此您必须在外部记录它。
import numpy as np
import pandas as pd
from scipy import sparse
def sparse_BlockIndex_df_to_csr(df):
columns = df.columns
zipped_data = zip(*[(df[col].sp_values - df[col].fill_value,
df[col].sp_index.to_int_index().indices)
for col in columns])
data, rows = map(list, zipped_data)
cols = [np.ones_like(a)*i for (i,a) in enumerate(data)]
data_f = np.concatenate(data)
rows_f = np.concatenate(rows)
cols_f = np.concatenate(cols)
arr = sparse.coo_matrix((data_f, (rows_f, cols_f)),
df.shape, dtype=np.float64)
return arr.tocsr()
回答by Victor May
The answer by @Marigold does the trick, but it is slow due to accessing all elements in each column, including the zeros. Building on it, I wrote the following quick n' dirty code, which runs about 50x faster on a 1000x1000 matrix with a density of about 1%. My code also handles dense columns appropriately.
@Marigold 的答案可以解决问题,但由于访问每列中的所有元素(包括零),因此速度很慢。在此基础上,我编写了以下快速 n' 脏代码,它在 1000x1000 矩阵上运行速度提高了约 50 倍,密度约为 1%。我的代码还适当地处理密集列。
def sparse_df_to_array(df):
num_rows = df.shape[0]
data = []
row = []
col = []
for i, col_name in enumerate(df.columns):
if isinstance(df[col_name], pd.SparseSeries):
column_index = df[col_name].sp_index
if isinstance(column_index, BlockIndex):
column_index = column_index.to_int_index()
ix = column_index.indices
data.append(df[col_name].sp_values)
row.append(ix)
col.append(len(df[col_name].sp_values) * [i])
else:
data.append(df[col_name].values)
row.append(np.array(range(0, num_rows)))
col.append(np.array(num_rows * [i]))
data_f = np.concatenate(data)
row_f = np.concatenate(row)
col_f = np.concatenate(col)
arr = coo_matrix((data_f, (row_f, col_f)), df.shape, dtype=np.float64)
return arr.tocsr()
回答by Claygirl
As of Pandas version 0.25 SparseSeriesand SparseDataFrameare deprecated. DataFrames now support Sparse Dtypesfor columns with sparse data. Sparse methods are available through sparseaccessor, so conversion one-liner now looks like this:
从 Pandas 0.25 版开始SparseSeries,SparseDataFrame已弃用。DataFrames 现在支持稀疏数据列的稀疏 Dtypes。稀疏方法可通过sparse访问器获得,因此单行转换现在如下所示:
sparse_matrix = scipy.sparse.csr_matrix(df.sparse.to_coo())
回答by Marigold
Here's a solution that fills the sparse matrix column by column (assumes you can fit at least one column to memory).
这是一个逐列填充稀疏矩阵的解决方案(假设您可以将至少一列放入内存)。
import pandas as pd
import numpy as np
from scipy.sparse import lil_matrix
def sparse_df_to_array(df):
""" Convert sparse dataframe to sparse array csr_matrix used by
scikit learn. """
arr = lil_matrix(df.shape, dtype=np.float32)
for i, col in enumerate(df.columns):
ix = df[col] != 0
arr[np.where(ix), i] = df.ix[ix, col]
return arr.tocsr()
回答by Marc Garcia
EDIT: This method is actually having a dense representation at some stage, so it doesn't solve the question.
编辑:此方法实际上在某个阶段具有密集表示,因此它不能解决问题。
You should be able to use the experimental .to_coo()method in pandas [1] in the following way:
您应该可以.to_coo()通过以下方式使用pandas [1] 中的实验方法:
df, idx_rows, idx_cols = df.stack().to_sparse().to_coo()
df = df.tocsr()
This method, instead of taking a DataFrame(rows / columns) it takes a Serieswith rows and columns in a MultiIndex(this is why you need the .stack()method). This Serieswith the MultiIndexneeds to be a SparseSeries, and even if your input is a SparseDataFrame, .stack()returns a regular Series. So, you need to use the .to_sparse()method before calling .to_coo().
这种方法,而不是采用DataFrame(行/列),而是采用 aSeries中的行和列MultiIndex(这就是您需要该.stack()方法的原因)。这Series与MultiIndex需要是SparseSeries,即使您的输入是SparseDataFrame,.stack()返回普通的Series。因此,您需要.to_sparse()在调用之前使用该方法.to_coo()。
The Seriesreturned by .stack(), even if it's not a SparseSeriesonly contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nanwhen the type is np.float).
在Series由归国.stack(),即使它不是一个SparseSeries仅包含非空的元素,所以它不应该比稀疏的版本(至少有更多的内存np.nan当类型为np.float)。

