在 C++ 中创建稀疏数组的最佳方法是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4306/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What is the best way to create a sparse array in C++?
提问by Ed.
I am working on a project that requires the manipulation of enormous matrices, specifically pyramidal summation for a copula calculation.
我正在从事一个需要处理巨大矩阵的项目,特别是用于 copula 计算的金字塔求和。
In short, I need to keep track of a relatively small number of values (usually a value of 1, and in rare cases more than 1) in a sea of zeros in the matrix (multidimensional array).
简而言之,我需要在矩阵(多维数组)的零海中跟踪相对较少的值(通常为 1,在极少数情况下大于 1)。
A sparse array allows the user to store a small number of values, and assume all undefined records to be a preset value. Since it is not physically possibly to store all values in memory, I need to store only the few non-zero elements. This could be several million entries.
稀疏数组允许用户存储少量值,并假设所有未定义的记录都是预设值。由于物理上不可能将所有值存储在内存中,因此我只需要存储少数非零元素。这可能是几百万个条目。
Speed is a huge priority, and I would also like to dynamically choose the number of variables in the class at runtime.
速度是重中之重,我还想在运行时动态选择类中的变量数量。
I currently work on a system that uses a binary search tree (b-tree) to store entries. Does anyone know of a better system?
我目前在一个使用二叉搜索树(b-tree)来存储条目的系统上工作。有谁知道更好的系统?
采纳答案by Mark Harrison
For C++, a map works well. Several million objects won't be a problem. 10 million items took about 4.4 seconds and about 57 meg on my computer.
对于 C++,地图效果很好。几百万个对象不会有问题。1000 万个项目在我的电脑上花费了大约 4.4 秒和大约 57 兆。
My test application is as follows:
我的测试应用程序如下:
#include <stdio.h>
#include <stdlib.h>
#include <map>
class triple {
public:
int x;
int y;
int z;
bool operator<(const triple &other) const {
if (x < other.x) return true;
if (other.x < x) return false;
if (y < other.y) return true;
if (other.y < y) return false;
return z < other.z;
}
};
int main(int, char**)
{
std::map<triple,int> data;
triple point;
int i;
for (i = 0; i < 10000000; ++i) {
point.x = rand();
point.y = rand();
point.z = rand();
//printf("%d %d %d %d\n", i, point.x, point.y, point.z);
data[point] = i;
}
return 0;
}
Now to dynamically choose the number of variables, the easiest solution is to represent index as a string, and then use string as a key for the map. For instance, an item located at [23][55] can be represented via "23,55" string. We can also extend this solution for higher dimensions; such as for three dimensions an arbitrary index will look like "34,45,56". A simple implementation of this technique is as follows:
现在要动态选择变量的数量,最简单的解决方案是将index表示为 string,然后使用 string 作为映射的键。例如,位于 [23][55] 的项目可以通过“23,55”字符串表示。我们还可以将这个解决方案扩展到更高的维度;例如对于三个维度,任意索引看起来像“34,45,56”。该技术的简单实现如下:
std::map data<string,int> data;
char ix[100];
sprintf(ix, "%d,%d", x, y); // 2 vars
data[ix] = i;
sprintf(ix, "%d,%d,%d", x, y, z); // 3 vars
data[ix] = i;
回答by Konrad Rudolph
As a general advice, a method using strings as indices is actually veryslow. A much more efficient but otherwise equivalent solution would be to use vectors/arrays. There's absolutely no need to write the indices in a string.
作为一般建议,使用字符串作为索引的方法实际上非常慢。一个更有效但等效的解决方案是使用向量/数组。绝对不需要在字符串中写入索引。
typedef vector<size_t> index_t;
struct index_cmp_t : binary_function<index_t, index_t, bool> {
bool operator ()(index_t const& a, index_t const& b) const {
for (index_t::size_type i = 0; i < a.size(); ++i)
if (a[i] != b[i])
return a[i] < b[i];
return false;
}
};
map<index_t, int, index_cmp_t> data;
index_t i(dims);
i[0] = 1;
i[1] = 2;
// … etc.
data[i] = 42;
However, using a map
in practice often isn't very efficient because of the implementation in terms of a balanced binary search tree. A better performing data structure in this case would be a hash table, as provided by std::unordered_map
.
然而,map
由于平衡二叉搜索树的实现,在实践中使用 a通常不是很有效。在这种情况下,性能更好的数据结构是哈希表,由std::unordered_map
.
回答by Nic Strong
Boost has a templated implementation of BLAS called uBLAS that contains a sparse matrix.
Boost 有一个名为 uBLAS 的 BLAS 模板化实现,它包含一个稀疏矩阵。
https://www.boost.org/doc/libs/release/libs/numeric/ublas/doc/index.htm
https://www.boost.org/doc/libs/release/libs/numeric/ublas/doc/index.htm
回答by Mat Noguchi
Small detail in the index comparison. You need to do a lexicographical compare, otherwise:
索引比较中的小细节。您需要进行字典比较,否则:
a= (1, 2, 1); b= (2, 1, 2);
(a<b) == (b<a) is true, but b!=a
Edit: So the comparison should probably be:
编辑:所以比较可能应该是:
return lhs.x<rhs.x
? true
: lhs.x==rhs.x
? lhs.y<rhs.y
? true
: lhs.y==rhs.y
? lhs.z<rhs.z
: false
: false
回答by Emile Cormier
Eigenis a C++ linear algebra library that has an implementationof a sparse matrix. It even supports matrix operations and solvers (LU factorization etc) that are optimized for sparse matrices.
Eigen是一个 C++ 线性代数库,它具有稀疏矩阵的实现。它甚至支持针对稀疏矩阵优化的矩阵运算和求解器(LU 分解等)。
回答by nlucaroni
Hash tables have a fast insertion and look up. You could write a simple hash function since you know you'd be dealing with only integer pairs as the keys.
哈希表具有快速插入和查找功能。您可以编写一个简单的散列函数,因为您知道只能处理整数对作为键。
回答by Validus Oculus
Complete list of solutions can be found in the wikipedia. For convenience, I have quoted relevant sections as follows.
完整的解决方案列表可以在维基百科中找到。为方便起见,我将相关部分引述如下。
https://en.wikipedia.org/wiki/Sparse_matrix#Dictionary_of_keys_.28DOK.29
https://en.wikipedia.org/wiki/Sparse_matrix#Dictionary_of_keys_.28DOK.29
Dictionary of keys (DOK)
键字典 (DOK)
DOK consists of a dictionary that maps (row, column)-pairs to the value of the elements. Elements that are missing from the dictionary are taken to be zero. The format is good for incrementally constructing a sparse matrix in random order, but poor for iterating over non-zero values in lexicographical order. One typically constructs a matrix in this format and then converts to another more efficient format for processing.[1]
DOK 包含一个将(行、列)对映射到元素值的字典。字典中缺失的元素被视为零。该格式适用于以随机顺序增量构建稀疏矩阵,但不适用于按字典顺序迭代非零值。人们通常以这种格式构造一个矩阵,然后转换为另一种更有效的格式进行处理。 [1]
List of lists (LIL)
列表列表 (LIL)
LIL stores one list per row, with each entry containing the column index and the value. Typically, these entries are kept sorted by column index for faster lookup. This is another format good for incremental matrix construction.[2]
LIL 每行存储一个列表,每个条目包含列索引和值。通常,这些条目按列索引排序,以便更快地查找。这是另一种适用于增量矩阵构建的格式。 [2]
Coordinate list (COO)
坐标列表 (COO)
COO stores a list of (row, column, value) tuples. Ideally, the entries are sorted (by row index, then column index) to improve random access times. This is another format which is good for incremental matrix construction.[3]
COO 存储(行、列、值)元组的列表。理想情况下,条目被排序(按行索引,然后是列索引)以提高随机访问时间。这是另一种适用于增量矩阵构建的格式。 [3]
Compressed sparse row (CSR, CRS or Yale format)
压缩稀疏行(CSR、CRS 或 Yale 格式)
The compressed sparse row (CSR) or compressed row storage (CRS) format represents a matrix M by three (one-dimensional) arrays, that respectively contain nonzero values, the extents of rows, and column indices. It is similar to COO, but compresses the row indices, hence the name. This format allows fast row access and matrix-vector multiplications (Mx).
压缩稀疏行 (CSR) 或压缩行存储 (CRS) 格式用三个(一维)数组表示矩阵 M,分别包含非零值、行的范围和列索引。它类似于 COO,但压缩行索引,因此得名。这种格式允许快速行访问和矩阵向量乘法 (Mx)。
回答by JSN
The best way to implement sparse matrices is to not to implement them - atleast not on your own. I would suggest to BLAS (which I think is a part of LAPACK) which can handle really huge matrices.
实现稀疏矩阵的最好方法是不要实现它们——至少不是你自己。我建议 BLAS(我认为它是 LAPACK 的一部分),它可以处理非常大的矩阵。
回答by Nicholas Jordan
Since only values with [a][b][c]...[w][x][y][z] are of consequence, we only store the indice themselves, not the value 1 which is just about everywhere - always the same + no way to hash it. Noting that the curse of dimensionality is present, suggest go with some established tool NIST or Boost, at least read the sources for that to circumvent needless blunder.
由于只有具有 [a][b][c]...[w][x][y][z] 的值才是重要的,我们只存储索引本身,而不是几乎无处不在的值 1 - 总是相同+无法对其进行哈希处理。注意到存在维度诅咒,建议使用一些成熟的工具 NIST 或 Boost,至少阅读源代码以规避不必要的错误。
If the work needs to capture the temporal dependence distributions and parametric tendencies of unknown data sets, then a Map or B-Tree with uni-valued root is probably not practical. We can store only the indice themselves, hashed if ordering ( sensibility for presentation ) can subordinate to reduction of time domain at run-time, for all 1 values. Since non-zero values other than one are few, an obvious candidate for those is whatever data-structure you can find readily and understand. If the data set is truly vast-universe sized I suggest some sort of sliding window that manages file / disk / persistent-io yourself, moving portions of the data into scope as need be. ( writing code that you can understand ) If you are under commitment to provide actual solution to a working group, failure to do so leaves you at the mercy of consumer grade operating systems that have the sole goal of taking your lunch away from you.
如果工作需要捕获未知数据集的时间依赖性分布和参数趋势,那么具有单值根的 Map 或 B-Tree 可能不切实际。对于所有 1 值,我们只能存储索引本身,如果排序(表示的敏感性)可以从属于运行时时域的减少,则进行散列。由于除 1 之外的非零值很少,因此这些值的明显候选者是您可以轻松找到并理解的任何数据结构。如果数据集真的是巨大的宇宙大小,我建议使用某种滑动窗口来自己管理文件 / 磁盘 / 持久化 io,根据需要将部分数据移动到范围内。(编写您可以理解的代码)如果您承诺为工作组提供实际解决方案,
回答by eold
Here is a relatively simple implementation that should provide a reasonable fast lookup (using a hash table) as well as fast iteration over non-zero elements in a row/column.
这是一个相对简单的实现,它应该提供合理的快速查找(使用哈希表)以及对行/列中非零元素的快速迭代。
// Copyright 2014 Leo Osvald
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#ifndef UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_
#define UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_
#include <algorithm>
#include <limits>
#include <map>
#include <type_traits>
#include <unordered_map>
#include <utility>
#include <vector>
// A simple time-efficient implementation of an immutable sparse matrix
// Provides efficient iteration of non-zero elements by rows/cols,
// e.g. to iterate over a range [row_from, row_to) x [col_from, col_to):
// for (int row = row_from; row < row_to; ++row) {
// for (auto col_range = sm.nonzero_col_range(row, col_from, col_to);
// col_range.first != col_range.second; ++col_range.first) {
// int col = *col_range.first;
// // use sm(row, col)
// ...
// }
template<typename T = double, class Coord = int>
class SparseMatrix {
struct PointHasher;
typedef std::map< Coord, std::vector<Coord> > NonZeroList;
typedef std::pair<Coord, Coord> Point;
public:
typedef T ValueType;
typedef Coord CoordType;
typedef typename NonZeroList::mapped_type::const_iterator CoordIter;
typedef std::pair<CoordIter, CoordIter> CoordIterRange;
SparseMatrix() = default;
// Reads a matrix stored in MatrixMarket-like format, i.e.:
// <num_rows> <num_cols> <num_entries>
// <row_1> <col_1> <val_1>
// ...
// Note: the header (lines starting with '%' are ignored).
template<class InputStream, size_t max_line_length = 1024>
void Init(InputStream& is) {
rows_.clear(), cols_.clear();
values_.clear();
// skip the header (lines beginning with '%', if any)
decltype(is.tellg()) offset = 0;
for (char buf[max_line_length + 1];
is.getline(buf, sizeof(buf)) && buf[0] == '%'; )
offset = is.tellg();
is.seekg(offset);
size_t n;
is >> row_count_ >> col_count_ >> n;
values_.reserve(n);
while (n--) {
Coord row, col;
typename std::remove_cv<T>::type val;
is >> row >> col >> val;
values_[Point(--row, --col)] = val;
rows_[col].push_back(row);
cols_[row].push_back(col);
}
SortAndShrink(rows_);
SortAndShrink(cols_);
}
const T& operator()(const Coord& row, const Coord& col) const {
static const T kZero = T();
auto it = values_.find(Point(row, col));
if (it != values_.end())
return it->second;
return kZero;
}
CoordIterRange
nonzero_col_range(Coord row, Coord col_from, Coord col_to) const {
CoordIterRange r;
GetRange(cols_, row, col_from, col_to, &r);
return r;
}
CoordIterRange
nonzero_row_range(Coord col, Coord row_from, Coord row_to) const {
CoordIterRange r;
GetRange(rows_, col, row_from, row_to, &r);
return r;
}
Coord row_count() const { return row_count_; }
Coord col_count() const { return col_count_; }
size_t nonzero_count() const { return values_.size(); }
size_t element_count() const { return size_t(row_count_) * col_count_; }
private:
typedef std::unordered_map<Point,
typename std::remove_cv<T>::type,
PointHasher> ValueMap;
struct PointHasher {
size_t operator()(const Point& p) const {
return p.first << (std::numeric_limits<Coord>::digits >> 1) ^ p.second;
}
};
static void SortAndShrink(NonZeroList& list) {
for (auto& it : list) {
auto& indices = it.second;
indices.shrink_to_fit();
std::sort(indices.begin(), indices.end());
}
// insert a sentinel vector to handle the case of all zeroes
if (list.empty())
list.emplace(Coord(), std::vector<Coord>(Coord()));
}
static void GetRange(const NonZeroList& list, Coord i, Coord from, Coord to,
CoordIterRange* r) {
auto lr = list.equal_range(i);
if (lr.first == lr.second) {
r->first = r->second = list.begin()->second.end();
return;
}
auto begin = lr.first->second.begin(), end = lr.first->second.end();
r->first = lower_bound(begin, end, from);
r->second = lower_bound(r->first, end, to);
}
ValueMap values_;
NonZeroList rows_, cols_;
Coord row_count_, col_count_;
};
#endif /* UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_ */
For simplicity, it's immutable
, but you can can make it mutable; be sure to change std::vector
to std::set
if you want a reasonable efficient "insertions" (changing a zero to a non-zero).
为简单起见,它是immutable
,但您可以使其可变;如果您想要合理有效的“插入”(将零更改为非零),请务必更改std::vector
为std::set
。