This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
C#. Need to optimise counting positive and negative values
I need to maximize speed of the following functionality:
a. a value comes in. value has 2 properties - int value and long timestamp in ticks.
b. need to count previously stored values which are younger than 1ms (from the current).
c. need to count negative and positive separately.
d. i only need to know if there are either 10 neg or pos values. i dont need to keep any other knowledge of the values.
me thinks - to implement 2 ring arrays for pos and neg separately, replacing expired with 0 keeping track of pos.neg counts as they come.
any thoughts?
Maintaining 2 buffers to keep the positives separated from the negatives sounds like a pain and inefficient.
You could instead have a single buffer with all the values, and use std::accumulate to count up the positives and negatives. If you start with a collection of all the tuples (each of which has an age and a value), you could begin by sorting the collection according to age, finding the last element that is <= 1 ms old, and then using accumulate from begin() to that point. Here's some code that demonstrates that last bit:
#include <algorithm>
#include <numeric>
#include <iterator>
#include <vector>
#include <string>
#include <ctime>
using namespace std;
struct Counter
{
Counter(unsigned pos=0, unsigned neg=0) : pos_(pos), neg_(neg) {};
unsigned pos_, neg_;
Counter& operator+(int n)
{
if( n < 0 )
++neg_;
else if( n > 0 )
++pos_;
return * this;
}
};
int main()
{
srand((unsigned)time(0));
vector<int> vals;
generate_n(back_inserter(vals), 1000, []()
{
return (rand() / (RAND_MAX/40)) - 20;
});
Counter cnt = accumulate(vals.begin(), vals.end(), Counter());
}
If sorting the collection by age and then searching the sorted results for the last eligible entry sounds too ineficient, you could use for_each_if instead of accumulate and simply iterate over the whole collection once. for_each_if isn't part of the Standard Library, but it's easy enough to write. If you don't want to muck about with writing your own for_each_if that's fine, too. You could simply tweak the accumulator a bit so that it doesn't accumulate elements which are too old:
#include <algorithm>
#include <numeric>
#include <iterator>
#include <vector>
#include <string>
#include <ctime>
using namespace std;
struct Tuple
{
int val_;
unsigned age_;
};
struct Counter
{
Counter(unsigned pos=0, unsigned neg=0) : pos_(pos), neg_(neg) {};
unsigned pos_, neg_;
Counter& operator+(const Tuple& tuple)
{
if( tuple.age_ > 1 )
return * this;
if( tuple.val_ < 0 )
++neg_;
else if( tuple.val_ > 0 )
++pos_;
return * this;
}
};
int main()
{
srand((unsigned)time(0));
vector<Tuple> tuples;
generate_n(back_inserter(tuples), 1000, []() -> Tuple
{
Tuple retval;
retval.val_ = (rand() / (RAND_MAX/40)) - 20;
retval.age_ = (rand() / (RAND_MAX/5));
return retval;
});
Counter cnt = accumulate(tuples.begin(), tuples.end(), Counter());
}
I would store the values in a min-heap keyed by timestamp - so the youngest values are at the top of the heap. The integer value is auxiliary data at each node. You could then implement the counting with a recursive function that traverses the heap. You'd pass the running total of negative and positive back up the recursive call.
It would look something like this, in Python-like pseudocode with types:
def young_pos_and_neg(Time currtime, HeapNode p):
if (p is not None and currtime - p.time < 1):
posleft, negleft = young_pos_and_neg(p.leftChild())
posright, negright = young_pos_and_neg(p.rightChild())
totpos = posleft + posright
totneg = negleft + negright
if (p.intValue < 0):
return totpos, totneg + 1
else:
return totpos + 1, totneg
else:
return 0, 0
If you call this on the heap root before inserting the new value - but with the new value's timestamp as the currtime argument - you will get a count of each. It may not be the fastest possible way, but it's pretty simple and elegant. In C++ you could replace the tuple return value with a struct.
Related
Consider this fairly easy algorithmic problem:
Given an array of (unsorted) numbers, find the length of the longest sequence of adjacent numbers that are increasing. For example, if we have {1,4,2,3,5}, we expect the result to be 3 since {2,3,5} gives the longest increasing sequence of adjacent/contiguous elements. Note that for non-empty arrays, such as {4,3,2,1}, the minimum result will be 1.
This works:
#include <algorithm>
#include <iostream>
#include <vector>
template <typename T, typename S>
T max_adjacent_length(const std::vector<S> &nums) {
if (nums.size() == 0) {
return 0;
}
T maxLength = 1;
T currLength = 1;
for (size_t i = 0; i < nums.size() - 1; i++) {
if (nums[i + 1] > nums[i]) {
currLength++;
} else {
currLength = 1;
}
maxLength = std::max(maxLength, currLength);
}
return maxLength;
}
int main() {
std::vector<double> nums = {1.2, 4.5, 3.1, 2.7, 5.3};
std::vector<int> ints = {4, 3, 2, 1};
std::cout << max_adjacent_length<int, double>(nums) << "\n"; // 2
std::cout << max_adjacent_length<int, int>(ints) << "\n"; // 1
return 0;
}
As an exercise for myself, I was wondering if there is/are STL algorithm(s) that achieve the same effect, thereby (ideally) avoiding the raw for-loop I have. The motivation behind doing this is to learn more about STL algorithms, and practice using abstracted algorithms to make my code more general and reusable.
Here are my ideas, but they don't quite achieve what I'd like.
std::adjacent_find achieves the pairwise comparisons and can be used to find the index of a non-increasing pair, but doesn't easily facilitate the ability to keep a current and maximum length and compare the two. It could be possible to have those state variables as part of my predicate function, but that seems a bit wrong since ideally you'd like your predicate function to not have any side effects, right?
std::adjacent_difference is interesting. One could use it to construct a vector of the differences between adjacent numbers. Then, starting from the second element, depending on if the difference is positive or negative, we could again track the maximum number of consecutive positive differences seen. This is actually quite close to achieving what we'd like. See the example code below:
#include <numeric>
#include <vector>
template <typename T, typename S> T max_adjacent_length(std::vector<S> &nums) {
if (nums.size() == 0) {
return 0;
}
std::adjacent_difference(nums.begin(), nums.end(), nums.begin());
nums.erase(std::begin(nums)); // keep only differences
T maxLength = 1, currLength = 1;
for (auto n : nums) {
currLength = n > 0 ? (currLength + 1) : 1;
maxLength = std::max(maxLength, currLength);
}
return maxLength;
}
The problem here is that we lose out the const-ness of nums if we want to compute the difference, or we have to sacrifice space and create a copy of nums, which is a no-no given the original solution is O(1) space complexity already.
Is there an idea/solution that I have overlooked that achieves what I want in a succinct and readable manner?
In both your code snippets, you are iterating through a range (in the first version, with an index-based-loop, and in the second with a range-for loop). This is not really the kind of code you should be writing if you want to use the standard algorithms, which work with iterators into the range. Instead of thinking of a range as a collection of elements, if you start thinking in terms of pairs of iterators, choosing the right algorithms becomes easier.
For this problem, here's a reasonable way to write this code:
auto max_adjacent_length = [](auto const & v)
{
long max = 0;
auto begin = v.begin();
while (begin != v.end()) {
auto next = std::is_sorted_until(begin, v.end());
max = std::max(std::distance(begin, next), max);
begin = next;
}
return max;
};
Here's a demo.
Note that you were already on the right track in terms of picking a reasonable algorithm. This could be solved with adjacent_find as well, with just a little more work.
For C++ language, what's the fastest way in processing run-time (in multi core processors), from an algorithm design viewpoint, to search numbers (e.g. between 100 and 1000) that are within an array (or splice or whatever faster data structures for the purpose of this) and return the range of numbers limited to only 10 items returned? e.g. pseudocode in golang:
var listofnums := []uint64
var numcounter := 1
// splice of [1,2,3,4,5,31,32 .. 932536543] this list has 1 billion numeric items.
// the listofnums are already sorted each time an item is added but we do not know the lower_bound or upper_bound of the item list.
// I know I can use binary search to find listofnums[i] where it is smallest at [i] too... I'm asking for suggestions.
for i:=uint(0); i < len(listofnums); i++ {
if listofnums[i] > 100 && listofnums[i] < 1000 {
if listofnums[i]> 1000 || numcounter == 10 {
return
}
fmt.Println("%d",listofnums[i])
numcounter++
}
}
is this the fastest way? I saw bitmap structures in C++ but not sure if can be applied here.
I've come across this question, which is perfectly fine for veteran programmers to ask but I have no idea why it's down voted.
What is the fastest search method for array?
Can someone please not remove this question but let me rephrase it? Thanks in advance. I hope to find the most optimum way to return a range of numbers from a large array of numeric items.
If I understand your problem correctly you need to find two positions in your array, the first of which all numbers are greater than or equal to 100 and the second of which all numbers are less than or equal to 1000.
The functions std::lower_bound and std::upper_bound do binary searches designed to find such a range.
For arrays, in C++ we usually use a std::vector and denote the beginning and end of ranges using a pair of iterators.
So something like this may be what you need:
std::pair<std::vector<int>::iterator, std::vector<int>::iterator>
find_range(std::vector<int>& v, int min, int max)
{
auto begin = std::lower_bound(std::begin(v), std::end(v), min);
// start searching after the previously found value
auto end = std::upper_bound(begin, std::end(v), max);
return {begin, end};
}
You can iterate over that range like this:
auto range = find_range(v, 100, 1000);
for(auto i = range.first; i != range.second; ++i)
std::cout << *i << '\n';
You can create a new vector from the range (slow) like this:
std::vector<int> selection{range.first, range.second};
My first attempt.
Features:
logN time complexity
creates an array slice, no copying of data
second binary search minimises the search space on the basis of the first
possible improvements:
if n is small, the second binary search would be a pessimisation. Better to simply count forward up to n times.
#include <vector>
#include <cstdint>
#include <algorithm>
#include <iterator>
#include <iostream>
template <class Iter> struct range
{
range(Iter first, std::size_t size) : begin_(first), end_(first + size) {}
auto begin() const { return begin_; }
auto end() const { return end_; }
Iter begin_, end_;
};
template<class Iter> range(Iter, std::size_t) -> range<Iter>;
auto find_first_n_between(std::vector<std::int64_t>& vec,
std::size_t n,
std::int64_t from, std::int64_t to)
{
auto lower = std::lower_bound(begin(vec), end(vec), from);
auto upper = std::upper_bound(lower, end(vec), to);
auto size = std::min(n, std::size_t(std::distance(lower, upper)));
return range(lower, size);
}
int main()
{
std::vector<std::int64_t> vec { 1,2,3,4,5,6,7,8,15,17,18,19,20 };
auto slice = find_first_n_between(vec, 5, 6, 15);
std::copy(std::begin(slice), std::end(slice), std::ostream_iterator<std::int64_t>(std::cout, ", "));
}
I have a class that has million of items and each item has a label of type int. I need to partition items based on their similar labels, so at the end I return a vector<MyClass>. First, I sort all items based on their label. Then, in a for loop I compare each label value with previous one and if its the same I store it in a myclass_temp until label != previous_label. If label != previous_label I add this myclass_temp to the vector<MyClass>, and I erase myclass_temp. I think the code is self-explained.
The program works fine, but it is slow, is there a better way to speed it up? I believe because I sort the items in the beginning, there should be a faster way to simply partition items with similar labels.
Second question is how to calculate O score for this algorithm and any suggested faster solution?
please feel free to correct my code.
vector <MyClass> PartitionByLabels(MyClass &myclass){
/// sort MyClass items based on label number
printf ("Sorting items by label number... \n");
std::sort(myclass.begin(), myclass.end(), compare_labels);
vector <MyClass> myClasses_vec;
MyClass myclass_temp;
int previous_label=0, label=0;
int total_items;
/// partition myclass items based on similar labels
for (int i=0; i < myclass.size(); i++){
label = myclass[i].label;
if (label == previous_label){
myclass_temp.push_back(myclass[i]);
previous_label = label;
/// add the last similar items
if (i == myclass.size()-1){
myClasses_vec.push_back(myclass_temp);
total_items +=myclass_temp.size();
}
} else{
myClasses_vec.push_back(myclass_temp);
total_items +=myclass_temp.size();
myclass_temp.EraseItems();
myclass_temp.push_back(myclass[i]);
previous_label = label;
}
}
printf("Total number of items: %d \n", total_items);
return myClasses_vec;
}
This algorithm should do it. I removed the templates to make it easier to check on godbolt.
Should be easy enough to put back in.
The O score for this method is that of std::sort - O(N.log(N))
#include <vector>
#include <algorithm>
#include <string>
#include <iterator>
struct thing
{
std::string label;
std::string value;
};
using MyClass = std::vector<thing>;
using Partitions = std::vector<MyClass>;
auto compare_labels = [](thing const& l, thing const& r) {
return l.label < r.label;
};
// pass by value - we need a copy anyway and we might get copy elision
Partitions PartitionByLabels(MyClass myclass){
/// sort MyClass items based on label number
std::sort(myclass.begin(), myclass.end(), compare_labels);
Partitions result;
auto first = myclass.begin();
auto last = myclass.end();
// because the range is sorted, we can partition it in linear time.
// choosing the correct algorithm is always the best optimisation
while (first != last)
{
auto next = std::find_if(first, last, [&first](auto const& x) { return x.label != first->label; });
// let's move the items - that should speed things up a little
// this is safe because we took a copy
result.push_back(MyClass(std::make_move_iterator(first),
std::make_move_iterator(next)));
first = next;
}
return result;
}
We can of course do better with unordered maps, if:
the label is hashable and equality-comparable
we don't need to order the output (if we did, we'd use a multimap instead)
The O-score for this method is linear time O(N)
#include <vector>
#include <algorithm>
#include <string>
#include <iterator>
#include <unordered_map>
struct thing
{
std::string label;
std::string value;
};
using MyClass = std::vector<thing>;
using Partitions = std::vector<MyClass>;
// pass by value - we need a copy anyway and we might get copy elision
Partitions PartitionByLabels(MyClass const& myclass){
using object_type = MyClass::value_type;
using label_type = decltype(std::declval<object_type>().label);
using value_type = decltype(std::declval<object_type>().value);
std::unordered_multimap<label_type, value_type> inter;
for(auto&& x : myclass) {
inter.emplace(x.label, x.value);
}
Partitions result;
auto first = inter.begin();
auto last = inter.end();
while (first != last)
{
auto range = inter.equal_range(first->first);
MyClass tmp;
tmp.reserve(std::distance(range.first, range.second));
for (auto i = range.first ; i != range.second ; ++i) {
tmp.push_back(object_type{i->first, std::move(i->second)});
}
result.push_back(std::move(tmp));
first = range.second;
}
return result;
}
Why not create a map from ints to vectors, iterate through the original vector once, adding each MyClass object to TheMap[myclass[i].label]? It takes your average runtime from f(n + n*log(n)) to f(n).
The problem is to find an integer without it's pair in a sequence of integers. Here's what I wrote so far, to me it looks like it should work but it doesn't. Any help for a noob programmer?
using namespace std;
int lonelyinteger(vector < int > a, int _a_size) {
for (int i = 0; i < _a_size; i++)
{
bool flag = false;
for (int n = i + 1; n < _a_size; n++)
{
if (a.at(i) == a.at(n))
{
flag = true;
break;
}
}
if (flag == false)
{
return a.at(i);
}
}
return 0;
}
For the input 1 1 2 it outputs 1 while it's supposed to 2
for 0 0 1 2 1 it outputs 0 and here it has to be 2
The problem is that your inner loop only checks from the index i and onward for a duplicate. In the case 1 1 2 the first loop encounters a[1] which is 1. After that index, there is no element that is equal to 1, so the function returns 1.
In general, there is a better solution to this problem. Instead of going through the vector twice, you can use a set to keep track of all the elements you have already encountered. For each element, check if the set already contains it. If not, add it to the set. Otherwise, remove it from the set. Anything remaining in the set will be unique within the vector.
All of the answers are good.
Now, assume that the array cannot be sorted, here is a somewhat lazy approach using std::map, but shows what can be done using the various algorithm functions.
#include <map>
#include <vector>
#include <iostream>
#include <algorithm>
using namespace std;
int lonelyinteger(const std::vector<int>& a)
{
typedef std::map<int, int> IntMap;
IntMap theMap;
// build the map
for_each(a.begin(), a.end(), [&](int n){ theMap[n]++; });
// find the first entry with a count of 1
return
find_if(theMap.begin(), theMap.end(),
[](const IntMap::value_type& pr){return pr.second == 1; })->first;
}
int main()
{
std::vector<int> TestVect = { 1, 1, 2 };
cout << lonelyinteger(TestVect);
}
Live example: http://ideone.com/0t89Ni
This code assumes that
the passed in vector is not empty,
the first item found with a count of 1 is the lonely value.
There is at least one "lonely value".
I also changed the signature to take a vector by reference and not send the count (since a vector knows its own size).
The code does not do any hand-coded loops, so that is one source of error removed.
Second, the count of the number of times a number is seen is more or less, done by the map using operator[] to insert new entries, and ++ to increase the count on the entry.
Last, the search for the first entry with only a count of 1 is done with std::find_if, again guaranteeing success (given that the data follows the assumptions made above).
So basically, without really trying hard, a solution can be written using algorithm functions and usage of the std::map associative container.
If your data will consist of multiple (or even no) "lonely" integers, the following changes can be made:
#include <map>
#include <vector>
#include <iostream>
#include <algorithm>
#include <iterator>
using namespace std;
std::vector<int> lonelyinteger(const std::vector<int>& a)
{
std::vector<int> retValue;
typedef std::map<int, int> IntMap;
IntMap theMap;
// build the map
for_each(a.begin(), a.end(), [&](int n){ theMap[n]++; });
// find all entries with a count of 1
for_each(theMap.begin(), theMap.end(),
[&](const IntMap::value_type& pr)
{if (pr.second == 1) retValue.push_back(pr.first); });
// return our answer
return retValue;
}
int main()
{
std::vector<int> TestVect = { 1, 1, 2, 3, 5, 0, 2, 8 };
std::vector<int> ans = lonelyinteger(TestVect);
copy(ans.begin(), ans.end(), ostream_iterator<int>(cout," "));
}
Live example: http://ideone.com/40NY4k
Note that we now retrieve any entries with an item of 1, and store it in a vector that will be returned.
Simple answer might be to just sort the lists and then look for something which has a different value before and after it..
Your problem is that the last item of any given value in the list has no subsequent duplicate values and you are thinking having no subsequent duplicates is the same as having no duplicates (which is false).
If you don't want to remove values your inner look has seen and earlier identified as a duplicate of a "previous" value loop over all values in the inner loop ignoring the match with itself.
I am working on a project that requires the manipulation of enormous matrices, specifically pyramidal summation for a copula calculation.
In short, I need to keep track of a relatively small number of values (usually a value of 1, and in rare cases more than 1) in a sea of zeros in the matrix (multidimensional array).
A sparse array allows the user to store a small number of values, and assume all undefined records to be a preset value. Since it is not physically possibly to store all values in memory, I need to store only the few non-zero elements. This could be several million entries.
Speed is a huge priority, and I would also like to dynamically choose the number of variables in the class at runtime.
I currently work on a system that uses a binary search tree (b-tree) to store entries. Does anyone know of a better system?
For C++, a map works well. Several million objects won't be a problem. 10 million items took about 4.4 seconds and about 57 meg on my computer.
My test application is as follows:
#include <stdio.h>
#include <stdlib.h>
#include <map>
class triple {
public:
int x;
int y;
int z;
bool operator<(const triple &other) const {
if (x < other.x) return true;
if (other.x < x) return false;
if (y < other.y) return true;
if (other.y < y) return false;
return z < other.z;
}
};
int main(int, char**)
{
std::map<triple,int> data;
triple point;
int i;
for (i = 0; i < 10000000; ++i) {
point.x = rand();
point.y = rand();
point.z = rand();
//printf("%d %d %d %d\n", i, point.x, point.y, point.z);
data[point] = i;
}
return 0;
}
Now to dynamically choose the number of variables, the easiest solution is to represent index as a string, and then use string as a key for the map. For instance, an item located at [23][55] can be represented via "23,55" string. We can also extend this solution for higher dimensions; such as for three dimensions an arbitrary index will look like "34,45,56". A simple implementation of this technique is as follows:
std::map data<string,int> data;
char ix[100];
sprintf(ix, "%d,%d", x, y); // 2 vars
data[ix] = i;
sprintf(ix, "%d,%d,%d", x, y, z); // 3 vars
data[ix] = i;
The accepted answer recommends using strings to represent multi-dimensional indices.
However, constructing strings is needlessly wasteful for this. If the size isn’t known at compile time (and thus std::tuple doesn’t work), std::vector works well as an index, both with hash maps and ordered trees. For std::map, this is almost trivial:
#include <vector>
#include <map>
using index_type = std::vector<int>;
template <typename T>
using sparse_array = std::map<index_type, T>;
For std::unordered_map (or similar hash table-based dictionaries) it’s slightly more work, since std::vector does not specialise std::hash:
#include <vector>
#include <unordered_map>
#include <numeric>
using index_type = std::vector<int>;
struct index_hash {
std::size_t operator()(index_type const& i) const noexcept {
// Like boost::hash_combine; there might be some caveats, see
// <https://stackoverflow.com/a/50978188/1968>
auto const hash_combine = [](auto seed, auto x) {
return std::hash<int>()(x) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
};
return std::accumulate(i.begin() + 1, i.end(), i[0], hash_combine);
}
};
template <typename T>
using sparse_array = std::unordered_map<index_type, T, index_hash>;
Either way, the usage is the same:
int main() {
using i = index_type;
auto x = sparse_array<int>();
x[i{1, 2, 3}] = 42;
x[i{4, 3, 2}] = 23;
std::cout << x[i{1, 2, 3}] + x[i{4, 3, 2}] << '\n'; // 65
}
Boost has a templated implementation of BLAS called uBLAS that contains a sparse matrix.
https://www.boost.org/doc/libs/release/libs/numeric/ublas/doc/index.htm
Eigen is a C++ linear algebra library that has an implementation of a sparse matrix. It even supports matrix operations and solvers (LU factorization etc) that are optimized for sparse matrices.
Complete list of solutions can be found in the wikipedia. For convenience, I have quoted relevant sections as follows.
https://en.wikipedia.org/wiki/Sparse_matrix#Dictionary_of_keys_.28DOK.29
Dictionary of keys (DOK)
DOK consists of a dictionary that maps (row, column)-pairs to the
value of the elements. Elements that are missing from the dictionary
are taken to be zero. The format is good for incrementally
constructing a sparse matrix in random order, but poor for iterating
over non-zero values in lexicographical order. One typically
constructs a matrix in this format and then converts to another more
efficient format for processing.[1]
List of lists (LIL)
LIL stores one list per row, with each entry containing the column
index and the value. Typically, these entries are kept sorted by
column index for faster lookup. This is another format good for
incremental matrix construction.[2]
Coordinate list (COO)
COO stores a list of (row, column, value) tuples. Ideally, the entries
are sorted (by row index, then column index) to improve random access
times. This is another format which is good for incremental matrix
construction.[3]
Compressed sparse row (CSR, CRS or Yale format)
The compressed sparse row (CSR) or compressed row storage (CRS) format
represents a matrix M by three (one-dimensional) arrays, that
respectively contain nonzero values, the extents of rows, and column
indices. It is similar to COO, but compresses the row indices, hence
the name. This format allows fast row access and matrix-vector
multiplications (Mx).
Small detail in the index comparison. You need to do a lexicographical compare, otherwise:
a= (1, 2, 1); b= (2, 1, 2);
(a<b) == (b<a) is true, but b!=a
Edit: So the comparison should probably be:
return lhs.x<rhs.x
? true
: lhs.x==rhs.x
? lhs.y<rhs.y
? true
: lhs.y==rhs.y
? lhs.z<rhs.z
: false
: false
Hash tables have a fast insertion and look up. You could write a simple hash function since you know you'd be dealing with only integer pairs as the keys.
The best way to implement sparse matrices is to not to implement them - atleast not on your own. I would suggest to BLAS (which I think is a part of LAPACK) which can handle really huge matrices.
Since only values with [a][b][c]...[w][x][y][z] are of consequence, we only store the indice themselves, not the value 1 which is just about everywhere - always the same + no way to hash it. Noting that the curse of dimensionality is present, suggest go with some established tool NIST or Boost, at least read the sources for that to circumvent needless blunder.
If the work needs to capture the temporal dependence distributions and parametric tendencies of unknown data sets, then a Map or B-Tree with uni-valued root is probably not practical. We can store only the indice themselves, hashed if ordering ( sensibility for presentation ) can subordinate to reduction of time domain at run-time, for all 1 values. Since non-zero values other than one are few, an obvious candidate for those is whatever data-structure you can find readily and understand. If the data set is truly vast-universe sized I suggest some sort of sliding window that manages file / disk / persistent-io yourself, moving portions of the data into scope as need be. ( writing code that you can understand ) If you are under commitment to provide actual solution to a working group, failure to do so leaves you at the mercy of consumer grade operating systems that have the sole goal of taking your lunch away from you.
Here is a relatively simple implementation that should provide a reasonable fast lookup (using a hash table) as well as fast iteration over non-zero elements in a row/column.
// Copyright 2014 Leo Osvald
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#ifndef UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_
#define UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_
#include <algorithm>
#include <limits>
#include <map>
#include <type_traits>
#include <unordered_map>
#include <utility>
#include <vector>
// A simple time-efficient implementation of an immutable sparse matrix
// Provides efficient iteration of non-zero elements by rows/cols,
// e.g. to iterate over a range [row_from, row_to) x [col_from, col_to):
// for (int row = row_from; row < row_to; ++row) {
// for (auto col_range = sm.nonzero_col_range(row, col_from, col_to);
// col_range.first != col_range.second; ++col_range.first) {
// int col = *col_range.first;
// // use sm(row, col)
// ...
// }
template<typename T = double, class Coord = int>
class SparseMatrix {
struct PointHasher;
typedef std::map< Coord, std::vector<Coord> > NonZeroList;
typedef std::pair<Coord, Coord> Point;
public:
typedef T ValueType;
typedef Coord CoordType;
typedef typename NonZeroList::mapped_type::const_iterator CoordIter;
typedef std::pair<CoordIter, CoordIter> CoordIterRange;
SparseMatrix() = default;
// Reads a matrix stored in MatrixMarket-like format, i.e.:
// <num_rows> <num_cols> <num_entries>
// <row_1> <col_1> <val_1>
// ...
// Note: the header (lines starting with '%' are ignored).
template<class InputStream, size_t max_line_length = 1024>
void Init(InputStream& is) {
rows_.clear(), cols_.clear();
values_.clear();
// skip the header (lines beginning with '%', if any)
decltype(is.tellg()) offset = 0;
for (char buf[max_line_length + 1];
is.getline(buf, sizeof(buf)) && buf[0] == '%'; )
offset = is.tellg();
is.seekg(offset);
size_t n;
is >> row_count_ >> col_count_ >> n;
values_.reserve(n);
while (n--) {
Coord row, col;
typename std::remove_cv<T>::type val;
is >> row >> col >> val;
values_[Point(--row, --col)] = val;
rows_[col].push_back(row);
cols_[row].push_back(col);
}
SortAndShrink(rows_);
SortAndShrink(cols_);
}
const T& operator()(const Coord& row, const Coord& col) const {
static const T kZero = T();
auto it = values_.find(Point(row, col));
if (it != values_.end())
return it->second;
return kZero;
}
CoordIterRange
nonzero_col_range(Coord row, Coord col_from, Coord col_to) const {
CoordIterRange r;
GetRange(cols_, row, col_from, col_to, &r);
return r;
}
CoordIterRange
nonzero_row_range(Coord col, Coord row_from, Coord row_to) const {
CoordIterRange r;
GetRange(rows_, col, row_from, row_to, &r);
return r;
}
Coord row_count() const { return row_count_; }
Coord col_count() const { return col_count_; }
size_t nonzero_count() const { return values_.size(); }
size_t element_count() const { return size_t(row_count_) * col_count_; }
private:
typedef std::unordered_map<Point,
typename std::remove_cv<T>::type,
PointHasher> ValueMap;
struct PointHasher {
size_t operator()(const Point& p) const {
return p.first << (std::numeric_limits<Coord>::digits >> 1) ^ p.second;
}
};
static void SortAndShrink(NonZeroList& list) {
for (auto& it : list) {
auto& indices = it.second;
indices.shrink_to_fit();
std::sort(indices.begin(), indices.end());
}
// insert a sentinel vector to handle the case of all zeroes
if (list.empty())
list.emplace(Coord(), std::vector<Coord>(Coord()));
}
static void GetRange(const NonZeroList& list, Coord i, Coord from, Coord to,
CoordIterRange* r) {
auto lr = list.equal_range(i);
if (lr.first == lr.second) {
r->first = r->second = list.begin()->second.end();
return;
}
auto begin = lr.first->second.begin(), end = lr.first->second.end();
r->first = lower_bound(begin, end, from);
r->second = lower_bound(r->first, end, to);
}
ValueMap values_;
NonZeroList rows_, cols_;
Coord row_count_, col_count_;
};
#endif /* UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_ */
For simplicity, it's immutable, but you can can make it mutable; be sure to change std::vector to std::set if you want a reasonable efficient "insertions" (changing a zero to a non-zero).
I would suggest doing something like:
typedef std::tuple<int, int, int> coord_t;
typedef boost::hash<coord_t> coord_hash_t;
typedef std::unordered_map<coord_hash_t, int, c_hash_t> sparse_array_t;
sparse_array_t the_data;
the_data[ { x, y, z } ] = 1; /* list-initialization is cool */
for( const auto& element : the_data ) {
int xx, yy, zz, val;
std::tie( std::tie( xx, yy, zz ), val ) = element;
/* ... */
}
To help keep your data sparse, you might want to write a subclass of unorderd_map, whose iterators automatically skip over (and erase) any items with a value of 0.