Related
With Eigen: given a large sparse matrix, what would be the quickest way to determine if a value at a given position is null or not ?
It must be said that I use this sparse matrix implementation to store non-numeric values. It's related to this question.
At present, what I have is (with T my data type):
Eigen::SparseMatrix<T> mat(n,n); // n could be 1000, 10000, ...
// fill
T e1( /* some values */ );
mat.insert( 3, 4 ) = e1;
...
Checking is done with:
T elem = mat.coeff( row, col );
if( elem == T() ) // use default constructor
std::cout << "empty !\n";
But this doesn't satisfy me, as it implies both a copy operation and a deep comparison operator. What I would like could be something like:
if( mat.isNull( row, col ) )
std::cout << "empty !\n";
Is there a way I can do this ? Or fetch a pointer that return nullptr if empty?
I checked Tutorial pages, quick ref and relevant manual pages but didn't find anything.
If this appears to be not possible, then I am thinking about wrapping it in a class along with a map holding all the filled positions. That way I could quickly find out if a given row,col has a value.
Addendum: I have Eigen 3.2 but could switch to 3.3 if required.
Edit: the linked question asks how to get a set of non-null elements. What I am asking for is "how do I know if the value at (row,col) is null ?". Not a dupe AFAIK.
Derived from https://eigen.tuxfamily.org/dox-devel/group__TutorialSparse.html , you can define the null checking function like
bool isNull(const Eigen::SparseMatrix<T>& mat, int row, int col)
{
for (Eigen::SparseMatrix<T>::InnerIterator it(mat, col); it; ++it) {
if (it.row() == row) return false;
}
return true;
}
This way you don't copy anything. You don't perform deep comparison either.
Depending on the number of elements per column, the most efficient solution might be to do a binary search the same way as coeffRef(). You can in fact copy that function, replace all returns by true or false and integrate it into the SparseMatrix using the plugin-mechanism.
Most of the time it is inefficient to access individual elements of a sparse matrix, though.
I have 2 structs, one simply has 2 values:
struct combo {
int output;
int input;
};
And another that sorts the input element based on the index of the output element:
struct organize {
bool operator()(combo const &a, combo const &b)
{
return a.input < b.input;
}
};
Using this:
sort(myVector.begin(), myVector.end(), organize());
What I'm trying to do with this, is iterate through the input varlable, and check if each element is equal to another input 'in'.
If it is equal, I want to insert the value at the same index it was found to be equal at for input, but from output into another temp vector.
I originally went with a more simple solution (when I wasn't using a structs and simply had 2 vectors, one input and one output) and had this in a function called copy:
for(int i = 0; i < input.size(); ++i){
if(input == in){
temp.push_back(output[i]);
}
}
Now this code did work exactly how I needed it, the only issue is it is simply too slow. It can handle 10 integer inputs, or 100 inputs but around 1000 it begins to slow down taking an extra 5 seconds or so, then at 10,000 it takes minutes, and you can forget about 100,000 or 1,000,000+ inputs.
So, I asked how to speed it up on here (just the function iterator) and somebody suggested sorting the input vector which I did, implemented their suggestion of using upper/lower bound, changing my iterator to this:
std::vector<int>::iterator it = input.begin();
auto lowerIt = std::lower_bound(input.begin(), input.end(), in);
auto upperIt = std::upper_bound(input.begin(), input.end(), in);
for (auto it = lowerIt; it != upperIt; ++it)
{
temp.push_back(output[it - input.begin()]);
}
And it worked, it made it much faster, I still would like it to be able to handle 1,000,000+ inputs in seconds but I'm not sure how to do that yet.
I then realized that I can't have the input vector sorted, what if the inputs are something like:
input.push_back(10);
input.push_back(-1);
output.push_back(1);
output.push_back(2);
Well then we have 10 in input corresponding to 1 in output, and -1 corresponding to 2. Obviously 10 doesn't come before -1 so sorting it smallest to largest doesn't really work here.
So I found a way to sort the input based on the output. So no matter how you organize input, the indexes match each other based on what order they were added.
My issue is, I have no clue how to iterate through just input with the same upper/lower bound iterator above. I can't seem to call upon just the input variable of myVector, I've tried something like:
std::vector<combo>::iterator it = myVector.input.begin();
But I get an error saying there is no member 'input'.
How can I iterate through just input so I can apply the upper/lower bound iterator to this new way with the structs?
Also I explained everything so everyone could get the best idea of what I have and what I'm trying to do, also maybe somebody could point me in a completely different direction that is fast enough to handle those millions of inputs. Keep in mind I'd prefer to stick with vectors because not doing so would involve me changing 2 other files to work with things that aren't vectors or lists.
Thank you!
I think that if you sort it in smallest to largest (x is an integer after all) that you should be able to use std::adjacent_find to find duplicates in the array, and process them properly. For the performance issues, you might consider using reserve to preallocate space for your large vector, so that your push back operations don't have to reallocate memory as often.
I currently have a solution but I feel it's not as efficient as it could be to this problem, so I want to see if there is a faster method to this.
I have two arrays (std::vectors for example). Both arrays contain only unique integer values that are sorted but are sparse in value, ie: 1,4,12,13... What I want to ask is there fast way I can find the INDEX to one of the arrays where the values are the same. For example, array1 has values 1,4,12,13 and array2 has values 2,12,14,16. The first matching value index is 1 in array2. The index into the array is what is important as I have other arrays that contain data that will use this index that "matches".
I am not confined to using arrays, maps are possible to. I am only comparing the two arrays once. They will not be reused again after the first matching pass. There can be small to large number of values (300,000+) in either array, but DO NOT always have the same number of values (that would make things much easier)
Worse case is a linear search O(N^2). Using map would get me better O(log N) but I would still have convert an array to into a map of value, index pairs.
What I currently have to not do any container type conversions is this. Loop over the smaller of the two arrays. Compare current element of small array (array1) with the current element of large array (array2). If array1 element value is larger than array2 element value, increment the index for array2 until is it no longer larger than array1 element value (while loop). Then, if array1 element value is smaller than array2 element, go to next loop iteration and begin again. Otherwise they must be equal and I have my index to either arrays of the matching value.
So in this loop, I am at best O(N) if all values have matches and at worse O(2N) if none match. So I am wondering if there is something faster out there? It's hard to know for sure how often the two arrays will match, but I would way I would lean more toward most of the arrays will mostly have matches than not.
I hope I explained the problem well enough and I appreciate any feedback or tips on improving this.
Code example:
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34,40};
for(unsigned int i=0, z=0; i < array1.size(); i++)
{
int value1 = array1[i];
while(value1 > array2[z] && z < array2.size())
z++;
if (z >= array2.size())
break; // reached end of array2
if (value1 < array2[z])
continue;
// we have a match, i and z indices have same value
}
Result will be matching indexes for array1 = [1,3] and for array2= [2,3]
I wrote an implementation of this function using an algorithm that performs better with sparse distributions, than the trivial linear merge.
For distributions, that are similar†, it has O(n) complexity but ranges where the distributions are greatly different, it should perform below linear, approaching O(log n) in optimal cases. However, I wasn't able to prove that the worst case isn't better than O(n log n). On the other hand, I haven't been able to find that worst case either.
I templated it so that any type of ranges can be used, such as sub-ranges or raw arrays. Technically it works with non-random access iterators as well, but the complexity is much greater, so it's not recommended. I think it should be possible to modify the algorithm to fall back to linear search in that case, but I haven't bothered.
† By similar distribution, I mean that the pair of arrays have many crossings. By crossing, I mean a point where you would switch from one array to another if you were to merge the two arrays together in sorted order.
#include <algorithm>
#include <iterator>
#include <utility>
// helper structure for the search
template<class Range, class Out>
struct search_data {
// is any there clearer way to get iterator that might be either
// a Range::const_iterator or const T*?
using iterator = decltype(std::cbegin(std::declval<Range&>()));
iterator curr;
const iterator begin, end;
Out out;
};
template<class Range, class Out>
auto init_search_data(const Range& range, Out out) {
return search_data<Range, Out>{
std::begin(range),
std::begin(range),
std::end(range),
out,
};
}
template<class Range, class Out1, class Out2>
void match_indices(const Range& in1, const Range& in2, Out1 out1, Out2 out2) {
auto search_data1 = init_search_data(in1, out1);
auto search_data2 = init_search_data(in2, out2);
// initial order is arbitrary
auto lesser = &search_data1;
auto greater = &search_data2;
// if either range is exhausted, we are finished
while(lesser->curr != lesser->end
&& greater->curr != greater->end) {
// difference of first values in each range
auto delta = *greater->curr - *lesser->curr;
if(!delta) { // matching value was found
// store both results and increment the iterators
*lesser->out++ = std::distance(lesser->begin, lesser->curr++);
*greater->out++ = std::distance(greater->begin, greater->curr++);
continue; // then start a new iteraton
}
if(delta < 0) { // set the order of ranges by their first value
std::swap(lesser, greater);
delta = -delta; // delta is always positive after this
}
// next crossing cannot be farther than the delta
// this assumption has following pre-requisites:
// range is sorted, values are integers, values in the range are unique
auto range_left = std::distance(lesser->curr, lesser->end);
auto upper_limit =
std::min(range_left, static_cast<decltype(range_left)>(delta));
// exponential search for a sub range where the value at upper bound
// is greater than target, and value at lower bound is lesser
auto target = *greater->curr;
auto lower = lesser->curr;
auto upper = std::next(lower, upper_limit);
for(int i = 1; i < upper_limit; i *= 2) {
auto guess = std::next(lower, i);
if(*guess >= target) {
upper = guess;
break;
}
lower = guess;
}
// skip all values in lesser,
// that are less than the least value in greater
lesser->curr = std::lower_bound(lower, upper, target);
}
}
#include <iostream>
#include <vector>
int main() {
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34};
std::vector<std::size_t> indices1;
std::vector<std::size_t> indices2;
match_indices(array1, array2,
std::back_inserter(indices1),
std::back_inserter(indices2));
std::cout << "indices in array1: ";
for(std::vector<int>::size_type i : indices1)
std::cout << i << ' ';
std::cout << "\nindices in array2: ";
for(std::vector<int>::size_type i : indices2)
std::cout << i << ' ';
std::cout << std::endl;
}
Since the arrays are already sorted you can just use something very much like the merge step of mergesort. This just looks at the head element of each array, and discards the lower element (the next element becomes the head). Stop when you find a match (or when either array becomes exhausted, indicating no match).
This is O(n) and the fastest you can do for arbitrary distubtions. With certain clustered distributions a "skip ahead" approach could be used rather than always looking at the next element. This could result in better than O(n) running times for certain distributions. For example, given the arrays 1,2,3,4,5 and 10,11,12,13,14 an algorithm could determine there were no matches to be found in as few as one comparison (5 < 10).
What is the range of the stored numbers?
I mean, you say that the numbers are integers, sorted, and sparse (i.e. non-sequential), and that there may be more than 300,000 of them, but what is their actual range?
The reason that I ask is that, if there is a reasonably small upper limit, u, (say, u=500,000), the fastest and most expedient solution might be to just use the values as indices. Yes, you might be wasting memory, but is 4*u really a lot of memory? This depends on your application and your target platform (i.e. if this is for a memory-constrained embedded system, its less likely to be a good idea than if you have a laptop with 32GiB RAM).
Of course, if the values are more-or-less evenly spread over 0-2^31-1, this crude idea isn't attractive, but maybe there are properties of the input values that you can exploit other simply than the range. You might be able to hand-write a fairly simple hash function.
Another thing worth considering is whether you actually need to be able to retrieve the index quickly or if it helps just be able to tell if the index exists in the other array quickly. Whether or not a value exists at a particular index requires only one bit, so you could have a bitmap of the range of the input values using 32x less memory (i.e. mask off 5 LSBs and use that as a bit position, then shift the remaining 27 bits 5 places right and use that as an array index).
Finally, a hybrid approach might be worth considering, where you decide how much memory you're prepared to use (say you decide 256KiB, which corresponds to 64Ki 4-byte integers) then use that as a lookup-table to into much smaller sub-problems. Say you have 300,000 values whose LSBs are pretty evenly distributed. Then you could use 16 LSBs as indices into a lookup-table of lists that are (on average) only 4 or 5 elements long, which you can then search by other means. A couple of year ago, I worked on some simulation software that had ~200,000,000 cells, each with a cell id; some utility functionality used a binary search to identify cells by id. We were able to speed it up significantly and non-intrusively with this strategy. Not a perfect solution, but a great improvement. (If the LSBs are not evenly distributed, maybe that's a property that you can exploit or maybe you can choose a range of bits that are, or do a bit of hashing.)
I guess the upshot is “consider some kind of hashing”, even the “identity hash” or simple masking/modulo with a little “your solution doesn't have to be perfectly general” on the side and some “your solution doesn't have to be perfectly space efficient” sauce on top.
This question is actually quite simple yet I would like to hear some ideas before jumping into coding. Given a file with a word in each line, calculating most n frequent numbers.
The first and unfortunately only thing that pops up in my mind use to use a std::map. I know fellow C++'ers will say that unordered_map would be so much reasonable.
I would like to know if anything could be added to the algorithm side or this is just basically 'whoever picks the best data structure wins' type of question. I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n) running time however I assume it will be to complex to implement
Any ideas?
The best data structure to use for this task is a Trie:
http://en.wikipedia.org/wiki/Trie
It will outperform a hash table for counting strings.
There are many different approaches to this question. It would finally depend on the scenario and others factors such as the size of the file (If the file has a billion lines) then a HashMapwould not be an efficient way to do it. Here are some things which you can do depending on your problem:
If you know that the number of unique words are very limited, you can use a TreeMap or in your case std::map.
If the number of words are very large then you can build a trie and keep count of various words in another data structure. This could be a heap (min/max depends on what you want to do) of size n. So you don't need to store all the words, just the necessary ones.
I would not start with std::map (or unordered_map) if I had much choice (though I don't know what other constraints may apply).
You have two data items here, and you use one as the key part of the time, but the other as the key another part of the time. For that, you probably want something like a Boost Bimap or possibly Boost MultiIndex.
Here's the general idea using Bimap:
#include <boost/bimap.hpp>
#include <boost/bimap/list_of.hpp>
#include <iostream>
#define elements(array) ((sizeof(array)/sizeof(array[0])))
class uint_proxy {
unsigned value;
public:
uint_proxy() : value(0) {}
uint_proxy& operator++() { ++value; return *this; }
unsigned operator++(int) { return value++; }
operator unsigned() const { return value; }
};
int main() {
int b[]={2,4,3,5,2,6,6,3,6,4};
boost::bimap<int, boost::bimaps::list_of<uint_proxy> > a;
// walk through array, counting how often each number occurs:
for (int i=0; i<elements(b); i++)
++a.left[b[i]];
// print out the most frequent:
std::cout << a.right.rbegin()->second;
}
For the moment, I've only printed out the most frequent number, but iterating N times to print out the N most frequent is pretty trivial.
If you are just interested in the top N most frequent words, and you don't need it to be exact, then there is a very clever structure you can use. I heard of this by way of Udi Manber, it works as follows:
You create an array of N elements, each element tracks a value and a count, you also keep a counter that indexes into this array. Additionally, you have a map from value to index into that array.
Every time you update your structure with a value (like a word from a stream of text) you first check your map to see if that value is already in your array, if it is you increment the count for that value. If it is not then you decrement the count of whatever element your counter is pointing at and then increment the counter.
This sounds simple, and nothing about the algorithm makes it seem like it will yield anything useful, but for typical real data it tends to do very well. Normally if you wish to track the top N things you might want to make this structure with the capacity of 10*N, since there will be a lot of empty values in it. Using the King James Bible as input, here is what this structure lists as the most frequent words (in no particular order):
0 : in
1 : And
2 : shall
3 : of
4 : that
5 : to
6 : he
7 : and
8 : the
9 : I
And here are the top ten most frequent words (in order):
0 : the , 62600
1 : and , 37820
2 : of , 34513
3 : to , 13497
4 : And , 12703
5 : in , 12216
6 : that , 11699
7 : he , 9447
8 : shall , 9335
9 : unto , 8912
You see that it got 9 of the top 10 words correct, and it did so using space for only 50 elements. Depending on your use case the savings on space here may be very useful. It is also very fast.
Here is the implementation of topN that I used, written in Go:
type Event string
type TopN struct {
events []Event
counts []int
current int
mapped map[Event]int
}
func makeTopN(N int) *TopN {
return &TopN{
counts: make([]int, N),
events: make([]Event, N),
current: 0,
mapped: make(map[Event]int, N),
}
}
func (t *TopN) RegisterEvent(e Event) {
if index, ok := t.mapped[e]; ok {
t.counts[index]++
} else {
if t.counts[t.current] == 0 {
t.counts[t.current] = 1
t.events[t.current] = e
t.mapped[e] = t.current
} else {
t.counts[t.current]--
if t.counts[t.current] == 0 {
delete(t.mapped, t.events[t.current])
}
}
}
t.current = (t.current + 1) % len(t.counts)
}
Given a file with a word in each line, calculating most n frequent numbers.
...
I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n)
If you meant the *n*s arethe same then no, this is not possible. However, if you just meant time linear in terms of the size of the input file, then a trivial implementation with a hash table will do what you want.
There might be probabilistic approximate algorithms with sublinear memory.
I have a std::map that I'm using to store values for x and y coordinates. My data is very sparse, so I don't want to use arrays or vectors, which would result in a massive waste of memory. My data ranges from -250000 to 250000, but I'll only have a few thousand points at the most.
Currently I'm creating a std::string with the two coordinates (i.e. "12x45") and using it as a key. This doesn't seem like the best way to do it.
My other thoughts were to use an int64 and shove the two int32s into it and use it as a key.
Or to use a class with the two coordinates. What are the requirements on a class that is to be used as the key?
What is the best way to do this? I'd rather not use a map of maps.
Use std::pair<int32,int32> for the key:
std::map<std::pair<int,int>, int> myMap;
myMap[std::make_pair(10,20)] = 25;
std::cout << myMap[std::make_pair(10,20)] << std::endl;
I usually solve this kind of problem like this:
struct Point {
int x;
int y;
};
inline bool operator<(const Point& p1, const Point& p2) {
if (p1.x != p2.x) {
return p1.x < p2.x;
} else {
return p1.y < p2.y;
}
}
Boost has a map container that uses one or more indices.
Multi Index Map
What are the requirements on a class that is to be used as the key?
The map needs to be able to tell whether one key's value is less than another key's value: by default this means that (key1 < key2) must be a valid boolean expression, i.e. that the key type should implement the 'less than' operator.
The map template also implements an overloaded constructor which lets you pass-in a reference to a function object of type key_compare, which can implement the comparison operator: so that alternatively the comparison can be implemented as a method of this external function object, instead of needing to be baked in to whatever type your key is of.
This will stuff multiple integer keys into a large integer, in this case, an _int64. It compares as an _int64, AKA long long (The ugliest type declaration ever. short short short short, would only be slightly less elegant. 10 years ago it was called vlong. Much better. So much for "progress"), so no comparison function is needed.
#define ULNG unsigned long
#define BYTE unsigned char
#define LLNG long long
#define ULLNG unsigned long long
// --------------------------------------------------------------------------
ULLNG PackGUID(ULNG SN, ULNG PID, BYTE NodeId) {
ULLNG CompKey=0;
PID = (PID << 8) + NodeId;
CompKey = ((ULLNG)CallSN << 32) + PID;
return CompKey;
}
Having provided this answer, I doubt this is going to work for you, as you need two separate and distinct keys to navigate with in 2 dimensions, X and Y.
On the other hand, if you already have the XY coordinate, and just want to associate a value with that key, then this works spectacularly, because an _int64 compare takes the same time as any other integer compare on Intel X86 chips - 1 clock.
In this case, the compare is 3X as fast on this synthetic key, vs a triple compound key.
If using this to create a sparsely populated spreadsheet, I would RX using 2 distinct trees, one nested inside the other. Make the Y dimension "the boss", and search Y space first to resolution before proceeding to the X dimension. Spreadsheets are taller than they are wide, and you always want the 1st dimension in any compound key to have the largest number of unique values.
This arrangement would create a map for the Y dimension that would have a map for the X dimension as it's data. When you get to a leaf in the Y dimension, you start searching it's X dimension for the column in the spreadsheet.
If you want to create a very powerful spreadsheet system, add a Z dimension in the same way, and use that for, as an example, organizational units. This is the basis for a very powerful budgeting/forecasting/accounting system, one which allows admin units to have lots of gory detail accounts to track admin expenses and such, and not have those accounts take up space for line units which have their own kinds of detail to track.
I think for your use case, std::pair, as suggested in David Norman's answer, is the best solution. However, since C++11 you can also use std::tuple. Tuples are useful if you have more than two keys, for example if you have 3D coordinates (i.e. x, y, and z). Then you don't have to nest pairs or define a comparator for a struct. But for your specific use case, the code could be written as follows:
int main() {
using tup_t = std::tuple<int, int>;
std::map<tup_t, int> m;
m[std::make_tuple(78, 26)] = 476;
tup_t t = { 12, 45 }; m[t] = 102;
for (auto const &kv : m)
std::cout << "{ " << std::get<0>(kv.first) << ", "
<< std::get<1>(kv.first) << " } => " << kv.second << std::endl;
return 0;
}
Output:
{ 12, 45 } => 102
{ 78, 26 } => 476
Note: Since C++17 working with tuples has become easier, espcially if you want to access multiple elements simultaneously.
For example, if you use structured binding, you can print the tuple as follows:
for (auto const &[k, v] : m) {
auto [x, y] = k;
std::cout << "{ " << x << ", " << y << " } => " << v << std::endl;
}
Code on Coliru
Use std::pair. Better even use QHash<QPair<int,int>,int> if you have many of such mappings.
Hope you will find it useful:
map<int, map<int, int>> troyka = { {4, {{5,6}} } };
troyka[4][5] = 7;
An alternative for the top result that is slightly less performant but allows for easier indexing
std::map<int, std::map<int,int>> myMap;
myMap[10][20] = 25;
std::cout << myMap[10][20] << std::endl;
First and foremost, ditch the string and use 2 ints, which you may well have done by now. Kudos for figuring out that a tree is the best way to implement a sparse matrix. Usually a magnet for bad implementations it seems.
FYI, a triple compound key works too, and I assume a pair of pairs as well.
It makes for some ugly sub-scripting though, so a little macro magic will make your life easier. I left this one general purpose, but type-casting the arguments in the macro is a good idea if you create macros for specific maps. The TresKey12 is tested and running fine. QuadKeys should also work.
NOTE: As long as your key parts are basic data types you DON'T need to write anything more. AKA, no need to fret about comparison functions. The STL has you covered. Just code it up and let it rip.
using namespace std; // save some typing
#define DosKeys(x,y) std::make_pair(std::make_pair(x,y))
#define TresKeys12(x,y,z) std::make_pair(x,std::make_pair(y,z))
#define TresKeys21(x,y,z) std::make_pair(std::make_pair(x,y),z))
#define QuadKeys(w,x,y,z) std::make_pair(std::make_pair(w,x),std::make_pair(y,z))
map<pair<INT, pair<ULLNG, ULLNG>>, pIC_MESSAGE> MapMe;
MapMe[TresKey12(Part1, Part2, Part3)] = new fooObject;
If someone wants to impress me, show me how to make a compare operator for TresKeys that doesn't rely on nesting pairs so I can use a single struct with 3 members and use a comparison function.
PS: TresKey12 gave me problems with a map declared as pair,z as it makes x,pair, and those two don't play nice. Not a problem for DosKeys, or QuadKeys. If it's a hot summer Friday though, you may find an unexpected side-effect of typing in DosEquis
... err.. DosKeys a bunch of times, is a thirst for Mexican beer. Caveat Emptor. As Sheldon Cooper says, "What's life without whimsy?".