is there a way to condense a vector (C++)? - c++

I have a sparsely populated vector that I populated via hashing, so elements are scattered randomly in the vector. Now what I want to do is iterate over every element in that vector. What I had in mind was essentially condensing the vector to fit the number of elements present, removing any empty spaces. Is there a way I can do this?

Either you save the additionally needed information during insertion of the elements (e.g. links to the previous / next element compared to a linked list) or you make one pass over all the elements and remove the unnecessary ones.
The first solution costs you some space (approx. 8 bytes / entry), the second costs you one pass over all elements. Depending on the scenario, one or both possibilities might not be useful.

You can condense using a version of run-length encoding.
You go over the original vector and create a new "condensed" vector which contains alternating values - a value from the original and a count of the empty spaces to the next value. For example this:
3 - - - - 4 - - 7 3 - - - 9 -
turns to this:
3 4 4 2 7 0 3 3 9 1

Related

removing elements from a vector with O(1) runtime

"Write a function which takes as an input an object of vector type
removes an element at the rank k in the constant time, O(1) [constant]. Assume that the order of elements does not matter."
I thought I might have had an idea about this. But, as I started to try by using .erase(), I looked up what the big-O notation was and found out it was O(n),as in linear relation. I can't think of any other way at the moment. I don't want any code, but I think pseudo code will at least point me in the right direction if anyone can help
Assume that the order of elements does not matter.
This is what you need to pay attention to.
Suppose you have a vector
0 1 2 3 4 5 6
and you want to remove the 3. You can turn this into
0 1 2 6 4 5
in O(1) without any issues.
Actually, there is a way to do it. Here is the pseudocode:
If the element you are trying to remove is the last element in the vector, remove it, done.
Read the last element of the vector and write it over the element-to-be-removed.
Remove the last element of the vector.
You can swap and pop_back in constant time.
std::swap(vec.back(), vec[rank]);
vec.pop_back();

Data structure with fast contiguous ranges retrieval

Imagine data structure, that manipulates some contiguous container, and allows quick retrieval of contiguous ranges of indices, within this array, that contains data (and probably free ranges too). Let's call this ranges "blocks". Each block knows its head and tail index:
struct Block
{
size_t begin;
size_t end;
}
When we manipulating array, our data structure updates blocks:
array view blocks [begin, end]
--------------------------------------------------------------
0 1 2 3 4 5 6 7 8 9 [0, 9]
pop 2 block 1 splitted
0 1 _ 3 4 5 6 7 8 9 [0, 1] [3, 9]
pop 7, 8 block 2 splitted
0 1 _ 3 4 5 6 _ _ 9 [0, 1] [3, 6] [9, 9]
push 7 changed end of block 3
0 1 _ 3 4 5 6 7 _ 9 [0, 1] [3, 7] [9, 9]
push 5 error: already in
0 1 _ 3 4 5 6 7 _ 9 [0, 1] [3, 7] [9, 9]
push 2 blocks 1, 2 merged
0 1 2 3 4 5 6 7 _ 9 [0, 7] [9, 9]
Even before profiling, we know that blocks retrieval speed will be cornerstone of application performance.
Basically usage is:
very often retrieval of contiguous blocks
quite rare insertions/deletions
most time we want number of blocks be minimal (prevent fragmentation)
What we have already tried:
std::vector<bool> + std::list<Block*>. On every change: write true/false to vector, then traverse it in for loop and re-generate list. On every query of blocks return list. Slower than we wanted.
std::list<Block*> update list directly, so no traversing. Return list. Much code to debug/test.
Questions:
Is that data structure has some generic name?
Is there already such data structures implemented (debugged and tested)?
If no, what can you advice on fast and robust implementation of such data structure?
Sorry if my explanation is not quite clear.
Edit
Typical application for this container is managing buffers: either system or GPU memory. In case of GPU we can store huge amounts of data in single vertex buffer, and then update/invalidate some regions. On each draw call we must know first and last index of each valid block in buffer to draw (very often, tenth to hundreds times per second) and sometimes (once a second) we must insert/remove blocks of data.
Another application is a custom "block memory allocator". For that purpose, similar data structure implemented in "Alexandrescu A. - Modern C++ Design" book via intrusive linked list. I'm looking for better options.
What I see here is a simple binary tree.
You have pairs (blocks) with a begin and an end indices, that is, pairs (a,b) where a <= b. So the set of blocks can be easily ordered and stored in a search-binary-tree.
Searching the block wich corresponds to a given number is easy (Just the tipical bynary-tree-search). So when you delete a number from the array, you need to search the block that corresponds to the number and split it in two new blocks. Note that all blocks are leaves, the internal nodes are the intervals wich the two child nodes forms.
Insertion on the other hand means searching the block, and test its brothers to know if the brothers have to be collapsed. This should be done recursively up through the tree.
You may want to try a tree like structure, either a simple red-black tree or a B+ tree.
Your first solution (vector of bools + list of blocks) seems like a good direction, but note that you don't need to regenerate the list completely from scratch (or go over the entire vector) - you just need to traverse the list until you find where the newly changed index should be fixed, and split/merge the appropriate blocks on the list.
If the list traversal proves too long, you could implement instead a vector of blocks, where each block is mapped to its start index, and each hole has a block saying where the hole ends. You can traverse this vector as fast as a list since you always jump to the next block (one O(1) lookup to determine the end of the block, another O(1) lookup to determine the beginning of the next block. The benefit however is that you can also access indices directly (for push/pop), and figure out their enclosing block with a binary search.
To make it work, you'll have to do some maintenance work on the "holes" (merge and split them like real blocks), but that should also be O(1) on any insertion/deletion. The important part is that there's always a single hole between blocks, and vice-versa
Why are you using a list of blocks? Do you need stable iterators AND stable references? boost::stable_vector may help. If you don't need stable references, maybe what you want is to write a wrapper container that contains a std::vector blocks and a secondary memory map of size blocks.capacity() which is a map from iterator index (which is kept inside returned iterators to real offset in the blocks vector) and a list of currently unused iterator indices.
Whenever you erase members from blocks, you repack blocks and shuffle the map accordingly for increased cache coherence, and when you want to insert, just push_back to blocks.
With block packing, you get cache coherence when iterating at the cost of deletion speed. And maintain relatively fast insert times.
Alternatively, if you need stable references and iterators, or if the size of the container is very large, at the cost of some access speed, iteration speed, and cache coherency, you wrap each entry in the vector in a simple structure that contains the real entry and an offset to the next valid, or just store pointers in the vector and have them at null on deletion.

Listing specific subsets using STL

Say I have a range of number, say {2,3,4,5}, stored in this order in a std::vector v, and that I want to list all possibles subsets which end with 5 using STL... that is :
2 3 4 5
2 3 5
2 4 5
3 4 5
2 5
3 5
4 5
5
( I hope i don't forget any:) )
I tried using while(next_permutation(v.begin(),v.end())) but didn't come up with the wanted result :)
Does anyone have an idea?
PS : those who have done the archives of google code jam 2010 may recognize this :)
Let's focus on the problem of printing all subsets. As you know, if you have vector of n elements, you'll have 2^n possible subsets. It's not coincidence, that if you have n-bit integer, the maximal stored value is 2^n. If you consider each integer as a vector of bits, then iterating over all possible values will give all possible subsets of bits. Well, we have subsets for free by iterating integer!
Assuming vector has not more than 32 elements (over 4 billion possible subsets!), this piece of code will print all subset of vector v (excluding empty one):
for (uint32_t mask =1; mask < (1<<v.size()); ++mask)
{
std::vector<int>::const_iterator it = v.begin();
for (uint32_t m =mask; m; (m>>=1), ++it)
{
if (m&1) std::cout << *it << " ";
}
std::cout << std::endl;
}
I just create all possible bit masks for size of vector, and iterate through every bit; if it's set, I print appropriate element.
Now applying the rule of ending with some specific number is piece of cake (by checking additional condition while looping through masks). Preferably, if there is only one 5 in your vector, you could swap it to the end and print all subsets of vector without last element.
I'm effectively using std::vector, const_iterator and std::cout, so you might think about it as being solved using STL. If I come up with something more STLish, I'll let you know (well, but how, it's just iterating). You can use this function as a benchmark for your STL solutions though ;-)
EDIT: As pointed out by Jørgen Fogh, it doesn't solve your subset blues if you want to operate on large vectors. Actually, if you would like to print all subsets for 32 elements it would generate terabytes of data. You could use 64-bit integer if you feel limited by constant 32, but you wouldn't even end iterating through all the numbers. If your problem is just answering how many are desired subsets, you definitely need another approach. And STL won't be much helpful also ;-)
As you can use any container I would use std::set because it is next to what we want to represent.
Now your task is to find all subsets ending with 5 so we take our initial set and remove 5 from it.
Now we want to have all subsets of this new set and append 5 to them at the end.
void subsets(std::set<std::set<int>> &sets, std::set<int> initial)
{
if(initial.empty())
return;
sets.insert(initial);//save the current set in the set of sets
std::set<int>::iterator i = initial.begin();
for(; i != initial.end(); i++)//for each item in the set
{
std::set<int> new_set(initial);//copy the set
new_set.erase(new_set.find(*i));//remove the current item
subsets(sets, new_set);//recursion ...
}
}
sets is a set that contains all subsets you want.
initial is the set that you want to have the subsets of.
Finally call this with subsets(all_subsets, initial_list_without_5);
This should create the subsets and finally you can append 5 to all of them. Btw don't forget the empty set :)
Also note that creating and erasing all these sets is not very efficient. If you want it faster the final set should get pointers to sets and new_set should be allocated dynamically...
tomasz describes a solution which will be workable as long as n<=32 although it will be take a very long time to print 2^32 different subsets. Since the bounds for the large dataset are 2 <= n <= 500 generating all the subsets is definitely not the way to go. You need to come up with some clever way to avoid having to generate them. In fact, this is the whole point of the problem.
You can probably find solutions by googling the problem if you want. My hint is that you need to look at the structure of the sets and avoid generating them at all. You should only calculate how many there are.
use permutation to create a vector of vectors. Then use std::partition with a function to sort it into the vectors that end with 5 and those that don't.

remove elements: but which container to prefer

I am keeping the nonzeros of a sparse matrix representation in some triplets, known in the numerical community as Compressed Sparse Row storage, entries are stored row-wise, for instance a 4x4 matrix is represented as
r:0 0 1 1 2 2 3 3 3
c:0 3 2 3 2 3 1 2 3
v:1 5 2 2 4 1 5 4 5
so 'r' gives row indices, 'c' gives column indices and 'v' are the values associated to the 2 indices above that value.
I would like to delete some rows and columns from my matrix representation, say rows and cols: 1 and 3. So I should remove 1s and 3s from the 'r' and 'c' arrays. I am also trying to learn more about the performance of the stl containers and read a bit more. As first try, created a multimap and delete the items by looping over them with the find method of multimap. This removes the found keys however might leave some of the searched values in the 'c' array then I swapped the key,value pairs and do the same operation for this second map, however this did not seem to be a very good solution to me, it seems to be pretty fast(on a problem with 50000 entries), though. So the question is what would be the most efficient way to do this with standard containers?
You could use a map (between a pair of rows and columns) and the value, something like map<pair<int,int>, int>
If you then want to delete a row, you iterate over the elements and erase those with the to-be deleted row. The same can be done for columns.
How are you accessing the matrix? Do you look up particular rows/columns and do things with them that way, or do you use the whole matrix at a time for operations like matrix-vector multiplications or factorization routines? If you're not normally indexing by row/column, then it may be more efficient to store your data in std::vector containers.
Your deletion operation is then a matter of iterating straight through the container, sliding down subsequent elements in place of the entries you wish to delete. Obviously, there are tradeoffs involved here. Your map/multimap approach will take something like O(k log n) time to delete k entries, but whole-matrix operations in that representation will be very inefficient (though hopefully still O(n) and not O(n log n)).
Using the array representation, deleting a single row or column would take O(n) time, but you could delete an arbitrary number of rows or columns in the same single pass, by keeping their indices in a pair of hash tables or splay trees and doing a lookup for each entry. After the deletion scan, you could either resize the vectors down to the number of elements you have left, which saves memory but might entail a copy, or just keep an explicit count of how many entries are valid, trading dead memory for saving time.

Ideas for specific data structure in c++

I need to do some task.
There are numbers give in two rows and they act like pairs of integers (a, b). I have to find the maximum 5 numbers of the a-row and then select the max of those 5 but this time from the b-row. Ex:
1 4
5 2
3 3
7 5
6 6
2 9
3 1
In this example, the pair i need is (6,6) because 6 (a) is in the top 5 of the a[i] numbers and 6 (b) is the maximum in the b section of those 5 pairs.
I was thinking of doing this with vectors and my own defined structures, also use some temp arrays but i don't know if that's the right thing to do maybe there is simpler way to do this.
Any ideas ?
EDIT: I also need the index number of the pair (in the case that is 5, it's the fifth pair i.e).
A priority queue holding pairs that does its order evaluations based on the first element of the pair would be appropriate. You could insert all the pairs and then extract the top 5. Then just iterate on that list of pairs looking for the max of the second element of each pair.
edit
I should say that it is a decent solution only if you can accept a runtime on the order of O(n * lg n)
Alternate approaches:
Push triples (first, second, index) into a vector and then std::partial_sort the first 5 items using a descending order functor on the first element. Then use std::max_element with a second functor to find the max of the second element, and grab its index. If I'm reading the complexity of partial_sort correctly this should run in linear time (O(n)) because we're always sorting a max of 5 elements rather than O(n*log n).
Similarly, have the vector only contain pairs and sort a second vector containing indexes into the first one.