How to efficiently add and remove vector values C++ - c++

I am trying to efficiently calculate the averages from a vector.
I have a matrix (vector of vectors) where:
row: the days I am going back (250)
column: the types of things I am calculating the average of (10,000 different things)
Currently I am using .push_back() which essentially iterates through each row in each column and then I use erase() in order to remove the last value. As this method goes through all the values, my code is very slow.
I am thinking of a method linked to substitution, however I have a hard time implementing the idea, as all the values have an order (i.e. I need to remove the old value and the value I add / substitute will be the newest).
Below is my code so far.
Any ideas for a solution or guides for the right direction will be much appreciated.
//declaration
vector <vector<float> > vectorOne;
//initialization
vectorOne(250, vector<float>(10000, 0)),
//This is the slow method
vectorOne[column].push_back(1);//add newest value
vectorOne[column].erase(vectorOne[column].begin() + 0); //remove latest value

You probably need a different data structure.
The problem sounds like a queue. You add to the end and take from the front. With real queues, everyone then shuffles up a step. With computer queues, we can use a circular buffer (you do need to be able to get a reasonable bound on maximum queue length).
I suggest implementing your own on top of a plain C array first, then using the STL version when you've understood the principle.

Related

Record all optimal sequence alignments when calculating Levenshtein distance in Julia

I'm working on the Levenshtein distance with Wagner–Fischer algorithm in Julia.
It would be easy to get the optimal value, but a little hard to get the optimal operation sequence, like insert or deletion, while backtrace from the right down corner of the matrix.
I can record the pointer information of each d[i][j], but it might give me 3 directions to go back to d[i-1][j-1] for substitution, d[i-1][j] for deletion and d[i][j-1] for insertion. So I'm trying to get all combination of the operation sets that gave me the optimal Levenshtein distance.
It seems that I can store one operation set in one array, but I don't know the total number of all combinations as well as there length, so it would be hard for me to define an array to store the operation set during the enumeration process. How can I generate arrays while store the former ones? Or I should use Dataframe?
If you implement the Wagner-Fischer algorithm, at some point, you choose the minimum over three alternatives (see Wikipedia pseudo-code). At this point, you save the chosen alternative in another matrix. Using a statement like:
c[i,j] = indmin([d[i-1,j]+1,d[i,j-1]+1,d[i-1,j-1]+1])
# indmin returns the index of the minimum element in a collection.
Now c[i,j] contains 1,2 or 3 according to deletion, insertion or substitution.
At the end of the calculation, you have the final d matrix element achieving the minimum distance, you then follow the c matrix backwards and read the action at each step. Keeping track of i and j allows reading the exact substitution by looking which element was in string1 at i and string2 at j in the current step. Keeping a matrix like c cannot be avoided because at the end of the algorithm, the information about the intermediate choices (done by min) would be lost.
I'm not sure that I got your question but anyway, vectors in Julia are dynamic data structures, so you are always able to grow it using appropriate function, e.g pull!() , append!() , preapend!() also its possible to reshape() the result vector to an array of desired size.
but one particular approach for the above case could be obtained using sparse() matrix:
import Base.zero
Base.zero(ASCIIString)=""
module GSparse
export insertion,deletion,substitude,result
s=sparse(ASCIIString[])
deletion(newval::ASCIIString)=begin
global s
s.n+=1
push!(s.colptr,last(s.colptr))
s[s.m,s.n]=newval
end
insertion(newval::ASCIIString)=begin
global s
s.m+=1
s[s.m,s.n]=newval
end
substitude(newval::ASCIIString)=begin
global s
s.n+=1
s.m+=1
push!(s.colptr,last(s.colptr))
s[s.m,s.n]=newval
end
result()=begin
global s
ret=zeros(ASCIIString,size(s))
le=length(s)
for (i = 1:le)
ret[le-i+1]=s[i]
end
ret
end
end
using GSparse
insertion("test");
insertion("testo");
insertion("testok");
substitude("1estok");
deletion("1stok");
result()
I like the approach because for large texts you could have many zero elements. also I fill data structure in forward way and create results by reversing.

"Tricks" to filling a "rolling" time-series array absent brute-force pushback of all values each iteration

My applications is financial, in C++, Visual Studio 2003.
I'm trying to maintain an array of the last (x) values for an observation, and as each new value arrives I have a loop first push all of the other values back then add the new value in the front.
It's computationally intensive, and I've been trying to be clever and come up with some way around this. I've probably either stated an oxymoronic problem or reached the limit of my intellect, or both.
Here's an idea I have:
Suppose it's 60 seconds of data, and new data arrive each second. Suppose we have an integer between 0 and 59, that will serve to index an element of the array. Suppose each second, when the data arrives, we first iterate the integer then overwrite the element of the array at that index with the new data. Then, suppose in our calculations, we refer to the same integer as the base, work backwards to zero, then to 59 then back down again. The formulas in the math would be a bit more tedious to write. But my application does a lot of these pushback/fills of arrays, each second for several data points, and each array having 3600 elements per data series (one hour of seconds).
Does the board think this is a good idea? Or am I being silly here?
What you're describing is nothing more than a circular buffer.
There's an implementation in Boost, and probably in other
libraries as well, and a good algorithm description on the
Wikipedia (http://en.wikipedia.org/wiki/Circular_buffer).
And yes, it's a good solution for the problem you describe.
You could use modulo as you suggested (hint: x % y is the syntax for "x" modulo "y"), or you could maintain two buffers where you essentially swap which one is the current data to be read and which buffer is the stale data that is to be overwritten. For copying large quantities of plain-old-data (POD) data, you should take a look at memcpy. Of course, in quantitative trading, people will do all sorts of things to get a speed edge, including custom hardware that allows one to bypass several layers of copying.
Are you sure you are talking about arrays? They don't have a "push" operation - it sounds more like an std::vector.
Anyway, here is the solution for what I think that you want:
If I understood it right, you want a collection of 3600 elements and each second the last element drops off and a new element is added.
So you should use a linked list queue for that task. Operations are performed in O(1).

3D-Grid of bins: nested std::vector vs std::unordered_map

pros, I need some performance-opinions with the following:
1st Question:
I want to store objects in a 3D-Grid-Structure, overall it will be ~33% filled, i.e. 2 out of 3 gridpoints will be empty.
Short image to illustrate:
Maybe Option A)
vector<vector<vector<deque<Obj>> grid;// (SizeX, SizeY, SizeZ);
grid[x][y][z].push_back(someObj);
This way I'd have a lot of empty deques, but accessing one of them would be fast, wouldn't it?
The Other Option B) would be
std::unordered_map<Pos3D, deque<Obj>, Pos3DHash, Pos3DEqual> Pos3DMap;
where I add&delete deques when data is added/deleted. Probably less memory used, but maybe less fast? What do you think?
2nd Question (follow up)
What if I had multiple containers at each position? Say 3 buckets for 3 different entities, say object types ObjA, ObjB, ObjC per grid point, then my data essentially becomes 4D?
Another illustration:
Using Option 1B I could just extend Pos3D to include the bucket number to account for even more sparse data.
Possible queries I want to optimize for:
Give me all Objects out of ObjA-buckets from the entire structure
Give me all Objects out of ObjB-buckets for a set of
grid-positions
Which is the nearest non-empty ObjC-bucket to
position x,y,z?
PS:
I had also thought about a tree based data-structure before, reading about nearest neighbour approaches. Since my data is so regular I had thought I'd save all the tree-building dividing of the cells into smaller pieces and just make a static 3D-grid of the final leafs. Thats how I came to ask about the best way to store this grid here.
Question associated with this, if I have a map<int, Obj> is there a fast way to ask for "all objects with keys between 780 and 790"? Or is the fastest way the building of the above mentioned tree?
EDIT
I ended up going with a 3D boost::multi_array that has fortran-ordering. It's a little bit like the chunks games like minecraft use. Which is a little like using a kd-tree with fixed leaf-size and fixed amount of leaves? Works pretty fast now so I'm happy with this approach.
Answer to 1st question
As #Joachim pointed out, this depends on whether you prefer fast access or small data. Roughly, this corresponds to your options A and B.
A) If you want fast access, go with a multidimensional std::vector or an array if you will. std::vector brings easier maintenance at a minimal overhead, so I'd prefer that. In terms of space it consumes O(N^3) space, where N is the number of grid points along one dimension. In order to get the best performance when iterating over the data, remember to resolve the indices in the reverse order as you defined it: innermost first, outermost last.
B) If you instead wish to keep things as small as possible, use a hash map, and use one which is optimized for space. That would result in space O(N), with N being the number of elements. Here is a benchmark comparing several hash maps. I made good experiences with google::sparse_hash_map, which has the smallest constant overhead I have seen so far. Plus, it is easy to add it to your build system.
If you need a mixture of speed and small data or don't know the size of each dimension in advance, use a hash map as well.
Answer to 2nd question
I'd say you data is 4D if you have a variable number of elements a long the 4th dimension, or a fixed large number of elements. With option 1B) you'd indeed add the bucket index, for 1A) you'd add another vector.
Which is the nearest non-empty ObjC-bucket to position x,y,z?
This operation is commonly called nearest neighbor search. You want a KDTree for that. There is libkdtree++, if you prefer small libraries. Otherwise, FLANN might be an option. It is a part of the Point Cloud Library which accomplishes a lot of tasks on multidimensional data and could be worth a look as well.

Read from file a sparse-matrix

I'm using a the Yale representation of sparse-matrix in power iteration algorithm, everything goes well and fast.
But, now I have a problem, my professor will send the sparse-matrix in a data file unordered, and since the matrix is symmetric only one pair of index will be there.
The problem is, in my implementation I need to insert the elements in order.
I tried somethings to read and after that insert into my sparse-matrix:
1) Using a dense matrix.
2) Using another sparse-matrix implementation, I tried with std::map.
3) Priority queue, I made a array of priority_queues. I insert the element i,j in the priority_queue[i], so when I pop the priority_queue[i] I take the lowest j-index of the row i.
But I need something really fast and memory efficient, because the largest matrix I'll use will be like 100k x 100k, and the tries I made was so slow, almost 200 times slower than the power iteration itself.
Any suggestions? Sorry for the poor english :(
The way many sparse loaders work is that you use an intermediate pure triples structure. I.e. whatever the file looks like, you load it into something like vector< tuple< row, column, value> >.
You then build the sparse structure from that. The reason is precisely what you're running into. Your sparse matrix structure likely has constraints, like you need to know the number of elements in each row/column, or the input needs to be sorted, etc. You can massage your triples array into whatever you need (i.e. by sorting it).
This also makes it trivial to solve your symmetry dilemma. For every triple in the source file, you insert both (row, column, value) and (column, row, value) into your intermediate structure.
Another option is to simply write a script that will sort your professor's file.
FYI, in the sparse world the number of elements (nonzeros) is what matters, not the dimensions of the matrix. 100k-by-100k is a meaningless piece of information. That entire matrix could be totally empty, for example.

Inserting and removing elements from an array while maintaining the array to be sorted

I'm wondering whether somebody can help me with this problem. I'm using C/C++ to program and I need to do the following:
I am given a sorted array P (biggest first) containing floats. It usually has a very big size.. sometimes holding correlation values from 10 megapixel images. I need to iterate through the array until it is empty. Within the loop there is additional processing taking place.
The gist of the problem is that at the start of the loop, I need to remove the elements with the maximum value from the array, check certain conditions and if they hold, then I need to reinsert the elements into the array but after decreasing their value. However, I want the array to be efficiently sorted after the reinsertion.
Can somebody point me towards a way of doing this? I have tried the naive approach of re-sorting everytime I insert, but that seems really wasteful.
Change the data structure. Repeatedly accessing the largest element, and then quickly inserting new values, in such a way that you can still efficiently repeatedly access the largest element, is a job for a heap, which may be fairly easily created from your array in C++.
BTW, please don't talk about "C/C++". There is no such language. You're instead making vague implications about the style in which you're writing things, most of which will strike experienced programmers as bad.
I would look into the http://www.cplusplus.com/reference/stl/priority_queue/, as it is designed to do just this.
You could use a binary search to determine where to insert the changed value after you removed it from the array. Note that inserting or removing at the front or somewhere in the middle is not very efficient either, as it requires moving all items with a higher index up or down, respectively.
ISTM that you should rather put your changed items into a new array and sort that once, after you finished iterating over the original array. If memory is a problem, and you really have to do things in place, change the values in place and only sort once.
I can't think of a better way to do this. Keeping the array sorted all the time seems rather inefficient.
Since the array is already sorted, you can use a binary search to find the location to insert the updated value. C++ provides std::lower_bound or std::upper_bound for this purpose, C provides bsearch. Just shift all the existing values up by one location in the array and store the new value at the newly cleared spot.
Here's some pseudocode that may work decently if you aren't decreasing the removed values by much:
For example, say you're processing the element with the maximum value in the array, and say the array is sorted in descending order (largest first).
Remove array[0].
Let newVal = array[0] - adjustment, where adjustment is the amount you're decreasing the value by.
Now loop through, adjusting only the values you need to:
Pseudocode:
i = 0
while (newVal < array[i]) {
array[i] = array[i+1];
i++;
}
array[i] = newVal;
swap(array[i], array[i+1]);
Again, if you're not decreasing the removed values by a large amount (relative to the values in the array), this could work fairly efficiently.
Of course, the generally better alternative is to use a more appropriate data structure, such as a heap.
Maybe using another temporary array could help.
This way you can first sort the "changed" elements alone.
And after that just do a regular merge O(n) for the two sub-arrays to the temp array, and copy everything back to the original array.