Redundant static data - c++

This question applies to any type of static data. I'm only using int to keep the example simple.
I am reading in a large XML data file containing ints and storing them in a vector<int>. For the particular data I'm using, it's very common for the same value to be repeated consecutively many times.
<Node value="4" count="4000">
The count attribute means that the value is to be repeated x number of times:
for(int i = 0; i < 4000; i++)
vec.push_back(4);
It seems like a waste of memory to store the same value repeatedly when I already know that it is going to appear 4000 times in a row. However, I need to be able to index into the vector at any point.
For larger data objects, I know that I can just store a pointers but that would still involve storing 4000 identical pointers in the example above.
Is there any type of strategy to deal with an issue like this?

Use two vectors. The first vector contains the indices, the second one the actual values.
Fill in the indices vector such that the value for all indices between indices[i-1] and indices [i] is in values[i].
Then use binary search on the indices array to locate the position in the values array. Binary search is very efficient (O(log n)), and you will only use a fraction of the memory compared to the original approach.
If you assume the following data:
4000 ints with value "4"
followed by 200 ints with value "3"
followed by 5000 ints with value "10"
You would create an index vector and value vector and fill it like this:
indices = {4000, 4200, 9200}; // indices[i+1] = indices [i] + new_count or 0
values = {4,3,10};
As suggested in the other answers, you should probably wrap this in an operator[].

I would suggest to write a specific class instead of using vector.
Your class should just hold the number of times an item occurs in a list and compute the index in a smart way so you can easily retrieve an element based on the index.

Try to wrap your data into some objects with vector-like interface (operator[] and so on), so you can hide implementation detail (that is you are not actually storing 4000 numbers) yet provide similar interface.

Related

Accessing elements of a vector in a map of vectors

I want to create a map of vectors. I want the vector to be a private member variable however so when I need to increase the size of the vector for a particular key in the map it does it for all other keys in the map also(would that work?). This will be a map of vectors(of ints) where the keys are strings. My question is how to access a particular element in the vector to change is value in C++. Something along the lines of map_name['word'].[3] = 2 if i wanted to set the third value of the vector of "word" to 2.
enter image description here
enter image description here
Im still having trouble figuring out how to make it so the size of each vector for all the keys in the maps is modifiable so i can increase the size of each vector at any point along the program. This is b/c the vector size is unknown at runtime and iterating through each element in the map to change the vector size will take too long.
The pattern is recursive.
That is, when you do:
expression[key] = value;
your expression doesn't have to just be a variable name; it can be a more complex expression, such as map_name["word"].
So:
map_name["word"][3] = 2;
Regarding the first question, yes it is possible as mentioned in one of the comments, you can make your imaginary class to do that.
And in the second question, you'll have to access an element of a vector which is an element of a map like this:
map1["abc"][1] = 2
The '.' you added was unnecessary because you're accessing an element inside another element, just like a 2D array

Large Data Set for Processing, need to maintain original data set

Here's my problem definition: I have an array of seven million indices, each containing a label. So, for simplicity, here's an example array that I'm dealing with: [1 2 3 3 3 5 2 1 7].
I need to go through this array and every time I come across a label, input the location of the label into a "set" with all others of the same label. With the array being so large, I want to access only a specific label's location at any given point, so let's say, I want to access only the locations of 3 and process those locations and change them to 5's, but I want to do more than just one operation and not only that, I want to do it on all labels, separately. In a small array like in my example, it seems trivial just to stick with the array. However, with an array of seven million points, it is much more time expensive to complete the searching for said labels and then operate on them.
To clear up confusion, using my example, I want the example array to give me the following:
1 mapped to a set containing 0 and 7
2 mapped to a set containing 1 and 6
3 mapped to a set containing 2, 3, and 4
5 mapped to a set containing 5
Originally, I did my processing on the original array and simply operated on the array. This took roughly ~30 seconds to determine the number of corresponding indices for each label (so I was able to determine that the size of 1 was two, size of six was two, size of 3 was three, etc. However, it did not produce the locations of said labels using this method. Therefore, there was added time throughout the rest of my processing finding the locations of each label as well although it was sped up by adding the termination that once it found all the indices of the referenced label, to stop searching.
Next step, I used a map<int,set<int>> but this ultimately led to an increase in time to ~100 seconds but decreased time in processing later down the road, but not enough to justify the heavy increase in time.
I haven't implemented it yet, but as an additional step, I am planning on trying to initialize an array of sets, with the indices corresponding to the label and trying to do it this method.
I have also tried hash_maps as well to no avail. Unordered_sets and unordered_maps are not included in the STL in Visual Studio 2005 so I have not implemented the above with these two structures.
Key points:
I have pre-processed the array such that I know the maximum label, and that all labels are consecutive (there are no gaps between the minimum label and the maximum). However, they are randomly dispersed in the original array. This may prove useful in initialization of a set size data structure.
Order of the indices corresponding to the labels is not important. Order of the labels in their given data structure is also not important.
Edit:
For background, the array corresponds to a binary image, and I implemented binary sequential labeling to output an array of same size as the binary image of UINT16 with all binary blobs labeled. What I want to do now is to obtain a map of the points that make up each blob as efficiently as possible.
Why do you use such complicated data structures for that task? Just create a vector of vectors to store all the positions of each label and that's it. And you also can avoid annoying vector memory allocation by pre-processing how much space you need for each label. Something like that:
vector <int> count(N);
for(size_t i = 0; i < N; ++i)
++count[dataArray[i]];
vector < vector <int> > labels(N);
vector <int> last(N);
for(size_t i = 0; i < N; ++i)
labels[i].resize(count[i]);
for(size_t i = 0; i < N; ++i) {
labels[dataArray[i]][last[dataArray[i]]] = i;
++last[dataArray[i]];
}
It will work in O(N) time, what looks like 1 second for your seven million of integers.
I wouldn't necessarily use general purpose maps (or hash tables) for this.
My initial gut feeling is that I'd create a second array "positions" of seven million (or whatever N) locations, and a third array "last_position_for_index" corresponding to the range [min-label, max-label]. Note that this will almost certainly take less storage than any kind of map.
Initialize all the entries of last_position_for_index to some reserved value, and then you can just loop through your array with something like (untested):
for (std::size_t k = 0; k<N; ++k) {
IndexType index = indices[k];
positions[k] = last_position_for_index[index-min_label];
last_position_for_index[index-min_label] = k;
}

What's the best way to read a matrix of unknown columns from standard input?

I know only the number of rows r in a matrix.
How do I read it into a multi-dimensional array arr[MAX][MAX]?
I thought of reading all the elements into a single array, count the no. of elements and then adding them to arr in groups of count/r. Is there a simpler way?
You could use the fact that everything may as well go into contiguous memory so just keep pushing it at the end of std::vector<double>. At the end you know its length, and given that you know r, you now also know the number of columns.
If you really have nothing but the number of rows and a list of data values, just read the whole thing into a vector, and then divide the size of the vector by the number of rows to get the number of columns. You should, however, also know whether the data is stored row-wise or column-wise. On this depends how to index the vector (I would keep the data in the vector and access it through index calculation, most probably encapsulated in a nice little class).

Merging two arrays of a class

I have two arrays of class Record. Class Record is defined like this
class Record{
char* string; //the word string
int count; //frequency word appears
}
And these are the two arrays defined (already initialized)
Record recordarray1=new Record[9000000]; //contains 9000000 unsorted Records
Record recordarray2=new Record[8000000] //contains 8000000 unsorted Records
the purpose is to find strings that match between the two arrays and add them to a new array where their counts are added together, and if there is a string not in the other array then just add to the new array. To do this I have tried sorting the two arrays first, (in alphabetical order by strings), then comparing recordarray2, if the string matches then advance recordarray2's index otherwise advance recordarray1's index until you find one. If you don't find it, then add it to the new array.
Unfortunately this method is WAY too slow, sorting itself takes 20+ seconds with STL sort. Is there a quicker standard method of sorting that i'm missing?
If I've understood correctly your algorithm should take O( nlogn + mlogm [sort both arrays] + n + m [to go through the arrays and compare]).
It may not be much of an optimization but you try to sort just one of the arrays and use binary search to check if the elements of the other array are present or not. So now it should take O( n [to copy one array as the new array] + nlogn [to sort it] + mlogn [to binary search the elements of the second into the sorted new one] ).
HTH
Sorting object might be expensive, so I would try to avoid this.
One faster way might be to create an index for each array using a std::hash_map with the string as has index and the array index as value. You get two containers that can be iterated at one time. The iterator for the lesser will be advanced until you find a match or the other points to a lesser value. This will lead you to a predictable iteration count.
The possible solution is to use unordered_map. The algorithm whould be as following:
Put the first array into the map, using strings as keys and count as values.
For each member in the second array, check it against containment in the map.
If it exists there
Put the record into the new array, combining counts
Remove the record from the map
Else
Put the record into the new array
Iterate throug the remaining recors in the map and put the in to the new array.
The complexity of this algorithm is aproximatelty O(n+m)
I feel that sorting is not needed. You can use following algorithm.
Start with the first element of
recordarray1; put into the new array
Search elements in recordarray2.
If the element is found increment count in new array. Also set the
recordarray2[N]::count to negative value; so that it will not be checked again in step 3
Put all the elements from
recordarray2 which doesn't have
count set to negative into new
array. If negative count is
encountered then simply change it to
positive.
Note: This algorithm doesn't take care if you have similar string elements in the same array. Also don't use string as a variable name. As it's also a typename as std::string.

Sort, pack and remap array of indexed values to minimize overlapping

Sitation:
overview:
I have something like this:
std::vector<SomeType> values;
std::vector<int> indexes;
struct Range{
int firstElement;//first element to be used in indexes array
int numElements;//number of element to be used from indexed array
int minIndex;/*minimum index encountered between firstElement
and firstElements+numElements*/
int maxIndex;/*maximum index encountered between firstElement
and firstElements+numElements*/
Range()
:firstElement(0), numElements(0), minIndex(0), maxIndex(0){
}
}
std::vector<Range> ranges;
I need to sort values, remap indexes, and recalculate ranges to minimize maxValueIndex-minValueIndex for each range.
details:
values is an array(okay, "vector") of some type (irrelevant which one). elements in values may be unique, but this is not guaranteed.
indexes is an vector of ints. each element in "indexes" is an indexes that correspond to some element in values. Elements in indexes are not unique, one value may repeat multiple types. And indexes.size() >= values.size().
Now, ranges correspond to a "chunk" of data from indexes. firstElement is an index of element to be used from indexes (i.e. used like this: indexes[range.firstElement]), numElements is (obviously) number of elements to be used, minIndex is mininum in (indexes[firstElement]...indexes[firstElement+numElements-1]) a,d maxIndex is maximum in (indexes[firstElement]...indexes[firstElement+numElements-1]). Ranges never overlap.
I.e. for every two ranges a, b
((a.firstElement >= b.firstElement) && (a.firstElement < (b.firstElement+b.numElements)) == false
Obviously, when I do any operation on values (swap to elements, etc), I need to update indexes (so they keep pointing on the same value), and recalculate corresponding range, so range's minIndex and maxIndex are correct.
Now, I need to rearrange values in the way that will minimize Range.maxIndex - Range.minIndex. I do not need the "best" result after packing, having "probably the best" or "good" packing will be enough.
problem:
Remapping indexes and recalculating ranges is easy. The problem is that I'm not sure how to sort elements in values, because same index may be encountered in multiple ranges.
Any ideas about how to proceed?
restrictions:
Changing container type is not allowed. Containers should be array-like. No maps, not lists.
But you're free to use whatever container you want during the sorting.
Also, no boost or external libraries - pure C++/STL, I really neeed only an algorithm.
additional info:
There is no greater/less comparison defined for SomeType - only equality/non-equality.
But there should be no need to ever compare two values, only indexes.
The goal of algorithm is to make sure that output of
for (int i = 0; i < indexes.size; i++){
print(values[indexes[i]]); //hypothetical print function
}
Will be identical before and after sorting, while also making sure that for each range
Range.maxIndex-Range.minIndex (after sorting) is as small as possible to achieve with reasonable effort.
I'm not looking for a "perfect" or "most optimal" solution, having a "probably perfect" or "probably most optimal" solution should be enough.
P.S. This is NOT a homework.
This is not an algorithm, just some thinking aloud. It will probably break if there are too many duplicates.
If there was no duplicates, you'd simply rearrange the values so the indexes are 0,1,2, and so on. So for the starting point, let's exclude the values that are double-referenced and arrange the rest
Since there are duplicates, you need to figure out where to stick them. Suppose the duplicate is referred to by ranges r1, r2, r3. Now, as long as you insert the duplicate between min([r1,r2,r3].minIndex)-1 and max([r1,r2,r3].maxIndex)+1, the sum of maxIndex-minIndex will be the same no matter where you insert it. Moving the insertion point to the left will reduce max-min for all ranges to the left, but increment it for all ranges to the right. So, I think the sensible thing to do is to insert the duplicate at the left edge (minindex) of the rightmost range (one with largest minIndex) of r1,r2,r3. Repeat with all duplicates.
Okay, it looks like there is only one way to reliably solve this problem:
Make sure that no index is ever used by two ranges at once by duplicating values.
I.e scan entire array of indexes, and when you find index (of value) that is being used in more than one range, you add copy of that value for each range - each with unique index. After that problem becomes trivial - you simply sort values in the way that will make sure that values array first contains values used only by first range, then values for 2nd range, and so on. I.e. this will get maximum packing.
Because in my app it is more important to minimize sum(ranges[i].maxIndex-ranges[i].minIndex) that to minimize number of values, this approach works for me.
I do not think that there is other reliable way to solve the problem - it is quite easy to get situation when there are indexes used by every range, and in this case it will not be possible to "pack" data no matter what you do. Even allowing index to be used by two ranges at once will lead to problems - you can get ranges a, b and c where a and b, b and c, a and c will have common indexes. In this case it also won't be possible to pack the data.