C++ - read 1000 floats and insert them into a vector of size 10 by keeping the lowest 10 numbers only - c++

So I am pretty new to c++ and I am not sure if there is a data structure already created to facilitate what I am trying to do (so I do not reinvent the wheel):
What I am trying to do
I am reading a file where I need to parse the file, do some calculations on every floating value on every row of the file, and return the top 10 results from the file in ascending order.
What am I trying to optimize
I am dealing with a 1k file and a 1.9 million row file so for each row, I will get a result that is of size 72 so in 1k row, I will need to allocate a vector of 72000 elements and for the 1.9 million rows ... well you get the idea.
What I have so far
I am currently working with a vector for the results which then I sort and resize it to 10.
const unsigned int vector_space = circularVector.size()*72;
//vector for the results
std::vector<ResultType> results;
results.reserve(vector_space);
but this is extremely inefficient.
*What I want to accomplish *
I want to only keep a vector of size 10, and whenever I perform a calculation, I will simply insert the value into the vector and remove the largest floating point that was in the vector, thus maintaining the top 10 results in ascending order.
Is there a structure already in c++ that will have such behavior?
Thanks!

EDIT: Changed to use the 10 lowest elements rather than the highest elements as the question now makes clear which is required
You can use a std::vector of 10 elements as a max heap, in which the elements are partially sorted such that the first element always contains the maximum value. Note that the following is all untested, but hopefully it should get you started.
// Create an empty vector to hold the highest values
std::vector<ResultType> results;
// Iterate over the first 10 entries in the file and put the results in the vector
for (... ; i < 10; i++) {
// Calculate the value of this row
ResultType r = ....
// Add it to the vector
results.push_back(r);
}
// Now that the vector is "full", turn it into a heap
std::make_heap(results.begin(), results.end());
// Iterate over all the remaining rows, adding values which are lower than the
// current maximum
for (i = 10; .....) {
// Calculate the value for this row
ResultType r = ....
// Compare it to the max element in the heap
if (r < results.front()) {
// Add the new element to the vector
results.push_back(r);
// Move the existing minimum to the back and "re-heapify" the rest
std::pop_heap(results.begin(), results.end());
// Remove the last element from the vector
results.pop_back();
}
}
// Finally, sort the results to put them all in order
// (using sort_heap just because we can)
std::sort_heap(results.begin(), results.end());

Yes. What you want is a priority queue or heap, defined so as to remove the lowest value. You just need to do such a remove if the size after the insertion is greater than 10. You should be able to do this with STL classes.

Just use std::set to do that, since in std::set all values are sorted from min to max.
void insert_value(std::set<ResultType>& myset, const ResultType& value){
myset.insert(value);
int limit = 10;
if(myset.size() > limit){
myset.erase(myset.begin());
}
}

I think MaxHeap will work for this problem.
1- Create a max heap of size 10.
2- Fill the heap with 10 elements for the first time.
3- For 11th element check it with the largest element i.e root/element at 0th index.
4- If 11th element is smaller; replace the root node with 11th element and heapify again.
Repeat the same steps until the whole file is parsed.

Related

Shrink QVector to last 10 elements & rearrange one item

I'm trying to find the most "efficient" way or, at least fast enough for 10k items vector to shrink it to last 10 items and move last selected item to the end of it.
I initially though of using this method for shrinking :
QVector<QModelIndex> newVec(listPrimary.end() - 10, listPrimary.end());
But that does not work, and I'm not sure how to use the Qt interators / std to get it to work...
And then once that's done do this test
if(newVec.contains(lastItem))
{
newVec.insert(newVec[vewVec.indexOf(newVec)],newVec.size());
}
else{
newVec.push_back(lastItem);
}
QVector Class has a method that does what you want:
QVector QVector::mid(int pos, int length = ...) const
Returns a sub-vector which contains elements from this vector, starting at position pos. If length is -1 (the default), all elements after pos are included; otherwise length elements (or all remaining elements if there are less than length elements) are included.
So as suggested in the comments, you can do something like this:
auto newVec = listPrimary.mid(listPrimary.size() - 10);
You do not have to pass length because its default value ensures that all elements after pos are included.

C++: Remove repeated numbers in a matrix

I want to remove numbers from a matrix that represents coordinates with the format 'x y z'. One example:
1.211 1.647 1.041
2.144 2.684 1.548
1.657 2.245 1.021
1.657 0.984 2.347
2.154 0.347 2.472
1.211 1.647 1.041
In this example the coordinates 1 and 6 are the same (x, y and z are the same) and I want to remove them but I do not want to remove cases with only one value equal as coordinates 3 and 4 for x-coordinate).
These values are in a text file and I want to print the coordinates without duplication in another file or even in the same one.
A very simple solution would be to treat each line as a string and use a set of strings. As you traverse the file line-wise, you check if the current line exists in the set and if not, you insert and print it.
Complexity: O(nlogn), extra memory needed: almost the same as your input file in the worst case
With the same complexity and exactly the worst case memory consumption as the previous solution, you can load the file in memory, sort it line-wise, and then easily skip duplicates while printing. The same can be done inside the file if you are allowed to re-order it, and this way you need very little extra memory, but be much slower.
If memory and storage is an issue (I'm assuming since you can't duplicate the file), you can use the simple method of comparing the current line with all previous lines before printing, with O(n^2) complexity but no extra memory. This however is a rather bad solution, since you have to read multiple times from the file, which can be really slow compared to the main memory.
How to do this if you want to preserve the order.
Read the coordinates into an array of structures like this
struct Coord
{
double x,y,z;
int pos;
bool deleted;
};
pos is the line number, deleted is set to false.
Sort the structs by whatever axis tends to show the greatest variation.
Run through the array comparing the value of the axis you were using in the sort from the previous item to the value in the current item. If the difference is less than a certain preset delta (.i.e. if you care about three digits after the decimal point you would look for a difference of 0.000999999 or so) you compare the remaining values and set deleted for any line where x,y,z are close enough.
for(int i=1;i<count;i++)
{
if(fabs(arr[i].x-arr[i-1].x)<0.001)
if(fabs(arr[i].y-arr[i-1].y)<0.001)
if(fabs(arr[i].z-arr[i-1].z)<0.001)
arr[i].deleted=true;
}
sort the array again, this time ascending by pos to restore the order.
Go through the array and output all items where deleted is false.
In c++, you can use the the power of STL to solve this problem. Use the map and store the three coordinates x, y and z as a key in the map. The mapped value to the key will store the count of that key.
Key_type = pair<pair<float,float>,float>
mapped_type = int
Create a map m with the above given key_type and mapped_type and insert all the rows into the map updating the count for each row. Let's assume n is the total number of rows.
for(i = 0; i < n; i++) {
m[make_pair(make_pair(x,y),z)]++;
}
Each insertion takes O(logn) and you have to insert n times. So,overall time complexity will be O(nlogn). Now, loop over all the rows of the matrix again and if the mapped_value of that row is 1, then it is unique.

Randomly select index from a STL vector from truth value

I have a vector that looks like:
vector<int> A = {0, 1, 1, 0, 0, 1, 0, 1};
I'd like to select a random index from the non-zero values of A. Using this example A, I want to randomly select an element from the array {1,2,5,7}.
Currently I do this by creating another array
vector<int> b;
for(int i=0;i<A.size();i++)
if(A[i])
b.push_back(i);
Once b is created, I find the index by using this answer:
get random element from container
Is there a more STL-like (or C++11) way of doing this, perhaps one that does not create an intermediate array? In this example A is small, but in my production code this selection process is in an inner-loop and A is non-static and thousands of elements long.
A great way to do this is Reservoir Sampling.
In short, you walk your array until you find the first non-zero value, and record that index as the first possible answer you might return.
Then, you continue to walk the array. Every time you find a non-zero value, you randomly might change which new index is your possible answer, with decreasing probability.
This algorithm also works great if you need M random index values from your array.
What's great about this, is that you walk each element only one time, and you don't need a separate memory structure to record the non-zero elements. It's O(N) in speed, and O(M) in memory, in your case it's O(1) in memory, since you only want 1 random value.
On the flip side, random number generators are traditionally quite slow. So, you might want to performance test this against any other ideas people come up with here, to see if the trade-off of speed-vs-memory is worth it for you.
With a single pass through the array, you can determine how many false (or true) values there are. If you are doing this kind of thing often, you can even write a class to keep track of this for you.
Regardless, you can then pick a random number i between 0 and num_false (or num_true). Then with another pass through the array, you can return the ith false (or true) index.
We can loop through each non-zero value and assign it a random number. The index with the largest random number is the one we select.
int value = 0;
int index = 0;
while(int i = 0; i < A.size(); i++) {
if(!A[i]) continue;
auto j = rand();
if(j > value) {
index = i;
value = j;
}
}
vector<int> A = {0,1,1,0,0,1,0,1};
random_shuffle(A.begin(),A.end());
auto it = find_if(A.begin(),A.end(),[](const int elem){return elem;});

Generate a new element different from 1000 elements of an array

I was asked this questions in an interview. Consider the scenario of punched cards, where each punched card has 64 bit pattern. I was suggested each card as an int since each int is a collection of bits.
Also, to be considered that I have an array which already contains 1000 such cards. I have to generate a new element everytime which is different from the previous 1000 cards. The integers(aka cards) in the array are not necessarily sorted.
Even more, how would that be possible the question was for C++, where does the 64 bit int comes from and how can I generate this new card from the array where the element to be generated is different from all the elements already present in the array?
There are 264 64 bit integers, a number that is so much
larger than 1000 that the simplest solution would be to just generate a
random 64 bit number, and then verify that it isn't in the table of
already generated numbers. (The probability that it is is
infinitesimal, but you might as well be sure.)
Since most random number generators do not generate 64 bit values, you
are left with either writing your own, or (much simpler), combining the
values, say by generating 8 random bytes, and memcpying them into a
uint64_t.
As for verifying that the number isn't already present, std::find is
just fine for one or two new numbers; if you have to do a lot of
lookups, sorting the table and using a binary search would be
worthwhile. Or some sort of a hash table.
I may be missing something, but most of the other answers appear to me as overly complicated.
Just sort the original array and then start counting from zero: if the current count is in the array skip it, otherwise you have your next number. This algorithm is O(n), where n is the number of newly generated numbers: both sorting the array and skipping existing numbers are constants. Here's an example:
#include <algorithm>
#include <iostream>
unsigned array[] = { 98, 1, 24, 66, 20, 70, 6, 33, 5, 41 };
unsigned count = 0;
unsigned index = 0;
int main() {
std::sort(array, array + 10);
while ( count < 100 ) {
if ( count > array[index] )
++index;
else {
if ( count < array[index] )
std::cout << count << std::endl;
++count;
}
}
}
Here's an O(n) algorithm:
int64 generateNewValue(list_of_cards)
{
return find_max(list_of_cards)+1;
}
Note: As #amit points out below, this will fail if INT64_MAX is already in the list.
As far as I'm aware, this is the only way you're going to get O(n). If you want to deal with that (fairly important) edge case, then you're going to have to do some kind of proper sort or search, which will take you to O(n log n).
#arne is almost there. What you need is a self-balancing interval tree, which can be built in O(n lg n) time.
Then take the top node, which will store some interval [i, j]. By the properties of an interval tree, both i-1 and j+1 are valid candidates for a new key, unless i = UINT64_MIN or j = UINT64_MAX. If both are true, then you've stored 2^64 elements and you can't possibly generate a new element. Store the new element, which takes O(lg n) worst-case time.
I.e.: init takes O(n lg n), generate takes O(lg n). Both are worst-case figures. The greatest thing about this approach is that the top node will keep "growing" (storing larger intervals) and merging with its successor or predecessor, so the tree will actually shrink in terms of memory use and eventually the time per operation decays to O(1). You also won't waste any numbers, so you can keep generating until you've got 2^64 of them.
This algorithm has O(N lg N) initialisation, O(1) query and O(N) memory usage. I assume you have some integer type which I will refer to as int64 and that it can represent the integers [0, int64_max].
Sort the numbers
Create a linked list containing intervals [u, v]
Insert [1, first number - 1]
For each of the remaining numbers, insert [prev number + 1, current number - 1]
Insert [last number + 1, int64_max]
You now have a list representing the numbers which are not used. You can simply iterate over them to generate new numbers.
I think the way to go is to use some kind of hashing. So you store your cards in some buckets based on lets say on MOD operation. Until you create some sort of indexing you are stucked with looping over the whole array.
IF you have a look on HashSet implementation in java you might get a clue.
Edit: I assume you wanted them to be random numbers, if you don't mind sequence MAX+1 below is good solution :)
You could build a binary tree of the already existing elements and traverse it until you find a node whose depth is not 64 and which has less than two child nodes. You can then construct a "missing" child node and have a new element. The should be fairly quick, in the order of about O(n) if I'm not mistaken.
bool seen[1001] = { false };
for each element of the original array
if the element is in the range 0..1000
seen[element] = true
find the index for the first false value in seen
Initialization:
Don't sort the list.
Create a new array 1000 long containing 0..999.
Iterate the list and, if any number is in the range 0..999, invalidate it in the new array by replacing the value in the new array with the value of the first item in the list.
Insertion:
Use an incrementing index to the new array. If the value in the new array at this index is not the value of the first element in the list, add it to the list, else check the value from the next position in the new array.
When the new array is used up, refill it using 1000..1999 and invalidating existing values as above. Yes, this is looping over the list, but it doesn't have to be done for each insertion.
Near O(1) until the list gets so large that occasionally iterating it for invalidation of the 'new' new array becomes significant. Maybe you could mitigate this by using a new array that grows, maybee always the size of the list?
Rgds,
Martin
Put them all into a hash table of size > 1000, and find the empty cell (this is the parking problem). Generate a key for that. This will of course work better for bigger table size. The table needs only 1-bit entries.
EDIT: this is the pigeonhole principle.
This needs "modulo tablesize" (or some other "semi-invertible" function) for a hash function.
unsigned hashtab[1001] = {0,};
unsigned long long long long numbers[1000] = { ... };
void init (void)
{
unsigned idx;
for (idx=0; idx < 1000; idx++) {
hashtab [ numbers[idx] % 1001 ] += 1; }
}
unsigned long long long long generate(void)
{
unsigned idx;
for (idx = 0; idx < 1001; idx++) {
if ( !hashtab [ idx] ) break; }
return idx + rand() * 1001;
}
Based on the solution here: question on array and number
Since there are 1000 numbers, if we consider their remainders with 1001, at least one remainder will be missing. We can pick that as our missing number.
So we maintain an array of counts: C[1001], which will maintain the number of integers with remainder r (upon dividing by 1001) in C[r].
We also maintain a set of numbers for which C[j] is 0 (say using a linked list).
When we move the window over, we decrement the count of the first element (say remainder i), i.e. decrement C[i]. If C[i] becomes zero we add i to the set of numbers. We update the C array with the new number we add.
If we need one number, we just pick a random element from the set of j for which C[j] is 0.
This is O(1) for new numbers and O(n) initially.
This is similar to other solutions but not quite.
How about something simple like this:
1) Partition the array into numbers equal and below 1000 and above
2) If all the numbers fit within the lower partition then choose 1001 (or any number greater than 1000) and we're done.
3) Otherwise we know that there must exist a number between 1 and 1000 that doesn't exist within the lower partition.
4) Create a 1000 element array of bools, or a 1000-element long bitfield, or whatnot and initialize the array to all 0's
5) For each integer in the lower partition, use its value as an index into the array/bitfield and set the corresponding bool to true (ie: do a radix sort)
6) Go over the array/bitfield and pick any unset value's index as the solution
This works in O(n) time, or since we've bounded everything by 1000, technically it's O(1), but O(n) time and space in general. There are three passes over the data, which isn't necessarily the most elegant approach, but the complexity remains O(n).
you can create a new array with the numbers that are not in the original array, then just pick one from this new array.
¿O(1)?

Fast way to pick randomly from a set, with each entry picked only once?

I'm working on a program to solve the n queens problem (the problem of putting n chess queens on an n x n chessboard such that none of them is able to capture any other using the standard chess queen's moves). I am using a heuristic algorithm, and it starts by placing one queen in each row and picking a column randomly out of the columns that are not already occupied. I feel that this step is an opportunity for optimization. Here is the code (in C++):
vector<int> colsleft;
//fills the vector sequentially with integer values
for (int c=0; c < size; c++)
colsleft.push_back(c);
for (int i=0; i < size; i++)
{
vector<int>::iterator randplace = colsleft.begin() + rand()%colsleft.size();
/* chboard is an integer array, with each entry representing a row
and holding the column position of the queen in that row */
chboard[i] = *randplace;
colsleft.erase(randplace);
}
If it is not clear from the code: I start by building a vector containing an integer for each column. Then, for each row, I pick a random entry in the vector, assign its value to that row's entry in chboard[]. I then remove that entry from the vector so it is not available for any other queens.
I'm curious about methods that could use arrays and pointers instead of a vector. Or <list>s? Is there a better way of filling the vector sequentially, other than the for loop? I would love to hear some suggestions!
The following should fulfill your needs:
#include <algorithm>
...
int randplace[size];
for (int i = 0; i < size; i ++)
randplace[i] = i;
random_shuffle(randplace, randplace + size);
You can do the same stuff with vectors, too, if you wish.
Source: http://gethelp.devx.com/techtips/cpp_pro/10min/10min1299.asp
Couple of random answers to some of your questions :):
As far as I know, there's no way to fill an array with consecutive values without iterating over it first. HOWEVER, if you really just need consecutive values, you do not need to fill the array - just use the cell indices as the values: a[0] is 0 and a[100] is 100 - when you get a random number, treat the number as the value.
You can implement the same with a list<> and remove cells you already hit, or...
For better performance, rather than removing cells, why not put an "already used" value in them (like -1) and check for that. Say you get a random number like 73, and a[73] contains -1, you just get a new random number.
Finally, describing item 3 reminded me of a re-hashing function. Perhaps you can implement your algorithm as a hash-table?
Your colsleft.erase(randplace); line is really inefficient, because erasing an element in the middle of the vector requires shifting all the ones after it. A more efficient approach that will satisfy your needs in this case is to simply swap the element with the one at index (size - i - 1) (the element whose index will be outside the range in the next iteration, so we "bring" that element into the middle, and swap the used one out).
And then we don't even need to bother deleting that element -- the end of the array will accumulate the "chosen" elements. And now we've basically implemented an in-place Knuth shuffle.