Inserting and removing elements from an array while maintaining the array to be sorted

Inserting and removing elements from an array while maintaining the array to be sorted - c++

I'm wondering whether somebody can help me with this problem. I'm using C/C++ to program and I need to do the following:
I am given a sorted array P (biggest first) containing floats. It usually has a very big size.. sometimes holding correlation values from 10 megapixel images. I need to iterate through the array until it is empty. Within the loop there is additional processing taking place.
The gist of the problem is that at the start of the loop, I need to remove the elements with the maximum value from the array, check certain conditions and if they hold, then I need to reinsert the elements into the array but after decreasing their value. However, I want the array to be efficiently sorted after the reinsertion.
Can somebody point me towards a way of doing this? I have tried the naive approach of re-sorting everytime I insert, but that seems really wasteful.

Change the data structure. Repeatedly accessing the largest element, and then quickly inserting new values, in such a way that you can still efficiently repeatedly access the largest element, is a job for a heap, which may be fairly easily created from your array in C++.
BTW, please don't talk about "C/C++". There is no such language. You're instead making vague implications about the style in which you're writing things, most of which will strike experienced programmers as bad.

I would look into the http://www.cplusplus.com/reference/stl/priority_queue/, as it is designed to do just this.

You could use a binary search to determine where to insert the changed value after you removed it from the array. Note that inserting or removing at the front or somewhere in the middle is not very efficient either, as it requires moving all items with a higher index up or down, respectively.
ISTM that you should rather put your changed items into a new array and sort that once, after you finished iterating over the original array. If memory is a problem, and you really have to do things in place, change the values in place and only sort once.
I can't think of a better way to do this. Keeping the array sorted all the time seems rather inefficient.

Since the array is already sorted, you can use a binary search to find the location to insert the updated value. C++ provides std::lower_bound or std::upper_bound for this purpose, C provides bsearch. Just shift all the existing values up by one location in the array and store the new value at the newly cleared spot.

Here's some pseudocode that may work decently if you aren't decreasing the removed values by much:
For example, say you're processing the element with the maximum value in the array, and say the array is sorted in descending order (largest first).
Remove array[0].
Let newVal = array[0] - adjustment, where adjustment is the amount you're decreasing the value by.
Now loop through, adjusting only the values you need to:
Pseudocode:
i = 0
while (newVal < array[i]) {
array[i] = array[i+1];
i++;
}
array[i] = newVal;
swap(array[i], array[i+1]);
Again, if you're not decreasing the removed values by a large amount (relative to the values in the array), this could work fairly efficiently.
Of course, the generally better alternative is to use a more appropriate data structure, such as a heap.

Maybe using another temporary array could help.
This way you can first sort the "changed" elements alone.
And after that just do a regular merge O(n) for the two sub-arrays to the temp array, and copy everything back to the original array.

Related

Optimal data structure (in C++) for random access and looping through elements

I have the following problem: I have a set of N elements (N being somewhere between several hundred and several thousand elements, let's say between 500 and 3000 elements). Out of these elements, small percentage will have some property "X", but the elements "gain" and "lose" this property in a semi-random fashion; so if I store them all in an array, and assign 1 to elements with property X, and zero otherwise, this array of N elements will have n 1's and the N-n zeros (n being small in the 20-50 range).
The problem is the following: these elements change very frequently in a semi-random way (meaning that any element can flip from 0 to 1 and vice versa, but the process that controls that is somewhat stable, so the total number "n" fluctuates a bit, but is reasonably stable in the 20-50 range); and I frequently need all the "X" elements of the set (in other words, indices of the array where value of the array is 1), to perform some task on them.
One simple and slow way to achieve this is to simply loop through the array and if index k has value 1, perform the task, but this is kinda slow because well over 95% of all the elements have value 1. The solution would be to put all the 1s into a different structure (with n elements) and then loop through that structure, instead of looping through all N elements. The question is what's the best structure to use?
Elements will flip from 0 to 1 and vice versa randomly (from several different threads), so there's no order there of any sort (time when element flipped from 0 to 1 is has nothing to do with time it will flip back), and when I loop through them (from another thread), I do not need to loop in any particular order (in other words, I just need to get them all, but it's nor relevant in which order).
Any suggestions what would be the optimal structure for this? "std::map" comes to mind, but since the keys of std::map are sorted (and I don't need that feature), the questions is if there is anything faster?
EDIT: To clarify, the array example is just one (slow) way to solve the problem. The essence of the problem is that out of one big set "S" with "N" elements, there is a continuously changing subset "s" of "n" elements (with n much smaller then N), and I need to loop though that set "s". Speed is of essence, both for adding/removing elements to "s", and for looping through them. So while suggestions like having 2 arrays and moving elements between them would be fast from iteration perspective, adding and removing elements to an array would be prohibitively slow. It sounds like some hash-based approach like std::set would work reasonably fast on both iteration and addition/removal fronts, the question is is there something better than that? Reading the documentation on "unordered_map" and "unordered_set" doesn't really clarify how much faster addition/removal of elements is relative to std::map and std::set, nor how much slower the iteration through them would be. Another thing to keep in mind is that I don't need a generic solution that works best in all cases, I need one that works best when N is in the 500-3000 range, and n is in the 20-50 range. Finally, the speed is really of essence; there are plenty slow ways of doing it, so I'm looking for the fastest way.

Since order doesn't appear to be important, you can use a single array and keep the elements with property X at the front. You will also need an index or iterator to the point in the array that is the transition from X set to unset.
To set X, increment the index/iterator and swap that element with the one you want to change.
To unset X, do the opposite: decrement the index/iterator and swap that element with the one you want to change.
Naturally with multiple threads you will need some sort of mutex to protect the array and index.
Edit: to keep a half-open range as iterators are normally used, you should reverse the order of the operations above: swap, then increment/decrement. If you keep an index instead of an iterator then the index does double duty as the count of the number of X.

N=3000 isn't really much. If you use a single bit for each of them, you have a structure smaller than 400 bytes. You can use std::bitset for that. If you use an unordered_set or a set however be mindful that you'll spend many more bytes for each of the n elements in your list: if you just allocate a pointer for each element in a 64bit architecture you'll use at least 8*50 = 400 bytes, much more than the bitset

#geza : perhaps I misunderstood what you meant by two arrays; I assume you meant something like have one std::vector (or something similar) in which I store all elements with property X, and another where I store the rest? In reality, I don't care about others, so I really need one array. Adding an element is obviously simple if I can just add it to the end of the array; now, correct me if I'm wrong here, but finding an element in that array is O(n) operation (since the array is unsorted), and then removing it from the array again requires shifting all the elements by one place, so this in average requires n/2 operations. If I use linked list instead of vector, then deleting an element is faster, but finding it still takes O(n). That's what I meant when I said it would be prohibitively slow; if I misunderstood you, please do clarify.
It sounds like std::unordered_set or std::unordered_map would be fastest in adding/deleting elements, since it's O(1) to find an element, but it's unclear to me how fast can one loop through all the keys; the documentation clearly states that iteration through keys of std::unordered_map is slower then iteration through keys of std::map, but it's not quantified in any way just how slow is "slower", and how fast is "faster".
And finally, to repeat one more time, I'm not interested in general solution, I'm interested in one for small "n". So if for example I have two solutions, one that's k_1*log(n), and second that's k_2*n^2, first one might be faster in principle (and for large n), but if k_1 >> k_2 (let's say for example k_1 = 1000 and k_2=2 and n=20), second one can still be faster for relatively small "n" (1000*log(20) is still larger than 2*20^2). So even if addition/deletion in std::unordered_map might be done in constant time O(1), for small "n" it still matters if that constant time is 1 nanosecond or 1 microsecond or 1 millisecond. So I'm really looking for suggestions that work best for small "n", not for in the asymptotic limit of large "n".

An alternative approach (in my opinion worth only if the number of element is increased at least tenfold) might be keeping a double index:
#include<algorithm>
#include<vector>
class didx {
// v == indexes[i] && v > 0 <==> flagged[v-1] == i
std::vector<ptrdiff_t> indexes;
std::vector<ptrdiff_t> flagged;
public:
didx(size_t size) : indexes(size) {}
// loop through flagged items using iterators
auto begin() { return flagged.begin(); }
auto end() { return flagged.end(); }
void flag(ptrdiff_t index) {
if(!isflagged(index)) {
flagged.push_back(index);
indexes[index] = flagged.size();
}
}
void unflag(ptrdiff_t index) {
if(isflagged(index)) {
// swap last item with item to be removed in "flagged", update indexes accordingly
// in "flagged" we swap last element with element at index to be removed
auto idx = indexes[index]-1;
auto last_element = flagged.back();
std::swap(flagged.back(),flagged[idx]);
std::swap(indexes[index],indexes[last_element]);
// remove the element, which is now last in "flagged"
flagged.pop_back();
indexes[index] = 0;
}
}
bool isflagged(ptrdiff_t index) {
return indexes[index] > 0;
}
};

Which is the better way to delete an array member?

I'm learning OOP, so I have to interact with arrays, not linked list. I have sorted data. The problem is to delete a member of the array (let's call it DL). The 1st method I came up with was overwrite data at i+1 to istarting at DL's index and decrease the amount of reading by 1. Later I found out that I can swap the DLwith the last member then decrease the counting variable by 1. However, I'll have to sort the data again. So which one is better?

If it needs to stay sorted, I'd say it's better to overwrite it by shifting every element after your target back one. Swapping it with the end element and then resorting would require more work, as a swap requires three actions:
1) Copying element one to a temp variable.
2) Copying element two to element one.
3) Copying the temp element to element two.
And this needs to be repeated multiple times in a sorting algorithm. And if you're working with an array of objects of a struct or class with multiple private data member each, the workload increases even more.
The overwrite takes fewer moves per iteration:
1) Copy i + 1 to i.
So, Id definitely go with overwriting, by moving all elements back one and decreasing count by one.
At any rate, it's probably just best to time both, with your specific data set, and see which one is faster. This is really simple to do by counting the milliseconds between start and finish of your implementation.

"Better" is a very subjective term and which one is more suitable (for whatever definition you choose) depends a great deal on the sort of data sets you're talking about (size, etc).
But I will mention this, the relative time complexities of array shuffle and most "regular" sorts are respectively O(n) and O(n log n).
That means the shuffle is likely to be faster in the vast majority of cases.

How do you implement a linked list using an array

Now, I know you must be telling to yourself, "Why the heck would anyone even do that?" But, it's something that will give us a really insightful knowledge about some primitive stuff. Kindly unleash your talent.

It is a valid question - college level data structure question. And so the answer can be found in many data structures books. http://books.google.co.in/books/about/Data_Structures_Using_C.html?id=X0Cd1Pr2W0gC

The wording of your question makes it seem that you are aware of the difference between linked lists and arrays. So I'm going to skip that part.
The main point to remember in the implementation is that while linked lists have pointers to the next element, in an array this will automatically be the next index. So, one way of implementing is to store all the data points of the linked list in the array. If you have to insert or delete an element, then you would first have to create a space in the array to place them at, or remove the extra space created. In a linked list you could have simply changed the pointers for one/two nodes and you would be done. However, we can't do that in an array since we can't manipulate the next pointers in the array. So, a simple idea is to shift every element to the left or right by one step depending upon your choice of operation. In case of insertion, insert that element in the space created by shifting right. In case of deletion, shift everything to the right of the element to be deleted to the left by one index. Note that this way every insertion and deletion will be O(n).
An idea avoid these repeated shifts in case of deletion could be to replace the element to be deleted by a pre-decided character, say ''. So, while traversing the array, a '' can be interpreted as an empty space. This will avoid left shifts in case of deletion. Also, when the array is full, we can traverse the entire array and remove all the '*' and shift the elements in one pass.
Take care to introduce checks about array bounds.

C++ Fixed Size Container to Store Most Recent Values

I would like to know what the most suitable data structure is for the following problem in C++
I am wanting to store 100 floats ordered by recency. So when I add (push) a new item the other elements are moved up one position. Every time an event is triggered I receive a value and then add it to my data structure.
When the number of elements reaches 100, I would like to remove (pop) the item at the end (the oldest).
I want to able to iterate over all the elements and perform some mathematical operations on them.
I have looked at all the standard C++ containers but none of them fulfill all my needs. What's the easiest way to achieve this with standard C++ code?

You want a circular buffer. You can use Boost's implementation or make your own by allocating an array, and keeping track of the beginning and end of the used range. This boils down to doing indexing modulo 100.

Without creating your own or using a library, std::vector is the most efficient standard data structure for this. Once it has reached its maximum size, there will be no more dynamic memory allocations. The cost of moving up 100 floats is trivial compared to the cost of dynamic memory allocations. (This is why std::list is a slow data structure for this). There is no push_front function for vector. Instead you have to use v.insert(v.begin(), f)
Of course this assumes what you are doing is performance-critical, which it probably isn't. In that case I would use std::deque for more convenient usage.

Just saw that you need to iterator over them. Use a list.
Your basic function would look something like this
void addToList(int value){
list100.push_back(value);
if(list100.size() > 100){
list100.pop_front();
}
}
Iterating over them is easy as well:
for(int val : list100){
sum += val;
}
// Average, or whatever you need to do
Obviously, if you're using something besides int, you'll need to change that. Although this adds a little bit more functionality than you need, it's very efficient since it's a doubly linked list.
http://www.cplusplus.com/reference/list/list/

You can use either std::array, std::dequeue, std::list or std::priority_queue

A MAP (std::map) should be able to solve your requirement. Use Key as the object and value as the current push number nPusheCount which gets incremented whenever you add an element to map.
when adding a new element to map, if you have less than 100 elements, just add the number to the MAP as key and nPushCount as the value.
If you have 100 elements already, check if the number exists in map already and do following:
If the number already exists in map, then add the number as key and nPushCount as value;
If doesnt, delete the number with lowest nPushCount as value and then add the desired number with updated nPushCount.

Algorithm for merging short lists into a long vector

I have a sparse matrix class whose non-zeros and corresponding column indices are stored, in row-order, in what are basically STL-vector-like containers. They may have unused capacity, like vectors; and to insert/remove elements, existing elements must be moved.
Say I have an operation, insert_erase_replace, or ier for short. ier can do the following, given a position p, a column index j, and a value v:
if v==0, ier removes the entry at p and left-shifts all subsequent entries.
if v!=0, and j is already present at p, ier replaces the cell contents at p with v.
if v!=0, and j is not present at p, ier inserts the entry v and column index j at p after right-shifting all subsequent entries.
So all of that is trivial.
Now let's say I have ier2, which does the same thing, except that it takes a list containing multiple column indices j and corresponding values v. It also has a size n, which indicates how many index/value pairs are present in the list. But because the vector only stores non-zeros, sometimes the actual insertion size is smaller than n.
Still trivial.
But now let's say I have ier3, which takes not just one list like ier2, but multiple lists. This represents editing a slice of the sparse matrix.
At some point, it becomes more efficient to iterate through the vectors, copying them piece by piece and inserting/replacing/erasing the list indices/values ier2-style as we arrive at each insertion point. And if the total insertion size would cause my vector to need a resize anyway, then we do that.
Given that my vector is much, much larger than the total length of the lists, is there an algorithm for efficiently merging the lists into the vector?
So far, here's what I have:
Each list passed to ier3 represents either a net deletion of entries (a left shift), a net replacement (no movement, therefore cheap), or a net insertion of entries (a right shift). There may also be some re-arrangement of elements in there, but the expensive parts are the net deletions and net insertions.
It's not hard to figure out an algorithm for efficiently doing ONLY net insertions or net deletions.
It's harder when either of the two may be happening.
The only thing I can think to do is to handle it in two passes:
Erase/replace
Insert/replace
We erase first because it makes it more likely that any insertions will require fewer copies.
Is this the right approach? Does anyone know of a better one?

Okay, so I'm going to suppose the intervals covered in each list in ier3 are disjoint and given to you in order. If it's meant for editing slices of a matrix, this seems reasonable. I'm also assuming you that you don't need to resize the vector, because that case is easily detectable and solvable.
Initialise a read pointer and a write pointer to the start of the vector you're editing. There'll be an instruction pointer into ie3 too, but I'll ignore that here for clarity's sake. You'll also need a queue. At each step, one of several things can happen:
Default: Neither read nor write are at a position detailed by ier3. In this case, add the element under read to the back of the queue and write the element at the front of the queue to the cell under write. Move both pointers forward one.
read is over a cell that needs to be deleted. In this case, simply move read forward one without adding anything to the queue.
read passes from one cell to the next such that an insertion should happen between them. In this case, add the insertion to the back of the queue and then continue with the next relevant case.
read is at a cell that needs to be modified. In this case, insert the modified cell at the back of the queue, write whatever's at the front of the queue to write, and step them both forwards.
read has arrived at the unused capacity of the vector. In which case just write whatever's left in the queue.
That's the basic outline, but a couple of optimizations can be made: first, if the queue's empty, step both pointers forward to the next position detailed by ie3 without doing anything. Second, minimize the buffer by doing extra writing steps whenever read is ahead of write and the queue is nonempty.

I'd go with your plan with a few important points highlighted.
The erase/replace step should start from the left and only move points within the affected range - it can leave a "gap". It should determine the size of the final vector. At the end of this step, use the determined size to shift the "tail" of the vector as needed, leaving the exact amount of space required for insertions free.
The insertions should start from the right and fill up the gap we left in step 1 by copying each point to it's final position.
This will never shift the main vector once and never copy any point (from the existing slice or insertion set) more than twice so it's essentially linear.
Other data structures might be helpful too - reserving space at both the front and end, or building it out of multiple sections so a resize doesn't force a full copy.
One further optimisation would be to allow some insertions during step 1. If you've erased some, completing any insertion you come across immediately until it balances will prevent you needing to move any points until you reach another erase.

Let n be the size of the list and m be the size of the vector. It sounds like ier does a binary search for j every time, so the searching part is O(n*log(m)).
Assuming the elements in the list are sorted, once you find the first element, it's faster to just navigate up the vector to find the next one. That way searching becomes O(log(m) + n) = O(n).
Also, do a dry pass first to count net deletions/insertions, and a second pass to actually apply the changes. I think these two passes will run faster than the two you describe.

I can suggest a different design for a sparse matrix that should help you achieve performance and a low memory footprint for large sparse matrices.
Instead of vector, why not use a 2D hash table. something like (no std:: for smaller code):
typedef unordered_map< unsigned /* index */, int /* value */ > col_type;
unordered_map< unsigned /* index */, col_type*>; // may need to define hash function for col_type
the outer class (sparse_matrix) searches in O(1) for a column. If not found, it allocates a new column.
Then the column type is searched for the column index in O(1) and either delete/replace or insert based on the original logic. It can see if the column is now empty and delete it from the 'row' hash map.
all basic operations add/delete/replace are O(1).
If you need a fast ordered iteration of the matrix, you can replace the unordered_map with 'map'. If the matrix is very sparse, the O(nlog(n)) complexity will be similar to the hash_map's.
BTW I used pointer to the col_type on purse, the outer hash map grows much (much!) faster this way.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js