Effective data structure for both deleteMin and search by key operations - c++

I have 100 sets of A objects, each set corresponding to a query point Qi, 1 <= i <= 100.
class A {
int id;
int distance;
float x;
float y;
}
In each iteration of my algorithm, I select one query point Qi and extract from the corresponding set the object having the minimum distance value. Then, I have to find this specific object in all 100 sets, searching with its id, and remove all those objects.
If I use a heap for each set of objects, it is cheap to extract the object with MIN(distance). However, I will not be able to find the same object in other heaps searching with the id, because the heap is organized with the distance value. Further, updating the heap is expensive.
Another option I have considered is using a map<id, (distance, x, y)> for each set. This way searching (find operation) by id is cheap. However, extracting the element with the minimum value takes linear time (it has to examine every element in the map).
Is there any data structure that I could use that is efficient for both the operations I need?
extract_min(distance)
find(id)
Thanks in advance!

std::map or boost::multi_index

You could use a tree map.

One simple approach is to have two maps for each data set. The first one contains all the data items sorted by id. The second would be a multimap and map distance to id so that you could easily figure out what id the lowest distance corresponds to. This one would be ordered by distance to make finding the min cheap (since it would use distance as the key). You could use map instead of multimap if you know that distances will always be unique.

In addition to ncluding a map as
suggested by many above, you
could replace your minimum heap
with a structure that has a
runtime complexity that is constant
for extract min. Your current version
has runtime complexity of O(log_2(n))
for extract min.
Since the range of your distances is
small, you could use a "Dial array"
algorithm. The keys are like "counting sort".
Because you may have more than one item in
an array item, but you don't care about the
order of equal value items, you would use
a doubly linked list as the array's item
data type. The Andrew Goldberg and Tarjan papers
regarding faster Dijkstra's algorithms
dicuss this in more detail.

Related

Count of previously smaller elements encountered in an input stream of integers?

Given an input stream of numbers ranging from 1 to 10^5 (non-repeating) we need to be able to tell at each point how many numbers smaller than this have been previously encountered.
I tried to use the set in C++ to maintain the elements already encountered and then taking upper_bound on the set for the current number. But upper_bound gives me the iterator of the element and then again I have to iterate through the set or use std::distance which is again linear in time.
Can I maintain some other data structure or follow some other algorithm in order to achieve this task more efficiently?
EDIT : Found an older question related to fenwick trees that is helpful here. Btw I have solved this problem now using segment trees taking hints from #doynax comment.
How to use Binary Indexed tree to count the number of elements that is smaller than the value at index?
Regardless of the container you are using, it is very good idea to enter them as sorted set so at any point we can just get the element index or iterator to know how many elements are before it.
You need to implement your own binary search tree algorithm. Each node should store two counters with total number of its child nodes.
Insertion to binary tree takes O(log n). During the insertion counters of all parents of that new element should be incremented O(log n).
Number of elements that are smaller than the new element can be derived from stored counters O(log n).
So, total running time O(n log n).
Keep your table sorted at each step. Use binary search. At each point, when you are searching for the number that was just given to you by the input stream, binary search is going to find either the next greatest number, or the next smallest one. Using the comparison, you can find the current input's index, and its index will be the numbers that are less than the current one. This algorithm takes O(n^2) time.
What if you used insertion sort to store each number into a linked list? Then you can count the number of elements less than the new one when finding where to put it in the list.
It depends on whether you want to use std or not. In certain situations, some parts of std are inefficient. (For example, std::vector can be considered inefficient in some cases due to the amount of dynamic allocation that occurs.) It's a case-by-case type of thing.
One possible solution here might be to use a skip list (relative of linked lists), as it is easier and more efficient to insert an element into a skip list than into an array.
You have to use the skip list approach, so you can use a binary search to insert each new element. (One cannot use binary search on a normal linked list.) If you're tracking the length with an accumulator, returning the number of larger elements would be as simple as length-index.
One more possible bonus to using this approach is that std::set.insert() is log(n) efficient already without a hint, so efficiency is already in question.

Instant sort when put new value in array C++

I have a dynamically allocated array containing structs with a key pair value. I need to write an update(key,value) function that puts new struct into array or if struct with same key is already in the array it needs to update its value. Insert and Update is combined in one function.
The problem is:
Before adding a struct I need to check if struct with this key already existing.
I can go through all elements of array and compare key (very slow)
Or I can use binary search, but (!) array must be sorted.
So I tried to sort array with each update (sloooow) or to sort it when calling binary search funtion.....which is each time updating
Finally, I thought that there must be a way of inserting a struct into array so it would be placed in a right place and be always sorted.
However, I couldn't think of an algorithm like that so I came here to ask for some help because google refuses to read my mind.
I need to make my code faster because my array accepts more that 50 000 structs and I'm using bubble sort (because I'm dumb).
Take a look at Red Black Trees: http://en.wikipedia.org/wiki/Red%E2%80%93black_tree
They will ensure the data is always sorted, and it has a complexity of O ( log n ) for inserts.
A binary heap will not suffice, as a binary heap does not have guaranteed sort order, your only guarantee is that the top element is either min or max.
One possible approach is to use a different data structure. As there is no genuine need to keep the structs ordered, there is only need to detect if the struct with the same key exits, so the costs of maintaining order in a balanced tree (for instance by using std::map) are excessive. A more suitable data structure would be a hash table. C++11 provides such in the standard library under obscure name std::unordered_map (http://en.cppreference.com/w/cpp/container/unordered_map).
If you insist on using an array, a possible approach might be to combine these algorithms:
Bloom filter (http://en.wikipedia.org/wiki/Bloom_filter)
Partial sort (http://en.cppreference.com/w/cpp/algorithm/partial_sort)
Binary search
Maintain two ranges in the array -- first goes a range that is already sorted, then goes a range that is not yet. When you insert a struct, first check with the bloom filter if a matching struct might already exist. If the bloom filter gives a negative answer, then just insert the struct at the end of the array. After that the sorted range does not change, the unsorted range grows by one.
If the bloom filter gives a positive answer, then apply partial sort algorithm to make the entire array sorted and then use binary search to check if such an object actually exists. If so, replace this element. After that the sorted range is the entire array, and the unsorted range is empty.
If the binary search has shown that the bloom filter was wrong, and the matching struct is not there, then you just put the new struct at the end of the array. After that the sorted range is entire array minus one, and the unsorted range is the last element in the array.
Each time you insert an element, binary search to find if it exists. If it doesn't exist, the binary search will give you the index at which you can insert it.
You could use std::set, which does not allow duplicate elements and places elements in sorted position. This assumes that you are storing the key and value in a struct, and not separately. In order for the sorting to work properly, you will need to define a comparison function for the structs.

Creating a fast dictionary

For a school task I have to create a graph and do some stuff with it. In input each vertex has an ID that is a number 0-999'999'999. As I can not create an array this long, I can't really use this ID as a key in adjacency matrix.
My first solution was to create a separate ID that is arbitrary of the original one and store it in some kind of dictionary/map thing, but as I get 10'000 records of vertices the lookup is probably bound to get slow. The algorithm has to be under O(n^2) and I already have a BFS and a toposort in there.
What would be the best solution in this case? As a side note - I can't use already established libraries (so I can't use graph, map, vector, string classes etc.), but I can code them myself, if that is the best option.
What you want is a binary search tree to do lookups in O(logn) time or a hash map to do lookups in ~O(1) time OR you can go with the array route in which case the size of your array would be the max value your ID can have (in your case, 10^9).
As #amit told you, check AVL/Red-Black trees and hash maps. There's no better way to do lookups in a graph below O(n) unless you can change the topology of the graph to turn it into a "search graph".
Why do you need to create an array of size 1 billion. You can simply create and adjacency matrix or adjacency list of number of nodes.
Whether number of vertices are constant or not, I'd suggest you to go for adjacency list. For example, you have 10 nodes,so you need to create an array of size 10, then for each nodes create a list of edges as you can see in the link above.
Consider this graph, do you really think you need to have 10^10 element in the adjacency list instead of 4 elements?

Quickly get an element from vector2 given an element from vector1

I have two vectors vector<DataPoint> dataand vector<string> labels where DataPoint is just a vector of float: typedef vector<float> DataPoint. Each datapoint data[i] has its associated label labels[i].
Is there any way to get the label of a given datapoint x quickly ? Something like string getLabel(DataPoint x){..} which is fast.
The best you can hope to find your DataPoint's index in data is O(log(n)) complexity (using a binary search) if your data vector is sorted. Otherwise that's a linear search in O(n).
The crux of the problem is that you have two vectors that contain related data, which is always a pain to manage (and a strong hint of bad design). Best replace both vectors with a vector<LabeledDataPoints> (a structure with two members: a DataPoint and a string).
A few notes: you can sort a vector with std::sort(), and search a pre-sorted vector with std::binary_search(), std::unordered_map is the C++11 hash table, std::map is a binary tree, which you could use for insertion-time sorting with O(log2N) lookup, insertion and erase. Google any of them for docs.
With your existing data structures, if dataPoint is pre-sorted, then you have O(log2N) where N is dataPoint.size(), and assuming that on average unequal dataPoints comparison need only compare the first float or two. Unsorted, it's O(N).
Clearly the performance issue is not having to look in labels after the common index is known - it's just finding out what that index is, given a dataPoint object outside the data vector.
If sorting is undesirable or O(log2N) is still too slow, you could consider putting the dataPoints into a hash table with their label.
In the unlikely case that the performance issue is only due to your dataPoints regularly starting with the same leading sequence of floats, then (assuming no trivial solution like comparing from the back of the vector towards the front) you could create some kind of hash or sum of the elements to compare first, only doing a float-by-float comparison if that's already known equal.
Old Answer (it is about getting the values (DataPoint instances) easily):
Why don't you use a Map, using labels as keys and DataPoint as values(map)? In this way, you will have the data associated, and depeding on the map type, you can have differentiations on complexities(using a map, will have a lookup complexity O(logn), while a hashmap will have O(1) expected and O(n) worst case). Use what works for you better.
For more information on maps and their complexities, look here too: multiset, map and hash map complexity
UPDATE:
To get the label for each DataPoint, one idea is to create a separate class (for example DataContainer) that contains as private members your vector of DataPoint instances and a string that contains your label with the appropriate setters/getters.
class DataContainer{
private:
DataPoint mDataPoint;
string mLabel;
public:
DataContainer(DataPoint dataPoint,string label):
mDataPoint(dataPoint), mLabel(label){}
void setDataPoint(DataPoint dataPoint){
mDataPoint = dataPoint;
}
void setLabel(string label){
mLabel = label;
}
DataPoint getDataPoint(){
return mDataPoint;
}
//This getter does the job, with O(1) complexity.
string getLabel(){
return mLabel;
}
}
This way, you can put your DataContainer in any structure you want (and I suggest map in the case you want to get the keys similarily: map), setting the label on instantiation and getting it with the getter method with O(1) complexity.
As you can see, your question needs to be approached differently, and there are some ways to do it.

How to store a sequence of timestamped data?

I have an application that need to store a sequence of voltage data, each entry is something like a pair {time, voltage}
the time is not necessarily continuous, if the voltage doesn't move, I will not have any reading.
The problem is that i also need to have a function that lookup timestamp, like, getVoltageOfTimestamp(float2second(922.325))
My solution is to have a deque that stores the paires, then for every 30 seconds, I do a sampling and store the index into a map
std::map,
so inside getVoltageOfTimestamp(float2second(922.325)), I simply find the nearest interval_of_30_seconds to the desired time, and then move my pointer of deque to that corresponding_index_of_deque, iterate from there and find the correct voltage.
I am not sure whether there exist a more 'computer scientist' solution here, can anyone give me a clue?
You could use a binary search on your std::deque because the timestamps are in ascending order.
If you want to optimize for speed, you could also use a std::map<Timestamp, Voltage>. For finding an element, you can use upper_bound on the map and return the element before the one found by upper_bound. This approach uses more memory (because std::map<Timestamp, Voltage> has some overhead and it also allocates each entry separately).
Rather then use a separate map, you can do a binary search directly on the deque to find the closet timestamp. Given the complexity requirements of a std::map, doing a binary search will be just as efficient as a map lookup (both are O(log N)) and won't require the extra overhead.
Do you mind using c++ ox conepts ? If not deque<tuple<Time, Voltage>> will do the job.
One way you can improve over binary search is to identify the samples of your data. Assuming your samples are every 30 milliseconds, then in vector/list store the readings as you get them. In the other array, insert the index of the array every 30 seconds. Now given a timestamp, just go to the first array and find the index of the element in the list, now just go there and check few elements preceding/succeeding it.
Hope this helps.