How to get partition where value belongs in partitioned interval? - c++

I have interval that is partitioned int large amount of smaller partitions.
There aren't any spaces also there aren't any overlapping intervals.
E.g: (0;600) is separated into:
(0;10>
(10;25>
(25;100>
(100;125>
(125;550>
(550;600)
Now i have large amount of values and i need to get partition id for each of them.
I can store array of values that partitions this interval into smaller intervals.
But if all values belongs to last partition it'll need to pass through whole array.
So i'm searching for any better solution to store these intervals. I want simple - max cca 150 lines length algorithm and i don't want to use any library except std.

Since there are no "empty spaces" in your partitioning, the end of each partition is redundant (it's the same as the start of the next partition).
And since you have the partition list sorted, you can simply use binary search, with std::upper_bound.
See it in action.
Edit: Correction (upper_bound, not lower_bound).

You could just improve your search algorithm.
Put all the ranges in the array, and then use Binary Search Algorithm to search for the right range.
It will cost O(logn), and it's really easy to implement.

Related

Optimum complexity when search uses a key which coresponding value for is later used for sorting

Edit: number of elements to be displayed can be user defined and default to 10 but can be defined to be a very big number.
I have a file which I parse for words, then I need to count how many times each word appeared in the text and display the 10 words with the highs number of appearances(using C++).
I currently insert each parsed word into a std::map, the word is the key and the number of its appearances is the value. Each time I come across a word that is not part of the std::map I add it with the initial value of 1 and each time I come across a word that is part of the map I add 1 to its current value.
After I am done parsing the file I have a map with all the unique words in the text and the number of their appearances but the map is not sorted by the value of the keys.
At this point I can traverse the std::map and push its words into priority queue(ordered with min value at the top), Once the priority queue reaches maximum capacity of 10 words I check if the value I am about to insert is bigger then the value at the top and if so I pop the top and Insert the value(if not I move on to the next value from the std::map.
Because each word appears only once(at this stage) I know for sure that each value at the priority queue is unique.
My question is can this be done more efficiently in regards to complicity?
This is python's collections.Counter, so you could look there for a real-world example. It essentially does the same thing you are doing: get counts by incrementing a dictionary, then heapq.nlargest on the (word, count) pairs. (priority queue is a heap. I have no idea why they added a Q.)
Consider select m largest/smallest out of N words. This should have a theoretical limit of O(N log m)
You should create the counts in O(N) time with an std::unordered_map. This is important, you don't care about sorting the words alphabetically, so don't use std::map here. If you use std::map, you're already at O(N log N) which is greater than the theoretical limit.
Now, when selecting the top 10, you need pretty much any method that only looks at 10 items at a time. Priority queue with a max size is a good option. The important point is that you don't track more than you need to. Your complexity here is O(N log m), which becomes O(N) in the special case when n is small compared to N. But the common mistake would be to include the whole data set when comparing items.
However, check if m >= N, because if you do need the whole data set, you can just call std::sort. I'm assuming you need them in order. If you didn't, this case would become really trivial. And check m==1 so you can just use max.
In conclusion, except for using the wrong map, I believe you've already met the theoretical limit in terms of big O complexity.

How to efficiently gather the repeating elements in a given array?

I'd like to gather the duplicates in a given array. For example, i have an array like this:
{1,5,3,1,5,6,3}
and i want the result to be:
{3,3,1,1,5,5,6}
In my case, the number of cluster is unknowen before calculation, and the order is not concerned.
I achived this by using the bult-in function Sort in C++. However, actually the ordering is not necessary. Hence, i guess there are probably more efficient methods to accomplish it.
Thanks in advance.
First, construct a histogram noting frequencies of each number. You can use a dictionary to accomplish this in O(n) time and space.
Next, loop over the dictionary's keys (order is unimportant here) and for each one, write a number of instances of that key equal to the corresponding value.
Example:
{1,5,3,1,5,6,3} input
{1->2,5->2,3->2,6->1} histogram dictionary
{1,1,5,5,3,3,6} wrote two 1s, two 5s, two 3s, then one 6
This whole thing is O(n) time and space. Certainly you can't do better than O(n) time. Whether you can do better than O(n) space or not while maintaining O(n) time I cannot say.

Hash that returns the same value for all numbers in range?

I'm working on a problem where I have an entire table from a database in memory at all times, with a low range and high range of 9-digit numbers. I'm given a 9-digit number that I need to use to lookup the rest of the columns in the table based on whether that number falls in the range. For example, if the range was 100,000,000 to 125,000,000 and I was given a number 117,123,456, then I would know that I'm in the 100-125 mil range, and whatever vector of data that points to is what I will be using.
Now the best I can think of for lookup time is log(n) run time. This is OK, at best, but still pretty slow. The table has at least 100,000 entries and I will need to look up values in this table tens-of-thousands, if not hundred-thousands of times, per execution of this application (10+ times/day).
So I was wondering if it was possible to use an unordered_set instead, writing my own Hash function that ALWAYS returns the same hash-value for every number in range. Using the same example above, 100,000,000 through 125,000,000 will always return, for example, a hash value of AB12CD. Then when I use the lookup value of 117,123,456, I will get that same AB12CD hash and have a lookup time of O(1).
Is this possible, and if so, any ideas how?
Thanks in advance.
Yes. Assuming that you can number your intervals in order, you could fit a polynomial to your cutoff values, and receive an index value from the polynomial. For instance, with cutoffs of 100,000,000, 125,000,000, 250,000,000, and 327,000,000, you could use points (100, 0), (125, 1), (250, 2), and (327, 3), restricting the first derivative to [0, 1]. Assuming that you have decently-behaved intervals, you'll be able to fit this with an (N+2)th-degree polynomial for N cutoffs.
Have a table of desired hash values; use floor[polynomial(i)] for the index into the table.
Can you write such a hash function? Yes. Will evaluating it be slower than a search? Well there's the catch...
I would personally solve this problem as follows. I'd have a sorted vector of all values. And then I'd have a jump table of indexes into that vector based on the value of n >> 8.
So now your logic is that you look in the jump table to figure out where you are jumping to and how many values you should consider. (Just look at where you land versus the next index to see the size of the range.) If the whole range goes to the same vector, you're done. If there are only a few entries, do a linear search to find where you belong. If they are a lot of entries, do a binary search. Experiment with your data to find when binary search beats a linear search.
A vague memory suggests that the tradeoff is around 100 or so because predicting a branch wrong is expensive. But that is a vague memory from many years ago, so run the experiment for yourself.

Not sure which data structure to use

Assuming I have the following text:
today was a good day and today was a sunny day.
I break up this text into lines, seperated by white spaces, which is
Today
was
a
good
etc.
Now I use the vector data structure to simple count the number of words in a text via .size(). That's done.
However, I also want to check If a word comes up more than once, and if so, how many time. In my example "today" comes up 2 times.
I want to store that "today" and append a 2/x (depending how often it comes up in a large text). Now that's not just for "today" but for every word in the text. I want to look up how often a word appears, append an counter, and sort it (the word + counters) in descending order (that's another thing, but
not important right now).
I'm not sure which data structure to use here. Map perhaps? But I can't add counters to map.
Edit: This is what I've done so far: http://pastebin.com/JncR4kw9
You should use a map. Infact, you should use an unordered_map.
unordered_map<string,int> will give you a hash table which will use strings as keys, and you can augment the integer to keep count.
unordered_map has the advantage of O(1) lookup and insertion over the O(logn) lookup and insertion of a map. This is because the former uses an array as a container whereas the latter uses some implementation of trees (red black, I think).
The only disadvantage of an unordered_map is that as mentioned in its name, you can't iterate over all the elements in lexical order. This should be clear from the explanation of their structure above. However, you don't seem to need such a traversal, and hence it shouldn't be an issue.
unordered_map<string,int> mymap;
mymap[word]++; // will increment the counter associated with the count of a word.
Why not use two data structures? The vector you have now, and a map, using the string as the key, and an integer as data, which then will be the number of times the word was found in the text.
Sort the vector in alphabetical order.
Scan it and compare every word to those that follow, until you find a different one, and son on.
a, a, and, day, day, sunny, today, today, was, was
2 1 2 1 2 2
A better option to consider is Radix Tree, https://en.wikipedia.org/wiki/Radix_tree
Which is quite memory efficient, and in case of large text input, it will perform better than alternative data structures.
One can store the frequencies of a word in the nodes of tree. Also it will reap the benefits of "locality of reference[For any text document]" too.

How to store a sequence of timestamped data?

I have an application that need to store a sequence of voltage data, each entry is something like a pair {time, voltage}
the time is not necessarily continuous, if the voltage doesn't move, I will not have any reading.
The problem is that i also need to have a function that lookup timestamp, like, getVoltageOfTimestamp(float2second(922.325))
My solution is to have a deque that stores the paires, then for every 30 seconds, I do a sampling and store the index into a map
std::map,
so inside getVoltageOfTimestamp(float2second(922.325)), I simply find the nearest interval_of_30_seconds to the desired time, and then move my pointer of deque to that corresponding_index_of_deque, iterate from there and find the correct voltage.
I am not sure whether there exist a more 'computer scientist' solution here, can anyone give me a clue?
You could use a binary search on your std::deque because the timestamps are in ascending order.
If you want to optimize for speed, you could also use a std::map<Timestamp, Voltage>. For finding an element, you can use upper_bound on the map and return the element before the one found by upper_bound. This approach uses more memory (because std::map<Timestamp, Voltage> has some overhead and it also allocates each entry separately).
Rather then use a separate map, you can do a binary search directly on the deque to find the closet timestamp. Given the complexity requirements of a std::map, doing a binary search will be just as efficient as a map lookup (both are O(log N)) and won't require the extra overhead.
Do you mind using c++ ox conepts ? If not deque<tuple<Time, Voltage>> will do the job.
One way you can improve over binary search is to identify the samples of your data. Assuming your samples are every 30 milliseconds, then in vector/list store the readings as you get them. In the other array, insert the index of the array every 30 seconds. Now given a timestamp, just go to the first array and find the index of the element in the list, now just go there and check few elements preceding/succeeding it.
Hope this helps.