Find the lowest unused number - c++

I've setup a std map to map some numbers, at this point I know what numbers I'm mapping from an to, eg:
std::map<int, int> myMap;
map[1] = 2;
map[2] = 4;
map[3] = 6;
Later however, I want to map some numbers to the lowest number possilbe that is not in the map, eg:
map[4] = getLowestFreeNumberToMapTo(map); // I'd like this to return 1
map[5] = getLowestFreeNumberToMapTo(map); // I'd like this to return 3
Any easy way of doing this?
I considered building an ordered list of numbers as I added them to the map so I could just look for 1, not find it, use it, add it etc.

Something like
typedef std::set<int> SetType;
SetType used; // The already used numbers
int freeCounter = 1; // The first available free number
void AddToMap(int i)
{
used.insert(i);
// Actually add the value to map
}
void GetNewNumber()
{
SetType::iterator iter = used.lower_bound(freeCounter);
while (iter != used.end() && *iter == freeCounter)
{
++iter;
++freeCounter;
}
return freeCounter++;
}
If your map is quite big but sparse, this will work like o(log(N)), where N is the number of items in the map - in most cases you won't have to iterate through the set, or just make a few steps.
Otherwise, if there are few gaps in the map, then you would better have a set of free items in the range [1..maxValueInTheMap].

Finding the lowest unused number is a very common operation in UNIX kernels, as every open/socket/etc. syscall is supposed to bind to the lowest unused FD number.
On Linux, the algorithm in fs/file.c­#alloc_fd is:
keep track of next_fd, a low water mark -- it is not necessarily 100% accurate
whenever a FD is freed, next_fd = min(fd, next_fd)
to allocate a FD, start searching the bitmap starting from next_fd -- lib/find_next_bit.c­#find_next_zero_bit is linear but still very fast, because it takes BITS_PER_LONG strides at a time
after a FD is allocated, next_fd = fd + 1
FreeBSD's sys/kern/kern_descrip.c­#fdalloc follows the same idea: start with int fd_freefile; /* approx. next free file */, and search the bitmap upwards.
However, these are all operating under the assumption that most processes have few FDs open, and very, very few have thousands. If the numbers will go much higher, with sparse holes, the common solution (as far as I've seen) is
#include <algorithm>
#include <functional>
#include <vector>
using namespace std;
int high_water_mark = 0;
vector<int> unused_numbers = vector<int>();
int get_new_number() {
if (used_numbers.empty())
return high_water_mark++;
pop_heap(unused_numbers.begin(), unused_numbers.end(), greater<int>());
return unused_numbers.pop_back();
}
void recycle_number(int number) {
unused_numbers.push_back(number);
push_heap(unused_numbers.begin(), unused_numbers.end(), greater<int>());
}
(untested code... idea is: keep a high water mark; try to steal from unused below the high water mark, or up the high water mark otherwise; return freed to unused)
and if your assumption is that the used numbers will be sparse, then Dmitry's solution makes more sense.

I'd use a bidirectional map class for this problem. That way you can simply check if value 1 exists etc.
Edit
The benefits of using a bimap is that there already exist robust implementations of it and even if searching for the next free number is O(n) it is only an issue if n is large (or possibly if n is moderate and this is called very frequently). Overall making for a simple implementation that is unlikely to be error prone and easily maintainable.
If n is large or this operation is performed very frequently than investing the effort of implementing a more advanced solution is merited. IMHO.

Related

Efficiency of an algorithm for scrambled input

I am currently writing a program, its done for the most part, in CPP that takes in a file, with numbered indices and then pushes out a scrambled quiz based on the initial input, so that no two are, theroretically, the same.
This is the code
// There has to be a more efficient way of doing this...
for (int tempCounter(inputCounter);
inputCounter != 0;
/* Blank on Purpose*/) {
randInput = (rand() % tempCounter) + 1;
inputIter = find (scrambledArray.begin(),
scrambledArray.end(),
randInput);
// Checks if the value passed in is within the given vector, no duplicates.
if (inputIter == scrambledArray.end()) {
--inputCounter;
scrambledArray.push_back(randInput);
}
}
The first comment states my problem. It will not happen, under normal circumstances, but what about if this were being applied to a larger application standpoint. This works, but it is highly inefficient should the user want to scramble 10000 or so results. I feel as if in that point this would be highly inefficient.
I'm not speaking about the efficiency of the code, as in shortening some sequences and compacting it to make it a bit prettier, I was more or less teaching someone, and upon getting to this point I came to the conclusion that this could be done in a way better manner, just don't know which way it could be...
So you want just the numbers 1..N shuffled? Yes, there is a more efficient way of doing that. You can use std::iota to construct your vector:
// first, construct your vector:
std::vector<int> scrambled(N);
std::iota(scrambled.begin(), scrambled.end(), 1);
And then std::shuffle it:
std::shuffle(scrambled.begin(), scrambled.end(),
std::mt19937{std::random_device{}()});
If you don't have C++11, the above would look like:
std::vector<int> scrambled;
scrambled.reserve(N);
for (int i = 1; i <= N; ++i) {
scrambled.push_back(i);
}
std::random_shuffle(scrambled.begin(), scrambled.end());

Is there any way of optimising this function?

This piece of code seems to be the worst offender in terms of time in my program. What my program is trying to do find the minimum number of individual "nodes" required to satisfy a network with two constraints:
Each node must connect to x number of other nodes
Each node must have y degrees of separation between it and each of the nodes it's connected to.
However for values of x greater than 600 this task takes a very long time, the task is on the order of exponential anyway so I expect it to take forever at some point but that also means that if any small changes could be made here it'd speed up the entire program by alot.
uniint = unsigned long long int (64-bit)
network is a vector of the form vector<vector<uniint>>
The piece of code:
/* Checks if id2 is in id1's list of connections */
inline bool CheckIfInList (uniint id1, uniint id2)
{
uniint id1size = network[id1].size();
for (uniint itr = 0; itr < id1size; ++itr)
{
if (network[id1][itr] == id2)
{
return true;
}
}
return false;
}
The only way is to sort the network[id1] array when you build it.
If you arrive here with a sorted array you can easiliy find, if exists, what you are looking for using a dichotomic search.
Use std::map or std::unordered_map for fast search. I guess it's impossible to MICRO optimize this code, std::vector is cool. But not for 600 elements search.
I'm guessing CheckIfInList() is called in a loop? Perhaps a vector is not the best choice, you could try vector<set<uniint>>. This will give you O(log n) for a look up of the inner collection instead of O(n)
For quick microoptimization, check whether your compiler optimizes the multiple calls to network[id1] away. If not, that is where you loose a lot of time, so remember the address:
vector<uniint>& connectedNodes = network[id1];
uniint id1size = connectedNodes.size();
for (uniint itr = 0; itr < id1size; ++itr)
{
if (connectedNodes[itr] == id2)
{
return true;
}
}
return false;
If your compiler already took care of that, I'm afraid that there's not much you can micro optimize about this method. The only real optimization can be achieved on the algorithmic level, starting with sorting the neighbour lists, moving on to using unordered_map<> instead of vector<>, and ending with asking yourself whether you can't somehow reduce the number of calls to CheckIfInList().
This is not as effective as HAL9000's suggestion, and is good for cases when you have an unsorted list/array. What you can do is to ask less question in each iteration if you put the value you looking for at the end of the vector.
uniint id1size = network[id1].size();
network[id1][id1size] = id2;
for (uniint itr = 0; network[id1][itr] == id2; ++itr);
//if itr != id1size return true else flase....
need to add checks if the last member in the vector was your id2.
This way you don't need to ask each time whether you get to the end of the list.

Time complexity issues with multimap

I created a program that finds the median of a list of numbers. The list of numbers is dynamic in that numbers can be removed and inserted (duplicate numbers can be entered) and during this time, the new median is re-evaluated and printed out.
I created this program using a multimap because
1) the benefit of it being already being sorted,
2) easy insertion, deletion, searching (since multimap implements binary search)
3) duplicate entries are allowed.
The constraints for the number of entries + deletions (represented as N) are: 0 < N <= 100,000.
The program I wrote works and prints out the correct median, but it isn't fast enough. I know that the unsorted_multimap is faster than multimap, but then the problem with unsorted_multimap is that I would have to sort it. I have to sort it because to find the median you need to have a sorted list. So my question is, would it be practical to use an unsorted_multimap and then quick sort the entries, or would that just be ridiculous? Would it be faster to just use a vector, quicksort the vector, and use a binary search? Or maybe I am forgetting some fabulous solution out there that I haven't even thought of.
Though I'm not new to C++, I will admit, that my skills with time-complexity are somewhat medicore.
The more I look at my own question, the more I'm beginning to think that just using a vector with quicksort and binary search would be better since the data structures basically already implement vectors.
the more I look at my own question, the more I'm beginning to think that just using vector with quicksort and binary search would be better since the data structures basically already implement vectors.
If you have only few updates - use unsorted std::vector + std::nth_element algorithm which is O(N). You don't need full sorting which is O(N*ln(N)).
live demo of nth_element:
#include <algorithm>
#include <iterator>
#include <iostream>
#include <ostream>
#include <vector>
using namespace std;
template<typename RandomAccessIterator>
RandomAccessIterator median(RandomAccessIterator first,RandomAccessIterator last)
{
RandomAccessIterator m = first + distance(first,last)/2; // handle even middle if needed
nth_element(first,m,last);
return m;
}
int main()
{
vector<int> values = {5,1,2,4,3};
cout << *median(begin(values),end(values)) << endl;
}
Output is:
3
If you have many updates and only removing from middle - use two heaps as comocomocomocomo suggests. If you would use fibonacci_heap - then you would also get O(N) removing from arbitary position (if don't have handle to it).
If you have many updates and need O(ln(N)) removing from arbitary places - then use two multisets as ipc suggests.
If your purpose is to keep track of the median on the fly, as elements are inserted/removed, you should use a min-heap and a max-heap. Each one would contain one half of the elements... There was a related question a couple of days ago: How to implement a Median-heap
Though, if you need to search for specific values in order to remove elements, you still need some kind of map.
You said that it is slow. Are you iterating from the beginning of the map to the (N/2)'th element every time you need the median? You don't need to. You can keep track of the median by maintaining an iterator pointing to it at all times and a counter of the number of elements less than that one. Every time you insert/remove, compare the new/old element with the median and update both iterator and counter.
Another way of seeing it is as two multimaps containing half the elements each. One holds the elements less than the median (or equal) and the other holds those greater. The heaps do this more efficiently, but they don't support searches.
If you only need the median a few times you can use the "select" algorithm. It is described in Sedgewick's book. It takes O(n) time on average. It is similar to quick sort but it does not sort completely. It just partitions the array with random pivots until, eventually, it gets to "select" on one side the smaller m elements (m=(n+1)/2). Then you search for the greatest of those m elements, and this is the median.
Here is how you could implement that in O(log N) per update:
template <typename T>
class median_set {
public:
std::multiset<T> below, above;
// O(log N)
void rebalance()
{
int diff = above.size() - below.size();
if (diff > 0) {
below.insert(*above.begin());
above.erase(above.begin());
} else if (diff < -1) {
above.insert(*below.rbegin());
below.erase(below.find(*below.rbegin()));
}
}
public:
// O(1)
bool empty() const { return below.empty() && above.empty(); }
// O(1)
T const& median() const
{
assert(!empty());
return *below.rbegin();
}
// O(log N)
void insert(T const& value)
{
if (!empty() && value > median())
above.insert(value);
else
below.insert(value);
rebalance();
}
// O(log N)
void erase(T const& value)
{
if (value > median())
above.erase(above.find(value));
else
below.erase(below.find(value));
rebalance();
}
};
(Work in action with tests)
The idea is the following:
Keep track of the values above and below the median in two sets
If a new value is added, add it to the corresponding set. Always ensure that the set below has exactly 0 or 1 more then the other
If a value is removed, remove it from the set and make sure that the condition still holds.
You can't use priority_queues because they won't let you remove one item.
Can any one help me what is Space and Time complexity of my following C# program with details.
//Passing Integer array to Find Extreme from that Integer Array
public int extreme(int[] A)
{
int N = A.Length;
if (N == 0)
{
return -1;
}
else
{
int average = CalculateAverage(A);
return FindExtremes(A, average);
}
}
// Calaculate Average of integerArray
private int CalculateAverage(int[] integerArray)
{
int sum = 0;
foreach (int value in integerArray)
{
sum += value;
}
return Convert.ToInt32(sum / integerArray.Length);
}
//Find Extreme from that Integer Array
private int FindExtremes(int[] integerArray, int average) {
int Index = -1; int ExtremeElement = integerArray[0];
for (int i = 0; i < integerArray.Length; i++)
{
int absolute = Math.Abs(integerArray[i] - average);
if (absolute > ExtremeElement)
{
ExtremeElement = integerArray[i];
Index = i;
}
}
return Index;
}
You are almost certainly better off using a vector. Possibly maintaining an auxiliary vector of indexes to be removed between median calculations so you can delete them in batches. New additions can also be put into an auxiliary vector, sorted, then merged in.

no duplicate function for a lottery program

right now im trying to make a function that checks to see if the user’s selection is already in the array , and if it does itll tell you to choose a diff number. how can i do this?
Do you mean something like this?
bool CheckNumberIsValid()
{
for(int i = 0 ; i < array_length; ++i)
{
if(array[i] == user_selection)
return false;
}
return true;
}
That should give you a clue, at least.
What's wrong with std::find? If you get the end iterator back, the
value isn't in the array; otherwise, it is. Or if this is homework, and
you're not allowed to use the standard library, a simple while loop
should do the trick: this is a standard linear search, algorithms for
which can be found anywhere. (On the other hand, some of the articles
which pop up when searching with Google are pretty bad. You really
should use the standard implementation:
Iterator
find( Iterator begin, Iterator end, ValueType target )
{
while ( begin != end && *begin != target )
++ begin;
return begin;
}
Simple, effective, and proven to work.)
[added post factum]Oh, homework tag. Ah well, it won't really benefit you that much then, still - I'll leave my answer since it can be of some use to others browsing through SO.
If you'd need to have lots of unique random numbers in a range - say 45000 random numbers from 0..45100 - then you should see how this is going to get problematic using the approach of:
while (size_of_range > v.size()) {
int n = // get random
if ( /* n is not already in v */ ) {
v.push_back(n);
}
}
If the size of the pool and the range you want to get are close, and the pool size is not a very small integer - it'll get harder and harder to get a random number that wasn't already put in the vector/array.
In that case, you'll be much better of using std::vector (in <vector>) and std::random_shuffle (in <algorithm>):
unsigned short start = 10; // the minimum value of a pool
unsigned short step = 1; // for 10,11,12,13,14... values in the vector
// initialize the pool of 45100 numbers
std::vector<unsigned long> pool(45100);
for (unsigned long i = 0, j = start; i < pool.size(); ++i, j += step) {
pool[i] = j;
}
// get 45000 numbers from the pool without repetitions
std::random_shuffle(pool.begin(), pool.end());
return std::vector<unsigned long>(pool.begin(), pool.begin() + 45000);
You can obviously use any type, but you'll need to initialize the vector accordingly, so it'd contain all possible values you want.
Note that the memory overhead probably won't really matter if you really need almost all of the numbers in the pool, and you'll get good performance. Using rand() and checking will take a lot of time, and if your RAND_MAX is equal 32767 then it'd be an infinite loop.
The memory overhead is however noticeable if you only need few of those values. The first approach would usually be faster then.
If it really needs to be the array you need to iterate or use find function from algorithm header. Well, I would suggest you go for putting the numbers in a set as the look up is fast in sets and handy using set::find function
ref: stl set
These are some of the steps (in pseudo-code since this is a homework question) on how you may get around to doing this:
Get user to enter a new number.
If the number entered is the first, push it to the vector anyway.
Sort the contents of the vector in case size is > 1.
Ask user to enter the number.
Perform a binary search on the contents to see if the number was entered.
If number is unique, push it into vector. If not unique, ask again.
Go to step 3.
HTH,
Sriram.

How to keep only the last duplicate when iterating through rows

Following code iterates through many data-rows, calcs some score per row and then sorts the rows according to that score:
unsigned count = 0;
score_pair* scores = new score_pair[num_rows];
while ((row = data.next_row())) {
float score = calc_score(data.next_feature())
scores[count].score = score;
scores[count].doc_id = row->docid;
count++;
}
assert(count <= num_rows);
qsort(scores, count, sizeof(score_pair), score_cmp);
Unfortunately, there are many duplicate rows with the same docid but different score. Now i like to keep the last score for any docid only. The docids are unsigned int, but usually big (=> no lookup-array) - using a HashMap to lookup the last count for a docid would probably be too slow (many millions of rows, should only take seconds not minutes...).
Ok, i modified my code to use a std:map:
map<int, int> docid_lookup;
unsigned count = 0;
score_pair* scores = new score_pair[num_rows];
while ((row = data.next_row())) {
float score = calc_score(data.next_feature())
map<int, int>::iterator iter;
iter = docid_lookup.find(row->docid);
if (iter != docid_lookup.end()) {
scores[iter->second].score = score;
scores[iter->second].doc_id = row->docid;
} else {
scores[count].score = score;
scores[count].doc_id = row->docid;
docid_lookup[row->docid] = count;
count++;
}
}
It works and the performance hit is not as bad as i expected - now it runs a minute instead of 16 seconds, so it's about a factor of 3. Memory usage has also gone up from about 1Gb to 4Gb.
The first thing I'd try would be a map or unordered_map: I'd be surprised if performance is a factor of 60 slower than what you did without any unique-ification. If the performance there isn't acceptable, another option is something like this:
// get the computed data into a vector
std::vector<score_pair>::size_type count = 0;
std::vector<score_pair> scores;
scores.reserve(num_rows);
while ((row = data.next_row())) {
float score = calc_score(data.next_feature())
scores.push_back(score_pair(score, row->docid));
}
assert(scores.size() <= num_rows);
// remove duplicate doc_ids
std::reverse(scores.begin(), scores.end());
std::stable_sort(scores.begin(), scores.end(), docid_cmp);
scores.erase(
std::unique(scores.begin(), scores.end(), docid_eq),
scores.end()
);
// order by score
std::sort(scores.begin(), scores.end(), score_cmp);
Note that the use of reverse and stable_sort is because you want the last score for each doc_id, but std::unique keeps the first. If you wanted the first score you could just use stable_sort, and if you didn't care what score, you could just use sort.
The best way of handling this is probably to pass reverse iterators into std::unique, rather than a separate reverse operation. But I'm not confident I can write that correctly without testing, and errors might be really confusing, so you get the unoptimised code...
Edit: just for comparison with your code, here's how I'd use the map:
std::map<int, float> scoremap;
while ((row = data.next_row())) {
scoremap[row->docid] = calc_score(data.next_feature());
}
std::vector<score_pair> scores(scoremap.begin(), scoremap.end());
std::sort(scores.begin(), scores.end(), score_cmp);
Note that score_pair would need a constructor taking a std::pair<int,float>, which makes it non-POD. If that's not acceptable, use std::transform, with a function to do the conversion.
Finally, if there is much duplication (say, on average 2 or more entries per doc_id), and if calc_score is non-trivial, then I would be looking to see whether it's possible to iterate the rows of data in reverse order. If it is, then it will speed up the map/unordered_map approach, because when you get a hit for the doc_id you don't need to calculate the score for that row, just drop it and move on.
I'd go for a std::map of docids. If you could create an appropriate hashing function, a hash-map would be preferable. But I guess it's too difficult. And no - the std::map ist not too slow. Access is O(log n), which is nearly as good as O(1). O(1) is array access time (and Hashmap btw).
Btw, if std::map is too slow, qsort O(n log n) is too slow as well. And, using a std::map and iterating over it's contents, you can perhaps save your qsort.
Some additions for the comment (by onebyone):
I did not go for the implementation
details, since there wasn't enough
information on that.
qsort may behave bad with sorted data
(depending on the implementation).
Std::map may not. This is a real
advantage, especially if you read the
values from a database that might
output them ordered by key.
There was no word on the memory allocation strategy. Changing to a memory allocator with fast allocation of small objects may improve the performance.
Still - the fastest would be a hash map with an appropriate hash function. Since there's not enough information about the distribution of the keys, presenting one in this answer is not possible.
Short - if you ask general questions, you get general answers. This means - at least for me, looking at the time complexity in the O-Notation. Still you were right, depending on different factors, the std::map may be too slow while qsort is still fast enough - it may also be the other way round in the worst case of qsort, where it has n^2 complexity.
Unless I've misunderstood the question, the solution can be simplified considerably. At least as I understand it, you have a few million docid's (which are of type unsigned int) and for each unique docid, you want to store one 'score' (which is a float). If the same docid occurs more than once in the input, you want to keep the score from the last one. If that's correct, the code can be reduced to this:
std::map<unsigned, float> scores;
while ((row = data.next_row()))
scores[row->docid] = calc_score(data.next_feature());
This will probably be somewhat slower than your original version since it allocates a lot of individual blocks rather than one big block of memory. Given your statement that there are a lot of duplicates in the docid's, I'd expect this to save quite a bit of memory, since it only stores data for each unique docid rather than for every row in the original data.
If you wanted to optimize this, you could almost certainly do so -- since it uses a lot of small blocks, a custom allocator designed for that purpose would probably help quite a bit. One possibility would be to take a look at the small-block allocator in Andrei Alexandrescu's Loki library. He's done more work on the problem since, but the one in Loki is probably sufficient for the task at hand -- it'll almost certainly save a fair amount of memory and run faster as well.
If your C++ implementation has it, and most do, try hash_map instead of std::map (it's sometimes available under std::hash_map).
If the lookups themselves are your computational bottleneck, this could be a significant speedup over std::map's binary tree.
Why not sort by doc id first, calculate scores, then for any subset of duplicates use the max score?
On re-reading the question; I'd suggest a modification to how scores are read in. Keep in mind C++ isn't my native language, so this won't quite be compilable.
unsigned count = 0;
pair<int, score_pair>* scores = new pair<int, score_pair>[num_rows];
while ((row = data.next_row())) {
float score = calc_score(data.next_feature())
scores[count].second.score = score;
scores[count].second.doc_id = row->docid;
scores[count].first = count;
count++;
}
assert(count <= num_rows);
qsort(scores, count, sizeof(score_pair), pair_docid_cmp);
//getting number of unique scores
int scoreCount = 0;
for(int i=1; i<num_rows; i++)
if(scores[i-1].second.docId != scores[i].second.docId) scoreCount++;
score_pair* actualScores=new score_pair[scoreCount];
int at=-1;
int lastId = -1;
for(int i=0; i<num_rows; i++)
{
//if in first entry of new doc id; has the last read time by pair_docid_cmp
if(lastId!=scores[i].second.docId)
actualScores[++at]=scores[i].second;
}
qsort(actualScores, count, sizeof(score_pair), score_cmp);
Where pair_docid_cmp would compare first on docid; grouping same docs together, then second by reverse order read; such that the last item read is the first in the sublist of items with the same docid. Should only be ~5/2x memory usage, and ~double the execution speed.