simple hash map with vectors in C++

simple hash map with vectors in C++ - c++

I'm in my first semester of studies and as a part of my comp. science assignment I have to implement a simple hash map using vectors, but I have some problems understanding the concept.
First of all I have to implement a hash function. To avoid collisions I thought it would be better to use double hashing, as follows:
do {
h = (k % m + j*(1+(k % (m-2)));
j++;
} while ( j % m != 0 );
where h is the hash to be returned, k is the key and m is the size of hash_map (and a prime number; they are all of type int).
This was easy, but then I need to be able to insert or remove a pair of key and the corresponding value in the map.
The signature of the two functions should be bool, so I have to return either true or flase, and I'm guessing that I should return true when there is no element at position h in the vector. (But I have no idea why remove should be bool as well).
My problem is what to do when the insert function returns false (i.e. when there is already a key-value pair saved on position h - I implemented this as a function named find). I could obviously move it to the next free place by simply increasing j, but then the hash calculated by my hash function wouldn't tell us anymore at which place a certain key is saved, causing wrong behaviour of remove function.
Is there any good example online, that doesn't use the pre defined STD methods? (My Google behaves wierdly in the past few days and only reutrns me unuseful hits in the local language)

I've been told to move my comment to an answer so here it is. I am presuming your get method takes the value you are looking for an argument.
so what we are going to do is a process called linear probing.
when we insert the value we hash it as normal lets say our hash value is 4
[x,x,x,,,x,x]
as we can see we can simply insert it in:
[x,x,x,x,,x,x]
however if 4 is taken when we insert it we will simply move to the next slot that is empty
[x,x,x,**x**,x,,x,x]
In linear probing if we reach the end we loop back round to the beginning until we find a slot. You shouldn't run out of space as you are using a vector which can allocate extra space when it starts getting near full capacity
this will cause problems when you are searching because the value at 4 may not be at 4 anymore (in this case its at 5). To solve this we do a little bit of a hack. Note that we still get O(1) run time complexity for inserting and retrieval as long as the load balance is below 1.
in our get method instead of returning the value in the array at 4 we are instead going to start looking for our value at 4 if its there we can return it. If not we look at the value at 5 and so on till we find the value.
in psudo code the new stuff looks like this
bool insert(value){
h = hash(value);
while(node[h] != null){
h++;
if( h = node.length){
h = 0;
}
}
node[h] = value;
return true;
}
get
get(value){
h = hash(value);
roundTrip = 0; //used to see if we keep going round the hashmap
while(true){
if(node[h] == value)
return node[h];
h++;
if( h = node.length){
h = 0;
roundTrip++;
}
if(roundTrip > 1){ //we can't find it after going round list once
return -1;
}
}
}

Related

C++ map showing values as 0 for all keys

I am inserting number from a vector in map with key as the number and value as it's (index + 1).
But when I print the contents of the map, value being shown is 0 though I pass interger i.
// taking input of n integers in vector s;
vector<int> s;
for(int i=0;i<n;i++){
int tmp;cin>>tmp;
s.push_back(tmp);
}
//creating map int to int
map<int,int> m;
bool done = false;
for(int i=1;i<=s.size();i++){
//check if number already in map
if (m[s[i-1]]!=0){
if (i-m[s[i-1]]>1){
done = true;
break;
}
}
// if number was not in map then insert the number and it's index + 1
else{
m.insert({s[i-1],i});
}
}
for(auto it=m.begin();it!=m.end();it++){
cout<<endl<<it->first<<": "<<it->second<<endl;
}
For input
n = 3
and numbers as
1 2 1 in vector s, I expect the output to be
1: 1
2: 2
but output is
1: 0
2: 0
Why 0? What's wrong?

Your code block following the comment:
// check if number already in map
is logically faulty, because operator[] will actually insert an element, using value initialisation(a), if it does not currently exist.
If you were to instead use:
if (m.find(s[i-1]) != m.end())
that would get rid of this problem.
(a) I believe(b) that value initialisation for classes involve one of the constructors; for arrays, value initialisation for each item in the array; and, for other types (this scenario), zero initialisation. That would mean using your method creates an entry for your key, with a zero value, and returns that zero value
It would then move to the else block (because the value is zero) and try to do the insert. However, this snippet from the standard (C++20, [map.modifiers] discussing insert) means that nothing happens:
If the map already contains an element whose key is equivalent to k, there is no effect.
(b) Though, as my kids will point out frequently, and without much prompting, I've been wrong before :-)

std::map::operator[] will create a default element if it doesn't exist. Because you do m[s[i-1]] in the if condition, m.insert({s[i-1],i}); in else branch will always fail.
To check if key is already present in map use either find(), count() or contains() (if your compiler supports C++20)
//either will work instead of `if (m[s[i-1]]!=0)`
if (m.find(s[i-1]) != m.end())
if (m.count(s[i-1]) == 1)
if (m.contains(s[i-1])) //C++20

find duplicate number in an array

I am debugging below problem and post the solution I am debugging and working on, the solution or similar is posted on a couple of forums, but I think the solution has a bug when num[0] = 0 or in general num[x] = x? Am I correct? Please feel free to correct me if I am wrong.
Given an array nums containing n + 1 integers where each integer is between 1 and n (inclusive), prove that at least one duplicate number must exist. Assume that there is only one duplicate number, find the duplicate one.
Note:
You must not modify the array (assume the array is read only).
You must use only constant, O(1) extra space.
Your runtime complexity should be less than O(n2).
There is only one duplicate number in the array, but it could be repeated more than once.
int findDuplicate3(vector<int>& nums)
{
if (nums.size() > 1)
{
int slow = nums[0];
int fast = nums[nums[0]];
while (slow != fast)
{
slow = nums[slow];
fast = nums[nums[fast]];
}
fast = 0;
while (fast != slow)
{
fast = nums[fast];
slow = nums[slow];
}
return slow;
}
return -1;
}

Below is my code which uses Floyd's cycle-finding algorithm:
#include <iostream>
#include <vector>
using namespace std;
int findDup(vector<int>&arr){
int len = arr.size();
if(len>1){
int slow = arr[0];
int fast = arr[arr[0]];
while(slow!=fast){
slow = arr[slow];
fast = arr[arr[fast]];
}
fast = 0;
while(slow!=fast){
slow = arr[slow];
fast = arr[fast];
}
return slow;
}
return -1;
}
int main() {
vector<int>v = {1,2,2,3,4};
cout<<findDup(v)<<endl;
return 0;
}
Comment This works because zeroes aren't allowed, so the first element of the array isn't part of a cycle, and so the first element of the first cycle we find is referred to both outside and inside the cycle. If zeroes were allowed, this would fail if arr[0] were on a cycle. E.g., [0,1,1].

The sum of integers from 1 to N = (N * (N + 1)) / 2. You can use this to find the duplicate -- sum the integers in the array, then subtract the above formula from the sum. That's the duplicate.
Update: The above solution is based on the (possibly invalid) assumption that the input array consists of the values from 1 to N plus a single duplicate.

Start with two pointers to the first element: fast and slow.
Define a 'move' as incrementing fast by 2 step(positions) and slow by 1.
After each move, check if slow & fast point to the same node.
If there is a loop, at some point they will. This is because after they are both in the loop, fast is moving twice as quickly as slow and will eventually 'run into' it.
Say they meet after k moves. This is NOT NECESSARILY the repeated element, since it might not be the first element of the loop reached from outside the loop.
Call this element X.
Notice that fast has stepped 2k times, and slow has stepped k times.
Move fast back to zero.
Repeatedly advance fast and slow by ONE STEP EACH, comparing after each step.
Notice that after another k steps, slow will have moved a total of 2k steps and fast a total of k steps from the start, so they will again both be pointing to X.
Notice that if the prior step is on the loop for both of them, they were both pointing to X-1. If the prior step was only on the loop for slow, then they were pointing to different elements.
Ditto for X-2, X-3, ...
So in going forward, the first time they are pointing to the same element is the first element of the cycle reached from outside the cycle, which is the repeated element you're looking for.

Since you cannot use any additional space, using another hash table would be ruled out.
Now, coming to the approach of hashing on existing array, it can be acheived if we are allowed to modify the array in place.
Algo:
1) Start with the first element.
2) Hash the first element and apply a transformation to the value of hash.Let's say this transformation is making the value -ve.
3)Proceed to next element.Hash the element and before applying the transformation, check if a transformation has already been applied.
4) If yes, then element is a duplicate.
Code:
for(i = 0; i < size; i++)
{
if(arr[abs(arr[i])] > 0)
arr[abs(arr[i])] = -arr[abs(arr[i])];
else
cout<< abs(arr[i]) <<endl;
}
This transformation is required since if we are to use hashing approach,then, there has to be a collision for hashing the same key.
I cant think of a way in which hashing can be used without any additional space and not modifying the array.

Insertion Sort Optimization

I'm trying to practice making some different sort functions and the insertion function that I came up with is giving me some trouble. I can sort lists that are less than 30K fairly quickly. But I have a list of 100K integers and it literally takes 15 minutes for the function to complete the sort. Everything is sorted correctly, but I don't believe it should take that long.
Am I missing something with my code that is making it take so long? Many thanks in advance.
void Sort::insertion_Sort(vector <int> v)
{
int vecSize = v.size();
//for loop to advance through the vector
for (int i=0; i < vecSize; i++)
{
//delcare some variables
int cursor = i;
int inputCursor = i-1;
int temp = v[cursor];
//check to see if we are considering only a single element
if (cursor > 0)
{
//if there is more than 1 element, then we test the following.
//1. is the cursor element less than the inputCursor(which
//is the previous element)
//2. is the input cursor greater than -1
while (inputCursor > -1 && v[cursor] < v[inputCursor] )
{
//if so, we swap the variables
//then move the cursors back to check
//the previous elment and see if we need to swap again.
temp = v[cursor];
v[cursor] = v[inputCursor];
v[inputCursor] = temp;
inputCursor--;
cursor--;
}
}
}
}

Insertion sort is an O(n^2) algorithm. It's slow for large inputs. It's going to take roughly 11 times longer to process a list of 100k items than a list of 30k items. For inputs larger than 20 or so, you should use something like quicksort, which is O(n*log(n)).

The O(n^2) vs O(n*log(n)) problem, as pointed out by the other answer, is the center of this problem. I would suggest a binary search algorithm, as it is more similar to the insert algorithm, and is simplier to implement. It would look for the point of insertion dividing the already inserted vector in half, and trying to see if the integer to be inserted is greater or not of the integer in the middle. Then, it will try again to split one of the half (the one on the choosen side) and so on, recursively.
I think this is the best approach without starting from scratch.

Using a hash to find one duplicated and one missing number in an array

I had this question during an interview and am curious to see how it would be implemented.
Given an unsorted array of integers from 0 to x. One number is missing and one is duplicated. Find those numbers.
Here is what I came up with:
int counts[x+1];
for(int i =0;i<=x; i++){
counts[a[i]]++;
if(counts[a[i]] == 2)
cout<<”Duplicate element: “<<a[i]; //I realized I could find this here
}
for(int j=0; j<=x; j++){
if(counts[j] == 0)
cout<<”Missing element: “<<j;
//if(counts[j] == 2)
// cout<<”Duplicate element: “<<j; //No longer needed here.
}
My initial solution was to create another array of size x+1, loop through the given array and index into my array at the values of the given array and increment. If after the increment any value in my array is two, that is the duplicate. However, I then had to loop through my array again to find any value that was 0 for the missing number.
I pointed out that this might not be the most time efficient solution, but wasn't sure how to speed it up when I was asked. I realized I could move finding the duplicate into the first loop, but that didn't help with the missing number. After waffling for a bit, the interviewer finally gave me the idea that a hash would be a better/faster solution. I have not worked with hashes much, so I wasn't sure how to implement that. Can someone enlighten me? Also, feel free to point out any other glaring errors in my code... Thanks in advance!

If the range of values is the about the same or smaller than the number of values in an array, then using a hash table will not help. In this case, there are x+1 possible values in an array of size x+1 (one missing, one duplicate), so a hash table isn't needed, just a histogram which you've already coded.
If the assignment were changed to be looking for duplicate 32 bit values in an array of size 1 million, then the second array (a histogram) could need to be 2^32 = 4 billion counts long. This is when a hash table would help, since the hash table size is a function of the array size, not the range of values. A hash table of size 1.5 to 2 million would be large enough. In this case, you would have 2^32 - 2^20 = 4293918720 "missing" values, so that part of the assignment would go away.
Wiki article on hash tables:
Hash Table

If x were small enough (such that the sum of 0..x can be represented), you could compute the sum of the unique values in a, and subtract that from the sum of 0..x, to get the missing value, without needing the second loop.

Here is a stab at a solution that uses an index (a true key-value hash doesn't make sense when the array is guaranteed to include only integers). Sorry OP, it's in Ruby:
values = mystery_array.sort.map.with_index { |n,i| n if n != i }.compact
missing_value,duplicate_value = mystery_array.include?(values[0] - 1) ? \
[values[-1] + 1, values[0]] : [values[0] - 1, values[-1]]
The functions used likely employ a non-trivial amount of looping behind the scenes, and this will create a (possibly very large) variable values which contains a range between the missing and/or duplicate value, as well as a second lookup loop, but it works.
Perhaps the interviewer meant to say Set instead of hash?

Sorting allowed?
auto first = std::begin(a);
auto last = std::end(a);
// sort it
std::sort( first, last );
// find duplicates
auto first_duplicate = *std::adjacent_find( first, last );
// find missing value
auto missing = std::adjacent_find(first, last, [](int x, int y) {return x+2 == y;});
int missing_number = 0;
if (missing != last)
{
missing_number = 1+ *missing;
}
else
{
if (counts[0] != 0)
{
missing_number = 0;
}
else
{
missing_number = 9;
}
}
Both could be done in a single hand-written loop, but I wanted to use only stl algorithms. Any better idea for handling the corner cases?

for (i=0 to length) { // first loop
for( j=0 to length ){ // second loop
if (t[i]==j+1) {
if (counter==0){//make sure duplicated number has not been found already
for( k=i+1 to length ) { //search for duplicated number
if(t[k]==j+1){
j+1 is the duplicated number ;
if(missingIsFound)
exit // exit program, missing and dup are found
counter=1 ;
}//end if t[k]..
}//end loop for duplicated number
} // end condition to search
continue ; // continue to first loop
}
else{
j+1 is the missing number ;
if(duplicatedIsFound)
exit // exit program, missing and dup are found
continue ; //continue to first loop
}//end second loop
} //end first loop

Given an array of integers, find the first integer that is unique

Given an array of integers, find the first integer that is unique.
my solution: use std::map
put integer (number as key, its index as value) to it one by one (O(n^2 lgn)), if have duplicate, remove the entry from the map (O(lg n)), after putting all numbers into the map, iterate the map and find the key with smallest index O(n).
O(n^2 lgn) because map needs to do sorting.
It is not efficient.
other better solutions?

I believe that the following would be the optimal solution, at least based on time / space complexity:
Step 1:
Store the integers in a hash map, which holds the integer as a key and the count of the number of times it appears as the value. This is generally an O(n) operation and the insertion / updating of elements in the hash table should be constant time, on the average. If an integer is found to appear more than twice, you really don't have to increment the usage count further (if you don't want to).
Step 2:
Perform a second pass over the integers. Look each up in the hash map and the first one with an appearance count of one is the one you were looking for (i.e., the first single appearing integer). This is also O(n), making the entire process O(n).
Some possible optimizations for special cases:
Optimization A: It may be possible to use a simple array instead of a hash table. This guarantees O(1) even in the worst case for counting the number of occurrences of a particular integer as well as the lookup of its appearance count. Also, this enhances real time performance, since the hash algorithm does not need to be executed. There may be a hit due to potentially poorer locality of reference (i.e., a larger sparse table vs. the hash table implementation with a reasonable load factor). However, this would be for very special cases of integer orderings and may be mitigated by the hash table's hash function producing pseudorandom bucket placements based on the incoming integers (i.e., poor locality of reference to begin with).
Each byte in the array would represent the count (up to 255) for the integer represented by the index of that byte. This would only be possible if the difference between the lowest integer and the highest (i.e., the cardinality of the domain of valid integers) was small enough such that this array would fit into memory. The index in the array of a particular integer would be its value minus the smallest integer present in the data set.
For example on modern hardware with a 64-bit OS, it is quite conceivable that a 4GB array can be allocated which can handle the entire domain of 32-bit integers. Even larger arrays are conceivable with sufficient memory.
The smallest and largest integers would have to be known before processing, or another linear pass through the data using the minmax algorithm to find out this information would be required.
Optimization B: You could optimize Optimization A further, by using at most 2 bits per integer (One bit indicates presence and the other indicates multiplicity). This would allow for the representation of four integers per byte, extending the array implementation to handle a larger domain of integers for a given amount of available memory. More bit games could be played here to compress the representation further, but they would only support special cases of data coming in and therefore cannot be recommended for the still mostly general case.

All this for no reason. Just using 2 for-loops & a variable would give you a simple O(n^2) algo.
If you are taking all the trouble of using a hash map, then it might as well be what #Micheal Goldshteyn suggests
UPDATE: I know this question is 1 year old. But was looking through the questions I answered and came across this. Thought there is a better solution than using a hashtable.
When we say unique, we will have a pattern. Eg: [5, 5, 66, 66, 7, 1, 1, 77]. In this lets have moving window of 3. first consider (5,5,66). we can easily estab. that there is duplicate here. So move the window by 1 element so we get (5,66,66). Same here. move to next (66,66,7). Again dups here. next (66,7,1). No dups here! take the middle element as this has to be the first unique in the set. The left element belongs to the dup so could 1. Hence 7 is the first unique element.
space: O(1)
time: O(n) * O(m^2) = O(n) * 9 ≈ O(n)

Inserting to a map is O(log n) not O(n log n) so inserting n keys will be n log n. also its better to use set.

Although it's O(n^2), the following has small coefficients, isn't too bad on the cache, and uses memmem() which is fast.
for(int x=0;x<len-1;x++)
if(memmem(&array[x+1], sizeof(int)*(len-(x+1)), array[x], sizeof(int))==NULL &&
memmem(&array[x+1], sizeof(int)*(x-1), array[x], sizeof(int))==NULL)
return array[x];

public static string firstUnique(int[] input)
{
int size = input.Length;
bool[] dupIndex = new bool[size];
for (int i = 0; i < size; ++i)
{
if (dupIndex[i])
{
continue;
}
else if (i == size - 1)
{
return input[i].ToString();
}
for (int j = i + 1; j < size; ++j)
{
if (input[i]==input[j])
{
dupIndex[j] = true;
break;
}
else if (j == size - 1)
{
return input[i].ToString();
}
}
}
return "No unique element";
}

#user3612419
Solution given you is good with some what close to O(N*N2) but further optimization in same code is possible I just added two-3 lines that you missed.
public static string firstUnique(int[] input)
{
int size = input.Length;
bool[] dupIndex = new bool[size];
for (int i = 0; i < size; ++i)
{
if (dupIndex[i])
{
continue;
}
else if (i == size - 1)
{
return input[i].ToString();
}
for (int j = i + 1; j < size; ++j)
{
if(dupIndex[j]==true)
{
continue;
}
if (input[i]==input[j])
{
dupIndex[j] = true;
dupIndex[i] = true;
break;
}
else if (j == size - 1)
{
return input[i].ToString();
}
}
}
return "No unique element";
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js