hashkey collision when removing C++ - c++

To make the search foreach "symbol" i want to remove from my hashTable, i have chosen to generate the hashkey i inserted it at. However, the problem that Im seeing in my remove function is when I need to remove a symbol from where a collision was found it previously results in my while loop condition testing false where i do not want.
bool hashmap::get(char const * const symbol, stock& s) const
{
int hash = this->hashStr( symbol );
while ( hashTable[hash].m_symbol != NULL )
{ // try to find a match for the stock associated with the symbol.
if ( strcmp( hashTable[hash].m_symbol , symbol ) == 0 )
{
s = &hashTable[hash];
return true;
}
++hash %= maxSize;
}
return false;
}
bool hashmap::put(const stock& s, int& usedIndex, int& hashIndex, int& symbolHash)
{
hashIndex = this->hashStr( s.m_symbol ); // Get remainder, Insert at that index.
symbolHash = (int&)s.m_symbol;
usedIndex = hashIndex;
while ( hashTable[hashIndex].m_symbol != NULL ) // collision found
{
++usedIndex %= maxSize; // if necessary wrap index around
if ( hashTable[usedIndex].m_symbol == NULL )
{
hashTable[usedIndex] = s;
return true;
}
else if ( strcmp( hashTable[usedIndex].m_symbol , s.m_symbol ) == 0 )
{
return false; // prevent duplicate entry
}
}
hashTable[hashIndex] = s; // insert if no collision
return true;
}
// What if I need to remove an index i generate?
bool hashmap::remove(char const * const symbol)
{
int hashVal = this->hashStr( symbol );
while ( hashTable[hashVal].m_symbol != NULL )
{
if ( strcmp( hashTable[hashVal].m_symbol, symbol ) == 0 )
{
stock temp = hashTable[hashVal]; // we cansave it
hashTable[hashVal].m_symbol = NULL;
return true;
}
++hashVal %= maxSize; // wrap around if needed
} // go to the next cell meaning their was a previous collision
return false;
}
int hashmap::hashStr(char const * const str)
{
size_t length = strlen( str );
int hash = 0;
for ( unsigned i = 0; i < length; i++ )
{
hash = 31 * hash + str[i];
}
return hash % maxSize;
}
What would I need to do to remove a "symbol" from my hashTable from a previous collision?
I am hoping it is not java's equation directly above.

It looks like you are implementing a hash table with open addressing, is that right? Deleting is a little tricky in that scheme. See http://www.maths.lse.ac.uk/Courses/MA407/del-hash.pdf:
"Deletion of keys is problematic with open addressing: If there are two colliding keys x and y with h(x) = h(y), and key x is inserted before key y, and one wants to delete key x, this cannot simply be done by marking T[h(x)] as FREE, since then y would no longer be found. One possibility would be to mark T[h(x)] as DELETED (another special entry), which is skipped when searching for a key. A table place marked as DELETED may also be re-used for storing another key z that one wants to insert if one is sure that this key z is not already in the table (i.e., by reaching the end of the probe sequence for key z and not finding it). Such re-use complicates the insertion method. Moreover, places with DELETED keys fill the table."

What you need to do is create a dummy sentinel value that represents a "deleted" item. When you insert a new value into the table, you need to check to see if an element is NULL or "deleted". If a slot contains this sentinel "deleted" value or the slot is NULL, then the slot is a valid slot for insertion.
That said, if you are writing this code for production, you should consider using the boost::unordered_map, instead of rolling your own hash map implementation. If this is for schoolwork,... well, good luck.

Related

How do I implement linear probing in C++?

I'm new to Hash Maps and I have an assignment due tomorrow. I implemented everything and it all worked out fine, except for when I get a collision. I cant quite understand the idea of linear probing, I did try to implement it based on what I understood, but the program stopped working for table size < 157, for some reason.
void hashEntry(string key, string value, entry HashTable[], int p)
{
key_de = key;
val_en = value;
for (int i = 0; i < sizeof(HashTable); i++)
{
HashTable[Hash(key, p) + i].key_de = value;
}
}
I thought that by adding a number each time to the hash function, 2 buckets would never get the same Hash index. But that didn't work.
A hash table with linear probing requires you
Initiate a linear search starting at the hashed-to location for an empty slot in which to store your key+value.
If the slot encountered is empty, store your key+value; you're done.
Otherwise, if they keys match, replace the value; you're done.
Otherwise, move to the next slot, hunting for any empty or key-matching slot, at which point (2) or (3) transpires.
To prevent overrun, the loop doing all of this wraps modulo the table size.
If you run all the way back to the original hashed-to location and still have no empty slot or matching-key overwrite, your table is completely populated (100% load) and you cannot insert more key+value pairs.
That's it. In practice it looks something like this:
bool hashEntry(string key, string value, entry HashTable[], int p)
{
bool inserted = false;
int hval = Hash(key, p);
for (int i = 0; !inserted && i < p; i++)
{
if (HashTable[(hval + i) % p].key_de.empty())
{
HashTable[(hval + i) % p].key_de = key;
}
if (HashTable[(hval + i) % p].key_de == key)
{
HashTable[(hval + i) % p].val_en = value;
inserted = true;
}
}
return inserted;
}
Note that expanding the table in a linear-probing hash algorithm is tedious. I suspect that will be forthcoming in your studies. Eventually you need to track how many slots are taken so when the table exceeds a specified load factor (say, 80%), you expand the table, rehashing all entries on the new p size, which will change where they all end up residing.
Anyway, hope it makes sense.

How can i make this algorithm more memory efficient?

I have been working on this piece of code for a couple of days and it seems
to work just fine for search trees of relative small size, but working with higher ones gives segmentation fault.
I have tried to find the specific place where such error comes from by using printing flags, and it seems to break at the is_goal() function. But the code of that function can be assumed as correct.
So the problems seems to be a memory problem, that's why I have the queue in the heap, and also the Nodes that I am creating. But it still crashes.
Are there any more memory saving tricks that can be taking into account that I am not using? Or maybe there is a better way to save such nodes?
It is important to note that I need to use BFS (I can't use A* to make the search tree smaller).
Also if you need to know, the map is a hash table where I save the coloring of the nodes so I don't have duplicates when exploring (This is because the hash saves depending on the state information and not the Node information).
EDIT: i have been told to indicate the goal to accomplish, such is to find the goal state in the search tree, the goal is to be be able to iterate over
big problems like rubik 3x3x3, n-puzzle 4x4 5x5, tower of hanoi and such.To do so, the memory used has to be minimal since the search tree of such problems are really big, the final objective is to compare this algorithm with others like a*, dfid, ida*, and so on. I hope this is enough information.
Node* BFS(state_t start ){
state_t state, child;
int d, ruleid;
state_map_t *map = new_state_map();
ruleid_iterator_t iter;
std::queue<Node*> *open = new std::queue<Node*>();
state_map_add( map, &start, 1);
Node* n = new Node(start);
open->push(n);
while( !open->empty() ) {
n = open->front();
state = n->get_state();
open->pop();
if (is_goal(&state) == 1) return n;
init_fwd_iter( &iter, &state );
while( ( ruleid = next_ruleid( &iter ) ) >= 0 ) {
apply_fwd_rule( ruleid, &state, &child );
const int *old_child_c = state_map_get( map, &child );
if ( old_child_c == NULL ) {
state_map_add( map, &child, 1 );
Node* nchild = new Node(child);
open->push(nchild);
}
}
}
return NULL;
}
I see a number of memory leaks.
open is never deleted but maybe could allocated in the stack instead of in the heap.
std::queue<Node*> open;
More important none of the node you push in the queue are deleted this is probably the origin of very big memory consumption.
Delete the nodes that you remove from the queue and that you don't plan to reuse. Delete the nodes of the queue before your get rid of the queue.
Node* BFS(state_t start ){
state_t state, child;
int d, ruleid;
state_map_t *map = new_state_map();
ruleid_iterator_t iter;
std::queue<Node*> open;
state_map_add( map, &start, 1);
Node* n = new Node(start);
open.push(n);
while( !open.empty() ) {
n = open.front();
state = n->get_state();
open.pop();
if (is_goal(&state) == 1) {
for (std::queue<Node*>::iterator it = open.begin(); it != open.end(); ++it)
delete *it;
return n;
}
else {
delete n;
}
init_fwd_iter( &iter, &state );
while( ( ruleid = next_ruleid( &iter ) ) >= 0 ) {
apply_fwd_rule( ruleid, &state, &child );
const int *old_child_c = state_map_get( map, &child );
if ( old_child_c == NULL ) {
state_map_add( map, &child, 1 );
Node* nchild = new Node(child);
open.push(nchild);
}
}
}
return NULL;
}

How to make sure user enters allowed enum

I have to write a program with an Enum state, which is the 50 2-letter state abbreviations(NY, FL, etc). I need to make a program that asks for the user info and they user needs to type in the 2 letters corresponding to the state. How can I check that their input is valid i.e matches a 2 letter state defined in Enum State{AL,...,WY}? I suppose I could make one huge if statement checking if input == "AL" || ... || input == "WY" {do stuff} else{ error input does not match state }, but having to do that for all 50 states would get a bit ridiculous. Is there an easier way to do this?
Also if Enum State is defined as {AL, NY, FL}, how could I cast a user input, which would be a string, into a State? If I changed the States to {"AL", "NY", "FL"} would that be easier or is there another way to do it?
Unfortunately C++ does not provide a portable way to convert enum to string and vice versa. Possible solution would be to populate a std::map<std::string,State> (or hash map) and on conversion do a lookup. You can either populate such map manually or create a simple script that will generate a function in a .cpp file to populate this map and call that script during build process. Your script can generate if/else statements instead of populating map as well.
Another, more difficult, but more flexible and stable solution is to use compiler like llvm and make a plugin that will generate such function based on syntax tree generated by compiler.
The simplest method is to use an STL std::map, but for academic exercises that may not be permitted (for example it may be required to use only techniques covered in the course material).
Unless explicitly initialised, enumerations are integer numbered sequentially starting from zero. Given that, you can scan a lookup-table of strings, and cast the matching index to an enum. For example:
enum eUSstate
{
AL, AK, AZ, ..., NOSTATE
} ;
eUSstate string_to_enum( std::string inp )
{
static const int STATES = 50 ;
std::string lookup[STATES] = { "AL", "AK", "AZ" ... } ;
int i = 0 ;
for( i = 0; i < STATES && lookup[i] != inp; i++ )
{
// do nothing
}
return static_cast<eUSstate>(i) ;
}
If perhaps you don't want to rely on a brute-force cast and maintaining a look-up table in the same order as the enumerations, then a lookup table having both the string and the matching enum may be used.
eUSstate string_to_enum( std::string inp )
{
static const int STATES = 50 ;
struct
{
std::string state_string ;
eUSstate state_enum ;
} lookup[STATES] { {"AL", AL}, {"AK", AK}, {"AZ", AL} ... } ;
eUSstate ret = NOSTATE ;
for( int i = 0; ret == NOSTATE && i < STATES; i++ )
{
if( lookup[i].state_string == inp )
{
ret = lookup[i].state_enum ;
}
}
return ret ;
}
The look-up can be optimised by taking advantage of alphabetical ordering and performing a binary search, but for 50 states it is hardly worth it.
What you need is a table. Because the enums are linear,
a simple table of strings would be sufficient:
char const* const stateNames[] =
{
// In the same order as in the enum.
"NY",
"FL",
// ...
};
Then:
char const* const* entry
= std::find( std::begin( stateNames ), std::end( stateNames ), userInput );
if (entry == std::end( stateNames ) ) {
// Illegal input...
} else {
State value = static_cast<State>( entry - std::begin( stateNames ) );
Alternatively, you can have an array of:
struct StateMapping
{
State enumValue;
char const* name;
struct OrderByName
{
bool operator()( StateMapping const& lhs, StateMapping const& rhs ) const
{
return std::strcmp( lhs.name, rhs. name ) < 0;
}
bool operator()( StateMapping const& lhs, std::string const& rhs ) const
{
return lhs.name < rhs;
}
bool operator()( std::string const& lhs, StateMapping const& rhs ) const
{
return lhs < rhs.name;
}
};
};
StateMapping const states[] =
{
{ NY, "NY" },
// ...
};
sorted by the key, and use std::lower_bound:
StateMapping::OrderByName cmp;
StateMapping entry =
std::lower_bound( std::begin( states ), std::end( states ), userInput, cmp );
if ( entry == std::end( states ) || cmp( userInput, *entry) {
// Illegal input...
} else {
State value = entry->enumValue;
// ...
}
The latter is probably slightly faster, but for only fifty
entries, I doubt you'll notice the difference.
And of course, you don't write this code manually; you generate
it with a simple script. (In the past, I had code which would
parse the C++ source for the enum definitions, and generate the
mapping functionality from them. It's simpler than it sounds,
since you can ignore large chunks of the C++ code, other than
for keeping track of the various nestings.)
The solution is simple, but only for 2 characters in the string (as in your case):
#include <stdio.h>
#include <stdint.h>
enum TEnum
{
AL = 'LA',
NY = 'YN',
FL = 'LF'
};
int _tmain(int argc, _TCHAR* argv[])
{
char* input = "NY";
//char* input = "AL";
//char* input = "FL";
switch( *(uint16_t*)input )
{
case AL:
printf("input AL");
break;
case NY:
printf("input NY");
break;
case FL:
printf("input FL");
break;
}
return 0;
}
In above example I used an enumeration with a double character code (it is legal) and passed to the switch statement a input string. I tested it end work!. Notice the word alignment in enumeration.
Ciao

How do I implement an erase function for a hash table?

I have a hash table using linear probing. I've been given the task to write an erase(int key) function with the following guidelines.
void erase(int key);
Preconditions: key >= 0
Postconditions: If a record with the specified key exists in the table, then
that record has been removed; otherwise the table is unchanged.
I was also given some hints to accomplish the task
It is important to realize that the insert function will allow you to add a new entry to the table, or to update an existing entry in the table.
For the linear probing version, notice that the code to insert an
item has two searches. The insert() function calls function
findIndex() to search the table to see if the item is already in the
table. If the item is not in the table, a second search is done to
find the position in the table to insert the item. Adding the ability
to delete an entry will require that the insertion process be
modified. When searching for an existing item, be sure that the
search does not stop when it comes to a location that was occupied
but is now empty because the item was deleted. When searching for a
position to insert a new item, use the first empty position - it does
not matter if the position has ever been occupied or not.
So I've started writing erase(key) and I seem to have run into the problem that the hints are referring to, but I'm not positive what it means. I'll provide code in a second, but what I've done to test my code is set up the hash table so that it will have a collision and then I erase that key and rehash the table but it doesn't go into the correct location.
For instance, I add a few elements into my hash table:
The hash table is:
Index Key Data
0 31 3100
1 1 100
2 2 200
3 -1
4 -1
5 -1
6 -1
7 -1
8 -1
9 -1
10 -1
11 -1
12 -1
13 -1
14 -1
15 -1
16 -1
17 -1
18 -1
19 -1
20 -1
21 -1
22 -1
23 -1
24 -1
25 -1
26 -1
27 -1
28 -1
29 -1
30 -1
So all of my values are empty except the first 3 indices. Obviously key 31 should be going into index 1. But since key 1 is already there, it collides and settles for index 0. I then erase key 1 and rehash the table but key 31 stays at index 0.
Here are the functions that may be worth looking at:
void Table::insert( const RecordType& entry )
{
bool alreadyThere;
int index;
assert( entry.key >= 0 );
findIndex( entry.key, alreadyThere, index );
if( alreadyThere )
table[index] = entry;
else
{
assert( size( ) < CAPACITY );
index = hash( entry.key );
while ( table[index].key != -1 )
index = ( index + 1 ) % CAPACITY;
table[index] = entry;
used++;
}
}
Since insert uses findIndex, I'll include that as well
void Table::findIndex( int key, bool& found, int& i ) const
{
int count = 0;
assert( key >=0 );
i = hash( key );
while ( count < CAPACITY && table[i].key != -1 && table[i].key != key )
{
count++;
i = (i + 1) % CAPACITY;
}
found = table[i].key == key;
}
And here is my current start on erase
void Table::erase(int key)
{
assert(key >= 0);
bool found, rehashFound;
int index, rehashIndex;
//check if key is in table
findIndex(key, found, index);
//if key is found, remove it
if(found)
{
//remove key at position
table[index].key = -1;
table[index].data = NULL;
cout << "Found key and removed it" << endl;
//reduce the number of used keys
used--;
//rehash the table
for(int i = 0; i < CAPACITY; i++)
{
if(table[i].key != -1)
{
cout << "Rehashing key : " << table[i].key << endl;
findIndex(table[i].key, rehashFound, rehashIndex);
cout << "Rehashed to index : " << rehashIndex << endl;
table[rehashIndex].key = table[i].key;
table[rehashIndex].data = table[i].data;
}
}
}
}
Can someone explain what I need to do to make it rehash properly? I understand the concept of a hash table, but I seem to be doing something wrong here.
EDIT
As per user's suggestion:
void Table::erase(int key)
{
assert(key >= 0);
bool found;
int index;
findIndex(key, found, index);
if(found)
{
table[index].key = -2;
table[index].data = NULL;
used--;
}
}
//modify insert(const RecordType & entry)
while(table[index].key != -1 || table[index].key != -2)
//modify findIndex
while(count < CAPACITY && table[i].key != -1
&& table[i].key != -2 && table[i].key != key)
When deleting an item from the table, don't move anything around. Just stick a "deleted" marker there. On an insert, treat deletion markers as empty and available for new items. On a lookup, treat them as occupied and keep probing if you hit one. When resizing the table, ignore the markers.
Note that this can cause problems if the table is never resized. If the table is never resized, after a while, your table will have no entries marked as never used, and lookup performance will go to hell. Since the hints mention keeping track of whether an empty position was ever used and treating once-used cells differently from never-used, I believe this is the intended solution. Presumably, resizing the table will be a later assignment.
It's not necessary to rehash the entire table every time a delete is done. If you want to minimise degradation in performance, then you can compact the table by considering whether any of the elements after (with wrapping from end to front allowed) the deleted element but before the next -1 hash to a bucket at or before the deleted element - if so, then they can be moved to or at least closer to their hash bucket, then you can repeat the compaction process for the just-moved element.
Doing this kind of compaction will remove the biggest flaw in your current code, which is that after a little use every bucket will be marked as either in use or having been used, and performance for e.g. find of a non-existent value will degrade to O(CAPACITY).
Off the top of my head with no compiler/testing...
int Table::next(int index) const
{
return (index + 1) % CAPACITY;
}
int Table::distance(int from, int to) const
{
return from < to ? to - from : to + CAPACITY - from;
}
void Table::erase(int key)
{
assert(key >= 0);
bool found;
int index;
findIndex(key, found, index);
if (found)
{
// compaction...
int limit = CAPACITY - 1;
for (int compact_from = next(index);
limit-- && table[compact_from].key >= 0;
compact_from = next(compact_from))
{
int ideal = hash(table[compact_from].key);
if (distance(ideal, index) <
distance(ideal, compact_from))
{
table[index] = table[compact_from];
index = compact_from;
}
}
// deletion
table[index].key = -1;
delete table[index].data; // or your = NULL if not a leak? ;-.
--used;
}
}

How do I make make my hash table with linear probing more efficient?

I'm trying to implement an efficient hash table where collisions are solved using linear probing with step. This function has to be as efficient as possible. No needless = or == operations. My code is working, but not efficient. This efficiency is evaluated by an internal company system. It needs to be better.
There are two classes representing a key/value pair: CKey and CValue. These classes each have a standard constructor, copy constructor, and overridden operators = and ==. Both of them contain a getValue() method returning value of internal private variable. There is also the method getHashLPS() inside CKey, which return hashed position in hash table.
int getHashLPS(int tableSize,int step, int collision) const
{
return ((value + (i*step)) % tableSize);
}
Hash table.
class CTable
{
struct CItem {
CKey key;
CValue value;
};
CItem **table;
int valueCounter;
}
Methods
// return collisions count
int insert(const CKey& key, const CValue& val)
{
int position, collision = 0;
while(true)
{
position = key.getHashLPS(tableSize, step, collision); // get position
if(table[position] == NULL) // free space
{
table[position] = new CItem; // save item
table[position]->key = CKey(key);
table[position]->value = CValue(val);
valueCounter++;
break;
}
if(table[position]->key == key) // same keys => overwrite value
{
table[position]->value = val;
break;
}
collision++; // current positions is full, try another
if(collision >= tableSize) // full table
return -1;
}
return collision;
}
// return collisions count
int remove(const CKey& key)
{
int position, collision = 0;
while(true)
{
position = key.getHashLPS(tableSize, step, collision);
if(table[position] == NULL) // free position - key isn't in table or is unreachable bacause of wrong rehashing
return -1;
if(table[position]->key == key) // found
{
table[position] = NULL; // remove it
valueCounter--;
int newPosition, collisionRehash = 0;
for(int i = 0; i < tableSize; i++, collisionRehash = 0) // rehash table
{
if(table[i] != NULL) // if there is a item, rehash it
{
while(true)
{
newPosition = table[i]->key.getHashLPS(tableSize, step, collisionRehash++);
if(newPosition == i) // same position like before
break;
if(table[newPosition] == NULL) // new position and there is a free space
{
table[newPosition] = table[i]; // copy from old, insert to new
table[i] = NULL; // remove from old
break;
}
}
}
}
break;
}
collision++; // there is some item on newPosition, let's count another
if(collision >= valueCounter) // item isn't in table
return -1;
}
return collision;
}
Both functions return collisions count (for my own purpose) and they return -1 when the searched CKey isn't in the table or the table is full.
Tombstones are forbidden. Rehashing after removing is a must.
The biggest change for improvement I see is in the removal function. You shouldn't need to rehash the entire table. You only need to rehash starting from the removal point until you reach an empty bucket. Also, when re-hashing, remove and store all of the items that need to be re-hashed before doing the re-hashing so that they don't get in the way when placing them back in.
Another thing. With all hashes, the quickest way to increase efficiency to to decrease the loadFactor (the ratio of elements to backing-array size). This reduces the number of collisions, which means less iterating looking for an open spot, and less rehashing on removal. In the limit, as the loadFactor approaches 0, collision probability approaches 0, and it becomes more and more like an array. Though of course memory use goes up.
Update
You only need to rehash starting from the removal point and moving forward by your step size until you reach a null. The reason for this is that those are the only objects that could possibly change their location due to the removal. All other objects would wind up hasing to the exact same place, since they don't belong to the same "collision run".
A possible improvement would be to pre-allocate an array of CItems, that would avoid the malloc()s / news and free() deletes; and you would need the array to be changed to "CItem *table;"
But again: what you want is basically a smooth ride in a car with square wheels.