I am going to store char's in some container, then take char's from a string and look up index for char from string in said container.
I also need to do the opposite, take an index and find out what char is there.
So it be more like:
container<char> c;
int index = c.indexOf('a');
char letter = c[12];
I don't care about insert or remove operations.
I suppose the best solution would be to just use a string or char table.
Then do:
int index = 'a'-myString[0]; //for lookup
char c = myString[index]; //for index
The generic data structure that has the ability to look up in two directions is the bidirectional map. If implemented with hash tables, it should have constant complexity for lookup.
Now, if we can make the assumption that char is 8 bits wide, and that your indices are contiguous, then we can use a greatly simpler data structure: Simply store the characters in an array. Use another array of size 1 << CHAR_BIT to store the index of each character.
Related
so I'm just learning (or trying to) a bit about hashing. I'm attempting to make a hashing function, however I'm confused where I save the data to. I'm trying to calculate the number of collisions and print that out. I have made 3 different files, one with 10,000 words, 20,000 words and 30,000 words. Each word is just 10 random numbers/letters.
long hash(char* s]){
long h;
for(int i = 0; i < 10; i++){
h = h + (int)s[i];
}
//A lot of examples then mod h by the table size
//I'm a bit confused what this table is... Is it an array of
//10,000 (or however many words)?
//h % TABLE_SIZE
return h
}
int main (int argc, char* argv[]){
fstream input(argv[1]);
char* nextWord;
while(!input.eof()){
input >> nextWord;
hash(nextWord);
}
}
So that's what I currently have, but I can't figure out what the table is exactly, as I said in the comments above... Is it a predefined array in my main with the number of words in it? For example, if I have a file of 10 words, do I make an array a of size 10 in my main? Then if/when I return h, lets say the order goes: 3, 7, 2, 3
The 4th word is a collision, correct? When that happens, I add 1 to collision and then add 1 to then check if slot 4 is also full?
Thanks for the help!
The point of hashing is to have a constant time access to every element you store. I'll try to explain on simple example bellow.
First, you need to know how much data you'd have to store. If for example you want to store numbers and you know, that you won't store numbers greater than 10. Simpliest solution is to create an array with 10 elements. That array is your "table", where you store your numbers. So how do I achieve that amazing constant time access? Hashing function! It's point is to return you an index to your array. Let's create a simple one: If you'd like to store 7, you just save it to array on position 7. Every time, you'd like to look, for element 7, you just pass it to your hasning funcion and bzaah! You got an position to your element in constant time! But what if you'd like to store more elements with value 7? Your simple hashing function is returning 7 for every element and now its position i already occupied! How to solve that? Well, there is not many solution, the simpliest are:
1: Chaining - you simply save element on first free position. This has significant draw back. Imagine, you want to delete some element ... (this is the method, you describing in question)
2: Linked list - if you create an array of pointers on some linked lists, you can easilly add your new element at the end of linked list, that is on position 7!
Both of this simple solutions has its drawbacks and cons. I guess you can see them. As #rwols has said, you don't have to use array. You can also use a tree or be a real C++ master and use unordered_map and unordered_set with custom hash function, which is quite cool. Also there is structure named trie, which is usefull, when you'd like to create some sort of dictionary (where is really hard to know, how many words you will need to store)
To sum it up. You has to know, how many things, you wan't to store and then, create ideal hashing function, that covers up array of apropriate size and in perfect world, it has to have uniform index distribution, with no colisions. (Achiving this is pretty hard and in the real world, I guess, this is impossible, so the less colisions, the better.)
Your hash function, is pretty bad. It will have lot of colisions (like strings "ab" and "ba") and also, you need to mod m it with m being the size of you array (aka. table), so you can save it to some array and you can profit of it. The modus is a way of simplyfiing the has function, because has function has to "fit" in table, that you specified in beginning, because you can't save element on position 11, 12, ... if you have array of 10.
How should good hashing function look like? Well, there is better sources than me. Some example (Alert! It's in Java)
To your example: You simply can't save 10k or even more words into table of size 10. That'll create a lot of collisions and you loose the main benefit of hashing function - constant access to elements you saved.
And how would your code look? Something like this:
int main (int argc, char* argv[]){
fstream input(argv[1]);
char* nextWord;
TypeOfElement table[size_of_table];
while(!input.eof()){
input >> nextWord;
table[hash(nextWord)] = // desired element which you want to save
}
}
But I guess, your goal isn't to save something somewhere, but to count number of colisions. Also note that code above doesn't solve colisions. If you'd like to count colisions, create array table of ints and initialize it to zero. Than, just increment the value, which is stored on index, which is returned by your hash funcion, like this:
table[hash(nextWord)]++;
I hope I helped. Please specify, what else you want to know.
If a hash table is required then as others have stated std::unordered_map will work in most cases. Now if you need something more powerful because of a large entry base, then I would suggest looking into tries. Tries combine the concepts of (Vector-Array) insertion, (Hashing) & Linked Lists. The run time is close to O(M) where M is the amount of characters in a string if you are hashing a string. It helps to remove the chance of collisions. And the more you add to a trie structure the less work has to be done as certain nodes are opened and created. The one draw back is that tries require more memory. Here is a diagram
Now your trie may vary on the size of the array due to what you are storing, but the overall concept and construction of one is the same. If you was doing a word - definition look up then you may want an array of 26 or a few more for each possible hashing character.
To count a number of words which have same hash, we should know hashes of all previous words. When you count a hash of some word, you should write it down, for example in some array. So you need an array with size equal to the number of words.
Then you should compare the new hash with all previous ones. Method of counting depends on what you need - number of pair of collisions or number off same elements.
Hash function should not be responsible for storing data. Normally you would have a container that uses hash function internally.
From what you wrote I understood that you want to create hashtable. One way you could do that (probably not the most efficient one, but should give you an idea):
#include <fstream>
#include <vector>
#include <string>
#include <map>
#include <memory>
using namespace std;
namespace example {
long hash(char* s){
long h;
for(int i = 0; i < 10; i++){
h = h + (int)s[i];
}
return h;
}
}
int main (int argc, char* argv[]){
fstream input(argv[1]);
char* nextWord;
std::map<long, std::unique_ptr<std::vector<std::string>>> hashtable;
while(!input.eof()){
input >> nextWord;
long newHash = example::hash(nextWord);
auto it = hashtable.find(newHash);
// Collision detected?
if (it == hashtable.end()) {
hashtable.insert(std::make_pair(newHash, std::unique_ptr<std::vector<std::string>>(new std::vector<std::string> { nextWord } )));
}
else {
it->second->push_back(nextWord);
}
}
}
I used some C++ 11 features to write an example faster.
I am not sure that I understand what you do not understand. The explanations below might help you.
A hash table is a kind of associative array. It is used to map keys to values in a similar manner an array is used to map indexes (keys) to values. For instance, an array of three numbers, { 11, -22, 33 }, associates index 0 to 11, index 1 to -22 and index 2 to 33.
Now, let us assume that we would like to associate 1 to 11, 2 to -22 and 3 to 33. The solution is simple: we keep the same array, only we transform the key by subtracting one from it, thus obtaining the original index
This is fine until we realize that this is just a particular case. What if the keys are not so “predictable”? A solution would be to put the associations in a list of {key, value} pairs and when someone is asking for a key, just search the list: { 123, 11}, {3, -22}, {0, 33} If the value associated to 3 is asked, we simply search the keys in list for a match and find -22. That’s fine, but if the list is large we’re in trouble. We could speed the search if we sort the array by keys and use binary search, but still the search may take some time if the list is large.
The search speed may be further enhanced if we break the list in sub-lists (or buckets) made of related pairs. This is what a hash function does: puts together pairs by related keys (an ideal hash function would associate one key to one value).
A hash table is a two columns table (an array):
The first column is the hash key (the index computed by a hash function). The size of the hash table is given by the maximum value of the hash function. If, for instance, the last step in computing the hash function is modulo 10, the size of the table will be 10; the pairs list will be broken into 10 sub-lists.
The second column is a list (bucket) of key/values pairs (the sub-list I was taking about).
I am trying the lower_bound function in C++.
Used it multiple times for 1 d datatypes.
Now, I am trying it on a sorted array dict[5000][20] to find strings of size <=20.
The string to be matched is in str.
bool recurseSerialNum(char *name,int s,int l,char (*keypad)[3],string str,char (*dict)[20],int
dictlen)
{
char (*idx)[20]= lower_bound(&dict[0],&dict[0]+dictlen,str.c_str());
int tmp=idx-dict;
if(tmp!=dictlen)
printf("%s\n",*idx);
}
As per http://www.cplusplus.com/reference/algorithm/lower_bound/?kw=lower_bound , this function is supposed to return the index of 'last'(beyond end) in case no match is found i.e. tmp should be equal dictlen.
In my case, it always returns the beginning index i.e. I get tmp equal to 0 both 1. When passed a string that is there in the dict and 2. When passed a string that is not there in the dict.
I think the issue is in handling and passing of the pointer. The default comparator should be available for this case as is available in case of vector. I also tried passing an explicit one, to no avail.
I tried this comparator -
bool compStr(const char *a, const char *b){
return strcmp(a,b)<0;
}
I know the ALTERNATE is to used vector ,etc, but I would like to know the issue in this one.
Searched on this over google as well as SO, but did not find anything similar.
There are two misunderstandings here, I believe.
std::lower_bound does not check if an element is part of a sorted range. Instead it finds the leftmost place where an element could be inserted without breaking the ordering.
You're not comparing the contents of the strings but their memory addresses.
It is true that dict in your case is a sorted range in that the sense that the memory addresses of the inner arrays are ascending. Where in relation to this str.c_str() lies is, of course, undefined. In practice, dict is a stack object, you will often find that the memory range for the heap (where str.c_str() invariably lies) is below that of the stack, in which case lower_bound quite correctly tells you that if you wanted to insert this address into the sorted range of addresses as which you interpret dict, you'd have to do so at the beginning.
For a solution, since there is an operator<(char const *, std::string const &), you could simply write
char (*idx)[20] = lower_bound(&dict[0], &dict[0] + dictlen, str);
...but are you perhaps really looking for std::find?
I'm trying to sort a 10X15 array of characters, where each row is a word. My goal is to sort it in a descending order, from the largest value word at the top, at array[row 0][column 0 through 14] position, and the smallest value word at the bottom array[row 9][column 0 through 14]. Each row is a word (yeah, they don't look as words, but it's to test the sorting capability of the program).
To clarify: What I need to do is this... considering that EACH row is a whole word, I need to sort the rows from the highest value word being at the top, and the lowest value word being at the bottom.
Edit:
Everything works now. For anyone who has a similar question, look to the comments below, there are several fantastic solutions, I just went with the one where I create my own sort function to learn more about sorting. And thanks to all of you for helping me! :)
You are using c++ so quit using arrays and begin with stl types:
Convert each row into a string:
string tempString
for (int i = 0; i < rowSize; ++i) {
tempString.pushBack(array[foreachrow][i])
}
add them to a vector
std::vector<std::string> sorter;
sorter.push_back(tempString);
Do that for each row.
std::vector<std::string> sorter;
for each row {
for each coloumn {
the string thing
}
push back the string
}
Then sort the vector with std::sort and write the vector back into the array (if you have to but don't because arrays suck)
As usual, you need qsort:
void qsort( const void *ptr, size_t count, size_t size,
int (*comp)(const void *, const void *) );
That takes a void pointer to your starting address, the number of elements to sort, the size of each element, and a comparison function.
You would call it like this:
qsort( array, ROWS, COLS, compare_word );
Where you define compare_word to sort in reverse:
int compare_word( const void* a, const void* b )
{
return strncmp( b, a, COLS );
}
Now, given that each word is 15 characters long, there may be padding to deal with. I don't have absolute knowledge that the array will be packed as 10 by 15 instead of 10 by 16. But if you suspect so, you could pass (&array[1][0] - &array[0][0]) as the element size instead of COLS.
If you are not allowed to use qsort and instead must write your own sorting algorithm, do something simple like selection sort. You can use strncmp to test the strings. Look up the function (google makes it easy, or if you use Linux, man 3 strncmp). To swap the characters, you could use a temporary char array of length COLS and then 3 calls to memcpy to swap the words.
The problem with your new code using string and vector is a simple typo:
sorter[count] = array[count+1]; should be sorter[count] = sorter[count+1];
I have a some data that needs to be stored and looked up efficiently. Preferably using C.
Each line of the data file is in the following format:
key1 key2 key3 data
where key1, key2, key3 are integers and data is an array of float.
I am thinking about converting key1,2,3 into a string, then use C++ std::map to map string to a float pointer:
std::map<string, float*>
Are there better ways of doing it?
Note: integer key1,2,3 has a range of 0-4000, but very sparsely populated. In another word if you go through all the values in key1, you will find < 100 unique int within the rang eof 0-4000.
You can use std::tuple to combine the three values into one:
std::map<std::tuple<int, int, int>, float *>
you do not have to use strings if your data limits for each key is from 0 to 4000
first generate the combined key as follows:
unsigned long ulCombinedKey = key1 + key2<<12 + key3 <<24;
after that you can use the map as you already stated in your questions.
A hierarchical map would do it:
map<int, map<int , map<int, list<float> > > > records;
and the access time would be good (logarithmic).
This way would be efficient if the range is very wide. Otherwise for 4000 the suggested shifts given in previous answer is faster and more efficient.
A hash provides very fast access to data, so you might want to use hashes to look up values from each of the three integers. This approach can be used in either c or c++.
For each line of data:
1. allocate space for the array of floats
2. store a pointer to the array of floats in an array of pointers
3. store the index of the pointer array in a hash based on on int1
4. store the index of the pointer array in a hash based on on int2
5. store the index of the pointer array in a hash based on on int3
This way, given int1, int2, or int3, one could look up a pointer array index, retrieve the pointer, then follow the pointer to the array of floats. This approach uses some memory, but not too much, given the problem said there are < 100 unique values for each of int1, int2, and int3.
I need a way in Which I can take the Input in a 2d array and sort it row wise in one of the fastest way . I tried taking Input and Sort it simultaneously using Insertion Sort. The Second thing I used is i took a multimap individually for a row and inserted with key value as the value i want and mapped value relates to that key as some Dummy value . Since map sorts key while Inserting It could be the one way I thought .
The below code is for making sure that 1 row in my 2D has its element sorted in
multimap. Basically you can say that I dont want to use a 2D structure at all as I
will use these rows individually one by one and hence can be considered as 1D array.
I also want they they gets rearranged While reading the Input , so i dont have to
extra opeartions for doing them.
for(long int j=1;j<=number_in_group;j++)
{
cin >> arrival_time;
arrival_map.insert(pair<long int, long int>(arrival_time,1));
}
Try an STL std::priority_queue? The output is guaranteed to be sorted, and if you polarize the inputs to be 2-D objects (that contain a row number for example) you're queue will build literally perfectly. At that point simply slurp the number off the queue in batches of 'n' where 'n' is your row size and each one will be sorted correctly. You will need a element type that encodes both the value AND the row in your priority queue, and sorts biased to the row # first, then then value. Your example uses long int as the data type for your values. Assuming your rows are no larger than the size of a system unsigned int:
class Element
{
public:
Element(unsigned int row, long int val)
: myrow(row), myval(val)
{};
bool operator <(const Element& elem)
{
return (myrow < elem.myrow ||
(myrow == elem.myrow && myval < elem.myvel);
}
unsigned int myrow;
long int myval;
};
typedef std::priority_queue<Element> MyQueue;
Note: this takes advantage of the priority queue's default comparison operator invoking std::less<>, which simply compares the items using the item-defined operator <(). Once you have this simply push your matrix into the queue, incrementing the row index as you switch to the next row.
MyQueue mq;
mq.push_back(Element(1,100));
mq.push_back(Element(1,99));
mq.push_back(Element(2,100));
mq.push_back(Element(2,101));
Popping the queue when finished will result in the following sequence:
99
100
100
101
Which I hope is what you want. Finally, please forgive the syntax errors and/or missing junk, as I just blasted this on the fly and have no compiler to check it against. Gotta love web cafes.