Efficient hash map of constant size string - c++

I have a need to map strings of constant size, which contain just alphanumeric values (A-Z, 0-9, no lower case letters) to other strings. The unordered_map becomes very large (tens of millions of keys), while the mapped values are from a set of a few thousand strings. Upon profiling I found that most time is spent inserting new values into the map (operator[]), and also clearing the map takes a very long time.
std::unordered_map<std::string, std::string> hashMap;
while (...){
...
hashMap[key] = value; // ~50% of program time is spent here
...
}
hashMap.clear(); // Takes a very long time, at this point hashMap.size() > 20,000,000
My thoughts are that the string allocations/deallocations are very slow, as well as hashing and inserting into the map.
Any suggestions to optimize this? Keep in mind that the key size is constant, and its contents are limited to a set of 36 characters, and that the mapped values are from a limited set. I'm open to using different container / data types other than strings and unordered_map.
Update
Following a suggestions by Baum Mit Augen I changed my key type to unsigned long long and made a function to convert base 36 to decimal:
unsigned long long ConvertBase36(const char* num)
{
unsigned long long retVal = 0;
for (int i = 0; i < 12; i++)
{
unsigned int digit = 0;
char currChar = num[i];
if (currChar <= '9')
{
digit = currChar - '0';
}
else
{
digit = currChar - 'A' + 10;
}
retVal *= 36;
retVal += digit;
}
return retVal;
}
This gave me about 10% improvement in whole program runtime.
I then tried to use the unordered_map reserve function again to see if it made any difference and it did not.
Trying map instead of unordered_map did about 10% worse so I reverted that change.
Finally replacing the string value with an unsigned int made things a bit faster.

Two unrelated suggestions, but both related to std::unordered_map::reserve.
First, since your unordered map contains 10Ms of elements, there are probably many re-allocations/rehashes going on as you insert. At the start, you might want to reserve 10Ms of entries.
Since
the mapped values are from a set of a few thousand strings
you should be able to store the values themselves in a secondary unordered_set that you first reserved to something large enough to ensure no iterators get invalidated on inserts - see invalidation guarantees for unordered associative containers.
Your (primary) unordered_map can then map strings to std::unordered_set::const_iterators.

For such a large number of entries, you might want to consider playing around with the number of buckets in your hash.
For starters, you can query your implementation-defined value with this:
unordered_map<T, U> um; cout << um.bucket_count();
Then you can play around and see which value produces the best results:
const size_t N = 100; unordered_map<T, U> m(N);

Related

Is there a way to use an enum for characters in a string? C++

This was taken off LeetCode but basically given a string composed of a few unique characters that each have an associated integer value, I need to quickly process the total integer value of the string. I thought enums would be useful since you know what is going to compose your strings.
The enum is the types of characters that can be in my string (can see that it's limited). If a character with a smaller value is before a character with a bigger value, like IV, then I subtract the preceding character's value from the one after it. Otherwise you add. The code is my attempt, but I can't get enums to work with my algorithm...
std::string s = "III";
int sum = 0;
enum {I = 1, V = 5, X = 10, L = 50, C = 100, D = 500, M = 1000};
// O(n) iteration.
for (int i = 0; i < s.length(); i++) {
// Must subtract.
if (s[i] < s[i+1]) {
sum += s[i+1] - s[i];
}
// Add.
else {
sum += s[i];
}
}
std::cout << "sum is: " << sum;
My questions then are 1) Is using enum with a string possible? 2) I know it's possible to do with a unordered_map but I think enums is much quicker.
If you won't mind minor memory overhead, you can do something like this:
int table[256];
table['I']=1;
table['V']=5;
...
and then
sum += table[s[i]];
and so on. This approach is guaranteed to be O(1), which is basically the fastest solution you able to get. You can also use std::array instead of POD array, encapsulate all this in some class and add assertions, but this is the idea.
2) I know it's possible to do with a unordered_map but I think enums
is much quicker.
you're comparing oranges with apples.
first, enum is not a container. it's basically just like a list of known constants.
when you mean the access time of operator[]:
for unordered_map:
Unordered map is an associative container that contains key-value
pairs with unique keys. Search, insertion, and removal of elements
have average constant-time complexity.
for string it's also constant time access.
1) Is using enum with a string possible
No. An enum key is basically like an "alias" for the value. Note that each string is a sequence of characters:
V != "V"
It is not possible to convert a char or a string to an enum without some kind of mapping. Because the compiler replaces the enum with its underlying value during compilation. So you cannot dynamically access the enum with its name stored in a string.
You have to use either any one of map family or if else construct to achieve your need.

C++: First non-repeating character, O(n) time using hash map

I'm trying to write a function to get the first non-repeating character of a string. I haven't found a satisfactory answer on how to do this in O(n) time for all cases. My current solution is:
char getFirstNonRepeated(char * str) {
if (strlen(str) > 0) {
int visitedArray[256] = {}; // Where 256 is the size of the alphabet
for (int i = 0; i < strlen(str); i++) {
visitedArray[str[i]] += 1;
}
for (int j = 0; j < 256; j++) {
if (visitedArray[j] == 1) return j;
}
}
return '\0'; // Either strlen == 0 or all characters are repeated
}
However, as long as n < 256, this algorithm runs in O(n^2) time in the worst case. I've read that using a hash table instead of an array to store the number of times each character is visited could get the algorithm to run consistently in O(n) time, because insertions, deletions, and searches on hash tables run in O(1) time. I haven't found a question that explains how to do this properly. I don't have very much experience using hash maps in C++ so any help would be appreciated.
Why are you repeating those calls to strlen() in every loop? That is linear with the length of the string, so your first loop effectively becomes O(n^2) for no good reason at all. Just calculate the length once and store it, or use str[i] as the end condition.
You should also be aware that if your compiler uses signed characters, any character value above 127 will be considered negative (and used as a negative, i.e. out of bounds, array offset). You can avoid this by explicitly casting your character values to be unsigned char.

How can I use a mixture of array and map in C++?

A short version of my problem: Is it possible to use an array data structure, for example, treats x[0] to x[10] as a normal array, and some other point value, x[15], x[20] as a map?
Reasons: I do not calculate or store any other value bigger than index 11,
and making the whole thing a map slows down the calculation significantly.
My initial problem: I am writing a fast program to calculate a series, which has x(0)=0, x(1)=1, x(2k)=(3x(k)+2x(Floor(k/2)))mod2^60, x(2k+1)=(2x(k)+3x(Floor(k/2)))mod2^60, and my target is to list numbers from x(10^12) to x(2*10^12)
I am listing and storing the first 10^8 value with normal array,
for (unsigned long long int i = 2; i<=100000000;i++){
if (i%2==0) {
x[i] =(3*x[i/2] + 2*x[(unsigned long long int)(i/4)])&1152921504606846975;
}
else{
x[i] =(2*x[(i-1)/2] + 3*x[(unsigned long long int)((i-1)/4)])&1152921504606846975;
}
}//these code for listing
unsigned long long int xtrans(unsigned long long int k){
if (k<=100000000)return x[k];
unsigned long long int result;
if (k%2==0) {
result =(3*xtrans(k/2) + 2*xtrans((unsigned long long int)(k/4)))&1152921504606846975;
}
else{
result =(2*xtrans((k-1)/2) + 3*xtrans((unsigned long long int)((k-1)/4)))&1152921504606846975;
}
return result;
}//These code for calculating x
listing those numbers takes me around 2s and 750MB of memory.
And I am planning to store specific values for example x[2*10^8], x[4*10^8] without calculating and storing other values for further optimization. But I have to use map in this situation. However, after I convert the declaration of x from array to map, it took me 90s and 4.5GB memory to achieve the same listing.
So I am now wondering if it is possible to use index under 10^8 as an array, and the remaining part as map?
Simply write a wrapper class for your idea:
class MyMap {
...
operator[](size_t i) {
return ( i <= barrier_ ) ? array_[i] : map_[i];
}
}
TL;DR
Why not create a custom class with an std::array of size 10 and an std::map as members and override [] operator to check index and pick value from either array or map as per the need.
Theoretically, you can use ArrayWithHash library to store your dictionary. It stores dictionary as a hybrid of array and hash table similar to table implementation in lua interpreter.
Awh::ArrayWithHash<uint64_t, uint64_t> x;
dict.Reserve(100000000, 0); //preallocate array part
for (uint64_t i = 2; i <= 100000000; i++) {
if (i % 2 == 0) {
x.Set(i, (3 * x.Get(i/2) + 2 * x.Get(i/4)) & 1152921504606846975ULL);
}
...
Unfortunately, memory consumption is one of the drawbacks of ArrayWithHash. It pads array to power-of-two size, so array part would eat 1 GB. As for hash table implementation, it is even less memory-efficient: it can take three times more memory than required to storing key/value pairs.

Given an array of integers, find the first integer that is unique

Given an array of integers, find the first integer that is unique.
my solution: use std::map
put integer (number as key, its index as value) to it one by one (O(n^2 lgn)), if have duplicate, remove the entry from the map (O(lg n)), after putting all numbers into the map, iterate the map and find the key with smallest index O(n).
O(n^2 lgn) because map needs to do sorting.
It is not efficient.
other better solutions?
I believe that the following would be the optimal solution, at least based on time / space complexity:
Step 1:
Store the integers in a hash map, which holds the integer as a key and the count of the number of times it appears as the value. This is generally an O(n) operation and the insertion / updating of elements in the hash table should be constant time, on the average. If an integer is found to appear more than twice, you really don't have to increment the usage count further (if you don't want to).
Step 2:
Perform a second pass over the integers. Look each up in the hash map and the first one with an appearance count of one is the one you were looking for (i.e., the first single appearing integer). This is also O(n), making the entire process O(n).
Some possible optimizations for special cases:
Optimization A: It may be possible to use a simple array instead of a hash table. This guarantees O(1) even in the worst case for counting the number of occurrences of a particular integer as well as the lookup of its appearance count. Also, this enhances real time performance, since the hash algorithm does not need to be executed. There may be a hit due to potentially poorer locality of reference (i.e., a larger sparse table vs. the hash table implementation with a reasonable load factor). However, this would be for very special cases of integer orderings and may be mitigated by the hash table's hash function producing pseudorandom bucket placements based on the incoming integers (i.e., poor locality of reference to begin with).
Each byte in the array would represent the count (up to 255) for the integer represented by the index of that byte. This would only be possible if the difference between the lowest integer and the highest (i.e., the cardinality of the domain of valid integers) was small enough such that this array would fit into memory. The index in the array of a particular integer would be its value minus the smallest integer present in the data set.
For example on modern hardware with a 64-bit OS, it is quite conceivable that a 4GB array can be allocated which can handle the entire domain of 32-bit integers. Even larger arrays are conceivable with sufficient memory.
The smallest and largest integers would have to be known before processing, or another linear pass through the data using the minmax algorithm to find out this information would be required.
Optimization B: You could optimize Optimization A further, by using at most 2 bits per integer (One bit indicates presence and the other indicates multiplicity). This would allow for the representation of four integers per byte, extending the array implementation to handle a larger domain of integers for a given amount of available memory. More bit games could be played here to compress the representation further, but they would only support special cases of data coming in and therefore cannot be recommended for the still mostly general case.
All this for no reason. Just using 2 for-loops & a variable would give you a simple O(n^2) algo.
If you are taking all the trouble of using a hash map, then it might as well be what #Micheal Goldshteyn suggests
UPDATE: I know this question is 1 year old. But was looking through the questions I answered and came across this. Thought there is a better solution than using a hashtable.
When we say unique, we will have a pattern. Eg: [5, 5, 66, 66, 7, 1, 1, 77]. In this lets have moving window of 3. first consider (5,5,66). we can easily estab. that there is duplicate here. So move the window by 1 element so we get (5,66,66). Same here. move to next (66,66,7). Again dups here. next (66,7,1). No dups here! take the middle element as this has to be the first unique in the set. The left element belongs to the dup so could 1. Hence 7 is the first unique element.
space: O(1)
time: O(n) * O(m^2) = O(n) * 9 ≈ O(n)
Inserting to a map is O(log n) not O(n log n) so inserting n keys will be n log n. also its better to use set.
Although it's O(n^2), the following has small coefficients, isn't too bad on the cache, and uses memmem() which is fast.
for(int x=0;x<len-1;x++)
if(memmem(&array[x+1], sizeof(int)*(len-(x+1)), array[x], sizeof(int))==NULL &&
memmem(&array[x+1], sizeof(int)*(x-1), array[x], sizeof(int))==NULL)
return array[x];
public static string firstUnique(int[] input)
{
int size = input.Length;
bool[] dupIndex = new bool[size];
for (int i = 0; i < size; ++i)
{
if (dupIndex[i])
{
continue;
}
else if (i == size - 1)
{
return input[i].ToString();
}
for (int j = i + 1; j < size; ++j)
{
if (input[i]==input[j])
{
dupIndex[j] = true;
break;
}
else if (j == size - 1)
{
return input[i].ToString();
}
}
}
return "No unique element";
}
#user3612419
Solution given you is good with some what close to O(N*N2) but further optimization in same code is possible I just added two-3 lines that you missed.
public static string firstUnique(int[] input)
{
int size = input.Length;
bool[] dupIndex = new bool[size];
for (int i = 0; i < size; ++i)
{
if (dupIndex[i])
{
continue;
}
else if (i == size - 1)
{
return input[i].ToString();
}
for (int j = i + 1; j < size; ++j)
{
if(dupIndex[j]==true)
{
continue;
}
if (input[i]==input[j])
{
dupIndex[j] = true;
dupIndex[i] = true;
break;
}
else if (j == size - 1)
{
return input[i].ToString();
}
}
}
return "No unique element";
}

Fast generation of random set, Monte Carlo Simulation

I have a set of numbers ~100, I wish to perform MC simulation on this set, the basic idea is I fully randomize the set, do some comparison/checks on the first ~20 values, store the result and repeat.
Now the actual comparison/check algorithm is extremely fast it actually completes in about 50 CPU cycles. With this in mind, and in order to optimize these simulations I need to generate the random sets as fast as possible.
Currently I'm using a Multiply With Carry algorithm by George Marsaglia which provides me with a random integer in 17 CPU cycles, quite fast. However, using the Fisher-Yates shuffling algorithm I have to generate 100 random integers, ~1700 CPU cycles. This overshadows my comparison time by a long ways.
So my question is are there other well known/robust techniques for doing this type of MC simulation, where I can avoid the long random set generation time?
I thought about just randomly choosing 20 values from the set, but I would then have to do collision checks to ensure that 20 unique entries were chosen.
Update:
Thanks for the responses. I have another question with regards to a method I just came up with after my post. The question is, will this provide a robust truly (assuming the RNG is good) random output. Basically my method is to set up an array of integer values the same length as my input array, set every value to zero. Now I begin randomly choosing 20 values from the input set like so:
int pcfast[100];
memset(pcfast,0,sizeof(int)*100);
int nchosen = 0;
while (nchosen<20)
{
int k = rand(100); //[0,100]
if ( pcfast[k] == 0 )
{
pcfast[k] = 1;
r[nchosen++] = s[k]; // r is the length 20 output, s the input set.
}
}
Basically what I mentioned above, choosing 20 values at random, except it seems like a somewhat optimized way of ensuring no collisions. Will this provide good random output? Its quite fast.
If you only use the first 20 values in the randomised array, then you only need to do 20 steps of the Fisher-Yates algorithm (Knuth's version). Then 20 values have been randomised (actually at the end of the array rather than at the beginning, in the usual formulation), in the sense that the remaining 80 steps of the algorithm are guaranteed not to move them. The other 80 positions aren't fully shuffled, but who cares?
C++ code (iterators should be random-access):
using std::swap;
template <typename Iterator, typename Rand> // you didn't specify the type
void partial_shuffle(Iterator first, Iterator middle, Iterator last, Rand rnd) {
size_t n = last - first;
while (first != middle) {
size_t k = rnd(n); // random integer from 0 to n-1
swap(*(first+k),*first);
--n;
++first;
}
}
On return, the values from first through to middle-1 are shuffled. Use it like this:
int arr[100];
for (int i = 0; i < 100; ++i) arr[i] = i;
while (need_more_samples()) {
partial_shuffle(arr, arr+20, arr+100, my_prng);
process_sample(arr, arr+20);
}
The Ross simulation book suggests something like the following:
double return[10];
for(int i=0, n=100; i < 10; i++) {
int x = rand(n); //pseudocode - generate an integer on [0,n]
return[i] = arr[x];
arr[x] = arr[n];
n--;
}