Reading from unordered_multiset results in crash - c++

While refactoring some old code a cumbersome multilevel-map developed in-house was replaced by an std::undordered_multiset.
The multilevel-map was something like [string_key1,string_val] . A complex algorithm was applied to derive the keys from string_val and resulted in duplicate string_val being stored in the map but with different keys.
Eventually at some point of the application the multilevel-map was iterated to get the string_val and its number of occurrences.
It replaced was an std::unordered_multilevelset and string_val are just inserted to it. It seems much simpler than having an std::map<std::string,int> and checking-retrieving-updating the counter for every insertion.
What I want to do retrieve the number of occurrences of its inserted element, but I do not have the keys beforehands. So I iterate over the buckets but my program crashes upon creation of the string.
// hash map declaration
std::unordered_multiset<std::string> clevel;
// get element and occurences
for (size_t cbucket = clevel->bucket_count() - 1; cbucket != 0; --cbucket)
{
std::string cmsg(*clevel->begin(cbucket));
cmsg += t_str("times=") + \
std::to_string(clevel->bucket_size(cbucket));
}
I do not understand what is going on here, tried to debug it but I am somehow stack( overflown ?) :) . Program crashes in std::string cmsg(*it);

You should consider how multiset actually works as a hashtable. For example reading this introduction you should notice that hash maps actually preallocate their internal buckets , and the number of buckets is optimized.
Therefore if you insert element "hello" , you will probably get a number of buckets already created, but only the one corresponding to hash("hello") will actually have an element that you may dereference. The rest will be let's say invalid.
Dereferencing the iterator to the begin of every bucket results in SEGV which is your case here.
To remedy this situation you should check every time that begin is not past the end.
for (size_t cbucket = clevel->bucket_count() - 1; cbucket != 0; --cbucket)
{
auto it = clevel->begin(cbucket);
if (it != clevel->end(cbucket))
{
std::string cmsg(*it);
cmsg += t_str("times=") + \
std::to_string(clevel->bucket_size(cbucket));
}
}

Related

Can anyone help me make this function more efficient

So I am trying to sort through an unordered_map container. The container reads input from a file which is a list of people. Each line in the file will be like rCB, bIA, and this will be stored as an element in the map. The second string in each element acts as a pointer to the next person in the list, so later on it will appear again in a new line, in this case:bIA,TDV.
So far I can sort through in order by creating an unordered_map iterator and using the second string in the find function for the iterator to go to the next element. My problem is going the other way. I am able to sort through the opposite way but the way i have implemented my solution means that it takes a very long time to eventually sort through, as we have input files of 3 million people.
list<string> SortEast(unordered_map<string, string> &TempUMap, unordered_map<string, string>::iterator IT, list<string> &TempList)
{
IT = TempUMap.begin();
while (TempList.size() != (TempUMap.size() + 1))
{
if (IT->second == TempList.front())
{
TempList.emplace_front(IT->first);
IT = TempUMap.begin();
}
IT++;
}
return TempList;
}
I've tried to make this more efficient but I cannot think of how. If i could find the value that would go at the start of the list I could sort in order starting with that value, but again I dont know how I would find this value easily.
Any help would be appreciated.
EDIT:
A sample of one of our input is:
rBC,biA
vnN,CmR
CmR,gnz
Dgu,OWn
lnh,Dgu
OWn,YMO
YMO,SIZ
XbL,Cjj
TDV,jew
iVk,vnN
wTb,rBC
jew,sbE
sbE,iVk
Cjj,wTb
AGn,XbL
gnz,SMz
biA,TDV
SIZ,uvD
SMz,lnh
This is only 20 people. In this case AGn is the first value and uvD is the last. The output I end up with is:
AGn
XbL
Cjj
wTb
rBC
biA
TDV
jew
sbE
iVk
vnN
CmR
gnz
SMz
lnh
Dgu
OWn
YMO
SIZ
uvD
As this file starts with rBC, that is the point at which i need to sort backwards
Can you not simply do something like this:
vector<string> orderAllTheNames(const unordered_map<string, string>& input, const string& begin)
{
vector<string> result;
result.reserve(input.size());
string current = begin;
result.push_back(current);
while(result.size() < input.size())
{
current = input[current];
result.push_back(std::move(current));
}
return result;
}
I may have missed some details as I typed this on my phone. You can add some pointers and/or std::moves if you're worried about too many copies flying around.
I guess it's the same as your solution, but without the awkward list and emplace_front.

Fastest ways to check if a value already exists in a stl container

I am holding a very big list of memory addresses (around 400.000) and need to check if a certain address already exists in it 400.000 times a second.
A code example to illustrate my setup:
std::set<uintptr_t> existingAddresses; // this one contains 400.000 entries
while (true) {
// a new list with possible new addresses
std::set<uintptr_t> newAddresses; // also contains about ~400.000 entries
// in my own code, these represent a new address list
for (auto newAddress : newAddresses) {
// already processed this address, skip it
if (existingAddresses.find(newAddress) != existingAddresses.end()) {
continue;
}
// we didn't have this address yet, so process it.
SomeHeavyTask(newAddress);
// so we don't process it again
existingAddresses.emplace(newAddress);
}
Sleep(1000);
}
This is the first implementation I came up with and I think it can be greatly improved.
Next I came up with using some custom indexing strategy, also used in databases. The idea is to take a part of the value and use that to index it in its own group set. If I would take for example the last two numbers of the address I would have 16^2 = 256 groups to put addresses in.
So I would end up with a map like this:
[FF] -> all address ending with `FF`
[EF] -> all addresses ending with `EF`
[00] -> all addresses ending with `00`
// etc...
With this I will only need to do a lookup on ~360 entries in the corresponding set. Resulting in ~360 lookups being done 400.000 times a second. Much better!
I am wondering if there are any other tricks or better ways to do this? My goal is to make this address lookup as FAST as possible.
std::set<uintptr_t> uses a balanced tree, so look-up time is O(log N).
std::unordered_set<uintptr_t>, on the other hand, is hash-based, with lookup time of O(1).
Although this is only an asymptotic complexity measure, meaning that there is no guaranteed improvement due to constant factors involved, the difference may prove significant when the collection contains 400,000 elements.
You may use algorithm similar to merge:
std::set<uintptr_t> existingAddresses; // this one contains 400.000 entries
while (true) {
// a new list with possible new addresses
std::set<uintptr_t> newAddresses; // also contains about ~400.000 entries
auto existing_it = existingAddresses.begin();
auto new_it = newAddresses.begin();
while (new_it != newAddresses.end() && existing_it != existingAddresses.end()) {
if (*new_it < *existing_it) {
// we didn't have this address yet, so process it.
SomeHeavyTask(*new_it);
// so we don't process it again
existingAddresses.insert(existing_it, *new_it);
++new_it;
} else if (*existing_it < *new_it) {
++existing_it;
} else { // Both equal
++existing_it;
++new_it;
}
}
for (new_it != newAddresses.end())
// we didn't have this address yet, so process it.
SomeHeavyTask(*new_it);
// so we don't process it again
existingAddresses.insert(existingAddresses.end(), *new_it);
++new_it;
}
Sleep(1000);
}
Complexity is now linear: O(N + M) instead of O(N log M) (with N number of new addresses, and M count for old ones).

Most efficient way to search for a value and return its index in a vector?

I am trying to iterate through a vector (k), and check if it contains a value (key), if it does, I want to add the value found at the same index of a different vector (val) and then add whatever value is found there to a third vector (temp).
for(int i = 0; i < k.size(); ++i)
{
if(k.at(i) == key)
{
temp.push_back(val.at(i));
}
}
I've learned a lot lately but I'm still not super advanced in C++, this code does work for my purposes but it is extremely slow. It can handle small vectors of sizes like 10 or 100, but takes much too long for sizes bigger like 1000, 10000 or even 1000000.
My question is, is there a faster and more efficient way to do this?
I've tried this:
std::vector<int>::iterator it = k.begin();
while ((iter = std::find(it, k.end(), key)) != k.end())
{
int index = std::distance(k.begin(), it);
temp.push_back(val.at(index));
}
I thought maybe using a vector iterator would speed things up, but I can't get the code to work due to bad_alloc errors that I'm not sure how to fix.
Does anyone know what I can do to make this little bit of code much faster?
Here are a few things you could do:
Pre-allocate the data for temp, so that push_back doesn't cause repeated allocations:
temp.reserve(k.size());
If k is sorted, you can use that fact to speed things up a bit:
auto lowerIt = std::lower_bound(k.begin(), k.end(), key);
auto upperIt = std::upper_bound(k.begin(), k.end(), key);
for (auto it = lowerIt; it != upperIt; ++it)
temp.push_back(val[it - k.begin()]);
at does bounds checking, so it is a tad bit slower than []. You obviously have to guarantee that you are never accessing an out of bounds index.
Besides Rakete's suggestions:
If your keys vector is sorted - use std::binary_search instead of std::find and then just iterate until the next value/end of vector.
If you're free to change your data structures, keep your data in std::unordered_multimap and use equal_range to access elements with your desired keys.

Need suggestion to improve speed for word break (dynamic programming)

The problem is: Given a string s and a dictionary of words dict, determine if s can be segmented into a space-separated sequence of one or more dictionary words.
For example, given
s = "hithere",
dict = ["hi", "there"].
Return true because "hithere" can be segmented as "leet code".
My implementation is as below. This code is ok for normal cases. However, it suffers a lot for input like:
s = "aaaaaaaaaaaaaaaaaaaaaaab", dict = {"aa", "aaaaaa", "aaaaaaaa"}.
I want to memorize the processed substrings, however, I cannot done it right. Any suggestion on how to improve? Thanks a lot!
class Solution {
public:
bool wordBreak(string s, unordered_set<string>& wordDict) {
int len = s.size();
if(len<1) return true;
for(int i(0); i<len; i++) {
string tmp = s.substr(0, i+1);
if((wordDict.find(tmp)!=wordDict.end())
&& (wordBreak(s.substr(i+1), wordDict)) )
return true;
}
return false;
}
};
It's logically a two-step process. Find all dictionary words within the input, consider the found positions (begin/end pairs), and then see if those words cover the whole input.
So you'd get for your example
aa: {0,2}, {1,3}, {2,4}, ... {20,22}
aaaaaa: {0,6}, {1,7}, ... {16,22}
aaaaaaaa: {0,8}, {1,9} ... {14,22}
This is a graph, with nodes 0-23 and a bunch of edges. But node 23 b is entirely unreachable - no incoming edge. This is now a simple graph theory problem
Finding all places where dictionary words occur is pretty easy, if your dictionary is organized as a trie. But even an std::map is usable, thanks to its equal_range method. You have what appears to be an O(N*N) nested loop for begin and end positions, with O(log N) lookup of each word. But you can quickly determine if s.substr(begin,end) is a still a viable prefix, and what dictionary words remain with that prefix.
Also note that you can build the graph lazily. Staring at begin=0 you find edges {0,2}, {0,6} and {0,8}. (And no others). You can now search nodes 2, 6 and 8. You even have a good algorithm - A* - that suggests you try node 8 first (reachable in just 1 edge). Thus, you'll find nodes {8,10}, {8,14} and {8,16} etc. As you see, you'll never need to build the part of the graph that contains {1,3} as it's simply unreachable.
Using graph theory, it's easy to see why your brute-force method breaks down. You arrive at node 8 (aaaaaaaa.aaaaaaaaaaaaaab) repeatedly, and each time search the subgraph from there on.
A further optimization is to run bidirectional A*. This would give you a very fast solution. At the second half of the first step, you look for edges leading to 23, b. As none exist, you immediately know that node {23} is isolated.
In your code, you are not using dynamic programming because you are not remembering the subproblems that you have already solved.
You can enable this remembering, for example, by storing the results based on the starting position of the string s within the original string, or even based on its length (because anyway the strings you are working with are suffixes of the original string, and therefore its length uniquely identifies it). Then, in the beginning of your wordBreak function, just check whether such length has already been processed and, if it has, do not rerun the computations, just return the stored value. Otherwise, run computations and store the result.
Note also that your approach with unordered_set will not allow you to obtain the fastest solution. The fastest solution that I can think of is O(N^2) by storing all the words in a trie (not in a map!) and following this trie as you walk along the given string. This achieves O(1) per loop iteration not counting the recursion call.
Thanks for all the comments. I changed my previous solution to the implementation below. At this point, I didn't explore to optimize on the dictionary, but those insights are very valuable and are very much appreciated.
For the current implementation, do you think it can be further improved? Thanks!
class Solution {
public:
bool wordBreak(string s, unordered_set<string>& wordDict) {
int len = s.size();
if(len<1) return true;
if(wordDict.size()==0) return false;
vector<bool> dq (len+1,false);
dq[0] = true;
for(int i(0); i<len; i++) {// start point
if(dq[i]) {
for(int j(1); j<=len-i; j++) {// length of substring, 1:len
if(!dq[i+j]) {
auto pos = wordDict.find(s.substr(i, j));
dq[i+j] = dq[i+j] || (pos!=wordDict.end());
}
}
}
if(dq[len]) return true;
}
return false;
}
};
Try the following:
class Solution {
public:
bool wordBreak(string s, unordered_set<string>& wordDict)
{
for (auto w : wordDict)
{
auto pos = s.find(w);
if (pos != string::npos)
{
if (wordBreak(s.substr(0, pos), wordDict) &&
wordBreak(s.substr(pos + w.size()), wordDict))
return true;
}
}
return false;
}
};
Essentially one you find a match remove the matching part from the input string and so continue testing on a smaller input.

no duplicate function for a lottery program

right now im trying to make a function that checks to see if the user’s selection is already in the array , and if it does itll tell you to choose a diff number. how can i do this?
Do you mean something like this?
bool CheckNumberIsValid()
{
for(int i = 0 ; i < array_length; ++i)
{
if(array[i] == user_selection)
return false;
}
return true;
}
That should give you a clue, at least.
What's wrong with std::find? If you get the end iterator back, the
value isn't in the array; otherwise, it is. Or if this is homework, and
you're not allowed to use the standard library, a simple while loop
should do the trick: this is a standard linear search, algorithms for
which can be found anywhere. (On the other hand, some of the articles
which pop up when searching with Google are pretty bad. You really
should use the standard implementation:
Iterator
find( Iterator begin, Iterator end, ValueType target )
{
while ( begin != end && *begin != target )
++ begin;
return begin;
}
Simple, effective, and proven to work.)
[added post factum]Oh, homework tag. Ah well, it won't really benefit you that much then, still - I'll leave my answer since it can be of some use to others browsing through SO.
If you'd need to have lots of unique random numbers in a range - say 45000 random numbers from 0..45100 - then you should see how this is going to get problematic using the approach of:
while (size_of_range > v.size()) {
int n = // get random
if ( /* n is not already in v */ ) {
v.push_back(n);
}
}
If the size of the pool and the range you want to get are close, and the pool size is not a very small integer - it'll get harder and harder to get a random number that wasn't already put in the vector/array.
In that case, you'll be much better of using std::vector (in <vector>) and std::random_shuffle (in <algorithm>):
unsigned short start = 10; // the minimum value of a pool
unsigned short step = 1; // for 10,11,12,13,14... values in the vector
// initialize the pool of 45100 numbers
std::vector<unsigned long> pool(45100);
for (unsigned long i = 0, j = start; i < pool.size(); ++i, j += step) {
pool[i] = j;
}
// get 45000 numbers from the pool without repetitions
std::random_shuffle(pool.begin(), pool.end());
return std::vector<unsigned long>(pool.begin(), pool.begin() + 45000);
You can obviously use any type, but you'll need to initialize the vector accordingly, so it'd contain all possible values you want.
Note that the memory overhead probably won't really matter if you really need almost all of the numbers in the pool, and you'll get good performance. Using rand() and checking will take a lot of time, and if your RAND_MAX is equal 32767 then it'd be an infinite loop.
The memory overhead is however noticeable if you only need few of those values. The first approach would usually be faster then.
If it really needs to be the array you need to iterate or use find function from algorithm header. Well, I would suggest you go for putting the numbers in a set as the look up is fast in sets and handy using set::find function
ref: stl set
These are some of the steps (in pseudo-code since this is a homework question) on how you may get around to doing this:
Get user to enter a new number.
If the number entered is the first, push it to the vector anyway.
Sort the contents of the vector in case size is > 1.
Ask user to enter the number.
Perform a binary search on the contents to see if the number was entered.
If number is unique, push it into vector. If not unique, ask again.
Go to step 3.
HTH,
Sriram.