C string map key - c++

Is there any issue with using a C string as a map key?
std::map<const char*, int> imap;
The order of the elements in the map doesn't matter, so it's okay if they are ordered using std::less<const char*>.
I'm using Visual Studio and according to MSDN (Microsoft specific):
In some cases, identical string literals can be "pooled" to save space in the executable file. In string-literal pooling, the compiler causes all references to a particular string literal to point to the same location in memory, instead of having each reference point to a separate instance of the string literal.
It says that they are only pooled in some cases, so it seems like accessing the map elements using a string literal would be a bad idea:
//Could these be referring to different map elements?
int i = imap["hello"];
int j = imap["hello"];
Is it possible to overload operator== for const char* so that the actual C string and not the pointer values would be used to determine if the map elements are equal:
bool operator==(const char* a, const char* b)
{
return strcmp(a, b) == 0 ? true : false;
}
Is it ever a good idea to use a C string as a map key?

Is it possible to overload operator== for const char* so that the actual C string and not the pointer values would be used to determine if the map elements are equal
No it's not, and yes, it's not a good idea for exactly the reason pointed out in the question and because you don't need char*, you can use a std::string instead. (you can provide a custom compare function - as pointed out by simonc, but I'd advise against it)
//Could these be referring to different map elements?
int i = imap["hello"];
int j = imap["hello"];
Yes, and they can even refer to elements that don't exist yet, but they'll be created by operator[] and be value initialized. The same issue exists with assignment:
imap["hello"] = 0;
imap["hello"] = 1;
The map could now have 1 or 2 elements.

You can provide a map with a custom comparitor which compares the C strings
std::map<const char*,YourType,CstrCmp>;
bool CstrCmp::operator()(const char* a, const char* b) const
{
return strcmp(a, b) < 0;
}

First, in order to introduce ordering over map keys you need to define a "less-than" comparison. A map says that two elements are "equivalent" if neither is less than the other. It's a bad idea to use char* for map keys because you will need to do memory management somewhere outside the map.
In most realistic scenarios when you query a map your keys will not be literals.
On the other hand, if you maintain a pool of string literals yourself and assign an ID to every literal you could use those IDs for map keys.
To summarize, I wouldn't rely on Microsoft saying "In some cases literals may be pooled". If you fill the map with literals and if you query the map with literals as keys you might as well use enum for keys.

Related

Usefulness of KeyEqual in std::unordered_set/std::unordered_map

I understand that this may be vague question, but I wonder what are real world cases when custom comparator is useful for hash containers in std.
I understand it's usefulness in ordered containers, but for hash containers it seems a bit weird.
Reason for this is that hash value for elements that are equal according to comparator needs to be the same, and I believe that in most cases that actually means converting lookup/insert element to some common representation(it is faster and easier to implement).
For example:
set of case insensitive strings: if you want to hash properly you need to uppercase/lowercase the entire string anyway.
set of fractions(where 2/3 == 42/63): you need to convert 42/63 to 2/3 and then hash that...
So I wonder if someone can provide some real world examples of usefulness of customizing std::unordered_ template parameters so I can recognize those patterns in future code I write.
Note 1: "symmetry argument" (std::map enables customization of a comparator so std::unordred_should be customizable also) is something I considered and I do not think it is convincing.
Note 2: I mixed 2 kind of comparators (< and ==) in the post for brevity, I know that std::map uses < and std::unordered_map uses ==.
As per https://en.cppreference.com/w/cpp/container/unordered_set
Internally, the elements are not sorted in any particular order, but
organized into buckets. Which bucket an element is placed into depends
entirely on the hash of its value. This allows fast access to
individual elements, since once a hash is computed, it refers to the
exact bucket the element is placed into.
So the hash function defines the bucket your element will end up in, but once the bucket is decided, in order to find the element, the operator == will be used.
Basically operator == is used to resolve hash collision, and hence, you need your hash function and your operator == to be consistent. Furthermore, if your operator operator == says that two elements are equal, the set will not allow a duplication.
For what concerns customization, I think that the idea of case-insensitive set of strings is a good one: given two strings you will need to provide a case-insensitive hash-function to allow the set to determine the bucket it has to store the string in. Then you will need to provide a custom KeyEqualto allow the set to actually retrieve the element.
A case I had to deal with, in the past, was a way to allow users to insert strings, keeping track of their order of insertion but avoiding duplicates. So, given a struct like:
struct keyword{
std::string value;
int sequenceCounter;
};
You want to detect duplicates according only to value. One of the solutions I came up with was an unordered_set with a custom comparator/hash function, that used only value. This allowed me to check for the existence of a key before allowing insertion.
One interesting usage is to define memory efficient indexes (database sense of the term) for a given set of objects.
Example
Let's say we have a program that has a collection of N objects of this class:
struct Person {
// each object has a unique firstName/lastName pair
std::string firstName;
std::string lastName;
// each object has a unique ssn value
std::string socialSecurityNumber;
// each object has a unique email value
std::string email;
}
And we need to retrieve efficiently objects by the value of any unique property.
Implementations comparison
Time complexities are given assuming string comparisons are constant time (strings have limited length).
1) Single unordered_map
With a single map indexed by a single key (ex: email):
std::unordered_map<std::string,Person> indexedByEmail;
Time complexity: lookup by any unique property other than email requires a traversal of the map: average O(N).
Memory usage: the email value is duplicated. This could be avoided by using a single set with custom hash & compare (see 3).
2) Multiple unordered_map, no custom hash & compare
With a map for each unique property, with default hash & comparisons:
std::unordered_map<std::pair<std::string,std::string>, Person> byName;
std::unordered_map<std::string, const Person*> byEmail;
std::unordered_map<std::string, const Person*> bySSN;
Time complexity: by using the appropriate map, a lookup by any unique property is average O(1).
Memory usage: inefficient, because of all the string duplications.
3) Multiple unordered_set, custom hash & comparison:
With custom hash & comparison, we define different unordered_set which will hash & compare only specific fields of the objects. Theses sets can be used to perform lookup as if items were stored in a map as in 2, but without duplicating any field.
using StrHash = std::hash<std::string>;
// --------------------
struct PersonNameHash {
std::size_t operator()(const Person& p) const {
// not the best hashing function in the world, but good enough for demo purposes.
return StrHash()(p.firstName) + StrHash()(p.lastName);
}
};
struct PersonNameEqual {
bool operator()(const Person& p1, const Person& p2) const {
return (p1.firstName == p2.firstName) && (p1.lastName == p2.lastName);
}
};
std::unordered_set<Person, PersonNameHash, PersonNameEqual> byName;
// --------------------
struct PersonSsnHash {
std::size_t operator()(const Person* p) const {
return StrHash()(p->socialSecurityNumber);
}
};
struct PersonSsnEqual {
bool operator()(const Person* p1, const Person* p2) const {
return p1->socialSecurityNumber == p2->socialSecurityNumber;
}
};
std::unordered_set<const Person*, PersonSsnHash, PersonSsnEqual> bySSN;
// --------------------
struct PersonEmailHash {
std::size_t operator()(const Person* p) const {
return StrHash()(p->email);
}
};
struct PersonEmailEqual {
bool operator()(const Person* p1, const Person* p2) const {
return p1->email == p2->email;
}
};
std::unordered_set<const Person*,PersonEmailHash,PersonEmailEqual> byEmail;
Time complexity: a lookup by any unique property is still O(1) average.
Memory usage: much better than 2): no string duplication.
Live demo
The hash function itself does something to extract features in a certain way, and The comparator's job is to distinguish whether features are the same or not
With a "shell" of data you may not need to modify the comparator
Briefly: put a feature shell on the data. Features are responsible for being compared
As a matter of fact, I don't quite understand what you problem description. My speech is inevitably confused in logic. Please understand.
:)

Substring of an element in a set

Is there a way to find and replace subset of a char*/string in a set?
Example:
std::set<char*> myset;
myset.insert("catt");
myset.insert("world");
myset.insert("hello");
it = myset.subsetfind("tt");
myset.replace(it, "t");
There are at least three reasons why this won't work.
std::set provides only the means to search the set for a value that compares equally to the value being searched for, and not to a value that matches some arbitrary portion of the value.
The shown program is undefined behavior. A string literal, such as "hello" is a const char *, and not a char *. No self-respecting C++ compiler will allow you to insert a const char * into a container of char *s. And you can't modify const values, by definition, anyway.
Values in std::set cannot be modified. To effect the modification of an existing value in a set, it must be erase()d, then the new value insert()ed.
std::set is simply not the right container for the goals you're trying to accomplish.
No, you can't (or at least shouldn't) modify the key while it's in the set. Doing so could change the relative order of the elements, in which case the modification would render the set invalid.
You need to start with a set of things you can modify. Then you need to search for the item, remove it from the set, modify it, then re-insert the result back into the set.
std::set<std::string> myset {"catt", "world", "hello"};
auto pos = std::find_if(myset.begin(), myset.end(), [](auto const &s) { return s.find("tt");};
if (pos != myset.end()) {
auto temp = *pos;
myset.remove(pos);
auto p= temp.find("tt");
temp.replace(p, 2, "t");
myset.insert(temp);
}
You cannot modify elements within a set.
You can find strings that contain the substring using std::find_if. Once you find matching elements, you can remove each from the set and add a modified copy of the string, with the substring replaced with something else.
PS. Remember that you cannot modify string literals. You will need to allocate some memory for the strings.
PPS. Implicit conversion of string literal to char* has been deprecated since C++ was standardized, and since C++11 such conversion is ill-formed.
PPPS. The default comparator will not be correct when you use pointers as the element type. I recommend you to use std::string instead. (A strcmp based comparator approach would also be possible, although much more prone to memory bugs).
You could use std::find_if with a predicate function/functor/lambda that searches for the substring you want.

What can bring a std::map to not find one of its keys?

I have a std::map associating const char* keys with int values:
std::map<const char*, int> myMap;
I initialize it with three keys, then check if it can find it:
myMap["zero"] = 0;
myMap["first"] = 1;
myMap["second"] = 2;
if (myMap.at("zero") != 0)
{
std::cerr << "We have a problem here..." << std::endl;
}
And nothing is printed. From here, everything looks ok.
But later in my code, without any alteration of this map, I try to find again a key:
int value = myMap.at("zero");
But the at function throws an std::out_of_range exception, which means it cannot find the element. myMap.find("zero") thinks the same, because it returns an iterator on the end of the map.
But the creepiest part is that the key is really in the map, if just before the call to the at function, I print the content of the map like this:
for (auto it = myMap.begin(); it != myMap.end(); it++)
{
std::cout << (*it).first << std::endl;
}
The output is as expected:
zero
first
second
How is it even possible? I don't use any beta-test library or anything supposed to be unstable.
You have a map of pointers to characters, not strings. The map lookup is based on the pointer value (address) and not the value of what's pointed at. In the first case, where "zero" is found in the map, you compiler has performed some string merging and is using one array of characters for both identical strings. This is not required by the language but is a common optimization. In the second case, when the string is not found, this merging has not been done (possibly your code here is in a different source module), so the address being used in the map is different from what was inserted and is then not found.
To fix this either store std::string objects in the map, or specify a comparison in your map declaration to order based on the strings and not the addresses.
key to map is char * . So map comparison function will try to compare raw pointer values and not the c style char string equivalence check. So declare the map having std::string as the key.
if you do not want to deal with the std::string and still want the same functionality with improved time complexity, sophisticated data structure is trie. Look at some implementations like Judy Array.

Applying c++ "lower_bound" on an array of char strings

I am trying the lower_bound function in C++.
Used it multiple times for 1 d datatypes.
Now, I am trying it on a sorted array dict[5000][20] to find strings of size <=20.
The string to be matched is in str.
bool recurseSerialNum(char *name,int s,int l,char (*keypad)[3],string str,char (*dict)[20],int
dictlen)
{
char (*idx)[20]= lower_bound(&dict[0],&dict[0]+dictlen,str.c_str());
int tmp=idx-dict;
if(tmp!=dictlen)
printf("%s\n",*idx);
}
As per http://www.cplusplus.com/reference/algorithm/lower_bound/?kw=lower_bound , this function is supposed to return the index of 'last'(beyond end) in case no match is found i.e. tmp should be equal dictlen.
In my case, it always returns the beginning index i.e. I get tmp equal to 0 both 1. When passed a string that is there in the dict and 2. When passed a string that is not there in the dict.
I think the issue is in handling and passing of the pointer. The default comparator should be available for this case as is available in case of vector. I also tried passing an explicit one, to no avail.
I tried this comparator -
bool compStr(const char *a, const char *b){
return strcmp(a,b)<0;
}
I know the ALTERNATE is to used vector ,etc, but I would like to know the issue in this one.
Searched on this over google as well as SO, but did not find anything similar.
There are two misunderstandings here, I believe.
std::lower_bound does not check if an element is part of a sorted range. Instead it finds the leftmost place where an element could be inserted without breaking the ordering.
You're not comparing the contents of the strings but their memory addresses.
It is true that dict in your case is a sorted range in that the sense that the memory addresses of the inner arrays are ascending. Where in relation to this str.c_str() lies is, of course, undefined. In practice, dict is a stack object, you will often find that the memory range for the heap (where str.c_str() invariably lies) is below that of the stack, in which case lower_bound quite correctly tells you that if you wanted to insert this address into the sorted range of addresses as which you interpret dict, you'd have to do so at the beginning.
For a solution, since there is an operator<(char const *, std::string const &), you could simply write
char (*idx)[20] = lower_bound(&dict[0], &dict[0] + dictlen, str);
...but are you perhaps really looking for std::find?

std::map iteration - order differences between Debug and Release builds

Here's a common code pattern I have to work with:
class foo {
public:
void InitMap();
void InvokeMethodsInMap();
static void abcMethod();
static void defMethod();
private:
typedef std::map<const char*, pMethod> TMyMap;
TMyMap m_MyMap;
}
void
foo::InitMap()
{
m_MyMap["abc"] = &foo::abcMethod;
m_MyMap["def"] = &foo::defMethod;
}
void
foo::InvokeMethodsInMap()
{
for (TMyMap::const_iterator it = m_MyMap.begin();
it != m_MyMap.end(); it++)
{
(*it->second)(it->first);
}
}
However, I have found that the order that the map is processed in (within the for loop) can differ based upon whether the build configuration is Release or Debug. It seems that the compiler optimisation that occurs in Release builds affects this order.
I thought that by using begin() in the loop above, and incrementing the iterator after each method call, it would process the map in order of initialisation. However, I also remember reading that a map is implemented as a hash table, and order cannot be guaranteed.
This is particularly annoying, as most of the unit tests are run on a Debug build, and often strange order dependency bugs aren't found until the external QA team start testing (because they use a Release build).
Can anyone explain this strange behaviour?
Don't use const char* as the key for maps. That means the map is ordered by the addresses of the strings, not the contents of the strings. Use a std::string as the key type, instead.
std::map is not a hash table, it's usually implemented as a red-black tree, and elements are guaranteed to be ordered by some criteria (by default, < comparison between keys).
The definition of map is:
map<Key, Data, Compare, Alloc>
Where the last two template parameters default too:
Compare: less<Key>
Alloc: allocator<value_type>
When inserting new values into a map. The new value (valueToInsert) is compared against the old values in order (N.B. This is not sequential search, the standard guarantees a max insert complexity of O(log(N)) ) until Compare(value,ValueToInsert) returns true. Because you are using 'const char*' as the key. The Compare Object is using less<const char*> this class just does a < on the two values. So in effect you are comparing the pointer values (not the string) therefore the order is random (as you don't know where the compiler will put strings.
There are two possible solutions:
Change the type of the key so that it compares the string values.
Define another Compare Type that does what you need.
Personally I (like Chris) would just use a std::string because < operator used on strings returns a comparison based on the string content. But for arguments sake we can just define a Compare type.
struct StringLess
{
bool operator()(const char* const& left,const char* const& right) const
{
return strcmp(left,right) < 0;
}
};
///
typedef std::map<const char*, int,StringLess> TMyMap;
If you want to use const char * as the key for your map, also set a key comparison function that uses strcmp (or similar) to compare the keys. That way your map will be ordered by the string's contents, rather than the string's pointer value (i.e. location in memory).