std::map with a char[5] key that may contain null bytes - c++

The keys are binary garbage and I only defined them as chars because I need a 1-byte array.
They may contain null bytes.
Now problem is, when I have a two keys: ab(0)a and ab(0)b ((0) being a null byte), the map treats them as strings, considers them equal and I don't get two unique map entries.
What's the best way to solve this?

Why not use std::string as key:
//must use this as:
std::string key1("ab\0a",4);
std::string key2("ab\0b",4);
std::string key3("a\0b\0b",5);
std::string key4("a\0\0b\0b",6);
Second argument should denote the size of the c-string. All of the above use this constructor:
string ( const char * s, size_t n );
description of which is this:
Content is initialized to a copy of the string formed by the first n characters in the array of characters pointed by s.

Use std::array<char,5> or maybe even better (if you want really to handle keys as binary values) std::bitset

If you really want to use char[5] as your key, consider writing your own comparison class to compare between keys correctly. The map class requires one of these in order to organize its contents. By default, it is using a version that doesn't work with your key.
Here's a page on the map class that shows the parameters for map. You'd want to write your own Compare class to replace less<Key> which is the third template parameter to map.

If you only need to distinguish them and don't rely on a lexicographical ordering you could treat each key as uint64_t. This has the advantage, that you could easily replace std::map by a hashmap implementation and that you don't have to do anything by hand.
Otherwise you can also write your own comparator somehow like this:
class MyKeyComp
{
public:
operator()(char* lhs, char* rhs)
{
return lhs[0] == rhs[0] ?
(lhs[1] == rhs[1] ?
(lhs[2] == rhs[2] ?
(lhs[3] == rhs[3] ? lhs[4] < rhs[4])
: lhs[3] < rhs[3])
: lhs[2] < rhs[2])
: lhs[1] < rhs[1])
: lhs[0] < rhs[0];
}
};

Related

Same key, multiple entries for std::unordered_map?

I have a map inserting multiple values with the same key of C string type.
I would expect to have a single entry with the specified key.
However the map seems to take it's address into consideration when uniquely identifying a key.
#include <cassert>
#include <iostream>
#include <string>
#include <unordered_map>
typedef char const* const MyKey;
/// #brief Hash function for StatementMap keys
///
/// Delegates to std::hash<std::string>.
struct MyMapHash {
public:
size_t operator()(MyKey& key) const {
return std::hash<std::string>{}(std::string(key));
}
};
typedef std::unordered_map<MyKey, int, MyMapHash> MyMap;
int main()
{
// Build std::strings to prevent optimizations on the addresses of
// underlying C strings.
std::string key1_s = "same";
std::string key2_s = "same";
MyKey key1 = key1_s.c_str();
MyKey key2 = key2_s.c_str();
// Make sure addresses are different.
assert(key1 != key2);
// Make sure hashes are identical.
assert(MyMapHash{}(key1) == MyMapHash{}(key2));
// Insert two values with the same key.
MyMap map;
map.insert({key1, 1});
map.insert({key2, 2});
// Make sure we find them in the map.
auto it1 = map.find(key1);
auto it2 = map.find(key2);
assert(it1 != map.end());
assert(it2 != map.end());
// Get values.
int value1 = it1->second;
int value2 = it2->second;
// The first one of any of these asserts fails. Why is there not only one
// entry in the map?
assert(value1 == value2);
assert(map.size() == 1u);
}
A print in the debugger shows that map contains two elements just after inserting them.
(gdb) p map
$4 = std::unordered_map with 2 elements = {
[0x7fffffffda20 "same"] = 2,
[0x7fffffffda00 "same"] = 1
}
Why does this happen if the hash function which delegates to std::hash<std::string> only takes it's value into account (this is asserted in the code)?
Moreover, if this is the intended behaviour, how can I use a map with C string as key, but with a 1:1 key-value mapping?
The reason is that hash maps (like std::unordered_map) do not only rely on the hash function for determining if two keys are equal. The hash function is the first comparison layer, after that the elements are always also compared by value. The reason is that even with good hash functions you might have collisions where two different keys yield the same hash value - but you still need to be able to save both entries in the hashmap. There are various strategies to handle that, you can find more information on looking for collision resolution for hash maps.
In your examples both entries have the same hash value but different values. The values are just compared by the standard comparison function, which compares the char* pointers, which are different. Therefore the value comparison fails and you get two entries in the map. To solve your issue you also need to define a custom equality function for your hash map, which can be done by specifiying the fourth template parameter KeyEqual for std::unordered_map.
This fails because the unordered_map does not and cannot solely rely on the hash function for the key to differentiate keys, but it must also compare keys with the same hash for equality. And comparing two char pointers compares the address pointed to.
If you want to change the comparison, pass a KeyEqual parameter to the map in addition to the hash.
struct MyKeyEqual
{
bool operator()(MyKey const &lhs, MyKey const &rhs) const
{
return std::strcmp(lhs, rhs) == 0;
}
};
unordered_map needs to be able to perform two operations on the key - checking equality, and obtaining hash code. Naturally, two unequal keys are allowed to have different hash codes. When this happens, unordered map applies hash collision resolution strategy to treat these unequal keys as distinct.
That is precisely what happens when you supply a character pointer for the key, and provide an implementation of hash to it: the default equality comparison for pointers kicks in, so two different pointers produce two different keys, even though the content of the corresponding C strings is the same.
You can fix it by providing a custom implementation of KeyEqual template parameter to perform actual comparison of C strings, for example, by calling strcmp:
return !strcmp(lhsKey, rhsKey);
You didn't define a map of keys but a map of pointers to a key.
typedef char const* const MyKey;
The compiler can optimize the two instances of "name" and use only one instance in the const data segment, but that can happen or not. A.k.a. undefined behavior.
Your map should contain the key itself. Make the key a std::string or similar.

In place Tokenization of std::string into a map of Key Value

In C the delimiters can be replaced by Nulls and a map of char* -> char* with a comparison function would work.
I am trying to figure out the fastest possible way to do this in Modern C++ . The idea is to avoid Copying characters in the Map.
std::string sample_string("name=alpha;title=something;job=nothing");
to
std::map<std::string,std::string> sample_map;
Without copying characters.
It's ok to lose original input string.
Two std::strings cannot point to the same underlying bytes, so no it's not possible to do with strings.
To avoid coping bytes, you could to use iterators:
struct Slice {
string::iterator begin, end;
bool operator < (const& Slice that) const {
return lexicographical_compare(begin, end, that.begin, that.end);
}
};
std::map<Slice,Slice> sample_map;
And beware that if you modify the original string, all the iterators will be invalid.

How not to use custom comparison function of std::map in searching ( map::find)?

As you can see in my code, lenMap is a std::map with a custom comparison function. This function just check the string's length.
Now when I want to search for some key ( using map::find), the map still uses that custom comparison function.
But How can I force my map not to use that when I search for some key ?
Code:
struct CompareByLength : public std::binary_function<string, string, bool>
{
bool operator()(const string& lhs, const string& rhs) const
{
return lhs.length() < rhs.length();
}
};
int main()
{
typedef map<string, string, CompareByLength> lenMap;
lenMap mymap;
mymap["one"] = "one";
mymap["a"] = "a";
mymap["foobar"] = "foobar";
// Now In mymap: [a, one, foobar]
string target = "b";
if (mymap.find(target) == mymap.end())
cout << "Not Found :) !";
else
cout << "Found :( !"; // I don't want to reach here because of "a" item !
return 0;
}
The map itself does not offer such an operation. The idea of the comparison functor is to create an internal ordering for faster lookup, so the elements are actually ordered according to your functor.
If you need to search for elements in a different way, you can either use the STL algorithm std::find_if() (which has linear time complexity) or create a second map that uses another comparison functor.
In your specific example, since you seem only to be interested in the string's length, you should rather use the length (of type std::size_t) and not the string itself as a key.
By the way, std::binary_function is not needed as a base class. Starting from C++11, it has even been deprecated, see here for example.
The comparison function tells the map how to order elements and how to differentiate between them. If it only compares the length, two different strings with the same length will occupy the same position in the map (one will overwrite the other).
Either store your strings in a different data structure and sort them, or perhaps try this comparison function:
struct CompareByLength
{
bool operator()(const string& lhs, const string& rhs) const
{
if (lhs.length() < rhs.length())
{
return true;
}
else if (rhs.length() < lhs.length())
{
return false;
}
else
{
return lhs < rhs;
}
}
};
I didn't test it, but I believe this will first order strings by length, and then however strings normally compare.
You could also use std::map<std::string::size_type, std::map<std::string, std::string>> and use the length for the first map and the string value for the second map. You would probably want to wrap this in a class to make it easier to use, as there is no protection against messing it up.

using a custom comparator with std::set

I'm trying to create a list of words read from a file arranged by their length. For that, I'm trying to use std::set with a custom comparator.
class Longer {
public:
bool operator() (const string& a, const string& b)
{ return a.size() > b.size();}
};
set<string, Longer> make_dictionary (const string& ifile){
// produces a map of words in 'ifile' sorted by their length
ifstream ifs {ifile};
if (!ifs) throw runtime_error ("couldn't open file for reading");
string word;
set<string, Longer> words;
while (ifs >> word){
strip(word);
tolower(word);
words.insert(word);
}
remove_plurals(words);
if (ifs.eof()){
return words;
}
else
throw runtime_error ("input failed");
}
From this, I expect a list of all words in a file arranged by their length. Instead, I get a very short list, with exactly one word for each length occurring in the input:
polynomially-decidable
complexity-theoretic
linearly-decidable
lexicographically
alternating-time
finite-variable
newenvironment
documentclass
binoppenalty
investigate
usepackage
corollary
latexsym
article
remark
logic
12pt
box
on
a
Any idea of what's going on here?
With your comparator, equal-length words are equivalent, and you can't have duplicate equivalent entries in a set.
To maintain multiple words, you should modify your comparator so that it also performs, say, a lexicographic comparison if the lengths are the same.
Your comparator only compares by length, that means that equally-sized but different strings are treated as being equivalent by std::set. (std::set treats them equally if neither a < b nor b < a are true, with < being your custom comparator function.)
That means your comparator should also consider the string contents to avoid this situation. The keyword here is lexicographic comparison, meaning you take multiple comparison criteria in account. The first criterion would be your string length, and the second would be the string itself. An easy way to write lexicographic comparison is to make use of std::tuple which provides a comparison operator performing lexicographic comparison on the components by overloading the operator<.
To make your "reverse" ordering of length, which you wrote with operator>, compatible with the usually used operator<, simply take the negative size of the strings, i.e. first rewrite a.size() > b.size() as -a.size() < -b.size(), and then compose it with the string itself into tuples, finally compare the tuples with <:
class Longer {
public:
bool operator() (const string& a, const string& b)
{
return std::make_tuple(-a.size(), a )
< std::make_tuple(-b.size(), b );
// ^^^^^^^^^ ^^^
// first second
// criterion criterion
}
};

Idiomatic C++ for finding a range of equal length strings, given a vector of strings (ordered by length)

given a std::vector< std::string >, the vector is ordered by string length, how can I find a range of equal length strength?
I am looking forward an idiomatic solution in C++.
I have found this solution:
// any idea for a better name? (English is not my mother tongue)
bool less_length( const std::string& lhs, const std::string& rhs )
{
return lhs.length() < rhs.length();
}
std::vector< std::string > words;
words.push_back("ape");
words.push_back("cat");
words.push_back("dog");
words.push_back("camel");
size_t length = 3;
// this will give a range from "ape" to "dog" (included):
std::equal_range( words.begin(), words.end(), std::string( length, 'a' ), less_length );
Is there a standard way of doing this (beautifully)?
I expect that you could write a comparator as follows:
struct LengthComparator {
bool operator()(const std::string &lhs, std::string::size_type rhs) {
return lhs.size() < rhs;
}
bool operator()(std::string::size_type lhs, const std::string &rhs) {
return lhs < rhs.size();
}
bool operator()(const std::string &lhs, const std::string &rhs) {
return lhs.size() < rhs.size();
}
};
Then use it:
std::equal_range(words.begin(), words.end(), length, LengthComparator());
I expect the third overload of operator() is never used, because the information it provides is redundant. The range has to be pre-sorted, so there's no point the algorithm comparing two items from the range, it should be comparing items from the range against the target you supply. But the standard doesn't guarantee that. [Edit: and defining all three means you can use the same comparator class to put the vector in order in the first place, which might be convenient].
This works for me (gcc 4.3.4), and while I think this will work on your implementation too, I'm less sure that it is actually valid. It implements the comparisons that the description of equal_range says will be true of the result, and 25.3.3/1 doesn't require that the template parameter T must be exactly the type of the objects referred to by the iterators. But there might be some text I've missed which adds more restrictions, so I'd do more standards-trawling before using it in anything important.
Your way is definitely not unidiomatic, but having to construct a dummy string with the target length does not look very elegant and it isn't very readable either.
I'd perhaps write my own helper function (i.e. string_length_range), encapsulating a plain, simple loop through the string list. There is no need to use std:: tools for everything.
std::equal_range does a binary search. Which means the words vector must be sorted, which in this case means that it must be non-decreasing in length.
I think your solution is a good one, definitely better than writing your own implementation of binary search which is notoriously error prone and hard to prove correct.
If doing a binary search was not your intent, then I agree with Alexander. Just a simple for loop through the words is the cleanest.