Check whether equal string literals are stored at the same address - c++

I am developing a (C++) library that uses unordered containers. These require a hasher (usually a specialization of the template structure std::hash) for the types of the elements they store. In my case, those elements are classes that encapsulate string literals, similar to conststr of the example at the bottom of this page. The STL offers an specialization for constant char pointers, which, however, only computes pointers, as explained here, in the 'Notes' section:
There is no specialization for C strings. std::hash<const char*>
produces a hash of the value of the pointer (the memory address), it
does not examine the contents of any character array.
Although this is very fast (or so I think), it is not guaranteed by the C++ standard whether several equal string literals are stored at the same address, as explained in this question. If they aren't, the first condition of hashers wouldn't be met:
For two parameters k1 and k2 that are equal, std::hash<Key>()(k1) ==
std::hash<Key>()(k2)
I would like to selectively compute the hash using the provided specialization, if the aforementioned guarantee is given, or some other algorithm otherwise. Although resorting back to asking those who include my headers or build my library to define a particular macro is feasible, an implementation defined one would be preferable.
Is there any macro, in any C++ implementation, but mainly g++ and clang, whose definition guarantees that several equal string literals are stored at the same address?
An example:
#ifdef __GXX_SAME_STRING_LITERALS_SAME_ADDRESS__
const char str1[] = "abc";
const char str2[] = "abc";
assert( str1 == str2 );
#endif

Is there any macro, in any C++ implementation, but mainly g++ and clang, whose definition guarantees that several equal string literals are stored at the same address?
gcc has the -fmerge-constants option (this is not a guarantee) :
Attempt to merge identical constants (string constants and floating-point constants) across compilation units.
This option is the default for optimized compilation if the assembler and linker support it. Use -fno-merge-constants to inhibit this behavior.
Enabled at levels -O, -O2, -O3, -Os.
Visual Studio has String Pooling (/GF option : "Eliminate Duplicate Strings")
String pooling allows what were intended as multiple pointers to multiple buffers to be multiple pointers to a single buffer. In the following code, s and t are initialized with the same string. String pooling causes them to point to the same memory:
char *s = "This is a character buffer";
char *t = "This is a character buffer";
Note: although MSDN uses char* strings literals, const char* should be used
clang apparently also has the -fmerge-constants option, but I can't find much about it, except in the --help section, so I'm not sure if it really is the equivalent of the gcc's one :
Disallow merging of constants
Anyway, how string literals are stored is implementation dependent (many do store them in the read-only portion of the program).
Rather than building your library on possible implementation-dependent hacks, I can only suggest the usage of std::string instead of C-style strings : they will behave exactly as you expect.
You can construct your std::string in-place in your containers with the emplace() methods :
std::unordered_set<std::string> my_set;
my_set.emplace("Hello");

Although C++ does not seem to allow for any way that works with string literals, there is an ugly but somewhat workable way around the problem if you don't mind rewriting your string literals as character sequences.
template <typename T, T...values>
struct static_array {
static constexpr T array[sizeof...(values)] { values... };
};
template <typename T, T...values>
constexpr T static_array<T, values...>::array[];
template <char...values>
using str = static_array<char, values..., '\0'>;
int main() {
return str<'a','b','c'>::array != str<'a','b','c'>::array;
}
This is required to return zero. The compiler has to ensure that even if multiple translation units instantiate str<'a','b','c'>, those definitions get merged, and you only end up with a single array.
You would need to make sure you don't mix this with string literals, though. Any string literal is guaranteed not to compare equal to any of the template instantiations' arrays.

The tacklelib C++11 library have a macro with the tmpl_string class to hold a literal string as a template class instance. The tmpl_string contains a static string with the same content which guarantees the same address for the same template class instance.
https://sourceforge.net/p/tacklelib/tacklelib/HEAD/tree/trunk/include/tacklelib/tackle/tmpl_string.hpp
Tests:
https://sourceforge.net/p/tacklelib/tacklelib/HEAD/tree/trunk/src/tests/unit/test_tmpl_string.cpp
Example:
const auto s = TACKLE_TMPL_STRING(0, "my literl string")
I've used it in another macro to conveniently and consistently extract a literal string begin/end:
#include <tacklelib/tackle/tmpl_string.hpp>
#include <tacklelib/utility/string_identity.hpp>
//...
std::vector<char> xml_arr;
xml_arr.insert(xml_arr.end(), UTILITY_LITERAL_STRING_WITH_BEGINEND_TUPLE("<?xml version='1.0' encoding='UTF-8'?>\n"));
https://sourceforge.net/p/tacklelib/tacklelib/HEAD/tree/trunk/include/tacklelib/utility/string_identity.hpp

Related

Are two std::string_views refering to equal-comparing string literal always also equal?

I have an unordered_map which is supposed to mimic a filter, taking key and value as std::string_view respectively. Now say I want to compare two filters that have the same key-value-pairs: Will they always compare equal?
My thought is the following: The compiler tries its best to merge const char*'s with the same byte information into one place in the binary, therefore within a specific translation unit, the string literal addresses will always match. Later I'm passing these addresses into the constructor of std::string_view. Naturally, as std::string_view doesn't implement the comparison operator==(), the compyler will byte-compare the classes and only when address and length match exactly, the std::string_views compare equal.
However: What happens if I instantiate a filter outside of this translation unit with exactly the same contents as the first filter and link the files together later? Will the compiler be able to see beyond the TU boundaries and merge the string literal locations as well? Or will the equal comparison fail as the underlying string views will have different addresses for their respective string literals?
Demo
#include <unordered_map>
#include <string_view>
#include <cstdio>
using filter_t = std::unordered_map<std::string_view, std::string_view>;
int main()
{
filter_t myfilter = {{ "key1", "value"}, {"key2", "value2" }};
filter_t my_second_filter = {{ "key1", "value"}, {"key2", "value2" }};
if (my_second_filter == myfilter) {
printf("filters are the same!\n");
}
}
Naturally, as std::string_view doesn't implement the comparison operator==(), the compyler will byte-compare the classes
That is never the case. If no operator== overload (or since C++20 a rewritten candidate overload of e.g. operator<=>) is available for a class type, then it is simply impossible to compare the type with ==.
Even for non-class types the built-in == never performs a bitwise/bytewise comparison of the object representation (i.e. the values of the bytes of storage occupied the object). The built-in == always performs a comparison of values held by the objects.
The only way to get bytewise comparison of the object representation is to explicitly have a operator== (or overload to which == can be rewritten) be defined to perform this comparison (e.g. by memcmp). Since that obviously wouldn't make sense for std::string_view (or most types), the standard library does not define std::string_view's operator== like that. It defines operator== (or operator<=> since C++20) instead to properly perform a comparison of the string it refers to, as a value, not as identity of a string literal or character array, see https://en.cppreference.com/w/cpp/string/basic_string_view/operator_cmp.
The equal comparison match by content and not by address, so your equal operation will still work, but you might have two copies of your string, depending on your compilers optimizations.

C++ constexpr std::array of string literals

I've been happily using the following style of constant string literals in my code for awhile, without really understanding how it works:
constexpr std::array myStrings = { "one", "two", "three" };
This may seem trivial, but I'm hazy on the details of what is going on under the hood. From my understanding, class template argument deduction (CTAD) is used to construct an array of the appropriate size and element type. My questions would be:
What is the element type of the std::array in this case, or is this implementation specific? Looking at the debugger (I'm using Microsoft C++), the elements are just pointers to non-contiguous locations.
Is it safe to declare constexpr arrays of string literals in this way?
I could do this instead, but it's not as tidy:
const std::array<std::string, 3> myOtherStrings = { "one", "two", "three" };
Yes, this is CTAD deducing the template arguments for you. (since C++17)
std::array has a deduction guide which enables CTAD with this form of initializer.
It will deduce the type of myStrings to
const std::array<const char*, 3>
The const char* is the result of usual array-to-pointer decay being applied to the elements of the initializer list (which are arrays of const chars).
const in front is a consequence of constexpr.
Each element of the array will point to the corresponding string literal.
constexpr is safe and you can use the array elements as you would individual string literals via const char* pointer. In particular trying to modify these literals or the array via const_cast will have undefined behavior though.
const std::array<std::string, 3> also works, but will not be usable in constant expressions. constexpr is not allowed on this because of std:string.
CTAD can also be used to deduce this type though with the help of string literal operators:
#include<string>
using namespace std::string_literals;
//...
const std::array myOtherStrings = { "one"s, "two"s, "three"s };
or since C++20:
const auto myOtherStrings = std::to_array<std::string>({ "one", "two", "three" });
As user17732522 already noted, the type deduction for your original code produces a const std::array<const char*, 3>. This works, but it's not a C++ std::string, so every use needs to scan for the NUL terminator, and they can't contain embedded NULs. I just wanted to emphasize the suggestion from my comment to use std::string_view.
Since std::string inherently relies on run-time memory allocation, you can't use it unless the entirety of the associated code is also constexpr (so no actual strings exist at all at run-time, the compiler computes the final result at compile-time), and that's unlikely to help you here if the goal is to avoid unnecessary runtime work for something that is partially known at compile time (especially if the array gets recreated on each function call; it's not global or static, so it's done many times, not just initialized once before use).
That said, if you can rely on C++17, you can split the difference with std::string_view. It's got a very concise literal form (add sv as a prefix to any string literal), and it's fully constexpr, so by doing:
// Top of file
#include <string_view>
// Use one of your choice:
using namespace std::literals; // Enables all literals
using namespace std::string_view_literals; // Enables sv suffix only
using namespace std::literals::string_view_literals; // Enables sv suffix only
// Point of use
constexpr std::array myStrings = { "one"sv, "two"sv, "three"sv };
you get something that involves no runtime work, has most of the benefits of std::string (knows its own length, can contain embedded NULs, accepted by most string-oriented APIs), and therefore operates more efficiently than a C-style string for the three common ways a function accepts string data:
For modern APIs that need to read a string-like thing, they accept std::string_view by value and the overhead is just copying the pointer and length to the function
For older APIs that accept const std::string&, it constructs a temporary std::string when you call it, but it can use the constructor that extracts the length from the std::string_view so it doesn't need to prewalk a C-style string with strlen to figure out how much to allocate.
For any API that needs a std::string (because it will modify/store its own copy), they're receiving string by value, and you get the same benefit as in #2 (it must be built, but it's built more efficiently).
The only case where you do worse by using std::string_views than using std::string is case #2 (where if the std::array contained std::strings, no copies would occur), and you only lose there if you make several such calls; in that scenario, you'd just bite the bullet and use const std::array myStrings = { "one"s, "two"s, "three"s };, paying the minor runtime cost to build real strings in exchange for avoiding copies when passing to old-style APIs taking const std::string&.

Should I compare a std::string to "string" or "string"s?

Consider this code snippet:
bool foo(const std::string& s) {
return s == "hello"; // comparing against a const char* literal
}
bool bar(const std::string& s) {
return s == "hello"s; // comparing against a std::string literal
}
At first sight, it looks like comparing against a const char* needs less assembly instructions1, as using a string literal will lead to an in-place construction of the std::string.
(EDIT: As pointed out in the answers, I forgot about the fact that effectively s.compare(const char*) will be called in foo(), so of course no in-place construction takes place in this case. Therefore striking out some lines below.)
However, looking at the operator==(const char*, const std::string&) reference:
All comparisons are done via the compare() member function.
From my understanding, this means that we will need to construct a std::string anyway in order to perform the comparison, so I suspect the overhead will be the same in the end (although hidden by the call to operator==).
Which of the comparisons should I prefer?
Does one version have advantages over the other (may be in specific situations)?
1 I'm aware that less assembly instructions doesn't neccessarily mean faster code, but I don't want to go into micro benchmarking here.
Neither.
If you want to be clever, compare to "string"sv, which returns a std::string_view.
While comparing against a literal like "string" does not result in any allocation-overhead, it's treated as a null terminated string, with all the concomittant disadvantages: No tolerance for embedded nulls, and users must heed the null terminator.
"string"s does an allocation, barring small-string-optimisation or allocation elision. Also, the operator gets passed the length of the literal, no need to count, and it allows for embedded nulls.
And finally using "string"sv combines the advantages of both other approaches, avoiding their individual disadvantages. Also, a std::string_view is a far simpler beast than a std::string, especially if the latter uses SSO as all modern ones do.
At least since C++14 (which generally allowed eliding allocations), compilers could in theory optimise all options to the last one, given sufficient information (generally available for the example) and effort, under the as-if rule. We aren't there yet though.
No, compare() does not require construction of a std::string for const char* operands.
You're using overload #4 here.
The comparison to string literal is the "free" version you're looking for. Instantiating a std::string here is completely unnecessary.
From my understanding, this means that we will need to construct a std::string anyway in order to perform the comparison, so I suspect the overhead will be the same in the end (although hidden by the call to operator==).
This is where that reasoning goes wrong. std::compare does not need to allocate its operand as a C-style null-terminated string to function. According to one of the overloads:
int compare( const CharT* s ) const; // (4)
4) Compares this string to the null-terminated character sequence beginning at the character pointed to by s with length Traits::length(s).
Although whether to allocate or not is an implementation detail, it does not seem reasonable that a sequence comparison would do so.

Add char to a std::string

I would like to append a char to a std::string in C++. I can think of at least three ways to do it:
std::string s = "hello";
char c = '!';
s += c;
s.append(c);
s.push_back(c);
which one is considered best practice ?
is there a performance difference (C++11) ?
Also, if I want to add two strings, same question with:
std::string s1 = "hello ", s2 = "world";
s1.append(s2);
s1 += s2;
The question is a matter of the compiler and personal taste and/or company coding style requirements. The over loaded operator probably resolves to your 2nd choice.
I would go with first one myself
which one is considered best practice?
If you've written a template accepting a variety of container types including string, the containers' APIs may only have a subset of those methods in common - in which case it's good to use that.
For example:
template <typename Container>
void f(Container& c, int n)
{
if (n % 7) c.push_back(n);
}
This will work with anything that support push_back - namely std::vector, std::deque, std::list, std::string. This contrasts with += and append which - of the Standard Library containers - are only supported by std::string.
Otherwise, do whatever you think's clearest to read - likely +=.
is there a performance difference (C++11)?
No reason there would be in optimised code, though implementations aren't required by the Standard to ensure that.
(Talking about performance of unoptimised code is not often useful, but FWIW in unoptimised code it's possible one or more of these may be written in terms of another (i.e. it may call the other function), and if inlining isn't done then the less direct code may be ever-so-slightly slower: which that is may vary across implementations. It would be misguided to worry about this.)
Also, if I want to add two strings, same question [...]
For two std::strings it's a little more complicated, as something like this...
template <typename Container, tyepname Value>
void append_twice(Container& c, const Value& v)
{
c.push_back(v);
c.push_back(v);
}
...could still be used for vector, deque, list, and string, but it does something semantically different for string by implicitly extracting and appending char elements, so you might want to avoid supporting all the types if correct behaviour of later code in your algorithm depends on that difference between extracting the chars and extracting the same string elements from the container that were appended earlier.
Again there's no performance difference.

Simplest, safest way of holding a bunch of const char* in a set?

I want to hold a bunch of const char pointers into an std::set container [1]. std::set template requires a comparator functor, and the standard C++ library offers std::less, but its implementation is based on comparing the two keys directly, which is not standard for pointers.
I know I can define my own functor and implement the operator() by casting the pointers to integers and comparing them, but is there a cleaner, 'standard' way of doing it?
Please do not suggest creating std::strings - it is a waste of time and space. The strings are static, so they can be compared for (in)equality based on their address.
1: The pointers are to static strings, so there is no problem with their lifetimes - they won't go away.
If you don't want to wrap them in std::strings, you can define a functor class:
struct ConstCharStarComparator
{
bool operator()(const char *s1, const char *s2) const
{
return strcmp(s1, s2) < 0;
}
};
typedef std::set<const char *, ConstCharStarComparator> stringset_t;
stringset_t myStringSet;
Just go ahead and use the default ordering which is less<>. The Standard guarantees that less will work even for pointers to different objects:
"For templates greater, less, greater_equal, and less_equal, the specializations for any
pointer type yield a total order, even if the built-in operators <, >, <=, >= do not."
The guarantee is there exactly for things like your set<const char*>.
The "optimized way"
If we ignore the "premature optimization is the root of all evil", the standard way is to add a comparator, which is easy to write:
struct MyCharComparator
{
bool operator()(const char * A, const char * B) const
{
return (strcmp(A, B) < 0) ;
}
} ;
To use with a:
std::set<const char *, MyCharComparator>
The standard way
Use a:
std::set<std::string>
It will work even if you put a static const char * inside (because std::string, unlike const char *, is comparable by its contents).
Of course, if you need to extract the data, you'll have to extract the data through std::string.c_str(). In the other hand, , but as it is a set, I guess you only want to know if "AAA" is in the set, not extract the value "AAA" of "AAA".
Note: I did read about "Please do not suggest creating std::strings", but then, you asked the "standard" way...
The "never do it" way
I noted the following comment after my answer:
Please do not suggest creating std::strings - it is a waste of time and space. The strings are static, so they can be compared for (in)equality based on their address.
This smells of C (use of the deprecated "static" keyword, probable premature optimization used for std::string bashing, and string comparison through their addresses).
Anyway, you don't want to to compare your strings through their address. Because I guess the last thing you want is to have a set containing:
{ "AAA", "AAA", "AAA" }
Of course, if you only use the same global variables to contain the string, this is another story.
In this case, I suggest:
std::set<const char *>
Of course, it won't work if you compare strings with the same contents but different variables/addresses.
And, of course, it won't work with static const char * strings if those strings are defined in a header.
But this is another story.
Depending on how big a "bunch" is, I would be inclined to store a corresponding bunch of std::strings in the set. That way you won't have to write any extra glue code.
Must the set contain const char*?
What immediately springs to mind is storing the strings in a std::string instead, and putting those into the std::set. This will allow comparisons without a problem, and you can always get the raw const char* with a simple function call:
const char* data = theString.c_str();
Either use a comparator, or use a wrapper type to be contained in the set. (Note: std::string is a wrapper, too....)
const char* a("a");
const char* b("b");
struct CWrap {
const char* p;
bool operator<(const CWrap& other) const{
return strcmp( p, other.p ) < 0;
}
CWrap( const char* p ): p(p){}
};
std::set<CWrap> myset;
myset.insert(a);
myset.insert(b);
Others have already posted plenty of solutions showing how to do lexical comparisons with const char*, so I won't bother.
Please do not suggest creating std::strings - it is a waste of time and space.
If std::string is a waste of time and space, then std::set might be a waste of time and space as well. Each element in a std::set is allocated separately from the free store. Depending on how your program uses sets, this may hurt performance more than std::set's O(log n) lookups help performance. You may get better results using another data structure, such as a sorted std::vector, or a statically allocated array that is sorted at compile time, depending on the intended lifetime of the set.
the standard C++ library offers std::less, but its implementation is based on comparing the two keys directly, which is not standard for pointers.
The strings are static, so they can be compared for (in)equality based on their address.
That depends on what the pointers point to. If all of the keys are allocated from the same array, then using operator< to compare pointers is not undefined behavior.
Example of an array containing separate static strings:
static const char keys[] = "apple\0banana\0cantaloupe";
If you create a std::set<const char*> and fill it with pointers that point into that array, their ordering will be well-defined.
If, however, the strings are all separate string literals, comparing their addresses will most likely involve undefined behavior. Whether or not it works depends on your compiler/linker implementation, how you use it, and your expectations.
If your compiler/linker supports string pooling and has it enabled, duplicate string literals should have the same address, but are they guaranteed to in all cases? Is it safe to rely on linker optimizations for correct functionality?
If you only use the string literals in one translation unit, the set ordering may be based on the order that the strings are first used, but if you change another translation unit to use one of the same string literals, the set ordering may change.
I know I can define my own functor and implement the operator() by casting the pointers to integers and comparing them
Casting the pointers to uintptr_t would seem to have no benefit over using pointer comparisons. The result is the same either way: implementation-specific.
Presumably you don't want to use std::string because of performance reasons.
I'm running MSVC and gcc, and they both seem to not mind this:
bool foo = "blah" < "grar";
EDIT: However, the behaviour in this case is unspecified. See comments...
They also don't complain about std::set<const char*>.
If you're using a compiler that does complain, I would probably go ahead with your suggested functor that casts the pointers to ints.
Edit:
Hey, I got voted down... Despite being one of the few people here that most directly answered his question. I'm new to Stack Overflow, is there any way to defend yourself if this happens? That being said, I'll try to right here:
The question is not looking for std::string solutions. Every time you enter an std::string in to the set, it will need to copy the entire string (until C++0x is standard, anyway). Also, every time you do a set look-up, it will need to do multiple string compares.
Storing the pointers in the set, however, incurs NO string copy (you're just copying the pointer around) and every comparison is a simple integer comparison on the addresses, not a string compare.
The question stated that storing the pointers to the strings was fine, I see no reason why we should all immediately assume that this statement was an error. If you know what you're doing, then there are considerable performance gains to using a const char* over either std::string or a custom comparison that calls strcmp. Yes, it's less safe, and more prone to error, but these are common trade-offs for performance, and since the question never stated the application, I think we should assume that he's already considered the pros and cons and decided in favor of performance.