Related
Consider this code snippet:
bool foo(const std::string& s) {
return s == "hello"; // comparing against a const char* literal
}
bool bar(const std::string& s) {
return s == "hello"s; // comparing against a std::string literal
}
At first sight, it looks like comparing against a const char* needs less assembly instructions1, as using a string literal will lead to an in-place construction of the std::string.
(EDIT: As pointed out in the answers, I forgot about the fact that effectively s.compare(const char*) will be called in foo(), so of course no in-place construction takes place in this case. Therefore striking out some lines below.)
However, looking at the operator==(const char*, const std::string&) reference:
All comparisons are done via the compare() member function.
From my understanding, this means that we will need to construct a std::string anyway in order to perform the comparison, so I suspect the overhead will be the same in the end (although hidden by the call to operator==).
Which of the comparisons should I prefer?
Does one version have advantages over the other (may be in specific situations)?
1 I'm aware that less assembly instructions doesn't neccessarily mean faster code, but I don't want to go into micro benchmarking here.
Neither.
If you want to be clever, compare to "string"sv, which returns a std::string_view.
While comparing against a literal like "string" does not result in any allocation-overhead, it's treated as a null terminated string, with all the concomittant disadvantages: No tolerance for embedded nulls, and users must heed the null terminator.
"string"s does an allocation, barring small-string-optimisation or allocation elision. Also, the operator gets passed the length of the literal, no need to count, and it allows for embedded nulls.
And finally using "string"sv combines the advantages of both other approaches, avoiding their individual disadvantages. Also, a std::string_view is a far simpler beast than a std::string, especially if the latter uses SSO as all modern ones do.
At least since C++14 (which generally allowed eliding allocations), compilers could in theory optimise all options to the last one, given sufficient information (generally available for the example) and effort, under the as-if rule. We aren't there yet though.
No, compare() does not require construction of a std::string for const char* operands.
You're using overload #4 here.
The comparison to string literal is the "free" version you're looking for. Instantiating a std::string here is completely unnecessary.
From my understanding, this means that we will need to construct a std::string anyway in order to perform the comparison, so I suspect the overhead will be the same in the end (although hidden by the call to operator==).
This is where that reasoning goes wrong. std::compare does not need to allocate its operand as a C-style null-terminated string to function. According to one of the overloads:
int compare( const CharT* s ) const; // (4)
4) Compares this string to the null-terminated character sequence beginning at the character pointed to by s with length Traits::length(s).
Although whether to allocate or not is an implementation detail, it does not seem reasonable that a sequence comparison would do so.
I have recently seen a colleague of mine using std::string as a buffer:
std::string receive_data(const Receiver& receiver) {
std::string buff;
int size = receiver.size();
if (size > 0) {
buff.resize(size);
const char* dst_ptr = buff.data();
const char* src_ptr = receiver.data();
memcpy((char*) dst_ptr, src_ptr, size);
}
return buff;
}
I guess this guy wants to take advantage of auto destruction of the returned string so he needs not worry about freeing of the allocated buffer.
This looks a bit strange to me since according to cplusplus.com the data() method returns a const char* pointing to a buffer internally managed by the string:
const char* data() const noexcept;
Memcpy-ing to a const char pointer? AFAIK this does no harm as long as we know what we do, but have I missed something? Is this dangerous?
Don't use std::string as a buffer.
It is bad practice to use std::string as a buffer, for several reasons (listed in no particular order):
std::string was not intended for use as a buffer; you would need to double-check the description of the class to make sure there are no "gotchas" which would prevent certain usage patterns (or make them trigger undefined behavior).
As a concrete example: Before C++17, you can't even write through the pointer you get with data() - it's const Tchar *; so your code would cause undefined behavior. (But &(str[0]), &(str.front()), or &(*(str.begin())) would work.)
Using std::strings for buffers is confusing to readers of your function's definition, who assume you would be using std::string for, well, strings. In other words, doing so breaks the Principle of Least Astonishment.
Worse yet, it's confusing for whoever might use your function - they too may think what you're returning is a string, i.e. valid human-readable text.
std::unique_ptr would be fine for your case, or even std::vector. In C++17, you can use std::byte for the element type, too. A more sophisticated option is a class with an SSO-like feature, e.g. Boost's small_vector (thank you, #gast128, for mentioning it).
(Minor point:) libstdc++ had to change its ABI for std::string to conform to the C++11 standard, so in some cases (which by now are rather unlikely), you might run into some linkage or runtime issues that you wouldn't with a different type for your buffer.
Also, your code may make two instead of one heap allocations (implementation dependent): Once upon string construction and another when resize()ing. But that in itself is not really a reason to avoid std::string, since you can avoid the double allocation using the construction in #Jarod42's answer.
You can completely avoid a manual memcpy by calling the appropriate constructor:
std::string receive_data(const Receiver& receiver) {
return {receiver.data(), receiver.size()};
}
That even handles \0 in a string.
BTW, unless content is actually text, I would prefer std::vector<std::byte> (or equivalent).
Memcpy-ing to a const char pointer? AFAIK this does no harm as long as we know what we do, but is this good behavior and why?
The current code may have undefined behavior, depending on the C++ version. To avoid undefined behavior in C++14 and below take the address of the first element. It yields a non-const pointer:
buff.resize(size);
memcpy(&buff[0], &receiver[0], size);
I have recently seen a colleague of mine using std::string as a buffer...
That was somewhat common in older code, especially circa C++03. There are several benefits and downsides to using a string like that. Depending on what you are doing with the code, std::vector can be a bit anemic, and you sometimes used a string instead and accepted the extra overhead of char_traits.
For example, std::string is usually a faster container than std::vector on append, and you can't return std::vector from a function. (Or you could not do so in practice in C++98 because C++98 required the vector to be constructed in the function and copied out). Additionally, std::string allowed you to search with a richer assortment of member functions, like find_first_of and find_first_not_of. That was convenient when searching though arrays of bytes.
I think what you really want/need is SGI's Rope class, but it never made it into the STL. It looks like GCC's libstdc++ may provide it.
There a lengthy discussion about this being legal in C++14 and below:
const char* dst_ptr = buff.data();
const char* src_ptr = receiver.data();
memcpy((char*) dst_ptr, src_ptr, size);
I know for certain it is not safe in GCC. I once did something like this in some self tests and it resulted in a segfault:
std::string buff("A");
...
char* ptr = (char*)buff.data();
size_t len = buff.size();
ptr[0] ^= 1; // tamper with byte
bool tampered = HMAC(key, ptr, len, mac);
GCC put the single byte 'A' in register AL. The high 3-bytes were garbage, so the 32-bit register was 0xXXXXXX41. When I dereferenced at ptr[0], GCC dereferenced a garbage address 0xXXXXXX41.
The two take-aways for me were, don't write half-ass self tests, and don't try to make data() a non-const pointer.
From C++17, data can return a non const char *.
Draft n4659 declares at [string.accessors]:
const charT* c_str() const noexcept;
const charT* data() const noexcept;
....
charT* data() noexcept;
The code is unnecessary, considering that
std::string receive_data(const Receiver& receiver) {
std::string buff;
int size = receiver.size();
if (size > 0) {
buff.assign(receiver.data(), size);
}
return buff;
}
will do exactly the same.
The big optimization opportunity I would investigate here is: Receiver appears to be some kind of container that supports .data() and .size(). If you can consume it, and pass it in as a rvalue reference Receiver&&, you might be able to use move semantics without making any copies at all! If it’s got an iterator interface, you could use those for range-based constructors or std::move() from <algorithm>.
In C++17 (as Serge Ballesta and others have mentioned), std::string::data() returns a pointer to non-const data. A std::string has been guaranteed to store all its data contiguously for years.
The code as written smells a bit, although it’s not really the programmer’s fault: those hacks were necessary at the time. Today, you should at least change the type of dst_ptr from const char* to char* and remove the cast in the first argument to memcpy(). You could also reserve() a number of bytes for the buffer and then use a STL function to move the data.
As others have mentioned, a std::vector or std::unique_ptr would be a more natural data structure to use here.
One downside is performance.
The .resize method will default-initialize all the new byte locations to 0.
That initialization is unnecessary if you're then going to overwrite the 0s with other data.
I do feel std::string is a legitimate contender for managing a "buffer"; whether or not it's the best choice depends on a few things...
Is your buffer content textual or binary in nature?
One major input to your decision should be whether the buffer content is textual in nature. It will be less potentially confusing to readers of your code if std::string is used for textual content.
char is not a good type for storing bytes. Keep in mind that the C++ Standard leaves it up to each implementation to decide whether char will be signed or unsigned, but for generic black-box handling of binary data (and sometimes even when passing characters to functions like std::toupper(int) that have undefined behaviour unless the argument is in range for unsigned char or is equal to EOF) you probably want unsigned data: why would you assume or imply that the first bit of each byte is a sign bit if it's opaque binary data?
Because of that, it's undeniably somewhat hackish to use std::string for "binary" data. You could use std::basic_string<std::byte>, but that's not what the question asks about, and you'd lose some inoperability benefits from using the ubiquitous std::string type.
Some potential benefits of using std::string
Firstly a few benefits:
it sports the RAII semantics we all know and love
most implementations feature short-string optimisation (SSO), which ensures that if the number of bytes is small enough to fit directly inside the string object, dynamic allocation/deallocation can be avoided (but there may be an extra branch each time the data is accessed)
this is perhaps more useful for passing around copies of data read or to be written, rather than for a buffer which should be pre-sized to accept a decent chunk of data if available (to improve throughput by handling more I/O at a time)
there's a wealth of std::string member functions, and non-member functions designed to work well with std::strings (including e.g. cout << my_string): if your client code would find them useful to parse/manipulate/process the buffer content, then you're off to a flying start
the API is very familiar to most C++ programmers
Mixed blessings
being a familiar, ubiquitous type, the code you interact with may have specialisations to for std::string that better suit your use for buffered data, or those specialisations might be worse: do evaluate that
Concerned
As Waxrat observed, what is lacking API wise is the ability to grow the buffer efficiently, as resize() writes NULs/'\0's into the characters added which is pointless if you're about to "receive" values into that memory. This isn't relevant in the OPs code where a copy of received data is being made, and the size is already known.
Discussion
Addressing einpoklum's concern:
std::string was not intended for use as a buffer; you would need to double-check the description of the class to make sure there are no "gotchas" which would prevent certain usage patterns (or make them trigger undefined behavior).
While it's true that std::string wasn't originally intended for this, the rest is mainly FUD. The Standard has made concessions to this kind of usage with C++17's non-const member function char* data(), and string has always supported embedded zero bytes. Most advanced programmers know what's safe to do.
Alternatives
static buffers (C char[N] array or std::array<char, N>) sized to some maximum message size, or ferrying slices of the data per call
a manually allocated buffer with std::unique_ptr to automate destruction: leaves you to do fiddly resizing, and track the allocated vs in-use sizes yourself; more error-prone overall
std::vector (possibly of std::byte for the element type; is widely understood to imply binary data, but the API is more restrictive and (for better or worse) it can't be expected to have anything equivalent to Short-String Optimisation.
Boost's small_vector: perhaps, if SSO was the only thing holding you back from std::vector, and you're happy using boost.
returning a functor that allows lazy access to the received data (providing you know it won't be deallocated or overwritten), deferring the choice of how it's stored by client code
use C++23's string::resize_and_overwrite
https://en.cppreference.com/w/cpp/string/basic_string/resize_and_overwrite
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1072r10.html
[[nodiscard]] static inline string formaterr (DWORD errcode) {
string strerr;
strerr.resize_and_overwrite(2048, [errcode](char* buf, size_t buflen) {
// https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-formatmessage
return FormatMessageA(
FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS,
nullptr,
errcode,
0,
buf,
static_cast<DWORD>(buflen),
nullptr
);
});
return strerr;
}
I am developing a (C++) library that uses unordered containers. These require a hasher (usually a specialization of the template structure std::hash) for the types of the elements they store. In my case, those elements are classes that encapsulate string literals, similar to conststr of the example at the bottom of this page. The STL offers an specialization for constant char pointers, which, however, only computes pointers, as explained here, in the 'Notes' section:
There is no specialization for C strings. std::hash<const char*>
produces a hash of the value of the pointer (the memory address), it
does not examine the contents of any character array.
Although this is very fast (or so I think), it is not guaranteed by the C++ standard whether several equal string literals are stored at the same address, as explained in this question. If they aren't, the first condition of hashers wouldn't be met:
For two parameters k1 and k2 that are equal, std::hash<Key>()(k1) ==
std::hash<Key>()(k2)
I would like to selectively compute the hash using the provided specialization, if the aforementioned guarantee is given, or some other algorithm otherwise. Although resorting back to asking those who include my headers or build my library to define a particular macro is feasible, an implementation defined one would be preferable.
Is there any macro, in any C++ implementation, but mainly g++ and clang, whose definition guarantees that several equal string literals are stored at the same address?
An example:
#ifdef __GXX_SAME_STRING_LITERALS_SAME_ADDRESS__
const char str1[] = "abc";
const char str2[] = "abc";
assert( str1 == str2 );
#endif
Is there any macro, in any C++ implementation, but mainly g++ and clang, whose definition guarantees that several equal string literals are stored at the same address?
gcc has the -fmerge-constants option (this is not a guarantee) :
Attempt to merge identical constants (string constants and floating-point constants) across compilation units.
This option is the default for optimized compilation if the assembler and linker support it. Use -fno-merge-constants to inhibit this behavior.
Enabled at levels -O, -O2, -O3, -Os.
Visual Studio has String Pooling (/GF option : "Eliminate Duplicate Strings")
String pooling allows what were intended as multiple pointers to multiple buffers to be multiple pointers to a single buffer. In the following code, s and t are initialized with the same string. String pooling causes them to point to the same memory:
char *s = "This is a character buffer";
char *t = "This is a character buffer";
Note: although MSDN uses char* strings literals, const char* should be used
clang apparently also has the -fmerge-constants option, but I can't find much about it, except in the --help section, so I'm not sure if it really is the equivalent of the gcc's one :
Disallow merging of constants
Anyway, how string literals are stored is implementation dependent (many do store them in the read-only portion of the program).
Rather than building your library on possible implementation-dependent hacks, I can only suggest the usage of std::string instead of C-style strings : they will behave exactly as you expect.
You can construct your std::string in-place in your containers with the emplace() methods :
std::unordered_set<std::string> my_set;
my_set.emplace("Hello");
Although C++ does not seem to allow for any way that works with string literals, there is an ugly but somewhat workable way around the problem if you don't mind rewriting your string literals as character sequences.
template <typename T, T...values>
struct static_array {
static constexpr T array[sizeof...(values)] { values... };
};
template <typename T, T...values>
constexpr T static_array<T, values...>::array[];
template <char...values>
using str = static_array<char, values..., '\0'>;
int main() {
return str<'a','b','c'>::array != str<'a','b','c'>::array;
}
This is required to return zero. The compiler has to ensure that even if multiple translation units instantiate str<'a','b','c'>, those definitions get merged, and you only end up with a single array.
You would need to make sure you don't mix this with string literals, though. Any string literal is guaranteed not to compare equal to any of the template instantiations' arrays.
The tacklelib C++11 library have a macro with the tmpl_string class to hold a literal string as a template class instance. The tmpl_string contains a static string with the same content which guarantees the same address for the same template class instance.
https://sourceforge.net/p/tacklelib/tacklelib/HEAD/tree/trunk/include/tacklelib/tackle/tmpl_string.hpp
Tests:
https://sourceforge.net/p/tacklelib/tacklelib/HEAD/tree/trunk/src/tests/unit/test_tmpl_string.cpp
Example:
const auto s = TACKLE_TMPL_STRING(0, "my literl string")
I've used it in another macro to conveniently and consistently extract a literal string begin/end:
#include <tacklelib/tackle/tmpl_string.hpp>
#include <tacklelib/utility/string_identity.hpp>
//...
std::vector<char> xml_arr;
xml_arr.insert(xml_arr.end(), UTILITY_LITERAL_STRING_WITH_BEGINEND_TUPLE("<?xml version='1.0' encoding='UTF-8'?>\n"));
https://sourceforge.net/p/tacklelib/tacklelib/HEAD/tree/trunk/include/tacklelib/utility/string_identity.hpp
It's well known that std::vector<bool> does not satisfy the Standard's container requirements, mainly because the packed representation prevents T* x = &v[i] from returning a pointer to a bool.
My question is: can this be remedied/mitigated when the reference_proxy overloads the address-of operator& to return a pointer_proxy?
The pointer-proxy could contain the same data as the reference_proxy in most implementations, namely a pointer into the packed data and a mask to isolate the particular bit inside the block pointed to. Indirection of the pointer_proxy would then yield the reference_proxy. Essentially both proxies are "fat" pointers, which are, however, still rather light-weight compared to disk-based proxy containers.
Instead of T* x = &v[0] one could then do auto x = &v[0], and use x like if(*x) without problems. I would also like to be able to write for(auto b: v) { /* ... */ }
Questions: would such a multi-proxy approach work with the STL's algorithms? Or do some algorithms really rely on the requirement that x needs to be a real bool*? Or are there too many consecutive user-defined conversions required that prevent this to work? I'd like to know any of such obstructions before trying to fully complete the above implementation sketch.
UPDATE (based on #HowardHinnant 's answer and this ancient discussion on comp.std.c++)
You can come a long way to almost mimic the builtin types: for any given type T, a pair of proxies (e.g. reference_proxy and iterator_proxy) can be made mutually consistent in the sense that reference_proxy::operator&() and iterator_proxy::operator*() are each other's inverse.
However, at some point one needs to map the proxy objects back to behave like T* or T&. For iterator proxies, one can overload operator->() and access the template T's interface without reimplementing all the functionality. However, for reference proxies, you would need to overload operator.(), and that is not allowed in current C++ (although Sebastian Redl presented such a proposal on BoostCon 2013). You can make a verbose work-around like a .get() member inside the reference proxy, or implement all of T's interface inside the reference (this is what is done for vector::bit_reference), but this will either lose the builtin syntax or introduce user-defined conversions that do not have builtin semantics for type conversions (you can have at most one user-defined conversion per argument).
My question is: can this be remedied/mitigated when the
reference_proxy overloads the address-of operator& to return a
pointer_proxy?
libc++ actually does this.
#include <vector>
#include <cassert>
int main()
{
std::vector<bool> v(1);
std::vector<bool>::pointer pb = &v[0];
assert(*pb == false);
*pb = true;
assert(v[0] == true);
std::vector<bool>::const_pointer cbp = pb;
assert(*cbp == true);
v[0] = false;
assert(*cbp == false);
}
It even extends to const_pointer and const_reference in ways that mimic the same types for vector<int>. This is a non-conforming extension for libc++. But it makes writing generic code which might be instantiated on vector<bool> far more likely to compile and behave correctly.
Questions: would such a multi-proxy approach work with the STL's
algorithms? Or do some algorithms really rely on the requirement that
x needs to be a real bool*? Or are there too many consecutive
user-defined conversions required that prevent this to work?
All of libc++'s algorithms work with vector<bool>. Some of them with quite spectacular performance. One algorithm in particular must have special treatment which the standard unfortunately does not mandate:
#include <vector>
#include <cassert>
int main()
{
std::vector<bool> v(1);
bool b = true;
assert(v[0] == false);
assert(b == true);
std::swap(b, v[0]);
assert(v[0] == true);
assert(b == false);
}
This is very easy for the implementation to accomplish. One simply needs to make sure swap works for any combination of bool and vector<bool>::reference. But I don't know if any implementation besides libc++ does this, and it is not mandated by C++11.
An array of bits is a wonderful data structure. But unfortunately it is poorly specified in the C++ standard. libc++ has gone somewhat outlaw to demonstrate that this can be a very useful and high performance data structure. The hope is that a future C++ standard may migrate in this direction to the benefit of the C++ programmer.
Offhand I would say first, that it will actually depend more on the particulars of each individual STL implementation since it doesn't officially conform to the standard requirement that a *reference_type to be lvalue of T*. So regarding potential implementation issues:
The main reason any piece of code would be explicitly dependent on the container's pointer being a real bool* is if the algo was using pointer arithmetic, in which case the size of the pointer type becomes relevant. Pointer arithmetic though would bypass the iterator interface and thus defeat the main purpose of the whole STL container-by-iterator design. std::vector<> itself is guaranteed to be contiguous in C++11, which allows optimized specializations of both STL algos and compiler for(:), both of which may use pointer arithmetic internally. If your type isn't derived from std::vector then that shouldn't be an issue; everything should just assume the iterator method instead.
However! STL code may still take pointers not for the purpose of pointer arithmetic but rather for some other purpose. In this case the problem is C++ syntax. Eg, quoting your own question:
Instead of T* x = &v[0] one could then do auto x = &v[0]
Any templated code in the STL will also have to do the same thing... and that seems entirely unlikely at this point that STL implementations will be making wide use of auto. There may be other situations were the STL attempts to do clever r-value casting tricks that end up failing because it isn't expecting mismatched reference types.
Regarding for(auto b: v) { /* ... */ }: I see no reason that shouldn't work. I think it will generate code that will be far less efficient than the same version you could just roll yourself in 15 mins (or less). I only bring it up since you mention intrinsics in the OP, which imples some consideration for performance. You won't be able to help it out using intrinsics either. There's nothing an intrinsic can do that somehow surpasses a simple bitwise shift for sequentially traversing an array of bits. Most of the added overhead will be from the compiler generating code to update the iterator pointer and mask values, and then reload the mask value on the next iteration. It won't be able to magically deduce what you're trying to do and turn it into a sequential shift op for you. It may at least be able to optimize out the pointer update+writeback stage by caching it into a register outside the loop, though honestly I'd be very skeptical based on my experiences.
Here's one way for going through bits from start to end, just for sake of comparison (a version capable of starting at any arbitrary point in the bitstream would require a little extra setup logic):
uint64_t* pBitSet = &v[-1]; // gets incremented on first iteration through loop.
uint64_t curBitSet = v[0];
for (int i=0; i<v.length(); ++i) {
if ((i % 64) == 0) {
curBitSet = *(++pBitSet);
}
int bit = curBitSet & 1;
curBitSet >>= 1;
// do stuff based on 'bit' here.
}
I want to hold a bunch of const char pointers into an std::set container [1]. std::set template requires a comparator functor, and the standard C++ library offers std::less, but its implementation is based on comparing the two keys directly, which is not standard for pointers.
I know I can define my own functor and implement the operator() by casting the pointers to integers and comparing them, but is there a cleaner, 'standard' way of doing it?
Please do not suggest creating std::strings - it is a waste of time and space. The strings are static, so they can be compared for (in)equality based on their address.
1: The pointers are to static strings, so there is no problem with their lifetimes - they won't go away.
If you don't want to wrap them in std::strings, you can define a functor class:
struct ConstCharStarComparator
{
bool operator()(const char *s1, const char *s2) const
{
return strcmp(s1, s2) < 0;
}
};
typedef std::set<const char *, ConstCharStarComparator> stringset_t;
stringset_t myStringSet;
Just go ahead and use the default ordering which is less<>. The Standard guarantees that less will work even for pointers to different objects:
"For templates greater, less, greater_equal, and less_equal, the specializations for any
pointer type yield a total order, even if the built-in operators <, >, <=, >= do not."
The guarantee is there exactly for things like your set<const char*>.
The "optimized way"
If we ignore the "premature optimization is the root of all evil", the standard way is to add a comparator, which is easy to write:
struct MyCharComparator
{
bool operator()(const char * A, const char * B) const
{
return (strcmp(A, B) < 0) ;
}
} ;
To use with a:
std::set<const char *, MyCharComparator>
The standard way
Use a:
std::set<std::string>
It will work even if you put a static const char * inside (because std::string, unlike const char *, is comparable by its contents).
Of course, if you need to extract the data, you'll have to extract the data through std::string.c_str(). In the other hand, , but as it is a set, I guess you only want to know if "AAA" is in the set, not extract the value "AAA" of "AAA".
Note: I did read about "Please do not suggest creating std::strings", but then, you asked the "standard" way...
The "never do it" way
I noted the following comment after my answer:
Please do not suggest creating std::strings - it is a waste of time and space. The strings are static, so they can be compared for (in)equality based on their address.
This smells of C (use of the deprecated "static" keyword, probable premature optimization used for std::string bashing, and string comparison through their addresses).
Anyway, you don't want to to compare your strings through their address. Because I guess the last thing you want is to have a set containing:
{ "AAA", "AAA", "AAA" }
Of course, if you only use the same global variables to contain the string, this is another story.
In this case, I suggest:
std::set<const char *>
Of course, it won't work if you compare strings with the same contents but different variables/addresses.
And, of course, it won't work with static const char * strings if those strings are defined in a header.
But this is another story.
Depending on how big a "bunch" is, I would be inclined to store a corresponding bunch of std::strings in the set. That way you won't have to write any extra glue code.
Must the set contain const char*?
What immediately springs to mind is storing the strings in a std::string instead, and putting those into the std::set. This will allow comparisons without a problem, and you can always get the raw const char* with a simple function call:
const char* data = theString.c_str();
Either use a comparator, or use a wrapper type to be contained in the set. (Note: std::string is a wrapper, too....)
const char* a("a");
const char* b("b");
struct CWrap {
const char* p;
bool operator<(const CWrap& other) const{
return strcmp( p, other.p ) < 0;
}
CWrap( const char* p ): p(p){}
};
std::set<CWrap> myset;
myset.insert(a);
myset.insert(b);
Others have already posted plenty of solutions showing how to do lexical comparisons with const char*, so I won't bother.
Please do not suggest creating std::strings - it is a waste of time and space.
If std::string is a waste of time and space, then std::set might be a waste of time and space as well. Each element in a std::set is allocated separately from the free store. Depending on how your program uses sets, this may hurt performance more than std::set's O(log n) lookups help performance. You may get better results using another data structure, such as a sorted std::vector, or a statically allocated array that is sorted at compile time, depending on the intended lifetime of the set.
the standard C++ library offers std::less, but its implementation is based on comparing the two keys directly, which is not standard for pointers.
The strings are static, so they can be compared for (in)equality based on their address.
That depends on what the pointers point to. If all of the keys are allocated from the same array, then using operator< to compare pointers is not undefined behavior.
Example of an array containing separate static strings:
static const char keys[] = "apple\0banana\0cantaloupe";
If you create a std::set<const char*> and fill it with pointers that point into that array, their ordering will be well-defined.
If, however, the strings are all separate string literals, comparing their addresses will most likely involve undefined behavior. Whether or not it works depends on your compiler/linker implementation, how you use it, and your expectations.
If your compiler/linker supports string pooling and has it enabled, duplicate string literals should have the same address, but are they guaranteed to in all cases? Is it safe to rely on linker optimizations for correct functionality?
If you only use the string literals in one translation unit, the set ordering may be based on the order that the strings are first used, but if you change another translation unit to use one of the same string literals, the set ordering may change.
I know I can define my own functor and implement the operator() by casting the pointers to integers and comparing them
Casting the pointers to uintptr_t would seem to have no benefit over using pointer comparisons. The result is the same either way: implementation-specific.
Presumably you don't want to use std::string because of performance reasons.
I'm running MSVC and gcc, and they both seem to not mind this:
bool foo = "blah" < "grar";
EDIT: However, the behaviour in this case is unspecified. See comments...
They also don't complain about std::set<const char*>.
If you're using a compiler that does complain, I would probably go ahead with your suggested functor that casts the pointers to ints.
Edit:
Hey, I got voted down... Despite being one of the few people here that most directly answered his question. I'm new to Stack Overflow, is there any way to defend yourself if this happens? That being said, I'll try to right here:
The question is not looking for std::string solutions. Every time you enter an std::string in to the set, it will need to copy the entire string (until C++0x is standard, anyway). Also, every time you do a set look-up, it will need to do multiple string compares.
Storing the pointers in the set, however, incurs NO string copy (you're just copying the pointer around) and every comparison is a simple integer comparison on the addresses, not a string compare.
The question stated that storing the pointers to the strings was fine, I see no reason why we should all immediately assume that this statement was an error. If you know what you're doing, then there are considerable performance gains to using a const char* over either std::string or a custom comparison that calls strcmp. Yes, it's less safe, and more prone to error, but these are common trade-offs for performance, and since the question never stated the application, I think we should assume that he's already considered the pros and cons and decided in favor of performance.