Uses of std::basic_string - c++

The basic_string class was apparently designed as a general purpose container, as I cannot find any text-specific function in its specification except for the c_str() function. Just out of curiosity, have you ever used the std::basic_string container class for anything else than storing human-readable character data?
The reason I ask this is because one often has to choose between making something general or specific. The designers chose to make the std::basic_string class general, but I doubt it is ever used that way.

It was designed as a string class (hence, for example, length() and all those dozens of find functions), but after the introduction of the STL into the std lib it was outfitted to be an STL container, too (hence size() and the iterators, with <algorithm> making all the find functions redundant).
It's main purpose is to store characters, though. Using anything than PODs isn't guaranteed to work (and doesn't work, for example, when using Dinkumware's std lib). Also, the necessary std::char_traits isn't required to be available for anything else than char and wchar_t (although many implementations come with a reasonable implementation of the base template).
In the original standard, the class wasn't required to store its data in a contiguous piece of memory, but this has changed with C++03.
In short, it's mostly useful as a container of characters (a.k.a. "string"), where "character" has a fairly wide definition.
The "wildest" I have used it for is for storing differently encoded strings by using different character types. That way, strings of different encodings are incompatible even if they use the same character size (ASCII and UTF-8) and, e.g., assignment causes compile-time errors.

yes - I've implemented state machine for 'unsigned int'. To store/compare states basic_string has been used

Related

C++ - Significance of local names for types

I recently read up how classes are allowed to define their own local names for types. One of the famous examples being size_type, provided almost by all STL containers. It was also mentioned that doing so helps hide implementation details from the user of the class. I am not quite sure how this is the case.
What are some examples where defining local names for types might be useful and how doing so hides implementation details?
Please provide some examples where defining local names for types might be useful and how it hides implementation details.
its more usefull when you use templated algorithms or containers, which might assume that your type has such type alias. So even if you modify type for size_type - i.e. change for some reason from size_t to int, then your type will still work with those algorithms / containers.
Otherwise, presence of size_type are required by standard when you for example implement your own allocator.
Suppose you have a program where you define several variables of type size_type and that it is defined somewhere as an int.
Then, upon analysis and reflection, you realize that the variables never assume values igger than 10 thousand. Therefore, the 32 bits used to allocate each of these variables are somewhate an overkill. In this case, you can redefine size_type as being of short type, instead of int. Therefore you will end up saving some memory.
Regarding the examples, you can check clock_t, char16_t, char32_t, wchar_t, true_type and false_type.

What is std::mbstate_t?

I'm creating a custom locale by deriving from std::codecvt. Most of the methods I'm supposed to implement are pretty straight forward, except for this std::mbstate_t. On my compiler, vs2010, it's declared as an int. But, google tells me it's a POD type, it's sometimes a union (of what I don't know) or a struct (again I can't find it).
As I understand it, std::mbstate_t is a placeholder for partial convertions. And, I think, it comes into play when std::codecvt::on_out() requires more space to write the output, which in turn will call std::codecvt::do_unshift(). Please correct me if my assumptions are wrong.
I've read another post about storing pointers, though the post doesn't have an adequate answer. I've also read this example which presumes it to be a 32bit type although the standard states an int to be no less than 16bits.
My question. What can I safely store in std::mbstate_t? Can I safely replace it with another type? The answer to the above post suggests replacing it, but the following comment says otherwise.
I think that /the/ book concerning these things is C++ IOStreams and Locales by Langer and Kreft, if you seriously want to mess with these things, try to get hold of a copy. Now, coming back to your question, the mbstate_t is used to hold the state of the conversion. Normally, you would store this inside the conversion facet, but since the facets are immutable, you need to store it externally. In practice, that is used when you need more than a sequence of bytes to determine the according character, the Linux manpage of mbsinit() gives ISO-2022 and UTF-7 as examples for such encodings. Note that this does not affect UTF-8, where a single Unicode codepoint is always encoded by a sequence of bytes and without anything before or after that affecting the results. Partial UTF-8 sequences are also not handled by that, do_in() returns partial instead.
Now, what can you store in the mbstate_t? Since the actual type is undefined and the number of functions to manipulate it are very limited, there is nothing you can do with it at first. However, nothing else does anything with that state either, so you can do some ugly hacking on it. This might require a few #ifdef depending on the standard library but then you can simply (ab)use the fact that it's a POD (ints and unions are also PODs) to store pretty much any type of POD that is not larger. This won't win you a beauty price and the code won't work on any system automatically, but I think in this case it's unavoidable and the work for porting is also limited.
Finally, can you replace it? This type is part of std::char_traits which in turn affect really all strings and streams, so you need to replace them throughout your program or convert. Further, if you now create a new char_traits class, you still can't easily instantiate e.g. basic_string with it, because there is no guarantee that a general basic_string template even exists, it is only required that the two specializations for char and wchar_t (and some more for C++11) exist. Ditto for streams. In short, no you can't replace mbstate_t.

C++ - Is string a built-in data type?

In C++, is string a built-in data type?
Thanks.
What is the definition of built-in that you want to use? Is it built-in the compiler toolset that you have yes, it should. Is it treated specially by the compiler? no, the compiler treats that type as any user defined type. Note that the same can probably be applied to many other languages for which most people will answer yes.
One of the focuses of the C++ committee is keeping the core language to a bare minimum, and provide as much functionality as possible in libraries. That has two intentions: the core language is more stable, libraries can be reimplemented, enhanced... without changing the compiler core. But more importantly, the fact that you do not need special compiler support to handle most of the standard library guarantees that the core language is expressive enough for most uses.
Simpler said in negated form: if the language required special compiler support to implement std::string that would mean that users don't have enough power to express that or a similar concept in the core language.
It's not a primitive -- that is, it's not "built in" the way that int, char, etc are. The closest built-in string-like type is char * or char[], which is the old C way of doing stringy stuff, but even that requires a bunch of library code in order to use productively.
Rather, std::string is a part of the standard library that comes with nearly every modern C++ compiler in existence. You'll need to #include <string> (or include something else that includes it, but really you should include what your code refers to) in order to use it.
If you are talking about std::string then no.
If you are talking about character array, I guess you can treat it as an array of a built in type.
No.
Built-in or "primitive" types can be used to create string-life functionality with the built-in type char. This, along with utility functions were what was used in C. In C++, there is still this functionality but a more intuitive way of using strings was added.
The string class is a part of the std namespace and is an instantiation of the basic_string template class. It is defined as
typedef basic_string<char> string;
It is a class with the ability to dynamically resize as needed and has many member functions acting as utilities. It also uses operator overloading so it is more intuitive to use. However, this functionality also means it has an overhead in terms of speed.
Depends on what you mean by built-in, but probably not. std::string is defined by the standard library (and thus the C++ standard) and is very universally supported by different compilers, but it is not a part of the core language like int or char.
It can be built-in, but it doesn't have to be.
The C++ standard library has a documented interface for its components. This can be realized either as library code or as compiler built-ins. The standard doesn't say how it should be implemented.
When you use #include <string> you have the std::string implementation available. This could be either because the compiler implements it directly, or because it links to some library code. We don't know for sure, unless we check each compiler.
None of the known compilers have chosen to make it a built-in type, because they didn't have to. The performance of a pure library implementation was obviously good enough.
No. It's part of standard library.
No, string is a class.
Definitely not. String is a class from standard library.
char *, or char[] are built-in types, but char, int, float, double, void, bool without any additions (as pointers, arrays, sign or size modifiers - unsigned, long etc.) are fundamental types.
No. There are different imlementations (eg Microsoft Visual C++), but char* is the C++ way of representing strings.

Handle std::basic_string<> with different type arguments

I want to implement a c++ library, and like many other libs I need to take string arguments from the user and giving back strings. The current standard defines std::string and std::wstring (I prefer wstring). Theoretically I have to implement methods with string arguments twice:
virtual void foo(std::string &) = 0; // convert internally from a previous defined charset to unicode
virtual void foo(std::wstring &) = 0;
C++0x doesn't make life easier, for char16_t and char32_t I need:
virtual void foo(std::u16string &) = 0;
virtual void foo(std::u32string &) = 0;
Handle such different types internally - for example putting all into a private vector member - requires conversion, wrappers... it's horrible.
Another problem is if a user (or myself) wants to work with custom allocators or customized trait classes: everthing results in a completely new type. For example, to write custom codecvt specializations for multibyte charsets, the standard says I have to introduce a custom state_type - which requires a custom trait class which results in a new std::basic_ifstream<> type - and that's completely incompatible to interfaces expecting std::ifstream& as an argument.
One -possible- solution is to construct each library class as a template that manages the value_type, traits and allocators specified by the user. But that's overkill and makes abstract base classes (interfaces) impossible.
Another solution is to just specify one type (e.g. u32string) as default, every user must pass data using this type. But now think about a project which uses 3 libraries, and the first lib uses u32string, the second lib u16string and the thirth lib wstring -> HELL.
What I really want is to declare a method just as void foo(put_unicode_string_here) - without introduce my own UnicodeString or UnicodeStream class.
There is always choice that has to be made if you don't want to support everything, but I personnally feel restricting input to UTF-8 is the easiest of all. Just use plain old std::string and everyone's happy. In practice, the user (of your library) will only have to convert to UTF-8 if he's on Windows, but there's a plethora of ways to do that simple task.
UPDATE: on the other hand, you could template all of your code and leave the std::basic_string<T> as a template throughout your code. This only gets messy if you do different things dependent on the size of the template argument.
char_traits is indeed a hopelessly awful wastebin of random traits. Should every string pre-specify the largest supported file size, case-sensitivity, and (ugh) state type of the encoding mechanism itself? NO.
However, what you ask is impossible even with well-designed traits. string and wstring are meaningfully different because the size of the internal character type differs. To run any kind of algorithm, you will need to query the object for char_t. That requires RTTI or virtual functions because basic_string doesn't (and shouldn't) maintain that info at runtime.
One -possible- solution is to construct each library class as a template that manages the value_type, traits and allocators specified by the user. But that's overkill and makes abstract base classes (interfaces) impossible.
This is the only complete solution. Templates actually do play well with abstract base classes: a number of templates can derive from a non-template abstract base, or the base can also be templated. However, it is difficult if not untenable because of the sensitivity and tedium of writing perfectly generic code.
Another solution is to just specify one type (e.g. u32string) as default, every user must pass data using this type. But now think about a project which uses 3 libraries, and the first lib uses u32string, the second lib u16string and the thirth lib wstring -> HELL.
This is why I'm scared by C++11's "improved" Unicode support. It simplifies direct interaction with file data and discourages abstraction to a common wchar_t internal format. It would have been better to require specific codecvts for UTF-16 and UTF-32 and specify that wchar_t must be at least 21 bits. Whereas before there were only "dumb" char and "smart" wchar_t libraries among clean C++ interfaces, we may have to contend with additional widths — and char16_t is just an instant red flag.
But, that's down the road.
If you really end up using a number of incompatible libraries, and the problem is shuttling data between functions requiring different formats, then write a ScopeGuard-style utility to convert from and back to your chosen common format, such as wstring. This utility can be a template with an explicit specialization for each incompatible format you need, or a non-templated set of classes.

C++ strings, when to use what?

It's been quite some time now that I've been coding in C++ and I think most who actually code in C++, would agree that one of the most trickiest decisions is to choose from an almost dizzying number of string types available. I mostly prefer ATL Cstring for its ease of use and features, but would like a comparative study of the available options.
I've checked out SO and haven't found any content which assists one choosing the right string. There are websites which state conversions from one string to another, but thats not what we want here.
Would love to have a comparison based on specialty, performance, portability (Windows, Mac, Linux/Unix, etc), ease of use/features, multi language support(Unicode/MBCS), cons (if any), and any other special cases.
I'm listing out the strings that I've encountered so far. I believe, there would be more, so we may edit this later to accommodate other options. Mind you, I've worked mostly on Windows, so the list reflects the same:
char*
std::string
STL's basic_string
ATL's CString
MFC's CString
BSTR
_bstr_t
CComBstr
Don't mean to put a dampener on your enthusiasm for this, but realistically it's inefficient to mix a lot of string types in the one project, so the larger the project gets the more inevitably it should settle on std::string (which is a typedef to an instantiation of STL's basic_string for type char, not a different entity), given that's the only Standard value-semantic option. char* is ok mainly for fixed sized strings (e.g. string literals, fixed size buffers) or interfacing with C.
Why do I say it's inefficient? You end up with needless template instantiations for the variety of string arguments (permutations even for multiple arguments). You find yourself calling functions that want to load a result into a string&, then have to call .c_str() on that and construct some other type, doing redundant memory allocation. Even const std::string& requires a string temporary if called using an ASCIIZ char* (e.g. to some other string type's buffer). When you want to write a function to handle the type of string a particular caller wants to use, you're pushed towards templates and therefore inline code, longer compile times and recompilation depedencies (there are some ways to mitigate this, but they get complex and to be convenient or automated they tend to require changes to the various string types - e.g. casting operator or member function returning some common interface/proxy object).
Projects may need to use non-Standard string types to interact with libraries they want to use, but you want to minimise that and limit the pervasiveness if possible.
The sorry story of C++ string handling is too depressing for me to write an essay on, but just a couple of points:
ATL and MFC CString are the same thing (same code and everything). They were merged years ago.
If you're using either _bstr_t or CComBstr, you probably wouldn't use BSTR except on calls into other people's APIs which take BSTR.
char* - fast, features include those that are in < cstring > header, error-prone (too low-level)
std::string - this is actually a typedef for std::basic_string<char, char_traits<char> > A beautiful thing - first of all, it's fast too. Second, you can use all the < algorithm >s because basic_string provides iterators. For wide-character support there is another typedef, wstring which is, std::basic_string<wchar_t, char_traits<wchar_t> >. This (basic_string)is a standard type therefore is absolutely portable. I'd go with this one.
ATL's and MFC's CStrings do not even provide iterators, therefore they are an abomination for me, because they are a class-wrapper around c-strings and they are very badly designed. IMHO
don't know about the rest.
HOpe this partial information helps
Obviously, only the first three are portable, so they should be preferred in most cases. If you're doing C++, then you should avoid char * in most instances, as raw pointers and arrays are error-prone. Interfacing with low-level C, such as in system calls or drivers, is the exception. std:string should be preferred by default, IMHO, because it meshes so nicely with the rest of the STL.
Again, IMHO, if you need to work with e.g. MFC, you should work with everything as std::string in your business logic, and translate to and from CString when you hit the WinApi functions.
2 and 3 are the same. 4 and 5 are the same, too. 7 and 8 are wrappers of 6. So, arguably, the list contains just C's strings, standard C++'s strings, Microsoft's C++ strings, and Microsoft's COM strings. That gives you the answer: in standard C++, use standard C++ strings (std::string)