What is std::mbstate_t? - c++

I'm creating a custom locale by deriving from std::codecvt. Most of the methods I'm supposed to implement are pretty straight forward, except for this std::mbstate_t. On my compiler, vs2010, it's declared as an int. But, google tells me it's a POD type, it's sometimes a union (of what I don't know) or a struct (again I can't find it).
As I understand it, std::mbstate_t is a placeholder for partial convertions. And, I think, it comes into play when std::codecvt::on_out() requires more space to write the output, which in turn will call std::codecvt::do_unshift(). Please correct me if my assumptions are wrong.
I've read another post about storing pointers, though the post doesn't have an adequate answer. I've also read this example which presumes it to be a 32bit type although the standard states an int to be no less than 16bits.
My question. What can I safely store in std::mbstate_t? Can I safely replace it with another type? The answer to the above post suggests replacing it, but the following comment says otherwise.

I think that /the/ book concerning these things is C++ IOStreams and Locales by Langer and Kreft, if you seriously want to mess with these things, try to get hold of a copy. Now, coming back to your question, the mbstate_t is used to hold the state of the conversion. Normally, you would store this inside the conversion facet, but since the facets are immutable, you need to store it externally. In practice, that is used when you need more than a sequence of bytes to determine the according character, the Linux manpage of mbsinit() gives ISO-2022 and UTF-7 as examples for such encodings. Note that this does not affect UTF-8, where a single Unicode codepoint is always encoded by a sequence of bytes and without anything before or after that affecting the results. Partial UTF-8 sequences are also not handled by that, do_in() returns partial instead.
Now, what can you store in the mbstate_t? Since the actual type is undefined and the number of functions to manipulate it are very limited, there is nothing you can do with it at first. However, nothing else does anything with that state either, so you can do some ugly hacking on it. This might require a few #ifdef depending on the standard library but then you can simply (ab)use the fact that it's a POD (ints and unions are also PODs) to store pretty much any type of POD that is not larger. This won't win you a beauty price and the code won't work on any system automatically, but I think in this case it's unavoidable and the work for porting is also limited.
Finally, can you replace it? This type is part of std::char_traits which in turn affect really all strings and streams, so you need to replace them throughout your program or convert. Further, if you now create a new char_traits class, you still can't easily instantiate e.g. basic_string with it, because there is no guarantee that a general basic_string template even exists, it is only required that the two specializations for char and wchar_t (and some more for C++11) exist. Ditto for streams. In short, no you can't replace mbstate_t.

Related

Why is there no overload for printing `std::byte`?

The following code does not compile in C++20
#include <iostream>
#include <cstddef>
int main(){
std::byte b {65};
std::cout<<"byte: "<<b<<'\n';// Missing overload
}
When std::byte was added in C++17, why was there no corresponding operator<< overloading for printing it? I can maybe understand the choice of not printing containers, but why not std::byte? It tries to act as primitive type and we even have overloads for std::string, the recent std::string_view, and perhaps the most related std::complex, and std::bitset itself can be printed.
There are also std::hex and similar modifiers, so printing 0-255 by default should not be an issue.
Was this just oversight? What about operator>>, std::bitset has it and it is not trivial at all.
EDIT: Found out even std::bitset can be printed.
From the paper on std::byte (P0298R3): (emphasis mine)
Design Decisions
std::byte is not an integer and not a character
The key motivation here is to make byte a distinct type – to improve program safety by leveraging the type system. This leads to the design that std::byte is not an integer type, nor a character type. It is a distinct
type for accessing the bits that ultimately make up object storage.
As such, it is not required to be implicitly convertible/interpreted to be either a char or any integral type whatsoever and hence cannot be printed using std::cout unless explicitly cast to the required type.
Furthermore, this question might help.
std::byte is intended for accessing raw data. To allow me to replace that damn uint8_t sprinkled all over the codebase with something that actually says "this is raw and unparsed", instead of something that could be misunderstood as a C string.
To underline: std::byte doesn't "try to be a primitive", it represents something even less - raw data.
That it's implemented like this is mostly a quirk of C++ and compiler implementations (layout rules for "primitive" types are much simpler than for a struct or a class).
This kind of thing is mostly found in low level code where, honestly, printing shouldn't be used. Isn't possible sometimes.
My use case, for example, is receiving raw bytes over I2C (or RS485) and parsing them into frame which is then put into a struct. Why would I want to serialize raw bytes over actual data? Data I will have access to almost immediately?
To sum up this somewhat ranty answer, providing operator overloads for std::byte to work with iostream goes against the intent of this type.
And expressing intent in code as much as possible is one of important principles in modern programming.

Does adding enumerators into enum break ABI?

In particular, i got following code in library interface:
typedef enum
{
state1,
state2,
state3,
state4,
state5,
state_error = -1,
} State;
I strictly forbidden to break ABI. However, I want to add state6 and state7. Will it break ABI?
I found here some tip, but i somewhat doubt if it`s my case?
You can...
append new enumerators to an existing enum.
Exeption: if that leads to the compiler choosing a larger underlying type for the enum,that makes the change binary-incompatible. Unfortunately, compilers have some leeway to choose the underlying type, so from an API-design perspective it's recommended to add a Max.... enumerator with an explicit large value (=255, =1<<15, etc) to create an interval of numeric enumerator values that is guaranteed to fit into the chosen underlying type, whatever that may be.
Your question is a nice example why long-term maintaining of ABI compatibility is a difficult task. The core of the problem here is that the compatibility depends not just on the given type, but also on how it is used in function/method prototypes or complex types (e.g. structures, unions etc.).
(1) If the enumeration is used strictly as an input into the library (e.g. as a parameter of a function which just changes of behavior the function/library), then it keeps the compatibility: You changed the contract in a way which can never hurt the customer i.e. the calling application. Old applications shall never use the new value and will get the old behavior, new applications just get more options.
(2) If the enumeration is used anywhere as an output from the library (e.g. return value or function filling some address provided by caller aka an output parameter), the change would break the ABI. Consider the enumeration to be a contract saying "the application never sees values other then those listed". Adding new enum member would break this contract because old applications could now see values they never counted with.
That is at least, if there are no measures to protect old applications from falling into these troubles. Generally speaking, the library still can output the new value, but never for any valid input potentially provided by the old applications.
There are some design patterns allowing such enum expansions:
E.g. the library can provide an initialization function which allows to specify version of ABI the application is ready for. Old application ask for version 1.0 and never get the new value on input; newer application specify 1.1. or 2.0 or if the new enum value as added in the version 1.1, and then it may get the new value.)
Or, if a function DoSomething() is getting some flags on input, you may add a new flag where application can specify it's ready to see the new output value.
Or, if that's not possible, new version of the library may add a new function DoSomethingEx() which provides the more complex behavior than the original DoSomething(). DoSomethingEx() now can return the new enum value, the DoSomething() cannot.
As a side note if you ever need to add such DoSomethingEx(), do it in a way that allows similar expansions in the future. For consistency, it's usually a good idea to design it so that DoSomethingEx() with default flags (usually zero) behaves the same way DoSomething() and only with some new flag(s) it offers a different and more complex behavior.
Drawback of course is that the library implementation has to check what the application is ready for and provide a behavior compatible for expectations of old applications. It does not seem as much but over time and many versions of the library, there may be dozens of such checks accumulated in the library implementation, making it more complex and harder to maintain.
The quote actually is your case. Simple add the new enum values at the end (but before the state_error as it has a different value) and it should be binary compatible, unless, as mentioned in the quote you provided, the compiler chooses to use a different sized type, which doesn't seem likely in the case of such a small enum.
The best way is to try and check: a simple sizeof(State) executed before and after the changes should be enough (though you also might want to check if the values are still the same).
Take a look at the highest-valued enumerator-id: state3 is 2.
That means, even if the compiler should have chosen char as the underlying type, you can comfortably fit 100+ additional enumerator-ids there, without risking damage to binary compatibility.
That pre-supposes that the users supply a value of the iterator, instead of ever reading one, though.

C++ strings, when to use what?

It's been quite some time now that I've been coding in C++ and I think most who actually code in C++, would agree that one of the most trickiest decisions is to choose from an almost dizzying number of string types available. I mostly prefer ATL Cstring for its ease of use and features, but would like a comparative study of the available options.
I've checked out SO and haven't found any content which assists one choosing the right string. There are websites which state conversions from one string to another, but thats not what we want here.
Would love to have a comparison based on specialty, performance, portability (Windows, Mac, Linux/Unix, etc), ease of use/features, multi language support(Unicode/MBCS), cons (if any), and any other special cases.
I'm listing out the strings that I've encountered so far. I believe, there would be more, so we may edit this later to accommodate other options. Mind you, I've worked mostly on Windows, so the list reflects the same:
char*
std::string
STL's basic_string
ATL's CString
MFC's CString
BSTR
_bstr_t
CComBstr
Don't mean to put a dampener on your enthusiasm for this, but realistically it's inefficient to mix a lot of string types in the one project, so the larger the project gets the more inevitably it should settle on std::string (which is a typedef to an instantiation of STL's basic_string for type char, not a different entity), given that's the only Standard value-semantic option. char* is ok mainly for fixed sized strings (e.g. string literals, fixed size buffers) or interfacing with C.
Why do I say it's inefficient? You end up with needless template instantiations for the variety of string arguments (permutations even for multiple arguments). You find yourself calling functions that want to load a result into a string&, then have to call .c_str() on that and construct some other type, doing redundant memory allocation. Even const std::string& requires a string temporary if called using an ASCIIZ char* (e.g. to some other string type's buffer). When you want to write a function to handle the type of string a particular caller wants to use, you're pushed towards templates and therefore inline code, longer compile times and recompilation depedencies (there are some ways to mitigate this, but they get complex and to be convenient or automated they tend to require changes to the various string types - e.g. casting operator or member function returning some common interface/proxy object).
Projects may need to use non-Standard string types to interact with libraries they want to use, but you want to minimise that and limit the pervasiveness if possible.
The sorry story of C++ string handling is too depressing for me to write an essay on, but just a couple of points:
ATL and MFC CString are the same thing (same code and everything). They were merged years ago.
If you're using either _bstr_t or CComBstr, you probably wouldn't use BSTR except on calls into other people's APIs which take BSTR.
char* - fast, features include those that are in < cstring > header, error-prone (too low-level)
std::string - this is actually a typedef for std::basic_string<char, char_traits<char> > A beautiful thing - first of all, it's fast too. Second, you can use all the < algorithm >s because basic_string provides iterators. For wide-character support there is another typedef, wstring which is, std::basic_string<wchar_t, char_traits<wchar_t> >. This (basic_string)is a standard type therefore is absolutely portable. I'd go with this one.
ATL's and MFC's CStrings do not even provide iterators, therefore they are an abomination for me, because they are a class-wrapper around c-strings and they are very badly designed. IMHO
don't know about the rest.
HOpe this partial information helps
Obviously, only the first three are portable, so they should be preferred in most cases. If you're doing C++, then you should avoid char * in most instances, as raw pointers and arrays are error-prone. Interfacing with low-level C, such as in system calls or drivers, is the exception. std:string should be preferred by default, IMHO, because it meshes so nicely with the rest of the STL.
Again, IMHO, if you need to work with e.g. MFC, you should work with everything as std::string in your business logic, and translate to and from CString when you hit the WinApi functions.
2 and 3 are the same. 4 and 5 are the same, too. 7 and 8 are wrappers of 6. So, arguably, the list contains just C's strings, standard C++'s strings, Microsoft's C++ strings, and Microsoft's COM strings. That gives you the answer: in standard C++, use standard C++ strings (std::string)

Uses of std::basic_string

The basic_string class was apparently designed as a general purpose container, as I cannot find any text-specific function in its specification except for the c_str() function. Just out of curiosity, have you ever used the std::basic_string container class for anything else than storing human-readable character data?
The reason I ask this is because one often has to choose between making something general or specific. The designers chose to make the std::basic_string class general, but I doubt it is ever used that way.
It was designed as a string class (hence, for example, length() and all those dozens of find functions), but after the introduction of the STL into the std lib it was outfitted to be an STL container, too (hence size() and the iterators, with <algorithm> making all the find functions redundant).
It's main purpose is to store characters, though. Using anything than PODs isn't guaranteed to work (and doesn't work, for example, when using Dinkumware's std lib). Also, the necessary std::char_traits isn't required to be available for anything else than char and wchar_t (although many implementations come with a reasonable implementation of the base template).
In the original standard, the class wasn't required to store its data in a contiguous piece of memory, but this has changed with C++03.
In short, it's mostly useful as a container of characters (a.k.a. "string"), where "character" has a fairly wide definition.
The "wildest" I have used it for is for storing differently encoded strings by using different character types. That way, strings of different encodings are incompatible even if they use the same character size (ASCII and UTF-8) and, e.g., assignment causes compile-time errors.
yes - I've implemented state machine for 'unsigned int'. To store/compare states basic_string has been used

Valid use of typedef?

I have a char (ie. byte) buffer that I'm sending over the network. At some point in the future I might want to switch the buffer to a different type like unsigned char or short. I've been thinking about doing something like this:
typedef char bufferElementType;
And whenever I do anything with a buffer element I declare it as bufferElementType rather than char. That way I could switch to another type by changing this typedef (of course it wouldn't be that simple, but it would at least be easy to identify the places that need to be modified... there'll be a bufferElementType nearby).
Is this a valid / good use of typedef? Is it not worth the trouble? Is it going to give me a headache at some point in the future? Is it going to make maintainance programmers hate me?
I've read through When Should I Use Typedef In C++, but no one really covered this.
It is a great (and normal) usage. You have to be careful, though, that, for example, the type you select meet the same signed/unsigned criteria, or that they respond similarly to operators. Then it would be easier to change the type afterwards.
Another option is to use templates to avoid fixing the type till the moment you're compiling. A class that is defined as:
template <typename CharType>
class Whatever
{
CharType aChar;
...
};
is able to work with any char type you select, while it responds to all the operators in the same way.
Another advantage of typedefs is that, if used wisely, they can increase readability. As a really dumb example, a Meter and a Degree can both be doubles, but you'd like to differentiate between them. Using a typedef is onc quick & easy solution to make errors more visible.
Note: a more robust solution to the above example would have been to create different types for a meter and a degree. Thus, the compiler can enforce things itself. This requires a bit of work, which doesn't always pay off, however. Using typedefs is a quick & easy way to make errors visible, as described in the article linked above.
Yes, this is the perfect usage for typedef, at least in C.
For C++ it may be argued that templates are a better idea (as Diego Sevilla has suggested) but they have their drawbacks. (Extra work if everything using the data type is not already wrapped in a few classes, slower compilation times and a more complex source file structure, etc.)
It also makes sense to combine the two approaches, that is, give a typedef name to a template parameter.
Note that as you're sending data over a network, char and other integer types may not be interchangeable (e.g. due to endian-ness). In that case, using a templated class with specialized functions might make more sense. (send<char> sends the byte, send<short> converts it to network byte order first)
Yet another solution would be to create a "BufferElementType" class with helper methods (convertToNetworkOrderBytes()) but I'll bet that would be an overkill for you.