How to convert uint16_t to a wide string (std::wstring) - c++

My question is related to this older question
Format specifiers for uint8_t, uint16_t, ...?
Just to recap the original question was related to how to use the specifiers for uint8_t, uint16_t, uint32_t and uint64_t with a scanf?
The answer to the question was as follows:
sscanf (line, "Value of integer: %" SCNd32 "\n", &my_integer);
But does anyone know how to do this but resulting in a wide string?
ie
std::wstring line;
swscanf (line.c_str(), L"Value of integer: %" SCNd16 L"\n", &my_integer);
The sbove line gives me a concatenating error. I believe because the SCNd16 is just not intended for a widestring?
Currently my solution is to create the std::string in the original answer and then convert it to a wide string
sscanf_s(line.c_str(), "%" SCNd16 "\n", &sixteenBitInteger)
// code here to check for EOF and EINVAL
//then I convert it
typedef std::codecvt_utf8<wchar_t> ConverterType;
std::wstring_convert<ConverterType, wchar_t> converter;
std::wstring convertedString = converter.from_bytes(line);
but it's rather ugly and I am sure there must be a more polished way to do this conversion?
If it helps to understand my use, I am using the uint16_t type to store the port number for a web server but I want to be able to convert it to a wide string as that is the expected display type. I am also using C++11 if that changes the answer at all and I do have access to the boost libraries although I would rather not use them.

This is a VS2013 compiler bug. Since it has been closed as "fixed", maybe it'll work in VS2015 (don't have the preview installed to give it a try).
The line of code you have
swscanf (line.c_str(), L"Value of integer: %" SCNd16 L"\n", &my_integer);
is well formed, because even if SCNd16 expands to a string literal that lacks the L prefix, the standard says that if out of two adjacent string literals, one lacks an encoding prefix, it is treated as if it has the same encoding prefix as the other.
§2.14.5/14 [lex.string]
In translation phase 6 (2.2), adjacent string literals are concatenated. If both string literals have the same encoding-prefix, the resulting concatenated string literal has that encoding-prefix. If one string literal has no encoding-prefix, it is treated as a string literal of the same encoding-prefix as the other operand. ...
Typically, you can use the preprocessor to widen strings by using token concatenation. For instance, defining a set of macros like this
#define WIDEN_(x) L##x
#define WIDEN(x) WIDEN_(x)
and converting the offending line of code to
swscanf (line.c_str(), L"Value of integer: %" WIDEN(SCNd16) L"\n", &my_integer);
would fix the problem, but it doesn't on VS2013 because of an implementation detail. The SCNd16 macro actually expands into two separate string literals - "h" "d". So the above macro widens the first literal, but not the second, and you run into the same (bogus) error.
Your options are to either hardcode the string "hd" or go with the runtime conversion solution you've shown.

A pure guess as I don't have time at the moment to try it.
Can you use the preprocesser to token paste the wide-string L to the front of the expanded SCNd16?

Related

__int64 to CString returns wrong values - C++ MFC

I want to convert a __int64 variable into a CString. The code is exactly this
__int64 i64TotalGB;
CString totalSpace;
i64TotalGB = 150;
printf("disk space: %I64d GB\n", i64TotalGB);
totalSpace.Format(_T("%I64d", i64TotalGB));
printf("totalSpace contains: %s", totalSpace);
the first printf prints
"disk space: 150GB"
which it's correct, but the second printf prints randomly high numbers like
"totalSpace contains: 298070026817519929"
I also tried to use a INT64 variable instead of a __int64 variable, but the result is the same. What can be the cause of this?
Here:
totalSpace.Format(_T("%I64d", i64TotalGB));
you're passing i64TotalGB as an argument to the _T() macro instead of passing it as the second argument to Format().
Try this:
totalSpace.Format(_T("%I64d"), i64TotalGB);
Having said that, thanks to MS's mess (ha) around character encodings, using _T here is not the right thing, as CString is composed of a TCHAR and not _TCHAR. So taking that into account, might as well use TEXT() instead of T(), as it is dependent on UNICODE and not _UNICODE:
totalSpace.Format(TEXT("%I64d"), i64TotalGB);
In addition, this line is wrong as it tries to pass an ATL CString as a char* (a.k.a. C-style string):
printf("totalSpace contains: %s", totalSpace);
For which the compiler gives this warning:
warning C4477: 'printf' : format string '%s' requires an argument of type 'char *', but variadic argument 1 has type 'ATL::CString'
While the structure of CString is practically compatible with passing it like you have, this is still formally undefined behavior. Use CString::GetString() to safeguard against it:
printf("totalSpace contains: %ls", totalSpace.GetString());
Note the %ls as under my configuration totalSpace.GetString() returned a const wchar_t*. However, as "printf does not currently support output into a UNICODE stream.", the correct version for this line, that will support characters outside your current code page, is a call to wprintf() in the following manner:
wprintf("totalSpace contains: %s", totalSpace.GetString());
Having said ALL that, here's a general advice, regardless of the direct problem behind the question. The far better practice nowadays is slightly different altogether, and I quote from the respectable answer by #IInspectable, saying that "generic-text mappings were relevant 2 decades ago".
What's the alternative? In the absence of good enough reason, try sticking explicitly to CStringW (A Unicode character type string with CRT support). Prefer the L character literal over the archaic data/text mappings that depend on whether the constant _UNICODE or _MBCS has been defined in your program. Conversely, the better practice would be using the wide-character versions of all API and language library calls, such as wprintf() instead of printf().
The bug is a result of numerous issues with the code, specifically these 2:
totalSpace.Format(_T("%I64d", i64TotalGB));
This uses the _T macro in a way it's not meant to be used. It should wrap a single character string literal. In the code it wraps a second argument.
printf("totalSpace contains: %s", totalSpace);
This assumes an ANSI-encoded character string, but passes a CString object, that can store both ANSI as well as Unicode encoded strings.
The recommended course of action is to drop generic-text mappings altogether, in favor of using Unicode (that's UTF-16LE on Windows) throughout1. The generic-text mappings were relevant 2 decades ago, to ease porting of Win9x code to the Windows NT based products.
To do this
Choose CStringW over CString.
Drop all occurences of _T, TEXT, and _TEXT, and replace them with an L prefix.
Use the wide-character version of the Windows API, CRT, and C++ Standard Library.
The fixed code looks like this:
__int64 i64TotalGB;
CStringW totalSpace; // Use wide-character string
i64TotalGB = 150;
printf("disk space: %I64d GB\n", i64TotalGB);
totalSpace.Format(L"%I64d", i64TotalGB); // Use wide-character string literal
wprintf(L"totalSpace contains: %s", totalSpace.GetString()); // Use wide-character library
On an unrelated note, while it is technically safe to pass a CString object in place of a character pointer in a variable argument list, this is an implementation detail, and not formally documented to work. Call CString::GetString() if you care about correct code.
1 Unless there is a justifiable reason to use a character encoding that uses char as its underlying type (like UTF-8 or ANSI). In that case you should still be explicit about it by using CStringA.
try this
totalSpace.Format(_T("%I64d"), i64TotalGB);

Compilation of string literals

Why can two string literals separated by a space, tab or "\n" be compiled without an error?
int main()
{
char * a = "aaaa" "bbbb";
}
"aaaa" is a char*
"bbbb" is a char*
There is no specific concatenation rule to process two string literals. And obviously the following code gives an error during compilation:
#include <iostream>
int main()
{
char * a = "aaaa";
char * b = "bbbb";
std::cout << a b;
}
Is this concatenation common to all compilers? Where is the null termination of "aaaa"? Is "aaaabbbb" a continuous block of RAM?
If you see e.g. this translation phase reference in phase 6 it does:
Adjacent string literals are concatenated.
And that's exactly what happens here. You have two adjacent string literals, and they are concatenated into a single string literal.
It is standard behavior.
It only works for string literals, not two pointer variables, as you noticed.
In this statement
char * a = "aaaa" "bbbb";
the compiler in some step of compilation before the syntax analysis considers adjacent string literals as one literal.
So for the compiler the above statement is equivalent to
char * a = "aaaabbbb";
that is the compiler stores only one string literal "aaaabbbb"
Adjacent string literals are concatenated as per the rules of C (and C++) standard. But no such rule exists for adjacent identifiers (i.e. variables a and b).
To quote, C++14 (N3797 draft), § 2.14.5:
In translation phase 6 (2.2), adjacent string literals are
concatenated. If both string literals have the same encoding-prefix,
the resulting concatenated string literal has that encoding-prefix. If
one string literal has no encoding-prefix, it is treated as a string
literal of the same encoding-prefix as the other operand. If a UTF-8
string literal token is adjacent to a wide string literal token, the
program is ill-formed. Any other concatenations are
conditionally-supported with implementation-defined behavior.
In C and C++ compiles adjacent string literals as a single string literal. For example this:
"Some text..." "and more text"
is equivalent to:
"Some text...and more text"
That for historical reasons:
The original C language was designed in 1969-1972 when computing was still dominated by the 80 column punched card. Its designers used 80 column devices such as the ASR-33 Teletype. These devices did not automatically wrap text, so there was a real incentive to keep source code within 80 columns. Fortran and Cobol had explicit continuation mechanisms to do so, before they finally moved to free format.
It was a stroke of brilliance for Dennis Ritchie (I assume) to realise that there was no ambiguity in the grammar and that long ASCII strings could be made to fit into 80 columns by the simple expedient of getting the compiler to concatenate adjacent literal strings. Countless C programmers were grateful for that small feature.
Once the feature is in, why would it ever be removed? It causes no grief and is frequently handy. I for one wish more languages had it. The modern trend is to have extended strings with triple quotes or other symbols, but the simplicity of this feature in C has never been outdone.
Similar question here.
String literals placed side-by-side are concatenated at translation phase 6 (after the preprocessor). That is, "Hello," " world!" yields the (single) string "Hello, world!". If the two strings have the same encoding prefix (or neither has one), the resulting string will have the same encoding prefix (or no prefix).
(source)

How to print uint32_t variables value via wprintf function?

It is a well-known fact that to print values of variables that type is one of fixed width integer types (like uint32_t) you need to include cinttypes (in C++) or inttypes.h (in C) header file and to use format specifiers macros like PRIu32. But how to do the same thing when wprintf function is used? Such macro should expand as a string literal with L prefix in that case.
If this will work or not actually depends on which standard of C the compiler is using.
From this string literal reference
Only two narrow or two wide string literals may be concatenated.
(until C99)
and
If one literal is unprefixed, the resulting string literal has the width/encoding specified by the prefixed literal. If the two string literals have different encoding prefixes, concatenation is implementation-defined. (since C99)
[Emphasis mine]
So if you're using an old compiler or one that doesn't support the C99 standard (or later) it's not possible. Besides fixed-width integer types was standardized in C99 so the macros don't really exist for such old compilers, making the issue moot.
For more modern compilers which support C99 and later, it's a non-issue since the string-literal concatenation will work and the compiler will turn the non-prefixed string into a wide-character string, so doing e.g.
wprintf(L"Value = %" PRIu32 "\n", uint32_t_value);
will work fine.
If you have a pre-C99 compiler, but still have the macros and fixed-width integer types, you can use function-like macros to prepend the L prefix to the string literals. Something like
#define LL(s) L ## s
#define L(s) LL(s)
...
wprintf(L"Value = %" L(PRIu32) L"\n", uint32_t_value);
Not sure where the problem is, but here (VS 2015) both
wprintf(L"AA %" PRIu32 L" BB", 123);
and
printf("AA %" PRIu32 " BB", 123);
compile correctly and give following output:
AA 123 BB
Even if your compiler does not support concatenation of differently-prefixed literals, you can always widen a narrow one:
#define WIDE(X) WIDE2(X)
#define WIDE2(X) L##X
wprintf(L"%" WIDE(PRIu32), foo);
Demo
A (weaker) alternative to using the macros from <inttypes.h> is to convert/cast the the fixed width type to an equivalent or larger standard type.
wprintf(L"%lu\n", 0ul + some_uint32_t_value);
// or
wprintf(L"%lu\n", (unsigned long) some_uint32_t_value);

difference between L"" and u8""

Is there any difference between the followings?
auto s1 = L"你好";
auto s2 = u8"你好";
Are s1 and s2 referring to the same type?
If no, what's the difference and which one is preferred?
They are not the same type.
s2 is a UTF-8 or narrow string literal. The C++11 draft standard section 2.14.5 String literals paragraph 7 says:
A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.
And paragraph 8 says:
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration (3.7).
s1 is a wide string literal which can support UTF-16 and UTF-32. Section 2.14.5 String literals paragraph 11 says:
A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.
See UTF8, UTF16, and UTF32 for a good discussion on the differences and advantages of each.
A quick way to determine types is to use typeid:
std::cout << typeid(s1).name() << std::endl ;
std::cout << typeid(s2).name() << std::endl ;
On my system this is the output:
PKw
PKc
Checking each of these with c++filt -t gives me:
wchar_t const*
char const*
L"" creates a null-terminated string, of type const wchar_t[]. This is valid in C++03. (Note that wchar_t refers to an implementation-dependent "wide-character" type).
u8"" creates a null-terminated UTF-8 string, of type const char[]. This is valid only in C++11.
Which one you choose is strongly dependent on what needs you have. L"" works in C++03, so if you need to work with older code (which may need to be compiled with a C++03 compiler), you'll need to use that. u8"" is easier to work with in many circumstances, particularly when the system in question normally expects char * strings.
The first is a wide character string, which might be encoded as UTF-16 or UTF-32, or something else entirely (though Unicode is now common enough that a completely different encoding is pretty unlikely).
The second is a string of narrow characters using UTF-8 encoding.
As to which is preferred: it'll depend on what you're doing, what platform you're coding for, etc. If you're mostly dealing with something like a web page/URL that's already encoded as UTF-8, and you'll probably just read it in, possibly verify its content, and later echo it back, it may well make sense to store it as UTF-8 as well.
Wide character strings vary by platform. If, for one example, you're coding for Windows, and a lot of the code interacts directly with the OS (which uses UTF-16) then storing your strings as UTF-16 can make a great deal of sense (and that's what Microsoft's compiler uses for wide character strings).

C++ with wxWidgets, Unicode vs. ASCII, what's the difference?

I have a Code::Blocks 10.05 rev 0 and gcc 4.5.2 Linux/unicode 64bit and
WxWidgets version 2.8.12.0-0
I have a simple problem:
#define _TT(x) wxT(x)
string file_procstatus;
file_procstatus.assign("/PATH/TO/FILE");
printf("%s",file_procstatus.c_str());
wxLogVerbose(_TT("%s"),file_procstatus.c_str());
Printf outputs "/PATH/TO/FILE" normally while wxLogVerbose turns into crap. When I want to change std::string to wxString I have to do following:
wxString buf;
buf = wxString::From8BitData(file_procstatus.c_str());
Somebody has an idea what might be wrong, why do I need to change from 8bit data?
This is to do with how the character data is stored in memory. Using the "string" you produce a string of type char using the ASCII character set whereas I would assume that the _TT macro expands to L"string" which create a string of type wchar_t using a Unicode character set (UTF-32 on Linux I believe).
the printf function is expecting a char string whereas wxLogVerbose I assume is expecting a wchar_t string. This is where the need for conversion comes from. ASCII used one byte per character (8 bit data) but wchar_t strings use multiple bytes per character so the problem is down to the character encoding.
If you don't want to have to call this conversion function then do something like the following:
wstring file_procstatus = wxT("/PATH/TO/FILE");
wxLogVerbose(_TT("%s"),file_procstatus.c_str());
The following article gives best explanation about differences in Unicode and ASCII character set, how they are stored in memory and how string functions work with them.
http://allaboutcharactersets.blogspot.in/